1
|
Mohammed T, Firoz A, Ramadan AM. RNA Editing in Chloroplast: Advancements and Opportunities. Curr Issues Mol Biol 2022; 44:5593-5604. [PMID: 36421663 PMCID: PMC9688838 DOI: 10.3390/cimb44110379] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2022] [Revised: 11/05/2022] [Accepted: 11/10/2022] [Indexed: 07/25/2023] Open
Abstract
Many eukaryotic and prokaryotic organisms employ RNA editing (insertion, deletion, or conversion) as a post-transcriptional modification mechanism. RNA editing events are common in these organelles of plants and have gained particular attention due to their role in the development and growth of plants, as well as their ability to cope with abiotic stress. Owing to rapid developments in sequencing technologies and data analysis methods, such editing sites are being accurately predicted, and many factors that influence RNA editing are being discovered. The mechanism and role of the pentatricopeptide repeat protein family of proteins in RNA editing are being uncovered with the growing realization of accessory proteins that might help these proteins. This review will discuss the role and type of RNA editing events in plants with an emphasis on chloroplast RNA editing, involved factors, gaps in knowledge, and future outlooks.
Collapse
Affiliation(s)
- Taimyiah Mohammed
- Department of Biological Sciences, Faculty of Science, King Abdulaziz University (KAU), P.O. Box 80141, Jeddah 21589, Saudi Arabia
| | - Ahmad Firoz
- Department of Biological Sciences, Faculty of Science, King Abdulaziz University (KAU), P.O. Box 80141, Jeddah 21589, Saudi Arabia
- Princess Dr. Najla Bint Saud Al-Saud Center for Excellence Research in Biotechnology, King Abdulaziz University, Jeddah 21589, Saudi Arabia
| | - Ahmed M. Ramadan
- Department of Biological Sciences, Faculty of Science, King Abdulaziz University (KAU), P.O. Box 80141, Jeddah 21589, Saudi Arabia
- Princess Dr. Najla Bint Saud Al-Saud Center for Excellence Research in Biotechnology, King Abdulaziz University, Jeddah 21589, Saudi Arabia
- Agricultural Genetic Engineering Research Institute (AGERI), Agriculture Research Center (ARC), Giza 12619, Egypt
| |
Collapse
|
2
|
Qin S, Fan Y, Hu S, Wang Y, Wang Z, Cao Y, Liu Q, Tan S, Dai Z, Zhou W. iPReditor-CMG: Improving a predictive RNA editor for crop mitochondrial genomes using genomic sequence features and an optimal support vector machine. PHYTOCHEMISTRY 2022; 200:113222. [PMID: 35561852 DOI: 10.1016/j.phytochem.2022.113222] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/23/2021] [Revised: 04/29/2022] [Accepted: 04/30/2022] [Indexed: 06/15/2023]
Abstract
In crops, RNA editing is one of the most important post-transcriptional processes in which specific cytidines (C) in virtually all mitochondrial protein-coding genes are converted to uridines (U). Despite extensive recent research in RNA editing, exploring all of the C-to-U editing events efficiently on the genomic scale remains challengeable. Developing accurate prediction methods for the detection of RNA editing sites would dramatically reduce experimental determination. Therefore, we propose a novel method, iPReditor-CMG (improved predictive RNA editor for crop mitochondrial genomes), to predict crop mitochondrial editing sites using genome sequence and an optimised support vector machine (SVM). We first selected three mitochondrial genomes with known RNA editing sites from Arabidopsis thaliana, Brassica napus and Oryza sativa, released by NCBI, as the training and test sets. The genes and their transcripts from self-sequenced tobacco mitochondrial ATPase were selected as the validation set. The iPReditor-CMG first coded the genome sequences as numerical vectors and then performed an efficient feature selection on the high-dimensional feature space, where the SVM was employed in feature selection and following modelling. The average independent prediction accuracy of intraspecific editing sites across three species was 0.85, and up to 0.91 in A. thaliana, which outperformed the reference models. For the interspecific independent prediction, the prediction accuracy between dicotyledons was 0.78 and the accuracy between dicotyledons and monocotyledons was 0.56, which implies that there might be similarity in the C-to-U editing mechanism in close relatives. Finally, the best model was identified with an independent test accuracy of 0.91 and an AUC of 0.88, which suggested that five unreported feature sequences, i.e. TGACA, ACAAC, GTAGA, CCGTT and TAACA, are closely associated with the editing phenomenon. Multiple tests supported that the iPReditor-CMG could be effectively applied to predict editing sites in crop mitochondria, which may further contribute to understanding the mechanisms of site editing and post-transcriptional events in crop mitochondria.
Collapse
Affiliation(s)
- Sidong Qin
- Hunan Provincial Engineering and Technology Research Center for Agricultural Big Data Analysis and Decision-Making, Hunan Agricultural University, Changsha, 410128, China; Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Hunan Agricultural University, Changsha, 410128, China
| | - Yanjun Fan
- Hunan Provincial Engineering and Technology Research Center for Agricultural Big Data Analysis and Decision-Making, Hunan Agricultural University, Changsha, 410128, China; Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Hunan Agricultural University, Changsha, 410128, China; Shanxi Province Jincheng City Landscaping Service Center, Shanxi, 048000, China
| | - Shengnan Hu
- Hunan Provincial Engineering and Technology Research Center for Agricultural Big Data Analysis and Decision-Making, Hunan Agricultural University, Changsha, 410128, China; Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Hunan Agricultural University, Changsha, 410128, China
| | - Yongqiang Wang
- Hunan Provincial Engineering and Technology Research Center for Agricultural Big Data Analysis and Decision-Making, Hunan Agricultural University, Changsha, 410128, China; Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Hunan Agricultural University, Changsha, 410128, China
| | - Ziqi Wang
- Hunan Provincial Engineering and Technology Research Center for Agricultural Big Data Analysis and Decision-Making, Hunan Agricultural University, Changsha, 410128, China; Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Hunan Agricultural University, Changsha, 410128, China
| | - Yixiang Cao
- Hunan Provincial Engineering and Technology Research Center for Agricultural Big Data Analysis and Decision-Making, Hunan Agricultural University, Changsha, 410128, China; Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Hunan Agricultural University, Changsha, 410128, China
| | - Qiyuan Liu
- Key Laboratory of Crop Physiology, Ecology and Genetic Breeding, Ministry of Education, College of Agronomy, Jiangxi Agricultural University, Nanchang, 330045, China
| | - Siqiao Tan
- College of Information and Intelligence, Hunan Agricultural University, Changsha, 410128, China
| | - Zhijun Dai
- Hunan Provincial Engineering and Technology Research Center for Agricultural Big Data Analysis and Decision-Making, Hunan Agricultural University, Changsha, 410128, China; Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Hunan Agricultural University, Changsha, 410128, China
| | - Wei Zhou
- Hunan Provincial Engineering and Technology Research Center for Agricultural Big Data Analysis and Decision-Making, Hunan Agricultural University, Changsha, 410128, China; Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Hunan Agricultural University, Changsha, 410128, China.
| |
Collapse
|
3
|
Affiliation(s)
- Markus Loecher
- Department of Business and Economics, Berlin School of Economics and Law, Berlin, Germany
| |
Collapse
|
4
|
Lo Giudice C, Hernández I, Ceci LR, Pesole G, Picardi E. RNA editing in plants: A comprehensive survey of bioinformatics tools and databases. PLANT PHYSIOLOGY AND BIOCHEMISTRY : PPB 2019; 137:53-61. [PMID: 30738217 DOI: 10.1016/j.plaphy.2019.02.001] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/05/2018] [Revised: 01/30/2019] [Accepted: 02/02/2019] [Indexed: 06/09/2023]
Abstract
RNA editing is a widespread epitranscriptomic mechanism by which primary RNAs are specifically modified through insertions/deletions or nucleotide substitutions. In plants, RNA editing occurs in organelles (plastids and mitochondria), involves the cytosine to uridine modification (rarely uridine to cytosine) within protein-coding and non-protein-coding regions of RNAs and affects organelle biogenesis, adaptation to environmental changes and signal transduction. High-throughput sequencing technologies have dramatically improved the detection of RNA editing sites at genomic scale. Consequently, different bioinformatics resources have been released to discovery and/or collect novel events. Here, we review and describe the state-of-the-art bioinformatics tools devoted to the characterization of RNA editing in plant organelles with the aim to improve our knowledge about this fascinating but yet under investigated process.
Collapse
Affiliation(s)
- Claudio Lo Giudice
- IBIOM-CNR, Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council, Italy
| | - Irene Hernández
- Departamento de Bioquímica y Biología Molecular y Celular, Facultad de Ciencias, Universidad de Zaragoza, C/ Pedro Cerbuna 12, 50009, Zaragoza, Spain
| | - Luigi R Ceci
- IBIOM-CNR, Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council, Italy
| | - Graziano Pesole
- IBIOM-CNR, Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council, Italy; Department of Biosciences, Biotechnology and Biopharmaceutics, University of Bari A. Moro, Bari, Italy
| | - Ernesto Picardi
- IBIOM-CNR, Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council, Italy; Department of Biosciences, Biotechnology and Biopharmaceutics, University of Bari A. Moro, Bari, Italy.
| |
Collapse
|
5
|
Couronné R, Probst P, Boulesteix AL. Random forest versus logistic regression: a large-scale benchmark experiment. BMC Bioinformatics 2018; 19:270. [PMID: 30016950 PMCID: PMC6050737 DOI: 10.1186/s12859-018-2264-5] [Citation(s) in RCA: 265] [Impact Index Per Article: 44.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2017] [Accepted: 06/27/2018] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND AND GOAL The Random Forest (RF) algorithm for regression and classification has considerably gained popularity since its introduction in 2001. Meanwhile, it has grown to a standard classification approach competing with logistic regression in many innovation-friendly scientific fields. RESULTS In this context, we present a large scale benchmarking experiment based on 243 real datasets comparing the prediction performance of the original version of RF with default parameters and LR as binary classification tools. Most importantly, the design of our benchmark experiment is inspired from clinical trial methodology, thus avoiding common pitfalls and major sources of biases. CONCLUSION RF performed better than LR according to the considered accuracy measured in approximately 69% of the datasets. The mean difference between RF and LR was 0.029 (95%-CI =[0.022,0.038]) for the accuracy, 0.041 (95%-CI =[0.031,0.053]) for the Area Under the Curve, and - 0.027 (95%-CI =[-0.034,-0.021]) for the Brier score, all measures thus suggesting a significantly better performance of RF. As a side-result of our benchmarking experiment, we observed that the results were noticeably dependent on the inclusion criteria used to select the example datasets, thus emphasizing the importance of clear statements regarding this dataset selection process. We also stress that neutral studies similar to ours, based on a high number of datasets and carefully designed, will be necessary in the future to evaluate further variants, implementations or parameters of random forests which may yield improved accuracy compared to the original version with default values.
Collapse
Affiliation(s)
- Raphael Couronné
- Department of Medical Information Processing, Biometry and Epidemiology, LMU Munich, Marchioninistr. 15, Munich, 81377 Germany
| | - Philipp Probst
- Department of Medical Information Processing, Biometry and Epidemiology, LMU Munich, Marchioninistr. 15, Munich, 81377 Germany
| | - Anne-Laure Boulesteix
- Department of Medical Information Processing, Biometry and Epidemiology, LMU Munich, Marchioninistr. 15, Munich, 81377 Germany
| |
Collapse
|
6
|
Edera AA, Gandini CL, Sanchez-Puerta MV. Towards a comprehensive picture of C-to-U RNA editing sites in angiosperm mitochondria. PLANT MOLECULAR BIOLOGY 2018; 97:215-231. [PMID: 29761268 DOI: 10.1007/s11103-018-0734-9] [Citation(s) in RCA: 69] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/16/2017] [Accepted: 05/02/2018] [Indexed: 06/08/2023]
Abstract
Our understanding of the dynamic and evolution of RNA editing in angiosperms is in part limited by the few editing sites identified to date. This study identified 10,217 editing sites from 17 diverse angiosperms. Our analyses confirmed the universality of certain features of RNA editing, and offer new evidence behind the loss of editing sites in angiosperms. RNA editing is a post-transcriptional process that substitutes cytidines (C) for uridines (U) in organellar transcripts of angiosperms. These substitutions mostly take place in mitochondrial messenger RNAs at specific positions called editing sites. By means of publicly available RNA-seq data, this study identified 10,217 editing sites in mitochondrial protein-coding genes of 17 diverse angiosperms. Even though other types of mismatches were also identified, we did not find evidence of non-canonical editing processes. The results showed an uneven distribution of editing sites among species, genes, and codon positions. The analyses revealed that editing sites were conserved across angiosperms but there were some species-specific sites. Non-synonymous editing sites were particularly highly conserved (~ 80%) across the plant species and were efficiently edited (80% editing extent). In contrast, editing sites at third codon positions were poorly conserved (~ 30%) and only partially edited (~ 40% editing extent). We found that the loss of editing sites along angiosperm evolution is mainly occurring by replacing editing sites with thymidines, instead of a degradation of the editing recognition motif around editing sites. Consecutive and highly conserved editing sites had been replaced by thymidines as result of retroprocessing, by which edited transcripts are reverse transcribed to cDNA and then integrated into the genome by homologous recombination. This phenomenon was more pronounced in eudicots, and in the gene cox1. These results suggest that retroprocessing is a widespread driving force underlying the loss of editing sites in angiosperm mitochondria.
Collapse
Affiliation(s)
- Alejandro A Edera
- IBAM, Facultad de Ciencias Agrarias, CONICET, Universidad Nacional de Cuyo, M5528AHB, Chacras de Coria, Argentina.
| | - Carolina L Gandini
- IBAM, Facultad de Ciencias Agrarias, CONICET, Universidad Nacional de Cuyo, M5528AHB, Chacras de Coria, Argentina
| | - M Virginia Sanchez-Puerta
- IBAM, Facultad de Ciencias Agrarias, CONICET, Universidad Nacional de Cuyo, M5528AHB, Chacras de Coria, Argentina
- Facultad de Ciencias Exactas y Naturales, Universidad Nacional de Cuyo, 5500, Mendoza, Argentina
| |
Collapse
|
7
|
Epifanio I. Intervention in prediction measure: a new approach to assessing variable importance for random forests. BMC Bioinformatics 2017; 18:230. [PMID: 28464827 PMCID: PMC5414143 DOI: 10.1186/s12859-017-1650-8] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2017] [Accepted: 04/25/2017] [Indexed: 12/20/2022] Open
Abstract
Background Random forests are a popular method in many fields since they can be successfully applied to complex data, with a small sample size, complex interactions and correlations, mixed type predictors, etc. Furthermore, they provide variable importance measures that aid qualitative interpretation and also the selection of relevant predictors. However, most of these measures rely on the choice of a performance measure. But measures of prediction performance are not unique or there is not even a clear definition, as in the case of multivariate response random forests. Methods A new alternative importance measure, called Intervention in Prediction Measure, is investigated. It depends on the structure of the trees, without depending on performance measures. It is compared with other well-known variable importance measures in different contexts, such as a classification problem with variables of different types, another classification problem with correlated predictor variables, and problems with multivariate responses and predictors of different types. Results Several simulation studies are carried out, showing the new measure to be very competitive. In addition, it is applied in two well-known bioinformatics applications previously used in other papers. Improvements in performance are also provided for these applications by the use of this new measure. Conclusions This new measure is expressed as a percentage, which makes it attractive in terms of interpretability. It can be used with new observations. It can be defined globally, for each class (in a classification problem) and case-wise. It can easily be computed for any kind of response, including multivariate responses. Furthermore, it can be used with any algorithm employed to grow each individual tree. It can be used in place of (or in addition to) other variable importance measures. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1650-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Irene Epifanio
- Departament de Matemàtiques and Institut de Matemàtiques i Aplicacions de Castelló, Universitat Jaume I, Campus del Riu Sec, Castelló, 12071, Spain.
| |
Collapse
|
8
|
Cahoon AB, Nauss JA, Stanley CD, Qureshi A. Deep Transcriptome Sequencing of Two Green Algae, Chara vulgaris and Chlamydomonas reinhardtii, Provides No Evidence of Organellar RNA Editing. Genes (Basel) 2017; 8:genes8020080. [PMID: 28230734 PMCID: PMC5333069 DOI: 10.3390/genes8020080] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2017] [Accepted: 02/13/2017] [Indexed: 11/16/2022] Open
Abstract
Nearly all land plants post-transcriptionally modify specific nucleotides within RNAs, a process known as RNA editing. This adaptation allows the correction of deleterious mutations within the asexually reproducing and presumably non-recombinant chloroplast and mitochondrial genomes. There are no reports of RNA editing in any of the green algae so this phenomenon is presumed to have originated in embryophytes either after the invasion of land or in the now extinct algal ancestor of all land plants. This was challenged when a recent in silico screen for RNA edit sites based on genomic sequence homology predicted edit sites in the green alga Chara vulgaris, a multicellular alga found within the Streptophyta clade and one of the closest extant algal relatives of land plants. In this study, the organelle transcriptomes of C. vulgaris and Chlamydomonas reinhardtii were deep sequenced for a comprehensive assessment of RNA editing. Initial analyses based solely on sequence comparisons suggested potential edit sites in both species, but subsequent high-resolution melt analysis, RNase H-dependent PCR (rhPCR), and Sanger sequencing of DNA and complementary DNAs (cDNAs) from each of the putative edit sites revealed them to be either single-nucleotide polymorphisms (SNPs) or spurious deep sequencing results. The lack of RNA editing in these two lineages is consistent with the current hypothesis that RNA editing evolved after embryophytes split from its ancestral algal lineage.
Collapse
Affiliation(s)
- A Bruce Cahoon
- Department of Natural Sciences, University of Virginia's College at Wise, 1 College Ave., Wise, VA 24293, USA.
| | - John A Nauss
- Department of Natural Sciences, University of Virginia's College at Wise, 1 College Ave., Wise, VA 24293, USA.
| | - Conner D Stanley
- Department of Natural Sciences, University of Virginia's College at Wise, 1 College Ave., Wise, VA 24293, USA.
| | - Ali Qureshi
- Department of Natural Sciences, University of Virginia's College at Wise, 1 College Ave., Wise, VA 24293, USA.
| |
Collapse
|
9
|
Janitza S, Strobl C, Boulesteix AL. An AUC-based permutation variable importance measure for random forests. BMC Bioinformatics 2013; 14:119. [PMID: 23560875 PMCID: PMC3626572 DOI: 10.1186/1471-2105-14-119] [Citation(s) in RCA: 148] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2012] [Accepted: 03/21/2013] [Indexed: 11/30/2022] Open
Abstract
Background The random forest (RF) method is a commonly used tool for classification with
high dimensional data as well as for ranking candidate predictors based on
the so-called random forest variable importance measures (VIMs). However the
classification performance of RF is known to be suboptimal in case of
strongly unbalanced data, i.e. data where response class sizes differ
considerably. Suggestions were made to obtain better classification
performance based either on sampling procedures or on cost sensitivity
analyses. However to our knowledge the performance of the VIMs has not yet
been examined in the case of unbalanced response classes. In this paper we
explore the performance of the permutation VIM for unbalanced data settings
and introduce an alternative permutation VIM based on the area under the
curve (AUC) that is expected to be more robust towards class imbalance. Results We investigated the performance of the standard permutation VIM and of our
novel AUC-based permutation VIM for different class imbalance levels using
simulated data and real data. The results suggest that the new AUC-based
permutation VIM outperforms the standard permutation VIM for unbalanced data
settings while both permutation VIMs have equal performance for balanced
data settings. Conclusions The standard permutation VIM loses its ability to discriminate between
associated predictors and predictors not associated with the response for
increasing class imbalance. It is outperformed by our new AUC-based
permutation VIM for unbalanced data settings, while the performance of both
VIMs is very similar in the case of balanced classes. The new AUC-based VIM
is implemented in the R package party for the unbiased RF variant based on
conditional inference trees. The codes implementing our study are available
from the companion website:
http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/070_drittmittel/janitza/index.html.
Collapse
Affiliation(s)
- Silke Janitza
- Department of Medical Informatics, Biometry and Epidemiology, University of Munich, Marchioninistr. 15, D-81377, Munich, Germany.
| | | | | |
Collapse
|
10
|
Lenz H, Knoop V. PREPACT 2.0: Predicting C-to-U and U-to-C RNA Editing in Organelle Genome Sequences with Multiple References and Curated RNA Editing Annotation. Bioinform Biol Insights 2013; 7:1-19. [PMID: 23362369 PMCID: PMC3547502 DOI: 10.4137/bbi.s11059] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023] Open
Abstract
RNA editing is vast in some genetic systems, with up to thousands of targeted C-to-U and U-to-C substitutions in mitochondria and chloroplasts of certain plants. Efficient prognoses of RNA editing in organelle genomes will help to reveal overlooked cases of editing. We present PREPACT 2.0 (http://www.prepact.de) with numerous enhancements of our previously developed Plant RNA Editing Prediction & Analysis Computer Tool. Reference organelle transcriptomes for editing prediction have been extended and reorganized to include 19 curated mitochondrial and 13 chloroplast genomes, now allowing to distinguish RNA editing sites from “pre-edited” sites. Queries may be run against multiple references and a new “commons” function identifies and highlights orthologous candidate editing sites congruently predicted by multiple references. Enhancements to the BLASTX mode in PREPACT 2.0 allow querying of complete novel organelle genomes within a few minutes, identifying protein genes and candidate RNA editing sites simultaneously without prior user analyses.
Collapse
Affiliation(s)
- Henning Lenz
- Abteilung Molekulare Evolution, Institut für Zelluläre und Molekulare Botanik, Universität Bonn, Bonn, Germany
| | | |
Collapse
|
11
|
|
12
|
Takala SL, Coulibaly D, Thera MA, Batchelor AH, Cummings MP, Escalante AA, Ouattara A, Traoré K, Niangaly A, Djimdé AA, Doumbo OK, Plowe CV. Extreme polymorphism in a vaccine antigen and risk of clinical malaria: implications for vaccine development. Sci Transl Med 2010; 1:2ra5. [PMID: 20165550 DOI: 10.1126/scitranslmed.3000257] [Citation(s) in RCA: 138] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
Abstract
Vaccines directed against the blood stages of Plasmodium falciparum malaria are intended to prevent the parasite from invading and replicating within host cells. No blood-stage malaria vaccine has shown clinical efficacy in humans. Most malaria vaccine antigens are parasite surface proteins that have evolved extensive genetic diversity, and this diversity could allow malaria parasites to escape vaccine-induced immunity. We examined the extent and within-host dynamics of genetic diversity in the blood-stage malaria vaccine antigen apical membrane antigen-1 in a longitudinal study in Mali. Two hundred and fourteen unique apical membrane antigen-1 haplotypes were identified among 506 human infections, and amino acid changes near a putative invasion machinery binding site were strongly associated with the development of clinical symptoms, suggesting that these residues may be important to consider in designing polyvalent apical membrane antigen-1 vaccines and in assessing vaccine efficacy in field trials. This extreme diversity may pose a serious obstacle to an effective polyvalent recombinant subunit apical membrane antigen-1 vaccine.
Collapse
Affiliation(s)
- Shannon L Takala
- Howard Hughes Medical Institute and Center for Vaccine Development, University of Maryland School of Medicine, 685 West Baltimore Street, Baltimore, MD 21201, USA
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
13
|
Salmans ML, Chaw SM, Lin CP, Shih ACC, Wu YW, Mulligan RM. Editing site analysis in a gymnosperm mitochondrial genome reveals similarities with angiosperm mitochondrial genomes. Curr Genet 2010; 56:439-46. [PMID: 20617318 PMCID: PMC2943580 DOI: 10.1007/s00294-010-0312-4] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2010] [Revised: 06/11/2010] [Accepted: 06/16/2010] [Indexed: 11/30/2022]
Abstract
Sequence analysis of organelle genomes and comprehensive analysis of C-to-U editing sites from flowering and non-flowering plants have provided extensive sequence information from diverse taxa. This study includes the first comprehensive analysis of RNA editing sites from a gymnosperm mitochondrial genome, and utilizes informatics analyses to determine conserved features in the RNA sequence context around editing sites. We have identified 565 editing sites in 21 full-length and 4 partial cDNAs of the 39 protein-coding genes identified from the mitochondrial genome of Cycas taitungensis. The information profiles and RNA sequence context of C-to-U editing sites in the Cycas genome exhibit similarity in the immediate flanking nucleotides. Relative entropy analyses indicate that similar regions in the 5' flanking 20 nucleotides have information content compared to angiosperm mitochondrial genomes. These results suggest that evolutionary constraints exist on the nucleotide sequences immediately adjacent to C-to-U editing sites, and similar regions are utilized in editing site recognition.
Collapse
Affiliation(s)
- Michael Lee Salmans
- Department of Developmental and Cell Biology, University of California, Irvine, 92697-2300, USA
| | | | | | | | | | | |
Collapse
|
14
|
Altmann A, Toloşi L, Sander O, Lengauer T. Permutation importance: a corrected feature importance measure. ACTA ACUST UNITED AC 2010; 26:1340-7. [PMID: 20385727 DOI: 10.1093/bioinformatics/btq134] [Citation(s) in RCA: 618] [Impact Index Per Article: 44.1] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
MOTIVATION In life sciences, interpretability of machine learning models is as important as their prediction accuracy. Linear models are probably the most frequently used methods for assessing feature relevance, despite their relative inflexibility. However, in the past years effective estimators of feature relevance have been derived for highly complex or non-parametric models such as support vector machines and RandomForest (RF) models. Recently, it has been observed that RF models are biased in such a way that categorical variables with a large number of categories are preferred. RESULTS In this work, we introduce a heuristic for normalizing feature importance measures that can correct the feature importance bias. The method is based on repeated permutations of the outcome vector for estimating the distribution of measured importance for each variable in a non-informative setting. The P-value of the observed importance provides a corrected measure of feature importance. We apply our method to simulated data and demonstrate that (i) non-informative predictors do not receive significant P-values, (ii) informative variables can successfully be recovered among non-informative variables and (iii) P-values computed with permutation importance (PIMP) are very helpful for deciding the significance of variables, and therefore improve model interpretability. Furthermore, PIMP was used to correct RF-based importance measures for two real-world case studies. We propose an improved RF model that uses the significant variables with respect to the PIMP measure and show that its prediction accuracy is superior to that of other existing models. AVAILABILITY R code for the method presented in this article is available at http://www.mpi-inf.mpg.de/ approximately altmann/download/PIMP.R CONTACT: altmann@mpi-inf.mpg.de, laura.tolosi@mpi-inf.mpg.de SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- André Altmann
- Department of Computational Biology and Applied Algorithmics, Max Planck Institute for Informatics, Saarbrücken, Germany.
| | | | | | | |
Collapse
|
15
|
Hammani K, Okuda K, Tanz SK, Chateigner-Boutin AL, Shikanai T, Small I. A study of new Arabidopsis chloroplast RNA editing mutants reveals general features of editing factors and their target sites. THE PLANT CELL 2009; 21:3686-99. [PMID: 19934379 PMCID: PMC2798323 DOI: 10.1105/tpc.109.071472] [Citation(s) in RCA: 145] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/17/2009] [Revised: 10/09/2009] [Accepted: 10/30/2009] [Indexed: 05/18/2023]
Abstract
RNA editing in higher plant organelles results in the conversion of specific cytidine residues to uridine residues in RNA. The recognition of a specific target C site by the editing machinery involves trans-acting factors that bind to the RNA upstream of the C to be edited. In the last few years, analysis of mutants affected in chloroplast biogenesis has identified several pentatricopeptide repeat (PPR) proteins from the PLS subfamily that are essential for the editing of particular RNA transcripts. We selected other genes from the same subfamily and used a reverse genetics approach to identify six new chloroplast editing factors in Arabidopsis thaliana (OTP80, OTP81, OTP82, OTP84, OTP85, and OTP86). These six factors account for nine editing sites not previously assigned to an editing factor and, together with the nine PPR editing proteins previously described, explain more than half of the 34 editing events in Arabidopsis chloroplasts. OTP80, OTP81, OTP85, and OTP86 target only one editing site each, OTP82 two sites, and OTP84 three sites in different transcripts. An analysis of the target sites requiring the five editing factors involved in editing of multiple sites (CRR22, CRR28, CLB19, OTP82, and OTP84) suggests that editing factors can generally distinguish pyrimidines from purines and, at some positions, must be able to recognize specific bases.
Collapse
Affiliation(s)
- Kamel Hammani
- Australian Research Council Centre of Excellence in Plant Energy Biology, University of Western Australia, Crawley 6009 WA, Australia
- Institut de Biologie Moléculaire des Plantes du Centre National de la Recherche Scientifique, Université de Strasbourg, 67084 Strasbourg Cedex, France
| | - Kenji Okuda
- Department of Botany, Graduate School of Science, Kyoto University, Kyoto 606-8502 Japan
| | - Sandra K. Tanz
- Australian Research Council Centre of Excellence in Plant Energy Biology, University of Western Australia, Crawley 6009 WA, Australia
| | - Anne-Laure Chateigner-Boutin
- Australian Research Council Centre of Excellence in Plant Energy Biology, University of Western Australia, Crawley 6009 WA, Australia
| | - Toshiharu Shikanai
- Department of Botany, Graduate School of Science, Kyoto University, Kyoto 606-8502 Japan
| | - Ian Small
- Australian Research Council Centre of Excellence in Plant Energy Biology, University of Western Australia, Crawley 6009 WA, Australia
- Address correspondence to
| |
Collapse
|
16
|
Yura K, Sulaiman S, Hatta Y, Shionyu M, Go M. RESOPS: a database for analyzing the correspondence of RNA editing sites to protein three-dimensional structures. PLANT & CELL PHYSIOLOGY 2009; 50:1865-73. [PMID: 19808808 PMCID: PMC2775959 DOI: 10.1093/pcp/pcp132] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/15/2009] [Accepted: 09/24/2009] [Indexed: 05/21/2023]
Abstract
Transcripts from mitochondrial and chloroplast DNA of land plants often undergo cytidine to uridine conversion-type RNA editing events. RESOPS is a newly built database that specializes in displaying RNA editing sites of land plant organelles on protein three-dimensional (3D) structures to help elucidate the mechanisms of RNA editing for gene expression regulation. RESOPS contains the following information: unedited and edited cDNA sequences with notes for the target nucleotides of RNA editing, conceptual translation from the edited cDNA sequence in pseudo-UniProt format, a list of proteins under the influence of RNA editing, multiple amino acid sequence alignments of edited proteins, the location of amino acid residues coded by codons under the influence of RNA editing in protein 3D structures and the statistics of biased distributions of the edited residues with respect to protein structures. Most of the data processing procedures are automated; hence, it is easy to keep abreast of updated genome and protein 3D structural data. In the RESOPS database, we clarified that the locations of residues switched by RNA editing are significantly biased to protein structural cores. The integration of different types of data in the database also help advance the understanding of RNA editing mechanisms. RESOPS is accessible at http://cib.cf.ocha.ac.jp/RNAEDITING/.
Collapse
Affiliation(s)
- Kei Yura
- Computational Biology, Graduate School of Humanities and Sciences, Ochanomizu University, 2-1-1 Otsuka, Bunkyo, Tokyo, 112-8610 Japan.
| | | | | | | | | |
Collapse
|
17
|
Mower JP. The PREP suite: predictive RNA editors for plant mitochondrial genes, chloroplast genes and user-defined alignments. Nucleic Acids Res 2009; 37:W253-9. [PMID: 19433507 PMCID: PMC2703948 DOI: 10.1093/nar/gkp337] [Citation(s) in RCA: 240] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
RNA editing alters plant mitochondrial and chloroplast transcripts by converting specific cytidines to uridines, which usually results in a change in the amino acid sequence of the translated protein. Systematic studies have experimentally identified sites of RNA editing in organellar transcriptomes from several species, but these analyses have not kept pace with rate of genome sequencing. The PREP (predictive RNA editors for plants) suite was developed to computationally predict sites of RNA editing based on the well-known principle that editing in plant organelles increases the conservation of proteins across species. The PREP suite provides predictive RNA editors for plant mitochondrial genes (PREP-Mt), for chloroplast genes (PREP-Cp), and for alignments submitted by the user (PREP-Aln). These servers require minimal input, are very fast, and are highly accurate on all seed plants examined to date. PREP-Mt has proved useful in several research studies and the newly developed PREP-Cp and PREP-Aln servers should be of further assistance for analyses that require knowledge of the location of sites of RNA editing. The PREP suite is freely available at http://prep.unl.edu/.
Collapse
Affiliation(s)
- Jeffrey P Mower
- Center for Plant Science Innovation and Department of Agronomy and Horticulture, University of Nebraska, Lincoln, NE 68588, USA.
| |
Collapse
|
18
|
Du P, Jia L, Li Y. CURE-Chloroplast: a chloroplast C-to-U RNA editing predictor for seed plants. BMC Bioinformatics 2009; 10:135. [PMID: 19422723 PMCID: PMC2688514 DOI: 10.1186/1471-2105-10-135] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2008] [Accepted: 05/08/2009] [Indexed: 12/04/2022] Open
Abstract
BACKGROUND RNA editing is a type of post-transcriptional modification of RNA and belongs to the class of mechanisms that contribute to the complexity of transcriptomes. C-to-U RNA editing is commonly observed in plant mitochondria and chloroplasts. The in vivo mechanism of recognizing C-to-U RNA editing sites is still unknown. In recent years, many efforts have been made to computationally predict C-to-U RNA editing sites in the mitochondria of seed plants, but there is still no algorithm available for C-to-U RNA editing site prediction in the chloroplasts of seed plants. RESULTS In this paper, we extend our algorithm CURE, which can accurately predict the C-to-U RNA editing sites in mitochondria, to predict C-to-U RNA editing sites in the chloroplasts of seed plants. The algorithm achieves over 80% sensitivity and over 99% specificity. We implement the algorithm as an online service called CURE-Chloroplast http://bioinfo.au.tsinghua.edu.cn/pure. CONCLUSION CURE-Chloroplast is an online service for predicting the C-to-U RNA editing sites in the chloroplasts of seed plants. The online service allows the processing of entire chloroplast genome sequences. Since CURE-Chloroplast performs very well, it could be a helpful tool in the study of C-to-U RNA editing in the chloroplasts of seed plants.
Collapse
Affiliation(s)
- Pufeng Du
- MOE Key Laboratory of Bioinformatics and Bioinformatics Div. TNLIST/Department of Automation, Tsinghua University, Beijing 100084, PR China
| | - Liyan Jia
- MOE Key Laboratory of Bioinformatics and Bioinformatics Div. TNLIST/Department of Automation, Tsinghua University, Beijing 100084, PR China
| | - Yanda Li
- MOE Key Laboratory of Bioinformatics and Bioinformatics Div. TNLIST/Department of Automation, Tsinghua University, Beijing 100084, PR China
| |
Collapse
|
19
|
A Molecular Footprint of Limb Loss: Sequence Variation of the Autopodial Identity Gene Hoxa-13. J Mol Evol 2008; 67:581-93. [DOI: 10.1007/s00239-008-9156-7] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2008] [Accepted: 08/05/2008] [Indexed: 10/21/2022]
|
20
|
Yura K, Miyata Y, Arikawa T, Higuchi M, Sugita M. Characteristics and prediction of RNA editing sites in transcripts of the Moss Takakia lepidozioides chloroplast. DNA Res 2008; 15:309-21. [PMID: 18650260 PMCID: PMC2575889 DOI: 10.1093/dnares/dsn016] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
RNA editing in land plant organelles is a process primarily involving the conversion of cytidine to uridine in pre-mRNAs. The process is required for gene expression in plant organelles, because this conversion alters the encoded amino acid residues and improves the sequence identity to homologous proteins. A recent study uncovered that proteins encoded in the nuclear genome are essential for editing site recognition in chloroplasts; the mechanisms by which this recognition occurs remain unclear. To understand these mechanisms, we determined the genomic and cDNA sequences of moss Takakia lepidozioides chloroplast genes, then computationally analyzed the sequences within −30 to +10 nucleotides of RNA editing sites (neighbor sequences) likely to be recognized by trans-factors. As the T. lepidozioides chloroplast has many RNA editing sites, the analysis of these sequences provides a unique opportunity to perform statistical analyses of chloroplast RNA editing sites. We divided the 302 obtained neighbor sequences into eight groups based on sequence similarity to identify group-specific patterns. The patterns were then applied to predict novel RNA editing sites in T. lepidozioides transcripts; ∼60% of these predicted sites are true editing sites. The success of this prediction algorithm suggests that the obtained patterns are indicative of key sites recognized by trans-factors around editing sites of T. lepidozioides chloroplast genes.
Collapse
Affiliation(s)
- Kei Yura
- Graduate School of Humanities and Sciences, Ochanomizu University, 2-1-1 Otsuka, Bunkyo, Tokyo 112-8610, Japan.
| | | | | | | | | |
Collapse
|
21
|
Millar AH, Small ID, Day DA, Whelan J. Mitochondrial biogenesis and function in Arabidopsis. THE ARABIDOPSIS BOOK 2008; 6:e0111. [PMID: 22303236 DOI: 10.1199/tab.0105] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/23/2023]
Abstract
Mitochondria represent the powerhouse of cells through their synthesis of ATP. However, understanding the role of mitochondria in the growth and development of plants will rely on a much deeper appreciation of the complexity of this organelle. Arabidopsis research has provided clear identification of mitochondrial components, allowed wide-scale analysis of gene expression, and has aided reverse genetic manipulation to test the impact of mitochondrial component loss on plant function. Forward genetics in Arabidopsis has identified mitochondrial involvement in mutations with notable impacts on plant metabolism, growth and development. Here we consider the evidence for components involved in mitochondria biogenesis, metabolism and signalling to the nucleus.
Collapse
|
22
|
Millar AH, Small ID, Day DA, Whelan J. Mitochondrial biogenesis and function in Arabidopsis. THE ARABIDOPSIS BOOK 2008; 6:e0111. [PMID: 22303236 PMCID: PMC3243404 DOI: 10.1199/tab.0111] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Mitochondria represent the powerhouse of cells through their synthesis of ATP. However, understanding the role of mitochondria in the growth and development of plants will rely on a much deeper appreciation of the complexity of this organelle. Arabidopsis research has provided clear identification of mitochondrial components, allowed wide-scale analysis of gene expression, and has aided reverse genetic manipulation to test the impact of mitochondrial component loss on plant function. Forward genetics in Arabidopsis has identified mitochondrial involvement in mutations with notable impacts on plant metabolism, growth and development. Here we consider the evidence for components involved in mitochondria biogenesis, metabolism and signalling to the nucleus.
Collapse
Affiliation(s)
- A. Harvey Millar
- Australian Research Council (ARC) Centre of Excellence in Plant Energy Biology, The University of Western Australia, 35 Stirling Highway, Crawley, WA 6009
| | - Ian D. Small
- Australian Research Council (ARC) Centre of Excellence in Plant Energy Biology, The University of Western Australia, 35 Stirling Highway, Crawley, WA 6009
| | - David A. Day
- School of Biological Sciences, The University of Sydney 2006, NSW, Australia
| | - James Whelan
- Australian Research Council (ARC) Centre of Excellence in Plant Energy Biology, The University of Western Australia, 35 Stirling Highway, Crawley, WA 6009
| |
Collapse
|
23
|
|
24
|
Mower JP. Modeling Sites of RNA Editing as a Fifth Nucleotide State Reveals Progressive Loss of Edited Sites from Angiosperm Mitochondria. Mol Biol Evol 2007; 25:52-61. [DOI: 10.1093/molbev/msm226] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023] Open
|
25
|
Mulligan RM, Chang KLC, Chou CC. Computational analysis of RNA editing sites in plant mitochondrial genomes reveals similar information content and a sporadic distribution of editing sites. Mol Biol Evol 2007; 24:1971-81. [PMID: 17591603 DOI: 10.1093/molbev/msm125] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
A computational analysis of RNA editing sites was performed on protein-coding sequences of plant mitochondrial genomes from Arabidopsis thaliana, Beta vulgaris, Brassica napus, and Oryza sativa. The distribution of nucleotides around edited and unedited cytidines was compared in 41 nucleotide segments and included 1481 edited cytidines and 21,390 unedited cytidines in the 4 genomes. The distribution of nucleotides was examined in 1, 2, and 3 nucleotide windows by comparison of nucleotide frequency ratios and relative entropy. The relative entropy analyses indicate that information is encoded in the nucleotide sequences in the 5 prime flank (-18 to -14, -13 to -10, -6 to -4, -2/-1) and the immediate 3 prime flanking nucleotide (+1), and these regions may be important in editing site recognition. The relative entropy was large when 2 or 3 nucleotide windows were analyzed, suggesting that several contiguous nucleotides may be involved in editing site recognition. RNA editing sites were frequently preceded by 2 pyrimidines or AU and followed by a guanidine (HYCG) in the monocot and dicot mitochondrial genomes, and rarely preceded by 2 purines. Analysis of chloroplast editing sites from a dicot, Nicotiana tabacum, and a monocot, Zea mays, revealed a similar distribution of nucleotides around editing sites (HYCA). The similarity of this motif around editing sites in monocots and dicots in both mitochondria and chloroplasts suggests that a mechanistic basis for this motif exists that is common in these different organelle and phylogenetic systems. The preferred sequence distribution around RNA editing sites may have an important impact on the acquisition of editing sites in evolution because the immediate sequence context of a cytidine residue may render a cytidine editable or uneditable, and consequently determine whether a T to C mutation at a specific position may be corrected by RNA editing. The distribution of editing sites in many protein-coding sequences is shown to be non-random with editing sites clustered in groups separated by regions with no editing sites. The sporadic distribution of editing sites could result from a mechanism of editing site loss by gene conversion utilizing edited sequence information, possibly through an edited cDNA intermediate.
Collapse
Affiliation(s)
- R Michael Mulligan
- Department of Developmental and Cell Biology, University of California, Irvine, USA.
| | | | | |
Collapse
|
26
|
Strobl C, Boulesteix AL, Zeileis A, Hothorn T. Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinformatics 2007; 8:25. [PMID: 17254353 PMCID: PMC1796903 DOI: 10.1186/1471-2105-8-25] [Citation(s) in RCA: 1173] [Impact Index Per Article: 69.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2006] [Accepted: 01/25/2007] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories. This is particularly important in genomics and computational biology, where predictors often include variables of different types, for example when predictors include both sequence data and continuous variables such as folding energy, or when amino acid sequence data show different numbers of categories. RESULTS Simulation studies are presented illustrating that, when random forest variable importance measures are used with data of varying types, the results are misleading because suboptimal predictor variables may be artificially preferred in variable selection. The two mechanisms underlying this deficiency are biased variable selection in the individual classification trees used to build the random forest on one hand, and effects induced by bootstrap sampling with replacement on the other hand. CONCLUSION We propose to employ an alternative implementation of random forests, that provides unbiased variable selection in the individual classification trees. When this method is applied using subsampling without replacement, the resulting variable importance measures can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale of measurement or their number of categories. The usage of both random forest algorithms and their variable importance measures in the R system for statistical computing is illustrated and documented thoroughly in an application re-analyzing data from a study on RNA editing. Therefore the suggested method can be applied straightforwardly by scientists in bioinformatics research.
Collapse
Affiliation(s)
- Carolin Strobl
- Institut für Statistik, Ludwig-Maximilians-Universität München, Ludwigstr. 33, 80539 München, Germany
| | - Anne-Laure Boulesteix
- Institut für medizinische Statistik und Epidemiologie, Technische Universität München, Ismaningerstr. 22, 81675 München, Germany
| | - Achim Zeileis
- Department für Statistik und Mathematik, Wirtschaftsuniversität Wien, Augasse 2-6, 1090 Wien, Austria
| | - Torsten Hothorn
- Institut für Medizininformatik, Biometrie und Epidemiologie, Friedrich-Alexander-Universtität Erlangen-Nürnberg, Waldstr. 6, D-91054 Erlangen, Germany
| |
Collapse
|
27
|
Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinformatics 2007. [PMID: 17254353 DOI: 10.1186/1471‐2105‐8‐25] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories. This is particularly important in genomics and computational biology, where predictors often include variables of different types, for example when predictors include both sequence data and continuous variables such as folding energy, or when amino acid sequence data show different numbers of categories. RESULTS Simulation studies are presented illustrating that, when random forest variable importance measures are used with data of varying types, the results are misleading because suboptimal predictor variables may be artificially preferred in variable selection. The two mechanisms underlying this deficiency are biased variable selection in the individual classification trees used to build the random forest on one hand, and effects induced by bootstrap sampling with replacement on the other hand. CONCLUSION We propose to employ an alternative implementation of random forests, that provides unbiased variable selection in the individual classification trees. When this method is applied using subsampling without replacement, the resulting variable importance measures can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale of measurement or their number of categories. The usage of both random forest algorithms and their variable importance measures in the R system for statistical computing is illustrated and documented thoroughly in an application re-analyzing data from a study on RNA editing. Therefore the suggested method can be applied straightforwardly by scientists in bioinformatics research.
Collapse
|
28
|
Thompson J, Gopal S. Correction: genetic algorithm learning as a robust approach to RNA editing site site prediction. BMC Bioinformatics 2006; 7:406. [PMID: 16956416 PMCID: PMC1569880 DOI: 10.1186/1471-2105-7-406] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2006] [Accepted: 09/06/2006] [Indexed: 11/10/2022] Open
Abstract
After the publication of [1], we were alerted to an error in our data. The error was an one-off miscalculation in the extraction of position information for our set of true negatives. Our data set should have used randomly selected non-edited cytosines (C) as true negatives, but the data generation phase resulted in a set of nucleotides that were each one nucleotide downstream of known, unedited cytosines. The consequences of this error are reflected in changes to our results, although the general conclusions presented in our original publication remain largely unchanged.
Collapse
Affiliation(s)
- James Thompson
- Department of Biological Sciences, Rochester Institute of Technology, Rochester, NY 14623, USA
| | - Shuba Gopal
- Department of Biological Sciences, Rochester Institute of Technology, Rochester, NY 14623, USA
| |
Collapse
|
29
|
Thompson J, Gopal S. Genetic algorithm learning as a robust approach to RNA editing site prediction. BMC Bioinformatics 2006; 7:145. [PMID: 16542417 PMCID: PMC1459874 DOI: 10.1186/1471-2105-7-145] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2005] [Accepted: 03/16/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND RNA editing is one of several post-transcriptional modifications that may contribute to organismal complexity in the face of limited gene complement in a genome. One form, known as C --> U editing, appears to exist in a wide range of organisms, but most instances of this form of RNA editing have been discovered serendipitously. With the large amount of genomic and transcriptomic data now available, a computational analysis could provide a more rapid means of identifying novel sites of C --> U RNA editing. Previous efforts have had some success but also some limitations. We present a computational method for identifying C --> U RNA editing sites in genomic sequences that is both robust and generalizable. We evaluate its potential use on the best data set available for these purposes: C --> U editing sites in plant mitochondrial genomes. RESULTS Our method is derived from a machine learning approach known as a genetic algorithm. REGAL (RNA Editing site prediction by Genetic Algorithm Learning) is 87% accurate when tested on three mitochondrial genomes, with an overall sensitivity of 82% and an overall specificity of 91%. REGAL's performance significantly improves on other ab initio approaches to predicting RNA editing sites in this data set. REGAL has a comparable sensitivity and higher specificity than approaches which rely on sequence homology, and it has the advantage that strong sequence conservation is not required for reliable prediction of edit sites. CONCLUSION Our results suggest that ab initio methods can generate robust classifiers of putative edit sites, and we highlight the value of combinatorial approaches as embodied by genetic algorithms. We present REGAL as one approach with the potential to be generalized to other organisms exhibiting C --> U RNA editing.
Collapse
Affiliation(s)
- James Thompson
- Department of Biological Sciences, Rochester Institute of Technology, Rochester, NY 14623, USA
| | - Shuba Gopal
- Department of Biological Sciences, Rochester Institute of Technology, Rochester, NY 14623, USA
| |
Collapse
|
30
|
Mower JP. PREP-Mt: predictive RNA editor for plant mitochondrial genes. BMC Bioinformatics 2005; 6:96. [PMID: 15826309 PMCID: PMC1087475 DOI: 10.1186/1471-2105-6-96] [Citation(s) in RCA: 90] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2005] [Accepted: 04/12/2005] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In plants, RNA editing is a process that converts specific cytidines to uridines and uridines to cytidines in transcripts from virtually all mitochondrial protein-coding genes. There are thousands of plant mitochondrial genes in the sequence databases, but sites of RNA editing have not been determined for most. Accurate methods of RNA editing site prediction will be important in filling in this information gap and could reduce or even eliminate the need for experimental determination of editing sites for many sequences. Because RNA editing tends to increase protein conservation across species by "correcting" codons that specify unconserved amino acids, this principle can be used to predict editing sites by identifying positions where an RNA editing event would increase the conservation of a protein to homologues from other plants. PREP-Mt takes this approach to predict editing sites for any protein-coding gene in plant mitochondria. RESULTS To test the general applicability of the PREP-Mt methodology, RNA editing sites were predicted for 370 full-length or nearly full-length DNA sequences and then compared to the known sites of RNA editing for these sequences. Of 60,263 cytidines in this test set, PREP-Mt correctly classified 58,994 as either an edited or unedited site (accuracy = 97.9%). PREP-Mt properly identified 3,038 of the 3,698 known sites of RNA editing (sensitivity = 82.2%) and 55,956 of the 56,565 known unedited sites (specificity = 98.9%). Accuracy and sensitivity increased to 98.7% and 94.7%, respectively, after excluding the 489 silent editing sites (which have no effect on protein sequence or function) from the test set. CONCLUSION These results indicate that PREP-Mt is effective at identifying C to U RNA editing sites in plant mitochondrial protein-coding genes. Thus, PREP-Mt should be useful in predicting protein sequences for use in molecular, biochemical, and phylogenetic analyses. In addition, PREP-Mt could be used to determine functionality of a mitochondrial gene or to identify particular sequences with unusual editing properties. The PREP-Mt methodology should be applicable to any system where RNA editing increases protein conservation across species.
Collapse
Affiliation(s)
- Jeffrey P Mower
- Department of Biology, Indiana University, Bloomington, IN 47405, USA.
| |
Collapse
|
31
|
Cummings MP, Segal MR. Few amino acid positions in rpoB are associated with most of the rifampin resistance in Mycobacterium tuberculosis. BMC Bioinformatics 2004; 5:137. [PMID: 15453919 PMCID: PMC524371 DOI: 10.1186/1471-2105-5-137] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2004] [Accepted: 09/28/2004] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Mutations in rpoB, the gene encoding the beta subunit of DNA-dependent RNA polymerase, are associated with rifampin resistance in Mycobacterium tuberculosis. Several studies have been conducted where minimum inhibitory concentration (MIC, which is defined as the minimum concentration of the antibiotic in a given culture medium below which bacterial growth is not inhibited) of rifampin has been measured and partial DNA sequences have been determined for rpoB in different isolates of M. tuberculosis. However, no model has been constructed to predict rifampin resistance based on sequence information alone. Such a model might provide the basis for quantifying rifampin resistance status based exclusively on DNA sequence data and thus eliminate the requirements for time consuming culturing and antibiotic testing of clinical isolates. RESULTS Sequence data for amino acid positions 511-533 of rpoB and associated MIC of rifampin for different isolates of M. tuberculosis were taken from studies examining rifampin resistance in clinical samples from New York City and throughout Japan. We used tree-based statistical methods and random forests to generate models of the relationships between rpoB amino acid sequence and rifampin resistance. The proportion of variance explained by a relatively simple tree-based cross-validated regression model involving two amino acid positions (526 and 531) is 0.679. The first partition in the data, based on position 531, results in groups that differ one hundredfold in mean MIC (1.596 micrograms/ml and 159.676 micrograms/ml). The subsequent partition based on position 526, the most variable in this region, results in a > 354-fold difference in MIC. When considered as a classification problem (susceptible or resistant), a cross-validated tree-based model correctly classified most (0.884) of the observations and was very similar to the regression model. Random forest analysis of the MIC data as a continuous variable, a regression problem, produced a model that explained 0.861 of the variance. The random forest analysis of the MIC data as discrete classes produced a model that correctly classified 0.942 of the observations with sensitivity of 0.958 and specificity of 0.885. CONCLUSIONS Highly accurate regression and classification models of rifampin resistance can be made based on this short sequence region. Models may be better with improved (and consistent) measurements of MIC and more sequence data.
Collapse
Affiliation(s)
- Michael P Cummings
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742-3360, USA
| | - Mark R Segal
- Department of Epidemiology and Biostatistics, University of California, San Francisco, CA 94143-0560, USA
| |
Collapse
|