1
|
Backofen R, Gorodkin J, Hofacker IL, Stadler PF. Comparative RNA Genomics. Methods Mol Biol 2024; 2802:347-393. [PMID: 38819565 DOI: 10.1007/978-1-0716-3838-5_12] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/01/2024]
Abstract
Over the last quarter of a century it has become clear that RNA is much more than just a boring intermediate in protein expression. Ancient RNAs still appear in the core information metabolism and comprise a surprisingly large component in bacterial gene regulation. A common theme with these types of mostly small RNAs is their reliance of conserved secondary structures. Large-scale sequencing projects, on the other hand, have profoundly changed our understanding of eukaryotic genomes. Pervasively transcribed, they give rise to a plethora of large and evolutionarily extremely flexible non-coding RNAs that exert a vastly diverse array of molecule functions. In this chapter we provide a-necessarily incomplete-overview of the current state of comparative analysis of non-coding RNAs, emphasizing computational approaches as a means to gain a global picture of the modern RNA world.
Collapse
Affiliation(s)
- Rolf Backofen
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Freiburg, Germany
- Center for Non-coding RNA in Technology and Health, University of Copenhagen, Frederiksberg, Denmark
| | - Jan Gorodkin
- Center for Non-coding RNA in Technology and Health, Department of Veterinary and Animal Sciences, University of Copenhagen, Frederiksberg, Denmark
| | - Ivo L Hofacker
- Institute for Theoretical Chemistry, University of Vienna, Wien, Austria
- Bioinformatics and Computational Biology research group, University of Vienna, Vienna, Austria
- Center for Non-coding RNA in Technology and Health, University of Copenhagen, Frederiksberg, Denmark
| | - Peter F Stadler
- Bioinformatics Group, Department of Computer Science, University of Leipzig, Leipzig, Germany.
- Interdisciplinary Center for Bioinformatics, University of Leipzig, Leipzig, Germany.
- Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany.
- Universidad National de Colombia, Bogotá, Colombia.
- Institute for Theoretical Chemistry, University of Vienna, Wien, Austria.
- Center for Non-coding RNA in Technology and Health, University of Copenhagen, Frederiksberg, Denmark.
- Santa Fe Institute, Santa Fe, NM, USA.
| |
Collapse
|
2
|
Proft S, Leiz J, Heinemann U, Seelow D, Schmidt-Ott KM, Rutkiewicz M. Discovery of a non-canonical GRHL1 binding site using deep convolutional and recurrent neural networks. BMC Genomics 2023; 24:736. [PMID: 38049725 PMCID: PMC10696883 DOI: 10.1186/s12864-023-09830-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2023] [Accepted: 11/22/2023] [Indexed: 12/06/2023] Open
Abstract
BACKGROUND Transcription factors regulate gene expression by binding to transcription factor binding sites (TFBSs). Most models for predicting TFBSs are based on position weight matrices (PWMs), which require a specific motif to be present in the DNA sequence and do not consider interdependencies of nucleotides. Novel approaches such as Transcription Factor Flexible Models or recurrent neural networks consequently provide higher accuracies. However, it is unclear whether such approaches can uncover novel non-canonical, hitherto unexpected TFBSs relevant to human transcriptional regulation. RESULTS In this study, we trained a convolutional recurrent neural network with HT-SELEX data for GRHL1 binding and applied it to a set of GRHL1 binding sites obtained from ChIP-Seq experiments from human cells. We identified 46 non-canonical GRHL1 binding sites, which were not found by a conventional PWM approach. Unexpectedly, some of the newly predicted binding sequences lacked the CNNG core motif, so far considered obligatory for GRHL1 binding. Using isothermal titration calorimetry, we experimentally confirmed binding between the GRHL1-DNA binding domain and predicted GRHL1 binding sites, including a non-canonical GRHL1 binding site. Mutagenesis of individual nucleotides revealed a correlation between predicted binding strength and experimentally validated binding affinity across representative sequences. This correlation was neither observed with a PWM-based nor another deep learning approach. CONCLUSIONS Our results show that convolutional recurrent neural networks may uncover unanticipated binding sites and facilitate quantitative transcription factor binding predictions.
Collapse
Affiliation(s)
- Sebastian Proft
- Exploratory Diagnostic Sciences, Berlin Institute of Health, Charité - Universitätsmedizin Berlin, 10117, Berlin, Germany
- Institute of Medical Genetics and Human Genetics, Charité - Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, 13353, Berlin, Germany
| | - Janna Leiz
- Department of Nephrology and Hypertension, Hannover Medical School, 30625, Hannover, Germany
- Department of Nephrology and Intensive Care Medicine, Charité - Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, 12203, Berlin, Germany
- Molecular and Translational Kidney Research, Max-Delbrück-Center for Molecular Medicine in the Helmholtz Association, 13125, Berlin, Germany
| | - Udo Heinemann
- Macromolecular Structure and Interaction, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, 13125, Berlin, Germany.
| | - Dominik Seelow
- Exploratory Diagnostic Sciences, Berlin Institute of Health, Charité - Universitätsmedizin Berlin, 10117, Berlin, Germany.
- Institute of Medical Genetics and Human Genetics, Charité - Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, 13353, Berlin, Germany.
| | - Kai M Schmidt-Ott
- Department of Nephrology and Hypertension, Hannover Medical School, 30625, Hannover, Germany.
- Department of Nephrology and Intensive Care Medicine, Charité - Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, 12203, Berlin, Germany.
- Molecular and Translational Kidney Research, Max-Delbrück-Center for Molecular Medicine in the Helmholtz Association, 13125, Berlin, Germany.
| | - Maria Rutkiewicz
- Macromolecular Structure and Interaction, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, 13125, Berlin, Germany
- Department of Structural Biology of Eukaryotes, Institute of Bioorganic Chemistry, Polish Academy of Sciences, Poznań, 61-704, Poland
| |
Collapse
|
3
|
Fremin BJ, Bhatt AS, Kyrpides NC. Identification of over ten thousand candidate structured RNAs in viruses and phages. Comput Struct Biotechnol J 2023; 21:5630-5639. [PMID: 38047235 PMCID: PMC10690425 DOI: 10.1016/j.csbj.2023.11.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2023] [Revised: 11/03/2023] [Accepted: 11/03/2023] [Indexed: 12/05/2023] Open
Abstract
Structured RNAs play crucial roles in viruses, exerting influence over both viral and host gene expression. However, the extensive diversity of structured RNAs and their ability to act in cis or trans positions pose challenges for predicting and assigning their functions. While comparative genomics approaches have successfully predicted candidate structured RNAs in microbes on a large scale, similar efforts for viruses have been lacking. In this study, we screened over 5 million DNA and RNA viral sequences, resulting in the prediction of 10,006 novel candidate structured RNAs. These predictions are widely distributed across taxonomy and ecosystem. We found transcriptional evidence for 206 of these candidate structured RNAs in the human fecal microbiome. These candidate RNAs exhibited evidence of nucleotide covariation, indicative of selective pressure maintaining the predicted secondary structures. Our analysis revealed a diverse repertoire of candidate structured RNAs, encompassing a substantial number of putative tRNAs or tRNA-like structures, Rho-independent transcription terminators, and potentially cis-regulatory structures consistently positioned upstream of genes. In summary, our findings shed light on the extensive diversity of structured RNAs in viruses, offering a valuable resource for further investigations into their functional roles and implications in viral gene expression and pave the way for a deeper understanding of the intricate interplay between viruses and their hosts at the molecular level.
Collapse
Affiliation(s)
- Brayon J. Fremin
- Department of Energy, Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Ami S. Bhatt
- Blood and Marrow Transplantation) and Genetics, Stanford University, Stanford, CA, USA
- Department of Medicine (Hematology, Stanford University, Stanford, CA, USA
- Department of Genetics, Stanford University, Stanford, CA, USA
| | - Nikos C. Kyrpides
- Department of Energy, Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
- Lead Contact, USA
| |
Collapse
|
4
|
Klapproth C, Zötzsche S, Kühnl F, Fallmann J, Stadler P, Findeiß S. Tailored machine learning models for functional RNA detection in genome-wide screens. NAR Genom Bioinform 2023; 5:lqad072. [PMID: 37608800 PMCID: PMC10440787 DOI: 10.1093/nargab/lqad072] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2022] [Revised: 06/28/2023] [Accepted: 07/30/2023] [Indexed: 08/24/2023] Open
Abstract
The in silico prediction of non-coding and protein-coding genetic loci has received considerable attention in comparative genomics aiming in particular at the identification of properties of nucleotide sequences that are informative of their biological role in the cell. We present here a software framework for the alignment-based training, evaluation and application of machine learning models with user-defined parameters. Instead of focusing on the one-size-fits-all approach of pervasive in silico annotation pipelines, we offer a framework for the structured generation and evaluation of models based on arbitrary features and input data, focusing on stable and explainable results. Furthermore, we showcase the usage of our software package in a full-genome screen of Drosophila melanogaster and evaluate our results against the well-known but much less flexible program RNAz.
Collapse
Affiliation(s)
- Christopher Klapproth
- Leipzig University, Department of Computer Science and Interdisciplinary Center of Bioinformatics, Bioinformatics Group, Härtelstrasse 16-18, D-04107 Leipzig, Germany
- ScaDS.AI Leipzig (Center for Scalable Data Analytics and Artificial Intelligence), Humboldtstraße 25, D-04105 Leipzig, Germany
| | - Siegfried Zötzsche
- Leipzig University, Department of Computer Science and Interdisciplinary Center of Bioinformatics, Bioinformatics Group, Härtelstrasse 16-18, D-04107 Leipzig, Germany
| | - Felix Kühnl
- Leipzig University, Department of Computer Science and Interdisciplinary Center of Bioinformatics, Bioinformatics Group, Härtelstrasse 16-18, D-04107 Leipzig, Germany
| | - Jörg Fallmann
- Leipzig University, Department of Computer Science and Interdisciplinary Center of Bioinformatics, Bioinformatics Group, Härtelstrasse 16-18, D-04107 Leipzig, Germany
| | - Peter F Stadler
- Leipzig University, Department of Computer Science and Interdisciplinary Center of Bioinformatics, Bioinformatics Group, Härtelstrasse 16-18, D-04107 Leipzig, Germany
- Max Planck Institute for Mathematics in the Science, Inselstraße 22, D-04103 Leipzig, Germany
- University of Vienna, Institute for Theoretical Chemistry, Währingerstraße 17, A-1090 Vienna, Austria
- Santa Fe Institute, 1399 Hyde Park Rd., Santa Fe NM 97501, USA
- Universidad Nacional de Colombia, Facultad de Ciencias, Bogotá, D.C., Colombia
| | - Sven Findeiß
- Leipzig University, Department of Computer Science and Interdisciplinary Center of Bioinformatics, Bioinformatics Group, Härtelstrasse 16-18, D-04107 Leipzig, Germany
| |
Collapse
|
5
|
Walter Costa MB. Evolutionary Conservation of RNA Secondary Structure. Methods Mol Biol 2023; 2586:121-146. [PMID: 36705902 DOI: 10.1007/978-1-0716-2768-6_8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
Noncoding RNAs, ncRNAs, naturally fold into structures, which allow them to perform their functions in the cell. Evolutionarily close species share structures and functions. This occurs because of shared selective pressures, resulting in conserved groups. Previous efforts in finding functional RNAs have been made in detecting conserved structures in genomes or alignments. It may occur that, within a conserved group, species-specific structures arise after species split due to positive selection. Detecting positive selection in ncRNAs is a hard problem in biology as well as bioinformatics. To detect positive selection, one should find species-specific structures within a conserved set. This chapter provides protocols to detect and analyze positive selection in ncRNA structures with the SSS-test and other free software.
Collapse
Affiliation(s)
- Maria Beatriz Walter Costa
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, University of Leipzig, Leipzig, Germany
- Institute of Laboratory Medicine, Clinical Chemistry und Molecular Diagnostics, University of Leipzig Medical Center, Leipzig, Germany
| |
Collapse
|
6
|
Andrews RJ, Rouse WB, O’Leary CA, Booher NJ, Moss WN. ScanFold 2.0: a rapid approach for identifying potential structured RNA targets in genomes and transcriptomes. PeerJ 2022; 10:e14361. [PMID: 36389431 PMCID: PMC9651051 DOI: 10.7717/peerj.14361] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2022] [Accepted: 10/18/2022] [Indexed: 11/11/2022] Open
Abstract
A major limiting factor in target discovery for both basic research and therapeutic intervention is the identification of structural and/or functional RNA elements in genomes and transcriptomes. This was the impetus for the original ScanFold algorithm, which provides maps of local RNA structural stability, evidence of sequence-ordered (potentially evolved) structure, and unique model structures comprised of recurring base pairs with the greatest structural bias. A key step in quantifying this propensity for ordered structure is the prediction of secondary structural stability for randomized sequences which, in the original implementation of ScanFold, is explicitly evaluated. This slow process has limited the rapid identification of ordered structures in large genomes/transcriptomes, which we seek to overcome in this current work introducing ScanFold 2.0. In this revised version of ScanFold, we no longer explicitly evaluate randomized sequence folding energy, but rather estimate it using a machine learning approach. For high randomization numbers, this can increase prediction speeds over 100-fold compared to ScanFold 1.0, allowing for the analysis of large sequences, as well as the use of additional folding algorithms that may be computationally expensive. In the testing of ScanFold 2.0, we re-evaluate the Zika, HIV, and SARS-CoV-2 genomes and compare both the consistency of results and the time of each run to ScanFold 1.0. We also re-evaluate the SARS-CoV-2 genome to assess the quality of ScanFold 2.0 predictions vs several biochemical structure probing datasets and compare the results to those of the original ScanFold program.
Collapse
Affiliation(s)
- Ryan J. Andrews
- Department of Biochemistry, University of Utah, Salt Lake City, UT, United States
| | - Warren B. Rouse
- The Roy J Carver Department of Biochemistry, Biophysics and Molecular Biology, Iowa State University, Ames, Iowa, United States
| | - Collin A. O’Leary
- The Roy J Carver Department of Biochemistry, Biophysics and Molecular Biology, Iowa State University, Ames, Iowa, United States
| | - Nicholas J. Booher
- Infrastructure and Research IT Services, Iowa State University, Ames, IA, United States
| | - Walter N. Moss
- The Roy J Carver Department of Biochemistry, Biophysics and Molecular Biology, Iowa State University, Ames, Iowa, United States
| |
Collapse
|
7
|
Fremin BJ, Bhatt AS. Comparative genomics identifies thousands of candidate structured RNAs in human microbiomes. Genome Biol 2021; 22:100. [PMID: 33845850 PMCID: PMC8040213 DOI: 10.1186/s13059-021-02319-w] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2020] [Accepted: 03/19/2021] [Indexed: 02/02/2023] Open
Abstract
BACKGROUND Structured RNAs play varied bioregulatory roles within microbes. To date, hundreds of candidate structured RNAs have been predicted using informatic approaches that search for motif structures in genomic sequence data. The human microbiome contains thousands of species and strains of microbes. Yet, much of the metagenomic data from the human microbiome remains unmined for structured RNA motifs primarily due to computational limitations. RESULTS We sought to apply a large-scale, comparative genomics approach to these organisms to identify candidate structured RNAs. With a carefully constructed, though computationally intensive automated analysis, we identify 3161 conserved candidate structured RNAs in intergenic regions, as well as 2022 additional candidate structured RNAs that may overlap coding regions. We validate the RNA expression of 177 of these candidate structures by analyzing small fragment RNA-seq data from four human fecal samples. CONCLUSIONS This approach identifies a wide variety of candidate structured RNAs, including tmRNAs, antitoxins, and likely ribosome protein leaders, from a wide variety of taxa. Overall, our pipeline enables conservative predictions of thousands of novel candidate structured RNAs from human microbiomes.
Collapse
Affiliation(s)
- Brayon J Fremin
- Department of Genetics, Stanford University, Stanford, CA, 94305, USA
| | - Ami S Bhatt
- Department of Genetics, Stanford University, Stanford, CA, 94305, USA.
- Department of Medicine (Hematology), Stanford University, Stanford, CA, 94305, USA.
| |
Collapse
|
8
|
Krützfeldt LM, Schubach M, Kircher M. The impact of different negative training data on regulatory sequence predictions. PLoS One 2020; 15:e0237412. [PMID: 33259518 PMCID: PMC7707526 DOI: 10.1371/journal.pone.0237412] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2020] [Accepted: 11/12/2020] [Indexed: 01/08/2023] Open
Abstract
Regulatory regions, like promoters and enhancers, cover an estimated 5–15% of the human genome. Changes to these sequences are thought to underlie much of human phenotypic variation and a substantial proportion of genetic causes of disease. However, our understanding of their functional encoding in DNA is still very limited. Applying machine or deep learning methods can shed light on this encoding and gapped k-mer support vector machines (gkm-SVMs) or convolutional neural networks (CNNs) are commonly trained on putative regulatory sequences. Here, we investigate the impact of negative sequence selection on model performance. By training gkm-SVM and CNN models on open chromatin data and corresponding negative training dataset, both learners and two approaches for negative training data are compared. Negative sets use either genomic background sequences or sequence shuffles of the positive sequences. Model performance was evaluated on three different tasks: predicting elements active in a cell-type, predicting cell-type specific elements, and predicting elements' relative activity as measured from independent experimental data. Our results indicate strong effects of the negative training data, with genomic backgrounds showing overall best results. Specifically, models trained on highly shuffled sequences perform worse on the complex tasks of tissue-specific activity and quantitative activity prediction, and seem to learn features of artificial sequences rather than regulatory activity. Further, we observe that insufficient matching of genomic background sequences results in model biases. While CNNs achieved and exceeded the performance of gkm-SVMs for larger training datasets, gkm-SVMs gave robust and best results for typical training dataset sizes without the need of hyperparameter optimization.
Collapse
Affiliation(s)
- Louisa-Marie Krützfeldt
- Charité–Universitätsmedizin Berlin, Berlin, Germany
- Berlin Institute of Health (BIH), Berlin, Germany
| | - Max Schubach
- Charité–Universitätsmedizin Berlin, Berlin, Germany
- Berlin Institute of Health (BIH), Berlin, Germany
| | - Martin Kircher
- Charité–Universitätsmedizin Berlin, Berlin, Germany
- Berlin Institute of Health (BIH), Berlin, Germany
- * E-mail:
| |
Collapse
|
9
|
Nowick K, Walter Costa MB, Höner Zu Siederdissen C, Stadler PF. Selection Pressures on RNA Sequences and Structures. Evol Bioinform Online 2019; 15:1176934319871919. [PMID: 31496634 PMCID: PMC6716170 DOI: 10.1177/1176934319871919] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2019] [Accepted: 07/29/2019] [Indexed: 12/31/2022] Open
Abstract
With the discovery of increasingly more functional noncoding RNAs (ncRNAs), it becomes eminent to more strongly consider them as important players during species evolution. Although tests for negative selection of ncRNAs already exist since the beginning of this century, the SSS-test is the first one for also investigating positive selection. When analyzing selection in ncRNAs, it should be taken into account that selection pressures can independently act on sequence and structure. We applied the SSS-test to explore the evolution of ncRNAs in primates and identified more than 100 long noncoding RNAs (lncRNAs) that might evolve under positive selection in humans. With this test, it is now possible to more thoroughly include ncRNAs into evolutionary studies.
Collapse
Affiliation(s)
- Katja Nowick
- Human Biology Group, Institute for Biology, Department of Biology, Chemistry, Pharmacy, Freie Universität Berlin, Berlin, Germany
| | | | - Christian Höner Zu Siederdissen
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Universität Leipzig, Leipzig, Germany
| | - Peter F Stadler
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Universität Leipzig, Leipzig, Germany.,Max Planck Institute for Mathematics in the Science, Leipzig, Germany.,Department of Theoretical Chemistry, Universität Wien, Wien, Austria.,Faculdad de Ciencias, Universidad Nacional de Colombia, Bogotá, Colombia.,Santa Fe Institute, Santa Fe, NM, USA
| |
Collapse
|
10
|
Walter Costa MB, Höner zu Siederdissen C, Dunjić M, Stadler PF, Nowick K. SSS-test: a novel test for detecting positive selection on RNA secondary structure. BMC Bioinformatics 2019; 20:151. [PMID: 30898084 PMCID: PMC6429701 DOI: 10.1186/s12859-019-2711-y] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2018] [Accepted: 03/03/2019] [Indexed: 12/23/2022] Open
Abstract
BACKGROUND Long non-coding RNAs (lncRNAs) play an important role in regulating gene expression and are thus important for determining phenotypes. Most attempts to measure selection in lncRNAs have focused on the primary sequence. The majority of small RNAs and at least some parts of lncRNAs must fold into specific structures to perform their biological function. Comprehensive assessments of selection acting on RNAs therefore must also encompass structure. Selection pressures acting on the structure of non-coding genes can be detected within multiple sequence alignments. Approaches of this type, however, have so far focused on negative selection. Thus, a computational method for identifying ncRNAs under positive selection is needed. RESULTS We introduce the SSS-test (test for Selection on Secondary Structure) to identify positive selection and thus adaptive evolution. Benchmarks with biological as well as synthetic controls yield coherent signals for both negative and positive selection, demonstrating the functionality of the test. A survey of a lncRNA collection comprising 15,443 families resulted in 110 candidates that appear to be under positive selection in human. In 26 lncRNAs that have been associated with psychiatric disorders we identified local structures that have signs of positive selection in the human lineage. CONCLUSIONS It is feasible to assay positive selection acting on RNA secondary structures on a genome-wide scale. The detection of human-specific positive selection in lncRNAs associated with cognitive disorder provides a set of candidate genes for further experimental testing and may provide insights into the evolution of cognitive abilities in humans. AVAILABILITY The SSS-test and related software is available at: https://github.com/waltercostamb/SSS-test . The databases used in this work are available at: http://www.bioinf.uni-leipzig.de/Software/SSS-test/ .
Collapse
Affiliation(s)
- Maria Beatriz Walter Costa
- Embrapa Agroenergia, Parque Estação Biológica (PqEB), Asa Norte, Brasília, DF, 70770-901 Brazil
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16–18, Leipzig, 04107 Germany
| | - Christian Höner zu Siederdissen
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16–18, Leipzig, 04107 Germany
| | - Marko Dunjić
- Human Biology Group, Institute for Biology, Department of Biology, Chemistry, Pharmacy, Freie Universitaet Berlin, Königin-Luise-Straße 1-3, Berlin, 14195 Germany
- Center for Human Molecular Genetics, Faculty of Biology, University of Belgrade, Studentski trg 16, PO box 43, Belgrade, 11000 Serbia
| | - Peter F. Stadler
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16–18, Leipzig, 04107 Germany
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig & Competence Center for Scalable Data Services and Solutions Dresden-Leipzig & Leipzig Research Center for Civilization Diseases, University Leipzig, Leipzig, 04107 Germany
- Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, Leipzig, 04103 Germany
- Department of Theoretical Chemistry, University of Vienna, Währinger Straße 17, Vienna, A-1090 Austria
- Center for non-coding RNA in Technology and Health, University of Copenhagen, Grønnegårdsvej 3, Frederiksberg C, DK-1870 Denmark
- Faculdad de Ciencias, Universidad Nacional de Colombia, Sede Bogotá, Ciudad Universitaria, Bogotá, D.C., COL-111321 Colombia
- Santa Fe Institute, 1399 Hyde Park Rd., Santa Fe, NM87501 USA
| | - Katja Nowick
- Human Biology Group, Institute for Biology, Department of Biology, Chemistry, Pharmacy, Freie Universitaet Berlin, Königin-Luise-Straße 1-3, Berlin, 14195 Germany
- TFome Research Group, Bioinformatics Group, Interdisciplinary Center of Bioinformatics, Department of Computer Science, University of Leipzig, Härtelstraße 16-18, Leipzig, 04107 Germany
- Paul-Flechsig-Institute for Brain Research, University of Leipzig, Liebigstraße 19. Haus C, Leipzig, 04103 Germany
- Bioinformatics, Faculty of Agricultural Sciences, Institute of Animal Science, University of Hohenheim, Garbenstraße 13, Stuttgart, 70593 Germany
| |
Collapse
|
11
|
Turner AW, Wong D, Khan MD, Dreisbach CN, Palmore M, Miller CL. Multi-Omics Approaches to Study Long Non-coding RNA Function in Atherosclerosis. Front Cardiovasc Med 2019; 6:9. [PMID: 30838214 PMCID: PMC6389617 DOI: 10.3389/fcvm.2019.00009] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2018] [Accepted: 01/30/2019] [Indexed: 12/15/2022] Open
Abstract
Atherosclerosis is a complex inflammatory disease of the vessel wall involving the interplay of multiple cell types including vascular smooth muscle cells, endothelial cells, and macrophages. Large-scale genome-wide association studies (GWAS) and the advancement of next generation sequencing technologies have rapidly expanded the number of long non-coding RNA (lncRNA) transcripts predicted to play critical roles in the pathogenesis of the disease. In this review, we highlight several lncRNAs whose functional role in atherosclerosis is well-documented through traditional biochemical approaches as well as those identified through RNA-sequencing and other high-throughput assays. We describe novel genomics approaches to study both evolutionarily conserved and divergent lncRNA functions and interactions with DNA, RNA, and proteins. We also highlight assays to resolve the complex spatial and temporal regulation of lncRNAs. Finally, we summarize the latest suite of computational tools designed to improve genomic and functional annotation of these transcripts in the human genome. Deep characterization of lncRNAs is fundamental to unravel coronary atherosclerosis and other cardiovascular diseases, as these regulatory molecules represent a new class of potential therapeutic targets and/or diagnostic markers to mitigate both genetic and environmental risk factors.
Collapse
Affiliation(s)
- Adam W. Turner
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, United States
| | - Doris Wong
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, United States
- Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA, United States
| | - Mohammad Daud Khan
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, United States
| | - Caitlin N. Dreisbach
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, United States
- School of Nursing, University of Virginia, Charlottesville, VA, United States
- Data Science Institute, University of Virginia, Charlottesville, VA, United States
| | - Meredith Palmore
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, United States
| | - Clint L. Miller
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, United States
- Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA, United States
- Data Science Institute, University of Virginia, Charlottesville, VA, United States
- Department of Biomedical Engineering, University of Virginia, Charlottesville, VA, United States
- Department of Public Health Sciences, University of Virginia, Charlottesville, VA, United States
| |
Collapse
|
12
|
Andrews RJ, Roche J, Moss WN. ScanFold: an approach for genome-wide discovery of local RNA structural elements-applications to Zika virus and HIV. PeerJ 2018; 6:e6136. [PMID: 30627482 PMCID: PMC6317755 DOI: 10.7717/peerj.6136] [Citation(s) in RCA: 51] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2018] [Accepted: 11/15/2018] [Indexed: 12/24/2022] Open
Abstract
In addition to encoding RNA primary structures, genomes also encode RNA secondary and tertiary structures that play roles in gene regulation and, in the case of RNA viruses, genome replication. Methods for the identification of functional RNA structures in genomes typically rely on scanning analysis windows, where multiple partially-overlapping windows are used to predict RNA structures and folding metrics to deduce regions likely to form functional structure. Separate structural models are produced for each window, where the step size can greatly affect the returned model. This makes deducing unique local structures challenging, as the same nucleotides in each window can be alternatively base paired. We are presenting here a new approach where all base pairs from analysis windows are considered and weighted by favorable folding. This results in unique base pairing throughout the genome and the generation of local regions/structures that can be ranked by their propensity to form unusually thermodynamically stable folds. We applied this approach to the Zika virus (ZIKV) and HIV-1 genomes. ZIKV is linked to a variety of neurological ailments including microcephaly and Guillain-Barré syndrome and its (+)-sense RNA genome encodes two, previously described, functionally essential structured RNA regions. HIV, the cause of AIDS, contains multiple functional RNA motifs in its genome, which have been extensively studied. Our approach is able to successfully identify and model the structures of known functional motifs in both viruses, while also finding additional regions likely to form functional structures. All data have been archived at the RNAStructuromeDB (www.structurome.bb.iastate.edu), a repository of RNA folding data for humans and their pathogens.
Collapse
Affiliation(s)
- Ryan J. Andrews
- Roy J. Carver Department of Biophysics, Biochemistry and Molecular Biology, Iowa State University, Ames, IA, USA
| | - Julien Roche
- Roy J. Carver Department of Biophysics, Biochemistry and Molecular Biology, Iowa State University, Ames, IA, USA
| | - Walter N. Moss
- Roy J. Carver Department of Biophysics, Biochemistry and Molecular Biology, Iowa State University, Ames, IA, USA
| |
Collapse
|
13
|
Kirsch R, Seemann SE, Ruzzo WL, Cohen SM, Stadler PF, Gorodkin J. Identification and characterization of novel conserved RNA structures in Drosophila. BMC Genomics 2018; 19:899. [PMID: 30537930 PMCID: PMC6288889 DOI: 10.1186/s12864-018-5234-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2018] [Accepted: 11/08/2018] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND Comparative genomics approaches have facilitated the discovery of many novel non-coding and structured RNAs (ncRNAs). The increasing availability of related genomes now makes it possible to systematically search for compensatory base changes - and thus for conserved secondary structures - even in genomic regions that are poorly alignable in the primary sequence. The wealth of available transcriptome data can add valuable insight into expression and possible function for new ncRNA candidates. Earlier work identifying ncRNAs in Drosophila melanogaster made use of sequence-based alignments and employed a sliding window approach, inevitably biasing identification toward RNAs encoded in the more conserved parts of the genome. RESULTS To search for conserved RNA structures (CRSs) that may not be highly conserved in sequence and to assess the expression of CRSs, we conducted a genome-wide structural alignment screen of 27 insect genomes including D. melanogaster and integrated this with an extensive set of tiling array data. The structural alignment screen revealed ∼30,000 novel candidate CRSs at an estimated false discovery rate of less than 10%. With more than one quarter of all individual CRS motifs showing sequence identities below 60%, the predicted CRSs largely complement the findings of sliding window approaches applied previously. While a sixth of the CRSs were ubiquitously expressed, we found that most were expressed in specific developmental stages or cell lines. Notably, most statistically significant enrichment of CRSs were observed in pupae, mainly in exons of untranslated regions, promotors, enhancers, and long ncRNAs. Interestingly, cell lines were found to express a different set of CRSs than were found in vivo. Only a small fraction of intergenic CRSs were co-expressed with the adjacent protein coding genes, which suggests that most intergenic CRSs are independent genetic units. CONCLUSIONS This study provides a more comprehensive view of the ncRNA transcriptome in fly as well as evidence for differential expression of CRSs during development and in cell lines.
Collapse
Affiliation(s)
- Rebecca Kirsch
- Center for non-coding RNA in Technology and Health, University of Copenhagen, Grønnegårdsvej 3, Frederiksberg C, DK-1870 Denmark
- Department of Veterinary and Animal Science, University of Copenhagen, Grønnegårdsvej 3, Frederiksberg C, DK-1870 Denmark
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16–18, Leipzig, D-04107 Germany
| | - Stefan E. Seemann
- Center for non-coding RNA in Technology and Health, University of Copenhagen, Grønnegårdsvej 3, Frederiksberg C, DK-1870 Denmark
- Department of Veterinary and Animal Science, University of Copenhagen, Grønnegårdsvej 3, Frederiksberg C, DK-1870 Denmark
| | - Walter L. Ruzzo
- Center for non-coding RNA in Technology and Health, University of Copenhagen, Grønnegårdsvej 3, Frederiksberg C, DK-1870 Denmark
- School of Computer Science and Engineering, University of Washington, Box 352350, Seattle, 98195-2350 WA USA
- Department of Genome Sciences, University of Washington, Box 355065, Seattle, 98195-5065 WA USA
- Fred Hutchinson Cancer Research Center, 1100 Fairview Ave. N., Seattle, 98109-1024 WA USA
| | - Stephen M. Cohen
- Department of Cellular and Molecular Medicine, University of Copenhagen, Blegdamsvej 3, Copenhagen N, DK-2200 Denmark
| | - Peter F. Stadler
- Center for non-coding RNA in Technology and Health, University of Copenhagen, Grønnegårdsvej 3, Frederiksberg C, DK-1870 Denmark
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16–18, Leipzig, D-04107 Germany
- Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, Leipzig, D-04103 Germany
- Faculdad de Ciencias, Universidad Nacional de Colombia, Sede Bogotá, Ciudad Universitaria, Bogotá, COL-111321 D.C. Colombia
- Department of Theoretical Chemistry, University of Vienna, Währinger Straße 17, Vienna, A-1090 Austria
- Santa Fe Institute, 1399 Hyde Park Rd., Santa Fe, NM87501 USA
| | - Jan Gorodkin
- Center for non-coding RNA in Technology and Health, University of Copenhagen, Grønnegårdsvej 3, Frederiksberg C, DK-1870 Denmark
- Department of Veterinary and Animal Science, University of Copenhagen, Grønnegårdsvej 3, Frederiksberg C, DK-1870 Denmark
| |
Collapse
|
14
|
Abstract
Over the last two decades it has become clear that RNA is much more than just a boring intermediate in protein expression. Ancient RNAs still appear in the core information metabolism and comprise a surprisingly large component in bacterial gene regulation. A common theme with these types of mostly small RNAs is their reliance of conserved secondary structures. Large scale sequencing projects, on the other hand, have profoundly changed our understanding of eukaryotic genomes. Pervasively transcribed, they give rise to a plethora of large and evolutionarily extremely flexible noncoding RNAs that exert a vastly diverse array of molecule functions. In this chapter we provide a-necessarily incomplete-overview of the current state of comparative analysis of noncoding RNAs, emphasizing computational approaches as a means to gain a global picture of the modern RNA world.
Collapse
Affiliation(s)
- Rolf Backofen
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Georges-Köhler-Allee 106, D-79110 Freiburg, Germany.,Center for non-coding RNA in Technology and Health, Department of Veterinary and Animal Sciences, University of Copenhagen, Grønnegårdsvej 3, DK-1870 Frederiksberg C, Denmark
| | - Jan Gorodkin
- Center for non-coding RNA in Technology and Health, Department of Veterinary and Animal Sciences, University of Copenhagen, Grønnegårdsvej 3, DK-1870 Frederiksberg C, Denmark
| | - Ivo L Hofacker
- Center for non-coding RNA in Technology and Health, Department of Veterinary and Animal Sciences, University of Copenhagen, Grønnegårdsvej 3, DK-1870 Frederiksberg C, Denmark.,Institute for Theoretical Chemistry, University of Vienna, Währingerstraße 17, A-1090 Wien, Austria.,Bioinformatics and Computational Biology Research Group, University of Vienna, Währingerstraße 17, A-1090 Vienna, Austria
| | - Peter F Stadler
- Center for non-coding RNA in Technology and Health, Department of Veterinary and Animal Sciences, University of Copenhagen, Grønnegårdsvej 3, DK-1870 Frederiksberg C, Denmark. .,Institute for Theoretical Chemistry, University of Vienna, Währingerstraße 17, A-1090 Wien, Austria. .,Bioinformatics Group, Department of Computer Science, Interdisciplinary Center for Bioinformatics, University of Leipzig, Härtelstraße 16-18, D-04107 Leipzig, Germany. .,Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, D-04103 Leipzig, Germany. .,Fraunhofer Institute for Cell Therapy and Immunology, Perlickstraße 1, D-04103 Leipzig, Germany. .,Santa Fe Institute, 1399 Hyde Park Rd, Santa Fe, NM 87501, USA.
| |
Collapse
|
15
|
Fallmann J, Will S, Engelhardt J, Grüning B, Backofen R, Stadler PF. Recent advances in RNA folding. J Biotechnol 2017; 261:97-104. [DOI: 10.1016/j.jbiotec.2017.07.007] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2017] [Revised: 07/02/2017] [Accepted: 07/04/2017] [Indexed: 12/23/2022]
|
16
|
Seemann SE, Mirza AH, Hansen C, Bang-Berthelsen CH, Garde C, Christensen-Dalsgaard M, Torarinsson E, Yao Z, Workman CT, Pociot F, Nielsen H, Tommerup N, Ruzzo WL, Gorodkin J. The identification and functional annotation of RNA structures conserved in vertebrates. Genome Res 2017; 27:1371-1383. [PMID: 28487280 PMCID: PMC5538553 DOI: 10.1101/gr.208652.116] [Citation(s) in RCA: 55] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2016] [Accepted: 05/04/2017] [Indexed: 01/15/2023]
Abstract
Structured elements of RNA molecules are essential in, e.g., RNA stabilization, localization, and protein interaction, and their conservation across species suggests a common functional role. We computationally screened vertebrate genomes for conserved RNA structures (CRSs), leveraging structure-based, rather than sequence-based, alignments. After careful correction for sequence identity and GC content, we predict ∼516,000 human genomic regions containing CRSs. We find that a substantial fraction of human–mouse CRS regions (1) colocalize consistently with binding sites of the same RNA binding proteins (RBPs) or (2) are transcribed in corresponding tissues. Additionally, a CaptureSeq experiment revealed expression of many of our CRS regions in human fetal brain, including 662 novel ones. For selected human and mouse candidate pairs, qRT-PCR and in vitro RNA structure probing supported both shared expression and shared structure despite low abundance and low sequence identity. About 30,000 CRS regions are located near coding or long noncoding RNA genes or within enhancers. Structured (CRS overlapping) enhancer RNAs and extended 3′ ends have significantly increased expression levels over their nonstructured counterparts. Our findings of transcribed uncharacterized regulatory regions that contain CRSs support their RNA-mediated functionality.
Collapse
Affiliation(s)
- Stefan E Seemann
- Center for non-coding RNA in Technology and Health (RTH), University of Copenhagen, DK-1870 Frederiksberg, Denmark.,Department of Veterinary and Animal Sciences, Faculty of Health and Medical Sciences, University of Copenhagen, DK-1870 Frederiksberg, Denmark
| | - Aashiq H Mirza
- Center for non-coding RNA in Technology and Health (RTH), University of Copenhagen, DK-1870 Frederiksberg, Denmark.,Copenhagen Diabetes Research Center (CPH-DIRECT), Herlev University Hospital, DK-2730 Herlev, Denmark
| | - Claus Hansen
- Center for non-coding RNA in Technology and Health (RTH), University of Copenhagen, DK-1870 Frederiksberg, Denmark.,Department of Cellular and Molecular Medicine (ICMM), Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200 Copenhagen, Denmark
| | - Claus H Bang-Berthelsen
- Center for non-coding RNA in Technology and Health (RTH), University of Copenhagen, DK-1870 Frederiksberg, Denmark.,Department of Obesity Biology and Department of Molecular Genetics, Novo Nordisk A/S, DK-2880 Bagsværd, Denmark
| | - Christian Garde
- Center for non-coding RNA in Technology and Health (RTH), University of Copenhagen, DK-1870 Frederiksberg, Denmark.,Department of Biotechnology and Biomedicine, Technical University of Denmark, DK-2800 Kongens Lyngby, Denmark
| | - Mikkel Christensen-Dalsgaard
- Center for non-coding RNA in Technology and Health (RTH), University of Copenhagen, DK-1870 Frederiksberg, Denmark.,Department of Cellular and Molecular Medicine (ICMM), Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200 Copenhagen, Denmark
| | - Elfar Torarinsson
- Center for non-coding RNA in Technology and Health (RTH), University of Copenhagen, DK-1870 Frederiksberg, Denmark
| | - Zizhen Yao
- Allen Institute for Brain Science, Seattle, Washington 98109, USA
| | - Christopher T Workman
- Center for non-coding RNA in Technology and Health (RTH), University of Copenhagen, DK-1870 Frederiksberg, Denmark.,Department of Biotechnology and Biomedicine, Technical University of Denmark, DK-2800 Kongens Lyngby, Denmark
| | - Flemming Pociot
- Center for non-coding RNA in Technology and Health (RTH), University of Copenhagen, DK-1870 Frederiksberg, Denmark.,Copenhagen Diabetes Research Center (CPH-DIRECT), Herlev University Hospital, DK-2730 Herlev, Denmark
| | - Henrik Nielsen
- Center for non-coding RNA in Technology and Health (RTH), University of Copenhagen, DK-1870 Frederiksberg, Denmark.,Department of Cellular and Molecular Medicine (ICMM), Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200 Copenhagen, Denmark
| | - Niels Tommerup
- Center for non-coding RNA in Technology and Health (RTH), University of Copenhagen, DK-1870 Frederiksberg, Denmark.,Department of Cellular and Molecular Medicine (ICMM), Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200 Copenhagen, Denmark
| | - Walter L Ruzzo
- Center for non-coding RNA in Technology and Health (RTH), University of Copenhagen, DK-1870 Frederiksberg, Denmark.,School of Computer Science and Engineering and Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA.,Fred Hutchinson Cancer Research Center, Seattle, Washington 98109, USA
| | - Jan Gorodkin
- Center for non-coding RNA in Technology and Health (RTH), University of Copenhagen, DK-1870 Frederiksberg, Denmark.,Department of Veterinary and Animal Sciences, Faculty of Health and Medical Sciences, University of Copenhagen, DK-1870 Frederiksberg, Denmark
| |
Collapse
|
17
|
Nitsche A, Stadler PF. Evolutionary clues in lncRNAs. WILEY INTERDISCIPLINARY REVIEWS-RNA 2016; 8. [PMID: 27436689 DOI: 10.1002/wrna.1376] [Citation(s) in RCA: 43] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/24/2016] [Revised: 06/06/2016] [Accepted: 06/09/2016] [Indexed: 12/13/2022]
Abstract
The diversity of long non-coding RNAs (lncRNAs) in the human transcriptome is in stark contrast to the sparse exploration of their functions concomitant with their conservation and evolution. The pervasive transcription of the largely non-coding human genome makes the evolutionary age and conservation patterns of lncRNAs to a topic of interest. Yet it is a fairly unexplored field and not that easy to determine as for protein-coding genes. Although there are a few experimentally studied cases, which are conserved at the sequence level, most lncRNAs exhibit weak or untraceable primary sequence conservation. Recent studies shed light on the interspecies conservation of secondary structures among lncRNA homologs by using diverse computational methods. This highlights the importance of structure on functionality of lncRNAs as opposed to the poor impact of primary sequence changes. Further clues in the evolution of lncRNAs are given by selective constraints on non-coding gene structures (e.g., promoters or splice sites) as well as the conservation of prevalent spatio-temporal expression patterns. However, a rapid evolutionary turnover is observable throughout the heterogeneous group of lncRNAs. This still gives rise to questions about its functional meaning. WIREs RNA 2017, 8:e1376. doi: 10.1002/wrna.1376 For further resources related to this article, please visit the WIREs website.
Collapse
Affiliation(s)
- Anne Nitsche
- Bioinformatics Group, Department of Computer Science, University Leipzig, Leipzig, Germany.,Institute de Biologie Moléculaire et Cellulaire, Université de Strasbourg, Cedex, France
| | - Peter F Stadler
- Bioinformatics Group, Department of Computer Science, University Leipzig, Leipzig, Germany.,Interdisciplinary Center for Bioinformatics, University Leipzig, Leipzig, Germany.,Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany.,Department of Diagnostics, Fraunhofer Institute for Cell Therapy and Immunology - IZI, Leipzig, Germany.,Center for Non-Coding RNA in Technology and Health, University of Copenhagen, Frederiksberg, Denmark.,Department of Theoretical Chemistry, University of Vienna, Wien, Austria.,Santa Fe Institute, Santa Fe, NM, USA
| |
Collapse
|
18
|
RNA 3D Modules in Genome-Wide Predictions of RNA 2D Structure. PLoS One 2015; 10:e0139900. [PMID: 26509713 PMCID: PMC4624896 DOI: 10.1371/journal.pone.0139900] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2015] [Accepted: 08/17/2015] [Indexed: 01/09/2023] Open
Abstract
Recent experimental and computational progress has revealed a large potential for RNA structure in the genome. This has been driven by computational strategies that exploit multiple genomes of related organisms to identify common sequences and secondary structures. However, these computational approaches have two main challenges: they are computationally expensive and they have a relatively high false discovery rate (FDR). Simultaneously, RNA 3D structure analysis has revealed modules composed of non-canonical base pairs which occur in non-homologous positions, apparently by independent evolution. These modules can, for example, occur inside structural elements which in RNA 2D predictions appear as internal loops. Hence one question is if the use of such RNA 3D information can improve the prediction accuracy of RNA secondary structure at a genome-wide level. Here, we use RNAz in combination with 3D module prediction tools and apply them on a 13-way vertebrate sequence-based alignment. We find that RNA 3D modules predicted by metaRNAmodules and JAR3D are significantly enriched in the screened windows compared to their shuffled counterparts. The initially estimated FDR of 47.0% is lowered to below 25% when certain 3D module predictions are present in the window of the 2D prediction. We discuss the implications and prospects for further development of computational strategies for detection of RNA 2D structure in genomic sequence.
Collapse
|
19
|
Pei S, Anthony JS, Meyer MM. Sampled ensemble neutrality as a feature to classify potential structured RNAs. BMC Genomics 2015; 16:35. [PMID: 25649229 PMCID: PMC4333902 DOI: 10.1186/s12864-014-1203-8] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2014] [Accepted: 12/22/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Structured RNAs have many biological functions ranging from catalysis of chemical reactions to gene regulation. Yet, many homologous structured RNAs display most of their conservation at the secondary or tertiary structure level. As a result, strategies for structured RNA discovery rely heavily on identification of sequences sharing a common stable secondary structure. However, correctly distinguishing structured RNAs from surrounding genomic sequence remains challenging, especially during de novo discovery. RNA also has a long history as a computational model for evolution due to the direct link between genotype (sequence) and phenotype (structure). From these studies it is clear that evolved RNA structures, like protein structures, can be considered robust to point mutations. In this context, an RNA sequence is considered robust if its neutrality (extent to which single mutant neighbors maintain the same secondary structure) is greater than that expected for an artificial sequence with the same minimum free energy structure. RESULTS In this work, we bring concepts from evolutionary biology to bear on the structured RNA de novo discovery process. We hypothesize that alignments corresponding to structured RNAs should consist of neutral sequences. We evaluate several measures of neutrality for their ability to distinguish between alignments of structured RNA sequences drawn from Rfam and various decoy alignments. We also introduce a new measure of RNA structural neutrality, the structure ensemble neutrality (SEN). SEN seeks to increase the biological relevance of existing neutrality measures in two ways. First, it uses information from an alignment of homologous sequences to identify a conserved biologically relevant structure for comparison. Second, it only counts base-pairs of the original structure that are absent in the comparison structure and does not penalize the formation of additional base-pairs. CONCLUSION We find that several measures of neutrality are effective at separating structured RNAs from decoy sequences, including both shuffled alignments and flanking genomic sequence. Furthermore, as an independent feature classifier to identify structured RNAs, SEN yields comparable performance to current approaches that consider a variety of features including stability and sequence identity. Finally, SEN outperforms other measures of neutrality at detecting mutational robustness in bacterial regulatory RNA structures.
Collapse
Affiliation(s)
- Shermin Pei
- Boston College, 140 Commonwealth Ave., Chestnut Hill, 02467, MA, USA.
| | - Jon S Anthony
- Boston College, 140 Commonwealth Ave., Chestnut Hill, 02467, MA, USA.
| | - Michelle M Meyer
- Boston College, 140 Commonwealth Ave., Chestnut Hill, 02467, MA, USA.
| |
Collapse
|
20
|
Long non-coding RNAs differentially expressed between normal versus primary breast tumor tissues disclose converse changes to breast cancer-related protein-coding genes. PLoS One 2014; 9:e106076. [PMID: 25264628 PMCID: PMC4180073 DOI: 10.1371/journal.pone.0106076] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2013] [Accepted: 07/29/2014] [Indexed: 12/04/2022] Open
Abstract
Breast cancer, the second leading cause of cancer death in women, is a highly heterogeneous disease, characterized by distinct genomic and transcriptomic profiles. Transcriptome analyses prevalently assessed protein-coding genes; however, the majority of the mammalian genome is expressed in numerous non-coding transcripts. Emerging evidence supports that many of these non-coding RNAs are specifically expressed during development, tumorigenesis, and metastasis. The focus of this study was to investigate the expression features and molecular characteristics of long non-coding RNAs (lncRNAs) in breast cancer. We investigated 26 breast tumor and 5 normal tissue samples utilizing a custom expression microarray enclosing probes for mRNAs as well as novel and previously identified lncRNAs. We identified more than 19,000 unique regions significantly differentially expressed between normal versus breast tumor tissue, half of these regions were non-coding without any evidence for functional open reading frames or sequence similarity to known proteins. The identified non-coding regions were primarily located in introns (53%) or in the intergenic space (33%), frequently orientated in antisense-direction of protein-coding genes (14%), and commonly distributed at promoter-, transcription factor binding-, or enhancer-sites. Analyzing the most diverse mRNA breast cancer subtypes Basal-like versus Luminal A and B resulted in 3,025 significantly differentially expressed unique loci, including 682 (23%) for non-coding transcripts. A notable number of differentially expressed protein-coding genes displayed non-synonymous expression changes compared to their nearest differentially expressed lncRNA, including an antisense lncRNA strongly anticorrelated to the mRNA coding for histone deacetylase 3 (HDAC3), which was investigated in more detail. Previously identified chromatin-associated lncRNAs (CARs) were predominantly downregulated in breast tumor samples, including CARs located in the protein-coding genes for CALD1, FTX, and HNRNPH1. In conclusion, a number of differentially expressed lncRNAs have been identified with relation to cancer-related protein-coding genes.
Collapse
|
21
|
Phylogeny and evolution of RNA structure. Methods Mol Biol 2014. [PMID: 24639167 DOI: 10.1007/978-1-62703-709-9_16] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register]
Abstract
Darwin's conviction that all living beings on Earth are related and the graph of relatedness is tree-shaped has been essentially confirmed by phylogenetic reconstruction first from morphology and later from data obtained by molecular sequencing. Limitations of the phylogenetic tree concept were recognized as more and more sequence information became available. The other path-breaking idea of Darwin, natural selection of fitter variants in populations, is cast into simple mathematical form and extended to mutation-selection dynamics. In this form the theory is directly applicable to RNA evolution in vitro and to virus evolution. Phylogeny and population dynamics of RNA provide complementary insights into evolution and the interplay between the two concepts will be pursued throughout this chapter. The two strategies for understanding evolution are ultimately related through the central paradigm of structural biology: sequence ⇒ structure ⇒ function. We elaborate on the state of the art in modeling both phylogeny and evolution of RNA driven by reproduction and mutation. Thereby the focus will be laid on models for phylogenetic sequence evolution as well as evolution and design of RNA structures with selected examples and notes on simulation methods. In the perspectives an attempt is made to combine molecular structure, population dynamics, and phylogeny in modeling evolution.
Collapse
|
22
|
Hackermüller J, Reiche K, Otto C, Hösler N, Blumert C, Brocke-Heidrich K, Böhlig L, Nitsche A, Kasack K, Ahnert P, Krupp W, Engeland K, Stadler PF, Horn F. Cell cycle, oncogenic and tumor suppressor pathways regulate numerous long and macro non-protein-coding RNAs. Genome Biol 2014; 15:R48. [PMID: 24594072 PMCID: PMC4054595 DOI: 10.1186/gb-2014-15-3-r48] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2013] [Accepted: 03/04/2014] [Indexed: 12/16/2022] Open
Abstract
Background The genome is pervasively transcribed but most transcripts do not code for proteins, constituting non-protein-coding RNAs. Despite increasing numbers of functional reports of individual long non-coding RNAs (lncRNAs), assessing the extent of functionality among the non-coding transcriptional output of mammalian cells remains intricate. In the protein-coding world, transcripts differentially expressed in the context of processes essential for the survival of multicellular organisms have been instrumental in the discovery of functionally relevant proteins and their deregulation is frequently associated with diseases. We therefore systematically identified lncRNAs expressed differentially in response to oncologically relevant processes and cell-cycle, p53 and STAT3 pathways, using tiling arrays. Results We found that up to 80% of the pathway-triggered transcriptional responses are non-coding. Among these we identified very large macroRNAs with pathway-specific expression patterns and demonstrated that these are likely continuous transcripts. MacroRNAs contain elements conserved in mammals and sauropsids, which in part exhibit conserved RNA secondary structure. Comparing evolutionary rates of a macroRNA to adjacent protein-coding genes suggests a local action of the transcript. Finally, in different grades of astrocytoma, a tumor disease unrelated to the initially used cell lines, macroRNAs are differentially expressed. Conclusions It has been shown previously that the majority of expressed non-ribosomal transcripts are non-coding. We now conclude that differential expression triggered by signaling pathways gives rise to a similar abundance of non-coding content. It is thus unlikely that the prevalence of non-coding transcripts in the cell is a trivial consequence of leaky or random transcription events.
Collapse
|
23
|
Abstract
De novo discovery of "motifs" capturing the commonalities among related noncoding ncRNA structured RNAs is among the most difficult problems in computational biology. This chapter outlines the challenges presented by this problem, together with some approaches towards solving them, with an emphasis on an approach based on the CMfinder CMfinder program as a case study. Applications to genomic screens for novel de novo structured ncRNA ncRNA s, including structured RNA elements in untranslated portions of protein-coding genes, are presented.
Collapse
Affiliation(s)
- Walter L Ruzzo
- Fred Hutchinson Cancer Research Center, Seattle, WA, 98109, USA
| | | |
Collapse
|
24
|
Abstract
Transcriptomics experiments and computational predictions both enable systematic discovery of new functional RNAs. However, many putative noncoding transcripts arise instead from artifacts and biological noise, and current computational prediction methods have high false positive rates. I discuss prospects for improving computational methods for analyzing and identifying functional RNAs, with a focus on detecting signatures of conserved RNA secondary structure. An interesting new front is the application of chemical and enzymatic experiments that probe RNA structure on a transcriptome-wide scale. I review several proposed approaches for incorporating structure probing data into the computational prediction of RNA secondary structure. Using probabilistic inference formalisms, I show how all these approaches can be unified in a well-principled framework, which in turn allows RNA probing data to be easily integrated into a wide range of analyses that depend on RNA secondary structure inference. Such analyses include homology search and genome-wide detection of new structural RNAs.
Collapse
Affiliation(s)
- Sean R Eddy
- Howard Hughes Medical Institute Janelia Farm Research Campus, Ashburn, Virginia 20147;
| |
Collapse
|
25
|
Energy-based RNA consensus secondary structure prediction in multiple sequence alignments. Methods Mol Biol 2014; 1097:125-41. [PMID: 24639158 DOI: 10.1007/978-1-62703-709-9_7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
Abstract
Many biologically important RNA structures are conserved in evolution leading to characteristic mutational patterns. RNAalifold is a widely used program to predict consensus secondary structures in multiple alignments by combining evolutionary information with traditional energy-based RNA folding algorithms. Here we describe the theory and applications of the RNAalifold algorithm. Consensus secondary structure prediction not only leads to significantly more accurate structure models, but it also allows to study structural conservation of functional RNAs.
Collapse
|
26
|
Sabarinathan R, Tafer H, Seemann SE, Hofacker IL, Stadler PF, Gorodkin J. RNAsnp: efficient detection of local RNA secondary structure changes induced by SNPs. Hum Mutat 2013; 34:546-56. [PMID: 23315997 PMCID: PMC3708107 DOI: 10.1002/humu.22273] [Citation(s) in RCA: 99] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2012] [Accepted: 12/18/2012] [Indexed: 02/05/2023]
Abstract
Structural characteristics are essential for the functioning of many noncoding RNAs and cis-regulatory elements of mRNAs. SNPs may disrupt these structures, interfere with their molecular function, and hence cause a phenotypic effect. RNA folding algorithms can provide detailed insights into structural effects of SNPs. The global measures employed so far suffer from limited accuracy of folding programs on large RNAs and are computationally too demanding for genome-wide applications. Here, we present a strategy that focuses on the local regions of maximal structural change between mutant and wild-type. These local regions are approximated in a “screening mode” that is intended for genome-wide applications. Furthermore, localized regions are identified as those with maximal discrepancy. The mutation effects are quantified in terms of empirical P values. To this end, the RNAsnp software uses extensive precomputed tables of the distribution of SNP effects as function of length and GC content. RNAsnp thus achieves both a noise reduction and speed-up of several orders of magnitude over shuffling-based approaches. On a data set comprising 501 SNPs associated with human-inherited diseases, we predict 54 to have significant local structural effect in the untranslated region of mRNAs. RNAsnp is available at http://rth.dk/resources/rnasnp.
Collapse
|
27
|
Smith MA, Gesell T, Stadler PF, Mattick JS. Widespread purifying selection on RNA structure in mammals. Nucleic Acids Res 2013; 41:8220-36. [PMID: 23847102 PMCID: PMC3783177 DOI: 10.1093/nar/gkt596] [Citation(s) in RCA: 130] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2013] [Revised: 05/29/2013] [Accepted: 06/16/2013] [Indexed: 12/14/2022] Open
Abstract
Evolutionarily conserved RNA secondary structures are a robust indicator of purifying selection and, consequently, molecular function. Evaluating their genome-wide occurrence through comparative genomics has consistently been plagued by high false-positive rates and divergent predictions. We present a novel benchmarking pipeline aimed at calibrating the precision of genome-wide scans for consensus RNA structure prediction. The benchmarking data obtained from two refined structure prediction algorithms, RNAz and SISSIz, were then analyzed to fine-tune the parameters of an optimized workflow for genomic sliding window screens. When applied to consistency-based multiple genome alignments of 35 mammals, our approach confidently identifies >4 million evolutionarily constrained RNA structures using a conservative sensitivity threshold that entails historically low false discovery rates for such analyses (5-22%). These predictions comprise 13.6% of the human genome, 88% of which fall outside any known sequence-constrained element, suggesting that a large proportion of the mammalian genome is functional. As an example, our findings identify both known and novel conserved RNA structure motifs in the long noncoding RNA MALAT1. This study provides an extensive set of functional transcriptomic annotations that will assist researchers in uncovering the precise mechanisms underlying the developmental ontologies of higher eukaryotes.
Collapse
Affiliation(s)
- Martin A. Smith
- RNA Biology and Plasticity Laboratory, Garvan Institute of Medical Research, 384 Victoria Street, Darlinghurst, Sydney, NSW 2010 Australia, Genomics and Computational Biology Division, Institute for Molecular Bioscience, 306 Carmody Rd, University of Queensland, Brisbane, 4067 Australia, Department of Structural and Computational Biology; and Center for Integrative Bioinformatics Vienna (CIBIV), Max F. Perutz Laboratories (MFPL), University of Vienna, Medical University of Vienna, Dr. Bohr-Gasse 9, A-1030 Vienna, Austria, Bioinformatics Group, Department of Computer Science; and Interdisciplinary Center for Bioinformatics, University of Leipzig, Härtelstrasse 16–18, D-04107 Leipzig, Germany, Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, D-04103 Leipzig, Germany, Center for Non-coding RNA in Technology and Health, Department of Basic Veterinary and Animal Sciences, Faculty of Life Sciences University of Copenhagen, Grønnegårdsvej 3, 1870 Frederiksberg C Denmark, Santa Fe Institute, 1399 Hyde Park Rd, Santa Fe, NM 87501, USA and St Vincent’s Clinical School, University of New South Wales, Level 5, de Lacy, Victoria St, St Vincent's Hospital, Sydney, NSW 2010 Australia
| | - Tanja Gesell
- RNA Biology and Plasticity Laboratory, Garvan Institute of Medical Research, 384 Victoria Street, Darlinghurst, Sydney, NSW 2010 Australia, Genomics and Computational Biology Division, Institute for Molecular Bioscience, 306 Carmody Rd, University of Queensland, Brisbane, 4067 Australia, Department of Structural and Computational Biology; and Center for Integrative Bioinformatics Vienna (CIBIV), Max F. Perutz Laboratories (MFPL), University of Vienna, Medical University of Vienna, Dr. Bohr-Gasse 9, A-1030 Vienna, Austria, Bioinformatics Group, Department of Computer Science; and Interdisciplinary Center for Bioinformatics, University of Leipzig, Härtelstrasse 16–18, D-04107 Leipzig, Germany, Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, D-04103 Leipzig, Germany, Center for Non-coding RNA in Technology and Health, Department of Basic Veterinary and Animal Sciences, Faculty of Life Sciences University of Copenhagen, Grønnegårdsvej 3, 1870 Frederiksberg C Denmark, Santa Fe Institute, 1399 Hyde Park Rd, Santa Fe, NM 87501, USA and St Vincent’s Clinical School, University of New South Wales, Level 5, de Lacy, Victoria St, St Vincent's Hospital, Sydney, NSW 2010 Australia
| | - Peter F. Stadler
- RNA Biology and Plasticity Laboratory, Garvan Institute of Medical Research, 384 Victoria Street, Darlinghurst, Sydney, NSW 2010 Australia, Genomics and Computational Biology Division, Institute for Molecular Bioscience, 306 Carmody Rd, University of Queensland, Brisbane, 4067 Australia, Department of Structural and Computational Biology; and Center for Integrative Bioinformatics Vienna (CIBIV), Max F. Perutz Laboratories (MFPL), University of Vienna, Medical University of Vienna, Dr. Bohr-Gasse 9, A-1030 Vienna, Austria, Bioinformatics Group, Department of Computer Science; and Interdisciplinary Center for Bioinformatics, University of Leipzig, Härtelstrasse 16–18, D-04107 Leipzig, Germany, Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, D-04103 Leipzig, Germany, Center for Non-coding RNA in Technology and Health, Department of Basic Veterinary and Animal Sciences, Faculty of Life Sciences University of Copenhagen, Grønnegårdsvej 3, 1870 Frederiksberg C Denmark, Santa Fe Institute, 1399 Hyde Park Rd, Santa Fe, NM 87501, USA and St Vincent’s Clinical School, University of New South Wales, Level 5, de Lacy, Victoria St, St Vincent's Hospital, Sydney, NSW 2010 Australia
| | - John S. Mattick
- RNA Biology and Plasticity Laboratory, Garvan Institute of Medical Research, 384 Victoria Street, Darlinghurst, Sydney, NSW 2010 Australia, Genomics and Computational Biology Division, Institute for Molecular Bioscience, 306 Carmody Rd, University of Queensland, Brisbane, 4067 Australia, Department of Structural and Computational Biology; and Center for Integrative Bioinformatics Vienna (CIBIV), Max F. Perutz Laboratories (MFPL), University of Vienna, Medical University of Vienna, Dr. Bohr-Gasse 9, A-1030 Vienna, Austria, Bioinformatics Group, Department of Computer Science; and Interdisciplinary Center for Bioinformatics, University of Leipzig, Härtelstrasse 16–18, D-04107 Leipzig, Germany, Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, D-04103 Leipzig, Germany, Center for Non-coding RNA in Technology and Health, Department of Basic Veterinary and Animal Sciences, Faculty of Life Sciences University of Copenhagen, Grønnegårdsvej 3, 1870 Frederiksberg C Denmark, Santa Fe Institute, 1399 Hyde Park Rd, Santa Fe, NM 87501, USA and St Vincent’s Clinical School, University of New South Wales, Level 5, de Lacy, Victoria St, St Vincent's Hospital, Sydney, NSW 2010 Australia
| |
Collapse
|
28
|
Evolutionary evidence for alternative structure in RNA sequence co-variation. PLoS Comput Biol 2013; 9:e1003152. [PMID: 23935473 PMCID: PMC3723493 DOI: 10.1371/journal.pcbi.1003152] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2013] [Accepted: 06/05/2013] [Indexed: 02/06/2023] Open
Abstract
Sequence conservation and co-variation of base pairs are hallmarks of structured RNAs. For certain RNAs (e.g. riboswitches), a single sequence must adopt at least two alternative secondary structures to effectively regulate the message. If alternative secondary structures are important to the function of an RNA, we expect to observe evolutionary co-variation supporting multiple conformations. We set out to characterize the evolutionary co-variation supporting alternative conformations in riboswitches to determine the extent to which alternative secondary structures are conserved. We found strong co-variation support for the terminator, P1, and anti-terminator stems in the purine riboswitch by extending alignments to include terminator sequences. When we performed Boltzmann suboptimal sampling on purine riboswitch sequences with terminators we found that these sequences appear to have evolved to favor specific alternative conformations. We extended our analysis of co-variation to classic alignments of group I/II introns, tRNA, and other classes of riboswitches. In a majority of these RNAs, we found evolutionary evidence for alternative conformations that are compatible with the Boltzmann suboptimal ensemble. Our analyses suggest that alternative conformations are selected for and thus likely play functional roles in even the most structured of RNAs. RNA (Ribonucleic Acid) is a messenger of genetic information, master regulator, and catalyst in the cell. To carry out its function, RNA can fold into complex three-dimensional structures. Certain classes of RNAs, called riboswitches, adopt at least two alternative structures to act as a switch. We set out to detect the evolutionary signal for alternative structures in riboswitches as we hypothesize that these RNA sequences must have evolved to allow both conformations. We find that indeed such signals exist when we compare the sequences of riboswitches from multiple species. When we extend this analysis to other RNA regulators in the cell that are not thought of as switches, we detect equivalent evolutionary support for alternative structures. Viewed through the lens of evolutionary structure conservation RNA sequences appear to have adapted to adopt multiple conformations.
Collapse
|
29
|
Basu S, Müller F, Sanges R. Examples of sequence conservation analyses capture a subset of mouse long non-coding RNAs sharing homology with fish conserved genomic elements. BMC Bioinformatics 2013; 14 Suppl 7:S14. [PMID: 23815359 PMCID: PMC3633045 DOI: 10.1186/1471-2105-14-s7-s14] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023] Open
Abstract
Background Long non-coding RNAs (lncRNA) are a major class of non-coding RNAs. They are involved in diverse intra-cellular mechanisms like molecular scaffolding, splicing and DNA methylation. Through these mechanisms they are reported to play a role in cellular differentiation and development. They show an enriched expression in the brain where they are implicated in maintaining cellular identity, homeostasis, stress responses and plasticity. Low sequence conservation and lack of functional annotations make it difficult to identify homologs of mammalian lncRNAs in other vertebrates. A computational evaluation of the lncRNAs through systematic conservation analyses of both sequences as well as their genomic architecture is required. Results Our results show that a subset of mouse candidate lncRNAs could be distinguished from random sequences based on their alignment with zebrafish phastCons elements. Using ROC analyses we were able to define a measure to select significantly conserved lncRNAs. Indeed, starting from ~2,800 mouse lncRNAs we could predict that between 4 and 11% present conserved sequence fragments in fish genomes. Gene ontology (GO) enrichment analyses of protein coding genes, proximal to the region of conservation, in both organisms highlighted similar GO classes like regulation of transcription and central nervous system development. The proximal coding genes in both the species show enrichment of their expression in brain. In summary, we show that interesting genomic regions in zebrafish could be marked based on their sequence homology to a mouse lncRNA, overlap with ESTs and proximity to genes involved in nervous system development. Conclusions Conservation at the sequence level can identify a subset of putative lncRNA orthologs. The similar protein-coding neighborhood and transcriptional information about the conserved candidates provide support to the hypothesis that they share functional homology. The pipeline herein presented represents a proof of principle showing that a portion between 4 and 11% of lncRNAs retains region of conservation between mammals and fishes. We believe this study will result useful as a reference to analyze the conservation of lncRNAs in newly sequenced genomes and transcriptomes.
Collapse
Affiliation(s)
- Swaraj Basu
- Laboratory of Animal Physiology and Evolution, Stazione Zoologica Anton Dohrn, Villa Comunale, 80121, Naples, Italy
| | | | | |
Collapse
|
30
|
Will S, Siebauer MF, Heyne S, Engelhardt J, Stadler PF, Reiche K, Backofen R. LocARNAscan: Incorporating thermodynamic stability in sequence and structure-based RNA homology search. Algorithms Mol Biol 2013; 8:14. [PMID: 23601347 PMCID: PMC3716875 DOI: 10.1186/1748-7188-8-14] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2013] [Accepted: 03/28/2013] [Indexed: 12/15/2022] Open
Abstract
Background The search for distant homologs has become an import issue in genome annotation. A particular difficulty is posed by divergent homologs that have lost recognizable sequence similarity. This same problem also arises in the recognition of novel members of large classes of RNAs such as snoRNAs or microRNAs that consist of families unrelated by common descent. Current homology search tools for structured RNAs are either based entirely on sequence similarity (such as blast or hmmer) or combine sequence and secondary structure. The most prominent example of the latter class of tools is Infernal. Alternatives are descriptor-based methods. In most practical applications published to-date, however, the information contained in covariance models or manually prescribed search patterns is dominated by sequence information. Here we ask two related questions: (1) Is secondary structure alone informative for homology search and the detection of novel members of RNA classes? (2) To what extent is the thermodynamic propensity of the target sequence to fold into the correct secondary structure helpful for this task? Results Sequence-structure alignment can be used as an alternative search strategy. In this scenario, the query consists of a base pairing probability matrix, which can be derived either from a single sequence or from a multiple alignment representing a set of known representatives. Sequence information can be optionally added to the query. The target sequence is pre-processed to obtain local base pairing probabilities. As a search engine we devised a semi-global scanning variant of LocARNA’s algorithm for sequence-structure alignment. The LocARNAscan tool is optimized for speed and low memory consumption. In benchmarking experiments on artificial data we observe that the inclusion of thermodynamic stability is helpful, albeit only in a regime of extremely low sequence information in the query. We observe, furthermore, that the sensitivity is bounded in particular by the limited accuracy of the predicted local structures of the target sequence. Conclusions Although we demonstrate that a purely structure-based homology search is feasible in principle, it is unlikely to outperform tools such as Infernal in most application scenarios, where a substantial amount of sequence information is typically available. The LocARNAscan approach will profit, however, from high throughput methods to determine RNA secondary structure. In transcriptome-wide applications, such methods will provide accurate structure annotations on the target side. Availability Source code of the free software LocARNAscan 1.0 and supplementary data are available at
http://www.bioinf.uni-leipzig.de/Software/LocARNAscan.
Collapse
|
31
|
Abstract
Recent genome-wide computational screens that search for conservation of RNA secondary structure in whole-genome alignments (WGAs) have predicted thousands of structural noncoding RNAs (ncRNAs). The sensitivity of such approaches, however, is limited, due to their reliance on sequence-based whole-genome aligners, which regularly misalign structural ncRNAs. This suggests that many more structural ncRNAs may remain undetected. Structure-based alignment, which could increase the sensitivity, has been prohibitive for genome-wide screens due to its extreme computational costs. Breaking this barrier, we present the pipeline REAPR (RE-Alignment for Prediction of structural ncRNA), which efficiently realigns whole genomes based on RNA sequence and structure, thus allowing us to boost the performance of de novo ncRNA predictors, such as RNAz. Key to the pipeline's efficiency is the development of a novel banding technique for multiple RNA alignment. REAPR significantly outperforms the widely used predictors RNAz and EvoFold in genome-wide screens; in direct comparison to the most recent RNAz screen on D. melanogaster, REAPR predicts twice as many high-confidence ncRNA candidates. Moreover, modENCODE RNA-seq experiments confirm a substantial number of its predictions as transcripts. REAPR's advancement of de novo structural characterization of ncRNAs complements the identification of transcripts from rapidly accumulating RNA-seq data.
Collapse
Affiliation(s)
- Sebastian Will
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| | | | | |
Collapse
|
32
|
Shore AN, Kabotyanski EB, Roarty K, Smith MA, Zhang Y, Creighton CJ, Dinger ME, Rosen JM. Pregnancy-induced noncoding RNA (PINC) associates with polycomb repressive complex 2 and regulates mammary epithelial differentiation. PLoS Genet 2012; 8:e1002840. [PMID: 22911650 PMCID: PMC3406180 DOI: 10.1371/journal.pgen.1002840] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2012] [Accepted: 06/01/2012] [Indexed: 02/07/2023] Open
Abstract
Pregnancy-induced noncoding RNA (PINC) and retinoblastoma-associated protein 46 (RbAp46) are upregulated in alveolar cells of the mammary gland during pregnancy and persist in alveolar cells that remain in the regressed lobules following involution. The cells that survive involution are thought to function as alveolar progenitor cells that rapidly differentiate into milk-producing cells in subsequent pregnancies, but it is unknown whether PINC and RbAp46 are involved in maintaining this progenitor population. Here, we show that, in the post-pubertal mouse mammary gland, mPINC is enriched in luminal and alveolar progenitors. mPINC levels increase throughout pregnancy and then decline in early lactation, when alveolar cells undergo terminal differentiation. Accordingly, mPINC expression is significantly decreased when HC11 mammary epithelial cells are induced to differentiate and produce milk proteins. This reduction in mPINC levels may be necessary for lactation, as overexpression of mPINC in HC11 cells blocks lactogenic differentiation, while knockdown of mPINC enhances differentiation. Finally, we demonstrate that mPINC interacts with RbAp46, as well as other members of the polycomb repressive complex 2 (PRC2), and identify potential targets of mPINC that are differentially expressed following modulation of mPINC expression levels. Taken together, our data suggest that mPINC inhibits terminal differentiation of alveolar cells during pregnancy to prevent abundant milk production and secretion until parturition. Additionally, a PRC2 complex that includes mPINC and RbAp46 may confer epigenetic modifications that maintain a population of mammary epithelial cells committed to the alveolar fate in the involuted gland. During pregnancy, epithelial cells of the mammary gland begin to undergo differentiation into functional alveolar cells that, during lactation, will produce and secrete milk proteins, thereby providing nourishment to offspring. Following lactation, the majority of alveolar cells die and the mammary gland remodels to a pre-pregnancy-like state in a process called involution. However, some alveolar cells survive involution, and these cells are thought to serve as alveolar progenitors that are able to rapidly proliferate and differentiate into milk-producing cells in subsequent pregnancies. Keeping alveolar cells from undergoing terminal differentiation during pregnancy and involution is vital for the preservation of an alveolar progenitor population. Here, we show that the long noncoding RNA, PINC, is downregulated in the mammary gland between late pregnancy and early lactation, when alveolar cells begin to terminally differentiate. This reduction of PINC levels may be necessary for lactation, as overexpression of PINC inhibits differentiation, while knockdown of PINC enhances differentiation of mammary epithelial cells. Finally, we find that PINC interacts with the chromatin-modifying complex PRC2, suggesting epigenetic regulation may be involved in maintaining alveolar progenitors in the pregnant and involuting mammary gland. These results emphasize the potential importance of lncRNA-PRC2 involvement in regulating cell fate during development.
Collapse
Affiliation(s)
- Amy N. Shore
- Program in Developmental Biology, Baylor College of Medicine, Houston, Texas, United States of America
| | - Elena B. Kabotyanski
- Department of Molecular and Cellular Biology, Baylor College of Medicine, Houston, Texas, United States of America
| | - Kevin Roarty
- Department of Molecular and Cellular Biology, Baylor College of Medicine, Houston, Texas, United States of America
| | - Martin A. Smith
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, Australia
| | - Yiqun Zhang
- Dan L. Duncan Cancer Center, Baylor College of Medicine, Houston, Texas, United States of America
| | - Chad J. Creighton
- Dan L. Duncan Cancer Center, Baylor College of Medicine, Houston, Texas, United States of America
| | - Marcel E. Dinger
- Diamantina Institute, The University of Queensland, Princess Alexandra Hospital, Brisbane, Australia
| | - Jeffrey M. Rosen
- Program in Developmental Biology, Baylor College of Medicine, Houston, Texas, United States of America
- Department of Molecular and Cellular Biology, Baylor College of Medicine, Houston, Texas, United States of America
- * E-mail:
| |
Collapse
|
33
|
Okada Y, Saito Y, Sato K, Sakakibara Y. Improved measurements of RNA structure conservation with generalized centroid estimators. Front Genet 2012; 2:54. [PMID: 22303350 PMCID: PMC3268607 DOI: 10.3389/fgene.2011.00054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2011] [Accepted: 08/08/2011] [Indexed: 11/13/2022] Open
Abstract
Identification of non-protein-coding RNAs (ncRNAs) in genomes is a crucial task for not only molecular cell biology but also bioinformatics. Secondary structures of ncRNAs are employed as a key feature of ncRNA analysis since biological functions of ncRNAs are deeply related to their secondary structures. Although the minimum free energy (MFE) structure of an RNA sequence is regarded as the most stable structure, MFE alone could not be an appropriate measure for identifying ncRNAs since the free energy is heavily biased by the nucleotide composition. Therefore, instead of MFE itself, several alternative measures for identifying ncRNAs have been proposed such as the structure conservation index (SCI) and the base pair distance (BPD), both of which employ MFE structures. However, these measurements are unfortunately not suitable for identifying ncRNAs in some cases including the genome-wide search and incur high false discovery rate. In this study, we propose improved measurements based on SCI and BPD, applying generalized centroid estimators to incorporate the robustness against low quality multiple alignments. Our experiments show that our proposed methods achieve higher accuracy than the original SCI and BPD for not only human-curated structural alignments but also low quality alignments produced by CLUSTAL W. Furthermore, the centroid-based SCI on CLUSTAL W alignments is more accurate than or comparable with that of the original SCI on structural alignments generated with RAF, a high quality structural aligner, for which twofold expensive computational time is required on average. We conclude that our methods are more suitable for genome-wide alignments which are of low quality from the point of view on secondary structures than the original SCI and BPD.
Collapse
Affiliation(s)
- Yohei Okada
- Department of Biosciences and Informatics, Keio University Yokohama, Japan
| | | | | | | |
Collapse
|
34
|
Pervouchine DD, Khrameeva EE, Pichugina MY, Nikolaienko OV, Gelfand MS, Rubtsov PM, Mironov AA. Evidence for widespread association of mammalian splicing and conserved long-range RNA structures. RNA (NEW YORK, N.Y.) 2012; 18:1-15. [PMID: 22128342 PMCID: PMC3261731 DOI: 10.1261/rna.029249.111] [Citation(s) in RCA: 50] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/07/2023]
Abstract
Pre-mRNA structure impacts many cellular processes, including splicing in genes associated with disease. The contemporary paradigm of RNA structure prediction is biased toward secondary structures that occur within short ranges of pre-mRNA, although long-range base-pairings are known to be at least as important. Recently, we developed an efficient method for detecting conserved RNA structures on the genome-wide scale, one that does not require multiple sequence alignments and works equally well for the detection of local and long-range base-pairings. Using an enhanced method that detects base-pairings at all possible combinations of splice sites within each gene, we now report RNA structures that could be involved in the regulation of splicing in mammals. Statistically, we demonstrate strong association between the occurrence of conserved RNA structures and alternative splicing, where local RNA structures are generally more frequent at alternative donor splice sites, while long-range structures are more associated with weak alternative acceptor splice sites. As an example, we validated the RNA structure in the human SF1 gene using minigenes in the HEK293 cell line. Point mutations that disrupted the base-pairing of two complementary boxes between exons 9 and 10 of this gene altered the splicing pattern, while the compensatory mutations that reestablished the base-pairing reverted splicing to that of the wild-type. There is statistical evidence for a Dscam-like class of mammalian genes, in which mutually exclusive RNA structures control mutually exclusive alternative splicing. In sum, we propose that long-range base-pairings carry an important, yet unconsidered part of the splicing code, and that, even by modest estimates, there must be thousands of such potentially regulatory structures conserved throughout the evolutionary history of mammals.
Collapse
Affiliation(s)
- Dmitri D Pervouchine
- Department of Bioengineering and Bioinformatics, Moscow State University, Moscow, 119992, GSP-2 Russia.
| | | | | | | | | | | | | |
Collapse
|
35
|
Mercer TR, Neph S, Dinger ME, Crawford J, Smith MA, Shearwood AMJ, Haugen E, Bracken CP, Rackham O, Stamatoyannopoulos JA, Filipovska A, Mattick JS. The human mitochondrial transcriptome. Cell 2011; 146:645-58. [PMID: 21854988 DOI: 10.1016/j.cell.2011.06.051] [Citation(s) in RCA: 590] [Impact Index Per Article: 45.4] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2010] [Revised: 06/15/2011] [Accepted: 06/27/2011] [Indexed: 11/27/2022]
Abstract
The human mitochondrial genome comprises a distinct genetic system transcribed as precursor polycistronic transcripts that are subsequently cleaved to generate individual mRNAs, tRNAs, and rRNAs. Here, we provide a comprehensive analysis of the human mitochondrial transcriptome across multiple cell lines and tissues. Using directional deep sequencing and parallel analysis of RNA ends, we demonstrate wide variation in mitochondrial transcript abundance and precisely resolve transcript processing and maturation events. We identify previously undescribed transcripts, including small RNAs, and observe the enrichment of several nuclear RNAs in mitochondria. Using high-throughput in vivo DNaseI footprinting, we establish the global profile of DNA-binding protein occupancy across the mitochondrial genome at single-nucleotide resolution, revealing regulatory features at mitochondrial transcription initiation sites and functional insights into disease-associated variants. This integrated analysis of the mitochondrial transcriptome reveals unexpected complexity in the regulation, expression, and processing of mitochondrial RNA and provides a resource for future studies of mitochondrial function (accessed at http://mitochondria.matticklab.com).
Collapse
Affiliation(s)
- Tim R Mercer
- Institute for Molecular Bioscience, The University of Queensland, Brisbane QLD 4072, Australia
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
36
|
Abstract
Non-coding RNAs (ncRNAs) are receiving more and more attention not only as an abundant class of genes, but also as regulatory structural elements (some located in mRNAs). A key feature of RNA function is its structure. Computational methods were developed early for folding and prediction of RNA structure with the aim of assisting in functional analysis. With the discovery of more and more ncRNAs, it has become clear that a large fraction of these are highly structured. Interestingly, a large part of the structure is comprised of regular Watson-Crick and GU wobble base pairs. This and the increased amount of available genomes have made it possible to employ structure-based methods for genomic screens. The field has moved from folding prediction of single sequences to computational screens for ncRNAs in genomic sequence using the RNA structure as the main characteristic feature. Whereas early methods focused on energy-directed folding of single sequences, comparative analysis based on structure preserving changes of base pairs has been efficient in improving accuracy, and today this constitutes a key component in genomic screens. Here, we cover the basic principles of RNA folding and touch upon some of the concepts in current methods that have been applied in genomic screens for de novo RNA structures in searches for novel ncRNA genes and regulatory RNA structure on mRNAs. We discuss the strengths and weaknesses of the different strategies and how they can complement each other.
Collapse
|
37
|
Findeiss S, Engelhardt J, Prohaska SJ, Stadler PF. Protein-coding structured RNAs: A computational survey of conserved RNA secondary structures overlapping coding regions in drosophilids. Biochimie 2011; 93:2019-23. [PMID: 21835221 DOI: 10.1016/j.biochi.2011.07.023] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2011] [Accepted: 07/19/2011] [Indexed: 11/15/2022]
Abstract
Functional RNA elements can be embedded also within exonic sequences coding for functional proteins. While not uncommon in viruses, only a few examples of this type have been described in some detail for eukaryotic genomes. Here we use RNAz and RNAcode, two comparative genomics methods that measure signatures of stabilizing selection acting on RNA secondary structure and peptide sequence, resp., to survey the fruit fly genomes. We estimate that there might be on the order of 1000 loci that are subject to dual selection pressure. The used genome-wide screens also expose the limitations of the currently available methods.
Collapse
Affiliation(s)
- Sven Findeiss
- Department of Theoretical Chemistry, University of Vienna, Währingerstraße 17, A-1090 Wien, Austria.
| | | | | | | |
Collapse
|
38
|
Khaitan D, Dinger ME, Mazar J, Crawford J, Smith MA, Mattick JS, Perera RJ. The melanoma-upregulated long noncoding RNA SPRY4-IT1 modulates apoptosis and invasion. Cancer Res 2011; 71:3852-62. [PMID: 21558391 DOI: 10.1158/0008-5472.can-10-4460] [Citation(s) in RCA: 375] [Impact Index Per Article: 28.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
The identification of cancer-associated long noncoding RNAs (lncRNAs) and the investigation of their molecular and biological functions are important to understand the molecular biology of cancer and its progression. Although the functions of lncRNAs and the mechanisms regulating their expression are largely unknown, recent studies are beginning to unravel their importance in human health and disease. Here, we report that a number of lncRNAs are differentially expressed in melanoma cell lines in comparison to melanocytes and keratinocyte controls. One of these lncRNAs, SPRY4-IT1 (GenBank accession ID AK024556), is derived from an intron of the SPRY4 gene and is predicted to contain several long hairpins in its secondary structure. RNA-FISH analysis showed that SPRY4-IT1 is predominantly localized in the cytoplasm of melanoma cells, and SPRY4-IT1 RNAi knockdown results in defects in cell growth, differentiation, and higher rates of apoptosis in melanoma cell lines. Differential expression of both SPRY4 and SPRY4-IT1 was also detected in vivo, in 30 distinct patient samples, classified as primary in situ, regional metastatic, distant metastatic, and nodal metastatic melanoma. The elevated expression of SPRY4-IT1 in melanoma cells compared to melanocytes, its accumulation in cell cytoplasm, and effects on cell dynamics, including increased rate of wound closure on SPRY4-IT1 overexpression, suggest that the higher expression of SPRY4-IT1 may have an important role in the molecular etiology of human melanoma.
Collapse
Affiliation(s)
- Divya Khaitan
- Sanford Burnham Medical Research Institute, Orlando, Florida 32827, USA
| | | | | | | | | | | | | |
Collapse
|
39
|
Washietl S, Findeiss S, Müller SA, Kalkhof S, von Bergen M, Hofacker IL, Stadler PF, Goldman N. RNAcode: robust discrimination of coding and noncoding regions in comparative sequence data. RNA (NEW YORK, N.Y.) 2011; 17:578-94. [PMID: 21357752 PMCID: PMC3062170 DOI: 10.1261/rna.2536111] [Citation(s) in RCA: 146] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/03/2023]
Abstract
With the availability of genome-wide transcription data and massive comparative sequencing, the discrimination of coding from noncoding RNAs and the assessment of coding potential in evolutionarily conserved regions arose as a core analysis task. Here we present RNAcode, a program to detect coding regions in multiple sequence alignments that is optimized for emerging applications not covered by current protein gene-finding software. Our algorithm combines information from nucleotide substitution and gap patterns in a unified framework and also deals with real-life issues such as alignment and sequencing errors. It uses an explicit statistical model with no machine learning component and can therefore be applied "out of the box," without any training, to data from all domains of life. We describe the RNAcode method and apply it in combination with mass spectrometry experiments to predict and confirm seven novel short peptides in Escherichia coli and to analyze the coding potential of RNAs previously annotated as "noncoding." RNAcode is open source software and available for all major platforms at http://wash.github.com/rnacode.
Collapse
Affiliation(s)
- Stefan Washietl
- EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB101SD, United Kingdom.
| | | | | | | | | | | | | | | |
Collapse
|
40
|
Abstract
Rapid improvements in high-throughput experimental technologies make it nowadays possible to study the expression, as well as changes in expression, of whole transcriptomes under different environmental conditions in a detailed view. We describe current approaches to identify genome-wide functional RNA transcripts (experimentally as well as computationally), and focus on computational methods that may be utilized to disclose their function. While genome databases offer a wealth of information about known and putative functions for protein-coding genes, functional information for novel non-coding RNA genes is almost nonexistent. This is mainly explained by the lack of established software tools to efficiently reveal the function and evolutionary origin of non-coding RNA genes. Here, we describe in detail computational approaches one may follow to annotate and classify an RNA transcript.
Collapse
Affiliation(s)
- Kristin Reiche
- Fraunhofer Institute for Cell Therapy and Immunology, Leipzig, Germany
| | | | | | | | | |
Collapse
|
41
|
Saito Y, Sato K, Sakakibara Y. Robust and accurate prediction of noncoding RNAs from aligned sequences. BMC Bioinformatics 2010; 11 Suppl 7:S3. [PMID: 21106125 PMCID: PMC2957686 DOI: 10.1186/1471-2105-11-s7-s3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
BACKGROUND Computational prediction of noncoding RNAs (ncRNAs) is an important task in the post-genomic era. One common approach is to utilize the profile information contained in alignment data rather than single sequences. However, this strategy involves the possibility that the quality of input alignments can influence the performance of prediction methods. Therefore, the evaluation of the robustness against alignment errors is necessary as well as the development of accurate prediction methods. RESULTS We describe a new method, called Profile BPLA kernel, which predicts ncRNAs from alignment data in combination with support vector machines (SVMs). Profile BPLA kernel is an extension of base-pairing profile local alignment (BPLA) kernel which we previously developed for the prediction from single sequences. By utilizing the profile information of alignment data, the proposed kernel can achieve better accuracy than the original BPLA kernel. We show that Profile BPLA kernel outperforms the existing prediction methods which also utilize the profile information using the high-quality structural alignment dataset. In addition to these standard benchmark tests, we extensively evaluate the robustness of Profile BPLA kernel against errors in input alignments. We consider two different types of error: first, that all sequences in an alignment are actually ncRNAs but are aligned ignoring their secondary structures; second, that an alignment contains unrelated sequences which are not ncRNAs but still aligned. In both cases, the effects on the performance of Profile BPLA kernel are surprisingly small. Especially for the latter case, we demonstrate that Profile BPLA kernel is more robust compared to the existing prediction methods. CONCLUSIONS Profile BPLA kernel provides a promising way for identifying ncRNAs from alignment data. It is more accurate than the existing prediction methods, and can keep its performance under the practical situations in which the quality of input alignments is not necessarily high.
Collapse
Affiliation(s)
- Yutaka Saito
- Department of Biosciences and Informatics, Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama, Kanagawa 223-8522, Japan
| | | | | |
Collapse
|
42
|
Nygaard S, Braunstein A, Malsen G, Van Dongen S, Gardner PP, Krogh A, Otto TD, Pain A, Berriman M, McAuliffe J, Dermitzakis ET, Jeffares DC. Long- and short-term selective forces on malaria parasite genomes. PLoS Genet 2010; 6:e1001099. [PMID: 20838588 PMCID: PMC2936524 DOI: 10.1371/journal.pgen.1001099] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2010] [Accepted: 07/28/2010] [Indexed: 11/18/2022] Open
Abstract
Plasmodium parasites, the causal agents of malaria, result in more than 1 million deaths annually. Plasmodium are unicellular eukaryotes with small ∼23 Mb genomes encoding ∼5200 protein-coding genes. The protein-coding genes comprise about half of these genomes. Although evolutionary processes have a significant impact on malaria control, the selective pressures within Plasmodium genomes are poorly understood, particularly in the non-protein-coding portion of the genome. We use evolutionary methods to describe selective processes in both the coding and non-coding regions of these genomes. Based on genome alignments of seven Plasmodium species, we show that protein-coding, intergenic and intronic regions are all subject to purifying selection and we identify 670 conserved non-genic elements. We then use genome-wide polymorphism data from P. falciparum to describe short-term selective processes in this species and identify some candidate genes for balancing (diversifying) selection. Our analyses suggest that there are many functional elements in the non-genic regions of these genomes and that adaptive evolution has occurred more frequently in the protein-coding regions of the genome.
Collapse
Affiliation(s)
- Sanne Nygaard
- Bioinformatics Centre, University of Copenhagen, Copenhagen, Denmark
- Biotech Research and Innovation Centre, University of Copenhagen, Copenhagen, Denmark
- Center for Social Evolution, University of Copenhagen, Copenhagen, Denmark
| | - Alexander Braunstein
- Statistics Department, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
- Google, Inc., Mountain View, California, United States of America
| | - Gareth Malsen
- Wellcome Trust Sanger Institute, Cambridge, United Kingdom
| | - Stijn Van Dongen
- RNA Genomics, European Bioinformatics Institute, Cambridge, United Kingdom
| | | | - Anders Krogh
- Bioinformatics Centre, University of Copenhagen, Copenhagen, Denmark
- Biotech Research and Innovation Centre, University of Copenhagen, Copenhagen, Denmark
| | - Thomas D. Otto
- Wellcome Trust Sanger Institute, Cambridge, United Kingdom
| | - Arnab Pain
- Wellcome Trust Sanger Institute, Cambridge, United Kingdom
- Computational Bioscience Research Center, King Abdullah University of Science and Technology, Jeddah, Saudi Arabia
| | | | - Jon McAuliffe
- Statistics Department, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
| | - Emmanouil T. Dermitzakis
- Wellcome Trust Sanger Institute, Cambridge, United Kingdom
- Department of Genetic Medicine and Development, University of Geneva, Geneva, Switzerland
- * E-mail: (DJ); (ED)
| | - Daniel C. Jeffares
- Wellcome Trust Sanger Institute, Cambridge, United Kingdom
- Department of Genetics, Evolution and Environment, University College London, United Kingdom
- * E-mail: (DJ); (ED)
| |
Collapse
|
43
|
Monitoring genomic sequences during SELEX using high-throughput sequencing: neutral SELEX. PLoS One 2010; 5:e9169. [PMID: 20161784 PMCID: PMC2820082 DOI: 10.1371/journal.pone.0009169] [Citation(s) in RCA: 61] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2009] [Accepted: 01/20/2010] [Indexed: 02/07/2023] Open
Abstract
Background SELEX is a well established in vitro selection tool to analyze the structure of ligand-binding nucleic acid sequences called aptamers. Genomic SELEX transforms SELEX into a tool to discover novel, genomically encoded RNA or DNA sequences binding a ligand of interest, called genomic aptamers. Concerns have been raised regarding requirements imposed on RNA sequences undergoing SELEX selection. Methodology/Principal Findings To evaluate SELEX and assess the extent of these effects, we designed and performed a Neutral SELEX experiment omitting the selection step, such that the sequences are under the sole selective pressure of SELEX's amplification steps. Using high-throughput sequencing, we obtained thousands of full-length sequences from the initial genomic library and the pools after each of the 10 rounds of Neutral SELEX. We compared these to sequences obtained from a Genomic SELEX experiment deriving from the same initial library, but screening for RNAs binding with high affinity to the E. coli regulator protein Hfq. With each round of Neutral SELEX, sequences became less stable and changed in nucleotide content, but no sequences were enriched. In contrast, we detected substantial enrichment in the Hfq-selected set with enriched sequences having structural stability similar to the neutral sequences but with significantly different nucleotide selection. Conclusions/Significance Our data indicate that positive selection in SELEX acts independently of the neutral selective requirements imposed on the sequences. We conclude that Genomic SELEX, when combined with high-throughput sequencing of positively and neutrally selected pools, as well as the gnomic library, is a powerful method to identify genomic aptamers.
Collapse
|
44
|
Gorodkin J, Hofacker IL, Torarinsson E, Yao Z, Havgaard JH, Ruzzo WL. De novo prediction of structured RNAs from genomic sequences. Trends Biotechnol 2009; 28:9-19. [PMID: 19942311 DOI: 10.1016/j.tibtech.2009.09.006] [Citation(s) in RCA: 50] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2009] [Revised: 08/31/2009] [Accepted: 09/22/2009] [Indexed: 12/29/2022]
Abstract
Growing recognition of the numerous, diverse and important roles played by non-coding RNA in all organisms motivates better elucidation of these cellular components. Comparative genomics is a powerful tool for this task and is arguably preferable to any high-throughput experimental technology currently available, because evolutionary conservation highlights functionally important regions. Conserved secondary structure, rather than primary sequence, is the hallmark of many functionally important RNAs, because compensatory substitutions in base-paired regions preserve structure. Unfortunately, such substitutions also obscure sequence identity and confound alignment algorithms, which complicates analysis greatly. This paper surveys recent computational advances in this difficult arena, which have enabled genome-scale prediction of cross-species conserved RNA elements. These predictions suggest that a wealth of these elements indeed exist.
Collapse
Affiliation(s)
- Jan Gorodkin
- Section for Genetics and Bioinformatics, IBHV and Center for Applied Bioinformatics, University of Copenhagen, Grønnegårdsvej 3, DK-1870 Frederiksberg C, Denmark.
| | | | | | | | | | | |
Collapse
|
45
|
Bernhart SH, Hofacker IL. From consensus structure prediction to RNA gene finding. BRIEFINGS IN FUNCTIONAL GENOMICS AND PROTEOMICS 2009; 8:461-71. [PMID: 19833701 DOI: 10.1093/bfgp/elp043] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
Reliable structure prediction is a prerequisite for most types of bioinformatical analysis of RNA. Since the accuracy of structure prediction from single sequences is limited, one often resorts to computing the consensus structure for a set of related RNA sequences. Since functionally important RNA structures are expected to evolve much more slowly than the underlying sequences, the pattern of sequence (co-)variation can be exploited to dramatically improve structure prediction. Since a conserved common structure is only expected when the RNA structure is under selective pressure, consensus structure prediction also provides an ideal starting point for the de novo detection of structured non-coding RNAs. Here, we review different strategies for the prediction of consensus secondary structures, and show how these approaches can be used to predict non-coding RNA genes.
Collapse
Affiliation(s)
- Stephan H Bernhart
- Department of Theoretical Chemistry, University of Vienna, Währingerstrasse 17, A-1090 Wien, Austria.
| | | |
Collapse
|
46
|
Bradley RK, Uzilov AV, Skinner ME, Bendaña YR, Barquist L, Holmes I. Evolutionary modeling and prediction of non-coding RNAs in Drosophila. PLoS One 2009; 4:e6478. [PMID: 19668382 PMCID: PMC2721679 DOI: 10.1371/journal.pone.0006478] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2009] [Accepted: 06/30/2009] [Indexed: 12/19/2022] Open
Abstract
We performed benchmarks of phylogenetic grammar-based ncRNA gene prediction, experimenting with eight different models of structural evolution and two different programs for genome alignment. We evaluated our models using alignments of twelve Drosophila genomes. We find that ncRNA prediction performance can vary greatly between different gene predictors and subfamilies of ncRNA gene. Our estimates for false positive rates are based on simulations which preserve local islands of conservation; using these simulations, we predict a higher rate of false positives than previous computational ncRNA screens have reported. Using one of the tested prediction grammars, we provide an updated set of ncRNA predictions for D. melanogaster and compare them to previously-published predictions and experimental data. Many of our predictions show correlations with protein-coding genes. We found significant depletion of intergenic predictions near the 3' end of coding regions and furthermore depletion of predictions in the first intron of protein-coding genes. Some of our predictions are colocated with larger putative unannotated genes: for example, 17 of our predictions showing homology to the RFAM family snoR28 appear in a tandem array on the X chromosome; the 4.5 Kbp spanned by the predicted tandem array is contained within a FlyBase-annotated cDNA.
Collapse
Affiliation(s)
- Robert K. Bradley
- Biophysics Graduate Group, University of California, Berkeley, California, United States of America
| | - Andrew V. Uzilov
- Department of Bioengineering, University of California, Berkeley, California, United States of America
| | - Mitchell E. Skinner
- Department of Bioengineering, University of California, Berkeley, California, United States of America
| | - Yuri R. Bendaña
- Department of Bioengineering, University of California, Berkeley, California, United States of America
| | - Lars Barquist
- Department of Bioengineering, University of California, Berkeley, California, United States of America
| | - Ian Holmes
- Biophysics Graduate Group, University of California, Berkeley, California, United States of America
- Department of Bioengineering, University of California, Berkeley, California, United States of America
| |
Collapse
|
47
|
Rose D, Jöris J, Hackermüller J, Reiche K, Li Q, Stadler PF. Duplicated RNA genes in teleost fish genomes. J Bioinform Comput Biol 2009; 6:1157-75. [PMID: 19090022 DOI: 10.1142/s0219720008003886] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2007] [Revised: 06/17/2008] [Accepted: 06/18/2008] [Indexed: 12/29/2022]
Abstract
Teleost fishes share a duplication of their entire genomes. We report here on a computational survey of structured non-coding RNAs (ncRNAs) in teleost genomes, focusing on the fate of fish-specific duplicates. As in other metazoan groups, we find evidence of a large number (11,543) of structured RNAs, most of which (~86%) are clade-specific or evolve so fast that their tetrapod homologs cannot be detected. In surprising contrast to protein-coding genes, the fish-specific genome duplication did not lead to a large number of paralogous ncRNAs: only 188 candidates, mostly microRNAs, appear in a larger copy number in teleosts than in tetrapods, suggesting that large-scale gene duplications do not play a major role in the expansion of the vertebrate ncRNA inventory.
Collapse
Affiliation(s)
- Dominic Rose
- Bioinformatics Group, Department of Computer Science, Interdisciplinary Center for Bioinformatics, University of Leipzig, Leipzig, Germany.
| | | | | | | | | | | |
Collapse
|
48
|
Bernhart SH, Hofacker IL, Will S, Gruber AR, Stadler PF. RNAalifold: improved consensus structure prediction for RNA alignments. BMC Bioinformatics 2008; 9:474. [PMID: 19014431 PMCID: PMC2621365 DOI: 10.1186/1471-2105-9-474] [Citation(s) in RCA: 412] [Impact Index Per Article: 25.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2008] [Accepted: 11/11/2008] [Indexed: 11/17/2022] Open
Abstract
Background The prediction of a consensus structure for a set of related RNAs is an important
first step for subsequent analyses. RNAalifold, which computes the minimum energy
structure that is simultaneously formed by a set of aligned sequences, is one of
the oldest and most widely used tools for this task. In recent years, several
alternative approaches have been advocated, pointing to several shortcomings of
the original RNAalifold approach. Results We show that the accuracy of RNAalifold predictions can be improved substantially
by introducing a different, more rational handling of alignment gaps, and by
replacing the rather simplistic model of covariance scoring with more
sophisticated RIBOSUM-like scoring matrices. These improvements are achieved
without compromising the computational efficiency of the algorithm. We show here
that the new version of RNAalifold not only outperforms the old one, but also
several other tools recently developed, on different datasets. Conclusion The new version of RNAalifold not only can replace the old one for almost any
application but it is also competitive with other approaches including those based
on SCFGs, maximum expected accuracy, or hierarchical nearest neighbor
classifiers.
Collapse
Affiliation(s)
- Stephan H Bernhart
- Department of Computer Science, University of Leipzig, Leipzig, Germany.
| | | | | | | | | |
Collapse
|