1
|
Zhang Y, Lang M, Jiang J, Gao Z, Xu F, Litfin T, Chen K, Singh J, Huang X, Song G, Tian Y, Zhan J, Chen J, Zhou Y. Multiple sequence alignment-based RNA language model and its application to structural inference. Nucleic Acids Res 2024; 52:e3. [PMID: 37941140 PMCID: PMC10783488 DOI: 10.1093/nar/gkad1031] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2023] [Accepted: 10/21/2023] [Indexed: 11/10/2023] Open
Abstract
Compared with proteins, DNA and RNA are more difficult languages to interpret because four-letter coded DNA/RNA sequences have less information content than 20-letter coded protein sequences. While BERT (Bidirectional Encoder Representations from Transformers)-like language models have been developed for RNA, they are ineffective at capturing the evolutionary information from homologous sequences because unlike proteins, RNA sequences are less conserved. Here, we have developed an unsupervised multiple sequence alignment-based RNA language model (RNA-MSM) by utilizing homologous sequences from an automatic pipeline, RNAcmap, as it can provide significantly more homologous sequences than manually annotated Rfam. We demonstrate that the resulting unsupervised, two-dimensional attention maps and one-dimensional embeddings from RNA-MSM contain structural information. In fact, they can be directly mapped with high accuracy to 2D base pairing probabilities and 1D solvent accessibilities, respectively. Further fine-tuning led to significantly improved performance on these two downstream tasks compared with existing state-of-the-art techniques including SPOT-RNA2 and RNAsnap2. By comparison, RNA-FM, a BERT-based RNA language model, performs worse than one-hot encoding with its embedding in base pair and solvent-accessible surface area prediction. We anticipate that the pre-trained RNA-MSM model can be fine-tuned on many other tasks related to RNA structure and function.
Collapse
Affiliation(s)
- Yikun Zhang
- School of Electronic and Computer Engineering, Peking University, Shenzhen 518055, China
- AI for Science (AI4S)-Preferred Program, Peking University Shenzhen Graduate School, Shenzen 518055, China
| | - Mei Lang
- Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518107, China
| | - Jiuhong Jiang
- Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518107, China
| | - Zhiqiang Gao
- Shanghai Artificial Intelligence Laboratory, Shanghai 200232, China
- Peng Cheng Laboratory, Shenzhen 518066, China
| | - Fan Xu
- Peng Cheng Laboratory, Shenzhen 518066, China
| | - Thomas Litfin
- Institute for Glycomics, Griffith University, Parklands Dr, Southport, QLD 4215, Australia
| | - Ke Chen
- Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518107, China
| | - Jaswinder Singh
- Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518107, China
| | | | - Guoli Song
- Peng Cheng Laboratory, Shenzhen 518066, China
| | | | - Jian Zhan
- Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518107, China
| | - Jie Chen
- School of Electronic and Computer Engineering, Peking University, Shenzhen 518055, China
- Peng Cheng Laboratory, Shenzhen 518066, China
| | - Yaoqi Zhou
- Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518107, China
- Institute for Glycomics, Griffith University, Parklands Dr, Southport, QLD 4215, Australia
| |
Collapse
|
2
|
Backofen R, Gorodkin J, Hofacker IL, Stadler PF. Comparative RNA Genomics. Methods Mol Biol 2024; 2802:347-393. [PMID: 38819565 DOI: 10.1007/978-1-0716-3838-5_12] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/01/2024]
Abstract
Over the last quarter of a century it has become clear that RNA is much more than just a boring intermediate in protein expression. Ancient RNAs still appear in the core information metabolism and comprise a surprisingly large component in bacterial gene regulation. A common theme with these types of mostly small RNAs is their reliance of conserved secondary structures. Large-scale sequencing projects, on the other hand, have profoundly changed our understanding of eukaryotic genomes. Pervasively transcribed, they give rise to a plethora of large and evolutionarily extremely flexible non-coding RNAs that exert a vastly diverse array of molecule functions. In this chapter we provide a-necessarily incomplete-overview of the current state of comparative analysis of non-coding RNAs, emphasizing computational approaches as a means to gain a global picture of the modern RNA world.
Collapse
Affiliation(s)
- Rolf Backofen
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Freiburg, Germany
- Center for Non-coding RNA in Technology and Health, University of Copenhagen, Frederiksberg, Denmark
| | - Jan Gorodkin
- Center for Non-coding RNA in Technology and Health, Department of Veterinary and Animal Sciences, University of Copenhagen, Frederiksberg, Denmark
| | - Ivo L Hofacker
- Institute for Theoretical Chemistry, University of Vienna, Wien, Austria
- Bioinformatics and Computational Biology research group, University of Vienna, Vienna, Austria
- Center for Non-coding RNA in Technology and Health, University of Copenhagen, Frederiksberg, Denmark
| | - Peter F Stadler
- Bioinformatics Group, Department of Computer Science, University of Leipzig, Leipzig, Germany.
- Interdisciplinary Center for Bioinformatics, University of Leipzig, Leipzig, Germany.
- Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany.
- Universidad National de Colombia, Bogotá, Colombia.
- Institute for Theoretical Chemistry, University of Vienna, Wien, Austria.
- Center for Non-coding RNA in Technology and Health, University of Copenhagen, Frederiksberg, Denmark.
- Santa Fe Institute, Santa Fe, NM, USA.
| |
Collapse
|
3
|
Kilar AM, Fajkus P, Fajkus J. GERONIMO: A tool for systematic retrieval of structural RNAs in a broad evolutionary context. Gigascience 2022; 12:giad080. [PMID: 37848616 PMCID: PMC10580375 DOI: 10.1093/gigascience/giad080] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2023] [Revised: 08/04/2023] [Accepted: 09/11/2023] [Indexed: 10/19/2023] Open
Abstract
BACKGROUND While web-based tools such as BLAST have made identifying conserved gene homologs appear easy, genes with variable sequences pose significant challenges. Functionally important noncoding RNAs (ncRNA) often show low sequence conservation due to genetic variations, including insertions and deletions. Rather than conserved sequences, these RNAs possess highly conserved structural features across a broad phylogenetic range. Such features can be identified using the covariance models approach, which combines sequence alignment with a secondary RNA structure consensus. However, running standard implementation of that approach (Infernal) requires advanced bioinformatics knowledge compared to user-friendly web services like BLAST. The issue is partially addressed by RNAcentral, which can be used to search for homologs across a broad range of ncRNA sequence collections from diverse organisms but not across the genome assemblies. RESULTS Here, we present GERONIMO, which conducts evolutionary searches across hundreds of genomes in a fully automated way. It provides results extended with taxonomy context, as summary tables and visualizations, to facilitate analysis for user convenience. Additionally, GERONIMO supplements homologous sequences with genomic regions to analyze promoter motifs or gene collinearity, enhancing the validation of results. CONCLUSION GERONIMO, built using Snakemake, has undergone extensive testing on hundreds of genomes, establishing itself as a valuable tool in the identification of ncRNA homologs across diverse taxonomic groups. Consequently, GERONIMO facilitates the investigation of the evolutionary patterns of functionally significant ncRNA players, whose understanding has previously been limited to individual organisms and close relatives.
Collapse
Affiliation(s)
- Agata M Kilar
- Mendel Centre for Plant Genomics and Proteomics, CEITEC Masaryk University, Brno CZ-62500, Czech Republic
- Laboratory of Functional Genomics and Proteomics, NCBR, Faculty of Science, Masaryk University, Brno CZ-61137, Czech Republic
| | - Petr Fajkus
- Mendel Centre for Plant Genomics and Proteomics, CEITEC Masaryk University, Brno CZ-62500, Czech Republic
- Department of Cell Biology and Radiobiology, Institute of Biophysics of the Czech Academy of Sciences, Brno CZ-61265, Czech Republic
| | - Jiří Fajkus
- Mendel Centre for Plant Genomics and Proteomics, CEITEC Masaryk University, Brno CZ-62500, Czech Republic
- Laboratory of Functional Genomics and Proteomics, NCBR, Faculty of Science, Masaryk University, Brno CZ-61137, Czech Republic
- Department of Cell Biology and Radiobiology, Institute of Biophysics of the Czech Academy of Sciences, Brno CZ-61265, Czech Republic
| |
Collapse
|
4
|
Seemann SE, Mirza AH, Bang-Berthelsen CH, Garde C, Christensen-Dalsgaard M, Workman CT, Pociot F, Tommerup N, Gorodkin J, Ruzzo WL. OUP accepted manuscript. Nucleic Acids Res 2022; 50:2452-2463. [PMID: 35188540 PMCID: PMC8934657 DOI: 10.1093/nar/gkac067] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2021] [Revised: 01/07/2022] [Accepted: 01/25/2022] [Indexed: 12/01/2022] Open
Abstract
Accelerated evolution of any portion of the genome is of significant interest, potentially signaling positive selection of phenotypic traits and adaptation. Accelerated evolution remains understudied for structured RNAs, despite the fact that an RNA’s structure is often key to its function. RNA structures are typically characterized by compensatory (structure-preserving) basepair changes that are unexpected given the underlying sequence variation, i.e., they have evolved through negative selection on structure. We address the question of how fast the primary sequence of an RNA can change through evolution while conserving its structure. Specifically, we consider predicted and known structures in vertebrate genomes. After careful control of false discovery rates, we obtain 13 de novo structures (and three known Rfam structures) that we predict to have rapidly evolving sequences—defined as structures where the primary sequences of human and mouse have diverged at least twice as fast (1.5 times for Rfam) as nearby neutrally evolving sequences. Two of the three known structures function in translation inhibition related to infection and immune response. We conclude that rapid sequence divergence does not preclude RNA structure conservation in vertebrates, although these events are relatively rare.
Collapse
Affiliation(s)
| | - Aashiq H Mirza
- Center for non-coding RNA in Technology and Health (RTH), University of Copenhagen, Denmark
- Steno Diabetes Center Copenhagen, Gentofte, Denmark
| | - Claus H Bang-Berthelsen
- Center for non-coding RNA in Technology and Health (RTH), University of Copenhagen, Denmark
- National Food Institute, Technical University of Denmark, Kgs. Lyngby, Denmark
| | - Christian Garde
- Center for non-coding RNA in Technology and Health (RTH), University of Copenhagen, Denmark
| | | | - Christopher T Workman
- Center for non-coding RNA in Technology and Health (RTH), University of Copenhagen, Denmark
- Center for Biological Sequence Analysis, Technical University of Denmark, Denmark
| | - Flemming Pociot
- Center for non-coding RNA in Technology and Health (RTH), University of Copenhagen, Denmark
- Steno Diabetes Center Copenhagen, Gentofte, Denmark
| | - Niels Tommerup
- Center for non-coding RNA in Technology and Health (RTH), University of Copenhagen, Denmark
- Department of Cellular and Molecular Medicine (ICMM), University of Copenhagen, Denmark
| | - Jan Gorodkin
- Center for non-coding RNA in Technology and Health (RTH), University of Copenhagen, Denmark
- Department of Veterinary and Animal Sciences, University of Copenhagen, Denmark
| | - Walter L Ruzzo
- Center for non-coding RNA in Technology and Health (RTH), University of Copenhagen, Denmark
- Computer Science and Engineering and Genome Sciences, University of Washington, USA
- Fred Hutchinson Cancer Research Center, Seattle, USA
| |
Collapse
|
5
|
Parra-Rincón E, Velandia-Huerto CA, Gittenberger A, Fallmann J, Gatter T, Brown FD, Stadler PF, Bermúdez-Santana CI. The Genome of the "Sea Vomit" Didemnum vexillum. Life (Basel) 2021; 11:1377. [PMID: 34947908 PMCID: PMC8704543 DOI: 10.3390/life11121377] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2021] [Revised: 12/02/2021] [Accepted: 12/03/2021] [Indexed: 11/25/2022] Open
Abstract
Tunicates are the sister group of vertebrates and thus occupy a key position for investigations into vertebrate innovations as well as into the consequences of the vertebrate-specific genome duplications. Nevertheless, tunicate genomes have not been studied extensively in the past, and comparative studies of tunicate genomes have remained scarce. The carpet sea squirt Didemnum vexillum, commonly known as "sea vomit", is a colonial tunicate considered an invasive species with substantial ecological and economical risk. We report the assembly of the D. vexillum genome using a hybrid approach that combines 28.5 Gb Illumina and 12.35 Gb of PacBio data. The new hybrid scaffolded assembly has a total size of 517.55 Mb that increases contig length about eightfold compared to previous, Illumina-only assembly. As a consequence of an unusually high genetic diversity of the colonies and the moderate length of the PacBio reads, presumably caused by the unusually acidic milieu of the tunic, the assembly is highly fragmented (L50 = 25,284, N50 = 6539). It is sufficient, however, for comprehensive annotations of both protein-coding genes and non-coding RNAs. Despite its shortcomings, the draft assembly of the "sea vomit" genome provides a valuable resource for comparative tunicate genomics and for the study of the specific properties of colonial ascidians.
Collapse
Affiliation(s)
- Ernesto Parra-Rincón
- Biology Department, Universidad Nacional de Colombia, Carrera 45 # 26-85, Edif. Uriel Gutiérrez, Bogotá D.C 111321, Colombia; (E.P.-R.); (P.F.S.)
| | - Cristian A. Velandia-Huerto
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Leipzig University, 04107 Leipzig, Germany; (J.F.); (T.G.)
| | - Adriaan Gittenberger
- GiMaRIS, Rijksstraatweg 75, 2171 AK Sassenheim, The Netherlands;
- Institute of Biology, Leiden University, P.O. Box 9505, 2300 RA Leiden, The Netherlands
- Naturalis Biodiversity Center, Darwinweg 2, 2333 CR Leiden, The Netherlands
| | - Jörg Fallmann
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Leipzig University, 04107 Leipzig, Germany; (J.F.); (T.G.)
| | - Thomas Gatter
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Leipzig University, 04107 Leipzig, Germany; (J.F.); (T.G.)
| | - Federico D. Brown
- Departamento de Zoologia, Instituto Biociências, Universidade de São Paulo, Rua do Matão, Tr. 14 no. 101, São Paulo 05508-090, Brazil;
- Centro de Biologia Marinha, Universidade de São Paulo, Rod. Manuel Hypólito do Rego km. 131.5, São Sebastião 11612-109, Brazil
| | - Peter F. Stadler
- Biology Department, Universidad Nacional de Colombia, Carrera 45 # 26-85, Edif. Uriel Gutiérrez, Bogotá D.C 111321, Colombia; (E.P.-R.); (P.F.S.)
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Leipzig University, 04107 Leipzig, Germany; (J.F.); (T.G.)
- Max Planck Institute for Mathematics in the Sciences, 04103 Leipzig, Germany
- Institute for Theoretical Chemistry, University of Vienna, 1090 Vienna, Austria
- Santa Fe Institute, Santa Fe, NM 87506, USA
| | - Clara I. Bermúdez-Santana
- Biology Department, Universidad Nacional de Colombia, Carrera 45 # 26-85, Edif. Uriel Gutiérrez, Bogotá D.C 111321, Colombia; (E.P.-R.); (P.F.S.)
| |
Collapse
|
6
|
Zammit A, Helwerda L, Olsthoorn RCL, Verbeek FJ, Gultyaev AP. A database of flavivirus RNA structures with a search algorithm for pseudoknots and triple base interactions. Bioinformatics 2021; 37:956-962. [PMID: 32866223 PMCID: PMC8128465 DOI: 10.1093/bioinformatics/btaa759] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2020] [Revised: 08/19/2020] [Accepted: 08/24/2020] [Indexed: 12/13/2022] Open
Abstract
Motivation The Flavivirus genus includes several important pathogens, such as Zika, dengue and yellow fever virus. Flavivirus RNA genomes contain a number of functionally important structures in their 3′ untranslated regions (3′UTRs). Due to the diversity of sequences and topologies of these structures, their identification is often difficult. In contrast, predictions of such structures are important for understanding of flavivirus replication cycles and development of antiviral strategies. Results We have developed an algorithm for structured pattern search in RNA sequences, including secondary structures, pseudoknots and triple base interactions. Using the data on known conserved flavivirus 3′UTR structures, we constructed structural descriptors which covered the diversity of patterns in these motifs. The descriptors and the search algorithm were used for the construction of a database of flavivirus 3′UTR structures. Validating this approach, we identified a number of domains matching a general pattern of exoribonuclease Xrn1-resistant RNAs in the growing group of insect-specific flaviviruses. Availability and implementation The Leiden Flavivirus RNA Structure Database is available at https://rna.liacs.nl. The search algorithm is available at https://github.com/LeidenRNA/SRHS. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Alan Zammit
- Group Imaging & Bioinformatics, Leiden Institute of Advanced Computer Science (LIACS), Leiden University, 2300 RA Leiden, The Netherlands
| | - Leon Helwerda
- Group Imaging & Bioinformatics, Leiden Institute of Advanced Computer Science (LIACS), Leiden University, 2300 RA Leiden, The Netherlands
| | - René C L Olsthoorn
- Group Supramolecular & Biomaterials Chemistry, Leiden Institute of Chemistry, Leiden University, 2300 RA Leiden, The Netherlands
| | - Fons J Verbeek
- Group Imaging & Bioinformatics, Leiden Institute of Advanced Computer Science (LIACS), Leiden University, 2300 RA Leiden, The Netherlands
| | - Alexander P Gultyaev
- Group Imaging & Bioinformatics, Leiden Institute of Advanced Computer Science (LIACS), Leiden University, 2300 RA Leiden, The Netherlands.,Department of Viroscience, Erasmus Medical Center, Rotterdam, 3000 CA, The Netherlands
| |
Collapse
|
7
|
Zhang T, Singh J, Litfin T, Zhan J, Paliwal K, Zhou Y. RNAcmap: A Fully Automatic Pipeline for Predicting Contact Maps of RNAs by Evolutionary Coupling Analysis. Bioinformatics 2021; 37:3494-3500. [PMID: 34021744 DOI: 10.1093/bioinformatics/btab391] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2020] [Revised: 03/27/2021] [Accepted: 05/18/2021] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION The accuracy of RNA secondary and tertiary structure prediction can be significantly improved by using structural restraints derived from evolutionary coupling or direct coupling analysis. Currently, these coupling analyses relied on manually curated multiple sequence alignments collected in the Rfam database, which contains 3016 families. By comparison, millions of non-coding RNA sequences are known. Here, we established RNAcmap, a fully automatic pipeline that enables evolutionary coupling analysis for any RNA sequences. The homology search was based on the covariance model built by INFERNAL according to two secondary structure predictors: a folding-based algorithm RNAfold and the latest deep-learning method SPOT-RNA. RESULTS We showed that the performance of RNAcmap is less dependent on the specific evolutionary coupling tool but is more dependent on the accuracy of secondary structure predictor with the best performance given by RNAcmap (SPOT-RNA). The performance of RNAcmap (SPOT-RNA) is comparable to that based on Rfam-supplied alignment and consistent for those sequences that are not in Rfam collections. Further improvement can be made with a simple meta predictor RNAcmap (SPOT-RNA/RNAfold) depending on which secondary structure predictor can find more homologous sequences. Reliable base-pairing information generated from RNAcmap, for RNAs with high effective homologous sequences, in particular, will be useful for aiding RNA structure prediction. AVAILABILITY RNAcmap is available as a web server at https://sparks-lab.org/server/rnacmap/ and as a standalone application along with the datasets at https://github.com/sparks-lab-org/RNAcmap_standalone. A platform independent and fully configured docker image of RNAcmap is also provided at https://hub.docker.com/r/jaswindersingh2/rnacmap.
Collapse
Affiliation(s)
- Tongchuan Zhang
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Parklands Dr. Southport, QLD 4222, Australia
| | - Jaswinder Singh
- Signal Processing Laboratory, School of Engineering and Built Environment, Griffith University, Brisbane, QLD 4111, Australia
| | - Thomas Litfin
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Parklands Dr. Southport, QLD 4222, Australia
| | - Jian Zhan
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Parklands Dr. Southport, QLD 4222, Australia
| | - Kuldip Paliwal
- Signal Processing Laboratory, School of Engineering and Built Environment, Griffith University, Brisbane, QLD 4111, Australia
| | - Yaoqi Zhou
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Parklands Dr. Southport, QLD 4222, Australia.,Institute for Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518055, China
| |
Collapse
|
8
|
Velandia-Huerto CA, Fallmann J, Stadler PF. miRNAture-Computational Detection of microRNA Candidates. Genes (Basel) 2021; 12:348. [PMID: 33673400 PMCID: PMC7996739 DOI: 10.3390/genes12030348] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2021] [Revised: 02/19/2021] [Accepted: 02/20/2021] [Indexed: 12/16/2022] Open
Abstract
Homology-based annotation of short RNAs, including microRNAs, is a difficult problem because their inherently small size limits the available information. Highly sensitive methods, including parameter optimized blast, nhmmer, or cmsearch runs designed to increase sensitivity inevitable lead to large numbers of false positives, which can be detected only by detailed analysis of specific features typical for a RNA family and/or the analysis of conservation patterns in structure-annotated multiple sequence alignments. The miRNAture pipeline implements a workflow specific to animal microRNAs that automatizes homology search and validation steps. The miRNAture pipeline yields very good results for a large number of "typical" miRBase families. However, it also highlights difficulties with atypical cases, in particular microRNAs deriving from repetitive elements and microRNAs with unusual, branched precursor structures and atypical locations of the mature product, which require specific curation by domain experts.
Collapse
Affiliation(s)
- Cristian A. Velandia-Huerto
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Leipzig University, D-04107 Leipzig, Germany
| | - Jörg Fallmann
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Leipzig University, D-04107 Leipzig, Germany
| | - Peter F. Stadler
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Leipzig University, D-04107 Leipzig, Germany
- Max Planck Institute for Mathematics in the Sciences, D-04103 Leipzig, Germany
- Institute for Theoretical Chemistry, University of Vienna, A-1090 Wien, Austria
- Facultad de Ciencias, Universidad National de Colombia, CO-111321 Bogotá, Colombia
- Santa Fe Insitute, Santa Fe, NM 87501, USA
| |
Collapse
|
9
|
Müller T, Miladi M, Hutter F, Hofacker I, Will S, Backofen R. The locality dilemma of Sankoff-like RNA alignments. Bioinformatics 2020; 36:i242-i250. [PMID: 32657398 PMCID: PMC7355259 DOI: 10.1093/bioinformatics/btaa431] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
Motivation Elucidating the functions of non-coding RNAs by homology has been strongly limited due to fundamental computational and modeling issues. While existing simultaneous alignment and folding (SA&F) algorithms successfully align homologous RNAs with precisely known boundaries (global SA&F), the more pressing problem of identifying new classes of homologous RNAs in the genome (local SA&F) is intrinsically more difficult and much less understood. Typically, the length of local alignments is strongly overestimated and alignment boundaries are dramatically mispredicted. We hypothesize that local SA&F approaches are compromised this way due to a score bias, which is caused by the contribution of RNA structure similarity to their overall alignment score. Results In the light of this hypothesis, we study pairwise local SA&F for the first time systematically—based on a novel local RNA alignment benchmark set and quality measure. First, we vary the relative influence of structure similarity compared to sequence similarity. Putting more emphasis on the structure component leads to overestimating the length of local alignments. This clearly shows the bias of current scores and strongly hints at the structure component as its origin. Second, we study the interplay of several important scoring parameters by learning parameters for local and global SA&F. The divergence of these optimized parameter sets underlines the fundamental obstacles for local SA&F. Third, by introducing a position-wise correction term in local SA&F, we constructively solve its principal issues. Availability and implementation The benchmark data, detailed results and scripts are available at https://github.com/BackofenLab/local_alignment. The RNA alignment tool LocARNA, including the modifications proposed in this work, is available at https://github.com/s-will/LocARNA/releases/tag/v2.0.0RC6. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Teresa Müller
- Bioinformatics Group, University of Freiburg, Freiburg 79110, Germany
| | - Milad Miladi
- Bioinformatics Group, University of Freiburg, Freiburg 79110, Germany
| | - Frank Hutter
- Machine Learning Lab, Department of Computer Science, University of Freiburg, Freiburg 79110, Germany
| | - Ivo Hofacker
- Theoretical Biochemistry Group (TBI), Institute for Theoretical Chemistry, University of Vienna, Vienna, Wien 1090, Austria
| | - Sebastian Will
- Theoretical Biochemistry Group (TBI), Institute for Theoretical Chemistry, University of Vienna, Vienna, Wien 1090, Austria.,Bioinformatics Group AMIBio, LIX-Laboratoire d'Informatique d'École Polytechnique, IPP, Palaiseau 91120, France
| | - Rolf Backofen
- Bioinformatics Group, University of Freiburg, Freiburg 79110, Germany.,Signalling Research Centres BIOSS and CIBSS, University of Freiburg, Freiburg 79104, Germany
| |
Collapse
|
10
|
Weinberg CE, Weinberg Z, Hammann C. Novel ribozymes: discovery, catalytic mechanisms, and the quest to understand biological function. Nucleic Acids Res 2019; 47:9480-9494. [PMID: 31504786 PMCID: PMC6765202 DOI: 10.1093/nar/gkz737] [Citation(s) in RCA: 46] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2019] [Revised: 08/08/2019] [Accepted: 08/21/2019] [Indexed: 12/21/2022] Open
Abstract
Small endonucleolytic ribozymes promote the self-cleavage of their own phosphodiester backbone at a specific linkage. The structures of and the reactions catalysed by members of individual families have been studied in great detail in the past decades. In recent years, bioinformatics studies have uncovered a considerable number of new examples of known catalytic RNA motifs. Importantly, entirely novel ribozyme classes were also discovered, for most of which both structural and biochemical information became rapidly available. However, for the majority of the new ribozymes, which are found in the genomes of a variety of species, a biological function remains elusive. Here, we concentrate on the different approaches to find catalytic RNA motifs in sequence databases. We summarize the emerging principles of RNA catalysis as observed for small endonucleolytic ribozymes. Finally, we address the biological functions of those ribozymes, where relevant information is available and common themes on their cellular activities are emerging. We conclude by speculating on the possibility that the identification and characterization of proteins that we hypothesize to be endogenously associated with catalytic RNA might help in answering the ever-present question of the biological function of the growing number of genomically encoded, small endonucleolytic ribozymes.
Collapse
Affiliation(s)
- Christina E Weinberg
- Institute for Biochemistry, Leipzig University, Brüderstraße 34, 04103 Leipzig, Germany
| | - Zasha Weinberg
- Bioinformatics Group, Department of Computer Science and Interdisciplinary Centre for Bioinformatics, Leipzig University, Härtelstraße 16–18, 04107 Leipzig, Germany
| | - Christian Hammann
- Ribogenetics & Biochemistry, Department of Life Sciences and Chemistry, Jacobs University Bremen gGmbH, Campus Ring 1, 28759 Bremen, Germany
| |
Collapse
|
11
|
Abstract
Computational methods can often facilitate the functional characterization of individual sRNAs and furthermore allow high-throughput analysis on large numbers of sRNA candidates. This chapter outlines a potential workflow for computational sRNA analyses and describes in detail methods for homolog detection, target prediction, and functional characterization based on enrichment analysis. The cyanobacterial sRNA IsaR1 is used as a specific example. All methods are available as webservers and easily accessible for nonexpert users.
Collapse
|
12
|
Samson J, Cronin S, Dean K. BC200 (BCYRN1) - The shortest, long, non-coding RNA associated with cancer. Noncoding RNA Res 2018; 3:131-143. [PMID: 30175286 PMCID: PMC6114260 DOI: 10.1016/j.ncrna.2018.05.003] [Citation(s) in RCA: 34] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2017] [Revised: 05/14/2018] [Accepted: 05/17/2018] [Indexed: 12/22/2022] Open
Abstract
With the discovery that the level of RNA synthesis in human cells far exceeds what is required to express protein-coding genes, there has been a concerted scientific effort to identify, catalogue and uncover the biological functions of the non-coding transcriptome. Long, non-coding RNAs (lncRNAs) are a diverse group of RNAs with equally wide-ranging biological roles in the cell. An increasing number of studies have reported alterations in the expression of lncRNAs in various cancers, although unravelling how they contribute specifically to the disease is a bigger challenge. Originally described as a brain-specific, non-coding RNA, BC200 (BCYRN1) is a 200-nucleotide, predominantly cytoplasmic lncRNA that has been linked to neurodegenerative disease and several types of cancer. Here we summarise what is known about BC200, primarily from studies in neuronal systems, before turning to a review of recent work that aims to understand how this lncRNA contributes to cancer initiation, progression and metastasis, along with its possible clinical utility as a biomarker or therapeutic target.
Collapse
Affiliation(s)
| | | | - K. Dean
- School of Biochemistry and Cell Biology, Western Gateway Building, University College Cork, Cork, Ireland
| |
Collapse
|
13
|
Lott SC, Schäfer RA, Mann M, Backofen R, Hess WR, Voß B, Georg J. GLASSgo - Automated and Reliable Detection of sRNA Homologs From a Single Input Sequence. Front Genet 2018; 9:124. [PMID: 29719549 PMCID: PMC5913331 DOI: 10.3389/fgene.2018.00124] [Citation(s) in RCA: 30] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2018] [Accepted: 03/26/2018] [Indexed: 11/24/2022] Open
Abstract
Bacterial small RNAs (sRNAs) are important post-transcriptional regulators of gene expression. The functional and evolutionary characterization of sRNAs requires the identification of homologs, which is frequently challenging due to their heterogeneity, short length and partly, little sequence conservation. We developed the GLobal Automatic Small RNA Search go (GLASSgo) algorithm to identify sRNA homologs in complex genomic databases starting from a single sequence. GLASSgo combines an iterative BLAST strategy with pairwise identity filtering and a graph-based clustering method that utilizes RNA secondary structure information. We tested the specificity, sensitivity and runtime of GLASSgo, BLAST and the combination RNAlien/cmsearch in a typical use case scenario on 40 bacterial sRNA families. The sensitivity of the tested methods was similar, while the specificity of GLASSgo and RNAlien/cmsearch was significantly higher than that of BLAST. GLASSgo was on average ∼87 times faster than RNAlien/cmsearch, and only ∼7.5 times slower than BLAST, which shows that GLASSgo optimizes the trade-off between speed and accuracy in the task of finding sRNA homologs. GLASSgo is fully automated, whereas BLAST often recovers only parts of homologs and RNAlien/cmsearch requires extensive additional bioinformatic work to get a comprehensive set of homologs. GLASSgo is available as an easy-to-use web server to find homologous sRNAs in large databases.
Collapse
Affiliation(s)
- Steffen C Lott
- Genetics and Experimental Bioinformatics, Faculty of Biology, University of Freiburg, Freiburg, Germany
| | - Richard A Schäfer
- Institute of Biochemical Engineering, University of Stuttgart, Stuttgart, Germany
| | - Martin Mann
- Bioinformatics Group, Faculty of Computer Science, University of Freiburg, Freiburg, Germany.,Forest Growth and Dendroecology, Institute of Forest Sciences, University of Freiburg, Freiburg, Germany
| | - Rolf Backofen
- Bioinformatics Group, Faculty of Computer Science, University of Freiburg, Freiburg, Germany.,ZBSA Center for Biological Systems Analysis, University of Freiburg, Freiburg, Germany.,BIOSS Centre for Biological Signalling Studies, Cluster of Excellence, University of Freiburg, Freiburg, Germany.,Center for Non-coding RNA in Technology and Health, University of Copenhagen, Frederiksberg, Denmark
| | - Wolfgang R Hess
- Genetics and Experimental Bioinformatics, Faculty of Biology, University of Freiburg, Freiburg, Germany.,Freiburg Institute for Advanced Studies, University of Freiburg, Freiburg, Germany
| | - Björn Voß
- Institute of Biochemical Engineering, University of Stuttgart, Stuttgart, Germany
| | - Jens Georg
- Genetics and Experimental Bioinformatics, Faculty of Biology, University of Freiburg, Freiburg, Germany
| |
Collapse
|
14
|
Abstract
Over the last two decades it has become clear that RNA is much more than just a boring intermediate in protein expression. Ancient RNAs still appear in the core information metabolism and comprise a surprisingly large component in bacterial gene regulation. A common theme with these types of mostly small RNAs is their reliance of conserved secondary structures. Large scale sequencing projects, on the other hand, have profoundly changed our understanding of eukaryotic genomes. Pervasively transcribed, they give rise to a plethora of large and evolutionarily extremely flexible noncoding RNAs that exert a vastly diverse array of molecule functions. In this chapter we provide a-necessarily incomplete-overview of the current state of comparative analysis of noncoding RNAs, emphasizing computational approaches as a means to gain a global picture of the modern RNA world.
Collapse
Affiliation(s)
- Rolf Backofen
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Georges-Köhler-Allee 106, D-79110 Freiburg, Germany.,Center for non-coding RNA in Technology and Health, Department of Veterinary and Animal Sciences, University of Copenhagen, Grønnegårdsvej 3, DK-1870 Frederiksberg C, Denmark
| | - Jan Gorodkin
- Center for non-coding RNA in Technology and Health, Department of Veterinary and Animal Sciences, University of Copenhagen, Grønnegårdsvej 3, DK-1870 Frederiksberg C, Denmark
| | - Ivo L Hofacker
- Center for non-coding RNA in Technology and Health, Department of Veterinary and Animal Sciences, University of Copenhagen, Grønnegårdsvej 3, DK-1870 Frederiksberg C, Denmark.,Institute for Theoretical Chemistry, University of Vienna, Währingerstraße 17, A-1090 Wien, Austria.,Bioinformatics and Computational Biology Research Group, University of Vienna, Währingerstraße 17, A-1090 Vienna, Austria
| | - Peter F Stadler
- Center for non-coding RNA in Technology and Health, Department of Veterinary and Animal Sciences, University of Copenhagen, Grønnegårdsvej 3, DK-1870 Frederiksberg C, Denmark. .,Institute for Theoretical Chemistry, University of Vienna, Währingerstraße 17, A-1090 Wien, Austria. .,Bioinformatics Group, Department of Computer Science, Interdisciplinary Center for Bioinformatics, University of Leipzig, Härtelstraße 16-18, D-04107 Leipzig, Germany. .,Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, D-04103 Leipzig, Germany. .,Fraunhofer Institute for Cell Therapy and Immunology, Perlickstraße 1, D-04103 Leipzig, Germany. .,Santa Fe Institute, 1399 Hyde Park Rd, Santa Fe, NM 87501, USA.
| |
Collapse
|
15
|
Abstract
The 7SK RNA is a small nuclear RNA that is involved in the regulation of Pol-II transcription. It is very well conserved in vertebrates, but shows extensive variations in both sequence and structure across invertebrates. A systematic homology search extended the collection of 7SK genes in both Arthropods and Lophotrochozoa making use of the large number of recently published invertebrate genomes. The extended data set made it possible to infer complete consensus structures for invertebrate 7SK RNAs. These show that not only the well-conserved 5'- and 3'- domains but all the interior Stem A domain is universally conserved. In contrast, Stem B region exhibits substantial structural variation and does not adhere to a common structural model beyond phylum level.
Collapse
Affiliation(s)
- Ali M Yazbeck
- a Bioinformatics Group, Department of Computer Science , Leipzig University , Härtelstraße 16-18, Leipzig , Germany.,b Lebanese University, Doctoral School for Science and Technology, Rafic Hariri University Campus , Hadath , Lebanon
| | - Kifah R Tout
- b Lebanese University, Doctoral School for Science and Technology, Rafic Hariri University Campus , Hadath , Lebanon
| | - Peter F Stadler
- a Bioinformatics Group, Department of Computer Science , Leipzig University , Härtelstraße 16-18, Leipzig , Germany.,c Interdisciplinary Center for Bioinformatics, German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Competence Center for Scalable Data Services and Solutions, and Leipzig Research Center for Civilization Diseases , Leipzig University.,d Department of Diagnostics , Fraunhofer Institute for Cell Therapy and Immunology - IZI , Perlickstraße 1, D-04103 Leipzig , Germany.,e Max Planck Institute for Mathematics in the Sciences , Inselstraße 22, D-04103 Leipzig , Germany.,f Department of Theoretical Chemistry , University of Vienna , Währingerstraße 17, A-1090 Wien , Austria.,g Center for non-coding RNA in Technology and Health , University of Copenhagen , Grønnegårdsvej 3, DK-1870 Frederiksberg C , Denmark.,h Santa Fe Institute , 1399 Hyde Park Rd., Santa Fe , NM 87501 , USA
| |
Collapse
|
16
|
Fallmann J, Will S, Engelhardt J, Grüning B, Backofen R, Stadler PF. Recent advances in RNA folding. J Biotechnol 2017; 261:97-104. [DOI: 10.1016/j.jbiotec.2017.07.007] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2017] [Revised: 07/02/2017] [Accepted: 07/04/2017] [Indexed: 12/23/2022]
|
17
|
Eggenhofer F, Hofacker IL, Höner Zu Siederdissen C. RNAlien - Unsupervised RNA family model construction. Nucleic Acids Res 2016; 44:8433-41. [PMID: 27330139 PMCID: PMC5041467 DOI: 10.1093/nar/gkw558] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2015] [Revised: 06/06/2016] [Accepted: 06/08/2016] [Indexed: 02/06/2023] Open
Abstract
Determining the function of a non-coding RNA requires costly and time-consuming wet-lab experiments. For this reason, computational methods which ascertain the homology of a sequence and thereby deduce functionality and family membership are often exploited. In this fashion, newly sequenced genomes can be annotated in a completely computational way. Covariance models are commonly used to assign novel RNA sequences to a known RNA family. However, to construct such models several examples of the family have to be already known. Moreover, model building is the work of experts who manually edit the necessary RNA alignment and consensus structure. Our method, RNAlien, starting from a single input sequence collects potential family member sequences by multiple iterations of homology search. RNA family models are fully automatically constructed for the found sequences. We have tested our method on a subset of the Rfam RNA family database. RNAlien models are a starting point to construct models of comparable sensitivity and specificity to manually curated ones from the Rfam database. RNAlien Tool and web server are available at http://rna.tbi.univie.ac.at/rnalien/.
Collapse
Affiliation(s)
- Florian Eggenhofer
- Institute for Theoretical Chemistry, University of Vienna, Währingerstrasse 17, A-1090 Vienna, Austria Bioinformatics Group, Department of Computer Science University of Freiburg, Georges-Köhler-Allee, 79110 Freiburg, Germany
| | - Ivo L Hofacker
- Institute for Theoretical Chemistry, University of Vienna, Währingerstrasse 17, A-1090 Vienna, Austria Research Group Bioinformatics and Computational Biology, Faculty of Computer Science, University of Vienna, A-1090 Vienna, Austria
| | - Christian Höner Zu Siederdissen
- Institute for Theoretical Chemistry, University of Vienna, Währingerstrasse 17, A-1090 Vienna, Austria Bioinformatics Group, Department of Computer Science, University of Leipzig, D-04107 Leipzig, Germany Interdisciplinary Center for Bioinformatics, University of Leipzig, Härtelstraße 16-18, D-04107 Leipzig, Germany
| |
Collapse
|
18
|
Barquist L, Burge SW, Gardner PP. Studying RNA Homology and Conservation with Infernal: From Single Sequences to RNA Families. CURRENT PROTOCOLS IN BIOINFORMATICS 2016; 54:12.13.1-12.13.25. [PMID: 27322404 PMCID: PMC5010141 DOI: 10.1002/cpbi.4] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Emerging high-throughput technologies have led to a deluge of putative non-coding RNA (ncRNA) sequences identified in a wide variety of organisms. Systematic characterization of these transcripts will be a tremendous challenge. Homology detection is critical to making maximal use of functional information gathered about ncRNAs: identifying homologous sequence allows us to transfer information gathered in one organism to another quickly and with a high degree of confidence. ncRNA presents a challenge for homology detection, as the primary sequence is often poorly conserved and de novo secondary structure prediction and search remain difficult. This unit introduces methods developed by the Rfam database for identifying "families" of homologous ncRNAs starting from single "seed" sequences, using manually curated sequence alignments to build powerful statistical models of sequence and structure conservation known as covariance models (CMs), implemented in the Infernal software package. We provide a step-by-step iterative protocol for identifying ncRNA homologs and then constructing an alignment and corresponding CM. We also work through an example for the bacterial small RNA MicA, discovering a previously unreported family of divergent MicA homologs in genus Xenorhabdus in the process. © 2016 by John Wiley & Sons, Inc.
Collapse
Affiliation(s)
- Lars Barquist
- Institute for Molecular Infection Biology, University of Würzburg, Würzburg, D-97080 Germany
- Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1SA United Kingdom; Fax: +44 (0)1223 494919
| | - Sarah W. Burge
- Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1SA United Kingdom; Fax: +44 (0)1223 494919
| | - Paul P. Gardner
- School of Biological Sciences, University of Canterbury, Private Bag 4800, Christchurch, New Zealand
- Biomolecular Interaction Centre, University of Canterbury, Private Bag 4800, Christchurch, New Zealand
| |
Collapse
|
19
|
Abstract
Genomic studies have greatly expanded our knowledge of structural non-coding RNAs (ncRNAs). These RNAs fold into characteristic secondary structures and perform specific-structure dependent biological functions. Hence RNA secondary structure prediction is one of the most well studied problems in computational RNA biology. Comparative sequence analysis is one of the more reliable RNA structure prediction approaches as it exploits information of multiple related sequences to infer the consensus secondary structure. This class of methods essentially learns a global secondary structure from the input sequences. In this paper, we consider the more general problem of unearthing common local secondary structure based patterns from a set of related sequences. The input sequences for example could correspond to 3(') or 5(') untranslated regions of a set of orthologous genes and the unearthed local patterns could correspond to regulatory motifs found in these regions. These sequences could also correspond to in vitro selected RNA, genomic segments housing ncRNA genes from the same family and so on. Here, we give a detailed review of the various computational techniques proposed in literature attempting to solve this general motif discovery problem. We also give empirical comparisons of some of the current state of the art methods and point out future directions of research.
Collapse
Affiliation(s)
- Avinash Achar
- Department of Computer and Information Science, Norwegian University of Science and Technology, Trondheim, Norway
| | - Pål Sætrom
- Department of Computer and Information Science, Norwegian University of Science and Technology, Trondheim, Norway.
- Department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology, Trondheim, Norway.
| |
Collapse
|
20
|
Gardner PP, Fasold M, Burge SW, Ninova M, Hertel J, Kehr S, Steeves TE, Griffiths-Jones S, Stadler PF. Conservation and losses of non-coding RNAs in avian genomes. PLoS One 2015; 10:e0121797. [PMID: 25822729 PMCID: PMC4378963 DOI: 10.1371/journal.pone.0121797] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2014] [Accepted: 02/03/2015] [Indexed: 11/21/2022] Open
Abstract
Here we present the results of a large-scale bioinformatics annotation of non-coding RNA loci in 48 avian genomes. Our approach uses probabilistic models of hand-curated families from the Rfam database to infer conserved RNA families within each avian genome. We supplement these annotations with predictions from the tRNA annotation tool, tRNAscan-SE and microRNAs from miRBase. We identify 34 lncRNA-associated loci that are conserved between birds and mammals and validate 12 of these in chicken. We report several intriguing cases where a reported mammalian lncRNA, but not its function, is conserved. We also demonstrate extensive conservation of classical ncRNAs (e.g., tRNAs) and more recently discovered ncRNAs (e.g., snoRNAs and miRNAs) in birds. Furthermore, we describe numerous “losses” of several RNA families, and attribute these to either genuine loss, divergence or missing data. In particular, we show that many of these losses are due to the challenges associated with assembling avian microchromosomes. These combined results illustrate the utility of applying homology-based methods for annotating novel vertebrate genomes.
Collapse
Affiliation(s)
- Paul P. Gardner
- School of Biological Sciences, University of Canterbury, Christchurch, New Zealand
- Biomolecular Interaction Centre, University of Canterbury, Christchurch, New Zealand
- * E-mail:
| | - Mario Fasold
- Bioinformatics Group, Department of Computer Science; and Interdisciplinary Center for Bioinformatics, University of Leipzig, Härtelstrasse 16-18, D-04107 Leipzig, Germany
- ecSeq Bioinformatics, Brandvorwerkstr.43, D-04275 Leipzig, Germany
| | - Sarah W. Burge
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge, CB10 1SD, UK
| | - Maria Ninova
- Faculty of Life Sciences, University of Manchester, Manchester, United Kingdom
| | - Jana Hertel
- Bioinformatics Group, Department of Computer Science; and Interdisciplinary Center for Bioinformatics, University of Leipzig, Härtelstrasse 16-18, D-04107 Leipzig, Germany
| | - Stephanie Kehr
- Bioinformatics Group, Department of Computer Science; and Interdisciplinary Center for Bioinformatics, University of Leipzig, Härtelstrasse 16-18, D-04107 Leipzig, Germany
| | - Tammy E. Steeves
- School of Biological Sciences, University of Canterbury, Christchurch, New Zealand
| | - Sam Griffiths-Jones
- Faculty of Life Sciences, University of Manchester, Manchester, United Kingdom
| | - Peter F. Stadler
- Bioinformatics Group, Department of Computer Science; and Interdisciplinary Center for Bioinformatics, University of Leipzig, Härtelstrasse 16-18, D-04107 Leipzig, Germany
- Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, D-04103 Leipzig, Germany
- Fraunhofer Institute for Cell Therapy and Immunology, Perlickstrasse 1, D-04103 Leipzig, Germany
- Department of Theoretical Chemistry of the University of Vienna, Währingerstrasse 17, A-1090 Vienna, Austria
- Center for RNA in Technology and Health, Univ. Copenhagen, Grønnegårdsvej 3, Frederiksberg C, Denmark
- Santa Fe Institute, 1399 Hyde Park Road, Santa Fe NM 87501, USA
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Germany
| |
Collapse
|
21
|
Hertel J, Stadler PF. The Expansion of Animal MicroRNA Families Revisited. Life (Basel) 2015; 5:905-20. [PMID: 25780960 PMCID: PMC4390885 DOI: 10.3390/life5010905] [Citation(s) in RCA: 37] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2014] [Revised: 02/09/2015] [Accepted: 02/11/2015] [Indexed: 12/14/2022] Open
Abstract
MicroRNAs are important regulatory small RNAs in many eukaryotes. Due to their small size and simple structure, they are readily innovated de novo. Throughout the evolution of animals, the emergence of novel microRNA families traces key morphological innovations. Here, we use a computational approach based on homology search and parsimony-based presence/absence analysis to draw a comprehensive picture of microRNA evolution in 159 animal species. We confirm previous observations regarding bursts of innovations accompanying the three rounds of genome duplications in vertebrate evolution and in the early evolution of placental mammals. With a much better resolution for the invertebrate lineage compared to large-scale studies, we observe additional bursts of innovation, e.g., in Rhabditoidea. More importantly, we see clear evidence that loss of microRNA families is not an uncommon phenomenon. The Enoplea may serve as a second dramatic example beyond the tunicates. The large-scale analysis presented here also highlights several generic technical issues in the analysis of very large gene families that will require further research.
Collapse
Affiliation(s)
- Jana Hertel
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, University Leipzig, Härtelstrasse 16-18, D-04107 Leipzig, Germany.
| | - Peter F Stadler
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, University Leipzig, Härtelstrasse 16-18, D-04107 Leipzig, Germany.
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Deutscher Platz 5E, 04103 Leipzig, Germany.
- Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, D-04103 Leipzig, Germany.
- Fraunhofer Institute for Cell Therapy and Immunology, Perlickstrasse 1, D-04103 Leipzig, Germany.
- Department of Theoretical Chemistry of the University of Vienna, Währingerstrasse 17, A-1090 Vienna, Austria.
- Center for RNA in Technology and Health, University of Copenhagen, Grønnegårdsvej 3, Frederiksberg C, Denmark.
- Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501, USA.
| |
Collapse
|
22
|
Pei S, Anthony JS, Meyer MM. Sampled ensemble neutrality as a feature to classify potential structured RNAs. BMC Genomics 2015; 16:35. [PMID: 25649229 PMCID: PMC4333902 DOI: 10.1186/s12864-014-1203-8] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2014] [Accepted: 12/22/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Structured RNAs have many biological functions ranging from catalysis of chemical reactions to gene regulation. Yet, many homologous structured RNAs display most of their conservation at the secondary or tertiary structure level. As a result, strategies for structured RNA discovery rely heavily on identification of sequences sharing a common stable secondary structure. However, correctly distinguishing structured RNAs from surrounding genomic sequence remains challenging, especially during de novo discovery. RNA also has a long history as a computational model for evolution due to the direct link between genotype (sequence) and phenotype (structure). From these studies it is clear that evolved RNA structures, like protein structures, can be considered robust to point mutations. In this context, an RNA sequence is considered robust if its neutrality (extent to which single mutant neighbors maintain the same secondary structure) is greater than that expected for an artificial sequence with the same minimum free energy structure. RESULTS In this work, we bring concepts from evolutionary biology to bear on the structured RNA de novo discovery process. We hypothesize that alignments corresponding to structured RNAs should consist of neutral sequences. We evaluate several measures of neutrality for their ability to distinguish between alignments of structured RNA sequences drawn from Rfam and various decoy alignments. We also introduce a new measure of RNA structural neutrality, the structure ensemble neutrality (SEN). SEN seeks to increase the biological relevance of existing neutrality measures in two ways. First, it uses information from an alignment of homologous sequences to identify a conserved biologically relevant structure for comparison. Second, it only counts base-pairs of the original structure that are absent in the comparison structure and does not penalize the formation of additional base-pairs. CONCLUSION We find that several measures of neutrality are effective at separating structured RNAs from decoy sequences, including both shuffled alignments and flanking genomic sequence. Furthermore, as an independent feature classifier to identify structured RNAs, SEN yields comparable performance to current approaches that consider a variety of features including stability and sequence identity. Finally, SEN outperforms other measures of neutrality at detecting mutational robustness in bacterial regulatory RNA structures.
Collapse
Affiliation(s)
- Shermin Pei
- Boston College, 140 Commonwealth Ave., Chestnut Hill, 02467, MA, USA.
| | - Jon S Anthony
- Boston College, 140 Commonwealth Ave., Chestnut Hill, 02467, MA, USA.
| | - Michelle M Meyer
- Boston College, 140 Commonwealth Ave., Chestnut Hill, 02467, MA, USA.
| |
Collapse
|
23
|
Lertampaiporn S, Thammarongtham C, Nukoolkit C, Kaewkamnerdpong B, Ruengjitchatchawalya M. Identification of non-coding RNAs with a new composite feature in the Hybrid Random Forest Ensemble algorithm. Nucleic Acids Res 2014; 42:e93. [PMID: 24771344 PMCID: PMC4066759 DOI: 10.1093/nar/gku325] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2014] [Revised: 04/02/2014] [Accepted: 04/07/2014] [Indexed: 12/13/2022] Open
Abstract
To identify non-coding RNA (ncRNA) signals within genomic regions, a classification tool was developed based on a hybrid random forest (RF) with a logistic regression model to efficiently discriminate short ncRNA sequences as well as long complex ncRNA sequences. This RF-based classifier was trained on a well-balanced dataset with a discriminative set of features and achieved an accuracy, sensitivity and specificity of 92.11%, 90.7% and 93.5%, respectively. The selected feature set includes a new proposed feature, SCORE. This feature is generated based on a logistic regression function that combines five significant features-structure, sequence, modularity, structural robustness and coding potential-to enable improved characterization of long ncRNA (lncRNA) elements. The use of SCORE improved the performance of the RF-based classifier in the identification of Rfam lncRNA families. A genome-wide ncRNA classification framework was applied to a wide variety of organisms, with an emphasis on those of economic, social, public health, environmental and agricultural significance, such as various bacteria genomes, the Arthrospira (Spirulina) genome, and rice and human genomic regions. Our framework was able to identify known ncRNAs with sensitivities of greater than 90% and 77.7% for prokaryotic and eukaryotic sequences, respectively. Our classifier is available at http://ncrna-pred.com/HLRF.htm.
Collapse
Affiliation(s)
- Supatcha Lertampaiporn
- Biological Engineering Program, Faculty of Engineering, King Mongkut's University of Technology Thonburi, 126 Pracha Uthit Rd, Bangmod, Thung Khru, Bangkok 10140, Thailand
| | - Chinae Thammarongtham
- Biochemical Engineering and Pilot Plant Research and Development Unit, National Center for Genetic Engineering and Biotechnology at King Mongkut's University of Technology Thonburi (Bang Khun Thian Campus), 49 Soi Thian Thale 25, Bang Khun Thian Chai Thale Rd, Tha Kham, Bangkok 10150, Thailand
| | - Chakarida Nukoolkit
- School of Information Technology, King Mongkut's University of Technology Thonburi, 126 Pracha Uthit Rd, Bangmod, Thung Khru, Bangkok 10140, Thailand
| | - Boonserm Kaewkamnerdpong
- Biological Engineering Program, Faculty of Engineering, King Mongkut's University of Technology Thonburi, 126 Pracha Uthit Rd, Bangmod, Thung Khru, Bangkok 10140, Thailand
| | - Marasri Ruengjitchatchawalya
- Biotechnology Program, School of Bioresources and Technology, King Mongkut's University of Technology Thonburi (Bang Khun Thian Campus), 49 Soi Thian Thale 25, Bang Khun Thian Chai Thale Rd, Tha Kham, Bangkok 10150, Thailand Bioinformatics and Systems Biology Program, King Mongkut's University of Technology Thonburi (Bang Khun Thian Campus), 49 Soi Thian Thale 25, Bang Khun Thian Chai Thale Rd, Tha Kham, Bangkok 10150, Thailand
| |
Collapse
|
24
|
Gruber AR. RNA Polymerase III promoter screen uncovers a novel noncoding RNA family conserved in Caenorhabditis and other clade V nematodes. Gene 2014; 544:236-40. [PMID: 24792899 DOI: 10.1016/j.gene.2014.04.068] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2014] [Revised: 04/25/2014] [Accepted: 04/28/2014] [Indexed: 10/25/2022]
Abstract
RNA Polymerase III is a highly specialized enzyme complex responsible for the transcription of a very distinct set of housekeeping noncoding RNAs including tRNAs, 7SK snRNA, Y RNAs, U6 snRNA, and the RNA components of RNaseP and RNaseMRP. In this work we have utilized the conserved promoter structure of known RNA Polymerase III transcripts consisting of characteristic sequence elements termed proximal sequence elements (PSE) A and B and a TATA-box to uncover a novel RNA Polymerase III-transcribed, noncoding RNA family found to be conserved in Caenorhabditis as well as other clade V nematode species. Homology search in combination with detailed sequence and secondary structure analysis revealed that members of this novel ncRNA family evolve rapidly, and only maintain a potentially functional small stem structure that links the 5' end to the very 3' end of the transcript and a small hairpin structure at the 3' end. This is most likely required for efficient transcription termination. In addition, our study revealed evidence that canonical C/D box snoRNAs are also transcribed from a PSE A-PSE B-TATA-box promoter in Caenorhabditis elegans.
Collapse
Affiliation(s)
- Andreas R Gruber
- Computational and Systems Biology, Biozentrum, University of Basel, Klingelbergstrasse 50-70, 4056 Basel, Switzerland; Swiss Institute of Bioinformatics, University of Basel, Klingelbergstrasse 50-70, 4056 Basel, Switzerland.
| |
Collapse
|
25
|
Abstract
Many RNA families, i.e., groups of homologous RNA genes, belong to RNA classes, such as tRNAs, snoRNAs, or microRNAs, that are characterized by common sequence motifs and/or common secondary structure features. The detection of new members of RNA classes, as well as the comprehensive annotation of genomes with members of RNA classes is a challenging task that goes beyond simple homology search. Computational methods addressing this problem typically use a three-tiered approach: In the first step an efficient and sensitive filter is employed. In the second step the candidate set is narrowed down using computationally expensive methods geared towards specificity. In the final step the hits are annotated with class-specific features and scored. Here we review the tools that are currently available for a diverse set of RNA classes.
Collapse
|
26
|
Sabarinathan R, Tafer H, Seemann SE, Hofacker IL, Stadler PF, Gorodkin J. RNAsnp: efficient detection of local RNA secondary structure changes induced by SNPs. Hum Mutat 2013; 34:546-56. [PMID: 23315997 PMCID: PMC3708107 DOI: 10.1002/humu.22273] [Citation(s) in RCA: 99] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2012] [Accepted: 12/18/2012] [Indexed: 02/05/2023]
Abstract
Structural characteristics are essential for the functioning of many noncoding RNAs and cis-regulatory elements of mRNAs. SNPs may disrupt these structures, interfere with their molecular function, and hence cause a phenotypic effect. RNA folding algorithms can provide detailed insights into structural effects of SNPs. The global measures employed so far suffer from limited accuracy of folding programs on large RNAs and are computationally too demanding for genome-wide applications. Here, we present a strategy that focuses on the local regions of maximal structural change between mutant and wild-type. These local regions are approximated in a “screening mode” that is intended for genome-wide applications. Furthermore, localized regions are identified as those with maximal discrepancy. The mutation effects are quantified in terms of empirical P values. To this end, the RNAsnp software uses extensive precomputed tables of the distribution of SNP effects as function of length and GC content. RNAsnp thus achieves both a noise reduction and speed-up of several orders of magnitude over shuffling-based approaches. On a data set comprising 501 SNPs associated with human-inherited diseases, we predict 54 to have significant local structural effect in the untranslated region of mRNAs. RNAsnp is available at http://rth.dk/resources/rnasnp.
Collapse
|
27
|
Leung YY, Ryvkin P, Ungar LH, Gregory BD, Wang LS. CoRAL: predicting non-coding RNAs from small RNA-sequencing data. Nucleic Acids Res 2013; 41:e137. [PMID: 23700308 PMCID: PMC3737537 DOI: 10.1093/nar/gkt426] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
The surprising observation that virtually the entire human genome is transcribed means we know little about the function of many emerging classes of RNAs, except their astounding diversities. Traditional RNA function prediction methods rely on sequence or alignment information, which are limited in their abilities to classify the various collections of non-coding RNAs (ncRNAs). To address this, we developed Classification of RNAs by Analysis of Length (CoRAL), a machine learning-based approach for classification of RNA molecules. CoRAL uses biologically interpretable features including fragment length and cleavage specificity to distinguish between different ncRNA populations. We evaluated CoRAL using genome-wide small RNA sequencing data sets from four human tissue types and were able to classify six different types of RNAs with ∼80% cross-validation accuracy. Analysis by CoRAL revealed that microRNAs, small nucleolar and transposon-derived RNAs are highly discernible and consistent across all human tissue types assessed, whereas long intergenic ncRNAs, small cytoplasmic RNAs and small nuclear RNAs show less consistent patterns. The ability to reliably annotate loci across tissue types demonstrates the potential of CoRAL to characterize ncRNAs using small RNA sequencing data in less well-characterized organisms.
Collapse
Affiliation(s)
- Yuk Yee Leung
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | | | | | | | | |
Collapse
|
28
|
Will S, Siebauer MF, Heyne S, Engelhardt J, Stadler PF, Reiche K, Backofen R. LocARNAscan: Incorporating thermodynamic stability in sequence and structure-based RNA homology search. Algorithms Mol Biol 2013; 8:14. [PMID: 23601347 PMCID: PMC3716875 DOI: 10.1186/1748-7188-8-14] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2013] [Accepted: 03/28/2013] [Indexed: 12/15/2022] Open
Abstract
Background The search for distant homologs has become an import issue in genome annotation. A particular difficulty is posed by divergent homologs that have lost recognizable sequence similarity. This same problem also arises in the recognition of novel members of large classes of RNAs such as snoRNAs or microRNAs that consist of families unrelated by common descent. Current homology search tools for structured RNAs are either based entirely on sequence similarity (such as blast or hmmer) or combine sequence and secondary structure. The most prominent example of the latter class of tools is Infernal. Alternatives are descriptor-based methods. In most practical applications published to-date, however, the information contained in covariance models or manually prescribed search patterns is dominated by sequence information. Here we ask two related questions: (1) Is secondary structure alone informative for homology search and the detection of novel members of RNA classes? (2) To what extent is the thermodynamic propensity of the target sequence to fold into the correct secondary structure helpful for this task? Results Sequence-structure alignment can be used as an alternative search strategy. In this scenario, the query consists of a base pairing probability matrix, which can be derived either from a single sequence or from a multiple alignment representing a set of known representatives. Sequence information can be optionally added to the query. The target sequence is pre-processed to obtain local base pairing probabilities. As a search engine we devised a semi-global scanning variant of LocARNA’s algorithm for sequence-structure alignment. The LocARNAscan tool is optimized for speed and low memory consumption. In benchmarking experiments on artificial data we observe that the inclusion of thermodynamic stability is helpful, albeit only in a regime of extremely low sequence information in the query. We observe, furthermore, that the sensitivity is bounded in particular by the limited accuracy of the predicted local structures of the target sequence. Conclusions Although we demonstrate that a purely structure-based homology search is feasible in principle, it is unlikely to outperform tools such as Infernal in most application scenarios, where a substantial amount of sequence information is typically available. The LocARNAscan approach will profit, however, from high throughput methods to determine RNA secondary structure. In transcriptome-wide applications, such methods will provide accurate structure annotations on the target side. Availability Source code of the free software LocARNAscan 1.0 and supplementary data are available at
http://www.bioinf.uni-leipzig.de/Software/LocARNAscan.
Collapse
|
29
|
Washietl S, Will S, Hendrix DA, Goff LA, Rinn JL, Berger B, Kellis M. Computational analysis of noncoding RNAs. WILEY INTERDISCIPLINARY REVIEWS-RNA 2012; 3:759-78. [PMID: 22991327 DOI: 10.1002/wrna.1134] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Noncoding RNAs have emerged as important key players in the cell. Understanding their surprisingly diverse range of functions is challenging for experimental and computational biology. Here, we review computational methods to analyze noncoding RNAs. The topics covered include basic and advanced techniques to predict RNA structures, annotation of noncoding RNAs in genomic data, mining RNA-seq data for novel transcripts and prediction of transcript structures, computational aspects of microRNAs, and database resources.
Collapse
Affiliation(s)
- Stefan Washietl
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA.
| | | | | | | | | | | | | |
Collapse
|
30
|
Conservation of a triple-helix-forming RNA stability element in noncoding and genomic RNAs of diverse viruses. Cell Rep 2012; 2:26-32. [PMID: 22840393 DOI: 10.1016/j.celrep.2012.05.020] [Citation(s) in RCA: 76] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2012] [Revised: 04/19/2012] [Accepted: 05/23/2012] [Indexed: 01/17/2023] Open
Abstract
Abundant expression of the long noncoding (lnc) PAN (polyadenylated nuclear) RNA by the human oncogenic gammaherpesvirus Kaposi's sarcoma-associated herpesvirus (KSHV) depends on a cis-element called the expression and nuclear retention element (ENE). The ENE upregulates PAN RNA by inhibiting its rapid nuclear decay through triple-helix formation with the poly(A) tail. Using structure-based bioinformatics, we identified six ENE-like elements in evolutionarily diverse viral genomes. Five are in double-stranded DNA viruses, including mammalian herpesviruses, insect polydnaviruses, and a protist mimivirus. One is in an insect picorna-like positive-strand RNA virus, suggesting that the ENE can counteract cytoplasmic as well as nuclear RNA decay pathways. Functionality of four of the ENEs was demonstrated by increased accumulation of an intronless polyadenylated reporter transcript in human cells. Identification of these ENEs enabled the discovery of PAN RNA homologs in two additional gammaherpesviruses, RRV and EHV2. Our findings demonstrate that searching for structural elements can lead to rapid identification of lncRNAs.
Collapse
|
31
|
BRASERO: A Resource for Benchmarking RNA Secondary Structure Comparison Algorithms. Adv Bioinformatics 2012; 2012:893048. [PMID: 22675348 PMCID: PMC3366197 DOI: 10.1155/2012/893048] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2011] [Accepted: 02/22/2012] [Indexed: 11/23/2022] Open
Abstract
The pairwise comparison of RNA secondary structures is a fundamental problem, with direct application in mining databases for annotating putative noncoding RNA candidates in newly sequenced genomes. An increasing number of software tools are available for comparing RNA secondary structures, based on different models (such as ordered trees or forests, arc annotated sequences, and multilevel trees) and computational principles (edit distance, alignment). We describe here the website BRASERO that offers tools for evaluating such software tools on real and synthetic datasets.
Collapse
|
32
|
Conservation and Occurrence of Trans-Encoded sRNAs in the Rhizobiales. Genes (Basel) 2011; 2:925-56. [PMID: 24710299 PMCID: PMC3927594 DOI: 10.3390/genes2040925] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2011] [Revised: 10/24/2011] [Accepted: 10/26/2011] [Indexed: 12/13/2022] Open
Abstract
Post-transcriptional regulation by trans-encoded sRNAs, for example via base-pairing with target mRNAs, is a common feature in bacteria and influences various cell processes, e.g., response to stress factors. Several studies based on computational and RNA-seq approaches identified approximately 180 trans-encoded sRNAs in Sinorhizobium meliloti. The initial point of this report is a set of 52 trans-encoded sRNAs derived from the former studies. Sequence homology combined with structural conservation analyses were applied to elucidate the occurrence and distribution of conserved trans-encoded sRNAs in the order of Rhizobiales. This approach resulted in 39 RNA family models (RFMs) which showed various taxonomic distribution patterns. Whereas the majority of RFMs was restricted to Sinorhizobium species or the Rhizobiaceae, members of a few RFMs were more widely distributed in the Rhizobiales. Access to this data is provided via the RhizoGATE portal [1,2].
Collapse
|
33
|
Cros MJ, de Monte A, Mariette J, Bardou P, Grenier-Boley B, Gautheret D, Touzet H, Gaspin C. RNAspace.org: An integrated environment for the prediction, annotation, and analysis of ncRNA. RNA (NEW YORK, N.Y.) 2011; 17:1947-56. [PMID: 21947200 PMCID: PMC3198588 DOI: 10.1261/rna.2844911] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/30/2011] [Accepted: 08/07/2011] [Indexed: 05/22/2023]
Abstract
The annotation of noncoding RNA genes remains a major bottleneck in genome sequencing projects. Most genome sequences released today still come with sets of tRNAs and rRNAs as the only annotated RNA elements, ignoring hundreds of other RNA families. We have developed a web environment that is dedicated to noncoding RNA (ncRNA) prediction, annotation, and analysis and allows users to run a variety of tools in an integrated and flexible manner. This environment offers complementary ncRNA gene finders and a set of tools for the comparison, visualization, editing, and export of ncRNA candidates. Predictions can be filtered according to a large set of characteristics. Based on this environment, we created a public website located at http://RNAspace.org. It accepts genomic sequences up to 5 Mb, which permits for an online annotation of a complete bacterial genome or a small eukaryotic chromosome. The project is hosted as a Source Forge project (http://rnaspace.sourceforge.net/).
Collapse
Affiliation(s)
| | - Antoine de Monte
- LIFL, UMR CNRS 8022 Université Lille 1 and INRIA Lille Nord Europe, 59655 Villeneuve d'Ascq cedex, France
| | - Jérôme Mariette
- INRA, Plateforme Bioinformatique, F-31320, UR 875, Castanet-Tolosan, France
| | | | - Benjamin Grenier-Boley
- LIFL, UMR CNRS 8022 Université Lille 1 and INRIA Lille Nord Europe, 59655 Villeneuve d'Ascq cedex, France
| | | | - Hélène Touzet
- LIFL, UMR CNRS 8022 Université Lille 1 and INRIA Lille Nord Europe, 59655 Villeneuve d'Ascq cedex, France
| | - Christine Gaspin
- INRA, UBIA, UR 875, F-31320 Castanet-Tolosan, France
- INRA, Plateforme Bioinformatique, F-31320, UR 875, Castanet-Tolosan, France
| |
Collapse
|
34
|
Abstract
Non-coding RNAs (ncRNAs) are receiving more and more attention not only as an abundant class of genes, but also as regulatory structural elements (some located in mRNAs). A key feature of RNA function is its structure. Computational methods were developed early for folding and prediction of RNA structure with the aim of assisting in functional analysis. With the discovery of more and more ncRNAs, it has become clear that a large fraction of these are highly structured. Interestingly, a large part of the structure is comprised of regular Watson-Crick and GU wobble base pairs. This and the increased amount of available genomes have made it possible to employ structure-based methods for genomic screens. The field has moved from folding prediction of single sequences to computational screens for ncRNAs in genomic sequence using the RNA structure as the main characteristic feature. Whereas early methods focused on energy-directed folding of single sequences, comparative analysis based on structure preserving changes of base pairs has been efficient in improving accuracy, and today this constitutes a key component in genomic screens. Here, we cover the basic principles of RNA folding and touch upon some of the concepts in current methods that have been applied in genomic screens for de novo RNA structures in searches for novel ncRNA genes and regulatory RNA structure on mRNAs. We discuss the strengths and weaknesses of the different strategies and how they can complement each other.
Collapse
|
35
|
Cruz JA, Westhof E. Identification and annotation of noncoding RNAs in Saccharomycotina. C R Biol 2011; 334:671-8. [PMID: 21819949 DOI: 10.1016/j.crvi.2011.05.016] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2010] [Accepted: 03/23/2011] [Indexed: 11/16/2022]
Abstract
The importance of ncRNAs in biological processes makes their annotation an essential component of any genome-sequencing project. The identification of ncRNAs in genomes requires specific expertise and tools that are distinct from the traditional protein gene annotation tools. Here, we describe the assembly of two automatic annotation pipelines, integrating publicly available tools, for homology and de novo ncRNA search in genomes. We applied both pipelines to 10 Saccharomycotina genomes and were able to find and annotate 693 ncRNA genes, corresponding to 81% of the ncRNAs expected for those genomes assuming the number of ncRNAs in Saccharomyces cerevisiae (86) as a reference. Several new ncRNAs, not yet known in the Saccharomycotina clade, were also detected. The results show the feasibility of automatic search for ncRNAs in full genomes and the utility of such approaches in large multi-genome sequencing and annotation projects.
Collapse
Affiliation(s)
- José Almeida Cruz
- Architecture et Réactivité de l'ARN, Institut de Biologie Moléculaire et Cellulaire du CNRS, Université de Strasbourg, 15 rue René-Descartes, 67084 Strasbourg cedex, France.
| | | |
Collapse
|
36
|
Bussotti G, Raineri E, Erb I, Zytnicki M, Wilm A, Beaudoing E, Bucher P, Notredame C. BlastR--fast and accurate database searches for non-coding RNAs. Nucleic Acids Res 2011; 39:6886-95. [PMID: 21624887 PMCID: PMC3167602 DOI: 10.1093/nar/gkr335] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
We present and validate BlastR, a method for efficiently and accurately searching non-coding RNAs. Our approach relies on the comparison of di-nucleotides using BlosumR, a new log-odd substitution matrix. In order to use BlosumR for comparison, we recoded RNA sequences into protein-like sequences. We then showed that BlosumR can be used along with the BlastP algorithm in order to search non-coding RNA sequences. Using Rfam as a gold standard, we benchmarked this approach and show BlastR to be more sensitive than BlastN. We also show that BlastR is both faster and more sensitive than BlastP used with a single nucleotide log-odd substitution matrix. BlastR, when used in combination with WU-BlastP, is about 5% more accurate than WU-BlastN and about 50 times slower. The approach shown here is equally effective when combined with the NCBI-Blast package. The software is an open source freeware available from www.tcoffee.org/blastr.html.
Collapse
Affiliation(s)
- Giovanni Bussotti
- Bioinformatics and Genomics program, Center for Genomic Regulation (CRG) and UPF, Barcelona, C/ D. Aiguader, 88, 08003 Barcelona, Spain
| | | | | | | | | | | | | | | |
Collapse
|
37
|
Seemann SE, Richter AS, Gesell T, Backofen R, Gorodkin J. PETcofold: predicting conserved interactions and structures of two multiple alignments of RNA sequences. ACTA ACUST UNITED AC 2010; 27:211-9. [PMID: 21088024 PMCID: PMC3018821 DOI: 10.1093/bioinformatics/btq634] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
Motivation: Predicting RNA–RNA interactions is essential for determining the function of putative non-coding RNAs. Existing methods for the prediction of interactions are all based on single sequences. Since comparative methods have already been useful in RNA structure determination, we assume that conserved RNA–RNA interactions also imply conserved function. Of these, we further assume that a non-negligible amount of the existing RNA–RNA interactions have also acquired compensating base changes throughout evolution. We implement a method, PETcofold, that can take covariance information in intra-molecular and inter-molecular base pairs into account to predict interactions and secondary structures of two multiple alignments of RNA sequences. Results:PETcofold's ability to predict RNA–RNA interactions was evaluated on a carefully curated dataset of 32 bacterial small RNAs and their targets, which was manually extracted from the literature. For evaluation of both RNA–RNA interaction and structure prediction, we were able to extract only a few high-quality examples: one vertebrate small nucleolar RNA and four bacterial small RNAs. For these we show that the prediction can be improved by our comparative approach. Furthermore, PETcofold was evaluated on controlled data with phylogenetically simulated sequences enriched for covariance patterns at the interaction sites. We observed increased performance with increased amounts of covariance. Availability: The program PETcofold is available as source code and can be downloaded from http://rth.dk/resources/petcofold. Contact:gorodkin@rth.dk; backofen@informatik.uni-freiburg.de Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Stefan E Seemann
- Center for non-coding RNA in Technology and Health, University of Copenhagen, Frederiksberg C, Denmark
| | | | | | | | | |
Collapse
|
38
|
Chikkagoudar S, Livesay DR, Roshan U. PLAST-ncRNA: Partition function Local Alignment Search Tool for non-coding RNA sequences. Nucleic Acids Res 2010; 38:W59-63. [PMID: 20522510 PMCID: PMC2896107 DOI: 10.1093/nar/gkq487] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
Alignment-based programs are valuable tools for finding potential homologs in genome sequences. Previously, it has been shown that partition function posterior probabilities attuned to local alignment achieve a high accuracy in identifying distantly similar non-coding RNA sequences that are hidden in a large genome. Here, we present an online implementation of that alignment algorithm based on such probabilities. Our server takes as input a query RNA sequence and a large genome sequence, and outputs a list of hits that are above a mean posterior probability threshold. The output is presented in a format suited to local alignment. It can also be viewed within the PLAST alignment viewer applet that provides a list of all hits found and highlights regions of high posterior probability within each local alignment. The server is freely available at http://plastrna.njit.edu.
Collapse
Affiliation(s)
- Satish Chikkagoudar
- Department of Biostatistics and Epidemiology, University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania, USA
| | | | | |
Collapse
|
39
|
Schlüter JP, Reinkensmeier J, Daschkey S, Evguenieva-Hackenberg E, Janssen S, Jänicke S, Becker JD, Giegerich R, Becker A. A genome-wide survey of sRNAs in the symbiotic nitrogen-fixing alpha-proteobacterium Sinorhizobium meliloti. BMC Genomics 2010; 11:245. [PMID: 20398411 PMCID: PMC2873474 DOI: 10.1186/1471-2164-11-245] [Citation(s) in RCA: 87] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2010] [Accepted: 04/17/2010] [Indexed: 12/03/2022] Open
Abstract
Background Small untranslated RNAs (sRNAs) are widespread regulators of gene expression in bacteria. This study reports on a comprehensive screen for sRNAs in the symbiotic nitrogen-fixing alpha-proteobacterium Sinorhizobium meliloti applying deep sequencing of cDNAs and microarray hybridizations. Results A total of 1,125 sRNA candidates that were classified as trans-encoded sRNAs (173), cis-encoded antisense sRNAs (117), mRNA leader transcripts (379), and sense sRNAs overlapping coding regions (456) were identified in a size range of 50 to 348 nucleotides. Among these were transcripts corresponding to 82 previously reported sRNA candidates. Enrichment for RNAs with primary 5'-ends prior to sequencing of cDNAs suggested transcriptional start sites corresponding to 466 predicted sRNA regions. The consensus σ70 promoter motif CTTGAC-N17-CTATAT was found upstream of 101 sRNA candidates. Expression patterns derived from microarray hybridizations provided further information on conditions of expression of a number of sRNA candidates. Furthermore, GenBank, EMBL, DDBJ, PDB, and Rfam databases were searched for homologs of the sRNA candidates identified in this study. Searching Rfam family models with over 1,000 sRNA candidates, re-discovered only those sequences from S. meliloti already known and stored in Rfam, whereas BLAST searches suggested a number of homologs in related alpha-proteobacteria. Conclusions The screening data suggests that in S. meliloti about 3% of the genes encode trans-encoded sRNAs and about 2% antisense transcripts. Thus, this first comprehensive screen for sRNAs applying deep sequencing in an alpha-proteobacterium shows that sRNAs also occur in high number in this group of bacteria.
Collapse
Affiliation(s)
- Jan-Philip Schlüter
- Institute of Biology III, Faculty of Biology, University of Freiburg, Freiburg, Germany
| | | | | | | | | | | | | | | | | |
Collapse
|
40
|
Abstract
The discovery of several new structured non-coding RNAs in bacterial and archaeal genomes and metagenomes raises burning questions about their biological and biochemical functions. The discovery of several new structured non-coding RNAs in bacterial and archaeal genomes and metagenomes raises burning questions about their biological and biochemical functions. See related research article by Weinberg et al.: http://genomebiology.com/2010/11/3/R31
Collapse
Affiliation(s)
- Eric Westhof
- Architecture et Réactivité de l'ARN, Université de Strasbourg, Institut de Biologie Moléculaire et Cellulaire du CNRS, 15 rue René Descartes, Strasbourg, France.
| |
Collapse
|
41
|
Gorodkin J, Hofacker IL, Torarinsson E, Yao Z, Havgaard JH, Ruzzo WL. De novo prediction of structured RNAs from genomic sequences. Trends Biotechnol 2009; 28:9-19. [PMID: 19942311 DOI: 10.1016/j.tibtech.2009.09.006] [Citation(s) in RCA: 50] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2009] [Revised: 08/31/2009] [Accepted: 09/22/2009] [Indexed: 12/29/2022]
Abstract
Growing recognition of the numerous, diverse and important roles played by non-coding RNA in all organisms motivates better elucidation of these cellular components. Comparative genomics is a powerful tool for this task and is arguably preferable to any high-throughput experimental technology currently available, because evolutionary conservation highlights functionally important regions. Conserved secondary structure, rather than primary sequence, is the hallmark of many functionally important RNAs, because compensatory substitutions in base-paired regions preserve structure. Unfortunately, such substitutions also obscure sequence identity and confound alignment algorithms, which complicates analysis greatly. This paper surveys recent computational advances in this difficult arena, which have enabled genome-scale prediction of cross-species conserved RNA elements. These predictions suggest that a wealth of these elements indeed exist.
Collapse
Affiliation(s)
- Jan Gorodkin
- Section for Genetics and Bioinformatics, IBHV and Center for Applied Bioinformatics, University of Copenhagen, Grønnegårdsvej 3, DK-1870 Frederiksberg C, Denmark.
| | | | | | | | | | | |
Collapse
|