1
|
Backofen R, Gorodkin J, Hofacker IL, Stadler PF. Comparative RNA Genomics. Methods Mol Biol 2024; 2802:347-393. [PMID: 38819565 DOI: 10.1007/978-1-0716-3838-5_12] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/01/2024]
Abstract
Over the last quarter of a century it has become clear that RNA is much more than just a boring intermediate in protein expression. Ancient RNAs still appear in the core information metabolism and comprise a surprisingly large component in bacterial gene regulation. A common theme with these types of mostly small RNAs is their reliance of conserved secondary structures. Large-scale sequencing projects, on the other hand, have profoundly changed our understanding of eukaryotic genomes. Pervasively transcribed, they give rise to a plethora of large and evolutionarily extremely flexible non-coding RNAs that exert a vastly diverse array of molecule functions. In this chapter we provide a-necessarily incomplete-overview of the current state of comparative analysis of non-coding RNAs, emphasizing computational approaches as a means to gain a global picture of the modern RNA world.
Collapse
Affiliation(s)
- Rolf Backofen
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Freiburg, Germany
- Center for Non-coding RNA in Technology and Health, University of Copenhagen, Frederiksberg, Denmark
| | - Jan Gorodkin
- Center for Non-coding RNA in Technology and Health, Department of Veterinary and Animal Sciences, University of Copenhagen, Frederiksberg, Denmark
| | - Ivo L Hofacker
- Institute for Theoretical Chemistry, University of Vienna, Wien, Austria
- Bioinformatics and Computational Biology research group, University of Vienna, Vienna, Austria
- Center for Non-coding RNA in Technology and Health, University of Copenhagen, Frederiksberg, Denmark
| | - Peter F Stadler
- Bioinformatics Group, Department of Computer Science, University of Leipzig, Leipzig, Germany.
- Interdisciplinary Center for Bioinformatics, University of Leipzig, Leipzig, Germany.
- Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany.
- Universidad National de Colombia, Bogotá, Colombia.
- Institute for Theoretical Chemistry, University of Vienna, Wien, Austria.
- Center for Non-coding RNA in Technology and Health, University of Copenhagen, Frederiksberg, Denmark.
- Santa Fe Institute, Santa Fe, NM, USA.
| |
Collapse
|
2
|
Kirsch R, Seemann SE, Ruzzo WL, Cohen SM, Stadler PF, Gorodkin J. Identification and characterization of novel conserved RNA structures in Drosophila. BMC Genomics 2018; 19:899. [PMID: 30537930 PMCID: PMC6288889 DOI: 10.1186/s12864-018-5234-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2018] [Accepted: 11/08/2018] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND Comparative genomics approaches have facilitated the discovery of many novel non-coding and structured RNAs (ncRNAs). The increasing availability of related genomes now makes it possible to systematically search for compensatory base changes - and thus for conserved secondary structures - even in genomic regions that are poorly alignable in the primary sequence. The wealth of available transcriptome data can add valuable insight into expression and possible function for new ncRNA candidates. Earlier work identifying ncRNAs in Drosophila melanogaster made use of sequence-based alignments and employed a sliding window approach, inevitably biasing identification toward RNAs encoded in the more conserved parts of the genome. RESULTS To search for conserved RNA structures (CRSs) that may not be highly conserved in sequence and to assess the expression of CRSs, we conducted a genome-wide structural alignment screen of 27 insect genomes including D. melanogaster and integrated this with an extensive set of tiling array data. The structural alignment screen revealed ∼30,000 novel candidate CRSs at an estimated false discovery rate of less than 10%. With more than one quarter of all individual CRS motifs showing sequence identities below 60%, the predicted CRSs largely complement the findings of sliding window approaches applied previously. While a sixth of the CRSs were ubiquitously expressed, we found that most were expressed in specific developmental stages or cell lines. Notably, most statistically significant enrichment of CRSs were observed in pupae, mainly in exons of untranslated regions, promotors, enhancers, and long ncRNAs. Interestingly, cell lines were found to express a different set of CRSs than were found in vivo. Only a small fraction of intergenic CRSs were co-expressed with the adjacent protein coding genes, which suggests that most intergenic CRSs are independent genetic units. CONCLUSIONS This study provides a more comprehensive view of the ncRNA transcriptome in fly as well as evidence for differential expression of CRSs during development and in cell lines.
Collapse
Affiliation(s)
- Rebecca Kirsch
- Center for non-coding RNA in Technology and Health, University of Copenhagen, Grønnegårdsvej 3, Frederiksberg C, DK-1870 Denmark
- Department of Veterinary and Animal Science, University of Copenhagen, Grønnegårdsvej 3, Frederiksberg C, DK-1870 Denmark
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16–18, Leipzig, D-04107 Germany
| | - Stefan E. Seemann
- Center for non-coding RNA in Technology and Health, University of Copenhagen, Grønnegårdsvej 3, Frederiksberg C, DK-1870 Denmark
- Department of Veterinary and Animal Science, University of Copenhagen, Grønnegårdsvej 3, Frederiksberg C, DK-1870 Denmark
| | - Walter L. Ruzzo
- Center for non-coding RNA in Technology and Health, University of Copenhagen, Grønnegårdsvej 3, Frederiksberg C, DK-1870 Denmark
- School of Computer Science and Engineering, University of Washington, Box 352350, Seattle, 98195-2350 WA USA
- Department of Genome Sciences, University of Washington, Box 355065, Seattle, 98195-5065 WA USA
- Fred Hutchinson Cancer Research Center, 1100 Fairview Ave. N., Seattle, 98109-1024 WA USA
| | - Stephen M. Cohen
- Department of Cellular and Molecular Medicine, University of Copenhagen, Blegdamsvej 3, Copenhagen N, DK-2200 Denmark
| | - Peter F. Stadler
- Center for non-coding RNA in Technology and Health, University of Copenhagen, Grønnegårdsvej 3, Frederiksberg C, DK-1870 Denmark
- Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16–18, Leipzig, D-04107 Germany
- Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, Leipzig, D-04103 Germany
- Faculdad de Ciencias, Universidad Nacional de Colombia, Sede Bogotá, Ciudad Universitaria, Bogotá, COL-111321 D.C. Colombia
- Department of Theoretical Chemistry, University of Vienna, Währinger Straße 17, Vienna, A-1090 Austria
- Santa Fe Institute, 1399 Hyde Park Rd., Santa Fe, NM87501 USA
| | - Jan Gorodkin
- Center for non-coding RNA in Technology and Health, University of Copenhagen, Grønnegårdsvej 3, Frederiksberg C, DK-1870 Denmark
- Department of Veterinary and Animal Science, University of Copenhagen, Grønnegårdsvej 3, Frederiksberg C, DK-1870 Denmark
| |
Collapse
|
3
|
Abstract
Over the last two decades it has become clear that RNA is much more than just a boring intermediate in protein expression. Ancient RNAs still appear in the core information metabolism and comprise a surprisingly large component in bacterial gene regulation. A common theme with these types of mostly small RNAs is their reliance of conserved secondary structures. Large scale sequencing projects, on the other hand, have profoundly changed our understanding of eukaryotic genomes. Pervasively transcribed, they give rise to a plethora of large and evolutionarily extremely flexible noncoding RNAs that exert a vastly diverse array of molecule functions. In this chapter we provide a-necessarily incomplete-overview of the current state of comparative analysis of noncoding RNAs, emphasizing computational approaches as a means to gain a global picture of the modern RNA world.
Collapse
Affiliation(s)
- Rolf Backofen
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Georges-Köhler-Allee 106, D-79110 Freiburg, Germany.,Center for non-coding RNA in Technology and Health, Department of Veterinary and Animal Sciences, University of Copenhagen, Grønnegårdsvej 3, DK-1870 Frederiksberg C, Denmark
| | - Jan Gorodkin
- Center for non-coding RNA in Technology and Health, Department of Veterinary and Animal Sciences, University of Copenhagen, Grønnegårdsvej 3, DK-1870 Frederiksberg C, Denmark
| | - Ivo L Hofacker
- Center for non-coding RNA in Technology and Health, Department of Veterinary and Animal Sciences, University of Copenhagen, Grønnegårdsvej 3, DK-1870 Frederiksberg C, Denmark.,Institute for Theoretical Chemistry, University of Vienna, Währingerstraße 17, A-1090 Wien, Austria.,Bioinformatics and Computational Biology Research Group, University of Vienna, Währingerstraße 17, A-1090 Vienna, Austria
| | - Peter F Stadler
- Center for non-coding RNA in Technology and Health, Department of Veterinary and Animal Sciences, University of Copenhagen, Grønnegårdsvej 3, DK-1870 Frederiksberg C, Denmark. .,Institute for Theoretical Chemistry, University of Vienna, Währingerstraße 17, A-1090 Wien, Austria. .,Bioinformatics Group, Department of Computer Science, Interdisciplinary Center for Bioinformatics, University of Leipzig, Härtelstraße 16-18, D-04107 Leipzig, Germany. .,Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, D-04103 Leipzig, Germany. .,Fraunhofer Institute for Cell Therapy and Immunology, Perlickstraße 1, D-04103 Leipzig, Germany. .,Santa Fe Institute, 1399 Hyde Park Rd, Santa Fe, NM 87501, USA.
| |
Collapse
|
4
|
Schneider HW, Raiol T, Brigido MM, Walter MEMT, Stadler PF. A Support Vector Machine based method to distinguish long non-coding RNAs from protein coding transcripts. BMC Genomics 2017; 18:804. [PMID: 29047334 PMCID: PMC5648457 DOI: 10.1186/s12864-017-4178-4] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2017] [Accepted: 10/05/2017] [Indexed: 12/31/2022] Open
Abstract
Background In recent years, a rapidly increasing number of RNA transcripts has been generated by thousands of sequencing projects around the world, creating enormous volumes of transcript data to be analyzed. An important problem to be addressed when analyzing this data is distinguishing between long non-coding RNAs (lncRNAs) and protein coding transcripts (PCTs). Thus, we present a Support Vector Machine (SVM) based method to distinguish lncRNAs from PCTs, using features based on frequencies of nucleotide patterns and ORF lengths, in transcripts. Methods The proposed method is based on SVM and uses the first ORF relative length and frequencies of nucleotide patterns selected by PCA as features. FASTA files were used as input to calculate all possible features. These features were divided in two sets: (i) 336 frequencies of nucleotide patterns; and (ii) 4 features derived from ORFs. PCA were applied to the first set to identify 6 groups of frequencies that could most contribute to the distinction. Twenty-four experiments using the 6 groups from the first set and the features from the second set where built to create the best model to distinguish lncRNAs from PCTs. Results This method was trained and tested with human (Homo sapiens), mouse (Mus musculus) and zebrafish (Danio rerio) data, achieving 98.21%, 98.03% and 96.09%, accuracy, respectively. Our method was compared to other tools available in the literature (CPAT, CPC, iSeeRNA, lncRNApred, lncRScan-SVM and FEELnc), and showed an improvement in accuracy by ≈3.00%. In addition, to validate our model, the mouse data was classified with the human model, and vice-versa, achieving ≈97.80% accuracy in both cases, showing that the model is not overfit. The SVM models were validated with data from rat (Rattus norvegicus), pig (Sus scrofa) and fruit fly (Drosophila melanogaster), and obtained more than 84.00% accuracy in all these organisms. Our results also showed that 81.2% of human pseudogenes and 91.7% of mouse pseudogenes were classified as non-coding. Moreover, our method was capable of re-annotating two uncharacterized sequences of Swiss-Prot database with high probability of being lncRNAs. Finally, in order to use the method to annotate transcripts derived from RNA-seq, previously identified lncRNAs of human, gorilla (Gorilla gorilla) and rhesus macaque (Macaca mulatta) were analyzed, having successfully classified 98.62%, 80.8% and 91.9%, respectively. Conclusions The SVM method proposed in this work presents high performance to distinguish lncRNAs from PCTs, as shown in the results. To build the model, besides using features known in the literature regarding ORFs, we used PCA to identify features among nucleotide pattern frequencies that contribute the most in distinguishing lncRNAs from PCTs, in reference data sets. Interestingly, models created with two evolutionary distant species could distinguish lncRNAs of even more distant species. Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-4178-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Hugo W Schneider
- Department of Computer Science, University of Brasilia, ICC Central, Instituto de Ciências Exatas, Campus Universitario Darcy Ribeiro, Asa Norte, CEP: 70910-900, Brasilia, Brazil.
| | - Taina Raiol
- Gerência Regional de Brasilia (GEREB), Oswaldo Cruz Foundation (Fiocruz), Av. L3 Norte, Campus Universitário Darcy Ribeiro, Gleba A, Asa Norte, CEP: 70910-900, Brasília, Brazil
| | - Marcelo M Brigido
- Laboratory of Molecular Biology, University of Brasilia, Instituto de Ciencias Biologicas, Campus Universitario Darcy Ribeiro, Asa Norte, CEP: 70910-900, Brasilia, Brazil
| | - Maria Emilia M T Walter
- Department of Computer Science, University of Brasilia, ICC Central, Instituto de Ciências Exatas, Campus Universitario Darcy Ribeiro, Asa Norte, CEP: 70910-900, Brasilia, Brazil
| | - Peter F Stadler
- Bioinformatics Group, Department of Computer Science and Interdisciplinary Center for Bioinformatics, University of Leipzig, Hartelstrasse 16-18, Leipzig, D-04107, Germany
| |
Collapse
|
5
|
Rare Splice Variants in Long Non-Coding RNAs. Noncoding RNA 2017; 3:ncrna3030023. [PMID: 29657294 PMCID: PMC5831916 DOI: 10.3390/ncrna3030023] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2017] [Revised: 06/24/2017] [Accepted: 06/29/2017] [Indexed: 12/12/2022] Open
Abstract
Long non-coding RNAs (lncRNAs) form a substantial component of the transcriptome and are involved in a wide variety of regulatory mechanisms. Compared to protein-coding genes, they are often expressed at low levels and are restricted to a narrow range of cell types or developmental stages. As a consequence, the diversity of their isoforms is still far from being recorded and catalogued in its entirety, and the debate is ongoing about what fraction of non-coding RNAs truly conveys biological function rather than being "junk". Here, using a collection of more than 100 transcriptomes from related B cell lymphoma, we show that lncRNA loci produce a very defined set of splice variants. While some of them are so rare that they become recognizable only in the superposition of dozens or hundreds of transcriptome datasets and not infrequently include introns or exons that have not been included in available genome annotation data, there is still a very limited number of processing products for any given locus. The combined depth of our sequencing data is large enough to effectively exhaust the isoform diversity: the overwhelming majority of splice junctions that are observed at all are represented by multiple junction-spanning reads. We conclude that the human transcriptome produces virtually no background of RNAs that are processed at effectively random positions, but is-under normal circumstances-confined to a well defined set of splice variants.
Collapse
|
6
|
HIPSTR and thousands of lncRNAs are heterogeneously expressed in human embryos, primordial germ cells and stable cell lines. Sci Rep 2016; 6:32753. [PMID: 27605307 PMCID: PMC5015059 DOI: 10.1038/srep32753] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2016] [Accepted: 08/11/2016] [Indexed: 01/02/2023] Open
Abstract
Eukaryotic genomes are transcribed into numerous regulatory long non-coding RNAs (lncRNAs). Compared to mRNAs, lncRNAs display higher developmental stage-, tissue-, and cell-subtype-specificity of expression, and are generally less abundant in a population of cells. Despite the progress in single-cell-focused research, the origins of low population-level expression of lncRNAs in homogeneous populations of cells are poorly understood. Here, we identify HIPSTR (Heterogeneously expressed from the Intronic Plus Strand of the TFAP2A-locus RNA), a novel lncRNA gene in the developmentally regulated TFAP2A locus. HIPSTR has evolutionarily conserved expression patterns, its promoter is most active in undifferentiated cells, and depletion of HIPSTR in HEK293 and in pluripotent H1BP cells predominantly affects the genes involved in early organismal development and cell differentiation. Most importantly, we find that HIPSTR is specifically induced and heterogeneously expressed in the 8-cell-stage human embryos during the major wave of embryonic genome activation. We systematically explore the phenomenon of cell-to-cell variation of gene expression and link it to low population-level expression of lncRNAs, showing that, similar to HIPSTR, the expression of thousands of lncRNAs is more highly heterogeneous than the expression of mRNAs in the individual, otherwise indistinguishable cells of totipotent human embryos, primordial germ cells, and stable cell lines.
Collapse
|
7
|
Abstract
All living organisms sense and respond to harmful changes in their intracellular and extracellular environment through complex signaling pathways that lead to changes in gene expression and cellular function in order to maintain homeostasis. Long non-coding RNAs (lncRNAs), a large and heterogeneous group of functional RNAs, play important roles in cellular response to stressful conditions. lncRNAs constitute a significant fraction of the genes differentially expressed in response to diverse stressful stimuli and, once induced, contribute to the regulation of downstream cellular processes, including feedback regulation of key stress response proteins. While many lncRNAs seem to be induced in response to a specific stress, there is significant overlap between lncRNAs induced in response to different stressful stimuli. In addition to stress-induced RNAs, several constitutively expressed lncRNAs also exert a strong regulatory impact on the stress response. Although our understanding of the contribution of lncRNAs to the cellular stress response is still highly rudimentary, the existing data point to the presence of a complex network of lncRNAs, miRNAs, and proteins in regulation of the cellular response to stress.
Collapse
Affiliation(s)
- Saba Valadkhan
- Department of Molecular Biology and Microbiology, Case Western Reserve University School of Medicine, Cleveland, OH, 44106, USA.
| | - Alberto Valencia-Hipólito
- Department of Molecular Biology and Microbiology, Case Western Reserve University School of Medicine, Cleveland, OH, 44106, USA
| |
Collapse
|