1
|
Lin Z, Qin Y, Chen H, Shi D, Zhong M, An T, Chen L, Wang Y, Lin F, Li G, Ji ZL. TransIntegrator: capture nearly full protein-coding transcript variants via integrating Illumina and PacBio transcriptomes. Brief Bioinform 2023; 24:bbad334. [PMID: 37779246 DOI: 10.1093/bib/bbad334] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2023] [Revised: 08/23/2023] [Accepted: 08/30/2023] [Indexed: 10/03/2023] Open
Abstract
Genes have the ability to produce transcript variants that perform specific cellular functions. However, accurately detecting all transcript variants remains a long-standing challenge, especially when working with poorly annotated genomes or without a known genome. To address this issue, we have developed a new computational method, TransIntegrator, which enables transcriptome-wide detection of novel transcript variants. For this, we determined 10 Illumina sequencing transcriptomes and a PacBio full-length transcriptome for consecutive embryo development stages of amphioxus, a species of great evolutionary importance. Based on the transcriptomes, we employed TransIntegrator to create a comprehensive transcript variant library, namely iTranscriptome. The resulting iTrancriptome contained 91 915 distinct transcript variants, with an average of 2.4 variants per gene. This substantially improved current amphioxus genome annotation by expanding the number of genes from 21 954 to 38 777. Further analysis manifested that the gene expansion was largely ascribed to integration of multiple Illumina datasets instead of involving the PacBio data. Moreover, we demonstrated an example application of TransIntegrator, via generating iTrancriptome, in aiding accurate transcriptome assembly, which significantly outperformed other hybrid methods such as IDP-denovo and Trinity. For user convenience, we have deposited the source codes of TransIntegrator on GitHub as well as a conda package in Anaconda. In summary, this study proposes an affordable but efficient method for reliable transcriptomic research in most species.
Collapse
Affiliation(s)
- Zhe Lin
- State Key Laboratory of Cellular Stress Biology, School of Life Sciences, Faculty of Medicine and Life Sciences, Xiamen University, 361102, Xiamen, China
- National Institute for Data Science in Health and Medicine, Xiamen University, 361102, Xiamen, China
| | - Yangmei Qin
- State Key Laboratory of Cellular Stress Biology, School of Life Sciences, Faculty of Medicine and Life Sciences, Xiamen University, 361102, Xiamen, China
| | - Hao Chen
- State Key Laboratory of Cellular Stress Biology, School of Life Sciences, Faculty of Medicine and Life Sciences, Xiamen University, 361102, Xiamen, China
| | - Dan Shi
- State Key Laboratory of Cellular Stress Biology, School of Life Sciences, Faculty of Medicine and Life Sciences, Xiamen University, 361102, Xiamen, China
| | - Mindong Zhong
- State Key Laboratory of Cellular Stress Biology, School of Life Sciences, Faculty of Medicine and Life Sciences, Xiamen University, 361102, Xiamen, China
| | - Te An
- School of Informatics, Xiamen University, 361005, Xiamen, China
| | - Linshan Chen
- State Key Laboratory of Cellular Stress Biology, School of Life Sciences, Faculty of Medicine and Life Sciences, Xiamen University, 361102, Xiamen, China
| | - Yiquan Wang
- State Key Laboratory of Cellular Stress Biology, School of Life Sciences, Faculty of Medicine and Life Sciences, Xiamen University, 361102, Xiamen, China
| | - Fan Lin
- National Institute for Data Science in Health and Medicine, Xiamen University, 361102, Xiamen, China
- School of Informatics, Xiamen University, 361005, Xiamen, China
| | - Guang Li
- State Key Laboratory of Cellular Stress Biology, School of Life Sciences, Faculty of Medicine and Life Sciences, Xiamen University, 361102, Xiamen, China
| | - Zhi-Liang Ji
- State Key Laboratory of Cellular Stress Biology, School of Life Sciences, Faculty of Medicine and Life Sciences, Xiamen University, 361102, Xiamen, China
- National Institute for Data Science in Health and Medicine, Xiamen University, 361102, Xiamen, China
| |
Collapse
|
2
|
Engelhard CA, Khani S, Derdak S, Bilban M, Kornfeld JW. Nanopore sequencing unveils the complexity of the cold-activated murine brown adipose tissue transcriptome. iScience 2023; 26:107190. [PMID: 37564700 PMCID: PMC10410515 DOI: 10.1016/j.isci.2023.107190] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2023] [Revised: 04/28/2023] [Accepted: 06/16/2023] [Indexed: 08/12/2023] Open
Abstract
Alternative transcription increases transcriptome complexity by expression of multiple transcripts per gene. Annotation and quantification of transcripts using short-read sequencing is non-trivial. Long-read sequencing aims at overcoming these problems by sequencing full-length transcripts. Activation of brown adipose tissue (BAT) thermogenesis involves major transcriptomic remodeling and positively affects metabolism via increased energy expenditure. We benchmark Oxford Nanopore Technology (ONT) long-read sequencing protocols to Illumina short-read sequencing assessing alignment characteristics, gene and transcript detection and quantification, differential gene and transcript expression, transcriptome reannotation, and differential transcript usage (DTU). We find ONT sequencing is superior to Illumina for transcriptome reassembly, reducing the risk of false-positive events by unambiguously mapping reads to transcripts. We identified novel isoforms of genes undergoing DTU in cold-activated BAT including Cars2, Adtrp, Acsl5, Scp2, Aldoa, and Pde4d, validated by real-time PCR. The reannotated murine BAT transcriptome established here provides a framework for future investigations into the regulation of BAT.
Collapse
Affiliation(s)
- Christoph Andreas Engelhard
- Department for Biochemistry and Molecular Biology (BMB), University of Southern Denmark, Campusvej 55, 5230 Odense M, Denmark
| | - Sajjad Khani
- Max Planck Institute for Metabolism Research, Gleueler Strasse 50, 50931 Cologne, Germany
- Cologne Excellence Cluster on Cellular Stress Responses in Ageing-Associated Diseases (CECAD), University of Cologne, Cologne, Germany
| | - Sophia Derdak
- Core Facilities, Medical University of Vienna, Lazarettgasse 14, 1090 Vienna, Austria
| | - Martin Bilban
- Department of Laboratory Medicine & Core Facilities, Medical University of Vienna, Waehringer Guertel 18-20, 1090 Vienna, Austria
| | - Jan-Wilhelm Kornfeld
- Department for Biochemistry and Molecular Biology (BMB), University of Southern Denmark, Campusvej 55, 5230 Odense M, Denmark
| |
Collapse
|
3
|
Núñez-Moreno G, Tamayo A, Ruiz-Sánchez C, Cortón M, Mínguez P. VIsoQLR: an interactive tool for the detection, quantification and fine-tuning of isoforms in selected genes using long-read sequencing. Hum Genet 2023; 142:495-506. [PMID: 36881176 DOI: 10.1007/s00439-023-02539-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2022] [Accepted: 02/23/2023] [Indexed: 03/08/2023]
Abstract
DNA variants altering the pre-mRNA splicing process represent an underestimated cause of human genetic diseases. Their association with disease traits should be confirmed using functional assays from patient cell lines or alternative models to detect aberrant mRNAs. Long-read sequencing is a suitable technique to identify and quantify mRNA isoforms. Available isoform detection and/or quantification tools are generally designed for the whole transcriptome analysis. However experiments focusing on genes of interest need more precise data fine-tuning and visualization tools.Here we describe VIsoQLR, an interactive analyzer, viewer and editor for the semi-automated identification and quantification of known and novel isoforms using long-read sequencing data. VIsoQLR is tailored to thoroughly analyze mRNA expression in splicing assays of selected genes. Our tool takes sequences aligned to a reference, and for each gene, it defines consensus splice sites and quantifies isoforms. VIsoQLR introduces features to edit the splice sites through dynamic and interactive graphics and tables, allowing accurate manual curation. Known isoforms detected by other methods can also be imported as references for comparison. A benchmark against two other popular transcriptome-based tools shows VIsoQLR accurate performance on both detection and quantification of isoforms. Here, we present VIsoQLR principles and features and its applicability in a case study example using nanopore-based long-read sequencing. VIsoQLR is available at https://github.com/TBLabFJD/VIsoQLR .
Collapse
Affiliation(s)
- Gonzalo Núñez-Moreno
- Department of Genetics and Genomics, Health Research Institute-Fundación Jiménez Díaz University Hospital, Universidad Autónoma de Madrid (IIS-FJD, UAM), Madrid, Spain.
- Bioinformatics Unit, Health Research Institute-Fundación Jiménez Díaz University Hospital, Universidad Autónoma de Madrid (IIS-FJD, UAM), Madrid, Spain.
- Center for Biomedical Network Research On Rare Diseases (CIBERER), Instituto de Salud Carlos III, Madrid, Spain.
| | - Alejandra Tamayo
- Department of Genetics and Genomics, Health Research Institute-Fundación Jiménez Díaz University Hospital, Universidad Autónoma de Madrid (IIS-FJD, UAM), Madrid, Spain
- Center for Biomedical Network Research On Rare Diseases (CIBERER), Instituto de Salud Carlos III, Madrid, Spain
- Department of Surgery, Medical and Social Sciences, Faculty of Medicine and Health Sciences, Science and Technology Campus, University of Alcalá, 28871, Alcalá de Henares, Spain
| | - Carolina Ruiz-Sánchez
- Department of Genetics and Genomics, Health Research Institute-Fundación Jiménez Díaz University Hospital, Universidad Autónoma de Madrid (IIS-FJD, UAM), Madrid, Spain
| | - Marta Cortón
- Department of Genetics and Genomics, Health Research Institute-Fundación Jiménez Díaz University Hospital, Universidad Autónoma de Madrid (IIS-FJD, UAM), Madrid, Spain
- Center for Biomedical Network Research On Rare Diseases (CIBERER), Instituto de Salud Carlos III, Madrid, Spain
| | - Pablo Mínguez
- Department of Genetics and Genomics, Health Research Institute-Fundación Jiménez Díaz University Hospital, Universidad Autónoma de Madrid (IIS-FJD, UAM), Madrid, Spain
- Bioinformatics Unit, Health Research Institute-Fundación Jiménez Díaz University Hospital, Universidad Autónoma de Madrid (IIS-FJD, UAM), Madrid, Spain
- Center for Biomedical Network Research On Rare Diseases (CIBERER), Instituto de Salud Carlos III, Madrid, Spain
| |
Collapse
|
4
|
Amy Lyu MJ, Tang Q, Wang Y, Essemine J, Chen F, Ni X, Chen G, Zhu XG. Evolution of gene regulatory network of C 4 photosynthesis in the genus Flaveria reveals the evolutionary status of C 3-C 4 intermediate species. PLANT COMMUNICATIONS 2023; 4:100426. [PMID: 35986514 PMCID: PMC9860191 DOI: 10.1016/j.xplc.2022.100426] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/15/2022] [Revised: 06/16/2022] [Accepted: 08/11/2022] [Indexed: 06/15/2023]
Abstract
C4 photosynthesis evolved from ancestral C3 photosynthesis by recruiting pre-existing genes to fulfill new functions. The enzymes and transporters required for the C4 metabolic pathway have been intensively studied and well documented; however, the transcription factors (TFs) that regulate these C4 metabolic genes are not yet well understood. In particular, how the TF regulatory network of C4 metabolic genes was rewired during the evolutionary process is unclear. Here, we constructed gene regulatory networks (GRNs) for four closely evolutionarily related species from the genus Flaveria, which represent four different evolutionary stages of C4 photosynthesis: C3 (F. robusta), type I C3-C4 (F. sonorensis), type II C3-C4 (F. ramosissima), and C4 (F. trinervia). Our results show that more than half of the co-regulatory relationships between TFs and core C4 metabolic genes are species specific. The counterparts of the C4 genes in C3 species were already co-regulated with photosynthesis-related genes, whereas the required TFs for C4 photosynthesis were recruited later. The TFs involved in C4 photosynthesis were widely recruited in the type I C3-C4 species; nevertheless, type II C3-C4 species showed a divergent GRN from C4 species. In line with these findings, a 13CO2 pulse-labeling experiment showed that the CO2 initially fixed into C4 acid was not directly released to the Calvin-Benson-Bassham cycle in the type II C3-C4 species. Therefore, our study uncovered dynamic changes in C4 genes and TF co-regulation during the evolutionary process; furthermore, we showed that the metabolic pathway of the type II C3-C4 species F. ramosissima represents an alternative evolutionary solution to the ammonia imbalance in C3-C4 intermediate species.
Collapse
Affiliation(s)
- Ming-Ju Amy Lyu
- National Key Laboratory of Plant Molecular Genetics, CAS Center for Excellence in Molecular Plant Sciences, Institute of Plant Physiology and Ecology, Chinese Academy of Sciences, Shanghai, China
| | - Qiming Tang
- National Key Laboratory of Plant Molecular Genetics, CAS Center for Excellence in Molecular Plant Sciences, Institute of Plant Physiology and Ecology, Chinese Academy of Sciences, Shanghai, China; University of Chinese Academy of Sciences
| | - Yanjie Wang
- National Key Laboratory of Plant Molecular Genetics, CAS Center for Excellence in Molecular Plant Sciences, Institute of Plant Physiology and Ecology, Chinese Academy of Sciences, Shanghai, China; University of Chinese Academy of Sciences
| | - Jemaa Essemine
- National Key Laboratory of Plant Molecular Genetics, CAS Center for Excellence in Molecular Plant Sciences, Institute of Plant Physiology and Ecology, Chinese Academy of Sciences, Shanghai, China
| | - Faming Chen
- National Key Laboratory of Plant Molecular Genetics, CAS Center for Excellence in Molecular Plant Sciences, Institute of Plant Physiology and Ecology, Chinese Academy of Sciences, Shanghai, China
| | - Xiaoxiang Ni
- National Key Laboratory of Plant Molecular Genetics, CAS Center for Excellence in Molecular Plant Sciences, Institute of Plant Physiology and Ecology, Chinese Academy of Sciences, Shanghai, China; University of Chinese Academy of Sciences
| | - Genyun Chen
- National Key Laboratory of Plant Molecular Genetics, CAS Center for Excellence in Molecular Plant Sciences, Institute of Plant Physiology and Ecology, Chinese Academy of Sciences, Shanghai, China
| | - Xin-Guang Zhu
- National Key Laboratory of Plant Molecular Genetics, CAS Center for Excellence in Molecular Plant Sciences, Institute of Plant Physiology and Ecology, Chinese Academy of Sciences, Shanghai, China.
| |
Collapse
|
5
|
Farkas C, Recabal A, Mella A, Candia-Herrera D, Olivero MG, Haigh JJ, Tarifeño-Saldivia E, Caprile T. annotate_my_genomes: an easy-to-use pipeline to improve genome annotation and uncover neglected genes by hybrid RNA sequencing. Gigascience 2022; 11:6874526. [PMID: 36472574 PMCID: PMC9724561 DOI: 10.1093/gigascience/giac099] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2022] [Revised: 07/22/2022] [Accepted: 09/28/2022] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND The advancement of hybrid sequencing technologies is increasingly expanding genome assemblies that are often annotated using hybrid sequencing transcriptomics, leading to improved genome characterization and the identification of novel genes and isoforms in a wide variety of organisms. RESULTS We developed an easy-to-use genome-guided transcriptome annotation pipeline that uses assembled transcripts from hybrid sequencing data as input and distinguishes between coding and long non-coding RNAs by integration of several bioinformatic approaches, including gene reconciliation with previous annotations in GTF format. We demonstrated the efficiency of this approach by correctly assembling and annotating all exons from the chicken SCO-spondin gene (containing more than 105 exons), including the identification of missing genes in the chicken reference annotations by homology assignments. CONCLUSIONS Our method helps to improve the current transcriptome annotation of the chicken brain. Our pipeline, implemented on Anaconda/Nextflow and Docker is an easy-to-use package that can be applied to a broad range of species, tissues, and research areas helping to improve and reconcile current annotations. The code and datasets are publicly available at https://github.com/cfarkas/annotate_my_genomes.
Collapse
Affiliation(s)
| | - Antonia Recabal
- Departamento de Biología Celular, Facultad de Ciencias Biológicas, Universidad de Concepción, Chile
| | - Andy Mella
- Instituto de Ciencias Naturales, Universidad de las Américas, Chile,Centro Integrativo de Biología y Química Aplicada (CIBQA), Universidad Bernardo O'Higgins, Santiago 8370854, Chile
| | - Daniel Candia-Herrera
- Departamento de Bioquímica y Biología Molecular, Facultad de Ciencias Biológicas, Universidad de Concepción, Chile
| | - Maryori González Olivero
- Departamento de Biología Celular, Facultad de Ciencias Biológicas, Universidad de Concepción, Chile
| | - Jody Jonathan Haigh
- CancerCare Manitoba Research Institute, Winnipeg, MB, Canada,Department of Pharmacology and Therapeutics, Rady Faculty of Health Sciences, University of Manitoba, Winnipeg, MB, Canada
| | | | | |
Collapse
|
6
|
de la Rubia I, Srivastava A, Xue W, Indi JA, Carbonell-Sala S, Lagarde J, Albà MM, Eyras E. RATTLE: reference-free reconstruction and quantification of transcriptomes from Nanopore sequencing. Genome Biol 2022; 23:153. [PMID: 35804393 PMCID: PMC9264490 DOI: 10.1186/s13059-022-02715-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2021] [Accepted: 06/20/2022] [Indexed: 11/04/2022] Open
Abstract
Nanopore sequencing enables the efficient and unbiased measurement of transcriptomes. Current methods for transcript identification and quantification rely on mapping reads to a reference genome, which precludes the study of species with a partial or missing reference or the identification of disease-specific transcripts not readily identifiable from a reference. We present RATTLE, a tool to perform reference-free reconstruction and quantification of transcripts using only Nanopore reads. Using simulated data and experimental data from isoform spike-ins, human tissues, and cell lines, we show that RATTLE accurately determines transcript sequences and their abundances, and shows good scalability with the number of transcripts.
Collapse
Affiliation(s)
- Ivan de la Rubia
- EMBL Australia Partner Laboratory Network at the Australian National University, Acton, Canberra, ACT, 2601, Australia.,Pompeu Fabra University (UPF), E08003, Barcelona, Spain
| | - Akanksha Srivastava
- EMBL Australia Partner Laboratory Network at the Australian National University, Acton, Canberra, ACT, 2601, Australia.,Australian National University, Acton, Canberra, ACT, 2601, Australia
| | - Wenjing Xue
- EMBL Australia Partner Laboratory Network at the Australian National University, Acton, Canberra, ACT, 2601, Australia.,Australian National University, Acton, Canberra, ACT, 2601, Australia
| | - Joel A Indi
- EMBL Australia Partner Laboratory Network at the Australian National University, Acton, Canberra, ACT, 2601, Australia.,Universidade de Lisboa, Lisboa, Portugal
| | - Silvia Carbonell-Sala
- Pompeu Fabra University (UPF), E08003, Barcelona, Spain.,Centre for Regulatory Genomics (CRG), E08001, Barcelona, Spain
| | - Julien Lagarde
- Pompeu Fabra University (UPF), E08003, Barcelona, Spain.,Centre for Regulatory Genomics (CRG), E08001, Barcelona, Spain
| | - M Mar Albà
- Pompeu Fabra University (UPF), E08003, Barcelona, Spain. .,Catalan Institution for Research and Advanced Studies (ICREA), E08010, Barcelona, Spain. .,Hospital del Mar Medical Research Institute (IMIM), E08001, Barcelona, Spain.
| | - Eduardo Eyras
- EMBL Australia Partner Laboratory Network at the Australian National University, Acton, Canberra, ACT, 2601, Australia. .,Australian National University, Acton, Canberra, ACT, 2601, Australia. .,Catalan Institution for Research and Advanced Studies (ICREA), E08010, Barcelona, Spain. .,Hospital del Mar Medical Research Institute (IMIM), E08001, Barcelona, Spain.
| |
Collapse
|
7
|
Shumate A, Wong B, Pertea G, Pertea M. Improved transcriptome assembly using a hybrid of long and short reads with StringTie. PLoS Comput Biol 2022; 18:e1009730. [PMID: 35648784 PMCID: PMC9191730 DOI: 10.1371/journal.pcbi.1009730] [Citation(s) in RCA: 84] [Impact Index Per Article: 42.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2021] [Revised: 06/13/2022] [Accepted: 05/11/2022] [Indexed: 01/01/2023] Open
Abstract
Short-read RNA sequencing and long-read RNA sequencing each have their strengths and weaknesses for transcriptome assembly. While short reads are highly accurate, they are rarely able to span multiple exons. Long-read technology can capture full-length transcripts, but its relatively high error rate often leads to mis-identified splice sites. Here we present a new release of StringTie that performs hybrid-read assembly. By taking advantage of the strengths of both long and short reads, hybrid-read assembly with StringTie is more accurate than long-read only or short-read only assembly, and on some datasets it can more than double the number of correctly assembled transcripts, while obtaining substantially higher precision than the long-read data assembly alone. Here we demonstrate the improved accuracy on simulated data and real data from Arabidopsis thaliana, Mus musculus, and human. We also show that hybrid-read assembly is more accurate than correcting long reads prior to assembly while also being substantially faster. StringTie is freely available as open source software at https://github.com/gpertea/stringtie.
Collapse
Affiliation(s)
- Alaina Shumate
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland, United States of America
- Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland, United States of America
| | - Brandon Wong
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland, United States of America
- Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland, United States of America
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, United States of America
- Department of Applied Math and Statistics, Johns Hopkins University, Baltimore, Maryland, United States of America
| | - Geo Pertea
- The Lieber Institute for Brain Development, Baltimore, Maryland, United States of America
| | - Mihaela Pertea
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland, United States of America
- Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland, United States of America
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, United States of America
| |
Collapse
|
8
|
Shmakov NА. Improving the quality of barley transcriptome de novo assembling by using a hybrid approach for lines with varying spike and stem coloration. Vavilovskii Zhurnal Genet Selektsii 2021; 25:30-38. [PMID: 34901701 PMCID: PMC8627909 DOI: 10.18699/vj21.004] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2020] [Revised: 01/15/2021] [Accepted: 01/15/2021] [Indexed: 11/19/2022] Open
Abstract
De novo transcriptome assembly is an important stage of RNA-seq data computational analysis. It allows the researchers to obtain the sequences of transcripts presented in the biological sample of interest. The availability of accurate and complete transcriptome sequence of the organism of interest is, in turn, an indispensable condition for further analysis of RNA-seq data. Through years of transcriptomic research, the bioinformatics community has developed a number of assembler programs for transcriptome reconstruction from short reads of RNA-seq libraries. Different assemblers makes it possible to conduct a de novo transcriptome reconstruction and a genome-guided reconstruction. The majority of the assemblers working with RNA-seq data are based on the De Bruijn graph method of sequence reconstruction. However, specif ics of their procedures can vary drastically, as do their results. A number of authors recommend a hybrid approach to transcriptome reconstruction based on combining the results of several assemblers in order to achieve a better transcriptome assembly. The advantage of this approach has been demonstrated in a number of studies, with RNA-seq experiments conducted on the Illumina platform. In this paper, we propose a hybrid approach for creating a transcriptome assembly of the barley Hordeum vulgare isogenic line Bowman and two nearly isogenic lines contrasting in spike pigmentation, based on the results of sequencing on the IonTorrent platform. This approach implements several de novo assemblers: Trinity, Trans-ABySS and rnaSPAdes. Several assembly metrics were examined: the percentage of reference transcripts observed in the assemblies, the percentage of RNA-seq reads involved, and BUSCO scores. It was shown that, based on the summation of these metrics, transcriptome meta-assembly surpasses individual transcriptome assemblies it consists of.
Collapse
Affiliation(s)
- N А Shmakov
- Institute of Cytology and Genetics of Siberian Branch of the Russian Academy of Sciences, Novosibirsk, Russia Kurchatov Genomics Center, Institute of Cytology and Genetics of Siberian Branch of the Russian Academy of Sciences, Novosibirsk, Russia
| |
Collapse
|
9
|
Nanopore sequencing technology, bioinformatics and applications. Nat Biotechnol 2021; 39:1348-1365. [PMID: 34750572 PMCID: PMC8988251 DOI: 10.1038/s41587-021-01108-x] [Citation(s) in RCA: 437] [Impact Index Per Article: 145.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2019] [Accepted: 09/22/2021] [Indexed: 12/13/2022]
Abstract
Rapid advances in nanopore technologies for sequencing single long DNA and RNA molecules have led to substantial improvements in accuracy, read length and throughput. These breakthroughs have required extensive development of experimental and bioinformatics methods to fully exploit nanopore long reads for investigations of genomes, transcriptomes, epigenomes and epitranscriptomes. Nanopore sequencing is being applied in genome assembly, full-length transcript detection and base modification detection and in more specialized areas, such as rapid clinical diagnoses and outbreak surveillance. Many opportunities remain for improving data quality and analytical approaches through the development of new nanopores, base-calling methods and experimental protocols tailored to particular applications.
Collapse
|
10
|
Lima L, Marchet C, Caboche S, Da Silva C, Istace B, Aury JM, Touzet H, Chikhi R. Comparative assessment of long-read error correction software applied to Nanopore RNA-sequencing data. Brief Bioinform 2021; 21:1164-1181. [PMID: 31232449 DOI: 10.1093/bib/bbz058] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2018] [Revised: 04/05/2019] [Accepted: 04/22/2019] [Indexed: 12/13/2022] Open
Abstract
MOTIVATION Nanopore long-read sequencing technology offers promising alternatives to high-throughput short read sequencing, especially in the context of RNA-sequencing. However this technology is currently hindered by high error rates in the output data that affect analyses such as the identification of isoforms, exon boundaries, open reading frames and creation of gene catalogues. Due to the novelty of such data, computational methods are still actively being developed and options for the error correction of Nanopore RNA-sequencing long reads remain limited. RESULTS In this article, we evaluate the extent to which existing long-read DNA error correction methods are capable of correcting cDNA Nanopore reads. We provide an automatic and extensive benchmark tool that not only reports classical error correction metrics but also the effect of correction on gene families, isoform diversity, bias toward the major isoform and splice site detection. We find that long read error correction tools that were originally developed for DNA are also suitable for the correction of Nanopore RNA-sequencing data, especially in terms of increasing base pair accuracy. Yet investigators should be warned that the correction process perturbs gene family sizes and isoform diversity. This work provides guidelines on which (or whether) error correction tools should be used, depending on the application type. BENCHMARKING SOFTWARE https://gitlab.com/leoisl/LR_EC_analyser.
Collapse
Affiliation(s)
- Leandro Lima
- Univ Lyon, Université Lyon 1, CNRS, Laboratoire de Biométrie et Biologie Evolutive UMR Villeurbanne, France.,EPI ERABLE - Inria Grenoble, Rhône-Alpes, France.,Università di Roma 'Tor Vergata', Roma, Italy
| | | | - Ségolène Caboche
- Université de Lille, CNRS, Inserm, CHU Lille, Institut Pasteur de Lille, UMR, Center for Infection and Immunity of Lille, Lille, France
| | - Corinne Da Silva
- Genoscope, Institut de biologie Francois-Jacob, Commissariat à l'Energie Atomique (CEA), Université Paris-Saclay, Evry, France
| | - Benjamin Istace
- Genoscope, Institut de biologie Francois-Jacob, Commissariat à l'Energie Atomique (CEA), Université Paris-Saclay, Evry, France
| | - Jean-Marc Aury
- Genoscope, Institut de biologie Francois-Jacob, Commissariat à l'Energie Atomique (CEA), Université Paris-Saclay, Evry, France
| | - Hélène Touzet
- CNRS, Université de Lille, CRIStAL UMR, Lille, France
| | - Rayan Chikhi
- CNRS, Université de Lille, CRIStAL UMR, Lille, France.,Institut Pasteur, C3BI - USR 3756, 25-28 rue du Docteur Roux, Paris, France
| |
Collapse
|
11
|
Wang Y, Hu Z, Ye N, Yin H. IsoSplitter: identification and characterization of alternative splicing sites without a reference genome. RNA (NEW YORK, N.Y.) 2021; 27:rna.077834.120. [PMID: 34021065 PMCID: PMC8284324 DOI: 10.1261/rna.077834.120] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/20/2020] [Accepted: 05/17/2021] [Indexed: 06/12/2023]
Abstract
Long-read transcriptome sequencing is designed to sequence full-length RNA molecules and advantageous for identifying alternative splice isoforms; however, in the absence of a reference genome, it is difficult to accurately locate splice sites, because of the diversity of patterns of alternative splicing (AS). Based on long-read transcriptome data we developed a versatile tool, IsoSplitter, to reverse-trace and validate AS gene "split-sites" with the following features: (1) IsoSplitter initially invokes a modified SIM4 program to find transcript split-sites; (2) each split-site is then quantified, to reveal transcript diversity, and putative isoforms are grouped into gene clusters; (3) an optional step for aligning short-reads is provided, to validate split-sites by identifying unique junction reads, and revealing and quantifying tissue-specific alternative splice isoforms. We tested IsoSplitter AS prediction using datasets from multiple model and non-model plant species, and showed that IsoSplitter pipeline is efficient to handle different transcriptomes with high accuracy. Furthermore, we evaluated the IsoSplitter pipeline compared with that of the splice junction identification tools, Program to Assemble Spliced Alignments (PASA-software needs a reference genome for AS identification) and AStrap, using data from the model plant Arabidopsis thaliana. We found that, IsoSplitter determined more than twice as many AS events than AStrap analysis; and 94.13% of the IsoSplitter predicted AS events were also identified by the PASA analysis. Starting from a simple sequence file, IsoSplitter is an assembly-free tool for identification and characterization of AS. IsoSplitter is developed and implemented in Python 3.5 using the Linux platform and is freely available at https://github.com/Hengfu-Yin/IsoSplitter.
Collapse
Affiliation(s)
- Yupeng Wang
- College of Information Science and Technology, Nanjing Forestry University, Nanjing, China
| | - Zhikang Hu
- State Key Laboratory of Tree Genetics and Breeding, Research Institute of Subtropical Forestry, Chinese Academy of Forestry, Hangzhou, Zhejiang 311400, China
| | - Ning Ye
- College of Information Science and Technology, Nanjing Forestry University, Nanjing, China
| | - Hengfu Yin
- Research Institute of Subtropical Forestry, Chinese academy of forestry
| |
Collapse
|
12
|
Broseus L, Thomas A, Oldfield AJ, Severac D, Dubois E, Ritchie W. TALC: Transcript-level Aware Long-read Correction. Bioinformatics 2021; 36:5000-5006. [PMID: 32910174 DOI: 10.1093/bioinformatics/btaa634] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2020] [Revised: 05/08/2020] [Accepted: 07/09/2020] [Indexed: 02/06/2023] Open
Abstract
MOTIVATION Long-read sequencing technologies are invaluable for determining complex RNA transcript architectures but are error-prone. Numerous 'hybrid correction' algorithms have been developed for genomic data that correct long reads by exploiting the accuracy and depth of short reads sequenced from the same sample. These algorithms are not suited for correcting more complex transcriptome sequencing data. RESULTS We have created a novel reference-free algorithm called Transcript-level Aware Long-Read Correction (TALC) which models changes in RNA expression and isoform representation in a weighted De Bruijn graph to correct long reads from transcriptome studies. We show that transcript-level aware correction by TALC improves the accuracy of the whole spectrum of downstream RNA-seq applications and is thus necessary for transcriptome analyses that use long read technology. AVAILABILITY AND IMPLEMENTATION TALC is implemented in C++ and available at https://github.com/lbroseus/TALC. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Lucile Broseus
- Department of Genome Dynamics, Institut de Génétique Humaine, Centre National de la Recherche Scientifique (CNRS), Université de Montpellier, Montpellier 34396, France
| | - Aubin Thomas
- Department of Genome Dynamics, Institut de Génétique Humaine, Centre National de la Recherche Scientifique (CNRS), Université de Montpellier, Montpellier 34396, France
| | - Andrew J Oldfield
- Department of Genome Dynamics, Institut de Génétique Humaine, Centre National de la Recherche Scientifique (CNRS), Université de Montpellier, Montpellier 34396, France
| | - Dany Severac
- MGX-Montpellier GenomiX, c/o Institut de Génomique Fonctionnelle, Montpellier Cedex 5 34094, France
| | - Emeric Dubois
- MGX-Montpellier GenomiX, c/o Institut de Génomique Fonctionnelle, Montpellier Cedex 5 34094, France
| | - William Ritchie
- Department of Genome Dynamics, Institut de Génétique Humaine, Centre National de la Recherche Scientifique (CNRS), Université de Montpellier, Montpellier 34396, France
| |
Collapse
|
13
|
Sahlin K, Medvedev P. Error correction enables use of Oxford Nanopore technology for reference-free transcriptome analysis. Nat Commun 2021; 12:2. [PMID: 33397972 PMCID: PMC7782715 DOI: 10.1038/s41467-020-20340-8] [Citation(s) in RCA: 67] [Impact Index Per Article: 22.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2020] [Accepted: 11/25/2020] [Indexed: 01/24/2023] Open
Abstract
Oxford Nanopore (ONT) is a leading long-read technology which has been revolutionizing transcriptome analysis through its capacity to sequence the majority of transcripts from end-to-end. This has greatly increased our ability to study the diversity of transcription mechanisms such as transcription initiation, termination, and alternative splicing. However, ONT still suffers from high error rates which have thus far limited its scope to reference-based analyses. When a reference is not available or is not a viable option due to reference-bias, error correction is a crucial step towards the reconstruction of the sequenced transcripts and downstream sequence analysis of transcripts. In this paper, we present a novel computational method to error correct ONT cDNA sequencing data, called isONcorrect. IsONcorrect is able to jointly use all isoforms from a gene during error correction, thereby allowing it to correct reads at low sequencing depths. We are able to obtain a median accuracy of 98.9-99.6%, demonstrating the feasibility of applying cost-effective cDNA full transcript length sequencing for reference-free transcriptome analysis.
Collapse
Affiliation(s)
- Kristoffer Sahlin
- Department of Mathematics, Science for Life Laboratory, Stockholm University, 106 91, Stockholm, Sweden
| | - Paul Medvedev
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, USA.
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA, USA.
- Center for Computational Biology and Bioinformatics, The Pennsylvania State University, University Park, PA, USA.
| |
Collapse
|
14
|
Abstract
RNA-Seq is nowadays an indispensable approach for comparative transcriptome profiling in model and nonmodel organisms. Analyzing RNA-Seq data from nonmodel organisms poses unique challenges, due to unavailability of a high-quality genome reference and to relative sparsity of tools for downstream functional analyses. In this chapter, we provide an overview of the analysis steps in RNA-Seq projects of nonmodel organisms, while elaborating on aspects that are unique to this analysis. These will include (1) strategic decisions that have to be made in advance, regarding sequencing technology and reference to use; (2) how to search for available draft genomes, and, if necessary, how to improve their gene prediction and annotation; (3) how to clean raw reads before de novo assembly; (4) how to separate the reads in RNA-Seq projects of symbiont organisms; (5) how to design and carry out a de novo transcriptome assembly that will be comprehensive and reliable; (6) how to assess transcriptome quality; (7) when and how to reduce redundancy in the transcriptome; (8) techniques and considerations in transcriptome functional annotation; (9) quantitating transcript abundance in the face of high transcriptome redundancy; and, most importantly, (10) how to achieve functional enrichment testing using available tools which either support a large range of species or enable a universal, non-species-specific analysis.Throughout the chapter, we will refer to a variety of useful software tools. For the initial analysis steps involving high-volume data, these will include Linux-based programs. For the later steps, we will describe both Linux and R packages for advanced users, as well as many user-friendly tools for nonprogrammers. Finally, we will present a full workflow for RNA-Seq analysis of nonmodel organisms using the NeatSeq-Flow platform, which can be used locally through a user-friendly interface.
Collapse
Affiliation(s)
- Vered Chalifa-Caspi
- Bioinformatics Core Facility, Ben-Gurion University of the Negev, Beer-Sheva, Israel.
| |
Collapse
|
15
|
Puglia GD, Prjibelski AD, Vitale D, Bushmanova E, Schmid KJ, Raccuia SA. Hybrid transcriptome sequencing approach improved assembly and gene annotation in Cynara cardunculus (L.). BMC Genomics 2020; 21:317. [PMID: 32819282 PMCID: PMC7441626 DOI: 10.1186/s12864-020-6670-5] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2019] [Accepted: 03/13/2020] [Indexed: 12/11/2022] Open
Abstract
Background The investigation of transcriptome profiles using short reads in non-model organisms, which lack of well-annotated genomes, is limited by partial gene reconstruction and isoform detection. In contrast, long-reads sequencing techniques revealed their potential to generate complete transcript assemblies even when a reference genome is lacking. Cynara cardunculus var. altilis (DC) (cultivated cardoon) is a perennial hardy crop adapted to dry environments with many industrial and nutraceutical applications due to the richness of secondary metabolites mostly produced in flower heads. The investigation of this species benefited from the recent release of a draft genome, but the transcriptome profile during the capitula formation still remains unexplored. In the present study we show a transcriptome analysis of vegetative and inflorescence organs of cultivated cardoon through a novel hybrid RNA-seq assembly approach utilizing both long and short RNA-seq reads. Results The inclusion of a single Nanopore flow-cell output in a hybrid sequencing approach determined an increase of 15% complete assembled genes and 18% transcript isoforms respect to short reads alone. Among 25,463 assembled unigenes, we identified 578 new genes and updated 13,039 gene models, 11,169 of which were alternatively spliced isoforms. During capitulum development, 3424 genes were differentially expressed and approximately two-thirds were identified as transcription factors including bHLH, MYB, NAC, C2H2 and MADS-box which were highly expressed especially after capitulum opening. We also show the expression dynamics of key genes involved in the production of valuable secondary metabolites of which capitulum is rich such as phenylpropanoids, flavonoids and sesquiterpene lactones. Most of their biosynthetic genes were strongly transcribed in the flower heads with alternative isoforms exhibiting differentially expression levels across the tissues. Conclusions This novel hybrid sequencing approach allowed to improve the transcriptome assembly, to update more than half of annotated genes and to identify many novel genes and different alternatively spliced isoforms. This study provides new insights on the flowering cycle in an Asteraceae plant, a valuable resource for plant biology and breeding in Cynara and an effective method for improving gene annotation.
Collapse
Affiliation(s)
- Giuseppe D Puglia
- Institute for Plant Breeding, Seed Science and Population Genetics, University of Hohenheim, Fruwirthstrasse 21, 70599, Stuttgart, Germany. .,Consiglio Nazionale delle Ricerche, Istituto per i Sistemi Agricoli e Forestali del Mediterraneo (CNR-ISAFOM) U.O.S. Catania, Via Empedocle, 58, 95128, Catania, Italy.
| | - Andrey D Prjibelski
- Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia
| | - Domenico Vitale
- Consiglio Nazionale delle Ricerche, Istituto per i Sistemi Agricoli e Forestali del Mediterraneo (CNR-ISAFOM) U.O.S. Catania, Via Empedocle, 58, 95128, Catania, Italy
| | - Elena Bushmanova
- Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia
| | - Karl J Schmid
- Institute for Plant Breeding, Seed Science and Population Genetics, University of Hohenheim, Fruwirthstrasse 21, 70599, Stuttgart, Germany.
| | - Salvatore A Raccuia
- Consiglio Nazionale delle Ricerche, Istituto per i Sistemi Agricoli e Forestali del Mediterraneo (CNR-ISAFOM) U.O.S. Catania, Via Empedocle, 58, 95128, Catania, Italy
| |
Collapse
|
16
|
Prjibelski AD, Puglia GD, Antipov D, Bushmanova E, Giordano D, Mikheenko A, Vitale D, Lapidus A. Extending rnaSPAdes functionality for hybrid transcriptome assembly. BMC Bioinformatics 2020; 21:302. [PMID: 32703149 PMCID: PMC7379828 DOI: 10.1186/s12859-020-03614-2] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2020] [Accepted: 06/18/2020] [Indexed: 11/29/2022] Open
Abstract
BACKGROUND De novo RNA-Seq assembly is a powerful method for analysing transcriptomes when the reference genome is not available or poorly annotated. However, due to the short length of Illumina reads it is usually impossible to reconstruct complete sequences of complex genes and alternative isoforms. Recently emerged possibility to generate long RNA reads, such as PacBio and Oxford Nanopores, may dramatically improve the assembly quality, and thus the consecutive analysis. While reference-based tools for analysing long RNA reads were recently developed, there is no established pipeline for de novo assembly of such data. RESULTS In this work we present a novel method that allows to perform high-quality de novo transcriptome assemblies by combining accuracy and reliability of short reads with exon structure information carried out from long error-prone reads. The algorithm is designed by incorporating existing hybridSPAdes approach into rnaSPAdes pipeline and adapting it for transcriptomic data. CONCLUSION To evaluate the benefit of using long RNA reads we selected several datasets containing both Illumina and Iso-seq or Oxford Nanopore Technologies (ONT) reads. Using an existing quality assessment software, we show that hybrid assemblies performed with rnaSPAdes contain more full-length genes and alternative isoforms comparing to the case when only short-read data is used.
Collapse
Affiliation(s)
- Andrey D Prjibelski
- Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia.
| | - Giuseppe D Puglia
- Consiglio Nazionale delle Ricerche, Istituto per i Sistemi Agricoli e Forestali del Mediterraneo, Catania, Italy
| | - Dmitry Antipov
- Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia
| | - Elena Bushmanova
- Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia
| | - Daniela Giordano
- Department of Electrical, Electronics and Computer Engineering, University of Catania, Catania, Italy
| | - Alla Mikheenko
- Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia
| | - Domenico Vitale
- Consiglio Nazionale delle Ricerche, Istituto per i Sistemi Agricoli e Forestali del Mediterraneo, Catania, Italy
| | - Alla Lapidus
- Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia
| |
Collapse
|
17
|
Oikonomopoulos S, Bayega A, Fahiminiya S, Djambazian H, Berube P, Ragoussis J. Methodologies for Transcript Profiling Using Long-Read Technologies. Front Genet 2020; 11:606. [PMID: 32733532 PMCID: PMC7358353 DOI: 10.3389/fgene.2020.00606] [Citation(s) in RCA: 53] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2020] [Accepted: 05/19/2020] [Indexed: 12/28/2022] Open
Abstract
RNA sequencing using next-generation sequencing technologies (NGS) is currently the standard approach for gene expression profiling, particularly for large-scale high-throughput studies. NGS technologies comprise high throughput, cost efficient short-read RNA-Seq, while emerging single molecule, long-read RNA-Seq technologies have enabled new approaches to study the transcriptome and its function. The emerging single molecule, long-read technologies are currently commercially available by Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT), while new methodologies based on short-read sequencing approaches are also being developed in order to provide long range single molecule level information-for example, the ones represented by the 10x Genomics linked read methodology. The shift toward long-read sequencing technologies for transcriptome characterization is based on current increases in throughput and decreases in cost, making these attractive for de novo transcriptome assembly, isoform expression quantification, and in-depth RNA species analysis. These types of analyses were challenging with standard short sequencing approaches, due to the complex nature of the transcriptome, which consists of variable lengths of transcripts and multiple alternatively spliced isoforms for most genes, as well as the high sequence similarity of highly abundant species of RNA, such as rRNAs. Here we aim to focus on single molecule level sequencing technologies and single-cell technologies that, combined with perturbation tools, allow the analysis of complete RNA species, whether short or long, at high resolution. In parallel, these tools have opened new ways in understanding gene functions at the tissue, network, and pathway levels, as well as their detailed functional characterization. Analysis of the epi-transcriptome, including RNA methylation and modification and the effects of such modifications on biological systems is now enabled through direct RNA sequencing instead of classical indirect approaches. However, many difficulties and challenges remain, such as methodologies to generate full-length RNA or cDNA libraries from all different species of RNAs, not only poly-A containing transcripts, and the identification of allele-specific transcripts due to current error rates of single molecule technologies, while the bioinformatics analysis on long-read data for accurate identification of 5' and 3' UTRs is still in development.
Collapse
Affiliation(s)
- Spyros Oikonomopoulos
- McGill Genome Centre, Department of Human Genetics, McGill University, Montréal, QC, Canada
| | - Anthony Bayega
- McGill Genome Centre, Department of Human Genetics, McGill University, Montréal, QC, Canada
| | - Somayyeh Fahiminiya
- McGill Genome Centre, Department of Human Genetics, McGill University, Montréal, QC, Canada
| | - Haig Djambazian
- McGill Genome Centre, Department of Human Genetics, McGill University, Montréal, QC, Canada
| | - Pierre Berube
- McGill Genome Centre, Department of Human Genetics, McGill University, Montréal, QC, Canada
| | - Jiannis Ragoussis
- McGill Genome Centre, Department of Human Genetics, McGill University, Montréal, QC, Canada
- Department of Bioengineering, McGill University, Montréal, QC, Canada
| |
Collapse
|
18
|
Hu Z, Lyu T, Yan C, Wang Y, Ye N, Fan Z, Li X, Li J, Yin H. Identification of alternatively spliced gene isoforms and novel noncoding RNAs by single-molecule long-read sequencing in Camellia. RNA Biol 2020; 17:966-976. [PMID: 32160106 PMCID: PMC7549672 DOI: 10.1080/15476286.2020.1738703] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2019] [Revised: 12/23/2019] [Accepted: 02/13/2020] [Indexed: 02/09/2023] Open
Abstract
Direct single-molecule sequencing of full-length transcripts allows efficient identification of gene isoforms, which is apt to alternative splicing (AS), polyadenylation, and long non-coding RNA analyses. However, the identification of gene isoforms and long non-coding RNAs with novel regulatory functions remains challenging, especially for species without a reference genome. Here, we present a comprehensive analysis of a combined long-read and short-read transcriptome sequencing in Camellia japonica. Through a novel bioinformatic pipeline of reverse-tracing the split-sites, we have uncovered 257,692 AS sites from 61,838 transcripts; and 13,068 AS isoforms have been validated by aligning the short reads. We have identified the tissue-specific AS isoforms along with 6,373 AS events that were found in all tissues. Furthermore, we have analysed the polyadenylation (polyA) patterns of transcripts, and found that the preference for polyA signals was different between the AS and non-AS transcripts. Moreover, we have predicted the phased small interfering RNA (phasiRNA) loci through integrative analyses of transcriptome and small RNA sequencing. We have shown that a newly evolved phasiRNA locus from lipoxygenases generated 12 consecutive 21 bp secondary RNAs, which were responsive to cold and heat stress in Camellia. Our studies of the isoform transcriptome provide insights into gene splicing and functions that may facilitate the mechanistic understanding of plants.
Collapse
Affiliation(s)
- Zhikang Hu
- State Key Laboratory of Tree Genetics and Breeding, Research Institute of Subtropical Forestry, Chinese Academy of Forestry, Hangzhou, Zhejiang, China
- College of Information Science and Technology, Nanjing Forestry University, Nanjing, China
- Key Laboratory of Forest Genetics and Breeding, Research Institute of Subtropical Forestry, Chinese Academy of Forestry, Hangzhou, Zhejiang, China
| | - Tao Lyu
- State Key Laboratory of Tree Genetics and Breeding, Research Institute of Subtropical Forestry, Chinese Academy of Forestry, Hangzhou, Zhejiang, China
- Key Laboratory of Forest Genetics and Breeding, Research Institute of Subtropical Forestry, Chinese Academy of Forestry, Hangzhou, Zhejiang, China
| | - Chao Yan
- State Key Laboratory of Tree Genetics and Breeding, Research Institute of Subtropical Forestry, Chinese Academy of Forestry, Hangzhou, Zhejiang, China
- Experimental Center for Subtropical Forestry, Chinese Academy of Forestry, Fenyi, Jiangxi, China
| | - Yupeng Wang
- State Key Laboratory of Tree Genetics and Breeding, Research Institute of Subtropical Forestry, Chinese Academy of Forestry, Hangzhou, Zhejiang, China
- College of Information Science and Technology, Nanjing Forestry University, Nanjing, China
| | - Ning Ye
- College of Information Science and Technology, Nanjing Forestry University, Nanjing, China
| | - Zhengqi Fan
- Key Laboratory of Forest Genetics and Breeding, Research Institute of Subtropical Forestry, Chinese Academy of Forestry, Hangzhou, Zhejiang, China
| | - Xinlei Li
- Key Laboratory of Forest Genetics and Breeding, Research Institute of Subtropical Forestry, Chinese Academy of Forestry, Hangzhou, Zhejiang, China
| | - Jiyuan Li
- Key Laboratory of Forest Genetics and Breeding, Research Institute of Subtropical Forestry, Chinese Academy of Forestry, Hangzhou, Zhejiang, China
| | - Hengfu Yin
- State Key Laboratory of Tree Genetics and Breeding, Research Institute of Subtropical Forestry, Chinese Academy of Forestry, Hangzhou, Zhejiang, China
- Key Laboratory of Forest Genetics and Breeding, Research Institute of Subtropical Forestry, Chinese Academy of Forestry, Hangzhou, Zhejiang, China
| |
Collapse
|
19
|
Luo Y, Liao X, Wu FX, Wang J. Computational Approaches for Transcriptome Assembly Based on Sequencing Technologies. Curr Bioinform 2020. [DOI: 10.2174/1574893614666190410155603] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Transcriptome assembly plays a critical role in studying biological properties and
examining the expression levels of genomes in specific cells. It is also the basis of many
downstream analyses. With the increase of speed and the decrease in cost, massive sequencing
data continues to accumulate. A large number of assembly strategies based on different
computational methods and experiments have been developed. How to efficiently perform
transcriptome assembly with high sensitivity and accuracy becomes a key issue. In this work, the
issues with transcriptome assembly are explored based on different sequencing technologies.
Specifically, transcriptome assemblies with next-generation sequencing reads are divided into
reference-based assemblies and de novo assemblies. The examples of different species are used to
illustrate that long reads produced by the third-generation sequencing technologies can cover fulllength
transcripts without assemblies. In addition, different transcriptome assemblies using the
Hybrid-seq methods and other tools are also summarized. Finally, we discuss the future directions
of transcriptome assemblies.
Collapse
Affiliation(s)
- Yuwen Luo
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Xingyu Liao
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Fang-Xiang Wu
- Division of Biomedical Engineering, University of Saskatchewan, Saskatchewan, Canada
| | - Jianxin Wang
- School of Computer Science and Engineering, Central South University, Changsha, China
| |
Collapse
|
20
|
Tung LH, Shao M, Kingsford C. Quantifying the benefit offered by transcript assembly with Scallop-LR on single-molecule long reads. Genome Biol 2019; 20:287. [PMID: 31849338 PMCID: PMC6918626 DOI: 10.1186/s13059-019-1883-0] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2019] [Accepted: 11/06/2019] [Indexed: 12/19/2022] Open
Abstract
Single-molecule long-read sequencing has been used to improve mRNA isoform identification. However, not all single-molecule long reads represent full transcripts due to incomplete cDNA synthesis and sequencing length limits. This drives a need for long-read transcript assembly. By adding long-read-specific optimizations to Scallop, we developed Scallop-LR, a reference-based long-read transcript assembler. Analyzing 26 PacBio samples, we quantified the benefit of performing transcript assembly on long reads. We demonstrate Scallop-LR identifies more known transcripts and potentially novel isoforms for the human transcriptome than Iso-Seq Analysis and StringTie, indicating that long-read transcript assembly by Scallop-LR can reveal a more complete human transcriptome.
Collapse
Affiliation(s)
- Laura H Tung
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, 15213, PA, USA
- Joint Carnegie Mellon University-University of Pittsburgh Ph.D. Program in Computational Biology, Pittsburgh, 15213, PA, USA
| | - Mingfu Shao
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, 16802, PA, USA
| | - Carl Kingsford
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, 15213, PA, USA.
| |
Collapse
|
21
|
Ruiz-Reche A, Srivastava A, Indi JA, de la Rubia I, Eyras E. ReorientExpress: reference-free orientation of nanopore cDNA reads with deep learning. Genome Biol 2019; 20:260. [PMID: 31783882 PMCID: PMC6883653 DOI: 10.1186/s13059-019-1884-z] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2019] [Accepted: 11/07/2019] [Indexed: 12/18/2022] Open
Abstract
We describe ReorientExpress, a method to perform reference-free orientation of transcriptomic long sequencing reads. ReorientExpress uses deep learning to correctly predict the orientation of the majority of reads, and in particular when trained on a closely related species or in combination with read clustering. ReorientExpress enables long-read transcriptomics in non-model organisms and samples without a genome reference without using additional technologies and is available at https://github.com/comprna/reorientexpress.
Collapse
Affiliation(s)
| | - Akanksha Srivastava
- The John Curtin School of Medical, Australian National University, Acton ACT, Canberra, 2601, Australia
- EMBL Australia Partner Laboratory Network and the Australian National University, Acton ACT, Canberra, 2601, Australia
| | - Joel A Indi
- Pompeu Fabra University, E08003, Barcelona, Spain
- Instituto de Medicina Molecular, Faculdade de Medicina, Universidade de Lisboa, Lisbon, Portugal
| | | | - Eduardo Eyras
- The John Curtin School of Medical, Australian National University, Acton ACT, Canberra, 2601, Australia.
- EMBL Australia Partner Laboratory Network and the Australian National University, Acton ACT, Canberra, 2601, Australia.
- IMIM - Hospital del Mar Medical Research Institute, E08003, Barcelona, Spain.
| |
Collapse
|
22
|
Vilperte V, Lucaciu CR, Halbwirth H, Boehm R, Rattei T, Debener T. Hybrid de novo transcriptome assembly of poinsettia (Euphorbia pulcherrima Willd. Ex Klotsch) bracts. BMC Genomics 2019; 20:900. [PMID: 31775622 PMCID: PMC6882326 DOI: 10.1186/s12864-019-6247-3] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2019] [Accepted: 10/30/2019] [Indexed: 12/11/2022] Open
Abstract
BACKGROUND Poinsettia is a popular and important ornamental crop, mostly during the Christmas season. Its bract coloration ranges from pink/red to creamy/white shades. Despite its ornamental value, there is a lack of knowledge about the genetics and molecular biology of poinsettia, especially on the mechanisms of color formation. We performed an RNA-Seq analysis in order to shed light on the transcriptome of poinsettia bracts. Moreover, we analyzed the transcriptome differences of red- and white-bracted poinsettia varieties during bract development and coloration. For the assembly of a bract transcriptome, two paired-end cDNA libraries from a red and white poinsettia pair were sequenced with the Illumina technology, and one library from a red-bracted variety was used for PacBio sequencing. Both short and long reads were assembled using a hybrid de novo strategy. Samples of red- and white-bracted poinsettias were sequenced and comparatively analyzed in three color developmental stages in order to understand the mechanisms of color formation and accumulation in the species. RESULTS The final transcriptome contains 288,524 contigs, with 33% showing confident protein annotation against the TAIR10 database. The BUSCO pipeline, which is based on near-universal orthologous gene groups, was applied to assess the transcriptome completeness. From a total of 1440 BUSCO groups searched, 77% were categorized as complete (41% as single-copy and 36% as duplicated), 10% as fragmented and 13% as missing BUSCOs. The gene expression comparison between red and white varieties of poinsettia showed a differential regulation of the flavonoid biosynthesis pathway only at particular stages of bract development. An initial impairment of the flavonoid pathway early in the color accumulation process for the white poinsettia variety was observed, but these differences were no longer present in the subsequent stages of bract development. Nonetheless, GSTF11 and UGT79B10 showed a lower expression in the last stage of bract development for the white variety and, therefore, are potential candidates for further studies on poinsettia coloration. CONCLUSIONS In summary, this transcriptome analysis provides a valuable foundation for further studies on poinsettia, such as plant breeding and genetics, and highlights crucial information on the molecular mechanism of color formation.
Collapse
Affiliation(s)
- Vinicius Vilperte
- Institute of Plant Genetics, Leibniz Universität Hannover, 30419, Hannover, Germany.,Klemm + Sohn GmbH & Co., 70379, Stuttgart, KG, Germany
| | - Calin Rares Lucaciu
- Department of Microbiology and Ecosystem Science, University of Vienna, 1090, Vienna, Austria
| | - Heidi Halbwirth
- Institute of Chemical, Environmental and Bioscience Engineering, Technische Universität Wien, 1060, Vienna, Austria
| | - Robert Boehm
- Klemm + Sohn GmbH & Co., 70379, Stuttgart, KG, Germany
| | - Thomas Rattei
- Department of Microbiology and Ecosystem Science, University of Vienna, 1090, Vienna, Austria.
| | - Thomas Debener
- Institute of Plant Genetics, Leibniz Universität Hannover, 30419, Hannover, Germany.
| |
Collapse
|
23
|
Li WV, Li S, Tong X, Deng L, Shi H, Li JJ. AIDE: annotation-assisted isoform discovery with high precision. Genome Res 2019; 29:2056-2072. [PMID: 31694868 PMCID: PMC6886511 DOI: 10.1101/gr.251108.119] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2019] [Accepted: 09/27/2019] [Indexed: 02/06/2023]
Abstract
Genome-wide accurate identification and quantification of full-length mRNA isoforms is crucial for investigating transcriptional and posttranscriptional regulatory mechanisms of biological phenomena. Despite continuing efforts in developing effective computational tools to identify or assemble full-length mRNA isoforms from second-generation RNA-seq data, it remains a challenge to accurately identify mRNA isoforms from short sequence reads owing to the substantial information loss in RNA-seq experiments. Here, we introduce a novel statistical method, annotation-assisted isoform discovery (AIDE), the first approach that directly controls false isoform discoveries by implementing the testing-based model selection principle. Solving the isoform discovery problem in a stepwise and conservative manner, AIDE prioritizes the annotated isoforms and precisely identifies novel isoforms whose addition significantly improves the explanation of observed RNA-seq reads. We evaluate the performance of AIDE based on multiple simulated and real RNA-seq data sets followed by PCR-Sanger sequencing validation. Our results show that AIDE effectively leverages the annotation information to compensate the information loss owing to short read lengths. AIDE achieves the highest precision in isoform discovery and the lowest error rates in isoform abundance estimation, compared with three state-of-the-art methods Cufflinks, SLIDE, and StringTie. As a robust bioinformatics tool for transcriptome analysis, AIDE enables researchers to discover novel transcripts with high confidence.
Collapse
Affiliation(s)
- Wei Vivian Li
- Department of Biostatistics and Epidemiology, Rutgers School of Public Health, Rutgers, The State University of New Jersey, Piscataway, New Jersey 08854, USA.,Department of Statistics, University of California, Los Angeles, California 90095, USA
| | - Shan Li
- Laboratory of Tumor Targeted and Immune Therapy, Clinical Research Center for Breast, State Key Laboratory of Biotherapy, West China Hospital, Sichuan University and Collaborative Innovation Center, Chengdu 610041, China
| | - Xin Tong
- Department of Data Sciences and Operations, Marshall School of Business, University of Southern California, Los Angeles, California 90089, USA
| | - Ling Deng
- Laboratory of Molecular Diagnosis of Cancer, Clinical Research Center for Breast, West China Hospital, Sichuan University, Chengdu 610041, China
| | - Hubing Shi
- Laboratory of Tumor Targeted and Immune Therapy, Clinical Research Center for Breast, State Key Laboratory of Biotherapy, West China Hospital, Sichuan University and Collaborative Innovation Center, Chengdu 610041, China
| | - Jingyi Jessica Li
- Department of Statistics, University of California, Los Angeles, California 90095, USA.,Department of Human Genetics, University of California, Los Angeles, California 90095, USA
| |
Collapse
|
24
|
Utilization of Tissue Ploidy Level Variation in de Novo Transcriptome Assembly of Pinus sylvestris. G3-GENES GENOMES GENETICS 2019; 9:3409-3421. [PMID: 31427456 PMCID: PMC6778806 DOI: 10.1534/g3.119.400357] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
Compared to angiosperms, gymnosperms lag behind in the availability of assembled and annotated genomes. Most genomic analyses in gymnosperms, especially conifer tree species, rely on the use of de novo assembled transcriptomes. However, the level of allelic redundancy and transcript fragmentation in these assembled transcriptomes, and their effect on downstream applications have not been fully investigated. Here, we assessed three assembly strategies for short-reads data, including the utility of haploid megagametophyte tissue during de novo assembly as single-allele guides, for six individuals and five different tissues in Pinus sylvestris. We then contrasted haploid and diploid tissue genotype calls obtained from the assembled transcriptomes to evaluate the extent of paralog mapping. The use of the haploid tissue during assembly increased its completeness without reducing the number of assembled transcripts. Our results suggest that current strategies that rely on available genomic resources as guidance to minimize allelic redundancy are less effective than the application of strategies that cluster redundant assembled transcripts. The strategy yielding the lowest levels of allelic redundancy among the assembled transcriptomes assessed here was the generation of SuperTranscripts with Lace followed by CD-HIT clustering. However, we still observed some levels of heterozygosity (multiple gene fragments per transcript reflecting allelic redundancy) in this assembled transcriptome on the haploid tissue, indicating that further filtering is required before using these assemblies for downstream applications. We discuss the influence of allelic redundancy when these reference transcriptomes are used to select regions for probe design of exome capture baits and for estimation of population genetic diversity.
Collapse
|
25
|
Turner AW, Wong D, Khan MD, Dreisbach CN, Palmore M, Miller CL. Multi-Omics Approaches to Study Long Non-coding RNA Function in Atherosclerosis. Front Cardiovasc Med 2019; 6:9. [PMID: 30838214 PMCID: PMC6389617 DOI: 10.3389/fcvm.2019.00009] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2018] [Accepted: 01/30/2019] [Indexed: 12/15/2022] Open
Abstract
Atherosclerosis is a complex inflammatory disease of the vessel wall involving the interplay of multiple cell types including vascular smooth muscle cells, endothelial cells, and macrophages. Large-scale genome-wide association studies (GWAS) and the advancement of next generation sequencing technologies have rapidly expanded the number of long non-coding RNA (lncRNA) transcripts predicted to play critical roles in the pathogenesis of the disease. In this review, we highlight several lncRNAs whose functional role in atherosclerosis is well-documented through traditional biochemical approaches as well as those identified through RNA-sequencing and other high-throughput assays. We describe novel genomics approaches to study both evolutionarily conserved and divergent lncRNA functions and interactions with DNA, RNA, and proteins. We also highlight assays to resolve the complex spatial and temporal regulation of lncRNAs. Finally, we summarize the latest suite of computational tools designed to improve genomic and functional annotation of these transcripts in the human genome. Deep characterization of lncRNAs is fundamental to unravel coronary atherosclerosis and other cardiovascular diseases, as these regulatory molecules represent a new class of potential therapeutic targets and/or diagnostic markers to mitigate both genetic and environmental risk factors.
Collapse
Affiliation(s)
- Adam W. Turner
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, United States
| | - Doris Wong
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, United States
- Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA, United States
| | - Mohammad Daud Khan
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, United States
| | - Caitlin N. Dreisbach
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, United States
- School of Nursing, University of Virginia, Charlottesville, VA, United States
- Data Science Institute, University of Virginia, Charlottesville, VA, United States
| | - Meredith Palmore
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, United States
| | - Clint L. Miller
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, United States
- Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA, United States
- Data Science Institute, University of Virginia, Charlottesville, VA, United States
- Department of Biomedical Engineering, University of Virginia, Charlottesville, VA, United States
- Department of Public Health Sciences, University of Virginia, Charlottesville, VA, United States
| |
Collapse
|