1
|
Luo Y, Liao X, Wu FX, Wang J. Computational Approaches for Transcriptome Assembly Based on Sequencing Technologies. Curr Bioinform 2020. [DOI: 10.2174/1574893614666190410155603] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Transcriptome assembly plays a critical role in studying biological properties and
examining the expression levels of genomes in specific cells. It is also the basis of many
downstream analyses. With the increase of speed and the decrease in cost, massive sequencing
data continues to accumulate. A large number of assembly strategies based on different
computational methods and experiments have been developed. How to efficiently perform
transcriptome assembly with high sensitivity and accuracy becomes a key issue. In this work, the
issues with transcriptome assembly are explored based on different sequencing technologies.
Specifically, transcriptome assemblies with next-generation sequencing reads are divided into
reference-based assemblies and de novo assemblies. The examples of different species are used to
illustrate that long reads produced by the third-generation sequencing technologies can cover fulllength
transcripts without assemblies. In addition, different transcriptome assemblies using the
Hybrid-seq methods and other tools are also summarized. Finally, we discuss the future directions
of transcriptome assemblies.
Collapse
Affiliation(s)
- Yuwen Luo
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Xingyu Liao
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Fang-Xiang Wu
- Division of Biomedical Engineering, University of Saskatchewan, Saskatchewan, Canada
| | - Jianxin Wang
- School of Computer Science and Engineering, Central South University, Changsha, China
| |
Collapse
|
2
|
Tasnim M, Ma S, Yang EW, Jiang T, Li W. Accurate inference of isoforms from multiple sample RNA-Seq data. BMC Genomics 2015; 16 Suppl 2:S15. [PMID: 25708199 PMCID: PMC4331715 DOI: 10.1186/1471-2164-16-s2-s15] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND RNA-Seq based transcriptome assembly has become a fundamental technique for studying expressed mRNAs (i.e., transcripts or isoforms) in a cell using high-throughput sequencing technologies, and is serving as a basis to analyze the structural and quantitative differences of expressed isoforms between samples. However, the current transcriptome assembly algorithms are not specifically designed to handle large amounts of errors that are inherent in real RNA-Seq datasets, especially those involving multiple samples, making downstream differential analysis applications difficult. On the other hand, multiple sample RNA-Seq datasets may provide more information than single sample datasets that can be utilized to improve the performance of transcriptome assembly and abundance estimation, but such information remains overlooked by the existing assembly tools. RESULTS We formulate a computational framework of transcriptome assembly that is capable of handling noisy RNA-Seq reads and multiple sample RNA-Seq datasets efficiently. We show that finding an optimal solution under this framework is an NP-hard problem. Instead, we develop an efficient heuristic algorithm, called Iterative Shortest Path (ISP), based on linear programming (LP) and integer linear programming (ILP). Our preliminary experimental results on both simulated and real datasets and comparison with the existing assembly tools demonstrate that (i) the ISP algorithm is able to assemble transcriptomes with a greatly increased precision while keeping the same level of sensitivity, especially when many samples are involved, and (ii) its assembly results help improve downstream differential analysis. The source code of ISP is freely available at http://alumni.cs.ucr.edu/~liw/isp.html.
Collapse
Affiliation(s)
- Masruba Tasnim
- Department of Computer Science and Engineering, University of California, Riverside, Riverside, CA, 92507, USA
| | - Shining Ma
- Department of Computer Science and Engineering, University of California, Riverside, Riverside, CA, 92507, USA
- MOE Key Lab of Bioinformatics and Bioinformatics Division, TNLIST / Department of Automation, Tsinghua University, Beijing, 100084, China
| | - Ei-Wen Yang
- Department of Computer Science and Engineering, University of California, Riverside, Riverside, CA, 92507, USA
| | - Tao Jiang
- Department of Computer Science and Engineering, University of California, Riverside, Riverside, CA, 92507, USA
- MOE Key Lab of Bioinformatics and Bioinformatics Division, TNLIST / Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China
| | - Wei Li
- Department of Computer Science and Engineering, University of California, Riverside, Riverside, CA, 92507, USA
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute and Harvard School of Public Health, Boston, MA, 02215, USA
| |
Collapse
|
3
|
Mangul S, Caciula A, Al Seesi S, Brinza D, Mӑndoiu I, Zelikovsky A. Transcriptome assembly and quantification from Ion Torrent RNA-Seq data. BMC Genomics 2014; 15 Suppl 5:S7. [PMID: 25082147 PMCID: PMC4120146 DOI: 10.1186/1471-2164-15-s5-s7] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023] Open
Abstract
BACKGROUND High throughput RNA sequencing (RNA-Seq) can generate whole transcriptome information at the single transcript level providing a powerful tool with multiple interrelated applications including transcriptome reconstruction and quantification. The sequences of novel transcripts can be reconstructed from deep RNA-Seq data, but this is computationally challenging due to sequencing errors, uneven coverage of expressed transcripts, and the need to distinguish between highly similar transcripts produced by alternative splicing. Another challenge in transcriptomic analysis comes from the ambiguities in mapping reads to transcripts. RESULTS We present MaLTA, a method for simultaneous transcriptome assembly and quantification from Ion Torrent RNA-Seq data. Our approach explores transcriptome structure and incorporates a maximum likelihood model into the assembly and quantification procedure. A new version of the IsoEM algorithm suitable for Ion Torrent RNA-Seq reads is used to accurately estimate transcript expression levels. The MaLTA-IsoEM tool is publicly available at: http://alan.cs.gsu.edu/NGS/?q=malta CONCLUSIONS Experimental results on both synthetic and real datasets show that Ion Torrent RNA-Seq data can be successfully used for transcriptome analyses. Experimental results suggest increased transcriptome assembly and quantification accuracy of MaLTA-IsoEM solution compared to existing state-of-the-art approaches.
Collapse
Affiliation(s)
- Serghei Mangul
- Department of Computer Science, University of California, Los Angeles, CA, USA
- Department of Computer Science, Georgia State University, Atlanta, GA, USA
| | - Adrian Caciula
- Department of Computer Science, Georgia State University, Atlanta, GA, USA
| | - Sahar Al Seesi
- Deptment of Computer Science & Engineering, University of Connecticut, Storrs, CT, USA
| | - Dumitru Brinza
- Ion Bioinformatics, Life Technologies Corporation, Foster City, CA, USA
| | - Ion Mӑndoiu
- Deptment of Computer Science & Engineering, University of Connecticut, Storrs, CT, USA
| | - Alex Zelikovsky
- Department of Computer Science, Georgia State University, Atlanta, GA, USA
| |
Collapse
|
4
|
Behr J, Kahles A, Zhong Y, Sreedharan VT, Drewe P, Rätsch G. MITIE: Simultaneous RNA-Seq-based transcript identification and quantification in multiple samples. Bioinformatics 2013; 29:2529-38. [PMID: 23980025 PMCID: PMC3789545 DOI: 10.1093/bioinformatics/btt442] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2012] [Revised: 07/19/2013] [Accepted: 07/29/2013] [Indexed: 02/07/2023] Open
Abstract
MOTIVATION High-throughput sequencing of mRNA (RNA-Seq) has led to tremendous improvements in the detection of expressed genes and reconstruction of RNA transcripts. However, the extensive dynamic range of gene expression, technical limitations and biases, as well as the observed complexity of the transcriptional landscape, pose profound computational challenges for transcriptome reconstruction. RESULTS We present the novel framework MITIE (Mixed Integer Transcript IdEntification) for simultaneous transcript reconstruction and quantification. We define a likelihood function based on the negative binomial distribution, use a regularization approach to select a few transcripts collectively explaining the observed read data and show how to find the optimal solution using Mixed Integer Programming. MITIE can (i) take advantage of known transcripts, (ii) reconstruct and quantify transcripts simultaneously in multiple samples, and (iii) resolve the location of multi-mapping reads. It is designed for genome- and assembly-based transcriptome reconstruction. We present an extensive study based on realistic simulated RNA-Seq data. When compared with state-of-the-art approaches, MITIE proves to be significantly more sensitive and overall more accurate. Moreover, MITIE yields substantial performance gains when used with multiple samples. We applied our system to 38 Drosophila melanogaster modENCODE RNA-Seq libraries and estimated the sensitivity of reconstructing omitted transcript annotations and the specificity with respect to annotated transcripts. Our results corroborate that a well-motivated objective paired with appropriate optimization techniques lead to significant improvements over the state-of-the-art in transcriptome reconstruction. AVAILABILITY MITIE is implemented in C++ and is available from http://bioweb.me/mitie under the GPL license.
Collapse
Affiliation(s)
- Jonas Behr
- Computational Biology Center, Sloan-Kettering Institute, 1275 York Avenue, New York, NY 10065, USA and Friedrich Miescher Laboratory, Max Planck Society, Spemannstr. 39, 72076 Tübingen, Germany
| | | | | | | | | | | |
Collapse
|
5
|
LeGault LH, Dewey CN. Inference of alternative splicing from RNA-Seq data with probabilistic splice graphs. ACTA ACUST UNITED AC 2013; 29:2300-10. [PMID: 23846746 PMCID: PMC3753571 DOI: 10.1093/bioinformatics/btt396] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022]
Abstract
Motivation: Alternative splicing and other processes that allow for different transcripts to be derived from the same gene are significant forces in the eukaryotic cell. RNA-Seq is a promising technology for analyzing alternative transcripts, as it does not require prior knowledge of transcript structures or genome sequences. However, analysis of RNA-Seq data in the presence of genes with large numbers of alternative transcripts is currently challenging due to efficiency, identifiability and representation issues. Results: We present RNA-Seq models and associated inference algorithms based on the concept of probabilistic splice graphs, which alleviate these issues. We prove that our models are often identifiable and demonstrate that our inference methods for quantification and differential processing detection are efficient and accurate. Availability: Software implementing our methods is available at http://deweylab.biostat.wisc.edu/psginfer. Contact:cdewey@biostat.wisc.edu Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Laura H LeGault
- Department of Computer Sciences, University of Wisconsin, Madison, WI 53706, USA
| | | |
Collapse
|
6
|
Abstract
As RNA-seq is replacing gene expression microarrays to assess genome-wide transcription abundance, gene expression Quantitative Trait Locus (eQTL) studies using RNA-seq have emerged. RNA-seq delivers two novel features that are important for eQTL studies. First, it provides information on allele-specific expression (ASE), which is not available from gene expression microarrays. Second, it generates unprecedentedly rich data to study RNA-isoform expression. In this paper, we review current methods for eQTL mapping using ASE and discuss some future directions. We also review existing works that use RNA-seq data to study RNA-isoform expression and we discuss the gaps between these works and isoform-specific eQTL mapping.
Collapse
Affiliation(s)
- Wei Sun
- Department of Biostatistics, Department of Genetics, Carolina Center of Genome Science, UNC Chapel Hill, Chapel Hill, NC 27599, USA
| | - Yijuan Hu
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA 30322, USA
| |
Collapse
|
7
|
Bao E, Jiang T, Girke T. BRANCH: boosting RNA-Seq assemblies with partial or related genomic sequences. ACTA ACUST UNITED AC 2013; 29:1250-9. [PMID: 23493323 DOI: 10.1093/bioinformatics/btt127] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
MOTIVATION De novo transcriptome assemblies of RNA-Seq data are important for genomics applications of unsequenced organisms. Owing to the complexity and often incomplete representation of transcripts in sequencing libraries, the assembly of high-quality transcriptomes can be challenging. However, with the rapidly growing number of sequenced genomes, it is now feasible to improve RNA-Seq assemblies by guiding them with genomic sequences. RESULTS This study introduces BRANCH, an algorithm designed for improving de novo transcriptome assemblies by using genomic information that can be partial or complete genome sequences from the same or a related organism. Its input includes assembled RNA reads (transfrags), genomic sequences (e.g. contigs) and the RNA reads themselves. It uses a customized version of BLAT to align the transfrags and RNA reads to the genomic sequences. After identifying exons from the alignments, it defines a directed acyclic graph and maps the transfrags to paths on the graph. It then joins and extends the transfrags by applying an algorithm that solves a combinatorial optimization problem, called the Minimum weight Minimum Path Cover with given Paths. In performance tests on real data from Caenorhabditis elegans and Saccharomyces cerevisiae, assisted by genomic contigs from the same species, BRANCH improved the sensitivity and precision of transfrags generated by Velvet/Oases or Trinity by 5.1-56.7% and 0.3-10.5%, respectively. These improvements added 3.8-74.1% complete transcripts and 8.3-3.8% proteins to the initial assembly. Similar improvements were achieved when guiding the BRANCH processing of a transcriptome assembly from a more complex organism (mouse) with genomic sequences from a related species (rat). AVAILABILITY The BRANCH software can be downloaded for free from this site: http://manuals.bioinformatics.ucr.edu/home/branch. CONTACT thomas.girke@ucr.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ergude Bao
- Department of Computer Science and Engineering, University of California, Riverside, CA 92521, USA
| | | | | |
Collapse
|
8
|
Mezlini AM, Smith EJM, Fiume M, Buske O, Savich GL, Shah S, Aparicio S, Chiang DY, Goldenberg A, Brudno M. iReckon: simultaneous isoform discovery and abundance estimation from RNA-seq data. Genome Res 2012. [PMID: 23204306 PMCID: PMC3589540 DOI: 10.1101/gr.142232.112] [Citation(s) in RCA: 98] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
High-throughput RNA sequencing (RNA-seq) promises to revolutionize our understanding of genes and their role in human disease by characterizing the RNA content of tissues and cells. The realization of this promise, however, is conditional on the development of effective computational methods for the identification and quantification of transcripts from incomplete and noisy data. In this article, we introduce iReckon, a method for simultaneous determination of the isoforms and estimation of their abundances. Our probabilistic approach incorporates multiple biological and technical phenomena, including novel isoforms, intron retention, unspliced pre-mRNA, PCR amplification biases, and multimapped reads. iReckon utilizes regularized expectation-maximization to accurately estimate the abundances of known and novel isoforms. Our results on simulated and real data demonstrate a superior ability to discover novel isoforms with a significantly reduced number of false-positive predictions, and our abundance accuracy prediction outmatches that of other state-of-the-art tools. Furthermore, we have applied iReckon to two cancer transcriptome data sets, a triple-negative breast cancer patient sample and the MCF7 breast cancer cell line, and show that iReckon is able to reconstruct the complex splicing changes that were not previously identified. QT-PCR validations of the isoforms detected in the MCF7 cell line confirmed all of iReckon's predictions and also showed strong agreement (r2 = 0.94) with the predicted abundances.
Collapse
Affiliation(s)
- Aziz M Mezlini
- Department of Computer Science, University of Toronto, Ontario M5S 2E4, Canada
| | | | | | | | | | | | | | | | | | | |
Collapse
|
9
|
Lin YY, Dao P, Hach F, Bakhshi M, Mo F, Lapuk A, Collins C, Sahinalp SC. CLIIQ: Accurate Comparative Detection and Quantification of Expressed Isoforms in a Population. LECTURE NOTES IN COMPUTER SCIENCE 2012. [DOI: 10.1007/978-3-642-33122-0_14] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
|