1
|
Shi Q, Zhang Q, Shao M. Accurate assembly of multiple RNA-seq samples with Aletsch. Bioinformatics 2024; 40:i307-i317. [PMID: 38940157 PMCID: PMC11211816 DOI: 10.1093/bioinformatics/btae215] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2024] Open
Abstract
MOTIVATION High-throughput RNA sequencing has become indispensable for decoding gene activities, yet the challenge of reconstructing full-length transcripts persists. Traditional single-sample assemblers frequently produce fragmented transcripts, especially in single-cell RNA-seq data. While algorithms designed for assembling multiple samples exist, they encounter various limitations. RESULTS We present Aletsch, a new assembler for multiple bulk or single-cell RNA-seq samples. Aletsch incorporates several algorithmic innovations, including a "bridging" system that can effectively integrate multiple samples to restore missed junctions in individual samples, and a new graph-decomposition algorithm that leverages "supporting" information across multiple samples to guide the decomposition of complex vertices. A standout feature of Aletsch is its application of a random forest model with 50 well-designed features for scoring transcripts. We demonstrate its robust adaptability across different chromosomes, datasets, and species. Our experiments, conducted on RNA-seq data from several protocols, firmly demonstrate Aletsch's significant outperformance over existing meta-assemblers. As an example, when measured with the partial area under the precision-recall curve (pAUC, constrained by precision), Aletsch surpasses the leading assemblers TransMeta by 22.9%-62.1% and PsiCLASS by 23.0%-175.5% on human datasets. AVAILABILITY AND IMPLEMENTATION Aletsch is freely available at https://github.com/Shao-Group/aletsch. Scripts that reproduce the experimental results of this manuscript is available at https://github.com/Shao-Group/aletsch-test.
Collapse
Affiliation(s)
- Qian Shi
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, United States
| | - Qimin Zhang
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, United States
| | - Mingfu Shao
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, United States
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA 16802, United States
| |
Collapse
|
2
|
Varoquaux N, Noble WS, Vert JP. Inference of 3D genome architecture by modeling overdispersion of Hi-C data. Bioinformatics 2023; 39:btac838. [PMID: 36594573 PMCID: PMC9857972 DOI: 10.1093/bioinformatics/btac838] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2022] [Revised: 11/16/2022] [Indexed: 01/04/2023] Open
Abstract
MOTIVATION We address the challenge of inferring a consensus 3D model of genome architecture from Hi-C data. Existing approaches most often rely on a two-step algorithm: first, convert the contact counts into distances, then optimize an objective function akin to multidimensional scaling (MDS) to infer a 3D model. Other approaches use a maximum likelihood approach, modeling the contact counts between two loci as a Poisson random variable whose intensity is a decreasing function of the distance between them. However, a Poisson model of contact counts implies that the variance of the data is equal to the mean, a relationship that is often too restrictive to properly model count data. RESULTS We first confirm the presence of overdispersion in several real Hi-C datasets, and we show that the overdispersion arises even in simulated datasets. We then propose a new model, called Pastis-NB, where we replace the Poisson model of contact counts by a negative binomial one, which is parametrized by a mean and a separate dispersion parameter. The dispersion parameter allows the variance to be adjusted independently from the mean, thus better modeling overdispersed data. We compare the results of Pastis-NB to those of several previously published algorithms, both MDS-based and statistical methods. We show that the negative binomial inference yields more accurate structures on simulated data, and more robust structures than other models across real Hi-C replicates and across different resolutions. AVAILABILITY AND IMPLEMENTATION A Python implementation of Pastis-NB is available at https://github.com/hiclib/pastis under the BSD license. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Nelle Varoquaux
- TIMC, Université Grenoble Alpes, CNRS, Grenoble INP, Grenoble 38000, France
| | - William S Noble
- Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA 98195, USA
| | - Jean-Philippe Vert
- Brain Team, Google Research, Paris 75009, France
- Centre for Computational Biology , MINES ParisTech, PSL University, Paris 75006, France
| |
Collapse
|
3
|
Yu T, Zhao X, Li G. TransMeta simultaneously assembles multisample RNA-seq reads. Genome Res 2022; 32:1398-1407. [PMID: 35858749 PMCID: PMC9341511 DOI: 10.1101/gr.276434.121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2021] [Accepted: 06/03/2022] [Indexed: 11/25/2022]
Abstract
Assembling RNA-seq reads into full-length transcripts is crucial in transcriptomic studies and poses computational challenges. Here we present TransMeta, a simple and robust algorithm that simultaneously assembles RNA-seq reads from multiple samples. TransMeta is designed based on the newly introduced vector-weighted splicing graph model, which enables accurate reconstruction of the consensus transcriptome via incorporating a cosine similarity-based combing strategy and a newly designed label-setting path-searching strategy. Tests on both simulated and real data sets show that TransMeta consistently outperforms PsiCLASS, StringTie2 plus its merge mode, and Scallop plus TACO, the most popular tools, in terms of precision and recall under a wide range of coverage thresholds at the meta-assembly level. Additionally, TransMeta consistently shows superior performance at the individual sample level.
Collapse
Affiliation(s)
- Ting Yu
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao 266237, China
| | - Xiaoyu Zhao
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao 266237, China
- School of Mathematics, Shandong University, Jinan, Shandong 250100, China
| | - Guojun Li
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao 266237, China
- School of Mathematical Science, Liaocheng University, Liaocheng 252000, China
| |
Collapse
|
4
|
Sashittal P, Zhang C, Peng J, El-Kebir M. Jumper enables discontinuous transcript assembly in coronaviruses. Nat Commun 2021; 12:6728. [PMID: 34795232 PMCID: PMC8602663 DOI: 10.1038/s41467-021-26944-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2021] [Accepted: 10/20/2021] [Indexed: 11/17/2022] Open
Abstract
Genes in SARS-CoV-2 and other viruses in the order of Nidovirales are expressed by a process of discontinuous transcription which is distinct from alternative splicing in eukaryotes and is mediated by the viral RNA-dependent RNA polymerase. Here, we introduce the DISCONTINUOUS TRANSCRIPT ASSEMBLYproblem of finding transcripts and their abundances given an alignment of paired-end short reads under a maximum likelihood model that accounts for varying transcript lengths. We show, using simulations, that our method, JUMPER, outperforms existing methods for classical transcript assembly. On short-read data of SARS-CoV-1, SARS-CoV-2 and MERS-CoV samples, we find that JUMPER not only identifies canonical transcripts that are part of the reference transcriptome, but also predicts expression of non-canonical transcripts that are supported by subsequent orthogonal analyses. Moreover, application of JUMPER on samples with and without treatment reveals viral drug response at the transcript level. As such, JUMPER enables detailed analyses of Nidovirales transcriptomes under varying conditions.
Collapse
Affiliation(s)
- Palash Sashittal
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
| | - Chuanyi Zhang
- Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
| | - Jian Peng
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
- College of Medicine, University of ILlinois at Urbana-Champaign, Urbana, IL, 61801, USA
| | - Mohammed El-Kebir
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA.
| |
Collapse
|
5
|
Gatter T, Stadler PF. Ryūtō: Improved multi-sample transcript assembly for differential transcript expression analysis and more. Bioinformatics 2021; 37:4307-4313. [PMID: 34255826 DOI: 10.1093/bioinformatics/btab494] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2021] [Revised: 06/21/2021] [Accepted: 07/01/2021] [Indexed: 01/12/2023] Open
Abstract
MOTIVATION Accurate assembly of RNA-seq is a crucial step in many analytic tasks such as gene annotation or expression studies. Despite ongoing research, progress on traditional single sample assembly has brought no major breakthrough. Multi-sample RNA-Seq experiments provide more information than single sample datasets and thus constitute a promising area of research. Yet, this advantage is challenging to utilize due to the large amount of accumulating errors. RESULTS We present an extension to Ryūtō enabling the reconstruction of consensus transcriptomes from multiple RNA-seq data sets, incorporating consensus calling at low level features. We report stable improvements already at 3 replicates. Ryūtō outperforms competing approaches, providing a better and user-adjustable sensitivity-precision trade-off. Ryūtō's unique ability to utilize a (incomplete) reference for multi sample assemblies greatly increases precision. We demonstrate benefits for differential expression analysis. CONCLUSION Ryūtō consistently improves assembly on replicates of the same tissue independent of filter settings, even when mixing conditions or time series. Consensus voting in Ryūtō is especially effective at high precision assembly, while Ryūtō's conventional mode can reach higher recall. AVAILABILITY Ryūtō is available at https://github.com/studla/RYUTO. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Thomas Gatter
- Bioinformatics Group, Department of Computer Science & Interdisciplinary Center for Bioinformatics, Universität Leipzig, D-04107 Leipzig, Germany
| | - Peter F Stadler
- Bioinformatics Group, Department of Computer Science & Interdisciplinary Center for Bioinformatics, Universität Leipzig, D-04107 Leipzig, Germany
- Discrete Biomath Group, Max Planck Institute for Mathematics in the Sciences, D-04103 Leipzig, Germany
- Institute for Theoretical Chemistry, University of Vienna, A-1090 Wien, Austria
- Santa Fe Institute, Santa Fe, NM 87501, USA
| |
Collapse
|
6
|
Yu T, Han R, Fang Z, Mu Z, Zheng H, Liu J. TransRef enables accurate transcriptome assembly by redefining accurate neo-splicing graphs. Brief Bioinform 2021; 22:6319943. [PMID: 34254977 DOI: 10.1093/bib/bbab261] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2021] [Revised: 06/09/2021] [Accepted: 01/22/2020] [Indexed: 11/14/2022] Open
Abstract
RNA-seq technology is widely employed in various research areas related to transcriptome analyses, and the identification of all the expressed transcripts from short sequencing reads presents a considerable computational challenge. In this study, we introduce TransRef, a new computational algorithm for accurate transcriptome assembly by redefining a novel graph model, the neo-splicing graph, and then iteratively applying a constrained dynamic programming to reconstruct all the expressed transcripts for each graph. When TransRef is utilized to analyze both real and simulated datasets, its performance is notably better than those of several state-of-the-art assemblers, including StringTie2, Cufflinks and Scallop. In particular, the performance of TransRef is notably strong in identifying novel transcripts and transcripts with low-expression levels, while the other assemblers are less effective.
Collapse
Affiliation(s)
- Ting Yu
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, China
| | - Renmin Han
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, China
| | - Zhaoyuan Fang
- Key Laboratory of Systems Biology, CAS Center for Excellence in Molecular Cell Science, Institute of Biochemistry and Cell Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Shanghai, China
| | - Zengchao Mu
- School of Mathematics from Shandong University, China
| | - Hongyu Zheng
- Department of Radiation Oncology, Qilu Hospital, Cheeloo College of Medicine, Shandong University, Jinan, China
| | - Juntao Liu
- School of Mathematics and Statistics at Shandong University, Weihai, China
| |
Collapse
|
7
|
Computational analysis of alternative polyadenylation from standard RNA-seq and single-cell RNA-seq data. Methods Enzymol 2021; 655:225-243. [PMID: 34183123 DOI: 10.1016/bs.mie.2021.03.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
Abstract
Alternative polyadenylation (APA) is a major mechanism of post-transcriptional regulation in various cellular processes including cell proliferation and differentiation. Since conventional APA profiling methods have not been widely adopted, global APA studies are very limited. In this chapter, we summarize current computational methods for analyzing APA in standard RNA-seq and scRNA-seq data and describe two state-of-the-art bioinformatic algorithms DaPars and scDaPars in detail. The bioinformatic pipelines for both DaPars and scDaPars are presented and the application of both algorithms are highlighted.
Collapse
|
8
|
Yu T, Liu J, Gao X, Li G. iPAC: a genome-guided assembler of isoforms via phasing and combing paths. Bioinformatics 2020; 36:2712-2717. [PMID: 31985799 DOI: 10.1093/bioinformatics/btaa052] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2019] [Revised: 12/14/2019] [Accepted: 01/20/2020] [Indexed: 01/09/2023] Open
Abstract
MOTIVATION Full-length transcript reconstruction is very important and quite challenging for the widely used RNA-seq data analysis. Currently, available RNA-seq assemblers generally suffered from serious limitations in practical applications, such as low assembly accuracy and incompatibility with latest alignment tools. RESULTS We introduce iPAC, a new genome-guided assembler for reconstruction of isoforms, which revolutionizes the usage of paired-end and sequencing depth information via phasing and combing paths over a newly designed phasing graph. Tested on both simulated and real datasets, it is to some extent superior to all the salient assemblers of the same kind. Especially, iPAC is significantly powerful in recovery of lowly expressed transcripts while others are not. AVAILABILITY AND IMPLEMENTATION iPAC is freely available at http://sourceforge.net/projects/transassembly/files. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ting Yu
- School of Mathematics, Shandong University, Jinan 250100, China
| | - Juntao Liu
- School of Mathematics, Shandong University, Jinan 250100, China
| | - Xin Gao
- Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Saudi Arabia
| | - Guojun Li
- School of Mathematics, Shandong University, Jinan 250100, China.,Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao 266237, China
| |
Collapse
|
9
|
Yu T, Mu Z, Fang Z, Liu X, Gao X, Liu J. TransBorrow: genome-guided transcriptome assembly by borrowing assemblies from different assemblers. Genome Res 2020; 30:1181-1190. [PMID: 32817072 PMCID: PMC7462071 DOI: 10.1101/gr.257766.119] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2019] [Accepted: 06/18/2020] [Indexed: 12/12/2022]
Abstract
RNA-seq technology is widely used in various transcriptomic studies and provides great opportunities to reveal the complex structures of transcriptomes. To effectively analyze RNA-seq data, we introduce a novel transcriptome assembler, TransBorrow, which borrows the assemblies from different assemblers to search for reliable subsequences by building a colored graph from those borrowed assemblies. Then, by seeding reliable subsequences, a newly designed path extension strategy accurately searches for a transcript-representing path cover over each splicing graph. TransBorrow was tested on both simulated and real data sets and showed great superiority over all the compared leading assemblers.
Collapse
Affiliation(s)
- Ting Yu
- School of Mathematics and Statistics, Shandong University (Weihai), Weihai 264209, China
| | - Zengchao Mu
- School of Mathematics and Statistics, Shandong University (Weihai), Weihai 264209, China
| | - Zhaoyuan Fang
- Key Laboratory of Systems Biology, CAS Center for Excellence in Molecular Cell Science, Institute of Biochemistry and Cell Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Shanghai 200031, China
| | - Xiaoping Liu
- School of Mathematics and Statistics, Shandong University (Weihai), Weihai 264209, China
| | - Xin Gao
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Saudi Arabia
| | - Juntao Liu
- School of Mathematics and Statistics, Shandong University (Weihai), Weihai 264209, China
| |
Collapse
|
10
|
Li WV, Li S, Tong X, Deng L, Shi H, Li JJ. AIDE: annotation-assisted isoform discovery with high precision. Genome Res 2019; 29:2056-2072. [PMID: 31694868 PMCID: PMC6886511 DOI: 10.1101/gr.251108.119] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2019] [Accepted: 09/27/2019] [Indexed: 02/06/2023]
Abstract
Genome-wide accurate identification and quantification of full-length mRNA isoforms is crucial for investigating transcriptional and posttranscriptional regulatory mechanisms of biological phenomena. Despite continuing efforts in developing effective computational tools to identify or assemble full-length mRNA isoforms from second-generation RNA-seq data, it remains a challenge to accurately identify mRNA isoforms from short sequence reads owing to the substantial information loss in RNA-seq experiments. Here, we introduce a novel statistical method, annotation-assisted isoform discovery (AIDE), the first approach that directly controls false isoform discoveries by implementing the testing-based model selection principle. Solving the isoform discovery problem in a stepwise and conservative manner, AIDE prioritizes the annotated isoforms and precisely identifies novel isoforms whose addition significantly improves the explanation of observed RNA-seq reads. We evaluate the performance of AIDE based on multiple simulated and real RNA-seq data sets followed by PCR-Sanger sequencing validation. Our results show that AIDE effectively leverages the annotation information to compensate the information loss owing to short read lengths. AIDE achieves the highest precision in isoform discovery and the lowest error rates in isoform abundance estimation, compared with three state-of-the-art methods Cufflinks, SLIDE, and StringTie. As a robust bioinformatics tool for transcriptome analysis, AIDE enables researchers to discover novel transcripts with high confidence.
Collapse
Affiliation(s)
- Wei Vivian Li
- Department of Biostatistics and Epidemiology, Rutgers School of Public Health, Rutgers, The State University of New Jersey, Piscataway, New Jersey 08854, USA.,Department of Statistics, University of California, Los Angeles, California 90095, USA
| | - Shan Li
- Laboratory of Tumor Targeted and Immune Therapy, Clinical Research Center for Breast, State Key Laboratory of Biotherapy, West China Hospital, Sichuan University and Collaborative Innovation Center, Chengdu 610041, China
| | - Xin Tong
- Department of Data Sciences and Operations, Marshall School of Business, University of Southern California, Los Angeles, California 90089, USA
| | - Ling Deng
- Laboratory of Molecular Diagnosis of Cancer, Clinical Research Center for Breast, West China Hospital, Sichuan University, Chengdu 610041, China
| | - Hubing Shi
- Laboratory of Tumor Targeted and Immune Therapy, Clinical Research Center for Breast, State Key Laboratory of Biotherapy, West China Hospital, Sichuan University and Collaborative Innovation Center, Chengdu 610041, China
| | - Jingyi Jessica Li
- Department of Statistics, University of California, Los Angeles, California 90095, USA.,Department of Human Genetics, University of California, Los Angeles, California 90095, USA
| |
Collapse
|
11
|
Song L, Sabunciyan S, Yang G, Florea L. A multi-sample approach increases the accuracy of transcript assembly. Nat Commun 2019; 10:5000. [PMID: 31676772 PMCID: PMC6825223 DOI: 10.1038/s41467-019-12990-0] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2019] [Accepted: 10/11/2019] [Indexed: 01/21/2023] Open
Abstract
Transcript assembly from RNA-seq reads is a critical step in gene expression and subsequent functional analyses. Here we present PsiCLASS, an accurate and efficient transcript assembler based on an approach that simultaneously analyzes multiple RNA-seq samples. PsiCLASS combines mixture statistical models for exonic feature selection across multiple samples with splice graph based dynamic programming algorithms and a weighted voting scheme for transcript selection. PsiCLASS achieves significantly better sensitivity-precision tradeoff, and renders precision up to 2-3 fold higher than the StringTie system and Scallop plus TACO, the two best current approaches. PsiCLASS is efficient and scalable, assembling 667 GEUVADIS samples in 9 h, and has robust accuracy with large numbers of samples. Transcript assembly is an important step in analysis of RNA-seq data whose accuracy influences downstream quantification, detection and characterization of alternative splice variants. Here, the authors develop PsiCLASS, a transcript assembler leveraging simultaneous analysis of multiple RNA-seq samples.
Collapse
Affiliation(s)
- Li Song
- McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, MD, USA.,Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.,Department of Data Sciences, Dana Farber Cancer Institute, Boston, MA, USA
| | - Sarven Sabunciyan
- Department of Pediatrics, Johns Hopkins School of Medicine, Baltimore, MD, USA
| | - Guangyu Yang
- McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, MD, USA.,Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Liliana Florea
- McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, MD, USA. .,Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA. .,Department of Medicine, Johns Hopkins School of Medicine, Baltimore, MD, USA.
| |
Collapse
|
12
|
Chen M, Ji G, Fu H, Lin Q, Ye C, Ye W, Su Y, Wu X. A survey on identification and quantification of alternative polyadenylation sites from RNA-seq data. Brief Bioinform 2019; 21:1261-1276. [PMID: 31267126 DOI: 10.1093/bib/bbz068] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2019] [Revised: 05/03/2019] [Accepted: 05/14/2019] [Indexed: 12/13/2022] Open
Abstract
Alternative polyadenylation (APA) has been implicated to play an important role in post-transcriptional regulation by regulating mRNA abundance, stability, localization and translation, which contributes considerably to transcriptome diversity and gene expression regulation. RNA-seq has become a routine approach for transcriptome profiling, generating unprecedented data that could be used to identify and quantify APA site usage. A number of computational approaches for identifying APA sites and/or dynamic APA events from RNA-seq data have emerged in the literature, which provide valuable yet preliminary results that should be refined to yield credible guidelines for the scientific community. In this review, we provided a comprehensive overview of the status of currently available computational approaches. We also conducted objective benchmarking analysis using RNA-seq data sets from different species (human, mouse and Arabidopsis) and simulated data sets to present a systematic evaluation of 11 representative methods. Our benchmarking study showed that the overall performance of all tools investigated is moderate, reflecting that there is still lot of scope to improve the prediction of APA site or dynamic APA events from RNA-seq data. Particularly, prediction results from individual tools differ considerably, and only a limited number of predicted APA sites or genes are common among different tools. Accordingly, we attempted to give some advice on how to assess the reliability of the obtained results. We also proposed practical recommendations on the appropriate method applicable to diverse scenarios and discussed implications and future directions relevant to profiling APA from RNA-seq data.
Collapse
Affiliation(s)
- Moliang Chen
- Department of Automation, Xiamen University, Xiamen 361005, China.,Xiamen Research Institute of National Center of Healthcare Big Data, Xiamen 361005, China
| | - Guoli Ji
- Department of Automation, Xiamen University, Xiamen 361005, China.,Xiamen Research Institute of National Center of Healthcare Big Data, Xiamen 361005, China
| | - Hongjuan Fu
- Department of Automation, Xiamen University, Xiamen 361005, China.,Xiamen Research Institute of National Center of Healthcare Big Data, Xiamen 361005, China
| | - Qianmin Lin
- Xiang' an hospital of Xiamen university, Xiamen 361005, China
| | - Congting Ye
- Key Laboratory of the Ministry of Education for Coastal and Wetland Ecosystems, College of the Environment and Ecology, Xiamen University, Xiamen, Fujian 361102, China
| | - Wenbin Ye
- Department of Automation, Xiamen University, Xiamen 361005, China.,Xiamen Research Institute of National Center of Healthcare Big Data, Xiamen 361005, China
| | - Yaru Su
- College of Mathematics and Computer Science, Fuzhou University, Fuzhou 350116, China
| | - Xiaohui Wu
- Department of Automation, Xiamen University, Xiamen 361005, China.,Xiamen Research Institute of National Center of Healthcare Big Data, Xiamen 361005, China
| |
Collapse
|
13
|
Shao M, Kingsford C. Theory and A Heuristic for the Minimum Path Flow Decomposition Problem. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:658-670. [PMID: 29990201 DOI: 10.1109/tcbb.2017.2779509] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Motivated by multiple genome assembly problems and other applications, we study the following minimum path flow decomposition problem: Given a directed acyclic graph $G=(V,E)$G=(V,E) with source $s$s and sink $t$t and a flow $f$f, compute a set of $s$s-$t$t paths $P$P and assign weight $w(p)$w(p) for $p\in P$p∈P such that $f(e) = \sum _{p\in P: e\in p} w(p)$f(e)=∑p∈P:e∈pw(p), $\forall e\in E$∀e∈E, and $|P|$|P| is minimized. We develop some fundamental theory for this problem, upon which we design an efficient heuristic. Specifically, we prove that the gap between the optimal number of paths and a known upper bound is determined by the nontrivial equations within the flow values. This result gives rise to the framework of our heuristic: to iteratively reduce the gap through identifying such equations. We also define an operation on certain independent substructures of the graph, and prove that this operation does not affect the optimality but can transform the graph into one with desired property that facilitates reducing the gap. We apply and test our algorithm on both simulated random instances and perfect splice graph instances, and also compare it with the existing state-of-art algorithm for flow decomposition. The results illustrate that our algorithm can achieve very high accuracy on these instances, and also that our algorithm significantly improves on the previous algorithms. An implementation of our algorithm is freely available at https://github.com/Kingsford-Group/catfish.
Collapse
|
14
|
Abstract
Newly sequenced genomes are being added to the tree of life at an unprecedented fast pace. Increasingly, such new genomes are phylogenetically close to previously sequenced and annotated genomes. In other cases, whole clades of closely related species or strains ought to be annotated simultaneously. Often, in subsequent studies differences between the closely related species or strains are in the focus of research when the shared gene structures prevail. We here review methods for comparative structural genome annotation. The reviewed methods include classical approaches such as the alignment of protein sequences or protein profiles against the genome and comparative gene prediction methods that exploit a genome alignment to annotate a target genome. Newer approaches such as the simultaneous annotation of multiple genomes are also reviewed. We discuss how the methods depend on the phylogenetic placement of genomes, give advice on the choice of methods, and examine the consistency between gene structure annotations in an example. Further, we provide practical advice on genome annotation in general.
Collapse
Affiliation(s)
- Stefanie König
- Institut für Mathematik und Informatik, Ernst Moritz Arndt Universität Greifswald, Greifswald, Germany
| | - Lars Romoth
- Institut für Mathematik und Informatik, Ernst Moritz Arndt Universität Greifswald, Greifswald, Germany
| | - Mario Stanke
- Institut für Mathematik und Informatik, Ernst Moritz Arndt Universität Greifswald, Greifswald, Germany.
| |
Collapse
|
15
|
Li WV, Li JJ. Modeling and analysis of RNA-seq data: a review from a statistical perspective. QUANTITATIVE BIOLOGY 2018; 6:195-209. [PMID: 31456901 PMCID: PMC6711375 DOI: 10.1007/s40484-018-0144-7] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2017] [Revised: 02/23/2018] [Accepted: 03/29/2018] [Indexed: 12/21/2022]
Abstract
BACKGROUND Since the invention of next-generation RNA sequencing (RNA-seq) technologies, they have become a powerful tool to study the presence and quantity of RNA molecules in biological samples and have revolutionized transcriptomic studies. The analysis of RNA-seq data at four different levels (samples, genes, transcripts, and exons) involve multiple statistical and computational questions, some of which remain challenging up to date. RESULTS We review RNA-seq analysis tools at the sample, gene, transcript, and exon levels from a statistical perspective. We also highlight the biological and statistical questions of most practical considerations. CONCLUSIONS The development of statistical and computational methods for analyzing RNA-seq data has made significant advances in the past decade. However, methods developed to answer the same biological question often rely on diverse statistical models and exhibit different performance under different scenarios. This review discusses and compares multiple commonly used statistical models regarding their assumptions, in the hope of helping users select appropriate methods as needed, as well as assisting developers for future method development.
Collapse
Affiliation(s)
- Wei Vivian Li
- Department of Statistics, University of California, Los Angeles, Los Angeles, CA 90095-1554, USA
| | - Jingyi Jessica Li
- Department of Statistics, University of California, Los Angeles, Los Angeles, CA 90095-1554, USA
- Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA 90095-088, USA
| |
Collapse
|
16
|
Shao M, Ma J, Wang S. DeepBound: accurate identification of transcript boundaries via deep convolutional neural fields. Bioinformatics 2018; 33:i267-i273. [PMID: 28881999 PMCID: PMC5870651 DOI: 10.1093/bioinformatics/btx267] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Motivation Reconstructing the full-length expressed transcripts (a.k.a. the transcript assembly problem) from the short sequencing reads produced by RNA-seq protocol plays a central role in identifying novel genes and transcripts as well as in studying gene expressions and gene functions. A crucial step in transcript assembly is to accurately determine the splicing junctions and boundaries of the expressed transcripts from the reads alignment. In contrast to the splicing junctions that can be efficiently detected from spliced reads, the problem of identifying boundaries remains open and challenging, due to the fact that the signal related to boundaries is noisy and weak. Results We present DeepBound, an effective approach to identify boundaries of expressed transcripts from RNA-seq reads alignment. In its core DeepBound employs deep convolutional neural fields to learn the hidden distributions and patterns of boundaries. To accurately model the transition probabilities and to solve the label-imbalance problem, we novelly incorporate the AUC (area under the curve) score into the optimizing objective function. To address the issue that deep probabilistic graphical models requires large number of labeled training samples, we propose to use simulated RNA-seq datasets to train our model. Through extensive experimental studies on both simulation datasets of two species and biological datasets, we show that DeepBound consistently and significantly outperforms the two existing methods. Availability and implementation DeepBound is freely available at https://github.com/realbigws/DeepBound.
Collapse
Affiliation(s)
- Mingfu Shao
- Department of Computational Biology, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
- To whom correspondence should be addressed. or
| | - Jianzhu Ma
- School of Medicine, University of California San Diego, La Jolla, CA, USA
| | - Sheng Wang
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
- To whom correspondence should be addressed. or
| |
Collapse
|
17
|
Aguiar D, Cheng LF, Dumitrascu B, Mordelet F, Pai AA, Engelhardt BE. Bayesian nonparametric discovery of isoforms and individual specific quantification. Nat Commun 2018; 9:1681. [PMID: 29703885 PMCID: PMC5923247 DOI: 10.1038/s41467-018-03402-w] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2017] [Accepted: 02/11/2018] [Indexed: 12/18/2022] Open
Abstract
Most human protein-coding genes can be transcribed into multiple distinct mRNA isoforms. These alternative splicing patterns encourage molecular diversity, and dysregulation of isoform expression plays an important role in disease etiology. However, isoforms are difficult to characterize from short-read RNA-seq data because they share identical subsequences and occur in different frequencies across tissues and samples. Here, we develop biisq, a Bayesian nonparametric model for isoform discovery and individual specific quantification from short-read RNA-seq data. biisq does not require isoform reference sequences but instead estimates an isoform catalog shared across samples. We use stochastic variational inference for efficient posterior estimates and demonstrate superior precision and recall for simulations compared to state-of-the-art isoform reconstruction methods. biisq shows the most gains for low abundance isoforms, with 36% more isoforms correctly inferred at low coverage versus a multi-sample method and 170% more versus single-sample methods. We estimate isoforms in the GEUVADIS RNA-seq data and validate inferred isoforms by associating genetic variants with isoform ratios. Alternative splicing leads to transcript isoform diversity. Here, Aguiar et al. develop biisq, a Bayesian nonparametric approach to discover and quantify isoforms from RNA-seq data.
Collapse
Affiliation(s)
- Derek Aguiar
- Department of Computer Science, Princeton University, Princeton, NJ, 08540, USA.
| | - Li-Fang Cheng
- Department of Electrical Engineering, Princeton University, Princeton, NJ, 08540, USA
| | - Bianca Dumitrascu
- Lewis-Sigler Institute, Princeton University, Princeton, NJ, 08544, USA
| | - Fantine Mordelet
- Institute for Genome Sciences and Policy, Duke University, Durham, NC, 27708, USA
| | - Athma A Pai
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA.,RNA Therapeutics Institute, University of Massachusetts Medical School, Worcester, MA, 01605, USA
| | - Barbara E Engelhardt
- Department of Computer Science, Princeton University, Princeton, NJ, 08540, USA. .,Center for Statistics and Machine Learning, Princeton University, Princeton, NJ, 08540, USA.
| |
Collapse
|
18
|
Li WV, Zhao A, Zhang S, Li JJ. MSIQ: JOINT MODELING OF MULTIPLE RNA-SEQ SAMPLES FOR ACCURATE ISOFORM QUANTIFICATION. Ann Appl Stat 2018; 12:510-539. [PMID: 29731954 PMCID: PMC5935499 DOI: 10.1214/17-aoas1100] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
Abstract
Next-generation RNA sequencing (RNA-seq) technology has been widely used to assess full-length RNA isoform abundance in a high-throughput manner. RNA-seq data offer insight into gene expression levels and transcriptome structures, enabling us to better understand the regulation of gene expression and fundamental biological processes. Accurate isoform quantification from RNA-seq data is challenging due to the information loss in sequencing experiments. A recent accumulation of multiple RNA-seq data sets from the same tissue or cell type provides new opportunities to improve the accuracy of isoform quantification. However, existing statistical or computational methods for multiple RNA-seq samples either pool the samples into one sample or assign equal weights to the samples when estimating isoform abundance. These methods ignore the possible heterogeneity in the quality of different samples and could result in biased and unrobust estimates. In this article, we develop a method, which we call "joint modeling of multiple RNA-seq samples for accurate isoform quantification" (MSIQ), for more accurate and robust isoform quantification by integrating multiple RNA-seq samples under a Bayesian framework. Our method aims to (1) identify a consistent group of samples with homogeneous quality and (2) improve isoform quantification accuracy by jointly modeling multiple RNA-seq samples by allowing for higher weights on the consistent group. We show that MSIQ provides a consistent estimator of isoform abundance, and we demonstrate the accuracy and effectiveness of MSIQ compared with alternative methods through simulation studies on D. melanogaster genes. We justify MSIQ's advantages over existing approaches via application studies on real RNA-seq data from human embryonic stem cells, brain tissues, and the HepG2 immortalized cell line. We also perform a comprehensive analysis of how the isoform quantification accuracy would be affected by RNA-seq sample heterogeneity and different experimental protocols.
Collapse
|
19
|
Tilgner H, Jahanbani F, Gupta I, Collier P, Wei E, Rasmussen M, Snyder M. Microfluidic isoform sequencing shows widespread splicing coordination in the human transcriptome. Genome Res 2017; 28:231-242. [PMID: 29196558 PMCID: PMC5793787 DOI: 10.1101/gr.230516.117] [Citation(s) in RCA: 49] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2017] [Accepted: 11/30/2017] [Indexed: 12/21/2022]
Abstract
Understanding transcriptome complexity is crucial for understanding human biology and disease. Technologies such as Synthetic long-read RNA sequencing (SLR-RNA-seq) delivered 5 million isoforms and allowed assessing splicing coordination. Pacific Biosciences and Oxford Nanopore increase throughput also but require high input amounts or amplification. Our new droplet-based method, sparse isoform sequencing (spISO-seq), sequences 100k–200k partitions of 10–200 molecules at a time, enabling analysis of 10–100 million RNA molecules. SpISO-seq requires less than 1 ng of input cDNA, limiting or removing the need for prior amplification with its associated biases. Adjusting the number of reads devoted to each molecule reduces sequencing lanes and cost, with little loss in detection power. The increased number of molecules expands our understanding of isoform complexity. In addition to confirming our previously published cases of splicing coordination (e.g., BIN1), the greater depth reveals many new cases, such as MAPT. Coordination of internal exons is found to be extensive among protein coding genes: 23.5%–59.3% (95% confidence interval) of highly expressed genes with distant alternative exons exhibit coordination, showcasing the need for long-read transcriptomics. However, coordination is less frequent for noncoding sequences, suggesting a larger role of splicing coordination in shaping proteins. Groups of genes with coordination are involved in protein–protein interactions with each other, raising the possibility that coordination facilitates complex formation and/or function. We also find new splicing coordination types, involving initial and terminal exons. Our results provide a more comprehensive understanding of the human transcriptome and a general, cost-effective method to analyze it.
Collapse
Affiliation(s)
- Hagen Tilgner
- Brain and Mind Research Institute, Weill Cornell Medicine, New York, New York 10021, USA
| | - Fereshteh Jahanbani
- Department of Genetics, Stanford University, Stanford, California 94304, USA
| | - Ishaan Gupta
- Brain and Mind Research Institute, Weill Cornell Medicine, New York, New York 10021, USA
| | - Paul Collier
- Brain and Mind Research Institute, Weill Cornell Medicine, New York, New York 10021, USA
| | - Eric Wei
- Department of Genetics, Stanford University, Stanford, California 94304, USA
| | | | - Michael Snyder
- Department of Genetics, Stanford University, Stanford, California 94304, USA
| |
Collapse
|
20
|
Li L, Wang X, Xiao G, Gazdar A. Integrative gene set enrichment analysis utilizing isoform-specific expression. Genet Epidemiol 2017; 41:498-510. [PMID: 28580727 DOI: 10.1002/gepi.22052] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2016] [Revised: 02/12/2017] [Accepted: 03/14/2017] [Indexed: 01/01/2023]
Abstract
Gene set enrichment analysis (GSEA) aims at identifying essential pathways, or more generally, sets of biologically related genes that are involved in complex human diseases. In the past, many studies have shown that GSEA is a very useful bioinformatics tool that plays critical roles in the innovation of disease prevention and intervention strategies. Despite its tremendous success, it is striking that conclusions of GSEA drawn from isolated studies are often sparse, and different studies may lead to inconsistent and sometimes contradictory results. Further, in the wake of next generation sequencing technologies, it has been made possible to measure genome-wide isoform-specific expression levels, calling for innovations that can utilize the unprecedented resolution. Currently, enormous amounts of data have been created from various RNA-seq experiments. All these give rise to a pressing need for developing integrative methods that allow for explicit utilization of isoform-specific expression, to combine multiple enrichment studies, in order to enhance the power, reproducibility, and interpretability of the analysis. We develop and evaluate integrative GSEA methods, based on two-stage procedures, which, for the first time, allow statistically efficient use of isoform-specific expression from multiple RNA-seq experiments. Through simulation and real data analysis, we show that our methods can greatly improve the performance in identifying essential gene sets compared to existing methods that can only use gene-level expression.
Collapse
Affiliation(s)
- Lie Li
- Department of Statistical Science, Southern Methodist University, Dallas, Texas, United States of America
| | - Xinlei Wang
- Department of Statistical Science, Southern Methodist University, Dallas, Texas, United States of America
| | - Guanghua Xiao
- Department of Clinical Sciences, The University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
| | - Adi Gazdar
- Department of Pathology, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
| |
Collapse
|
21
|
Kahles A, Ong CS, Zhong Y, Rätsch G. SplAdder: identification, quantification and testing of alternative splicing events from RNA-Seq data. Bioinformatics 2016; 32:1840-7. [PMID: 26873928 PMCID: PMC4908322 DOI: 10.1093/bioinformatics/btw076] [Citation(s) in RCA: 93] [Impact Index Per Article: 11.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2015] [Revised: 12/18/2015] [Accepted: 02/04/2016] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Understanding the occurrence and regulation of alternative splicing (AS) is a key task towards explaining the regulatory processes that shape the complex transcriptomes of higher eukaryotes. With the advent of high-throughput sequencing of RNA (RNA-Seq), the diversity of AS transcripts could be measured at an unprecedented depth. Although the catalog of known AS events has grown ever since, novel transcripts are commonly observed when working with less well annotated organisms, in the context of disease, or within large populations. Whereas an identification of complete transcripts is technically challenging and computationally expensive, focusing on single splicing events as a proxy for transcriptome characteristics is fruitful and sufficient for a wide range of analyses. RESULTS We present SplAdder, an alternative splicing toolbox, that takes RNA-Seq alignments and an annotation file as input to (i) augment the annotation based on RNA-Seq evidence, (ii) identify alternative splicing events present in the augmented annotation graph, (iii) quantify and confirm these events based on the RNA-Seq data and (iv) test for significant quantitative differences between samples. Thereby, our main focus lies on performance, accuracy and usability. AVAILABILITY Source code and documentation are available for download at http://github.com/ratschlab/spladder Example data, introductory information and a small tutorial are accessible via http://bioweb.me/spladder CONTACTS : andre.kahles@ratschlab.org or gunnar.ratsch@ratschlab.org SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- André Kahles
- Memorial Sloan Kettering Cancer Center, New York, NY 10065, USA and
| | - Cheng Soon Ong
- Canberra Research Laboratory, NICTA, Canberra, ACT 2601, Australia
| | - Yi Zhong
- Memorial Sloan Kettering Cancer Center, New York, NY 10065, USA and
| | - Gunnar Rätsch
- Memorial Sloan Kettering Cancer Center, New York, NY 10065, USA and
| |
Collapse
|
22
|
Song L, Sabunciyan S, Florea L. CLASS2: accurate and efficient splice variant annotation from RNA-seq reads. Nucleic Acids Res 2016; 44:e98. [PMID: 26975657 PMCID: PMC4889935 DOI: 10.1093/nar/gkw158] [Citation(s) in RCA: 46] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2014] [Accepted: 02/28/2016] [Indexed: 11/29/2022] Open
Abstract
Next generation sequencing of cellular RNA is making it possible to characterize genes and alternative splicing in unprecedented detail. However, designing bioinformatics tools to accurately capture splicing variation has proven difficult. Current programs can find major isoforms of a gene but miss lower abundance variants, or are sensitive but imprecise. CLASS2 is a novel open source tool for accurate genome-guided transcriptome assembly from RNA-seq reads based on the model of splice graph. An extension of our program CLASS, CLASS2 jointly optimizes read patterns and the number of supporting reads to score and prioritize transcripts, implemented in a novel, scalable and efficient dynamic programming algorithm. When compared against reference programs, CLASS2 had the best overall accuracy and could detect up to twice as many splicing events with precision similar to the best reference program. Notably, it was the only tool to produce consistently reliable transcript models for a wide range of applications and sequencing strategies, including ribosomal RNA-depleted samples. Lightweight and multi-threaded, CLASS2 requires <3GB RAM and can analyze a 350 million read set within hours, and can be widely applied to transcriptomics studies ranging from clinical RNA sequencing, to alternative splicing analyses, and to the annotation of new genomes.
Collapse
Affiliation(s)
- Li Song
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, MD 21205, USA Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Sarven Sabunciyan
- Department of Pediatrics, Johns Hopkins School of Medicine, Baltimore, MD 21287, USA
| | - Liliana Florea
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, MD 21205, USA Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA Department of Medicine, Johns Hopkins School of Medicine, Baltimore, MD 21205, USA
| |
Collapse
|
23
|
Canzar S, Andreotti S, Weese D, Reinert K, Klau GW. CIDANE: comprehensive isoform discovery and abundance estimation. Genome Biol 2016; 17:16. [PMID: 26831908 PMCID: PMC4734886 DOI: 10.1186/s13059-015-0865-0] [Citation(s) in RCA: 37] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2015] [Accepted: 12/29/2015] [Indexed: 12/19/2022] Open
Abstract
We present CIDANE, a novel framework for genome-based transcript reconstruction and quantification from RNA-seq reads. CIDANE assembles transcripts efficiently with significantly higher sensitivity and precision than existing tools. Its algorithmic core not only reconstructs transcripts ab initio, but also allows the use of the growing annotation of known splice sites, transcription start and end sites, or full-length transcripts, which are available for most model organisms. CIDANE supports the integrated analysis of RNA-seq and additional gene-boundary data and recovers splice junctions that are invisible to other methods. CIDANE is available at http://ccb.jhu.edu/software/cidane/.
Collapse
Affiliation(s)
- Stefan Canzar
- Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA.,Toyota Technological Institute at Chicago, 6045 S. Kennwood Avenue, Chicago, IL 60637, USA
| | - Sandro Andreotti
- Department of Mathematics and Computer Science, Institute of Computer Science, Freie Universität Berlin, Arnimallee 14, Berlin, 14195, Germany
| | - David Weese
- Department of Mathematics and Computer Science, Institute of Computer Science, Freie Universität Berlin, Arnimallee 14, Berlin, 14195, Germany.
| | - Knut Reinert
- Department of Mathematics and Computer Science, Institute of Computer Science, Freie Universität Berlin, Arnimallee 14, Berlin, 14195, Germany.
| | - Gunnar W Klau
- Life Sciences, Centrum Wiskunde & Informatica (CWI), Science Park 123, Amsterdam, 1098 XG, The Netherlands.
| |
Collapse
|
24
|
Abstract
RNA sequencing allows for simultaneous transcript discovery and quantification, but reconstructing complete transcripts from such data remains difficult. Here, we introduce Bayesembler, a novel probabilistic method for transcriptome assembly built on a Bayesian model of the RNA sequencing process. Under this model, samples from the posterior distribution over transcripts and their abundance values are obtained using Gibbs sampling. By using the frequency at which transcripts are observed during sampling to select the final assembly, we demonstrate marked improvements in sensitivity and precision over state-of-the-art assemblers on both simulated and real data. Bayesembler is available at https://github.com/bioinformatics-centre/bayesembler.
Collapse
Affiliation(s)
- Lasse Maretty
- The Bioinformatics Centre, Department of Biology and Biotech Research andInnovation Centre (BRIC), University of Copenhagen, Ole Maaløes Vej 5, 2200, Copenhagen, Denmark
| | | | | |
Collapse
|
25
|
Hayer KE, Pizarro A, Lahens NF, Hogenesch JB, Grant GR. Benchmark analysis of algorithms for determining and quantifying full-length mRNA splice forms from RNA-seq data. Bioinformatics 2015; 31:3938-45. [PMID: 26338770 PMCID: PMC4673975 DOI: 10.1093/bioinformatics/btv488] [Citation(s) in RCA: 61] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2014] [Accepted: 08/17/2015] [Indexed: 01/26/2023] Open
Abstract
MOTIVATION Because of the advantages of RNA sequencing (RNA-Seq) over microarrays, it is gaining widespread popularity for highly parallel gene expression analysis. For example, RNA-Seq is expected to be able to provide accurate identification and quantification of full-length splice forms. A number of informatics packages have been developed for this purpose, but short reads make it a difficult problem in principle. Sequencing error and polymorphisms add further complications. It has become necessary to perform studies to determine which algorithms perform best and which if any algorithms perform adequately. However, there is a dearth of independent and unbiased benchmarking studies. Here we take an approach using both simulated and experimental benchmark data to evaluate their accuracy. RESULTS We conclude that most methods are inaccurate even using idealized data, and that no method is highly accurate once multiple splice forms, polymorphisms, intron signal, sequencing errors, alignment errors, annotation errors and other complicating factors are present. These results point to the pressing need for further algorithm development. AVAILABILITY AND IMPLEMENTATION Simulated datasets and other supporting information can be found at http://bioinf.itmat.upenn.edu/BEERS/bp2. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Katharina E Hayer
- University of Pennsylvania, Institute for Translational Medicine and Therapeutics, Philadelphia, PA 19104
| | - Angel Pizarro
- Scientific Computing at Amazon Web Services, Seattle, WA 98108
| | | | | | - Gregory R Grant
- University of Pennsylvania, Institute for Translational Medicine and Therapeutics, Philadelphia, PA 19104, Department of Genetics, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
26
|
Alamancos GP, Pagès A, Trincado JL, Bellora N, Eyras E. Leveraging transcript quantification for fast computation of alternative splicing profiles. RNA (NEW YORK, N.Y.) 2015; 21:1521-31. [PMID: 26179515 PMCID: PMC4536314 DOI: 10.1261/rna.051557.115] [Citation(s) in RCA: 153] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/26/2015] [Accepted: 05/29/2015] [Indexed: 05/02/2023]
Abstract
Alternative splicing plays an essential role in many cellular processes and bears major relevance in the understanding of multiple diseases, including cancer. High-throughput RNA sequencing allows genome-wide analyses of splicing across multiple conditions. However, the increasing number of available data sets represents a major challenge in terms of computation time and storage requirements. We describe SUPPA, a computational tool to calculate relative inclusion values of alternative splicing events, exploiting fast transcript quantification. SUPPA accuracy is comparable and sometimes superior to standard methods using simulated as well as real RNA-sequencing data compared with experimentally validated events. We assess the variability in terms of the choice of annotation and provide evidence that using complete transcripts rather than more transcripts per gene provides better estimates. Moreover, SUPPA coupled with de novo transcript reconstruction methods does not achieve accuracies as high as using quantification of known transcripts, but remains comparable to existing methods. Finally, we show that SUPPA is more than 1000 times faster than standard methods. Coupled with fast transcript quantification, SUPPA provides inclusion values at a much higher speed than existing methods without compromising accuracy, thereby facilitating the systematic splicing analysis of large data sets with limited computational resources. The software is implemented in Python 2.7 and is available under the MIT license at https://bitbucket.org/regulatorygenomicsupf/suppa.
Collapse
Affiliation(s)
| | - Amadís Pagès
- Universitat Pompeu Fabra, E08003 Barcelona, Spain Centre for Genomic Regulation, E08003 Barcelona, Spain
| | | | - Nicolás Bellora
- INIBIOMA, CONICET-UNComahue, Bariloche, 8400 Río Negro, Argentina
| | - Eduardo Eyras
- Universitat Pompeu Fabra, E08003 Barcelona, Spain Catalan Institution for Research and Advanced Studies, E08010 Barcelona, Spain
| |
Collapse
|
27
|
Alamancos GP, Pagès A, Trincado JL, Bellora N, Eyras E. Leveraging transcript quantification for fast computation of alternative splicing profiles. RNA (NEW YORK, N.Y.) 2015; 21:1521-1531. [PMID: 26179515 DOI: 10.1101/008763] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/26/2015] [Accepted: 05/29/2015] [Indexed: 05/18/2023]
Abstract
Alternative splicing plays an essential role in many cellular processes and bears major relevance in the understanding of multiple diseases, including cancer. High-throughput RNA sequencing allows genome-wide analyses of splicing across multiple conditions. However, the increasing number of available data sets represents a major challenge in terms of computation time and storage requirements. We describe SUPPA, a computational tool to calculate relative inclusion values of alternative splicing events, exploiting fast transcript quantification. SUPPA accuracy is comparable and sometimes superior to standard methods using simulated as well as real RNA-sequencing data compared with experimentally validated events. We assess the variability in terms of the choice of annotation and provide evidence that using complete transcripts rather than more transcripts per gene provides better estimates. Moreover, SUPPA coupled with de novo transcript reconstruction methods does not achieve accuracies as high as using quantification of known transcripts, but remains comparable to existing methods. Finally, we show that SUPPA is more than 1000 times faster than standard methods. Coupled with fast transcript quantification, SUPPA provides inclusion values at a much higher speed than existing methods without compromising accuracy, thereby facilitating the systematic splicing analysis of large data sets with limited computational resources. The software is implemented in Python 2.7 and is available under the MIT license at https://bitbucket.org/regulatorygenomicsupf/suppa.
Collapse
Affiliation(s)
| | - Amadís Pagès
- Universitat Pompeu Fabra, E08003 Barcelona, Spain Centre for Genomic Regulation, E08003 Barcelona, Spain
| | | | - Nicolás Bellora
- INIBIOMA, CONICET-UNComahue, Bariloche, 8400 Río Negro, Argentina
| | - Eduardo Eyras
- Universitat Pompeu Fabra, E08003 Barcelona, Spain Catalan Institution for Research and Advanced Studies, E08010 Barcelona, Spain
| |
Collapse
|
28
|
Bernard E, Jacob L, Mairal J, Viara E, Vert JP. A convex formulation for joint RNA isoform detection and quantification from multiple RNA-seq samples. BMC Bioinformatics 2015; 16:262. [PMID: 26286719 PMCID: PMC4543468 DOI: 10.1186/s12859-015-0695-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2015] [Accepted: 08/05/2015] [Indexed: 01/22/2023] Open
Abstract
BACKGROUND Detecting and quantifying isoforms from RNA-seq data is an important but challenging task. The problem is often ill-posed, particularly at low coverage. One promising direction is to exploit several samples simultaneously. RESULTS We propose a new method for solving the isoform deconvolution problem jointly across several samples. We formulate a convex optimization problem that allows to share information between samples and that we solve efficiently. We demonstrate the benefits of combining several samples on simulated and real data, and show that our approach outperforms pooling strategies and methods based on integer programming. CONCLUSION Our convex formulation to jointly detect and quantify isoforms from RNA-seq data of multiple related samples is a computationally efficient approach to leverage the hypotheses that some isoforms are likely to be present in several samples. The software and source code are available at http://cbio.ensmp.fr/flipflop.
Collapse
Affiliation(s)
- Elsa Bernard
- MINES ParisTech, PSL Research University, CBIO-Centre for Computational Biology, Fontainebleau, 77300, France. .,Institut Curie, Paris, 75005, France. .,INSERM U900, Paris, 75005, France.
| | - Laurent Jacob
- Laboratoire Biométrie et Biologie Evolutive, Université de Lyon, Université Lyon 1, CNRS, INRA, UMR5558, Villeurbanne, France.
| | - Julien Mairal
- Inria, LEAR Team, Laboratoire Jean Kuntzmann, CNRS, Université Grenoble Alpes, 655, Avenue de l'Europe, Montbonnot, 38330, France.
| | | | - Jean-Philippe Vert
- MINES ParisTech, PSL Research University, CBIO-Centre for Computational Biology, Fontainebleau, 77300, France. .,Institut Curie, Paris, 75005, France. .,INSERM U900, Paris, 75005, France.
| |
Collapse
|
29
|
Pertea M, Pertea GM, Antonescu CM, Chang TC, Mendell JT, Salzberg SL. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol 2015. [PMID: 25690850 DOI: 10.1038/nbt3122] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/19/2023]
Abstract
Methods used to sequence the transcriptome often produce more than 200 million short sequences. We introduce StringTie, a computational method that applies a network flow algorithm originally developed in optimization theory, together with optional de novo assembly, to assemble these complex data sets into transcripts. When used to analyze both simulated and real data sets, StringTie produces more complete and accurate reconstructions of genes and better estimates of expression levels, compared with other leading transcript assembly programs including Cufflinks, IsoLasso, Scripture and Traph. For example, on 90 million reads from human blood, StringTie correctly assembled 10,990 transcripts, whereas the next best assembly was of 7,187 transcripts by Cufflinks, which is a 53% increase in transcripts assembled. On a simulated data set, StringTie correctly assembled 7,559 transcripts, which is 20% more than the 6,310 assembled by Cufflinks. As well as producing a more complete transcriptome assembly, StringTie runs faster on all data sets tested to date compared with other assembly software, including Cufflinks.
Collapse
Affiliation(s)
- Mihaela Pertea
- 1] Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland, USA. [2] McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University, Baltimore, Maryland, USA
| | - Geo M Pertea
- 1] Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland, USA. [2] McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University, Baltimore, Maryland, USA
| | - Corina M Antonescu
- 1] Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland, USA. [2] McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University, Baltimore, Maryland, USA
| | - Tsung-Cheng Chang
- 1] Department of Molecular Biology, The University of Texas Southwestern Medical Center, Dallas, Texas, USA. [2] Center for Regenerative Science and Medicine, The University of Texas Southwestern Medical Center, Dallas, Texas, USA
| | - Joshua T Mendell
- 1] Department of Molecular Biology, The University of Texas Southwestern Medical Center, Dallas, Texas, USA. [2] Center for Regenerative Science and Medicine, The University of Texas Southwestern Medical Center, Dallas, Texas, USA. [3] Simmons Cancer Center, The University of Texas Southwestern Medical Center, Dallas, Texas, USA
| | - Steven L Salzberg
- 1] Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland, USA. [2] McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University, Baltimore, Maryland, USA. [3] Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland, USA. [4] Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, USA
| |
Collapse
|
30
|
Pertea M, Pertea GM, Antonescu CM, Chang TC, Mendell JT, Salzberg SL. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol 2015; 33:290-5. [PMID: 25690850 PMCID: PMC4643835 DOI: 10.1038/nbt.3122] [Citation(s) in RCA: 6933] [Impact Index Per Article: 770.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2014] [Accepted: 12/09/2014] [Indexed: 12/21/2022]
Abstract
Methods used to sequence the transcriptome often produce more than 200 million short sequences. We introduce StringTie, a computational method that applies a network flow algorithm originally developed in optimization theory, together with optional de novo assembly, to assemble these complex data sets into transcripts. When used to analyze both simulated and real data sets, StringTie produces more complete and accurate reconstructions of genes and better estimates of expression levels, compared with other leading transcript assembly programs including Cufflinks, IsoLasso, Scripture and Traph. For example, on 90 million reads from human blood, StringTie correctly assembled 10,990 transcripts, whereas the next best assembly was of 7,187 transcripts by Cufflinks, which is a 53% increase in transcripts assembled. On a simulated data set, StringTie correctly assembled 7,559 transcripts, which is 20% more than the 6,310 assembled by Cufflinks. As well as producing a more complete transcriptome assembly, StringTie runs faster on all data sets tested to date compared with other assembly software, including Cufflinks.
Collapse
Affiliation(s)
- Mihaela Pertea
- 1] Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland, USA. [2] McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University, Baltimore, Maryland, USA
| | - Geo M Pertea
- 1] Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland, USA. [2] McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University, Baltimore, Maryland, USA
| | - Corina M Antonescu
- 1] Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland, USA. [2] McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University, Baltimore, Maryland, USA
| | - Tsung-Cheng Chang
- 1] Department of Molecular Biology, The University of Texas Southwestern Medical Center, Dallas, Texas, USA. [2] Center for Regenerative Science and Medicine, The University of Texas Southwestern Medical Center, Dallas, Texas, USA
| | - Joshua T Mendell
- 1] Department of Molecular Biology, The University of Texas Southwestern Medical Center, Dallas, Texas, USA. [2] Center for Regenerative Science and Medicine, The University of Texas Southwestern Medical Center, Dallas, Texas, USA. [3] Simmons Cancer Center, The University of Texas Southwestern Medical Center, Dallas, Texas, USA
| | - Steven L Salzberg
- 1] Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland, USA. [2] McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University, Baltimore, Maryland, USA. [3] Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland, USA. [4] Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, USA
| |
Collapse
|
31
|
Hoff KJ, Stanke M. Current methods for automated annotation of protein-coding genes. CURRENT OPINION IN INSECT SCIENCE 2015; 7:8-14. [PMID: 32846689 DOI: 10.1016/j.cois.2015.02.008] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/01/2014] [Revised: 12/08/2014] [Accepted: 02/18/2015] [Indexed: 06/11/2023]
Abstract
We review software tools for gene prediction - the identification of protein-coding genes and their structure in genome sequences. The discussed approaches include methods based on RNA-Seq and current methods based on homology - comparative gene prediction and protein spliced alignments. Many methods require that their parameters are adjusted to the target species or its broader clade. These include ab initio gene finders, integrated approaches with ab initio components and some aligners. We also review current automatic methods for training for the common case that a bona fide training set of gene structures is not available before annotation.
Collapse
Affiliation(s)
- K J Hoff
- Institut für Mathematik und Informatik, Universität Greifswald, Walther-Rathenau-Str. 47, 17487 Greifswald, Germany
| | - M Stanke
- Institut für Mathematik und Informatik, Universität Greifswald, Walther-Rathenau-Str. 47, 17487 Greifswald, Germany
| |
Collapse
|
32
|
Tasnim M, Ma S, Yang EW, Jiang T, Li W. Accurate inference of isoforms from multiple sample RNA-Seq data. BMC Genomics 2015; 16 Suppl 2:S15. [PMID: 25708199 PMCID: PMC4331715 DOI: 10.1186/1471-2164-16-s2-s15] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND RNA-Seq based transcriptome assembly has become a fundamental technique for studying expressed mRNAs (i.e., transcripts or isoforms) in a cell using high-throughput sequencing technologies, and is serving as a basis to analyze the structural and quantitative differences of expressed isoforms between samples. However, the current transcriptome assembly algorithms are not specifically designed to handle large amounts of errors that are inherent in real RNA-Seq datasets, especially those involving multiple samples, making downstream differential analysis applications difficult. On the other hand, multiple sample RNA-Seq datasets may provide more information than single sample datasets that can be utilized to improve the performance of transcriptome assembly and abundance estimation, but such information remains overlooked by the existing assembly tools. RESULTS We formulate a computational framework of transcriptome assembly that is capable of handling noisy RNA-Seq reads and multiple sample RNA-Seq datasets efficiently. We show that finding an optimal solution under this framework is an NP-hard problem. Instead, we develop an efficient heuristic algorithm, called Iterative Shortest Path (ISP), based on linear programming (LP) and integer linear programming (ILP). Our preliminary experimental results on both simulated and real datasets and comparison with the existing assembly tools demonstrate that (i) the ISP algorithm is able to assemble transcriptomes with a greatly increased precision while keeping the same level of sensitivity, especially when many samples are involved, and (ii) its assembly results help improve downstream differential analysis. The source code of ISP is freely available at http://alumni.cs.ucr.edu/~liw/isp.html.
Collapse
Affiliation(s)
- Masruba Tasnim
- Department of Computer Science and Engineering, University of California, Riverside, Riverside, CA, 92507, USA
| | - Shining Ma
- Department of Computer Science and Engineering, University of California, Riverside, Riverside, CA, 92507, USA
- MOE Key Lab of Bioinformatics and Bioinformatics Division, TNLIST / Department of Automation, Tsinghua University, Beijing, 100084, China
| | - Ei-Wen Yang
- Department of Computer Science and Engineering, University of California, Riverside, Riverside, CA, 92507, USA
| | - Tao Jiang
- Department of Computer Science and Engineering, University of California, Riverside, Riverside, CA, 92507, USA
- MOE Key Lab of Bioinformatics and Bioinformatics Division, TNLIST / Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China
| | - Wei Li
- Department of Computer Science and Engineering, University of California, Riverside, Riverside, CA, 92507, USA
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute and Harvard School of Public Health, Boston, MA, 02215, USA
| |
Collapse
|
33
|
Shenker S, Miura P, Sanfilippo P, Lai EC. IsoSCM: improved and alternative 3' UTR annotation using multiple change-point inference. RNA (NEW YORK, N.Y.) 2015; 21:14-27. [PMID: 25406361 PMCID: PMC4274634 DOI: 10.1261/rna.046037.114] [Citation(s) in RCA: 41] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/23/2014] [Accepted: 10/15/2014] [Indexed: 05/23/2023]
Abstract
Major applications of RNA-seq data include studies of how the transcriptome is modulated at the levels of gene expression and RNA processing, and how these events are related to cellular identity, environmental condition, and/or disease status. While many excellent tools have been developed to analyze RNA-seq data, these generally have limited efficacy for annotating 3' UTRs. Existing assembly strategies often fragment long 3' UTRs, and importantly, none of the algorithms in popular use can apportion data into tandem 3' UTR isoforms, which are frequently generated by alternative cleavage and polyadenylation (APA). Consequently, it is often not possible to identify patterns of differential APA using existing assembly tools. To address these limitations, we present a new method for transcript assembly, Isoform Structural Change Model (IsoSCM) that incorporates change-point analysis to improve the 3' UTR annotation process. Through evaluation on simulated and genuine data sets, we demonstrate that IsoSCM annotates 3' termini with higher sensitivity and specificity than can be achieved with existing methods. We highlight the utility of IsoSCM by demonstrating its ability to recover known patterns of tissue-regulated APA. IsoSCM will facilitate future efforts for 3' UTR annotation and genome-wide studies of the breadth, regulation, and roles of APA leveraging RNA-seq data. The IsoSCM software and source code are available from our website https://github.com/shenkers/isoscm.
Collapse
Affiliation(s)
- Sol Shenker
- Department of Developmental Biology, Sloan-Kettering Institute, New York, New York 10065, USA Tri-Institutional Program in Computational Biology and Medicine, Weill Cornell Medical College, New York, New York 10065, USA
| | - Pedro Miura
- Department of Developmental Biology, Sloan-Kettering Institute, New York, New York 10065, USA
| | - Piero Sanfilippo
- Department of Developmental Biology, Sloan-Kettering Institute, New York, New York 10065, USA Tri-Institutional Program in Computational Biology and Medicine, Weill Cornell Medical College, New York, New York 10065, USA
| | - Eric C Lai
- Department of Developmental Biology, Sloan-Kettering Institute, New York, New York 10065, USA
| |
Collapse
|
34
|
Huang Y, Hu Y, Liu J. Piecing the puzzle together: a revisit to transcript reconstruction problem in RNA-seq. BMC Bioinformatics 2014; 15 Suppl 9:S3. [PMID: 25252653 PMCID: PMC4168703 DOI: 10.1186/1471-2105-15-s9-s3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
The advancement of RNA sequencing (RNA-seq) has provided an unprecedented opportunity to assess both the diversity and quantity of transcript isoforms in an mRNA transcriptome. In this paper, we revisit the computational problem of transcript reconstruction and quantification. Unlike existing methods which focus on how to explain the exons and splice variants detected by the reads with a set of isoforms, we aim at reconstructing transcripts by piecing the reads into individual effective transcript copies. Simultaneously, the quantity of each isoform is explicitly measured by the number of assembled effective copies, instead of estimated solely based on the collective read count. We have developed a novel method named Astroid that solves the problem of effective copy reconstruction on the basis of a flow network. The RNA-seq reads are represented as vertices in the flow network and are connected by weighted edges that evaluate the likelihood of two reads originating from the same effective copy. A maximum likelihood set of transcript copies is then reconstructed by solving a minimum-cost flow problem on the flow network. Simulation studies on the human transcriptome have demonstrated the superior sensitivity and specificity of Astroid in transcript reconstruction as well as improved accuracy in transcript quantification over several existing approaches. The application of Astroid on two real RNA-seq datasets has further demonstrated its accuracy through high correlation between the estimated isoform abundance and the qRT-PCR validations.
Collapse
Affiliation(s)
- Yan Huang
- Department of Computer Science, University of Kentucky, Lexington, KY, USA
| | - Yin Hu
- Department of Computer Science, University of Kentucky, Lexington, KY, USA
| | - Jinze Liu
- Department of Computer Science, University of Kentucky, Lexington, KY, USA
| |
Collapse
|
35
|
Bernard E, Jacob L, Mairal J, Vert JP. Efficient RNA isoform identification and quantification from RNA-Seq data with network flows. Bioinformatics 2014; 30:2447-55. [PMID: 24813214 PMCID: PMC4147886 DOI: 10.1093/bioinformatics/btu317] [Citation(s) in RCA: 54] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
Motivation: Several state-of-the-art methods for isoform identification and quantification are based on ℓ1-regularized regression, such as the Lasso. However, explicitly listing the—possibly exponentially—large set of candidate transcripts is intractable for genes with many exons. For this reason, existing approaches using the ℓ1-penalty are either restricted to genes with few exons or only run the regression algorithm on a small set of preselected isoforms. Results: We introduce a new technique called FlipFlop, which can efficiently tackle the sparse estimation problem on the full set of candidate isoforms by using network flow optimization. Our technique removes the need of a preselection step, leading to better isoform identification while keeping a low computational cost. Experiments with synthetic and real RNA-Seq data confirm that our approach is more accurate than alternative methods and one of the fastest available. Availability and implementation: Source code is freely available as an R package from the Bioconductor Web site (http://www.bioconductor.org/), and more information is available at http://cbio.ensmp.fr/flipflop. Contact:Jean-Philippe.Vert@mines.org Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Elsa Bernard
- Mines ParisTech, Centre for Computational Biology, 77300 Fontainebleau, Institut Curie, 26 rue d'Ulm, 75248 Paris Cedex 05, INSERM U900, Paris F-75248, France, Laboratoire Biométrie et Biologie Evolutive, Université de Lyon, Université Lyon 1, CNRS, INRA, UMR5558, Villeurbanne, France and LEAR Project-Team, INRIA Grenoble Rhône Alpes, 38330 Montbonnot, France Mines ParisTech, Centre for Computational Biology, 77300 Fontainebleau, Institut Curie, 26 rue d'Ulm, 75248 Paris Cedex 05, INSERM U900, Paris F-75248, France, Laboratoire Biométrie et Biologie Evolutive, Université de Lyon, Université Lyon 1, CNRS, INRA, UMR5558, Villeurbanne, France and LEAR Project-Team, INRIA Grenoble Rhône Alpes, 38330 Montbonnot, France Mines ParisTech, Centre for Computational Biology, 77300 Fontainebleau, Institut Curie, 26 rue d'Ulm, 75248 Paris Cedex 05, INSERM U900, Paris F-75248, France, Laboratoire Biométrie et Biologie Evolutive, Université de Lyon, Université Lyon 1, CNRS, INRA, UMR5558, Villeurbanne, France and LEAR Project-Team, INRIA Grenoble Rhône Alpes, 38330 Montbonnot, France
| | - Laurent Jacob
- Mines ParisTech, Centre for Computational Biology, 77300 Fontainebleau, Institut Curie, 26 rue d'Ulm, 75248 Paris Cedex 05, INSERM U900, Paris F-75248, France, Laboratoire Biométrie et Biologie Evolutive, Université de Lyon, Université Lyon 1, CNRS, INRA, UMR5558, Villeurbanne, France and LEAR Project-Team, INRIA Grenoble Rhône Alpes, 38330 Montbonnot, France
| | - Julien Mairal
- Mines ParisTech, Centre for Computational Biology, 77300 Fontainebleau, Institut Curie, 26 rue d'Ulm, 75248 Paris Cedex 05, INSERM U900, Paris F-75248, France, Laboratoire Biométrie et Biologie Evolutive, Université de Lyon, Université Lyon 1, CNRS, INRA, UMR5558, Villeurbanne, France and LEAR Project-Team, INRIA Grenoble Rhône Alpes, 38330 Montbonnot, France
| | - Jean-Philippe Vert
- Mines ParisTech, Centre for Computational Biology, 77300 Fontainebleau, Institut Curie, 26 rue d'Ulm, 75248 Paris Cedex 05, INSERM U900, Paris F-75248, France, Laboratoire Biométrie et Biologie Evolutive, Université de Lyon, Université Lyon 1, CNRS, INRA, UMR5558, Villeurbanne, France and LEAR Project-Team, INRIA Grenoble Rhône Alpes, 38330 Montbonnot, France Mines ParisTech, Centre for Computational Biology, 77300 Fontainebleau, Institut Curie, 26 rue d'Ulm, 75248 Paris Cedex 05, INSERM U900, Paris F-75248, France, Laboratoire Biométrie et Biologie Evolutive, Université de Lyon, Université Lyon 1, CNRS, INRA, UMR5558, Villeurbanne, France and LEAR Project-Team, INRIA Grenoble Rhône Alpes, 38330 Montbonnot, France Mines ParisTech, Centre for Computational Biology, 77300 Fontainebleau, Institut Curie, 26 rue d'Ulm, 75248 Paris Cedex 05, INSERM U900, Paris F-75248, France, Laboratoire Biométrie et Biologie Evolutive, Université de Lyon, Université Lyon 1, CNRS, INRA, UMR5558, Villeurbanne, France and LEAR Project-Team, INRIA Grenoble Rhône Alpes, 38330 Montbonnot, France
| |
Collapse
|
36
|
Angelini C, De Canditiis D, De Feis I. Computational approaches for isoform detection and estimation: good and bad news. BMC Bioinformatics 2014; 15:135. [PMID: 24885830 PMCID: PMC4098781 DOI: 10.1186/1471-2105-15-135] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2013] [Accepted: 04/24/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The main goal of the whole transcriptome analysis is to correctly identify all expressed transcripts within a specific cell/tissue--at a particular stage and condition--to determine their structures and to measure their abundances. RNA-seq data promise to allow identification and quantification of transcriptome at unprecedented level of resolution, accuracy and low cost. Several computational methods have been proposed to achieve such purposes. However, it is still not clear which promises are already met and which challenges are still open and require further methodological developments. RESULTS We carried out a simulation study to assess the performance of 5 widely used tools, such as: CEM, Cufflinks, iReckon, RSEM, and SLIDE. All of them have been used with default parameters. In particular, we considered the effect of the following three different scenarios: the availability of complete annotation, incomplete annotation, and no annotation at all. Moreover, comparisons were carried out using the methods in three different modes of action. In the first mode, the methods were forced to only deal with those isoforms that are present in the annotation; in the second mode, they were allowed to detect novel isoforms using the annotation as guide; in the third mode, they were operating in fully data driven way (although with the support of the alignment on the reference genome). In the latter modality, precision and recall are quite poor. On the contrary, results are better with the support of the annotation, even though it is not complete. Finally, abundance estimation error often shows a very skewed distribution. The performance strongly depends on the true real abundance of the isoforms. Lowly (and sometimes also moderately) expressed isoforms are poorly detected and estimated. In particular, lowly expressed isoforms are identified mainly if they are provided in the original annotation as potential isoforms. CONCLUSIONS Both detection and quantification of all isoforms from RNA-seq data are still hard problems and they are affected by many factors. Overall, the performance significantly changes since it depends on the modes of action and on the type of available annotation. Results obtained using complete or partial annotation are able to detect most of the expressed isoforms, even though the number of false positives is often high. Fully data driven approaches require more attention, at least for complex eucaryotic genomes. Improvements are desirable especially for isoform quantification and for isoform detection with low abundance.
Collapse
|
37
|
Schulz MH. Letting the data speak for themselves: a fully Bayesian approach to transcriptome assembly. Genome Biol 2014; 15:498. [PMID: 25830215 PMCID: PMC4318165 DOI: 10.1186/s13059-014-0498-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
A novel method for transcriptome assembly, Bayesembler, provides greater accuracy without sacrifice of computational speed, and particular advantages for alternative transcripts expressed at low levels.
Collapse
Affiliation(s)
- Marcel H Schulz
- Excellence Cluster for Multimodal Computing and Interaction, Saarland 66123, Germany.
| |
Collapse
|
38
|
Drechsel G, Kahles A, Kesarwani AK, Stauffer E, Behr J, Drewe P, Rätsch G, Wachter A. Nonsense-mediated decay of alternative precursor mRNA splicing variants is a major determinant of the Arabidopsis steady state transcriptome. THE PLANT CELL 2013; 25:3726-42. [PMID: 24163313 PMCID: PMC3877825 DOI: 10.1105/tpc.113.115485] [Citation(s) in RCA: 153] [Impact Index Per Article: 13.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/27/2013] [Revised: 09/17/2013] [Accepted: 10/07/2013] [Indexed: 05/18/2023]
Abstract
The nonsense-mediated decay (NMD) surveillance pathway can recognize erroneous transcripts and physiological mRNAs, such as precursor mRNA alternative splicing (AS) variants. Currently, information on the global extent of coupled AS and NMD remains scarce and even absent for any plant species. To address this, we conducted transcriptome-wide splicing studies using Arabidopsis thaliana mutants in the NMD factor homologs UP FRAMESHIFT1 (UPF1) and UPF3 as well as wild-type samples treated with the translation inhibitor cycloheximide. Our analyses revealed that at least 17.4% of all multi-exon, protein-coding genes produce splicing variants that are targeted by NMD. Moreover, we provide evidence that UPF1 and UPF3 act in a translation-independent mRNA decay pathway. Importantly, 92.3% of the NMD-responsive mRNAs exhibit classical NMD-eliciting features, supporting their authenticity as direct targets. Genes generating NMD-sensitive AS variants function in diverse biological processes, including signaling and protein modification, for which NaCl stress-modulated AS-NMD was found. Besides mRNAs, numerous noncoding RNAs and transcripts derived from intergenic regions were shown to be NMD responsive. In summary, we provide evidence for a major function of AS-coupled NMD in shaping the Arabidopsis transcriptome, having fundamental implications in gene regulation and quality control of transcript processing.
Collapse
Affiliation(s)
- Gabriele Drechsel
- Center for Plant Molecular Biology, University of Tübingen, 72076 Tuebingen, Germany
| | - André Kahles
- Computational Biology Center, Sloan-Kettering Institute, New York, New York 10065
| | - Anil K. Kesarwani
- Center for Plant Molecular Biology, University of Tübingen, 72076 Tuebingen, Germany
| | - Eva Stauffer
- Center for Plant Molecular Biology, University of Tübingen, 72076 Tuebingen, Germany
| | - Jonas Behr
- Computational Biology Center, Sloan-Kettering Institute, New York, New York 10065
| | - Philipp Drewe
- Computational Biology Center, Sloan-Kettering Institute, New York, New York 10065
| | - Gunnar Rätsch
- Computational Biology Center, Sloan-Kettering Institute, New York, New York 10065
| | - Andreas Wachter
- Center for Plant Molecular Biology, University of Tübingen, 72076 Tuebingen, Germany
- Address correspondence to
| |
Collapse
|