1
|
Orabi B, Xie N, McConeghy B, Dong X, Chauve C, Hach F. Freddie: annotation-independent detection and discovery of transcriptomic alternative splicing isoforms using long-read sequencing. Nucleic Acids Res 2022; 51:e11. [PMID: 36478271 PMCID: PMC9881145 DOI: 10.1093/nar/gkac1112] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2022] [Revised: 10/26/2022] [Accepted: 11/08/2022] [Indexed: 12/13/2022] Open
Abstract
Alternative splicing (AS) is an important mechanism in the development of many cancers, as novel or aberrant AS patterns play an important role as an independent onco-driver. In addition, cancer-specific AS is potentially an effective target of personalized cancer therapeutics. However, detecting AS events remains a challenging task, especially if these AS events are novel. This is exacerbated by the fact that existing transcriptome annotation databases are far from being comprehensive, especially with regard to cancer-specific AS. Additionally, traditional sequencing technologies are severely limited by the short length of the generated reads, which rarely spans more than a single splice junction site. Given these challenges, transcriptomic long-read (LR) sequencing presents a promising potential for the detection and discovery of AS. We present Freddie, a computational annotation-independent isoform discovery and detection tool. Freddie takes as input transcriptomic LR sequencing of a sample alongside its genomic split alignment and computes a set of isoforms for the given sample. It then partitions the input reads into sets that can be processed independently and in parallel. For each partition, Freddie segments the genomic alignment of the reads into canonical exon segments. The goal of this segmentation is to be able to represent any potential isoform as a subset of these canonical exons. This segmentation is formulated as an optimization problem and is solved with a dynamic programming algorithm. Then, Freddie reconstructs the isoforms by jointly clustering and error-correcting the reads using the canonical segmentation as a succinct representation. The clustering and error-correcting step is formulated as an optimization problem-the Minimum Error Clustering into Isoforms (MErCi) problem-and is solved using integer linear programming (ILP). We compare the performance of Freddie on simulated datasets with other isoform detection tools with varying dependence on annotation databases. We show that Freddie outperforms the other tools in its accuracy, including those given the complete ground truth annotation. We also run Freddie on a transcriptomic LR dataset generated in-house from a prostate cancer cell line with a matched short-read RNA-seq dataset. Freddie results in isoforms with a higher short-read cross-validation rate than the other tested tools. Freddie is open source and available at https://github.com/vpc-ccg/freddie/.
Collapse
Affiliation(s)
- Baraa Orabi
- Department of Computer Science, the University of British Columbia, Vancouver, BC, Canada
| | - Ning Xie
- Vancouver Prostate Centre, Vancouver, BC, Canada
| | | | - Xuesen Dong
- Vancouver Prostate Centre, Vancouver, BC, Canada,Department of Urologic Sciences, the University of British Columbia, Vancouver, BC, Canada
| | - Cedric Chauve
- Department of Mathematics, Simon Fraser University, Burnaby, BC, Canada
| | - Faraz Hach
- To whom correspondence should be addressed.
| |
Collapse
|
2
|
Caceres M, Mumey B, Husic E, Rizzi R, Cairo M, Sahlin K, Tomescu AI. Safety in Multi-Assembly via Paths Appearing in All Path Covers of a DAG. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3673-3684. [PMID: 34847041 DOI: 10.1109/tcbb.2021.3131203] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
A multi-assembly problem asks to reconstruct multiple genomic sequences from mixed reads sequenced from all of them. Standard formulations of such problems model a solution as a path cover in a directed acyclic graph, namely a set of paths that together cover all vertices of the graph. Since multi-assembly problems admit multiple solutions in practice, we consider an approach commonly used in standard genome assembly: output only partial solutions (contigs, or safe paths), that appear in all path cover solutions. We study constrained path covers, a restriction on the path cover solution that incorporate practical constraints arising in multi-assembly problems. We give efficient algorithms finding all maximal safe paths for constrained path covers. We compute the safe paths of splicing graphs constructed from transcript annotations of different species. Our algorithms run in less than 15 seconds per species and report RNA contigs that are over 99% precise and are up to 8 times longer than unitigs. Moreover, RNA contigs cover over 70% of the transcripts and their coding sequences in most cases. With their increased length to unitigs, high precision, and fast construction time, maximal safe paths can provide a better base set of sequences for transcript assembly programs.
Collapse
|
3
|
Zhu K, Schäffer AA, Robinson W, Xu J, Ruppin E, Ergun AF, Ye Y, Sahinalp SC. Strain level microbial detection and quantification with applications to single cell metagenomics. Nat Commun 2022; 13:6430. [PMID: 36307411 PMCID: PMC9616933 DOI: 10.1038/s41467-022-33869-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2021] [Accepted: 10/04/2022] [Indexed: 12/25/2022] Open
Abstract
Computational identification and quantification of distinct microbes from high throughput sequencing data is crucial for our understanding of human health. Existing methods either use accurate but computationally expensive alignment-based approaches or less accurate but computationally fast alignment-free approaches, which often fail to correctly assign reads to genomes. Here we introduce CAMMiQ, a combinatorial optimization framework to identify and quantify distinct genomes (specified by a database) in a metagenomic dataset. As a key methodological innovation, CAMMiQ uses substrings of variable length and those that appear in two genomes in the database, as opposed to the commonly used fixed-length, unique substrings. These substrings allow to accurately decouple mixtures of highly similar genomes resulting in higher accuracy than the leading alternatives, without requiring additional computational resources, as demonstrated on commonly used benchmarking datasets. Importantly, we show that CAMMiQ can distinguish closely related bacterial strains in simulated metagenomic and real single-cell metatranscriptomic data.
Collapse
Affiliation(s)
- Kaiyuan Zhu
- Cancer Data Science Laboratory, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
- Department of Computer Science & Engineering, UC San Diego, La Jolla, CA, USA
- Department of Computer Science, Indiana University, Bloomington, IN, USA
| | - Alejandro A Schäffer
- Cancer Data Science Laboratory, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
| | - Welles Robinson
- Cancer Data Science Laboratory, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
- Surgery Branch, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
| | - Junyan Xu
- Cancer Data Science Laboratory, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
| | - Eytan Ruppin
- Cancer Data Science Laboratory, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
| | - A Funda Ergun
- Department of Computer Science, Indiana University, Bloomington, IN, USA
| | - Yuzhen Ye
- Department of Computer Science, Indiana University, Bloomington, IN, USA
| | - S Cenk Sahinalp
- Cancer Data Science Laboratory, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA.
- Department of Computer Science, Indiana University, Bloomington, IN, USA.
| |
Collapse
|
4
|
Zhao J, Feng H, Zhu D, Lin Y. MultiTrans: An Algorithm for Path Extraction Through Mixed Integer Linear Programming for Transcriptome Assembly. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:48-56. [PMID: 34033544 DOI: 10.1109/tcbb.2021.3083277] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Recent advances in RNA-seq technology have made identification of expressed genes affordable, and thus boosting repaid development of transcriptomic studies. Transcriptome assembly, reconstructing all expressed transcripts from RNA-seq reads, is an essential step to understand genes, proteins, and cell functions. Transcriptome assembly remains a challenging problem due to complications in splicing variants, expression levels, uneven coverage and sequencing errors. Here, we formulate the transcriptome assembly problem as path extraction on splicing graphs (or assembly graphs), and propose a novel algorithm MultiTrans for path extraction using mixed integer linear programming. MultiTrans is able to take into consideration coverage constraints on vertices and edges, the number of paths and the paired-end information simultaneously. We benchmarked MultiTrans against two state-of-the-art transcriptome assemblers, TransLiG and rnaSPAdes. Experimental results show that MultiTrans generates more accurate transcripts compared to TransLiG (using the same splicing graphs) and rnaSPAdes (using the same assembly graphs). MultiTrans is freely available at https://github.com/jzbio/MultiTrans.
Collapse
|
5
|
Jones DC, Ruzzo WL. Polee: RNA-Seq analysis using approximate likelihood. NAR Genom Bioinform 2021; 3:lqab046. [PMID: 34056596 PMCID: PMC8152449 DOI: 10.1093/nargab/lqab046] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2021] [Revised: 04/11/2021] [Accepted: 05/11/2021] [Indexed: 12/20/2022] Open
Abstract
The analysis of mRNA transcript abundance with RNA-Seq is a central tool in molecular biology research, but often analyses fail to account for the uncertainty in these estimates, which can be significant, especially when trying to disentangle isoforms or duplicated genes. Preserving uncertainty necessitates a full probabilistic model of the all the sequencing reads, which quickly becomes intractable, as experiments can consist of billions of reads. To overcome these limitations, we propose a new method of approximating the likelihood function of a sparse mixture model, using a technique we call the Pólya tree transformation. We demonstrate that substituting this approximation for the real thing achieves most of the benefits with a fraction of the computational costs, leading to more accurate detection of differential transcript expression and transcript coexpression.
Collapse
Affiliation(s)
- Daniel C Jones
- Paul G. Allen School of Computer Science & Engineering, University of Washington, Box 352350, Seattle, WA 98195-2350, USA
| | - Walter L Ruzzo
- Paul G. Allen School of Computer Science & Engineering, University of Washington, Box 352350, Seattle, WA 98195-2350, USA
- Department of Genome Sciences, University of Washington, Box 355065, Seattle, WA 98195-5065, USA
- Fred Hutchinson Cancer Research Center, 1100 Fairview Ave. N., P.O. Box 19024, Seattle, WA 98109, USA
| |
Collapse
|
6
|
Luo Y, Liao X, Wu FX, Wang J. Computational Approaches for Transcriptome Assembly Based on Sequencing Technologies. Curr Bioinform 2020. [DOI: 10.2174/1574893614666190410155603] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Transcriptome assembly plays a critical role in studying biological properties and
examining the expression levels of genomes in specific cells. It is also the basis of many
downstream analyses. With the increase of speed and the decrease in cost, massive sequencing
data continues to accumulate. A large number of assembly strategies based on different
computational methods and experiments have been developed. How to efficiently perform
transcriptome assembly with high sensitivity and accuracy becomes a key issue. In this work, the
issues with transcriptome assembly are explored based on different sequencing technologies.
Specifically, transcriptome assemblies with next-generation sequencing reads are divided into
reference-based assemblies and de novo assemblies. The examples of different species are used to
illustrate that long reads produced by the third-generation sequencing technologies can cover fulllength
transcripts without assemblies. In addition, different transcriptome assemblies using the
Hybrid-seq methods and other tools are also summarized. Finally, we discuss the future directions
of transcriptome assemblies.
Collapse
Affiliation(s)
- Yuwen Luo
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Xingyu Liao
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Fang-Xiang Wu
- Division of Biomedical Engineering, University of Saskatchewan, Saskatchewan, Canada
| | - Jianxin Wang
- School of Computer Science and Engineering, Central South University, Changsha, China
| |
Collapse
|
7
|
Li WV, Li S, Tong X, Deng L, Shi H, Li JJ. AIDE: annotation-assisted isoform discovery with high precision. Genome Res 2019; 29:2056-2072. [PMID: 31694868 PMCID: PMC6886511 DOI: 10.1101/gr.251108.119] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2019] [Accepted: 09/27/2019] [Indexed: 02/06/2023]
Abstract
Genome-wide accurate identification and quantification of full-length mRNA isoforms is crucial for investigating transcriptional and posttranscriptional regulatory mechanisms of biological phenomena. Despite continuing efforts in developing effective computational tools to identify or assemble full-length mRNA isoforms from second-generation RNA-seq data, it remains a challenge to accurately identify mRNA isoforms from short sequence reads owing to the substantial information loss in RNA-seq experiments. Here, we introduce a novel statistical method, annotation-assisted isoform discovery (AIDE), the first approach that directly controls false isoform discoveries by implementing the testing-based model selection principle. Solving the isoform discovery problem in a stepwise and conservative manner, AIDE prioritizes the annotated isoforms and precisely identifies novel isoforms whose addition significantly improves the explanation of observed RNA-seq reads. We evaluate the performance of AIDE based on multiple simulated and real RNA-seq data sets followed by PCR-Sanger sequencing validation. Our results show that AIDE effectively leverages the annotation information to compensate the information loss owing to short read lengths. AIDE achieves the highest precision in isoform discovery and the lowest error rates in isoform abundance estimation, compared with three state-of-the-art methods Cufflinks, SLIDE, and StringTie. As a robust bioinformatics tool for transcriptome analysis, AIDE enables researchers to discover novel transcripts with high confidence.
Collapse
Affiliation(s)
- Wei Vivian Li
- Department of Biostatistics and Epidemiology, Rutgers School of Public Health, Rutgers, The State University of New Jersey, Piscataway, New Jersey 08854, USA.,Department of Statistics, University of California, Los Angeles, California 90095, USA
| | - Shan Li
- Laboratory of Tumor Targeted and Immune Therapy, Clinical Research Center for Breast, State Key Laboratory of Biotherapy, West China Hospital, Sichuan University and Collaborative Innovation Center, Chengdu 610041, China
| | - Xin Tong
- Department of Data Sciences and Operations, Marshall School of Business, University of Southern California, Los Angeles, California 90089, USA
| | - Ling Deng
- Laboratory of Molecular Diagnosis of Cancer, Clinical Research Center for Breast, West China Hospital, Sichuan University, Chengdu 610041, China
| | - Hubing Shi
- Laboratory of Tumor Targeted and Immune Therapy, Clinical Research Center for Breast, State Key Laboratory of Biotherapy, West China Hospital, Sichuan University and Collaborative Innovation Center, Chengdu 610041, China
| | - Jingyi Jessica Li
- Department of Statistics, University of California, Los Angeles, California 90095, USA.,Department of Human Genetics, University of California, Los Angeles, California 90095, USA
| |
Collapse
|
8
|
Song L, Sabunciyan S, Yang G, Florea L. A multi-sample approach increases the accuracy of transcript assembly. Nat Commun 2019; 10:5000. [PMID: 31676772 PMCID: PMC6825223 DOI: 10.1038/s41467-019-12990-0] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2019] [Accepted: 10/11/2019] [Indexed: 01/21/2023] Open
Abstract
Transcript assembly from RNA-seq reads is a critical step in gene expression and subsequent functional analyses. Here we present PsiCLASS, an accurate and efficient transcript assembler based on an approach that simultaneously analyzes multiple RNA-seq samples. PsiCLASS combines mixture statistical models for exonic feature selection across multiple samples with splice graph based dynamic programming algorithms and a weighted voting scheme for transcript selection. PsiCLASS achieves significantly better sensitivity-precision tradeoff, and renders precision up to 2-3 fold higher than the StringTie system and Scallop plus TACO, the two best current approaches. PsiCLASS is efficient and scalable, assembling 667 GEUVADIS samples in 9 h, and has robust accuracy with large numbers of samples. Transcript assembly is an important step in analysis of RNA-seq data whose accuracy influences downstream quantification, detection and characterization of alternative splice variants. Here, the authors develop PsiCLASS, a transcript assembler leveraging simultaneous analysis of multiple RNA-seq samples.
Collapse
Affiliation(s)
- Li Song
- McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, MD, USA.,Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.,Department of Data Sciences, Dana Farber Cancer Institute, Boston, MA, USA
| | - Sarven Sabunciyan
- Department of Pediatrics, Johns Hopkins School of Medicine, Baltimore, MD, USA
| | - Guangyu Yang
- McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, MD, USA.,Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Liliana Florea
- McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, MD, USA. .,Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA. .,Department of Medicine, Johns Hopkins School of Medicine, Baltimore, MD, USA.
| |
Collapse
|
9
|
Bonizzoni P, Ciccolella S, Vedova GD, Soto M. Does Relaxing the Infinite Sites Assumption Give Better Tumor Phylogenies? An ILP-Based Comparative Approach. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:1410-1423. [PMID: 31603766 DOI: 10.1109/tcbb.2018.2865729] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Most of the evolutionary history reconstruction approaches are based on the infinite sites assumption, which states that mutations appear once in the evolutionary history. The Perfect Phylogeny model is the result of the infinite sites assumption and has been widely used to infer cancer evolution. Nonetheless, recent results show that recurrent and back mutations are present in the evolutionary history of tumors, hence the Perfect Phylogeny model might be too restrictive. We propose an approach that allows losing previously acquired mutations and multiple acquisitions of a character. Moreover, we provide an ILP formulation for the evolutionary tree reconstruction problem. Our formulation allows us to tackle both the Incomplete Directed Phylogeny problem and the Clonal Reconstruction problem when general evolutionary models are considered. The latter problem is fundamental in cancer genomics, the goal is to study the evolutionary history of a tumor considering as input data the fraction of cells having a certain mutation in a set of cancer samples. For the Clonal Reconstruction problem, an experimental analysis shows the advantage of allowing mutation losses. Namely, by analyzing real and simulated datasets, our ILP approach provides a better interpretation of the evolutionary history than a Perfect Phylogeny. The software is at https://github.com/AlgoLab/gppf.
Collapse
|
10
|
Shao M, Kingsford C. Theory and A Heuristic for the Minimum Path Flow Decomposition Problem. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:658-670. [PMID: 29990201 DOI: 10.1109/tcbb.2017.2779509] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Motivated by multiple genome assembly problems and other applications, we study the following minimum path flow decomposition problem: Given a directed acyclic graph $G=(V,E)$G=(V,E) with source $s$s and sink $t$t and a flow $f$f, compute a set of $s$s-$t$t paths $P$P and assign weight $w(p)$w(p) for $p\in P$p∈P such that $f(e) = \sum _{p\in P: e\in p} w(p)$f(e)=∑p∈P:e∈pw(p), $\forall e\in E$∀e∈E, and $|P|$|P| is minimized. We develop some fundamental theory for this problem, upon which we design an efficient heuristic. Specifically, we prove that the gap between the optimal number of paths and a known upper bound is determined by the nontrivial equations within the flow values. This result gives rise to the framework of our heuristic: to iteratively reduce the gap through identifying such equations. We also define an operation on certain independent substructures of the graph, and prove that this operation does not affect the optimality but can transform the graph into one with desired property that facilitates reducing the gap. We apply and test our algorithm on both simulated random instances and perfect splice graph instances, and also compare it with the existing state-of-art algorithm for flow decomposition. The results illustrate that our algorithm can achieve very high accuracy on these instances, and also that our algorithm significantly improves on the previous algorithms. An implementation of our algorithm is freely available at https://github.com/Kingsford-Group/catfish.
Collapse
|
11
|
Li WV, Li JJ. Modeling and analysis of RNA-seq data: a review from a statistical perspective. QUANTITATIVE BIOLOGY 2018; 6:195-209. [PMID: 31456901 PMCID: PMC6711375 DOI: 10.1007/s40484-018-0144-7] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2017] [Revised: 02/23/2018] [Accepted: 03/29/2018] [Indexed: 12/21/2022]
Abstract
BACKGROUND Since the invention of next-generation RNA sequencing (RNA-seq) technologies, they have become a powerful tool to study the presence and quantity of RNA molecules in biological samples and have revolutionized transcriptomic studies. The analysis of RNA-seq data at four different levels (samples, genes, transcripts, and exons) involve multiple statistical and computational questions, some of which remain challenging up to date. RESULTS We review RNA-seq analysis tools at the sample, gene, transcript, and exon levels from a statistical perspective. We also highlight the biological and statistical questions of most practical considerations. CONCLUSIONS The development of statistical and computational methods for analyzing RNA-seq data has made significant advances in the past decade. However, methods developed to answer the same biological question often rely on diverse statistical models and exhibit different performance under different scenarios. This review discusses and compares multiple commonly used statistical models regarding their assumptions, in the hope of helping users select appropriate methods as needed, as well as assisting developers for future method development.
Collapse
Affiliation(s)
- Wei Vivian Li
- Department of Statistics, University of California, Los Angeles, Los Angeles, CA 90095-1554, USA
| | - Jingyi Jessica Li
- Department of Statistics, University of California, Los Angeles, Los Angeles, CA 90095-1554, USA
- Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA 90095-088, USA
| |
Collapse
|
12
|
Li WV, Zhao A, Zhang S, Li JJ. MSIQ: JOINT MODELING OF MULTIPLE RNA-SEQ SAMPLES FOR ACCURATE ISOFORM QUANTIFICATION. Ann Appl Stat 2018; 12:510-539. [PMID: 29731954 PMCID: PMC5935499 DOI: 10.1214/17-aoas1100] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
Abstract
Next-generation RNA sequencing (RNA-seq) technology has been widely used to assess full-length RNA isoform abundance in a high-throughput manner. RNA-seq data offer insight into gene expression levels and transcriptome structures, enabling us to better understand the regulation of gene expression and fundamental biological processes. Accurate isoform quantification from RNA-seq data is challenging due to the information loss in sequencing experiments. A recent accumulation of multiple RNA-seq data sets from the same tissue or cell type provides new opportunities to improve the accuracy of isoform quantification. However, existing statistical or computational methods for multiple RNA-seq samples either pool the samples into one sample or assign equal weights to the samples when estimating isoform abundance. These methods ignore the possible heterogeneity in the quality of different samples and could result in biased and unrobust estimates. In this article, we develop a method, which we call "joint modeling of multiple RNA-seq samples for accurate isoform quantification" (MSIQ), for more accurate and robust isoform quantification by integrating multiple RNA-seq samples under a Bayesian framework. Our method aims to (1) identify a consistent group of samples with homogeneous quality and (2) improve isoform quantification accuracy by jointly modeling multiple RNA-seq samples by allowing for higher weights on the consistent group. We show that MSIQ provides a consistent estimator of isoform abundance, and we demonstrate the accuracy and effectiveness of MSIQ compared with alternative methods through simulation studies on D. melanogaster genes. We justify MSIQ's advantages over existing approaches via application studies on real RNA-seq data from human embryonic stem cells, brain tissues, and the HepG2 immortalized cell line. We also perform a comprehensive analysis of how the isoform quantification accuracy would be affected by RNA-seq sample heterogeneity and different experimental protocols.
Collapse
|
13
|
Tomescu AI, Gagie T, Popa A, Rizzi R, Kuosmanen A, Mäkinen V. Explaining a Weighted DAG with Few Paths for Solving Genome-Guided Multi-Assembly. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:1345-1354. [PMID: 26671806 DOI: 10.1109/tcbb.2015.2418753] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
RNA-Seq technology offers new high-throughput ways for transcript identification and quantification based on short reads, and has recently attracted great interest. This is achieved by constructing a weighted DAG whose vertices stand for exons, and whose arcs stand for split alignments of the RNA-Seq reads to the exons. The task consists of finding a number of paths, together with their expression levels, which optimally explain the weights of the graph under various fitting functions, such as least sum of squared residuals. In (Tomescu et al. BMC Bioinformatics, 2013) we studied this genome-guided multi-assembly problem when the number of allowed solution paths was linear in the number of arcs. In this paper, we further refine this problem by asking for a bounded number k of solution paths, which is the setting of most practical interest. We formulate this problem in very broad terms, and show that for many choices of the fitting function it becomes NP-hard. Nevertheless, we identify a natural graph parameter of a DAG G, which we call arc-width and denote ⟨G⟩, and give a dynamic programming algorithm running in time O(W(k)⟨G⟩(k)(⟨G⟩+ k)n) , where n is the number of vertices and W is the maximum weight of G. This implies that the problem is fixed-parameter tractable (FPT) in the parameters W, ⟨G⟩, and k. We also show that the arc-width of DAGs constructed from simulated and real RNA-Seq reads is small in practice. Finally, we study the approximability of this problem, and, in particular, give a fully polynomial-time approximation scheme (FPTAS) for the case when the fitting function penalizes the maximum ratio between the weights of the arcs and their predicted coverage.
Collapse
|
14
|
Rizzi R, Tomescu AI, Mäkinen V. On the complexity of Minimum Path Cover with Subpath Constraints for multi-assembly. BMC Bioinformatics 2014; 15 Suppl 9:S5. [PMID: 25252805 PMCID: PMC4168716 DOI: 10.1186/1471-2105-15-s9-s5] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
BACKGROUND Multi-assembly problems have gathered much attention in the last years, as Next-Generation Sequencing technologies have started being applied to mixed settings, such as reads from the transcriptome (RNA-Seq), or from viral quasi-species. One classical model that has resurfaced in many multi-assembly methods (e.g. in Cufflinks, ShoRAH, BRANCH, CLASS) is the Minimum Path Cover (MPC) Problem, which asks for the minimum number of directed paths that cover all the nodes of a directed acyclic graph. The MPC Problem is highly popular because the acyclicity of the graph ensures its polynomial-time solvability. RESULTS In this paper, we consider two generalizations of it dealing with integrating constraints arising from long reads or paired-end reads; these extensions have also been considered by two recent methods, but not fully solved. More specifically, we study the two problems where also a set of subpaths, or pairs of subpaths, of the graph have to be entirely covered by some path in the MPC. We show that in the case of long reads (subpaths), the generalized problem can be solved in polynomial-time by a reduction to the classical MPC Problem. We also consider the weighted case, and show that it can be solved in polynomial-time by a reduction to a min-cost circulation problem. As a side result, we also improve the time complexity of the classical minimum weight MPC Problem. In the case of paired-end reads (pairs of subpaths), the generalized problem becomes NP-hard, but we show that it is fixed-parameter tractable (FPT) in the total number of constraints. This computational dichotomy between long reads and paired-end reads is also a general insight into multi-assembly problems.
Collapse
Affiliation(s)
- Romeo Rizzi
- Department of Computer Science, University of Verona, Italy
| | - Alexandru I Tomescu
- Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland
| | - Veli Mäkinen
- Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland
| |
Collapse
|
15
|
Safikhani Z, Sadeghi M, Pezeshk H, Eslahchi C. SSP: an interval integer linear programming for de novo transcriptome assembly and isoform discovery of RNA-seq reads. Genomics 2013; 102:507-14. [PMID: 24161398 DOI: 10.1016/j.ygeno.2013.10.003] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2013] [Revised: 09/28/2013] [Accepted: 10/16/2013] [Indexed: 11/19/2022]
Abstract
Recent advances in the sequencing technologies have provided a handful of RNA-seq datasets for transcriptome analysis. However, reconstruction of full-length isoforms and estimation of the expression level of transcripts with a low cost are challenging tasks. We propose a novel de novo method named SSP that incorporates interval integer linear programming to resolve alternatively spliced isoforms and reconstruct the whole transcriptome from short reads. Experimental results show that SSP is fast and precise in determining different alternatively spliced isoforms along with the estimation of reconstructed transcript abundances. The SSP software package is available at http://www.bioinf.cs.ipm.ir/software/ssp.
Collapse
Affiliation(s)
- Zhaleh Safikhani
- Department of Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran
| | - Mehdi Sadeghi
- National Institute of Genetic Engineering and Biotechnology (NIGEB), Tehran, Iran.
| | - Hamid Pezeshk
- School of Mathematics, Statistics and Computer Sciences, Center of Excellence in Biomathematics, College of Science, University of Tehran, Tehran, Iran
| | - Changiz Eslahchi
- Department of Computer Science, Shahid Beheshti University, GC., Tehran, Iran; School of Computer Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
| |
Collapse
|
16
|
Dao P, Numanagić I, Lin YY, Hach F, Karakoc E, Donmez N, Collins C, Eichler EE, Sahinalp SC. ORMAN: optimal resolution of ambiguous RNA-Seq multimappings in the presence of novel isoforms. ACTA ACUST UNITED AC 2013; 30:644-51. [PMID: 24130305 DOI: 10.1093/bioinformatics/btt591] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION RNA-Seq technology is promising to uncover many novel alternative splicing events, gene fusions and other variations in RNA transcripts. For an accurate detection and quantification of transcripts, it is important to resolve the mapping ambiguity for those RNA-Seq reads that can be mapped to multiple loci: >17% of the reads from mouse RNA-Seq data and 50% of the reads from some plant RNA-Seq data have multiple mapping loci. In this study, we show how to resolve the mapping ambiguity in the presence of novel transcriptomic events such as exon skipping and novel indels towards accurate downstream analysis. We introduce ORMAN ( O ptimal R esolution of M ultimapping A mbiguity of R N A-Seq Reads), which aims to compute the minimum number of potential transcript products for each gene and to assign each multimapping read to one of these transcripts based on the estimated distribution of the region covering the read. ORMAN achieves this objective through a combinatorial optimization formulation, which is solved through well-known approximation algorithms, integer linear programs and heuristics. RESULTS On a simulated RNA-Seq dataset including a random subset of transcripts from the UCSC database, the performance of several state-of-the-art methods for identifying and quantifying novel transcripts, such as Cufflinks, IsoLasso and CLIIQ, is significantly improved through the use of ORMAN. Furthermore, in an experiment using real RNA-Seq reads, we show that ORMAN is able to resolve multimapping to produce coverage values that are similar to the original distribution, even in genes with highly non-uniform coverage. AVAILABILITY ORMAN is available at http://orman.sf.net
Collapse
Affiliation(s)
- Phuong Dao
- School of Computing Science, Simon Fraser University, Burnaby, BC, Canada, Department of Genome Sciences, University of Washington, Seattle, WA, USA, Vancouver Prostate Centre & Department of Urologic Sciences, University of British Columbia, Vancouver, BC, Canada and Division of Computer Science, School of Informatics and Computing, Indiana University, Bloomington, IN, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
17
|
Behr J, Kahles A, Zhong Y, Sreedharan VT, Drewe P, Rätsch G. MITIE: Simultaneous RNA-Seq-based transcript identification and quantification in multiple samples. Bioinformatics 2013; 29:2529-38. [PMID: 23980025 PMCID: PMC3789545 DOI: 10.1093/bioinformatics/btt442] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2012] [Revised: 07/19/2013] [Accepted: 07/29/2013] [Indexed: 02/07/2023] Open
Abstract
MOTIVATION High-throughput sequencing of mRNA (RNA-Seq) has led to tremendous improvements in the detection of expressed genes and reconstruction of RNA transcripts. However, the extensive dynamic range of gene expression, technical limitations and biases, as well as the observed complexity of the transcriptional landscape, pose profound computational challenges for transcriptome reconstruction. RESULTS We present the novel framework MITIE (Mixed Integer Transcript IdEntification) for simultaneous transcript reconstruction and quantification. We define a likelihood function based on the negative binomial distribution, use a regularization approach to select a few transcripts collectively explaining the observed read data and show how to find the optimal solution using Mixed Integer Programming. MITIE can (i) take advantage of known transcripts, (ii) reconstruct and quantify transcripts simultaneously in multiple samples, and (iii) resolve the location of multi-mapping reads. It is designed for genome- and assembly-based transcriptome reconstruction. We present an extensive study based on realistic simulated RNA-Seq data. When compared with state-of-the-art approaches, MITIE proves to be significantly more sensitive and overall more accurate. Moreover, MITIE yields substantial performance gains when used with multiple samples. We applied our system to 38 Drosophila melanogaster modENCODE RNA-Seq libraries and estimated the sensitivity of reconstructing omitted transcript annotations and the specificity with respect to annotated transcripts. Our results corroborate that a well-motivated objective paired with appropriate optimization techniques lead to significant improvements over the state-of-the-art in transcriptome reconstruction. AVAILABILITY MITIE is implemented in C++ and is available from http://bioweb.me/mitie under the GPL license.
Collapse
Affiliation(s)
- Jonas Behr
- Computational Biology Center, Sloan-Kettering Institute, 1275 York Avenue, New York, NY 10065, USA and Friedrich Miescher Laboratory, Max Planck Society, Spemannstr. 39, 72076 Tübingen, Germany
| | | | | | | | | | | |
Collapse
|