101
|
Nicolae M, Mangul S, Măndoiu II, Zelikovsky A. Estimation of alternative splicing isoform frequencies from RNA-Seq data. Algorithms Mol Biol 2011; 6:9. [PMID: 21504602 PMCID: PMC3107792 DOI: 10.1186/1748-7188-6-9] [Citation(s) in RCA: 131] [Impact Index Per Article: 10.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2010] [Accepted: 04/19/2011] [Indexed: 11/21/2022] Open
Abstract
BACKGROUND Massively parallel whole transcriptome sequencing, commonly referred as RNA-Seq, is quickly becoming the technology of choice for gene expression profiling. However, due to the short read length delivered by current sequencing technologies, estimation of expression levels for alternative splicing gene isoforms remains challenging. RESULTS In this paper we present a novel expectation-maximization algorithm for inference of isoform- and gene-specific expression levels from RNA-Seq data. Our algorithm, referred to as IsoEM, is based on disambiguating information provided by the distribution of insert sizes generated during sequencing library preparation, and takes advantage of base quality scores, strand and read pairing information when available. The open source Java implementation of IsoEM is freely available at http://dna.engr.uconn.edu/software/IsoEM/. CONCLUSIONS Empirical experiments on both synthetic and real RNA-Seq datasets show that IsoEM has scalable running time and outperforms existing methods of isoform and gene expression level estimation. Simulation experiments confirm previous findings that, for a fixed sequencing cost, using reads longer than 25-36 bases does not necessarily lead to better accuracy for estimating expression levels of annotated isoforms and genes.
Collapse
Affiliation(s)
- Marius Nicolae
- Department of Computer Science & Engineering, University of Connecticut,371 Fairfield Rd., Unit 2155, Storrs, CT 06269-2155, USA
| | - Serghei Mangul
- Computer Science Department, Georgia State University, University Plaza, Atlanta, Georgia 30303, USA
| | - Ion I Măndoiu
- Department of Computer Science & Engineering, University of Connecticut,371 Fairfield Rd., Unit 2155, Storrs, CT 06269-2155, USA
| | - Alex Zelikovsky
- Computer Science Department, Georgia State University, University Plaza, Atlanta, Georgia 30303, USA
| |
Collapse
|
102
|
Deng N, Puetter A, Zhang K, Johnson K, Zhao Z, Taylor C, Flemington EK, Zhu D. Isoform-level microRNA-155 target prediction using RNA-seq. Nucleic Acids Res 2011; 39:e61. [PMID: 21317189 PMCID: PMC3089486 DOI: 10.1093/nar/gkr042] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Computational prediction of microRNA targets remains a challenging problem. The existing rule-based, data-driven and expression profiling approaches to target prediction are mostly approached from the gene-level. The increasing availability of RNA-seq data provides a new perspective for microRNA target prediction on the isoform-level. We hypothesize that the splicing isoform is the ultimate effector in microRNA targeting and that the proposed isoform-level approach is capable of predicting non-dominant isoform targets as well as their targeting regions that are otherwise invisible to many existing approaches. To test the hypothesis, we used an iterative expectation maximization (EM) algorithm to quantify transcriptomes at the isoform-level. The performance of the EM algorithm in transcriptome quantification was examined in simulation studies using FluxSimulator. We used joint evidence from isoform-level down-regulation and seed enrichment to predict microRNA-155 targets. We validated our computational approach using results from 149 in-house performed in vitro 3′-UTR assays. We also augmented the splicing database using exon–exon junction evidence, and applied the EM algorithm to predict and quantify 1572 cell line specific novel isoforms. Combined with seed enrichment analysis, we predicted 51 novel microRNA-155 isoform targets. Our work is among the first computational studies advocating the isoform-level microRNA target prediction.
Collapse
Affiliation(s)
- Nan Deng
- Department of Computer Science, University of New Orleans, 2000 Lakeshore Drive, New Orleans, LA 70148, USA
| | | | | | | | | | | | | | | |
Collapse
|
103
|
Turro E, Su SY, Gonçalves Â, Coin LJM, Richardson S, Lewin A. Haplotype and isoform specific expression estimation using multi-mapping RNA-seq reads. Genome Biol 2011; 12:R13. [PMID: 21310039 PMCID: PMC3188795 DOI: 10.1186/gb-2011-12-2-r13] [Citation(s) in RCA: 172] [Impact Index Per Article: 13.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2010] [Revised: 11/17/2010] [Accepted: 02/10/2011] [Indexed: 11/11/2022] Open
Abstract
We present a novel pipeline and methodology for simultaneously estimating isoform expression and allelic imbalance in diploid organisms using RNA-seq data. We achieve this by modeling the expression of haplotype-specific isoforms. If unknown, the two parental isoform sequences can be individually reconstructed. A new statistical method, MMSEQ, deconvolves the mapping of reads to multiple transcripts (isoforms or haplotype-specific isoforms). Our software can take into account non-uniform read generation and works with paired-end reads.
Collapse
Affiliation(s)
- Ernest Turro
- Department of Epidemiology and Biostatistics, Imperial College London, Norfolk Place, London, W2 1PG, UK.
| | | | | | | | | | | |
Collapse
|
104
|
Comprehensive identification and quantification of microbial transcriptomes by genome-wide unbiased methods. Curr Opin Biotechnol 2011; 22:32-41. [DOI: 10.1016/j.copbio.2010.10.003] [Citation(s) in RCA: 52] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2010] [Revised: 10/06/2010] [Accepted: 10/06/2010] [Indexed: 11/19/2022]
|
105
|
Abstract
In the few years since its initial application, massively parallel cDNA sequencing, or RNA-seq, has allowed many advances in the characterization and quantification of transcriptomes. Recently, several developments in RNA-seq methods have provided an even more complete characterization of RNA transcripts. These developments include improvements in transcription start site mapping, strand-specific measurements, gene fusion detection, small RNA characterization and detection of alternative splicing events. Ongoing developments promise further advances in the application of RNA-seq, particularly direct RNA sequencing and approaches that allow RNA quantification from very small amounts of cellular materials.
Collapse
Affiliation(s)
- Fatih Ozsolak
- Helicos BioSciences Corporation, One Kendall Square, Cambridge, Massachusetts 02139, USA.
| | | |
Collapse
|
106
|
Twine NA, Janitz K, Wilkins MR, Janitz M. Whole transcriptome sequencing reveals gene expression and splicing differences in brain regions affected by Alzheimer's disease. PLoS One 2011; 6:e16266. [PMID: 21283692 PMCID: PMC3025006 DOI: 10.1371/journal.pone.0016266] [Citation(s) in RCA: 219] [Impact Index Per Article: 16.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2010] [Accepted: 12/08/2010] [Indexed: 11/18/2022] Open
Abstract
Recent studies strongly indicate that aberrations in the control of gene expression might contribute to the initiation and progression of Alzheimer's disease (AD). In particular, alternative splicing has been suggested to play a role in spontaneous cases of AD. Previous transcriptome profiling of AD models and patient samples using microarrays delivered conflicting results. This study provides, for the first time, transcriptomic analysis for distinct regions of the AD brain using RNA-Seq next-generation sequencing technology. Illumina RNA-Seq analysis was used to survey transcriptome profiles from total brain, frontal and temporal lobe of healthy and AD post-mortem tissue. We quantified gene expression levels, splicing isoforms and alternative transcript start sites. Gene Ontology term enrichment analysis revealed an overrepresentation of genes associated with a neuron's cytological structure and synapse function in AD brain samples. Analysis of the temporal lobe with the Cufflinks tool revealed that transcriptional isoforms of the apolipoprotein E gene, APOE-001, -002 and -005, are under the control of different promoters in normal and AD brain tissue. We also observed differing expression levels of APOE-001 and -002 splice variants in the AD temporal lobe. Our results indicate that alternative splicing and promoter usage of the APOE gene in AD brain tissue might reflect the progression of neurodegeneration.
Collapse
Affiliation(s)
- Natalie A. Twine
- School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, New South Wales, Australia
- New South Wales Systems Biology Initiative, University of New South Wales, Sydney, New South Wales, Australia
| | - Karolina Janitz
- Ramaciotti Centre for Gene Function Analysis, University of New South Wales, Sydney, New South Wales, Australia
| | - Marc R. Wilkins
- School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, New South Wales, Australia
- New South Wales Systems Biology Initiative, University of New South Wales, Sydney, New South Wales, Australia
- Ramaciotti Centre for Gene Function Analysis, University of New South Wales, Sydney, New South Wales, Australia
| | - Michal Janitz
- School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, New South Wales, Australia
- * E-mail:
| |
Collapse
|
107
|
Sutherland GT, Janitz M, Kril JJ. Understanding the pathogenesis of Alzheimer's disease: will RNA-Seq realize the promise of transcriptomics? J Neurochem 2011; 116:937-46. [PMID: 21175619 DOI: 10.1111/j.1471-4159.2010.07157.x] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
The prevalence of Alzheimer's disease (AD) is increasing rapidly in the western world and is poised to have a significant economic and societal impact. Current treatments do not alter the underlying disease processes meaning new treatments are required if this imminent epidemic is to be averted. The clinical manifestations of AD are secondary to a substantial loss of cortical neurons. To be effective, neuroprotective strategies will need to be implemented prior to this cell loss. However, this requires the discovery of both pre-clinical markers to identify susceptible patients and the early pathogenic mechanisms to serve as therapeutic targets. Although the biomarkers and pathogenic mechanisms may overlap, it is likely that new approaches are required to identify novel elements of the disease. Transcriptomic analyses, that assume no a priori etiological hypotheses, promise much in elucidating the pathogenesis of complex diseases like AD. Microarrays are the most popular platform for transcriptomic analysis and have been applied across AD models, patient samples and postmortem brain tissue. The results of these studies have been largely discordant which could, to some extent, reflect the limitations of this probe-hybridization-based methodology. In comparison, whole transcriptome sequencing (RNA-Seq) utilizes a highly efficient, next-generation DNA sequencing method with improved dynamic range and scope of transcript detection. RNA-Seq is not only highly suited to investigations of the genomically complex human brain tissue but it can potentially overcome technical issues inherent to case-control comparisons of postmortem brain tissue in neurodegenerative diseases. The volume of data generated by this platform looms as the major logistical hurdle and a systematic experimental approach will be required to maximise the detection of pathogenically relevant signals. Nevertheless, RNA-Seq looks set to deliver a quantum leap forward in our understanding of AD pathogenesis.
Collapse
Affiliation(s)
- Greg T Sutherland
- Discipline of Pathology, Sydney Medical School, University of Sydney, Sydney, NSW, Australia.
| | | | | |
Collapse
|
108
|
|
109
|
Abstract
In the few years since its initial application, massively parallel cDNA sequencing, or RNA-seq, has allowed many advances in the characterization and quantification of transcriptomes. Recently, several developments in RNA-seq methods have provided an even more complete characterization of RNA transcripts. These developments include improvements in transcription start site mapping, strand-specific measurements, gene fusion detection, small RNA characterization and detection of alternative splicing events. Ongoing developments promise further advances in the application of RNA-seq, particularly direct RNA sequencing and approaches that allow RNA quantification from very small amounts of cellular materials.
Collapse
Affiliation(s)
- Fatih Ozsolak
- Helicos BioSciences Corporation, One Kendall Square, Cambridge, Massachusetts 02139, USA.
| | | |
Collapse
|
110
|
Lee S, Seo CH, Lim B, Yang JO, Oh J, Kim M, Lee S, Lee B, Kang C, Lee S. Accurate quantification of transcriptome from RNA-Seq data by effective length normalization. Nucleic Acids Res 2010; 39:e9. [PMID: 21059678 PMCID: PMC3025570 DOI: 10.1093/nar/gkq1015] [Citation(s) in RCA: 81] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
We propose a novel, efficient and intuitive approach of estimating mRNA abundances from the whole transcriptome shotgun sequencing (RNA-Seq) data. Our method, NEUMA (Normalization by Expected Uniquely Mappable Area), is based on effective length normalization using uniquely mappable areas of gene and mRNA isoform models. Using the known transcriptome sequence model such as RefSeq, NEUMA pre-computes the numbers of all possible gene-wise and isoform-wise informative reads: the former being sequences mapped to all mRNA isoforms of a single gene exclusively and the latter uniquely mapped to a single mRNA isoform. The results are used to estimate the effective length of genes and transcripts, taking experimental distributions of fragment size into consideration. Quantitative RT-PCR based on 27 randomly selected genes in two human cell lines and computer simulation experiments demonstrated superior accuracy of NEUMA over other recently developed methods. NEUMA covers a large proportion of genes and mRNA isoforms and offers a measure of consistency ('consistency coefficient') for each gene between an independently measured gene-wise level and the sum of the isoform levels. NEUMA is applicable to both paired-end and single-end RNA-Seq data. We propose that NEUMA could make a standard method in quantifying gene transcript levels from RNA-Seq data.
Collapse
Affiliation(s)
- Soohyun Lee
- Korean Bioinformation Center (KOBIC), Korea Research Institute of Bioscience and Biotechnology (KRIBB), Yuseong-gu, Daejeon, Korea
| | | | | | | | | | | | | | | | | | | |
Collapse
|
111
|
Dimon MT, Sorber K, DeRisi JL. HMMSplicer: a tool for efficient and sensitive discovery of known and novel splice junctions in RNA-Seq data. PLoS One 2010; 5:e13875. [PMID: 21079731 PMCID: PMC2975632 DOI: 10.1371/journal.pone.0013875] [Citation(s) in RCA: 47] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2010] [Accepted: 09/16/2010] [Indexed: 02/01/2023] Open
Abstract
Background High-throughput sequencing of an organism's transcriptome, or RNA-Seq, is a valuable and versatile new strategy for capturing snapshots of gene expression. However, transcriptome sequencing creates a new class of alignment problem: mapping short reads that span exon-exon junctions back to the reference genome, especially in the case where a splice junction is previously unknown. Methodology/Principal Findings Here we introduce HMMSplicer, an accurate and efficient algorithm for discovering canonical and non-canonical splice junctions in short read datasets. HMMSplicer identifies more splice junctions than currently available algorithms when tested on publicly available A. thaliana, P. falciparum, and H. sapiens datasets without a reduction in specificity. Conclusions/Significance HMMSplicer was found to perform especially well in compact genomes and on genes with low expression levels, alternative splice isoforms, or non-canonical splice junctions. Because HHMSplicer does not rely on pre-built gene models, the products of inexact splicing are also detected. For H. sapiens, we find 3.6% of 3′ splice sites and 1.4% of 5′ splice sites are inexact, typically differing by 3 bases in either direction. In addition, HMMSplicer provides a score for every predicted junction allowing the user to set a threshold to tune false positive rates depending on the needs of the experiment. HMMSplicer is implemented in Python. Code and documentation are freely available at http://derisilab.ucsf.edu/software/hmmsplicer.
Collapse
Affiliation(s)
- Michelle T. Dimon
- Department of Biochemistry and Biophysics, University of California San Francisco, San Francisco, California, United States of America
- Biological and Medical Informatics Program, University of California San Francisco, San Francisco, California, United States of America
| | - Katherine Sorber
- Department of Biochemistry and Biophysics, University of California San Francisco, San Francisco, California, United States of America
| | - Joseph L. DeRisi
- Department of Biochemistry and Biophysics, University of California San Francisco, San Francisco, California, United States of America
- Howard Hughes Medical Institute, Bethesda, Maryland, United States of America
- * E-mail:
| |
Collapse
|
112
|
Wang K, Singh D, Zeng Z, Coleman SJ, Huang Y, Savich GL, He X, Mieczkowski P, Grimm SA, Perou CM, MacLeod JN, Chiang DY, Prins JF, Liu J. MapSplice: accurate mapping of RNA-seq reads for splice junction discovery. Nucleic Acids Res 2010; 38:e178. [PMID: 20802226 PMCID: PMC2952873 DOI: 10.1093/nar/gkq622] [Citation(s) in RCA: 755] [Impact Index Per Article: 53.9] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
The accurate mapping of reads that span splice junctions is a critical component of all analytic techniques that work with RNA-seq data. We introduce a second generation splice detection algorithm, MapSplice, whose focus is high sensitivity and specificity in the detection of splices as well as CPU and memory efficiency. MapSplice can be applied to both short (<75 bp) and long reads (≥75 bp). MapSplice is not dependent on splice site features or intron length, consequently it can detect novel canonical as well as non-canonical splices. MapSplice leverages the quality and diversity of read alignments of a given splice to increase accuracy. We demonstrate that MapSplice achieves higher sensitivity and specificity than TopHat and SpliceMap on a set of simulated RNA-seq data. Experimental studies also support the accuracy of the algorithm. Splice junctions derived from eight breast cancer RNA-seq datasets recapitulated the extensiveness of alternative splicing on a global level as well as the differences between molecular subtypes of breast cancer. These combined results indicate that MapSplice is a highly accurate algorithm for the alignment of RNA-seq reads to splice junctions. Software download URL: http://www.netlab.uky.edu/p/bioinfo/MapSplice.
Collapse
Affiliation(s)
- Kai Wang
- Department of Computer Science, University of Kentucky, Lexington, KY 40506, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
113
|
Abstract
The past few decades are characterized by an explosive evolution of genetics and molecular cell biology. Advances in chemistry and engineering have enabled increased data throughput, permitting the study of complete sets of molecules with increasing speed and accuracy using techniques such as genomics, transcriptomics, proteomics, and metabolomics. Prediction of long-term outcomes in transplantation is hampered by the absence of sufficiently robust biomarkers and a lack of adequate insight into the mechanisms of acute and chronic alloimmune injury and the adaptive mechanisms of immunological quiescence that may support transplantation tolerance. Here, we discuss some of the great opportunities that molecular diagnostic tools have to offer both basic scientists and translational researchers for bench-to-bedside clinical application in transplantation medicine, with special focus on genomics and genome-wide association studies, epigenetics (DNA methylation and histone modifications), gene expression studies and transcriptomics (including microRNA and small interfering RNA studies), proteomics and peptidomics, antibodyomics, metabolomics, chemical genomics and functional imaging with nanoparticles. We address the challenges and opportunities associated with the newer high-throughput sequencing technologies, especially in the field of bioinformatics and biostatistics, and demonstrate the importance of integrative approaches. Although this Review focuses on transplantation research and clinical transplantation, the concepts addressed are valid for all translational research.
Collapse
|
114
|
Uncovering the complexity of transcriptomes with RNA-Seq. J Biomed Biotechnol 2010; 2010:853916. [PMID: 20625424 PMCID: PMC2896904 DOI: 10.1155/2010/853916] [Citation(s) in RCA: 249] [Impact Index Per Article: 17.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2010] [Accepted: 04/07/2010] [Indexed: 11/19/2022] Open
Abstract
In recent years, the introduction of massively parallel sequencing platforms for Next Generation Sequencing (NGS) protocols, able to simultaneously sequence hundred thousand
DNA fragments, dramatically changed the landscape of the genetics studies. RNA-Seq for transcriptome studies, Chip-Seq for DNA-proteins interaction,
CNV-Seq for large genome nucleotide variations are only some of the intriguing new
applications supported by these innovative platforms. Among them RNA-Seq
is perhaps the most complex NGS application. Expression levels of specific genes,
differential splicing, allele-specific expression of transcripts can be accurately determined by RNA-Seq experiments to address many biological-related issues. All these attributes are not readily achievable from previously widespread
hybridization-based or tag sequence-based approaches. However, the unprecedented level
of sensitivity and the large amount of available data produced by NGS platforms provide
clear advantages as well as new challenges and issues. This technology brings the
great power to make several new biological observations and discoveries, it also requires
a considerable effort in the development of new bioinformatics tools to deal with these
massive data files. The paper aims to give a survey of the RNA-Seq
methodology, particularly focusing on the challenges that this application presents both
from a biological and a bioinformatics point of view.
Collapse
|
115
|
Abstract
We provide a novel web service, called rQuant.web, allowing convenient access to tools for quantitative analysis of RNA sequencing data. The underlying quantitation technique rQuant is based on quadratic programming and estimates different biases induced by library preparation, sequencing and read mapping. It can tackle multiple transcripts per gene locus and is therefore particularly well suited to quantify alternative transcripts. rQuant.web is available as a tool in a Galaxy installation at http://galaxy.fml.mpg.de. Using rQuant.web is free of charge, it is open to all users, and there is no login requirement.
Collapse
Affiliation(s)
- Regina Bohnert
- Friedrich Miescher Laboratory of the Max Planck Society, Tübingen, Germany.
| | | |
Collapse
|
116
|
Estimation of Alternative Splicing isoform Frequencies from RNA-Seq Data. LECTURE NOTES IN COMPUTER SCIENCE 2010. [DOI: 10.1007/978-3-642-15294-8_17] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
|