1
|
Sharma H, Pani T, Dasgupta U, Batra J, Sharma RD. Prediction of transcript structure and concentration using RNA-Seq data. Brief Bioinform 2023; 24:6995379. [PMID: 36682028 DOI: 10.1093/bib/bbad022] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2022] [Revised: 11/25/2022] [Accepted: 01/06/2023] [Indexed: 01/23/2023] Open
Abstract
Alternative splicing (AS) is a key post-transcriptional modification that helps in increasing protein diversity. Almost 90% of the protein-coding genes in humans are known to undergo AS and code for different transcripts. Some transcripts are associated with diseases such as breast cancer, lung cancer and glioblastoma. Hence, these transcripts can serve as novel therapeutic and prognostic targets for drug discovery. Herein, we have developed a pipeline, Finding Alternative Splicing Events (FASE), as the R package that includes modules to determine the structure and concentration of transcripts using differential AS. To predict the correct structure of expressed transcripts in given conditions, FASE combines the AS events with the information of exons, introns and junctions using graph theory. The estimated concentration of predicted transcripts is reported as the relative expression in terms of log2CPM. Using FASE, we were able to identify several unique transcripts of EMILIN1 and SLK genes in the TCGA-BRCA data, which were validated using RT-PCR. The experimental study demonstrated consistent results, which signify the high accuracy and precision of the developed methods. In conclusion, the developed pipeline, FASE, can efficiently predict novel transcripts that are missed in general transcript-level differential expression analysis. It can be applied selectively from a single gene to simple or complex genome even in multiple experimental conditions for the identification of differential AS-based biomarkers, prognostic targets and novel therapeutics.
Collapse
Affiliation(s)
- Harsh Sharma
- Amity Institute of Integrative Sciences and Health, Amity University Haryana, Gurugram 122413, India
| | - Trishna Pani
- Amity Institute of Integrative Sciences and Health, Amity University Haryana, Gurugram 122413, India
| | - Ujjaini Dasgupta
- Amity Institute of Integrative Sciences and Health, Amity University Haryana, Gurugram 122413, India
| | - Jyotsna Batra
- School of Biomedical Sciences, Institute of Health and Biomedical Innovation (IHBI), Translational Research Institute, Queensland University of Technology (QUT), Brisbane, QLD, Australia
| | - Ravi Datta Sharma
- Amity Institute of Integrative Sciences and Health, Amity University Haryana, Gurugram 122413, India
| |
Collapse
|
2
|
Matsumoto H, Hayashi T, Ozaki H, Tsuyuzaki K, Umeda M, Iida T, Nakamura M, Okano H, Nikaido I. An NMF-based approach to discover overlooked differentially expressed gene regions from single-cell RNA-seq data. NAR Genom Bioinform 2019; 2:lqz020. [PMID: 34632380 PMCID: PMC8499053 DOI: 10.1093/nargab/lqz020] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2019] [Revised: 11/05/2019] [Accepted: 11/29/2019] [Indexed: 12/31/2022] Open
Abstract
Single-cell RNA sequencing has enabled researchers to quantify the transcriptomes of individual cells, infer cell types and investigate differential expression among cell types, which will lead to a better understanding of the regulatory mechanisms of cell states. Transcript diversity caused by phenomena such as aberrant splicing events have been revealed, and differential expression of previously unannotated transcripts might be overlooked by annotation-based analyses. Accordingly, we have developed an approach to discover overlooked differentially expressed (DE) gene regions that complements annotation-based methods. Our algorithm decomposes mapped count data matrix for a gene region using non-negative matrix factorization, quantifies the differential expression level based on the decomposed matrix, and compares the differential expression level based on annotation-based approach to discover previously unannotated DE transcripts. We performed single-cell RNA sequencing for human neural stem cells and applied our algorithm to the dataset. We also applied our algorithm to two public single-cell RNA sequencing datasets correspond to mouse ES and primitive endoderm cells, and human preimplantation embryos. As a result, we discovered several intriguing DE transcripts, including a transcript related to the modulation of neural stem/progenitor cell differentiation.
Collapse
Affiliation(s)
- Hirotaka Matsumoto
- Medical Image Analysis Team, RIKEN Center for Advanced Intelligence Project, Nihonbashi 1-chome Mitsui Building 15F, 1-4-1 Nihonbashi, Chuo-ku, Tokyo 103-0027, Japan.,Laboratory for Bioinformatics Research, RIKEN Center for Biosystems Dynamics Research, 2-1 Hirosawa, Wako, Saitama 351-0198, Japan
| | - Tetsutaro Hayashi
- Laboratory for Bioinformatics Research, RIKEN Center for Biosystems Dynamics Research, 2-1 Hirosawa, Wako, Saitama 351-0198, Japan
| | - Haruka Ozaki
- Center for Artificial Intelligence Research, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki 305-8577, Japan.,Bioinformatics Laboratory, Faculty of Medicine, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki 305-8577, Japan
| | - Koki Tsuyuzaki
- Laboratory for Bioinformatics Research, RIKEN Center for Biosystems Dynamics Research, 2-1 Hirosawa, Wako, Saitama 351-0198, Japan
| | - Mana Umeda
- Laboratory for Bioinformatics Research, RIKEN Center for Biosystems Dynamics Research, 2-1 Hirosawa, Wako, Saitama 351-0198, Japan
| | - Tsuyoshi Iida
- Department of Orthopaedic Surgery, Keio University School of Medicine, 35 Sinanomachi, Shinjuku-ku, Tokyo 160-8582, Japan
| | - Masaya Nakamura
- Department of Orthopaedic Surgery, Keio University School of Medicine, 35 Sinanomachi, Shinjuku-ku, Tokyo 160-8582, Japan
| | - Hideyuki Okano
- Department of Physiology, Keio University School of Medicine, 35 Sinanomachi, Shinjuku-ku, Tokyo 160-8582, Japan
| | - Itoshi Nikaido
- Laboratory for Bioinformatics Research, RIKEN Center for Biosystems Dynamics Research, 2-1 Hirosawa, Wako, Saitama 351-0198, Japan.,Bioinformatics Course, Master's/Doctoral Program in Life Science Innovation (T-LSI), School of Integrative and Global Majors (SIGMA), University of Tsukuba, 2-1 Hirosawa, Wako, Saitama 351-0198, Japan
| |
Collapse
|
3
|
Romero JP, Muniategui A, De Miguel FJ, Aramburu A, Montuenga L, Pio R, Rubio A. EventPointer: an effective identification of alternative splicing events using junction arrays. BMC Genomics 2016; 17:467. [PMID: 27315794 PMCID: PMC4912780 DOI: 10.1186/s12864-016-2816-x] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2016] [Accepted: 06/07/2016] [Indexed: 12/22/2022] Open
Abstract
Background Alternative splicing (AS) is a major source of variability in the transcriptome of eukaryotes. There is an increasing interest in its role in different pathologies. Before sequencing technology appeared, AS was measured with specific arrays. However, these arrays did not perform well in the detection of AS events and provided very large false discovery rates (FDR). Recently the Human Transcriptome Array 2.0 (HTA 2.0) has been deployed. It includes junction probes. However, the interpretation software provided by its vendor (TAC 3.0) does not fully exploit its potential (does not study jointly the exons and junctions involved in a splicing event) and can only be applied to case–control studies. New statistical algorithms and software must be developed in order to exploit the HTA 2.0 array for event detection. Results We have developed EventPointer, an R package (built under the aroma.affymetrix framework) to search and analyze Alternative Splicing events using HTA 2.0 arrays. This software uses a linear model that broadens its application from plain case–control studies to complex experimental designs. Given the CEL files and the design and contrast matrices, the software retrieves a list of all the detected events indicating: 1) the type of event (exon cassette, alternative 3′, etc.), 2) its fold change and its statistical significance, and 3) the potential protein domains affected by the AS events and the statistical significance of the possible enrichment. Our tests have shown that EventPointer has an extremely low FDR value (only 1 false positive within the tested top-200 events). This software is publicly available and it has been uploaded to GitHub. Conclusions This software empowers the HTA 2.0 arrays for AS event detection as an alternative to RNA-seq: simplifying considerably the required analysis, speeding it up and reducing the required computational power. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-2816-x) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Juan P Romero
- CEIT, Parque Tecnológico de San Sebastián, Paseo Mikeletegi 48, 20009, San Sebastián, Gipuzkoa, Spain.,Tecnun, University of Navarra, P° de Manuel Lardizabal 13, 20018, Donostia-San Sebastián, Gipuzkoa, Spain
| | - Ander Muniategui
- CEIT, Parque Tecnológico de San Sebastián, Paseo Mikeletegi 48, 20009, San Sebastián, Gipuzkoa, Spain.,Tecnun, University of Navarra, P° de Manuel Lardizabal 13, 20018, Donostia-San Sebastián, Gipuzkoa, Spain
| | - Fernando J De Miguel
- Program in Solid Tumors and Biomarkers, CIMA, University of Navarra, Avda. Pío XII, 55, E-31008, Pamplona, Navarra, Spain
| | - Ander Aramburu
- CEIT, Parque Tecnológico de San Sebastián, Paseo Mikeletegi 48, 20009, San Sebastián, Gipuzkoa, Spain.,Tecnun, University of Navarra, P° de Manuel Lardizabal 13, 20018, Donostia-San Sebastián, Gipuzkoa, Spain
| | - Luis Montuenga
- Program in Solid Tumors and Biomarkers, CIMA, University of Navarra, Avda. Pío XII, 55, E-31008, Pamplona, Navarra, Spain.,Department of Histology and Pathology, University of Navarra, Pamplona, Spain.,IdiSNA, Navarra Institute for Health Research, Recinto de Complejo Hospitalario de Navarra, C/Irunlarrea 3, 31008, Pamplona, Navarra, Spain
| | - Ruben Pio
- Program in Solid Tumors and Biomarkers, CIMA, University of Navarra, Avda. Pío XII, 55, E-31008, Pamplona, Navarra, Spain.,IdiSNA, Navarra Institute for Health Research, Recinto de Complejo Hospitalario de Navarra, C/Irunlarrea 3, 31008, Pamplona, Navarra, Spain.,Department of Biochemistry and Genetics, University of Navarra, Pamplona, Spain
| | - Angel Rubio
- CEIT, Parque Tecnológico de San Sebastián, Paseo Mikeletegi 48, 20009, San Sebastián, Gipuzkoa, Spain. .,Tecnun, University of Navarra, P° de Manuel Lardizabal 13, 20018, Donostia-San Sebastián, Gipuzkoa, Spain.
| |
Collapse
|
4
|
Ye Y, Li JJ. NMFP: a non-negative matrix factorization based preselection method to increase accuracy of identifying mRNA isoforms from RNA-seq data. BMC Genomics 2016; 17 Suppl 1:11. [PMID: 26818007 PMCID: PMC4895266 DOI: 10.1186/s12864-015-2304-8] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open
Abstract
Background The advent of next-generation RNA sequencing (RNA-seq) has greatly advanced transcriptomic studies, including system-wide identification and quantification of mRNA isoforms under various biological conditions. A number of computational methods have been developed to systematically identify mRNA isoforms in a high-throughput manner from RNA-seq data. However, a common drawback of these methods is that their identified mRNA isoforms contain a high percentage of false positives, especially for genes with complex splicing structures, e.g., many exons and exon junctions. Results We have developed a preselection method called “Non-negative Matrix Factorization Preselection” (NMFP) which is designed to improve the accuracy of computational methods in identifying mRNA isoforms from RNA-seq data. We demonstrated through simulation and real data studies that NMFP can effectively shrink the search space of isoform candidates and increase the accuracy of two mainstream computational methods, Cufflinks and SLIDE, in their identification of mRNA isoforms. Conclusion NMFP is a useful tool to preselect mRNA isoform candidates for downstream isoform discovery methods. It can greatly reduce the number of isoform candidates while maintaining a good coverage of unknown true isoforms. Adding NMFP as an upstream step, computational methods are expected to achieve better accuracy in identifying mRNA isoforms from RNA-seq data. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-2304-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Yuting Ye
- Division of Biostatistics, University of California, Berkeley, 94720, Berkeley, CA, USA.
| | - Jingyi Jessica Li
- Department of Statistics, 8125 Math Sciences Bldg., University of California, Los Angeles, Los Angeles, 90095-1554, CA, USA. .,Department of Human Genetics, 695 Charles E. Young Drive South, University of California, Los Angeles, Los Angeles, 90095-7088, CA, USA.
| |
Collapse
|
5
|
Korir PK, Geeleher P, Seoighe C. Seq-ing improved gene expression estimates from microarrays using machine learning. BMC Bioinformatics 2015; 16:286. [PMID: 26338512 PMCID: PMC4559919 DOI: 10.1186/s12859-015-0712-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2015] [Accepted: 08/19/2015] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND Quantifying gene expression by RNA-Seq has several advantages over microarrays, including greater dynamic range and gene expression estimates on an absolute, rather than a relative scale. Nevertheless, microarrays remain in widespread use, demonstrated by the ever-growing numbers of samples deposited in public repositories. RESULTS We propose a novel approach to microarray analysis that attains many of the advantages of RNA-Seq. This method, called Machine Learning of Transcript Expression (MaLTE), leverages samples for which both microarray and RNA-Seq data are available, using a Random Forest to learn the relationship between the fluorescence intensity of sets of microarray probes and RNA-Seq transcript expression estimates. We trained MaLTE on data from the Genotype-Tissue Expression (GTEx) project, consisting of Affymetrix gene arrays and RNA-Seq from over 700 samples across a broad range of human tissues. CONCLUSION This approach can be used to accurately estimate absolute expression levels from microarray data, at both gene and transcript level, which has not previously been possible. This methodology will facilitate re-analysis of archived microarray data and broaden the utility of the vast quantities of data still being generated.
Collapse
Affiliation(s)
- Paul K Korir
- School of Biochemistry and Cell Biology, University College Cork, Western Road, Cork, Ireland.
| | - Paul Geeleher
- Section of Hematology/Oncology, Department of Medicine, University of Chicago, Chicago, IL-60637, USA.
| | - Cathal Seoighe
- School of Mathematics, Statistics and Applied Mathematics, University Road, Galway, Ireland.
- Institute of Infectious Disease and Molecular Medicine, Anzio Road, Cape Town, 7925, South Africa.
| |
Collapse
|
6
|
de Miguel FJ, Sharma RD, Pajares MJ, Montuenga LM, Rubio A, Pio R. Identification of alternative splicing events regulated by the oncogenic factor SRSF1 in lung cancer. Cancer Res 2013; 74:1105-15. [PMID: 24371231 DOI: 10.1158/0008-5472.can-13-1481] [Citation(s) in RCA: 63] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Abnormal alternative splicing has been associated with cancer. Genome-wide microarrays can be used to detect differential splicing events. In this study, we have developed ExonPointer, an algorithm that uses data from exon and junction probes to identify annotated cassette exons. We used the algorithm to profile differential splicing events in lung adenocarcinoma A549 cells after downregulation of the oncogenic serine/arginine-rich splicing factor 1 (SRSF1). Data were generated using two different microarray platforms. The PCR-based validation rate of the top 20 ranked genes was 60% and 100%. Functional enrichment analyses found a substantial number of splicing events in genes related to RNA metabolism. These analyses also identified genes associated with cancer and developmental and hereditary disorders, as well as biologic processes such as cell division, apoptosis, and proliferation. Most of the top 20 ranked genes were validated in other adenocarcinoma and squamous cell lung cancer cells, with validation rates of 80% to 95% and 70% to 75%, respectively. Moreover, the analysis allowed us to identify four genes, ATP11C, IQCB1, TUBD1, and proline-rich coiled-coil 2C (PRRC2C), with a significantly different pattern of alternative splicing in primary non-small cell lung tumors compared with normal lung tissue. In the case of PRRC2C, SRSF1 downregulation led to the skipping of an exon overexpressed in primary lung tumors. Specific siRNA downregulation of the exon-containing variant significantly reduced cell growth. In conclusion, using a novel analytical tool, we have identified new splicing events regulated by the oncogenic splicing factor SRSF1 in lung cancer.
Collapse
Affiliation(s)
- Fernando J de Miguel
- Authors' Affiliations: Division of Oncology, Center for Applied Medical Research (CIMA); Departments of Histology and Pathology and Biochemistry and Genetics, Schools of Science and Medicine, University of Navarra, Pamplona; and CEIT and TECNUN, University of Navarra, San Sebastian, Spain
| | | | | | | | | | | |
Collapse
|
7
|
Abstract
RNA sequencing is a recent technology which has seen an explosion of methods addressing all levels of analysis, from read mapping to transcript assembly to differential expression modeling. In particular the discovery of isoforms at the transcript assembly stage is a complex problem and current approaches suffer from various limitations. For instance, many approaches use graphs to construct a minimal set of isoforms which covers the observed reads, then perform a separate algorithm to quantify the isoforms, which can result in a loss of power. Current methods also use ad-hoc solutions to deal with the vast number of possible isoforms which can be constructed from a given set of reads. Finally, while the need of taking into account features such as read pairing and sampling rate of reads has been acknowledged, most existing methods do not seamlessly integrate these features as part of the model. We present Montebello, an integrated statistical approach which performs simultaneous isoform discovery and quantification by using a Monte Carlo simulation to find the most likely isoform composition leading to a set of observed reads. We compare Montebello to Cufflinks, a popular isoform discovery approach, on a simulated data set and on 46.3 million brain reads from an Illumina tissue panel. On this data set Montebello appears to offer a modest improvement over Cufflinks when considering discovery and parsimony metrics. In addition Montebello mitigates specific difficulties inherent in the Cufflinks approach. Finally, Montebello can be fine-tuned depending on the type of solution desired.
Collapse
Affiliation(s)
- David Hiller
- Center for Epigenetics, Johns Hopkins School of Medicine, 855 N. Wolfe St., Rangos 570, Baltimore, MD 21205
| | - Wing Hung Wong
- Department of Statistics, Sequoia Hall, 390 Serra Mall, Stanford, CA, 94305
| |
Collapse
|
8
|
Chen P, Lepikhova T, Hu Y, Monni O, Hautaniemi S. Comprehensive exon array data processing method for quantitative analysis of alternative spliced variants. Nucleic Acids Res 2011; 39:e123. [PMID: 21745820 PMCID: PMC3185423 DOI: 10.1093/nar/gkr513] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
Alternative splicing of pre-mRNA generates protein diversity. Dysfunction of splicing machinery and expression of specific transcripts has been linked to cancer progression and drug response. Exon microarray technology enables genome-wide quantification of expression levels of the majority of exons and facilitates the discovery of alternative splicing events. Analysis of exon array data is more challenging than the analysis of gene expression data and there is a need for reliable quantification of exons and alternatively spliced variants. We introduce a novel, computationally efficient methodology, Multiple Exon Array Preprocessing (MEAP), for exon array data pre-processing, analysis and visualization. We compared MEAP with existing pre-processing methods, and validation of six exons and two alternatively spliced variants with qPCR corroborated MEAP expression estimates. Analysis of exon array data from head and neck squamous cell carcinoma (HNSCC) cell lines revealed several transcripts associated with 11q13 amplification, which is related with decreased survival and metastasis in HNSCC patients. Our results demonstrate that MEAP produces reliable expression values at exon, alternatively spliced variant and gene levels, which allows generating novel experimentally testable predictions.
Collapse
Affiliation(s)
- Ping Chen
- Research Programs Unit, Genome-Scale Biology and Institute of Biomedicine, Biochemistry and Developmental Biology, 00014 University of Helsinki, Finland
| | | | | | | | | |
Collapse
|
9
|
NSMAP: a method for spliced isoforms identification and quantification from RNA-Seq. BMC Bioinformatics 2011; 12:162. [PMID: 21575225 PMCID: PMC3113944 DOI: 10.1186/1471-2105-12-162] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2010] [Accepted: 05/16/2011] [Indexed: 11/26/2022] Open
Abstract
Background The development of techniques for sequencing the messenger RNA (RNA-Seq) enables it to study the biological mechanisms such as alternative splicing and gene expression regulation more deeply and accurately. Most existing methods employ RNA-Seq to quantify the expression levels of already annotated isoforms from the reference genome. However, the current reference genome is very incomplete due to the complexity of the transcriptome which hiders the comprehensive investigation of transcriptome using RNA-Seq. Novel study on isoform inference and estimation purely from RNA-Seq without annotation information is desirable. Results A Nonnegativity and Sparsity constrained Maximum APosteriori (NSMAP) model has been proposed to estimate the expression levels of isoforms from RNA-Seq data without the annotation information. In contrast to previous methods, NSMAP performs identification of the structures of expressed isoforms and estimation of the expression levels of those expressed isoforms simultaneously, which enables better identification of isoforms. In the simulations parameterized by two real RNA-Seq data sets, more than 77% expressed isoforms are correctly identified and quantified. Then, we apply NSMAP on two RNA-Seq data sets of myelodysplastic syndromes (MDS) samples and one normal sample in order to identify differentially expressed known and novel isoforms in MDS disease. Conclusions NSMAP provides a good strategy to identify and quantify novel isoforms without the knowledge of annotated reference genome which can further realize the potential of RNA-Seq technique in transcriptome analysis. NSMAP package is freely available at https://sites.google.com/site/nsmapforrnaseq.
Collapse
|
10
|
Nicolae M, Mangul S, Măndoiu II, Zelikovsky A. Estimation of alternative splicing isoform frequencies from RNA-Seq data. Algorithms Mol Biol 2011; 6:9. [PMID: 21504602 PMCID: PMC3107792 DOI: 10.1186/1748-7188-6-9] [Citation(s) in RCA: 131] [Impact Index Per Article: 10.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2010] [Accepted: 04/19/2011] [Indexed: 11/21/2022] Open
Abstract
BACKGROUND Massively parallel whole transcriptome sequencing, commonly referred as RNA-Seq, is quickly becoming the technology of choice for gene expression profiling. However, due to the short read length delivered by current sequencing technologies, estimation of expression levels for alternative splicing gene isoforms remains challenging. RESULTS In this paper we present a novel expectation-maximization algorithm for inference of isoform- and gene-specific expression levels from RNA-Seq data. Our algorithm, referred to as IsoEM, is based on disambiguating information provided by the distribution of insert sizes generated during sequencing library preparation, and takes advantage of base quality scores, strand and read pairing information when available. The open source Java implementation of IsoEM is freely available at http://dna.engr.uconn.edu/software/IsoEM/. CONCLUSIONS Empirical experiments on both synthetic and real RNA-Seq datasets show that IsoEM has scalable running time and outperforms existing methods of isoform and gene expression level estimation. Simulation experiments confirm previous findings that, for a fixed sequencing cost, using reads longer than 25-36 bases does not necessarily lead to better accuracy for estimating expression levels of annotated isoforms and genes.
Collapse
Affiliation(s)
- Marius Nicolae
- Department of Computer Science & Engineering, University of Connecticut,371 Fairfield Rd., Unit 2155, Storrs, CT 06269-2155, USA
| | - Serghei Mangul
- Computer Science Department, Georgia State University, University Plaza, Atlanta, Georgia 30303, USA
| | - Ion I Măndoiu
- Department of Computer Science & Engineering, University of Connecticut,371 Fairfield Rd., Unit 2155, Storrs, CT 06269-2155, USA
| | - Alex Zelikovsky
- Computer Science Department, Georgia State University, University Plaza, Atlanta, Georgia 30303, USA
| |
Collapse
|
11
|
Support vector machines-based identification of alternative splicing in Arabidopsis thaliana from whole-genome tiling arrays. BMC Bioinformatics 2011; 12:55. [PMID: 21324185 PMCID: PMC3051901 DOI: 10.1186/1471-2105-12-55] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2010] [Accepted: 02/16/2011] [Indexed: 11/15/2022] Open
Abstract
Background Alternative splicing (AS) is a process which generates several distinct mRNA isoforms from the same gene by splicing different portions out of the precursor transcript. Due to the (patho-)physiological importance of AS, a complete inventory of AS is of great interest. While this is in reach for human and mammalian model organisms, our knowledge of AS in plants has remained more incomplete. Experimental approaches for monitoring AS are either based on transcript sequencing or rely on hybridization to DNA microarrays. Among the microarray platforms facilitating the discovery of AS events, tiling arrays are well-suited for identifying intron retention, the most prevalent type of AS in plants. However, analyzing tiling array data is challenging, because of high noise levels and limited probe coverage. Results In this work, we present a novel method to detect intron retentions (IR) and exon skips (ES) from tiling arrays. While statistical tests have typically been proposed for this purpose, our method instead utilizes support vector machines (SVMs) which are appreciated for their accuracy and robustness to noise. Existing EST and cDNA sequences served for supervised training and evaluation. Analyzing a large collection of publicly available microarray and sequence data for the model plant A. thaliana, we demonstrated that our method is more accurate than existing approaches. The method was applied in a genome-wide screen which resulted in the discovery of 1,355 IR events. A comparison of these IR events to the TAIR annotation and a large set of short-read RNA-seq data showed that 830 of the predicted IR events are novel and that 525 events (39%) overlap with either the TAIR annotation or the IR events inferred from the RNA-seq data. Conclusions The method developed in this work expands the scarce repertoire of analysis tools for the identification of alternative mRNA splicing from whole-genome tiling arrays. Our predictions are highly enriched with known AS events and complement the A. thaliana genome annotation with respect to AS. Since all predicted AS events can be precisely attributed to experimental conditions, our work provides a basis for follow-up studies focused on the elucidation of the regulatory mechanisms underlying tissue-specific and stress-dependent AS in plants.
Collapse
|
12
|
Anton MA, Aramburu A, Rubio A. Improvements to previous algorithms to predict gene structure and isoform concentrations using Affymetrix Exon arrays. BMC Bioinformatics 2010; 11:578. [PMID: 21110835 PMCID: PMC3012675 DOI: 10.1186/1471-2105-11-578] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2010] [Accepted: 11/26/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Exon arrays provide a way to measure the expression of different isoforms of genes in an organism. Most of the procedures to deal with these arrays are focused on gene expression or on exon expression. Although the only biological analytes that can be properly assigned a concentration are transcripts, there are very few algorithms that focus on them. The reason is that previously developed summarization methods do not work well if applied to transcripts. In addition, gene structure prediction, i.e., the correspondence between probes and novel isoforms, is a field which is still unexplored. RESULTS We have modified and adapted a previous algorithm to take advantage of the special characteristics of the Affymetrix exon arrays. The structure and concentration of transcripts -some of them possibly unknown- in microarray experiments were predicted using this algorithm. Simulations showed that the suggested modifications improved both specificity (SP) and sensitivity (ST) of the predictions. The algorithm was also applied to different real datasets showing its effectiveness and the concordance with PCR validated results. CONCLUSIONS The proposed algorithm shows a substantial improvement in the performance over the previous version. This improvement is mainly due to the exploitation of the redundancy of the Affymetrix exon arrays. An R-Package of SPACE with the updated algorithms have been developed and is freely available.
Collapse
Affiliation(s)
- Miguel A Anton
- CEIT and TECNUN, University of Navarra, San Sebastián, Spain
| | | | | |
Collapse
|
13
|
Hsiao TH, Lin CH, Lee TT, Cheng JY, Wei PK, Chuang EY, Peck K. Verifying expressed transcript variants by detecting and assembling stretches of consecutive exons. Nucleic Acids Res 2010; 38:e187. [PMID: 20798177 PMCID: PMC2978383 DOI: 10.1093/nar/gkq754] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
We herein describe an integrated system for the high-throughput analysis of splicing events and the identification of transcript variants. The system resolves individual splicing events and elucidates transcript variants via a pipeline that combines aspects such as bioinformatic analysis, high-throughput transcript variant amplification, and high-resolution capillary electrophoresis. For the 14 369 human genes known to have transcript variants, minimal primer sets were designed to amplify all transcript variants and examine all splicing events; these have been archived in the ASprimerDB database, which is newly described herein. A high-throughput thermocycler, dubbed GenTank, was developed to simultaneously perform thousands of PCR amplifications. Following the resolution of the various amplicons by capillary gel electrophoresis, two new computer programs, AmpliconViewer and VariantAssembler, may be used to analyze the splicing events, assemble the consecutive exons embodied by the PCR amplicons, and distinguish expressed versus putative transcript variants. This novel system not only facilitates the validation of putative transcript variants and the detection of novel transcript variants, it also semi-quantitatively measures the transcript variant expression levels of each gene. To demonstrate the system’s capability, we used it to resolve transcript variants yielded by single and multiple splicing events, and to decipher the exon connectivity of long transcripts.
Collapse
Affiliation(s)
- Tzu-Hung Hsiao
- Departmant of Electrical Engineering, National Taiwan University, Taipei, Taiwan 106, ROC
| | | | | | | | | | | | | |
Collapse
|
14
|
Pio R, Blanco D, Pajares MJ, Aibar E, Durany O, Ezponda T, Agorreta J, Gomez-Roman J, Anton MA, Rubio A, Lozano MD, López-Picazo JM, Subirada F, Maes T, Montuenga LM. Development of a novel splice array platform and its application in the identification of alternative splice variants in lung cancer. BMC Genomics 2010; 11:352. [PMID: 20525254 PMCID: PMC2889901 DOI: 10.1186/1471-2164-11-352] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2009] [Accepted: 06/03/2010] [Indexed: 12/22/2022] Open
Abstract
Background Microarrays strategies, which allow for the characterization of thousands of alternative splice forms in a single test, can be applied to identify differential alternative splicing events. In this study, a novel splice array approach was developed, including the design of a high-density oligonucleotide array, a labeling procedure, and an algorithm to identify splice events. Results The array consisted of exon probes and thermodynamically balanced junction probes. Suboptimal probes were tagged and considered in the final analysis. An unbiased labeling protocol was developed using random primers. The algorithm used to distinguish changes in expression from changes in splicing was calibrated using internal non-spliced control sequences. The performance of this splice array was validated with artificial constructs for CDC6, VEGF, and PCBP4 isoforms. The platform was then applied to the analysis of differential splice forms in lung cancer samples compared to matched normal lung tissue. Overexpression of splice isoforms was identified for genes encoding CEACAM1, FHL-1, MLPH, and SUSD2. None of these splicing isoforms had been previously associated with lung cancer. Conclusions This methodology enables the detection of alternative splicing events in complex biological samples, providing a powerful tool to identify novel diagnostic and prognostic biomarkers for cancer and other pathologies.
Collapse
Affiliation(s)
- Ruben Pio
- Division of Oncology, Center for Applied Medical Research, Pamplona, Spain.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
15
|
Richard H, Schulz MH, Sultan M, Nürnberger A, Schrinner S, Balzereit D, Dagand E, Rasche A, Lehrach H, Vingron M, Haas SA, Yaspo ML. Prediction of alternative isoforms from exon expression levels in RNA-Seq experiments. Nucleic Acids Res 2010; 38:e112. [PMID: 20150413 PMCID: PMC2879520 DOI: 10.1093/nar/gkq041] [Citation(s) in RCA: 123] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
Alternative splicing, polyadenylation of pre-messenger RNA molecules and differential promoter usage can produce a variety of transcript isoforms whose respective expression levels are regulated in time and space, thus contributing specific biological functions. However, the repertoire of mammalian alternative transcripts and their regulation are still poorly understood. Second-generation sequencing is now opening unprecedented routes to address the analysis of entire transcriptomes. Here, we developed methods that allow the prediction and quantification of alternative isoforms derived solely from exon expression levels in RNA-Seq data. These are based on an explicit statistical model and enable the prediction of alternative isoforms within or between conditions using any known gene annotation, as well as the relative quantification of known transcript structures. Applying these methods to a human RNA-Seq dataset, we validated a significant fraction of the predictions by RT-PCR. Data further showed that these predictions correlated well with information originating from junction reads. A direct comparison with exon arrays indicated improved performances of RNA-Seq over microarrays in the prediction of skipped exons. Altogether, the set of methods presented here comprehensively addresses multiple aspects of alternative isoform analysis. The software is available as an open-source R-package called Solas at http://cmb.molgen.mpg.de/2ndGenerationSequencing/Solas/.
Collapse
Affiliation(s)
- Hugues Richard
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Ihnestr 73, 14195 Berlin, Germany.
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
16
|
Estimation of Alternative Splicing isoform Frequencies from RNA-Seq Data. LECTURE NOTES IN COMPUTER SCIENCE 2010. [DOI: 10.1007/978-3-642-15294-8_17] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
|
17
|
Hiller D, Jiang H, Xu W, Wong WH. Identifiability of isoform deconvolution from junction arrays and RNA-Seq. ACTA ACUST UNITED AC 2009; 25:3056-9. [PMID: 19762346 DOI: 10.1093/bioinformatics/btp544] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION Splice junction microarrays and RNA-seq are two popular ways of quantifying splice variants within a cell. Unfortunately, isoform expressions cannot always be determined from the expressions of individual exons and splice junctions. While this issue has been noted before, the extent of the problem on various platforms has not yet been explored, nor have potential remedies been presented. RESULTS We propose criteria that will guarantee identifiability of an isoform deconvolution model on exon and splice junction arrays and in RNA-Seq. We show that up to 97% of 2256 alternatively spliced human genes selected from the RefSeq database lead to identifiable gene models in RNA-seq, with similar results in mouse. However, in the Human Exon array only 26% of these genes lead to identifiable models, and even in the most comprehensive splice junction array only 69% lead to identifiable models. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- David Hiller
- Department of Statistics, Stanford University, Stanford, CA 94305, USA
| | | | | | | |
Collapse
|
18
|
She Y, Hubbell E, Wang H. Resolving deconvolution ambiguity in gene alternative splicing. BMC Bioinformatics 2009; 10:237. [PMID: 19653895 PMCID: PMC2739860 DOI: 10.1186/1471-2105-10-237] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2009] [Accepted: 08/04/2009] [Indexed: 11/16/2022] Open
Abstract
Background For many gene structures it is impossible to resolve intensity data uniquely to establish abundances of splice variants. This was empirically noted by Wang et al. in which it was called a "degeneracy problem". The ambiguity results from an ill-posed problem where additional information is needed in order to obtain an unique answer in splice variant deconvolution. Results In this paper, we analyze the situations under which the problem occurs and perform a rigorous mathematical study which gives necessary and sufficient conditions on how many and what type of constraints are needed to resolve all ambiguity. This analysis is generally applicable to matrix models of splice variants. We explore the proposal that probe sequence information may provide sufficient additional constraints to resolve real-world instances. However, probe behavior cannot be predicted with sufficient accuracy by any existing probe sequence model, and so we present a Bayesian framework for estimating variant abundances by incorporating the prediction uncertainty from the micro-model of probe responsiveness into the macro-model of probe intensities. Conclusion The matrix analysis of constraints provides a tool for detecting real-world instances in which additional constraints may be necessary to resolve splice variants. While purely mathematical constraints can be stated without error, real-world constraints may themselves be poorly resolved. Our Bayesian framework provides a generic solution to the problem of uniquely estimating transcript abundances given additional constraints that themselves may be uncertain, such as regression fit to probe sequence models. We demonstrate the efficacy of it by extensive simulations as well as various biological data.
Collapse
Affiliation(s)
- Yiyuan She
- Affymetrix Inc, Santa Clara, CA 95051, USA.
| | | | | |
Collapse
|
19
|
Abstract
Alterations in alternative splicing affect essential biologic processes and are the basis for a number of pathologic conditions, including cancer. In this review we will summarize the evidence supporting the relevance of alternative splicing in lung cancer. An example that illustrates this relevance is the altered balance between Bcl-xL and Bcl-xS, two splice variants of the apoptosis regulator Bcl-x. Splice modifications in cancer-related genes can be associated with modifications either in cis-acting splicing regulatory sequences or in trans-acting splicing factors. In fact, lung tumors show abnormal expression of splicing regulators such as ASF/SF2 or some members of the heterogeneous nuclear ribonucleoprotein family. The potential significance of alternative splicing as a target for lung cancer diagnosis or treatment will also be discussed.
Collapse
|
20
|
Zheng S, Chen L. A hierarchical Bayesian model for comparing transcriptomes at the individual transcript isoform level. Nucleic Acids Res 2009; 37:e75. [PMID: 19417075 PMCID: PMC2691848 DOI: 10.1093/nar/gkp282] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2008] [Revised: 04/13/2009] [Accepted: 04/14/2009] [Indexed: 11/19/2022] Open
Abstract
The complexity of mammalian transcriptomes is compounded by alternative splicing which allows one gene to produce multiple transcript isoforms. However, transcriptome comparison has been limited to differential analysis at the gene level instead of the individual transcript isoform level. High-throughput sequencing technologies and high-resolution tiling arrays provide an unprecedented opportunity to compare transcriptomes at the level of individual splice variants. However, sequence read coverage or probe intensity at each position may represent a family of splice variants instead of one single isoform. Here we propose a hierarchical Bayesian model, BASIS (Bayesian Analysis of Splicing IsoformS), to infer the differential expression level of each transcript isoform in response to two conditions. A latent variable was introduced to perform direct statistical selection of differentially expressed isoforms. Model parameters were inferred based on an ergodic Markov chain generated by our Gibbs sampler. BASIS has the ability to borrow information across different probes (or positions) from the same genes and different genes. BASIS can handle the heteroskedasticity of probe intensity or sequence read coverage. We applied BASIS to a human tiling-array data set and a mouse RNA-seq data set. Some of the predictions were validated by quantitative real-time RT-PCR experiments.
Collapse
Affiliation(s)
- Sika Zheng
- Howard Hughes Medical Institute, University of California, Los Angeles, Los Angeles, CA 90095 and Molecular and Computational Biology, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
| | - Liang Chen
- Howard Hughes Medical Institute, University of California, Los Angeles, Los Angeles, CA 90095 and Molecular and Computational Biology, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
| |
Collapse
|