1
|
Alser M, Rotman J, Deshpande D, Taraszka K, Shi H, Baykal PI, Yang HT, Xue V, Knyazev S, Singer BD, Balliu B, Koslicki D, Skums P, Zelikovsky A, Alkan C, Mutlu O, Mangul S. Technology dictates algorithms: recent developments in read alignment. Genome Biol 2021; 22:249. [PMID: 34446078 PMCID: PMC8390189 DOI: 10.1186/s13059-021-02443-7] [Citation(s) in RCA: 37] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2020] [Accepted: 07/28/2021] [Indexed: 01/08/2023] Open
Abstract
Aligning sequencing reads onto a reference is an essential step of the majority of genomic analysis pipelines. Computational algorithms for read alignment have evolved in accordance with technological advances, leading to today's diverse array of alignment methods. We provide a systematic survey of algorithmic foundations and methodologies across 107 alignment methods, for both short and long reads. We provide a rigorous experimental evaluation of 11 read aligners to demonstrate the effect of these underlying algorithms on speed and efficiency of read alignment. We discuss how general alignment algorithms have been tailored to the specific needs of various domains in biology.
Collapse
Affiliation(s)
- Mohammed Alser
- Computer Science Department, ETH Zürich, 8092, Zürich, Switzerland
- Computer Engineering Department, Bilkent University, 06800 Bilkent, Ankara, Turkey
- Information Technology and Electrical Engineering Department, ETH Zürich, Zürich, 8092, Switzerland
| | - Jeremy Rotman
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Dhrithi Deshpande
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA, 90089, USA
| | - Kodi Taraszka
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Huwenbo Shi
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
| | - Pelin Icer Baykal
- Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
| | - Harry Taegyun Yang
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
- Bioinformatics Interdepartmental Ph.D. Program, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Victor Xue
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Sergey Knyazev
- Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
| | - Benjamin D Singer
- Division of Pulmonary and Critical Care Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, 60611, USA
- Department of Biochemistry & Molecular Genetics, Northwestern University Feinberg School of Medicine, Chicago, USA
- Simpson Querrey Institute for Epigenetics, Northwestern University Feinberg School of Medicine, Chicago, IL, 60611, USA
| | - Brunilda Balliu
- Department of Computational Medicine, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - David Koslicki
- Computer Science and Engineering, Pennsylvania State University, University Park, PA, 16801, USA
- Biology Department, Pennsylvania State University, University Park, PA, 16801, USA
- The Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, PA, 16801, USA
| | - Pavel Skums
- Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
| | - Alex Zelikovsky
- Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
- The Laboratory of Bioinformatics, I.M. Sechenov First Moscow State Medical University, Moscow, 119991, Russia
| | - Can Alkan
- Computer Engineering Department, Bilkent University, 06800 Bilkent, Ankara, Turkey
- Bilkent-Hacettepe Health Sciences and Technologies Program, Ankara, Turkey
| | - Onur Mutlu
- Computer Science Department, ETH Zürich, 8092, Zürich, Switzerland
- Computer Engineering Department, Bilkent University, 06800 Bilkent, Ankara, Turkey
- Information Technology and Electrical Engineering Department, ETH Zürich, Zürich, 8092, Switzerland
| | - Serghei Mangul
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA, 90089, USA.
| |
Collapse
|
2
|
Lau YY, How KY, Yin WF, Chan KG. Functional characterization of quorum sensing LuxR-type transcriptional regulator, EasR in Enterobacter asburiae strain L1. PeerJ 2020; 8:e10068. [PMID: 33150063 PMCID: PMC7585371 DOI: 10.7717/peerj.10068] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2019] [Accepted: 09/08/2020] [Indexed: 01/17/2023] Open
Abstract
Over the past decades, Enterobacter spp. have been identified as challenging and important pathogens. The emergence of multidrug-resistant Enterobacteria especially those that produce Klebsiella pneumoniae carbapenemase has been a very worrying health crisis. Although efforts have been made to unravel the complex mechanisms that contribute to the pathogenicity of different Enterobacter spp., there is very little information associated with AHL-type QS mechanism in Enterobacter spp. Signaling via N-acyl homoserine lactone (AHL) is the most common quorum sensing (QS) mechanism utilized by Proteobacteria. A typical AHL-based QS system involves two key players: a luxI gene homolog to synthesize AHLs and a luxR gene homolog, an AHL-dependent transcriptional regulator. These signaling molecules enable inter-species and intra-species interaction in response to external stimuli according to population density. In our recent study, we reported the genome of AHL-producing bacterium, Enterobacter asburiae strain L1. Whole genome sequencing and in silico analysis revealed the presence of a pair of luxI/R genes responsible for AHL-type QS, designated as easI/R, in strain L1. In a QS system, a LuxR transcriptional protein detects and responds to the concentration of a specific AHL controlling gene expression. In E. asburiae strain L1, EasR protein binds to its cognate AHLs, N-butanoyl homoserine lactone (C4-HSL) and N–hexanoyl homoserine lactone (C6-HSL), modulating the expression of targeted genes. In this current work, we have cloned the 693 bp luxR homolog of strain L1 for further characterization. The functionality and specificity of EasR protein in response to different AHL signaling molecules to activate gene transcription were tested and validated with β-galactosidase assays. Higher β-galactosidase activities were detected for cells harboring EasR, indicating EasR is a functional transcriptional regulator. This is the first report documenting the cloning and characterization of transcriptional regulator, luxR homolog of E. asburiae.
Collapse
Affiliation(s)
- Yin Yin Lau
- International Genome Centre, Jiangsu University, Zhenjiang, China.,Division of Genetics and Molecular Biology, Institute of Biological Sciences, Faculty of Science, University of Malaya, Malaysia
| | - Kah Yan How
- Division of Genetics and Molecular Biology, Institute of Biological Sciences, Faculty of Science, University of Malaya, Malaysia
| | - Wai-Fong Yin
- Division of Genetics and Molecular Biology, Institute of Biological Sciences, Faculty of Science, University of Malaya, Malaysia
| | - Kok-Gan Chan
- International Genome Centre, Jiangsu University, Zhenjiang, China.,Division of Genetics and Molecular Biology, Institute of Biological Sciences, Faculty of Science, University of Malaya, Malaysia
| |
Collapse
|
3
|
Lin HN, Hsu WL. DART: a fast and accurate RNA-seq mapper with a partitioning strategy. Bioinformatics 2018; 34:190-197. [PMID: 28968831 PMCID: PMC5860201 DOI: 10.1093/bioinformatics/btx558] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2017] [Revised: 08/29/2017] [Accepted: 09/03/2017] [Indexed: 01/13/2023] Open
Abstract
MOTIVATION In recent years, the massively parallel cDNA sequencing (RNA-Seq) technologies have become a powerful tool to provide high resolution measurement of expression and high sensitivity in detecting low abundance transcripts. However, RNA-seq data requires a huge amount of computational efforts. The very fundamental and critical step is to align each sequence fragment against the reference genome. Various de novo spliced RNA aligners have been developed in recent years. Though these aligners can handle spliced alignment and detect splice junctions, some challenges still remain to be solved. With the advances in sequencing technologies and the ongoing collection of sequencing data in the ENCODE project, more efficient alignment algorithms are highly demanded. Most read mappers follow the conventional seed-and-extend strategy to deal with inexact matches for sequence alignment. However, the extension is much more time consuming than the seeding step. RESULTS We proposed a novel RNA-seq de novo mapping algorithm, call DART, which adopts a partitioning strategy to avoid the extension step. The experiment results on synthetic datasets and real NGS datasets showed that DART is a highly efficient aligner that yields the highest or comparable sensitivity and accuracy compared to most state-of-the-art aligners, and more importantly, it spends the least amount of time among the selected aligners. AVAILABILITY AND IMPLEMENTATION https://github.com/hsinnan75/DART. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hsin-Nan Lin
- Institute of Information Science, Academia Sinica, Taipei, Taiwan
| | - Wen-Lian Hsu
- Institute of Information Science, Academia Sinica, Taipei, Taiwan
| |
Collapse
|
4
|
Mao R, Liang C, Zhang Y, Hao X, Li J. 50/50 Expressional Odds of Retention Signifies the Distinction between Retained Introns and Constitutively Spliced Introns in Arabidopsis thaliana. FRONTIERS IN PLANT SCIENCE 2017; 8:1728. [PMID: 29062321 PMCID: PMC5640774 DOI: 10.3389/fpls.2017.01728] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/28/2017] [Accepted: 09/21/2017] [Indexed: 05/23/2023]
Abstract
Intron retention, one of the most prevalent alternative splicing events in plants, can lead to introns retained in mature mRNAs. However, in comparison with constitutively spliced introns (CSIs), the relevantly distinguishable features for retained introns (RIs) are still poorly understood. This work proposes a computational pipeline to discover novel RIs from multiple next-generation RNA sequencing (RNA-Seq) datasets of Arabidopsis thaliana. Using this pipeline, we detected 3,472 novel RIs from 18 RNA-Seq datasets and re-confirmed 1,384 RIs which are currently annotated in the TAIR10 database. We also use the expression of intron-containing isoforms as a new feature in addition to the conventional features. Based on these features, RIs are highly distinguishable from CSIs by machine learning methods, especially when the expressional odds of retention (i.e., the expression ratio of the RI-containing isoforms relative to the isoforms without RIs for the same gene) reaches to or larger than 50/50. In this case, the RIs and CSIs can be clearly separated by the Random Forest with an outstanding performance of 0.95 on AUC (the area under a receiver operating characteristics curve). The closely related characteristics to the RIs include the low strength of splice sites, high similarity with the flanking exon sequences, low occurrence percentage of YTRAY near the acceptor site, existence of putative intronic splicing silencers (ISSs, i.e., AG/GA-rich motifs) and intronic splicing enhancers (ISEs, i.e., TTTT-containing motifs), and enrichment of Serine/Arginine-Rich (SR) proteins and heterogeneous nuclear ribonucleoparticle proteins (hnRNPs).
Collapse
Affiliation(s)
- Rui Mao
- College of Information Engineering, Northwest A&F University, Yangling, China
| | - Chun Liang
- Department of Biology, Miami University, Oxford, OH, United States
- Department of Computer Sciences and Software Engineering, Miami University, Oxford, OH, United States
| | - Yang Zhang
- College of Information Engineering, Northwest A&F University, Yangling, China
| | - Xingan Hao
- State Key Laboratory of Crop Stress Biology for Arid Areas, College of Plant Protection, Northwest A&F University, Yangling, China
| | - Jinyan Li
- Advanced Analytics Institute, University of Technology Sydney, Sydney, NSW, Australia
| |
Collapse
|
5
|
Abebrese EL, Ali SH, Arnold ZR, Andrews VM, Armstrong K, Burns L, Crowder HR, Day RT, Hsu DG, Jarrell K, Lee G, Luo Y, Mugayo D, Raza Z, Friend K. Identification of human short introns. PLoS One 2017; 12:e0175393. [PMID: 28520720 PMCID: PMC5435141 DOI: 10.1371/journal.pone.0175393] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2017] [Accepted: 03/26/2017] [Indexed: 01/08/2023] Open
Abstract
Canonical pre-mRNA splicing requires snRNPs and associated splicing factors to excise conserved intronic sequences, with a minimum intron length required for efficient splicing. Non-canonical splicing-intron excision without the spliceosome-has been documented; most notably, some tRNAs and the XBP1 mRNA contain short introns that are not removed by the spliceosome. There have been some efforts to identify additional short introns, but little is known about how many short introns are processed from mRNAs. Here, we report an approach to identify RNA short introns from RNA-Seq data, discriminating against small genomic deletions. We identify hundreds of short introns conserved among multiple human cell lines. These short introns are often alternatively spliced and are found in a variety of RNAs-both mRNAs and lncRNAs. Short intron splicing efficiency is increased by secondary structure, and we detect both canonical and non-canonical short introns. In many cases, splicing of these short introns from mRNAs is predicted to alter the reading frame and change protein output. Our findings imply that standard gene prediction models which often assume a lower limit for intron size fail to predict short introns effectively. We conclude that short introns are abundant in the human transcriptome, and short intron splicing represents an added layer to mRNA regulation.
Collapse
Affiliation(s)
- Emmanuel L. Abebrese
- Department of Chemistry and Biochemistry, Washington and Lee University, Lexington, Virginia, United States of America
| | - Syed H. Ali
- Department of Chemistry and Biochemistry, Washington and Lee University, Lexington, Virginia, United States of America
| | - Zachary R. Arnold
- Department of Chemistry and Biochemistry, Washington and Lee University, Lexington, Virginia, United States of America
| | - Victoria M. Andrews
- Department of Chemistry and Biochemistry, Washington and Lee University, Lexington, Virginia, United States of America
| | - Katharine Armstrong
- Department of Chemistry and Biochemistry, Washington and Lee University, Lexington, Virginia, United States of America
| | - Lindsay Burns
- Department of Chemistry and Biochemistry, Washington and Lee University, Lexington, Virginia, United States of America
| | - Hannah R. Crowder
- Department of Chemistry and Biochemistry, Washington and Lee University, Lexington, Virginia, United States of America
| | - R. Thomas Day
- Department of Chemistry and Biochemistry, Washington and Lee University, Lexington, Virginia, United States of America
| | - Daniel G. Hsu
- Department of Chemistry and Biochemistry, Washington and Lee University, Lexington, Virginia, United States of America
| | - Katherine Jarrell
- Department of Chemistry and Biochemistry, Washington and Lee University, Lexington, Virginia, United States of America
| | - Grace Lee
- Department of Chemistry and Biochemistry, Washington and Lee University, Lexington, Virginia, United States of America
| | - Yi Luo
- Department of Chemistry and Biochemistry, Washington and Lee University, Lexington, Virginia, United States of America
| | - Daphine Mugayo
- Department of Chemistry and Biochemistry, Washington and Lee University, Lexington, Virginia, United States of America
| | - Zain Raza
- Department of Chemistry and Biochemistry, Washington and Lee University, Lexington, Virginia, United States of America
| | - Kyle Friend
- Department of Chemistry and Biochemistry, Washington and Lee University, Lexington, Virginia, United States of America
| |
Collapse
|
6
|
Li W, Richter RA, Jung Y, Zhu Q, Li RW. Web-based bioinformatics workflows for end-to-end RNA-seq data computation and analysis in agricultural animal species. BMC Genomics 2016; 17:761. [PMID: 27678198 PMCID: PMC5039875 DOI: 10.1186/s12864-016-3118-z] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2016] [Accepted: 09/23/2016] [Indexed: 11/18/2022] Open
Abstract
Background Remarkable advances in Next Generation Sequencing (NGS) technologies, bioinformatics algorithms and computational technologies have significantly accelerated genomic research. However, complicated NGS data analysis still remains as a major bottleneck. RNA-seq, as one of the major area in the NGS field, also confronts great challenges in data analysis. Results To address the challenges in RNA-seq data analysis, we developed a web portal that offers three integrated workflows that can perform end-to-end compute and analysis, including sequence quality control, read-mapping, transcriptome assembly, reconstruction and quantification, and differential analysis. The first workflow utilizes Tuxedo (Tophat, Cufflink, Cuffmerge and Cuffdiff suite of tools). The second workflow deploys Trinity for de novo assembly and uses RSEM for transcript quantification and EdgeR for differential analysis. The third combines STAR, RSEM, and EdgeR for data analysis. All these workflows support multiple samples and multiple groups of samples and perform differential analysis between groups in a single workflow job submission. The calculated results are available for download and post-analysis. The supported animal species include chicken, cow, duck, goat, pig, horse, rabbit, sheep, turkey, as well as several other model organisms including yeast, C. elegans, Drosophila, and human, with genomic sequences and annotations obtained from ENSEMBL. The RNA-seq portal is freely available from http://weizhongli-lab.org/RNA-seq. Conclusions The web portal offers not only bioinformatics software, workflows, computation and reference data, but also an integrated environment for complex RNA-seq data analysis for agricultural animal species. In this project, our aim is not to develop new RNA-seq tools, but to build web workflows for using popular existing RNA-seq methods and make these tools more accessible to the communities.
Collapse
Affiliation(s)
- Weizhong Li
- J. Craig Venter Institute, La Jolla, CA, 92037, USA.
| | | | - Yunsup Jung
- J. Craig Venter Institute, La Jolla, CA, 92037, USA
| | - Qiyun Zhu
- J. Craig Venter Institute, La Jolla, CA, 92037, USA
| | - Robert W Li
- United States Department of Agriculture, Agriculture Research Service (USDA-ARS), Animal Genomics and Improvement Laboratory, Beltsville, MD, 20705, USA
| |
Collapse
|
7
|
Meher PK, Sahu TK, Rao AR, Wahi SD. A computational approach for prediction of donor splice sites with improved accuracy. J Theor Biol 2016; 404:285-294. [PMID: 27302911 DOI: 10.1016/j.jtbi.2016.06.013] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2015] [Revised: 04/18/2016] [Accepted: 06/09/2016] [Indexed: 11/24/2022]
Abstract
Identification of splice sites is important due to their key role in predicting the exon-intron structure of protein coding genes. Though several approaches have been developed for the prediction of splice sites, further improvement in the prediction accuracy will help predict gene structure more accurately. This paper presents a computational approach for prediction of donor splice sites with higher accuracy. In this approach, true and false splice sites were first encoded into numeric vectors and then used as input in artificial neural network (ANN), support vector machine (SVM) and random forest (RF) for prediction. ANN and SVM were found to perform equally and better than RF, while tested on HS3D and NN269 datasets. Further, the performance of ANN, SVM and RF were analyzed by using an independent test set of 50 genes and found that the prediction accuracy of ANN was higher than that of SVM and RF. All the predictors achieved higher accuracy while compared with the existing methods like NNsplice, MEM, MDD, WMM, MM1, FSPLICE, GeneID and ASSP, using the independent test set. We have also developed an online prediction server (PreDOSS) available at http://cabgrid.res.in:8080/predoss, for prediction of donor splice sites using the proposed approach.
Collapse
Affiliation(s)
- Prabina Kumar Meher
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India.
| | - Tanmaya Kumar Sahu
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India.
| | - A R Rao
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India.
| | - S D Wahi
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India.
| |
Collapse
|
8
|
Meher PK, Sahu TK, Rao AR, Wahi SD. Identification of donor splice sites using support vector machine: a computational approach based on positional, compositional and dependency features. Algorithms Mol Biol 2016; 11:16. [PMID: 27252772 PMCID: PMC4888255 DOI: 10.1186/s13015-016-0078-4] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2015] [Accepted: 05/17/2016] [Indexed: 11/16/2022] Open
Abstract
Background Identification of splice sites is essential for annotation of genes. Though existing approaches have achieved an acceptable level of accuracy, still there is a need for further improvement. Besides, most of the approaches are species-specific and hence it is required to develop approaches compatible across species. Results Each splice site sequence was transformed into a numeric vector of length 49, out of which four were positional, four were dependency and 41 were compositional features. Using the transformed vectors as input, prediction was made through support vector machine. Using balanced training set, the proposed approach achieved area under ROC curve (AUC-ROC) of 96.05, 96.96, 96.95, 96.24 % and area under PR curve (AUC-PR) of 97.64, 97.89, 97.91, 97.90 %, while tested on human, cattle, fish and worm datasets respectively. On the other hand, AUC-ROC of 97.21, 97.45, 97.41, 98.06 % and AUC-PR of 93.24, 93.34, 93.38, 92.29 % were obtained, while imbalanced training datasets were used. The proposed approach was found comparable with state-of-art splice site prediction approaches, while compared using the bench mark NN269 dataset and other datasets. Conclusions The proposed approach achieved consistent accuracy across different species as well as found comparable with the existing approaches. Thus, we believe that the proposed approach can be used as a complementary method to the existing methods for the prediction of splice sites. A web server named as ‘HSplice’ has also been developed based on the proposed approach for easy prediction of 5′ splice sites by the users and is freely available at http://cabgrid.res.in:8080/HSplice.
Collapse
|
9
|
Aguiar-Pulido V, Huang W, Suarez-Ulloa V, Cickovski T, Mathee K, Narasimhan G. Metagenomics, Metatranscriptomics, and Metabolomics Approaches for Microbiome Analysis. Evol Bioinform Online 2016; 12:5-16. [PMID: 27199545 PMCID: PMC4869604 DOI: 10.4137/ebo.s36436] [Citation(s) in RCA: 158] [Impact Index Per Article: 19.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2015] [Revised: 01/26/2016] [Accepted: 01/31/2016] [Indexed: 01/21/2023] Open
Abstract
Microbiomes are ubiquitous and are found in the ocean, the soil, and in/on other living organisms. Changes in the microbiome can impact the health of the environmental niche in which they reside. In order to learn more about these communities, different approaches based on data from multiple omics have been pursued. Metagenomics produces a taxonomical profile of the sample, metatranscriptomics helps us to obtain a functional profile, and metabolomics completes the picture by determining which byproducts are being released into the environment. Although each approach provides valuable information separately, we show that, when combined, they paint a more comprehensive picture. We conclude with a review of network-based approaches as applied to integrative studies, which we believe holds the key to in-depth understanding of microbiomes.
Collapse
Affiliation(s)
- Vanessa Aguiar-Pulido
- Bioinformatics Research Group (BioRG), School of Computing and Information Sciences, Florida International University, Miami, FL, USA
| | - Wenrui Huang
- Bioinformatics Research Group (BioRG), School of Computing and Information Sciences, Florida International University, Miami, FL, USA
| | - Victoria Suarez-Ulloa
- Chromatin Structure and Evolution Group (Chromevol), Department of Biological Sciences, Florida International University, Miami, FL, USA
| | - Trevor Cickovski
- Bioinformatics Research Group (BioRG), School of Computing and Information Sciences, Florida International University, Miami, FL, USA.; Department of Computer Science, Eckerd College, St. Petersburg, FL, USA
| | - Kalai Mathee
- Biomolecular Sciences Institute, Florida International University, Miami, FL, USA.; Herbert Wertheim College of Medicine, Florida International University, Miami, FL, USA.; Global Health Consortium, Florida International University, Miami, FL, USA
| | - Giri Narasimhan
- Bioinformatics Research Group (BioRG), School of Computing and Information Sciences, Florida International University, Miami, FL, USA.; Biomolecular Sciences Institute, Florida International University, Miami, FL, USA
| |
Collapse
|
10
|
Ye H, Meehan J, Tong W, Hong H. Alignment of Short Reads: A Crucial Step for Application of Next-Generation Sequencing Data in Precision Medicine. Pharmaceutics 2015; 7:523-41. [PMID: 26610555 PMCID: PMC4695832 DOI: 10.3390/pharmaceutics7040523] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2015] [Revised: 11/14/2015] [Accepted: 11/17/2015] [Indexed: 02/06/2023] Open
Abstract
Precision medicine or personalized medicine has been proposed as a modernized and promising medical strategy. Genetic variants of patients are the key information for implementation of precision medicine. Next-generation sequencing (NGS) is an emerging technology for deciphering genetic variants. Alignment of raw reads to a reference genome is one of the key steps in NGS data analysis. Many algorithms have been developed for alignment of short read sequences since 2008. Users have to make a decision on which alignment algorithm to use in their studies. Selection of the right alignment algorithm determines not only the alignment algorithm but also the set of suitable parameters to be used by the algorithm. Understanding these algorithms helps in selecting the appropriate alignment algorithm for different applications in precision medicine. Here, we review current available algorithms and their major strategies such as seed-and-extend and q-gram filter. We also discuss the challenges in current alignment algorithms, including alignment in multiple repeated regions, long reads alignment and alignment facilitated with known genetic variants.
Collapse
Affiliation(s)
- Hao Ye
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079, USA.
| | - Joe Meehan
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079, USA.
| | - Weida Tong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079, USA.
| | - Huixiao Hong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079, USA.
| |
Collapse
|
11
|
Thangam M, Gopal RK. CRCDA--Comprehensive resources for cancer NGS data analysis. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2015; 2015:bav092. [PMID: 26450948 PMCID: PMC4597977 DOI: 10.1093/database/bav092] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/27/2015] [Accepted: 08/31/2015] [Indexed: 12/24/2022]
Abstract
Next generation sequencing (NGS) innovations put a compelling landmark in life science and changed the direction of research in clinical oncology with its productivity to diagnose and treat cancer. The aim of our portal comprehensive resources for cancer NGS data analysis (CRCDA) is to provide a collection of different NGS tools and pipelines under diverse classes with cancer pathways and databases and furthermore, literature information from PubMed. The literature data was constrained to 18 most common cancer types such as breast cancer, colon cancer and other cancers that exhibit in worldwide population. NGS-cancer tools for the convenience have been categorized into cancer genomics, cancer transcriptomics, cancer epigenomics, quality control and visualization. Pipelines for variant detection, quality control and data analysis were listed to provide out-of-the box solution for NGS data analysis, which may help researchers to overcome challenges in selecting and configuring individual tools for analysing exome, whole genome and transcriptome data. An extensive search page was developed that can be queried by using (i) type of data [literature, gene data and sequence read archive (SRA) data] and (ii) type of cancer (selected based on global incidence and accessibility of data). For each category of analysis, variety of tools are available and the biggest challenge is in searching and using the right tool for the right application. The objective of the work is collecting tools in each category available at various places and arranging the tools and other data in a simple and user-friendly manner for biologists and oncologists to find information easier. To the best of our knowledge, we have collected and presented a comprehensive package of most of the resources available in cancer for NGS data analysis. Given these factors, we believe that this website will be an useful resource to the NGS research community working on cancer. Database URL: http://bioinfo.au-kbc.org.in/ngs/ngshome.html.
Collapse
Affiliation(s)
- Manonanthini Thangam
- AU-KBC Research Centre, MIT Campus of Anna University, Chromepet, Chennai, India
| | - Ramesh Kumar Gopal
- AU-KBC Research Centre, MIT Campus of Anna University, Chromepet, Chennai, India
| |
Collapse
|
12
|
Survey of Programs Used to Detect Alternative Splicing Isoforms from Deep Sequencing Data In Silico. BIOMED RESEARCH INTERNATIONAL 2015; 2015:831352. [PMID: 26421304 PMCID: PMC4573434 DOI: 10.1155/2015/831352] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/26/2014] [Revised: 02/17/2015] [Accepted: 03/02/2015] [Indexed: 11/29/2022]
Abstract
Next-generation sequencing techniques have been rapidly emerging. However, the massive sequencing reads hide a great deal of unknown important information. Advances have enabled researchers to discover alternative splicing (AS) sites and isoforms using computational approaches instead of molecular experiments. Given the importance of AS for gene expression and protein diversity in eukaryotes, detecting alternative splicing and isoforms represents a hot topic in systems biology and epigenetics research. The computational methods applied to AS prediction have improved since the emergence of next-generation sequencing. In this study, we introduce state-of-the-art research on AS and then compare the research methods and software tools available for AS based on next-generation sequencing reads. Finally, we discuss the prospects of computational methods related to AS.
Collapse
|
13
|
Meher PK, Sahu TK, Rao AR, Wahi SD. A statistical approach for 5' splice site prediction using short sequence motifs and without encoding sequence data. BMC Bioinformatics 2014; 15:362. [PMID: 25420551 PMCID: PMC4702320 DOI: 10.1186/s12859-014-0362-6] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2014] [Accepted: 10/24/2014] [Indexed: 11/17/2022] Open
Abstract
Background Most of the approaches for splice site prediction are based on machine learning techniques. Though, these approaches provide high prediction accuracy, the window lengths used are longer in size. Hence, these approaches may not be suitable to predict the novel splice variants using the short sequence reads generated from next generation sequencing technologies. Further, machine learning techniques require numerically encoded data and produce different accuracy with different encoding procedures. Therefore, splice site prediction with short sequence motifs and without encoding sequence data became a motivation for the present study. Results An approach for finding association among nucleotide bases in the splice site motifs is developed and used further to determine the appropriate window size. Besides, an approach for prediction of donor splice sites using sum of absolute error criterion has also been proposed. The proposed approach has been compared with commonly used approaches i.e., Maximum Entropy Modeling (MEM), Maximal Dependency Decomposition (MDD), Weighted Matrix Method (WMM) and Markov Model of first order (MM1) and was found to perform equally with MEM and MDD and better than WMM and MM1 in terms of prediction accuracy. Conclusions The proposed prediction approach can be used in the prediction of donor splice sites with higher accuracy using short sequence motifs and hence can be used as a complementary method to the existing approaches. Based on the proposed methodology, a web server was also developed for easy prediction of donor splice sites by users and is available at http://cabgrid.res.in:8080/sspred. Electronic supplementary material The online version of this article (doi:10.1186/s12859-014-0362-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Prabina Kumar Meher
- Division of Statistical Genetics, Indian Agricultural Statistics Research Institute, New Delhi, 110012, India.
| | - Tanmaya Kumar Sahu
- Centre for Agricultural Bioinformatics, Indian Agricultural Statistics Research Institute, New Delhi, 110012, India.
| | - Atmakuri Ramakrishna Rao
- Centre for Agricultural Bioinformatics, Indian Agricultural Statistics Research Institute, New Delhi, 110012, India.
| | - Sant Dass Wahi
- Division of Statistical Genetics, Indian Agricultural Statistics Research Institute, New Delhi, 110012, India.
| |
Collapse
|
14
|
ARYANA: Aligning Reads by Yet Another Approach. BMC Bioinformatics 2014; 15 Suppl 9:S12. [PMID: 25252881 PMCID: PMC4168712 DOI: 10.1186/1471-2105-15-s9-s12] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Motivation Although there are many different algorithms and software tools for aligning sequencing reads, fast gapped sequence search is far from solved. Strong interest in fast alignment is best reflected in the $106 prize for the Innocentive competition on aligning a collection of reads to a given database of reference genomes. In addition, de novo assembly of next-generation sequencing long reads requires fast overlap-layout-concensus algorithms which depend on fast and accurate alignment. Contribution We introduce ARYANA, a fast gapped read aligner, developed on the base of BWA indexing infrastructure with a completely new alignment engine that makes it significantly faster than three other aligners: Bowtie2, BWA and SeqAlto, with comparable generality and accuracy. Instead of the time-consuming backtracking procedures for handling mismatches, ARYANA comes with the seed-and-extend algorithmic framework and a significantly improved efficiency by integrating novel algorithmic techniques including dynamic seed selection, bidirectional seed extension, reset-free hash tables, and gap-filling dynamic programming. As the read length increases ARYANA's superiority in terms of speed and alignment rate becomes more evident. This is in perfect harmony with the read length trend as the sequencing technologies evolve. The algorithmic platform of ARYANA makes it easy to develop mission-specific aligners for other applications using ARYANA engine. Availability ARYANA with complete source code can be obtained from http://github.com/aryana-aligner
Collapse
|
15
|
Lau YY, Yin WF, Chan KG. Enterobacter asburiae strain L1: complete genome and whole genome optical mapping analysis of a quorum sensing bacterium. SENSORS 2014; 14:13913-24. [PMID: 25196111 PMCID: PMC4178997 DOI: 10.3390/s140813913] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/26/2014] [Revised: 07/07/2014] [Accepted: 07/23/2014] [Indexed: 01/01/2023]
Abstract
Enterobacter asburiae L1 is a quorum sensing bacterium isolated from lettuce leaves. In this study, for the first time, the complete genome of E. asburiae L1 was sequenced using the single molecule real time sequencer (PacBio RSII) and the whole genome sequence was verified by using optical genome mapping (OpGen) technology. In our previous study, E. asburiae L1 has been reported to produce AHLs, suggesting the possibility of virulence factor regulation which is quorum sensing dependent. This evoked our interest to study the genome of this bacterium and here we present the complete genome of E. asburiae L1, which carries the virulence factor gene virK, the N-acyl homoserine lactone-based QS transcriptional regulator gene luxR and the N-acyl homoserine lactone synthase gene which we firstly named easI. The availability of the whole genome sequence of E. asburiae L1 will pave the way for the study of the QS-mediated gene expression in this bacterium. Hence, the importance and functions of these signaling molecules can be further studied in the hope of elucidating the mechanisms of QS-regulation in E. asburiae. To the best of our knowledge, this is the first documentation of both a complete genome sequence and the establishment of the molecular basis of QS properties of E. asburiae.
Collapse
Affiliation(s)
- Yin Yin Lau
- Division of Genetics and Molecular Biology, Institute of Biological Sciences, Faculty of Science, University of Malaya, Kuala Lumpur 50603, Malaysia.
| | - Wai-Fong Yin
- Division of Genetics and Molecular Biology, Institute of Biological Sciences, Faculty of Science, University of Malaya, Kuala Lumpur 50603, Malaysia.
| | - Kok-Gan Chan
- Division of Genetics and Molecular Biology, Institute of Biological Sciences, Faculty of Science, University of Malaya, Kuala Lumpur 50603, Malaysia.
| |
Collapse
|
16
|
Rahn R, Weese D, Reinert K. Journaled string tree-a scalable data structure for analyzing thousands of similar genomes on your laptop. Bioinformatics 2014; 30:3499-505. [PMID: 25028723 DOI: 10.1093/bioinformatics/btu438] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023] Open
Abstract
MOTIVATION Next-generation sequencing (NGS) has revolutionized biomedical research in the past decade and led to a continuous stream of developments in bioinformatics, addressing the need for fast and space-efficient solutions for analyzing NGS data. Often researchers need to analyze a set of genomic sequences that stem from closely related species or are indeed individuals of the same species. Hence, the analyzed sequences are similar. For analyses where local changes in the examined sequence induce only local changes in the results, it is obviously desirable to examine identical or similar regions not repeatedly. RESULTS In this work, we provide a datatype that exploits data parallelism inherent in a set of similar sequences by analyzing shared regions only once. In real-world experiments, we show that algorithms that otherwise would scan each reference sequentially can be speeded up by a factor of 115.
Collapse
Affiliation(s)
- René Rahn
- Department of Mathematics and Computer Science, Freie Universität Berlin, Takustr. 9, 14195 Berlin, Germany
| | - David Weese
- Department of Mathematics and Computer Science, Freie Universität Berlin, Takustr. 9, 14195 Berlin, Germany
| | - Knut Reinert
- Department of Mathematics and Computer Science, Freie Universität Berlin, Takustr. 9, 14195 Berlin, Germany
| |
Collapse
|
17
|
Benjamin AM, Nichols M, Burke TW, Ginsburg GS, Lucas JE. Comparing reference-based RNA-Seq mapping methods for non-human primate data. BMC Genomics 2014; 15:570. [PMID: 25001289 PMCID: PMC4112205 DOI: 10.1186/1471-2164-15-570] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2013] [Accepted: 06/25/2014] [Indexed: 12/03/2022] Open
Abstract
Background The application of next-generation sequencing technology to gene expression quantification analysis, namely, RNA-Sequencing, has transformed the way in which gene expression studies are conducted and analyzed. These advances are of particular interest to researchers studying organisms with missing or incomplete genomes, as the need for knowledge of sequence information is overcome. De novo assembly methods have gained widespread acceptance in the RNA-Seq community for organisms with no true reference genome or transcriptome. While such methods have tremendous utility, computational cost is still a significant challenge for organisms with large and complex genomes. Results In this manuscript, we present a comparison of four reference-based mapping methods for non-human primate data. We utilize TopHat2 and GSNAP for mapping to the human genome, and Bowtie2 and Stampy for mapping to the human genome and transcriptome for a total of six mapping approaches. For each of these methods, we explore mapping rates and locations, number of detected genes, correlations between computed expression values, and the utility of the resulting data for differential expression analysis. Conclusions We show that reference-based mapping methods indeed have utility in RNA-Seq analysis of mammalian data with no true reference, and the details of mapping methods should be carefully considered when doing so. Critical algorithm features include short seed sequences, the allowance of mismatches, and the allowance of gapped alignments in addition to splice junction gaps. Such features facilitate sensitive alignment of non-human primate RNA-Seq data to a human reference. Electronic supplementary material The online version of this article (doi:10.1186/1471-2164-15-570) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Ashlee M Benjamin
- Center for Applied Genomics, Department of Medicine, Duke University, Durham, North Carolina, USA.
| | | | | | | | | |
Collapse
|
18
|
Kloetgen A, Münch PC, Borkhardt A, Hoell JI, McHardy AC. Biochemical and bioinformatic methods for elucidating the role of RNA-protein interactions in posttranscriptional regulation. Brief Funct Genomics 2014; 14:102-14. [PMID: 24951655 PMCID: PMC4471435 DOI: 10.1093/bfgp/elu020] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open
Abstract
Our understanding of transcriptional gene regulation has dramatically increased over the past decades, and many regulators of gene expression, such as transcription factors, have been analyzed extensively. Additionally, in recent years, deeper insights into the physiological roles of RNA have been obtained. More precisely, splicing, polyadenylation, various modifications, localization and the translation of messenger RNAs (mRNAs) are regulated by their interaction with RNA-binding proteins (RBPs). New technologies now enable the analysis of this regulation at different levels. A technique known as ultraviolet (UV) cross-linking and immunoprecipitation (CLIP) allows us to determine physical protein–RNA interactions on a genome-wide scale. UV cross-linking introduces covalent bonds between interacting RBPs and RNAs. In combination with immunoprecipitation and deep sequencing techniques, tens of millions of short reads (representing bound RNAs by an RBP of interest) are generated and are used to characterize the regulatory network mediated by an RBP. Other methods, such as mass spectrometry, can also be used for characterization of cross-linked RBPs and RNAs instead of CLIP methods. In this review, we discuss experimental and computational methods for the generation and analysis of CLIP data. The computational methods include short-read alignment, annotation and RNA-binding motif discovery. We describe the challenges of analyzing CLIP data and indicate areas where improvements are needed.
Collapse
Affiliation(s)
| | | | | | | | - Alice C McHardy
- Corresponding author. Alice C. McHardy, Heinrich-Heine University, Department of Algorithmic Bioinformatics, Universitaetsstrasse 1, 40225 Duesseldorf, Germany. Tel.: +49-211-8110427; Fax: +49-211-8113464; E-mail:
| |
Collapse
|
19
|
Maji RK, Sarkar A, Khatua S, Dasgupta S, Ghosh Z. PVT: an efficient computational procedure to speed up next-generation sequence analysis. BMC Bioinformatics 2014; 15:167. [PMID: 24894600 PMCID: PMC4063226 DOI: 10.1186/1471-2105-15-167] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2014] [Accepted: 05/07/2014] [Indexed: 12/05/2022] Open
Abstract
Background High-throughput Next-Generation Sequencing (NGS) techniques are advancing genomics and molecular biology research. This technology generates substantially large data which puts up a major challenge to the scientists for an efficient, cost and time effective solution to analyse such data. Further, for the different types of NGS data, there are certain common challenging steps involved in analysing those data. Spliced alignment is one such fundamental step in NGS data analysis which is extremely computational intensive as well as time consuming. There exists serious problem even with the most widely used spliced alignment tools. TopHat is one such widely used spliced alignment tools which although supports multithreading, does not efficiently utilize computational resources in terms of CPU utilization and memory. Here we have introduced PVT (Pipelined Version of TopHat) where we take up a modular approach by breaking TopHat’s serial execution into a pipeline of multiple stages, thereby increasing the degree of parallelization and computational resource utilization. Thus we address the discrepancies in TopHat so as to analyze large NGS data efficiently. Results We analysed the SRA dataset (SRX026839 and SRX026838) consisting of single end reads and SRA data SRR1027730 consisting of paired-end reads. We used TopHat v2.0.8 to analyse these datasets and noted the CPU usage, memory footprint and execution time during spliced alignment. With this basic information, we designed PVT, a pipelined version of TopHat that removes the redundant computational steps during ‘spliced alignment’ and breaks the job into a pipeline of multiple stages (each comprising of different step(s)) to improve its resource utilization, thus reducing the execution time. Conclusions PVT provides an improvement over TopHat for spliced alignment of NGS data analysis. PVT thus resulted in the reduction of the execution time to ~23% for the single end read dataset. Further, PVT designed for paired end reads showed an improved performance of ~41% over TopHat (for the chosen data) with respect to execution time. Moreover we propose PVT-Cloud which implements PVT pipeline in cloud computing system.
Collapse
Affiliation(s)
| | | | | | | | - Zhumur Ghosh
- Bioinformatics Centre, Bose Institute, Kolkata 700054, India.
| |
Collapse
|
20
|
|
21
|
Zheng CL, Kawane S, Bottomly D, Wilmot B. Analysis considerations for utilizing RNA-Seq to characterize the brain transcriptome. INTERNATIONAL REVIEW OF NEUROBIOLOGY 2014; 116:21-54. [PMID: 25172470 DOI: 10.1016/b978-0-12-801105-8.00002-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
RNA-Seq allows one to examine only gene expression as well as expression of noncoding RNAs, alternative splicing, and allele-specific expression. With this increased sensitivity and dynamic range, there are computational and statistical considerations that need to be contemplated, which are highly dependent on the biological question being asked. We highlight these to provide an overview of their importance and the impact they can have on downstream interpretation of the brain transcriptome.
Collapse
Affiliation(s)
- Christina L Zheng
- Department of Medical Informatics and Clinical Epidemiology, Oregon Health and Science University, Portland, Oregon, USA; Knight Cancer Institute, Oregon Health, Oregon Health and Science University, Portland, Oregon, USA.
| | - Sunita Kawane
- Clinical & Translational Research Institute, Oregon Health and Science University, Portland, Oregon, USA
| | - Daniel Bottomly
- Clinical & Translational Research Institute, Oregon Health and Science University, Portland, Oregon, USA
| | - Beth Wilmot
- Department of Medical Informatics and Clinical Epidemiology, Oregon Health and Science University, Portland, Oregon, USA; Clinical & Translational Research Institute, Oregon Health and Science University, Portland, Oregon, USA
| |
Collapse
|
22
|
Alamancos GP, Agirre E, Eyras E. Methods to study splicing from high-throughput RNA sequencing data. Methods Mol Biol 2014; 1126:357-97. [PMID: 24549677 DOI: 10.1007/978-1-62703-980-2_26] [Citation(s) in RCA: 54] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022]
Abstract
The development of novel high-throughput sequencing (HTS) methods for RNA (RNA-Seq) has provided a very powerful mean to study splicing under multiple conditions at unprecedented depth. However, the complexity of the information to be analyzed has turned this into a challenging task. In the last few years, a plethora of tools have been developed, allowing researchers to process RNA-Seq data to study the expression of isoforms and splicing events, and their relative changes under different conditions. We provide an overview of the methods available to study splicing from short RNA-Seq data, which could serve as an entry point for users who need to decide on a suitable tool for a specific analysis. We also attempt to propose a classification of the tools according to the operations they do, to facilitate the comparison and choice of methods.
Collapse
Affiliation(s)
- Gael P Alamancos
- Computational Genomics, Universitat Pompeu Fabra, Barcelona, Spain
| | | | | |
Collapse
|
23
|
Abstract
Background RNA-Seq methodology is a revolutionary transcriptomics sequencing technology, which is the representative of Next generation Sequencing (NGS). With the high throughput sequencing of RNA-Seq, we can acquire much more information like differential expression and novel splice variants from deep sequence analysis and data mining. But the short read length brings a great challenge to alignment, especially when the reads span two or more exons. Methods A two steps heuristic splice alignment tool is generated in this investigation. First, map raw reads to reference with unspliced aligner - BWA; second, split initial unmapped reads into three equal short reads (seeds), align each seed to the reference, filter hits, search possible split position of read and extend hits to a complete match. Results Compare with other splice alignment tools like SOAPsplice and Tophat2, HSA has a better performance in call rate and efficiency, but its results do not as accurate as the other software to some extent. Conclusions HSA is an effective spliced aligner of RNA-Seq reads mapping, which is available at https://github.com/vlcc/HSA.
Collapse
|
24
|
Behr J, Kahles A, Zhong Y, Sreedharan VT, Drewe P, Rätsch G. MITIE: Simultaneous RNA-Seq-based transcript identification and quantification in multiple samples. Bioinformatics 2013; 29:2529-38. [PMID: 23980025 PMCID: PMC3789545 DOI: 10.1093/bioinformatics/btt442] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2012] [Revised: 07/19/2013] [Accepted: 07/29/2013] [Indexed: 02/07/2023] Open
Abstract
MOTIVATION High-throughput sequencing of mRNA (RNA-Seq) has led to tremendous improvements in the detection of expressed genes and reconstruction of RNA transcripts. However, the extensive dynamic range of gene expression, technical limitations and biases, as well as the observed complexity of the transcriptional landscape, pose profound computational challenges for transcriptome reconstruction. RESULTS We present the novel framework MITIE (Mixed Integer Transcript IdEntification) for simultaneous transcript reconstruction and quantification. We define a likelihood function based on the negative binomial distribution, use a regularization approach to select a few transcripts collectively explaining the observed read data and show how to find the optimal solution using Mixed Integer Programming. MITIE can (i) take advantage of known transcripts, (ii) reconstruct and quantify transcripts simultaneously in multiple samples, and (iii) resolve the location of multi-mapping reads. It is designed for genome- and assembly-based transcriptome reconstruction. We present an extensive study based on realistic simulated RNA-Seq data. When compared with state-of-the-art approaches, MITIE proves to be significantly more sensitive and overall more accurate. Moreover, MITIE yields substantial performance gains when used with multiple samples. We applied our system to 38 Drosophila melanogaster modENCODE RNA-Seq libraries and estimated the sensitivity of reconstructing omitted transcript annotations and the specificity with respect to annotated transcripts. Our results corroborate that a well-motivated objective paired with appropriate optimization techniques lead to significant improvements over the state-of-the-art in transcriptome reconstruction. AVAILABILITY MITIE is implemented in C++ and is available from http://bioweb.me/mitie under the GPL license.
Collapse
Affiliation(s)
- Jonas Behr
- Computational Biology Center, Sloan-Kettering Institute, 1275 York Avenue, New York, NY 10065, USA and Friedrich Miescher Laboratory, Max Planck Society, Spemannstr. 39, 72076 Tübingen, Germany
| | | | | | | | | | | |
Collapse
|
25
|
Morey M, Fernández-Marmiesse A, Castiñeiras D, Fraga JM, Couce ML, Cocho JA. A glimpse into past, present, and future DNA sequencing. Mol Genet Metab 2013; 110:3-24. [PMID: 23742747 DOI: 10.1016/j.ymgme.2013.04.024] [Citation(s) in RCA: 80] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/27/2013] [Revised: 04/30/2013] [Accepted: 04/30/2013] [Indexed: 12/21/2022]
Abstract
Current advances in DNA sequencing technologies are dropping down sequencing cost while increasing throughput at a pace never shown before. Past-decade great milestones, as the establishment of a reference human genome (amongst others) and large-scale human genetic variation study in the 1000 Genome project are, in conjunction with the use of these techniques, triggering advances in many areas of basic and applied science. These tools, stored in and combined with the vast amount of information present in biological online databases are, with the use of automated interpretation and analysis tools, allowing the fulfillment of increasingly ambitious studies in many areas and also are democratizing the access to information, interpretation and technologies, being the first opportunity for researchers to assess the influence of genetics in complex events as multifactorial diseases, evolutionary studies, metagenomics, transcriptomics, etc. In this review, we present the current state of the art of these technologies, focusing on second generation sequencing, from sample and library preparation to sequencing chemistries and bioinformatic software available for final data analysis and visualisation, with its possible applications. We also make an overview of first and third generation, due to its historical importance and for being the upcoming future tools for genetic analysis, respectively.
Collapse
Affiliation(s)
- Marcos Morey
- Unidad de Diagnóstico y Tratamiento de Enfermedades Metabólicas Congénitas, Hospital Clínico Universitario de Santiago, A Choupana s/n, 15706 Santiago de Compostela, Spain.
| | | | | | | | | | | |
Collapse
|
26
|
Patrick E, Buckley M, Yang YH. Estimation of data-specific constitutive exons with RNA-Seq data. BMC Bioinformatics 2013; 14:31. [PMID: 23360225 PMCID: PMC3656776 DOI: 10.1186/1471-2105-14-31] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2012] [Accepted: 01/13/2013] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND RNA-Seq has the potential to answer many diverse and interesting questions about the inner workings of cells. Estimating changes in the overall transcription of a gene is not straightforward. Changes in overall gene transcription can easily be confounded with changes in exon usage which alter the lengths of transcripts produced by a gene. Measuring the expression of constitutive exons--xons which are consistently conserved after splicing--ffers an unbiased estimation of the overall transcription of a gene. RESULTS We propose a clustering-based method, exClust, for estimating the exons that are consistently conserved after splicing in a given data set. These are considered as the exons which are "constitutive" in this data. The method utilises information from both annotation and the dataset of interest. The method is implemented in an openly available R function package, sydSeq. CONCLUSION When used on two real datasets exClust includes more than three times as many reads as the standard UI method, and improves concordance with qRT-PCR data. When compared to other methods, our method is shown to produce robust estimates of overall gene transcription.
Collapse
Affiliation(s)
- Ellis Patrick
- School of Mathematics and Statistics, University of Sydney, Sydney NSW 2006, Australia
| | | | | |
Collapse
|
27
|
Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. STAR: ultrafast universal RNA-seq aligner. ACTA ACUST UNITED AC 2012; 29:15-21. [PMID: 23104886 DOI: 10.1093/bioinformatics/bts635] [Citation(s) in RCA: 28509] [Impact Index Per Article: 2375.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
MOTIVATION Accurate alignment of high-throughput RNA-seq data is a challenging and yet unsolved problem because of the non-contiguous transcript structure, relatively short read lengths and constantly increasing throughput of the sequencing technologies. Currently available RNA-seq aligners suffer from high mapping error rates, low mapping speed, read length limitation and mapping biases. RESULTS To align our large (>80 billon reads) ENCODE Transcriptome RNA-seq dataset, we developed the Spliced Transcripts Alignment to a Reference (STAR) software based on a previously undescribed RNA-seq alignment algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure. STAR outperforms other aligners by a factor of >50 in mapping speed, aligning to the human genome 550 million 2 × 76 bp paired-end reads per hour on a modest 12-core server, while at the same time improving alignment sensitivity and precision. In addition to unbiased de novo detection of canonical junctions, STAR can discover non-canonical splices and chimeric (fusion) transcripts, and is also capable of mapping full-length RNA sequences. Using Roche 454 sequencing of reverse transcription polymerase chain reaction amplicons, we experimentally validated 1960 novel intergenic splice junctions with an 80-90% success rate, corroborating the high precision of the STAR mapping strategy. AVAILABILITY AND IMPLEMENTATION STAR is implemented as a standalone C++ code. STAR is free open source software distributed under GPLv3 license and can be downloaded from http://code.google.com/p/rna-star/.
Collapse
|
28
|
Fonseca NA, Rung J, Brazma A, Marioni JC. Tools for mapping high-throughput sequencing data. Bioinformatics 2012; 28:3169-77. [DOI: 10.1093/bioinformatics/bts605] [Citation(s) in RCA: 207] [Impact Index Per Article: 17.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
|
29
|
Duitama J, Srivastava PK, Măndoiu II. Towards accurate detection and genotyping of expressed variants from whole transcriptome sequencing data. BMC Genomics 2012; 13 Suppl 2:S6. [PMID: 22537301 PMCID: PMC3394419 DOI: 10.1186/1471-2164-13-s2-s6] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Massively parallel transcriptome sequencing (RNA-Seq) is becoming the method of choice for studying functional effects of genetic variability and establishing causal relationships between genetic variants and disease. However, RNA-Seq poses new technical and computational challenges compared to genome sequencing. In particular, mapping transcriptome reads onto the genome is more challenging than mapping genomic reads due to splicing. Furthermore, detection and genotyping of single nucleotide variants (SNVs) requires statistical models that are robust to variability in read coverage due to unequal transcript expression levels. RESULTS In this paper we present a strategy to more reliably map transcriptome reads by taking advantage of the availability of both the genome reference sequence and transcript databases such as CCDS. We also present a novel Bayesian model for SNV discovery and genotyping based on quality scores. CONCLUSIONS Experimental results on RNA-Seq data generated from blood cell tissue of three Hapmap individuals show that our methods yield increased accuracy compared to several widely used methods. The open source code implementing our methods, released under the GNU General Public License, is available at http://dna.engr.uconn.edu/software/NGSTools/.
Collapse
Affiliation(s)
- Jorge Duitama
- Department of Computer Science & Engineering, University of Connecticut, 371 Fairfield Rd, Unit 2155, Storrs, CT, 06269-2155, USA.
| | | | | |
Collapse
|
30
|
Pirola Y, Rizzi R, Picardi E, Pesole G, Della Vedova G, Bonizzoni P. PIntron: a fast method for detecting the gene structure due to alternative splicing via maximal pairings of a pattern and a text. BMC Bioinformatics 2012; 13 Suppl 5:S2. [PMID: 22537006 PMCID: PMC3358663 DOI: 10.1186/1471-2105-13-s5-s2] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
BACKGROUND A challenging issue in designing computational methods for predicting the gene structure into exons and introns from a cluster of transcript (EST, mRNA) sequences, is guaranteeing accuracy as well as efficiency in time and space, when large clusters of more than 20,000 ESTs and genes longer than 1 Mb are processed. Traditionally, the problem has been faced by combining different tools, not specifically designed for this task. RESULTS We propose a fast method based on ad hoc procedures for solving the problem. Our method combines two ideas: a novel algorithm of proved small time complexity for computing spliced alignments of a transcript against a genome, and an efficient algorithm that exploits the inherent redundancy of information in a cluster of transcripts to select, among all possible factorizations of EST sequences, those allowing to infer splice site junctions that are largely confirmed by the input data. The EST alignment procedure is based on the construction of maximal embeddings, that are sequences obtained from paths of a graph structure, called embedding graph, whose vertices are the maximal pairings of a genomic sequence T and an EST P. The procedure runs in time linear in the length of P and T and in the size of the output.The method was implemented into the PIntron package. PIntron requires as input a genomic sequence or region and a set of EST and/or mRNA sequences. Besides the prediction of the full-length transcript isoforms potentially expressed by the gene, the PIntron package includes a module for the CDS annotation of the predicted transcripts. CONCLUSIONS PIntron, the software tool implementing our methodology, is available at http://www.algolab.eu/PIntron under GNU AGPL. PIntron has been shown to outperform state-of-the-art methods, and to quickly process some critical genes. At the same time, PIntron exhibits high accuracy (sensitivity and specificity) when benchmarked with ENCODE annotations.
Collapse
Affiliation(s)
- Yuri Pirola
- Dipartimento di Informatica Sistemistica e Comunicazione, Univ, degli Studi di Milano-Bicocca, Milano, 20126, Italy
| | | | | | | | | | | |
Collapse
|
31
|
RNA-Seq mapping and detection of gene fusions with a suffix array algorithm. PLoS Comput Biol 2012; 8:e1002464. [PMID: 22496636 PMCID: PMC3320572 DOI: 10.1371/journal.pcbi.1002464] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2011] [Accepted: 02/21/2012] [Indexed: 12/20/2022] Open
Abstract
High-throughput RNA sequencing enables quantification of transcripts (both known and novel), exon/exon junctions and fusions of exons from different genes. Discovery of gene fusions–particularly those expressed with low abundance– is a challenge with short- and medium-length sequencing reads. To address this challenge, we implemented an RNA-Seq mapping pipeline within the LifeScope software. We introduced new features including filter and junction mapping, annotation-aided pairing rescue and accurate mapping quality values. We combined this pipeline with a Suffix Array Spliced Read (SASR) aligner to detect chimeric transcripts. Performing paired-end RNA-Seq of the breast cancer cell line MCF-7 using the SOLiD system, we called 40 gene fusions among over 120,000 splicing junctions. We validated 36 of these 40 fusions with TaqMan assays, of which 25 were expressed in MCF-7 but not the Human Brain Reference. An intra-chromosomal gene fusion involving the estrogen receptor alpha gene ESR1, and another involving the RPS6KB1 (Ribosomal protein S6 kinase beta-1) were recurrently expressed in a number of breast tumor cell lines and a clinical tumor sample. Advances in sequencing technology are enabling detailed characterization of RNA transcripts from biological samples. The fundamental challenge of accurately mapping the reads on transcripts and gleaning biological meaning from the data remains. One class of transcripts, gene fusions, is particularly important in cancer. Some gene fusions are prominent markers in leukemia, prostate, and other cancers and putatively causative in certain tumor types. We present a set of new RNA-Seq analysis techniques to map reads, and count expression of genes, exons and splicing junctions, especially those that give evidence of gene fusions. These tools are available in a software package with a straightforward graphical user interface. Using this software, we called and validated several gene fusions in a breast cancer cell line. By testing the presence of these fusions in a larger population of tumor cell lines and clinical samples, we found that two of them were expressed recurrently.
Collapse
|
32
|
Rogers MF, Thomas J, Reddy AS, Ben-Hur A. SpliceGrapher: detecting patterns of alternative splicing from RNA-Seq data in the context of gene models and EST data. Genome Biol 2012; 13:R4. [PMID: 22293517 PMCID: PMC3334585 DOI: 10.1186/gb-2012-13-1-r4] [Citation(s) in RCA: 95] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2011] [Revised: 10/25/2011] [Accepted: 01/31/2012] [Indexed: 12/25/2022] Open
Abstract
We propose a method for predicting splice graphs that enhances curated gene models using evidence from RNA-Seq and EST alignments. Results obtained using RNA-Seq experiments in Arabidopsis thaliana show that predictions made by our SpliceGrapher method are more consistent with current gene models than predictions made by TAU and Cufflinks. Furthermore, analysis of plant and human data indicates that the machine learning approach used by SpliceGrapher is useful for discriminating between real and spurious splice sites, and can improve the reliability of detection of alternative splicing. SpliceGrapher is available for download at http://SpliceGrapher.sf.net.
Collapse
Affiliation(s)
- Mark F Rogers
- Department of Computer Science, 1873 Campus Delivery, Colorado State University, Fort Collins, CO 80523-1873, USA
| | | | | | | |
Collapse
|
33
|
Liu Q, Chen C, Shen E, Zhao F, Sun Z, Wu J. Detection, annotation and visualization of alternative splicing from RNA-Seq data with SplicingViewer. Genomics 2011; 99:178-82. [PMID: 22226708 DOI: 10.1016/j.ygeno.2011.12.003] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2011] [Revised: 11/20/2011] [Accepted: 12/16/2011] [Indexed: 11/26/2022]
Abstract
Alternative splicing is a crucial mechanism by which diverse gene products can be generated from a limited number of genes, and is thought to be involved in complex orchestration of eukaryotic gene expression. Next-generation sequencing technologies, with reduced time and cost, provide unprecedented opportunities for deep interrogation of alternative splicing at the genome-wide scale. In this study, an integrated software SplicingViewer has been developed for unambiguous detection, annotation and visualization of splice junctions and alternative splicing events from RNA-Seq data. Specifically, it allows easy identification and characterization of splice junctions, and holds a versatile computational pipeline for in-depth annotation and classification of alternative splicing with different patterns. Moreover, it provides a user-friendly environment in which an alternative splicing landscape can be displayed in a straightforward and flexible manner. In conclusion, SplicingViewer can be widely used for studying alternative splicing easily and efficiently. SplicingViewer can be freely accessed at http://bioinformatics.zj.cn/splicingviewer.
Collapse
Affiliation(s)
- Qi Liu
- Institute of Genomic Medicine, Wenzhou Medical College, Wenzhou, China
| | | | | | | | | | | |
Collapse
|
34
|
Chen LY, Wei KC, Huang ACY, Wang K, Huang CY, Yi D, Tang CY, Galas DJ, Hood LE. RNASEQR--a streamlined and accurate RNA-seq sequence analysis program. Nucleic Acids Res 2011; 40:e42. [PMID: 22199257 PMCID: PMC3315322 DOI: 10.1093/nar/gkr1248] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Next-generation sequencing (NGS) technologies-based transcriptomic profiling method often called RNA-seq has been widely used to study global gene expression, alternative exon usage, new exon discovery, novel transcriptional isoforms and genomic sequence variations. However, this technique also poses many biological and informatics challenges to extracting meaningful biological information. The RNA-seq data analysis is built on the foundation of high quality initial genome localization and alignment information for RNA-seq sequences. Toward this goal, we have developed RNASEQR to accurately and effectively map millions of RNA-seq sequences. We have systematically compared RNASEQR with four of the most widely used tools using a simulated data set created from the Consensus CDS project and two experimental RNA-seq data sets generated from a human glioblastoma patient. Our results showed that RNASEQR yields more accurate estimates for gene expression, complete gene structures and new transcript isoforms, as well as more accurate detection of single nucleotide variants (SNVs). RNASEQR analyzes raw data from RNA-seq experiments effectively and outputs results in a manner that is compatible with a wide variety of specialized downstream analyses on desktop computers.
Collapse
Affiliation(s)
- Leslie Y Chen
- Institute for Systems Biology, Seattle, WA 98109, USA.
| | | | | | | | | | | | | | | | | |
Collapse
|
35
|
Abstract
Abstract
Background
RNA-Seq is revolutionizing the way transcript abundances are measured. A key challenge in transcript quantification from RNA-Seq data is the handling of reads that map to multiple genes or isoforms. This issue is particularly important for quantification with de novo transcriptome assemblies in the absence of sequenced genomes, as it is difficult to determine which transcripts are isoforms of the same gene. A second significant issue is the design of RNA-Seq experiments, in terms of the number of reads, read length, and whether reads come from one or both ends of cDNA fragments.
Results
We present RSEM, an user-friendly software package for quantifying gene and isoform abundances from single-end or paired-end RNA-Seq data. RSEM outputs abundance estimates, 95% credibility intervals, and visualization files and can also simulate RNA-Seq data. In contrast to other existing tools, the software does not require a reference genome. Thus, in combination with a de novo transcriptome assembler, RSEM enables accurate transcript quantification for species without sequenced genomes. On simulated and real data sets, RSEM has superior or comparable performance to quantification methods that rely on a reference genome. Taking advantage of RSEM's ability to effectively use ambiguously-mapping reads, we show that accurate gene-level abundance estimates are best obtained with large numbers of short single-end reads. On the other hand, estimates of the relative frequencies of isoforms within single genes may be improved through the use of paired-end reads, depending on the number of possible splice forms for each gene.
Conclusions
RSEM is an accurate and user-friendly software tool for quantifying transcript abundances from RNA-Seq data. As it does not rely on the existence of a reference genome, it is particularly useful for quantification with de novo transcriptome assemblies. In addition, RSEM has enabled valuable guidance for cost-efficient design of quantification experiments with RNA-Seq, which is currently relatively expensive.
Collapse
|
36
|
Huang Q, Lin B, Liu H, Ma X, Mo F, Yu W, Li L, Li H, Tian T, Wu D, Shen F, Xing J, Chen ZN. RNA-Seq analyses generate comprehensive transcriptomic landscape and reveal complex transcript patterns in hepatocellular carcinoma. PLoS One 2011; 6:e26168. [PMID: 22043308 PMCID: PMC3197143 DOI: 10.1371/journal.pone.0026168] [Citation(s) in RCA: 79] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2011] [Accepted: 09/21/2011] [Indexed: 02/07/2023] Open
Abstract
RNA-seq is a powerful tool for comprehensive characterization of whole transcriptome at both gene and exon levels and with a unique ability of identifying novel splicing variants. To date, RNA-seq analysis of HBV-related hepatocellular carcinoma (HCC) has not been reported. In this study, we performed transcriptome analyses for 10 matched pairs of cancer and non-cancerous tissues from HCC patients on Solexa/Illumina GAII platform. On average, about 21.6 million sequencing reads and 10.6 million aligned reads were obtained for samples sequenced on each lane, which was able to identify >50% of all the annotated genes for each sample. Furthermore, we identified 1,378 significantly differently expressed genes (DEGs) and 24, 338 differentially expressed exons (DEEs). Comprehensive function analyses indicated that cell growth-related, metabolism-related and immune-related pathways were most significantly enriched by DEGs, pointing to a complex mechanism for HCC carcinogenesis. Positional gene enrichment analysis showed that DEGs were most significantly enriched at chromosome 8q21.3–24.3. The most interesting findings were from the analysis at exon levels where we characterized three major patterns of expression changes between gene and exon levels, implying a much complex landscape of transcript-specific differential expressions in HCC. Finally, we identified a novel highly up-regulated exon-exon junction in ATAD2 gene in HCC tissues. Overall, to our best knowledge, our study represents the most comprehensive characterization of HBV-related HCC transcriptome including exon level expression changes and novel splicing variants, which illustrated the power of RNA-seq and provided important clues for understanding the molecular mechanisms of HCC pathogenesis at system-wide levels.
Collapse
Affiliation(s)
- Qichao Huang
- State Key Laboratory of Cancer Biology, Cell Engineering Research Center and Department of Cell Biology, Fourth Military Medical University, Xi'an, China
| | - Biaoyang Lin
- Systems Biology Division, Zhejiang–California International Nanosystems Institute (ZCNI), Zhejiang University, Hangzhou, China
- Department of Urology, University of Washington, Seattle, Washington, United States of America
| | - Hanqiang Liu
- State Key Laboratory of Cancer Biology, Cell Engineering Research Center and Department of Cell Biology, Fourth Military Medical University, Xi'an, China
| | - Xi Ma
- State Key Laboratory of Cancer Biology, Cell Engineering Research Center and Department of Cell Biology, Fourth Military Medical University, Xi'an, China
| | - Fan Mo
- Systems Biology Division, Zhejiang–California International Nanosystems Institute (ZCNI), Zhejiang University, Hangzhou, China
| | - Wei Yu
- Systems Biology Division, Zhejiang–California International Nanosystems Institute (ZCNI), Zhejiang University, Hangzhou, China
| | - Lisha Li
- Systems Biology Division, Zhejiang–California International Nanosystems Institute (ZCNI), Zhejiang University, Hangzhou, China
| | - Hongwei Li
- State Key Laboratory of Cancer Biology, Cell Engineering Research Center and Department of Cell Biology, Fourth Military Medical University, Xi'an, China
| | - Tian Tian
- Institute of Life Science and Biotechnology, Beijing Jiaotong University, Beijing, People's Republic of China
| | - Dong Wu
- Department of Comprehensive Treatment, Eastern Hepatobiliary Surgery Hospital, Second Military Medical University, Shanghai, People's Republic of China
| | - Feng Shen
- Department of Comprehensive Treatment, Eastern Hepatobiliary Surgery Hospital, Second Military Medical University, Shanghai, People's Republic of China
| | - Jinliang Xing
- State Key Laboratory of Cancer Biology, Cell Engineering Research Center and Department of Cell Biology, Fourth Military Medical University, Xi'an, China
- * E-mail: (JX); (Z-NC)
| | - Zhi-Nan Chen
- State Key Laboratory of Cancer Biology, Cell Engineering Research Center and Department of Cell Biology, Fourth Military Medical University, Xi'an, China
- * E-mail: (JX); (Z-NC)
| |
Collapse
|
37
|
Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 2011; 12:323. [PMID: 21816040 DOI: 10.1007/978-1-4939-0512-63] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2011] [Accepted: 08/04/2011] [Indexed: 05/28/2023] Open
Abstract
BACKGROUND RNA-Seq is revolutionizing the way transcript abundances are measured. A key challenge in transcript quantification from RNA-Seq data is the handling of reads that map to multiple genes or isoforms. This issue is particularly important for quantification with de novo transcriptome assemblies in the absence of sequenced genomes, as it is difficult to determine which transcripts are isoforms of the same gene. A second significant issue is the design of RNA-Seq experiments, in terms of the number of reads, read length, and whether reads come from one or both ends of cDNA fragments. RESULTS We present RSEM, an user-friendly software package for quantifying gene and isoform abundances from single-end or paired-end RNA-Seq data. RSEM outputs abundance estimates, 95% credibility intervals, and visualization files and can also simulate RNA-Seq data. In contrast to other existing tools, the software does not require a reference genome. Thus, in combination with a de novo transcriptome assembler, RSEM enables accurate transcript quantification for species without sequenced genomes. On simulated and real data sets, RSEM has superior or comparable performance to quantification methods that rely on a reference genome. Taking advantage of RSEM's ability to effectively use ambiguously-mapping reads, we show that accurate gene-level abundance estimates are best obtained with large numbers of short single-end reads. On the other hand, estimates of the relative frequencies of isoforms within single genes may be improved through the use of paired-end reads, depending on the number of possible splice forms for each gene. CONCLUSIONS RSEM is an accurate and user-friendly software tool for quantifying transcript abundances from RNA-Seq data. As it does not rely on the existence of a reference genome, it is particularly useful for quantification with de novo transcriptome assemblies. In addition, RSEM has enabled valuable guidance for cost-efficient design of quantification experiments with RNA-Seq, which is currently relatively expensive.
Collapse
Affiliation(s)
- Bo Li
- Department of Computer Sciences, University of Wisconsin-Madison, Madison, WI, USA
| | | |
Collapse
|
38
|
Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 2011; 12:323. [PMID: 21816040 PMCID: PMC3163565 DOI: 10.1186/1471-2105-12-323] [Citation(s) in RCA: 12997] [Impact Index Per Article: 999.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2011] [Accepted: 08/04/2011] [Indexed: 02/07/2023] Open
Abstract
Background RNA-Seq is revolutionizing the way transcript abundances are measured. A key challenge in transcript quantification from RNA-Seq data is the handling of reads that map to multiple genes or isoforms. This issue is particularly important for quantification with de novo transcriptome assemblies in the absence of sequenced genomes, as it is difficult to determine which transcripts are isoforms of the same gene. A second significant issue is the design of RNA-Seq experiments, in terms of the number of reads, read length, and whether reads come from one or both ends of cDNA fragments. Results We present RSEM, an user-friendly software package for quantifying gene and isoform abundances from single-end or paired-end RNA-Seq data. RSEM outputs abundance estimates, 95% credibility intervals, and visualization files and can also simulate RNA-Seq data. In contrast to other existing tools, the software does not require a reference genome. Thus, in combination with a de novo transcriptome assembler, RSEM enables accurate transcript quantification for species without sequenced genomes. On simulated and real data sets, RSEM has superior or comparable performance to quantification methods that rely on a reference genome. Taking advantage of RSEM's ability to effectively use ambiguously-mapping reads, we show that accurate gene-level abundance estimates are best obtained with large numbers of short single-end reads. On the other hand, estimates of the relative frequencies of isoforms within single genes may be improved through the use of paired-end reads, depending on the number of possible splice forms for each gene. Conclusions RSEM is an accurate and user-friendly software tool for quantifying transcript abundances from RNA-Seq data. As it does not rely on the existence of a reference genome, it is particularly useful for quantification with de novo transcriptome assemblies. In addition, RSEM has enabled valuable guidance for cost-efficient design of quantification experiments with RNA-Seq, which is currently relatively expensive.
Collapse
Affiliation(s)
- Bo Li
- Department of Computer Sciences, University of Wisconsin-Madison, Madison, WI, USA
| | | |
Collapse
|
39
|
Abstract
BACKGROUND RNA-Seq is revolutionizing the way transcript abundances are measured. A key challenge in transcript quantification from RNA-Seq data is the handling of reads that map to multiple genes or isoforms. This issue is particularly important for quantification with de novo transcriptome assemblies in the absence of sequenced genomes, as it is difficult to determine which transcripts are isoforms of the same gene. A second significant issue is the design of RNA-Seq experiments, in terms of the number of reads, read length, and whether reads come from one or both ends of cDNA fragments. RESULTS We present RSEM, an user-friendly software package for quantifying gene and isoform abundances from single-end or paired-end RNA-Seq data. RSEM outputs abundance estimates, 95% credibility intervals, and visualization files and can also simulate RNA-Seq data. In contrast to other existing tools, the software does not require a reference genome. Thus, in combination with a de novo transcriptome assembler, RSEM enables accurate transcript quantification for species without sequenced genomes. On simulated and real data sets, RSEM has superior or comparable performance to quantification methods that rely on a reference genome. Taking advantage of RSEM's ability to effectively use ambiguously-mapping reads, we show that accurate gene-level abundance estimates are best obtained with large numbers of short single-end reads. On the other hand, estimates of the relative frequencies of isoforms within single genes may be improved through the use of paired-end reads, depending on the number of possible splice forms for each gene. CONCLUSIONS RSEM is an accurate and user-friendly software tool for quantifying transcript abundances from RNA-Seq data. As it does not rely on the existence of a reference genome, it is particularly useful for quantification with de novo transcriptome assemblies. In addition, RSEM has enabled valuable guidance for cost-efficient design of quantification experiments with RNA-Seq, which is currently relatively expensive.
Collapse
Affiliation(s)
- Bo Li
- Department of Computer Sciences, University of Wisconsin-Madison, Madison, WI, USA
| | | |
Collapse
|
40
|
Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 2011. [PMID: 21816040 DOI: 10.1186/1471‐2105‐12‐323] [Citation(s) in RCA: 53] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND RNA-Seq is revolutionizing the way transcript abundances are measured. A key challenge in transcript quantification from RNA-Seq data is the handling of reads that map to multiple genes or isoforms. This issue is particularly important for quantification with de novo transcriptome assemblies in the absence of sequenced genomes, as it is difficult to determine which transcripts are isoforms of the same gene. A second significant issue is the design of RNA-Seq experiments, in terms of the number of reads, read length, and whether reads come from one or both ends of cDNA fragments. RESULTS We present RSEM, an user-friendly software package for quantifying gene and isoform abundances from single-end or paired-end RNA-Seq data. RSEM outputs abundance estimates, 95% credibility intervals, and visualization files and can also simulate RNA-Seq data. In contrast to other existing tools, the software does not require a reference genome. Thus, in combination with a de novo transcriptome assembler, RSEM enables accurate transcript quantification for species without sequenced genomes. On simulated and real data sets, RSEM has superior or comparable performance to quantification methods that rely on a reference genome. Taking advantage of RSEM's ability to effectively use ambiguously-mapping reads, we show that accurate gene-level abundance estimates are best obtained with large numbers of short single-end reads. On the other hand, estimates of the relative frequencies of isoforms within single genes may be improved through the use of paired-end reads, depending on the number of possible splice forms for each gene. CONCLUSIONS RSEM is an accurate and user-friendly software tool for quantifying transcript abundances from RNA-Seq data. As it does not rely on the existence of a reference genome, it is particularly useful for quantification with de novo transcriptome assemblies. In addition, RSEM has enabled valuable guidance for cost-efficient design of quantification experiments with RNA-Seq, which is currently relatively expensive.
Collapse
Affiliation(s)
- Bo Li
- Department of Computer Sciences, University of Wisconsin-Madison, Madison, WI, USA
| | | |
Collapse
|
41
|
Huang W, Umbach DM, Vincent Jordan N, Abell AN, Johnson GL, Li L. Efficiently identifying genome-wide changes with next-generation sequencing data. Nucleic Acids Res 2011; 39:e130. [PMID: 21803788 PMCID: PMC3201882 DOI: 10.1093/nar/gkr592] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
We propose a new and effective statistical framework for identifying genome-wide differential changes in epigenetic marks with ChIP-seq data or gene expression with mRNA-seq data, and we develop a new software tool EpiCenter that can efficiently perform data analysis. The key features of our framework are: (i) providing multiple normalization methods to achieve appropriate normalization under different scenarios, (ii) using a sequence of three statistical tests to eliminate background regions and to account for different sources of variation and (iii) allowing adjustment for multiple testing to control false discovery rate (FDR) or family-wise type I error. Our software EpiCenter can perform multiple analytic tasks including: (i) identifying genome-wide epigenetic changes or differentially expressed genes, (ii) finding transcription factor binding sites and (iii) converting multiple-sample sequencing data into a single read-count data matrix. By simulation, we show that our framework achieves a low FDR consistently over a broad range of read coverage and biological variation. Through two real examples, we demonstrate the effectiveness of our framework and the usages of our tool. In particular, we show that our novel and robust 'parsimony' normalization method is superior to the widely-used 'tagRatio' method. Our software EpiCenter is freely available to the public.
Collapse
Affiliation(s)
- Weichun Huang
- Biostatistics Branch, National Institute of Environmental Health Sciences, RTP, NC 27709, USA.
| | | | | | | | | | | |
Collapse
|
42
|
Huang S, Zhang J, Li R, Zhang W, He Z, Lam TW, Peng Z, Yiu SM. SOAPsplice: Genome-Wide ab initio Detection of Splice Junctions from RNA-Seq Data. Front Genet 2011; 2:46. [PMID: 22303342 PMCID: PMC3268599 DOI: 10.3389/fgene.2011.00046] [Citation(s) in RCA: 79] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2011] [Accepted: 06/26/2011] [Indexed: 11/23/2022] Open
Abstract
RNA-Seq, a method using next generation sequencing technologies to sequence the transcriptome, facilitates genome-wide analysis of splice junction sites. In this paper, we introduce SOAPsplice, a robust tool to detect splice junctions using RNA-Seq data without using any information of known splice junctions. SOAPsplice uses a novel two-step approach consisting of first identifying as many reasonable splice junction candidates as possible, and then, filtering the false positives with two effective filtering strategies. In both simulated and real datasets, SOAPsplice is able to detect many reliable splice junctions with low false positive rate. The improvement gained by SOAPsplice, when compared to other existing tools, becomes more obvious when the depth of sequencing is low. SOAPsplice is freely available at http://soap.genomics.org.cn/soapsplice.html.
Collapse
Affiliation(s)
- Songbo Huang
- Bioinformatics Center, Beijing Genomics Institute at Shenzhen Shenzhen, China
| | | | | | | | | | | | | | | |
Collapse
|
43
|
Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods 2011; 8:469-77. [PMID: 21623353 DOI: 10.1038/nmeth.1613] [Citation(s) in RCA: 639] [Impact Index Per Article: 49.2] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Abstract
High-throughput RNA sequencing (RNA-seq) promises a comprehensive picture of the transcriptome, allowing for the complete annotation and quantification of all genes and their isoforms across samples. Realizing this promise requires increasingly complex computational methods. These computational challenges fall into three main categories: (i) read mapping, (ii) transcriptome reconstruction and (iii) expression quantification. Here we explain the major conceptual and practical challenges, and the general classes of solutions for each category. Finally, we highlight the interdependence between these categories and discuss the benefits for different biological applications.
Collapse
|
44
|
Jean G, Kahles A, Sreedharan VT, De Bona F, Rätsch G. RNA-Seq read alignments with PALMapper. ACTA ACUST UNITED AC 2011; Chapter 11:Unit 11.6. [PMID: 21154708 DOI: 10.1002/0471250953.bi1106s32] [Citation(s) in RCA: 41] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Next-generation sequencing technologies have revolutionized genome and transcriptome sequencing. RNA-Seq experiments are able to generate huge amounts of transcriptome sequence reads at a fraction of the cost of Sanger sequencing. Reads produced by these technologies are relatively short and error prone. To utilize such reads for transcriptome reconstruction and gene-structure identification, one needs to be able to accurately align the sequence reads over intron boundaries. In this unit, we describe PALMapper, a fast and easy-to-use tool that is designed to accurately compute both unspliced and spliced alignments for millions of RNA-Seq reads. It combines the efficient read mapper GenomeMapper with the spliced aligner QPALMA, which exploits read-quality information and predictions of splice sites to improve the alignment accuracy. The PALMapper package is available as a command-line tool running on Unix or Mac OS X systems or through a Web interface based on Galaxy tools.
Collapse
Affiliation(s)
- Géraldine Jean
- Friedrich Miescher Laboratory, Max Planck Society, Tübingen, Germany
| | | | | | | | | |
Collapse
|
45
|
Abstract
The development of next-generation sequencing technologies has enabled the transcriptome to be measured and characterized at a level which was previously unattainable. Shot gun sequencing of RNAs, or RNA-Seq as it is known, is providing the means to simultaneously survey locus activity, transcript-specific expression, sequence content of transcripts and transcriptome discovery. This article discusses the current state of RNA-Seq, its potential for redefining transcriptomics and some of the challenges associated with this revolutionary technology.
Collapse
Affiliation(s)
- Karin S Kassahn
- Queensland Centre for Medical Genomics, Institute for Molecular Bioscience, Brisbane, Australia
| | | | | |
Collapse
|
46
|
Nacu S, Yuan W, Kan Z, Bhatt D, Rivers CS, Stinson J, Peters BA, Modrusan Z, Jung K, Seshagiri S, Wu TD. Deep RNA sequencing analysis of readthrough gene fusions in human prostate adenocarcinoma and reference samples. BMC Med Genomics 2011; 4:11. [PMID: 21261984 PMCID: PMC3041646 DOI: 10.1186/1755-8794-4-11] [Citation(s) in RCA: 123] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2010] [Accepted: 01/24/2011] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Readthrough fusions across adjacent genes in the genome, or transcription-induced chimeras (TICs), have been estimated using expressed sequence tag (EST) libraries to involve 4-6% of all genes. Deep transcriptional sequencing (RNA-Seq) now makes it possible to study the occurrence and expression levels of TICs in individual samples across the genome. METHODS We performed single-end RNA-Seq on three human prostate adenocarcinoma samples and their corresponding normal tissues, as well as brain and universal reference samples. We developed two bioinformatics methods to specifically identify TIC events: a targeted alignment method using artificial exon-exon junctions within 200,000 bp from adjacent genes, and genomic alignment allowing splicing within individual reads. We performed further experimental verification and characterization of selected TIC and fusion events using quantitative RT-PCR and comparative genomic hybridization microarrays. RESULTS Targeted alignment against artificial exon-exon junctions yielded 339 distinct TIC events, including 32 gene pairs with multiple isoforms. The false discovery rate was estimated to be 1.5%. Spliced alignment to the genome was less sensitive, finding only 18% of those found by targeted alignment in 33-nt reads and 59% of those in 50-nt reads. However, spliced alignment revealed 30 cases of TICs with intervening exons, in addition to distant inversions, scrambled genes, and translocations. Our findings increase the catalog of observed TIC gene pairs by 66%.We verified 6 of 6 predicted TICs in all prostate samples, and 2 of 5 predicted novel distant gene fusions, both private events among 54 prostate tumor samples tested. Expression of TICs correlates with that of the upstream gene, which can explain the prostate-specific pattern of some TIC events and the restriction of the SLC45A3-ELK4 e4-e2 TIC to ERG-negative prostate samples, as confirmed in 20 matched prostate tumor and normal samples and 9 lung cancer cell lines. CONCLUSIONS Deep transcriptional sequencing and analysis with targeted and spliced alignment methods can effectively identify TIC events across the genome in individual tissues. Prostate and reference samples exhibit a wide range of TIC events, involving more genes than estimated previously using ESTs. Tissue specificity of TIC events is correlated with expression patterns of the upstream gene. Some TIC events, such as MSMB-NCOA4, may play functional roles in cancer.
Collapse
Affiliation(s)
- Serban Nacu
- Department of Bioinformatics and Molecular Biology, Genentech, Inc, South San Francisco, California 94080, USA
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
47
|
Xiao X, Lee JH. Systems analysis of alternative splicing and its regulation. WILEY INTERDISCIPLINARY REVIEWS-SYSTEMS BIOLOGY AND MEDICINE 2011; 2:550-565. [PMID: 20836047 DOI: 10.1002/wsbm.84] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Alternative splicing (AS) has emerged as a key mechanism that accounts for gene expression diversity in metazoan organisms. Splicing is tightly regulated by a repertoire of RNA and protein factors and RNA sequence elements that function in a cooperative manner. Systems-level experimental and computational approaches have been instrumental in establishing comprehensive profiles of transcript variants generated by AS. In addition, systems biology approaches are starting to define how combinatorial splicing regulation shapes the complex splicing phenotypes observed in different tissue types and developmental stages and under different conditions. Here, we review recent progress in these areas.
Collapse
Affiliation(s)
- Xinshu Xiao
- Department of Physiological Science, University of California, Los Angeles, CA 90095, USA.,Molecular Biology Institute, University of California, Los Angeles, CA 90095, USA
| | - Jae-Hyung Lee
- Department of Physiological Science, University of California, Los Angeles, CA 90095, USA.,Molecular Biology Institute, University of California, Los Angeles, CA 90095, USA
| |
Collapse
|
48
|
Lovci MT, Li HR, Fu XD, Yeo GW. RNA-seq analysis of gene expression and alternative splicing by double-random priming strategy. Methods Mol Biol 2011; 729:247-55. [PMID: 21365495 PMCID: PMC3918474 DOI: 10.1007/978-1-61779-065-2_16] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
Transcriptome analysis by deep sequencing, more commonly known as RNA-seq is, becoming the method of choice for gene discovery and quantitative splicing detection. We published a double-random priming RNA-seq approach capable of generating strand-specific information [Li et al., Proc Natl Acad Sci USA 105:20179-20184, 2008]. Poly(A)+ RNA from a treated and an untreated sample were utilized to generate RNA-seq libraries that were sequenced on the Illumina GA1 analyzer. Statistical analysis of approximately ten million sequence reads generated from both control and treated cells suggests that this tag density is sufficient for quantitative analysis of gene expression. We were also able to detect a large fraction of reads corresponding to annotated alternative exons, with a subset of the reads matching known and detecting new splice junctions. In this chapter, we provide a detailed, bench-ready protocol for the double-random priming method and provide user-friendly templates for the curve-fitting model described in the paper to estimate the tag density needed for optimal detection of regulated gene expression and alternative splicing.
Collapse
Affiliation(s)
- Michael T Lovci
- Department of Cellular and Molecular Medicine, Stem Cell Program, University of California, San Diego, La Jolla, CA, USA
| | | | | | | |
Collapse
|
49
|
Abstract
Many methods and tools are available for preprocessing high-throughput RNA sequencing data and detecting differential expression.
Collapse
Affiliation(s)
- Alicia Oshlack
- Bioinformatics Division, Walter and Eliza Hall Institute, 1G Royal Parade, Parkville 3052, Australia.
| | | | | |
Collapse
|
50
|
Dimon MT, Sorber K, DeRisi JL. HMMSplicer: a tool for efficient and sensitive discovery of known and novel splice junctions in RNA-Seq data. PLoS One 2010; 5:e13875. [PMID: 21079731 PMCID: PMC2975632 DOI: 10.1371/journal.pone.0013875] [Citation(s) in RCA: 47] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2010] [Accepted: 09/16/2010] [Indexed: 02/01/2023] Open
Abstract
Background High-throughput sequencing of an organism's transcriptome, or RNA-Seq, is a valuable and versatile new strategy for capturing snapshots of gene expression. However, transcriptome sequencing creates a new class of alignment problem: mapping short reads that span exon-exon junctions back to the reference genome, especially in the case where a splice junction is previously unknown. Methodology/Principal Findings Here we introduce HMMSplicer, an accurate and efficient algorithm for discovering canonical and non-canonical splice junctions in short read datasets. HMMSplicer identifies more splice junctions than currently available algorithms when tested on publicly available A. thaliana, P. falciparum, and H. sapiens datasets without a reduction in specificity. Conclusions/Significance HMMSplicer was found to perform especially well in compact genomes and on genes with low expression levels, alternative splice isoforms, or non-canonical splice junctions. Because HHMSplicer does not rely on pre-built gene models, the products of inexact splicing are also detected. For H. sapiens, we find 3.6% of 3′ splice sites and 1.4% of 5′ splice sites are inexact, typically differing by 3 bases in either direction. In addition, HMMSplicer provides a score for every predicted junction allowing the user to set a threshold to tune false positive rates depending on the needs of the experiment. HMMSplicer is implemented in Python. Code and documentation are freely available at http://derisilab.ucsf.edu/software/hmmsplicer.
Collapse
Affiliation(s)
- Michelle T. Dimon
- Department of Biochemistry and Biophysics, University of California San Francisco, San Francisco, California, United States of America
- Biological and Medical Informatics Program, University of California San Francisco, San Francisco, California, United States of America
| | - Katherine Sorber
- Department of Biochemistry and Biophysics, University of California San Francisco, San Francisco, California, United States of America
| | - Joseph L. DeRisi
- Department of Biochemistry and Biophysics, University of California San Francisco, San Francisco, California, United States of America
- Howard Hughes Medical Institute, Bethesda, Maryland, United States of America
- * E-mail:
| |
Collapse
|