1
|
Foltz SM, Greene CS, Taroni JN. Cross-platform normalization enables machine learning model training on microarray and RNA-seq data simultaneously. Commun Biol 2023; 6:222. [PMID: 36841852 PMCID: PMC9968332 DOI: 10.1038/s42003-023-04588-6] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2017] [Accepted: 02/13/2023] [Indexed: 02/27/2023] Open
Abstract
Large compendia of gene expression data have proven valuable for the discovery of novel biological relationships. Historically, most available RNA assays were run on microarray, while RNA-seq is now the platform of choice for many new experiments. The data structure and distributions between the platforms differ, making it challenging to combine them directly. Here we perform supervised and unsupervised machine learning evaluations to assess which existing normalization methods are best suited for combining microarray and RNA-seq data. We find that quantile and Training Distribution Matching normalization allow for supervised and unsupervised model training on microarray and RNA-seq data simultaneously. Nonparanormal normalization and z-scores are also appropriate for some applications, including pathway analysis with Pathway-Level Information Extractor (PLIER). We demonstrate that it is possible to perform effective cross-platform normalization using existing methods to combine microarray and RNA-seq data for machine learning applications.
Collapse
Affiliation(s)
- Steven M Foltz
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
- Childhood Cancer Data Lab, Alex's Lemonade Stand Foundation, Wynnewood, PA, USA
| | - Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.
- Center for Health AI, University of Colorado School of Medicine, Aurora, CO, USA.
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, USA.
| | - Jaclyn N Taroni
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.
- Childhood Cancer Data Lab, Alex's Lemonade Stand Foundation, Wynnewood, PA, USA.
| |
Collapse
|
2
|
Zanella L, Facco P, Bezzo F, Cimetta E. Feature Selection and Molecular Classification of Cancer Phenotypes: A Comparative Study. Int J Mol Sci 2022; 23:ijms23169087. [PMID: 36012350 PMCID: PMC9408964 DOI: 10.3390/ijms23169087] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2022] [Revised: 08/09/2022] [Accepted: 08/11/2022] [Indexed: 11/16/2022] Open
Abstract
The classification of high dimensional gene expression data is key to the development of effective diagnostic and prognostic tools. Feature selection involves finding the best subset with the highest power in predicting class labels. Here, we conducted a comparative study focused on different combinations of feature selectors (Chi-Squared, mRMR, Relief-F, and Genetic Algorithms) and classification learning algorithms (Random Forests, PLS-DA, SVM, Regularized Logistic/Multinomial Regression, and kNN) to identify those with the best predictive capacity. The performance of each combination is evaluated through an empirical study on three benchmark cancer-related microarray datasets. Our results first suggest that the quality of the data relevant to the target classes is key for the successful classification of cancer phenotypes. We also proved that, for a given classification learning algorithm and dataset, all filters have a similar performance. Interestingly, filters achieve comparable or even better results with respect to the GA-based wrappers, while also being easier and faster to implement. Taken together, our findings suggest that simple, well-established feature selectors in combination with optimized classifiers guarantee good performances, with no need for complicated and computationally demanding methodologies.
Collapse
Affiliation(s)
- Luca Zanella
- Department of Industrial Engineering (DII), University of Padova, 35131 Padova, Italy
| | - Pierantonio Facco
- Department of Industrial Engineering (DII), University of Padova, 35131 Padova, Italy
| | - Fabrizio Bezzo
- Department of Industrial Engineering (DII), University of Padova, 35131 Padova, Italy
| | - Elisa Cimetta
- Department of Industrial Engineering (DII), University of Padova, 35131 Padova, Italy
- Fondazione Istituto di Ricerca Pediatrica Città della Speranza (IRP), 35127 Padova, Italy
- Correspondence:
| |
Collapse
|
3
|
Peters TJ, French HJ, Bradford ST, Pidsley R, Stirzaker C, Varinli H, Nair S, Qu W, Song J, Giles KA, Statham AL, Speirs H, Speed TP, Clark SJ. Evaluation of cross-platform and interlaboratory concordance via consensus modelling of genomic measurements. Bioinformatics 2019; 35:560-570. [PMID: 30084929 PMCID: PMC6378945 DOI: 10.1093/bioinformatics/bty675] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2018] [Revised: 07/10/2018] [Accepted: 07/31/2018] [Indexed: 01/23/2023] Open
Abstract
Motivation A synoptic view of the human genome benefits chiefly from the application of nucleic acid sequencing and microarray technologies. These platforms allow interrogation of patterns such as gene expression and DNA methylation at the vast majority of canonical loci, allowing granular insights and opportunities for validation of original findings. However, problems arise when validating against a “gold standard” measurement, since this immediately biases all subsequent measurements towards that particular technology or protocol. Since all genomic measurements are estimates, in the absence of a ”gold standard” we instead empirically assess the measurement precision and sensitivity of a large suite of genomic technologies via a consensus modelling method called the row-linear model. This method is an application of the American Society for Testing and Materials Standard E691 for assessing interlaboratory precision and sources of variability across multiple testing sites. Both cross-platform and cross-locus comparisons can be made across all common loci, allowing identification of technology- and locus-specific tendencies. Results We assess technologies including the Infinium MethylationEPIC BeadChip, whole genome bisulfite sequencing (WGBS), two different RNA-Seq protocols (PolyA+ and Ribo-Zero) and five different gene expression array platforms. Each technology thus is characterised herein, relative to the consensus. We showcase a number of applications of the row-linear model, including correlation with known interfering traits. We demonstrate a clear effect of cross-hybridisation on the sensitivity of Infinium methylation arrays. Additionally, we perform a true interlaboratory test on a set of samples interrogated on the same platform across twenty-one separate testing laboratories. Availability and implementation A full implementation of the row-linear model, plus extra functions for visualisation, are found in the R package consensus at https://github.com/timpeters82/consensus. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Timothy J Peters
- Epigenetics Laboratory, Genomics and Epigenetics Division, Garvan Institute of Medical Research, Darlinghurst, NSW, Australia
| | - Hugh J French
- Epigenetics Laboratory, Genomics and Epigenetics Division, Garvan Institute of Medical Research, Darlinghurst, NSW, Australia.,South Western Sydney Clinical School, Faculty of Medicine, University of New South Wales, Liverpool, NSW, Australia
| | - Stephen T Bradford
- Epigenetics Laboratory, Genomics and Epigenetics Division, Garvan Institute of Medical Research, Darlinghurst, NSW, Australia.,CSIRO Health and Biosecurity, North Ryde, NSW, Australia
| | - Ruth Pidsley
- Epigenetics Laboratory, Genomics and Epigenetics Division, Garvan Institute of Medical Research, Darlinghurst, NSW, Australia
| | - Clare Stirzaker
- Epigenetics Laboratory, Genomics and Epigenetics Division, Garvan Institute of Medical Research, Darlinghurst, NSW, Australia.,St Vincent's Clinical School, Faculty of Medicine, UNSW, Darlinghurst, NSW, Australia
| | - Hilal Varinli
- Epigenetics Laboratory, Genomics and Epigenetics Division, Garvan Institute of Medical Research, Darlinghurst, NSW, Australia.,CSIRO Health and Biosecurity, North Ryde, NSW, Australia.,Department of Biological Sciences, Macquarie University, North Ryde, NSW, Australia.,NSW Ministry of Health, LMB 961, North Sydney, NSW, Australia
| | - Shalima Nair
- Epigenetics Laboratory, Genomics and Epigenetics Division, Garvan Institute of Medical Research, Darlinghurst, NSW, Australia
| | - Wenjia Qu
- Epigenetics Laboratory, Genomics and Epigenetics Division, Garvan Institute of Medical Research, Darlinghurst, NSW, Australia
| | - Jenny Song
- Epigenetics Laboratory, Genomics and Epigenetics Division, Garvan Institute of Medical Research, Darlinghurst, NSW, Australia
| | - Katherine A Giles
- Epigenetics Laboratory, Genomics and Epigenetics Division, Garvan Institute of Medical Research, Darlinghurst, NSW, Australia
| | - Aaron L Statham
- Epigenetics Laboratory, Genomics and Epigenetics Division, Garvan Institute of Medical Research, Darlinghurst, NSW, Australia
| | - Helen Speirs
- Ramaciotti Centre for Genomics, University of New South Wales, Randwick, NSW, Australia
| | - Terence P Speed
- Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research, Parkville, VIC, Australia.,Department of Mathematics & Statistics, University of Melbourne, Melbourne, VIC, Australia
| | - Susan J Clark
- Epigenetics Laboratory, Genomics and Epigenetics Division, Garvan Institute of Medical Research, Darlinghurst, NSW, Australia.,St Vincent's Clinical School, Faculty of Medicine, UNSW, Darlinghurst, NSW, Australia
| |
Collapse
|
4
|
Lim SB, Tan SJ, Lim WT, Lim CT. Compendiums of cancer transcriptomes for machine learning applications. Sci Data 2019; 6:194. [PMID: 31594947 PMCID: PMC6783425 DOI: 10.1038/s41597-019-0207-2] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2019] [Accepted: 07/25/2019] [Indexed: 12/18/2022] Open
Abstract
There are massive transcriptome profiles in the form of microarray. The challenge is that they are processed using diverse platforms and preprocessing tools, requiring considerable time and informatics expertise for cross-dataset analyses. If there exists a single, integrated data source, data-reuse can be facilitated for discovery, analysis, and validation of biomarker-based clinical strategy. Here, we present merged microarray-acquired datasets (MMDs) across 11 major cancer types, curating 8,386 patient-derived tumor and tumor-free samples from 95 GEO datasets. Using machine learning algorithms, we show that diagnostic models trained from MMDs can be directly applied to RNA-seq-acquired TCGA data with high classification accuracy. Machine learning optimized MMD further aids to reveal immune landscape across various carcinomas critically needed in disease management and clinical interventions. This unified data source may serve as an excellent training or test set to apply, develop, and refine machine learning algorithms that can be tapped to better define genomic landscape of human cancers.
Collapse
Affiliation(s)
- Su Bin Lim
- NUS Graduate School for Integrative Sciences & Engineering, National University of Singapore, Singapore, Singapore
- Department of Biomedical Engineering, National University of Singapore, Singapore, Singapore
| | - Swee Jin Tan
- Regional Scientific Affairs, Sysmex Asia Pacific, Singapore, Singapore
| | - Wan-Teck Lim
- Division of Medical Oncology, National Cancer Centre Singapore, Singapore, Singapore
- Office of Academic and Clinical Development, Duke-NUS Medical School, Singapore, Singapore
- IMCB NCC MPI Singapore Oncogenome Laboratory, Institute of Molecular and Cell Biology (IMCB), A*STAR, Singapore, Singapore
| | - Chwee Teck Lim
- NUS Graduate School for Integrative Sciences & Engineering, National University of Singapore, Singapore, Singapore.
- Department of Biomedical Engineering, National University of Singapore, Singapore, Singapore.
- Mechanobiology Institute, National University of Singapore, Singapore, Singapore.
- Institute for Health Innovation and Technology (iHealthtech), National University of Singapore, Singapore, Singapore.
| |
Collapse
|
5
|
Franks JM, Cai G, Whitfield ML. Feature specific quantile normalization enables cross-platform classification of molecular subtypes using gene expression data. Bioinformatics 2019; 34:1868-1874. [PMID: 29360996 DOI: 10.1093/bioinformatics/bty026] [Citation(s) in RCA: 37] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2017] [Accepted: 01/16/2018] [Indexed: 12/22/2022] Open
Abstract
Motivation Molecular subtypes of cancers and autoimmune disease, defined by transcriptomic profiling, have provided insight into disease pathogenesis, molecular heterogeneity and therapeutic responses. However, technical biases inherent to different gene expression profiling platforms present a unique problem when analyzing data generated from different studies. Currently, there is a lack of effective methods designed to eliminate platform-based bias. We present a method to normalize and classify RNA-seq data using machine learning classifiers trained on DNA microarray data and molecular subtypes in two datasets: breast invasive carcinoma (BRCA) and colorectal cancer (CRC). Results Multiple analyses show that feature specific quantile normalization (FSQN) successfully removes platform-based bias from RNA-seq data, regardless of feature scaling or machine learning algorithm. We achieve up to 98% accuracy for BRCA data and 97% accuracy for CRC data in assigning molecular subtypes to RNA-seq data normalized using FSQN and a support vector machine trained exclusively on DNA microarray data. We find that maximum accuracy was achieved when normalizing RNA-seq datasets that contain at least 25 samples. FSQN allows comparison of RNA-seq data to existing DNA microarray datasets. Using these techniques, we can successfully leverage information from existing gene expression data in new analyses despite different platforms used for gene expression profiling. Availability and implementation FSQN has been submitted as an R package to CRAN. All code used for this study is available on Github (https://github.com/jenniferfranks/FSQN). Contact michael.l.whitfield@dartmouth.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Guoshuai Cai
- Department of Environmental Health Sciences, Arnold School of Public Health, University of South Carolina, Columbia, SC, 29208, USA
| | - Michael L Whitfield
- Department of Molecular and Systems Biology.,Department of Biomedical Data Science, Geisel School of Medicine at Dartmouth, Lebanon, NH, 03756, USA
| |
Collapse
|
6
|
Pedersen CB, Nielsen FC, Rossing M, Olsen LR. Using microarray-based subtyping methods for breast cancer in the era of high-throughput RNA sequencing. Mol Oncol 2018; 12:2136-2146. [PMID: 30289602 PMCID: PMC6275246 DOI: 10.1002/1878-0261.12389] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2018] [Revised: 09/19/2018] [Accepted: 09/25/2018] [Indexed: 11/30/2022] Open
Abstract
Breast cancer is a highly heterogeneous disease that can be classified into multiple subtypes based on the tumor transcriptome. Most of the subtyping schemes used in clinics today are derived from analyses of microarray data from thousands of different tumors together with clinical data for the patients from which the tumors were isolated. However, RNA sequencing (RNA‐Seq) is gradually replacing microarrays as the preferred transcriptomics platform, and although transcript abundances measured by the two different technologies are largely compatible, subtyping methods developed for probe‐based microarray data are incompatible with RNA‐Seq as input data. Here, we present an RNA‐Seq data processing pipeline, which relies on the mapping of sequencing reads to the probe set target sequences instead of the human reference genome, thereby enabling probe‐based subtyping of breast cancer tumor tissue using sequencing‐based transcriptomics. By analyzing 66 breast cancer tumors for which gene expression was measured using both microarrays and RNA‐Seq, we show that RNA‐Seq data can be directly compared to microarray data using our pipeline. Additionally, we demonstrate that the established subtyping method CITBCMST (Guedj et al., 2012), which relies on a 375 probe set‐signature to classify samples into the six subtypes basL, lumA, lumB, lumC, mApo, and normL, can be applied without further modifications. This pipeline enables a seamless transition to sequencing‐based transcriptomics for future clinical purposes.
Collapse
Affiliation(s)
- Christina Bligaard Pedersen
- Department of Bio and Health Informatics, Technical University of Denmark, Kemitorvet, Kongens Lyngby, Denmark.,Center for Genomic Medicine, Rigshospitalet - Copenhagen University Hospital, Denmark
| | - Finn Cilius Nielsen
- Center for Genomic Medicine, Rigshospitalet - Copenhagen University Hospital, Denmark
| | - Maria Rossing
- Center for Genomic Medicine, Rigshospitalet - Copenhagen University Hospital, Denmark
| | - Lars Rønn Olsen
- Department of Bio and Health Informatics, Technical University of Denmark, Kemitorvet, Kongens Lyngby, Denmark.,Center for Genomic Medicine, Rigshospitalet - Copenhagen University Hospital, Denmark
| |
Collapse
|
7
|
Dapas M, Kandpal M, Bi Y, Davuluri RV. Comparative evaluation of isoform-level gene expression estimation algorithms for RNA-seq and exon-array platforms. Brief Bioinform 2017; 18:260-269. [PMID: 26944083 PMCID: PMC5444266 DOI: 10.1093/bib/bbw016] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2015] [Indexed: 01/04/2023] Open
Abstract
Given that the majority of multi-exon genes generate diverse functional products, it is important to evaluate expression at the isoform level. Previous studies have demonstrated strong gene-level correlations between RNA sequencing (RNA-seq) and microarray platforms, but have not studied their concordance at the isoform level. We performed transcript abundance estimation on raw RNA-seq and exon-array expression profiles available for common glioblastoma multiforme samples from The Cancer Genome Atlas using different analysis pipelines, and compared both the isoform- and gene-level expression estimates between programs and platforms. The results showed better concordance between RNA-seq/exon-array and reverse transcription-quantitative polymerase chain reaction (RT-qPCR) platforms for fold change estimates than for raw abundance estimates, suggesting that fold change normalization against a control is an important step for integrating expression data across platforms. Based on RT-qPCR validations, eXpress and Multi-Mapping Bayesian Gene eXpression (MMBGX) programs achieved the best performance for RNA-seq and exon-array platforms, respectively, for deriving the isoform-level fold change values. While eXpress achieved the highest correlation with the RT-qPCR and exon-array (MMBGX) results overall, RSEM was more highly correlated with MMBGX for the subset of transcripts that are highly variable across the samples. eXpress appears to be most successful in discriminating lowly expressed transcripts, but IsoformEx and RSEM correlate more strongly with MMBGX for highly expressed transcripts. The results also reinforce how potentially important isoform-level expression changes can be masked by gene-level estimates, and demonstrate that exon arrays yield comparable results to RNA-seq for evaluating isoform-level expression changes.
Collapse
Affiliation(s)
| | - Manoj Kandpal
- Department of Veterinary Surgery & Radiology, College of Veterinary & Animal Sciences, GBPUAT, Pantnagar - 263 145, Uttarakhand, India
| | - Yingtao Bi
- Center for Systems and Computational Biology, Molecular and Cellular Oncogenesis Program, The Wistar Institute, 19104 Philadelphia, PA, USA
| | - Ramana V Davuluri
- Center for Systems and Computational Biology, Molecular and Cellular Oncogenesis Program, The Wistar Institute, 19104 Philadelphia, PA, USA
| |
Collapse
|
8
|
Kan M, Shumyatcher M, Himes BE. Using omics approaches to understand pulmonary diseases. Respir Res 2017; 18:149. [PMID: 28774304 PMCID: PMC5543452 DOI: 10.1186/s12931-017-0631-9] [Citation(s) in RCA: 75] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2017] [Accepted: 07/26/2017] [Indexed: 12/24/2022] Open
Abstract
Omics approaches are high-throughput unbiased technologies that provide snapshots of various aspects of biological systems and include: 1) genomics, the measure of DNA variation; 2) transcriptomics, the measure of RNA expression; 3) epigenomics, the measure of DNA alterations not involving sequence variation that influence RNA expression; 4) proteomics, the measure of protein expression or its chemical modifications; and 5) metabolomics, the measure of metabolite levels. Our understanding of pulmonary diseases has increased as a result of applying these omics approaches to characterize patients, uncover mechanisms underlying drug responsiveness, and identify effects of environmental exposures and interventions. As more tissue- and cell-specific omics data is analyzed and integrated for diverse patients under various conditions, there will be increased identification of key mechanisms that underlie pulmonary biological processes, disease endotypes, and novel therapeutics that are efficacious in select individuals. We provide a synopsis of how omics approaches have advanced our understanding of asthma, chronic obstructive pulmonary disease (COPD), acute respiratory distress syndrome (ARDS), idiopathic pulmonary fibrosis (IPF), and pulmonary arterial hypertension (PAH), and we highlight ongoing work that will facilitate pulmonary disease precision medicine.
Collapse
Affiliation(s)
- Mengyuan Kan
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, 402 Blockley Hall 423 Guardian Drive, Philadelphia, PA 19104 USA
| | - Maya Shumyatcher
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, 402 Blockley Hall 423 Guardian Drive, Philadelphia, PA 19104 USA
| | - Blanca E. Himes
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, 402 Blockley Hall 423 Guardian Drive, Philadelphia, PA 19104 USA
| |
Collapse
|
9
|
Thompson JA, Tan J, Greene CS. Cross-platform normalization of microarray and RNA-seq data for machine learning applications. PeerJ 2016; 4:e1621. [PMID: 26844019 PMCID: PMC4736986 DOI: 10.7717/peerj.1621] [Citation(s) in RCA: 57] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2015] [Accepted: 01/02/2016] [Indexed: 01/08/2023] Open
Abstract
Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data can be applied to RNA-seq data, larger, more diverse training datasets can be created and validation can be performed on newly generated data. We developed Training Distribution Matching (TDM), which transforms RNA-seq data for use with models constructed from legacy platforms. We evaluated TDM, as well as quantile normalization, nonparanormal transformation, and a simple log 2 transformation, on both simulated and biological datasets of gene expression. Our evaluation included both supervised and unsupervised machine learning approaches. We found that TDM exhibited consistently strong performance across settings and that quantile normalization also performed well in many circumstances. We also provide a TDM package for the R programming language.
Collapse
Affiliation(s)
- Jeffrey A. Thompson
- Department of Genetics, Geisel School of Medicine at Dartmouth, Hanover, New Hampshire, United States of America
- Quantitative Biomedical Sciences Program, Geisel School of Medicine at Dartmouth, Hanover, New Hampshire, United States of America
| | - Jie Tan
- Department of Genetics, Geisel School of Medicine at Dartmouth, Hanover, New Hampshire, United States of America
- Molecular and Cellular Biology, Geisel School of Medicine at Dartmouth, Hanover, New Hampshire, United States of America
| | - Casey S. Greene
- Department of Genetics, Geisel School of Medicine at Dartmouth, Hanover, New Hampshire, United States of America
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, Pennslyvania, United States of America
| |
Collapse
|
10
|
Abstract
Motivation: RNA-Seq technique has been demonstrated as a revolutionary means for exploring transcriptome because it provides deep coverage and base pair-level resolution. RNA-Seq quantification is proven to be an efficient alternative to Microarray technique in gene expression study, and it is a critical component in RNA-Seq differential expression analysis. Most existing RNA-Seq quantification tools require the alignments of fragments to either a genome or a transcriptome, entailing a time-consuming and intricate alignment step. To improve the performance of RNA-Seq quantification, an alignment-free method, Sailfish, has been recently proposed to quantify transcript abundances using all k-mers in the transcriptome, demonstrating the feasibility of designing an efficient alignment-free method for transcriptome quantification. Even though Sailfish is substantially faster than alternative alignment-dependent methods such as Cufflinks, using all k-mers in the transcriptome quantification impedes the scalability of the method. Results: We propose a novel RNA-Seq quantification method, RNA-Skim, which partitions the transcriptome into disjoint transcript clusters based on sequence similarity, and introduces the notion of sig-mers, which are a special type of k-mers uniquely associated with each cluster. We demonstrate that the sig-mer counts within a cluster are sufficient for estimating transcript abundances with accuracy comparable with any state-of-the-art method. This enables RNA-Skim to perform transcript quantification on each cluster independently, reducing a complex optimization problem into smaller optimization tasks that can be run in parallel. As a result, RNA-Skim uses <4% of the k-mers and <10% of the CPU time required by Sailfish. It is able to finish transcriptome quantification in <10 min per sample by using just a single thread on a commodity computer, which represents >100 speedup over the state-of-the-art alignment-based methods, while delivering comparable or higher accuracy. Availability and implementation: The software is available at http://www.csbio.unc.edu/rs. Contact:weiwang@cs.ucla.edu Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Zhaojun Zhang
- Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA and Department of Computer Science, University of California, Los Angeles, CA, USA
| | - Wei Wang
- Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA and Department of Computer Science, University of California, Los Angeles, CA, USA
| |
Collapse
|