1
|
Lu M, Yin R, Chen XS. Ensemble methods of rank-based trees for single sample classification with gene expression profiles. J Transl Med 2024; 22:140. [PMID: 38321494 PMCID: PMC10848444 DOI: 10.1186/s12967-024-04940-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2023] [Accepted: 01/27/2024] [Indexed: 02/08/2024] Open
Abstract
Building Single Sample Predictors (SSPs) from gene expression profiles presents challenges, notably due to the lack of calibration across diverse gene expression measurement technologies. However, recent research indicates the viability of classifying phenotypes based on the order of expression of multiple genes. Existing SSP methods often rely on Top Scoring Pairs (TSP), which are platform-independent and easy to interpret through the concept of "relative expression reversals". Nevertheless, TSP methods face limitations in classifying complex patterns involving comparisons of more than two gene expressions. To overcome these constraints, we introduce a novel approach that extends TSP rules by constructing rank-based trees capable of encompassing extensive gene-gene comparisons. This method is bolstered by incorporating two ensemble strategies, boosting and random forest, to mitigate the risk of overfitting. Our implementation of ensemble rank-based trees employs boosting with LogitBoost cost and random forests, addressing both binary and multi-class classification problems. In a comparative analysis across 12 cancer gene expression datasets, our proposed methods demonstrate superior performance over both the k-TSP classifier and nearest template prediction methods. We have further refined our approach to facilitate variable selection and the generation of clear, precise decision rules from rank-based trees, enhancing interpretability. The cumulative evidence from our research underscores the significant potential of ensemble rank-based trees in advancing disease classification via gene expression data, offering a robust, interpretable, and scalable solution. Our software is available at https://CRAN.R-project.org/package=ranktreeEnsemble .
Collapse
Affiliation(s)
- Min Lu
- Division of Biostatistics, Department of Public Health Sciences, Miller School of Medicine, University of Miami, 1120 NW 14th Street, Miami, FL, 33136, USA.
| | - Ruijie Yin
- Division of Biostatistics, Department of Public Health Sciences, Miller School of Medicine, University of Miami, 1120 NW 14th Street, Miami, FL, 33136, USA
| | - X Steven Chen
- Division of Biostatistics, Department of Public Health Sciences, Miller School of Medicine, University of Miami, 1120 NW 14th Street, Miami, FL, 33136, USA.
- Sylvester Comprehensive Cancer Center, Miller School of Medicine, University of Miami, 1475 NW 12th Ave, Miami, FL, 33136, USA.
| |
Collapse
|
2
|
Kwan B, Fuhrer T, Montemayor D, Fink JC, He J, Hsu CY, Messer K, Nelson RG, Pu M, Ricardo AC, Rincon-Choles H, Shah VO, Ye H, Zhang J, Sharma K, Natarajan L. A generalized covariate-adjusted top-scoring pair algorithm with applications to diabetic kidney disease stage classification in the Chronic Renal Insufficiency Cohort (CRIC) Study. BMC Bioinformatics 2023; 24:57. [PMID: 36803209 PMCID: PMC9942303 DOI: 10.1186/s12859-023-05171-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2022] [Accepted: 02/02/2023] [Indexed: 02/22/2023] Open
Abstract
BACKGROUND The growing amount of high dimensional biomolecular data has spawned new statistical and computational models for risk prediction and disease classification. Yet, many of these methods do not yield biologically interpretable models, despite offering high classification accuracy. An exception, the top-scoring pair (TSP) algorithm derives parameter-free, biologically interpretable single pair decision rules that are accurate and robust in disease classification. However, standard TSP methods do not accommodate covariates that could heavily influence feature selection for the top-scoring pair. Herein, we propose a covariate-adjusted TSP method, which uses residuals from a regression of features on the covariates for identifying top scoring pairs. We conduct simulations and a data application to investigate our method, and compare it to existing classifiers, LASSO and random forests. RESULTS Our simulations found that features that were highly correlated with clinical variables had high likelihood of being selected as top scoring pairs in the standard TSP setting. However, through residualization, our covariate-adjusted TSP was able to identify new top scoring pairs, that were largely uncorrelated with clinical variables. In the data application, using patients with diabetes (n = 977) selected for metabolomic profiling in the Chronic Renal Insufficiency Cohort (CRIC) study, the standard TSP algorithm identified (valine-betaine, dimethyl-arg) as the top-scoring metabolite pair for classifying diabetic kidney disease (DKD) severity, whereas the covariate-adjusted TSP method identified the pair (pipazethate, octaethylene glycol) as top-scoring. Valine-betaine and dimethyl-arg had, respectively, ≥ 0.4 absolute correlation with urine albumin and serum creatinine, known prognosticators of DKD. Thus without covariate-adjustment the top-scoring pair largely reflected known markers of disease severity, whereas covariate-adjusted TSP uncovered features liberated from confounding, and identified independent prognostic markers of DKD severity. Furthermore, TSP-based methods achieved competitive classification accuracy in DKD to LASSO and random forests, while providing more parsimonious models. CONCLUSIONS We extended TSP-based methods to account for covariates, via a simple, easy to implement residualizing process. Our covariate-adjusted TSP method identified metabolite features, uncorrelated from clinical covariates, that discriminate DKD severity stage based on the relative ordering between two features, and thus provide insights into future studies on the order reversals in early vs advanced disease states.
Collapse
Grants
- R01 DK110541 NIDDK NIH HHS
- U24 DK060990 NIDDK NIH HHS
- R01DK118736, 1R01DK110541-01A1, U01DK060990, U01DK060984, U01DK061022, U01DK061021, U01DK061028, U01DK060980, U01DK060963, U01DK060902, U24DK060990 NIDDK NIH HHS
- National Science Foundation Graduate Research Fellowship Program
- Intramural Research Program of the National Institute of Diabetes and Digestive and Kidney Diseases
- National Institute of Diabetes and Digestive and Kidney Diseases
Collapse
Affiliation(s)
- Brian Kwan
- Division of Biostatistics and Bioinformatics, Herbert Wertheim School of Public Health, University of California, San Diego, La Jolla, CA, USA
- Moores Cancer Center, University of California, San Diego, La Jolla, CA, USA
| | - Tobias Fuhrer
- Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland
| | - Daniel Montemayor
- Division of Nephrology, Department of Medicine, University of Texas Health San Antonio, San Antonio, TX, USA
- Center for Renal Precision Medicine, University of Texas Health San Antonio, San Antonio, TX, USA
| | - Jeffery C Fink
- Department of Medicine, University of Maryland, Baltimore School of Medicine, Baltimore, MD, USA
| | - Jiang He
- Department of Epidemiology, Tulane University School of Public Health and Tropical Medicine and Tulane University Translational Science Institute,, New Orleans, LA, USA
| | - Chi-Yuan Hsu
- Division of Nephrology, University of California, San Francisco School of Medicine, San Francisco, CA, USA
| | - Karen Messer
- Division of Biostatistics and Bioinformatics, Herbert Wertheim School of Public Health, University of California, San Diego, La Jolla, CA, USA
- Moores Cancer Center, University of California, San Diego, La Jolla, CA, USA
| | - Robert G Nelson
- Chronic Kidney Disease Section, National Institute of Diabetes and Digestive and Kidney Diseases, Phoenix, AZ, USA
| | - Minya Pu
- Moores Cancer Center, University of California, San Diego, La Jolla, CA, USA
| | - Ana C Ricardo
- Department of Medicine, University of Illinois, Chicago, IL, USA
| | - Hernan Rincon-Choles
- Department of Nephrology, Glickman Urological and Kidney Institute, Cleveland Clinic Foundation, Cleveland, OH, USA
| | - Vallabh O Shah
- University of New Mexico Health Sciences Center, Albuquerque, NM, USA
| | - Hongping Ye
- Division of Nephrology, Department of Medicine, University of Texas Health San Antonio, San Antonio, TX, USA
- Center for Renal Precision Medicine, University of Texas Health San Antonio, San Antonio, TX, USA
| | - Jing Zhang
- Moores Cancer Center, University of California, San Diego, La Jolla, CA, USA
| | - Kumar Sharma
- Division of Nephrology, Department of Medicine, University of Texas Health San Antonio, San Antonio, TX, USA
- Center for Renal Precision Medicine, University of Texas Health San Antonio, San Antonio, TX, USA
| | - Loki Natarajan
- Division of Biostatistics and Bioinformatics, Herbert Wertheim School of Public Health, University of California, San Diego, La Jolla, CA, USA.
- Moores Cancer Center, University of California, San Diego, La Jolla, CA, USA.
| |
Collapse
|
3
|
Kim DM, Feilotter HE, Davey SK. BRCA1 Variant Assessment Using a Simple Analytic Assay. J Appl Lab Med 2022; 7:674-688. [PMID: 35021209 DOI: 10.1093/jalm/jfab163] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2021] [Accepted: 10/04/2021] [Indexed: 11/14/2022]
Abstract
BACKGROUND We previously developed a biological assay to accurately predict BRCA1 (BRCA1 DNA repair associated) mutation status, based on gene expression profiles of Epstein-Barr virus-transformed lymphoblastoid cell lines. The original work was done using whole genome expression microarrays, and nearest shrunken centroids analysis. While these approaches are appropriate for model building, they are difficult to implement clinically, where more targeted testing and analysis are required for time and cost savings. METHODS Here, we describe adaptation of the original predictor to use the NanoString nCounter platform for testing, with analysis based on the k-top scoring pairs (k-TSP) method. RESULTS Assessing gene expression using the nCounter platform on a set of lymphoblastoid cell lines yielded 93.8% agreement with the microarray-derived data, and 87.5% overall correct classification of BRCA1 carriers and controls. Using the original gene expression microarray data used to develop our predictor with nearest shrunken centroids, we rebuilt a classifier based on the k-TSP method. This classifier relies on the relative expression of 10 pairs of genes, compared to the original 43 identified by nearest shrunken centroids (NSC), and was 96.2% concordant with the original training set prediction, with a 94.3% overall correct classification of BRCA1 carriers and controls. CONCLUSIONS The k-TSP classifier was shown to accurately predict BRCA1 status using data generated on the nCounter platform and is feasible for initiating a clinical validation.
Collapse
Affiliation(s)
- Daniel M Kim
- Department of Pathology and Molecular Medicine, Queen's University Cancer Research Institute, Queen's University, Kingston, ON, Canada.,Division of Cancer Biology and Genetics, Queen's University Cancer Research Institute, Queen's University, Kingston, ON, Canada
| | - Harriet E Feilotter
- Department of Pathology and Molecular Medicine, Queen's University Cancer Research Institute, Queen's University, Kingston, ON, Canada.,Division of Cancer Biology and Genetics, Queen's University Cancer Research Institute, Queen's University, Kingston, ON, Canada
| | - Scott K Davey
- Department of Pathology and Molecular Medicine, Queen's University Cancer Research Institute, Queen's University, Kingston, ON, Canada.,Division of Cancer Biology and Genetics, Queen's University Cancer Research Institute, Queen's University, Kingston, ON, Canada.,Departments of Oncology and Biomedical and Molecular Sciences, Queen's University Cancer Research Institute, Queen's University, Kingston, ON, Canada
| |
Collapse
|
4
|
Eriksson P, Marzouka NAD, Sjödahl G, Bernardo C, Liedberg F, Höglund M. A comparison of rule-based and centroid single-sample multiclass predictors for transcriptomic classification. Bioinformatics 2021; 38:1022-1029. [PMID: 34788787 PMCID: PMC8796360 DOI: 10.1093/bioinformatics/btab763] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2021] [Revised: 10/24/2021] [Accepted: 11/02/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Gene expression-based multiclass prediction, such as tumor subtyping, is a non-trivial bioinformatic problem. Most classifier methods operate by comparing expression levels relative to other samples. Methods that base predictions on the expression pattern within a sample have been proposed as an alternative. As these methods are invariant to the cohort composition and can be applied to a sample in isolation, they can collectively be termed single sample predictors (SSP). Such predictors could potentially be used for preprocessing-free classification of new samples and be built to function across different expression platforms where proper batch and dataset normalization is challenging. Here, we evaluate the behavior of several multiclass SSPs based on binary gene-pair rules (k-Top Scoring Pairs, Absolute Intrinsic Molecular Subtyping and a new Random Forest approach) and compare them to centroids built with centered or raw expression values, with the criteria that an optimal predictor should have high accuracy, overcome differences in tumor purity, be robust across expression platforms and provide an informative prediction output score. RESULTS We found that gene-pair-based SSPs showed excellent performance on many expression-based classification tasks. The three methods differed in prediction score output, handling of tied scores and behavior in low purity samples. The k-Top Scoring Pairs and Random Forest approach both achieved high classification accuracy while providing an informative prediction score. Although gene-pair-based SSPs have been touted as being cross-platform compatible (through training on mixed platform data), out-of-the-box compatibility with a new dataset remains a potential issue that warrants cohort-to-cohort verification. AVAILABILITY AND IMPLEMENTATION Our R package 'multiclassPairs' (https://cran.r-project.org/package=multiclassPairs) (https://doi.org/10.1093/bioinformatics/btab088) is freely available and enables easy training, prediction, and visualization using the gene-pair rule-based Random Forest SSP method and provides additional multiclass functionalities to the switchBox k-Top-Scoring Pairs package. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Nour-al-dain Marzouka
- Division of Oncology, Department of Clinical Sciences, Lund University, Lund, Sweden
| | - Gottfrid Sjödahl
- Urology - urothelial cancer, Department of Translational Medicine, Lund University, Skåne University Hospital, Malmö, Sweden
| | - Carina Bernardo
- Division of Oncology, Department of Clinical Sciences, Lund University, Lund, Sweden
| | - Fredrik Liedberg
- Urology - urothelial cancer, Department of Translational Medicine, Lund University, Skåne University Hospital, Malmö, Sweden
| | - Mattias Höglund
- Division of Oncology, Department of Clinical Sciences, Lund University, Lund, Sweden
| |
Collapse
|
5
|
Chen A, Laeyendecker O, Eshleman SH, Monaco DR, Kammers K, Larman HB, Ruczinski I. A top scoring pairs classifier for recent HIV infections. Stat Med 2021; 40:2604-2612. [PMID: 33660319 DOI: 10.1002/sim.8920] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2020] [Revised: 01/07/2021] [Accepted: 02/03/2021] [Indexed: 11/11/2022]
Abstract
Accurate incidence estimation of HIV infection from cross-sectional biomarker data is crucial for monitoring the epidemic and determining the impact of HIV prevention interventions. A key feature of cross-sectional incidence testing methods is the mean window period, defined as the average duration that infected individuals are classified as recently infected. Two assays available for cross-sectional incidence estimation, the BED capture immunoassay, and the Limiting Antigen (LAg) Avidity assay, measure a general characteristic of antibody response; performance of these assays can be affected and biased by factors such as viral suppression, resulting in sample misclassification and overestimation of HIV incidence. As availability and use of antiretroviral treatment increase worldwide, algorithms that do not include HIV viral load and are not impacted by viral suppression are needed for cross-sectional HIV incidence estimation. Using a phage display system to quantify antibody binding to over 3300 HIV peptides, we present a classifier based on top scoring peptide pairs that identifies recent infections using HIV antibody responses alone. Based on plasma samples from individuals with known dates of seroconversion, we estimated the mean window period for our classifier to be 217 days (95% confidence interval 183 to 257 days), compared to the estimated mean window period for the LAg-Avidity protocol of 106 days (76 to 146 days). Moreover, each of the four peptide pairs correctly classified more of the recent samples than the LAg-Avidity assay alone at the same classification accuracy for non-recent samples.
Collapse
Affiliation(s)
- Athena Chen
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, USA
| | - Oliver Laeyendecker
- Laboratory of Immunoregulation, Division of Intramural Research, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Baltimore, Maryland, USA.,Department of Medicine, Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
| | - Susan H Eshleman
- Department of Pathology, Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
| | - Daniel R Monaco
- Department of Pathology, Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
| | - Kai Kammers
- Division of Biostatistics and Bioinformatics, Department of Oncology, Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
| | - Harry Benjamin Larman
- Department of Pathology, Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
| | - Ingo Ruczinski
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, USA
| |
Collapse
|
6
|
Marzouka NAD, Eriksson P. multiclassPairs: an R package to train multiclass pair-based classifier. Bioinformatics 2021; 37:3043-3044. [PMID: 33543757 PMCID: PMC8479681 DOI: 10.1093/bioinformatics/btab088] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2020] [Revised: 01/27/2021] [Accepted: 02/02/2021] [Indexed: 02/02/2023] Open
Abstract
MOTIVATION k-Top Scoring Pairs (kTSP) algorithms utilize in-sample gene expression feature pair rules for class prediction, and have demonstrated excellent performance and robustness. The available packages and tools primarily focus on binary prediction (i.e. two classes). However, many real-world classification problems e.g. tumor subtype prediction, are multiclass tasks. RESULTS Here, we present multiclassPairs, an R package to train pair-based single sample classifiers for multiclass problems. multiclassPairs offers two main methods to build multiclass prediction models, either using a one-versus-rest kTSP scheme or through a novel pair-based Random Forest approach. The package also provides options for dealing with class imbalances, multiplatform training, missing features in test data and visualization of training and test results. AVAILABILITY AND IMPLEMENTATION 'multiclassPairs' package is available on CRAN servers and GitHub: https://github.com/NourMarzouka/multiclassPairs. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Nour-Al-Dain Marzouka
- Department of Clinical Sciences, Division of Oncology, Lund University, 22381 Lund, Sweden,To whom correspondence should be addressed.
| | - Pontus Eriksson
- Department of Clinical Sciences, Division of Oncology, Lund University, 22381 Lund, Sweden
| |
Collapse
|
7
|
Li X, Huang H, Zhang J, Jiang F, Guo Y, Shi Y, Guo Z, Ao L. A qualitative transcriptional signature for predicting the biochemical recurrence risk of prostate cancer patients after radical prostatectomy. Prostate 2020; 80:376-387. [PMID: 31961962 PMCID: PMC7065139 DOI: 10.1002/pros.23952] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/20/2019] [Accepted: 01/02/2020] [Indexed: 12/27/2022]
Abstract
BACKGROUND The qualitative transcriptional characteristics, the within-sample relative expression orderings (REOs) of genes, are highly robust against batch effects and sample quality variations. Hence, we develop a qualitative transcriptional signature based on REOs to predict the biochemical recurrence risk of prostate cancer (PCa) patients after radical prostatectomy. METHODS Gene pairs with REOs significantly correlated with the biochemical recurrence-free survival (BFS) were identified from 131 PCa samples in the training data set. From these gene pairs, we selected a qualitative transcriptional signature based on the within-sample REOs of gene pairs which could predict the recurrence risk of PCa patients after radical prostatectomy. RESULTS A signature consisting of 74 gene pairs, named 74-GPS, was developed for predicting the recurrence risk of PCa patients after radical prostatectomy based on the majority voting rule that a sample was assigned as high risk when at least 37 gene pairs of the 74-GPS voted for high risk; otherwise, low risk. The signature was validated in six independent datasets produced by different platforms. In each of the validation datasets, the Kaplan-Meier survival analysis showed that the average BFS of the low-risk group was significantly better than that of the high-risk group. Analyses of multiomics data of PCa samples from TCGA suggested that both the epigenomic and genomic alternations could cause the reproducible transcriptional differences between the two different prognostic groups. CONCLUSIONS The proposed qualitative transcriptional signature can robustly stratify PCa patients after radical prostatectomy into two groups with different recurrence risk and distinct multiomics characteristics. Hence, 74-GPS may serve as a helpful tool for guiding the management of PCa patients with radical prostatectomy at the individual level.
Collapse
Affiliation(s)
- Xiang Li
- Department of Bioinformatics, Key Laboratory of Ministry of Education for Gastrointestinal Cancer, The School of Basic Medical SciencesFujian Medical UniversityFuzhouChina
- Key Laboratory of Medical BioinformaticsFujian Medical UniversityFuzhouChina
- Fujian Key Laboratory of Tumor MicrobiologyFujian Medical UniversityFuzhouChina
| | - Haiyan Huang
- Department of Bioinformatics, Key Laboratory of Ministry of Education for Gastrointestinal Cancer, The School of Basic Medical SciencesFujian Medical UniversityFuzhouChina
| | - Jiahui Zhang
- Department of Bioinformatics, Key Laboratory of Ministry of Education for Gastrointestinal Cancer, The School of Basic Medical SciencesFujian Medical UniversityFuzhouChina
| | - Fengle Jiang
- Department of Bioinformatics, Key Laboratory of Ministry of Education for Gastrointestinal Cancer, The School of Basic Medical SciencesFujian Medical UniversityFuzhouChina
| | - Yating Guo
- Department of Bioinformatics, Key Laboratory of Ministry of Education for Gastrointestinal Cancer, The School of Basic Medical SciencesFujian Medical UniversityFuzhouChina
| | - Yidan Shi
- Department of Bioinformatics, Key Laboratory of Ministry of Education for Gastrointestinal Cancer, The School of Basic Medical SciencesFujian Medical UniversityFuzhouChina
| | - Zheng Guo
- Department of Bioinformatics, Key Laboratory of Ministry of Education for Gastrointestinal Cancer, The School of Basic Medical SciencesFujian Medical UniversityFuzhouChina
- Key Laboratory of Medical BioinformaticsFujian Medical UniversityFuzhouChina
- Fujian Key Laboratory of Tumor MicrobiologyFujian Medical UniversityFuzhouChina
| | - Lu Ao
- Department of Bioinformatics, Key Laboratory of Ministry of Education for Gastrointestinal Cancer, The School of Basic Medical SciencesFujian Medical UniversityFuzhouChina
- Key Laboratory of Medical BioinformaticsFujian Medical UniversityFuzhouChina
- Fujian Key Laboratory of Tumor MicrobiologyFujian Medical UniversityFuzhouChina
| |
Collapse
|
8
|
Krzhizhanovskaya VV, Závodszky G, Lees MH, Dongarra JJ, Sloot PMA, Brissos S, Teixeira J. Tree Based Advanced Relative Expression Analysis. LECTURE NOTES IN COMPUTER SCIENCE 2020. [PMCID: PMC7304016 DOI: 10.1007/978-3-030-50420-5_37] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
This paper presents a new concept for biomarker discovery and gene expression data classification that rises from the Relative Expression Analysis (RXA). The basic idea of RXA is to focus on simple ordering relationships between the expression of small sets of genes rather than their raw values. We propose a paradigm shift as we extend RXA concept to tree-based Advanced Relative Expression Analysis (ARXA). The main contribution is a decision tree with splitting nodes that consider relative fraction comparisons between multiple gene pairs. In addition, to face the enormous computational complexity of RXA, the most time-consuming part which is scoring all possible gene pairs in each splitting node is parallelized using GPU. This way the algorithm allows searching for more tailored interactions between sub-groups of genes in a reasonable time. Experiments carried out on 8 cancer-related datasets show not only significant improvement in accuracy and speed of our approach in comparison to various RXA solutions but also new interesting patterns between subgroups of genes.
Collapse
|
9
|
Rashid NU, Peng XL, Jin C, Moffitt RA, Volmar KE, Belt BA, Panni RZ, Nywening TM, Herrera SG, Moore KJ, Hennessey SG, Morrison AB, Kawalerski R, Nayyar A, Chang AE, Schmidt B, Kim HJ, Linehan DC, Yeh JJ. Purity Independent Subtyping of Tumors (PurIST), A Clinically Robust, Single-sample Classifier for Tumor Subtyping in Pancreatic Cancer. Clin Cancer Res 2019; 26:82-92. [PMID: 31754050 DOI: 10.1158/1078-0432.ccr-19-1467] [Citation(s) in RCA: 109] [Impact Index Per Article: 21.8] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2019] [Revised: 07/10/2019] [Accepted: 10/01/2019] [Indexed: 12/20/2022]
Abstract
PURPOSE Molecular subtyping for pancreatic cancer has made substantial progress in recent years, facilitating the optimization of existing therapeutic approaches to improve clinical outcomes in pancreatic cancer. With advances in treatment combinations and choices, it is becoming increasingly important to determine ways to place patients on the best therapies upfront. Although various molecular subtyping systems for pancreatic cancer have been proposed, consensus regarding proposed subtypes, as well as their relative clinical utility, remains largely unknown and presents a natural barrier to wider clinical adoption. EXPERIMENTAL DESIGN We assess three major subtype classification schemas in the context of results from two clinical trials and by meta-analysis of publicly available expression data to assess statistical criteria of subtype robustness and overall clinical relevance. We then developed a single-sample classifier (SSC) using penalized logistic regression based on the most robust and replicable schema. RESULTS We demonstrate that a tumor-intrinsic two-subtype schema is most robust, replicable, and clinically relevant. We developed Purity Independent Subtyping of Tumors (PurIST), a SSC with robust and highly replicable performance on a wide range of platforms and sample types. We show that PurIST subtypes have meaningful associations with patient prognosis and have significant implications for treatment response to FOLIFIRNOX. CONCLUSIONS The flexibility and utility of PurIST on low-input samples such as tumor biopsies allows it to be used at the time of diagnosis to facilitate the choice of effective therapies for patients with pancreatic ductal adenocarcinoma and should be considered in the context of future clinical trials.
Collapse
Affiliation(s)
- Naim U Rashid
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina. .,Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina
| | - Xianlu L Peng
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina
| | - Chong Jin
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina.,Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina
| | - Richard A Moffitt
- Department of Biomedical Informatics and Pathology, Stony Brook University, Stony Brook, New York.,Department of Pharmacological Sciences, Stony Brook Cancer Center, Stony Brook University, Stony Brook, New York
| | - Keith E Volmar
- University of North Carolina-Rex Healthcare, Raleigh, North Carolina
| | - Brian A Belt
- Department of Surgery, University of Rochester, Rochester, New York
| | - Roheena Z Panni
- Department of Surgery, Washington University, Saint Louis, St. Louis, Missouri
| | - Timothy M Nywening
- Department of Surgery, Washington University, Saint Louis, St. Louis, Missouri
| | - Silvia G Herrera
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina
| | - Kristin J Moore
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina
| | - Sarah G Hennessey
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina
| | - Ashley B Morrison
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina
| | - Ryan Kawalerski
- Department of Biomedical Informatics and Pathology, Stony Brook University, Stony Brook, New York
| | - Apoorve Nayyar
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina
| | - Audrey E Chang
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina
| | - Benjamin Schmidt
- Department of Surgery, Washington University, Saint Louis, St. Louis, Missouri
| | - Hong Jin Kim
- Department of Surgery, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina
| | - David C Linehan
- Department of Surgery, University of Rochester, Rochester, New York
| | - Jen Jen Yeh
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina. .,Department of Surgery, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina.,Department of Pharmacology, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina
| |
Collapse
|
10
|
Rashid NU, Li Q, Yeh JJ, Ibrahim JG. Modeling Between-Study Heterogeneity for Improved Replicability in Gene Signature Selection and Clinical Prediction. J Am Stat Assoc 2019; 115:1125-1138. [PMID: 33012902 DOI: 10.1080/01621459.2019.1671197] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
In the genomic era, the identification of gene signatures associated with disease is of significant interest. Such signatures are often used to predict clinical outcomes in new patients and aid clinical decision-making. However, recent studies have shown that gene signatures are often not replicable. This occurrence has practical implications regarding the generalizability and clinical applicability of such signatures. To improve replicability, we introduce a novel approach to select gene signatures from multiple datasets whose effects are consistently non-zero and account for between-study heterogeneity. We build our model upon some rank-based quantities, facilitating integration over different genomic datasets. A high dimensional penalized Generalized Linear Mixed Model (pGLMM) is used to select gene signatures and address data heterogeneity. We compare our method to some commonly used strategies that select gene signatures ignoring between-study heterogeneity. We provide asymptotic results justifying the performance of our method and demonstrate its advantage in the presence of heterogeneity through thorough simulation studies. Lastly, we motivate our method through a case study subtyping pancreatic cancer patients from four gene expression studies.
Collapse
Affiliation(s)
- Naim U Rashid
- Department of Biostatistics, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, NC, U.S.A.,Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC, U.S.A
| | - Quefeng Li
- Department of Biostatistics, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, NC, U.S.A
| | - Jen Jen Yeh
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC, U.S.A.,Department of Surgery, University of North Carolina at Chapel Hill, Chapel Hill, NC, U.S.A.,Department of Pharmacology, University of North Carolina at Chapel Hill, Chapel Hill, NC, U.S.A
| | - Joseph G Ibrahim
- Department of Biostatistics, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, NC, U.S.A
| |
Collapse
|
11
|
Afsari B, Guo T, Considine M, Florea L, Kagohara LT, Stein-O'Brien GL, Kelley D, Flam E, Zambo KD, Ha PK, Geman D, Ochs MF, Califano JA, Gaykalova DA, Favorov AV, Fertig EJ. Splice Expression Variation Analysis (SEVA) for inter-tumor heterogeneity of gene isoform usage in cancer. Bioinformatics 2019; 34:1859-1867. [PMID: 29342249 DOI: 10.1093/bioinformatics/bty004] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2017] [Accepted: 01/10/2018] [Indexed: 12/22/2022] Open
Abstract
Motivation Current bioinformatics methods to detect changes in gene isoform usage in distinct phenotypes compare the relative expected isoform usage in phenotypes. These statistics model differences in isoform usage in normal tissues, which have stable regulation of gene splicing. Pathological conditions, such as cancer, can have broken regulation of splicing that increases the heterogeneity of the expression of splice variants. Inferring events with such differential heterogeneity in gene isoform usage requires new statistical approaches. Results We introduce Splice Expression Variability Analysis (SEVA) to model increased heterogeneity of splice variant usage between conditions (e.g. tumor and normal samples). SEVA uses a rank-based multivariate statistic that compares the variability of junction expression profiles within one condition to the variability within another. Simulated data show that SEVA is unique in modeling heterogeneity of gene isoform usage, and benchmark SEVA's performance against EBSeq, DiffSplice and rMATS that model differential isoform usage instead of heterogeneity. We confirm the accuracy of SEVA in identifying known splice variants in head and neck cancer and perform cross-study validation of novel splice variants. A novel comparison of splice variant heterogeneity between subtypes of head and neck cancer demonstrated unanticipated similarity between the heterogeneity of gene isoform usage in HPV-positive and HPV-negative subtypes and anticipated increased heterogeneity among HPV-negative samples with mutations in genes that regulate the splice variant machinery. These results show that SEVA accurately models differential heterogeneity of gene isoform usage from RNA-seq data. Availability and implementation SEVA is implemented in the R/Bioconductor package GSReg. Contact bahman@jhu.edu or favorov@sensi.org or ejfertig@jhmi.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Bahman Afsari
- Division of Biostatistics and Bioinformatics, Department of Oncology, Sidney Kimmel Comprehensive Cancer Center
| | - Theresa Guo
- Department of Otolaryngology-Head and Neck Surgery
| | - Michael Considine
- Division of Biostatistics and Bioinformatics, Department of Oncology, Sidney Kimmel Comprehensive Cancer Center
| | - Liliana Florea
- McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University, Baltimore, MD 21205, USA
| | - Luciane T Kagohara
- Division of Biostatistics and Bioinformatics, Department of Oncology, Sidney Kimmel Comprehensive Cancer Center
| | - Genevieve L Stein-O'Brien
- Division of Biostatistics and Bioinformatics, Department of Oncology, Sidney Kimmel Comprehensive Cancer Center
| | - Dylan Kelley
- Department of Otolaryngology-Head and Neck Surgery
| | - Emily Flam
- Department of Otolaryngology-Head and Neck Surgery
| | | | - Patrick K Ha
- Department of Otolaryngology-Head and Neck Surgery, University of California, San Francisco, CA 94158, USA
| | - Donald Geman
- Department of Applied Mathematics & Statistics, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Michael F Ochs
- Department of Mathematics & Statistics, The College of New Jersey, Ewing, NJ 08628, USA
| | - Joseph A Califano
- Division of Otolaryngology, Department of Surgery, University of California, San Diego, CA 92093, USA
| | | | - Alexander V Favorov
- Division of Biostatistics and Bioinformatics, Department of Oncology, Sidney Kimmel Comprehensive Cancer Center.,Laboratory of Systems Biology and Computational Genetics, Vavilov Institute of General Genetics, RAS, Moscow 119333, Russia
| | - Elana J Fertig
- Division of Biostatistics and Bioinformatics, Department of Oncology, Sidney Kimmel Comprehensive Cancer Center
| |
Collapse
|
12
|
Johnson KW, Glicksberg BS, Shameer K, Vengrenyuk Y, Krittanawong C, Russak AJ, Sharma SK, Narula JN, Dudley JT, Kini AS. A transcriptomic model to predict increase in fibrous cap thickness in response to high-dose statin treatment: Validation by serial intracoronary OCT imaging. EBioMedicine 2019; 44:41-49. [PMID: 31126891 PMCID: PMC6607084 DOI: 10.1016/j.ebiom.2019.05.007] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2019] [Revised: 04/15/2019] [Accepted: 05/03/2019] [Indexed: 02/04/2023] Open
Abstract
Background Fibrous cap thickness (FCT), best measured by intravascular optical coherence tomography (OCT), is the most important determinant of plaque rupture in the coronary arteries. Statin treatment increases FCT and thus reduces the likelihood of acute coronary events. However, substantial statin-related FCT increase occurs in only a subset of patients. Currently, there are no methods to predict which patients will benefit. We use transcriptomic data from a clinical trial of rosuvastatin to predict if a patient's FCT will increase in response to statin therapy. Methods FCT was measured using OCT in 69 patients at (1) baseline and (2) after 8–10 weeks of 40 mg rosuvastatin. Peripheral blood mononuclear cells were assayed via microarray. We constructed machine learning models with baseline gene expression data to predict change in FCT. Finally, we ascertained the biological functions of the most predictive transcriptomic markers. Findings Machine learning models were able to predict FCT responders using baseline gene expression with high fidelity (Classification AUC = 0.969 and 0.972). The first model (elastic net) using 73 genes had an accuracy of 92.8%, sensitivity of 94.1%, and specificity of 91.4%. The second model (KTSP) using 18 genes has an accuracy of 95.7%, sensitivity of 94.3%, and specificity of 97.1%. We found 58 enriched gene ontology terms, including many involved with immune cell function and cholesterol biometabolism. Interpretation In this pilot study, transcriptomic models could predict if FCT increased following 8–10 weeks of rosuvastatin. These findings may have significance for therapy selection and could supplement invasive imaging modalities.
Collapse
Affiliation(s)
- Kipp W Johnson
- Institute for Next Generation Healthcare, Mount Sinai Health System, New York, NY, United States of America; Department of Genetics and Genomic Sciences, Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, United States of America
| | - Benjamin S Glicksberg
- Bakar Computational Health Sciences Institute, The University of California, San Francisco, San Francisco, CA, United States of America
| | - Khader Shameer
- Advanced Analytics Center, AstraZeneca, Gaithersburg, MD, United States of America
| | - Yuliya Vengrenyuk
- Mount Sinai Heart, Mount Sinai Health System, New York, NY, United States of America
| | - Chayakrit Krittanawong
- Department of Internal Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, United States of America
| | - Adam J Russak
- Institute for Next Generation Healthcare, Mount Sinai Health System, New York, NY, United States of America; Department of Internal Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, United States of America
| | - Samin K Sharma
- Mount Sinai Heart, Mount Sinai Health System, New York, NY, United States of America
| | - Jagat N Narula
- Mount Sinai Heart, Mount Sinai Health System, New York, NY, United States of America
| | - Joel T Dudley
- Institute for Next Generation Healthcare, Mount Sinai Health System, New York, NY, United States of America; Department of Genetics and Genomic Sciences, Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, United States of America
| | - Annapoorna S Kini
- Mount Sinai Heart, Mount Sinai Health System, New York, NY, United States of America.
| |
Collapse
|
13
|
A new data analysis method based on feature linear combination. J Biomed Inform 2019; 94:103173. [PMID: 30965135 DOI: 10.1016/j.jbi.2019.103173] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2018] [Revised: 04/02/2019] [Accepted: 04/06/2019] [Indexed: 01/15/2023]
Abstract
In biological data, feature relationships are complex and diverse, they could reflect physiological and pathological changes. Defining simple and efficient classification rules based on feature relationships is helpful for discriminating different conditions and studying disease mechanism. The popular data analysis method, k top scoring pairs (k-TSP), explores the feature relationship by focusing on the difference of the relative level of two features in different groups and classifies samples based on the exploration. To define more efficient classification rules, we propose a new data analysis method based on the linear combination of k > 0 top scoring pairs (LC-k-TSP). LC-k-TSP applies support vector machine (SVM) to define the best linear relationship of each feature pair, scores feature pairs by the discriminative abilities of the corresponding linear combinations and selects k disjoint top scoring pairs to construct an ensemble classifier. Experiments on twelve public datasets showed the superiority of LC-k-TSP over k-TSP which evaluates the relationship of every two features in the same way. The experiment also illustrated that LC-k-TSP performed similarly to SVM and random forest (RF) in accuracy rate. LC-k-TSP studies the own unique linear combination for each feature pair and defines simple classification rules, it is easy to explore the biomedical explanation. Finally, we applied LC-k-TSP to analyze the hepatocellular carcinoma (HCC) metabolomics data and define the simple classification rules for discrimination of different liver diseases. It obtained accuracy rates of 89.76% and 89.13% in distinguishing between small HCC and hepatic cirrhosis (CIR) groups as well as between HCC and CIR groups, superior to 87.99% and 80.35% by k-TSP. Hence, defining classification rules based on feature relationships is an effective way to analyze biological data. LC-k-TSP which checks different feature pairs by their corresponding unique best linear relationship has the superiority over k-TSP which checks each pair by the same linear relationship. Availability and implementation: http://www.402.dicp.ac.cn/download_ok_4.htm.
Collapse
|
14
|
Sjöström M, Staaf J, Edén P, Wärnberg F, Bergh J, Malmström P, Fernö M, Niméus E, Fredriksson I. Identification and validation of single-sample breast cancer radiosensitivity gene expression predictors. Breast Cancer Res 2018; 20:64. [PMID: 29973242 PMCID: PMC6033283 DOI: 10.1186/s13058-018-0978-y] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2018] [Accepted: 05/08/2018] [Indexed: 02/12/2023] Open
Abstract
BACKGROUND Adjuvant radiotherapy is the standard of care after breast-conserving surgery for primary breast cancer, despite a majority of patients being over- or under-treated. In contrast to adjuvant endocrine therapy and chemotherapy, no diagnostic tests are in clinical use that can stratify patients for adjuvant radiotherapy. This study presents the development and validation of a targeted gene expression assay to predict the risk of ipsilateral breast tumor recurrence and response to adjuvant radiotherapy after breast-conserving surgery in primary breast cancer. METHODS Fresh-frozen primary tumors from 336 patients radically (clear margins) operated on with breast-conserving surgery with or without radiotherapy were collected. Patients were split into a discovery cohort (N = 172) and a validation cohort (N = 164). Genes predicting ipsilateral breast tumor recurrence in an Illumina HT12 v4 whole transcriptome analysis were combined with genes identified in the literature (248 genes in total) to develop a targeted radiosensitivity assay on the Nanostring nCounter platform. Single-sample predictors for ipsilateral breast tumor recurrence based on a k-top scoring pairs algorithm were trained, stratified for estrogen receptor (ER) status and radiotherapy. Two previously published profiles, the radiosensitivity signature of Speers et al., and the 10-gene signature of Eschrich et al., were also included in the targeted panel. RESULTS Derived single-sample predictors were prognostic for ipsilateral breast tumor recurrence in radiotherapy-treated ER+ patients (AUC 0.67, p = 0.01), ER+ patients without radiotherapy (AUC = 0.89, p = 0.02), and radiotherapy-treated ER- patients (AUC = 0.78, p < 0.001). Among ER+ patients, radiotherapy had an excellent effect on tumors classified as radiosensitive (p < 0.001), while radiotherapy had no effect on tumors classified as radioresistant (p = 0.36) and there was a high risk of ipsilateral breast tumor recurrence (55% at 10 years). Our single-sample predictors developed in ER+ tumors and the radiosensitivity signature correlated with proliferation, while single-sample predictors developed in ER- tumors correlated with immune response. The 10-gene signature negatively correlated with both proliferation and immune response. CONCLUSIONS Our targeted single-sample predictors were prognostic for ipsilateral breast tumor recurrence and have the potential to stratify patients for adjuvant radiotherapy. The correlation of models with biology may explain the different performance in subgroups of breast cancer.
Collapse
Affiliation(s)
- Martin Sjöström
- Faculty of Medicine, Department of Clinical Sciences Lund, Oncology and Pathology, Lund University, Lund, Sweden. .,Department of Haematology, Oncology and Radiation Physics ,Skåne University Hospital, Lund, Sweden.
| | - Johan Staaf
- Faculty of Medicine, Department of Clinical Sciences Lund, Oncology and Pathology, Lund University, Lund, Sweden
| | - Patrik Edén
- Department of Theoretical Physics and Computational Biology, Lund University, Lund, Sweden
| | - Fredrik Wärnberg
- Department of Surgical Sciences, Uppsala University, Uppsala, Sweden
| | - Jonas Bergh
- Department of Oncology and Pathology, Cancer Center Karolinska, Karolinska Institutet, Stockholm, Sweden.,Department of Oncology, Karolinska University Hospital, Radiumhemmet, Stockholm, Sweden
| | - Per Malmström
- Faculty of Medicine, Department of Clinical Sciences Lund, Oncology and Pathology, Lund University, Lund, Sweden.,Department of Haematology, Oncology and Radiation Physics ,Skåne University Hospital, Lund, Sweden
| | - Mårten Fernö
- Faculty of Medicine, Department of Clinical Sciences Lund, Oncology and Pathology, Lund University, Lund, Sweden
| | - Emma Niméus
- Faculty of Medicine, Department of Clinical Sciences Lund, Oncology and Pathology, Lund University, Lund, Sweden.,Faculty of Medicine, Department of Clinical Sciences Lund, Surgery, Lund University, Lund, Sweden.,Department of Surgery, Skåne University Hospital, Lund, Sweden
| | - Irma Fredriksson
- Department of Molecular Medicine and Surgery, Karolinska Institutet, Stockholm, Sweden.,Department of Breast- and Endocrine Surgery, Karolinska University Hospital, Stockholm, Sweden
| |
Collapse
|
15
|
Kim S, Lin CW, Tseng GC. MetaKTSP: a meta-analytic top scoring pair method for robust cross-study validation of omics prediction analysis. Bioinformatics 2016; 32:1966-73. [PMID: 27153719 DOI: 10.1093/bioinformatics/btw115] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2015] [Accepted: 02/19/2016] [Indexed: 01/08/2023] Open
Abstract
MOTIVATION Supervised machine learning is widely applied to transcriptomic data to predict disease diagnosis, prognosis or survival. Robust and interpretable classifiers with high accuracy are usually favored for their clinical and translational potential. The top scoring pair (TSP) algorithm is an example that applies a simple rank-based algorithm to identify rank-altered gene pairs for classifier construction. Although many classification methods perform well in cross-validation of single expression profile, the performance usually greatly reduces in cross-study validation (i.e. the prediction model is established in the training study and applied to an independent test study) for all machine learning methods, including TSP. The failure of cross-study validation has largely diminished the potential translational and clinical values of the models. The purpose of this article is to develop a meta-analytic top scoring pair (MetaKTSP) framework that combines multiple transcriptomic studies and generates a robust prediction model applicable to independent test studies. RESULTS We proposed two frameworks, by averaging TSP scores or by combining P-values from individual studies, to select the top gene pairs for model construction. We applied the proposed methods in simulated data sets and three large-scale real applications in breast cancer, idiopathic pulmonary fibrosis and pan-cancer methylation. The result showed superior performance of cross-study validation accuracy and biomarker selection for the new meta-analytic framework. In conclusion, combining multiple omics data sets in the public domain increases robustness and accuracy of the classification model that will ultimately improve disease understanding and clinical treatment decisions to benefit patients. AVAILABILITY AND IMPLEMENTATION An R package MetaKTSP is available online. (http://tsenglab.biostat.pitt.edu/software.htm). CONTACT ctseng@pitt.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- SungHwan Kim
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA, USA Department of Statistics, Korea University, Seoul, South Korea
| | - Chien-Wei Lin
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA, USA
| | - George C Tseng
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA, USA Department of Computational and Systems Biology Department of Human Genetics, University of Pittsburgh, Pittsburgh, PA, USA
| |
Collapse
|
16
|
Afsari B, Geman D, Fertig EJ. Learning dysregulated pathways in cancers from differential variability analysis. Cancer Inform 2014; 13:61-7. [PMID: 25392694 PMCID: PMC4218688 DOI: 10.4137/cin.s14066] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2014] [Revised: 08/13/2014] [Accepted: 08/14/2014] [Indexed: 12/16/2022] Open
Abstract
Analysis of gene sets can implicate activity in signaling pathways that is responsible for cancer initiation and progression, but is not discernible from the analysis of individual genes. Multiple methods and software packages have been developed to infer pathway activity from expression measurements for set of genes targeted by that pathway. Broadly, three major methodologies have been proposed: over-representation, enrichment, and differential variability. Both over-representation and enrichment analyses are effective techniques to infer differentially regulated pathways from gene sets with relatively consistent differentially expressed (DE) genes. Specifically, these algorithms aggregate statistics from each gene in the pathway. However, they overlook multivariate patterns related to gene interactions and variations in expression. Therefore, the analysis of differential variability of multigene expression patterns can be essential to pathway inference in cancers. The corresponding methodologies and software packages for such multivariate variability analysis of pathways are reviewed here. We also introduce a new, computationally efficient algorithm, expression variation analysis (EVA), which has been implemented along with a previously proposed algorithm, Differential Rank Conservation (DIRAC), in an open source R package, gene set regulation (GSReg). EVA inferred similar pathways as DIRAC at reduced computational costs. Moreover, EVA also inferred different dysregulated pathways than those identified by enrichment analysis.
Collapse
Affiliation(s)
- Bahman Afsari
- Postdoctoral Fellow, Division of Biostatistics and Bioinformatics, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, USA
| | - Donald Geman
- Professor, Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD, USA
| | - Elana J Fertig
- Assistant Professor, Division of Biostatistics and Bioinformatics, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, USA
| |
Collapse
|
17
|
Afsari B, Fertig EJ, Geman D, Marchionni L. switchBox: an R package for k-Top Scoring Pairs classifier development. ACTA ACUST UNITED AC 2014; 31:273-4. [PMID: 25262153 DOI: 10.1093/bioinformatics/btu622] [Citation(s) in RCA: 32] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
UNLABELLED k-Top Scoring Pairs (kTSP) is a classification method for prediction from high-throughput data based on a set of the paired measurements. Each of the two possible orderings of a pair of measurements (e.g. a reversal in the expression of two genes) is associated with one of two classes. The kTSP prediction rule is the aggregation of voting among such individual two-feature decision rules based on order switching. kTSP, like its predecessor, Top Scoring Pair (TSP), is a parameter-free classifier relying only on ranking of a small subset of features, rendering it robust to noise and potentially easy to interpret in biological terms. In contrast to TSP, kTSP has comparable accuracy to standard genomics classification techniques, including Support Vector Machines and Prediction Analysis for Microarrays. Here, we describe 'switchBox', an R package for kTSP-based prediction. AVAILABILITY The 'switchBox' package is freely available from Bioconductor: http://www.bioconductor.org. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Bahman Afsari
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, School of Medicine, Johns Hopkins University, Baltimore, MD 21205 and Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Elana J Fertig
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, School of Medicine, Johns Hopkins University, Baltimore, MD 21205 and Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Donald Geman
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, School of Medicine, Johns Hopkins University, Baltimore, MD 21205 and Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Luigi Marchionni
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, School of Medicine, Johns Hopkins University, Baltimore, MD 21205 and Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD 21218, USA
| |
Collapse
|