1
|
Wang B, Luan Y. Evaluation of normalization methods for predicting quantitative phenotypes in metagenomic data analysis. Front Genet 2024; 15:1369628. [PMID: 38903761 PMCID: PMC11188486 DOI: 10.3389/fgene.2024.1369628] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2024] [Accepted: 05/13/2024] [Indexed: 06/22/2024] Open
Abstract
Genotype-to-phenotype mapping is an essential problem in the current genomic era. While qualitative case-control predictions have received significant attention, less emphasis has been placed on predicting quantitative phenotypes. This emerging field holds great promise in revealing intricate connections between microbial communities and host health. However, the presence of heterogeneity in microbiome datasets poses a substantial challenge to the accuracy of predictions and undermines the reproducibility of models. To tackle this challenge, we investigated 22 normalization methods that aimed at removing heterogeneity across multiple datasets, conducted a comprehensive review of them, and evaluated their effectiveness in predicting quantitative phenotypes in three simulation scenarios and 31 real datasets. The results indicate that none of these methods demonstrate significant superiority in predicting quantitative phenotypes or attain a noteworthy reduction in Root Mean Squared Error (RMSE) of the predictions. Given the frequent occurrence of batch effects and the satisfactory performance of batch correction methods in predicting datasets affected by these effects, we strongly recommend utilizing batch correction methods as the initial step in predicting quantitative phenotypes. In summary, the performance of normalization methods in predicting metagenomic data remains a dynamic and ongoing research area. Our study contributes to this field by undertaking a comprehensive evaluation of diverse methods and offering valuable insights into their effectiveness in predicting quantitative phenotypes.
Collapse
Affiliation(s)
- Beibei Wang
- Frontier Science Center for Nonlinear Expectations, Ministry of Education, Qingdao, China
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, China
- School of Mathematics, Shandong University, Jinan, China
| | - Yihui Luan
- Frontier Science Center for Nonlinear Expectations, Ministry of Education, Qingdao, China
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, China
- School of Mathematics, Shandong University, Jinan, China
| |
Collapse
|
2
|
Loewinger G, Nunez RA, Mazumder R, Parmigiani G. Optimal ensemble construction for multistudy prediction with applications to mortality estimation. Stat Med 2024; 43:1774-1789. [PMID: 38396313 DOI: 10.1002/sim.10006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2023] [Revised: 10/12/2023] [Accepted: 12/22/2023] [Indexed: 02/25/2024]
Abstract
It is increasingly common to encounter prediction tasks in the biomedical sciences for which multiple datasets are available for model training. Common approaches such as pooling datasets before model fitting can produce poor out-of-study prediction performance when datasets are heterogeneous. Theoretical and applied work has shown multistudy ensembling to be a viable alternative that leverages the variability across datasets in a manner that promotes model generalizability. Multistudy ensembling uses a two-stage stacking strategy which fits study-specific models and estimates ensemble weights separately. This approach ignores, however, the ensemble properties at the model-fitting stage, potentially resulting in performance losses. Motivated by challenges in the estimation of COVID-attributable mortality, we propose optimal ensemble construction, an approach to multistudy stacking whereby we jointly estimate ensemble weights and parameters associated with study-specific models. We prove that limiting cases of our approach yield existing methods such as multistudy stacking and pooling datasets before model fitting. We propose an efficient block coordinate descent algorithm to optimize the loss function. We use our method to perform multicountry COVID-19 baseline mortality prediction. We show that when little data is available for a country before the onset of the pandemic, leveraging data from other countries can substantially improve prediction accuracy. We further compare and characterize the method's performance in data-driven simulations and other numerical experiments. Our method remains competitive with or outperforms multistudy stacking and other earlier methods in the COVID-19 data application and in a range of simulation settings.
Collapse
Affiliation(s)
- Gabriel Loewinger
- Machine Learning Team, National Institute on Mental Health, Bethesda, Maryland, USA
| | - Rolando Acosta Nunez
- Department of Biotatistics, Harvard School of Public Health, Boston, Massachusetts, USA
- Regeneron Pharmaceuticals Inc., Tarrytown, New York, USA
| | - Rahul Mazumder
- Operations Research Center and MIT Center for Statistics, MIT Sloan School of Management, Cambridge, Massachusetts, USA
| | - Giovanni Parmigiani
- Department of Biotatistics, Harvard School of Public Health, Boston, Massachusetts, USA
- Department of Data Science, Dana Farber Cancer Institute, Boston, Massachusetts, USA
| |
Collapse
|
3
|
Shan Y, Huang C, Li Y, Zhu H. Merging or ensembling: integrative analysis in multiple neuroimaging studies. Biometrics 2024; 80:ujae003. [PMID: 38465984 PMCID: PMC10926268 DOI: 10.1093/biomtc/ujae003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2022] [Revised: 11/27/2023] [Accepted: 01/10/2024] [Indexed: 03/12/2024]
Abstract
The aim of this paper is to systematically investigate merging and ensembling methods for spatially varying coefficient mixed effects models (SVCMEM) in order to carry out integrative learning of neuroimaging data obtained from multiple biomedical studies. The "merged" approach involves training a single learning model using a comprehensive dataset that encompasses information from all the studies. Conversely, the "ensemble" approach involves creating a weighted average of distinct learning models, each developed from an individual study. We systematically investigate the prediction accuracy of the merged and ensemble learners under the presence of different degrees of interstudy heterogeneity. Additionally, we establish asymptotic guidelines for making strategic decisions about when to employ either of these models in different scenarios, along with deriving optimal weights for the ensemble learner. To validate our theoretical results, we perform extensive simulation studies. The proposed methodology is also applied to 3 large-scale neuroimaging studies.
Collapse
Affiliation(s)
- Yue Shan
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, United States
| | - Chao Huang
- Department of Statistics, Florida State University, Tallahassee, FL 32306, United States
| | - Yun Li
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, United States
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, United States
| | - Hongtu Zhu
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, United States
- Department of Statistics, Florida State University, Tallahassee, FL 32306, United States
- Department of Statistics & Operations Research, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, United States
- Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, United States
| |
Collapse
|
4
|
Chekroud AM, Hawrilenko M, Loho H, Bondar J, Gueorguieva R, Hasan A, Kambeitz J, Corlett PR, Koutsouleris N, Krumholz HM, Krystal JH, Paulus M. Illusory generalizability of clinical prediction models. Science 2024; 383:164-167. [PMID: 38207039 DOI: 10.1126/science.adg8538] [Citation(s) in RCA: 13] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2023] [Accepted: 11/10/2023] [Indexed: 01/13/2024]
Abstract
It is widely hoped that statistical models can improve decision-making related to medical treatments. Because of the cost and scarcity of medical outcomes data, this hope is typically based on investigators observing a model's success in one or two datasets or clinical contexts. We scrutinized this optimism by examining how well a machine learning model performed across several independent clinical trials of antipsychotic medication for schizophrenia. Models predicted patient outcomes with high accuracy within the trial in which the model was developed but performed no better than chance when applied out-of-sample. Pooling data across trials to predict outcomes in the trial left out did not improve predictions. These results suggest that models predicting treatment outcomes in schizophrenia are highly context-dependent and may have limited generalizability.
Collapse
Affiliation(s)
- Adam M Chekroud
- Spring Health, New York City, NY 10010, USA
- Department of Psychiatry, Yale University School of Medicine, New Haven, CT 06520, USA
| | | | - Hieronimus Loho
- Department of Psychiatry, Yale University School of Medicine, New Haven, CT 06520, USA
| | | | | | - Alkomiet Hasan
- Department of Psychiatry, Psychotherapy and Psychosomatics, University Augsburg, 86159 Augsburg, Germany
| | - Joseph Kambeitz
- Department of Psychiatry and Psychotherapy, University of Cologne, Faculty of Medicine and University Hospital of Cologne, Cologne, Germany
| | - Philip R Corlett
- Department of Psychiatry, Yale University School of Medicine, New Haven, CT 06520, USA
| | - Nikolaos Koutsouleris
- Department of Psychiatry and Psychotherapy, Ludwig-Maximilians-University, Munich, Germany
| | - Harlan M Krumholz
- Center for Outcomes Research and Evaluation, Yale New Haven Hospital, New Haven, CT 06520, USA
| | - John H Krystal
- Department of Psychiatry, Yale University School of Medicine, New Haven, CT 06520, USA
| | - Martin Paulus
- Laureate Institute for Brain Research, Tulsa, OK 74136, USA
| |
Collapse
|
5
|
Heiling HM, Rashid NU, Li Q, Ibrahim JG. glmmPen: High Dimensional Penalized Generalized Linear Mixed Models. THE R JOURNAL 2023; 15:106-128. [PMID: 38818017 PMCID: PMC11138212 DOI: 10.32614/rj-2023-086] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/01/2024]
Abstract
Generalized linear mixed models (GLMMs) are widely used in research for their ability to model correlated outcomes with non-Gaussian conditional distributions. The proper selection of fixed and random effects is a critical part of the modeling process, where model misspecification may lead to significant bias. However, the joint selection of fixed and random effects has historically been limited to lower dimensional GLMMs, largely due to the use of criterion-based model selection strategies. Here we present the R package glmmPen, one of the first to select fixed and random effects in higher dimension using a penalized GLMM modeling framework. Model parameters are estimated using a Monte Carlo expectation conditional minimization (MCECM) algorithm, which leverages Stan and RcppArmadillo for increased computational efficiency. Our package supports the Binomial, Gaussian, and Poisson families and multiple penalty functions. In this manuscript we discuss the modeling procedure, estimation scheme, and software implementation through application to a pancreatic cancer subtyping study. Simulation results show our method has good performance in selecting both the fixed and random effects in high dimensional GLMMs.
Collapse
Affiliation(s)
| | | | - Quefeng Li
- University of North Carolina Chapel Hill
| | | |
Collapse
|
6
|
Gao Y, Sun F. Batch normalization followed by merging is powerful for phenotype prediction integrating multiple heterogeneous studies. PLoS Comput Biol 2023; 19:e1010608. [PMID: 37844077 PMCID: PMC10602384 DOI: 10.1371/journal.pcbi.1010608] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2022] [Revised: 10/26/2023] [Accepted: 09/30/2023] [Indexed: 10/18/2023] Open
Abstract
Heterogeneity in different genomic studies compromises the performance of machine learning models in cross-study phenotype predictions. Overcoming heterogeneity when incorporating different studies in terms of phenotype prediction is a challenging and critical step for developing machine learning algorithms with reproducible prediction performance on independent datasets. We investigated the best approaches to integrate different studies of the same type of omics data under a variety of different heterogeneities. We developed a comprehensive workflow to simulate a variety of different types of heterogeneity and evaluate the performances of different integration methods together with batch normalization by using ComBat. We also demonstrated the results through realistic applications on six colorectal cancer (CRC) metagenomic studies and six tuberculosis (TB) gene expression studies, respectively. We showed that heterogeneity in different genomic studies can markedly negatively impact the machine learning classifier's reproducibility. ComBat normalization improved the prediction performance of machine learning classifier when heterogeneous populations are present, and could successfully remove batch effects within the same population. We also showed that the machine learning classifier's prediction accuracy can be markedly decreased as the underlying disease model became more different in training and test populations. Comparing different merging and integration methods, we found that merging and integration methods can outperform each other in different scenarios. In the realistic applications, we observed that the prediction accuracy improved when applying ComBat normalization with merging or integration methods in both CRC and TB studies. We illustrated that batch normalization is essential for mitigating both population differences of different studies and batch effects. We also showed that both merging strategy and integration methods can achieve good performances when combined with batch normalization. In addition, we explored the potential of boosting phenotype prediction performance by rank aggregation methods and showed that rank aggregation methods had similar performance as other ensemble learning approaches.
Collapse
Affiliation(s)
- Yilin Gao
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, California, United States of America
| | - Fengzhu Sun
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, California, United States of America
| |
Collapse
|
7
|
Zhu H, Li T, Zhao B. Statistical Learning Methods for Neuroimaging Data Analysis with Applications. Annu Rev Biomed Data Sci 2023; 6:73-104. [PMID: 37127052 DOI: 10.1146/annurev-biodatasci-020722-100353] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/03/2023]
Abstract
The aim of this review is to provide a comprehensive survey of statistical challenges in neuroimaging data analysis, from neuroimaging techniques to large-scale neuroimaging studies and statistical learning methods. We briefly review eight popular neuroimaging techniques and their potential applications in neuroscience research and clinical translation. We delineate four themes of neuroimaging data and review major image processing analysis methods for processing neuroimaging data at the individual level. We briefly review four large-scale neuroimaging-related studies and a consortium on imaging genomics and discuss four themes of neuroimaging data analysis at the population level. We review nine major population-based statistical analysis methods and their associated statistical challenges and present recent progress in statistical methodology to address these challenges.
Collapse
Affiliation(s)
- Hongtu Zhu
- Department of Biostatistics, Department of Statistics, Department of Genetics, and Department of Computer Science, University of North Carolina, Chapel Hill, North Carolina, USA;
- Biomedical Research Imaging Center, University of North Carolina, Chapel Hill, North Carolina, USA
| | - Tengfei Li
- Biomedical Research Imaging Center, University of North Carolina, Chapel Hill, North Carolina, USA
- Department of Radiology, University of North Carolina, Chapel Hill, North Carolina, USA
| | - Bingxin Zhao
- Department of Statistics and Data Science, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| |
Collapse
|
8
|
Laajala TD, Sreekanth V, Soupir AC, Creed JH, Halkola AS, Calboli FCF, Singaravelu K, Orman MV, Colin-Leitzinger C, Gerke T, Fridley BL, Tyekucheva S, Costello JC. A harmonized resource of integrated prostate cancer clinical, -omic, and signature features. Sci Data 2023; 10:430. [PMID: 37407670 PMCID: PMC10322899 DOI: 10.1038/s41597-023-02335-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2023] [Accepted: 06/27/2023] [Indexed: 07/07/2023] Open
Abstract
Genomic and transcriptomic data have been generated across a wide range of prostate cancer (PCa) study cohorts. These data can be used to better characterize the molecular features associated with clinical outcomes and to test hypotheses across multiple, independent patient cohorts. In addition, derived features, such as estimates of cell composition, risk scores, and androgen receptor (AR) scores, can be used to develop novel hypotheses leveraging existing multi-omic datasets. The full potential of such data is yet to be realized as independent datasets exist in different repositories, have been processed using different pipelines, and derived and clinical features are often not provided or not standardized. Here, we present the curatedPCaData R package, a harmonized data resource representing >2900 primary tumor, >200 normal tissue, and >500 metastatic PCa samples across 19 datasets processed using standardized pipelines with updated gene annotations. We show that meta-analysis across harmonized studies has great potential for robust and clinically meaningful insights. curatedPCaData is an open and accessible community resource with code made available for reproducibility.
Collapse
Affiliation(s)
- Teemu D Laajala
- Department of Mathematics and Statistics, University of Turku, Turku, Finland.
- Department of Pharmacology, University of Colorado Anschutz Medical Campus, Aurora, CO, USA.
| | - Varsha Sreekanth
- Department of Pharmacology, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Alex C Soupir
- Department of Biostatistics and Bioinformatics, Moffitt Cancer Center, Tampa, FL, USA
| | - Jordan H Creed
- Department of Biostatistics and Bioinformatics, Moffitt Cancer Center, Tampa, FL, USA
| | - Anni S Halkola
- Department of Mathematics and Statistics, University of Turku, Turku, Finland
| | - Federico C F Calboli
- Department of Mathematics and Statistics, University of Turku, Turku, Finland
- Natural Resources Institute Finland (Luke), F-31600, Jokioinen, Finland
| | | | - Michael V Orman
- Department of Pharmacology, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | | | - Travis Gerke
- Department of Cancer Epidemiology, Moffitt Cancer Center, Tampa, FL, USA
| | - Brooke L Fridley
- Department of Biostatistics and Bioinformatics, Moffitt Cancer Center, Tampa, FL, USA
| | - Svitlana Tyekucheva
- Department of Data Science, Dana-Farber Cancer Institute; Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
| | - James C Costello
- Department of Pharmacology, University of Colorado Anschutz Medical Campus, Aurora, CO, USA.
- University of Colorado Cancer Center, University of Colorado Anschutz Medical Campus, Aurora, CO, USA.
| |
Collapse
|
9
|
Colicino E, Fiorito G. DNA methylation-based biomarkers for cardiometabolic-related traits and their importance for risk stratification. CURRENT OPINION IN EPIDEMIOLOGY AND PUBLIC HEALTH 2023; 2:25-31. [PMID: 38601732 PMCID: PMC11003758 DOI: 10.1097/pxh.0000000000000020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/12/2024]
Abstract
Recent findings The prevalence of cardiometabolic syndrome in adults is increasing worldwide, highlighting the importance of biomarkers for individuals' classification based on their health status. Although cardiometabolic risk scores and diagnostic criteria have been developed aggregating adverse health effects of individual conditions on the overall syndrome, none of them has gained unanimous acceptance. Therefore, novel molecular biomarkers have been developed to better understand the risk, onset and progression of both individual conditions and the overall cardiometabolic syndrome. Summary Consistent associations between whole blood DNA methylation (DNAm) levels at several single genomic (i.e. CpG) sites and both individual and aggregated cardiometabolic conditions supported the creation of second-generation DNAm-based cardiometabolic-related biomarkers. These biomarkers linearly combine individual DNAm levels from key CpG sites, selected by a two-step machine learning procedures. They can be used, even retrospectively, in populations with extant whole blood DNAm levels and without observed cardiometabolic phenotypes. Purpose of review Here we offer an overview of the second-generation DNAm-based cardiometabolic biomarkers, discussing methodological advancements and implications on the interpretation and generalizability of the findings. We finally emphasize the contribution of DNAm-based biomarkers for risk stratification beyond traditional factors and discuss limitations and future directions of the field.
Collapse
Affiliation(s)
- Elena Colicino
- Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | | |
Collapse
|
10
|
Laajala TD, Sreekanth V, Soupir A, Creed J, Calboli FCF, Singaravelu K, Orman M, Colin-Leitzinger C, Gerke T, Fidley BL, Tyekucheva S, Costello JC. curatedPCaData: Integration of clinical, genomic, and signature features in a curated and harmonized prostate cancer data resource. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.01.17.524403. [PMID: 36711769 PMCID: PMC9882125 DOI: 10.1101/2023.01.17.524403] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
Abstract
Genomic and transcriptomic data have been generated across a wide range of prostate cancer (PCa) study cohorts. These data can be used to better characterize the molecular features associated with clinical outcomes and to test hypotheses across multiple, independent patient cohorts. In addition, derived features, such as estimates of cell composition, risk scores, and androgen receptor (AR) scores, can be used to develop novel hypotheses leveraging existing multi-omic datasets. The full potential of such data is yet to be realized as independent datasets exist in different repositories, have been processed using different pipelines, and derived and clinical features are often not provided or unstandardized. Here, we present the curatedPCaData R package, a harmonized data resource representing >2900 primary tumor, >200 normal tissue, and >500 metastatic PCa samples across 19 datasets processed using standardized pipelines with updated gene annotations. We show that meta-analysis across harmonized studies has great potential for robust and clinically meaningful insights. curatedPCaData is an open and accessible community resource with code made available for reproducibility.
Collapse
Affiliation(s)
- Teemu D Laajala
- Department of Mathematics and Statistics, University of Turku, Turku, Finland
- Department of Pharmacology, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Varsha Sreekanth
- Department of Pharmacology, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Alex Soupir
- Department of Biostatistics and Bioinformatics, Moffitt Cancer Center, Tampa, FL, USA
| | - Jordan Creed
- Department of Biostatistics and Bioinformatics, Moffitt Cancer Center, Tampa, FL, USA
| | - Federico CF Calboli
- Department of Mathematics and Statistics, University of Turku, Turku, Finland
- Natural Resources Institute Finland (Luke), F-31600, Jokioinen, Finland
| | | | - Michael Orman
- Department of Pharmacology, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | | | - Travis Gerke
- Department of Cancer Epidemiology, Moffitt Cancer Center, Tampa, FL, USA
| | - Brooke L. Fidley
- Department of Biostatistics and Bioinformatics, Moffitt Cancer Center, Tampa, FL, USA
| | - Svitlana Tyekucheva
- Department of Data Science, Dana-Farber Cancer Institute; Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - James C Costello
- Department of Pharmacology, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
- University of Colorado Cancer Center, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| |
Collapse
|
11
|
Wu Y, Ren B, Patil P. A pairwise strategy for imputing predictive features when combining multiple datasets. Bioinformatics 2022; 39:6964381. [PMID: 36576001 PMCID: PMC9835467 DOI: 10.1093/bioinformatics/btac839] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2022] [Revised: 11/30/2022] [Accepted: 12/27/2022] [Indexed: 12/29/2022] Open
Abstract
MOTIVATION In the training of predictive models using high-dimensional genomic data, multiple studies' worth of data are often combined to increase sample size and improve generalizability. A drawback of this approach is that there may be different sets of features measured in each study due to variations in expression measurement platform or technology. It is often common practice to work only with the intersection of features measured in common across all studies, which results in the blind discarding of potentially useful feature information that is measured in individual or subsets of studies. RESULTS We characterize the loss in predictive performance incurred by using only the intersection of feature information available across all studies when training predictors using gene expression data from microarray and sequencing datasets. We study the properties of linear and polynomial regression for imputing discarded features and demonstrate improvements in the external performance of prediction functions through simulation and in gene expression data collected on breast cancer patients. To improve this process, we propose a pairwise strategy that applies any imputation algorithm to two studies at a time and averages imputed features across pairs. We demonstrate that the pairwise strategy is preferable to first merging all datasets together and imputing any resulting missing features. Finally, we provide insights on which subsets of intersected and study-specific features should be used so that missing-feature imputation best promotes cross-study replicability. AVAILABILITY AND IMPLEMENTATION The code is available at https://github.com/YujieWuu/Pairwise_imputation. SUPPLEMENTARY INFORMATION Supplementary information is available at Bioinformatics online.
Collapse
Affiliation(s)
- Yujie Wu
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA
| | - Boyu Ren
- Laboratory for Psychiatric Biostatistics, McLean Hospital, Belmont, MA 02478, USA,Department of Psychiatry, Harvard Medical School, Boston, MA 02115, USA
| | | |
Collapse
|
12
|
Loewinger G, Patil P, Kishida KT, Parmigiani G. Hierarchical resampling for bagging in multistudy prediction with applications to human neurochemical sensing. Ann Appl Stat 2022; 16:2145-2165. [PMID: 36274786 PMCID: PMC9586160 DOI: 10.1214/21-aoas1574] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
We propose the "study strap ensemble", which combines advantages of two common approaches to fitting prediction models when multiple training datasets ("studies") are available: pooling studies and fitting one model versus averaging predictions from multiple models each fit to individual studies. The study strap ensemble fits models to bootstrapped datasets, or "pseudo-studies." These are generated by resampling from multiple studies with a hierarchical resampling scheme that generalizes the randomized cluster bootstrap. The study strap is controlled by a tuning parameter that determines the proportion of observations to draw from each study. When the parameter is set to its lowest value, each pseudo-study is resampled from only a single study. When it is high, the study strap ignores the multi-study structure and generates pseudo-studies by merging the datasets and drawing observations like a standard bootstrap. We empirically show the optimal tuning value often lies in between, and prove that special cases of the study strap draw the merged dataset and the set of original studies as pseudo-studies. We extend the study strap approach with an ensemble weighting scheme that utilizes information in the distribution of the covariates of the test dataset. Our work is motivated by neuroscience experiments using real-time neurochemical sensing during awake behavior in humans. Current techniques to perform this kind of research require measurements from an electrode placed in the brain during awake neurosurgery and rely on prediction models to estimate neurotransmitter concentrations from the electrical measurements recorded by the electrode. These models are trained by combining multiple datasets that are collected in vitro under heterogeneous conditions in order to promote accuracy of the models when applied to data collected in the brain. A prevailing challenge is deciding how to combine studies or ensemble models trained on different studies to enhance model generalizability. Our methods produce marked improvements in simulations and in this application. All methods are available in the studyStrap CRAN package.
Collapse
|
13
|
Niu X, Gou J, Chang H, Lowe M, Zhang F(Z. Classification model with weighted regularization to improve the reproducibility of neuroimaging signature selection. Stat Med 2022; 41:5046-5060. [DOI: 10.1002/sim.9553] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2021] [Revised: 06/16/2022] [Accepted: 07/26/2022] [Indexed: 11/10/2022]
Affiliation(s)
- Xin Niu
- Department of Psychological and Brain Sciences Drexel University Philadelphia Pennsylvania USA
| | - Jiangtao Gou
- Department of Mathematics and Statistics Villanova University Villanova Pennsylvania USA
| | - Hansoo Chang
- Department of Psychological and Brain Sciences Drexel University Philadelphia Pennsylvania USA
| | - Michael Lowe
- Department of Psychological and Brain Sciences Drexel University Philadelphia Pennsylvania USA
| | - Fengqing (Zoe) Zhang
- Department of Psychological and Brain Sciences Drexel University Philadelphia Pennsylvania USA
| |
Collapse
|
14
|
Krepel J, Kircher M, Kohls M, Jung K. Comparison of merging strategies for building machine learning models on multiple independent gene expression data sets. Stat Anal Data Min 2022. [DOI: 10.1002/sam.11549] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Affiliation(s)
- Jessica Krepel
- Institute for Animal Breeding and Genetics University of Veterinary Medicine Hannover Hannover Germany
| | - Magdalena Kircher
- Institute for Animal Breeding and Genetics University of Veterinary Medicine Hannover Hannover Germany
| | - Moritz Kohls
- Institute for Animal Breeding and Genetics University of Veterinary Medicine Hannover Hannover Germany
| | - Klaus Jung
- Institute for Animal Breeding and Genetics University of Veterinary Medicine Hannover Hannover Germany
| |
Collapse
|
15
|
Kim MP, Kern C, Goldwasser S, Kreuter F, Reingold O. Universal adaptability: Target-independent inference that competes with propensity scoring. Proc Natl Acad Sci U S A 2022; 119:e2108097119. [PMID: 35046023 PMCID: PMC8794832 DOI: 10.1073/pnas.2108097119] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2021] [Accepted: 12/02/2021] [Indexed: 11/20/2022] Open
Abstract
The gold-standard approaches for gleaning statistically valid conclusions from data involve random sampling from the population. Collecting properly randomized data, however, can be challenging, so modern statistical methods, including propensity score reweighting, aim to enable valid inferences when random sampling is not feasible. We put forth an approach for making inferences based on available data from a source population that may differ in composition in unknown ways from an eventual target population. Whereas propensity scoring requires a separate estimation procedure for each different target population, we show how to build a single estimator, based on source data alone, that allows for efficient and accurate estimates on any downstream target data. We demonstrate, theoretically and empirically, that our target-independent approach to inference, which we dub "universal adaptability," is competitive with target-specific approaches that rely on propensity scoring. Our approach builds on a surprising connection between the problem of inferences in unspecified target populations and the multicalibration problem, studied in the burgeoning field of algorithmic fairness. We show how the multicalibration framework can be employed to yield valid inferences from a single source population across a diverse set of target populations.
Collapse
Affiliation(s)
- Michael P Kim
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720
- Miller Institute for Basic Research in Science, Berkeley, CA 94720
| | - Christoph Kern
- School of Social Sciences, University of Mannheim, 68159 Mannheim, Germany
| | - Shafi Goldwasser
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720;
- Simons Institute for the Theory of Computation, Berkeley, CA 94720
| | - Frauke Kreuter
- Joint Program in Survey Methodology, University of Maryland, College Park, MD 20742
- Department of Statistics, Ludwig-Maximilians-Universität München, 80539 München, Germany
| | - Omer Reingold
- Department of Computer Science, Stanford University, Stanford, CA 94305
| |
Collapse
|
16
|
Tarumi S, Takeuchi W, Qi R, Ning X, Ruppert L, Ban H, Robertson DH, Schleyer TK, Kawamoto K. Predicting pharmacotherapeutic outcomes for type 2 diabetes: An evaluation of three approaches to leveraging electronic health record data from multiple sources. J Biomed Inform 2022; 129:104001. [DOI: 10.1016/j.jbi.2022.104001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2021] [Revised: 12/30/2021] [Accepted: 01/17/2022] [Indexed: 10/19/2022]
|
17
|
Nwosu IO, Piccolo SR. A systematic review of datasets that can help elucidate relationships among gene expression, race, and immunohistochemistry-defined subtypes in breast cancer. Cancer Biol Ther 2021; 22:417-429. [PMID: 34412551 DOI: 10.1080/15384047.2021.1953902] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
Scholarly requirements have led to a massive increase of transcriptomic data in the public domain, with millions of samples available for secondary research. We identified gene-expression datasets representing 10,214 breast-cancer patients in public databases. We focused on datasets that included patient metadata on race and/or immunohistochemistry (IHC) profiling of the ER, PR, and HER-2 proteins. This review provides a summary of these datasets and describes findings from 32 research articles associated with the datasets. These studies have helped to elucidate relationships between IHC, race, and/or treatment options, as well as relationships between IHC status and the breast-cancer intrinsic subtypes. We have also identified broad themes across the analysis methodologies used in these studies, including breast cancer subtyping, deriving predictive biomarkers, identifying differentially expressed genes, and optimizing data processing. Finally, we discuss limitations of prior work and recommend future directions for reusing these datasets in secondary analyses.
Collapse
Affiliation(s)
| | - Stephen R Piccolo
- Department of Biology, Brigham Young University, Provo, Utah, United States
| |
Collapse
|
18
|
Zhang Y, Patil P, Johnson WE, Parmigiani G. Robustifying genomic classifiers to batch effects via ensemble learning. Bioinformatics 2021; 37:1521-1527. [PMID: 33245114 DOI: 10.1093/bioinformatics/btaa986] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2020] [Revised: 10/20/2020] [Accepted: 11/13/2020] [Indexed: 01/08/2023] Open
Abstract
MOTIVATION Genomic data are often produced in batches due to practical restrictions, which may lead to unwanted variation in data caused by discrepancies across batches. Such 'batch effects' often have negative impact on downstream biological analysis and need careful consideration. In practice, batch effects are usually addressed by specifically designed software, which merge the data from different batches, then estimate batch effects and remove them from the data. Here, we focus on classification and prediction problems, and propose a different strategy based on ensemble learning. We first develop prediction models within each batch, then integrate them through ensemble weighting methods. RESULTS We provide a systematic comparison between these two strategies using studies targeting diverse populations infected with tuberculosis. In one study, we simulated increasing levels of heterogeneity across random subsets of the study, which we treat as simulated batches. We then use the two methods to develop a genomic classifier for the binary indicator of disease status. We evaluate the accuracy of prediction in another independent study targeting a different population cohort. We observed that in independent validation, while merging followed by batch adjustment provides better discrimination at low level of heterogeneity, our ensemble learning strategy achieves more robust performance, especially at high severity of batch effects. These observations provide practical guidelines for handling batch effects in the development and evaluation of genomic classifiers. AVAILABILITY AND IMPLEMENTATION The data underlying this article are available in the article and in its online supplementary material. Processed data is available in the Github repository with implementation code, at https://github.com/zhangyuqing/bea_ensemble. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yuqing Zhang
- Clinical Bioinformatics, Gilead Sciences, Inc., Foster City, CA 94404, USA
| | - Prasad Patil
- Department of Biostatistics, Boston University School of Public Health, Boston, MA 02118, USA
| | - W Evan Johnson
- Department of Biostatistics, Boston University School of Public Health, Boston, MA 02118, USA.,Division of Computational Biomedicine, Boston University School of Medicine, Boston, MA 02118, USA
| | - Giovanni Parmigiani
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA 02215, USA.,Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA
| |
Collapse
|
19
|
Park JA, Sung MD, Kim HH, Park YR. Weight-Based Framework for Predictive Modeling of Multiple Databases With Noniterative Communication Without Data Sharing: Privacy-Protecting Analytic Method for Multi-Institutional Studies. JMIR Med Inform 2021; 9:e21043. [PMID: 33818396 PMCID: PMC8056295 DOI: 10.2196/21043] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2020] [Revised: 11/16/2020] [Accepted: 03/03/2021] [Indexed: 01/22/2023] Open
Abstract
Background Securing the representativeness of study populations is crucial in biomedical research to ensure high generalizability. In this regard, using multi-institutional data have advantages in medicine. However, combining data physically is difficult as the confidential nature of biomedical data causes privacy issues. Therefore, a methodological approach is necessary when using multi-institution medical data for research to develop a model without sharing data between institutions. Objective This study aims to develop a weight-based integrated predictive model of multi-institutional data, which does not require iterative communication between institutions, to improve average predictive performance by increasing the generalizability of the model under privacy-preserving conditions without sharing patient-level data. Methods The weight-based integrated model generates a weight for each institutional model and builds an integrated model for multi-institutional data based on these weights. We performed 3 simulations to show the weight characteristics and to determine the number of repetitions of the weight required to obtain stable values. We also conducted an experiment using real multi-institutional data to verify the developed weight-based integrated model. We selected 10 hospitals (2845 intensive care unit [ICU] stays in total) from the electronic intensive care unit Collaborative Research Database to predict ICU mortality with 11 features. To evaluate the validity of our model, compared with a centralized model, which was developed by combining all the data of 10 hospitals, we used proportional overlap (ie, 0.5 or less indicates a significant difference at a level of .05; and 2 indicates 2 CIs overlapping completely). Standard and firth logistic regression models were applied for the 2 simulations and the experiment. Results The results of these simulations indicate that the weight of each institution is determined by 2 factors (ie, the data size of each institution and how well each institutional model fits into the overall institutional data) and that repeatedly generating 200 weights is necessary per institution. In the experiment, the estimated area under the receiver operating characteristic curve (AUC) and 95% CIs were 81.36% (79.37%-83.36%) and 81.95% (80.03%-83.87%) in the centralized model and weight-based integrated model, respectively. The proportional overlap of the CIs for AUC in both the weight-based integrated model and the centralized model was approximately 1.70, and that of overlap of the 11 estimated odds ratios was over 1, except for 1 case. Conclusions In the experiment where real multi-institutional data were used, our model showed similar results to the centralized model without iterative communication between institutions. In addition, our weight-based integrated model provided a weighted average model by integrating 10 models overfitted or underfitted, compared with the centralized model. The proposed weight-based integrated model is expected to provide an efficient distributed research approach as it increases the generalizability of the model and does not require iterative communication.
Collapse
Affiliation(s)
- Ji Ae Park
- Department of Biomedical System Informatics, Yonsei University College of Medicine, Seoul, Republic of Korea
| | - Min Dong Sung
- Department of Biomedical System Informatics, Yonsei University College of Medicine, Seoul, Republic of Korea
| | - Ho Heon Kim
- Department of Biomedical System Informatics, Yonsei University College of Medicine, Seoul, Republic of Korea
| | - Yu Rang Park
- Department of Biomedical System Informatics, Yonsei University College of Medicine, Seoul, Republic of Korea
| |
Collapse
|
20
|
Zhang Y, Bernau C, Parmigiani G, Waldron L. The impact of different sources of heterogeneity on loss of accuracy from genomic prediction models. Biostatistics 2020; 21:253-268. [PMID: 30202918 DOI: 10.1093/biostatistics/kxy044] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2018] [Revised: 07/22/2018] [Accepted: 08/04/2018] [Indexed: 11/13/2022] Open
Abstract
Cross-study validation (CSV) of prediction models is an alternative to traditional cross-validation (CV) in domains where multiple comparable datasets are available. Although many studies have noted potential sources of heterogeneity in genomic studies, to our knowledge none have systematically investigated their intertwined impacts on prediction accuracy across studies. We employ a hybrid parametric/non-parametric bootstrap method to realistically simulate publicly available compendia of microarray, RNA-seq, and whole metagenome shotgun microbiome studies of health outcomes. Three types of heterogeneity between studies are manipulated and studied: (i) imbalances in the prevalence of clinical and pathological covariates, (ii) differences in gene covariance that could be caused by batch, platform, or tumor purity effects, and (iii) differences in the "true" model that associates gene expression and clinical factors to outcome. We assess model accuracy, while altering these factors. Lower accuracy is seen in CSV than in CV. Surprisingly, heterogeneity in known clinical covariates and differences in gene covariance structure have very limited contributions in the loss of accuracy when validating in new studies. However, forcing identical generative models greatly reduces the within/across study difference. These results, observed consistently for multiple disease outcomes and omics platforms, suggest that the most easily identifiable sources of study heterogeneity are not necessarily the primary ones that undermine the ability to accurately replicate the accuracy of omics prediction models in new studies. Unidentified heterogeneity, such as could arise from unmeasured confounding, may be more important.
Collapse
Affiliation(s)
- Yuqing Zhang
- Graduate Program in Bioinformatics, Boston University, 24 Cummington Mall, Boston, MA, USA
| | - Christoph Bernau
- Department of Medical Informatics, Biometry and Epidemiology, University of Munich, Marchioninistr. 15, Munich, Germany
| | - Giovanni Parmigiani
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, 3 Blackfan Cir, Boston, MA, USA.,Department of Biostatistics, Harvard TH Chan School of Public Health, 677 Huntington Ave, Boston, MA, USA
| | - Levi Waldron
- Graduate School of Public Health and Health Policy, Institute for Implementation Science in Population Health, City University of New York, 55 W 125th St, New York, NY, USA
| |
Collapse
|
21
|
Wang M, Luo W, Jones K, Bian X, Williams R, Higson H, Wu D, Hicks B, Yeager M, Zhu B. SomaticCombiner: improving the performance of somatic variant calling based on evaluation tests and a consensus approach. Sci Rep 2020; 10:12898. [PMID: 32732891 PMCID: PMC7393490 DOI: 10.1038/s41598-020-69772-8] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2020] [Accepted: 07/16/2020] [Indexed: 02/06/2023] Open
Abstract
It is challenging to identify somatic variants from high-throughput sequence reads due to tumor heterogeneity, sub-clonality, and sequencing artifacts. In this study, we evaluated the performance of eight primary somatic variant callers and multiple ensemble methods using both real and synthetic whole-genome sequencing, whole-exome sequencing, and deep targeted sequencing datasets with the NA12878 cell line. The test results showed that a simple consensus approach can significantly improve performance even with a limited number of callers and is more robust and stable than machine learning based ensemble approaches. To fully exploit the multi-callers, we also developed a software package, SomaticCombiner, that can combine multiple callers and integrates a new variant allelic frequency (VAF) adaptive majority voting approach, which can maintain sensitive detection for variants with low VAFs.
Collapse
Affiliation(s)
- Mingyi Wang
- Cancer Genomics Research Laboratory, Division of Cancer Epidemiology and Genetics, Frederick National Laboratory for Cancer Research, Frederick, MD, 20877, USA.
| | - Wen Luo
- Cancer Genomics Research Laboratory, Division of Cancer Epidemiology and Genetics, Frederick National Laboratory for Cancer Research, Frederick, MD, 20877, USA
| | - Kristine Jones
- Cancer Genomics Research Laboratory, Division of Cancer Epidemiology and Genetics, Frederick National Laboratory for Cancer Research, Frederick, MD, 20877, USA
| | - Xiaopeng Bian
- Center for Biomedical Informatics and Information Technology, National Cancer Institute, Rockville, MD, 20850, USA
| | - Russell Williams
- Cancer Genomics Research Laboratory, Division of Cancer Epidemiology and Genetics, Frederick National Laboratory for Cancer Research, Frederick, MD, 20877, USA
| | - Herbert Higson
- Cancer Genomics Research Laboratory, Division of Cancer Epidemiology and Genetics, Frederick National Laboratory for Cancer Research, Frederick, MD, 20877, USA
| | - Dongjing Wu
- Cancer Genomics Research Laboratory, Division of Cancer Epidemiology and Genetics, Frederick National Laboratory for Cancer Research, Frederick, MD, 20877, USA
| | - Belynda Hicks
- Cancer Genomics Research Laboratory, Division of Cancer Epidemiology and Genetics, Frederick National Laboratory for Cancer Research, Frederick, MD, 20877, USA
| | - Meredith Yeager
- Cancer Genomics Research Laboratory, Division of Cancer Epidemiology and Genetics, Frederick National Laboratory for Cancer Research, Frederick, MD, 20877, USA
| | - Bin Zhu
- Cancer Genomics Research Laboratory, Division of Cancer Epidemiology and Genetics, Frederick National Laboratory for Cancer Research, Frederick, MD, 20877, USA.
| |
Collapse
|
22
|
Westerman K, Fernández‐Sanlés A, Patil P, Sebastiani P, Jacques P, Starr JM, J. Deary I, Liu Q, Liu S, Elosua R, DeMeo DL, Ordovás JM. Epigenomic Assessment of Cardiovascular Disease Risk and Interactions With Traditional Risk Metrics. J Am Heart Assoc 2020; 9:e015299. [PMID: 32308120 PMCID: PMC7428544 DOI: 10.1161/jaha.119.015299] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/13/2019] [Accepted: 03/10/2020] [Indexed: 12/16/2022]
Abstract
Background Epigenome-wide association studies for cardiometabolic risk factors have discovered multiple loci associated with incident cardiovascular disease (CVD). However, few studies have sought to directly optimize a predictor of CVD risk. Furthermore, it is challenging to train multivariate models across multiple studies in the presence of study- or batch effects. Methods and Results Here, we analyzed existing DNA methylation data collected using the Illumina HumanMethylation450 microarray to create a predictor of CVD risk across 3 cohorts: Women's Health Initiative, Framingham Heart Study Offspring Cohort, and Lothian Birth Cohorts. We trained Cox proportional hazards-based elastic net regressions for incident CVD separately in each cohort and used a recently introduced cross-study learning approach to integrate these individual scores into an ensemble predictor. The methylation-based risk score was associated with CVD time-to-event in a held-out fraction of the Framingham data set (hazard ratio per SD=1.28, 95% CI, 1.10-1.50) and predicted myocardial infarction status in the independent REGICOR (Girona Heart Registry) data set (odds ratio per SD=2.14, 95% CI, 1.58-2.89). These associations remained after adjustment for traditional cardiovascular risk factors and were similar to those from elastic net models trained on a directly merged data set. Additionally, we investigated interactions between the methylation-based risk score and both genetic and biochemical CVD risk, showing preliminary evidence of an enhanced performance in those with less traditional risk factor elevation. Conclusions This investigation provides proof-of-concept for a genome-wide, CVD-specific epigenomic risk score and suggests that DNA methylation data may enable the discovery of high-risk individuals who would be missed by alternative risk metrics.
Collapse
Affiliation(s)
- Kenneth Westerman
- JM‐USDA Human Nutrition Research Center on Aging at Tufts UniversityBostonMA
| | - Alba Fernández‐Sanlés
- Cardiovascular Epidemiology and Genetics Research GroupREGICOR Study GroupIMIM (Hospital del Mar Medical Research Institute)BarcelonaCataloniaSpain
- Pompeu Fabra University (UPF)BarcelonaCataloniaSpain
| | - Prasad Patil
- Department of BiostatisticsBoston University School of Public HealthBostonMA
| | - Paola Sebastiani
- Department of BiostatisticsBoston University School of Public HealthBostonMA
| | - Paul Jacques
- JM‐USDA Human Nutrition Research Center on Aging at Tufts UniversityBostonMA
| | - John M. Starr
- Department of PsychologyUniversity of EdinburghUnited Kingdom
- Centre for Cognitive Ageing and Cognitive EpidemiologyUniversity of EdinburghUnited Kingdom
| | - Ian J. Deary
- Department of PsychologyUniversity of EdinburghUnited Kingdom
- Centre for Cognitive Ageing and Cognitive EpidemiologyUniversity of EdinburghUnited Kingdom
| | - Qing Liu
- Department of EpidemiologyBrown University School of Public HealthProvidenceRI
| | - Simin Liu
- Department of EpidemiologyBrown University School of Public HealthProvidenceRI
| | - Roberto Elosua
- Cardiovascular Epidemiology and Genetics Research GroupREGICOR Study GroupIMIM (Hospital del Mar Medical Research Institute)BarcelonaCataloniaSpain
- CIBER Cardiovascular Diseases (CIBERCV)MadridSpain
- Medicine DepartmentMedical SchoolUniversity of Vic‐Central University of Catalonia (UVic‐UCC)VicCataloniaSpain
| | - Dawn L. DeMeo
- Channing Division of Network MedicineDepartment of MedicineBrigham and Women’s HospitalBostonMA
| | - José M. Ordovás
- JM‐USDA Human Nutrition Research Center on Aging at Tufts UniversityBostonMA
- IMDEA AlimentaciónCEIUAMMadridSpain
- Centro Nacional de Investigaciones Cardiovasculares (CNIC)MadridSpain
| |
Collapse
|
23
|
Ramchandran M, Patil P, Parmigiani G. Tree-Weighting for Multi-Study Ensemble Learners. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2020; 25:451-462. [PMID: 31797618 PMCID: PMC6980320] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Multi-study learning uses multiple training studies, separately trains classifiers on each, and forms an ensemble with weights rewarding members with better cross-study prediction ability. This article considers novel weighting approaches for constructing tree-based ensemble learners in this setting. Using Random Forests as a single-study learner, we compare weighting each forest to form the ensemble, to extracting the individual trees trained by each Random Forest and weighting them directly. We find that incorporating multiple layers of ensembling in the training process by weighting trees increases the robustness of the resulting predictor. Furthermore, we explore how ensembling weights correspond to tree structure, to shed light on the features that determine whether weighting trees directly is advantageous. Finally, we apply our approach to genomic datasets and show that weighting trees improves upon the basic multi-study learning paradigm. Code and supplementary material are available at https://github.com/m-ramchandran/tree-weighting.
Collapse
Affiliation(s)
- Maya Ramchandran
- Department of Biostatistics, Harvard T.H. Chan School of Public Health,
| | - Prasad Patil
- Department of Biostatistics, Harvard T.H. Chan School of Public Health,Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA 02115, USA
| | - Giovanni Parmigiani
- Department of Biostatistics, Harvard T.H. Chan School of Public Health,Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA 02115, USA
| |
Collapse
|
24
|
Zhang X, Hu Y, Aouizerat BE, Peng G, Marconi VC, Corley MJ, Hulgan T, Bryant KJ, Zhao H, Krystal JH, Justice AC, Xu K. Machine learning selected smoking-associated DNA methylation signatures that predict HIV prognosis and mortality. Clin Epigenetics 2018; 10:155. [PMID: 30545403 PMCID: PMC6293604 DOI: 10.1186/s13148-018-0591-z] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2018] [Accepted: 11/26/2018] [Indexed: 12/20/2022] Open
Abstract
Background The effects of tobacco smoking on epigenome-wide methylation signatures in white blood cells (WBCs) collected from persons living with HIV may have important implications for their immune-related outcomes, including frailty and mortality. The application of a machine learning approach to the analysis of CpG methylation in the epigenome enables the selection of phenotypically relevant features from high-dimensional data. Using this approach, we now report that a set of smoking-associated DNA-methylated CpGs predicts HIV prognosis and mortality in an HIV-positive veteran population. Results We first identified 137 epigenome-wide significant CpGs for smoking in WBCs from 1137 HIV-positive individuals (p < 1.70E−07). To examine whether smoking-associated CpGs were predictive of HIV frailty and mortality, we applied ensemble-based machine learning to build a model in a training sample employing 408,583 CpGs. A set of 698 CpGs was selected and predictive of high HIV frailty in a testing sample [(area under curve (AUC) = 0.73, 95%CI 0.63~0.83)] and was replicated in an independent sample [(AUC = 0.78, 95%CI 0.73~0.83)]. We further found an association of a DNA methylation index constructed from the 698 CpGs that were associated with a 5-year survival rate [HR = 1.46; 95%CI 1.06~2.02, p = 0.02]. Interestingly, the 698 CpGs located on 445 genes were enriched on the integrin signaling pathway (p = 9.55E−05, false discovery rate = 0.036), which is responsible for the regulation of the cell cycle, differentiation, and adhesion. Conclusion We demonstrated that smoking-associated DNA methylation features in white blood cells predict HIV infection-related clinical outcomes in a population living with HIV. Electronic supplementary material The online version of this article (10.1186/s13148-018-0591-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Xinyu Zhang
- Department of Psychiatry, Yale School of Medicine, 300 George Street, 950 Campbell Ave, West Haven, New Haven, CT, 06511, USA.,VA Connecticut Healthcare System, 950 Campbell Ave, West Haven, CT, 06516, USA
| | - Ying Hu
- Center for Biomedical Bioinformatics, National Cancer Institute, Rockville, MD, 20852, USA
| | - Bradley E Aouizerat
- Bluestone Center for Clinical Research, New York University, New York, NY, 10010, USA
| | - Gang Peng
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, 065116, USA
| | - Vincent C Marconi
- Division of Infectious Diseases, Emory University School of Medicine, Atlanta, GA, 30303, USA
| | - Michael J Corley
- Department of Native Hawaiian Health, John A. Burns School of Medicine, University of Hawaii, Suite 1016B, Honolulu, 96813, USA
| | - Todd Hulgan
- School of Medicine, Vanderbilt University, Nashville, TN, 37232, USA
| | - Kendall J Bryant
- National Institute on Alcohol Abuse and Alcoholism, Bethesda, MD, 20852, USA
| | - Hongyu Zhao
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, 065116, USA
| | - John H Krystal
- Department of Psychiatry, Yale School of Medicine, 300 George Street, 950 Campbell Ave, West Haven, New Haven, CT, 06511, USA.,VA Connecticut Healthcare System, 950 Campbell Ave, West Haven, CT, 06516, USA
| | - Amy C Justice
- VA Connecticut Healthcare System, 950 Campbell Ave, West Haven, CT, 06516, USA.,Yale University School of Medicine, New Haven, CT, 06516, USA
| | - Ke Xu
- Department of Psychiatry, Yale School of Medicine, 300 George Street, 950 Campbell Ave, West Haven, New Haven, CT, 06511, USA. .,VA Connecticut Healthcare System, 950 Campbell Ave, West Haven, CT, 06516, USA.
| |
Collapse
|
25
|
|