Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Patil P, Parmigiani G. Training replicable predictors in multiple studies. Proc Natl Acad Sci U S A 2018;115:2578-2583. [PMID: 29531060 PMCID: PMC5856504 DOI: 10.1073/pnas.1708283115] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023] Open

For:	Patil P, Parmigiani G. Training replicable predictors in multiple studies. Proc Natl Acad Sci U S A 2018;115:2578-2583. [PMID: 29531060 PMCID: PMC5856504 DOI: 10.1073/pnas.1708283115] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023] Open

Number

Cited by Other Article(s)

Wang B, Luan Y. Evaluation of normalization methods for predicting quantitative phenotypes in metagenomic data analysis. Front Genet 2024;15:1369628. [PMID: 38903761 PMCID: PMC11188486 DOI: 10.3389/fgene.2024.1369628] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2024] [Accepted: 05/13/2024] [Indexed: 06/22/2024] Open

Loewinger G, Nunez RA, Mazumder R, Parmigiani G. Optimal ensemble construction for multistudy prediction with applications to mortality estimation. Stat Med 2024;43:1774-1789. [PMID: 38396313 DOI: 10.1002/sim.10006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2023] [Revised: 10/12/2023] [Accepted: 12/22/2023] [Indexed: 02/25/2024]

Shan Y, Huang C, Li Y, Zhu H. Merging or ensembling: integrative analysis in multiple neuroimaging studies. Biometrics 2024;80:ujae003. [PMID: 38465984 PMCID: PMC10926268 DOI: 10.1093/biomtc/ujae003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2022] [Revised: 11/27/2023] [Accepted: 01/10/2024] [Indexed: 03/12/2024]

Chekroud AM, Hawrilenko M, Loho H, Bondar J, Gueorguieva R, Hasan A, Kambeitz J, Corlett PR, Koutsouleris N, Krumholz HM, Krystal JH, Paulus M. Illusory generalizability of clinical prediction models. Science 2024;383:164-167. [PMID: 38207039 DOI: 10.1126/science.adg8538] [Citation(s) in RCA: 13] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2023] [Accepted: 11/10/2023] [Indexed: 01/13/2024]

Heiling HM, Rashid NU, Li Q, Ibrahim JG. glmmPen: High Dimensional Penalized Generalized Linear Mixed Models. THE R JOURNAL 2023;15:106-128. [PMID: 38818017 PMCID: PMC11138212 DOI: 10.32614/rj-2023-086] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/01/2024]

Gao Y, Sun F. Batch normalization followed by merging is powerful for phenotype prediction integrating multiple heterogeneous studies. PLoS Comput Biol 2023;19:e1010608. [PMID: 37844077 PMCID: PMC10602384 DOI: 10.1371/journal.pcbi.1010608] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2022] [Revised: 10/26/2023] [Accepted: 09/30/2023] [Indexed: 10/18/2023] Open

Abstract

Heterogeneity in different genomic studies compromises the performance of machine learning models in cross-study phenotype predictions. Overcoming heterogeneity when incorporating different studies in terms of phenotype prediction is a challenging and critical step for developing machine learning algorithms with reproducible prediction performance on independent datasets. We investigated the best approaches to integrate different studies of the same type of omics data under a variety of different heterogeneities. We developed a comprehensive workflow to simulate a variety of different types of heterogeneity and evaluate the performances of different integration methods together with batch normalization by using ComBat. We also demonstrated the results through realistic applications on six colorectal cancer (CRC) metagenomic studies and six tuberculosis (TB) gene expression studies, respectively. We showed that heterogeneity in different genomic studies can markedly negatively impact the machine learning classifier's reproducibility. ComBat normalization improved the prediction performance of machine learning classifier when heterogeneous populations are present, and could successfully remove batch effects within the same population. We also showed that the machine learning classifier's prediction accuracy can be markedly decreased as the underlying disease model became more different in training and test populations. Comparing different merging and integration methods, we found that merging and integration methods can outperform each other in different scenarios. In the realistic applications, we observed that the prediction accuracy improved when applying ComBat normalization with merging or integration methods in both CRC and TB studies. We illustrated that batch normalization is essential for mitigating both population differences of different studies and batch effects. We also showed that both merging strategy and integration methods can achieve good performances when combined with batch normalization. In addition, we explored the potential of boosting phenotype prediction performance by rank aggregation methods and showed that rank aggregation methods had similar performance as other ensemble learning approaches.

Collapse

Zhu H, Li T, Zhao B. Statistical Learning Methods for Neuroimaging Data Analysis with Applications. Annu Rev Biomed Data Sci 2023;6:73-104. [PMID: 37127052 DOI: 10.1146/annurev-biodatasci-020722-100353] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/03/2023]

Laajala TD, Sreekanth V, Soupir AC, Creed JH, Halkola AS, Calboli FCF, Singaravelu K, Orman MV, Colin-Leitzinger C, Gerke T, Fridley BL, Tyekucheva S, Costello JC. A harmonized resource of integrated prostate cancer clinical, -omic, and signature features. Sci Data 2023;10:430. [PMID: 37407670 PMCID: PMC10322899 DOI: 10.1038/s41597-023-02335-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2023] [Accepted: 06/27/2023] [Indexed: 07/07/2023] Open

Colicino E, Fiorito G. DNA methylation-based biomarkers for cardiometabolic-related traits and their importance for risk stratification. CURRENT OPINION IN EPIDEMIOLOGY AND PUBLIC HEALTH 2023;2:25-31. [PMID: 38601732 PMCID: PMC11003758 DOI: 10.1097/pxh.0000000000000020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/12/2024]

Laajala TD, Sreekanth V, Soupir A, Creed J, Calboli FCF, Singaravelu K, Orman M, Colin-Leitzinger C, Gerke T, Fidley BL, Tyekucheva S, Costello JC. curatedPCaData: Integration of clinical, genomic, and signature features in a curated and harmonized prostate cancer data resource. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.01.17.524403. [PMID: 36711769 PMCID: PMC9882125 DOI: 10.1101/2023.01.17.524403] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]

Wu Y, Ren B, Patil P. A pairwise strategy for imputing predictive features when combining multiple datasets. Bioinformatics 2022;39:6964381. [PMID: 36576001 PMCID: PMC9835467 DOI: 10.1093/bioinformatics/btac839] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2022] [Revised: 11/30/2022] [Accepted: 12/27/2022] [Indexed: 12/29/2022] Open

Loewinger G, Patil P, Kishida KT, Parmigiani G. Hierarchical resampling for bagging in multistudy prediction with applications to human neurochemical sensing. Ann Appl Stat 2022;16:2145-2165. [PMID: 36274786 PMCID: PMC9586160 DOI: 10.1214/21-aoas1574] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]

Abstract

We propose the "study strap ensemble", which combines advantages of two common approaches to fitting prediction models when multiple training datasets ("studies") are available: pooling studies and fitting one model versus averaging predictions from multiple models each fit to individual studies. The study strap ensemble fits models to bootstrapped datasets, or "pseudo-studies." These are generated by resampling from multiple studies with a hierarchical resampling scheme that generalizes the randomized cluster bootstrap. The study strap is controlled by a tuning parameter that determines the proportion of observations to draw from each study. When the parameter is set to its lowest value, each pseudo-study is resampled from only a single study. When it is high, the study strap ignores the multi-study structure and generates pseudo-studies by merging the datasets and drawing observations like a standard bootstrap. We empirically show the optimal tuning value often lies in between, and prove that special cases of the study strap draw the merged dataset and the set of original studies as pseudo-studies. We extend the study strap approach with an ensemble weighting scheme that utilizes information in the distribution of the covariates of the test dataset. Our work is motivated by neuroscience experiments using real-time neurochemical sensing during awake behavior in humans. Current techniques to perform this kind of research require measurements from an electrode placed in the brain during awake neurosurgery and rely on prediction models to estimate neurotransmitter concentrations from the electrical measurements recorded by the electrode. These models are trained by combining multiple datasets that are collected in vitro under heterogeneous conditions in order to promote accuracy of the models when applied to data collected in the brain. A prevailing challenge is deciding how to combine studies or ensemble models trained on different studies to enhance model generalizability. Our methods produce marked improvements in simulations and in this application. All methods are available in the studyStrap CRAN package.

Collapse

Niu X, Gou J, Chang H, Lowe M, Zhang F(Z. Classification model with weighted regularization to improve the reproducibility of neuroimaging signature selection. Stat Med 2022;41:5046-5060. [DOI: 10.1002/sim.9553] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2021] [Revised: 06/16/2022] [Accepted: 07/26/2022] [Indexed: 11/10/2022]

Krepel J, Kircher M, Kohls M, Jung K. Comparison of merging strategies for building machine learning models on multiple independent gene expression data sets. Stat Anal Data Min 2022. [DOI: 10.1002/sam.11549] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]

Kim MP, Kern C, Goldwasser S, Kreuter F, Reingold O. Universal adaptability: Target-independent inference that competes with propensity scoring. Proc Natl Acad Sci U S A 2022;119:e2108097119. [PMID: 35046023 PMCID: PMC8794832 DOI: 10.1073/pnas.2108097119] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2021] [Accepted: 12/02/2021] [Indexed: 11/20/2022] Open

Tarumi S, Takeuchi W, Qi R, Ning X, Ruppert L, Ban H, Robertson DH, Schleyer TK, Kawamoto K. Predicting pharmacotherapeutic outcomes for type 2 diabetes: An evaluation of three approaches to leveraging electronic health record data from multiple sources. J Biomed Inform 2022;129:104001. [DOI: 10.1016/j.jbi.2022.104001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2021] [Revised: 12/30/2021] [Accepted: 01/17/2022] [Indexed: 10/19/2022]

Nwosu IO, Piccolo SR. A systematic review of datasets that can help elucidate relationships among gene expression, race, and immunohistochemistry-defined subtypes in breast cancer. Cancer Biol Ther 2021;22:417-429. [PMID: 34412551 DOI: 10.1080/15384047.2021.1953902] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open

Zhang Y, Patil P, Johnson WE, Parmigiani G. Robustifying genomic classifiers to batch effects via ensemble learning. Bioinformatics 2021;37:1521-1527. [PMID: 33245114 DOI: 10.1093/bioinformatics/btaa986] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2020] [Revised: 10/20/2020] [Accepted: 11/13/2020] [Indexed: 01/08/2023] Open

Abstract

MOTIVATION

Genomic data are often produced in batches due to practical restrictions, which may lead to unwanted variation in data caused by discrepancies across batches. Such 'batch effects' often have negative impact on downstream biological analysis and need careful consideration. In practice, batch effects are usually addressed by specifically designed software, which merge the data from different batches, then estimate batch effects and remove them from the data. Here, we focus on classification and prediction problems, and propose a different strategy based on ensemble learning. We first develop prediction models within each batch, then integrate them through ensemble weighting methods.

RESULTS

We provide a systematic comparison between these two strategies using studies targeting diverse populations infected with tuberculosis. In one study, we simulated increasing levels of heterogeneity across random subsets of the study, which we treat as simulated batches. We then use the two methods to develop a genomic classifier for the binary indicator of disease status. We evaluate the accuracy of prediction in another independent study targeting a different population cohort. We observed that in independent validation, while merging followed by batch adjustment provides better discrimination at low level of heterogeneity, our ensemble learning strategy achieves more robust performance, especially at high severity of batch effects. These observations provide practical guidelines for handling batch effects in the development and evaluation of genomic classifiers.

AVAILABILITY AND IMPLEMENTATION

The data underlying this article are available in the article and in its online supplementary material. Processed data is available in the Github repository with implementation code, at https://github.com/zhangyuqing/bea_ensemble.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

Collapse

Park JA, Sung MD, Kim HH, Park YR. Weight-Based Framework for Predictive Modeling of Multiple Databases With Noniterative Communication Without Data Sharing: Privacy-Protecting Analytic Method for Multi-Institutional Studies. JMIR Med Inform 2021;9:e21043. [PMID: 33818396 PMCID: PMC8056295 DOI: 10.2196/21043] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2020] [Revised: 11/16/2020] [Accepted: 03/03/2021] [Indexed: 01/22/2023] Open

Abstract

Background

Securing the representativeness of study populations is crucial in biomedical research to ensure high generalizability. In this regard, using multi-institutional data have advantages in medicine. However, combining data physically is difficult as the confidential nature of biomedical data causes privacy issues. Therefore, a methodological approach is necessary when using multi-institution medical data for research to develop a model without sharing data between institutions.

Objective

This study aims to develop a weight-based integrated predictive model of multi-institutional data, which does not require iterative communication between institutions, to improve average predictive performance by increasing the generalizability of the model under privacy-preserving conditions without sharing patient-level data.

Methods

The weight-based integrated model generates a weight for each institutional model and builds an integrated model for multi-institutional data based on these weights. We performed 3 simulations to show the weight characteristics and to determine the number of repetitions of the weight required to obtain stable values. We also conducted an experiment using real multi-institutional data to verify the developed weight-based integrated model. We selected 10 hospitals (2845 intensive care unit [ICU] stays in total) from the electronic intensive care unit Collaborative Research Database to predict ICU mortality with 11 features. To evaluate the validity of our model, compared with a centralized model, which was developed by combining all the data of 10 hospitals, we used proportional overlap (ie, 0.5 or less indicates a significant difference at a level of .05; and 2 indicates 2 CIs overlapping completely). Standard and firth logistic regression models were applied for the 2 simulations and the experiment.

Results

The results of these simulations indicate that the weight of each institution is determined by 2 factors (ie, the data size of each institution and how well each institutional model fits into the overall institutional data) and that repeatedly generating 200 weights is necessary per institution. In the experiment, the estimated area under the receiver operating characteristic curve (AUC) and 95% CIs were 81.36% (79.37%-83.36%) and 81.95% (80.03%-83.87%) in the centralized model and weight-based integrated model, respectively. The proportional overlap of the CIs for AUC in both the weight-based integrated model and the centralized model was approximately 1.70, and that of overlap of the 11 estimated odds ratios was over 1, except for 1 case.

Conclusions

In the experiment where real multi-institutional data were used, our model showed similar results to the centralized model without iterative communication between institutions. In addition, our weight-based integrated model provided a weighted average model by integrating 10 models overfitted or underfitted, compared with the centralized model. The proposed weight-based integrated model is expected to provide an efficient distributed research approach as it increases the generalizability of the model and does not require iterative communication.

Collapse

Zhang Y, Bernau C, Parmigiani G, Waldron L. The impact of different sources of heterogeneity on loss of accuracy from genomic prediction models. Biostatistics 2020;21:253-268. [PMID: 30202918 DOI: 10.1093/biostatistics/kxy044] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2018] [Revised: 07/22/2018] [Accepted: 08/04/2018] [Indexed: 11/13/2022] Open

Wang M, Luo W, Jones K, Bian X, Williams R, Higson H, Wu D, Hicks B, Yeager M, Zhu B. SomaticCombiner: improving the performance of somatic variant calling based on evaluation tests and a consensus approach. Sci Rep 2020;10:12898. [PMID: 32732891 PMCID: PMC7393490 DOI: 10.1038/s41598-020-69772-8] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2020] [Accepted: 07/16/2020] [Indexed: 02/06/2023] Open

Westerman K, Fernández‐Sanlés A, Patil P, Sebastiani P, Jacques P, Starr JM, J. Deary I, Liu Q, Liu S, Elosua R, DeMeo DL, Ordovás JM. Epigenomic Assessment of Cardiovascular Disease Risk and Interactions With Traditional Risk Metrics. J Am Heart Assoc 2020;9:e015299. [PMID: 32308120 PMCID: PMC7428544 DOI: 10.1161/jaha.119.015299] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/13/2019] [Accepted: 03/10/2020] [Indexed: 12/16/2022]

Abstract

Background Epigenome-wide association studies for cardiometabolic risk factors have discovered multiple loci associated with incident cardiovascular disease (CVD). However, few studies have sought to directly optimize a predictor of CVD risk. Furthermore, it is challenging to train multivariate models across multiple studies in the presence of study- or batch effects. Methods and Results Here, we analyzed existing DNA methylation data collected using the Illumina HumanMethylation450 microarray to create a predictor of CVD risk across 3 cohorts: Women's Health Initiative, Framingham Heart Study Offspring Cohort, and Lothian Birth Cohorts. We trained Cox proportional hazards-based elastic net regressions for incident CVD separately in each cohort and used a recently introduced cross-study learning approach to integrate these individual scores into an ensemble predictor. The methylation-based risk score was associated with CVD time-to-event in a held-out fraction of the Framingham data set (hazard ratio per SD=1.28, 95% CI, 1.10-1.50) and predicted myocardial infarction status in the independent REGICOR (Girona Heart Registry) data set (odds ratio per SD=2.14, 95% CI, 1.58-2.89). These associations remained after adjustment for traditional cardiovascular risk factors and were similar to those from elastic net models trained on a directly merged data set. Additionally, we investigated interactions between the methylation-based risk score and both genetic and biochemical CVD risk, showing preliminary evidence of an enhanced performance in those with less traditional risk factor elevation. Conclusions This investigation provides proof-of-concept for a genome-wide, CVD-specific epigenomic risk score and suggests that DNA methylation data may enable the discovery of high-risk individuals who would be missed by alternative risk metrics.

Collapse

Ramchandran M, Patil P, Parmigiani G. Tree-Weighting for Multi-Study Ensemble Learners. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2020;25:451-462. [PMID: 31797618 PMCID: PMC6980320] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]

Zhang X, Hu Y, Aouizerat BE, Peng G, Marconi VC, Corley MJ, Hulgan T, Bryant KJ, Zhao H, Krystal JH, Justice AC, Xu K. Machine learning selected smoking-associated DNA methylation signatures that predict HIV prognosis and mortality. Clin Epigenetics 2018;10:155. [PMID: 30545403 PMCID: PMC6293604 DOI: 10.1186/s13148-018-0591-z] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2018] [Accepted: 11/26/2018] [Indexed: 12/20/2022] Open

Abstract

Background

The effects of tobacco smoking on epigenome-wide methylation signatures in white blood cells (WBCs) collected from persons living with HIV may have important implications for their immune-related outcomes, including frailty and mortality. The application of a machine learning approach to the analysis of CpG methylation in the epigenome enables the selection of phenotypically relevant features from high-dimensional data. Using this approach, we now report that a set of smoking-associated DNA-methylated CpGs predicts HIV prognosis and mortality in an HIV-positive veteran population.

Results

We first identified 137 epigenome-wide significant CpGs for smoking in WBCs from 1137 HIV-positive individuals (p < 1.70E−07). To examine whether smoking-associated CpGs were predictive of HIV frailty and mortality, we applied ensemble-based machine learning to build a model in a training sample employing 408,583 CpGs. A set of 698 CpGs was selected and predictive of high HIV frailty in a testing sample [(area under curve (AUC) = 0.73, 95%CI 0.63~0.83)] and was replicated in an independent sample [(AUC = 0.78, 95%CI 0.73~0.83)]. We further found an association of a DNA methylation index constructed from the 698 CpGs that were associated with a 5-year survival rate [HR = 1.46; 95%CI 1.06~2.02, p = 0.02]. Interestingly, the 698 CpGs located on 445 genes were enriched on the integrin signaling pathway (p = 9.55E−05, false discovery rate = 0.036), which is responsible for the regulation of the cell cycle, differentiation, and adhesion.

Conclusion

We demonstrated that smoking-associated DNA methylation features in white blood cells predict HIV infection-related clinical outcomes in a population living with HIV.

Electronic supplementary material

The online version of this article (10.1186/s13148-018-0591-z) contains supplementary material, which is available to authorized users.

Collapse

Reproducibility of research: Issues and proposed remedies. Proc Natl Acad Sci U S A 2018. [PMID: 29531033 DOI: 10.1073/pnas.1802324115] [Citation(s) in RCA: 31] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023] Open