1
|
Loo RTJ, Tsurkalenko O, Klucken J, Mangone G, Khoury F, Vidailhet M, Corvol JC, Krüger R, Glaab E. Levodopa-induced dyskinesia in Parkinson's disease: Insights from cross-cohort prognostic analysis using machine learning. Parkinsonism Relat Disord 2024; 126:107054. [PMID: 38991633 DOI: 10.1016/j.parkreldis.2024.107054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/28/2024] [Revised: 06/29/2024] [Accepted: 07/02/2024] [Indexed: 07/13/2024]
Abstract
BACKGROUND Prolonged levodopa treatment in Parkinson's disease (PD) often leads to motor complications, including levodopa-induced dyskinesia (LID). Despite continuous levodopa treatment, some patients do not develop LID symptoms, even in later stages of the disease. OBJECTIVE This study explores machine learning (ML) methods using baseline clinical characteristics to predict the development of LID in PD patients over four years, across multiple cohorts. METHODS Using interpretable ML approaches, we analyzed clinical data from three independent longitudinal PD cohorts (LuxPARK, n = 356; PPMI, n = 484; ICEBERG, n = 113) to develop cross-cohort prognostic models and identify potential predictors for the development of LID. We examined cohort-specific and shared predictive factors, assessing model performance and stability through cross-validation analyses. RESULTS Consistent cross-validation results for single and multiple cohort analyses highlighted the effectiveness of the ML models and identified baseline clinical characteristics with significant predictive value for the LID prognosis in PD. Predictors positively correlated with LID include axial symptoms, freezing of gait, and rigidity in the lower extremities. Conversely, the risk of developing LID was inversely associated with the occurrence of resting tremors, higher body weight, later onset of PD, and visuospatial abilities. CONCLUSIONS This study presents interpretable ML models for dyskinesia prognosis with significant predictive power in cross-cohort analyses. The models may pave the way for proactive interventions against dyskinesia in PD by optimizing levodopa dosing regimens and adjunct treatments with dopamine agonists or MAO-B inhibitors, and by employing non-pharmacological interventions such as dietary adjustments affecting levodopa absorption for high-risk LID patients.
Collapse
Affiliation(s)
- Rebecca Ting Jiin Loo
- Biomedical Data Science, Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Olena Tsurkalenko
- Translational Neuroscience, Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Esch-sur-Alzette, Luxembourg; Transversal Translational Medicine, Luxembourg Institute of Health (LIH), Strassen, Luxembourg; Digital Medicine Group, Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Esch-sur-Alzette, Luxembourg; Digital Medicine Group, Department of Precision Health, Luxembourg Institute of Health (LIH), Strassen, Luxembourg; Digital Medicine Group, Centre Hospitalier de Luxembourg (CHL), Luxembourg
| | - Jochen Klucken
- Digital Medicine Group, Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Esch-sur-Alzette, Luxembourg; Digital Medicine Group, Department of Precision Health, Luxembourg Institute of Health (LIH), Strassen, Luxembourg; Digital Medicine Group, Centre Hospitalier de Luxembourg (CHL), Luxembourg
| | - Graziella Mangone
- Sorbonne Université, Paris Brain Institute - ICM, Inserm, CNRS, Assistance Publique Hôpitaux de Paris, Pitié-Salpêtrière Hospital, Department of Neurology, Paris, 75013, France
| | - Fouad Khoury
- Sorbonne Université, Paris Brain Institute - ICM, Inserm, CNRS, Assistance Publique Hôpitaux de Paris, Pitié-Salpêtrière Hospital, Department of Neurology, Paris, 75013, France
| | - Marie Vidailhet
- Sorbonne Université, Paris Brain Institute - ICM, Inserm, CNRS, Assistance Publique Hôpitaux de Paris, Pitié-Salpêtrière Hospital, Department of Neurology, Paris, 75013, France
| | - Jean-Christophe Corvol
- Sorbonne Université, Paris Brain Institute - ICM, Inserm, CNRS, Assistance Publique Hôpitaux de Paris, Pitié-Salpêtrière Hospital, Department of Neurology, Paris, 75013, France
| | - Rejko Krüger
- Translational Neuroscience, Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Esch-sur-Alzette, Luxembourg; Transversal Translational Medicine, Luxembourg Institute of Health (LIH), Strassen, Luxembourg; Department of Neurology, Centre Hospitalier de Luxembourg (CHL), Luxembourg
| | - Enrico Glaab
- Biomedical Data Science, Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Esch-sur-Alzette, Luxembourg.
| |
Collapse
|
2
|
Chhoa H, Chabriat H, Chevret S, Biard L. Comparison of models for stroke-free survival prediction in patients with CADASIL. Sci Rep 2023; 13:22443. [PMID: 38105268 PMCID: PMC10725863 DOI: 10.1038/s41598-023-49552-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2023] [Accepted: 12/09/2023] [Indexed: 12/19/2023] Open
Abstract
Cerebral autosomal dominant arteriopathy with subcortical infarcts and leukoencephalopathy, which is caused by mutations of the NOTCH3 gene, has a large heterogeneous progression, presenting with declines of various clinical scores and occurrences of various clinical event. To help assess disease progression, this work focused on predicting the composite endpoint of stroke-free survival time by comparing the performance of Cox proportional hazards regression to that of machine learning models using one of four feature selection approaches applied to demographic, clinical and magnetic resonance imaging observational data collected from a study cohort of 482 patients. The quality of the modeling process and the predictive performance were evaluated in a nested cross-validation procedure using the time-dependent Brier Score and AUC at 5 years from baseline, the former measuring the overall performance including calibration and the latter highlighting the discrimination ability, with both metrics taking into account the presence of right-censoring. The best model for each metric was the componentwise gradient boosting model with a mean Brier score of 0.165 and the random survival forest model with a mean AUC of 0.773, both combined with the LASSO feature selection method.
Collapse
Affiliation(s)
- Henri Chhoa
- ECSTRRA Team, Université Paris Cité, UMR1153, INSERM, Paris, France
| | - Hugues Chabriat
- Centre NeuroVasculaire Translationnel - Centre de Référence CERVCO, DMU NeuroSciences, Hôpital Lariboisière, GHU APHP-Nord, Université Paris Cité, Paris, France
- INSERM NeuroDiderot UMR 1141, GenMedStroke Team, Paris, France
| | - Sylvie Chevret
- ECSTRRA Team, Université Paris Cité, UMR1153, INSERM, Paris, France
| | - Lucie Biard
- ECSTRRA Team, Université Paris Cité, UMR1153, INSERM, Paris, France.
| |
Collapse
|
3
|
Ding M, Li R, Qin J, Ning J. A double-robust test for high-dimensional gene coexpression networks conditioning on clinical information. Biometrics 2023; 79:3227-3238. [PMID: 37312587 PMCID: PMC10838184 DOI: 10.1111/biom.13890] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2022] [Accepted: 05/18/2023] [Indexed: 06/15/2023]
Abstract
It has been increasingly appealing to evaluate whether expression levels of two genes in a gene coexpression network are still dependent given samples' clinical information, in which the conditional independence test plays an essential role. For enhanced robustness regarding model assumptions, we propose a class of double-robust tests for evaluating the dependence of bivariate outcomes after controlling for known clinical information. Although the proposed test relies on the marginal density functions of bivariate outcomes given clinical information, the test remains valid as long as one of the density functions is correctly specified. Because of the closed-form variance formula, the proposed test procedure enjoys computational efficiency without requiring a resampling procedure or tuning parameters. We acknowledge the need to infer the conditional independence network with high-dimensional gene expressions, and further develop a procedure for multiple testing by controlling the false discovery rate. Numerical results show that our method accurately controls both the type-I error and false discovery rate, and it provides certain levels of robustness regarding model misspecification. We apply the method to a gastric cancer study with gene expression data to understand the associations between genes belonging to the transforming growth factor β signaling pathway given cancer-stage information.
Collapse
Affiliation(s)
- Maomao Ding
- Meta Platforms, Inc., Menlo Park, California, USA
| | - Ruosha Li
- Department of Biostatistics and Data Science, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Jin Qin
- Biostatistics Research Branch, National Institute of Allergy and Infectious Diseases, Bethesda, Maryland, USA
| | - Jing Ning
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, Texas, USA
| |
Collapse
|
4
|
Fanizzi A, Pomarico D, Rizzo A, Bove S, Comes MC, Didonna V, Giotta F, La Forgia D, Latorre A, Pastena MI, Petruzzellis N, Rinaldi L, Tamborra P, Zito A, Lorusso V, Massafra R. Machine learning survival models trained on clinical data to identify high risk patients with hormone responsive HER2 negative breast cancer. Sci Rep 2023; 13:8575. [PMID: 37237020 PMCID: PMC10220052 DOI: 10.1038/s41598-023-35344-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2022] [Accepted: 05/16/2023] [Indexed: 05/28/2023] Open
Abstract
For endocrine-positive Her2 negative breast cancer patients at an early stage, the benefit of adding chemotherapy to adjuvant endocrine therapy is not still confirmed. Several genomic tests are available on the market but are very expensive. Therefore, there is the urgent need to explore novel reliable and less expensive prognostic tools in this setting. In this paper, we shown a machine learning survival model to estimate Invasive Disease-Free Events trained on clinical and histological data commonly collected in clinical practice. We collected clinical and cytohistological outcomes of 145 patients referred to Istituto Tumori "Giovanni Paolo II". Three machine learning survival models are compared with the Cox proportional hazards regression according to time-dependent performance metrics evaluated in cross-validation. The c-index at 10 years obtained by random survival forest, gradient boosting, and component-wise gradient boosting is stabled with or without feature selection at approximately 0.68 in average respect to 0.57 obtained to Cox model. Moreover, machine learning survival models have accurately discriminated low- and high-risk patients, and so a large group which can be spared additional chemotherapy to hormone therapy. The preliminary results obtained by including only clinical determinants are encouraging. The integrated use of data already collected in clinical practice for routine diagnostic investigations, if properly analyzed, can reduce time and costs of the genomic tests.
Collapse
Affiliation(s)
- Annarita Fanizzi
- Struttura Semplice Dipartimentale di Fisica Sanitaria, I.R.C.C.S. Istituto Tumori "Giovanni Paolo II", Viale Orazio Flacco 65, 70124, Bari, Italy
| | - Domenico Pomarico
- Struttura Semplice Dipartimentale di Fisica Sanitaria, I.R.C.C.S. Istituto Tumori "Giovanni Paolo II", Viale Orazio Flacco 65, 70124, Bari, Italy
| | - Alessandro Rizzo
- Struttura Semplice Dipartimentale di Oncologia Per la Presa in Carico Globale del Paziente Oncologico "Don Tonino Bello", I.R.C.C.S. Istituto Tumori "Giovanni Paolo II", Viale Orazio Flacco 65, 70124, Bari, Italy
| | - Samantha Bove
- Struttura Semplice Dipartimentale di Fisica Sanitaria, I.R.C.C.S. Istituto Tumori "Giovanni Paolo II", Viale Orazio Flacco 65, 70124, Bari, Italy.
| | - Maria Colomba Comes
- Struttura Semplice Dipartimentale di Fisica Sanitaria, I.R.C.C.S. Istituto Tumori "Giovanni Paolo II", Viale Orazio Flacco 65, 70124, Bari, Italy.
| | - Vittorio Didonna
- Struttura Semplice Dipartimentale di Fisica Sanitaria, I.R.C.C.S. Istituto Tumori "Giovanni Paolo II", Viale Orazio Flacco 65, 70124, Bari, Italy
| | - Francesco Giotta
- Unità Operativa Complessa di Oncologia Medica, I.R.C.C.S. Istituto Tumori "Giovanni Paolo II", Viale Orazio Flacco 65, 70124, Bari, Italy
| | - Daniele La Forgia
- Struttura Semplice Dipartimentale di Radiologia Senologica, I.R.C.C.S. Istituto Tumori "Giovanni Paolo II", Viale Orazio Flacco 65, 70124, Bari, Italy
| | - Agnese Latorre
- Unità Operativa Complessa di Oncologia Medica, I.R.C.C.S. Istituto Tumori "Giovanni Paolo II", Viale Orazio Flacco 65, 70124, Bari, Italy
| | - Maria Irene Pastena
- Unità Operativa Complessa di Anatomia Patologica, I.R.C.C.S. Istituto Tumori "Giovanni Paolo II", Viale Orazio Flacco 65, 70124, Bari, Italy
| | - Nicole Petruzzellis
- Struttura Semplice Dipartimentale di Fisica Sanitaria, I.R.C.C.S. Istituto Tumori "Giovanni Paolo II", Viale Orazio Flacco 65, 70124, Bari, Italy
| | - Lucia Rinaldi
- Struttura Semplice Dipartimentale di Oncologia Per la Presa in Carico Globale del Paziente Oncologico "Don Tonino Bello", I.R.C.C.S. Istituto Tumori "Giovanni Paolo II", Viale Orazio Flacco 65, 70124, Bari, Italy
| | - Pasquale Tamborra
- Struttura Semplice Dipartimentale di Fisica Sanitaria, I.R.C.C.S. Istituto Tumori "Giovanni Paolo II", Viale Orazio Flacco 65, 70124, Bari, Italy
| | - Alfredo Zito
- Unità Operativa Complessa di Anatomia Patologica, I.R.C.C.S. Istituto Tumori "Giovanni Paolo II", Viale Orazio Flacco 65, 70124, Bari, Italy
| | - Vito Lorusso
- Unità Operativa Complessa di Oncologia Medica, I.R.C.C.S. Istituto Tumori "Giovanni Paolo II", Viale Orazio Flacco 65, 70124, Bari, Italy
| | - Raffaella Massafra
- Struttura Semplice Dipartimentale di Fisica Sanitaria, I.R.C.C.S. Istituto Tumori "Giovanni Paolo II", Viale Orazio Flacco 65, 70124, Bari, Italy
| |
Collapse
|
5
|
Gündoğdu S. Efficient prediction of early-stage diabetes using XGBoost classifier with random forest feature selection technique. MULTIMEDIA TOOLS AND APPLICATIONS 2023; 82:1-19. [PMID: 37362660 PMCID: PMC10043839 DOI: 10.1007/s11042-023-15165-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/04/2021] [Revised: 04/30/2022] [Accepted: 03/17/2023] [Indexed: 06/28/2023]
Abstract
Diabetes is one of the most common and serious diseases affecting human health. Early diagnosis and treatment are vital to prevent or delay complications related to diabetes. An automated diabetes detection system assists physicians in the early diagnosis of the disease and reduces complications by providing fast and precise results. This study aims to introduce a technique based on a combination of multiple linear regression (MLR), random forest (RF), and XGBoost (XG) to diagnose diabetes from questionnaire data. MLR-RF algorithm is used for feature selection, and XG is used for classification in the proposed system. The dataset is the diabetic hospital data in Sylhet, Bangladesh. It contains 520 instances, including 320 diabetics and 200 control instances. The performance of the classifiers is measured concerning accuracy (ACC), precision (PPV), recall (SEN, sensitivity), F1 score (F1), and the area under the receiver-operating-characteristic curve (AUC). The results show that the proposed system achieves an accuracy of 99.2%, an AUC of 99.3%, and a prediction time of 0.04825 seconds. The feature selection method improves the prediction time, although it does not affect the accuracy of the four compared classifiers. The results of this study are quite reasonable and successful when compared with other studies. The proposed method can be used as an auxiliary tool in diagnosing diabetes.
Collapse
Affiliation(s)
- Serdar Gündoğdu
- Department of Computer Technologies, Dokuz Eylul University, Bergama Vocational School, Izmir, Turkey
| |
Collapse
|
6
|
Song X, Liu M, Waitman LR, Patel A, Simpson SQ. Clinical factors associated with rapid treatment of sepsis. PLoS One 2021; 16:e0250923. [PMID: 33956846 PMCID: PMC8101717 DOI: 10.1371/journal.pone.0250923] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2020] [Accepted: 04/17/2021] [Indexed: 12/29/2022] Open
Abstract
PURPOSE To understand what clinical presenting features of sepsis patients are historically associated with rapid treatment involving antibiotics and fluids, as appropriate. DESIGN This was a retrospective, observational cohort study using a machine-learning model with an embedded feature selection mechanism (gradient boosting machine). METHODS For adult patients (age ≥ 18 years) who were admitted through Emergency Department (ED) meeting clinical criteria of severe sepsis from 11/2007 to 05/2018 at an urban tertiary academic medical center, we developed gradient boosting models (GBMs) using a total of 760 original and derived variables, including demographic variables, laboratory values, vital signs, infection diagnosis present on admission, and historical comorbidities. We identified the most impactful factors having strong association with rapid treatment, and further applied the Shapley Additive exPlanation (SHAP) values to examine the marginal effects for each factor. RESULTS For the subgroups with or without fluid bolus treatment component, the models achieved high accuracy of area-under-receiver-operating-curve of 0.91 [95% CI, 0.86-0.95] and 0.84 [95% CI, 0.81-0.86], and sensitivity of 0.81[95% CI, 0.72-0.87] and 0.91 [95% CI, 0.81-0.97], respectively. We identified the 20 most impactful factors associated with rapid treatment for each subgroup. In the non-hypotensive subgroup, initial physiological values were the most impactful to the model, while in the fluid bolus subgroup, value minima and maxima tended to be the most impactful. CONCLUSION These machine learning methods identified factors associated with rapid treatment of severe sepsis patients from a large volume of high-dimensional clinical data. The results provide insight into differences in the rapid provision of treatment among patients with sepsis.
Collapse
Affiliation(s)
- Xing Song
- Health Management and Informatics, School of Medicine, University of Missouri, Columbia, MO, United States of America
| | - Mei Liu
- Division of Medical Informatics, Department of Internal Medicine, University of Kansas Medical Center, Kansas City, KS, United States of America
| | - Lemuel R. Waitman
- Health Management and Informatics, School of Medicine, University of Missouri, Columbia, MO, United States of America
| | - Anurag Patel
- Anurag4Health, Kansas City, KS, United States of America
| | - Steven Q. Simpson
- Pulmonary and Critical Care Division, Department of Internal Medicine, University of Kansas Medical Center, Kansas City, KS, United States of America
| |
Collapse
|
7
|
Lin YC, Keenan K, Gong J, Panjwani N, Avolio J, Lin F, Adam D, Barrett P, Bégin S, Berthiaume Y, Bilodeau L, Bjornson C, Brusky J, Burgess C, Chilvers M, Consunji-Araneta R, Côté-Maurais G, Dale A, Donnelly C, Fairservice L, Griffin K, Henderson N, Hillaby A, Hughes D, Iqbal S, Itterman J, Jackson M, Karlsen E, Kosteniuk L, Lazosky L, Leung W, Levesque V, Maille É, Mateos-Corral D, McMahon V, Merjaneh M, Morrison N, Parkins M, Pike J, Price A, Quon BS, Reisman J, Smith C, Smith MJ, Vadeboncoeur N, Veniott D, Viczko T, Wilcox P, van Wylick R, Cutting G, Tullis E, Ratjen F, Rommens JM, Sun L, Solomon M, Stephenson AL, Brochiero E, Blackman S, Corvol H, Strug LJ. Cystic fibrosis-related diabetes onset can be predicted using biomarkers measured at birth. Genet Med 2021; 23:927-933. [PMID: 33500570 PMCID: PMC8105168 DOI: 10.1038/s41436-020-01073-x] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2020] [Revised: 12/09/2020] [Accepted: 12/15/2020] [Indexed: 12/16/2022] Open
Abstract
Purpose Cystic fibrosis (CF), caused by pathogenic variants in the CF transmembrane conductance regulator (CFTR), affects multiple organs including the exocrine pancreas, which is a causal contributor to cystic fibrosis–related diabetes (CFRD). Untreated CFRD causes increased CF-related mortality whereas early detection can improve outcomes. Methods Using genetic and easily accessible clinical measures available at birth, we constructed a CFRD prediction model using the Canadian CF Gene Modifier Study (CGS; n = 1,958) and validated it in the French CF Gene Modifier Study (FGMS; n = 1,003). We investigated genetic variants shown to associate with CF disease severity across multiple organs in genome-wide association studies. Results The strongest predictors included sex, CFTR severity score, and several genetic variants including one annotated to PRSS1, which encodes cationic trypsinogen. The final model defined in the CGS shows excellent agreement when validated on the FGMS, and the risk classifier shows slightly better performance at predicting CFRD risk later in life in both studies. Conclusion We demonstrated clinical utility by comparing CFRD prevalence rates between the top 10% of individuals with the highest risk and the bottom 10% with the lowest risk. A web-based application was developed to provide practitioners with patient-specific CFRD risk to guide CFRD monitoring and treatment.
Collapse
Affiliation(s)
- Yu-Chung Lin
- Department of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada
| | - Katherine Keenan
- Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON, Canada
| | - Jiafen Gong
- Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON, Canada
| | - Naim Panjwani
- Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON, Canada
| | - Julie Avolio
- Program in Translational Medicine, The Hospital for Sick Children, Toronto, ON, Canada
| | - Fan Lin
- Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON, Canada
| | - Damien Adam
- Department of Medicine, Faculty of Medicine, Université de Montréal, Montréal, QC, Canada.,CRCHUM, Montréal, QC, Canada
| | | | | | - Yves Berthiaume
- Department of Medicine, Faculty of Medicine, Université de Montréal, Montréal, QC, Canada
| | - Lara Bilodeau
- Centre de recherche de l'Institut universitaire de cardiologie et de pneumologie de Québec-Université Laval, Québec City, QC, Canada
| | | | - Janna Brusky
- Jim Pattison Children's Hospital, Saskatoon, SK, Canada
| | | | - Mark Chilvers
- British Columbia Children's Hospital, Vancouver, BC, Canada
| | | | | | - Andrea Dale
- Queen Elizabeth II Health Sciences Centre, Halifax, NS, Canada
| | | | | | | | | | | | | | - Shaikh Iqbal
- The Children's Hospital of Winnipeg, Winnipeg, MB, Canada
| | | | - Mary Jackson
- Royal University Hospital, Saskatoon, SK, Canada
| | | | | | | | - Winnie Leung
- University of Alberta Hospital, Edmonton, AB, Canada
| | | | | | | | | | | | - Nancy Morrison
- Queen Elizabeth II Health Sciences Centre, Halifax, NS, Canada
| | | | | | - April Price
- The Children's Hospital of Western Ontario, London, ON, Canada
| | | | - Joe Reisman
- The Children's Hospital of Eastern Ontario, Ottawa, ON, Canada
| | - Clare Smith
- Foothills Medical Centre, Calgary, AB, Canada
| | - Mary Jane Smith
- Janeway Children's Health & Rehabilitation Centre, St. John's, NL, Canada
| | - Nathalie Vadeboncoeur
- Centre de recherche de l'Institut universitaire de cardiologie et de pneumologie de Québec-Université Laval, Québec City, QC, Canada
| | | | - Terry Viczko
- British Columbia Children's Hospital, Vancouver, BC, Canada
| | | | | | - Garry Cutting
- McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, USA
| | | | - Felix Ratjen
- Program in Translational Medicine, The Hospital for Sick Children, Toronto, ON, Canada.,Division of Respiratory Medicine, Hospital for Sick Children, Toronto, ON, Canada
| | - Johanna M Rommens
- Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
| | - Lei Sun
- Department of Statistical Sciences, University of Toronto, Toronto, ON, Canada
| | - Melinda Solomon
- Division of Respiratory Medicine, Hospital for Sick Children, Toronto, ON, Canada
| | | | - Emmanuelle Brochiero
- Department of Medicine, Faculty of Medicine, Université de Montréal, Montréal, QC, Canada.,CRCHUM, Montréal, QC, Canada
| | - Scott Blackman
- McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, USA
| | - Harriet Corvol
- Assistance Publique-Hôpitaux de Paris, Hôpital Trousseau, Pediatric Pulmonary Department, Paris, France.,Sorbonne Université, Institut National de la Santé et de la Recherche Médicale, Centre de Recherche Saint Antoine, Paris, France
| | - Lisa J Strug
- Department of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada. .,Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON, Canada. .,Department of Statistical Sciences, University of Toronto, Toronto, ON, Canada. .,The Center for Applied Genomics, The Hospital for Sick Children, Toronto, ON, Canada. .,Department of Computer Science, University of Toronto, Toronto, ON, Canada.
| |
Collapse
|
8
|
Morris E, He K, Li Y, Li Y, Kang J. SurvBoost: An R Package for High-Dimensional Variable Selection in the Stratified Proportional Hazards Model via Gradient Boosting. THE R JOURNAL 2020; 12:105-117. [PMID: 34094592 PMCID: PMC8174798 DOI: 10.32614/rj-2020-018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
High-dimensional variable selection in the proportional hazards (PH) model has many successful applications in different areas. In practice, data may involve confounding variables that do not satisfy the PH assumption, in which case the stratified proportional hazards (SPH) model can be adopted to control the confounding effects by stratification without directly modeling the confounding effects. However, there is a lack of computationally efficient statistical software for high-dimensional variable selection in the SPH model. In this work an R package, SurvBoost, is developed to implement the gradient boosting algorithm for fitting the SPH model with high-dimensional covariate variables. Simulation studies demonstrate that in many scenarios SurvBoost can achieve better selection accuracy and reduce computational time substantially compared to the existing R package that implements boosting algorithms without stratification. The proposed R package is also illustrated by an analysis of gene expression data with survival outcome in The Cancer Genome Atlas study. In addition, a detailed hands-on tutorial for SurvBoost is provided.
Collapse
Affiliation(s)
- Emily Morris
- Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, MI 48109
| | - Kevin He
- Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, MI 48109
| | - Yanming Li
- Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, MI 48109
| | - Yi Li
- Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, MI 48109
| | - Jian Kang
- Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, MI 48109
| |
Collapse
|
9
|
Song X, Waitman LR, Yu AS, Robbins DC, Hu Y, Liu M. Longitudinal Risk Prediction of Chronic Kidney Disease in Diabetic Patients Using a Temporal-Enhanced Gradient Boosting Machine: Retrospective Cohort Study. JMIR Med Inform 2020; 8:e15510. [PMID: 32012067 PMCID: PMC7055762 DOI: 10.2196/15510] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2019] [Revised: 10/31/2019] [Accepted: 10/31/2019] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND Artificial intelligence-enabled electronic health record (EHR) analysis can revolutionize medical practice from the diagnosis and prediction of complex diseases to making recommendations in patient care, especially for chronic conditions such as chronic kidney disease (CKD), which is one of the most frequent complications in patients with diabetes and is associated with substantial morbidity and mortality. OBJECTIVE The longitudinal prediction of health outcomes requires effective representation of temporal data in the EHR. In this study, we proposed a novel temporal-enhanced gradient boosting machine (GBM) model that dynamically updates and ensembles learners based on new events in patient timelines to improve the prediction accuracy of CKD among patients with diabetes. METHODS Using a broad spectrum of deidentified EHR data on a retrospective cohort of 14,039 adult patients with type 2 diabetes and GBM as the base learner, we validated our proposed Landmark-Boosting model against three state-of-the-art temporal models for rolling predictions of 1-year CKD risk. RESULTS The proposed model uniformly outperformed other models, achieving an area under receiver operating curve of 0.83 (95% CI 0.76-0.85), 0.78 (95% CI 0.75-0.82), and 0.82 (95% CI 0.78-0.86) in predicting CKD risk with automatic accumulation of new data in later years (years 2, 3, and 4 since diabetes mellitus onset, respectively). The Landmark-Boosting model also maintained the best calibration across moderate- and high-risk groups and over time. The experimental results demonstrated that the proposed temporal model can not only accurately predict 1-year CKD risk but also improve performance over time with additionally accumulated data, which is essential for clinical use to improve renal management of patients with diabetes. CONCLUSIONS Incorporation of temporal information in EHR data can significantly improve predictive model performance and will particularly benefit patients who follow-up with their physicians as recommended.
Collapse
Affiliation(s)
- Xing Song
- University of Kansas Medical Center, Department of Internal Medicine, Division of Medical Informatics, Kansas City, KS, United States
| | - Lemuel R Waitman
- University of Kansas Medical Center, Department of Internal Medicine, Division of Medical Informatics, Kansas City, KS, United States
| | - Alan Sl Yu
- University of Kansas Medical Center, Division of Nephrology and Hypertension and the Kidney Institute, Kansas City, KS, United States
| | - David C Robbins
- University of Kansas Medical Center, Diabetes Institute, Kansas City, KS, United States
| | - Yong Hu
- Jinan University, Big Data Decision Institute, Guangzhou, China
| | - Mei Liu
- University of Kansas Medical Center, Department of Internal Medicine, Division of Medical Informatics, Kansas City, KS, United States
| |
Collapse
|
10
|
Song X, Waitman LR, Hu Y, Yu ASL, Robbins D, Liu M. An exploration of ontology-based EMR data abstraction for diabetic kidney disease prediction. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2019; 2019:704-713. [PMID: 31259027 PMCID: PMC6568123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Diabetic Kidney Disease (DKD) is a critical and morbid complication of diabetes and the leading cause of chronic kidney disease in the developed world. Electronic medical records (EMRs) hold promise for supporting clinical decision-making with its nationwide adoption as well as rich information characterizing patients' health care experience. However, few retrospective studies have fully utilized the EMR data to model DKD risk. This study examines the effectiveness of an unbiased data driven approach in identifying potential DKD patients in 6 months prior to onset by utilizing EMR on a broader spectrum. Meanwhile, we evaluate how different levels of data granularity of Medications and Diagnoses observations would affect prediction performance and knowledge discovery. The experimental results suggest that different data granularity may not necessarily influence the prediction accuracy, but it would dramatically change the internal structure of the predictive models.
Collapse
Affiliation(s)
- Xing Song
- University of Kansas Medical Center, Department of Internal Medicine, Division of Medical Informatics, Kansas City, KS, USA
| | - Lemuel R Waitman
- University of Kansas Medical Center, Department of Internal Medicine, Division of Medical Informatics, Kansas City, KS, USA
| | - Yong Hu
- Jinan University, Big Data Decision Institute, Guangzhou, PRC
| | - Alan S L Yu
- University of Kansas Medical Center, Division of Nephrology and Hypertension and the Kidney Institute, Kansas City, KS, USA
| | - David Robbins
- University of Kansas Medical Center, Diabetes Institute, Kansas City, KS, USA
| | - Mei Liu
- University of Kansas Medical Center, Department of Internal Medicine, Division of Medical Informatics, Kansas City, KS, USA
| |
Collapse
|
11
|
He K, Kang J, Hong HG, Zhu J, Li Y, Lin H, Xu H, Li Y. Covariance-Insured Screening. Comput Stat Data Anal 2019; 132:100-114. [PMID: 30880853 DOI: 10.1016/j.csda.2018.09.001] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Modern bio-technologies have produced a vast amount of high-throughput data with the number of predictors far greater than the sample size. In order to identify more novel biomarkers and understand biological mechanisms, it is vital to detect signals weakly associated with outcomes among ultrahigh-dimensional predictors. However, existing screening methods, which typically ignore correlation information, are likely to miss weak signals. By incorporating the inter-feature dependence, a covariance-insured screening approach is proposed to identify predictors that are jointly informative but marginally weakly associated with outcomes. The validity of the method is examined via extensive simulations and a real data study for selecting potential genetic factors related to the onset of multiple myeloma.
Collapse
Affiliation(s)
- Kevin He
- Department of Biostatistics, School of Public Health, University of Michigan
| | - Jian Kang
- Department of Biostatistics, School of Public Health, University of Michigan
| | - Hyokyoung G Hong
- Department of Statistics and Probability, Michigan State University
| | - Ji Zhu
- Department of Statistics, University of Michigan
| | - Yanming Li
- Department of Biostatistics, School of Public Health, University of Michigan
| | - Huazhen Lin
- School of Statistics, Southwestern University of Finance and Economics
| | - Han Xu
- Department of Statistics, University of Michigan
| | - Yi Li
- Department of Biostatistics, School of Public Health, University of Michigan
| |
Collapse
|
12
|
Song X, Waitman LR, Hu Y, Yu ASL, Robins D, Liu M. Robust clinical marker identification for diabetic kidney disease with ensemble feature selection. J Am Med Inform Assoc 2019; 26:242-253. [PMID: 30602020 PMCID: PMC7792755 DOI: 10.1093/jamia/ocy165] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2018] [Revised: 11/05/2018] [Accepted: 11/21/2018] [Indexed: 11/15/2022] Open
Abstract
Objective Diabetic kidney disease (DKD) is one of the most frequent complications in diabetes associated with substantial morbidity and mortality. To accelerate DKD risk factor discovery, we present an ensemble feature selection approach to identify a robust set of discriminant factors using electronic medical records (EMRs). Material and Methods We identified a retrospective cohort of 15 645 adult patients with type 2 diabetes, excluding those with pre-existing kidney disease, and utilized all available clinical data types in modeling. We compared 3 machine-learning-based embedded feature selection methods in conjunction with 6 feature ensemble techniques for selecting top-ranked features in terms of robustness to data perturbations and predictability for DKD onset. Results The gradient boosting machine (GBM) with weighted mean rank feature ensemble technique achieved the best performance with an AUC of 0.82 [95%-CI, 0.81-0.83] on internal validation and 0.71 [95%-CI, 0.68-0.73] on external temporal validation. The ensemble model identified a set of 440 features from 84 872 unique clinical features that are both predicative of DKD onset and robust against data perturbations, including 191 labs, 51 visit details (mainly vital signs), 39 medications, 34 orders, 30 diagnoses, and 95 other clinical features. Discussion Many of the top-ranked features have not been included in the state-of-art DKD prediction models, but their relationships with kidney function have been suggested in existing literature. Conclusion Our ensemble feature selection framework provides an option for identifying a robust and parsimonious feature set unbiasedly from EMR data, which effectively aids in knowledge discovery for DKD risk factors.
Collapse
Affiliation(s)
- Xing Song
- Department of Internal Medicine, Division of Medical Informatics, University of Kansas Medical Center, Kansas City, Kansas, USA
| | - Lemuel R Waitman
- Department of Internal Medicine, Division of Medical Informatics, University of Kansas Medical Center, Kansas City, Kansas, USA
| | - Yong Hu
- Big Data Decision Institute, Jinan University, Guangzhou, PRC
| | - Alan S L Yu
- Division of Nephrology and Hypertension and the Kidney Institute, University of Kansas Medical Center, Kansas City, Kansas, USA
| | - David Robins
- Diabetes Institute, University of Kansas Medical Center, Kansas City, Kansas, USA
| | - Mei Liu
- Department of Internal Medicine, Division of Medical Informatics, University of Kansas Medical Center, Kansas City, Kansas, USA
| |
Collapse
|
13
|
He K, Zhou X, Jiang H, Wen X, Li Y. False discovery control for penalized variable selections with high-dimensional covariates. Stat Appl Genet Mol Biol 2018; 17:/j/sagmb.2018.17.issue-6/sagmb-2018-0038/sagmb-2018-0038.xml. [PMID: 30864387 PMCID: PMC6450074 DOI: 10.1515/sagmb-2018-0038] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Modern bio-technologies have produced a vast amount of high-throughput data with the number of predictors much exceeding the sample size. Penalized variable selection has emerged as a powerful and efficient dimension reduction tool. However, control of false discoveries (i.e. inclusion of irrelevant variables) for penalized high-dimensional variable selection presents serious challenges. To effectively control the fraction of false discoveries for penalized variable selections, we propose a false discovery controlling procedure. The proposed method is general and flexible, and can work with a broad class of variable selection algorithms, not only for linear regressions, but also for generalized linear models and survival analysis.
Collapse
Affiliation(s)
- Kevin He
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Xiang Zhou
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Hui Jiang
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA
- University of Michigan, Center for Computational Medicine and Bioinformatics, Ann Arbor, MI, USA
| | - Xiaoquan Wen
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA
- University of Michigan, Center for Computational Medicine and Bioinformatics, Ann Arbor, MI, USA
| | - Yi Li
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA
- University of Michigan, Center for Computational Medicine and Bioinformatics, Ann Arbor, MI, USA
| |
Collapse
|
14
|
Ensembling Variable Selectors by Stability Selection for the Cox Model. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2017; 2017:2747431. [PMID: 29270195 PMCID: PMC5706076 DOI: 10.1155/2017/2747431] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/13/2017] [Revised: 08/18/2017] [Accepted: 10/29/2017] [Indexed: 11/17/2022]
Abstract
As a pivotal tool to build interpretive models, variable selection plays an increasingly important role in high-dimensional data analysis. In recent years, variable selection ensembles (VSEs) have gained much interest due to their many advantages. Stability selection (Meinshausen and Bühlmann, 2010), a VSE technique based on subsampling in combination with a base algorithm like lasso, is an effective method to control false discovery rate (FDR) and to improve selection accuracy in linear regression models. By adopting lasso as a base learner, we attempt to extend stability selection to handle variable selection problems in a Cox model. According to our experience, it is crucial to set the regularization region Λ in lasso and the parameter λmin properly so that stability selection can work well. To the best of our knowledge, however, there is no literature addressing this problem in an explicit way. Therefore, we first provide a detailed procedure to specify Λ and λmin. Then, some simulated and real-world data with various censoring rates are used to examine how well stability selection performs. It is also compared with several other variable selection approaches. Experimental results demonstrate that it achieves better or competitive performance in comparison with several other popular techniques.
Collapse
|
15
|
Wu M, Zang Y, Zhang S, Huang J, Ma S. Accommodating missingness in environmental measurements in gene-environment interaction analysis. Genet Epidemiol 2017; 41:523-554. [PMID: 28657194 DOI: 10.1002/gepi.22055] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2016] [Revised: 02/26/2017] [Accepted: 04/10/2017] [Indexed: 11/05/2022]
Abstract
For the prognosis of complex diseases, beyond the main effects of genetic (G) and environmental (E) factors, gene-environment (G-E) interactions also play an important role. Many approaches have been developed for detecting important G-E interactions, most of which assume that measurements are complete. In practical data analysis, missingness in E measurements is not uncommon, and failing to properly accommodate such missingness leads to biased estimation and false marker identification. In this study, we conduct G-E interaction analysis with prognosis data under an accelerated failure time (AFT) model. To accommodate missingness in E measurements, we adopt a nonparametric kernel-based data augmentation approach. With a well-designed weighting scheme, a nice "byproduct" is that the proposed approach enjoys a certain robustness property. A penalization approach, which respects the "main effects, interactions" hierarchy, is adopted for selection (of important interactions and main effects) and regularized estimation. The proposed approach has sound interpretations and a solid statistical basis. It outperforms multiple alternatives in simulation. The analysis of TCGA data on lung cancer and melanoma leads to interesting findings and models with superior prediction.
Collapse
Affiliation(s)
- Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, People's Republic of China.,Department of Biostatistics, Yale University, New Haven, Connecticut, United States of America
| | - Yangguang Zang
- Department of Biostatistics, Yale University, New Haven, Connecticut, United States of America.,School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, People's Republic of China
| | - Sanguo Zhang
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, People's Republic of China
| | - Jian Huang
- Department of Statistics and Actuarial Science, University of Iowa, Iowa City, Iowa, United States of America
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, Connecticut, United States of America
| |
Collapse
|