1
|
Slieker RC, Münch M, Donnelly LA, Bouland GA, Dragan I, Kuznetsov D, Elders PJM, Rutter GA, Ibberson M, Pearson ER, 't Hart LM, van de Wiel MA, Beulens JWJ. An omics-based machine learning approach to predict diabetes progression: a RHAPSODY study. Diabetologia 2024; 67:885-894. [PMID: 38374450 PMCID: PMC10954972 DOI: 10.1007/s00125-024-06105-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/03/2023] [Accepted: 01/05/2024] [Indexed: 02/21/2024]
Abstract
AIMS/HYPOTHESIS People with type 2 diabetes are heterogeneous in their disease trajectory, with some progressing more quickly to insulin initiation than others. Although classical biomarkers such as age, HbA1c and diabetes duration are associated with glycaemic progression, it is unclear how well such variables predict insulin initiation or requirement and whether newly identified markers have added predictive value. METHODS In two prospective cohort studies as part of IMI-RHAPSODY, we investigated whether clinical variables and three types of molecular markers (metabolites, lipids, proteins) can predict time to insulin requirement using different machine learning approaches (lasso, ridge, GRridge, random forest). Clinical variables included age, sex, HbA1c, HDL-cholesterol and C-peptide. Models were run with unpenalised clinical variables (i.e. always included in the model without weights) or penalised clinical variables, or without clinical variables. Model development was performed in one cohort and the model was applied in a second cohort. Model performance was evaluated using Harrel's C statistic. RESULTS Of the 585 individuals from the Hoorn Diabetes Care System (DCS) cohort, 69 required insulin during follow-up (1.0-11.4 years); of the 571 individuals in the Genetics of Diabetes Audit and Research in Tayside Scotland (GoDARTS) cohort, 175 required insulin during follow-up (0.3-11.8 years). Overall, the clinical variables and proteins were selected in the different models most often, followed by the metabolites. The most frequently selected clinical variables were HbA1c (18 of the 36 models, 50%), age (15 models, 41.2%) and C-peptide (15 models, 41.2%). Base models (age, sex, BMI, HbA1c) including only clinical variables performed moderately in both the DCS discovery cohort (C statistic 0.71 [95% CI 0.64, 0.79]) and the GoDARTS replication cohort (C 0.71 [95% CI 0.69, 0.75]). A more extensive model including HDL-cholesterol and C-peptide performed better in both cohorts (DCS, C 0.74 [95% CI 0.67, 0.81]; GoDARTS, C 0.73 [95% CI 0.69, 0.77]). Two proteins, lactadherin and proto-oncogene tyrosine-protein kinase receptor, were most consistently selected and slightly improved model performance. CONCLUSIONS/INTERPRETATION Using machine learning approaches, we show that insulin requirement risk can be modestly well predicted by predominantly clinical variables. Inclusion of molecular markers improves the prognostic performance beyond that of clinical variables by up to 5%. Such prognostic models could be useful for identifying people with diabetes at high risk of progressing quickly to treatment intensification. DATA AVAILABILITY Summary statistics of lipidomic, proteomic and metabolomic data are available from a Shiny dashboard at https://rhapdata-app.vital-it.ch .
Collapse
Affiliation(s)
- Roderick C Slieker
- Department of Epidemiology and Data Science, Amsterdam UMC, Vrije Universiteit, Amsterdam, the Netherlands
- Amsterdam Public Health, Amsterdam, the Netherlands
- Amsterdam Cardiovascular Sciences, Amsterdam, the Netherlands
- Department of Cell and Chemical Biology, Leiden University Medical Center, Leiden, the Netherlands
| | - Magnus Münch
- Department of Epidemiology and Data Science, Amsterdam UMC, Vrije Universiteit, Amsterdam, the Netherlands
| | - Louise A Donnelly
- Population Health & Genomics, School of Medicine, University of Dundee, Dundee, UK
| | - Gerard A Bouland
- Department of Cell and Chemical Biology, Leiden University Medical Center, Leiden, the Netherlands
- Delft Bioinformatics Lab, Delft University of Technology, Delft, the Netherlands
| | - Iulian Dragan
- Vital-IT Group, SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Dmitry Kuznetsov
- Vital-IT Group, SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Petra J M Elders
- Amsterdam Public Health, Amsterdam, the Netherlands
- Amsterdam Cardiovascular Sciences, Amsterdam, the Netherlands
- Department of General Practice, Amsterdam UMC, Vrije Universiteit, Amsterdam, the Netherlands
| | - Guy A Rutter
- CRCHUM, Faculty of Medicine, Université de Montréal, Montréal, QC, Canada
- Department of Metabolism, Digestion and Reproduction, Faculty of Medicine, Imperial College London, London, UK
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Republic of Singapore
| | - Mark Ibberson
- Vital-IT Group, SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Ewan R Pearson
- Population Health & Genomics, School of Medicine, University of Dundee, Dundee, UK
| | - Leen M 't Hart
- Department of Epidemiology and Data Science, Amsterdam UMC, Vrije Universiteit, Amsterdam, the Netherlands
- Department of Cell and Chemical Biology, Leiden University Medical Center, Leiden, the Netherlands
- Department of Biomedical Data Sciences, Section of Molecular Epidemiology, Leiden University Medical Center, Leiden, the Netherlands
| | - Mark A van de Wiel
- Department of Epidemiology and Data Science, Amsterdam UMC, Vrije Universiteit, Amsterdam, the Netherlands
- Amsterdam Public Health, Amsterdam, the Netherlands
| | - Joline W J Beulens
- Department of Epidemiology and Data Science, Amsterdam UMC, Vrije Universiteit, Amsterdam, the Netherlands.
- Amsterdam Public Health, Amsterdam, the Netherlands.
- Amsterdam Cardiovascular Sciences, Amsterdam, the Netherlands.
- Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht, the Netherlands.
| |
Collapse
|
2
|
Senar N, van de Wiel M, Zwinderman AH, Hof MH. TOSCCA: a framework for interpretation and testing of sparse canonical correlations. BIOINFORMATICS ADVANCES 2024; 4:vbae021. [PMID: 38456127 PMCID: PMC10919946 DOI: 10.1093/bioadv/vbae021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/03/2023] [Revised: 01/24/2024] [Accepted: 02/14/2024] [Indexed: 03/09/2024]
Abstract
Summary In clinical and biomedical research, multiple high-dimensional datasets are nowadays routinely collected from omics and imaging devices. Multivariate methods, such as Canonical Correlation Analysis (CCA), integrate two (or more) datasets to discover and understand underlying biological mechanisms. For an explorative method like CCA, interpretation is key. We present a sparse CCA method based on soft-thresholding that produces near-orthogonal components, allows for browsing over various sparsity levels, and permutation-based hypothesis testing. Our soft-thresholding approach avoids tuning of a penalty parameter. Such tuning is computationally burdensome and may render unintelligible results. In addition, unlike alternative approaches, our method is less dependent on the initialization. We examined the performance of our approach with simulations and illustrated its use on real cancer genomics data from drug sensitivity screens. Moreover, we compared its performance to Penalized Matrix Analysis (PMA), which is a popular alternative of sparse CCA with a focus on yielding interpretable results. Compared to PMA, our method offers improved interpretability of the results, while not compromising, or even improving, signal discovery. Availability and implementation The software and simulation framework are available at https://github.com/nuria-sv/toscca.
Collapse
Affiliation(s)
- Nuria Senar
- Department of Epidemiology & Data Science, Amsterdam School of Public Health, Amsterdam UMC, 1105 AZ Nord-Holland, The Netherlands
| | - Mark van de Wiel
- Department of Epidemiology & Data Science, Amsterdam School of Public Health, Amsterdam UMC, 1105 AZ Nord-Holland, The Netherlands
| | - Aeilko H Zwinderman
- Department of Epidemiology & Data Science, Amsterdam School of Public Health, Amsterdam UMC, 1105 AZ Nord-Holland, The Netherlands
| | - Michel H Hof
- Department of Epidemiology & Data Science, Amsterdam School of Public Health, Amsterdam UMC, 1105 AZ Nord-Holland, The Netherlands
| |
Collapse
|
3
|
Hoogland J, Debray TPA, Crowther MJ, Riley RD, IntHout J, Reitsma JB, Zwinderman AH. Regularized parametric survival modeling to improve risk prediction models. Biom J 2024; 66:e2200319. [PMID: 37775946 DOI: 10.1002/bimj.202200319] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2022] [Revised: 04/30/2023] [Accepted: 09/17/2023] [Indexed: 10/01/2023]
Abstract
We propose to combine the benefits of flexible parametric survival modeling and regularization to improve risk prediction modeling in the context of time-to-event data. Thereto, we introduce ridge, lasso, elastic net, and group lasso penalties for both log hazard and log cumulative hazard models. The log (cumulative) hazard in these models is represented by a flexible function of time that may depend on the covariates (i.e., covariate effects may be time-varying). We show that the optimization problem for the proposed models can be formulated as a convex optimization problem and provide a user-friendly R implementation for model fitting and penalty parameter selection based on cross-validation. Simulation study results show the advantage of regularization in terms of increased out-of-sample prediction accuracy and improved calibration and discrimination of predicted survival probabilities, especially when sample size was relatively small with respect to model complexity. An applied example illustrates the proposed methods. In summary, our work provides both a foundation for and an easily accessible implementation of regularized parametric survival modeling and suggests that it improves out-of-sample prediction performance.
Collapse
Affiliation(s)
- J Hoogland
- Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands
- Department of Epidemiology and Data Science, Amsterdam University Medical Centers, Amsterdam, The Netherlands
| | - T P A Debray
- Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands
- Cochrane Netherlands, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands
| | - M J Crowther
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
| | - R D Riley
- School for Medicine, Keele University, Keele, Staffordshire, UK
| | - J IntHout
- Radboud Institute for Health Sciences (RIHS), Radboud University Medical Center, Nijmegen, The Netherlands
| | - J B Reitsma
- Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands
- Cochrane Netherlands, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands
| | - A H Zwinderman
- Department of Epidemiology and Data Science, Amsterdam University Medical Centers, Amsterdam, The Netherlands
| |
Collapse
|
4
|
van Nee MM, Wessels LFA, van de Wiel MA. ecpc: an R-package for generic co-data models for high-dimensional prediction. BMC Bioinformatics 2023; 24:172. [PMID: 37101151 PMCID: PMC10134536 DOI: 10.1186/s12859-023-05289-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2022] [Accepted: 04/12/2023] [Indexed: 04/28/2023] Open
Abstract
BACKGROUND High-dimensional prediction considers data with more variables than samples. Generic research goals are to find the best predictor or to select variables. Results may be improved by exploiting prior information in the form of co-data, providing complementary data not on the samples, but on the variables. We consider adaptive ridge penalised generalised linear and Cox models, in which the variable-specific ridge penalties are adapted to the co-data to give a priori more weight to more important variables. The R-package ecpc originally accommodated various and possibly multiple co-data sources, including categorical co-data, i.e. groups of variables, and continuous co-data. Continuous co-data, however, were handled by adaptive discretisation, potentially inefficiently modelling and losing information. As continuous co-data such as external p values or correlations often arise in practice, more generic co-data models are needed. RESULTS Here, we present an extension to the method and software for generic co-data models, particularly for continuous co-data. At the basis lies a classical linear regression model, regressing prior variance weights on the co-data. Co-data variables are then estimated with empirical Bayes moment estimation. After placing the estimation procedure in the classical regression framework, extension to generalised additive and shape constrained co-data models is straightforward. Besides, we show how ridge penalties may be transformed to elastic net penalties. In simulation studies we first compare various co-data models for continuous co-data from the extension to the original method. Secondly, we compare variable selection performance to other variable selection methods. The extension is faster than the original method and shows improved prediction and variable selection performance for non-linear co-data relations. Moreover, we demonstrate use of the package in several genomics examples throughout the paper. CONCLUSIONS The R-package ecpc accommodates linear, generalised additive and shape constrained additive co-data models for the purpose of improved high-dimensional prediction and variable selection. The extended version of the package as presented here (version number 3.1.1 and higher) is available on ( https://cran.r-project.org/web/packages/ecpc/ ).
Collapse
Affiliation(s)
- Mirrelijn M van Nee
- Epidemiology & Data Science, Amsterdam Public Health research institute, Amsterdam University Medical Centers, Amsterdam, The Netherlands.
| | - Lodewyk F A Wessels
- Molecular Carcinogenesis, Netherlands Cancer Institute, Amsterdam, The Netherlands
- Computational Cancer Biology, Oncode Institute, Amsterdam, The Netherlands
- Intelligent Systems, Delft University Medical Centers, Delft, The Netherlands
| | - Mark A van de Wiel
- Epidemiology & Data Science, Amsterdam Public Health research institute, Amsterdam University Medical Centers, Amsterdam, The Netherlands
| |
Collapse
|