1
|
Ferrario PG, Gedrich K. Machine learning and personalized nutrition: a promising liaison? Eur J Clin Nutr 2024; 78:74-76. [PMID: 37833568 PMCID: PMC10774117 DOI: 10.1038/s41430-023-01350-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2023] [Revised: 09/12/2023] [Accepted: 09/20/2023] [Indexed: 10/15/2023]
Affiliation(s)
- Paola G Ferrario
- Department of Physiology and Biochemistry of Nutrition, Max Rubner-Institut, Karlsruhe, Germany.
| | - Kurt Gedrich
- Technical University of Munich, ZIEL - Institute for Food & Health, Research Group Public Health Nutrition, Freising, Germany
| |
Collapse
|
2
|
Rahnenführer J, De Bin R, Benner A, Ambrogi F, Lusa L, Boulesteix AL, Migliavacca E, Binder H, Michiels S, Sauerbrei W, McShane L. Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges. BMC Med 2023; 21:182. [PMID: 37189125 DOI: 10.1186/s12916-023-02858-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/28/2022] [Accepted: 04/03/2023] [Indexed: 05/17/2023] Open
Abstract
BACKGROUND In high-dimensional data (HDD) settings, the number of variables associated with each observation is very large. Prominent examples of HDD in biomedical research include omics data with a large number of variables such as many measurements across the genome, proteome, or metabolome, as well as electronic health records data that have large numbers of variables recorded for each patient. The statistical analysis of such data requires knowledge and experience, sometimes of complex methods adapted to the respective research questions. METHODS Advances in statistical methodology and machine learning methods offer new opportunities for innovative analyses of HDD, but at the same time require a deeper understanding of some fundamental statistical concepts. Topic group TG9 "High-dimensional data" of the STRATOS (STRengthening Analytical Thinking for Observational Studies) initiative provides guidance for the analysis of observational studies, addressing particular statistical challenges and opportunities for the analysis of studies involving HDD. In this overview, we discuss key aspects of HDD analysis to provide a gentle introduction for non-statisticians and for classically trained statisticians with little experience specific to HDD. RESULTS The paper is organized with respect to subtopics that are most relevant for the analysis of HDD, in particular initial data analysis, exploratory data analysis, multiple testing, and prediction. For each subtopic, main analytical goals in HDD settings are outlined. For each of these goals, basic explanations for some commonly used analysis methods are provided. Situations are identified where traditional statistical methods cannot, or should not, be used in the HDD setting, or where adequate analytic tools are still lacking. Many key references are provided. CONCLUSIONS This review aims to provide a solid statistical foundation for researchers, including statisticians and non-statisticians, who are new to research with HDD or simply want to better evaluate and understand the results of HDD analyses.
Collapse
Affiliation(s)
| | | | - Axel Benner
- Division of Biostatistics, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Federico Ambrogi
- Department of Clinical Sciences and Community Health, University of Milan, Milan, Italy
- Scientific Directorate, IRCCS Policlinico San Donato, San Donato Milanese, Italy
| | - Lara Lusa
- Department of Mathematics, Faculty of Mathematics, Natural Sciences and Information Technology, University of Primorksa, Koper, Slovenia
- Institute of Biostatistics and Medical Informatics, University of Ljubljana, Ljubljana, Slovenia
| | - Anne-Laure Boulesteix
- Institute for Medical Information Processing, Biometry and Epidemiology, Ludwig Maximilian University of Munich, Munich, Germany
| | | | - Harald Binder
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Freiburg, Germany
| | - Stefan Michiels
- Service de Biostatistique et d'Épidémiologie, Gustave Roussy, Université Paris-Saclay, Villejuif, France
- Oncostat U1018, Inserm, Université Paris-Saclay, Labeled Ligue Contre le Cancer, Villejuif, France
| | - Willi Sauerbrei
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Freiburg, Germany
| | - Lisa McShane
- Biometric Research Program, Division of Cancer Treatment and Diagnosis, National Cancer Institute, Bethesda, MD, USA.
| |
Collapse
|
3
|
Ezugwu AE, Oyelade ON, Ikotun AM, Agushaka JO, Ho YS. Machine Learning Research Trends in Africa: A 30 Years Overview with Bibliometric Analysis Review. ARCHIVES OF COMPUTATIONAL METHODS IN ENGINEERING : STATE OF THE ART REVIEWS 2023; 30:1-31. [PMID: 37359741 PMCID: PMC10148585 DOI: 10.1007/s11831-023-09930-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/18/2023] [Accepted: 04/19/2023] [Indexed: 06/28/2023]
Abstract
The machine learning (ML) paradigm has gained much popularity today. Its algorithmic models are employed in every field, such as natural language processing, pattern recognition, object detection, image recognition, earth observation and many other research areas. In fact, machine learning technologies and their inevitable impact suffice in many technological transformation agendas currently being propagated by many nations, for which the already yielded benefits are outstanding. From a regional perspective, several studies have shown that machine learning technology can help address some of Africa's most pervasive problems, such as poverty alleviation, improving education, delivering quality healthcare services, and addressing sustainability challenges like food security and climate change. In this state-of-the-art paper, a critical bibliometric analysis study is conducted, coupled with an extensive literature survey on recent developments and associated applications in machine learning research with a perspective on Africa. The presented bibliometric analysis study consists of 2761 machine learning-related documents, of which 89% were articles with at least 482 citations published in 903 journals during the past three decades. Furthermore, the collated documents were retrieved from the Science Citation Index EXPANDED, comprising research publications from 54 African countries between 1993 and 2021. The bibliometric study shows the visualization of the current landscape and future trends in machine learning research and its application to facilitate future collaborative research and knowledge exchange among authors from different research institutions scattered across the African continent.
Collapse
Affiliation(s)
- Absalom E. Ezugwu
- Unit for Data Science and Computing, North-West University, 11 Hoffman Street, Potchefstroom, 2520 South Africa
| | - Olaide N. Oyelade
- Department of Computer Science, Faculty of Physical Sciences, Ahmadu Bello University, Zaria, Nigeria
| | - Abiodun M. Ikotun
- Unit for Data Science and Computing, North-West University, 11 Hoffman Street, Potchefstroom, 2520 South Africa
| | - Jeffery O. Agushaka
- Unit for Data Science and Computing, North-West University, 11 Hoffman Street, Potchefstroom, 2520 South Africa
| | - Yuh-Shan Ho
- Trend Research Centre, Asia University, No. 500, Lioufeng RoadWufeng, Taichung, 41354 Taiwan
| |
Collapse
|
4
|
Molnar C, König G, Bischl B, Casalicchio G. Model-agnostic feature importance and effects with dependent features: a conditional subgroup approach. Data Min Knowl Discov 2023. [DOI: 10.1007/s10618-022-00901-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
Abstract
AbstractThe interpretation of feature importance in machine learning models is challenging when features are dependent. Permutation feature importance (PFI) ignores such dependencies, which can cause misleading interpretations due to extrapolation. A possible remedy is more advanced conditional PFI approaches that enable the assessment of feature importance conditional on all other features. Due to this shift in perspective and in order to enable correct interpretations, it is beneficial if the conditioning is transparent and comprehensible. In this paper, we propose a new sampling mechanism for the conditional distribution based on permutations in conditional subgroups. As these subgroups are constructed using tree-based methods such as transformation trees, the conditioning becomes inherently interpretable. This not only provides a simple and effective estimator of conditional PFI, but also local PFI estimates within the subgroups. In addition, we apply the conditional subgroups approach to partial dependence plots, a popular method for describing feature effects that can also suffer from extrapolation when features are dependent and interactions are present in the model. In simulations and a real-world application, we demonstrate the advantages of the conditional subgroup approach over existing methods: It allows to compute conditional PFI that is more true to the data than existing proposals and enables a fine-grained interpretation of feature effects and importance within the conditional subgroups.
Collapse
|
5
|
Lam M, Chen CY, Hill WD, Xia C, Tian R, Levey DF, Gelernter J, Stein MB, Hatoum AS, Huang H, Malhotra AK, Runz H, Ge T, Lencz T. Collective genomic segments with differential pleiotropic patterns between cognitive dimensions and psychopathology. Nat Commun 2022; 13:6868. [PMID: 36369282 PMCID: PMC9652380 DOI: 10.1038/s41467-022-34418-y] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2021] [Accepted: 10/24/2022] [Indexed: 11/13/2022] Open
Abstract
Cognitive deficits are known to be related to most forms of psychopathology. Here, we perform local genetic correlation analysis as a means of identifying independent segments of the genome that show biologically interpretable pleiotropic associations between cognitive dimensions and psychopathology. We identify collective segments of the genome, which we call "meta-loci", showing differential pleiotropic patterns for psychopathology relative to either cognitive task performance (CTP) or performance on a non-cognitive factor (NCF) derived from educational attainment. We observe that neurodevelopmental gene sets expressed during the prenatal-early childhood period predominate in CTP-relevant meta-loci, while post-natal gene sets are more involved in NCF-relevant meta-loci. Further, we demonstrate that neurodevelopmental gene sets are dissociable across CTP meta-loci with respect to their spatial distribution across the brain. Additionally, we find that GABA-ergic, cholinergic, and glutamatergic genes drive pleiotropic relationships within dissociable meta-loci.
Collapse
Affiliation(s)
- Max Lam
- Division of Psychiatry Research, The Zucker Hillside Hospital, Northwell, Glen Oaks, NY, USA
- Institute of Behavioral Science, Feinstein Institutes for Medical Research, Manhasset, NY, USA
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Analytical and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA
- Institute of Mental Health, Singapore, Singapore
| | - Chia-Yen Chen
- Translational Biology, Research and Development, Biogen Inc, Cambridge, MA, USA
| | - W David Hill
- Lothian Birth Cohorts group, Department of Psychology, University of Edinburgh, Edinburgh, UK
| | - Charley Xia
- Lothian Birth Cohorts group, Department of Psychology, University of Edinburgh, Edinburgh, UK
| | - Ruoyu Tian
- Computational Biology and Human Genetics, Dewpoint Therapeutics, Boston, MA, USA
| | - Daniel F Levey
- Department of Psychiatry, Yale University School of Medicine, New Haven, CT, USA
- VA Connecticut Healthcare System, West Haven, CT, USA
| | - Joel Gelernter
- Department of Psychiatry, Yale University School of Medicine, New Haven, CT, USA
- VA Connecticut Healthcare System, West Haven, CT, USA
- Department of Genetics, Yale University School of Medicine, New Haven, CT, USA
- Department of Neuroscience, Yale University School of Medicine, New Haven, CT, USA
| | - Murray B Stein
- VA San Diego Healthcare System, San Diego, CA, USA
- Department of Psychiatry, University of California, San Diego, La Jolla, CA, USA
- Herbert Wertheim School of Public Health and Human Longevity Science, University of California San Diego, La Jolla, CA, USA
| | - Alexander S Hatoum
- Department of Psychiatry, Washington University in St. Louis Medical School, St. Louis, MO, USA
| | - Hailiang Huang
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Analytical and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA
| | - Anil K Malhotra
- Division of Psychiatry Research, The Zucker Hillside Hospital, Northwell, Glen Oaks, NY, USA
- Institute of Behavioral Science, Feinstein Institutes for Medical Research, Manhasset, NY, USA
- Department of Psychiatry, Zucker School of Medicine at Hofstra/Norwell, Hempstead, NY, USA
- Department of Molecular Medicine, Zucker School of Medicine at Hofstra/Norwell, Hempstead, NY, USA
| | - Heiko Runz
- Translational Biology, Research and Development, Biogen Inc, Cambridge, MA, USA
| | - Tian Ge
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Psychiatric and Neurodevelopmental Genetics Unit, Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
- Department of Psychiatry, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
| | - Todd Lencz
- Division of Psychiatry Research, The Zucker Hillside Hospital, Northwell, Glen Oaks, NY, USA.
- Institute of Behavioral Science, Feinstein Institutes for Medical Research, Manhasset, NY, USA.
- Department of Psychiatry, Zucker School of Medicine at Hofstra/Norwell, Hempstead, NY, USA.
- Department of Molecular Medicine, Zucker School of Medicine at Hofstra/Norwell, Hempstead, NY, USA.
| |
Collapse
|
6
|
Machine learning-based genetic diagnosis models for hereditary hearing loss by the GJB2, SLC26A4 and MT-RNR1 variants. EBioMedicine 2021; 69:103322. [PMID: 34161886 PMCID: PMC8237285 DOI: 10.1016/j.ebiom.2021.103322] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2020] [Revised: 03/18/2021] [Accepted: 03/18/2021] [Indexed: 12/16/2022] Open
Abstract
Background Hereditary hearing loss (HHL) is the most common sensory deficit, which highly afflicts humans. With gene sequencing technology development, more variants will be identified and support genetic diagnoses, which is difficult for human experts to diagnose. This study aims to develop a machine learning-based genetic diagnosis model of HHL-related variants of GJB2, SLC26A4 and MT-RNR1. Methods This case-control study included 1898 subjects, among which 1354 were HHL patients and 544 were carriers. Risk assessment models were established based on variants at 144 sites in three genes related to HHL by building six machine learning (ML) models. We compared the ML models with the genetic risk score (GRS) and expert interpretation (EI) to verify the clinical performance. Findings Among the six ML models, the support vector machine (SVM) showed the best performance. For the prediction of HHL-related gene sites in subjects with variants, the area under the receiver operating characteristic (AUC) of the SVM model was 0.803 (0.680–0.814) in the 10-fold stratified cross-validation and 0.751 (0.635–0.779) in external validation. The predicted results were better than both EI and GRS. Furthermore, 11 sites were identified as the smallest feature set that can be accurately predicted. Interpretation The developed SVM model has great potential to be an efficient and effective tool for HHL prediction when high throughput sequencing data are available.
Collapse
|
7
|
Bauer A, Zierer A, Gieger C, Büyüközkan M, Müller-Nurasyid M, Grallert H, Meisinger C, Strauch K, Prokisch H, Roden M, Peters A, Krumsiek J, Herder C, Koenig W, Thorand B, Huth C. Comparison of genetic risk prediction models to improve prediction of coronary heart disease in two large cohorts of the MONICA/KORA study. Genet Epidemiol 2021; 45:633-650. [PMID: 34082474 DOI: 10.1002/gepi.22389] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2021] [Revised: 04/20/2021] [Accepted: 05/04/2021] [Indexed: 12/19/2022]
Abstract
It is still unclear how genetic information, provided as single-nucleotide polymorphisms (SNPs), can be most effectively integrated into risk prediction models for coronary heart disease (CHD) to add significant predictive value beyond clinical risk models. For the present study, a population-based case-cohort was used as a trainingset (451 incident cases, 1488 noncases) and an independent cohort as testset (160 incident cases, 2749 noncases). The following strategies to quantify genetic information were compared: A weighted genetic risk score including Metabochip SNPs associated with CHD in the literature (GRSMetabo ); selection of the most predictive SNPs among these literature-confirmed variants using priority-Lasso (PLMetabo ); validation of two comprehensive polygenic risk scores: GRSGola based on Metabochip data, and GRSKhera (available in the testset only) based on cross-validated genome-wide genotyping data. We used Cox regression to assess associations with incident CHD. C-index, category-free net reclassification index (cfNRI) and relative integrated discrimination improvement (IDIrel ) were used to quantify the predictive performance of genetic information beyond Framingham risk score variables. In contrast to GRSMetabo and PLMetabo , GRSGola significantly improved the prediction (delta C-index [95% confidence interval]: 0.0087 [0.0044, 0.0130]; IDIrel : 0.0509 [0.0131, 0.0894]; cfNRI improved only in cases: 0.1761 [0.0253, 0.3219]). GRSKhera yielded slightly worse prediction results than GRSGola .
Collapse
Affiliation(s)
- Alina Bauer
- Institute of Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
| | - Astrid Zierer
- Institute of Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
| | - Christian Gieger
- Institute of Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany.,German Center for Diabetes Research (DZD), Partner München-Neuherberg, München-Neuherberg, Germany.,Research Unit of Molecular Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
| | - Mustafa Büyüközkan
- Institute of Computational Biology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany.,Institute for Computational Biomedicine, Englander Institute for Precision Medicine, Department of Physiology and Biophysics, Weill Cornell Medicine, New York, USA
| | - Martina Müller-Nurasyid
- Institute of Genetic Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany.,Chair of Genetic Epidemiology, IBE, Faculty of Medicine, LMU, Munich, Germany.,Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI), University Medical Center, Johannes Gutenberg University, Mainz, Germany.,Department of Internal Medicine I (Cardiology), Hospital of the Ludwig-Maximilians-University (LMU) Munich, Munich, Germany
| | - Harald Grallert
- Institute of Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany.,German Center for Diabetes Research (DZD), Partner München-Neuherberg, München-Neuherberg, Germany.,Research Unit of Molecular Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
| | - Christa Meisinger
- German Center for Diabetes Research (DZD), Partner München-Neuherberg, München-Neuherberg, Germany.,Chair of Epidemiology, LMU Munich, UNIKA-T Augsburg, Augsburg, Germany.,Independent Research Group Clinical Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
| | - Konstantin Strauch
- Institute of Genetic Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany.,Chair of Genetic Epidemiology, IBE, Faculty of Medicine, LMU, Munich, Germany.,Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI), University Medical Center, Johannes Gutenberg University, Mainz, Germany
| | - Holger Prokisch
- Institute of Human Genetics, School of Medicine, Technische Universität München, München, Germany.,Institute of Neurogenomics, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
| | - Michael Roden
- Department of Endocrinology and Diabetology, Medical Faculty and University Hospital Düsseldorf, Heinrich-Heine-University Düsseldorf, Düsseldorf, Germany.,Institute for Clinical Diabetology, German Diabetes Center, Leibniz Center for Diabetes Research at Heinrich Heine University Düsseldorf, Düsseldorf, Germany.,German Center for Diabetes Research (DZD), Partner Düsseldorf, München-Neuherberg, Germany
| | - Annette Peters
- Institute of Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany.,German Center for Diabetes Research (DZD), Partner München-Neuherberg, München-Neuherberg, Germany.,Institute of Epidemiology and Medical Biometry, University of Ulm, Ulm, Germany
| | - Jan Krumsiek
- Institute of Computational Biology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany.,Institute for Computational Biomedicine, Englander Institute for Precision Medicine, Department of Physiology and Biophysics, Weill Cornell Medicine, New York, USA
| | - Christian Herder
- Department of Endocrinology and Diabetology, Medical Faculty and University Hospital Düsseldorf, Heinrich-Heine-University Düsseldorf, Düsseldorf, Germany.,Institute for Clinical Diabetology, German Diabetes Center, Leibniz Center for Diabetes Research at Heinrich Heine University Düsseldorf, Düsseldorf, Germany.,German Center for Diabetes Research (DZD), Partner Düsseldorf, München-Neuherberg, Germany
| | - Wolfgang Koenig
- Institute of Epidemiology and Medical Biometry, University of Ulm, Ulm, Germany.,Deutsches Herzzentrum München, Technische Universität München, Munich, Germany.,German Centre for Cardiovascular Research (DZHK), partner site Munich Heart Alliance, Munich, Germany
| | - Barbara Thorand
- Institute of Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany.,German Center for Diabetes Research (DZD), Partner München-Neuherberg, München-Neuherberg, Germany
| | - Cornelia Huth
- Institute of Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany.,German Center for Diabetes Research (DZD), Partner München-Neuherberg, München-Neuherberg, Germany
| |
Collapse
|
8
|
Bracher-Smith M, Crawford K, Escott-Price V. Machine learning for genetic prediction of psychiatric disorders: a systematic review. Mol Psychiatry 2021; 26:70-79. [PMID: 32591634 PMCID: PMC7610853 DOI: 10.1038/s41380-020-0825-2] [Citation(s) in RCA: 50] [Impact Index Per Article: 16.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/15/2020] [Revised: 06/09/2020] [Accepted: 06/16/2020] [Indexed: 12/25/2022]
Abstract
Machine learning methods have been employed to make predictions in psychiatry from genotypes, with the potential to bring improved prediction of outcomes in psychiatric genetics; however, their current performance is unclear. We aim to systematically review machine learning methods for predicting psychiatric disorders from genetics alone and evaluate their discrimination, bias and implementation. Medline, PsycInfo, Web of Science and Scopus were searched for terms relating to genetics, psychiatric disorders and machine learning, including neural networks, random forests, support vector machines and boosting, on 10 September 2019. Following PRISMA guidelines, articles were screened for inclusion independently by two authors, extracted, and assessed for risk of bias. Overall, 63 full texts were assessed from a pool of 652 abstracts. Data were extracted for 77 models of schizophrenia, bipolar, autism or anorexia across 13 studies. Performance of machine learning methods was highly varied (0.48-0.95 AUC) and differed between schizophrenia (0.54-0.95 AUC), bipolar (0.48-0.65 AUC), autism (0.52-0.81 AUC) and anorexia (0.62-0.69 AUC). This is likely due to the high risk of bias identified in the study designs and analysis for reported results. Choices for predictor selection, hyperparameter search and validation methodology, and viewing of the test set during training were common causes of high risk of bias in analysis. Key steps in model development and validation were frequently not performed or unreported. Comparison of discrimination across studies was constrained by heterogeneity of predictors, outcome and measurement, in addition to sample overlap within and across studies. Given widespread high risk of bias and the small number of studies identified, it is important to ensure established analysis methods are adopted. We emphasise best practices in methodology and reporting for improving future studies.
Collapse
Affiliation(s)
- Matthew Bracher-Smith
- MRC Centre for Neuropsychiatric Genetics and Genomics, Division of Psychological Medicine and Clinical Neurosciences, School of Medicine, Cardiff University, Cardiff, UK
| | - Karen Crawford
- MRC Centre for Neuropsychiatric Genetics and Genomics, Division of Psychological Medicine and Clinical Neurosciences, School of Medicine, Cardiff University, Cardiff, UK
- Dementia Research Institute, School of Medicine, Cardiff University, Cardiff, UK
| | - Valentina Escott-Price
- MRC Centre for Neuropsychiatric Genetics and Genomics, Division of Psychological Medicine and Clinical Neurosciences, School of Medicine, Cardiff University, Cardiff, UK.
- Dementia Research Institute, School of Medicine, Cardiff University, Cardiff, UK.
| |
Collapse
|
9
|
Boulesteix AL, Groenwold RH, Abrahamowicz M, Binder H, Briel M, Hornung R, Morris TP, Rahnenführer J, Sauerbrei W. Introduction to statistical simulations in health research. BMJ Open 2020; 10:e039921. [PMID: 33318113 PMCID: PMC7737058 DOI: 10.1136/bmjopen-2020-039921] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
Abstract
In health research, statistical methods are frequently used to address a wide variety of research questions. For almost every analytical challenge, different methods are available. But how do we choose between different methods and how do we judge whether the chosen method is appropriate for our specific study? Like in any science, in statistics, experiments can be run to find out which methods should be used under which circumstances. The main objective of this paper is to demonstrate that simulation studies, that is, experiments investigating synthetic data with known properties, are an invaluable tool for addressing these questions. We aim to provide a first introduction to simulation studies for data analysts or, more generally, for researchers involved at different levels in the analyses of health data, who (1) may rely on simulation studies published in statistical literature to choose their statistical methods and who, thus, need to understand the criteria of assessing the validity and relevance of simulation results and their interpretation; and/or (2) need to understand the basic principles of designing statistical simulations in order to efficiently collaborate with more experienced colleagues or start learning to conduct their own simulations. We illustrate the implementation of a simulation study and the interpretation of its results through a simple example inspired by recent literature, which is completely reproducible using the R-script available from online supplemental file 1.
Collapse
Affiliation(s)
- Anne-Laure Boulesteix
- Institute for Medical Information Processing, Biometry and Epidemiology, Ludwig Maximilian University of Munich, Munich, Germany
| | - Rolf Hh Groenwold
- Department of Clinical Epidemiology, Leiden University Medical Centre, Leiden, The Netherlands
- Department of Biomedical Data Science, Leiden University Medical Centre, Leiden, The Netherlands
| | - Michal Abrahamowicz
- Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montreal, Quebec, Canada
| | - Harald Binder
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Freiburg im Breisgau, Germany
| | - Matthias Briel
- Department of Clinical Research, Institute for Clinical Epidemiology and Biostatistics, University Hospital Basel and University of Basel, Basel, Switzerland
- Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, Ontario, Canada
| | - Roman Hornung
- Institute for Medical Information Processing, Biometry and Epidemiology, Ludwig Maximilian University of Munich, Munich, Germany
| | | | - Jörg Rahnenführer
- Department of Statistics, TU Dortmund University, Dortmund, Nordrhein-Westfalen, Germany
| | - Willi Sauerbrei
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Freiburg im Breisgau, Germany
| |
Collapse
|
10
|
Huang J, Huth C, Covic M, Troll M, Adam J, Zukunft S, Prehn C, Wang L, Nano J, Scheerer MF, Neschen S, Kastenmüller G, Suhre K, Laxy M, Schliess F, Gieger C, Adamski J, Hrabe de Angelis M, Peters A, Wang-Sattler R. Machine Learning Approaches Reveal Metabolic Signatures of Incident Chronic Kidney Disease in Individuals With Prediabetes and Type 2 Diabetes. Diabetes 2020; 69:2756-2765. [PMID: 33024004 DOI: 10.2337/db20-0586] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/03/2020] [Accepted: 09/29/2020] [Indexed: 11/13/2022]
Abstract
Early and precise identification of individuals with prediabetes and type 2 diabetes (T2D) at risk for progressing to chronic kidney disease (CKD) is essential to prevent complications of diabetes. Here, we identify and evaluate prospective metabolite biomarkers and the best set of predictors of CKD in the longitudinal, population-based Cooperative Health Research in the Region of Augsburg (KORA) cohort by targeted metabolomics and machine learning approaches. Out of 125 targeted metabolites, sphingomyelin C18:1 and phosphatidylcholine diacyl C38:0 were identified as candidate metabolite biomarkers of incident CKD specifically in hyperglycemic individuals followed during 6.5 years. Sets of predictors for incident CKD developed from 125 metabolites and 14 clinical variables showed highly stable performances in all three machine learning approaches and outperformed the currently established clinical algorithm for CKD. The two metabolites in combination with five clinical variables were identified as the best set of predictors, and their predictive performance yielded a mean area value under the receiver operating characteristic curve of 0.857. The inclusion of metabolite variables in the clinical prediction of future CKD may thus improve the risk prediction in people with prediabetes and T2D. The metabolite link with hyperglycemia-related early kidney dysfunction warrants further investigation.
Collapse
Affiliation(s)
- Jialing Huang
- Research Unit of Molecular Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
- Institute of Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
- German Center for Diabetes Research (DZD), München-Neuherberg, Germany
| | - Cornelia Huth
- Institute of Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
- German Center for Diabetes Research (DZD), München-Neuherberg, Germany
| | - Marcela Covic
- Research Unit of Molecular Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
- Institute of Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
- German Center for Diabetes Research (DZD), München-Neuherberg, Germany
| | - Martina Troll
- Research Unit of Molecular Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
- Institute of Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
| | - Jonathan Adam
- Research Unit of Molecular Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
- Institute of Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
| | - Sven Zukunft
- Research Unit of Molecular Endocrinology and Metabolism, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
| | - Cornelia Prehn
- Research Unit of Molecular Endocrinology and Metabolism, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
| | - Li Wang
- Research Unit of Molecular Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
- Institute of Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
- Department of Scientific Research and Shandong University Postdoctoral Work Station, Liaocheng People's Hospital, Shandong, P. R. China
| | - Jana Nano
- Institute of Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
- German Center for Diabetes Research (DZD), München-Neuherberg, Germany
| | - Markus F Scheerer
- Institute of Experimental Genetics, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
| | - Susanne Neschen
- Institute of Experimental Genetics, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
| | - Gabi Kastenmüller
- Institute of Bioinformatics and Systems Biology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
| | - Karsten Suhre
- Department of Physiology and Biophysics, Weill Cornell Medicine - Qatar, Doha, Qatar
| | - Michael Laxy
- Institute of Health Economics and Health Care Management, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
| | | | - Christian Gieger
- Research Unit of Molecular Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
- Institute of Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
- German Center for Diabetes Research (DZD), München-Neuherberg, Germany
| | - Jerzy Adamski
- Research Unit of Molecular Endocrinology and Metabolism, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
- Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
- Chair of Experimental Genetics, Center of Life and Food Sciences Weihenstephan, Technische Universität München, Freising, Germany
| | - Martin Hrabe de Angelis
- German Center for Diabetes Research (DZD), München-Neuherberg, Germany
- Institute of Experimental Genetics, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
- Chair of Experimental Genetics, Center of Life and Food Sciences Weihenstephan, Technische Universität München, Freising, Germany
| | - Annette Peters
- Institute of Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
- German Center for Diabetes Research (DZD), München-Neuherberg, Germany
| | - Rui Wang-Sattler
- Research Unit of Molecular Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
- Institute of Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
- German Center for Diabetes Research (DZD), München-Neuherberg, Germany
| |
Collapse
|
11
|
Machado RA, de Oliveira Silva C, Martelli-Junior H, das Neves LT, Coletta RD. Machine learning in prediction of genetic risk of nonsyndromic oral clefts in the Brazilian population. Clin Oral Investig 2020; 25:1273-1280. [PMID: 32617779 DOI: 10.1007/s00784-020-03433-y] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2019] [Accepted: 06/24/2020] [Indexed: 01/07/2023]
Abstract
OBJECTIVES Genetic variants in multiple genes and loci have been associated with the risk of nonsyndromic cleft lip with or without cleft palate (NSCL ± P). However, the estimation of risk remains challenge, because most of these variants are population-specific rendering the identification of the underlying genetic risk difficult. Herein we examined the use of machine learning network in previously reported single nucleotide polymorphisms (SNPs) to predict risk of NSCL ± P in the Brazilian population. MATERIALS AND METHODS Random forest and neural network methods were applied in 72 SNPs in a case-control sample composed by 722 NSCL ± P and 866 controls for discrimination of NSCL ± P risk. SNP-SNP interactions and functional annotation biological processes associated with the identified NSCL ± P risk genes were verified. RESULTS Supervised random forest decision trees revealed high scores of importance for the SNPs rs11717284 and rs1875735 in FGF12, rs41268753 in GRHL3, rs2236225 in MTHFD1, rs2274976 in MTHFR, rs2235371 and rs642961 in IRF6, rs17085106 in RHPN2, rs28372960 in TCOF1, rs7078160 in VAX1, rs10762573 and rs2131960 in VCL, and rs227731 in 17q22, with an accuracy of 99% and an error rate of approximately 3% to predict the risk of NSCL ± P. Those same 13 SNPs were considered the most important for the neural network to effectively predict NSCL ± P risk, with an overall accuracy of 94%. Multivariate regression model revealed significant interactions among all SNPs, with an exception of those in FGF12 and MTHFD1. The most significantly biological processes for selected genes were those involved in tissue and epithelium development; neural tube closure; and metabolism of methionine, folate, and homocysteine. CONCLUSIONS Our results provide novel clues for genetic mechanism studies of NSCL ± P and point out for a machine learning model composed by 13 SNPs that is capable of predicting NSCL ± P risk. CLINICAL RELEVANCE Although validation is necessary, this genetic panel can be useful in the near future to assist in NSCL ± P genetic counseling.
Collapse
Affiliation(s)
- Renato Assis Machado
- Department of Oral Diagnosis, School of Dentistry, University of Campinas, Piracicaba, São Paulo, CEP 13414-018, Brazil
- Post-Graduation Program in Rehabilitation Sciences, Hospital for Rehabilitation of Craniofacial Anomalies, University of São Paulo, Bauru, São Paulo, Brazil
| | - Carolina de Oliveira Silva
- Department of Oral Diagnosis, School of Dentistry, University of Campinas, Piracicaba, São Paulo, CEP 13414-018, Brazil
| | - Hercílio Martelli-Junior
- Stomatology Clinic, Dental School, State University of Montes Claros, Montes Claros, Minas Gerais, Brazil
- Center for Rehabilitation of Craniofacial Anomalies, Dental School, University of José Rosario Vellano, Alfenas, Minas Gerais, Brazil
| | - Lucimara Teixeira das Neves
- Post-Graduation Program in Rehabilitation Sciences, Hospital for Rehabilitation of Craniofacial Anomalies, University of São Paulo, Bauru, São Paulo, Brazil
- Department of Biological Sciences, Bauru School of Dentistry, University of São Paulo, Bauru, São Paulo, Brazil
| | - Ricardo D Coletta
- Department of Oral Diagnosis, School of Dentistry, University of Campinas, Piracicaba, São Paulo, CEP 13414-018, Brazil.
| |
Collapse
|
12
|
Caliebe A, Nothnagel M. Special issue on 'Genetic epidemiology of complex diseases: impact of population history and modelling assumptions'. Hum Genet 2020; 139:1-3. [PMID: 31664516 DOI: 10.1007/s00439-019-02074-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Affiliation(s)
- Amke Caliebe
- Institute of Medical Informatics and Statistics, Kiel University, Kiel, Germany. .,University Medical Centre Schleswig-Holstein, Kiel, Germany.
| | - Michael Nothnagel
- Cologne Center for Genomics, University of Cologne, Cologne, Germany. .,University Hospital Cologne, Cologne, Germany.
| |
Collapse
|