1
|
Novielli P, Romano D, Pavan S, Losciale P, Stellacci AM, Diacono D, Bellotti R, Tangaro S. Explainable artificial intelligence for genotype-to-phenotype prediction in plant breeding: a case study with a dataset from an almond germplasm collection. FRONTIERS IN PLANT SCIENCE 2024; 15:1434229. [PMID: 39319003 PMCID: PMC11420924 DOI: 10.3389/fpls.2024.1434229] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/17/2024] [Accepted: 08/13/2024] [Indexed: 09/26/2024]
Abstract
Background Advances in DNA sequencing revolutionized plant genomics and significantly contributed to the study of genetic diversity. However, predicting phenotypes from genomic data remains a challenge, particularly in the context of plant breeding. Despite significant progress, accurately predicting phenotypes from high-dimensional genomic data remains a challenge, particularly in identifying the key genetic factors influencing these predictions. This study aims to bridge this gap by integrating explainable artificial intelligence (XAI) techniques with advanced machine learning models. This approach is intended to enhance both the predictive accuracy and interpretability of genotype-to-phenotype models, thereby improving their reliability and supporting more informed breeding decisions. Results This study compares several ML methods for genotype-to-phenotype prediction, using data available from an almond germplasm collection. After preprocessing and feature selection, regression models are employed to predict almond shelling fraction. Best predictions were obtained by the Random Forest method (correlation = 0.727 ± 0.020, an R 2 = 0.511 ± 0.025, and an RMSE = 7.746 ± 0.199). Notably, the application of the SHAP (SHapley Additive exPlanations) values algorithm to explain the results highlighted several genomic regions associated with the trait, including one, having the highest feature importance, located in a gene potentially involved in seed development. Conclusions Employing explainable artificial intelligence algorithms enhances model interpretability, identifying genetic polymorphisms associated with the shelling percentage. These findings underscore XAI's efficacy in predicting phenotypic traits from genomic data, highlighting its significance in optimizing crop production for sustainable agriculture.
Collapse
Affiliation(s)
- Pierfrancesco Novielli
- Dipartimento di Scienze del Suolo, della Pianta e degli Alimenti, Università degli Studi di Bari Aldo Moro, Bari, Italy
- Istituto Nazionale di Fisica Nucleare, Sezione di Bari, Bari, Italy
| | - Donato Romano
- Dipartimento di Scienze del Suolo, della Pianta e degli Alimenti, Università degli Studi di Bari Aldo Moro, Bari, Italy
- Istituto Nazionale di Fisica Nucleare, Sezione di Bari, Bari, Italy
| | - Stefano Pavan
- Dipartimento di Scienze del Suolo, della Pianta e degli Alimenti, Università degli Studi di Bari Aldo Moro, Bari, Italy
| | - Pasquale Losciale
- Dipartimento di Scienze del Suolo, della Pianta e degli Alimenti, Università degli Studi di Bari Aldo Moro, Bari, Italy
| | - Anna Maria Stellacci
- Dipartimento di Scienze del Suolo, della Pianta e degli Alimenti, Università degli Studi di Bari Aldo Moro, Bari, Italy
| | - Domenico Diacono
- Istituto Nazionale di Fisica Nucleare, Sezione di Bari, Bari, Italy
| | - Roberto Bellotti
- Istituto Nazionale di Fisica Nucleare, Sezione di Bari, Bari, Italy
- Dipartimento Interateneo di Fisica “M. Merlin”, Università degli Studi di Bari Aldo Moro, Bari, Italy
| | - Sabina Tangaro
- Dipartimento di Scienze del Suolo, della Pianta e degli Alimenti, Università degli Studi di Bari Aldo Moro, Bari, Italy
- Istituto Nazionale di Fisica Nucleare, Sezione di Bari, Bari, Italy
| |
Collapse
|
2
|
Mohtasham F, Pourhoseingholi M, Hashemi Nazari SS, Kavousi K, Zali MR. Comparative analysis of feature selection techniques for COVID-19 dataset. Sci Rep 2024; 14:18627. [PMID: 39128991 PMCID: PMC11317481 DOI: 10.1038/s41598-024-69209-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2024] [Accepted: 08/01/2024] [Indexed: 08/13/2024] Open
Abstract
In the context of early disease detection, machine learning (ML) has emerged as a vital tool. Feature selection (FS) algorithms play a crucial role in ensuring the accuracy of predictive models by identifying the most influential variables. This study, focusing on a retrospective cohort of 4778 COVID-19 patients from Iran, explores the performance of various FS methods, including filter, embedded, and hybrid approaches, in predicting mortality outcomes. The researchers leveraged 115 routine clinical, laboratory, and demographic features and employed 13 ML models to assess the effectiveness of these FS methods based on classification accuracy, predictive accuracy, and statistical tests. The results indicate that a Hybrid Boruta-VI model combined with the Random Forest algorithm demonstrated superior performance, achieving an accuracy of 0.89, an F1 score of 0.76, and an AUC value of 0.95 on test data. Key variables identified as important predictors of adverse outcomes include age, oxygen saturation levels, albumin levels, neutrophil counts, platelet levels, and markers of kidney function. These findings highlight the potential of advanced FS techniques and ML models in enhancing early disease detection and informing clinical decision-making.
Collapse
Affiliation(s)
- Farideh Mohtasham
- Gastroenterology and Liver Diseases Research Center, Research Institute for Gastroenterology and Liver Diseases, Shahid Beheshti University of Medical Sciences, Tehran, Iran.
| | - MohamadAmin Pourhoseingholi
- Hearing Sciences, Mental Health and Clinical Neurosciences, School of Medicine, National Institute for Health and Care Research (NIHR) Nottingham Biomedical Research Center, University of Nottingham, Nottingham, UK
| | - Seyed Saeed Hashemi Nazari
- Department of Epidemiology, School of Public Health & Safety, Shahid Beheshti University of Medical Sciences (SBMU), Tehran, Iran
| | - Kaveh Kavousi
- Laboratory of Complex Biological Systems and Bioinformatics (CBB), Department of Bioinformatics, Institute of Biochemistry and Biophysics (IBB), University of Tehran, Tehran, Iran.
| | - Mohammad Reza Zali
- Gastroenterology and Liver Diseases Research Center, Research Institute for Gastroenterology and Liver Diseases, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| |
Collapse
|
3
|
Sztepanacz JL, Houle D. Regularized regression can improve estimates of multivariate selection in the face of multicollinearity and limited data. Evol Lett 2024; 8:361-373. [PMID: 39211358 PMCID: PMC11358252 DOI: 10.1093/evlett/qrad064] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2023] [Revised: 11/19/2023] [Accepted: 12/06/2023] [Indexed: 09/04/2024] Open
Abstract
The breeder's equation, Δ z ¯ = G β , allows us to understand how genetics (the genetic covariance matrix, G) and the vector of linear selection gradients β interact to generate evolutionary trajectories. Estimation of β using multiple regression of trait values on relative fitness revolutionized the way we study selection in laboratory and wild populations. However, multicollinearity, or correlation of predictors, can lead to very high variances of and covariances between elements of β, posing a challenge for the interpretation of the parameter estimates. This is particularly relevant in the era of big data, where the number of predictors may approach or exceed the number of observations. A common approach to multicollinear predictors is to discard some of them, thereby losing any information that might be gained from those traits. Using simulations, we show how, on the one hand, multicollinearity can result in inaccurate estimates of selection, and, on the other, how the removal of correlated phenotypes from the analyses can provide a misguided view of the targets of selection. We show that regularized regression, which places data-validated constraints on the magnitudes of individual elements of β, can produce more accurate estimates of the total strength and direction of multivariate selection in the presence of multicollinearity and limited data, and often has little cost when multicollinearity is low. We also compare standard and regularized regression estimates of selection in a reanalysis of three published case studies, showing that regularized regression can improve fitness predictions in independent data. Our results suggest that regularized regression is a valuable tool that can be used as an important complement to traditional least-squares estimates of selection. In some cases, its use can lead to improved predictions of individual fitness, and improved estimates of the total strength and direction of multivariate selection.
Collapse
Affiliation(s)
| | - David Houle
- Department of Biology, Florida State University, Tallahassee, FL, United States
| |
Collapse
|
4
|
Alfayyadh MM, Maksemous N, Sutherland HG, Lea RA, Griffiths LR. Unravelling the Genetic Landscape of Hemiplegic Migraine: Exploring Innovative Strategies and Emerging Approaches. Genes (Basel) 2024; 15:443. [PMID: 38674378 PMCID: PMC11049430 DOI: 10.3390/genes15040443] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2024] [Accepted: 03/25/2024] [Indexed: 04/28/2024] Open
Abstract
Migraine is a severe, debilitating neurovascular disorder. Hemiplegic migraine (HM) is a rare and debilitating neurological condition with a strong genetic basis. Sequencing technologies have improved the diagnosis and our understanding of the molecular pathophysiology of HM. Linkage analysis and sequencing studies in HM families have identified pathogenic variants in ion channels and related genes, including CACNA1A, ATP1A2, and SCN1A, that cause HM. However, approximately 75% of HM patients are negative for these mutations, indicating there are other genes involved in disease causation. In this review, we explored our current understanding of the genetics of HM. The evidence presented herein summarises the current knowledge of the genetics of HM, which can be expanded further to explain the remaining heritability of this debilitating condition. Innovative bioinformatics and computational strategies to cover the entire genetic spectrum of HM are also discussed in this review.
Collapse
Affiliation(s)
| | | | | | | | - Lyn R. Griffiths
- Centre for Genomics and Personalised Health, Genomics Research Centre, School of Biomedical Sciences, Queensland University of Technology (QUT), Brisbane, QLD 4059, Australia; (M.M.A.); (N.M.); (H.G.S.); (R.A.L.)
| |
Collapse
|
5
|
Gu LL, Yang RQ, Wang ZY, Jiang D, Fang M. Ensemble learning for integrative prediction of genetic values with genomic variants. BMC Bioinformatics 2024; 25:120. [PMID: 38515026 PMCID: PMC10956256 DOI: 10.1186/s12859-024-05720-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2022] [Accepted: 02/26/2024] [Indexed: 03/23/2024] Open
Abstract
BACKGROUND Whole genome variants offer sufficient information for genetic prediction of human disease risk, and prediction of animal and plant breeding values. Many sophisticated statistical methods have been developed for enhancing the predictive ability. However, each method has its own advantages and disadvantages, so far, no one method can beat others. RESULTS We herein propose an Ensemble Learning method for Prediction of Genetic Values (ELPGV), which assembles predictions from several basic methods such as GBLUP, BayesA, BayesB and BayesCπ, to produce more accurate predictions. We validated ELPGV with a variety of well-known datasets and a serious of simulated datasets. All revealed that ELPGV was able to significantly enhance the predictive ability than any basic methods, for instance, the comparison p-value of ELPGV over basic methods were varied from 4.853E-118 to 9.640E-20 for WTCCC dataset. CONCLUSIONS ELPGV is able to integrate the merit of each method together to produce significantly higher predictive ability than any basic methods and it is simple to implement, fast to run, without using genotype data. is promising for wide application in genetic predictions.
Collapse
Affiliation(s)
- Lin-Lin Gu
- Key Laboratory of Healthy Mariculture for the East China Sea, Ministry of Agriculture and Rural Affairs and Fisheries College, Jimei University, Xiamen, People's Republic of China
| | - Run-Qing Yang
- Research Center for Aquatic Biotechnology, Chinese Academy of Fishery Sciences, Beijing, People's Republic of China
| | - Zhi-Yong Wang
- Key Laboratory of Healthy Mariculture for the East China Sea, Ministry of Agriculture and Rural Affairs and Fisheries College, Jimei University, Xiamen, People's Republic of China.
| | - Dan Jiang
- Key Laboratory of Healthy Mariculture for the East China Sea, Ministry of Agriculture and Rural Affairs and Fisheries College, Jimei University, Xiamen, People's Republic of China.
| | - Ming Fang
- Key Laboratory of Healthy Mariculture for the East China Sea, Ministry of Agriculture and Rural Affairs and Fisheries College, Jimei University, Xiamen, People's Republic of China.
- Life Science College, Heilongjiang Bayi Agricultural University, Daqing, People's Republic of China.
| |
Collapse
|
6
|
Zhou W, Yan Z, Zhang L. A comparative study of 11 non-linear regression models highlighting autoencoder, DBN, and SVR, enhanced by SHAP importance analysis in soybean branching prediction. Sci Rep 2024; 14:5905. [PMID: 38467662 PMCID: PMC10928191 DOI: 10.1038/s41598-024-55243-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2023] [Accepted: 02/21/2024] [Indexed: 03/13/2024] Open
Abstract
To explore a robust tool for advancing digital breeding practices through an artificial intelligence-driven phenotype prediction expert system, we undertook a thorough analysis of 11 non-linear regression models. Our investigation specifically emphasized the significance of Support Vector Regression (SVR) and SHapley Additive exPlanations (SHAP) in predicting soybean branching. By using branching data (phenotype) of 1918 soybean accessions and 42 k SNP (Single Nucleotide Polymorphism) polymorphic data (genotype), this study systematically compared 11 non-linear regression AI models, including four deep learning models (DBN (deep belief network) regression, ANN (artificial neural network) regression, Autoencoders regression, and MLP (multilayer perceptron) regression) and seven machine learning models (e.g., SVR (support vector regression), XGBoost (eXtreme Gradient Boosting) regression, Random Forest regression, LightGBM regression, GPs (Gaussian processes) regression, Decision Tree regression, and Polynomial regression). After being evaluated by four valuation metrics: R2 (R-squared), MAE (Mean Absolute Error), MSE (Mean Squared Error), and MAPE (Mean Absolute Percentage Error), it was found that the SVR, Polynomial Regression, DBN, and Autoencoder outperformed other models and could obtain a better prediction accuracy when they were used for phenotype prediction. In the assessment of deep learning approaches, we exemplified the SVR model, conducting analyses on feature importance and gene ontology (GO) enrichment to provide comprehensive support. After comprehensively comparing four feature importance algorithms, no notable distinction was observed in the feature importance ranking scores across the four algorithms, namely Variable Ranking, Permutation, SHAP, and Correlation Matrix, but the SHAP value could provide rich information on genes with negative contributions, and SHAP importance was chosen for feature selection. The results of this study offer valuable insights into AI-mediated plant breeding, addressing challenges faced by traditional breeding programs. The method developed has broad applicability in phenotype prediction, minor QTL (quantitative trait loci) mining, and plant smart-breeding systems, contributing significantly to the advancement of AI-based breeding practices and transitioning from experience-based to data-based breeding.
Collapse
Affiliation(s)
- Wei Zhou
- Florida Agricultural and Mechanical University, Tallahassee, FL, 32307, USA.
| | - Zhengxiao Yan
- Florida State University, Tallahassee, FL, 32306, USA
| | - Liting Zhang
- Florida State University, Tallahassee, FL, 32306, USA
| |
Collapse
|
7
|
Jeng XJ, Hu Y, Venkat V, Lu TP, Tzeng JY. Transfer learning with false negative control improves polygenic risk prediction. PLoS Genet 2023; 19:e1010597. [PMID: 38011285 PMCID: PMC10723713 DOI: 10.1371/journal.pgen.1010597] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2023] [Revised: 12/15/2023] [Accepted: 11/09/2023] [Indexed: 11/29/2023] Open
Abstract
Polygenic risk score (PRS) is a quantity that aggregates the effects of variants across the genome and estimates an individual's genetic predisposition for a given trait. PRS analysis typically contains two input data sets: base data for effect size estimation and target data for individual-level prediction. Given the availability of large-scale base data, it becomes more common that the ancestral background of base and target data do not perfectly match. In this paper, we treat the GWAS summary information obtained in the base data as knowledge learned from a pre-trained model, and adopt a transfer learning framework to effectively leverage the knowledge learned from the base data that may or may not have similar ancestral background as the target samples to build prediction models for target individuals. Our proposed transfer learning framework consists of two main steps: (1) conducting false negative control (FNC) marginal screening to extract useful knowledge from the base data; and (2) performing joint model training to integrate the knowledge extracted from base data with the target training data for accurate trans-data prediction. This new approach can significantly enhance the computational and statistical efficiency of joint-model training, alleviate over-fitting, and facilitate more accurate trans-data prediction when heterogeneity level between target and base data sets is small or high.
Collapse
Affiliation(s)
- Xinge Jessie Jeng
- Department of Statistics, North Carolina State University, Raleigh, North Carolina, United States of America
| | - Yifei Hu
- Department of Statistics, North Carolina State University, Raleigh, North Carolina, United States of America
| | - Vaishnavi Venkat
- Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, United States of America
| | - Tzu-Pin Lu
- Institute of Health Data Analytics and Statistics, National Taiwan University, Taipei, Taiwan
- Department of Public Health, National Taiwan University, Taipei, Taiwan
| | - Jung-Ying Tzeng
- Department of Statistics, North Carolina State University, Raleigh, North Carolina, United States of America
- Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, United States of America
- Institute of Health Data Analytics and Statistics, National Taiwan University, Taipei, Taiwan
| |
Collapse
|
8
|
Lehmann B, Mackintosh M, McVean G, Holmes C. Optimal strategies for learning multi-ancestry polygenic scores vary across traits. Nat Commun 2023; 14:4023. [PMID: 37419925 PMCID: PMC10328935 DOI: 10.1038/s41467-023-38930-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2022] [Accepted: 05/22/2023] [Indexed: 07/09/2023] Open
Abstract
Polygenic scores (PGSs) are individual-level measures that aggregate the genome-wide genetic predisposition to a given trait. As PGS have predominantly been developed using European-ancestry samples, trait prediction using such European ancestry-derived PGS is less accurate in non-European ancestry individuals. Although there has been recent progress in combining multiple PGS trained on distinct populations, the problem of how to maximize performance given a multiple-ancestry cohort is largely unexplored. Here, we investigate the effect of sample size and ancestry composition on PGS performance for fifteen traits in UK Biobank. For some traits, PGS estimated using a relatively small African-ancestry training set outperformed, on an African-ancestry test set, PGS estimated using a much larger European-ancestry only training set. We observe similar, but not identical, results when considering other minority-ancestry groups within UK Biobank. Our results emphasise the importance of targeted data collection from underrepresented groups in order to address existing disparities in PGS performance.
Collapse
Affiliation(s)
- Brieuc Lehmann
- Department of Statistical Science, University College London, London, UK.
| | | | - Gil McVean
- Big Data Institute, University of Oxford, Oxford, UK
| | - Chris Holmes
- The Alan Turing Institute, London, UK
- Big Data Institute, University of Oxford, Oxford, UK
- Department of Statistics, University of Oxford, Oxford, UK
| |
Collapse
|
9
|
Ko C, Brody JP. Evaluation of a genetic risk score computed using human chromosomal-scale length variation to predict breast cancer. Hum Genomics 2023; 17:53. [PMID: 37328908 DOI: 10.1186/s40246-023-00482-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2022] [Accepted: 03/30/2023] [Indexed: 06/18/2023] Open
Abstract
INTRODUCTION The ability to accurately predict whether a woman will develop breast cancer later in her life, should reduce the number of breast cancer deaths. Different predictive models exist for breast cancer based on family history, BRCA status, and SNP analysis. The best of these models has an accuracy (area under the receiver operating characteristic curve, AUC) of about 0.65. We have developed computational methods to characterize a genome by a small set of numbers that represent the length of segments of the chromosomes, called chromosomal-scale length variation (CSLV). METHODS We built machine learning models to differentiate between women who had breast cancer and women who did not based on their CSLV characterization. We applied this procedure to two different datasets: the UK Biobank (1534 women with breast cancer and 4391 women who did not) and the Cancer Genome Atlas (TCGA) 874 with breast cancer and 3381 without. RESULTS We found a machine learning model that could predict breast cancer with an AUC of 0.836 95% CI (0.830.0.843) in the UK Biobank data. Using a similar approach with the TCGA data, we obtained a model with an AUC of 0.704 95% CI (0.702, 0.706). Variable importance analysis indicated that no single chromosomal region was responsible for significant fraction of the model results. CONCLUSION In this retrospective study, chromosomal-scale length variation could effectively predict whether or not a woman enrolled in the UK Biobank study developed breast cancer.
Collapse
Affiliation(s)
- Charmeine Ko
- Department of Biomedical Engineering, University of California, Irvine, USA
| | - James P Brody
- Department of Biomedical Engineering, University of California, Irvine, USA.
| |
Collapse
|
10
|
Banerjee J, Taroni JN, Allaway RJ, Prasad DV, Guinney J, Greene C. Machine learning in rare disease. Nat Methods 2023:10.1038/s41592-023-01886-z. [PMID: 37248386 DOI: 10.1038/s41592-023-01886-z] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2021] [Accepted: 04/22/2023] [Indexed: 05/31/2023]
Abstract
High-throughput profiling methods (such as genomics or imaging) have accelerated basic research and made deep molecular characterization of patient samples routine. These approaches provide a rich portrait of genes, molecular pathways and cell types involved in disease phenotypes. Machine learning (ML) can be a useful tool for extracting disease-relevant patterns from high-dimensional datasets. However, depending upon the complexity of the biological question, machine learning often requires many samples to identify recurrent and biologically meaningful patterns. Rare diseases are inherently limited in clinical cases, leading to few samples to study. In this Perspective, we outline the challenges and emerging solutions for using ML for small sample sets, specifically in rare diseases. Advances in ML methods for rare diseases are likely to be informative for applications beyond rare diseases for which few samples exist with high-dimensional data. We propose that the method community prioritize the development of ML techniques for rare disease research.
Collapse
Affiliation(s)
| | - Jaclyn N Taroni
- Childhood Cancer Data Lab, Alex's Lemonade Stand Foundation, Philadelphia, PA, USA
| | | | | | | | - Casey Greene
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, USA.
| |
Collapse
|
11
|
Xu Y, Ritchie SC, Liang Y, Timmers PRHJ, Pietzner M, Lannelongue L, Lambert SA, Tahir UA, May-Wilson S, Foguet C, Johansson Å, Surendran P, Nath AP, Persyn E, Peters JE, Oliver-Williams C, Deng S, Prins B, Luan J, Bomba L, Soranzo N, Di Angelantonio E, Pirastu N, Tai ES, van Dam RM, Parkinson H, Davenport EE, Paul DS, Yau C, Gerszten RE, Mälarstig A, Danesh J, Sim X, Langenberg C, Wilson JF, Butterworth AS, Inouye M. An atlas of genetic scores to predict multi-omic traits. Nature 2023; 616:123-131. [PMID: 36991119 PMCID: PMC10323211 DOI: 10.1038/s41586-023-05844-9] [Citation(s) in RCA: 28] [Impact Index Per Article: 28.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2022] [Accepted: 02/15/2023] [Indexed: 03/30/2023]
Abstract
The use of omic modalities to dissect the molecular underpinnings of common diseases and traits is becoming increasingly common. But multi-omic traits can be genetically predicted, which enables highly cost-effective and powerful analyses for studies that do not have multi-omics1. Here we examine a large cohort (the INTERVAL study2; n = 50,000 participants) with extensive multi-omic data for plasma proteomics (SomaScan, n = 3,175; Olink, n = 4,822), plasma metabolomics (Metabolon HD4, n = 8,153), serum metabolomics (Nightingale, n = 37,359) and whole-blood Illumina RNA sequencing (n = 4,136), and use machine learning to train genetic scores for 17,227 molecular traits, including 10,521 that reach Bonferroni-adjusted significance. We evaluate the performance of genetic scores through external validation across cohorts of individuals of European, Asian and African American ancestries. In addition, we show the utility of these multi-omic genetic scores by quantifying the genetic control of biological pathways and by generating a synthetic multi-omic dataset of the UK Biobank3 to identify disease associations using a phenome-wide scan. We highlight a series of biological insights with regard to genetic mechanisms in metabolism and canonical pathway associations with disease; for example, JAK-STAT signalling and coronary atherosclerosis. Finally, we develop a portal ( https://www.omicspred.org/ ) to facilitate public access to all genetic scores and validation results, as well as to serve as a platform for future extensions and enhancements of multi-omic genetic scores.
Collapse
Affiliation(s)
- Yu Xu
- Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK.
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK.
- Victor Phillip Dahdaleh Heart and Lung Research Institute, University of Cambridge, Cambridge, UK.
| | - Scott C Ritchie
- Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- Victor Phillip Dahdaleh Heart and Lung Research Institute, University of Cambridge, Cambridge, UK
- British Heart Foundation Centre of Research Excellence, School of Clinical Medicine, University of Cambridge, Cambridge, UK
| | - Yujian Liang
- Saw Swee Hock School of Public Health, National University of Singapore and National University Health System, Singapore, Singapore
| | - Paul R H J Timmers
- Centre for Global Health Research, Usher Institute, University of Edinburgh, Edinburgh, UK
| | - Maik Pietzner
- MRC Epidemiology Unit, Institute of Metabolic Science, University of Cambridge School of Clinical Medicine, Cambridge, UK
- Computational Medicine, Berlin Institute of Health (BIH) at Charité - Universitätsmedizin Berlin, Berlin, Germany
- Precision Healthcare University Research Institute, Queen Mary University of London, London, UK
| | - Loïc Lannelongue
- Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- Victor Phillip Dahdaleh Heart and Lung Research Institute, University of Cambridge, Cambridge, UK
- Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge, UK
| | - Samuel A Lambert
- Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- Victor Phillip Dahdaleh Heart and Lung Research Institute, University of Cambridge, Cambridge, UK
- British Heart Foundation Centre of Research Excellence, School of Clinical Medicine, University of Cambridge, Cambridge, UK
- Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge, UK
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, UK
| | - Usman A Tahir
- Division of Cardiovascular Medicine, Beth Israel Deaconess Medical Center, Boston, MA, USA
| | - Sebastian May-Wilson
- Centre for Global Health Research, Usher Institute, University of Edinburgh, Edinburgh, UK
| | - Carles Foguet
- Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- Victor Phillip Dahdaleh Heart and Lung Research Institute, University of Cambridge, Cambridge, UK
- Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge, UK
| | - Åsa Johansson
- Department of Immunology, Genetics and Pathology, Science for Life Laboratory, Uppsala University, Uppsala, Sweden
| | - Praveen Surendran
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
| | - Artika P Nath
- Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, Victoria, Australia
| | - Elodie Persyn
- Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- Victor Phillip Dahdaleh Heart and Lung Research Institute, University of Cambridge, Cambridge, UK
| | - James E Peters
- Department of Immunology and Inflammation, Faculty of Medicine, Imperial College London, London, UK
| | - Clare Oliver-Williams
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
| | - Shuliang Deng
- Division of Cardiovascular Medicine, Beth Israel Deaconess Medical Center, Boston, MA, USA
| | - Bram Prins
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
| | - Jian'an Luan
- MRC Epidemiology Unit, Institute of Metabolic Science, University of Cambridge School of Clinical Medicine, Cambridge, UK
| | - Lorenzo Bomba
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK
- BioMarin Pharmaceutical, Novato, CA, USA
| | - Nicole Soranzo
- British Heart Foundation Centre of Research Excellence, School of Clinical Medicine, University of Cambridge, Cambridge, UK
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK
- Department of Haematology, University of Cambridge, Cambridge, UK
- NIHR Blood and Transplant Research Unit in Donor Health and Behaviour, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- Genomics Research Centre, Human Technopole, Milan, Italy
| | - Emanuele Di Angelantonio
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- Victor Phillip Dahdaleh Heart and Lung Research Institute, University of Cambridge, Cambridge, UK
- British Heart Foundation Centre of Research Excellence, School of Clinical Medicine, University of Cambridge, Cambridge, UK
- Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge, UK
- NIHR Blood and Transplant Research Unit in Donor Health and Behaviour, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- Health Data Science Research Centre, Human Technopole, Milan, Italy
| | - Nicola Pirastu
- Centre for Global Health Research, Usher Institute, University of Edinburgh, Edinburgh, UK
- Genomics Research Centre, Human Technopole, Milan, Italy
| | - E Shyong Tai
- Saw Swee Hock School of Public Health, National University of Singapore and National University Health System, Singapore, Singapore
- Department of Medicine, National University of Singapore and National University Health System, Singapore, Singapore
| | - Rob M van Dam
- Saw Swee Hock School of Public Health, National University of Singapore and National University Health System, Singapore, Singapore
- Departments of Exercise and Nutrition Sciences and Epidemiology, Milken Institute School of Public Health, The George Washington University, Washington, DC, USA
| | - Helen Parkinson
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, UK
| | | | - Dirk S Paul
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- British Heart Foundation Centre of Research Excellence, School of Clinical Medicine, University of Cambridge, Cambridge, UK
| | - Christopher Yau
- Nuffield Department of Women's and Reproductive Health, University of Oxford, Oxford, UK
- Division of Informatics, Imaging and Data Sciences, Faculty of Biology, Medicine and Health, University of Manchester, Manchester, UK
- Health Data Research UK, London, UK
| | - Robert E Gerszten
- Division of Cardiovascular Medicine, Beth Israel Deaconess Medical Center, Boston, MA, USA
- Broad Institute of Harvard University and Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Anders Mälarstig
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
- Pfizer Worldwide Research, Development and Medical, Stockholm, Sweden
| | - John Danesh
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- Victor Phillip Dahdaleh Heart and Lung Research Institute, University of Cambridge, Cambridge, UK
- British Heart Foundation Centre of Research Excellence, School of Clinical Medicine, University of Cambridge, Cambridge, UK
- Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge, UK
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK
- NIHR Blood and Transplant Research Unit in Donor Health and Behaviour, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
| | - Xueling Sim
- Saw Swee Hock School of Public Health, National University of Singapore and National University Health System, Singapore, Singapore
| | - Claudia Langenberg
- MRC Epidemiology Unit, Institute of Metabolic Science, University of Cambridge School of Clinical Medicine, Cambridge, UK
- Computational Medicine, Berlin Institute of Health (BIH) at Charité - Universitätsmedizin Berlin, Berlin, Germany
- Precision Healthcare University Research Institute, Queen Mary University of London, London, UK
| | - James F Wilson
- Centre for Global Health Research, Usher Institute, University of Edinburgh, Edinburgh, UK
- MRC Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, UK
| | - Adam S Butterworth
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- Victor Phillip Dahdaleh Heart and Lung Research Institute, University of Cambridge, Cambridge, UK
- British Heart Foundation Centre of Research Excellence, School of Clinical Medicine, University of Cambridge, Cambridge, UK
- Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge, UK
- NIHR Blood and Transplant Research Unit in Donor Health and Behaviour, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
| | - Michael Inouye
- Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK.
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK.
- Victor Phillip Dahdaleh Heart and Lung Research Institute, University of Cambridge, Cambridge, UK.
- British Heart Foundation Centre of Research Excellence, School of Clinical Medicine, University of Cambridge, Cambridge, UK.
- Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge, UK.
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, Victoria, Australia.
- The Alan Turing Institute, London, UK.
| |
Collapse
|
12
|
Clemens B, Lefort-Besnard J, Ritter C, Smith E, Votinov M, Derntl B, Habel U, Bzdok D. Accurate machine learning prediction of sexual orientation based on brain morphology and intrinsic functional connectivity. Cereb Cortex 2023; 33:4013-4025. [PMID: 36104854 PMCID: PMC10068286 DOI: 10.1093/cercor/bhac323] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2022] [Revised: 07/20/2022] [Accepted: 07/21/2022] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND Sexual orientation in humans represents a multilevel construct that is grounded in both neurobiological and environmental factors. OBJECTIVE Here, we bring to bear a machine learning approach to predict sexual orientation from gray matter volumes (GMVs) or resting-state functional connectivity (RSFC) in a cohort of 45 heterosexual and 41 homosexual participants. METHODS In both brain assessments, we used penalized logistic regression models and nonparametric permutation. RESULTS We found an average accuracy of 62% (±6.72) for predicting sexual orientation based on GMV and an average predictive accuracy of 92% (±9.89) using RSFC. Regions in the precentral gyrus, precuneus and the prefrontal cortex were significantly informative for distinguishing heterosexual from homosexual participants in both the GMV and RSFC settings. CONCLUSIONS These results indicate that, aside from self-reports, RSFC offers neurobiological information valuable for highly accurate prediction of sexual orientation. We demonstrate for the first time that sexual orientation is reflected in specific patterns of RSFC, which enable personalized, brain-based predictions of this highly complex human trait. While these results are preliminary, our neurobiologically based prediction framework illustrates the great value and potential of RSFC for revealing biologically meaningful and generalizable predictive patterns in the human brain.
Collapse
Affiliation(s)
- Benjamin Clemens
- Department of Psychiatry, Psychotherapy and Psychosomatics, Faculty of Medicine, RWTH Aachen, Pauwelsstr. 30, 52074 Aachen, Germany
- Research Center Jülich, Institute of Neuroscience and Medicine: JARA-Institute Brain Structure Function Relationship (INM 10), Wilhelm-Johnen-Strase, 52428 Jülich, Germany
| | | | - Christoph Ritter
- Interdisciplinary Center for Clinical Research (IZKF), RWTH Aachen University, Pauwelsstr. 30, 52074 Aachen, Germany
| | - Elke Smith
- Biological Psychology, Department of Psychology, University of Cologne, Bernhard-Feilchenfeld-Str. 11, 50969 Cologne, Germany
| | - Mikhail Votinov
- Department of Psychiatry, Psychotherapy and Psychosomatics, Faculty of Medicine, RWTH Aachen, Pauwelsstr. 30, 52074 Aachen, Germany
- Research Center Jülich, Institute of Neuroscience and Medicine: JARA-Institute Brain Structure Function Relationship (INM 10), Wilhelm-Johnen-Strase, 52428 Jülich, Germany
| | - Birgit Derntl
- Department of Psychiatry and Psychotherapy, University of Tübingen, Calwerst. 14, 72076 Tübingen, Germany
- Werner Reichardt Center for Integrative Neuroscience (CIN), University of Tübingen, Otfried-Müller-Str. 25, 72076 Tübingen, Germany
| | - Ute Habel
- Department of Psychiatry, Psychotherapy and Psychosomatics, Faculty of Medicine, RWTH Aachen, Pauwelsstr. 30, 52074 Aachen, Germany
- Research Center Jülich, Institute of Neuroscience and Medicine: JARA-Institute Brain Structure Function Relationship (INM 10), Wilhelm-Johnen-Strase, 52428 Jülich, Germany
| | - Danilo Bzdok
- McConnell Brain Imaging Centre, McGill University, 3801 University Rue, Montreal Quebec H3A 2B4, Canada
- Department of Biomedical Engineering, McGill University, 3775 University Rue, Montreal Quebec H3A 2B4, Canada
- Faculty of Medicine, Montreal Neurological Institute (MNI) and Hospital, McGill University, 3801 University Rue, Montreal Quebec H3A 2B4, Canada
- Mila–Quebec Artificial Intelligence Institute, 6666 Rue St-Urbain #200, Montreal Quebec H2S 3H1, Canada
| |
Collapse
|
13
|
Kang G, Baek SH, Kim YH, Kim DH, Park JW. Genetic Risk Assessment of Nonsyndromic Cleft Lip with or without Cleft Palate by Linking Genetic Networks and Deep Learning Models. Int J Mol Sci 2023; 24:ijms24054557. [PMID: 36901988 PMCID: PMC10003462 DOI: 10.3390/ijms24054557] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2023] [Revised: 02/13/2023] [Accepted: 02/20/2023] [Indexed: 03/02/2023] Open
Abstract
Recent deep learning algorithms have further improved risk classification capabilities. However, an appropriate feature selection method is required to overcome dimensionality issues in population-based genetic studies. In this Korean case-control study of nonsyndromic cleft lip with or without cleft palate (NSCL/P), we compared the predictive performance of models that were developed by using the genetic-algorithm-optimized neural networks ensemble (GANNE) technique with those models that were generated by eight conventional risk classification methods, including polygenic risk score (PRS), random forest (RF), support vector machine (SVM), extreme gradient boosting (XGBoost), and deep-learning-based artificial neural network (ANN). GANNE, which is capable of automatic input SNP selection, exhibited the highest predictive power, especially in the 10-SNP model (AUC of 88.2%), thus improving the AUC by 23% and 17% compared to PRS and ANN, respectively. Genes mapped with input SNPs that were selected by using a genetic algorithm (GA) were functionally validated for risks of developing NSCL/P in gene ontology and protein-protein interaction (PPI) network analyses. The IRF6 gene, which is most frequently selected via GA, was also a major hub gene in the PPI network. Genes such as RUNX2, MTHFR, PVRL1, TGFB3, and TBX22 significantly contributed to predicting NSCL/P risk. GANNE is an efficient disease risk classification method using a minimum optimal set of SNPs; however, further validation studies are needed to ensure the clinical utility of the model for predicting NSCL/P risk.
Collapse
Affiliation(s)
- Geon Kang
- Department of Medical Genetics, College of Medicine, Hallym University, Chuncheon 24252, Republic of Korea
| | - Seung-Hak Baek
- Department of Orthodontics, School of Dentistry, Seoul National University, Seoul 03080, Republic of Korea
| | - Young Ho Kim
- Department of Orthodontics, The Institute of Oral Health Science, Samsung Medical Center, School of Medicine, Sungkyunkwan University, Seoul 06351, Republic of Korea
| | - Dong-Hyun Kim
- Department of Social and Preventive Medicine, College of Medicine, Hallym University, Chuncheon 24252, Republic of Korea
| | - Ji Wan Park
- Department of Medical Genetics, College of Medicine, Hallym University, Chuncheon 24252, Republic of Korea
- Correspondence:
| |
Collapse
|
14
|
Learning high-order interactions for polygenic risk prediction. PLoS One 2023; 18:e0281618. [PMID: 36763605 PMCID: PMC9916647 DOI: 10.1371/journal.pone.0281618] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2022] [Accepted: 01/27/2023] [Indexed: 02/11/2023] Open
Abstract
Within the framework of precision medicine, the stratification of individual genetic susceptibility based on inherited DNA variation has paramount relevance. However, one of the most relevant pitfalls of traditional Polygenic Risk Scores (PRS) approaches is their inability to model complex high-order non-linear SNP-SNP interactions and their effect on the phenotype (e.g. epistasis). Indeed, they incur in a computational challenge as the number of possible interactions grows exponentially with the number of SNPs considered, affecting the statistical reliability of the model parameters as well. In this work, we address this issue by proposing a novel PRS approach, called High-order Interactions-aware Polygenic Risk Score (hiPRS), that incorporates high-order interactions in modeling polygenic risk. The latter combines an interaction search routine based on frequent itemsets mining and a novel interaction selection algorithm based on Mutual Information, to construct a simple and interpretable weighted model of user-specified dimensionality that can predict a given binary phenotype. Compared to traditional PRSs methods, hiPRS does not rely on GWAS summary statistics nor any external information. Moreover, hiPRS differs from Machine Learning-based approaches that can include complex interactions in that it provides a readable and interpretable model and it is able to control overfitting, even on small samples. In the present work we demonstrate through a comprehensive simulation study the superior performance of hiPRS w.r.t. state of the art methods, both in terms of scoring performance and interpretability of the resulting model. We also test hiPRS against small sample size, class imbalance and the presence of noise, showcasing its robustness to extreme experimental settings. Finally, we apply hiPRS to a case study on real data from DACHS cohort, defining an interaction-aware scoring model to predict mortality of stage II-III Colon-Rectal Cancer patients treated with oxaliplatin.
Collapse
|
15
|
Spanbauer C, Pan W. Sparse prediction informed by genetic annotations using the logit normal prior for Bayesian regression tree ensembles. Genet Epidemiol 2023; 47:26-44. [PMID: 36349692 PMCID: PMC9892284 DOI: 10.1002/gepi.22505] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2022] [Revised: 09/08/2022] [Accepted: 09/21/2022] [Indexed: 11/11/2022]
Abstract
Using high-dimensional genetic variants such as single nucleotide polymorphisms (SNP) to predict complex diseases and traits has important applications in basic research and other clinical settings. For example, predicting gene expression is a necessary first step to identify (putative) causal genes in transcriptome-wide association studies. Due to weak signals, high-dimensionality, and linkage disequilibrium (correlation) among SNPs, building such a prediction model is challenging. However, functional annotations at the SNP level (e.g., as epigenomic data across multiple cell- or tissue-types) are available and could be used to inform predictor importance and aid in outcome prediction. Existing approaches to incorporate annotations have been based mainly on (generalized) linear models. Bayesian additive regression trees (BART), in contrast, is a reliable method to obtain high-quality nonlinear out of sample predictions without overfitting. Unfortunately, the default prior from BART may be too inflexible to handle sparse situations where the number of predictors approaches or surpasses the number of observations. Motivated by our real data application, this article proposes an alternative prior based on the logit normal distribution because it provides a framework that is adaptive to sparsity and can model informative functional annotations. It also provides a framework to incorporate prior information about the between SNP correlations. Computational details for carrying out inference are presented along with the results from a simulation study and a genome-wide prediction analysis of the Alzheimer's Disease Neuroimaging Initiative data.
Collapse
Affiliation(s)
- Charles Spanbauer
- Division of Biostatistics, University of Minnesota, MN, USA,Corresponding author;
| | - Wei Pan
- Division of Biostatistics, University of Minnesota, MN, USA
| | - The Alzheimer’s Disease Neuroimaging Initiative
- Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf
| |
Collapse
|
16
|
Gerussi A, Scaravaglio M, Cristoferi L, Verda D, Milani C, De Bernardi E, Ippolito D, Asselta R, Invernizzi P, Kather JN, Carbone M. Artificial intelligence for precision medicine in autoimmune liver disease. Front Immunol 2022; 13:966329. [PMID: 36439097 PMCID: PMC9691668 DOI: 10.3389/fimmu.2022.966329] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2022] [Accepted: 10/13/2022] [Indexed: 09/10/2023] Open
Abstract
Autoimmune liver diseases (AiLDs) are rare autoimmune conditions of the liver and the biliary tree with unknown etiology and limited treatment options. AiLDs are inherently characterized by a high degree of complexity, which poses great challenges in understanding their etiopathogenesis, developing novel biomarkers and risk-stratification tools, and, eventually, generating new drugs. Artificial intelligence (AI) is considered one of the best candidates to support researchers and clinicians in making sense of biological complexity. In this review, we offer a primer on AI and machine learning for clinicians, and discuss recent available literature on its applications in medicine and more specifically how it can help to tackle major unmet needs in AiLDs.
Collapse
Affiliation(s)
- Alessio Gerussi
- Division of Gastroenterology, Center for Autoimmune Liver Diseases, Department of Medicine and Surgery, University of Milano-Bicocca, Monza, Italy
- European Reference Network on Hepatological Diseases (ERN RARE-LIVER), San Gerardo Hospital, Monza, Italy
| | - Miki Scaravaglio
- Division of Gastroenterology, Center for Autoimmune Liver Diseases, Department of Medicine and Surgery, University of Milano-Bicocca, Monza, Italy
- European Reference Network on Hepatological Diseases (ERN RARE-LIVER), San Gerardo Hospital, Monza, Italy
| | - Laura Cristoferi
- Division of Gastroenterology, Center for Autoimmune Liver Diseases, Department of Medicine and Surgery, University of Milano-Bicocca, Monza, Italy
- European Reference Network on Hepatological Diseases (ERN RARE-LIVER), San Gerardo Hospital, Monza, Italy
- Bicocca Bioinformatics Biostatistics and Bioimaging Centre - B4, School of Medicine and Surgery, University of Milano-Bicocca, Monza, Italy
| | | | - Chiara Milani
- Division of Gastroenterology, Center for Autoimmune Liver Diseases, Department of Medicine and Surgery, University of Milano-Bicocca, Monza, Italy
- European Reference Network on Hepatological Diseases (ERN RARE-LIVER), San Gerardo Hospital, Monza, Italy
| | - Elisabetta De Bernardi
- Department of Medicine and Surgery and Tecnomed Foundation, University of Milano - Bicocca, Monza, Italy
| | | | - Rosanna Asselta
- Humanitas Clinical and Research Center, Rozzano, Milan, Italy
- Department of Biomedical Sciences, Humanitas University, Pieve Emanuele, Milan, Italy
| | - Pietro Invernizzi
- Division of Gastroenterology, Center for Autoimmune Liver Diseases, Department of Medicine and Surgery, University of Milano-Bicocca, Monza, Italy
- European Reference Network on Hepatological Diseases (ERN RARE-LIVER), San Gerardo Hospital, Monza, Italy
| | - Jakob Nikolas Kather
- Department of Medicine III, University Hospital RWTH Aachen, Aachen, Germany
- Else Kroener Fresenius Center for Digital Health, Medical Faculty Carl Gustav Carus, Technical University Dresden, Dresden, Germany
| | - Marco Carbone
- Division of Gastroenterology, Center for Autoimmune Liver Diseases, Department of Medicine and Surgery, University of Milano-Bicocca, Monza, Italy
- European Reference Network on Hepatological Diseases (ERN RARE-LIVER), San Gerardo Hospital, Monza, Italy
| |
Collapse
|
17
|
Ayat M, Domaratzki M. Sparse bayesian learning for genomic selection in yeast. FRONTIERS IN BIOINFORMATICS 2022; 2:960889. [PMID: 36304259 PMCID: PMC9580947 DOI: 10.3389/fbinf.2022.960889] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2022] [Accepted: 08/02/2022] [Indexed: 11/13/2022] Open
Abstract
Genomic selection, which predicts phenotypes such as yield and drought resistance in crops from high-density markers positioned throughout the genome of the varieties, is moving towards machine learning techniques to make predictions on complex traits that are controlled by several genes. In this paper, we consider sparse Bayesian learning and ensemble learning as a technique for genomic selection and ranking markers based on their relevance to a trait. We define and explore two different forms of the sparse Bayesian learning for predicting phenotypes and identifying the most influential markers of a trait, respectively. We apply our methods on a Saccharomyces cerevisiae dataset, and analyse our results with respect to existing related works, trait heritability, as well as the accuracies obtained from linear and Gaussian kernel functions. We find that sparse Bayesian methods are not only competitive with other machine learning methods in predicting yeast growth in different environments, but are also capable of identifying the most important markers, including both positive and negative effects on the growth, from which biologists can get insight. This attribute can make our proposed ensemble of sparse Bayesian learners favourable in ranking markers based on their relevance to a trait.
Collapse
Affiliation(s)
- Maryam Ayat
- Lactanet, Sainte-Anne-deBellevue, QC, Canada
| | - Mike Domaratzki
- Department of Computer Science, University of Western Ontario, London, ON, Canada
- *Correspondence: Mike Domaratzki,
| |
Collapse
|
18
|
Pudjihartono N, Fadason T, Kempa-Liehr AW, O'Sullivan JM. A Review of Feature Selection Methods for Machine Learning-Based Disease Risk Prediction. FRONTIERS IN BIOINFORMATICS 2022; 2:927312. [PMID: 36304293 PMCID: PMC9580915 DOI: 10.3389/fbinf.2022.927312] [Citation(s) in RCA: 75] [Impact Index Per Article: 37.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2022] [Accepted: 06/03/2022] [Indexed: 01/14/2023] Open
Abstract
Machine learning has shown utility in detecting patterns within large, unstructured, and complex datasets. One of the promising applications of machine learning is in precision medicine, where disease risk is predicted using patient genetic data. However, creating an accurate prediction model based on genotype data remains challenging due to the so-called “curse of dimensionality” (i.e., extensively larger number of features compared to the number of samples). Therefore, the generalizability of machine learning models benefits from feature selection, which aims to extract only the most “informative” features and remove noisy “non-informative,” irrelevant and redundant features. In this article, we provide a general overview of the different feature selection methods, their advantages, disadvantages, and use cases, focusing on the detection of relevant features (i.e., SNPs) for disease risk prediction.
Collapse
Affiliation(s)
| | - Tayaza Fadason
- Liggins Institute, University of Auckland, Auckland, New Zealand
- Maurice Wilkins Centre for Molecular Biodiscovery, Auckland, New Zealand
| | - Andreas W. Kempa-Liehr
- Department of Engineering Science, The University of Auckland, Auckland, New Zealand
- *Correspondence: Andreas W. Kempa-Liehr, ; Justin M. O'Sullivan,
| | - Justin M. O'Sullivan
- Liggins Institute, University of Auckland, Auckland, New Zealand
- Maurice Wilkins Centre for Molecular Biodiscovery, Auckland, New Zealand
- MRC Lifecourse Epidemiology Unit, University of Southampton, Southampton, United Kingdom
- Singapore Institute for Clinical Sciences, Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
- Australian Parkinson’s Mission, Garvan Institute of Medical Research, Sydney, NSW, Australia
- *Correspondence: Andreas W. Kempa-Liehr, ; Justin M. O'Sullivan,
| |
Collapse
|
19
|
Ruigrok M, Xue B, Catanach A, Zhang M, Jesson L, Davy M, Wellenreuther M. The Relative Power of Structural Genomic Variation versus SNPs in Explaining the Quantitative Trait Growth in the Marine Teleost Chrysophrys auratus. Genes (Basel) 2022; 13:genes13071129. [PMID: 35885912 PMCID: PMC9320665 DOI: 10.3390/genes13071129] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2022] [Revised: 06/08/2022] [Accepted: 06/20/2022] [Indexed: 02/04/2023] Open
Abstract
Background: Genetic diversity provides the basic substrate for evolution. Genetic variation consists of changes ranging from single base pairs (single-nucleotide polymorphisms, or SNPs) to larger-scale structural variants, such as inversions, deletions, and duplications. SNPs have long been used as the general currency for investigations into how genetic diversity fuels evolution. However, structural variants can affect more base pairs in the genome than SNPs and can be responsible for adaptive phenotypes due to their impact on linkage and recombination. In this study, we investigate the first steps needed to explore the genetic basis of an economically important growth trait in the marine teleost finfish Chrysophrys auratus using both SNP and structural variant data. Specifically, we use feature selection methods in machine learning to explore the relative predictive power of both types of genetic variants in explaining growth and discuss the feature selection results of the evaluated methods. Methods: SNP and structural variant callers were used to generate catalogues of variant data from 32 individual fish at ages 1 and 3 years. Three feature selection algorithms (ReliefF, Chi-square, and a mutual-information-based method) were used to reduce the dataset by selecting the most informative features. Following this selection process, the subset of variants was used as features to classify fish into small, medium, or large size categories using KNN, naïve Bayes, random forest, and logistic regression. The top-scoring features in each feature selection method were subsequently mapped to annotated genomic regions in the zebrafish genome, and a permutation test was conducted to see if the number of mapped regions was greater than when random sampling was applied. Results: Without feature selection, the prediction accuracies ranged from 0 to 0.5 for both structural variants and SNPs. Following feature selection, the prediction accuracy increased only slightly to between 0 and 0.65 for structural variants and between 0 and 0.75 for SNPs. The highest prediction accuracy for the logistic regression was achieved for age 3 fish using SNPs, although generally predictions for age 1 and 3 fish were very similar (ranging from 0–0.65 for both SNPs and structural variants). The Chi-square feature selection of SNP data was the only method that had a significantly higher number of matches to annotated genomic regions of zebrafish than would be explained by chance alone. Conclusions: Predicting a complex polygenic trait such as growth using data collected from a low number of individuals remains challenging. While we demonstrate that both SNPs and structural variants provide important information to help understand the genetic basis of phenotypic traits such as fish growth, the full complexities that exist within a genome cannot be easily captured by classical machine learning techniques. When using high-dimensional data, feature selection shows some increase in the prediction accuracy of classification models and provides the potential to identify unknown genomic correlates with growth. Our results show that both SNPs and structural variants significantly impact growth, and we therefore recommend that researchers interested in the genotype–phenotype map should strive to go beyond SNPs and incorporate structural variants in their studies as well. We discuss how our machine learning models can be further expanded to serve as a test bed to inform evolutionary studies and the applied management of species.
Collapse
Affiliation(s)
- Mike Ruigrok
- The New Zealand Institute for Plant & Food Research Ltd., Nelson 7010, New Zealand; (M.R.); (A.C.); (L.J.); (M.D.)
- Wellington Faculty of Engineering, Victoria University of Wellington, Wellington 6012, New Zealand; (B.X.); (M.Z.)
| | - Bing Xue
- Wellington Faculty of Engineering, Victoria University of Wellington, Wellington 6012, New Zealand; (B.X.); (M.Z.)
| | - Andrew Catanach
- The New Zealand Institute for Plant & Food Research Ltd., Nelson 7010, New Zealand; (M.R.); (A.C.); (L.J.); (M.D.)
| | - Mengjie Zhang
- Wellington Faculty of Engineering, Victoria University of Wellington, Wellington 6012, New Zealand; (B.X.); (M.Z.)
| | - Linley Jesson
- The New Zealand Institute for Plant & Food Research Ltd., Nelson 7010, New Zealand; (M.R.); (A.C.); (L.J.); (M.D.)
| | - Marcus Davy
- The New Zealand Institute for Plant & Food Research Ltd., Nelson 7010, New Zealand; (M.R.); (A.C.); (L.J.); (M.D.)
| | - Maren Wellenreuther
- The New Zealand Institute for Plant & Food Research Ltd., Nelson 7010, New Zealand; (M.R.); (A.C.); (L.J.); (M.D.)
- School of Biological Sciences, University of Auckland, Auckland 1010, New Zealand
- Correspondence:
| |
Collapse
|
20
|
Isik YE, Gormez Y, Aydin Z, Bakir-Gungor B. The Determination of Distinctive Single Nucleotide Polymorphism Sets for the Diagnosis of Behçet's Disease. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1909-1918. [PMID: 33476272 DOI: 10.1109/tcbb.2021.3053429] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Behçet's Disease (BD) is a multi-system inflammatory disorder in which the etiology remains unclear. The most probable hypothesis is that genetic tendency and environmental factors play roles in the development of BD. In order to find the essential reasons, genetic changes on thousands of genes should be analyzed. Besides, there is a need for extra analysis to find out which genetic factor affects the disease. Machine learning approaches have high potential for extracting the knowledge from genomics and selecting the representative Single Nucleotide Polymorphisms (SNPs) as the most effective features for the clinical diagnosis process. In this study, we have attempted to identify representative SNPs using feature selection methods, incorporating biological information and aimed to develop a machine-learning model for diagnosing Behçet's disease. By combining biological information and machine learning classifiers, up to 99.64 percent accuracy of disease prediction is achieved using only 13,611 out of 311,459 SNPs. In addition, we revealed the SNPs that are most distinctive by performing repeated feature selection in cross-validation experiments.
Collapse
|
21
|
Yoo HY, Lee KC, Woo JE, Park SH, Lee S, Joo J, Bae JS, Kwon HJ, Park BJ. A Genome-Wide Association Study and Machine-Learning Algorithm Analysis on the Prediction of Facial Phenotypes by Genotypes in Korean Women. Clin Cosmet Investig Dermatol 2022; 15:433-445. [PMID: 35313536 PMCID: PMC8933694 DOI: 10.2147/ccid.s339547] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2021] [Accepted: 01/12/2022] [Indexed: 12/03/2022]
Abstract
Purpose Changes in facial appearance are affected by various intrinsic and extrinsic factors, which vary from person to person. Therefore, each person needs to determine their skin condition accurately to care for their skin accordingly. Recently, genetic identification by skin-related phenotypes has become possible using genome-wide association studies (GWAS) and machine-learning algorithms. However, because most GWAS have focused on populations with American or European skin pigmentation, large-scale GWAS are needed for Asian populations. This study aimed to evaluate the correlation of facial phenotypes with candidate single-nucleotide polymorphisms (SNPs) to predict phenotype from genotype using machine learning. Materials and Methods A total of 749 Korean women aged 30-50 years were enrolled in this study and evaluated for five facial phenotypes (melanin, gloss, hydration, wrinkle, and elasticity). To find highly related SNPs with each phenotype, GWAS analysis was used. In addition, phenotype prediction was performed using three machine-learning algorithms (linear, ridge, and linear support vector regressions) using five-fold cross-validation. Results Using GWAS analysis, we found 46 novel highly associated SNPs (p < 1×10-05): 3, 20, 12, 6, and 5 SNPs for melanin, gloss, hydration, wrinkle, and elasticity, respectively. On comparing the performance of each model based on phenotypes using five-fold cross-validation, the ridge regression model showed the highest accuracy (r2 = 0.6422-0.7266) in all skin traits. Therefore, the optimal solution for personal skin diagnosis using GWAS was with the ridge regression model. Conclusion The proposed facial phenotype prediction model in this study provided the optimal solution for accurately predicting the skin condition of an individual by identifying genotype information of target characteristics and machine-learning methods. This model has potential utility for the development of customized cosmetics.
Collapse
Affiliation(s)
- Hye-Young Yoo
- Skin & Natural Products Lab, Kolmar Korea Co., Ltd., Seoul, 06800, Republic of Korea
| | - Ki-Chan Lee
- R&D Department, Eone Diagnomics Genome Center Co., Ltd, Songdo Incheon, 22014, Republic of Korea
| | - Ji-Eun Woo
- Skin & Natural Products Lab, Kolmar Korea Co., Ltd., Seoul, 06800, Republic of Korea
| | - Sung-Ha Park
- Skin & Natural Products Lab, Kolmar Korea Co., Ltd., Seoul, 06800, Republic of Korea
| | - Sunghoon Lee
- R&D Department, Eone Diagnomics Genome Center Co., Ltd, Songdo Incheon, 22014, Republic of Korea
| | - Joungsu Joo
- R&D Department, Eone Diagnomics Genome Center Co., Ltd, Songdo Incheon, 22014, Republic of Korea
| | - Jin-Sik Bae
- R&D Department, Eone Diagnomics Genome Center Co., Ltd, Songdo Incheon, 22014, Republic of Korea
| | - Hyuk-Jung Kwon
- R&D Department, Eone Diagnomics Genome Center Co., Ltd, Songdo Incheon, 22014, Republic of Korea
| | - Byoung-Jun Park
- Skin & Natural Products Lab, Kolmar Korea Co., Ltd., Seoul, 06800, Republic of Korea
| |
Collapse
|
22
|
Collin CB, Gebhardt T, Golebiewski M, Karaderi T, Hillemanns M, Khan FM, Salehzadeh-Yazdi A, Kirschner M, Krobitsch S, Kuepfer L. Computational Models for Clinical Applications in Personalized Medicine—Guidelines and Recommendations for Data Integration and Model Validation. J Pers Med 2022; 12:jpm12020166. [PMID: 35207655 PMCID: PMC8879572 DOI: 10.3390/jpm12020166] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2021] [Revised: 01/14/2022] [Accepted: 01/20/2022] [Indexed: 12/12/2022] Open
Abstract
The future development of personalized medicine depends on a vast exchange of data from different sources, as well as harmonized integrative analysis of large-scale clinical health and sample data. Computational-modelling approaches play a key role in the analysis of the underlying molecular processes and pathways that characterize human biology, but they also lead to a more profound understanding of the mechanisms and factors that drive diseases; hence, they allow personalized treatment strategies that are guided by central clinical questions. However, despite the growing popularity of computational-modelling approaches in different stakeholder communities, there are still many hurdles to overcome for their clinical routine implementation in the future. Especially the integration of heterogeneous data from multiple sources and types are challenging tasks that require clear guidelines that also have to comply with high ethical and legal standards. Here, we discuss the most relevant computational models for personalized medicine in detail that can be considered as best-practice guidelines for application in clinical care. We define specific challenges and provide applicable guidelines and recommendations for study design, data acquisition, and operation as well as for model validation and clinical translation and other research areas.
Collapse
Affiliation(s)
- Catherine Bjerre Collin
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, 2200 N Copenhagen, Denmark; (C.B.C.); (T.K.)
| | - Tom Gebhardt
- Department of Systems Biology and Bioinformatics, University of Rostock, 18057 Rostock, Germany; (T.G.); (M.H.); (F.M.K.)
| | - Martin Golebiewski
- Heidelberg Institute for Theoretical Studies gGmbH, 69118 Heidelberg, Germany;
| | - Tugce Karaderi
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, 2200 N Copenhagen, Denmark; (C.B.C.); (T.K.)
- Center for Health Data Science, Faculty of Health and Medical Sciences, University of Copenhagen, 2200 N Copenhagen, Denmark
| | - Maximilian Hillemanns
- Department of Systems Biology and Bioinformatics, University of Rostock, 18057 Rostock, Germany; (T.G.); (M.H.); (F.M.K.)
| | - Faiz Muhammad Khan
- Department of Systems Biology and Bioinformatics, University of Rostock, 18057 Rostock, Germany; (T.G.); (M.H.); (F.M.K.)
| | | | - Marc Kirschner
- Forschungszentrum Jülich GmbH, Project Management Jülich, 52425 Jülich, Germany; (M.K.); (S.K.)
| | - Sylvia Krobitsch
- Forschungszentrum Jülich GmbH, Project Management Jülich, 52425 Jülich, Germany; (M.K.); (S.K.)
| | | | - Lars Kuepfer
- Institute for Systems Medicine with Focus on Organ Interaction, University Hospital RWTH Aachen, 52074 Aachen, Germany
- Correspondence: ; Tel.: +49-241-8085900
| |
Collapse
|
23
|
Xu Y, Vuckovic D, Ritchie SC, Akbari P, Jiang T, Grealey J, Butterworth AS, Ouwehand WH, Roberts DJ, Di Angelantonio E, Danesh J, Soranzo N, Inouye M. Machine learning optimized polygenic scores for blood cell traits identify sex-specific trajectories and genetic correlations with disease. CELL GENOMICS 2022; 2:None. [PMID: 35072137 PMCID: PMC8758502 DOI: 10.1016/j.xgen.2021.100086] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/20/2020] [Revised: 08/24/2021] [Accepted: 12/13/2021] [Indexed: 12/13/2022]
Abstract
Genetic association studies for blood cell traits, which are key indicators of health and immune function, have identified several hundred associations and defined a complex polygenic architecture. Polygenic scores (PGSs) for blood cell traits have potential clinical utility in disease risk prediction and prevention, but designing PGS remains challenging and the optimal methods are unclear. To address this, we evaluated the relative performance of 6 methods to develop PGS for 26 blood cell traits, including a standard method of pruning and thresholding (P + T) and 5 learning methods: LDpred2, elastic net (EN), Bayesian ridge (BR), multilayer perceptron (MLP) and convolutional neural network (CNN). We evaluated these optimized PGSs on blood cell trait data from UK Biobank and INTERVAL. We find that PGSs designed using common machine learning methods EN and BR show improved prediction of blood cell traits and consistently outperform other methods. Our analyses suggest EN/BR as the top choices for PGS construction, showing improved performance for 25 blood cell traits in the external validation, with correlations with the directly measured traits increasing by 10%-23%. Ten PGSs showed significant statistical interaction with sex, and sex-specific PGS stratification showed that all of them had substantial variation in the trajectories of blood cell traits with age. Genetic correlations between the PGSs for blood cell traits and common human diseases identified well-known as well as new associations. We develop machine learning-optimized PGS for blood cell traits, demonstrate their relationships with sex, age, and disease, and make these publicly available as a resource.
Collapse
Affiliation(s)
- Yu Xu
- Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge CB1 8RN, UK
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, VIC 3004, Australia
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge CB1 8RN, UK
| | - Dragana Vuckovic
- Department of Human Genetics, Wellcome Sanger Institute, Hinxton CB10 1SA, UK
- National Institute for Health Research Blood and Transplant Research Unit in Donor Health and Genomics, University of Cambridge, Cambridge CB1 8RN, UK
| | - Scott C. Ritchie
- Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge CB1 8RN, UK
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, VIC 3004, Australia
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge CB1 8RN, UK
- British Heart Foundation Centre of Research Excellence, University of Cambridge, Cambridge CB1 8RN, UK
| | - Parsa Akbari
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge CB1 8RN, UK
- National Institute for Health Research Blood and Transplant Research Unit in Donor Health and Genomics, University of Cambridge, Cambridge CB1 8RN, UK
| | - Tao Jiang
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge CB1 8RN, UK
| | - Jason Grealey
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, VIC 3004, Australia
- Department of Mathematics and Statistics, La Trobe University, Bundoora, VIC 3086, Australia
| | - Adam S. Butterworth
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge CB1 8RN, UK
- National Institute for Health Research Blood and Transplant Research Unit in Donor Health and Genomics, University of Cambridge, Cambridge CB1 8RN, UK
- British Heart Foundation Centre of Research Excellence, University of Cambridge, Cambridge CB1 8RN, UK
- Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge CB10 1SA, UK
| | - Willem H. Ouwehand
- Department of Human Genetics, Wellcome Sanger Institute, Hinxton CB10 1SA, UK
- British Heart Foundation Centre of Research Excellence, University of Cambridge, Cambridge CB1 8RN, UK
- National Health Service (NHS) Blood and Transplant, Cambridge Biomedical Campus, Cambridge CB2 0PT, UK
- Department of Haematology, University of Cambridge, Cambridge CB2 0PT, UK
| | - David J. Roberts
- National Institute for Health Research Blood and Transplant Research Unit in Donor Health and Genomics, University of Cambridge, Cambridge CB1 8RN, UK
- National Health Service (NHS) Blood and Transplant, Cambridge Biomedical Campus, Cambridge CB2 0PT, UK
- National Institute for Health Research Oxford Biomedical Research Centre, University of Oxford and John Radcliffe Hospital, Oxford OX3 9DU, UK
| | - Emanuele Di Angelantonio
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge CB1 8RN, UK
- National Institute for Health Research Blood and Transplant Research Unit in Donor Health and Genomics, University of Cambridge, Cambridge CB1 8RN, UK
- British Heart Foundation Centre of Research Excellence, University of Cambridge, Cambridge CB1 8RN, UK
- Health Data Science Research Centre, Human Technopole, Milan 20157, Italy
- Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge CB10 1SA, UK
| | - John Danesh
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge CB1 8RN, UK
- Department of Human Genetics, Wellcome Sanger Institute, Hinxton CB10 1SA, UK
- National Institute for Health Research Blood and Transplant Research Unit in Donor Health and Genomics, University of Cambridge, Cambridge CB1 8RN, UK
- British Heart Foundation Centre of Research Excellence, University of Cambridge, Cambridge CB1 8RN, UK
- Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge CB10 1SA, UK
| | - Nicole Soranzo
- Department of Human Genetics, Wellcome Sanger Institute, Hinxton CB10 1SA, UK
- National Institute for Health Research Blood and Transplant Research Unit in Donor Health and Genomics, University of Cambridge, Cambridge CB1 8RN, UK
- British Heart Foundation Centre of Research Excellence, University of Cambridge, Cambridge CB1 8RN, UK
| | - Michael Inouye
- Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge CB1 8RN, UK
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, VIC 3004, Australia
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge CB1 8RN, UK
- British Heart Foundation Centre of Research Excellence, University of Cambridge, Cambridge CB1 8RN, UK
- Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge CB10 1SA, UK
- The Alan Turing Institute, London NW1 2DB, UK
| |
Collapse
|
24
|
Raben TG, Lello L, Widen E, Hsu SDH. From Genotype to Phenotype: Polygenic Prediction of Complex Human Traits. Methods Mol Biol 2022; 2467:421-446. [PMID: 35451785 DOI: 10.1007/978-1-0716-2205-6_15] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Decoding the genome confers the capability to predict characteristics of the organism (phenotype) from DNA (genotype). We describe the present status and future prospects of genomic prediction of complex traits in humans. Some highly heritable complex phenotypes such as height and other quantitative traits can already be predicted with reasonable accuracy from DNA alone. For many diseases, including important common conditions such as coronary artery disease, breast cancer, type I and II diabetes, individuals with outlier polygenic scores (e.g., top few percent) have been shown to have 5 or even 10 times higher risk than average. Several psychiatric conditions such as schizophrenia and autism also fall into this category. We discuss related topics such as the genetic architecture of complex traits, sibling validation of polygenic scores, and applications to adult health, in vitro fertilization (embryo selection), and genetic engineering.
Collapse
Affiliation(s)
| | - Louis Lello
- Michigan State University, East Lansing, MI, USA
- Genomic Prediction, North Brunswick, NJ, USA
| | - Erik Widen
- Michigan State University, East Lansing, MI, USA
| | - Stephen D H Hsu
- Michigan State University, East Lansing, MI, USA.
- Genomic Prediction, North Brunswick, NJ, USA.
| |
Collapse
|
25
|
Passamonti MM, Somenzi E, Barbato M, Chillemi G, Colli L, Joost S, Milanesi M, Negrini R, Santini M, Vajana E, Williams JL, Ajmone-Marsan P. The Quest for Genes Involved in Adaptation to Climate Change in Ruminant Livestock. Animals (Basel) 2021; 11:2833. [PMID: 34679854 PMCID: PMC8532622 DOI: 10.3390/ani11102833] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2021] [Revised: 09/21/2021] [Accepted: 09/23/2021] [Indexed: 12/14/2022] Open
Abstract
Livestock radiated out from domestication centres to most regions of the world, gradually adapting to diverse environments, from very hot to sub-zero temperatures and from wet and humid conditions to deserts. The climate is changing; generally global temperature is increasing, although there are also more extreme cold periods, storms, and higher solar radiation. These changes impact livestock welfare and productivity. This review describes advances in the methodology for studying livestock genomes and the impact of the environment on animal production, giving examples of discoveries made. Sequencing livestock genomes has facilitated genome-wide association studies to localize genes controlling many traits, and population genetics has identified genomic regions under selection or introgressed from one breed into another to improve production or facilitate adaptation. Landscape genomics, which combines global positioning and genomics, has identified genomic features that enable animals to adapt to local environments. Combining the advances in genomics and methods for predicting changes in climate is generating an explosion of data which calls for innovations in the way big data sets are treated. Artificial intelligence and machine learning are now being used to study the interactions between the genome and the environment to identify historic effects on the genome and to model future scenarios.
Collapse
Affiliation(s)
- Matilde Maria Passamonti
- Department of Animal Science, Food and Nutrition—DIANA, Università Cattolica del Sacro Cuore, Via Emilia Parmense, 84, 29122 Piacenza, Italy; (M.M.P.); (E.S.); (M.B.); (L.C.); (R.N.); (J.L.W.)
| | - Elisa Somenzi
- Department of Animal Science, Food and Nutrition—DIANA, Università Cattolica del Sacro Cuore, Via Emilia Parmense, 84, 29122 Piacenza, Italy; (M.M.P.); (E.S.); (M.B.); (L.C.); (R.N.); (J.L.W.)
| | - Mario Barbato
- Department of Animal Science, Food and Nutrition—DIANA, Università Cattolica del Sacro Cuore, Via Emilia Parmense, 84, 29122 Piacenza, Italy; (M.M.P.); (E.S.); (M.B.); (L.C.); (R.N.); (J.L.W.)
| | - Giovanni Chillemi
- Department for Innovation in Biological, Agro-Food and Forest Systems–DIBAF, Università Della Tuscia, Via S. Camillo de Lellis snc, 01100 Viterbo, Italy; (G.C.); (M.M.)
| | - Licia Colli
- Department of Animal Science, Food and Nutrition—DIANA, Università Cattolica del Sacro Cuore, Via Emilia Parmense, 84, 29122 Piacenza, Italy; (M.M.P.); (E.S.); (M.B.); (L.C.); (R.N.); (J.L.W.)
- Research Center on Biodiversity and Ancient DNA—BioDNA, Università Cattolica del Sacro Cuore, Via Emilia Parmense, 84, 29122 Piacenza, Italy
| | - Stéphane Joost
- Laboratory of Geographic Information Systems (LASIG), School of Architecture, Civil and Environmental Engineering (ENAC), Ecole Polytechnique Fédérale de Lausanne (EPFL), 1015 Lausanne, Switzerland; (S.J.); (E.V.)
| | - Marco Milanesi
- Department for Innovation in Biological, Agro-Food and Forest Systems–DIBAF, Università Della Tuscia, Via S. Camillo de Lellis snc, 01100 Viterbo, Italy; (G.C.); (M.M.)
| | - Riccardo Negrini
- Department of Animal Science, Food and Nutrition—DIANA, Università Cattolica del Sacro Cuore, Via Emilia Parmense, 84, 29122 Piacenza, Italy; (M.M.P.); (E.S.); (M.B.); (L.C.); (R.N.); (J.L.W.)
| | - Monia Santini
- Impacts on Agriculture, Forests and Ecosystem Services (IAFES) Division, Fondazione Centro Euro-Mediterraneo Sui Cambiamenti Climatici (CMCC), Viale Trieste 127, 01100 Viterbo, Italy;
| | - Elia Vajana
- Laboratory of Geographic Information Systems (LASIG), School of Architecture, Civil and Environmental Engineering (ENAC), Ecole Polytechnique Fédérale de Lausanne (EPFL), 1015 Lausanne, Switzerland; (S.J.); (E.V.)
| | - John Lewis Williams
- Department of Animal Science, Food and Nutrition—DIANA, Università Cattolica del Sacro Cuore, Via Emilia Parmense, 84, 29122 Piacenza, Italy; (M.M.P.); (E.S.); (M.B.); (L.C.); (R.N.); (J.L.W.)
| | - Paolo Ajmone-Marsan
- Department of Animal Science, Food and Nutrition—DIANA, Università Cattolica del Sacro Cuore, Via Emilia Parmense, 84, 29122 Piacenza, Italy; (M.M.P.); (E.S.); (M.B.); (L.C.); (R.N.); (J.L.W.)
- Nutrigenomics and Proteomics Research Center—PRONUTRIGEN, Università Cattolica del Sacro Cuore, Via Emilia Parmense, 84, 29122 Piacenza, Italy
| |
Collapse
|
26
|
Katsaouni N, Tashkandi A, Wiese L, Schulz MH. Machine learning based disease prediction from genotype data. Biol Chem 2021; 402:871-885. [PMID: 34218544 DOI: 10.1515/hsz-2021-0109] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2021] [Accepted: 06/15/2021] [Indexed: 12/16/2022]
Abstract
Using results from genome-wide association studies for understanding complex traits is a current challenge. Here we review how genotype data can be used with different machine learning (ML) methods to predict phenotype occurrence and severity from genotype data. We discuss common feature encoding schemes and how studies handle the often small number of samples compared to the huge number of variants. We compare which ML methods are being applied, including recent results using deep neural networks. Further, we review the application of methods for feature explanation and interpretation.
Collapse
Affiliation(s)
- Nikoletta Katsaouni
- Institute for Cardiovascular Regeneration, Goethe University, 60590 Frankfurt am Main, Germany
| | - Araek Tashkandi
- Institute of Computer Sciences and Engineering, University of Jeddah, 21959 Jeddah, Saudi Arabia
| | - Lena Wiese
- Institute of Computer Science, Goethe University, 60629 Frankfurt am Main, Germany
| | - Marcel H Schulz
- Institute for Cardiovascular Regeneration, Goethe University, 60590 Frankfurt am Main, Germany
- German Center for Cardiovascular Research (DZHK), Partner Site RheinMain, 60590 Frankfurt am Main, Germany
- Cardio-Pulmonary Institute, Goethe University Hospital, Frankfurt am Main, Germany
| |
Collapse
|
27
|
Mieth B, Rozier A, Rodriguez JA, Höhne MMC, Görnitz N, Müller KR. DeepCOMBI: explainable artificial intelligence for the analysis and discovery in genome-wide association studies. NAR Genom Bioinform 2021; 3:lqab065. [PMID: 34296082 PMCID: PMC8291080 DOI: 10.1093/nargab/lqab065] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2020] [Revised: 05/27/2021] [Accepted: 07/08/2021] [Indexed: 02/06/2023] Open
Abstract
Deep learning has revolutionized data science in many fields by greatly improving prediction performances in comparison to conventional approaches. Recently, explainable artificial intelligence has emerged as an area of research that goes beyond pure prediction improvement by extracting knowledge from deep learning methodologies through the interpretation of their results. We investigate such explanations to explore the genetic architectures of phenotypes in genome-wide association studies. Instead of testing each position in the genome individually, the novel three-step algorithm, called DeepCOMBI, first trains a neural network for the classification of subjects into their respective phenotypes. Second, it explains the classifiers’ decisions by applying layer-wise relevance propagation as one example from the pool of explanation techniques. The resulting importance scores are eventually used to determine a subset of the most relevant locations for multiple hypothesis testing in the third step. The performance of DeepCOMBI in terms of power and precision is investigated on generated datasets and a 2007 study. Verification of the latter is achieved by validating all findings with independent studies published up until 2020. DeepCOMBI is shown to outperform ordinary raw P-value thresholding and other baseline methods. Two novel disease associations (rs10889923 for hypertension, rs4769283 for type 1 diabetes) were identified.
Collapse
Affiliation(s)
- Bettina Mieth
- Machine Learning Group, Technische Universität Berlin, Berlin 10587, Germany
| | - Alexandre Rozier
- Machine Learning Group, Technische Universität Berlin, Berlin 10587, Germany
| | - Juan Antonio Rodriguez
- CNAG-CRG, Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), Barcelona 08003, Spain
| | - Marina M C Höhne
- Machine Learning Group, Technische Universität Berlin, Berlin 10587, Germany
| | | | - Klaus-Robert Müller
- Machine Learning Group, Technische Universität Berlin, Berlin 10587, Germany
| |
Collapse
|
28
|
Westerman EL, Bowman SEJ, Davidson B, Davis MC, Larson ER, Sanford CPJ. Deploying Big Data to Crack the Genotype to Phenotype Code. Integr Comp Biol 2021; 60:385-396. [PMID: 32492136 DOI: 10.1093/icb/icaa055] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
Mechanistically connecting genotypes to phenotypes is a longstanding and central mission of biology. Deciphering these connections will unite questions and datasets across all scales from molecules to ecosystems. Although high-throughput sequencing has provided a rich platform on which to launch this effort, tools for deciphering mechanisms further along the genome to phenome pipeline remain limited. Machine learning approaches and other emerging computational tools hold the promise of augmenting human efforts to overcome these obstacles. This vision paper is the result of a Reintegrating Biology Workshop, bringing together the perspectives of integrative and comparative biologists to survey challenges and opportunities in cracking the genotype to phenotype code and thereby generating predictive frameworks across biological scales. Key recommendations include promoting the development of minimum "best practices" for the experimental design and collection of data; fostering sustained and long-term data repositories; promoting programs that recruit, train, and retain a diversity of talent; and providing funding to effectively support these highly cross-disciplinary efforts. We follow this discussion by highlighting a few specific transformative research opportunities that will be advanced by these efforts.
Collapse
Affiliation(s)
- Erica L Westerman
- Department of Biological Sciences, University of Arkansas, Fayetteville, AR 72701, USA
| | - Sarah E J Bowman
- High-Throughput Crystallization Screening Center, Hauptman-Woodward Medical Research Institute, Buffalo, NY 14203, USA.,Department of Biochemistry, Jacobs School of Medicine & Biomedical Sciences at the University at Buffalo, Buffalo, NY 14203, USA
| | - Bradley Davidson
- Department of Biology, Swarthmore College, Swarthmore, PA 19081, USA
| | - Marcus C Davis
- Department of Biology, James Madison University, Harrisonburg, VA 22807, USA
| | - Eric R Larson
- Department of Natural Resources and Environmental Sciences, University of Illinois, Urbana, IL 61801, USA
| | - Christopher P J Sanford
- Department of Ecology, Evolution and Organismal Biology, Kennesaw State University, Kennesaw, GA 30144, USA
| |
Collapse
|
29
|
Bauer A, Zierer A, Gieger C, Büyüközkan M, Müller-Nurasyid M, Grallert H, Meisinger C, Strauch K, Prokisch H, Roden M, Peters A, Krumsiek J, Herder C, Koenig W, Thorand B, Huth C. Comparison of genetic risk prediction models to improve prediction of coronary heart disease in two large cohorts of the MONICA/KORA study. Genet Epidemiol 2021; 45:633-650. [PMID: 34082474 DOI: 10.1002/gepi.22389] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2021] [Revised: 04/20/2021] [Accepted: 05/04/2021] [Indexed: 12/19/2022]
Abstract
It is still unclear how genetic information, provided as single-nucleotide polymorphisms (SNPs), can be most effectively integrated into risk prediction models for coronary heart disease (CHD) to add significant predictive value beyond clinical risk models. For the present study, a population-based case-cohort was used as a trainingset (451 incident cases, 1488 noncases) and an independent cohort as testset (160 incident cases, 2749 noncases). The following strategies to quantify genetic information were compared: A weighted genetic risk score including Metabochip SNPs associated with CHD in the literature (GRSMetabo ); selection of the most predictive SNPs among these literature-confirmed variants using priority-Lasso (PLMetabo ); validation of two comprehensive polygenic risk scores: GRSGola based on Metabochip data, and GRSKhera (available in the testset only) based on cross-validated genome-wide genotyping data. We used Cox regression to assess associations with incident CHD. C-index, category-free net reclassification index (cfNRI) and relative integrated discrimination improvement (IDIrel ) were used to quantify the predictive performance of genetic information beyond Framingham risk score variables. In contrast to GRSMetabo and PLMetabo , GRSGola significantly improved the prediction (delta C-index [95% confidence interval]: 0.0087 [0.0044, 0.0130]; IDIrel : 0.0509 [0.0131, 0.0894]; cfNRI improved only in cases: 0.1761 [0.0253, 0.3219]). GRSKhera yielded slightly worse prediction results than GRSGola .
Collapse
Affiliation(s)
- Alina Bauer
- Institute of Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
| | - Astrid Zierer
- Institute of Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
| | - Christian Gieger
- Institute of Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany.,German Center for Diabetes Research (DZD), Partner München-Neuherberg, München-Neuherberg, Germany.,Research Unit of Molecular Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
| | - Mustafa Büyüközkan
- Institute of Computational Biology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany.,Institute for Computational Biomedicine, Englander Institute for Precision Medicine, Department of Physiology and Biophysics, Weill Cornell Medicine, New York, USA
| | - Martina Müller-Nurasyid
- Institute of Genetic Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany.,Chair of Genetic Epidemiology, IBE, Faculty of Medicine, LMU, Munich, Germany.,Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI), University Medical Center, Johannes Gutenberg University, Mainz, Germany.,Department of Internal Medicine I (Cardiology), Hospital of the Ludwig-Maximilians-University (LMU) Munich, Munich, Germany
| | - Harald Grallert
- Institute of Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany.,German Center for Diabetes Research (DZD), Partner München-Neuherberg, München-Neuherberg, Germany.,Research Unit of Molecular Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
| | - Christa Meisinger
- German Center for Diabetes Research (DZD), Partner München-Neuherberg, München-Neuherberg, Germany.,Chair of Epidemiology, LMU Munich, UNIKA-T Augsburg, Augsburg, Germany.,Independent Research Group Clinical Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
| | - Konstantin Strauch
- Institute of Genetic Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany.,Chair of Genetic Epidemiology, IBE, Faculty of Medicine, LMU, Munich, Germany.,Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI), University Medical Center, Johannes Gutenberg University, Mainz, Germany
| | - Holger Prokisch
- Institute of Human Genetics, School of Medicine, Technische Universität München, München, Germany.,Institute of Neurogenomics, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany
| | - Michael Roden
- Department of Endocrinology and Diabetology, Medical Faculty and University Hospital Düsseldorf, Heinrich-Heine-University Düsseldorf, Düsseldorf, Germany.,Institute for Clinical Diabetology, German Diabetes Center, Leibniz Center for Diabetes Research at Heinrich Heine University Düsseldorf, Düsseldorf, Germany.,German Center for Diabetes Research (DZD), Partner Düsseldorf, München-Neuherberg, Germany
| | - Annette Peters
- Institute of Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany.,German Center for Diabetes Research (DZD), Partner München-Neuherberg, München-Neuherberg, Germany.,Institute of Epidemiology and Medical Biometry, University of Ulm, Ulm, Germany
| | - Jan Krumsiek
- Institute of Computational Biology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany.,Institute for Computational Biomedicine, Englander Institute for Precision Medicine, Department of Physiology and Biophysics, Weill Cornell Medicine, New York, USA
| | - Christian Herder
- Department of Endocrinology and Diabetology, Medical Faculty and University Hospital Düsseldorf, Heinrich-Heine-University Düsseldorf, Düsseldorf, Germany.,Institute for Clinical Diabetology, German Diabetes Center, Leibniz Center for Diabetes Research at Heinrich Heine University Düsseldorf, Düsseldorf, Germany.,German Center for Diabetes Research (DZD), Partner Düsseldorf, München-Neuherberg, Germany
| | - Wolfgang Koenig
- Institute of Epidemiology and Medical Biometry, University of Ulm, Ulm, Germany.,Deutsches Herzzentrum München, Technische Universität München, Munich, Germany.,German Centre for Cardiovascular Research (DZHK), partner site Munich Heart Alliance, Munich, Germany
| | - Barbara Thorand
- Institute of Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany.,German Center for Diabetes Research (DZD), Partner München-Neuherberg, München-Neuherberg, Germany
| | - Cornelia Huth
- Institute of Epidemiology, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany.,German Center for Diabetes Research (DZD), Partner München-Neuherberg, München-Neuherberg, Germany
| |
Collapse
|
30
|
Phenotypical predictors of pregnancy-related restless legs syndrome and their association with basal ganglia and the limbic circuits. Sci Rep 2021; 11:9996. [PMID: 33976261 PMCID: PMC8113250 DOI: 10.1038/s41598-021-89360-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2020] [Accepted: 04/23/2021] [Indexed: 11/21/2022] Open
Abstract
Restless legs syndrome (RLS) in pregnancy is a common disorder with a multifactorial etiology. A neurological and obstetrical cohort of 308 postpartum women was screened for RLS within 1 to 6 days of childbirth and 12 weeks postpartum. Of the 308 young mothers, 57 (prevalence rate 19%) were identified as having been affected by RLS symptoms in the recently completed pregnancy. Structural and functional MRI was obtained from 25 of these 57 participants. A multivariate two-window algorithm was employed to systematically chart the relationship between brain structures and phenotypical predictors of RLS. A decreased volume of the parietal, orbitofrontal and frontal areas shortly after delivery was found to be linked to persistent RLS symptoms up to 12 weeks postpartum, the symptoms' severity and intensity in the most recent pregnancy, and a history of RLS in previous pregnancies. The same negative relationship was observed between brain volume and not being married, not receiving any iron supplement and higher numbers of stressful life events. High cortisol levels, being married and receiving iron supplements, on the other hand, were found to be associated with increased volumes in the bilateral striatum. Investigating RLS symptoms in pregnancy within a brain-phenotype framework may help shed light on the heterogeneity of the condition.
Collapse
|
31
|
Varma M, Paskov KM, Chrisman BS, Sun MW, Jung JY, Stockham NT, Washington PY, Wall DP. A maximum flow-based network approach for identification of stable noncoding biomarkers associated with the multigenic neurological condition, autism. BioData Min 2021; 14:28. [PMID: 33941233 PMCID: PMC8091705 DOI: 10.1186/s13040-021-00262-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2020] [Accepted: 04/20/2021] [Indexed: 12/05/2022] Open
Abstract
Background Machine learning approaches for predicting disease risk from high-dimensional whole genome sequence (WGS) data often result in unstable models that can be difficult to interpret, limiting the identification of putative sets of biomarkers. Here, we design and validate a graph-based methodology based on maximum flow, which leverages the presence of linkage disequilibrium (LD) to identify stable sets of variants associated with complex multigenic disorders. Results We apply our method to a previously published logistic regression model trained to identify variants in simple repeat sequences associated with autism spectrum disorder (ASD); this L1-regularized model exhibits high predictive accuracy yet demonstrates great variability in the features selected from over 230,000 possible variants. In order to improve model stability, we extract the variants assigned non-zero weights in each of 5 cross-validation folds and then assemble the five sets of features into a flow network subject to LD constraints. The maximum flow formulation allowed us to identify 55 variants, which we show to be more stable than the features identified by the original classifier. Conclusion Our method allows for the creation of machine learning models that can identify predictive variants. Our results help pave the way towards biomarker-based diagnosis methods for complex genetic disorders. Supplementary Information The online version contains supplementary material available at 10.1186/s13040-021-00262-x.
Collapse
Affiliation(s)
- Maya Varma
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Kelley M Paskov
- Department of Biomedical Data Science, Stanford University, Stanford, CA, USA
| | | | - Min Woo Sun
- Department of Biomedical Data Science, Stanford University, Stanford, CA, USA
| | - Jae-Yoon Jung
- Department of Biomedical Data Science, Stanford University, Stanford, CA, USA.,Department of Pediatrics, Stanford University, Stanford, CA, USA
| | - Nate T Stockham
- Department of Neuroscience, Stanford University, Stanford, CA, USA
| | | | - Dennis P Wall
- Department of Biomedical Data Science, Stanford University, Stanford, CA, USA. .,Department of Pediatrics, Stanford University, Stanford, CA, USA.
| |
Collapse
|
32
|
Prediction of atherosclerosis diseases using biosensor-assisted deep learning artificial neuron model. Neural Comput Appl 2021. [DOI: 10.1007/s00521-020-05317-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
|
33
|
Tozzo V, Azencott CA, Fiorini S, Fava E, Trucco A, Barla A. Where Do We Stand in Regularization for Life Science Studies? J Comput Biol 2021; 29:213-232. [PMID: 33926217 PMCID: PMC8968832 DOI: 10.1089/cmb.2019.0371] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022] Open
Abstract
More and more biologists and bioinformaticians turn to machine learning to analyze large amounts of data. In this context, it is crucial to understand which is the most suitable data analysis pipeline for achieving reliable results. This process may be challenging, due to a variety of factors, the most crucial ones being the data type and the general goal of the analysis (e.g., explorative or predictive). Life science data sets require further consideration as they often contain measures with a low signal-to-noise ratio, high-dimensional observations, and relatively few samples. In this complex setting, regularization, which can be defined as the introduction of additional information to solve an ill-posed problem, is the tool of choice to obtain robust models. Different regularization practices may be used depending both on characteristics of the data and of the question asked, and different choices may lead to different results. In this article, we provide a comprehensive description of the impact and importance of regularization techniques in life science studies. In particular, we provide an intuition of what regularization is and of the different ways it can be implemented and exploited. We propose four general life sciences problems in which regularization is fundamental and should be exploited for robustness. For each of these large families of problems, we enumerate different techniques as well as examples and case studies. Lastly, we provide a unified view of how to approach each data type with various regularization techniques.
Collapse
Affiliation(s)
- Veronica Tozzo
- Department of Informatics, Bioengineering, Robotics and System Engineering-DIBRIS, University of Genoa, Genoa, Italy
| | - Chloé-Agathe Azencott
- Centre for Computational Biology-CBIO, MINES ParisTech, PSL Research University, Paris, France.,Institut Curie, PSL Research University, Paris, France.,INSERM, U900, Paris, France
| | | | - Emanuele Fava
- Departiment of Electrical, Electronic, Telecommunications Engineering, and Naval Architecture (DITEN), University of Genoa, Genoa, Italy
| | - Andrea Trucco
- Departiment of Electrical, Electronic, Telecommunications Engineering, and Naval Architecture (DITEN), University of Genoa, Genoa, Italy
| | - Annalisa Barla
- Department of Informatics, Bioengineering, Robotics and System Engineering-DIBRIS, University of Genoa, Genoa, Italy
| |
Collapse
|
34
|
Manavalan R, Priya S. Genetic interactions effects for cancer disease identification using computational models: a review. Med Biol Eng Comput 2021; 59:733-758. [PMID: 33839998 DOI: 10.1007/s11517-021-02343-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2020] [Accepted: 03/10/2021] [Indexed: 11/29/2022]
Abstract
Genome-wide association studies (GWAS) provide clear insight into understanding genetic variations and environmental influences responsible for various human diseases. Cancer identification through genetic interactions (epistasis) is one of the significant ongoing researches in GWAS. The growth of the cancer cell emerges from multi-locus as well as complex genetic interaction. It is impractical for the physician to detect cancer via manual examination of SNPs interaction. Due to its importance, several computational approaches have been modeled to infer epistasis effects. This article includes a comprehensive and multifaceted review of all relevant genetic studies published between 2001 and 2020. In this contemporary review, various computational methods are as follows: multifactor dimensionality reduction-based approaches, statistical strategies, machine learning, and optimization-based techniques are carefully reviewed and presented with their evaluation results. Moreover, these computational approaches' strengths and limitations are described. The issues behind the computational methods for identifying the cancer disease through genetic interactions and the various evaluation parameters used by researchers have been analyzed. This review is highly beneficial for researchers and medical professionals to learn techniques adapted to discover the epistasis and aids to design novel automatic epistasis detection systems with strong robustness and maximum efficiency to address the different research problems in finding practical solutions effectively.
Collapse
Affiliation(s)
- R Manavalan
- Department of Computer Science, Arignar Anna Government Arts College, Villupuram, Tamil Nadu, 605602, India.
| | - S Priya
- Computer Science, Arignar Anna Government Arts College, Villupuram, Tamil Nadu, India
| |
Collapse
|
35
|
Bracher-Smith M, Crawford K, Escott-Price V. Machine learning for genetic prediction of psychiatric disorders: a systematic review. Mol Psychiatry 2021; 26:70-79. [PMID: 32591634 PMCID: PMC7610853 DOI: 10.1038/s41380-020-0825-2] [Citation(s) in RCA: 50] [Impact Index Per Article: 16.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/15/2020] [Revised: 06/09/2020] [Accepted: 06/16/2020] [Indexed: 12/25/2022]
Abstract
Machine learning methods have been employed to make predictions in psychiatry from genotypes, with the potential to bring improved prediction of outcomes in psychiatric genetics; however, their current performance is unclear. We aim to systematically review machine learning methods for predicting psychiatric disorders from genetics alone and evaluate their discrimination, bias and implementation. Medline, PsycInfo, Web of Science and Scopus were searched for terms relating to genetics, psychiatric disorders and machine learning, including neural networks, random forests, support vector machines and boosting, on 10 September 2019. Following PRISMA guidelines, articles were screened for inclusion independently by two authors, extracted, and assessed for risk of bias. Overall, 63 full texts were assessed from a pool of 652 abstracts. Data were extracted for 77 models of schizophrenia, bipolar, autism or anorexia across 13 studies. Performance of machine learning methods was highly varied (0.48-0.95 AUC) and differed between schizophrenia (0.54-0.95 AUC), bipolar (0.48-0.65 AUC), autism (0.52-0.81 AUC) and anorexia (0.62-0.69 AUC). This is likely due to the high risk of bias identified in the study designs and analysis for reported results. Choices for predictor selection, hyperparameter search and validation methodology, and viewing of the test set during training were common causes of high risk of bias in analysis. Key steps in model development and validation were frequently not performed or unreported. Comparison of discrimination across studies was constrained by heterogeneity of predictors, outcome and measurement, in addition to sample overlap within and across studies. Given widespread high risk of bias and the small number of studies identified, it is important to ensure established analysis methods are adopted. We emphasise best practices in methodology and reporting for improving future studies.
Collapse
Affiliation(s)
- Matthew Bracher-Smith
- MRC Centre for Neuropsychiatric Genetics and Genomics, Division of Psychological Medicine and Clinical Neurosciences, School of Medicine, Cardiff University, Cardiff, UK
| | - Karen Crawford
- MRC Centre for Neuropsychiatric Genetics and Genomics, Division of Psychological Medicine and Clinical Neurosciences, School of Medicine, Cardiff University, Cardiff, UK
- Dementia Research Institute, School of Medicine, Cardiff University, Cardiff, UK
| | - Valentina Escott-Price
- MRC Centre for Neuropsychiatric Genetics and Genomics, Division of Psychological Medicine and Clinical Neurosciences, School of Medicine, Cardiff University, Cardiff, UK.
- Dementia Research Institute, School of Medicine, Cardiff University, Cardiff, UK.
| |
Collapse
|
36
|
Zou K, Kim KS, Kim K, Kang D, Park YH, Sun H, Ha BK, Ha J, Jun TH. Genetic Diversity and Genome-Wide Association Study of Seed Aspect Ratio Using a High-Density SNP Array in Peanut ( Arachis hypogaea L.). Genes (Basel) 2020; 12:E2. [PMID: 33375051 PMCID: PMC7822046 DOI: 10.3390/genes12010002] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2020] [Revised: 12/09/2020] [Accepted: 12/17/2020] [Indexed: 12/12/2022] Open
Abstract
Peanut (Arachis hypogaea L.) is one of the important oil crops of the world. In this study, we aimed to evaluate the genetic diversity of 384 peanut germplasms including 100 Korean germplasms and 284 core collections from the United States Department of Agriculture (USDA) using an Axiom_Arachis array with 58K single-nucleotide polymorphisms (SNPs). We evaluated the evolutionary relationships among 384 peanut germplasms using a genome-wide association study (GWAS) of seed aspect ratio data processed by ImageJ software. In total, 14,030 filtered polymorphic SNPs were identified from the peanut 58K SNP array. We identified five SNPs with significant associations to seed aspect ratio on chromosomes Aradu.A09, Aradu.A10, Araip.B08, and Araip.B09. AX-177640219 on chromosome Araip.B08 was the most significantly associated marker in GAPIT and Regularization method. Phosphoenolpyruvate carboxylase (PEPC) was found among the eleven genes within a linkage disequilibrium (LD) of the significant SNPs on Araip.B08 and could have a strong causal effect in determining seed aspect ratio. The results of the present study provide information and methods that are useful for further genetic and genomic studies as well as molecular breeding programs in peanuts.
Collapse
Affiliation(s)
- Kunyan Zou
- Department of Plant Bioscience, Pusan National University, Miryang 50463, Korea; (K.Z.); (D.K.); (Y.-H.P.)
| | | | - Kipoong Kim
- Department of Statistics, Pusan National University, Busan 46241, Korea; (K.K.); (H.S.)
| | - Dongwoo Kang
- Department of Plant Bioscience, Pusan National University, Miryang 50463, Korea; (K.Z.); (D.K.); (Y.-H.P.)
| | - Yu-Hyeon Park
- Department of Plant Bioscience, Pusan National University, Miryang 50463, Korea; (K.Z.); (D.K.); (Y.-H.P.)
| | - Hokeun Sun
- Department of Statistics, Pusan National University, Busan 46241, Korea; (K.K.); (H.S.)
| | - Bo-Keun Ha
- Department of Applied Plant Science, Chonnam National University, Gwangju 61186, Korea;
| | - Jungmin Ha
- Department of Plant Science, Gangneung-Wonju National University, Gangneung 25457, Korea;
| | - Tae-Hwan Jun
- Department of Plant Bioscience, Pusan National University, Miryang 50463, Korea; (K.Z.); (D.K.); (Y.-H.P.)
- Life and Industry Convergence Research Institute, Pusan National University, Miryang 50463, Korea
| |
Collapse
|
37
|
|
38
|
Kang J, Coates JT, Strawderman RL, Rosenstein BS, Kerns SL. Genomics models in radiotherapy: From mechanistic to machine learning. Med Phys 2020; 47:e203-e217. [PMID: 32418335 PMCID: PMC8725063 DOI: 10.1002/mp.13751] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2019] [Revised: 06/28/2019] [Accepted: 07/17/2019] [Indexed: 12/28/2022] Open
Abstract
Machine learning (ML) provides a broad framework for addressing high-dimensional prediction problems in classification and regression. While ML is often applied for imaging problems in medical physics, there are many efforts to apply these principles to biological data toward questions of radiation biology. Here, we provide a review of radiogenomics modeling frameworks and efforts toward genomically guided radiotherapy. We first discuss medical oncology efforts to develop precision biomarkers. We next discuss similar efforts to create clinical assays for normal tissue or tumor radiosensitivity. We then discuss modeling frameworks for radiosensitivity and the evolution of ML to create predictive models for radiogenomics.
Collapse
Affiliation(s)
- John Kang
- Department of Radiation Oncology, University of Rochester Medical Center, Rochester, NY 14642, USA
| | - James T. Coates
- CRUK/MRC Oxford Institute for Radiation Oncology, University of Oxford, Oxford OX3 7DQ, UK
| | - Robert L. Strawderman
- Department of Biostatistics and Computational Biology, University of Rochester, Rochester, NY 14642, USA
| | - Barry S. Rosenstein
- Department of Radiation Oncology and the Department of Genetics and Genomic Sciences, Icahn Institute for Data Science and Genomic Technology, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Sarah L. Kerns
- Department of Radiation Oncology, University of Rochester Medical Center, Rochester, NY 14642, USA
| |
Collapse
|
39
|
Padilla-Martínez F, Collin F, Kwasniewski M, Kretowski A. Systematic Review of Polygenic Risk Scores for Type 1 and Type 2 Diabetes. Int J Mol Sci 2020; 21:E1703. [PMID: 32131491 PMCID: PMC7084489 DOI: 10.3390/ijms21051703] [Citation(s) in RCA: 34] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2020] [Revised: 02/28/2020] [Accepted: 02/28/2020] [Indexed: 02/07/2023] Open
Abstract
Recent studies have led to considerable advances in the identification of genetic variants associated with type 1 and type 2 diabetes. An approach for converting genetic data into a predictive measure of disease susceptibility is to add the risk effects of loci into a polygenic risk score. In order to summarize the recent findings, we conducted a systematic review of studies comparing the accuracy of polygenic risk scores developed during the last two decades. We selected 15 risk scores from three databases (Scopus, Web of Science and PubMed) enrolled in this systematic review. We identified three polygenic risk scores that discriminate between type 1 diabetes patients and healthy people, one that discriminate between type 1 and type 2 diabetes, two that discriminate between type 1 and monogenic diabetes and nine polygenic risk scores that discriminate between type 2 diabetes patients and healthy people. Prediction accuracy of polygenic risk scores was assessed by comparing the area under the curve. The actual benefits, potential obstacles and possible solutions for the implementation of polygenic risk scores in clinical practice were also discussed. Develop strategies to establish the clinical validity of polygenic risk scores by creating a framework for the interpretation of findings and their translation into actual evidence, are the way to demonstrate their utility in medical practice.
Collapse
Affiliation(s)
- Felipe Padilla-Martínez
- Centre for Bioinformatics and Data Analysis, Medical University of Bialystok, 15-276 Bialystok, Poland; (F.C.); (M.K.)
- Clinical Research Centre, Medical University of Bialystok, 15-276 Bialystok, Poland;
| | - Francois Collin
- Centre for Bioinformatics and Data Analysis, Medical University of Bialystok, 15-276 Bialystok, Poland; (F.C.); (M.K.)
| | - Miroslaw Kwasniewski
- Centre for Bioinformatics and Data Analysis, Medical University of Bialystok, 15-276 Bialystok, Poland; (F.C.); (M.K.)
| | - Adam Kretowski
- Clinical Research Centre, Medical University of Bialystok, 15-276 Bialystok, Poland;
- Department of Endocrinology, Diabetology and Internal Medicine, Medical University of Bialystok, 15-276 Bialystok, Poland
| |
Collapse
|
40
|
Waldmann P, Pfeiffer C, Mészáros G. Sparse Convolutional Neural Networks for Genome-Wide Prediction. Front Genet 2020; 11:25. [PMID: 32117441 PMCID: PMC7029737 DOI: 10.3389/fgene.2020.00025] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2019] [Accepted: 01/08/2020] [Indexed: 12/03/2022] Open
Abstract
Genome-wide prediction (GWP) has become the state-of-the art method in artificial selection. Data sets often comprise number of genomic markers and individuals in ranges from a few thousands to millions. Hence, computational efficiency is important and various machine learning methods have successfully been used in GWP. Neural networks (NN) and deep learning (DL) are very flexible methods that usually show outstanding prediction properties on complex structured data, but their use in GWP is nevertheless rare and debated. This study describes a powerful NN method for genomic marker data that can easily be extended. It is shown that a one-dimensional convolutional neural network (CNN) can be used to incorporate the ordinal information between markers and, together with pooling and ℓ1-norm regularization, provides a sparse and computationally efficient approach for GWP. The method, denoted CNNGWP, is implemented in the deep learning software Keras, and hyper-parameters of the NN are tuned with Bayesian optimization. Model averaged ensemble predictions further reduce prediction error. Evaluations show that CNNGWP improves prediction error by more than 25% on simulated data and around 3% on real pig data compared with results obtained with GBLUP and the LASSO. In conclusion, the CNNGWP provides a promising approach for GWP, but the magnitude of improvement depends on the genetic architecture and the heritability.
Collapse
Affiliation(s)
- Patrik Waldmann
- Department of Animal Breeding and Genetics, The Swedish University of Agriculutural Sciences, Uppsala, Sweden
| | - Christina Pfeiffer
- Division of Livestock Science, University of Natural Resources and Life Sciences Vienna (BOKU), Vienna, Austria
| | - Gábor Mészáros
- Division of Livestock Science, University of Natural Resources and Life Sciences Vienna (BOKU), Vienna, Austria
| |
Collapse
|
41
|
Lello L, Raben TG, Yong SY, Tellier LCAM, Hsu SDH. Genomic Prediction of 16 Complex Disease Risks Including Heart Attack, Diabetes, Breast and Prostate Cancer. Sci Rep 2019; 9:15286. [PMID: 31653892 PMCID: PMC6814833 DOI: 10.1038/s41598-019-51258-x] [Citation(s) in RCA: 33] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2019] [Accepted: 09/26/2019] [Indexed: 01/09/2023] Open
Abstract
We construct risk predictors using polygenic scores (PGS) computed from common Single Nucleotide Polymorphisms (SNPs) for a number of complex disease conditions, using L1-penalized regression (also known as LASSO) on case-control data from UK Biobank. Among the disease conditions studied are Hypothyroidism, (Resistant) Hypertension, Type 1 and 2 Diabetes, Breast Cancer, Prostate Cancer, Testicular Cancer, Gallstones, Glaucoma, Gout, Atrial Fibrillation, High Cholesterol, Asthma, Basal Cell Carcinoma, Malignant Melanoma, and Heart Attack. We obtain values for the area under the receiver operating characteristic curves (AUC) in the range ~0.58-0.71 using SNP data alone. Substantially higher predictor AUCs are obtained when incorporating additional variables such as age and sex. Some SNP predictors alone are sufficient to identify outliers (e.g., in the 99th percentile of polygenic score, or PGS) with 3-8 times higher risk than typical individuals. We validate predictors out-of-sample using the eMERGE dataset, and also with different ancestry subgroups within the UK Biobank population. Our results indicate that substantial improvements in predictive power are attainable using training sets with larger case populations. We anticipate rapid improvement in genomic prediction as more case-control data become available for analysis.
Collapse
Affiliation(s)
- Louis Lello
- Department of Physics and Astronomy, Michigan State University, East Lansing, Michigan, USA.
| | - Timothy G Raben
- Department of Physics and Astronomy, Michigan State University, East Lansing, Michigan, USA.
| | - Soke Yuen Yong
- Department of Physics and Astronomy, Michigan State University, East Lansing, Michigan, USA.
| | - Laurent C A M Tellier
- Genomic Prediction, North Brunswick, NJ, USA.
- Cognitive Genomics Laboratory, Shenzhen Key Laboratory of Neurogenomics, China National GeneBank, BGI-Shenzhen, Shenzhen, China.
| | - Stephen D H Hsu
- Department of Physics and Astronomy, Michigan State University, East Lansing, Michigan, USA.
- Genomic Prediction, North Brunswick, NJ, USA.
- Cognitive Genomics Laboratory, Shenzhen Key Laboratory of Neurogenomics, China National GeneBank, BGI-Shenzhen, Shenzhen, China.
| |
Collapse
|
42
|
Grinberg NF, Orhobor OI, King RD. An evaluation of machine-learning for predicting phenotype: studies in yeast, rice, and wheat. Mach Learn 2019; 109:251-277. [PMID: 32174648 PMCID: PMC7048706 DOI: 10.1007/s10994-019-05848-5] [Citation(s) in RCA: 47] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2015] [Revised: 09/17/2019] [Accepted: 09/19/2019] [Indexed: 11/01/2022]
Abstract
In phenotype prediction the physical characteristics of an organism are predicted from knowledge of its genotype and environment. Such studies, often called genome-wide association studies, are of the highest societal importance, as they are of central importance to medicine, crop-breeding, etc. We investigated three phenotype prediction problems: one simple and clean (yeast), and the other two complex and real-world (rice and wheat). We compared standard machine learning methods; elastic net, ridge regression, lasso regression, random forest, gradient boosting machines (GBM), and support vector machines (SVM), with two state-of-the-art classical statistical genetics methods; genomic BLUP and a two-step sequential method based on linear regression. Additionally, using the clean yeast data, we investigated how performance varied with the complexity of the biological mechanism, the amount of observational noise, the number of examples, the amount of missing data, and the use of different data representations. We found that for almost all the phenotypes considered, standard machine learning methods outperformed the methods from classical statistical genetics. On the yeast problem, the most successful method was GBM, followed by lasso regression, and the two statistical genetics methods; with greater mechanistic complexity GBM was best, while in simpler cases lasso was superior. In the wheat and rice studies the best two methods were SVM and BLUP. The most robust method in the presence of noise, missing data, etc. was random forests. The classical statistical genetics method of genomic BLUP was found to perform well on problems where there was population structure. This suggests that standard machine learning methods need to be refined to include population structure information when this is present. We conclude that the application of machine learning methods to phenotype prediction problems holds great promise, but that determining which methods is likely to perform well on any given problem is elusive and non-trivial.
Collapse
Affiliation(s)
- Nastasiya F. Grinberg
- School of Computer Science, University of Manchester, Oxford Road, Manchester, M13 9PL UK
- Present Address: Department of Medicine, Cambridge Institute of Therapeutic Immunology & Infectious Disease, Jeffrey Cheah Biomedical Centre, Cambridge Biomedical Campus, University of Cambridge, Cambridge, CB2 0AW UK
| | | | - Ross D. King
- Department of Biology and Biological Engineering, Division of Systems and Synthetic Biology, Chalmers University of Technology, Kemivägen 10, SE-412 96 Gothenburg, Sweden
| |
Collapse
|
43
|
Abd El Hamid MM, Mabrouk MS, Omar YMK. DEVELOPING AN EARLY PREDICTIVE SYSTEM FOR IDENTIFYING GENETIC BIOMARKERS ASSOCIATED TO ALZHEIMER’S DISEASE USING MACHINE LEARNING TECHNIQUES. BIOMEDICAL ENGINEERING: APPLICATIONS, BASIS AND COMMUNICATIONS 2019. [DOI: 10.4015/s1016237219500406] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Alzheimer’s disease (AD) is an irreversible, progressive disorder that assaults the nerve cells of the brain. It is the most widely recognized kind of dementia among older adults. Apolipoprotein E (APOE), is one of the most common genetic risk factors for AD whose significant association with AD is observed in various genome-wide association studies (GWAS). Single nucleotide polymorphisms (SNPs) are the most common type of genetic variation among individuals. SNPs related to many common diseases like AD. SNPs are recognized as significant biomarkers for this disease, they help in understanding and detecting the disease in its early stages. Detecting SNPs biomarkers associated to the disease with high classification accuracy leads to early prediction and diagnosis. Machine learning techniques are utilized to discover new biomarkers of the disease. Sequential minimal optimization (SMO) algorithm with different kernels, Naive Bayes (NB), tree augmented Naive Bayes (TAN) and K2 learning algorithm have been applied on all genetic data of Alzheimer’s disease neuroimaging initiative phase 1 (ADNI-1)/Whole genome sequencing (WGS) datasets. The highest classification accuracy was achieved using 500 SNPs based on the [Formula: see text]-value threshold ([Formula: see text]-value [Formula: see text]). In whole genome approach ADNI-1, results revealed that NB and K2 learning algorithms scored an overall accuracy of 98% and 98.40%, respectively. In whole genome approach WGS, NB and K2 learning algorithms scored an overall accuracy of 99.63% and 99.75%, respectively.
Collapse
Affiliation(s)
| | - Mai S. Mabrouk
- Biomedical Engineering Department, Misr University for Science and Technology (MUST), Egypt
| | - Yasser M. K. Omar
- College of Computing and Information Technology AASTMT, Cairo Branch, Egypt
| |
Collapse
|
44
|
Analysis of polygenic risk score usage and performance in diverse human populations. Nat Commun 2019; 10:3328. [PMID: 31346163 PMCID: PMC6658471 DOI: 10.1038/s41467-019-11112-0] [Citation(s) in RCA: 543] [Impact Index Per Article: 108.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2018] [Accepted: 06/18/2019] [Indexed: 12/11/2022] Open
Abstract
A historical tendency to use European ancestry samples hinders medical genetics research, including the use of polygenic scores, which are individual-level metrics of genetic risk. We analyze the first decade of polygenic scoring studies (2008–2017, inclusive), and find that 67% of studies included exclusively European ancestry participants and another 19% included only East Asian ancestry participants. Only 3.8% of studies were among cohorts of African, Hispanic, or Indigenous peoples. We find that predictive performance of European ancestry-derived polygenic scores is lower in non-European ancestry samples (e.g. African ancestry samples: t = −5.97, df = 24, p = 3.7 × 10−6), and we demonstrate the effects of methodological choices in polygenic score distributions for worldwide populations. These findings highlight the need for improved treatment of linkage disequilibrium and variant frequencies when applying polygenic scoring to cohorts of non-European ancestry, and bolster the rationale for large-scale GWAS in diverse human populations. Predominant participation of European-ancestry individuals in genetic studies has hindered the better understanding of genetic risk in non-European ancestry individuals. Here, Duncan et al. quantify polygenic risk score use and performance in worldwide populations.
Collapse
|
45
|
Romagnoni A, Jégou S, Van Steen K, Wainrib G, Hugot JP. Comparative performances of machine learning methods for classifying Crohn Disease patients using genome-wide genotyping data. Sci Rep 2019; 9:10351. [PMID: 31316157 PMCID: PMC6637191 DOI: 10.1038/s41598-019-46649-z] [Citation(s) in RCA: 57] [Impact Index Per Article: 11.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2019] [Accepted: 07/03/2019] [Indexed: 02/08/2023] Open
Abstract
Crohn Disease (CD) is a complex genetic disorder for which more than 140 genes have been identified using genome wide association studies (GWAS). However, the genetic architecture of the trait remains largely unknown. The recent development of machine learning (ML) approaches incited us to apply them to classify healthy and diseased people according to their genomic information. The Immunochip dataset containing 18,227 CD patients and 34,050 healthy controls enrolled and genotyped by the international Inflammatory Bowel Disease genetic consortium (IIBDGC) has been re-analyzed using a set of ML methods: penalized logistic regression (LR), gradient boosted trees (GBT) and artificial neural networks (NN). The main score used to compare the methods was the Area Under the ROC Curve (AUC) statistics. The impact of quality control (QC), imputing and coding methods on LR results showed that QC methods and imputation of missing genotypes may artificially increase the scores. At the opposite, neither the patient/control ratio nor marker preselection or coding strategies significantly affected the results. LR methods, including Lasso, Ridge and ElasticNet provided similar results with a maximum AUC of 0.80. GBT methods like XGBoost, LightGBM and CatBoost, together with dense NN with one or more hidden layers, provided similar AUC values, suggesting limited epistatic effects in the genetic architecture of the trait. ML methods detected near all the genetic variants previously identified by GWAS among the best predictors plus additional predictors with lower effects. The robustness and complementarity of the different methods are also studied. Compared to LR, non-linear models such as GBT or NN may provide robust complementary approaches to identify and classify genetic markers.
Collapse
Affiliation(s)
- Alberto Romagnoni
- Centre de recherche sur l'inflammation UMR 1149, Inserm - Université Paris Diderot, 75018, Paris, France.,Data Team, Département d'informatique de l'ENS, École normale supérieure, CNRS, PSL Research University, 75005, Paris, France
| | | | - Kristel Van Steen
- WELBIO, GIGA-R Medical Genomics - BIO3, University of Liège, Liège, Belgium.,Department of Human Genetics, University of Leuven, Leuven, Belgium
| | - Gilles Wainrib
- Data Team, Département d'informatique de l'ENS, École normale supérieure, CNRS, PSL Research University, 75005, Paris, France.,Owkin, 75011, Paris, France
| | - Jean-Pierre Hugot
- Centre de recherche sur l'inflammation UMR 1149, Inserm - Université Paris Diderot, 75018, Paris, France. .,Hôpital Robert Debré, Assistance Publique-Hôpitaux de Paris, 75019, Paris, France.
| | | |
Collapse
|
46
|
Stephenson M, Darlington GA, Schenkel FS, Squires EJ, Ali RA. DSRIG: Incorporating graphical structure in the regularized modeling of SNP data. J Bioinform Comput Biol 2019; 17:1950017. [PMID: 31288640 DOI: 10.1142/s0219720019500173] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Genetic selection of farm animals plays an important role in genetic improvement programs. Regularized regression methods on single nucleotide polymorphism (SNP) data from a set of candidate genes can help to identify genes that are associated with the trait of interest. This complex task must also consider the relative effect sizes on the desired trait and account for the relationships among the candidate SNPs so that selection of a SNP does not promote other undesirable traits through breeding. We present the Doubly Sparse Regression Incorporating Graphical structure (DSRIG), a novel regularized method for genetic selection that exploits the relationships among candidate SNPs to improve prediction. DSRIG was applied in the prediction of skatole and androstenone levels, two compounds known to be associated with boar taint. DSRIG was shown to provide a predictive benefit when compared to ordinary least squares (OLS) and the least absolute shrinkage and selection operator (LASSO) in a cross-validation procedure. The relative sizes of the coefficient estimates over the cross-validation procedure were compared to determine which SNPs may have the greatest impact on expression of the boar taint compounds and a consensus graph was used to infer the relationships among SNPs.
Collapse
Affiliation(s)
- Matthew Stephenson
- * Department of Mathematics & Statistics, University of Guelph, 50 Stone Road East, Guelph, Ontario N1G 2W1, Canada
| | - Gerarda A Darlington
- * Department of Mathematics & Statistics, University of Guelph, 50 Stone Road East, Guelph, Ontario N1G 2W1, Canada
| | - Flavio S Schenkel
- † Department of Animal Biosciences, University of Guelph, 50 Stone Road East, Guelph, Ontario N1G 2W1, Canada
| | - E James Squires
- † Department of Animal Biosciences, University of Guelph, 50 Stone Road East, Guelph, Ontario N1G 2W1, Canada
| | - R Ayesha Ali
- * Department of Mathematics & Statistics, University of Guelph, 50 Stone Road East, Guelph, Ontario N1G 2W1, Canada
| |
Collapse
|
47
|
Mwanga EP, Mapua SA, Siria DJ, Ngowo HS, Nangacha F, Mgando J, Baldini F, González Jiménez M, Ferguson HM, Wynne K, Selvaraj P, Babayan SA, Okumu FO. Using mid-infrared spectroscopy and supervised machine-learning to identify vertebrate blood meals in the malaria vector, Anopheles arabiensis. Malar J 2019; 18:187. [PMID: 31146762 PMCID: PMC6543689 DOI: 10.1186/s12936-019-2822-y] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2019] [Accepted: 05/25/2019] [Indexed: 02/03/2023] Open
Abstract
BACKGROUND The propensity of different Anopheles mosquitoes to bite humans instead of other vertebrates influences their capacity to transmit pathogens to humans. Unfortunately, determining proportions of mosquitoes that have fed on humans, i.e. Human Blood Index (HBI), currently requires expensive and time-consuming laboratory procedures involving enzyme-linked immunosorbent assays (ELISA) or polymerase chain reactions (PCR). Here, mid-infrared (MIR) spectroscopy and supervised machine learning are used to accurately distinguish between vertebrate blood meals in guts of malaria mosquitoes, without any molecular techniques. METHODS Laboratory-reared Anopheles arabiensis females were fed on humans, chickens, goats or bovines, then held for 6 to 8 h, after which they were killed and preserved in silica. The sample size was 2000 mosquitoes (500 per host species). Five individuals of each host species were enrolled to ensure genotype variability, and 100 mosquitoes fed on each. Dried mosquito abdomens were individually scanned using attenuated total reflection-Fourier transform infrared (ATR-FTIR) spectrometer to obtain high-resolution MIR spectra (4000 cm-1 to 400 cm-1). The spectral data were cleaned to compensate atmospheric water and CO2 interference bands using Bruker-OPUS software, then transferred to Python™ for supervised machine-learning to predict host species. Seven classification algorithms were trained using 90% of the spectra through several combinations of 75-25% data splits. The best performing model was used to predict identities of the remaining 10% validation spectra, which had not been used for model training or testing. RESULTS The logistic regression (LR) model achieved the highest accuracy, correctly predicting true vertebrate blood meal sources with overall accuracy of 98.4%. The model correctly identified 96% goat blood meals, 97% of bovine blood meals, 100% of chicken blood meals and 100% of human blood meals. Three percent of bovine blood meals were misclassified as goat, and 2% of goat blood meals misclassified as human. CONCLUSION Mid-infrared spectroscopy coupled with supervised machine learning can accurately identify multiple vertebrate blood meals in malaria vectors, thus potentially enabling rapid assessment of mosquito blood-feeding histories and vectorial capacities. The technique is cost-effective, fast, simple, and requires no reagents other than desiccants. However, scaling it up will require field validation of the findings and boosting relevant technical capacity in affected countries.
Collapse
Affiliation(s)
- Emmanuel P Mwanga
- Environmental Health and Ecological Science Thematic Group, Ifakara Health Institute, Morogoro, Tanzania.
| | - Salum A Mapua
- Environmental Health and Ecological Science Thematic Group, Ifakara Health Institute, Morogoro, Tanzania
| | - Doreen J Siria
- Environmental Health and Ecological Science Thematic Group, Ifakara Health Institute, Morogoro, Tanzania
| | - Halfan S Ngowo
- Environmental Health and Ecological Science Thematic Group, Ifakara Health Institute, Morogoro, Tanzania
- Institute of Biodiversity, Animal Health and Comparative Medicine, University of Glasgow, Glasgow, G12 8QQ, UK
| | - Francis Nangacha
- Environmental Health and Ecological Science Thematic Group, Ifakara Health Institute, Morogoro, Tanzania
| | - Joseph Mgando
- Environmental Health and Ecological Science Thematic Group, Ifakara Health Institute, Morogoro, Tanzania
| | - Francesco Baldini
- Institute of Biodiversity, Animal Health and Comparative Medicine, University of Glasgow, Glasgow, G12 8QQ, UK
| | | | - Heather M Ferguson
- Institute of Biodiversity, Animal Health and Comparative Medicine, University of Glasgow, Glasgow, G12 8QQ, UK
| | - Klaas Wynne
- School of Chemistry, University of Glasgow, Glasgow, G12 8QQ, UK
| | | | - Simon A Babayan
- Institute of Biodiversity, Animal Health and Comparative Medicine, University of Glasgow, Glasgow, G12 8QQ, UK
| | - Fredros O Okumu
- Environmental Health and Ecological Science Thematic Group, Ifakara Health Institute, Morogoro, Tanzania
- Institute of Biodiversity, Animal Health and Comparative Medicine, University of Glasgow, Glasgow, G12 8QQ, UK
- School of Public Health, University of Witwatersrand, Johannesburg, South Africa
| |
Collapse
|
48
|
Jackknife Model Averaging Prediction Methods for Complex Phenotypes with Gene Expression Levels by Integrating External Pathway Information. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2019; 2019:2807470. [PMID: 31089389 PMCID: PMC6476151 DOI: 10.1155/2019/2807470] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/13/2019] [Accepted: 03/20/2019] [Indexed: 01/03/2023]
Abstract
Motivation In the past few years many prediction approaches have been proposed and widely employed in high dimensional genetic data for disease risk evaluation. However, those approaches typically ignore in model fitting the important group structures that naturally exists in genetic data. Methods In the present study, we applied a novel model-averaging approach, called jackknife model averaging prediction (JMAP), for high dimensional genetic risk prediction while incorporating pathway information into the model specification. JMAP selects the optimal weights across candidate models by minimizing a cross validation criterion in a jackknife way. Compared with previous approaches, one of the primary features of JMAP is to allow model weights to vary from 0 to 1 but without the limitation that the summation of weights is equal to one. We evaluated the performance of JMAP using extensive simulation studies and compared it with existing methods. We finally applied JMAP to four real cancer datasets that are publicly available from TCGA. Results The simulations showed that compared with other existing approaches (e.g., gsslasso), JMAP performed best or is among the best methods across a range of scenarios. For example, among 14 out of 16 simulation settings with PVE = 0.3, JMAP has an average of 0.075 higher prediction accuracy compared with gsslasso. We further found that in the simulation, the model weights for the true candidate models have much smaller chances to be zero compared with those for the null candidate models and are substantially greater in magnitude. In the real data application, JMAP also behaves comparably or better compared with the other methods for continuous phenotypes. For example, for the COAD, CRC, and PAAD datasets, the average gains of predictive accuracy of JMAP are 0.019, 0.064, and 0.052 compared with gsslasso. Conclusion The proposed method JMAP is a novel model-averaging approach for high dimensional genetic risk prediction while incorporating external useful group structures into the model specification.
Collapse
|
49
|
Waldmann P, Ferenčaković M, Mészáros G, Khayatzadeh N, Curik I, Sölkner J. AUTALASSO: an automatic adaptive LASSO for genome-wide prediction. BMC Bioinformatics 2019; 20:167. [PMID: 30940067 PMCID: PMC6444607 DOI: 10.1186/s12859-019-2743-3] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2018] [Accepted: 03/18/2019] [Indexed: 01/30/2023] Open
Abstract
Background Genome-wide prediction has become the method of choice in animal and plant breeding. Prediction of breeding values and phenotypes are routinely performed using large genomic data sets with number of markers on the order of several thousands to millions. The number of evaluated individuals is usually smaller which results in problems where model sparsity is of major concern. The LASSO technique has proven to be very well-suited for sparse problems often providing excellent prediction accuracy. Several computationally efficient LASSO algorithms have been developed, but optimization of hyper-parameters can be demanding. Results We have developed a novel automatic adaptive LASSO (AUTALASSO) based on the alternating direction method of multipliers (ADMM) optimization algorithm. The two major hyper-parameters of ADMM are the learning rate and the regularization factor. The learning rate is automatically tuned with line search and the regularization factor optimized using Golden section search. Results show that AUTALASSO provides superior prediction accuracy when evaluated on simulated and real bull data compared to the adaptive LASSO, LASSO and ridge regression implemented in the popular glmnet software. Conclusions The AUTALASSO provides a very flexible and computationally efficient approach to GWP, especially when it is important to obtain high prediction accuracy and genetic gain. The AUTALASSO also has the capability to perform GWAS of both additive and dominance effects with smaller prediction error than the ordinary LASSO.
Collapse
Affiliation(s)
- Patrik Waldmann
- Department of Animal Breeding and Genetics, Swedish University of Agricultural Sciences, Box 7023, Uppsala, 750 07, Sweden.
| | - Maja Ferenčaković
- Department of Animal Science, Faculty of Agriculture, University of Zagreb, Svetosimunska 25, Zagreb, 10000, Croatia
| | - Gábor Mészáros
- Division of Livestock Sciences,Department of Sustainable Agricultural Systems,University of Natural Resources and Life Sciences Vienna, Gregor Mendel Str. 33, Vienna, A-1180, Austria
| | - Negar Khayatzadeh
- Division of Livestock Sciences,Department of Sustainable Agricultural Systems,University of Natural Resources and Life Sciences Vienna, Gregor Mendel Str. 33, Vienna, A-1180, Austria
| | - Ino Curik
- Department of Animal Science, Faculty of Agriculture, University of Zagreb, Svetosimunska 25, Zagreb, 10000, Croatia
| | - Johann Sölkner
- Division of Livestock Sciences,Department of Sustainable Agricultural Systems,University of Natural Resources and Life Sciences Vienna, Gregor Mendel Str. 33, Vienna, A-1180, Austria
| |
Collapse
|
50
|
Park YJ, Bae JH, Shin MH, Hyun SH, Cho YS, Choe YS, Choi JY, Lee KH, Kim BT, Moon SH. Development of Predictive Models in Patients with Epiphora Using Lacrimal Scintigraphy and Machine Learning. Nucl Med Mol Imaging 2019; 53:125-135. [PMID: 31057684 PMCID: PMC6473022 DOI: 10.1007/s13139-019-00574-1] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2018] [Revised: 09/19/2018] [Accepted: 01/07/2019] [Indexed: 12/22/2022] Open
Abstract
PURPOSE We developed predictive models using different programming languages and different computing platforms for machine learning (ML) and deep learning (DL) that classify clinical diagnoses in patients with epiphora. We evaluated the diagnostic performance of these models. METHODS Between January 2016 and September 2017, 250 patients with epiphora who underwent dacryocystography (DCG) and lacrimal scintigraphy (LS) were included in the study. We developed five different predictive models using ML tools, Python-based TensorFlow, R, and Microsoft Azure Machine Learning Studio (MAMLS). A total of 27 clinical characteristics and parameters including variables related to epiphora (VE) and variables related to dacryocystography (VDCG) were used as input data. Apart from this, we developed two predictive convolutional neural network (CNN) models for diagnosing LS images. We conducted this study using supervised learning. RESULTS Among 500 eyes of 250 patients, 59 eyes had anatomical obstruction, 338 eyes had functional obstruction, and the remaining 103 eyes were normal. For the data set that excluded VE and VDCG, the test accuracies in Python-based TensorFlow, R, multiclass logistic regression in MAMLS, multiclass neural network in MAMLS, and nuclear medicine physician were 81.70%, 80.60%, 81.70%, 73.10%, and 80.60%, respectively. The test accuracies of CNN models in three-class classification diagnosis and binary classification diagnosis were 72.00% and 77.42%, respectively. CONCLUSIONS ML-based predictive models using different programming languages and different computing platforms were useful for classifying clinical diagnoses in patients with epiphora and were similar to a clinician's diagnostic ability.
Collapse
Affiliation(s)
- Yong-Jin Park
- Departments of Nuclear Medicine, Samsung Medical Center, Sungkyunkwan University School of Medicine, 50, Irwon-dong, Gangnam-gu, Seoul, 135-710 South Korea
| | - Ji Hoon Bae
- Departments of Nuclear Medicine, Samsung Medical Center, Sungkyunkwan University School of Medicine, 50, Irwon-dong, Gangnam-gu, Seoul, 135-710 South Korea
| | - Mu Heon Shin
- Departments of Nuclear Medicine, Samsung Medical Center, Sungkyunkwan University School of Medicine, 50, Irwon-dong, Gangnam-gu, Seoul, 135-710 South Korea
| | - Seung Hyup Hyun
- Departments of Nuclear Medicine, Samsung Medical Center, Sungkyunkwan University School of Medicine, 50, Irwon-dong, Gangnam-gu, Seoul, 135-710 South Korea
| | - Young Seok Cho
- Departments of Nuclear Medicine, Samsung Medical Center, Sungkyunkwan University School of Medicine, 50, Irwon-dong, Gangnam-gu, Seoul, 135-710 South Korea
| | - Yearn Seong Choe
- Departments of Nuclear Medicine, Samsung Medical Center, Sungkyunkwan University School of Medicine, 50, Irwon-dong, Gangnam-gu, Seoul, 135-710 South Korea
| | - Joon Young Choi
- Departments of Nuclear Medicine, Samsung Medical Center, Sungkyunkwan University School of Medicine, 50, Irwon-dong, Gangnam-gu, Seoul, 135-710 South Korea
| | - Kyung-Han Lee
- Departments of Nuclear Medicine, Samsung Medical Center, Sungkyunkwan University School of Medicine, 50, Irwon-dong, Gangnam-gu, Seoul, 135-710 South Korea
| | - Byung-Tae Kim
- Departments of Nuclear Medicine, Samsung Medical Center, Sungkyunkwan University School of Medicine, 50, Irwon-dong, Gangnam-gu, Seoul, 135-710 South Korea
| | - Seung Hwan Moon
- Departments of Nuclear Medicine, Samsung Medical Center, Sungkyunkwan University School of Medicine, 50, Irwon-dong, Gangnam-gu, Seoul, 135-710 South Korea
| |
Collapse
|