1
|
Barnett EJ, Onete DG, Salekin A, Faraone SV. Genomic Machine Learning Meta-regression: Insights on Associations of Study Features With Reported Model Performance. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:169-177. [PMID: 38109236 DOI: 10.1109/tcbb.2023.3343808] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/20/2023]
Abstract
Many studies have been conducted with the goal of correctly predicting diagnostic status of a disorder using the combination of genomic data and machine learning. It is often hard to judge which components of a study led to better results and whether better reported results represent a true improvement or an uncorrected bias inflating performance. We extracted information about the methods used and other differentiating features in genomic machine learning models. We used these features in linear regressions predicting model performance. We tested for univariate and multivariate associations as well as interactions between features. Of the models reviewed, 46% used feature selection methods that can lead to data leakage. Across our models, the number of hyperparameter optimizations reported, data leakage due to feature selection, model type, and modeling an autoimmune disorder were significantly associated with an increase in reported model performance. We found a significant, negative interaction between data leakage and training size. Our results suggest that methods susceptible to data leakage are prevalent among genomic machine learning research, resulting in inflated reported performance. Best practice guidelines that promote the avoidance and recognition of data leakage may help the field avoid biased results.
Collapse
|
2
|
Hermes S, Cady J, Armentrout S, O’Connor J, Holdaway SC, Cruchaga C, Wingo T, Greytak EM. Epistatic Features and Machine Learning Improve Alzheimer's Disease Risk Prediction Over Polygenic Risk Scores. J Alzheimers Dis 2024; 99:1425-1440. [PMID: 38788065 PMCID: PMC11284654 DOI: 10.3233/jad-230236] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/26/2024]
Abstract
Background Polygenic risk scores (PRS) are linear combinations of genetic markers weighted by effect size that are commonly used to predict disease risk. For complex heritable diseases such as late-onset Alzheimer's disease (LOAD), PRS models fail to capture much of the heritability. Additionally, PRS models are highly dependent on the population structure of the data on which effect sizes are assessed and have poor generalizability to new data. Objective The goal of this study is to construct a paragenic risk score that, in addition to single genetic marker data used in PRS, incorporates epistatic interaction features and machine learning methods to predict risk for LOAD. Methods We construct a new state-of-the-art genetic model for risk of Alzheimer's disease. Our approach innovates over PRS models in two ways: First, by directly incorporating epistatic interactions between SNP loci using an evolutionary algorithm guided by shared pathway information; and second, by estimating risk via an ensemble of non-linear machine learning models rather than a single linear model. We compare the paragenic model to several PRS models from the literature trained on the same dataset. Results The paragenic model is significantly more accurate than the PRS models under 10-fold cross-validation, obtaining an AUC of 83% and near-clinically significant matched sensitivity/specificity of 75%. It remains significantly more accurate when evaluated on an independent holdout dataset and maintains accuracy within APOE genotype strata. Conclusions Paragenic models show potential for improving disease risk prediction for complex heritable diseases such as LOAD over PRS models.
Collapse
Affiliation(s)
| | | | | | | | | | - Carlos Cruchaga
- Department of Psychiatry, Washington University, St. Louis, MO, USA
- Hope Center Program on Protein Aggregation and Neurodegeneration, Washington University, St. Louis, MO, USA
| | - Thomas Wingo
- Goizueta Alzheimer’s Disease Center, Emory University School of Medicine, Atlanta, GA, USA
- Department of Neurology, Emory University School of Medicine, Atlanta, GA, USA
- Department of Human Genetics, Emory University School of Medicine, Atlanta, GA, USA
| | | | | |
Collapse
|
3
|
Ribeiro-dos-Santos A, de Brito LM, de Araújo GS. The fusiform gyrus exhibits differential gene-gene co-expression in Alzheimer's disease. Front Aging Neurosci 2023; 15:1138336. [PMID: 37255536 PMCID: PMC10225579 DOI: 10.3389/fnagi.2023.1138336] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2023] [Accepted: 04/21/2023] [Indexed: 06/01/2023] Open
Abstract
Alzheimer's Disease (AD) is an irreversible neurodegenerative disease clinically characterized by the presence of β-amyloid plaques and tau deposits in various regions of the brain. However, the underlying factors that contribute to the development of AD remain unclear. Recently, the fusiform gyrus has been identified as a critical brain region associated with mild cognitive impairment, which may increase the risk of AD development. In our study, we performed gene co-expression and differential co-expression network analyses, as well as gene-expression-based prediction, using RNA-seq transcriptome data from post-mortem fusiform gyrus tissue samples collected from both cognitively healthy individuals and those with AD. We accessed differential co-expression networks in large cohorts such as ROSMAP, MSBB, and Mayo, and conducted over-representation analyses of gene pathways and gene ontology. Our results comprise four exclusive gene hubs in co-expression modules of Alzheimer's Disease, including FNDC3A, MED23, NRIP1, and PKN2. Further, we identified three genes with differential co-expressed links, namely FAM153B, CYP2C8, and CKMT1B. The differential co-expressed network showed moderate predictive performance for AD, with an area under the curve ranging from 0.71 to 0.76 (+/- 0.07). The over-representation analysis identified enrichment for Toll-Like Receptors Cascades and signaling pathways, such as G protein events, PIP2 hydrolysis and EPH-Epherin mechanism, in the fusiform gyrus. In conclusion, our findings shed new light on the molecular pathophysiology of AD by identifying new genes and biological pathways involved, emphasizing the crucial role of gene regulatory networks in the fusiform gyrus.
Collapse
Affiliation(s)
- Arthur Ribeiro-dos-Santos
- Programa de Pós-graduação em Genética e Biologia Molecular, Laboratório de Genética Humana e Médica, Instituto de Ciências Biológicas, Universidade Federal do Pará, Belém, Brazil
| | - Leonardo Miranda de Brito
- Programa de Pós-graduação em Genética e Biologia Molecular, Laboratório de Genética Humana e Médica, Instituto de Ciências Biológicas, Universidade Federal do Pará, Belém, Brazil
- Centro de Informática, Universidade Federal de Pernambuco, Recife, Brazil
| | - Gilderlanio Santana de Araújo
- Programa de Pós-graduação em Genética e Biologia Molecular, Laboratório de Genética Humana e Médica, Instituto de Ciências Biológicas, Universidade Federal do Pará, Belém, Brazil
| |
Collapse
|
4
|
Alatrany AS, Khan W, Hussain A, Al-Jumeily D. Wide and deep learning based approaches for classification of Alzheimer's disease using genome-wide association studies. PLoS One 2023; 18:e0283712. [PMID: 37126509 PMCID: PMC10150974 DOI: 10.1371/journal.pone.0283712] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2022] [Accepted: 03/15/2023] [Indexed: 05/02/2023] Open
Abstract
The increasing incidence of Alzheimer's disease (AD) has been leading towards a significant growth in socioeconomic challenges. A reliable prediction of AD might be useful to mitigate or at-least slow down its progression for which, identification of the factors affecting the AD and its accurate diagnoses, are vital. In this study, we use Genome-Wide Association Studies (GWAS) dataset which comprises significant genetic markers of complex diseases. The original dataset contains large number of attributes (620901) for which we propose a hybrid feature selection approach based on association test, principal component analysis, and the Boruta algorithm, to identify the most promising predictors of AD. The selected features are then forwarded to a wide and deep neural network models to classify the AD cases and healthy controls. The experimental outcomes indicate that our approach outperformed the existing methods when evaluated on standard dataset, producing an accuracy and f1-score of 99%. The outcomes from this study are impactful particularly, the identified features comprising AD-associated genes and a reliable classification model that might be useful for other chronic diseases.
Collapse
Affiliation(s)
- Abbas Saad Alatrany
- School of Computer Science and Mathematics, Liverpool John Moores University, Liverpool, United Kingdom
- University of Information Technology and Communications, Baghdad, Iraq
- Imam Ja’afar Al-Sadiq University, Baghdad, Iraq
| | - Wasiq Khan
- School of Computer Science and Mathematics, Liverpool John Moores University, Liverpool, United Kingdom
| | - Abir Hussain
- School of Computer Science and Mathematics, Liverpool John Moores University, Liverpool, United Kingdom
- Department of Electrical Engineering, University of Sharjah, Sharjah, UAE
| | - Dhiya Al-Jumeily
- School of Computer Science and Mathematics, Liverpool John Moores University, Liverpool, United Kingdom
| | | |
Collapse
|
5
|
Hermes S, Cady J, Armentrout S, O’Connor J, Carlson S, Cruchaga C, Wingo T, Greytak EM. Epistatic Features and Machine Learning Improve Alzheimer's Risk Prediction Over Polygenic Risk Scores. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2023:2023.02.10.23285766. [PMID: 36798198 PMCID: PMC9934790 DOI: 10.1101/2023.02.10.23285766] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/18/2023]
Abstract
Background Polygenic risk scores (PRS) are linear combinations of genetic markers weighted by effect size that are commonly used to predict disease risk. For complex heritable diseases such as late onset Alzheimer's disease (LOAD), PRS models fail to capture much of the heritability. Additionally, PRS models are highly dependent on the population structure of data on which effect sizes are assessed, and have poor generalizability to new data. Objective The goal of this study is to construct a paragenic risk score that, in addition to single genetic marker data used in PRS, incorporates epistatic interaction features and machine learning methods to predict lifetime risk for LOAD. Methods We construct a new state-of-the-art genetic model for lifetime risk of Alzheimer's disease. Our approach innovates over PRS models in two ways: First, by directly incorporating epistatic interactions between SNP loci using an evolutionary algorithm guided by shared pathway information; and second, by estimating risk via an ensemble of machine learning models (gradient boosting machines and deep learning) instead of simple logistic regression. We compare the paragenic model to a PRS model from the literature trained on the same dataset. Results The paragenic model is significantly more accurate than the PRS model under 10-fold cross-validation, obtaining an AUC of 83% and near-clinically significant matched sensitivity/specificity of 75%, and remains significantly more accurate when evaluated on an independent holdout dataset. Additionally, the paragenic model maintains accuracy within APOE genotypes. Conclusion Paragenic models show potential for improving lifetime disease risk prediction for complex heritable diseases such as LOAD over PRS models.
Collapse
Affiliation(s)
| | - Janet Cady
- Parabon NanoLabs, Inc., Reston, Virginia, USA
| | | | | | | | - Carlos Cruchaga
- Department of Psychiatry, Washington University, St. Louis, MO, USA
- Hope Center Program on Protein Aggregation and Neurodegeneration, Washington University St. Louis, MO, USA
| | - Thomas Wingo
- Goizueta Alzheimer’s Disease Center, Emory University School of Medicine, Atlanta, GA, USA
- Department of Neurology, Emory University School of Medicine, Atlanta, GA, USA
- Department of Human Genetics, Emory University School of Medicine, Atlanta, GA, USA
| | | | | |
Collapse
|
6
|
Page ML, Vance EL, Cloward ME, Ringger E, Dayton L, Ebbert MTW, Miller JB, Kauwe JSK. The Polygenic Risk Score Knowledge Base offers a centralized online repository for calculating and contextualizing polygenic risk scores. Commun Biol 2022; 5:899. [PMID: 36056235 PMCID: PMC9438378 DOI: 10.1038/s42003-022-03795-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2021] [Accepted: 08/03/2022] [Indexed: 11/20/2022] Open
Abstract
The process of identifying suitable genome-wide association (GWA) studies and formatting the data to calculate multiple polygenic risk scores on a single genome can be laborious. Here, we present a centralized polygenic risk score calculator currently containing over 250,000 genetic variant associations from the NHGRI-EBI GWAS Catalog for users to easily calculate sample-specific polygenic risk scores with comparable results to other available tools. Polygenic risk scores are calculated either online through the Polygenic Risk Score Knowledge Base (PRSKB; https://prs.byu.edu ) or via a command-line interface. We report study-specific polygenic risk scores across the UK Biobank, 1000 Genomes, and the Alzheimer's Disease Neuroimaging Initiative (ADNI), contextualize computed scores, and identify potentially confounding genetic risk factors in ADNI. We introduce a streamlined analysis tool and web interface to calculate and contextualize polygenic risk scores across various studies, which we anticipate will facilitate a wider adaptation of polygenic risk scores in future disease research.
Collapse
Affiliation(s)
- Madeline L Page
- Sanders-Brown Center on Aging, University of Kentucky, Lexington, KY, USA
| | - Elizabeth L Vance
- Sanders-Brown Center on Aging, University of Kentucky, Lexington, KY, USA
| | | | - Ed Ringger
- Department of Biology, Brigham Young University, Provo, UT, USA
| | - Louisa Dayton
- Department of Biology, Brigham Young University, Provo, UT, USA
| | - Mark T W Ebbert
- Sanders-Brown Center on Aging, University of Kentucky, Lexington, KY, USA.,Division of Biomedical Informatics, Department of Internal Medicine, University of Kentucky, Lexington, KY, USA.,Department of Neuroscience, University of Kentucky, Lexington, KY, USA
| | | | - Justin B Miller
- Sanders-Brown Center on Aging, University of Kentucky, Lexington, KY, USA.,Division of Biomedical Informatics, Department of Internal Medicine, University of Kentucky, Lexington, KY, USA.,Department of Pathology and Laboratory Medicine, University of Kentucky, Lexington, KY, USA
| | - John S K Kauwe
- Department of Biology, Brigham Young University, Provo, UT, USA.
| |
Collapse
|