1
|
Zhou X, Cai F, Li S, Li G, Zhang C, Xie J, Yang Y. Machine learning techniques for prediction in pregnancy complicated by autoimmune rheumatic diseases: Applications and challenges. Int Immunopharmacol 2024; 134:112238. [PMID: 38735259 DOI: 10.1016/j.intimp.2024.112238] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2024] [Revised: 05/01/2024] [Accepted: 05/08/2024] [Indexed: 05/14/2024]
Abstract
Autoimmune rheumatic diseases are chronic conditions affecting multiple systems and often occurring in young women of childbearing age. The diseases and the physiological characteristics of pregnancy significantly impact maternal-fetal health and pregnancy outcomes. Currently, the integration of big data with healthcare has led to the increasing popularity of using machine learning (ML) to mine clinical data for studying pregnancy complications. In this review, we introduce the basics of ML and the recent advances and trends of ML in different prediction applications for common pregnancy complications by autoimmune rheumatic diseases. Finally, the challenges and future for enhancing the accuracy, reliability, and clinical applicability of ML in prediction have been discussed. This review will provide insights into the utilization of ML in identifying and assisting clinical decision-making for pregnancy complications, while also establishing a foundation for exploring comprehensive management strategies for pregnancy and enhancing maternal and child health.
Collapse
Affiliation(s)
- Xiaoshi Zhou
- Department of Pharmacy, Sichuan Academy of Medical Sciences & Sichuan Provincial People's Hospital, School of Medicine, University of Electronic Science and Technology of China, Chengdu, China
| | - Feifei Cai
- Department of Pharmacy, Sichuan Academy of Medical Sciences & Sichuan Provincial People's Hospital, School of Medicine, University of Electronic Science and Technology of China, Chengdu, China
| | - Shiran Li
- Department of Pharmacy, Sichuan Academy of Medical Sciences & Sichuan Provincial People's Hospital, School of Medicine, University of Electronic Science and Technology of China, Chengdu, China
| | - Guolin Li
- Department of Pharmacy, Sichuan Academy of Medical Sciences & Sichuan Provincial People's Hospital, School of Medicine, University of Electronic Science and Technology of China, Chengdu, China; School of Basic Medicine and Clinical Pharmacy, China Pharmaceutical University, Nanjing, China
| | - Changji Zhang
- Department of Pharmacy, Sichuan Academy of Medical Sciences & Sichuan Provincial People's Hospital, School of Medicine, University of Electronic Science and Technology of China, Chengdu, China; School of Basic Medicine and Clinical Pharmacy, China Pharmaceutical University, Nanjing, China
| | - Jingxian Xie
- Department of Pharmacy, Sichuan Academy of Medical Sciences & Sichuan Provincial People's Hospital, School of Medicine, University of Electronic Science and Technology of China, Chengdu, China; College of Pharmacy, Southwest Medical University, Luzhou, China
| | - Yong Yang
- Department of Pharmacy, Sichuan Academy of Medical Sciences & Sichuan Provincial People's Hospital, School of Medicine, University of Electronic Science and Technology of China, Chengdu, China.
| |
Collapse
|
2
|
Gao Y, Cui Y. Optimizing clinico-genomic disease prediction across ancestries: a machine learning strategy with Pareto improvement. Genome Med 2024; 16:76. [PMID: 38835075 DOI: 10.1186/s13073-024-01345-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Accepted: 05/17/2024] [Indexed: 06/06/2024] Open
Abstract
BACKGROUND Accurate prediction of an individual's predisposition to diseases is vital for preventive medicine and early intervention. Various statistical and machine learning models have been developed for disease prediction using clinico-genomic data. However, the accuracy of clinico-genomic prediction of diseases may vary significantly across ancestry groups due to their unequal representation in clinical genomic datasets. METHODS We introduced a deep transfer learning approach to improve the performance of clinico-genomic prediction models for data-disadvantaged ancestry groups. We conducted machine learning experiments on multi-ancestral genomic datasets of lung cancer, prostate cancer, and Alzheimer's disease, as well as on synthetic datasets with built-in data inequality and distribution shifts across ancestry groups. RESULTS Deep transfer learning significantly improved disease prediction accuracy for data-disadvantaged populations in our multi-ancestral machine learning experiments. In contrast, transfer learning based on linear frameworks did not achieve comparable improvements for these data-disadvantaged populations. CONCLUSIONS This study shows that deep transfer learning can enhance fairness in multi-ancestral machine learning by improving prediction accuracy for data-disadvantaged populations without compromising prediction accuracy for other populations, thus providing a Pareto improvement towards equitable clinico-genomic prediction of diseases.
Collapse
Affiliation(s)
- Yan Gao
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, 38163, USA
- Center for Integrative and Translational Genomics, University of Tennessee Health Science Center, Memphis, TN, 38163, USA
| | - Yan Cui
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, 38163, USA.
- Center for Integrative and Translational Genomics, University of Tennessee Health Science Center, Memphis, TN, 38163, USA.
- Center for Cancer Research, University of Tennessee Health Science Center, Memphis, TN, 38163, USA.
| |
Collapse
|
3
|
Hrytsenko Y, Shea B, Elgart M, Kurniansyah N, Lyons G, Morrison AC, Carson AP, Haring B, Mitchell BD, Psaty BM, Jaeger BC, Gu CC, Kooperberg C, Levy D, Lloyd-Jones D, Choi E, Brody JA, Smith JA, Rotter JI, Moll M, Fornage M, Simon N, Castaldi P, Casanova R, Chung RH, Kaplan R, Loos RJF, Kardia SLR, Rich SS, Redline S, Kelly T, O'Connor T, Zhao W, Kim W, Guo X, Ida Chen YD, Sofer T. Machine learning models for predicting blood pressure phenotypes by combining multiple polygenic risk scores. Sci Rep 2024; 14:12436. [PMID: 38816422 PMCID: PMC11139858 DOI: 10.1038/s41598-024-62945-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2024] [Accepted: 05/22/2024] [Indexed: 06/01/2024] Open
Abstract
We construct non-linear machine learning (ML) prediction models for systolic and diastolic blood pressure (SBP, DBP) using demographic and clinical variables and polygenic risk scores (PRSs). We developed a two-model ensemble, consisting of a baseline model, where prediction is based on demographic and clinical variables only, and a genetic model, where we also include PRSs. We evaluate the use of a linear versus a non-linear model at both the baseline and the genetic model levels and assess the improvement in performance when incorporating multiple PRSs. We report the ensemble model's performance as percentage variance explained (PVE) on a held-out test dataset. A non-linear baseline model improved the PVEs from 28.1 to 30.1% (SBP) and 14.3% to 17.4% (DBP) compared with a linear baseline model. Including seven PRSs in the genetic model computed based on the largest available GWAS of SBP/DBP improved the genetic model PVE from 4.8 to 5.1% (SBP) and 4.7 to 5% (DBP) compared to using a single PRS. Adding additional 14 PRSs computed based on two independent GWASs further increased the genetic model PVE to 6.3% (SBP) and 5.7% (DBP). PVE differed across self-reported race/ethnicity groups, with primarily all non-White groups benefitting from the inclusion of additional PRSs. In summary, non-linear ML models improves BP prediction in models incorporating diverse populations.
Collapse
Affiliation(s)
- Yana Hrytsenko
- Department of Medicine, Brigham and Women's Hospital, Boston, MA, USA
- Department of Medicine, Harvard Medical School, Boston, MA, USA
- CardioVascular Institute (CVI), Beth Israel Deaconess Medical Center, Boston, MA, USA
| | - Benjamin Shea
- CardioVascular Institute (CVI), Beth Israel Deaconess Medical Center, Boston, MA, USA
| | - Michael Elgart
- Department of Medicine, Brigham and Women's Hospital, Boston, MA, USA
- Department of Medicine, Harvard Medical School, Boston, MA, USA
| | | | - Genevieve Lyons
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Alanna C Morrison
- Department of Epidemiology, School of Public Health, Human Genetics Center, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - April P Carson
- Department of Medicine, University of Mississippi Medical Center, Jackson, MS, USA
| | - Bernhard Haring
- Department of Epidemiology & Population Health, Albert Einstein College of Medicine, Bronx, NY, USA
- Department of Medicine III, Saarland University, Homburg, Saarland, Germany
| | - Braxton D Mitchell
- Department of Medicine, University of Maryland School of Medicine, Baltimore, MD, USA
| | - Bruce M Psaty
- Department of Medicine, University of Washington, Seattle, WA, USA
- Department of Epidemiology, University of Washington, Seattle, WA, USA
- Cardiovascular Health Research Unit, University of Washington, Seattle, WA, USA
- Health Systems and Population Health, University of Washington, Seattle, WA, USA
| | - Byron C Jaeger
- Department of Biostatistics and Data Science, Wake Forest University School of Medicine, Winston-Salem, NC, USA
| | - C Charles Gu
- The Center for Biostatistics and Data Science, Washington University, St. Louis, USA
| | - Charles Kooperberg
- Division of Public Health Sciences, Fred Hutchinson Cancer Center, Seattle, WA, USA
| | - Daniel Levy
- The Population Sciences Branch of the National Heart, Lung and Blood Institute, Bethesda, MD, USA
- The Framingham Heart Study, Framingham, MA, USA
| | - Donald Lloyd-Jones
- Department of Preventive Medicine, Northwestern University, Chicago, IL, USA
| | - Eunhee Choi
- Columbia Hypertension Laboratory, Department of Medicine, Columbia University Irving Medical Center, New York, NY, USA
| | - Jennifer A Brody
- Department of Medicine, University of Washington, Seattle, WA, USA
- Cardiovascular Health Research Unit, University of Washington, Seattle, WA, USA
| | - Jennifer A Smith
- Department of Epidemiology, School of Public Health, University of Michigan, Ann Arbor, MI, USA
- Survey Research Center, Institute for Social Research, University of Michigan, Ann Arbor, MI, USA
| | - Jerome I Rotter
- Department of Pediatrics, The Institute for Translational Genomics and Population Sciences, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA, USA
| | - Matthew Moll
- Department of Medicine, Brigham and Women's Hospital, Boston, MA, USA
- Department of Medicine, Harvard Medical School, Boston, MA, USA
- VA Boston Healthcare System, West Roxbury, MA, USA
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Boston, USA
| | - Myriam Fornage
- Department of Epidemiology, School of Public Health, Human Genetics Center, The University of Texas Health Science Center at Houston, Houston, TX, USA
- Brown Foundation Institute of Molecular Medicine, McGovern Medical School, University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Noah Simon
- Department of Biostatistics, School of Public Health, University of Washington, Seattle, WA, USA
| | - Peter Castaldi
- Department of Medicine, Brigham and Women's Hospital, Boston, MA, USA
- Department of Medicine, Harvard Medical School, Boston, MA, USA
| | - Ramon Casanova
- Department of Biostatistics and Data Science, Wake Forest University School of Medicine, Winston-Salem, NC, USA
| | - Ren-Hua Chung
- Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, Taipei City, Taiwan
| | - Robert Kaplan
- Department of Epidemiology & Population Health, Albert Einstein College of Medicine, Bronx, NY, USA
- Division of Public Health Sciences, Fred Hutchinson Cancer Center, Seattle, WA, USA
| | - Ruth J F Loos
- The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
- Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty for Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Sharon L R Kardia
- Department of Epidemiology, School of Public Health, University of Michigan, Ann Arbor, MI, USA
| | - Stephen S Rich
- Center for Public Health Genomics, University of Virginia School of Medicine, Charlottesville, VA, USA
| | - Susan Redline
- Department of Medicine, Brigham and Women's Hospital, Boston, MA, USA
- Department of Medicine, Harvard Medical School, Boston, MA, USA
- Division of Sleep and Circadian Disorders, Brigham and Women's Hospital, Boston, MA, USA
| | - Tanika Kelly
- Department of Epidemiology, Tulane University School of Public Health and Tropical Medicine, New Orleans, LA, USA
| | - Timothy O'Connor
- Department of Medicine, University of Maryland School of Medicine, Baltimore, MD, USA
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA
- Program in Health Equity and Population Health, University of Maryland School of Medicine, Baltimore, MD, USA
| | - Wei Zhao
- Department of Epidemiology, School of Public Health, University of Michigan, Ann Arbor, MI, USA
- Survey Research Center, Institute for Social Research, University of Michigan, Ann Arbor, MI, USA
| | - Wonji Kim
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Boston, USA
| | - Xiuqing Guo
- Department of Pediatrics, The Institute for Translational Genomics and Population Sciences, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA, USA
| | - Yii-Der Ida Chen
- Department of Pediatrics, The Institute for Translational Genomics and Population Sciences, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA, USA
| | - Tamar Sofer
- Department of Medicine, Brigham and Women's Hospital, Boston, MA, USA.
- Department of Medicine, Harvard Medical School, Boston, MA, USA.
- CardioVascular Institute (CVI), Beth Israel Deaconess Medical Center, Boston, MA, USA.
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
- Center for Life Sciences CLS-934, 3 Blackfan St., Boston, MA, 02115, USA.
| |
Collapse
|
4
|
Alireza Z, Maleeha M, Kaikkonen M, Fortino V. Enhancing prediction accuracy of coronary artery disease through machine learning-driven genomic variant selection. J Transl Med 2024; 22:356. [PMID: 38627847 PMCID: PMC11020205 DOI: 10.1186/s12967-024-05090-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Accepted: 03/14/2024] [Indexed: 04/19/2024] Open
Abstract
Machine learning (ML) methods are increasingly becoming crucial in genome-wide association studies for identifying key genetic variants or SNPs that statistical methods might overlook. Statistical methods predominantly identify SNPs with notable effect sizes by conducting association tests on individual genetic variants, one at a time, to determine their relationship with the target phenotype. These genetic variants are then used to create polygenic risk scores (PRSs), estimating an individual's genetic risk for complex diseases like cancer or cardiovascular disorders. Unlike traditional methods, ML algorithms can identify groups of low-risk genetic variants that improve prediction accuracy when combined in a mathematical model. However, the application of ML strategies requires addressing the feature selection challenge to prevent overfitting. Moreover, ensuring the ML model depends on a concise set of genomic variants enhances its clinical applicability, where testing is feasible for only a limited number of SNPs. In this study, we introduce a robust pipeline that applies ML algorithms in combination with feature selection (ML-FS algorithms), aimed at identifying the most significant genomic variants associated with the coronary artery disease (CAD) phenotype. The proposed computational approach was tested on individuals from the UK Biobank, differentiating between CAD and non-CAD individuals within this extensive cohort, and benchmarked against standard PRS-based methodologies like LDpred2 and Lassosum. Our strategy incorporates cross-validation to ensure a more robust evaluation of genomic variant-based prediction models. This method is commonly applied in machine learning strategies but has often been neglected in previous studies assessing the predictive performance of polygenic risk scores. Our results demonstrate that the ML-FS algorithm can identify panels with as few as 50 genetic markers that can achieve approximately 80% accuracy when used in combination with known risk factors. The modest increase in accuracy over PRS performances is noteworthy, especially considering that PRS models incorporate a substantially larger number of genetic variants. This extensive variant selection can pose practical challenges in clinical settings. Additionally, the proposed approach revealed novel CAD-genetic variant associations.
Collapse
Affiliation(s)
- Z Alireza
- Institute of Biomedicine, University of Eastern Finland, 70210, Kuopio, Finland
| | - M Maleeha
- Institute of Biomedicine, University of Eastern Finland, 70210, Kuopio, Finland
| | - M Kaikkonen
- A.I.Virtanen Institute, University of Eastern Finland, 70210, Kuopio, Finland
| | - V Fortino
- Institute of Biomedicine, University of Eastern Finland, 70210, Kuopio, Finland.
| |
Collapse
|
5
|
Don J, Schork AJ, Glusman G, Rappaport N, Cummings SR, Duggan D, Raju A, Hellberg KLG, Gunn S, Monti S, Perls T, Lapidus J, Goetz LH, Sebastiani P, Schork NJ. The relationship between 11 different polygenic longevity scores, parental lifespan, and disease diagnosis in the UK Biobank. GeroScience 2024:10.1007/s11357-024-01107-1. [PMID: 38451433 DOI: 10.1007/s11357-024-01107-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2023] [Accepted: 02/21/2024] [Indexed: 03/08/2024] Open
Abstract
Large-scale genome-wide association studies (GWAS) strongly suggest that most traits and diseases have a polygenic component. This observation has motivated the development of disease-specific "polygenic scores (PGS)" that are weighted sums of the effects of disease-associated variants identified from GWAS that correlate with an individual's likelihood of expressing a specific phenotype. Although most GWAS have been pursued on disease traits, leading to the creation of refined "Polygenic Risk Scores" (PRS) that quantify risk to diseases, many GWAS have also been pursued on extreme human longevity, general fitness, health span, and other health-positive traits. These GWAS have discovered many genetic variants seemingly protective from disease and are often different from disease-associated variants (i.e., they are not just alternative alleles at disease-associated loci) and suggest that many health-positive traits also have a polygenic basis. This observation has led to an interest in "polygenic longevity scores (PLS)" that quantify the "risk" or genetic predisposition of an individual towards health. We derived 11 different PLS from 4 different available GWAS on lifespan and then investigated the properties of these PLS using data from the UK Biobank (UKB). Tests of association between the PLS and population structure, parental lifespan, and several cancerous and non-cancerous diseases, including death from COVID-19, were performed. Based on the results of our analyses, we argue that PLS are made up of variants not only robustly associated with parental lifespan, but that also contribute to the genetic architecture of disease susceptibility, morbidity, and mortality.
Collapse
Affiliation(s)
- Janith Don
- Translational Genomics Research Institute (TGen), Phoenix, AZ, USA
| | - Andrew J Schork
- The Institute of Biological Psychiatry, Copenhagen University Hospital, Copenhagen, Denmark
- GLOBE Institute, Copenhagen University, Copenhagen, Denmark
| | | | | | - Steve R Cummings
- San Francisco Coordinating Center, California Pacific Medical Center Research Institute, San Francisco, CA, USA
| | - David Duggan
- Translational Genomics Research Institute (TGen), Phoenix, AZ, USA
| | - Anish Raju
- Translational Genomics Research Institute (TGen), Phoenix, AZ, USA
| | - Kajsa-Lotta Georgii Hellberg
- The Institute of Biological Psychiatry, Copenhagen University Hospital, Copenhagen, Denmark
- GLOBE Institute, Copenhagen University, Copenhagen, Denmark
| | - Sophia Gunn
- Department of Biostatistics, Boston University School of Public Health, Boston, MA, USA
| | - Stefano Monti
- Department of Biostatistics, Boston University School of Public Health, Boston, MA, USA
| | - Thomas Perls
- Department of Medicine, Section of Geriatrics, Boston University, Boston, MA, USA
| | - Jodi Lapidus
- Department of Biostatistics, Oregon Health & Science University, Portland, OR, USA
| | - Laura H Goetz
- Translational Genomics Research Institute (TGen), Phoenix, AZ, USA
- Veterans Affairs Loma Linda Health Care, Loma Linda, CA, USA
| | - Paola Sebastiani
- Department of Biostatistics, Boston University School of Public Health, Boston, MA, USA
- Institute for Clinical Research and Health Policy Studies, Tufts Medical Center, Boston, MA, USA
- Tufts University School of Medicine and Data Intensive Study Center, Boston, MA, USA
| | - Nicholas J Schork
- Translational Genomics Research Institute (TGen), Phoenix, AZ, USA.
- The City of Hope National Medical Center, Duarte, CA, USA.
| |
Collapse
|
6
|
Chafai N, Bonizzi L, Botti S, Badaoui B. Emerging applications of machine learning in genomic medicine and healthcare. Crit Rev Clin Lab Sci 2024; 61:140-163. [PMID: 37815417 DOI: 10.1080/10408363.2023.2259466] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2023] [Accepted: 09/12/2023] [Indexed: 10/11/2023]
Abstract
The integration of artificial intelligence technologies has propelled the progress of clinical and genomic medicine in recent years. The significant increase in computing power has facilitated the ability of artificial intelligence models to analyze and extract features from extensive medical data and images, thereby contributing to the advancement of intelligent diagnostic tools. Artificial intelligence (AI) models have been utilized in the field of personalized medicine to integrate clinical data and genomic information of patients. This integration allows for the identification of customized treatment recommendations, ultimately leading to enhanced patient outcomes. Notwithstanding the notable advancements, the application of artificial intelligence (AI) in the field of medicine is impeded by various obstacles such as the limited availability of clinical and genomic data, the diversity of datasets, ethical implications, and the inconclusive interpretation of AI models' results. In this review, a comprehensive evaluation of multiple machine learning algorithms utilized in the fields of clinical and genomic medicine is conducted. Furthermore, we present an overview of the implementation of artificial intelligence (AI) in the fields of clinical medicine, drug discovery, and genomic medicine. Finally, a number of constraints pertaining to the implementation of artificial intelligence within the healthcare industry are examined.
Collapse
Affiliation(s)
- Narjice Chafai
- Laboratory of Biodiversity, Ecology, and Genome, Faculty of Sciences, Department of Biology, Mohammed V University in Rabat, Rabat, Morocco
| | - Luigi Bonizzi
- Department of Biomedical, Surgical and Dental Science, University of Milan, Milan, Italy
| | - Sara Botti
- PTP Science Park, Via Einstein - Loc. Cascina Codazza, Lodi, Italy
| | - Bouabid Badaoui
- Laboratory of Biodiversity, Ecology, and Genome, Faculty of Sciences, Department of Biology, Mohammed V University in Rabat, Rabat, Morocco
- African Sustainable Agriculture Research Institute (ASARI), Mohammed VI Polytechnic University (UM6P), Laâyoune, Morocco
| |
Collapse
|
7
|
Fong WJ, Tan HM, Garg R, Teh AL, Pan H, Gupta V, Krishna B, Chen ZH, Purwanto NY, Yap F, Tan KH, Chan KYJ, Chan SY, Goh N, Rane N, Tan ESE, Jiang Y, Han M, Meaney M, Wang D, Keppo J, Tan GCY. Comparing feature selection and machine learning approaches for predicting CYP2D6 methylation from genetic variation. Front Neuroinform 2024; 17:1244336. [PMID: 38449836 PMCID: PMC10915285 DOI: 10.3389/fninf.2023.1244336] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2023] [Accepted: 10/18/2023] [Indexed: 03/08/2024] Open
Abstract
Introduction Pharmacogenetics currently supports clinical decision-making on the basis of a limited number of variants in a few genes and may benefit paediatric prescribing where there is a need for more precise dosing. Integrating genomic information such as methylation into pharmacogenetic models holds the potential to improve their accuracy and consequently prescribing decisions. Cytochrome P450 2D6 (CYP2D6) is a highly polymorphic gene conventionally associated with the metabolism of commonly used drugs and endogenous substrates. We thus sought to predict epigenetic loci from single nucleotide polymorphisms (SNPs) related to CYP2D6 in children from the GUSTO cohort. Methods Buffy coat DNA methylation was quantified using the Illumina Infinium Methylation EPIC beadchip. CpG sites associated with CYP2D6 were used as outcome variables in Linear Regression, Elastic Net and XGBoost models. We compared feature selection of SNPs from GWAS mQTLs, GTEx eQTLs and SNPs within 2 MB of the CYP2D6 gene and the impact of adding demographic data. The samples were split into training (75%) sets and test (25%) sets for validation. In Elastic Net model and XGBoost models, optimal hyperparameter search was done using 10-fold cross validation. Root Mean Square Error and R-squared values were obtained to investigate each models' performance. When GWAS was performed to determine SNPs associated with CpG sites, a total of 15 SNPs were identified where several SNPs appeared to influence multiple CpG sites. Results Overall, Elastic Net models of genetic features appeared to perform marginally better than heritability estimates and substantially better than Linear Regression and XGBoost models. The addition of nongenetic features appeared to improve performance for some but not all feature sets and probes. The best feature set and Machine Learning (ML) approach differed substantially between CpG sites and a number of top variables were identified for each model. Discussion The development of SNP-based prediction models for CYP2D6 CpG methylation in Singaporean children of varying ethnicities in this study has clinical application. With further validation, they may add to the set of tools available to improve precision medicine and pharmacogenetics-based dosing.
Collapse
Affiliation(s)
- Wei Jing Fong
- Computational Biology, National University of Singapore, Singapore, Singapore
| | - Hong Ming Tan
- Computational Biology, National University of Singapore, Singapore, Singapore
| | - Rishabh Garg
- Computational Biology, National University of Singapore, Singapore, Singapore
| | - Ai Ling Teh
- Singapore Institute for Clinical Sciences (SICS), Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
| | - Hong Pan
- Singapore Institute for Clinical Sciences (SICS), Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
| | - Varsha Gupta
- Singapore Institute for Clinical Sciences (SICS), Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
| | - Bernadus Krishna
- Computational Biology, National University of Singapore, Singapore, Singapore
| | - Zou Hui Chen
- Computational Biology, National University of Singapore, Singapore, Singapore
| | | | - Fabian Yap
- KK Women's and Children's Hospital, Singapore, Singapore
| | - Kok Hian Tan
- KK Women's and Children's Hospital, Singapore, Singapore
- Duke NUS Medical School, Singapore, Singapore
| | - Kok Yen Jerry Chan
- KK Women's and Children's Hospital, Singapore, Singapore
- Duke NUS Medical School, Singapore, Singapore
| | - Shiao-Yng Chan
- Singapore Institute for Clinical Sciences (SICS), Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
- National University Hospital, Singapore, Singapore
| | | | - Nikita Rane
- Institute of Mental Health,Singapore, Singapore
| | | | | | - Mei Han
- Computational Biology, National University of Singapore, Singapore, Singapore
| | - Michael Meaney
- Singapore Institute for Clinical Sciences (SICS), Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
| | - Dennis Wang
- Singapore Institute for Clinical Sciences (SICS), Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
- National Heart and Lung Institute, Imperial College London, London, United Kingdom
| | - Jussi Keppo
- Computational Biology, National University of Singapore, Singapore, Singapore
| | - Geoffrey Chern-Yee Tan
- Computational Biology, National University of Singapore, Singapore, Singapore
- Institute of Mental Health,Singapore, Singapore
| |
Collapse
|
8
|
Choi Y, Cha J, Choi S. Evaluation of penalized and machine learning methods for asthma disease prediction in the Korean Genome and Epidemiology Study (KoGES). BMC Bioinformatics 2024; 25:56. [PMID: 38308205 PMCID: PMC10837879 DOI: 10.1186/s12859-024-05677-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2022] [Accepted: 01/26/2024] [Indexed: 02/04/2024] Open
Abstract
BACKGROUND Genome-wide association studies have successfully identified genetic variants associated with human disease. Various statistical approaches based on penalized and machine learning methods have recently been proposed for disease prediction. In this study, we evaluated the performance of several such methods for predicting asthma using the Korean Chip (KORV1.1) from the Korean Genome and Epidemiology Study (KoGES). RESULTS First, single-nucleotide polymorphisms were selected via single-variant tests using logistic regression with the adjustment of several epidemiological factors. Next, we evaluated the following methods for disease prediction: ridge, least absolute shrinkage and selection operator, elastic net, smoothly clipped absolute deviation, support vector machine, random forest, boosting, bagging, naïve Bayes, and k-nearest neighbor. Finally, we compared their predictive performance based on the area under the curve of the receiver operating characteristic curves, precision, recall, F1-score, Cohen's Kappa, balanced accuracy, error rate, Matthews correlation coefficient, and area under the precision-recall curve. Additionally, three oversampling algorithms are used to deal with imbalance problems. CONCLUSIONS Our results show that penalized methods exhibit better predictive performance for asthma than that achieved via machine learning methods. On the other hand, in the oversampling study, randomforest and boosting methods overall showed better prediction performance than penalized methods.
Collapse
Affiliation(s)
- Yongjun Choi
- Department of Applied Artificial Intelligence, College of Computing, Hanyang University, 55 Hanyang-daehak-ro, Sangnok-gu, Ansan, 15588, South Korea
| | - Junho Cha
- Department of Applied Artificial Intelligence, College of Computing, Hanyang University, 55 Hanyang-daehak-ro, Sangnok-gu, Ansan, 15588, South Korea
| | - Sungkyoung Choi
- Department of Applied Artificial Intelligence, College of Computing, Hanyang University, 55 Hanyang-daehak-ro, Sangnok-gu, Ansan, 15588, South Korea.
- Department of Mathematical Data Science, College of Science and Convergence Technology, Hanyang University, 55 Hanyang-daehak-ro, Sangnok-gu, Ansan, 15588, South Korea.
| |
Collapse
|
9
|
Chung CW, Chou SC, Hsiao TH, Zhang GJ, Chung YF, Chen YM. Machine learning approaches to identify systemic lupus erythematosus in anti-nuclear antibody-positive patients using genomic data and electronic health records. BioData Min 2024; 17:1. [PMID: 38183082 PMCID: PMC10770905 DOI: 10.1186/s13040-023-00352-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Accepted: 12/19/2023] [Indexed: 01/07/2024] Open
Abstract
BACKGROUND Although the 2019 EULAR/ACR classification criteria for systemic lupus erythematosus (SLE) has required at least a positive anti-nuclear antibody (ANA) titer (≥ 1:80), it remains challenging for clinicians to identify patients with SLE. This study aimed to develop a machine learning (ML) approach to assist in the detection of SLE patients using genomic data and electronic health records. METHODS Participants with a positive ANA (≥ 1:80) were enrolled from the Taiwan Precision Medicine Initiative cohort. The Taiwan Biobank version 2 array was used to detect single nucleotide polymorphism (SNP) data. Six ML models, Logistic Regression, Random Forest (RF), Support Vector Machine, Light Gradient Boosting Machine, Gradient Tree Boosting, and Extreme Gradient Boosting (XGB), were used to identify SLE patients. The importance of the clinical and genetic features was determined by Shapley Additive Explanation (SHAP) values. A logistic regression model was applied to identify genetic variations associated with SLE in the subset of patients with an ANA equal to or exceeding 1:640. RESULTS A total of 946 SLE and 1,892 non-SLE controls were included in this analysis. Among the six ML models, RF and XGB demonstrated superior performance in the differentiation of SLE from non-SLE. The leading features in the SHAP diagram were anti-double strand DNA antibodies, ANA titers, AC4 ANA pattern, polygenic risk scores, complement levels, and SNPs. Additionally, in the subgroup with a high ANA titer (≥ 1:640), six SNPs positively associated with SLE and five SNPs negatively correlated with SLE were discovered. CONCLUSIONS ML approaches offer the potential to assist in diagnosing SLE and uncovering novel SNPs in a group of patients with autoimmunity.
Collapse
Affiliation(s)
- Chih-Wei Chung
- Department of Information Management, National Taiwan University, Taipei, Taiwan
| | - Seng-Cho Chou
- Department of Information Management, National Taiwan University, Taipei, Taiwan
| | - Tzu-Hung Hsiao
- Department of Medical Research, Taichung Veterans General Hospital, Taichung, Taiwan
- Department of Public Health, Fu Jen Catholic University, New Taipei City, Taiwan
- Institute of Genomics and Bioinformatics, National Chung Hsing University, Taichung, Taiwan
| | - Grace Joyce Zhang
- Department of Cellular and Physiological Sciences, The University of British Columbia, Vancouver, BC, Canada
| | - Yu-Fang Chung
- Department of Electrical Engineering, Tunghai University, Taichung, Taiwan
| | - Yi-Ming Chen
- Department of Medical Research, Taichung Veterans General Hospital, Taichung, Taiwan.
- Division of Allergy, Immunology and Rheumatology, Department of Internal Medicine, Taichung Veterans General Hospital, 1650, Section 4, Taiwan Boulevard, Xitun Dist., Taichung City, 407, Taiwan.
- Department of Post-Baccalaureate Medicine, College of Medicine, National Chung Hsing University, Taichung, Taiwan.
- School of Medicine, College of Medicine, National Yang Ming Chiao Tung University, Taipei, Taiwan.
- Rong Hsing Research Center for Translational Medicine & Ph.D. Program in Translational Medicine, National Chung Hsing University, Taichung, Taiwan.
- Precision Medicine Research Center, College of Medicine, National Chung Hsing University, Taichung, Taiwan.
| |
Collapse
|
10
|
Hrytsenko Y, Shea B, Elgart M, Kurniansyah N, Lyons G, Morrison AC, Carson AP, Haring B, Mitchel BD, Psaty BM, Jaeger BC, Gu CC, Kooperberg C, Levy D, Lloyd-Jones D, Choi E, Brody JA, Smith JA, Rotter JI, Moll M, Fornage M, Simon N, Castaldi P, Casanova R, Chung RH, Kaplan R, Loos RJ, Kardia SLR, Rich SS, Redline S, Kelly T, O’Connor T, Zhao W, Kim W, Guo X, Der Ida Chen Y, Sofer T. Machine learning models for blood pressure phenotypes combining multiple polygenic risk scores. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2023:2023.12.13.23299909. [PMID: 38168328 PMCID: PMC10760279 DOI: 10.1101/2023.12.13.23299909] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/05/2024]
Abstract
We construct non-linear machine learning (ML) prediction models for systolic and diastolic blood pressure (SBP, DBP) using demographic and clinical variables and polygenic risk scores (PRSs). We developed a two-model ensemble, consisting of a baseline model, where prediction is based on demographic and clinical variables only, and a genetic model, where we also include PRSs. We evaluate the use of a linear versus a non-linear model at both the baseline and the genetic model levels and assess the improvement in performance when incorporating multiple PRSs. We report the ensemble model's performance as percentage variance explained (PVE) on a held-out test dataset. A non-linear baseline model improved the PVEs from 28.1% to 30.1% (SBP) and 14.3% to 17.4% (DBP) compared with a linear baseline model. Including seven PRSs in the genetic model computed based on the largest available GWAS of SBP/DBP improved the genetic model PVE from 4.8% to 5.1% (SBP) and 4.7% to 5% (DBP) compared to using a single PRS. Adding additional 14 PRSs computed based on two independent GWASs further increased the genetic model PVE to 6.3% (SBP) and 5.7% (DBP). PVE differed across self-reported race/ethnicity groups, with primarily all non-White groups benefitting from the inclusion of additional PRSs.
Collapse
Affiliation(s)
- Yana Hrytsenko
- Department of Medicine, Brigham and Women’s Hospital, Boston, MA
- Department of Medicine, Harvard Medical School, Boston, MA
- CardioVascular Institute (CVI), Beth Israel Deaconess Medical Center, Boston, MA
| | - Benjamin Shea
- CardioVascular Institute (CVI), Beth Israel Deaconess Medical Center, Boston, MA
| | - Michael Elgart
- Department of Medicine, Brigham and Women’s Hospital, Boston, MA
- Department of Medicine, Harvard Medical School, Boston, MA
| | | | - Genevieve Lyons
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA
| | - Alanna C. Morrison
- Human Genetics Center, Department of Epidemiology, Human Genetics, and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - April P. Carson
- Department of Medicine, University of Mississippi Medical Center, Jackson, MS, USA
| | - Bernhard Haring
- Department of Epidemiology & Population Health, Albert Einstein College of Medicine, Bronx, NY, USA
- Department of Medicine III, Saarland University, Homburg, Saarland, Germany
| | - Braxton D. Mitchel
- Department of Medicine, University of Maryland School of Medicine, Baltimore, MD, USA
| | - Bruce M. Psaty
- Department of Medicine, University of Washington, Seattle, WA, USA
- Department of Epidemiology, University of Washington, Seattle, WA, USA
- Cardiovascular Health Research Unit, University of Washington, Seattle, WA, USA
- Health Systems and Population Health, University of Washington, Seattle, WA, USA
| | - Byron C. Jaeger
- Department of Biostatistics and Data Science, Wake Forest University School of Medicine, Winston-Salem, NC, USA
| | - C Charles Gu
- The Center for Biostatistics and Data Science, Washington University, St. Louis, USA
| | - Charles Kooperberg
- Division of Public Health Sciences, Fred Hutchinson Cancer Center, Seattle, WA, USA
| | - Daniel Levy
- The Population Sciences Branch of the National Heart, Lung and Blood Institute, Bethesda, MD, USA
- The Framingham Heart Study, Framingham, MA, USA
| | - Donald Lloyd-Jones
- Department of Preventive Medicine, Northwestern University, Chicago, IL, USA
| | - Eunhee Choi
- Columbia Hypertension Laboratory, Department of Medicine, Columbia University Irving Medical Center, New York, NY, USA
| | - Jennifer A Brody
- Department of Medicine, University of Washington, Seattle, WA, USA
- Cardiovascular Health Research Unit, University of Washington, Seattle, WA, USA
| | - Jennifer A Smith
- Department of Epidemiology, School of Public Health, University of Michigan, Ann Arbor, MI, USA
- Survey Research Center, Institute for Social Research, University of Michigan, Ann Arbor, MI, USA
| | - Jerome I. Rotter
- The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA, USA
| | - Matthew Moll
- Department of Medicine, Brigham and Women’s Hospital, Boston, MA
- Department of Medicine, Harvard Medical School, Boston, MA
- VA Boston Healthcare System, West Roxbury, MA, USA
| | - Myriam Fornage
- Human Genetics Center, Department of Epidemiology, Human Genetics, and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, USA
- Brown Foundation Institute of Molecular Medicine, McGovern Medical School, University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Noah Simon
- Department of Biostatistics, School of Public Health, University of Washington, Seattle, WA
| | - Peter Castaldi
- Department of Medicine, Brigham and Women’s Hospital, Boston, MA
- Department of Medicine, Harvard Medical School, Boston, MA
| | - Ramon Casanova
- Health Systems and Population Health, University of Washington, Seattle, WA, USA
| | - Ren-Hua Chung
- Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, Taipei City, Taiwan
| | - Robert Kaplan
- Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, WA, USA
- Department of Epidemiology & Population Health, Albert Einstein College of Medicine, Bronx, NY, USA
| | - Ruth J.F. Loos
- The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
- Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty for Health and Medical Sciences, University of Copenhagen, Denmark, DK
| | - Sharon L. R. Kardia
- Department of Epidemiology, School of Public Health, University of Michigan, Ann Arbor, MI, USA
| | - Stephen S. Rich
- Center for Public Health Genomics, University of Virginia School of Medicine, Charlottesville, VA, USA
| | - Susan Redline
- Department of Medicine, Harvard Medical School, Boston, MA
- Division of Sleep and Circadian Disorders, Brigham and Women’s Hospital, Boston, MA, USA
| | - Tanika Kelly
- Department of Epidemiology, Tulane University School of Public Health and Tropical Medicine, New Orleans, LA, USA
| | - Timothy O’Connor
- Department of Medicine III, Saarland University, Homburg, Saarland, Germany
| | - Wei Zhao
- Department of Epidemiology, School of Public Health, University of Michigan, Ann Arbor, MI, USA
- Survey Research Center, Institute for Social Research, University of Michigan, Ann Arbor, MI, USA
| | - Wonji Kim
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women’s Hospital
| | - Xiuqing Guo
- The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA, USA
| | - Yii Der Ida Chen
- The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA, USA
| | | | - Tamar Sofer
- Department of Medicine, Brigham and Women’s Hospital, Boston, MA
- Department of Medicine, Harvard Medical School, Boston, MA
- CardioVascular Institute (CVI), Beth Israel Deaconess Medical Center, Boston, MA
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA
| |
Collapse
|
11
|
Bettencourt C, Skene N, Bandres-Ciga S, Anderson E, Winchester LM, Foote IF, Schwartzentruber J, Botia JA, Nalls M, Singleton A, Schilder BM, Humphrey J, Marzi SJ, Toomey CE, Kleifat AA, Harshfield EL, Garfield V, Sandor C, Keat S, Tamburin S, Frigerio CS, Lourida I, Ranson JM, Llewellyn DJ. Artificial intelligence for dementia genetics and omics. Alzheimers Dement 2023; 19:5905-5921. [PMID: 37606627 PMCID: PMC10841325 DOI: 10.1002/alz.13427] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2023] [Revised: 07/14/2023] [Accepted: 07/18/2023] [Indexed: 08/23/2023]
Abstract
Genetics and omics studies of Alzheimer's disease and other dementia subtypes enhance our understanding of underlying mechanisms and pathways that can be targeted. We identified key remaining challenges: First, can we enhance genetic studies to address missing heritability? Can we identify reproducible omics signatures that differentiate between dementia subtypes? Can high-dimensional omics data identify improved biomarkers? How can genetics inform our understanding of causal status of dementia risk factors? And which biological processes are altered by dementia-related genetic variation? Artificial intelligence (AI) and machine learning approaches give us powerful new tools in helping us to tackle these challenges, and we review possible solutions and examples of best practice. However, their limitations also need to be considered, as well as the need for coordinated multidisciplinary research and diverse deeply phenotyped cohorts. Ultimately AI approaches improve our ability to interrogate genetics and omics data for precision dementia medicine. HIGHLIGHTS: We have identified five key challenges in dementia genetics and omics studies. AI can enable detection of undiscovered patterns in dementia genetics and omics data. Enhanced and more diverse genetics and omics datasets are still needed. Multidisciplinary collaborative efforts using AI can boost dementia research.
Collapse
Affiliation(s)
- Conceicao Bettencourt
- Department of Neurodegenerative Disease, UCL Queen Square Institute of Neurology, London, UK
- Queen Square Brain Bank for Neurological Disorders, UCL Queen Square Institute of Neurology, London, UK
| | - Nathan Skene
- UK Dementia Research Institute, Imperial College London, London, UK
- Department of Brain Sciences, Imperial College London, London, UK
| | - Sara Bandres-Ciga
- Center for Alzheimer's and Related Dementias (CARD), National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, Maryland, USA
| | - Emma Anderson
- Department of Mental Health of Older People, Division of Psychiatry, University College London, London, UK
| | | | - Isabelle F Foote
- Institute for Behavioral Genetics, University of Colorado Boulder, Boulder, Colorado, USA
| | - Jeremy Schwartzentruber
- Open Targets, Cambridge, UK
- Wellcome Sanger Institute, Cambridge, UK
- Illumina Artificial Intelligence Laboratory, Illumina Inc, Foster City, California, USA
| | - Juan A Botia
- Departamento de Ingeniería de la Información y las Comunicaciones, Universidad de Murcia, Murcia, Spain
| | - Mike Nalls
- Center for Alzheimer's and Related Dementias (CARD), National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, Maryland, USA
- Data Tecnica International LLC, Washington, DC, USA
| | - Andrew Singleton
- Center for Alzheimer's and Related Dementias (CARD), National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, Maryland, USA
- Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, Maryland, USA
| | - Brian M Schilder
- UK Dementia Research Institute, Imperial College London, London, UK
- Department of Brain Sciences, Imperial College London, London, UK
| | - Jack Humphrey
- Nash Family Department of Neuroscience and Friedman Brain Institute, Icahn School of Medicine at Mount Sinai, New York, USA
| | - Sarah J Marzi
- UK Dementia Research Institute, Imperial College London, London, UK
- Department of Brain Sciences, Imperial College London, London, UK
| | - Christina E Toomey
- Queen Square Brain Bank for Neurological Disorders, UCL Queen Square Institute of Neurology, London, UK
- Department of Clinical and Movement Neuroscience, UCL Queen Square Institute of Neurology, London, UK
- The Francis Crick Institute, London, UK
| | - Ahmad Al Kleifat
- Department of Basic and Clinical Neuroscience, Maurice Wohl Clinical Neuroscience Institute, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK
| | - Eric L Harshfield
- Stroke Research Group, Department of Clinical Neurosciences, University of Cambridge, Cambridge, UK
| | - Victoria Garfield
- MRC Unit for Lifelong Health and Ageing, Institute of Cardiovascular Science, University College London, London, UK
| | - Cynthia Sandor
- UK Dementia Research Institute. School of Medicine, Cardiff University, Cardiff, UK
| | - Samuel Keat
- UK Dementia Research Institute. School of Medicine, Cardiff University, Cardiff, UK
| | - Stefano Tamburin
- Department of Neurosciences, Biomedicine and Movement Sciences, Neurology Section, University of Verona, Verona, Italy
| | - Carlo Sala Frigerio
- UK Dementia Research Institute, Queen Square Institute of Neurology, University College London, London, UK
| | | | | | - David J Llewellyn
- University of Exeter Medical School, Exeter, UK
- The Alan Turing Institute, London, UK
| |
Collapse
|
12
|
Khanna NN, Singh M, Maindarkar M, Kumar A, Johri AM, Mentella L, Laird JR, Paraskevas KI, Ruzsa Z, Singh N, Kalra MK, Fernandes JFE, Chaturvedi S, Nicolaides A, Rathore V, Singh I, Teji JS, Al-Maini M, Isenovic ER, Viswanathan V, Khanna P, Fouda MM, Saba L, Suri JS. Polygenic Risk Score for Cardiovascular Diseases in Artificial Intelligence Paradigm: A Review. J Korean Med Sci 2023; 38:e395. [PMID: 38013648 PMCID: PMC10681845 DOI: 10.3346/jkms.2023.38.e395] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/31/2023] [Accepted: 10/15/2023] [Indexed: 11/29/2023] Open
Abstract
Cardiovascular disease (CVD) related mortality and morbidity heavily strain society. The relationship between external risk factors and our genetics have not been well established. It is widely acknowledged that environmental influence and individual behaviours play a significant role in CVD vulnerability, leading to the development of polygenic risk scores (PRS). We employed the PRISMA search method to locate pertinent research and literature to extensively review artificial intelligence (AI)-based PRS models for CVD risk prediction. Furthermore, we analyzed and compared conventional vs. AI-based solutions for PRS. We summarized the recent advances in our understanding of the use of AI-based PRS for risk prediction of CVD. Our study proposes three hypotheses: i) Multiple genetic variations and risk factors can be incorporated into AI-based PRS to improve the accuracy of CVD risk predicting. ii) AI-based PRS for CVD circumvents the drawbacks of conventional PRS calculators by incorporating a larger variety of genetic and non-genetic components, allowing for more precise and individualised risk estimations. iii) Using AI approaches, it is possible to significantly reduce the dimensionality of huge genomic datasets, resulting in more accurate and effective disease risk prediction models. Our study highlighted that the AI-PRS model outperformed traditional PRS calculators in predicting CVD risk. Furthermore, using AI-based methods to calculate PRS may increase the precision of risk predictions for CVD and have significant ramifications for individualized prevention and treatment plans.
Collapse
Affiliation(s)
- Narendra N Khanna
- Department of Cardiology, Indraprastha APOLLO Hospitals, New Delhi, India
- Asia Pacific Vascular Society, New Delhi, India
| | - Manasvi Singh
- Stroke Monitoring and Diagnostic Division, AtheroPoint™, Roseville, CA, USA
- Bennett University, Greater Noida, India
| | - Mahesh Maindarkar
- Asia Pacific Vascular Society, New Delhi, India
- Stroke Monitoring and Diagnostic Division, AtheroPoint™, Roseville, CA, USA
- School of Bioengineering Sciences and Research, Maharashtra Institute of Technology's Art, Design and Technology University, Pune, India
| | | | - Amer M Johri
- Department of Medicine, Division of Cardiology, Queen's University, Kingston, Canada
| | - Laura Mentella
- Department of Medicine, Division of Cardiology, University of Toronto, Toronto, Canada
| | - John R Laird
- Heart and Vascular Institute, Adventist Health St. Helena, St. Helena, CA, USA
| | | | - Zoltan Ruzsa
- Invasive Cardiology Division, University of Szeged, Szeged, Hungary
| | - Narpinder Singh
- Department of Food Science and Technology, Graphic Era Deemed to be University, Dehradun, Uttarakhand, India
| | | | | | - Seemant Chaturvedi
- Department of Neurology & Stroke Program, University of Maryland, Baltimore, MD, USA
| | - Andrew Nicolaides
- Vascular Screening and Diagnostic Centre and University of Nicosia Medical School, Cyprus
| | - Vijay Rathore
- Nephrology Department, Kaiser Permanente, Sacramento, CA, USA
| | - Inder Singh
- Stroke Monitoring and Diagnostic Division, AtheroPoint™, Roseville, CA, USA
| | - Jagjit S Teji
- Ann and Robert H. Lurie Children's Hospital of Chicago, Chicago, IL, USA
| | - Mostafa Al-Maini
- Allergy, Clinical Immunology and Rheumatology Institute, Toronto, ON, Canada
| | - Esma R Isenovic
- Department of Radiobiology and Molecular Genetics, National Institute of The Republic of Serbia, University of Belgrade, Beograd, Serbia
| | | | - Puneet Khanna
- Department of Anaesthesiology, AIIMS, New Delhi, India
| | - Mostafa M Fouda
- Department of Electrical and Computer Engineering, Idaho State University, Pocatello, ID, USA
| | - Luca Saba
- Department of Radiology, Azienda Ospedaliero Universitaria, Cagliari, Italy
| | - Jasjit S Suri
- Asia Pacific Vascular Society, New Delhi, India
- Stroke Monitoring and Diagnostic Division, AtheroPoint™, Roseville, CA, USA
- Department of Computer Engineering, Graphic Era Deemed to be University, Dehradun, India.
| |
Collapse
|
13
|
Toussaint PA, Leiser F, Thiebes S, Schlesner M, Brors B, Sunyaev A. Explainable artificial intelligence for omics data: a systematic mapping study. Brief Bioinform 2023; 25:bbad453. [PMID: 38113073 PMCID: PMC10729786 DOI: 10.1093/bib/bbad453] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2022] [Revised: 07/28/2023] [Accepted: 11/08/2023] [Indexed: 12/21/2023] Open
Abstract
Researchers increasingly turn to explainable artificial intelligence (XAI) to analyze omics data and gain insights into the underlying biological processes. Yet, given the interdisciplinary nature of the field, many findings have only been shared in their respective research community. An overview of XAI for omics data is needed to highlight promising approaches and help detect common issues. Toward this end, we conducted a systematic mapping study. To identify relevant literature, we queried Scopus, PubMed, Web of Science, BioRxiv, MedRxiv and arXiv. Based on keywording, we developed a coding scheme with 10 facets regarding the studies' AI methods, explainability methods and omics data. Our mapping study resulted in 405 included papers published between 2010 and 2023. The inspected papers analyze DNA-based (mostly genomic), transcriptomic, proteomic or metabolomic data by means of neural networks, tree-based methods, statistical methods and further AI methods. The preferred post-hoc explainability methods are feature relevance (n = 166) and visual explanation (n = 52), while papers using interpretable approaches often resort to the use of transparent models (n = 83) or architecture modifications (n = 72). With many research gaps still apparent for XAI for omics data, we deduced eight research directions and discuss their potential for the field. We also provide exemplary research questions for each direction. Many problems with the adoption of XAI for omics data in clinical practice are yet to be resolved. This systematic mapping study outlines extant research on the topic and provides research directions for researchers and practitioners.
Collapse
Affiliation(s)
- Philipp A Toussaint
- Department of Economics and Management, Karlsruhe Institute of Technology, Karlsruhe, Germany
- HIDSS4Health – Helmholtz Information and Data Science School for Health, Karlsruhe, Heidelberg, Germany
| | - Florian Leiser
- Department of Economics and Management, Karlsruhe Institute of Technology, Karlsruhe, Germany
| | - Scott Thiebes
- Department of Economics and Management, Karlsruhe Institute of Technology, Karlsruhe, Germany
| | - Matthias Schlesner
- Biomedical Informatics, Data Mining and Data Analytics, Faculty of Applied Computer Science and Medical Faculty, University of Augsburg, Augsburg, Germany
| | - Benedikt Brors
- Division of Applied Bioinformatics, German Cancer Research Center (DKFZ), Heidelberg, Germany
- Translational Oncology, National Center for Tumor Diseases, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Ali Sunyaev
- Department of Economics and Management, Karlsruhe Institute of Technology, Karlsruhe, Germany
| |
Collapse
|
14
|
Sadeqi MB, Ballvora A, Dadshani S, Léon J. Genetic Parameter and Hyper-Parameter Estimation Underlie Nitrogen Use Efficiency in Bread Wheat. Int J Mol Sci 2023; 24:14275. [PMID: 37762585 PMCID: PMC10531695 DOI: 10.3390/ijms241814275] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2023] [Revised: 09/07/2023] [Accepted: 09/14/2023] [Indexed: 09/29/2023] Open
Abstract
Estimation and prediction play a key role in breeding programs. Currently, phenotyping of complex traits such as nitrogen use efficiency (NUE) in wheat is still expensive, requires high-throughput technologies and is very time consuming compared to genotyping. Therefore, researchers are trying to predict phenotypes based on marker information. Genetic parameters such as population structure, genomic relationship matrix, marker density and sample size are major factors that increase the performance and accuracy of a model. However, they play an important role in adjusting the statistically significant false discovery rate (FDR) threshold in estimation. In parallel, there are many genetic hyper-parameters that are hidden and not represented in the given genomic selection (GS) model but have significant effects on the results, such as panel size, number of markers, minor allele frequency, number of call rates for each marker, number of cross validations and batch size in the training set of the genomic file. The main challenge is to ensure the reliability and accuracy of predicted breeding values (BVs) as results. Our study has confirmed the results of bias-variance tradeoff and adaptive prediction error for the ensemble-learning-based model STACK, which has the highest performance when estimating genetic parameters and hyper-parameters in a given GS model compared to other models.
Collapse
Affiliation(s)
- Mohammad Bahman Sadeqi
- INRES-Plant Breeding, Rheinische Friedrich-Wilhelms-Universität Bonn, 53113 Bonn, Germany; (M.B.S.); (J.L.)
| | - Agim Ballvora
- INRES-Plant Breeding, Rheinische Friedrich-Wilhelms-Universität Bonn, 53113 Bonn, Germany; (M.B.S.); (J.L.)
| | - Said Dadshani
- INRES-Plant Nutrition, Rheinische Friedrich-Wilhelms-Universität Bonn, 53113 Bonn, Germany;
| | - Jens Léon
- INRES-Plant Breeding, Rheinische Friedrich-Wilhelms-Universität Bonn, 53113 Bonn, Germany; (M.B.S.); (J.L.)
| |
Collapse
|
15
|
Astrologo NCN, Gaudillo JD, Albia JR, Roxas-Villanueva RML. Genetic risk assessment based on association and prediction studies. Sci Rep 2023; 13:15230. [PMID: 37709797 PMCID: PMC10502006 DOI: 10.1038/s41598-023-41862-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2023] [Accepted: 09/01/2023] [Indexed: 09/16/2023] Open
Abstract
The genetic basis of phenotypic emergence provides valuable information for assessing individual risk. While association studies have been pivotal in identifying genetic risk factors within a population, complementing it with insights derived from predictions studies that assess individual-level risk offers a more comprehensive approach to understanding phenotypic expression. In this study, we established personalized risk assessment models using single-nucleotide polymorphism (SNP) data from 200 Korean patients, of which 100 experienced hepatitis B surface antigen (HBsAg) seroclearance and 100 patients demonstrated high levels of HBsAg. The risk assessment models determined the predictive power of the following: (1) genome-wide association study (GWAS)-identified candidate biomarkers considered significant in a reference study and (2) machine learning (ML)-identified candidate biomarkers with the highest feature importance scores obtained by using random forest (RF). While utilizing all features yielded 64% model accuracy, using relevant biomarkers achieved higher model accuracies: 82% for 52 GWAS-identified candidate biomarkers, 71% for three GWAS-identified biomarkers, and 80% for 150 ML-identified candidate biomarkers. Findings highlight that the joint contributions of relevant biomarkers significantly influence phenotypic emergence. On the other hand, combining ML-identified candidate biomarkers into the pool of GWAS-identified candidate biomarkers resulted in the improved predictive accuracy of 90%, demonstrating the capability of ML as an auxiliary analysis to GWAS. Furthermore, some of the ML-identified candidate biomarkers were found to be linked with hepatocellular carcinoma (HCC), reinforcing previous claims that HCC can still occur despite the absence of HBsAg.
Collapse
Affiliation(s)
- Nicole Cathlene N Astrologo
- Data Analytics Research Laboratory (DARELab), Institute of Mathematical Sciences and Physics, University of the Philippines Los Baños, 4031, Los Baños, Laguna, Philippines
- Computational Interdisciplinary Research Laboratory (CINTERLabs), University of the Philippines Los Baños, 4031, Los Baños, Laguna, Philippines
| | - Joverlyn D Gaudillo
- Data Analytics Research Laboratory (DARELab), Institute of Mathematical Sciences and Physics, University of the Philippines Los Baños, 4031, Los Baños, Laguna, Philippines.
- Computational Interdisciplinary Research Laboratory (CINTERLabs), University of the Philippines Los Baños, 4031, Los Baños, Laguna, Philippines.
- Domingo AI Research Center (DARC Labs), 1606, Pasig, Philippines.
| | - Jason R Albia
- Domingo AI Research Center (DARC Labs), 1606, Pasig, Philippines
- Venn Biosciences Corporation Dba InterVenn Biosciences, Metro Manila, Pasig, Philippines
- Graduate School, University of the Philippines Los Baños, 4031, Los Baños, Laguna, Philippines
| | - Ranzivelle Marianne L Roxas-Villanueva
- Data Analytics Research Laboratory (DARELab), Institute of Mathematical Sciences and Physics, University of the Philippines Los Baños, 4031, Los Baños, Laguna, Philippines
- Computational Interdisciplinary Research Laboratory (CINTERLabs), University of the Philippines Los Baños, 4031, Los Baños, Laguna, Philippines
| |
Collapse
|
16
|
Lakiotaki K, Papadovasilakis Z, Lagani V, Fafalios S, Charonyktakis P, Tsagris M, Tsamardinos I. Automated machine learning for genome wide association studies. Bioinformatics 2023; 39:btad545. [PMID: 37672022 PMCID: PMC10562960 DOI: 10.1093/bioinformatics/btad545] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2023] [Revised: 06/29/2023] [Accepted: 09/05/2023] [Indexed: 09/07/2023] Open
Abstract
MOTIVATION Genome-wide association studies (GWAS) present several computational and statistical challenges for their data analysis, including knowledge discovery, interpretability, and translation to clinical practice. RESULTS We develop, apply, and comparatively evaluate an automated machine learning (AutoML) approach, customized for genomic data that delivers reliable predictive and diagnostic models, the set of genetic variants that are important for predictions (called a biosignature), and an estimate of the out-of-sample predictive power. This AutoML approach discovers variants with higher predictive performance compared to standard GWAS methods, computes an individual risk prediction score, generalizes to new, unseen data, is shown to better differentiate causal variants from other highly correlated variants, and enhances knowledge discovery and interpretability by reporting multiple equivalent biosignatures. AVAILABILITY AND IMPLEMENTATION Code for this study is available at: https://github.com/mensxmachina/autoML-GWAS. JADBio offers a free version at: https://jadbio.com/sign-up/. SNP data can be downloaded from the EGA repository (https://ega-archive.org/). PRS data are found at: https://www.aicrowd.com/challenges/opensnp-height-prediction. Simulation data to study population structure can be found at: https://easygwas.ethz.ch/data/public/dataset/view/1/.
Collapse
Affiliation(s)
| | - Zaharias Papadovasilakis
- Department of Computer Science, University of Crete, Heraklion, Greece
- JADBio Gnosis DA S.A., Science and Technology Park of Crete, GR-70013 Heraklion, Greece
- Laboratory of Immune Regulation and Tolerance, School of Medicine, University of Crete, Heraklion, Greece
| | - Vincenzo Lagani
- Biological and Environmental Sciences and Engineering Division (BESE), King Abdullah University of Science and Technology KAUST, Thuwal 23952, Saudi Arabia
- SDAIA-KAUST Center of Excellence in Data Science and Artificial Intelligence, Thuwal 23952, Saudi Arabia
- Institute of Chemical Biology, Ilia State University, Tbilisi, Georgia
| | - Stefanos Fafalios
- Department of Computer Science, University of Crete, Heraklion, Greece
- JADBio Gnosis DA S.A., Science and Technology Park of Crete, GR-70013 Heraklion, Greece
| | - Paulos Charonyktakis
- JADBio Gnosis DA S.A., Science and Technology Park of Crete, GR-70013 Heraklion, Greece
| | - Michail Tsagris
- Department of Computer Science, University of Crete, Heraklion, Greece
- Department of Economics, University of Crete, Heraklion, Greece
| | - Ioannis Tsamardinos
- Department of Computer Science, University of Crete, Heraklion, Greece
- JADBio Gnosis DA S.A., Science and Technology Park of Crete, GR-70013 Heraklion, Greece
| |
Collapse
|
17
|
Alatrany AS, Khan W, Hussain AJ, Mustafina J, Al-Jumeily D. Transfer Learning for Classification of Alzheimer's Disease Based on Genome Wide Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:2700-2711. [PMID: 37018274 DOI: 10.1109/tcbb.2022.3233869] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Alzheimer's disease (AD) is a type of brain disorder that is regarded as a degenerative disease because the corresponding symptoms aggravate with the time progression. Single nucleotide polymorphisms (SNPs) have been identified as relevant biomarkers for this condition. This study aims to identify SNPs biomarkers associated with the AD in order to perform a reliable classification of AD. In contrast to existing related works, we utilize deep transfer learning with varying experimental analysis for reliable classification of AD. For this purpose, the convolutional neural networks (CNN) are firstly trained over the genome-wide association studies (GWAS) dataset requested from the AD neuroimaging initiative. We then employ the deep transfer learning for further training of our CNN (as base model) over a different AD GWAS dataset, to extract the final set of features. The extracted features are then fed into Support Vector Machine for classification of AD. Detailed experiments are performed using multiple datasets and varying experimental configurations. The statistical outcomes indicate an accuracy of 89% which is a significant improvement when benchmarked with existing related works.
Collapse
|
18
|
Xiu Y, Jiang C, Zhang S, Yu X, Qiao K, Huang Y. Prediction of nonsentinel lymph node metastasis in breast cancer patients based on machine learning. World J Surg Oncol 2023; 21:244. [PMID: 37563717 PMCID: PMC10416453 DOI: 10.1186/s12957-023-03109-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2023] [Accepted: 07/12/2023] [Indexed: 08/12/2023] Open
Abstract
BACKGROUND Develop the best machine learning (ML) model to predict nonsentinel lymph node metastases (NSLNM) in breast cancer patients. METHODS From June 2016 to August 2022, 1005 breast cancer patients were included in this retrospective study. Univariate and multivariate analyses were performed using logistic regression. Six ML models were introduced, and their performance was compared. RESULTS NSLNM occurred in 338 (33.6%) of 1005 patients. The best ML model was XGBoost, whose average area under the curve (AUC) based on 10-fold cross-verification was 0.722. It performed better than the nomogram, which was based on logistic regression (AUC: 0.764 vs. 0.706). CONCLUSIONS The ML model XGBoost can well predict NSLNM in breast cancer patients.
Collapse
Affiliation(s)
- Yuting Xiu
- Department of Breast Surgery, Harbin Medical University Cancer Hospital, Harbin, 150086, China
| | - Cong Jiang
- Department of Breast Surgery, Harbin Medical University Cancer Hospital, Harbin, 150086, China
| | - Shiyuan Zhang
- Department of Breast Surgery, Harbin Medical University Cancer Hospital, Harbin, 150086, China
| | - Xiao Yu
- Department of Breast Surgery, Harbin Medical University Cancer Hospital, Harbin, 150086, China
| | - Kun Qiao
- Department of Breast Surgery, Harbin Medical University Cancer Hospital, Harbin, 150086, China.
| | - Yuanxi Huang
- Department of Breast Surgery, Harbin Medical University Cancer Hospital, Harbin, 150086, China.
| |
Collapse
|
19
|
Gao Y, Sharma T, Cui Y. Addressing the Challenge of Biomedical Data Inequality: An Artificial Intelligence Perspective. Annu Rev Biomed Data Sci 2023; 6:153-171. [PMID: 37104653 PMCID: PMC10529864 DOI: 10.1146/annurev-biodatasci-020722-020704] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/29/2023]
Abstract
Artificial intelligence (AI) and other data-driven technologies hold great promise to transform healthcare and confer the predictive power essential to precision medicine. However, the existing biomedical data, which are a vital resource and foundation for developing medical AI models, do not reflect the diversity of the human population. The low representation in biomedical data has become a significant health risk for non-European populations, and the growing application of AI opens a new pathway for this health risk to manifest and amplify. Here we review the current status of biomedical data inequality and present a conceptual framework for understanding its impacts on machine learning. We also discuss the recent advances in algorithmic interventions for mitigating health disparities arising from biomedical data inequality. Finally, we briefly discuss the newly identified disparity in data quality among ethnic groups and its potential impacts on machine learning.
Collapse
Affiliation(s)
- Yan Gao
- Department of Genetics, Genomics, and Informatics, University of Tennessee Health Science Center, Memphis, Tennessee, USA;
| | - Teena Sharma
- Department of Genetics, Genomics, and Informatics, University of Tennessee Health Science Center, Memphis, Tennessee, USA;
| | - Yan Cui
- Department of Genetics, Genomics, and Informatics, University of Tennessee Health Science Center, Memphis, Tennessee, USA;
| |
Collapse
|
20
|
Choudhary A, Anand A, Singh A, Roy P, Singh N, Kumar V, Sharma S, Baranwal M. Machine learning-based ensemble approach in prediction of lung cancer predisposition using XRCC1 gene polymorphism. J Biomol Struct Dyn 2023:1-10. [PMID: 37545160 DOI: 10.1080/07391102.2023.2242492] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2022] [Accepted: 07/23/2023] [Indexed: 08/08/2023]
Abstract
The employment of machine learning approaches has shown promising results in predicting cancer. In the current study, polymorphisms data of five single nucleotide polymorphisms (SNPs) of DNA repair gene XRCC1 (XRCC1 399, XRCC1 194, XRCC1 206, XRCC1 632, XRCC1 280) of the north Indian population along with four smoking status data is considered as an input to the proposed ensemble model to predict the risk of individual susceptibility to the lung cancer. The prediction accuracy of the proposed ensemble model for cancer predisposition was found to be 85%. The model performance is also evaluated using sensitivity, specificity, precision and the Gini index, which is found in the range of 0.83-0.87. The proposed model also outperformed in all evaluation parameters when compared with the individual Model (LM, SVM, RF, KNN and baseline neural net). Collectively, current results suggest the potential of the proposed ensemble model in predicting the risk of cancer based on XRCC1 SNPs data.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Abhishek Choudhary
- Department of Computer Science, Thapar Institute of Engineering & Technology, India
| | - Adarsh Anand
- Department of Electronics & Communication Engineering, Thapar Institute of Engineering & Technology, India
| | - Amrita Singh
- Department of Biotechnology, Thapar Institute of Engineering & Technology, Patiala, Punjab, India
| | - Pratima Roy
- Department of Biotechnology, Thapar Institute of Engineering & Technology, Patiala, Punjab, India
| | - Navneet Singh
- Department of Pulmonary Medicine, Post Graduate Institute of Education and Medical Research (PGIMER), Chandigarh, India
| | - Vinay Kumar
- Department of Electronics & Communication Engineering, Thapar Institute of Engineering & Technology, India
| | - Siddharth Sharma
- Department of Biotechnology, Thapar Institute of Engineering & Technology, Patiala, Punjab, India
| | - Manoj Baranwal
- Department of Biotechnology, Thapar Institute of Engineering & Technology, Patiala, Punjab, India
| |
Collapse
|
21
|
Sitinjak BDP, Murdaya N, Rachman TA, Zakiyah N, Barliana MI. The Potential of Single Nucleotide Polymorphisms (SNPs) as Biomarkers and Their Association with the Increased Risk of Coronary Heart Disease: A Systematic Review. Vasc Health Risk Manag 2023; 19:289-301. [PMID: 37179817 PMCID: PMC10167955 DOI: 10.2147/vhrm.s405039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2023] [Accepted: 04/30/2023] [Indexed: 05/15/2023] Open
Abstract
Human genetic analyses and epidemiological studies showed a potential association between several types of gene polymorphism and the development of coronary heart disease (CHD). Many studies on this pertinent topic need to be investigated further to reach an evidence-based conclusion. Therefore, in this current review, we describe several types of gene polymorphisms that are potentially linked to CHD. A systematic review using the databases EBSCO, PubMed, and ScienceDirect databases was searched until October of 2022 to find relevant studies on the topic of gene polymorphisms on risk factors for CHD, especially for the factors associated with single nucleotide polymorphisms (SNPs). The risk of bias and quality assessment was evaluated by Joanna Briggs Institute (JBI) guidelines. From keyword search results, a total of 6243 articles were identified, which were subsequently narrowed to 14 articles using prespecified inclusion criteria. The results suggested that there were 33 single nucleotide polymorphisms (SNPs) that can potentially increase the risk factors and clinical symptoms of CHD. This study also indicated that gene polymorphisms had a potential role in increasing CHD risk factors that were causally associated with atherosclerosis, increased homocysteine, immune/inflammatory response, Low-Density Lipoprotein (LDL), arterial lesions, and reduction of therapeutic effectiveness. In conclusion, the findings of this study indicate that SNPs may increase risk factors for CHD and SNPs show different effects between individuals. This demonstrates that knowledge of SNPs on CHD risk factors can be used to develop biomarkers for diagnostics and therapeutic response prediction to decide successful therapy and become the basis for defining personalized medicine in future.
Collapse
Affiliation(s)
- Bernap Dwi Putra Sitinjak
- Department of Pharmacology and Clinical Pharmacy, Faculty of Pharmacy, Universitas Padjadjaran, Bandung, West Java, Indonesia
| | - Niky Murdaya
- Department of Pharmacology and Clinical Pharmacy, Faculty of Pharmacy, Universitas Padjadjaran, Bandung, West Java, Indonesia
| | - Tiara Anisya Rachman
- Department of Pharmacology and Clinical Pharmacy, Faculty of Pharmacy, Universitas Padjadjaran, Bandung, West Java, Indonesia
| | - Neily Zakiyah
- Department of Pharmacology and Clinical Pharmacy, Faculty of Pharmacy, Universitas Padjadjaran, Bandung, West Java, Indonesia
- Center of Excellence for Pharmaceutical Care Innovation, Universitas Padjadjaran, Bandung, West Java, Indonesia
| | - Melisa Intan Barliana
- Center of Excellence for Pharmaceutical Care Innovation, Universitas Padjadjaran, Bandung, West Java, Indonesia
- Department of Biological Pharmacy, Biotechnology Pharmacy Laboratory, Faculty of Pharmacy, Universitas Padjadjaran, Bandung, West Java, Indonesia
| |
Collapse
|
22
|
Alzoubi H, Alzubi R, Ramzan N. Deep Learning Framework for Complex Disease Risk Prediction Using Genomic Variations. SENSORS (BASEL, SWITZERLAND) 2023; 23:s23094439. [PMID: 37177642 PMCID: PMC10181706 DOI: 10.3390/s23094439] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/13/2023] [Revised: 04/05/2023] [Accepted: 04/26/2023] [Indexed: 05/15/2023]
Abstract
Genome-wide association studies have proven their ability to improve human health outcomes by identifying genotypes associated with phenotypes. Various works have attempted to predict the risk of diseases for individuals based on genotype data. This prediction can either be considered as an analysis model that can lead to a better understanding of gene functions that underlie human disease or as a black box in order to be used in decision support systems and in early disease detection. Deep learning techniques have gained more popularity recently. In this work, we propose a deep-learning framework for disease risk prediction. The proposed framework employs a multilayer perceptron (MLP) in order to predict individuals' disease status. The proposed framework was applied to the Wellcome Trust Case-Control Consortium (WTCCC), the UK National Blood Service (NBS) Control Group, and the 1958 British Birth Cohort (58C) datasets. The performance comparison of the proposed framework showed that the proposed approach outperformed the other methods in predicting disease risk, achieving an area under the curve (AUC) up to 0.94.
Collapse
Affiliation(s)
- Hadeel Alzoubi
- Department of Computer Science, College of Computer Science and Information Technology, King Faisal University, Al-Ahsa 31982, Saudi Arabia
| | - Raid Alzubi
- Department of Computer Science, College of Computer Science and Information Technology, King Faisal University, Al-Ahsa 31982, Saudi Arabia
| | - Naeem Ramzan
- School of Computing, Engineering and Physical Sciences, University of the West of Scotland, High Street, Paisley PA1 2BE, UK
| |
Collapse
|
23
|
Liu Y, Wang L, Wang Z, He S. Association study of selenium-related gene polymorphisms with geriatric depression in China. Medicine (Baltimore) 2023; 102:e33594. [PMID: 37115082 PMCID: PMC10145890 DOI: 10.1097/md.0000000000033594] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/27/2022] [Revised: 03/28/2023] [Accepted: 03/31/2023] [Indexed: 04/29/2023] Open
Abstract
Depression is a common mental health problem in older adults, but its cause remains unclear. Selenium is an essential micronutrient and a powerful antioxidant in the brain and nervous system. Several recent studies have reported a relationship between selenium levels and depression. This study aimed to investigate the relationship between 4 genes co-associated with selenium and geriatric depression. 1486 participants were included in this study from 5 communities in Ningxia Hui Autonomous Region during 2013 to 2016 in a health examination program for urban and rural residents. Polymorphisms of 4 selenium-related genes were analyzed in 1266 healthy volunteers and 220 patients with depression. The genotyping of rs2830072, rs2030324, rs6265, rs11136000, rs7982, rs10510412, rs1801282, rs1151999, rs17793951, rs709149, rs709154, and rs4135263 were performed by Matrix-assisted laser desorption/ionization-time of flight mass spectrometry (MALDI-TOF-MS) technology. The analysis of selenium-related genes showed that there were significant differences between depression and controls for allele and genotype frequencies of peroxisome proliferator activated receptor gamma (PPARG) rs10510412, rs709149, and rs709154 (all P < .05). In this study, when adjusting for age, sex, marital status, education, and alcohol consumption, results showed that rs709149 and rs709154 were still significantly correlated with geriatric depression in the codominant, dominant, overdominant, and log-additive models. Logistic regression analysis showed that rs709149 AG or GG gene carriers were 1.630 and 1.746 times more susceptible to depression than AA gene carriers (95% CI = 1.042-2.549; 1.207-2.526). The results of this study suggest that the rs709149 polymorphism of the selenium-related gene PPARG is a genetic risk factor for depression in older adults.
Collapse
Affiliation(s)
- Yu Liu
- Department of Epidemiology and Health Statistics, School of Public Health and Management, Ningxia Medical University, Yinchuan, China
| | - Liqun Wang
- Department of Epidemiology and Health Statistics, School of Public Health and Management, Ningxia Medical University, Yinchuan, China
- Key Laboratory of Environmental Factors and Chronic Disease Control, Yinchuan, China
| | - Zhizhong Wang
- Department of Epidemiology and Health Statistics, School of Public Health at Guangdong Medical University, Dongguan, China
| | - Shulan He
- Department of Epidemiology and Health Statistics, School of Public Health and Management, Ningxia Medical University, Yinchuan, China
- Key Laboratory of Environmental Factors and Chronic Disease Control, Yinchuan, China
| |
Collapse
|
24
|
Wang J, Lange K, Sung V, Morgan A, Saffery R, Wake M. Association of Polygenic Risk Scores for Hearing Difficulty in Older Adults With Hearing Loss in Mid-Childhood and Midlife: A Population-Based Cross-sectional Study Within the Longitudinal Study of Australian Children. JAMA Otolaryngol Head Neck Surg 2023; 149:204-211. [PMID: 36701147 PMCID: PMC9880866 DOI: 10.1001/jamaoto.2022.4466] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2022] [Accepted: 11/11/2022] [Indexed: 01/27/2023]
Abstract
Importance Although more than 200 genes have been associated with monogenic congenital hearing loss, the polygenic contribution to hearing decline across the life course remains largely unknown. Objective To examine the association of polygenic risk scores (PRSs) for self-reported hearing difficulty among adults (40-69 years) with measured hearing and speech reception abilities in mid-childhood and early midlife. Design, Setting, and Participants This was a population-based cross-sectional study nested within the Longitudinal Study of Australian Children that included 1608 children and 1642 adults. Pure tone audiometry, speech reception threshold against noise, and genetic data were evaluated. Linear and logistic regressions of PRSs were conducted for hearing outcomes. Study analysis was performed from March 1 to 31, 2022. Main Outcomes and Measures Genotypes were generated from saliva or blood using global single-nucleotide polymorphisms array and PRSs derived from published genome-wide association studies of self-reported hearing difficulty (PRS1) and hearing aid use (PRS2). Hearing outcomes were continuous using the high Fletcher index (mean hearing threshold, 1, 2, and 4 kHz) and speech reception threshold (SRT); and dichotomized for bilateral hearing loss of more than 15 dB HL and abnormal SRT. Results Included in the study were 1608 children (mean [SD] age, 11.5 [0.5] years; 812 [50.5%] male children; 1365 [84.9%] European and 243[15.1%] non-European) and 1642 adults (mean [SD] age, 43.7 [5.1] years; 1442 [87.8%] female adults; 1430 [87.1%] European and 212 [12.9%] non-European individuals). In adults, both PRS1 and PRS2 were associated with hearing thresholds. For each SD increment in PRS1 and PRS2, hearing thresholds were 0.4 (95% CI, 0-0.8) decibel hearing level (dB HL) and 0.9 (95% CI, 0.5-1.2) dB HL higher on the high Fletcher index, respectively. Each SD increment in PRS increased the odds of adult hearing loss of more than 15 dB HL by 10% to 30% (OR for PRS1, 1.1; 95% CI, 1.0-1.3; OR for PRS2, 1.3; 95% CI, 1.1-1.5). Similar but attenuated patterns were noted in children (OR for PRS1, 1.1; 95% CI, 0.8-1.2; OR for PRS2, 1.2; 95% CI, 1.0-1.5). Both PRSs showed minimal evidence of associations with speech reception thresholds or abnormal SRT in children or adults. Conclusions and Relevance This population-based cross-sectional study of PRSs for self-reported hearing difficulty among adults found an association with hearing ability in mid-childhood. This adds to the evidence that age-related hearing loss begins as early as the first decade of life and that polygenic inheritance may play a role together with other environmental risk factors.
Collapse
Affiliation(s)
- Jing Wang
- Murdoch Children’s Research Institute, Royal Children’s Hospital, Parkville, Victoria, Australia
- Department of Pediatrics, The University of Melbourne, Parkville, Victoria, Australia
| | - Katherine Lange
- Murdoch Children’s Research Institute, Royal Children’s Hospital, Parkville, Victoria, Australia
- Department of Pediatrics, The University of Melbourne, Parkville, Victoria, Australia
| | - Valerie Sung
- Murdoch Children’s Research Institute, Royal Children’s Hospital, Parkville, Victoria, Australia
- Department of Pediatrics, The University of Melbourne, Parkville, Victoria, Australia
- Center for Community Child Health, Royal Children’s Hospital, Parkville, Victoria, Australia
| | - Angela Morgan
- Murdoch Children’s Research Institute, Royal Children’s Hospital, Parkville, Victoria, Australia
- Department of Audiology and Speech Pathology, The University of Melbourne, Parkville, Victoria, Australia
- Speech Pathology Department, Royal Children’s Hospital, Parkville, Victoria, Australia
| | - Richard Saffery
- Murdoch Children’s Research Institute, Royal Children’s Hospital, Parkville, Victoria, Australia
- Department of Pediatrics, The University of Melbourne, Parkville, Victoria, Australia
| | - Melissa Wake
- Murdoch Children’s Research Institute, Royal Children’s Hospital, Parkville, Victoria, Australia
- Department of Pediatrics, The University of Melbourne, Parkville, Victoria, Australia
- Department of Pediatrics and The Liggins Institute, The University of Auckland, Grafton, Auckland, New Zealand
| |
Collapse
|
25
|
Kang G, Baek SH, Kim YH, Kim DH, Park JW. Genetic Risk Assessment of Nonsyndromic Cleft Lip with or without Cleft Palate by Linking Genetic Networks and Deep Learning Models. Int J Mol Sci 2023; 24:ijms24054557. [PMID: 36901988 PMCID: PMC10003462 DOI: 10.3390/ijms24054557] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2023] [Revised: 02/13/2023] [Accepted: 02/20/2023] [Indexed: 03/02/2023] Open
Abstract
Recent deep learning algorithms have further improved risk classification capabilities. However, an appropriate feature selection method is required to overcome dimensionality issues in population-based genetic studies. In this Korean case-control study of nonsyndromic cleft lip with or without cleft palate (NSCL/P), we compared the predictive performance of models that were developed by using the genetic-algorithm-optimized neural networks ensemble (GANNE) technique with those models that were generated by eight conventional risk classification methods, including polygenic risk score (PRS), random forest (RF), support vector machine (SVM), extreme gradient boosting (XGBoost), and deep-learning-based artificial neural network (ANN). GANNE, which is capable of automatic input SNP selection, exhibited the highest predictive power, especially in the 10-SNP model (AUC of 88.2%), thus improving the AUC by 23% and 17% compared to PRS and ANN, respectively. Genes mapped with input SNPs that were selected by using a genetic algorithm (GA) were functionally validated for risks of developing NSCL/P in gene ontology and protein-protein interaction (PPI) network analyses. The IRF6 gene, which is most frequently selected via GA, was also a major hub gene in the PPI network. Genes such as RUNX2, MTHFR, PVRL1, TGFB3, and TBX22 significantly contributed to predicting NSCL/P risk. GANNE is an efficient disease risk classification method using a minimum optimal set of SNPs; however, further validation studies are needed to ensure the clinical utility of the model for predicting NSCL/P risk.
Collapse
Affiliation(s)
- Geon Kang
- Department of Medical Genetics, College of Medicine, Hallym University, Chuncheon 24252, Republic of Korea
| | - Seung-Hak Baek
- Department of Orthodontics, School of Dentistry, Seoul National University, Seoul 03080, Republic of Korea
| | - Young Ho Kim
- Department of Orthodontics, The Institute of Oral Health Science, Samsung Medical Center, School of Medicine, Sungkyunkwan University, Seoul 06351, Republic of Korea
| | - Dong-Hyun Kim
- Department of Social and Preventive Medicine, College of Medicine, Hallym University, Chuncheon 24252, Republic of Korea
| | - Ji Wan Park
- Department of Medical Genetics, College of Medicine, Hallym University, Chuncheon 24252, Republic of Korea
- Correspondence:
| |
Collapse
|
26
|
Learning high-order interactions for polygenic risk prediction. PLoS One 2023; 18:e0281618. [PMID: 36763605 PMCID: PMC9916647 DOI: 10.1371/journal.pone.0281618] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2022] [Accepted: 01/27/2023] [Indexed: 02/11/2023] Open
Abstract
Within the framework of precision medicine, the stratification of individual genetic susceptibility based on inherited DNA variation has paramount relevance. However, one of the most relevant pitfalls of traditional Polygenic Risk Scores (PRS) approaches is their inability to model complex high-order non-linear SNP-SNP interactions and their effect on the phenotype (e.g. epistasis). Indeed, they incur in a computational challenge as the number of possible interactions grows exponentially with the number of SNPs considered, affecting the statistical reliability of the model parameters as well. In this work, we address this issue by proposing a novel PRS approach, called High-order Interactions-aware Polygenic Risk Score (hiPRS), that incorporates high-order interactions in modeling polygenic risk. The latter combines an interaction search routine based on frequent itemsets mining and a novel interaction selection algorithm based on Mutual Information, to construct a simple and interpretable weighted model of user-specified dimensionality that can predict a given binary phenotype. Compared to traditional PRSs methods, hiPRS does not rely on GWAS summary statistics nor any external information. Moreover, hiPRS differs from Machine Learning-based approaches that can include complex interactions in that it provides a readable and interpretable model and it is able to control overfitting, even on small samples. In the present work we demonstrate through a comprehensive simulation study the superior performance of hiPRS w.r.t. state of the art methods, both in terms of scoring performance and interpretability of the resulting model. We also test hiPRS against small sample size, class imbalance and the presence of noise, showcasing its robustness to extreme experimental settings. Finally, we apply hiPRS to a case study on real data from DACHS cohort, defining an interaction-aware scoring model to predict mortality of stage II-III Colon-Rectal Cancer patients treated with oxaliplatin.
Collapse
|
27
|
Robust SNP-based prediction of rheumatoid arthritis through machine-learning-optimized polygenic risk score. J Transl Med 2023; 21:92. [PMID: 36750873 PMCID: PMC9903430 DOI: 10.1186/s12967-023-03939-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2022] [Accepted: 01/28/2023] [Indexed: 02/09/2023] Open
Abstract
BACKGROUND The popular statistics-based Genome-wide association studies (GWAS) have provided deep insights into the field of complex disorder genetics. However, its clinical applicability to predict disease/trait outcomes remains unclear as statistical models are not designed to make predictions. This study employs statistics-free machine-learning (ML)-optimized polygenic risk score (PRS) to complement existing GWAS and bring the prediction of disease/trait outcomes closer to clinical application. Rheumatoid Arthritis (RA) was selected as a model disease to demonstrate the robustness of ML in disease prediction as RA is a prevalent chronic inflammatory joint disease with high mortality rates, affecting adults at the economic prime. Early identification of at-risk individuals may facilitate measures to mitigate the effects of the disease. METHODS This study employs a robust ML feature selection algorithm to identify single nucleotide polymorphisms (SNPs) that can predict RA from a set of training data comprising RA patients and population control samples. Thereafter, selected SNPs were evaluated for their predictive performances across 3 independent, unseen test datasets. The selected SNPs were subsequently used to generate PRS which was also evaluated for its predictive capacity as a sole feature. RESULTS Through robust ML feature selection, 9 SNPs were found to be the minimum number of features for excellent predictive performance (AUC > 0.9) in 3 independent, unseen test datasets. PRS based on these 9 SNPs was significantly associated with (P < 1 × 10-16) and predictive (AUC > 0.9) of RA in the 3 unseen datasets. A RA ML-PRS calculator of these 9 SNPs was developed ( https://xistance.shinyapps.io/prs-ra/ ) to facilitate individualized clinical applicability. The majority of the predictive SNPs are protective, reside in non-coding regions, and are either predicted to be potentially functional SNPs (pfSNPs) or in high linkage disequilibrium (r2 > 0.8) with un-interrogated pfSNPs. CONCLUSIONS These findings highlight the promise of this ML strategy to identify useful genetic features that can robustly predict disease and amenable to translation for clinical application.
Collapse
|
28
|
Sha Z, Chen Y, Hu T. NSPA: characterizing the disease association of multiple genetic interactions at single-subject resolution. BIOINFORMATICS ADVANCES 2023; 3:vbad010. [PMID: 36818729 PMCID: PMC9927570 DOI: 10.1093/bioadv/vbad010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/31/2022] [Revised: 01/02/2023] [Accepted: 02/02/2023] [Indexed: 02/10/2023]
Abstract
Motivation The interaction between genetic variables is one of the major barriers to characterizing the genetic architecture of complex traits. To consider epistasis, network science approaches are increasingly being used in research to elucidate the genetic architecture of complex diseases. Network science approaches associate genetic variables' disease susceptibility to their topological importance in the network. However, this network only represents genetic interactions and does not describe how these interactions attribute to disease association at the subject-scale. We propose the Network-based Subject Portrait Approach (NSPA) and an accompanying feature transformation method to determine the collective risk impact of multiple genetic interactions for each subject. Results The feature transformation method converts genetic variants of subjects into new values that capture how genetic variables interact with others to attribute to a subject's disease association. We apply this approach to synthetic and genetic datasets and learn that (1) the disease association can be captured using multiple disjoint sets of genetic interactions and (2) the feature transformation method based on NSPA improves predictive performance comparing with using the original genetic variables. Our findings confirm the role of genetic interaction in complex disease and provide a novel approach for gene-disease association studies to identify genetic architecture in the context of epistasis. Availability and implementation The codes of NSPA are now available in: https://github.com/MIB-Lab/Network-based-Subject-Portrait-Approach. Contact ting.hu@queensu.ca. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- Zhendong Sha
- School of Computing, Queen’s University, Kingston, Ontario, Canada K7L 2N8
| | - Yuanzhu Chen
- School of Computing, Queen’s University, Kingston, Ontario, Canada K7L 2N8
| | - Ting Hu
- To whom correspondence should be addressed.
| |
Collapse
|
29
|
Fritzsche MC, Akyüz K, Cano Abadía M, McLennan S, Marttinen P, Mayrhofer MT, Buyx AM. Ethical layering in AI-driven polygenic risk scores-New complexities, new challenges. Front Genet 2023; 14:1098439. [PMID: 36816027 PMCID: PMC9933509 DOI: 10.3389/fgene.2023.1098439] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2022] [Accepted: 01/04/2023] [Indexed: 01/27/2023] Open
Abstract
Researchers aim to develop polygenic risk scores as a tool to prevent and more effectively treat serious diseases, disorders and conditions such as breast cancer, type 2 diabetes mellitus and coronary heart disease. Recently, machine learning techniques, in particular deep neural networks, have been increasingly developed to create polygenic risk scores using electronic health records as well as genomic and other health data. While the use of artificial intelligence for polygenic risk scores may enable greater accuracy, performance and prediction, it also presents a range of increasingly complex ethical challenges. The ethical and social issues of many polygenic risk score applications in medicine have been widely discussed. However, in the literature and in practice, the ethical implications of their confluence with the use of artificial intelligence have not yet been sufficiently considered. Based on a comprehensive review of the existing literature, we argue that this stands in need of urgent consideration for research and subsequent translation into the clinical setting. Considering the many ethical layers involved, we will first give a brief overview of the development of artificial intelligence-driven polygenic risk scores, associated ethical and social implications, challenges in artificial intelligence ethics, and finally, explore potential complexities of polygenic risk scores driven by artificial intelligence. We point out emerging complexity regarding fairness, challenges in building trust, explaining and understanding artificial intelligence and polygenic risk scores as well as regulatory uncertainties and further challenges. We strongly advocate taking a proactive approach to embedding ethics in research and implementation processes for polygenic risk scores driven by artificial intelligence.
Collapse
Affiliation(s)
- Marie-Christine Fritzsche
- Institute of History and Ethics in Medicine, TUM School of Medicine, Technical University of Munich, Munich, Germany,Department of Science, Technology and Society (STS), School of Social Sciences and Technology, Technical University of Munich, Munich, Germany,*Correspondence: Marie-Christine Fritzsche,
| | - Kaya Akyüz
- Biobanking and Biomolecular Resources Research Infrastructure Consortium - European Research Infrastructure Consortium (BBMRI-ERIC), Graz, Austria,Department of Science and Technology Studies, University of Vienna, Vienna, Austria
| | - Mónica Cano Abadía
- Biobanking and Biomolecular Resources Research Infrastructure Consortium - European Research Infrastructure Consortium (BBMRI-ERIC), Graz, Austria
| | - Stuart McLennan
- Institute of History and Ethics in Medicine, TUM School of Medicine, Technical University of Munich, Munich, Germany,Department of Science, Technology and Society (STS), School of Social Sciences and Technology, Technical University of Munich, Munich, Germany
| | - Pekka Marttinen
- Helsinki Institute for Information Technology HIIT, Aalto University, Helsinki, Finland
| | - Michaela Th. Mayrhofer
- Biobanking and Biomolecular Resources Research Infrastructure Consortium - European Research Infrastructure Consortium (BBMRI-ERIC), Graz, Austria
| | - Alena M. Buyx
- Institute of History and Ethics in Medicine, TUM School of Medicine, Technical University of Munich, Munich, Germany,Department of Science, Technology and Society (STS), School of Social Sciences and Technology, Technical University of Munich, Munich, Germany
| |
Collapse
|
30
|
Ferrè L, Clarelli F, Pignolet B, Mascia E, Frasca M, Santoro S, Sorosina M, Bucciarelli F, Moiola L, Martinelli V, Comi G, Liblau R, Filippi M, Valentini G, Esposito F. Combining Clinical and Genetic Data to Predict Response to Fingolimod Treatment in Relapsing Remitting Multiple Sclerosis Patients: A Precision Medicine Approach. J Pers Med 2023; 13:jpm13010122. [PMID: 36675783 PMCID: PMC9861774 DOI: 10.3390/jpm13010122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2022] [Revised: 12/30/2022] [Accepted: 12/30/2022] [Indexed: 01/11/2023] Open
Abstract
A personalized approach is strongly advocated for treatment selection in Multiple Sclerosis patients due to the high number of available drugs. Machine learning methods proved to be valuable tools in the context of precision medicine. In the present work, we applied machine learning methods to identify a combined clinical and genetic signature of response to fingolimod that could support the prediction of drug response. Two cohorts of fingolimod-treated patients from Italy and France were enrolled and divided into training, validation, and test set. Random forest training and robust feature selection were performed in the first two sets respectively, and the independent test set was used to evaluate model performance. A genetic-only model and a combined clinical-genetic model were obtained. Overall, 381 patients were classified according to the NEDA-3 criterion at 2 years; we identified a genetic model, including 123 SNPs, that was able to predict fingolimod response with an AUROC= 0.65 in the independent test set. When combining clinical data, the model accuracy increased to an AUROC= 0.71. Integrating clinical and genetic data by means of machine learning methods can help in the prediction of response to fingolimod, even though further studies are required to definitely extend this approach to clinical applications.
Collapse
Affiliation(s)
- Laura Ferrè
- Neurology and Neurorehabilitation Unit, IRCCS San Raffaele Hospital, 20132 Milan, Italy
- Laboratory of Human Genetics of Neurological Disorders, IRCCS San Raffaele Hospital, 20132 Milan, Italy
- Vita-Salute San Raffaele University, 20132 Milan, Italy
| | - Ferdinando Clarelli
- Laboratory of Human Genetics of Neurological Disorders, IRCCS San Raffaele Hospital, 20132 Milan, Italy
| | - Beatrice Pignolet
- Centre Hospitalier Universitaire de Toulouse, CEDEX 9, 31059 Toulouse, France
- Institut Toulousain des Maladies Infectieuses et Inflammatoires (Infinity), INSERM UMR1291–CNRS UMR5051—Université Toulouse III, CEDEX 3, 31024 Toulouse, France
| | - Elisabetta Mascia
- Laboratory of Human Genetics of Neurological Disorders, IRCCS San Raffaele Hospital, 20132 Milan, Italy
| | - Marco Frasca
- AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, 20133 Milan, Italy
- Data Science Research Center, Università degli Studi di Milano, 20133 Milan, Italy
- Infolife National Lab, CINI, 00185 Rome, Italy
| | - Silvia Santoro
- Laboratory of Human Genetics of Neurological Disorders, IRCCS San Raffaele Hospital, 20132 Milan, Italy
| | - Melissa Sorosina
- Laboratory of Human Genetics of Neurological Disorders, IRCCS San Raffaele Hospital, 20132 Milan, Italy
| | - Florence Bucciarelli
- Centre Hospitalier Universitaire de Toulouse, CEDEX 9, 31059 Toulouse, France
- Institut Toulousain des Maladies Infectieuses et Inflammatoires (Infinity), INSERM UMR1291–CNRS UMR5051—Université Toulouse III, CEDEX 3, 31024 Toulouse, France
| | - Lucia Moiola
- Neurology and Neurorehabilitation Unit, IRCCS San Raffaele Hospital, 20132 Milan, Italy
| | - Vittorio Martinelli
- Neurology and Neurorehabilitation Unit, IRCCS San Raffaele Hospital, 20132 Milan, Italy
| | | | - Roland Liblau
- Institut Toulousain des Maladies Infectieuses et Inflammatoires (Infinity), INSERM UMR1291–CNRS UMR5051—Université Toulouse III, CEDEX 3, 31024 Toulouse, France
- Department of Immunology, Toulouse University Hospitals, CEDEX 3, 31024 Toulouse, France
| | - Massimo Filippi
- Neurology and Neurorehabilitation Unit, IRCCS San Raffaele Hospital, 20132 Milan, Italy
- Vita-Salute San Raffaele University, 20132 Milan, Italy
- Neuroimaging Research Unit, IRCCS San Raffaele Hospital, 20132 Milan, Italy
- Neurophisiology Unit, IRCCS San Raffaele Hospital, 20132 Milan, Italy
| | - Giorgio Valentini
- AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, 20133 Milan, Italy
- Data Science Research Center, Università degli Studi di Milano, 20133 Milan, Italy
- Infolife National Lab, CINI, 00185 Rome, Italy
| | - Federica Esposito
- Neurology and Neurorehabilitation Unit, IRCCS San Raffaele Hospital, 20132 Milan, Italy
- Laboratory of Human Genetics of Neurological Disorders, IRCCS San Raffaele Hospital, 20132 Milan, Italy
- Correspondence:
| |
Collapse
|
31
|
Odintsova VV, Hagenbeek FA, van der Laan CM, van de Weijer S, Boomsma DI. Genetics and epigenetics of human aggression. HANDBOOK OF CLINICAL NEUROLOGY 2023; 197:13-44. [PMID: 37633706 DOI: 10.1016/b978-0-12-821375-9.00005-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/28/2023]
Abstract
There is substantial variation between humans in aggressive behavior, with its biological etiology and molecular genetic basis mostly unknown. This review chapter offers an overview of genomic and omics studies revealing the genetic contribution to aggression and first insights into associations with epigenetic and other omics (e.g., metabolomics) profiles. We allowed for a broad phenotype definition including studies on "aggression," "aggressive behavior," or "aggression-related traits," "antisocial behavior," "conduct disorder," and "oppositional defiant disorder." Heritability estimates based on family and twin studies in children and adults of this broadly defined phenotype of aggression are around 50%, with relatively small fluctuations around this estimate. Next, we review the genome-wide association studies (GWAS) which search for associations with alleles and also allow for gene-based tests and epigenome-wide association studies (EWAS) which seek to identify associations with differently methylated regions across the genome. Both GWAS and EWAS allow for construction of Polygenic and DNA methylation scores at an individual level. Currently, these predict a small percentage of variance in aggression. We expect that increases in sample size will lead to additional discoveries in GWAS and EWAS, and that multiomics approaches will lead to a more comprehensive understanding of the molecular underpinnings of aggression.
Collapse
Affiliation(s)
- Veronika V Odintsova
- Department of Biological Psychology, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands; Amsterdam Reproduction and Development (AR&D) Research Institute, Amsterdam, The Netherlands; Mental Health Division, Amsterdam Public Health (APH) Research Institute, Amsterdam, The Netherlands
| | - Fiona A Hagenbeek
- Department of Biological Psychology, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands; Mental Health Division, Amsterdam Public Health (APH) Research Institute, Amsterdam, The Netherlands
| | - Camiel M van der Laan
- Department of Biological Psychology, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands; Netherlands Institute for the Study of Crime and Law Enforcement (NSCR), Amsterdam, The Netherlands
| | - Steve van de Weijer
- Netherlands Institute for the Study of Crime and Law Enforcement (NSCR), Amsterdam, The Netherlands
| | - Dorret I Boomsma
- Department of Biological Psychology, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands; Amsterdam Reproduction and Development (AR&D) Research Institute, Amsterdam, The Netherlands.
| |
Collapse
|
32
|
Salgado Á, de Melo-Minardi RC, Giovanetti M, Veloso A, Morais-Rodrigues F, Adelino T, de Jesus R, Tosta S, Azevedo V, Lourenco J, Alcantara LCJ. Machine learning models exploring characteristic single-nucleotide signatures in yellow fever virus. PLoS One 2022; 17:e0278982. [PMID: 36508435 PMCID: PMC9744328 DOI: 10.1371/journal.pone.0278982] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2021] [Accepted: 11/29/2022] [Indexed: 12/14/2022] Open
Abstract
Yellow fever virus (YFV) is the agent of the most severe mosquito-borne disease in the tropics. Recently, Brazil suffered major YFV outbreaks with a high fatality rate affecting areas where the virus has not been reported for decades, consisting of urban areas where a large number of unvaccinated people live. We developed a machine learning framework combining three different algorithms (XGBoost, random forest and regularized logistic regression) to analyze YFV genomic sequences. This method was applied to 56 YFV sequences from human infections and 27 from non-human primate (NHPs) infections to investigate the presence of genetic signatures possibly related to disease severity (in human related sequences) and differences in PCR cycle threshold (Ct) values (in NHP related sequences). Our analyses reveal four non-synonymous single nucleotide variations (SNVs) on sequences from human infections, in proteins NS3 (E614D), NS4a (I69V), NS5 (R727G, V643A) and six non-synonymous SNVs on NHP sequences, in proteins E (L385F), NS1 (A171V), NS3 (I184V) and NS5 (N11S, I374V, E641D). We performed comparative protein structural analysis on these SNVs, describing possible impacts on protein function. Despite the fact that the dataset is limited in size and that this study does not consider virus-host interactions, our work highlights the use of machine learning as a versatile and fast initial approach to genomic data exploration.
Collapse
Affiliation(s)
- Álvaro Salgado
- Laboratório de Genética Celular e Molecular, Instituto de Ciências Biológicas, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
- * E-mail: (AS); (LCJA); (JL)
| | - Raquel C. de Melo-Minardi
- Departamento de Ciência da Computação, Instituto de Ciências Exatas, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Marta Giovanetti
- Laboratório de Genética Celular e Molecular, Instituto de Ciências Biológicas, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
- Laboratório de Flavivírus, Instituto Oswaldo Cruz, Fundação Oswaldo Cruz, Rio de Janeiro, Brazil
| | - Adriano Veloso
- Departamento de Ciência da Computação, Instituto de Ciências Exatas, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Francielly Morais-Rodrigues
- Laboratório de Genética Celular e Molecular, Instituto de Ciências Biológicas, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Talita Adelino
- Laboratório Central de Saúde Pública, Fundação Ezequiel Dias, Belo Horizonte, Minas Gerais, Brazil
| | - Ronaldo de Jesus
- Coordenação Geral dos Laboratórios de Saúde Pública, Secretaria de Vigilância em Saúde, Ministério da Saúde, Brasília, DF, Brazil
| | - Stephane Tosta
- Laboratório de Genética Celular e Molecular, Instituto de Ciências Biológicas, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Vasco Azevedo
- Laboratório de Genética Celular e Molecular, Instituto de Ciências Biológicas, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - José Lourenco
- Department of Zoology, University of Oxford, Oxford, United Kingdom
- * E-mail: (AS); (LCJA); (JL)
| | - Luiz Carlos J. Alcantara
- Laboratório de Genética Celular e Molecular, Instituto de Ciências Biológicas, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
- Laboratório de Flavivírus, Instituto Oswaldo Cruz, Fundação Oswaldo Cruz, Rio de Janeiro, Brazil
- * E-mail: (AS); (LCJA); (JL)
| |
Collapse
|
33
|
Abd El Hamid MM, Omar YM, Shaheen M, Mabrouk MS. Discovering epistasis interactions in Alzheimer's disease using deep learning model. GENE REPORTS 2022. [DOI: 10.1016/j.genrep.2022.101673] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/15/2022]
|
34
|
Abstract
Predicting outcomes in open-heart surgery can be challenging. Unexpected readmissions, long hospital stays, and mortality have economic implications. In this study, we investigated machine learning (ML) performance in data visualization and predicting patient outcomes associated with open-heart surgery. We evaluated 8,947 patients who underwent cardiac surgery from April 2006 to January 2018. Data visualization and classification were performed at cohort-level and patient-level using clustering, correlation matrix, and seven different predictive models for predicting three outcomes ("Discharged," "Died," and "Readmitted") at binary level. Cross-validation was used to train and test each dataset with the application of hyperparameter optimization and data imputation techniques. Machine learning showed promising performance for predicting mortality (AUC 0.83 ± 0.03) and readmission (AUC 0.75 ± 0.035). The cohort-level analysis revealed that ML performance is comparable to the Society of Thoracic Surgeons (STS) risk model even with limited number of samples ( e.g. , less than 3,000 samples for ML versus more than 100,000 samples for the STS risk models). With all cases (8,947 samples, referred as patient-level analysis), ML showed comparable performance to what has been reported for the STS models. However, we acknowledge that it remains unknown at this stage as to how the model might perform outside the institution and does not in any way constitute a comparison of the performance of the internal model with the STS model. Our study demonstrates a systematic application of ML in analyzing and predicting outcomes after open-heart surgery. The predictive utility of ML in cardiac surgery and clinical implications of the results are highlighted.
Collapse
|
35
|
Gerussi A, Scaravaglio M, Cristoferi L, Verda D, Milani C, De Bernardi E, Ippolito D, Asselta R, Invernizzi P, Kather JN, Carbone M. Artificial intelligence for precision medicine in autoimmune liver disease. Front Immunol 2022; 13:966329. [PMID: 36439097 PMCID: PMC9691668 DOI: 10.3389/fimmu.2022.966329] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2022] [Accepted: 10/13/2022] [Indexed: 09/10/2023] Open
Abstract
Autoimmune liver diseases (AiLDs) are rare autoimmune conditions of the liver and the biliary tree with unknown etiology and limited treatment options. AiLDs are inherently characterized by a high degree of complexity, which poses great challenges in understanding their etiopathogenesis, developing novel biomarkers and risk-stratification tools, and, eventually, generating new drugs. Artificial intelligence (AI) is considered one of the best candidates to support researchers and clinicians in making sense of biological complexity. In this review, we offer a primer on AI and machine learning for clinicians, and discuss recent available literature on its applications in medicine and more specifically how it can help to tackle major unmet needs in AiLDs.
Collapse
Affiliation(s)
- Alessio Gerussi
- Division of Gastroenterology, Center for Autoimmune Liver Diseases, Department of Medicine and Surgery, University of Milano-Bicocca, Monza, Italy
- European Reference Network on Hepatological Diseases (ERN RARE-LIVER), San Gerardo Hospital, Monza, Italy
| | - Miki Scaravaglio
- Division of Gastroenterology, Center for Autoimmune Liver Diseases, Department of Medicine and Surgery, University of Milano-Bicocca, Monza, Italy
- European Reference Network on Hepatological Diseases (ERN RARE-LIVER), San Gerardo Hospital, Monza, Italy
| | - Laura Cristoferi
- Division of Gastroenterology, Center for Autoimmune Liver Diseases, Department of Medicine and Surgery, University of Milano-Bicocca, Monza, Italy
- European Reference Network on Hepatological Diseases (ERN RARE-LIVER), San Gerardo Hospital, Monza, Italy
- Bicocca Bioinformatics Biostatistics and Bioimaging Centre - B4, School of Medicine and Surgery, University of Milano-Bicocca, Monza, Italy
| | | | - Chiara Milani
- Division of Gastroenterology, Center for Autoimmune Liver Diseases, Department of Medicine and Surgery, University of Milano-Bicocca, Monza, Italy
- European Reference Network on Hepatological Diseases (ERN RARE-LIVER), San Gerardo Hospital, Monza, Italy
| | - Elisabetta De Bernardi
- Department of Medicine and Surgery and Tecnomed Foundation, University of Milano - Bicocca, Monza, Italy
| | | | - Rosanna Asselta
- Humanitas Clinical and Research Center, Rozzano, Milan, Italy
- Department of Biomedical Sciences, Humanitas University, Pieve Emanuele, Milan, Italy
| | - Pietro Invernizzi
- Division of Gastroenterology, Center for Autoimmune Liver Diseases, Department of Medicine and Surgery, University of Milano-Bicocca, Monza, Italy
- European Reference Network on Hepatological Diseases (ERN RARE-LIVER), San Gerardo Hospital, Monza, Italy
| | - Jakob Nikolas Kather
- Department of Medicine III, University Hospital RWTH Aachen, Aachen, Germany
- Else Kroener Fresenius Center for Digital Health, Medical Faculty Carl Gustav Carus, Technical University Dresden, Dresden, Germany
| | - Marco Carbone
- Division of Gastroenterology, Center for Autoimmune Liver Diseases, Department of Medicine and Surgery, University of Milano-Bicocca, Monza, Italy
- European Reference Network on Hepatological Diseases (ERN RARE-LIVER), San Gerardo Hospital, Monza, Italy
| |
Collapse
|
36
|
A machine-learning approach for nonalcoholic steatohepatitis susceptibility estimation. Indian J Gastroenterol 2022; 41:475-482. [PMID: 36367682 DOI: 10.1007/s12664-022-01263-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/11/2021] [Accepted: 05/02/2022] [Indexed: 11/13/2022]
Abstract
BACKGROUND Nonalcoholic steatohepatitis (NASH), a severe form of nonalcoholic fatty liver disease, can lead to advanced liver damage and has become an increasingly prominent health problem worldwide. Predictive models for early identification of high-risk individuals could help identify preventive and interventional measures. Traditional epidemiological models with limited predictive power are based on statistical analysis. In the current study, a novel machine-learning approach was developed for individual NASH susceptibility prediction using candidate single nucleotide polymorphisms (SNPs). METHODS A total of 245 NASH patients and 120 healthy individuals were included in the study. Single nucleotide polymorphism genotypes of candidate genes including two SNPs in the cytochrome P450 family 2 subfamily E member 1 (CYP2E1) gene (rs6413432, rs3813867), two SNPs in the glucokinase regulator (GCKR) gene (rs780094, rs1260326), rs738409 SNP in patatin-like phospholipase domain-containing 3 (PNPLA3), and gender parameters were used to develop models for identifying at-risk individuals. To predict the individual's susceptibility to NASH, nine different machine-learning models were constructed. These models involved two different feature selections including Chi-square, and support vector machine recursive feature elimination (SVM-RFE) and three classification algorithms including k-nearest neighbor (KNN), multi-layer perceptron (MLP), and random forest (RF). All nine machine-learning models were trained using 80% of both the NASH patients and the healthy controls data. The nine machine-learning models were then tested on 20% of both groups. The model's performance was compared for model accuracy, precision, sensitivity, and F measure. RESULTS Among all nine machine-learning models, the KNN classifier with all features as input showed the highest performance with 86% F measure and 79% accuracy. CONCLUSIONS Machine learning based on genomic variety may be applicable for estimating an individual's susceptibility for developing NASH among high-risk groups with a high degree of accuracy, precision, and sensitivity.
Collapse
|
37
|
Goodman MO, Cade BE, Shah NA, Huang T, Dashti HS, Saxena R, Rutter MK, Libby P, Sofer T, Redline S. Pathway-Specific Polygenic Risk Scores Identify Obstructive Sleep Apnea-Related Pathways Differentially Moderating Genetic Susceptibility to Coronary Artery Disease. CIRCULATION. GENOMIC AND PRECISION MEDICINE 2022; 15:e003535. [PMID: 36170352 PMCID: PMC9588629 DOI: 10.1161/circgen.121.003535] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/03/2021] [Accepted: 06/02/2022] [Indexed: 01/04/2023]
Abstract
BACKGROUND Obstructive sleep apnea (OSA) and its features, such as chronic intermittent hypoxia, may differentially affect specific molecular pathways and processes in the pathogenesis of coronary artery disease (CAD) and influence the subsequent risk and severity of CAD events. In particular, competing adverse (eg, inflammatory) and protective (eg, increased coronary collateral blood flow) mechanisms may operate, but remain poorly understood. We hypothesize that common genetic variation in selected molecular pathways influences the likelihood of CAD events differently in individuals with and without OSA, in a pathway-dependent manner. METHODS We selected a cross-sectional sample of 471 877 participants from the UK Biobank, with 4974 ascertained to have OSA, 25 988 to have CAD, and 711 to have both. We calculated pathway-specific polygenic risk scores for CAD, based on 6.6 million common variants evaluated in the CARDIoGRAMplusC4D genome-wide association study (Coronary ARtery DIsease Genome wide Replication and Meta-analysis [CARDIoGRAM] plus The Coronary Artery Disease [C4D] Genetics), annotated to specific genes and pathways using functional genomics databases. Based on prior evidence of involvement with intermittent hypoxia and CAD, we tested pathway-specific polygenic risk scores for the HIF1 (hypoxia-inducible factor 1), VEGF (vascular endothelial growth factor), NFκB (nuclear factor kappa-light-chain-enhancer of activated B cells) and TNF (tumor necrosis factor) signaling pathways. RESULTS In a multivariable-adjusted logistic generalized additive model, elevated pathway-specific polygenic risk scores for the Kyoto Encyclopedia of Genes and Genomes VEGF pathway (39 genes) associated with protection for CAD in OSA (interaction odds ratio 0.86, P=6×10-4). By contrast, the genome-wide CAD PRS did not show evidence of statistical interaction with OSA. CONCLUSIONS We find evidence that pathway-specific genetic risk of CAD differs between individuals with and without OSA in a qualitatively pathway-dependent manner. These results provide evidence that gene-by-environment interaction influences CAD risk in certain pathways among people with OSA, an effect that is not well-captured by the genome-wide PRS. This invites further study of how OSA interacts with genetic risk at the molecular level and suggests eventual personalization of OSA treatment to reduce CAD risk according to individual pathway-specific genetic risk profiles.
Collapse
Affiliation(s)
- Matthew O Goodman
- Division of Sleep & Circadian Disorders (M.O.G., B.E.C., R.S., T.S., S.R.), Brigham and Women's Hospital & Harvard Medical School
- Division of Sleep Medicine, Harvard Medical School, Boston (M.O.G., B.E.C., T.H., R.S., T.S., S.R.)
- Program in Medical & Population Genetics, Broad Institute, Cambridge, MA (M.O.G., B.E.C., H.S.D., R.S.)
| | - Brian E Cade
- Division of Sleep & Circadian Disorders (M.O.G., B.E.C., R.S., T.S., S.R.), Brigham and Women's Hospital & Harvard Medical School
- Division of Sleep Medicine, Harvard Medical School, Boston (M.O.G., B.E.C., T.H., R.S., T.S., S.R.)
- Program in Medical & Population Genetics, Broad Institute, Cambridge, MA (M.O.G., B.E.C., H.S.D., R.S.)
| | - Neomi A Shah
- Icahn School of Medicine at Mount Sinai, New York, NY (N.A.S.)
| | - Tianyi Huang
- Channing Division of Network Medicine (T.H.), Brigham and Women's Hospital & Harvard Medical School
- Division of Sleep Medicine, Harvard Medical School, Boston (M.O.G., B.E.C., T.H., R.S., T.S., S.R.)
| | - Hassan S Dashti
- Program in Medical & Population Genetics, Broad Institute, Cambridge, MA (M.O.G., B.E.C., H.S.D., R.S.)
- Center for Genomic Medicine, Massachusetts General Hospital (H.S.D., R.S.)
- Department of Anesthesia, Critical Care & Pain Medicine, Massachusetts General Hospital & Harvard Medical School, Boston (H.S.D., R.S.)
| | - Richa Saxena
- Division of Sleep & Circadian Disorders (M.O.G., B.E.C., R.S., T.S., S.R.), Brigham and Women's Hospital & Harvard Medical School
- Division of Sleep Medicine, Harvard Medical School, Boston (M.O.G., B.E.C., T.H., R.S., T.S., S.R.)
- Program in Medical & Population Genetics, Broad Institute, Cambridge, MA (M.O.G., B.E.C., H.S.D., R.S.)
- Center for Genomic Medicine, Massachusetts General Hospital (H.S.D., R.S.)
- Department of Anesthesia, Critical Care & Pain Medicine, Massachusetts General Hospital & Harvard Medical School, Boston (H.S.D., R.S.)
| | - Martin K Rutter
- Division of Diabetes, Endocrinology & Gastroenterology, School of Medical Sciences, Faculty of Biology, Medicine and Health, University of Manchester (M.K.R.)
- Diabetes, Endocrinology & Metabolism Centre, Manchester Univ NHS Foundation Trust, Manchester Academic Health Science Centre, Manchester, United Kingdom (M.K.R.)
| | - Peter Libby
- Division of Cardiovascular Medicine, Department of Medicine (P.L.), Brigham and Women's Hospital & Harvard Medical School
| | - Tamar Sofer
- Division of Sleep & Circadian Disorders (M.O.G., B.E.C., R.S., T.S., S.R.), Brigham and Women's Hospital & Harvard Medical School
- Division of Sleep Medicine, Harvard Medical School, Boston (M.O.G., B.E.C., T.H., R.S., T.S., S.R.)
| | - Susan Redline
- Division of Sleep & Circadian Disorders (M.O.G., B.E.C., R.S., T.S., S.R.), Brigham and Women's Hospital & Harvard Medical School
| |
Collapse
|
38
|
Nascimben M, Rimondini L, Corà D, Venturin M. Polygenic risk modeling of tumor stage and survival in bladder cancer. BioData Min 2022; 15:23. [PMID: 36175974 PMCID: PMC9523990 DOI: 10.1186/s13040-022-00306-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2021] [Accepted: 09/18/2022] [Indexed: 11/26/2022] Open
Abstract
Introduction Bladder cancer assessment with non-invasive gene expression signatures facilitates the detection of patients at risk and surveillance of their status, bypassing the discomforts given by cystoscopy. To achieve accurate cancer estimation, analysis pipelines for gene expression data (GED) may integrate a sequence of several machine learning and bio-statistical techniques to model complex characteristics of pathological patterns. Methods Numerical experiments tested the combination of GED preprocessing by discretization with tree ensemble embeddings and nonlinear dimensionality reductions to categorize oncological patients comprehensively. Modeling aimed to identify tumor stage and distinguish survival outcomes in two situations: complete and partial data embedding. This latter experimental condition simulates the addition of new patients to an existing model for rapid monitoring of disease progression. Machine learning procedures were employed to identify the most relevant genes involved in patient prognosis and test the performance of preprocessed GED compared to untransformed data in predicting patient conditions. Results Data embedding paired with dimensionality reduction produced prognostic maps with well-defined clusters of patients, suitable for medical decision support. A second experiment simulated the addition of new patients to an existing model (partial data embedding): Uniform Manifold Approximation and Projection (UMAP) methodology with uniform data discretization led to better outcomes than other analyzed pipelines. Further exploration of parameter space for UMAP and t-distributed stochastic neighbor embedding (t-SNE) underlined the importance of tuning a higher number of parameters for UMAP rather than t-SNE. Moreover, two different machine learning experiments identified a group of genes valuable for partitioning patients (gene relevance analysis) and showed the higher precision obtained by preprocessed data in predicting tumor outcomes for cancer stage and survival rate (six classes prediction). Conclusions The present investigation proposed new analysis pipelines for disease outcome modeling from bladder cancer-related biomarkers. Complete and partial data embedding experiments suggested that pipelines employing UMAP had a more accurate predictive ability, supporting the recent literature trends on this methodology. However, it was also found that several UMAP parameters influence experimental results, therefore deriving a recommendation for researchers to pay attention to this aspect of the UMAP technique. Machine learning procedures further demonstrated the effectiveness of the proposed preprocessing in predicting patients’ conditions and determined a sub-group of biomarkers significant for forecasting bladder cancer prognosis.
Collapse
Affiliation(s)
- Mauro Nascimben
- Department of Health Sciences, Università del Piemonte Orientale, Via Solaroli 17, 28100, Novara, Italy. .,Enginsoft SpA, Via Giambellino 7, 35129, Padova, Italy.
| | - Lia Rimondini
- Department of Health Sciences, Università del Piemonte Orientale, Via Solaroli 17, 28100, Novara, Italy
| | - Davide Corà
- Department of Health Sciences, Università del Piemonte Orientale, Via Solaroli 17, 28100, Novara, Italy.,Department of Translational Medicine, Università del Piemonte Orientale, Via Solaroli 17, 28100, Novara, Italy
| | | |
Collapse
|
39
|
Zhou X, Li X, Zhang Z, Han Q, Deng H, Jiang Y, Tang C, Yang L. Support vector machine deep mining of electronic medical records to predict the prognosis of severe acute myocardial infarction. Front Physiol 2022; 13:991990. [PMID: 36246101 PMCID: PMC9558165 DOI: 10.3389/fphys.2022.991990] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2022] [Accepted: 08/17/2022] [Indexed: 11/13/2022] Open
Abstract
Cardiovascular disease is currently one of the most important diseases causing death in China and the world, and acute myocardial infarction is a major cause of cardiovascular disease. This study provides an analytical technique for predicting the prognosis of patients with severe acute myocardial infarction using a support vector machine (SVM) technique based on information gleaned from electronic medical records in the Medical Information Marketplace for Intensive Care (MIMIC)-III database. The MIMIC-III database provided 4785 electronic medical records data for inclusion in the model development after screening 7070 electronic medical records of patients admitted to the intensive care unit for treatment of acute myocardial infarction. Adopting the APS-III score as the criterion for identifying anticipated risk, the dimensions of data information incorporated into the mathematical model design were found using correlation coefficient matrix heatmaps and ordered logistic analysis. An automated prognostic risk-prediction model was developed using SVM, and the fit was evaluated by 5× cross-validation. We used a grid search method to further optimize the parameters and improve the model fit. The excellent generalization ability of SVM was fully verified by calculating the 95% confidence interval of the area under the receiver operating characteristic curve (AUC) for six algorithms (linear discriminant, tree, Kernel Naive Bayes, RUSBoost, KNN, and SVM). Compared to the remaining five models, its confidence interval was the narrowest with higher fitting accuracy and better performance. The patient prognostic risk prediction model constructed using SVM had a relatively impressive accuracy (92.2%) and AUC value (0.98). In this study, a model was designed for fitting that can maximize the potential information to be gleaned in the electronic medical records data. It was demonstrated that SVM models based on electronic medical records data can offer an effective solution for clinical disease prognostic risk assessment and improved clinical outcomes and have great potential for clinical application in the clinical treatment of myocardial infarction.
Collapse
Affiliation(s)
- Xingyu Zhou
- Zhuhai Campus of Zunyi Medical University, Zhuhai, China
- Key Laboratory of Human-Machine Intelligence-Synergy Systems, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (CAS), Shenzhen, China
| | - Xianying Li
- Zhuhai Campus of Zunyi Medical University, Zhuhai, China
| | - Zijun Zhang
- Zhuhai Campus of Zunyi Medical University, Zhuhai, China
| | - Qinrong Han
- Zhuhai Campus of Zunyi Medical University, Zhuhai, China
| | - Huijiao Deng
- Zhuhai Campus of Zunyi Medical University, Zhuhai, China
| | - Yi Jiang
- Zhuhai Campus of Zunyi Medical University, Zhuhai, China
| | - Chunxiao Tang
- Zhuhai Campus of Zunyi Medical University, Zhuhai, China
| | - Lin Yang
- Zhuhai Campus of Zunyi Medical University, Zhuhai, China
- Key Laboratory of Human-Machine Intelligence-Synergy Systems, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (CAS), Shenzhen, China
- *Correspondence: Lin Yang,
| |
Collapse
|
40
|
Sahu M, Gupta R, Ambasta RK, Kumar P. Artificial intelligence and machine learning in precision medicine: A paradigm shift in big data analysis. PROGRESS IN MOLECULAR BIOLOGY AND TRANSLATIONAL SCIENCE 2022; 190:57-100. [PMID: 36008002 DOI: 10.1016/bs.pmbts.2022.03.002] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
The integration of artificial intelligence in precision medicine has revolutionized healthcare delivery. Precision medicine identifies the phenotype of particular patients with less-common responses to treatment. Recent studies have demonstrated that translational research exploring the convergence between artificial intelligence and precision medicine will help solve the most difficult challenges facing precision medicine. Here, we discuss different aspects of artificial intelligence in precision medicine that improve healthcare delivery. First, we discuss how artificial intelligence changes the landscape of precision medicine and the evolution of artificial intelligence in precision medicine. Second, we highlight the synergies between artificial intelligence and precision medicine and promises of artificial intelligence and precision medicine in healthcare delivery. Third, we briefly explain the promise of big data analytics and the integration of nanomaterials in precision medicine. Last, we highlight the challenges and opportunities of artificial intelligence in precision medicine.
Collapse
Affiliation(s)
- Mehar Sahu
- Molecular Neuroscience and Functional Genomics Laboratory, Delhi Technological University (Formerly Delhi College of Engineering), Shahbad Daulatpur, Delhi, India
| | - Rohan Gupta
- Molecular Neuroscience and Functional Genomics Laboratory, Delhi Technological University (Formerly Delhi College of Engineering), Shahbad Daulatpur, Delhi, India
| | - Rashmi K Ambasta
- Molecular Neuroscience and Functional Genomics Laboratory, Delhi Technological University (Formerly Delhi College of Engineering), Shahbad Daulatpur, Delhi, India
| | - Pravir Kumar
- Molecular Neuroscience and Functional Genomics Laboratory, Delhi Technological University (Formerly Delhi College of Engineering), Shahbad Daulatpur, Delhi, India.
| |
Collapse
|
41
|
Elgart M, Lyons G, Romero-Brufau S, Kurniansyah N, Brody JA, Guo X, Lin HJ, Raffield L, Gao Y, Chen H, de Vries P, Lloyd-Jones DM, Lange LA, Peloso GM, Fornage M, Rotter JI, Rich SS, Morrison AC, Psaty BM, Levy D, Redline S, Sofer T. Non-linear machine learning models incorporating SNPs and PRS improve polygenic prediction in diverse human populations. Commun Biol 2022; 5:856. [PMID: 35995843 PMCID: PMC9395509 DOI: 10.1038/s42003-022-03812-z] [Citation(s) in RCA: 27] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2021] [Accepted: 08/05/2022] [Indexed: 01/03/2023] Open
Abstract
Polygenic risk scores (PRS) are commonly used to quantify the inherited susceptibility for a trait, yet they fail to account for non-linear and interaction effects between single nucleotide polymorphisms (SNPs). We address this via a machine learning approach, validated in nine complex phenotypes in a multi-ancestry population. We use an ensemble method of SNP selection followed by gradient boosted trees (XGBoost) to allow for non-linearities and interaction effects. We compare our results to the standard, linear PRS model developed using PRSice, LDpred2, and lassosum2. Combining a PRS as a feature in an XGBoost model results in a relative increase in the percentage variance explained compared to the standard linear PRS model by 22% for height, 27% for HDL cholesterol, 43% for body mass index, 50% for sleep duration, 58% for systolic blood pressure, 64% for total cholesterol, 66% for triglycerides, 77% for LDL cholesterol, and 100% for diastolic blood pressure. Multi-ancestry trained models perform similarly to specific racial/ethnic group trained models and are consistently superior to the standard linear PRS models. This work demonstrates an effective method to account for non-linearities and interaction effects in genetics-based prediction models.
Collapse
Affiliation(s)
- Michael Elgart
- Division of Sleep and Circadian Disorders, Brigham and Women's Hospital, Boston, MA, USA.
- Department of Medicine, Harvard Medical School, Boston, MA, USA.
| | - Genevieve Lyons
- Division of Sleep and Circadian Disorders, Brigham and Women's Hospital, Boston, MA, USA
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Santiago Romero-Brufau
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Department of Medicine, Mayo Clinic, Rochester, MN, USA
| | - Nuzulul Kurniansyah
- Division of Sleep and Circadian Disorders, Brigham and Women's Hospital, Boston, MA, USA
| | - Jennifer A Brody
- Cardiovascular Health Research Unit, Department of Medicine, University of Washington, Seattle, WA, USA
| | - Xiuqing Guo
- The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA, USA
| | - Henry J Lin
- The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA, USA
| | - Laura Raffield
- Department of Genetics, University of North Carolina, Chapel Hill, NC, USA
| | - Yan Gao
- The Jackson Heart Study, University of Mississippi Medical Center, Jackson, MS, USA
| | - Han Chen
- Human Genetics Center, Department of Epidemiology, Human Genetics, and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, USA
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Paul de Vries
- Human Genetics Center, Department of Epidemiology, Human Genetics, and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | | | - Leslie A Lange
- Department of Medicine, University of Colorado Denver, Anschutz Medical Campus, Aurora, CO, USA
| | - Gina M Peloso
- Department of Biostatistics, Boston University School of Public Health, Boston, MA, USA
| | - Myriam Fornage
- Human Genetics Center, Department of Epidemiology, Human Genetics, and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, USA
- Brown Foundation Institute of Molecular Medicine, McGovern Medical School, University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Jerome I Rotter
- The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA, USA
| | - Stephen S Rich
- Center for Public Health Genomics, University of Virginia School of Medicine, Charlottesville, VA, USA
| | - Alanna C Morrison
- Human Genetics Center, Department of Epidemiology, Human Genetics, and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Bruce M Psaty
- Cardiovascular Health Research Unit, Departments of Medicine, Epidemiology, and Health Services, University of Washington, Seattle, WA, USA
| | - Daniel Levy
- The Population Sciences Branch of the National Heart, Lung and Blood Institute, Bethesda, MD, USA
- The Framingham Heart Study, Framingham, MA, USA
| | - Susan Redline
- Division of Sleep and Circadian Disorders, Brigham and Women's Hospital, Boston, MA, USA
- Department of Medicine, Harvard Medical School, Boston, MA, USA
| | - Tamar Sofer
- Division of Sleep and Circadian Disorders, Brigham and Women's Hospital, Boston, MA, USA.
- Department of Medicine, Harvard Medical School, Boston, MA, USA.
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
| |
Collapse
|
42
|
Ma W, Lau YL, Yang W, Wang YF. Random forests algorithm boosts genetic risk prediction of systemic lupus erythematosus. Front Genet 2022; 13:902793. [PMID: 36046232 PMCID: PMC9421562 DOI: 10.3389/fgene.2022.902793] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2022] [Accepted: 07/19/2022] [Indexed: 11/13/2022] Open
Abstract
Patients with systemic lupus erythematosus (SLE) present varied clinical manifestations, posing a diagnostic challenge for physicians. Genetic factors substantially contribute to SLE development. A polygenic risk scoring (PRS) model has been used to estimate the genetic risk of SLE in individuals. However, this approach assumes independent and additive contribution of genetic variants to disease development. We aimed to improve the accuracy of SLE prediction using machine-learning algorithms. We applied random forest (RF), support vector machine (SVM), and artificial neural network (ANN) to classify SLE cases and controls using the data from our previous genome-wide association studies (GWAS) conducted in either Chinese or European populations, including a total of 19,208 participants. The overall performances of these predictors were assessed by the value of area under the receiver-operator curve (AUC). The analyses in the Chinese GWAS showed that the RF model significantly outperformed other predictors, achieving a mean AUC value of 0.84, a 13% improvement upon the PRS model (AUC = 0.74). At the optimal cut-off, the RF predictor reached a sensitivity of 84% with a specificity of 68% in SLE classification. To validate these results, similar analyses were repeated in the European GWAS, and the RF model consistently outperformed other algorithms. Our study suggests that the RF model could be an additional and powerful predictor for SLE early diagnosis.
Collapse
Affiliation(s)
- Wen Ma
- Department of Paediatrics and Adolescent Medicine, The University of Hong Kong, Hong Kong, China
| | - Yu-Lung Lau
- Department of Paediatrics and Adolescent Medicine, The University of Hong Kong, Hong Kong, China
| | - Wanling Yang
- Department of Paediatrics and Adolescent Medicine, The University of Hong Kong, Hong Kong, China
- *Correspondence: Wanling Yang, ; Yong-Fei Wang,
| | - Yong-Fei Wang
- Department of Paediatrics and Adolescent Medicine, The University of Hong Kong, Hong Kong, China
- Shenzhen Futian Hospital for Rheumatic Diseases, Shenzhen, China
- *Correspondence: Wanling Yang, ; Yong-Fei Wang,
| |
Collapse
|
43
|
Pudjihartono N, Fadason T, Kempa-Liehr AW, O'Sullivan JM. A Review of Feature Selection Methods for Machine Learning-Based Disease Risk Prediction. FRONTIERS IN BIOINFORMATICS 2022; 2:927312. [PMID: 36304293 PMCID: PMC9580915 DOI: 10.3389/fbinf.2022.927312] [Citation(s) in RCA: 72] [Impact Index Per Article: 36.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2022] [Accepted: 06/03/2022] [Indexed: 01/14/2023] Open
Abstract
Machine learning has shown utility in detecting patterns within large, unstructured, and complex datasets. One of the promising applications of machine learning is in precision medicine, where disease risk is predicted using patient genetic data. However, creating an accurate prediction model based on genotype data remains challenging due to the so-called “curse of dimensionality” (i.e., extensively larger number of features compared to the number of samples). Therefore, the generalizability of machine learning models benefits from feature selection, which aims to extract only the most “informative” features and remove noisy “non-informative,” irrelevant and redundant features. In this article, we provide a general overview of the different feature selection methods, their advantages, disadvantages, and use cases, focusing on the detection of relevant features (i.e., SNPs) for disease risk prediction.
Collapse
Affiliation(s)
| | - Tayaza Fadason
- Liggins Institute, University of Auckland, Auckland, New Zealand
- Maurice Wilkins Centre for Molecular Biodiscovery, Auckland, New Zealand
| | - Andreas W. Kempa-Liehr
- Department of Engineering Science, The University of Auckland, Auckland, New Zealand
- *Correspondence: Andreas W. Kempa-Liehr, ; Justin M. O'Sullivan,
| | - Justin M. O'Sullivan
- Liggins Institute, University of Auckland, Auckland, New Zealand
- Maurice Wilkins Centre for Molecular Biodiscovery, Auckland, New Zealand
- MRC Lifecourse Epidemiology Unit, University of Southampton, Southampton, United Kingdom
- Singapore Institute for Clinical Sciences, Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
- Australian Parkinson’s Mission, Garvan Institute of Medical Research, Sydney, NSW, Australia
- *Correspondence: Andreas W. Kempa-Liehr, ; Justin M. O'Sullivan,
| |
Collapse
|
44
|
Abstract
The tremendous amount of biological sequence data available, combined with the recent methodological breakthrough in deep learning in domains such as computer vision or natural language processing, is leading today to the transformation of bioinformatics through the emergence of deep genomics, the application of deep learning to genomic sequences. We review here the new applications that the use of deep learning enables in the field, focusing on three aspects: the functional annotation of genomes, the sequence determinants of the genome functions and the possibility to write synthetic genomic sequences.
Collapse
|
45
|
Quazi S. Artificial intelligence and machine learning in precision and genomic medicine. Med Oncol 2022; 39:120. [PMID: 35704152 PMCID: PMC9198206 DOI: 10.1007/s12032-022-01711-1] [Citation(s) in RCA: 34] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2022] [Accepted: 03/14/2022] [Indexed: 10/28/2022]
Abstract
The advancement of precision medicine in medical care has led behind the conventional symptom-driven treatment process by allowing early risk prediction of disease through improved diagnostics and customization of more effective treatments. It is necessary to scrutinize overall patient data alongside broad factors to observe and differentiate between ill and relatively healthy people to take the most appropriate path toward precision medicine, resulting in an improved vision of biological indicators that can signal health changes. Precision and genomic medicine combined with artificial intelligence have the potential to improve patient healthcare. Patients with less common therapeutic responses or unique healthcare demands are using genomic medicine technologies. AI provides insights through advanced computation and inference, enabling the system to reason and learn while enhancing physician decision making. Many cell characteristics, including gene up-regulation, proteins binding to nucleic acids, and splicing, can be measured at high throughput and used as training objectives for predictive models. Researchers can create a new era of effective genomic medicine with the improved availability of a broad range of datasets and modern computer techniques such as machine learning. This review article has elucidated the contributions of ML algorithms in precision and genome medicine.
Collapse
Affiliation(s)
- Sameer Quazi
- GenLab Biosolutions Private Limited, Bangalore, Karnataka, 560043, India.
- Department of Biomedical Sciences, School of Life Sciences, Anglia Ruskin University, Cambridge, UK.
| |
Collapse
|
46
|
Abstract
The advancement of precision medicine in medical care has led behind the conventional symptom-driven treatment process by allowing early risk prediction of disease through improved diagnostics and customization of more effective treatments. It is necessary to scrutinize overall patient data alongside broad factors to observe and differentiate between ill and relatively healthy people to take the most appropriate path toward precision medicine, resulting in an improved vision of biological indicators that can signal health changes. Precision and genomic medicine combined with artificial intelligence have the potential to improve patient healthcare. Patients with less common therapeutic responses or unique healthcare demands are using genomic medicine technologies. AI provides insights through advanced computation and inference, enabling the system to reason and learn while enhancing physician decision making. Many cell characteristics, including gene up-regulation, proteins binding to nucleic acids, and splicing, can be measured at high throughput and used as training objectives for predictive models. Researchers can create a new era of effective genomic medicine with the improved availability of a broad range of datasets and modern computer techniques such as machine learning. This review article has elucidated the contributions of ML algorithms in precision and genome medicine.
Collapse
Affiliation(s)
- Sameer Quazi
- GenLab Biosolutions Private Limited, Bangalore, Karnataka, 560043, India.
- Department of Biomedical Sciences, School of Life Sciences, Anglia Ruskin University, Cambridge, UK.
| |
Collapse
|
47
|
Zhuang YJ, Mangwiro Y, Wake M, Saffery R, Greaves RF. Multi-omics analysis from archival neonatal dried blood spots: limitations and opportunities. Clin Chem Lab Med 2022; 60:1318-1341. [PMID: 35670573 DOI: 10.1515/cclm-2022-0311] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2022] [Accepted: 05/25/2022] [Indexed: 02/07/2023]
Abstract
Newborn screening (NBS) programs operate in many countries, processing millions of dried bloodspot (DBS) samples annually. In addition to early identification of various adverse health outcomes, these samples have considerable potential as a resource for population-based research that could address key questions related to child health. The feasibility of archival DBS samples for emerging targeted and untargeted multi-omics analysis has not been previously explored in the literature. This review aims to critically evaluate the latest advances to identify opportunities and challenges of applying omics analyses to NBS cards in a research setting. Medline, Embase and PubMed databases were searched to identify studies utilizing DBS for genomic, proteomic and metabolomic assays. A total of 800 records were identified after removing duplicates, of which 23 records were included in this review. These papers consisted of one combined genomic/metabolomic, four genomic, three epigenomic, four proteomic and 11 metabolomic studies. Together they demonstrate that the increasing sensitivity of multi-omic analytical techniques makes the broad use of NBS samples achievable for large cohort studies. Maintaining the pre-analytical integrity of the DBS sample through storage at temperatures below -20 °C will enable this important resource to be fully realized in a research capacity.
Collapse
Affiliation(s)
- Yuan-Jessica Zhuang
- Department of Paediatrics, The University of Melbourne, Melbourne, VIC, Australia
| | - Yeukai Mangwiro
- Department of Paediatrics, The University of Melbourne, Melbourne, VIC, Australia
- Murdoch Children's Research Institute, Melbourne, VIC, Australia
| | - Melissa Wake
- Department of Paediatrics, The University of Melbourne, Melbourne, VIC, Australia
- Murdoch Children's Research Institute, Melbourne, VIC, Australia
| | - Richard Saffery
- Department of Paediatrics, The University of Melbourne, Melbourne, VIC, Australia
- Murdoch Children's Research Institute, Melbourne, VIC, Australia
| | - Ronda F Greaves
- Department of Paediatrics, The University of Melbourne, Melbourne, VIC, Australia
- Victorian Clinical Genetics Services, Murdoch Children's Research Institute, Melbourne, VIC, Australia
| |
Collapse
|
48
|
Helenius M, Vaitkeviciene G, Abrahamsson J, Jonsson ÓG, Lund B, Harila-Saari A, Vettenranta K, Mikkel S, Stanulla M, Lopez-Lopez E, Waanders E, Madsen HO, Marquart HV, Modvig S, Gupta R, Schmiegelow K, Nielsen RL. Characteristics of white blood cell count in acute lymphoblastic leukemia: A COST LEGEND phenotype-genotype study. Pediatr Blood Cancer 2022; 69:e29582. [PMID: 35316565 DOI: 10.1002/pbc.29582] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/13/2021] [Revised: 12/20/2021] [Accepted: 12/31/2021] [Indexed: 11/10/2022]
Abstract
BACKGROUND White blood cell count (WBC) as a measure of extramedullary leukemic cell survival is a well-known prognostic factor in acute lymphoblastic leukemia (ALL), but its biology, including impact of host genome variants, is poorly understood. METHODS We included patients treated with the Nordic Society of Paediatric Haematology and Oncology (NOPHO) ALL-2008 protocol (N = 2347, 72% were genotyped by Illumina Omni2.5exome-8-Bead chip) aged 1-45 years, diagnosed with B-cell precursor (BCP-) or T-cell ALL (T-ALL) to investigate the variation in WBC. Spline functions of WBC were fitted correcting for association with age across ALL subgroups of immunophenotypes and karyotypes. The residuals between spline WBC and actual WBC were used to identify WBC-associated germline genetic variants in a genome-wide association study (GWAS) while adjusting for age and ALL subtype associations. RESULTS We observed an overall inverse correlation between age and WBC, which was stronger for the selected patient subgroups of immunophenotype and karyotypes (ρBCP-ALL = -.17, ρT-ALL = -.19; p < 3 × 10-4 ). Spline functions fitted to age, immunophenotype, and karyotype explained WBC variation better than age alone (ρ = .43, p << 2 × 10-6 ). However, when the spline-adjusted WBC residuals were used as phenotype, no GWAS significant associations were found. Based on available annotation, the top 50 genetic variants suggested effects on signal transduction, translation initiation, cell development, and proliferation. CONCLUSION These results indicate that host genome variants do not strongly influence WBC across ALL subsets, and future studies of why some patients are more prone to hyperleukocytosis should be performed within specific ALL subsets that apply more complex analyses to capture potential germline variant interactions and impact on WBC.
Collapse
Affiliation(s)
- Marianne Helenius
- Department of Health Technology, Technical University of Denmark, Kongens Lyngby, Copenhagen, Denmark.,Department of Pediatrics and Adolescent Medicine, University Hospital Rigshospitalet, Copenhagen, Denmark
| | - Goda Vaitkeviciene
- Vilnius University Hospital Santaros Klinikos Center for Pediatric Oncology and Hematology and Vilnius University, Vilnius, Lithuania
| | - Jonas Abrahamsson
- Department of Paediatrics, Institution for Clinical Sciences, Sahlgrenska University Hospital, Gothenburg, Sweden
| | | | - Bendik Lund
- Department of Pediatrics, St. Olavs Hospital, Trondheim, Norway
| | - Arja Harila-Saari
- Department of Women's and Children's Health, Uppsala University, Uppsala, Sweden
| | - Kim Vettenranta
- University of Helsinki and Children´s Hospital, University of Helsinki, Helsinki, Finland
| | - Sirje Mikkel
- Department of Hematology and Oncology, University of Tartu, Tartu, Estonia
| | - Martin Stanulla
- Department of Pediatric Hematology and Oncology, Hannover Medical School, Hannover, Germany
| | - Elixabet Lopez-Lopez
- Department of Genetics, Physical Anthropology and Animal Physiology, Faculty of Science and Technology, University of the Basque Country (UPV/EHU), Leioa, Spain.,Pediatric Oncology Group, BioCruces Bizkaia Health Research Institute, Barakaldo, Spain
| | - Esmé Waanders
- Department of Genetics, University Medical Center Utrecht, Utrecht, The Netherlands.,Princess Máxima Center for Pediatric Oncology, Utrecht, The Netherlands
| | - Hans O Madsen
- Department of Clinical Immunology, University Hospital Rigshospitalet, Copenhagen, Denmark
| | - Hanne Vibeke Marquart
- Department of Clinical Immunology, University Hospital Rigshospitalet, Copenhagen, Denmark
| | - Signe Modvig
- Department of Clinical Immunology, University Hospital Rigshospitalet, Copenhagen, Denmark
| | - Ramneek Gupta
- Department of Health Technology, Technical University of Denmark, Kongens Lyngby, Copenhagen, Denmark.,Novo Nordisk Research Centre Oxford, Oxford, UK
| | - Kjeld Schmiegelow
- Department of Pediatrics and Adolescent Medicine, University Hospital Rigshospitalet, Copenhagen, Denmark.,Institute of Clinical Medicine, Faculty of Medicine, University of Copenhagen, Copenhagen, Denmark
| | - Rikke Linnemann Nielsen
- Department of Health Technology, Technical University of Denmark, Kongens Lyngby, Copenhagen, Denmark.,Department of Pediatrics and Adolescent Medicine, University Hospital Rigshospitalet, Copenhagen, Denmark.,Novo Nordisk Research Centre Oxford, Oxford, UK
| |
Collapse
|
49
|
Isik YE, Gormez Y, Aydin Z, Bakir-Gungor B. The Determination of Distinctive Single Nucleotide Polymorphism Sets for the Diagnosis of Behçet's Disease. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1909-1918. [PMID: 33476272 DOI: 10.1109/tcbb.2021.3053429] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Behçet's Disease (BD) is a multi-system inflammatory disorder in which the etiology remains unclear. The most probable hypothesis is that genetic tendency and environmental factors play roles in the development of BD. In order to find the essential reasons, genetic changes on thousands of genes should be analyzed. Besides, there is a need for extra analysis to find out which genetic factor affects the disease. Machine learning approaches have high potential for extracting the knowledge from genomics and selecting the representative Single Nucleotide Polymorphisms (SNPs) as the most effective features for the clinical diagnosis process. In this study, we have attempted to identify representative SNPs using feature selection methods, incorporating biological information and aimed to develop a machine-learning model for diagnosing Behçet's disease. By combining biological information and machine learning classifiers, up to 99.64 percent accuracy of disease prediction is achieved using only 13,611 out of 311,459 SNPs. In addition, we revealed the SNPs that are most distinctive by performing repeated feature selection in cross-validation experiments.
Collapse
|
50
|
Petkov S, Chiodi F. Impaired CD4+ T cell differentiation in HIV-1 infected patients receiving early anti-retroviral therapy. Genomics 2022; 114:110367. [PMID: 35429609 DOI: 10.1016/j.ygeno.2022.110367] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2022] [Revised: 04/01/2022] [Accepted: 04/09/2022] [Indexed: 01/14/2023]
Abstract
Differentiation of CD4+ T naïve (TN) into central memory (TCM) cells involves extensive molecular processes. We compared the transcriptomes of CD4+ TN and TCM cells from HIV-1 infected patients receiving early anti-retroviral therapy (ART; EA; n = 13) and controls (n = 15). Comparison of protein coding genes between TCM and TN revealed 533 and 82 differentially expressed genes (DEGs) in controls and EA, respectively. A high degree of transcriptional complexity was detected during transition of CD4+ TN to TCM cells in controls involving 70 TFs, 20 master regulators of T cell differentiation (TBX21, GATA3, RARA, FOXP3, RORC); in EA only 7 TFs were modulated with expression of several master regulators remaining unchanged during differentiation. Analysis of interactions between modulated TFs and target genes revealed important regulatory interactions missing in EA group. We conclude that T cell differentiation in EA patients is impaired due to reduced modulation of genes involved in transition from CD4+ TN to TCM cells.
Collapse
Affiliation(s)
- Stefan Petkov
- Department of Microbiology, Tumor and Cell Biology, Biomedicum, Karolinska Institutet, Solna, Sweden
| | - Francesca Chiodi
- Department of Microbiology, Tumor and Cell Biology, Biomedicum, Karolinska Institutet, Solna, Sweden.
| |
Collapse
|