1
|
Andrews N, Unrath N, Wall P, Buckley JF, Fanning S. Prediction of Listeria monocytogenes Clonal Complexes from Multilocus Variable Number Tandem Repeat Analysis Patterns Using a Machine Learning Approach. Foodborne Pathog Dis 2024; 21:593-599. [PMID: 38963774 DOI: 10.1089/fpd.2023.0163] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/06/2024] Open
Abstract
Multilocus variable number tandem repeat analysis (MLVA) is a molecular subtyping technique that remains useful for those without the resources to access whole genome sequencing for the tracking and tracing of bacterial contaminants. Unlike techniques such as multilocus sequence typing (MLST) and pulsed-field gel electrophoresis, MLVA did not emerge as a standardized subtyping method for Listeria monocytogenes, and as a result, there is no reference database of virulent or food-associated MLVA subtypes as there is for MLST-based clonal complexes (CCs). Having previously shown the close congruence of a 5-loci MLVA scheme with MLST, a predictive model was created using the XGBoost machine learning (ML) technique, which enabled the prediction of CCs from MLVA patterns with ∼85% (±4%) accuracy. As well as validating the model on existing data, a straightforward update protocol was simulated for if and when previously unseen subtypes might arise. This article illustrates how ML techniques can be applied with elementary coding skills to add value to previous-generation molecular subtyping data in-built food processing environments.
Collapse
Affiliation(s)
- Nicholas Andrews
- UCD-Centre for Food Safety, School of Public Health, Physiotherapy and Sports Science, and School of Agriculture and Food Science, University College Dublin, Dublin, Ireland
| | - Natalia Unrath
- UCD-Centre for Food Safety, School of Public Health, Physiotherapy and Sports Science, and School of Agriculture and Food Science, University College Dublin, Dublin, Ireland
| | - Patrick Wall
- UCD-Centre for Food Safety, School of Public Health, Physiotherapy and Sports Science, and School of Agriculture and Food Science, University College Dublin, Dublin, Ireland
| | - James F Buckley
- UCD-Centre for Food Safety, School of Public Health, Physiotherapy and Sports Science, and School of Agriculture and Food Science, University College Dublin, Dublin, Ireland
| | - Séamus Fanning
- UCD-Centre for Food Safety, School of Public Health, Physiotherapy and Sports Science, and School of Agriculture and Food Science, University College Dublin, Dublin, Ireland
- Institute for Global Food Security, Queen's University Belfast, Belfast, United Kingdom
| |
Collapse
|
2
|
Alfayyadh MM, Maksemous N, Sutherland HG, Lea RA, Griffiths LR. Unravelling the Genetic Landscape of Hemiplegic Migraine: Exploring Innovative Strategies and Emerging Approaches. Genes (Basel) 2024; 15:443. [PMID: 38674378 PMCID: PMC11049430 DOI: 10.3390/genes15040443] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2024] [Accepted: 03/25/2024] [Indexed: 04/28/2024] Open
Abstract
Migraine is a severe, debilitating neurovascular disorder. Hemiplegic migraine (HM) is a rare and debilitating neurological condition with a strong genetic basis. Sequencing technologies have improved the diagnosis and our understanding of the molecular pathophysiology of HM. Linkage analysis and sequencing studies in HM families have identified pathogenic variants in ion channels and related genes, including CACNA1A, ATP1A2, and SCN1A, that cause HM. However, approximately 75% of HM patients are negative for these mutations, indicating there are other genes involved in disease causation. In this review, we explored our current understanding of the genetics of HM. The evidence presented herein summarises the current knowledge of the genetics of HM, which can be expanded further to explain the remaining heritability of this debilitating condition. Innovative bioinformatics and computational strategies to cover the entire genetic spectrum of HM are also discussed in this review.
Collapse
Affiliation(s)
| | | | | | | | - Lyn R. Griffiths
- Centre for Genomics and Personalised Health, Genomics Research Centre, School of Biomedical Sciences, Queensland University of Technology (QUT), Brisbane, QLD 4059, Australia; (M.M.A.); (N.M.); (H.G.S.); (R.A.L.)
| |
Collapse
|
3
|
Chung CW, Chou SC, Hsiao TH, Zhang GJ, Chung YF, Chen YM. Machine learning approaches to identify systemic lupus erythematosus in anti-nuclear antibody-positive patients using genomic data and electronic health records. BioData Min 2024; 17:1. [PMID: 38183082 PMCID: PMC10770905 DOI: 10.1186/s13040-023-00352-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Accepted: 12/19/2023] [Indexed: 01/07/2024] Open
Abstract
BACKGROUND Although the 2019 EULAR/ACR classification criteria for systemic lupus erythematosus (SLE) has required at least a positive anti-nuclear antibody (ANA) titer (≥ 1:80), it remains challenging for clinicians to identify patients with SLE. This study aimed to develop a machine learning (ML) approach to assist in the detection of SLE patients using genomic data and electronic health records. METHODS Participants with a positive ANA (≥ 1:80) were enrolled from the Taiwan Precision Medicine Initiative cohort. The Taiwan Biobank version 2 array was used to detect single nucleotide polymorphism (SNP) data. Six ML models, Logistic Regression, Random Forest (RF), Support Vector Machine, Light Gradient Boosting Machine, Gradient Tree Boosting, and Extreme Gradient Boosting (XGB), were used to identify SLE patients. The importance of the clinical and genetic features was determined by Shapley Additive Explanation (SHAP) values. A logistic regression model was applied to identify genetic variations associated with SLE in the subset of patients with an ANA equal to or exceeding 1:640. RESULTS A total of 946 SLE and 1,892 non-SLE controls were included in this analysis. Among the six ML models, RF and XGB demonstrated superior performance in the differentiation of SLE from non-SLE. The leading features in the SHAP diagram were anti-double strand DNA antibodies, ANA titers, AC4 ANA pattern, polygenic risk scores, complement levels, and SNPs. Additionally, in the subgroup with a high ANA titer (≥ 1:640), six SNPs positively associated with SLE and five SNPs negatively correlated with SLE were discovered. CONCLUSIONS ML approaches offer the potential to assist in diagnosing SLE and uncovering novel SNPs in a group of patients with autoimmunity.
Collapse
Affiliation(s)
- Chih-Wei Chung
- Department of Information Management, National Taiwan University, Taipei, Taiwan
| | - Seng-Cho Chou
- Department of Information Management, National Taiwan University, Taipei, Taiwan
| | - Tzu-Hung Hsiao
- Department of Medical Research, Taichung Veterans General Hospital, Taichung, Taiwan
- Department of Public Health, Fu Jen Catholic University, New Taipei City, Taiwan
- Institute of Genomics and Bioinformatics, National Chung Hsing University, Taichung, Taiwan
| | - Grace Joyce Zhang
- Department of Cellular and Physiological Sciences, The University of British Columbia, Vancouver, BC, Canada
| | - Yu-Fang Chung
- Department of Electrical Engineering, Tunghai University, Taichung, Taiwan
| | - Yi-Ming Chen
- Department of Medical Research, Taichung Veterans General Hospital, Taichung, Taiwan.
- Division of Allergy, Immunology and Rheumatology, Department of Internal Medicine, Taichung Veterans General Hospital, 1650, Section 4, Taiwan Boulevard, Xitun Dist., Taichung City, 407, Taiwan.
- Department of Post-Baccalaureate Medicine, College of Medicine, National Chung Hsing University, Taichung, Taiwan.
- School of Medicine, College of Medicine, National Yang Ming Chiao Tung University, Taipei, Taiwan.
- Rong Hsing Research Center for Translational Medicine & Ph.D. Program in Translational Medicine, National Chung Hsing University, Taichung, Taiwan.
- Precision Medicine Research Center, College of Medicine, National Chung Hsing University, Taichung, Taiwan.
| |
Collapse
|
4
|
Ho M, Levy TJ, Koulas I, Founta K, Coppa K, Hirsch JS, Davidson KW, Spyropoulos AC, Zanos TP. Longitudinal dynamic clinical phenotypes of in-hospital COVID-19 patients across three dominant virus variants in New York. Int J Med Inform 2024; 181:105286. [PMID: 37956643 PMCID: PMC10843635 DOI: 10.1016/j.ijmedinf.2023.105286] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Revised: 10/20/2023] [Accepted: 11/03/2023] [Indexed: 11/15/2023]
Abstract
BACKGROUND COVID-19 is a challenging disease to characterize given its wide-ranging heterogeneous symptomatology. Several studies have attempted to extract clinical phenotypes but often relied on data from small patient cohorts, usually limited to only one viral variant and utilizing a static snapshot of patient data. OBJECTIVE This study aimed to identify clinical phenotypes of hospitalized COVID-19 patients and investigate their longitudinal dynamics throughout the pandemic, with the goal to relate these phenotypes to clinical outcomes and treatment strategies. METHODS We utilized routinely collected demographic and clinical data throughout the hospitalization of 38,077 patients admitted between 3/2020 to 5/2022, in 12 New York hospitals. Uniform Manifold Approximation and Projection and agglomerative hierarchical clustering were used to derive the clusters, followed by exploratory data analysis to compare the prevalence of comorbidities and treatments per cluster. RESULTS 4 distinct clinical phenotypes remained robust in multi-site validation and were associated with different mortality rates. The temporal progression of these phenotypes throughout the COVID-19 pandemic demonstrated increased variability across the waves of the three dominant viral variants (alpha, delta, omicron). Longitudinal analysis evaluating changes in clinical phenotypes of each patient throughout the course of a 4-week hospital stay exemplified the dynamic nature of the disease progression. Factors such as sex, race/ethnicity and specific treatment modalities revealed significant and clinically relevant differences between the observed phenotypes. CONCLUSIONS Our proposed methodology has the potential of enabling clinicians and policy makers to draw evidence-based conclusions for guiding treatment modalities in a dynamic fashion.
Collapse
Affiliation(s)
- Matthew Ho
- Institute of Health Systems Science, Feinstein Institutes for Medical Research, Northwell Health, Manhasset, NY 11030; Institute of Bioelectronic Medicine, Feinstein Institutes for Medical Research, Northwell Health, Manhasset, NY 11030; Donald and Barbara Zucker School of Medicine at Hofstra/Northwell, Northwell Health, Hempstead, NY 11549
| | - Todd J Levy
- Institute of Health Systems Science, Feinstein Institutes for Medical Research, Northwell Health, Manhasset, NY 11030; Institute of Bioelectronic Medicine, Feinstein Institutes for Medical Research, Northwell Health, Manhasset, NY 11030
| | - Ioannis Koulas
- Institute of Health Systems Science, Feinstein Institutes for Medical Research, Northwell Health, Manhasset, NY 11030
| | - Kyriaki Founta
- Donald and Barbara Zucker School of Medicine at Hofstra/Northwell, Northwell Health, Hempstead, NY 11549
| | - Kevin Coppa
- Department of Clinical Digital Solutions, Northwell Health, New Hyde Park, NY 11042
| | - Jamie S Hirsch
- Institute of Health Systems Science, Feinstein Institutes for Medical Research, Northwell Health, Manhasset, NY 11030; Donald and Barbara Zucker School of Medicine at Hofstra/Northwell, Northwell Health, Hempstead, NY 11549; Department of Clinical Digital Solutions, Northwell Health, New Hyde Park, NY 11042
| | - Karina W Davidson
- Institute of Health Systems Science, Feinstein Institutes for Medical Research, Northwell Health, Manhasset, NY 11030; Donald and Barbara Zucker School of Medicine at Hofstra/Northwell, Northwell Health, Hempstead, NY 11549
| | - Alex C Spyropoulos
- Institute of Health Systems Science, Feinstein Institutes for Medical Research, Northwell Health, Manhasset, NY 11030; Donald and Barbara Zucker School of Medicine at Hofstra/Northwell, Northwell Health, Hempstead, NY 11549
| | - Theodoros P Zanos
- Institute of Health Systems Science, Feinstein Institutes for Medical Research, Northwell Health, Manhasset, NY 11030; Institute of Bioelectronic Medicine, Feinstein Institutes for Medical Research, Northwell Health, Manhasset, NY 11030; Donald and Barbara Zucker School of Medicine at Hofstra/Northwell, Northwell Health, Hempstead, NY 11549.
| |
Collapse
|
5
|
Bettencourt C, Skene N, Bandres-Ciga S, Anderson E, Winchester LM, Foote IF, Schwartzentruber J, Botia JA, Nalls M, Singleton A, Schilder BM, Humphrey J, Marzi SJ, Toomey CE, Kleifat AA, Harshfield EL, Garfield V, Sandor C, Keat S, Tamburin S, Frigerio CS, Lourida I, Ranson JM, Llewellyn DJ. Artificial intelligence for dementia genetics and omics. Alzheimers Dement 2023; 19:5905-5921. [PMID: 37606627 PMCID: PMC10841325 DOI: 10.1002/alz.13427] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2023] [Revised: 07/14/2023] [Accepted: 07/18/2023] [Indexed: 08/23/2023]
Abstract
Genetics and omics studies of Alzheimer's disease and other dementia subtypes enhance our understanding of underlying mechanisms and pathways that can be targeted. We identified key remaining challenges: First, can we enhance genetic studies to address missing heritability? Can we identify reproducible omics signatures that differentiate between dementia subtypes? Can high-dimensional omics data identify improved biomarkers? How can genetics inform our understanding of causal status of dementia risk factors? And which biological processes are altered by dementia-related genetic variation? Artificial intelligence (AI) and machine learning approaches give us powerful new tools in helping us to tackle these challenges, and we review possible solutions and examples of best practice. However, their limitations also need to be considered, as well as the need for coordinated multidisciplinary research and diverse deeply phenotyped cohorts. Ultimately AI approaches improve our ability to interrogate genetics and omics data for precision dementia medicine. HIGHLIGHTS: We have identified five key challenges in dementia genetics and omics studies. AI can enable detection of undiscovered patterns in dementia genetics and omics data. Enhanced and more diverse genetics and omics datasets are still needed. Multidisciplinary collaborative efforts using AI can boost dementia research.
Collapse
Affiliation(s)
- Conceicao Bettencourt
- Department of Neurodegenerative Disease, UCL Queen Square Institute of Neurology, London, UK
- Queen Square Brain Bank for Neurological Disorders, UCL Queen Square Institute of Neurology, London, UK
| | - Nathan Skene
- UK Dementia Research Institute, Imperial College London, London, UK
- Department of Brain Sciences, Imperial College London, London, UK
| | - Sara Bandres-Ciga
- Center for Alzheimer's and Related Dementias (CARD), National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, Maryland, USA
| | - Emma Anderson
- Department of Mental Health of Older People, Division of Psychiatry, University College London, London, UK
| | | | - Isabelle F Foote
- Institute for Behavioral Genetics, University of Colorado Boulder, Boulder, Colorado, USA
| | - Jeremy Schwartzentruber
- Open Targets, Cambridge, UK
- Wellcome Sanger Institute, Cambridge, UK
- Illumina Artificial Intelligence Laboratory, Illumina Inc, Foster City, California, USA
| | - Juan A Botia
- Departamento de Ingeniería de la Información y las Comunicaciones, Universidad de Murcia, Murcia, Spain
| | - Mike Nalls
- Center for Alzheimer's and Related Dementias (CARD), National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, Maryland, USA
- Data Tecnica International LLC, Washington, DC, USA
| | - Andrew Singleton
- Center for Alzheimer's and Related Dementias (CARD), National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, Maryland, USA
- Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, Maryland, USA
| | - Brian M Schilder
- UK Dementia Research Institute, Imperial College London, London, UK
- Department of Brain Sciences, Imperial College London, London, UK
| | - Jack Humphrey
- Nash Family Department of Neuroscience and Friedman Brain Institute, Icahn School of Medicine at Mount Sinai, New York, USA
| | - Sarah J Marzi
- UK Dementia Research Institute, Imperial College London, London, UK
- Department of Brain Sciences, Imperial College London, London, UK
| | - Christina E Toomey
- Queen Square Brain Bank for Neurological Disorders, UCL Queen Square Institute of Neurology, London, UK
- Department of Clinical and Movement Neuroscience, UCL Queen Square Institute of Neurology, London, UK
- The Francis Crick Institute, London, UK
| | - Ahmad Al Kleifat
- Department of Basic and Clinical Neuroscience, Maurice Wohl Clinical Neuroscience Institute, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK
| | - Eric L Harshfield
- Stroke Research Group, Department of Clinical Neurosciences, University of Cambridge, Cambridge, UK
| | - Victoria Garfield
- MRC Unit for Lifelong Health and Ageing, Institute of Cardiovascular Science, University College London, London, UK
| | - Cynthia Sandor
- UK Dementia Research Institute. School of Medicine, Cardiff University, Cardiff, UK
| | - Samuel Keat
- UK Dementia Research Institute. School of Medicine, Cardiff University, Cardiff, UK
| | - Stefano Tamburin
- Department of Neurosciences, Biomedicine and Movement Sciences, Neurology Section, University of Verona, Verona, Italy
| | - Carlo Sala Frigerio
- UK Dementia Research Institute, Queen Square Institute of Neurology, University College London, London, UK
| | | | | | - David J Llewellyn
- University of Exeter Medical School, Exeter, UK
- The Alan Turing Institute, London, UK
| |
Collapse
|
6
|
Polano M, Bedon L, Dal Bo M, Sorio R, Bartoletti M, De Mattia E, Cecchin E, Pisano C, Lorusso D, Lissoni AA, De Censi A, Cecere SC, Scollo P, Marchini S, Arenare L, De Giorgi U, Califano D, Biagioli E, Chiodini P, Perrone F, Pignata S, Toffoli G. Machine Learning Application Identifies Germline Markers of Hypertension in Patients With Ovarian Cancer Treated With Carboplatin, Taxane, and Bevacizumab. Clin Pharmacol Ther 2023; 114:652-663. [PMID: 37243926 DOI: 10.1002/cpt.2960] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Accepted: 05/22/2023] [Indexed: 05/29/2023]
Abstract
Pharmacogenomics studies how genes influence a person's response to treatment. When complex phenotypes are influenced by multiple genetic variations with little effect, a single piece of genetic information is often insufficient to explain this variability. The application of machine learning (ML) in pharmacogenomics holds great potential - namely, it can be used to unravel complicated genetic relationships that could explain response to therapy. In this study, ML techniques were used to investigate the relationship between genetic variations affecting more than 60 candidate genes and carboplatin-induced, taxane-induced, and bevacizumab-induced toxicities in 171 patients with ovarian cancer enrolled in the MITO-16A/MaNGO-OV2A trial. Single-nucleotide variation (SNV, formerly SNP) profiles were examined using ML to find and prioritize those associated with drug-induced toxicities, specifically hypertension, hematological toxicity, nonhematological toxicity, and proteinuria. The Boruta algorithm was used in cross-validation to determine the significance of SNVs in predicting toxicities. Important SNVs were then used to train eXtreme gradient boosting models. During cross-validation, the models achieved reliable performance with a Matthews correlation coefficient ranging from 0.375 to 0.410. A total of 43 SNVs critical for predicting toxicity were identified. For each toxicity, key SNVs were used to create a polygenic toxicity risk score that effectively divided individuals into high-risk and low-risk categories. In particular, compared with low-risk individuals, high-risk patients were 28-fold more likely to develop hypertension. The proposed method provided insightful data to improve precision medicine for patients with ovarian cancer, which may be useful for reducing toxicities and improving toxicity management.
Collapse
Affiliation(s)
- Maurizio Polano
- Experimental and Clinical Pharmacology Unit, Centro di Riferimento Oncologico di Aviano, Istituto di Ricovero e Cura a Carattere Scientifico, Aviano, Italy
| | - Luca Bedon
- Experimental and Clinical Pharmacology Unit, Centro di Riferimento Oncologico di Aviano, Istituto di Ricovero e Cura a Carattere Scientifico, Aviano, Italy
| | - Michele Dal Bo
- Experimental and Clinical Pharmacology Unit, Centro di Riferimento Oncologico di Aviano, Istituto di Ricovero e Cura a Carattere Scientifico, Aviano, Italy
| | - Roberto Sorio
- Dipartimento di Oncologia Medica, Centro di Riferimento Oncologico di Aviano, Istituto di Ricovero e Cura a Carattere Scientifico, Aviano, Italy
| | - Michele Bartoletti
- Dipartimento di Oncologia Medica, Centro di Riferimento Oncologico di Aviano, Istituto di Ricovero e Cura a Carattere Scientifico, Aviano, Italy
| | - Elena De Mattia
- Experimental and Clinical Pharmacology Unit, Centro di Riferimento Oncologico di Aviano, Istituto di Ricovero e Cura a Carattere Scientifico, Aviano, Italy
| | - Erika Cecchin
- Experimental and Clinical Pharmacology Unit, Centro di Riferimento Oncologico di Aviano, Istituto di Ricovero e Cura a Carattere Scientifico, Aviano, Italy
| | - Carmela Pisano
- Uro-Gynecologic Oncology Unit, Istituto Nazionale Tumori Istituto di Ricovero e Cura a Carattere Scientifico Fondazione G. Pascale, Naples, Italy
| | - Domenica Lorusso
- Department of Women and Child Health, Division of Gynecologic Oncology, Fondazione Policlinico Universitario A. Gemelli Istituto di Ricovero e Cura a Carattere Scientifico, Rome, Italy
- Department of Life Science and Public Health, Catholic University of Sacred Heart Largo Agostino Gemelli, Rome, Italy
| | - Andrea Alberto Lissoni
- Clinica Ostetrica e Ginecologica, Istituto di Ricovero e Cura a Carattere Scientifico S. Gerardo Monza, Università di Milano Bicocca, Milano, Italy
| | | | - Sabrina Chiara Cecere
- Uro-Gynecologic Oncology Unit, Istituto Nazionale Tumori Istituto di Ricovero e Cura a Carattere Scientifico Fondazione G. Pascale, Naples, Italy
| | - Paolo Scollo
- Unità Operativa Ostetricia e Ginecologia, Dipartimento Materno-Infantile, Ospedale Cannizzaro, Catania, Italy
| | - Sergio Marchini
- Molecular Pharmacology laboratory, Group of Cancer Pharmacology Istituto di Ricovero e Cura a Carattere Scientifico Humanitas Research Hospital, Rozzano, Italy
| | - Laura Arenare
- Clinical Trial Unit, Istituto Nazionale Tumori, Istituto di Ricovero e Cura a Carattere Scientifico, Fondazione G. Pascale, Naples, Italy
| | - Ugo De Giorgi
- Istituto di Ricovero e Cura a Carattere Scientifico Istituto Romagnolo per lo Studio dei Tumori Dino Amadori, Meldola, Italy
| | - Daniela Califano
- Microenvironment Molecular Targets Unit, Istituto Nazionale Tumori IRCCS, Fondazione G. Pascale, Naples, Italy
| | - Elena Biagioli
- Department Of Oncology, Istituto di Ricerche Farmacologiche Mario Negri IRCCS Milano, Milano, Italy
| | - Paolo Chiodini
- Department of Mental Health and Public Medicine, Section of Statistics, Università degli Studi della Campania Luigi Vanvitelli, Naples, Italy
| | - Francesco Perrone
- Clinical Trial Unit, Istituto Nazionale Tumori, Istituto di Ricovero e Cura a Carattere Scientifico, Fondazione G. Pascale, Naples, Italy
| | - Sandro Pignata
- Uro-Gynecologic Oncology Unit, Istituto Nazionale Tumori Istituto di Ricovero e Cura a Carattere Scientifico Fondazione G. Pascale, Naples, Italy
| | - Giuseppe Toffoli
- Experimental and Clinical Pharmacology Unit, Centro di Riferimento Oncologico di Aviano, Istituto di Ricovero e Cura a Carattere Scientifico, Aviano, Italy
| |
Collapse
|
7
|
Susmitha P, Kumar P, Yadav P, Sahoo S, Kaur G, Pandey MK, Singh V, Tseng TM, Gangurde SS. Genome-wide association study as a powerful tool for dissecting competitive traits in legumes. FRONTIERS IN PLANT SCIENCE 2023; 14:1123631. [PMID: 37645459 PMCID: PMC10461012 DOI: 10.3389/fpls.2023.1123631] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/14/2022] [Accepted: 06/08/2023] [Indexed: 08/31/2023]
Abstract
Legumes are extremely valuable because of their high protein content and several other nutritional components. The major challenge lies in maintaining the quantity and quality of protein and other nutritional compounds in view of climate change conditions. The global need for plant-based proteins has increased the demand for seeds with a high protein content that includes essential amino acids. Genome-wide association studies (GWAS) have evolved as a standard approach in agricultural genetics for examining such intricate characters. Recent development in machine learning methods shows promising applications for dimensionality reduction, which is a major challenge in GWAS. With the advancement in biotechnology, sequencing, and bioinformatics tools, estimation of linkage disequilibrium (LD) based associations between a genome-wide collection of single-nucleotide polymorphisms (SNPs) and desired phenotypic traits has become accessible. The markers from GWAS could be utilized for genomic selection (GS) to predict superior lines by calculating genomic estimated breeding values (GEBVs). For prediction accuracy, an assortment of statistical models could be utilized, such as ridge regression best linear unbiased prediction (rrBLUP), genomic best linear unbiased predictor (gBLUP), Bayesian, and random forest (RF). Both naturally diverse germplasm panels and family-based breeding populations can be used for association mapping based on the nature of the breeding system (inbred or outbred) in the plant species. MAGIC, MCILs, RIAILs, NAM, and ROAM are being used for association mapping in several crops. Several modifications of NAM, such as doubled haploid NAM (DH-NAM), backcross NAM (BC-NAM), and advanced backcross NAM (AB-NAM), have also been used in crops like rice, wheat, maize, barley mustard, etc. for reliable marker-trait associations (MTAs), phenotyping accuracy is equally important as genotyping. Highthroughput genotyping, phenomics, and computational techniques have advanced during the past few years, making it possible to explore such enormous datasets. Each population has unique virtues and flaws at the genomics and phenomics levels, which will be covered in more detail in this review study. The current investigation includes utilizing elite breeding lines as association mapping population, optimizing the choice of GWAS selection, population size, and hurdles in phenotyping, and statistical methods which will analyze competitive traits in legume breeding.
Collapse
Affiliation(s)
- Pusarla Susmitha
- Regional Agricultural Research Station, Acharya N.G. Ranga Agricultural University, Andhra Pradesh, India
| | - Pawan Kumar
- Department of Genetics and Plant Breeding, College of Agriculture, Chaudhary Charan Singh (CCS) Haryana Agricultural University, Hisar, India
| | - Pankaj Yadav
- Department of Bioscience and Bioengineering, Indian Institute of Technology, Rajasthan, India
| | - Smrutishree Sahoo
- Department of Genetics and Plant Breeding, School of Agriculture, Gandhi Institute of Engineering and Technology (GIET) University, Odisha, India
| | - Gurleen Kaur
- Horticultural Sciences Department, University of Florida, Gainesville, FL, United States
| | - Manish K. Pandey
- Department of Genomics, Prebreeding and Bioinformatics, International Crops Research Institute for the Semi-Arid Tropics, Hyderabad, India
| | - Varsha Singh
- Department of Plant and Soil Sciences, Mississippi State University, Starkville, MS, United States
| | - Te Ming Tseng
- Department of Plant and Soil Sciences, Mississippi State University, Starkville, MS, United States
| | - Sunil S. Gangurde
- Department of Plant Pathology, University of Georgia, Tifton, GA, United States
| |
Collapse
|
8
|
Choudhary A, Anand A, Singh A, Roy P, Singh N, Kumar V, Sharma S, Baranwal M. Machine learning-based ensemble approach in prediction of lung cancer predisposition using XRCC1 gene polymorphism. J Biomol Struct Dyn 2023:1-10. [PMID: 37545160 DOI: 10.1080/07391102.2023.2242492] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2022] [Accepted: 07/23/2023] [Indexed: 08/08/2023]
Abstract
The employment of machine learning approaches has shown promising results in predicting cancer. In the current study, polymorphisms data of five single nucleotide polymorphisms (SNPs) of DNA repair gene XRCC1 (XRCC1 399, XRCC1 194, XRCC1 206, XRCC1 632, XRCC1 280) of the north Indian population along with four smoking status data is considered as an input to the proposed ensemble model to predict the risk of individual susceptibility to the lung cancer. The prediction accuracy of the proposed ensemble model for cancer predisposition was found to be 85%. The model performance is also evaluated using sensitivity, specificity, precision and the Gini index, which is found in the range of 0.83-0.87. The proposed model also outperformed in all evaluation parameters when compared with the individual Model (LM, SVM, RF, KNN and baseline neural net). Collectively, current results suggest the potential of the proposed ensemble model in predicting the risk of cancer based on XRCC1 SNPs data.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Abhishek Choudhary
- Department of Computer Science, Thapar Institute of Engineering & Technology, India
| | - Adarsh Anand
- Department of Electronics & Communication Engineering, Thapar Institute of Engineering & Technology, India
| | - Amrita Singh
- Department of Biotechnology, Thapar Institute of Engineering & Technology, Patiala, Punjab, India
| | - Pratima Roy
- Department of Biotechnology, Thapar Institute of Engineering & Technology, Patiala, Punjab, India
| | - Navneet Singh
- Department of Pulmonary Medicine, Post Graduate Institute of Education and Medical Research (PGIMER), Chandigarh, India
| | - Vinay Kumar
- Department of Electronics & Communication Engineering, Thapar Institute of Engineering & Technology, India
| | - Siddharth Sharma
- Department of Biotechnology, Thapar Institute of Engineering & Technology, Patiala, Punjab, India
| | - Manoj Baranwal
- Department of Biotechnology, Thapar Institute of Engineering & Technology, Patiala, Punjab, India
| |
Collapse
|
9
|
Alzoubi H, Alzubi R, Ramzan N. Deep Learning Framework for Complex Disease Risk Prediction Using Genomic Variations. SENSORS (BASEL, SWITZERLAND) 2023; 23:s23094439. [PMID: 37177642 PMCID: PMC10181706 DOI: 10.3390/s23094439] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/13/2023] [Revised: 04/05/2023] [Accepted: 04/26/2023] [Indexed: 05/15/2023]
Abstract
Genome-wide association studies have proven their ability to improve human health outcomes by identifying genotypes associated with phenotypes. Various works have attempted to predict the risk of diseases for individuals based on genotype data. This prediction can either be considered as an analysis model that can lead to a better understanding of gene functions that underlie human disease or as a black box in order to be used in decision support systems and in early disease detection. Deep learning techniques have gained more popularity recently. In this work, we propose a deep-learning framework for disease risk prediction. The proposed framework employs a multilayer perceptron (MLP) in order to predict individuals' disease status. The proposed framework was applied to the Wellcome Trust Case-Control Consortium (WTCCC), the UK National Blood Service (NBS) Control Group, and the 1958 British Birth Cohort (58C) datasets. The performance comparison of the proposed framework showed that the proposed approach outperformed the other methods in predicting disease risk, achieving an area under the curve (AUC) up to 0.94.
Collapse
Affiliation(s)
- Hadeel Alzoubi
- Department of Computer Science, College of Computer Science and Information Technology, King Faisal University, Al-Ahsa 31982, Saudi Arabia
| | - Raid Alzubi
- Department of Computer Science, College of Computer Science and Information Technology, King Faisal University, Al-Ahsa 31982, Saudi Arabia
| | - Naeem Ramzan
- School of Computing, Engineering and Physical Sciences, University of the West of Scotland, High Street, Paisley PA1 2BE, UK
| |
Collapse
|
10
|
Learning high-order interactions for polygenic risk prediction. PLoS One 2023; 18:e0281618. [PMID: 36763605 PMCID: PMC9916647 DOI: 10.1371/journal.pone.0281618] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2022] [Accepted: 01/27/2023] [Indexed: 02/11/2023] Open
Abstract
Within the framework of precision medicine, the stratification of individual genetic susceptibility based on inherited DNA variation has paramount relevance. However, one of the most relevant pitfalls of traditional Polygenic Risk Scores (PRS) approaches is their inability to model complex high-order non-linear SNP-SNP interactions and their effect on the phenotype (e.g. epistasis). Indeed, they incur in a computational challenge as the number of possible interactions grows exponentially with the number of SNPs considered, affecting the statistical reliability of the model parameters as well. In this work, we address this issue by proposing a novel PRS approach, called High-order Interactions-aware Polygenic Risk Score (hiPRS), that incorporates high-order interactions in modeling polygenic risk. The latter combines an interaction search routine based on frequent itemsets mining and a novel interaction selection algorithm based on Mutual Information, to construct a simple and interpretable weighted model of user-specified dimensionality that can predict a given binary phenotype. Compared to traditional PRSs methods, hiPRS does not rely on GWAS summary statistics nor any external information. Moreover, hiPRS differs from Machine Learning-based approaches that can include complex interactions in that it provides a readable and interpretable model and it is able to control overfitting, even on small samples. In the present work we demonstrate through a comprehensive simulation study the superior performance of hiPRS w.r.t. state of the art methods, both in terms of scoring performance and interpretability of the resulting model. We also test hiPRS against small sample size, class imbalance and the presence of noise, showcasing its robustness to extreme experimental settings. Finally, we apply hiPRS to a case study on real data from DACHS cohort, defining an interaction-aware scoring model to predict mortality of stage II-III Colon-Rectal Cancer patients treated with oxaliplatin.
Collapse
|
11
|
Zizaan A, Idri A. Machine learning based Breast Cancer screening: trends, challenges, and opportunities. COMPUTER METHODS IN BIOMECHANICS AND BIOMEDICAL ENGINEERING: IMAGING & VISUALIZATION 2023. [DOI: 10.1080/21681163.2023.2172615] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/10/2023]
Affiliation(s)
- Asma Zizaan
- Mohammed VI Polytechnic University, Benguerir, Morocco
| | - Ali Idri
- Mohammed VI Polytechnic University, Benguerir, Morocco
- Software Project Management Research Team, ENSIAS, Mohammed V University, Rabat, Morocco
| |
Collapse
|
12
|
Gonzalez-Gomez R, Ibañez A, Moguilner S. Multiclass characterization of frontotemporal dementia variants via multimodal brain network computational inference. Netw Neurosci 2023; 7:322-350. [PMID: 37333999 PMCID: PMC10270711 DOI: 10.1162/netn_a_00285] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2022] [Accepted: 10/03/2022] [Indexed: 04/03/2024] Open
Abstract
Characterizing a particular neurodegenerative condition against others possible diseases remains a challenge along clinical, biomarker, and neuroscientific levels. This is the particular case of frontotemporal dementia (FTD) variants, where their specific characterization requires high levels of expertise and multidisciplinary teams to subtly distinguish among similar physiopathological processes. Here, we used a computational approach of multimodal brain networks to address simultaneous multiclass classification of 298 subjects (one group against all others), including five FTD variants: behavioral variant FTD, corticobasal syndrome, nonfluent variant primary progressive aphasia, progressive supranuclear palsy, and semantic variant primary progressive aphasia, with healthy controls. Fourteen machine learning classifiers were trained with functional and structural connectivity metrics calculated through different methods. Due to the large number of variables, dimensionality was reduced, employing statistical comparisons and progressive elimination to assess feature stability under nested cross-validation. The machine learning performance was measured through the area under the receiver operating characteristic curves, reaching 0.81 on average, with a standard deviation of 0.09. Furthermore, the contributions of demographic and cognitive data were also assessed via multifeatured classifiers. An accurate simultaneous multiclass classification of each FTD variant against other variants and controls was obtained based on the selection of an optimum set of features. The classifiers incorporating the brain's network and cognitive assessment increased performance metrics. Multimodal classifiers evidenced specific variants' compromise, across modalities and methods through feature importance analysis. If replicated and validated, this approach may help to support clinical decision tools aimed to detect specific affectations in the context of overlapping diseases.
Collapse
Affiliation(s)
- Raul Gonzalez-Gomez
- Latin American Brain Health Institute (BrainLat), Universidad Adolfo Ibañez, Santiago de Chile, Chile
- Center for Social and Cognitive Neuroscience, School of Psychology, Universidad Adolfo Ibañez, Santiago de Chile, Chile
| | - Agustín Ibañez
- Latin American Brain Health Institute (BrainLat), Universidad Adolfo Ibañez, Santiago de Chile, Chile
- Cognitive Neuroscience Center, Universidad de San Andres, Buenos Aires, Argentina
- Global Brain Health Institute, University of California San Francisco, San Francisco, CA, USA
- Trinity College Dublin, Dublin, Ireland
| | - Sebastian Moguilner
- Center for Social and Cognitive Neuroscience, School of Psychology, Universidad Adolfo Ibañez, Santiago de Chile, Chile
- Cognitive Neuroscience Center, Universidad de San Andres, Buenos Aires, Argentina
- Global Brain Health Institute, University of California San Francisco, San Francisco, CA, USA
- Department of Neurology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
| |
Collapse
|
13
|
Salgado Á, de Melo-Minardi RC, Giovanetti M, Veloso A, Morais-Rodrigues F, Adelino T, de Jesus R, Tosta S, Azevedo V, Lourenco J, Alcantara LCJ. Machine learning models exploring characteristic single-nucleotide signatures in yellow fever virus. PLoS One 2022; 17:e0278982. [PMID: 36508435 PMCID: PMC9744328 DOI: 10.1371/journal.pone.0278982] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2021] [Accepted: 11/29/2022] [Indexed: 12/14/2022] Open
Abstract
Yellow fever virus (YFV) is the agent of the most severe mosquito-borne disease in the tropics. Recently, Brazil suffered major YFV outbreaks with a high fatality rate affecting areas where the virus has not been reported for decades, consisting of urban areas where a large number of unvaccinated people live. We developed a machine learning framework combining three different algorithms (XGBoost, random forest and regularized logistic regression) to analyze YFV genomic sequences. This method was applied to 56 YFV sequences from human infections and 27 from non-human primate (NHPs) infections to investigate the presence of genetic signatures possibly related to disease severity (in human related sequences) and differences in PCR cycle threshold (Ct) values (in NHP related sequences). Our analyses reveal four non-synonymous single nucleotide variations (SNVs) on sequences from human infections, in proteins NS3 (E614D), NS4a (I69V), NS5 (R727G, V643A) and six non-synonymous SNVs on NHP sequences, in proteins E (L385F), NS1 (A171V), NS3 (I184V) and NS5 (N11S, I374V, E641D). We performed comparative protein structural analysis on these SNVs, describing possible impacts on protein function. Despite the fact that the dataset is limited in size and that this study does not consider virus-host interactions, our work highlights the use of machine learning as a versatile and fast initial approach to genomic data exploration.
Collapse
Affiliation(s)
- Álvaro Salgado
- Laboratório de Genética Celular e Molecular, Instituto de Ciências Biológicas, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
- * E-mail: (AS); (LCJA); (JL)
| | - Raquel C. de Melo-Minardi
- Departamento de Ciência da Computação, Instituto de Ciências Exatas, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Marta Giovanetti
- Laboratório de Genética Celular e Molecular, Instituto de Ciências Biológicas, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
- Laboratório de Flavivírus, Instituto Oswaldo Cruz, Fundação Oswaldo Cruz, Rio de Janeiro, Brazil
| | - Adriano Veloso
- Departamento de Ciência da Computação, Instituto de Ciências Exatas, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Francielly Morais-Rodrigues
- Laboratório de Genética Celular e Molecular, Instituto de Ciências Biológicas, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Talita Adelino
- Laboratório Central de Saúde Pública, Fundação Ezequiel Dias, Belo Horizonte, Minas Gerais, Brazil
| | - Ronaldo de Jesus
- Coordenação Geral dos Laboratórios de Saúde Pública, Secretaria de Vigilância em Saúde, Ministério da Saúde, Brasília, DF, Brazil
| | - Stephane Tosta
- Laboratório de Genética Celular e Molecular, Instituto de Ciências Biológicas, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Vasco Azevedo
- Laboratório de Genética Celular e Molecular, Instituto de Ciências Biológicas, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - José Lourenco
- Department of Zoology, University of Oxford, Oxford, United Kingdom
- * E-mail: (AS); (LCJA); (JL)
| | - Luiz Carlos J. Alcantara
- Laboratório de Genética Celular e Molecular, Instituto de Ciências Biológicas, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
- Laboratório de Flavivírus, Instituto Oswaldo Cruz, Fundação Oswaldo Cruz, Rio de Janeiro, Brazil
- * E-mail: (AS); (LCJA); (JL)
| |
Collapse
|
14
|
Moguilner S, Birba A, Fittipaldi S, Gonzalez-Campo C, Tagliazucchi E, Reyes P, Matallana D, Parra MA, Slachevsky A, Farías G, Cruzat J, García A, Eyre HA, Joie RL, Rabinovici G, Whelan R, Ibáñez A. Multi-feature computational framework for combined signatures of dementia in underrepresented settings. J Neural Eng 2022; 19:10.1088/1741-2552/ac87d0. [PMID: 35940105 PMCID: PMC11177279 DOI: 10.1088/1741-2552/ac87d0] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2022] [Accepted: 08/08/2022] [Indexed: 11/11/2022]
Abstract
Objective.The differential diagnosis of behavioral variant frontotemporal dementia (bvFTD) and Alzheimer's disease (AD) remains challenging in underrepresented, underdiagnosed groups, including Latinos, as advanced biomarkers are rarely available. Recent guidelines for the study of dementia highlight the critical role of biomarkers. Thus, novel cost-effective complementary approaches are required in clinical settings.Approach. We developed a novel framework based on a gradient boosting machine learning classifier, tuned by Bayesian optimization, on a multi-feature multimodal approach (combining demographic, neuropsychological, magnetic resonance imaging (MRI), and electroencephalography/functional MRI connectivity data) to characterize neurodegeneration using site harmonization and sequential feature selection. We assessed 54 bvFTD and 76 AD patients and 152 healthy controls (HCs) from a Latin American consortium (ReDLat).Main results. The multimodal model yielded high area under the curve classification values (bvFTD patients vs HCs: 0.93 (±0.01); AD patients vs HCs: 0.95 (±0.01); bvFTD vs AD patients: 0.92 (±0.01)). The feature selection approach successfully filtered non-informative multimodal markers (from thousands to dozens).Results. Proved robust against multimodal heterogeneity, sociodemographic variability, and missing data.Significance. The model accurately identified dementia subtypes using measures readily available in underrepresented settings, with a similar performance than advanced biomarkers. This approach, if confirmed and replicated, may potentially complement clinical assessments in developing countries.
Collapse
Affiliation(s)
- Sebastian Moguilner
- Global Brain Health Institute (GBHI), University of California San Francisco (UCSF), CA, United States of America
- Cognitive Neuroscience Center (CNC), Universidad de San Andrés, Buenos Aires, Argentina
- Latin American Brain Health (BrainLat), Universidad Adolfo Ibáñez, Santiago, Chile
- Trinity College Dublin, Dublin, Ireland
| | - Agustina Birba
- Cognitive Neuroscience Center (CNC), Universidad de San Andrés, Buenos Aires, Argentina
- Latin American Brain Health (BrainLat), Universidad Adolfo Ibáñez, Santiago, Chile
- National Scientific and Technical Research Council (CONICET), Buenos Aires, Argentina
| | - Sol Fittipaldi
- Cognitive Neuroscience Center (CNC), Universidad de San Andrés, Buenos Aires, Argentina
- National Scientific and Technical Research Council (CONICET), Buenos Aires, Argentina
| | | | - Enzo Tagliazucchi
- Latin American Brain Health (BrainLat), Universidad Adolfo Ibáñez, Santiago, Chile
- National Scientific and Technical Research Council (CONICET), Buenos Aires, Argentina
- Department of Physics, University of Buenos Aires, Buenos Aires, Argentina
| | - Pablo Reyes
- Medical School, Aging Institute, Psychiatry and Mental Health, Pontificia Universidad Javeriana, Bogota, Colombia
| | - Diana Matallana
- Medical School, Aging Institute, Psychiatry and Mental Health, Pontificia Universidad Javeriana, Bogota, Colombia
| | - Mario A Parra
- MAP: School of Psychological Sciences and Health, University of Strathclyde, Glasgow, United Kingdom
| | - Andrea Slachevsky
- Gerosciences Center for Brain Health and Metabolism, Santiago, Chile
- Faculty of Medicine, University of Chile, Santiago, Chile
- Memory and Neuropsychiatric Clinic (CMYN) Neurology Department, Hospital del Salvador and University of Chile, Santiago, Chile
- Servicio de Neurología, Departamento de Medicina, Clínica Alemana-Universidad del Desarrollo, Santiago de Chile, Chile
| | - Gonzalo Farías
- Faculty of Medicine, University of Chile, Santiago, Chile
| | - Josefina Cruzat
- Latin American Brain Health (BrainLat), Universidad Adolfo Ibáñez, Santiago, Chile
| | - Adolfo García
- Global Brain Health Institute (GBHI), University of California San Francisco (UCSF), CA, United States of America
- Cognitive Neuroscience Center (CNC), Universidad de San Andrés, Buenos Aires, Argentina
- National Scientific and Technical Research Council (CONICET), Buenos Aires, Argentina
- Departamento de Lingüística y Literatura, Facultad de Humanidades, Universidad de Santiago de Chile, Santiago, Chile
- Trinity College Dublin, Dublin, Ireland
| | - Harris A Eyre
- Global Brain Health Institute (GBHI), University of California San Francisco (UCSF), CA, United States of America
- Neuroscience-Inspired Policy Initiative, Organisation for Economic Co-operation and Development and PRODEO Institute, Paris, France
- IMPACT, The Institute for Mental and Physical Health and Clinical Translation, Deakin University, Geelong, Victoria, Australia
- Department of Psychiatry and Behavioral Sciences, Baylor College of Medicine, Houston, TX, United States of America
- Trinity College Dublin, Dublin, Ireland
| | - Renaud La Joie
- Memory and Aging Center, Department of Neurology, Weill Institute for Neurosciences, University of California, San Francisco, San Francisco, CA, United States of America
| | - Gil Rabinovici
- Global Brain Health Institute (GBHI), University of California San Francisco (UCSF), CA, United States of America
- Memory and Aging Center, Department of Neurology, Weill Institute for Neurosciences, University of California, San Francisco, San Francisco, CA, United States of America
- Trinity College Dublin, Dublin, Ireland
| | - Robert Whelan
- Global Brain Health Institute (GBHI), University of California San Francisco (UCSF), CA, United States of America
- Trinity College Dublin, Dublin, Ireland
| | - Agustín Ibáñez
- Global Brain Health Institute (GBHI), University of California San Francisco (UCSF), CA, United States of America
- Cognitive Neuroscience Center (CNC), Universidad de San Andrés, Buenos Aires, Argentina
- Latin American Brain Health (BrainLat), Universidad Adolfo Ibáñez, Santiago, Chile
- National Scientific and Technical Research Council (CONICET), Buenos Aires, Argentina
- Trinity College Dublin, Dublin, Ireland
| |
Collapse
|
15
|
Elgart M, Lyons G, Romero-Brufau S, Kurniansyah N, Brody JA, Guo X, Lin HJ, Raffield L, Gao Y, Chen H, de Vries P, Lloyd-Jones DM, Lange LA, Peloso GM, Fornage M, Rotter JI, Rich SS, Morrison AC, Psaty BM, Levy D, Redline S, Sofer T. Non-linear machine learning models incorporating SNPs and PRS improve polygenic prediction in diverse human populations. Commun Biol 2022; 5:856. [PMID: 35995843 PMCID: PMC9395509 DOI: 10.1038/s42003-022-03812-z] [Citation(s) in RCA: 27] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2021] [Accepted: 08/05/2022] [Indexed: 01/03/2023] Open
Abstract
Polygenic risk scores (PRS) are commonly used to quantify the inherited susceptibility for a trait, yet they fail to account for non-linear and interaction effects between single nucleotide polymorphisms (SNPs). We address this via a machine learning approach, validated in nine complex phenotypes in a multi-ancestry population. We use an ensemble method of SNP selection followed by gradient boosted trees (XGBoost) to allow for non-linearities and interaction effects. We compare our results to the standard, linear PRS model developed using PRSice, LDpred2, and lassosum2. Combining a PRS as a feature in an XGBoost model results in a relative increase in the percentage variance explained compared to the standard linear PRS model by 22% for height, 27% for HDL cholesterol, 43% for body mass index, 50% for sleep duration, 58% for systolic blood pressure, 64% for total cholesterol, 66% for triglycerides, 77% for LDL cholesterol, and 100% for diastolic blood pressure. Multi-ancestry trained models perform similarly to specific racial/ethnic group trained models and are consistently superior to the standard linear PRS models. This work demonstrates an effective method to account for non-linearities and interaction effects in genetics-based prediction models.
Collapse
Affiliation(s)
- Michael Elgart
- Division of Sleep and Circadian Disorders, Brigham and Women's Hospital, Boston, MA, USA.
- Department of Medicine, Harvard Medical School, Boston, MA, USA.
| | - Genevieve Lyons
- Division of Sleep and Circadian Disorders, Brigham and Women's Hospital, Boston, MA, USA
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Santiago Romero-Brufau
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Department of Medicine, Mayo Clinic, Rochester, MN, USA
| | - Nuzulul Kurniansyah
- Division of Sleep and Circadian Disorders, Brigham and Women's Hospital, Boston, MA, USA
| | - Jennifer A Brody
- Cardiovascular Health Research Unit, Department of Medicine, University of Washington, Seattle, WA, USA
| | - Xiuqing Guo
- The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA, USA
| | - Henry J Lin
- The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA, USA
| | - Laura Raffield
- Department of Genetics, University of North Carolina, Chapel Hill, NC, USA
| | - Yan Gao
- The Jackson Heart Study, University of Mississippi Medical Center, Jackson, MS, USA
| | - Han Chen
- Human Genetics Center, Department of Epidemiology, Human Genetics, and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, USA
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Paul de Vries
- Human Genetics Center, Department of Epidemiology, Human Genetics, and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | | | - Leslie A Lange
- Department of Medicine, University of Colorado Denver, Anschutz Medical Campus, Aurora, CO, USA
| | - Gina M Peloso
- Department of Biostatistics, Boston University School of Public Health, Boston, MA, USA
| | - Myriam Fornage
- Human Genetics Center, Department of Epidemiology, Human Genetics, and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, USA
- Brown Foundation Institute of Molecular Medicine, McGovern Medical School, University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Jerome I Rotter
- The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA, USA
| | - Stephen S Rich
- Center for Public Health Genomics, University of Virginia School of Medicine, Charlottesville, VA, USA
| | - Alanna C Morrison
- Human Genetics Center, Department of Epidemiology, Human Genetics, and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Bruce M Psaty
- Cardiovascular Health Research Unit, Departments of Medicine, Epidemiology, and Health Services, University of Washington, Seattle, WA, USA
| | - Daniel Levy
- The Population Sciences Branch of the National Heart, Lung and Blood Institute, Bethesda, MD, USA
- The Framingham Heart Study, Framingham, MA, USA
| | - Susan Redline
- Division of Sleep and Circadian Disorders, Brigham and Women's Hospital, Boston, MA, USA
- Department of Medicine, Harvard Medical School, Boston, MA, USA
| | - Tamar Sofer
- Division of Sleep and Circadian Disorders, Brigham and Women's Hospital, Boston, MA, USA.
- Department of Medicine, Harvard Medical School, Boston, MA, USA.
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
| |
Collapse
|
16
|
Risk score prediction model based on single nucleotide polymorphism for predicting malaria: a machine learning approach. BMC Bioinformatics 2022; 23:325. [PMID: 35934714 PMCID: PMC9358850 DOI: 10.1186/s12859-022-04870-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2022] [Accepted: 08/01/2022] [Indexed: 11/25/2022] Open
Abstract
Background The malaria risk prediction is currently limited to using advanced statistical methods, such as time series and cluster analysis on epidemiological data. Nevertheless, machine learning models have been explored to study the complexity of malaria through blood smear images and environmental data. However, to the best of our knowledge, no study analyses the contribution of Single Nucleotide Polymorphisms (SNPs) to malaria using a machine learning model. More specifically, this study aims to quantify an individual's susceptibility to the development of malaria by using risk scores obtained from the cumulative effects of SNPs, known as weighted genetic risk scores (wGRS).
Results We proposed an SNP-based feature extraction algorithm that incorporates the susceptibility information of an individual to malaria to generate the feature set. However, it can become computationally expensive for a machine learning model to learn from many SNPs. Therefore, we reduced the feature set by employing the Logistic Regression and Recursive Feature Elimination (LR-RFE) method to select SNPs that improve the efficacy of our model. Next, we calculated the wGRS of the selected feature set, which is used as the model's target variables. Moreover, to compare the performance of the wGRS-only model, we calculated and evaluated the combination of wGRS with genotype frequency (wGRS + GF). Finally, Light Gradient Boosting Machine (LightGBM), eXtreme Gradient Boosting (XGBoost), and Ridge regression algorithms are utilized to establish the machine learning models for malaria risk prediction. Conclusions Our proposed approach identified SNP rs334 as the most contributing feature with an importance score of 6.224 compared to the baseline, with an importance score of 1.1314. This is an important result as prior studies have proven that rs334 is a major genetic risk factor for malaria. The analysis and comparison of the three machine learning models demonstrated that LightGBM achieves the highest model performance with a Mean Absolute Error (MAE) score of 0.0373. Furthermore, based on wGRS + GF, all models performed significantly better than wGRS alone, in which LightGBM obtained the best performance (0.0033 MAE score). Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04870-0.
Collapse
|
17
|
Hou C, Xu B, Hao Y, Yang D, Song H, Li J. Development and validation of polygenic risk scores for prediction of breast cancer and breast cancer subtypes in Chinese women. BMC Cancer 2022; 22:374. [PMID: 35395775 PMCID: PMC8991589 DOI: 10.1186/s12885-022-09425-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2021] [Accepted: 03/15/2022] [Indexed: 02/08/2023] Open
Abstract
Background Studies investigating breast cancer polygenic risk score (PRS) in Chinese women are scarce. The objectives of this study were to develop and validate PRSs that could be used to stratify risk for overall and subtype-specific breast cancer in Chinese women, and to evaluate the performance of a newly proposed Artificial Neural Network (ANN) based approach for PRS construction. Methods The PRSs were constructed using the dataset from a genome-wide association study (GWAS) and validated in an independent case-control study. Three approaches, including repeated logistic regression (RLR), logistic ridge regression (LRR) and ANN based approach, were used to build the PRSs for overall and subtype-specific breast cancer based on 24 selected single nucleotide polymorphisms (SNPs). Predictive performance and calibration of the PRSs were evaluated unadjusted and adjusted for Gail-2 model 5-year risk or classical breast cancer risk factors. Results The primary PRSANN and PRSLRR both showed modest predictive ability for overall breast cancer (odds ratio per interquartile range increase of the PRS in controls [IQ-OR] 1.76 vs 1.58; area under the receiver operator characteristic curve [AUC] 0.601 vs 0.598) and remained to be predictive after adjustment. Although estrogen receptor negative (ER−) breast cancer was poorly predicted by the primary PRSs, the ER− PRSs trained solely on ER− breast cancer cases saw a substantial improvement in predictions of ER− breast cancer. Conclusions The 24 SNPs based PRSs can provide additional risk information to help breast cancer risk stratification in the general population of China. The newly proposed ANN approach for PRS construction has potential to replace the traditional approaches, but more studies are needed to validate and investigate its performance. Supplementary Information The online version contains supplementary material available at 10.1186/s12885-022-09425-3.
Collapse
Affiliation(s)
- Can Hou
- West China Biomedical Big Data Center, West China Hospital, Sichuan University, No. 37 Guo Xue Xiang, Chengdu, 610047, Sichuan, China.,Department of Epidemiology and Biostatistics, West China School of Public Health and West China Fourth Hospital, Sichuan University, No.16 Ren Min Nan Lu, Chengdu, 610041, Sichuan, China.,Med-X Center for Informatics, Sichuan University, Chengdu, China
| | - Bin Xu
- Department of Epidemiology and Biostatistics, West China School of Public Health and West China Fourth Hospital, Sichuan University, No.16 Ren Min Nan Lu, Chengdu, 610041, Sichuan, China
| | - Yu Hao
- Department of Epidemiology and Biostatistics, West China School of Public Health and West China Fourth Hospital, Sichuan University, No.16 Ren Min Nan Lu, Chengdu, 610041, Sichuan, China
| | - Daowen Yang
- Robot Perception and Control Joint Lab, Sichuan University & Aisono, Chengdu, China
| | - Huan Song
- West China Biomedical Big Data Center, West China Hospital, Sichuan University, No. 37 Guo Xue Xiang, Chengdu, 610047, Sichuan, China. .,Med-X Center for Informatics, Sichuan University, Chengdu, China.
| | - Jiayuan Li
- Department of Epidemiology and Biostatistics, West China School of Public Health and West China Fourth Hospital, Sichuan University, No.16 Ren Min Nan Lu, Chengdu, 610041, Sichuan, China.
| |
Collapse
|
18
|
Govender P, Fashoto SG, Maharaj L, Adeleke MA, Mbunge E, Olamijuwon J, Akinnuwesi B, Okpeku M. The application of machine learning to predict genetic relatedness using human mtDNA hypervariable region I sequences. PLoS One 2022; 17:e0263790. [PMID: 35180257 PMCID: PMC8856515 DOI: 10.1371/journal.pone.0263790] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2021] [Accepted: 01/26/2022] [Indexed: 11/21/2022] Open
Abstract
Human identification of unknown samples following disaster and mass casualty events is essential, especially to bring closure to family and friends of the deceased. Unfortunately, victim identification is often challenging for forensic investigators as analysis becomes complicated when biological samples are degraded or of poor quality as a result of exposure to harsh environmental factors. Mitochondrial DNA becomes the ideal option for analysis, particularly for determining the origin of the samples. In such events, the estimation of genetic parameters plays an important role in modelling and predicting genetic relatedness and is useful in assigning unknown individuals to an ethnic group. Various techniques exist for the estimation of genetic relatedness, but the use of Machine learning (ML) algorithms are novel and presently the least used in forensic genetic studies. In this study, we investigated the ability of ML algorithms to predict genetic relatedness using hypervariable region I sequences; that were retrieved from the GenBank database for three race groups, namely African, Asian and Caucasian. Four ML classification algorithms; Support vector machines (SVM), Linear discriminant analysis (LDA), Quadratic discriminant analysis (QDA) and Random Forest (RF) were hybridised with one-hot encoding, Principal component analysis (PCA) and Bags of Words (BoW), and were compared for inferring genetic relatedness. The findings from this study on WEKA showed that genetic inferences based on PCA-SVM achieved an overall accuracy of 80–90% and consistently outperformed PCA-LDA, PCA-RF and PCA-QDA, while in Python BoW-PCA-RF achieved 94.4% accuracy which outperformed BoW-PCA-SVM, BoW-PCA-LDA and BoW-PCA-QDA respectively. ML results from the use of WEKA and Python software tools displayed higher accuracies as compared to the Analysis of molecular variance results. Given the results, SVM and RF algorithms are likely to also be useful in other sequence classification applications, making it a promising tool in genetics and forensic science. The study provides evidence that ML can be utilized as a supplementary tool for forensic genetics casework analysis.
Collapse
Affiliation(s)
- Priyanka Govender
- Discipline of Genetics, School of Life Sciences, University of KwaZulu-Natal, Westville, South Africa
| | - Stephen Gbenga Fashoto
- Faculty of Science and Engineering, Department of Computer Science, Computational Intelligence and Health Informatics Research Group, University of Eswatini, Kwaluseni, Kingdom of Eswatini
| | - Leah Maharaj
- Discipline of Genetics, School of Life Sciences, University of KwaZulu-Natal, Westville, South Africa
| | - Matthew A. Adeleke
- Discipline of Genetics, School of Life Sciences, University of KwaZulu-Natal, Westville, South Africa
| | - Elliot Mbunge
- Faculty of Science and Engineering, Department of Computer Science, Computational Intelligence and Health Informatics Research Group, University of Eswatini, Kwaluseni, Kingdom of Eswatini
| | - Jeremiah Olamijuwon
- Faculty of Science and Engineering, Department of Computer Science, Computational Intelligence and Health Informatics Research Group, University of Eswatini, Kwaluseni, Kingdom of Eswatini
| | - Boluwaji Akinnuwesi
- Faculty of Science and Engineering, Department of Computer Science, Computational Intelligence and Health Informatics Research Group, University of Eswatini, Kwaluseni, Kingdom of Eswatini
| | - Moses Okpeku
- Discipline of Genetics, School of Life Sciences, University of KwaZulu-Natal, Westville, South Africa
- * E-mail:
| |
Collapse
|
19
|
Karim MR, Cochez M, Zappa A, Sahay R, Rebholz-Schuhmann D, Beyan O, Decker S. Convolutional Embedded Networks for Population Scale Clustering and Bio-Ancestry Inferencing. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:369-382. [PMID: 32750845 DOI: 10.1109/tcbb.2020.2994649] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
The study of genetic variants (GVs) can help find correlating population groups and to identify cohorts that are predisposed to common diseases and explain differences in disease susceptibility and how patients react to drugs. Machine learning techniques are increasingly being applied to identify interacting GVs to understand their complex phenotypic traits. Since the performance of a learning algorithm not only depends on the size and nature of the data but also on the quality of underlying representation, deep neural networks (DNNs) can learn non-linear mappings that allow transforming GVs data into more clustering and classification friendly representations than manual feature selection. In this paper, we propose convolutional embedded networks (CEN) in which we combine two DNN architectures called convolutional embedded clustering (CEC) and convolutional autoencoder (CAE) classifier for clustering individuals and predicting geographic ethnicity based on GVs, respectively. We employed CAE-based representation learning to 95 million GVs from the '1000 genomes' (covering 2,504 individuals from 26 ethnic origins) and 'Simons genome diversity' (covering 279 individuals from 130 ethnic origins) projects. Quantitative and qualitative analyses with a focus on accuracy and scalability show that our approach outperforms state-of-the-art approaches such as VariantSpark and ADMIXTURE. In particular, CEC can cluster targeted population groups in 22 hours with an adjusted rand index (ARI) of 0.915, the normalized mutual information (NMI) of 0.92, and the clustering accuracy (ACC) of 89 percent. Contrarily, the CAE classifier can predict the geographic ethnicity of unknown samples with an F1 and Mathews correlation coefficient (MCC) score of 0.9004 and 0.8245, respectively. Further, to provide interpretations of the predictions, we identify significant biomarkers using gradient boosted trees (GBT) and SHapley Additive exPlanations (SHAP). Overall, our approach is transparent and faster than the baseline methods, and scalable for 5 to 100 percent of the full human genome.
Collapse
|
20
|
Shao D, Dai Y, Li N, Cao X, Zhao W, Cheng L, Rong Z, Huang L, Wang Y, Zhao J. Artificial intelligence in clinical research of cancers. Brief Bioinform 2021; 23:6470966. [PMID: 34929741 PMCID: PMC8769909 DOI: 10.1093/bib/bbab523] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2021] [Revised: 11/06/2021] [Accepted: 11/13/2021] [Indexed: 12/16/2022] Open
Abstract
Several factors, including advances in computational algorithms, the availability of high-performance computing hardware, and the assembly of large community-based databases, have led to the extensive application of Artificial Intelligence (AI) in the biomedical domain for nearly 20 years. AI algorithms have attained expert-level performance in cancer research. However, only a few AI-based applications have been approved for use in the real world. Whether AI will eventually be capable of replacing medical experts has been a hot topic. In this article, we first summarize the cancer research status using AI in the past two decades, including the consensus on the procedure of AI based on an ideal paradigm and current efforts of the expertise and domain knowledge. Next, the available data of AI process in the biomedical domain are surveyed. Then, we review the methods and applications of AI in cancer clinical research categorized by the data types including radiographic imaging, cancer genome, medical records, drug information and biomedical literatures. At last, we discuss challenges in moving AI from theoretical research to real-world cancer research applications and the perspectives toward the future realization of AI participating cancer treatment.
Collapse
Affiliation(s)
- Dan Shao
- College of Computer Science and Technology, Key Laboratory of Human Health Status Identification and Function Enhancement of Jilin Province, Changchun University, Changchun 130022, China
| | - Yinfei Dai
- College of Computer Science and Technology, Key Laboratory of Human Health Status Identification and Function Enhancement of Jilin Province, Changchun University, Changchun 130022, China
| | - Nianfeng Li
- College of Computer Science and Technology, Key Laboratory of Human Health Status Identification and Function Enhancement of Jilin Province, Changchun University, Changchun 130022, China
| | - Xuqing Cao
- Department of Neurology, People's Hospital of Ningxia Hui Autonomous Region (The Affiliated people's Hospital of Ningxia Medical University and The First Affiliated Hospital of Northwest Minzu University), Yinchuan 750002, China
| | - Wei Zhao
- Department of Biochemistry and Molecular Biology, Ningxia Medical University, Yinchuan 750002, China
| | - Li Cheng
- Department of Electrical Diagnosis, Affiliated Hospital of Changchun University of Traditional Chinese Medicine, Changchun, 130021, China
| | - Zhuqing Rong
- School of Science, Key Laboratory of Human Health Status Identification and Function Enhancement of Jilin Province, Changchun University, Changchun 130022, China
| | - Lan Huang
- Key laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China
| | - Yan Wang
- Key laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China
| | - Jing Zhao
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, 43210, USA
| |
Collapse
|
21
|
Westhues CC, Mahone GS, da Silva S, Thorwarth P, Schmidt M, Richter JC, Simianer H, Beissinger TM. Prediction of Maize Phenotypic Traits With Genomic and Environmental Predictors Using Gradient Boosting Frameworks. FRONTIERS IN PLANT SCIENCE 2021; 12:699589. [PMID: 34880880 PMCID: PMC8647909 DOI: 10.3389/fpls.2021.699589] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/23/2021] [Accepted: 10/15/2021] [Indexed: 05/26/2023]
Abstract
The development of crop varieties with stable performance in future environmental conditions represents a critical challenge in the context of climate change. Environmental data collected at the field level, such as soil and climatic information, can be relevant to improve predictive ability in genomic prediction models by describing more precisely genotype-by-environment interactions, which represent a key component of the phenotypic response for complex crop agronomic traits. Modern predictive modeling approaches can efficiently handle various data types and are able to capture complex nonlinear relationships in large datasets. In particular, machine learning techniques have gained substantial interest in recent years. Here we examined the predictive ability of machine learning-based models for two phenotypic traits in maize using data collected by the Maize Genomes to Fields (G2F) Initiative. The data we analyzed consisted of multi-environment trials (METs) dispersed across the United States and Canada from 2014 to 2017. An assortment of soil- and weather-related variables was derived and used in prediction models alongside genotypic data. Linear random effects models were compared to a linear regularized regression method (elastic net) and to two nonlinear gradient boosting methods based on decision tree algorithms (XGBoost, LightGBM). These models were evaluated under four prediction problems: (1) tested and new genotypes in a new year; (2) only unobserved genotypes in a new year; (3) tested and new genotypes in a new site; (4) only unobserved genotypes in a new site. Accuracy in forecasting grain yield performance of new genotypes in a new year was improved by up to 20% over the baseline model by including environmental predictors with gradient boosting methods. For plant height, an enhancement of predictive ability could neither be observed by using machine learning-based methods nor by using detailed environmental information. An investigation of key environmental factors using gradient boosting frameworks also revealed that temperature at flowering stage, frequency and amount of water received during the vegetative and grain filling stage, and soil organic matter content appeared as important predictors for grain yield in our panel of environments.
Collapse
Affiliation(s)
- Cathy C. Westhues
- Division of Plant Breeding Methodology, Department of Crop Sciences, University of Goettingen, Goettingen, Germany
- Center for Integrated Breeding Research, University of Goettingen, Goettingen, Germany
| | | | - Sofia da Silva
- Kleinwanzlebener Saatzucht (KWS) SAAT SE, Einbeck, Germany
| | | | - Malthe Schmidt
- Kleinwanzlebener Saatzucht (KWS) SAAT SE, Einbeck, Germany
| | | | - Henner Simianer
- Center for Integrated Breeding Research, University of Goettingen, Goettingen, Germany
- Animal Breeding and Genetics Group, Department of Animal Sciences, University of Goettingen, Goettingen, Germany
| | - Timothy M. Beissinger
- Division of Plant Breeding Methodology, Department of Crop Sciences, University of Goettingen, Goettingen, Germany
- Center for Integrated Breeding Research, University of Goettingen, Goettingen, Germany
| |
Collapse
|
22
|
Structural and functional motor-network disruptions predict selective action-concept deficits: Evidence from frontal lobe epilepsy. Cortex 2021; 144:43-55. [PMID: 34637999 DOI: 10.1016/j.cortex.2021.08.003] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2020] [Revised: 07/12/2021] [Accepted: 08/05/2021] [Indexed: 12/22/2022]
Abstract
Built on neurodegenerative lesions models, the disrupted motor grounding hypothesis (DMGH) posits that motor-system alterations selectively impair action comprehension. However, major doubts remain concerning the dissociability, neural signatures, and etiological generalizability of such deficits. Few studies have compared action-concept outcomes between disorders affecting and sparing motor circuitry, and none has examined their multimodal network predictors via data-driven approaches. Here, we first assessed action- and object-concept processing in patients with frontal lobe epilepsy (FLE), patients with posterior cortex epilepsy (PCE), and healthy controls. Then, we examined structural and functional network signatures via diffusion tensor imaging and resting-state connectivity measures. Finally, we used these measures to predict behavioral performance with an XGBoost machine learning regression algorithm. Relative to controls, FLE (but not PCE) patients exhibited selective action-concept deficits together with structural and functional abnormalities along motor networks. The XGBoost model reached a significantly large effect size only for action-concept outcomes in FLE, mainly predicted by structural (cortico-spinal tract, anterior thalamic radiation, uncinate fasciculus) and functional (M1-parietal/supramarginal connectivity) motor networks. These results extend the DMGH, suggesting that action-concept deficits are dissociable markers of frontal/motor (relative to posterior) disruptions, directly related to the structural and functional integrity of motor networks, and traceable beyond canonical movement disorders.
Collapse
|
23
|
Moguilner S, Birba A, Fino D, Isoardi R, Huetagoyena C, Otoya R, Tirapu V, Cremaschi F, Sedeño L, Ibáñez A, García AM. Multimodal neurocognitive markers of frontal lobe epilepsy: Insights from ecological text processing. Neuroimage 2021; 235:117998. [PMID: 33789131 PMCID: PMC8272524 DOI: 10.1016/j.neuroimage.2021.117998] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2021] [Revised: 03/15/2021] [Accepted: 03/24/2021] [Indexed: 01/07/2023] Open
Abstract
The pressing call to detect sensitive cognitive markers of frontal lobe epilepsy (FLE) remains poorly addressed. Standard frameworks prove nosologically unspecific (as they reveal deficits that also emerge across other epilepsy subtypes), possess low ecological validity, and are rarely supported by multimodal neuroimaging assessments. To bridge these gaps, we examined naturalistic action and non-action text comprehension, combined with structural and functional connectivity measures, in 19 FLE patients, 19 healthy controls, and 20 posterior cortex epilepsy (PCE) patients. Our analyses integrated inferential statistics and data-driven machine-learning classifiers. FLE patients were selectively and specifically impaired in action comprehension, irrespective of their neuropsychological profile. These deficits selectively and specifically correlated with (a) reduced integrity of the anterior thalamic radiation, a subcortical structure underlying motoric and action-language processing as well as epileptic seizure spread in this subtype; and (b) hypoconnectivity between the primary motor cortex and the left-parietal/supramarginal regions, two putative substrates of action-language comprehension. Moreover, machine-learning classifiers based on the above neurocognitive measures yielded 75% accuracy rates in discriminating individual FLE patients from both controls and PCE patients. Briefly, action-text assessments, combined with structural and functional connectivity measures, seem to capture ecological cognitive deficits that are specific to FLE, opening new avenues for discriminatory characterizations among epilepsy types.
Collapse
Affiliation(s)
- Sebastian Moguilner
- Global Brain Health Institute, UCSF, California, US, & Trinity College Dublin, Dublin, Ireland; Nuclear Medicine School Foundation (FUESMEN), National Commission of Atomic Energy (CNEA), Mendoza, Argentina
| | - Agustina Birba
- University of San Andres, Buenos Aires, Argentina; National Scientific and Technical Research Council (CONICET), Buenos Aires, Argentina
| | - Daniel Fino
- Nuclear Medicine School Foundation (FUESMEN), National Commission of Atomic Energy (CNEA), Mendoza, Argentina; Fundación Argentina para el Desarrollo en Salud, Mendoza, Argentina
| | - Roberto Isoardi
- Nuclear Medicine School Foundation (FUESMEN), National Commission of Atomic Energy (CNEA), Mendoza, Argentina
| | - Celeste Huetagoyena
- Neuromed, Clinical Neuroscience, Mendoza, Argentina; Universidad Católica Argentina
| | - Raúl Otoya
- Neuromed, Clinical Neuroscience, Mendoza, Argentina
| | - Viviana Tirapu
- Nuclear Medicine School Foundation (FUESMEN), National Commission of Atomic Energy (CNEA), Mendoza, Argentina; Neuromed, Clinical Neuroscience, Mendoza, Argentina
| | - Fabián Cremaschi
- Nuclear Medicine School Foundation (FUESMEN), National Commission of Atomic Energy (CNEA), Mendoza, Argentina; Neuroscience Department of the School of Medicine, National University of Cuyo, Mendoza, Argentina; Santa Isabel de Hungría Hospital, Mendoza, Argentina
| | - Lucas Sedeño
- National Scientific and Technical Research Council (CONICET), Buenos Aires, Argentina
| | - Agustín Ibáñez
- Global Brain Health Institute, UCSF, California, US, & Trinity College Dublin, Dublin, Ireland; University of San Andres, Buenos Aires, Argentina; National Scientific and Technical Research Council (CONICET), Buenos Aires, Argentina; Center for Social and Cognitive Neuroscience (CSCN), School of Psychology, Universidad Adolfo Ibáñez, Santiago, Chile
| | - Adolfo M García
- Global Brain Health Institute, UCSF, California, US, & Trinity College Dublin, Dublin, Ireland; University of San Andres, Buenos Aires, Argentina; National Scientific and Technical Research Council (CONICET), Buenos Aires, Argentina; Faculty of Education, National University of Cuyo (UNCuyo), Mendoza, Argentina; Departamento de Lingüística y Literatura, Facultad de Humanidades, Universidad de Santiago de Chile, Santiago, Chile.
| |
Collapse
|
24
|
Banegas-Luna AJ, Peña-García J, Iftene A, Guadagni F, Ferroni P, Scarpato N, Zanzotto FM, Bueno-Crespo A, Pérez-Sánchez H. Towards the Interpretability of Machine Learning Predictions for Medical Applications Targeting Personalised Therapies: A Cancer Case Survey. Int J Mol Sci 2021; 22:4394. [PMID: 33922356 PMCID: PMC8122817 DOI: 10.3390/ijms22094394] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2021] [Revised: 04/16/2021] [Accepted: 04/20/2021] [Indexed: 12/18/2022] Open
Abstract
Artificial Intelligence is providing astonishing results, with medicine being one of its favourite playgrounds. Machine Learning and, in particular, Deep Neural Networks are behind this revolution. Among the most challenging targets of interest in medicine are cancer diagnosis and therapies but, to start this revolution, software tools need to be adapted to cover the new requirements. In this sense, learning tools are becoming a commodity but, to be able to assist doctors on a daily basis, it is essential to fully understand how models can be interpreted. In this survey, we analyse current machine learning models and other in-silico tools as applied to medicine-specifically, to cancer research-and we discuss their interpretability, performance and the input data they are fed with. Artificial neural networks (ANN), logistic regression (LR) and support vector machines (SVM) have been observed to be the preferred models. In addition, convolutional neural networks (CNNs), supported by the rapid development of graphic processing units (GPUs) and high-performance computing (HPC) infrastructures, are gaining importance when image processing is feasible. However, the interpretability of machine learning predictions so that doctors can understand them, trust them and gain useful insights for the clinical practice is still rarely considered, which is a factor that needs to be improved to enhance doctors' predictive capacity and achieve individualised therapies in the near future.
Collapse
Affiliation(s)
- Antonio Jesús Banegas-Luna
- Structural Bioinformatics and High-Performance Computing Research Group (BIO-HPC), Universidad Católica de Murcia (UCAM), 30107 Murcia, Spain; (J.P.-G.); (A.B.-C.)
| | - Jorge Peña-García
- Structural Bioinformatics and High-Performance Computing Research Group (BIO-HPC), Universidad Católica de Murcia (UCAM), 30107 Murcia, Spain; (J.P.-G.); (A.B.-C.)
| | - Adrian Iftene
- Faculty of Computer Science, Universitatea Alexandru Ioan Cuza (UAIC), 700505 Jashi, Romania;
| | - Fiorella Guadagni
- Interinstitutional Multidisciplinary Biobank (BioBIM), IRCCS San Raffaele Roma, 00166 Rome, Italy; (F.G.); (P.F.)
- Department of Human Sciences and Promotion of the Quality of Life, San Raffaele Roma Open University, 00166 Rome, Italy;
| | - Patrizia Ferroni
- Interinstitutional Multidisciplinary Biobank (BioBIM), IRCCS San Raffaele Roma, 00166 Rome, Italy; (F.G.); (P.F.)
- Department of Human Sciences and Promotion of the Quality of Life, San Raffaele Roma Open University, 00166 Rome, Italy;
| | - Noemi Scarpato
- Department of Human Sciences and Promotion of the Quality of Life, San Raffaele Roma Open University, 00166 Rome, Italy;
| | - Fabio Massimo Zanzotto
- Dipartimento di Ingegneria dell’Impresa “Mario Lucertini”, University of Rome Tor Vergata, 00133 Rome, Italy;
| | - Andrés Bueno-Crespo
- Structural Bioinformatics and High-Performance Computing Research Group (BIO-HPC), Universidad Católica de Murcia (UCAM), 30107 Murcia, Spain; (J.P.-G.); (A.B.-C.)
| | - Horacio Pérez-Sánchez
- Structural Bioinformatics and High-Performance Computing Research Group (BIO-HPC), Universidad Católica de Murcia (UCAM), 30107 Murcia, Spain; (J.P.-G.); (A.B.-C.)
| |
Collapse
|
25
|
Muneeb M, Henschel A. Eye-color and Type-2 diabetes phenotype prediction from genotype data using deep learning methods. BMC Bioinformatics 2021; 22:198. [PMID: 33874881 PMCID: PMC8056510 DOI: 10.1186/s12859-021-04077-9] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2020] [Accepted: 03/03/2021] [Indexed: 01/08/2023] Open
Abstract
Background Genotype–phenotype predictions are of great importance in genetics. These predictions can help to find genetic mutations causing variations in human beings. There are many approaches for finding the association which can be broadly categorized into two classes, statistical techniques, and machine learning. Statistical techniques are good for finding the actual SNPs causing variation where Machine Learning techniques are good where we just want to classify the people into different categories. In this article, we examined the Eye-color and Type-2 diabetes phenotype. The proposed technique is a hybrid approach consisting of some parts from statistical techniques and remaining from Machine learning. Results The main dataset for Eye-color phenotype consists of 806 people. 404 people have Blue-Green eyes where 402 people have Brown eyes. After preprocessing we generated 8 different datasets, containing different numbers of SNPs, using the mutation difference and thresholding at individual SNP. We calculated three types of mutation at each SNP no mutation, partial mutation, and full mutation. After that data is transformed for machine learning algorithms. We used about 9 classifiers, RandomForest, Extreme Gradient boosting, ANN, LSTM, GRU, BILSTM, 1DCNN, ensembles of ANN, and ensembles of LSTM which gave the best accuracy of 0.91, 0.9286, 0.945, 0.94, 0.94, 0.92, 0.95, and 0.96% respectively. Stacked ensembles of LSTM outperformed other algorithms for 1560 SNPs with an overall accuracy of 0.96, AUC = 0.98 for brown eyes, and AUC = 0.97 for Blue-Green eyes. The main dataset for Type-2 diabetes consists of 107 people where 30 people are classified as cases and 74 people as controls. We used different linear threshold to find the optimal number of SNPs for classification. The final model gave an accuracy of 0.97%. Conclusion Genotype–phenotype predictions are very useful especially in forensic. These predictions can help to identify SNP variant association with traits and diseases. Given more datasets, machine learning model predictions can be increased. Moreover, the non-linearity in the Machine learning model and the combination of SNPs Mutations while training the model increases the prediction. We considered binary classification problems but the proposed approach can be extended to multi-class classification.
Collapse
Affiliation(s)
- Muhammad Muneeb
- Department of Electrical Engineering and Computer Science, Center for Biotechnology Khalifa University, Khalifa University of Science and Technology, Abu Dhabi, United Arab Emirates
| | - Andreas Henschel
- Department of Electrical Engineering and Computer Science, Center for Biotechnology Khalifa University, Khalifa University of Science and Technology, Abu Dhabi, United Arab Emirates.
| |
Collapse
|
26
|
Lebrett MB, Crosbie EJ, Smith MJ, Woodward ER, Evans DG, Crosbie PAJ. Targeting lung cancer screening to individuals at greatest risk: the role of genetic factors. J Med Genet 2021; 58:217-226. [PMID: 33514608 PMCID: PMC8005792 DOI: 10.1136/jmedgenet-2020-107399] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2020] [Revised: 12/06/2020] [Accepted: 12/08/2020] [Indexed: 12/24/2022]
Abstract
Lung cancer (LC) is the most common global cancer. An individual’s risk of developing LC is mediated by an array of factors, including family history of the disease. Considerable research into genetic risk factors for LC has taken place in recent years, with both low-penetrance and high-penetrance variants implicated in increasing or decreasing a person’s risk of the disease. LC is the leading cause of cancer death worldwide; poor survival is driven by late onset of non-specific symptoms, resulting in late-stage diagnoses. Evidence for the efficacy of screening in detecting cancer earlier, thereby reducing lung-cancer specific mortality, is now well established. To ensure the cost-effectiveness of a screening programme and to limit the potential harms to participants, a risk threshold for screening eligibility is required. Risk prediction models (RPMs), which provide an individual’s personal risk of LC over a particular period based on a large number of risk factors, may improve the selection of high-risk individuals for LC screening when compared with generalised eligibility criteria that only consider smoking history and age. No currently used RPM integrates genetic risk factors into its calculation of risk. This review provides an overview of the evidence for LC screening, screening related harms and the use of RPMs in screening cohort selection. It gives a synopsis of the known genetic risk factors for lung cancer and discusses the evidence for including them in RPMs, focusing in particular on the use of polygenic risk scores to increase the accuracy of targeted lung cancer screening.
Collapse
Affiliation(s)
- Mikey B Lebrett
- Division of Infection, Immunity and Respiratory Medicine, The University of Manchester Faculty of Biology Medicine and Health, Manchester, UK.,Prevention and Early Detection Theme, NIHR Manchester Biomedical Research Centre, Manchester, UK
| | - Emma J Crosbie
- Prevention and Early Detection Theme, NIHR Manchester Biomedical Research Centre, Manchester, UK.,Division of Cancer Sciences, The University of Manchester Faculty of Biology Medicine and Health, Manchester, UK
| | - Miriam J Smith
- Prevention and Early Detection Theme, NIHR Manchester Biomedical Research Centre, Manchester, UK.,Manchester Centre for Genomic Medicine, St Mary's Hospital, Division of Evolution and Genomic Sciences, School of Biological Sciences, University of Manchester, Manchester, UK
| | - Emma R Woodward
- Prevention and Early Detection Theme, NIHR Manchester Biomedical Research Centre, Manchester, UK.,Manchester Centre for Genomic Medicine, St Mary's Hospital, Division of Evolution and Genomic Sciences, School of Biological Sciences, University of Manchester, Manchester, UK
| | - D Gareth Evans
- Prevention and Early Detection Theme, NIHR Manchester Biomedical Research Centre, Manchester, UK.,Manchester Centre for Genomic Medicine, St Mary's Hospital, Division of Evolution and Genomic Sciences, School of Biological Sciences, University of Manchester, Manchester, UK
| | - Philip A J Crosbie
- Division of Infection, Immunity and Respiratory Medicine, The University of Manchester Faculty of Biology Medicine and Health, Manchester, UK .,Prevention and Early Detection Theme, NIHR Manchester Biomedical Research Centre, Manchester, UK.,Manchester Thoracic Oncology Centre, Wythenshawe Hospital, Manchester University NHS Foundation Trust, Manchester, UK
| |
Collapse
|
27
|
Warner E, Wang N, Lee J, Rao A. Meaningful incorporation of artificial intelligence for personalized patient management during cancer: Quantitative imaging, risk assessment, and therapeutic outcomes. Artif Intell Med 2021. [DOI: 10.1016/b978-0-12-821259-2.00017-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
28
|
Abstract
Many modern problems in medicine and public health leverage machine-learning methods to predict outcomes based on observable covariates. In a wide array of settings, predicted outcomes are used in subsequent statistical analysis, often without accounting for the distinction between observed and predicted outcomes. We call inference with predicted outcomes postprediction inference. In this paper, we develop methods for correcting statistical inference using outcomes predicted with arbitrarily complicated machine-learning models including random forests and deep neural nets. Rather than trying to derive the correction from first principles for each machine-learning algorithm, we observe that there is typically a low-dimensional and easily modeled representation of the relationship between the observed and predicted outcomes. We build an approach for postprediction inference that naturally fits into the standard machine-learning framework where the data are divided into training, testing, and validation sets. We train the prediction model in the training set, estimate the relationship between the observed and predicted outcomes in the testing set, and use that relationship to correct subsequent inference in the validation set. We show our postprediction inference (postpi) approach can correct bias and improve variance estimation and subsequent statistical inference with predicted outcomes. To show the broad range of applicability of our approach, we show postpi can improve inference in two distinct fields: modeling predicted phenotypes in repurposed gene expression data and modeling predicted causes of death in verbal autopsy data. Our method is available through an open-source R package: https://github.com/leekgroup/postpi.
Collapse
|
29
|
Seo H, Cho DH. Feature selection algorithm based on dual correlation filters for cancer-associated somatic variants. BMC Bioinformatics 2020; 21:486. [PMID: 33121438 PMCID: PMC7596964 DOI: 10.1186/s12859-020-03767-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2019] [Accepted: 09/18/2020] [Indexed: 12/30/2022] Open
Abstract
Background Since the development of sequencing technology, an enormous amount of genetic information has been generated, and human cancer analysis using this information is drawing attention. As the effects of variants on human cancer become known, it is important to find cancer-associated variants among countless variants. Results We propose a new filter-based feature selection method applicable for extracting cancer-associated somatic variants considering correlations of data. Both variants associated with the activation and deactivation of cancer’s characteristics are analyzed using dual correlation filters. The multiobjective optimization is utilized to consider two types of variants simultaneously without redundancy. To overcome high computational complexity problem, we calculate the correlation-based weight to select significant variants instead of directly searching for the optimal subset of variants. The proposed algorithm is applied to the identification of melanoma metastasis or breast cancer stage, and the classification results of the proposed method are compared with those of conventional single correlation filter-based method. Conclusions We verified that the proposed dual correlation filter-based method can extract cancer-associated variants related to the characteristics of human cancer.
Collapse
Affiliation(s)
- Hyein Seo
- School of Electrical Engineering, Korea Advanced Institute of Science and Technology (KAIST), 291 Daehak-ro, Yuseong-gu, 34141, Daejeon, Republic of Korea
| | - Dong-Ho Cho
- School of Electrical Engineering, Korea Advanced Institute of Science and Technology (KAIST), 291 Daehak-ro, Yuseong-gu, 34141, Daejeon, Republic of Korea.
| |
Collapse
|
30
|
Finkbeiner S. Functional genomics, genetic risk profiling and cell phenotypes in neurodegenerative disease. Neurobiol Dis 2020; 146:105088. [PMID: 32977020 PMCID: PMC7686089 DOI: 10.1016/j.nbd.2020.105088] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2020] [Revised: 09/11/2020] [Accepted: 09/18/2020] [Indexed: 12/03/2022] Open
Abstract
Human genetics provides unbiased insights into the causes of human disease, which can be used to create a foundation for effective ways to more accurately diagnose patients, stratify patients for more successful clinical trials, discover and develop new therapies, and ultimately help patients choose the safest and most promising therapeutic option based on their risk profile. But the process for translating basic observations from human genetics studies into pathogenic disease mechanisms and treatments is laborious and complex, and this challenge has particularly slowed the development of interventions for neurodegenerative disease. In this review, we discuss the many steps in the process, the important considerations at each stage, and some of the latest tools and technologies that are available to help investigators translate insights from human genetics into diagnostic and therapeutic strategies that will lead to the sort of advances in clinical care that make a difference for patients.
Collapse
Affiliation(s)
- Steven Finkbeiner
- Center for Systems and Therapeutics, USA; Taube/Koret Center for Neurodegenerative Disease Research, Gladstone Institutes, San Francisco, CA 94158, USA; Departments of Neurology and Physiology, University of Califorina, San Francisco, CA 94158, USA.
| |
Collapse
|
31
|
Bakhtiari S, Sulaimany S, Talebi M, Kalhor K. Computational Prediction of Probable Single Nucleotide Polymorphism-Cancer Relationships. Cancer Inform 2020; 19:1176935120942216. [PMID: 32728337 PMCID: PMC7364831 DOI: 10.1177/1176935120942216] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2020] [Accepted: 06/22/2020] [Indexed: 12/18/2022] Open
Abstract
Genetic variations such as single nucleotide polymorphisms (SNPs) can cause susceptibility to cancer. Although thousands of genetic variants have been identified to be associated with different cancers, the molecular mechanisms of cancer remain unknown. There is not a particular dataset of relationships between cancer and SNPs, as a bipartite network, for computational analysis and prediction. Link prediction as a computational graph analysis method can help us to gain new insight into the network. In this article, after creating a network between cancer and SNPs using SNPedia and Cancer Research UK databases, we evaluated the computational link prediction methods to foresee new SNP-Cancer relationships. Results show that among the popular scoring methods based on network topology, for relation prediction, the preferential attachment (PA) algorithm is the most robust method according to computational and experimental evidence, and some of its computational predictions are corroborated in recent publications. According to the PA predictions, rs1801394-Non-small cell lung cancer, rs4880-Non-small cell lung cancer, and rs1805794-Colorectal cancer are some of the best probable SNP-Cancer associations that have not yet been mentioned in any published article, and they are the most probable candidates for additional laboratory and validation studies. Also, it is feasible to improve the predicting algorithms to produce new predictions in the future.
Collapse
Affiliation(s)
- Shahab Bakhtiari
- Department of Biological Sciences, University of Kurdistan, Sanandaj, Iran
| | - Sadegh Sulaimany
- Department of Computer Engineering, University of Kurdistan, Sanandaj, Iran
| | - Mehrdad Talebi
- Department of Medical Genetics, Shahid Sadoughi University of Medical Sciences, Yazd, Iran
| | - Kabmiz Kalhor
- Department of Biological Sciences, University of Kurdistan, Sanandaj, Iran
| |
Collapse
|
32
|
Behravan H, Hartikainen JM, Tengström M, Kosma VM, Mannermaa A. Predicting breast cancer risk using interacting genetic and demographic factors and machine learning. Sci Rep 2020; 10:11044. [PMID: 32632202 PMCID: PMC7338351 DOI: 10.1038/s41598-020-66907-9] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2020] [Accepted: 06/01/2020] [Indexed: 12/21/2022] Open
Abstract
Breast cancer (BC) is a multifactorial disease and the most common cancer in women worldwide. We describe a machine learning approach to identify a combination of interacting genetic variants (SNPs) and demographic risk factors for BC, especially factors related to both familial history (Group 1) and oestrogen metabolism (Group 2), for predicting BC risk. This approach identifies the best combinations of interacting genetic and demographic risk factors that yield the highest BC risk prediction accuracy. In tests on the Kuopio Breast Cancer Project (KBCP) dataset, our approach achieves a mean average precision (mAP) of 77.78 in predicting BC risk by using interacting genetic and Group 1 features, which is better than the mAPs of 74.19 and 73.65 achieved using only Group 1 features and interacting SNPs, respectively. Similarly, using interacting genetic and Group 2 features yields a mAP of 78.00, which outperforms the system based on only Group 2 features, which has a mAP of 72.57. Furthermore, the gene interaction maps built from genes associated with SNPs that interact with demographic risk factors indicate important BC-related biological entities, such as angiogenesis, apoptosis and oestrogen-related networks. The results also show that demographic risk factors are individually more important than genetic variants in predicting BC risk.
Collapse
Affiliation(s)
- Hamid Behravan
- Institute of Clinical Medicine, Pathology and Forensic Medicine, and Translational Cancer Research Area, University of Eastern Finland, P.O. Box 1627, FI-70211, Kuopio, Finland.
| | - Jaana M Hartikainen
- Institute of Clinical Medicine, Pathology and Forensic Medicine, and Translational Cancer Research Area, University of Eastern Finland, P.O. Box 1627, FI-70211, Kuopio, Finland
| | - Maria Tengström
- Institute of Clinical Medicine, Oncology, University of Eastern Finland, P.O. Box 1627, FI-70211, Kuopio, Finland
- Cancer Center, Kuopio University Hospital, Kuopio, P.O. Box 100, FI-70029, Kuopio, Finland
| | - Veli-Matti Kosma
- Institute of Clinical Medicine, Pathology and Forensic Medicine, and Translational Cancer Research Area, University of Eastern Finland, P.O. Box 1627, FI-70211, Kuopio, Finland
- Biobank of Eastern Finland, Kuopio University Hospital, Kuopio, Finland
| | - Arto Mannermaa
- Institute of Clinical Medicine, Pathology and Forensic Medicine, and Translational Cancer Research Area, University of Eastern Finland, P.O. Box 1627, FI-70211, Kuopio, Finland
- Biobank of Eastern Finland, Kuopio University Hospital, Kuopio, Finland
| |
Collapse
|
33
|
Machine Learning Supports Long Noncoding RNAs as Expression Markers for Endometrial Carcinoma. BIOMED RESEARCH INTERNATIONAL 2020; 2020:3968279. [PMID: 32420338 PMCID: PMC7199595 DOI: 10.1155/2020/3968279] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/21/2019] [Accepted: 12/17/2019] [Indexed: 12/19/2022]
Abstract
Uterine corpus endometrial carcinoma (UCEC) is the second most common type of gynecological tumor. Several research studies have recently shown the potential of different ncRNAs as biomarkers for prognostics and diagnosis in different types of cancers, including UCEC. Thus, we hypothesized that long noncoding RNAs (lncRNAs) could serve as efficient factors to discriminate solid primary (TP) and normal adjacent (NT) tissues in UCEC with high accuracy. We performed an in silico differential expression analysis comparing TP and NT from a set of samples downloaded from the Cancer Genome Atlas (TCGA) database, targeting highly differentially expressed lncRNAs that could potentially serve as gene expression markers. All analyses were performed in R software. The receiver operator characteristics (ROC) analyses and both supervised and unsupervised machine learning indicated a set of 14 lncRNAs that may serve as biomarkers for UCEC. Functions and putative pathways were assessed through a coexpression network and target enrichment analysis.
Collapse
|
34
|
Bandoy DJDR, Weimer BC. Biological Machine Learning Combined with Campylobacter Population Genomics Reveals Virulence Gene Allelic Variants Cause Disease. Microorganisms 2020; 8:E549. [PMID: 32290186 PMCID: PMC7232492 DOI: 10.3390/microorganisms8040549] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2020] [Revised: 04/07/2020] [Accepted: 04/08/2020] [Indexed: 01/17/2023] Open
Abstract
Highly dimensional data generated from bacterial whole-genome sequencing is providing an unprecedented scale of information that requires an appropriate statistical analysis framework to infer biological function from populations of genomes. The application of genome-wide association study (GWAS) methods is an appropriate framework for bacterial population genome analysis that yields a list of candidate genes associated with a phenotype, but it provides an unranked measure of importance. Here, we validated a novel framework to define infection mechanism using the combination of GWAS, machine learning, and bacterial population genomics that ranked allelic variants that accurately identified disease. This approach parsed a dataset of 1.2 million single nucleotide polymorphisms (SNPs) and indels that resulted in an importance ranked list of associated alleles of porA in Campylobacter jejuni using spatiotemporal analysis over 30 years. We validated this approach using previously proven laboratory experimental alleles from an in vivo guinea pig abortion model. This framework, termed µPathML, defined intestinal and extraintestinal groups that have differential allelic porA variants that cause abortion. Divergent variants containing indels that defeated automated annotation were rescued using biological context and knowledge that resulted in defining rare, divergent variants that were maintained in the population over two continents and 30 years. This study defines the capability of machine learning coupled with GWAS and population genomics to simultaneously identify and rank alleles to define their role in infectious disease mechanisms.
Collapse
Affiliation(s)
- DJ Darwin R. Bandoy
- 100 K Pathogen Genome Project, Department of Population Health and Reproduction, School of Veterinary Medicine, University of California Davis, Davis, CA 95616, USA
- Department of Veterinary, Paraclinical Sciences, College of Veterinary Medicine, University of the Philippines Los Baños, Los Baños 4031, Philippines;
| | - Bart C. Weimer
- 100 K Pathogen Genome Project, Department of Population Health and Reproduction, School of Veterinary Medicine, University of California Davis, Davis, CA 95616, USA
| |
Collapse
|
35
|
Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med 2019; 25:44-56. [PMID: 30617339 DOI: 10.1038/s41591-018-0300-7] [Citation(s) in RCA: 2155] [Impact Index Per Article: 431.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2018] [Accepted: 11/12/2018] [Indexed: 11/08/2022]
Abstract
The use of artificial intelligence, and the deep-learning subtype in particular, has been enabled by the use of labeled big data, along with markedly enhanced computing power and cloud storage, across all sectors. In medicine, this is beginning to have an impact at three levels: for clinicians, predominantly via rapid, accurate image interpretation; for health systems, by improving workflow and the potential for reducing medical errors; and for patients, by enabling them to process their own data to promote health. The current limitations, including bias, privacy and security, and lack of transparency, along with the future directions of these applications will be discussed in this article. Over time, marked improvements in accuracy, productivity, and workflow will likely be actualized, but whether that will be used to improve the patient-doctor relationship or facilitate its erosion remains to be seen.
Collapse
Affiliation(s)
- Eric J Topol
- Department of Molecular Medicine, Scripps Research, La Jolla, CA, USA.
| |
Collapse
|