1
|
Arango NK, Morgante F. Comparing statistical learning methods for complex trait prediction from gene expression. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.01.596951. [PMID: 38895364 PMCID: PMC11185554 DOI: 10.1101/2024.06.01.596951] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/21/2024]
Abstract
Accurate prediction of complex traits is an important task in quantitative genetics that has become increasingly relevant for personalized medicine. Genotypes have traditionally been used for trait prediction using a variety of methods such as mixed models, Bayesian methods, penalized regressions, dimension reductions, and machine learning methods. Recent studies have shown that gene expression levels can produce higher prediction accuracy than genotypes. However, only a few prediction methods were used in these studies. Thus, a comprehensive assessment of methods is needed to fully evaluate the potential of gene expression as a predictor of complex trait phenotypes. Here, we used data from the Drosophila Genetic Reference Panel (DGRP) to compare the ability of several existing statistical learning methods to predict starvation resistance from gene expression in the two sexes separately. The methods considered differ in assumptions about the distribution of gene effect sizes - ranging from models that assume that every gene affects the trait to more sparse models - and their ability to capture gene-gene interactions. We also used functional annotation (i.e., Gene Ontology (GO)) as an external source of biological information to inform prediction models. The results show that differences in prediction accuracy between methods exist, although they are generally not large. Methods performing variable selection gave higher accuracy in females while methods assuming a more polygenic architecture performed better in males. Incorporating GO annotations further improved prediction accuracy for a few GO terms of biological significance. Biological significance extended to the genes underlying highly predictive GO terms with different genes emerging between sexes. Notably, the Insulin-like Receptor (InR) was prevalent across methods and sexes. Our results confirmed the potential of transcriptomic prediction and highlighted the importance of selecting appropriate methods and strategies in order to achieve accurate predictions.
Collapse
Affiliation(s)
- Noah Klimkowski Arango
- Center for Human Genetics, Clemson University, Greenwood, SC, USA
- Department of Genetics and Biochemistry, Clemson University, Clemson, SC, USA
| | - Fabio Morgante
- Center for Human Genetics, Clemson University, Greenwood, SC, USA
- Department of Genetics and Biochemistry, Clemson University, Clemson, SC, USA
| |
Collapse
|
2
|
Daoud S, Taha M. Protein characteristics substantially influence the propensity of activity cliffs among kinase inhibitors. Sci Rep 2024; 14:9058. [PMID: 38643174 PMCID: PMC11032345 DOI: 10.1038/s41598-024-59501-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2023] [Accepted: 04/11/2024] [Indexed: 04/22/2024] Open
Abstract
Activity cliffs (ACs) are pairs of structurally similar molecules with significantly different affinities for a biotarget, posing a challenge in computer-assisted drug discovery. This study focuses on protein kinases, significant therapeutic targets, with some exhibiting ACs while others do not despite numerous inhibitors. The hypothesis that the presence of ACs is dependent on the target protein and its complete structural context is explored. Machine learning models were developed to link protein properties to ACs, revealing specific tripeptide sequences and overall protein properties as critical factors in ACs occurrence. The study highlights the importance of considering the entire protein matrix rather than just the binding site in understanding ACs. This research provides valuable insights for drug discovery and design, paving the way for addressing ACs-related challenges in modern computational approaches.
Collapse
Affiliation(s)
- Safa Daoud
- Department of Pharmaceutical Chemistry and Pharmacognosy, Faculty of Pharmacy, Applied Sciences Private University, Amman, Jordan.
| | - Mutasem Taha
- Department of Pharmaceutical Sciences, Faculty of Pharmacy, University of Jordan, Amman, Jordan.
| |
Collapse
|
3
|
Hassan AM, Biaggi-Ondina A, Asaad M, Morris N, Liu J, Selber JC, Butler CE. Artificial Intelligence Modeling to Predict Periprosthetic Infection and Explantation following Implant-Based Reconstruction. Plast Reconstr Surg 2023; 152:929-938. [PMID: 36862958 DOI: 10.1097/prs.0000000000010345] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/04/2023]
Abstract
BACKGROUND Despite improvements in prosthesis design and surgical techniques, periprosthetic infection and explantation rates following implant-based reconstruction (IBR) remain relatively high. Artificial intelligence is an extremely powerful predictive tool that involves machine learning (ML) algorithms. We sought to develop, validate, and evaluate the use of ML algorithms to predict complications of IBR. METHODS A comprehensive review of patients who underwent IBR from January of 2018 to December of 2019 was conducted. Nine supervised ML algorithms were developed to predict periprosthetic infection and explantation. Patient data were randomly divided into training (80%) and testing (20%) sets. RESULTS The authors identified 481 patients (694 reconstructions) with a mean ± SD age of 50.0 ± 11.5 years, mean ± SD body mass index of 26.7 ± 4.8 kg/m 2 , and median follow-up time of 16.1 months (range, 11.9 to 3.2 months). Periprosthetic infection developed in 113 of the reconstructions (16.3%), and explantation was required with 82 (11.8%) of them. ML demonstrated good discriminatory performance in predicting periprosthetic infection and explantation (area under the receiver operating characteristic curve, 0.73 and 0.78, respectively), and identified nine and 12 significant predictors of periprosthetic infection and explantation, respectively. CONCLUSIONS ML algorithms trained using readily available perioperative clinical data accurately predict periprosthetic infection and explantation following IBR. The authors' findings support incorporating ML models into perioperative assessment of patients undergoing IBR to provide data-driven, patient-specific risk assessment to aid individualized patient counseling, shared decision-making, and presurgical optimization.
Collapse
Affiliation(s)
- Abbas M Hassan
- From the Department of Plastic Surgery, The University of Texas MD Anderson Cancer Center
| | - Andrea Biaggi-Ondina
- From the Department of Plastic Surgery, The University of Texas MD Anderson Cancer Center
| | - Malke Asaad
- From the Department of Plastic Surgery, The University of Texas MD Anderson Cancer Center
| | - Natalie Morris
- From the Department of Plastic Surgery, The University of Texas MD Anderson Cancer Center
| | - Jun Liu
- From the Department of Plastic Surgery, The University of Texas MD Anderson Cancer Center
| | - Jesse C Selber
- From the Department of Plastic Surgery, The University of Texas MD Anderson Cancer Center
| | - Charles E Butler
- From the Department of Plastic Surgery, The University of Texas MD Anderson Cancer Center
| |
Collapse
|
4
|
Esmaeili F, Narimani Z, Vasighi M. Discovering SNP-disease relationships in genome-wide SNP data using an improved harmony search based on SNP locus and genetic inheritance patterns. PLoS One 2023; 18:e0292266. [PMID: 37831690 PMCID: PMC10575495 DOI: 10.1371/journal.pone.0292266] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2023] [Accepted: 09/15/2023] [Indexed: 10/15/2023] Open
Abstract
Advances in high-throughput sequencing technologies have made it possible to access millions of measurements from thousands of people. Single nucleotide polymorphisms (SNPs), the most common type of mutation in the human genome, have been shown to play a significant role in the development of complex and multifactorial diseases. However, studying the synergistic interactions between different SNPs in explaining multifactorial diseases is challenging due to the high dimensionality of the data and methodological complexities. Existing solutions often use a multi-objective approach based on metaheuristic optimization algorithms such as harmony search. However, previous studies have shown that using a multi-objective approach is not sufficient to address complex disease models with no or low marginal effect. In this research, we introduce a locus-driven harmony search (LDHS), an improved harmony search algorithm that focuses on using SNP locus information and genetic inheritance patterns to initialize harmony memories. The proposed method integrates biological knowledge to improve harmony memory initialization by adding SNP combinations that are likely candidates for interaction and disease causation. Using a SNP grouping process, LDHS generates harmonies that include SNPs with a higher potential for interaction, resulting in greater power in detecting disease-causing SNP combinations. The performance of the proposed algorithm was evaluated on 200 synthesized datasets for disease models with and without marginal effect. The results show significant improvement in the power of the algorithm to find disease-related SNP sets while decreasing computational cost compared to state-of-the-art algorithms. The proposed algorithm also demonstrated notable performance on real breast cancer data, showing that integrating prior knowledge can significantly improve the process of detecting disease-related SNPs in both real and synthesized data.
Collapse
Affiliation(s)
- Fariba Esmaeili
- Department of Computer Science and Information Technology, Institute for Advanced Studies in Basic Sciences (IASBS), Zanjan, Iran
| | - Zahra Narimani
- Department of Computer Science and Information Technology, Institute for Advanced Studies in Basic Sciences (IASBS), Zanjan, Iran
| | - Mahdi Vasighi
- Department of Computer Science and Information Technology, Institute for Advanced Studies in Basic Sciences (IASBS), Zanjan, Iran
| |
Collapse
|
5
|
Freda PJ, Ghosh A, Zhang E, Luo T, Chitre AS, Polesskaya O, St Pierre CL, Gao J, Martin CD, Chen H, Garcia-Martinez AG, Wang T, Han W, Ishiwari K, Meyer P, Lamparelli A, King CP, Palmer AA, Li R, Moore JH. Automated quantitative trait locus analysis (AutoQTL). BioData Min 2023; 16:14. [PMID: 37038201 PMCID: PMC10088184 DOI: 10.1186/s13040-023-00331-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2022] [Accepted: 03/31/2023] [Indexed: 04/12/2023] Open
Abstract
BACKGROUND Quantitative Trait Locus (QTL) analysis and Genome-Wide Association Studies (GWAS) have the power to identify variants that capture significant levels of phenotypic variance in complex traits. However, effort and time are required to select the best methods and optimize parameters and pre-processing steps. Although machine learning approaches have been shown to greatly assist in optimization and data processing, applying them to QTL analysis and GWAS is challenging due to the complexity of large, heterogenous datasets. Here, we describe proof-of-concept for an automated machine learning approach, AutoQTL, with the ability to automate many complicated decisions related to analysis of complex traits and generate solutions to describe relationships that exist in genetic data. RESULTS Using a publicly available dataset of 18 putative QTL from a large-scale GWAS of body mass index in the laboratory rat, Rattus norvegicus, AutoQTL captures the phenotypic variance explained under a standard additive model. AutoQTL also detects evidence of non-additive effects including deviations from additivity and 2-way epistatic interactions in simulated data via multiple optimal solutions. Additionally, feature importance metrics provide different insights into the inheritance models and predictive power of multiple GWAS-derived putative QTL. CONCLUSIONS This proof-of-concept illustrates that automated machine learning techniques can complement standard approaches and have the potential to detect both additive and non-additive effects via various optimal solutions and feature importance metrics. In the future, we aim to expand AutoQTL to accommodate omics-level datasets with intelligent feature selection and feature engineering strategies.
Collapse
Affiliation(s)
- Philip J Freda
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, 700 N. San Vicente Blvd., Pacific Design Center, Suite G540, West Hollywood, CA, 90069, USA
| | - Attri Ghosh
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, 700 N. San Vicente Blvd., Pacific Design Center, Suite G540, West Hollywood, CA, 90069, USA
| | - Elizabeth Zhang
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, 700 N. San Vicente Blvd., Pacific Design Center, Suite G540, West Hollywood, CA, 90069, USA
| | - Tianhao Luo
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, 700 N. San Vicente Blvd., Pacific Design Center, Suite G540, West Hollywood, CA, 90069, USA
| | - Apurva S Chitre
- Department of Psychiatry, University of California San Diego, 9500 Gilman Dr., Mail Code: 0667, La Jolla, CA, 92093-0667, USA
| | - Oksana Polesskaya
- Department of Psychiatry, University of California San Diego, 9500 Gilman Dr., Mail Code: 0667, La Jolla, CA, 92093-0667, USA
| | - Celine L St Pierre
- Department of Psychiatry, University of California San Diego, 9500 Gilman Dr., Mail Code: 0667, La Jolla, CA, 92093-0667, USA
| | - Jianjun Gao
- Department of Psychiatry, University of California San Diego, 9500 Gilman Dr., Mail Code: 0667, La Jolla, CA, 92093-0667, USA
| | - Connor D Martin
- Department of Pharmacology & Toxicology, Jacobs School of Medicine and Biomedical Sciences, University at Buffalo, 955 Main Street, Suite 3102, Buffalo, NY, 14203, USA
| | - Hao Chen
- Department of Pharmacology, Addiction Science, and Toxicology, University of Tennessee Health Science Center, Translational Research Building, 71 South Manassas, Memphis, TN, 38163, USA
| | - Angel G Garcia-Martinez
- Department of Pharmacology, Addiction Science, and Toxicology, University of Tennessee Health Science Center, Translational Research Building, 71 South Manassas, Memphis, TN, 38163, USA
| | - Tengfei Wang
- Department of Pharmacology, Addiction Science, and Toxicology, University of Tennessee Health Science Center, Translational Research Building, 71 South Manassas, Memphis, TN, 38163, USA
| | - Wenyan Han
- Department of Pharmacology, Addiction Science, and Toxicology, University of Tennessee Health Science Center, Translational Research Building, 71 South Manassas, Memphis, TN, 38163, USA
| | - Keita Ishiwari
- Department of Pharmacology & Toxicology, Jacobs School of Medicine and Biomedical Sciences, University at Buffalo, 955 Main Street, Suite 3102, Buffalo, NY, 14203, USA
- Clinical and Research Institute on Addictions, University at Buffalo, 1021 Main Street, Buffalo, NY, 14203-1016, USA
| | - Paul Meyer
- Department of Psychology, University at Buffalo, 204 Park Hall, North Campus, Buffalo, NY, 14260-4110, USA
| | - Alexander Lamparelli
- Department of Psychology, University at Buffalo, 204 Park Hall, North Campus, Buffalo, NY, 14260-4110, USA
| | - Christopher P King
- Department of Psychology, University at Buffalo, 204 Park Hall, North Campus, Buffalo, NY, 14260-4110, USA
| | - Abraham A Palmer
- Department of Psychiatry, University of California San Diego, 9500 Gilman Dr., Mail Code: 0667, La Jolla, CA, 92093-0667, USA
- Institute for Genomic Medicine, University of California San Diego, 9500 Gilman Dr., Mail Code: 0667, La Jolla, CA, 92093-0667, USA
| | - Ruowang Li
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, 700 N. San Vicente Blvd., Pacific Design Center, Suite G540, West Hollywood, CA, 90069, USA
| | - Jason H Moore
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, 700 N. San Vicente Blvd., Pacific Design Center, Suite G540, West Hollywood, CA, 90069, USA.
| |
Collapse
|
6
|
Frade MCM, Beltrame T, Gois MDO, Pinto A, Tonello SCGDM, Torres RDS, Catai AM. Toward characterizing cardiovascular fitness using machine learning based on unobtrusive data. PLoS One 2023; 18:e0282398. [PMID: 36862737 PMCID: PMC9980797 DOI: 10.1371/journal.pone.0282398] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2022] [Accepted: 02/14/2023] [Indexed: 03/03/2023] Open
Abstract
Cardiopulmonary exercise testing (CPET) is a non-invasive approach to measure the maximum oxygen uptake ([Formula: see text]), which is an index to assess cardiovascular fitness (CF). However, CPET is not available to all populations and cannot be obtained continuously. Thus, wearable sensors are associated with machine learning (ML) algorithms to investigate CF. Therefore, this study aimed to predict CF by using ML algorithms using data obtained by wearable technologies. For this purpose, 43 volunteers with different levels of aerobic power, who wore a wearable device to collect unobtrusive data for 7 days, were evaluated by CPET. Eleven inputs (sex, age, weight, height, and body mass index, breathing rate, minute ventilation, total hip acceleration, walking cadence, heart rate, and tidal volume) were used to predict the [Formula: see text] by support vector regression (SVR). Afterward, the SHapley Additive exPlanations (SHAP) method was used to explain their results. SVR was able to predict the CF, and the SHAP method showed that the inputs related to hemodynamic and anthropometric domains were the most important ones to predict the CF. Therefore, we conclude that the cardiovascular fitness can be predicted by wearable technologies associated with machine learning during unsupervised activities of daily living.
Collapse
Affiliation(s)
| | - Thomas Beltrame
- Department of Physical Therapy, Federal University of São Carlos, São Carlos, São Paulo, Brazil
- Samsung R&D Institute Brazil–SRBR, Campinas, São Paulo, Brazil
- * E-mail:
| | | | - Allan Pinto
- Brazilian Synchrotron Light Laboratory (LNLS), Brazilian Center for Research in Energy and Materials (CNPEM), Campinas, São Paulo, Brazil
| | | | - Ricardo da Silva Torres
- Department of ICT and Natural Sciences, Faculty of Information Technology and Electrical Engineering, NTNU—Norwegian University of Science and Technology, Ålesund, Norway
| | - Aparecida Maria Catai
- Department of Physical Therapy, Federal University of São Carlos, São Carlos, São Paulo, Brazil
| |
Collapse
|
7
|
Automated quantitative trait locus analysis (AutoQTL). BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.01.12.523835. [PMID: 36711526 PMCID: PMC9882220 DOI: 10.1101/2023.01.12.523835] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/16/2023]
Abstract
Background Quantitative Trait Locus (QTL) analysis and Genome-Wide Association Studies (GWAS) have the power to identify variants that capture significant levels of phenotypic variance in complex traits. However, effort and time are required to select the best methods and optimize parameters and pre-processing steps. Although machine learning approaches have been shown to greatly assist in optimization and data processing, applying them to QTL analysis and GWAS is challenging due to the complexity of large, heterogenous datasets. Here, we describe proof-of-concept for an automated machine learning approach, AutoQTL, with the ability to automate many complex decisions related to analysis of complex traits and generate diverse solutions to describe relationships that exist in genetic data. Results Using a dataset of 18 putative QTL from a large-scale GWAS of body mass index in the laboratory rat, Rattus norvegicus , AutoQTL captures the phenotypic variance explained under a standard additive model while also providing evidence of non-additive effects including deviations from additivity and 2-way epistatic interactions from simulated data via multiple optimal solutions. Additionally, feature importance metrics provide different insights into the inheritance models and predictive power of multiple GWAS-derived putative QTL. Conclusions This proof-of-concept illustrates that automated machine learning techniques can be applied to genetic data and has the potential to detect both additive and non-additive effects via various optimal solutions and feature importance metrics. In the future, we aim to expand AutoQTL to accommodate omics-level datasets with intelligent feature selection strategies.
Collapse
|
8
|
Hassan AM, Rajesh A, Asaad M, Jonas NA, Coert JH, Mehrara BJ, Butler CE. A Surgeon's Guide to Artificial Intelligence-Driven Predictive Models. Am Surg 2023; 89:11-19. [PMID: 35588764 PMCID: PMC9674797 DOI: 10.1177/00031348221103648] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
Artificial intelligence (AI) focuses on processing and interpreting complex information as well as identifying relationships and patterns among complex data. Artificial intelligence- and machine learning (ML)-driven predictions have shown promising potential in influencing real-time decisions and improving surgical outcomes by facilitating screening, diagnosis, risk assessment, preoperative planning, and shared decision-making. Fundamental understanding of the algorithms, as well as their development and interpretation, is essential for the evolution of AI in surgery. In this article, we provide surgeons with a fundamental understanding of AI-driven predictive models through an overview of common ML and deep learning algorithms, model development, performance metrics and interpretation. This would serve as a basis for understanding ML-based research, while fostering new ideas and innovations for furthering the reach of this emerging discipline.
Collapse
Affiliation(s)
- Abbas M. Hassan
- Department of Plastic & Reconstructive Surgery, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Aashish Rajesh
- Department of Surgery, University of Texas Health Science Center, San Antonio, TX, USA
| | - Malke Asaad
- Department of Plastic Surgery, University of Pittsburgh Medical Center, Pittsburgh, PA, USA
| | - Nelson A. Jonas
- Department of Plastic & Reconstructive Surgery, Memorial Sloan Kettering Cancer Center, New York, NY
| | - J. Henk Coert
- Department of Plastic and Reconstructive Surgery, University Medical Center Utrecht, Utrecht, Netherlands
| | - Babak J. Mehrara
- Department of Plastic & Reconstructive Surgery, Memorial Sloan Kettering Cancer Center, New York, NY
| | - Charles E. Butler
- Department of Plastic & Reconstructive Surgery, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| |
Collapse
|
9
|
Manduchi E, Romano JD, Moore JH. The promise of automated machine learning for the genetic analysis of complex traits. Hum Genet 2022; 141:1529-1544. [PMID: 34713318 PMCID: PMC9360157 DOI: 10.1007/s00439-021-02393-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2021] [Accepted: 10/22/2021] [Indexed: 12/24/2022]
Abstract
The genetic analysis of complex traits has been dominated by parametric statistical methods due to their theoretical properties, ease of use, computational efficiency, and intuitive interpretation. However, there are likely to be patterns arising from complex genetic architectures which are more easily detected and modeled using machine learning methods. Unfortunately, selecting the right machine learning algorithm and tuning its hyperparameters can be daunting for experts and non-experts alike. The goal of automated machine learning (AutoML) is to let a computer algorithm identify the right algorithms and hyperparameters thus taking the guesswork out of the optimization process. We review the promises and challenges of AutoML for the genetic analysis of complex traits and give an overview of several approaches and some example applications to omics data. It is our hope that this review will motivate studies to develop and evaluate novel AutoML methods and software in the genetics and genomics space. The promise of AutoML is to enable anyone, regardless of training or expertise, to apply machine learning as part of their genetic analysis strategy.
Collapse
Affiliation(s)
- Elisabetta Manduchi
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA, 19104, USA
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, 19104, USA
| | - Joseph D Romano
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA, 19104, USA
| | - Jason H Moore
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA, 19104, USA.
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, 19104, USA.
| |
Collapse
|
10
|
Novel Machine Learning Approach for the Prediction of Hernia Recurrence, Surgical Complication, and 30-Day Readmission after Abdominal Wall Reconstruction. J Am Coll Surg 2022; 234:918-927. [DOI: 10.1097/xcs.0000000000000141] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
|
11
|
Musolf AM, Holzinger ER, Malley JD, Bailey-Wilson JE. What makes a good prediction? Feature importance and beginning to open the black box of machine learning in genetics. Hum Genet 2021; 141:1515-1528. [PMID: 34862561 PMCID: PMC9360120 DOI: 10.1007/s00439-021-02402-z] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2021] [Accepted: 11/08/2021] [Indexed: 01/26/2023]
Abstract
Genetic data have become increasingly complex within the past decade, leading researchers to pursue increasingly complex questions, such as those involving epistatic interactions and protein prediction. Traditional methods are ill-suited to answer these questions, but machine learning (ML) techniques offer an alternative solution. ML algorithms are commonly used in genetics to predict or classify subjects, but some methods evaluate which features (variables) are responsible for creating a good prediction; this is called feature importance. This is critical in genetics, as researchers are often interested in which features (e.g., SNP genotype or environmental exposure) are responsible for a good prediction. This allows for the deeper analysis beyond simple prediction, including the determination of risk factors associated with a given phenotype. Feature importance further permits the researcher to peer inside the black box of many ML algorithms to see how they work and which features are critical in informing a good prediction. This review focuses on ML methods that provide feature importance metrics for the analysis of genetic data. Five major categories of ML algorithms: k nearest neighbors, artificial neural networks, deep learning, support vector machines, and random forests are described. The review ends with a discussion of how to choose the best machine for a data set. This review will be particularly useful for genetic researchers looking to use ML methods to answer questions beyond basic prediction and classification.
Collapse
Affiliation(s)
- Anthony M Musolf
- Statistical Genetics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, 333 Cassell Drive Suite 1200, Baltimore, MD, 21224, USA
| | - Emily R Holzinger
- Target Sciences, Informatics and Predictive Sciences, Bristol Myers Squibb, Cambridge, MA, USA
| | - James D Malley
- Statistical Genetics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, 333 Cassell Drive Suite 1200, Baltimore, MD, 21224, USA
| | - Joan E Bailey-Wilson
- Statistical Genetics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, 333 Cassell Drive Suite 1200, Baltimore, MD, 21224, USA.
| |
Collapse
|
12
|
Katsaouni N, Tashkandi A, Wiese L, Schulz MH. Machine learning based disease prediction from genotype data. Biol Chem 2021; 402:871-885. [PMID: 34218544 DOI: 10.1515/hsz-2021-0109] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2021] [Accepted: 06/15/2021] [Indexed: 12/16/2022]
Abstract
Using results from genome-wide association studies for understanding complex traits is a current challenge. Here we review how genotype data can be used with different machine learning (ML) methods to predict phenotype occurrence and severity from genotype data. We discuss common feature encoding schemes and how studies handle the often small number of samples compared to the huge number of variants. We compare which ML methods are being applied, including recent results using deep neural networks. Further, we review the application of methods for feature explanation and interpretation.
Collapse
Affiliation(s)
- Nikoletta Katsaouni
- Institute for Cardiovascular Regeneration, Goethe University, 60590 Frankfurt am Main, Germany
| | - Araek Tashkandi
- Institute of Computer Sciences and Engineering, University of Jeddah, 21959 Jeddah, Saudi Arabia
| | - Lena Wiese
- Institute of Computer Science, Goethe University, 60629 Frankfurt am Main, Germany
| | - Marcel H Schulz
- Institute for Cardiovascular Regeneration, Goethe University, 60590 Frankfurt am Main, Germany
- German Center for Cardiovascular Research (DZHK), Partner Site RheinMain, 60590 Frankfurt am Main, Germany
- Cardio-Pulmonary Institute, Goethe University Hospital, Frankfurt am Main, Germany
| |
Collapse
|