1
|
Identification of Clinically Relevant HIV Vif Protein Motif Mutations through Machine Learning and Undersampling. Cells 2023; 12:cells12050772. [PMID: 36899908 PMCID: PMC10001277 DOI: 10.3390/cells12050772] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2022] [Revised: 02/08/2023] [Accepted: 02/21/2023] [Indexed: 03/06/2023] Open
Abstract
Human Immunodeficiency virus (HIV) and its clinical entity, the Acquired Immunodeficiency Syndrome (AIDS) continue to represent an important health burden worldwide. Although great advances have been made towards determining the way viral genetic diversity affects clinical outcome, genetic association studies have been hindered by the complexity of their interactions with the human host. This study provides an innovative approach for the identification and analysis of epidemiological associations between HIV Viral Infectivity Factor (Vif) protein mutations and four clinical endpoints (Viral load and CD4 T cell numbers at time of both clinical debut and on historical follow-up of patients. Furthermore, this study highlights an alternative approach to the analysis of imbalanced datasets, where patients without specific mutations outnumber those with mutations. Imbalanced datasets are still a challenge hindering the development of classification algorithms through machine learning. This research deals with Decision Trees, Naïve Bayes (NB), Support Vector Machines (SVMs), and Artificial Neural Networks (ANNs). This paper proposes a new methodology considering an undersampling approach to deal with imbalanced datasets and introduces two novel and differing approaches (MAREV-1 and MAREV-2). As theses approaches do not involve human pre-determined and hypothesis-driven combinations of motifs having functional or clinical relevance, they provide a unique opportunity to discover novel complex motif combinations of interest. Moreover, the motif combinations found can be analyzed through traditional statistical approaches avoiding statistical corrections for multiple tests.
Collapse
|
2
|
Sinoquet C. A method combining a random forest-based technique with the modeling of linkage disequilibrium through latent variables, to run multilocus genome-wide association studies. BMC Bioinformatics 2018; 19:106. [PMID: 29587628 PMCID: PMC5870262 DOI: 10.1186/s12859-018-2054-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2016] [Accepted: 02/09/2018] [Indexed: 01/26/2023] Open
Abstract
BACKGROUND Genome-wide association studies (GWASs) have been widely used to discover the genetic basis of complex phenotypes. However, standard single-SNP GWASs suffer from lack of power. In particular, they do not directly account for linkage disequilibrium, that is the dependences between SNPs (Single Nucleotide Polymorphisms). RESULTS We present the comparative study of two multilocus GWAS strategies, in the random forest-based framework. The first method, T-Trees, was designed by Botta and collaborators (Botta et al., PLoS ONE 9(4):e93379, 2014). We designed the other method, which is an innovative hybrid method combining T-Trees with the modeling of linkage disequilibrium. Linkage disequilibrium is modeled through a collection of tree-shaped Bayesian networks with latent variables, following our former works (Mourad et al., BMC Bioinformatics 12(1):16, 2011). We compared the two methods, both on simulated and real data. For dominant and additive genetic models, in either of the conditions simulated, the hybrid approach always slightly performs better than T-Trees. We assessed predictive powers through the standard ROC technique on 14 real datasets. For 10 of the 14 datasets analyzed, the already high predicted power observed for T-Trees (0.910-0.946) can still be increased by up to 0.030. We also assessed whether the distributions of SNPs' scores obtained from T-Trees and the hybrid approach differed. Finally, we thoroughly analyzed the intersections of top 100 SNPs output by any two or the three methods amongst T-Trees, the hybrid approach, and the single-SNP method. CONCLUSIONS The sophistication of T-Trees through finer linkage disequilibrium modeling is shown beneficial. The distributions of SNPs' scores generated by T-Trees and the hybrid approach are shown statistically different, which suggests complementary of the methods. In particular, for 12 of the 14 real datasets, the distribution tail of highest SNPs' scores shows larger values for the hybrid approach. Thus are pinpointed more interesting SNPs than by T-Trees, to be provided as a short list of prioritized SNPs, for a further analysis by biologists. Finally, among the 211 top 100 SNPs jointly detected by the single-SNP method, T-Trees and the hybrid approach over the 14 datasets, we identified 72 and 38 SNPs respectively present in the top25s and top10s for each method.
Collapse
Affiliation(s)
- Christine Sinoquet
- LS2N, UMR CNRS 6004, Université de Nantes, 2 rue de la Houssinière, BP 92208, Nantes Cedex, 44322, France.
| |
Collapse
|
3
|
Yoo TK, Kim DW, Choi SB, Oh E, Park JS. Simple Scoring System and Artificial Neural Network for Knee Osteoarthritis Risk Prediction: A Cross-Sectional Study. PLoS One 2016; 11:e0148724. [PMID: 26859664 PMCID: PMC4747508 DOI: 10.1371/journal.pone.0148724] [Citation(s) in RCA: 43] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2015] [Accepted: 01/22/2016] [Indexed: 01/27/2023] Open
Abstract
BACKGROUND Knee osteoarthritis (OA) is the most common joint disease of adults worldwide. Since the treatments for advanced radiographic knee OA are limited, clinicians face a significant challenge of identifying patients who are at high risk of OA in a timely and appropriate way. Therefore, we developed a simple self-assessment scoring system and an improved artificial neural network (ANN) model for knee OA. METHODS The Fifth Korea National Health and Nutrition Examination Surveys (KNHANES V-1) data were used to develop a scoring system and ANN for radiographic knee OA. A logistic regression analysis was used to determine the predictors of the scoring system. The ANN was constructed using 1777 participants and validated internally on 888 participants in the KNHANES V-1. The predictors of the scoring system were selected as the inputs of the ANN. External validation was performed using 4731 participants in the Osteoarthritis Initiative (OAI). Area under the curve (AUC) of the receiver operating characteristic was calculated to compare the prediction models. RESULTS The scoring system and ANN were built using the independent predictors including sex, age, body mass index, educational status, hypertension, moderate physical activity, and knee pain. In the internal validation, both scoring system and ANN predicted radiographic knee OA (AUC 0.73 versus 0.81, p<0.001) and symptomatic knee OA (AUC 0.88 versus 0.94, p<0.001) with good discriminative ability. In the external validation, both scoring system and ANN showed lower discriminative ability in predicting radiographic knee OA (AUC 0.62 versus 0.67, p<0.001) and symptomatic knee OA (AUC 0.70 versus 0.76, p<0.001). CONCLUSIONS The self-assessment scoring system may be useful for identifying the adults at high risk for knee OA. The performance of the scoring system is improved significantly by the ANN. We provided an ANN calculator to simply predict the knee OA risk.
Collapse
Affiliation(s)
- Tae Keun Yoo
- Department of Ophthalmology, Yonsei University College of Medicine, Seoul, Republic of Korea
| | - Deok Won Kim
- Department of Medical Engineering, Yonsei University College of Medicine, Seoul, Republic of Korea
- Graduate Program in Biomedical Engineering, Yonsei University, Seoul, Republic of Korea
- * E-mail:
| | - Soo Beom Choi
- Department of Medical Engineering, Yonsei University College of Medicine, Seoul, Republic of Korea
- Graduate Program in Biomedical Engineering, Yonsei University, Seoul, Republic of Korea
| | - Ein Oh
- Department of Anaesthesiology and Pain Medicine, Yonsei University College of Medicine, Seoul, Republic of Korea
| | - Jee Soo Park
- Department of Medical Engineering, Yonsei University College of Medicine, Seoul, Republic of Korea
- Department of Medicine, Yonsei University College of Medicine, Seoul, Republic of Korea
| |
Collapse
|
4
|
Beam AL, Motsinger-Reif AA, Doyle J. An investigation of gene-gene interactions in dose-response studies with Bayesian nonparametrics. BioData Min 2015; 8:6. [PMID: 25691918 PMCID: PMC4330980 DOI: 10.1186/s13040-015-0039-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2014] [Accepted: 01/18/2015] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Best practice for statistical methodology in cell-based dose-response studies has yet to be established. We examine the ability of MANOVA to detect trait-associated genetic loci in the presence of gene-gene interactions. We present a novel Bayesian nonparametric method designed to detect such interactions. RESULTS MANOVA and the Bayesian nonparametric approach show good ability to detect trait-associated genetic variants under various possible genetic models. It is shown through several sets of analyses that this may be due to marginal effects being present, even if the underlying genetic model does not explicitly contain them. CONCLUSIONS Understanding how genetic interactions affect drug response continues to be a critical goal. MANOVA and the novel Bayesian framework present a trade-off between computational complexity and model flexibility.
Collapse
Affiliation(s)
- Andrew L Beam
- Center for Biomedical Informatics, Boston, Massachusetts
| | - Alison A Motsinger-Reif
- Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina ; Department of Statistics, North Carolina State University, Raleigh, North Carolina
| | - Jon Doyle
- Department of Computer Science, North Carolina State University, Raleigh, North Carolina
| |
Collapse
|
5
|
Beam AL, Motsinger-Reif A, Doyle J. Bayesian neural networks for detecting epistasis in genetic association studies. BMC Bioinformatics 2014; 15:368. [PMID: 25413600 PMCID: PMC4256933 DOI: 10.1186/s12859-014-0368-0] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2014] [Accepted: 10/30/2014] [Indexed: 12/02/2022] Open
Abstract
Background Discovering causal genetic variants from large genetic association studies poses many difficult challenges. Assessing which genetic markers are involved in determining trait status is a computationally demanding task, especially in the presence of gene-gene interactions. Results A non-parametric Bayesian approach in the form of a Bayesian neural network is proposed for use in analyzing genetic association studies. Demonstrations on synthetic and real data reveal they are able to efficiently and accurately determine which variants are involved in determining case-control status. By using graphics processing units (GPUs) the time needed to build these models is decreased by several orders of magnitude. In comparison with commonly used approaches for detecting interactions, Bayesian neural networks perform very well across a broad spectrum of possible genetic relationships. Conclusions The proposed framework is shown to be a powerful method for detecting causal SNPs while being computationally efficient enough to handle large datasets. Electronic supplementary material The online version of this article (doi:10.1186/s12859-014-0368-0) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Andrew L Beam
- Center for Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| | - Alison Motsinger-Reif
- Bioinformatics Research Center, North Carolina State University, Raleigh, NC, USA. .,Department of Statistics, North Carolina State University, Raleigh, NC, USA.
| | - Jon Doyle
- Department of Computer Science, North Carolina State University, Raleigh, NC, USA.
| |
Collapse
|
6
|
Screening for prediabetes using machine learning models. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2014; 2014:618976. [PMID: 25165484 PMCID: PMC4140121 DOI: 10.1155/2014/618976] [Citation(s) in RCA: 50] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/28/2014] [Accepted: 07/08/2014] [Indexed: 12/30/2022]
Abstract
The global prevalence of diabetes is rapidly increasing. Studies support the necessity of screening and interventions for prediabetes, which could result in serious complications and diabetes. This study aimed at developing an intelligence-based screening model for prediabetes. Data from the Korean National Health and Nutrition Examination Survey (KNHANES) were used, excluding subjects with diabetes. The KNHANES 2010 data (n = 4685) were used for training and internal validation, while data from KNHANES 2011 (n = 4566) were used for external validation. We developed two models to screen for prediabetes using an artificial neural network (ANN) and support vector machine (SVM) and performed a systematic evaluation of the models using internal and external validation. We compared the performance of our models with that of a screening score model based on logistic regression analysis for prediabetes that had been developed previously. The SVM model showed the areas under the curve of 0.731 in the external datasets, which is higher than those of the ANN model (0.729) and the screening score model (0.712), respectively. The prescreening methods developed in this study performed better than the screening score model that had been developed previously and may be more effective method for prediabetes screening.
Collapse
|
7
|
A review for detecting gene-gene interactions using machine learning methods in genetic epidemiology. BIOMED RESEARCH INTERNATIONAL 2013; 2013:432375. [PMID: 24228248 PMCID: PMC3818807 DOI: 10.1155/2013/432375] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/17/2013] [Revised: 08/26/2013] [Accepted: 08/27/2013] [Indexed: 01/04/2023]
Abstract
Recently, the greatest statistical computational challenge in genetic epidemiology is to identify and characterize the genes that interact with other genes and environment factors that bring the effect on complex multifactorial disease. These gene-gene interactions are also denoted as epitasis in which this phenomenon cannot be solved by traditional statistical method due to the high dimensionality of the data and the occurrence of multiple polymorphism. Hence, there are several machine learning methods to solve such problems by identifying such susceptibility gene which are neural networks (NNs), support vector machine (SVM), and random forests (RFs) in such common and multifactorial disease. This paper gives an overview on machine learning methods, describing the methodology of each machine learning methods and its application in detecting gene-gene and gene-environment interactions. Lastly, this paper discussed each machine learning method and presents the strengths and weaknesses of each machine learning method in detecting gene-gene interactions in complex human disease.
Collapse
|
8
|
HOLZINGER EMILYR, DUDEK SCOTTM, FRASE ALEXT, KRAUSS RONALDM, MEDINA MARISAW, RITCHIE MARYLYND. ATHENA: a tool for meta-dimensional analysis applied to genotypes and gene expression data to predict HDL cholesterol levels. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2013:385-396. [PMID: 23424143 PMCID: PMC3587764] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
Technology is driving the field of human genetics research with advances in techniques to generate high-throughput data that interrogate various levels of biological regulation. With this massive amount of data comes the important task of using powerful bioinformatics techniques to sift through the noise to find true signals that predict various human traits. A popular analytical method thus far has been the genome-wide association study (GWAS), which assesses the association of single nucleotide polymorphisms (SNPs) with the trait of interest. Unfortunately, GWAS has not been able to explain a substantial proportion of the estimated heritability for most complex traits. Due to the inherently complex nature of biology, this phenomenon could be a factor of the simplistic study design. A more powerful analysis may be a systems biology approach that integrates different types of data, or a meta-dimensional analysis. For this study we used the Analysis Tool for Heritable and Environmental Network Associations (ATHENA) to integrate high-throughput SNPs and gene expression variables (EVs) to predict high-density lipoprotein cholesterol (HDL-C) levels. We generated multivariable models that consisted of SNPs only, EVs only, and SNPs + EVs with testing r-squared values of 0.16, 0.11, and 0.18, respectively. Additionally, using just the SNPs and EVs from the best models, we generated a model with a testing r-squared of 0.32. A linear regression model with the same variables resulted in an adjusted r-squared of 0.23. With this systems biology approach, we were able to integrate different types of high-throughput data to generate meta-dimensional models that are predictive for the HDL-C in our data set. Additionally, our modeling method was able to capture more of the HDL-C variation than a linear regression model that included the same variables.
Collapse
Affiliation(s)
| | - SCOTT M. DUDEK
- Center for Systems Genomics, Pennsylvania State University, University Park, PA 16803, USA
| | - ALEX T. FRASE
- Center for Systems Genomics, Pennsylvania State University, University Park, PA 16803, USA
| | - RONALD M. KRAUSS
- Children’s Hospital Oakland Research Institute, Oakland, CA 94609, USA
| | - MARISA W. MEDINA
- Children’s Hospital Oakland Research Institute, Oakland, CA 94609, USA
| | - MARYLYN D. RITCHIE
- Center for Systems Genomics, Pennsylvania State University, University Park, PA 16803, USA
| |
Collapse
|
9
|
Bridges M, Heron EA, O'Dushlaine C, Segurado R, Morris D, Corvin A, Gill M, Pinto C. Genetic classification of populations using supervised learning. PLoS One 2011; 6:e14802. [PMID: 21589856 PMCID: PMC3093382 DOI: 10.1371/journal.pone.0014802] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2010] [Accepted: 12/01/2010] [Indexed: 11/18/2022] Open
Abstract
There are many instances in genetics in which we wish to determine whether two
candidate populations are distinguishable on the basis of their genetic
structure. Examples include populations which are geographically separated,
case–control studies and quality control (when participants in a study
have been genotyped at different laboratories). This latter application is of
particular importance in the era of large scale genome wide association studies,
when collections of individuals genotyped at different locations are being
merged to provide increased power. The traditional method for detecting
structure within a population is some form of exploratory technique such as
principal components analysis. Such methods, which do not utilise our prior
knowledge of the membership of the candidate populations. are termed
unsupervised. Supervised methods, on the other hand are
able to utilise this prior knowledge when it is available. In this paper we demonstrate that in such cases modern supervised approaches are
a more appropriate tool for detecting genetic differences between populations.
We apply two such methods, (neural networks and support vector machines) to the
classification of three populations (two from Scotland and one from Bulgaria).
The sensitivity exhibited by both these methods is considerably higher than that
attained by principal components analysis and in fact comfortably exceeds a
recently conjectured theoretical limit on the sensitivity of unsupervised
methods. In particular, our methods can distinguish between the two Scottish
populations, where principal components analysis cannot. We suggest, on the
basis of our results that a supervised learning approach should be the method of
choice when classifying individuals into pre-defined populations, particularly
in quality control for large scale genome wide association studies.
Collapse
Affiliation(s)
- Michael Bridges
- Astrophysics Group, Cavendish Laboratory, Cambridge, United
Kingdom
| | - Elizabeth A. Heron
- Neuropsychiatric Genetics Research Group, Department of Psychiatry,
Trinity College, Dublin, Ireland
| | - Colm O'Dushlaine
- Neuropsychiatric Genetics Research Group, Department of Psychiatry,
Trinity College, Dublin, Ireland
| | - Ricardo Segurado
- Neuropsychiatric Genetics Research Group, Department of Psychiatry,
Trinity College, Dublin, Ireland
| | | | - Derek Morris
- Neuropsychiatric Genetics Research Group, Department of Psychiatry,
Trinity College, Dublin, Ireland
| | - Aiden Corvin
- Neuropsychiatric Genetics Research Group, Department of Psychiatry,
Trinity College, Dublin, Ireland
| | - Michael Gill
- Neuropsychiatric Genetics Research Group, Department of Psychiatry,
Trinity College, Dublin, Ireland
| | - Carlos Pinto
- Neuropsychiatric Genetics Research Group, Department of Psychiatry,
Trinity College, Dublin, Ireland
- * E-mail:
| |
Collapse
|
10
|
Motsinger-Reif AA, Deodhar S, Winham SJ, Hardison NE. Grammatical evolution decision trees for detecting gene-gene interactions. BioData Min 2010; 3:8. [PMID: 21087514 PMCID: PMC3000379 DOI: 10.1186/1756-0381-3-8] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2010] [Accepted: 11/18/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND A fundamental goal of human genetics is the discovery of polymorphisms that predict common, complex diseases. It is hypothesized that complex diseases are due to a myriad of factors including environmental exposures and complex genetic risk models, including gene-gene interactions. Such epistatic models present an important analytical challenge, requiring that methods perform not only statistical modeling, but also variable selection to generate testable genetic model hypotheses. This challenge is amplified by recent advances in genotyping technology, as the number of potential predictor variables is rapidly increasing. METHODS Decision trees are a highly successful, easily interpretable data-mining method that are typically optimized with a hierarchical model building approach, which limits their potential to identify interacting effects. To overcome this limitation, we utilize evolutionary computation, specifically grammatical evolution, to build decision trees to detect and model gene-gene interactions. In the current study, we introduce the Grammatical Evolution Decision Trees (GEDT) method and software and evaluate this approach on simulated data representing gene-gene interaction models of a range of effect sizes. We compare the performance of the method to a traditional decision tree algorithm and a random search approach and demonstrate the improved performance of the method to detect purely epistatic interactions. RESULTS The results of our simulations demonstrate that GEDT has high power to detect even very moderate genetic risk models. GEDT has high power to detect interactions with and without main effects. CONCLUSIONS GEDT, while still in its initial stages of development, is a promising new approach for identifying gene-gene interactions in genetic association studies.
Collapse
|
11
|
Beam AL, Motsinger-Reif AA. Optimization of nonlinear dose- and concentration-response models utilizing evolutionary computation. Dose Response 2010; 9:387-409. [PMID: 22013401 DOI: 10.2203/dose-response.09-030.beam] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023] Open
Abstract
An essential part of toxicity and chemical screening is assessing the concentrated related effects of a test article. Most often this concentration-response is a nonlinear, necessitating sophisticated regression methodologies. The parameters derived from curve fitting are essential in determining a test article's potency (EC(50)) and efficacy (E(max)) and variations in model fit may lead to different conclusions about an article's performance and safety. Previous approaches have leveraged advanced statistical and mathematical techniques to implement nonlinear least squares (NLS) for obtaining the parameters defining such a curve. These approaches, while mathematically rigorous, suffer from initial value sensitivity, computational intensity, and rely on complex and intricate computational and numerical techniques. However if there is a known mathematical model that can reliably predict the data, then nonlinear regression may be equally viewed as parameter optimization. In this context, one may utilize proven techniques from machine learning, such as evolutionary algorithms, which are robust, powerful, and require far less computational framework to optimize the defining parameters. In the current study we present a new method that uses such techniques, Evolutionary Algorithm Dose Response Modeling (EADRM), and demonstrate its effectiveness compared to more conventional methods on both real and simulated data.
Collapse
Affiliation(s)
- Andrew L Beam
- Department of Statistics, North Carolina State University, Raleigh, North Carolina; CellzDirect/Invitrogen Corporation (a part of Life Technologies), Durham, North Carolina
| | | |
Collapse
|
12
|
Holzinger ER, Buchanan CC, Dudek SM, Torstenson EC, Turner SD, Ritchie MD. Initialization Parameter Sweep in ATHENA: Optimizing Neural Networks for Detecting Gene-Gene Interactions in the Presence of Small Main Effects. GENETIC AND EVOLUTIONARY COMPUTATION CONFERENCE : [PROCEEDINGS]. GENETIC AND EVOLUTIONARY COMPUTATION CONFERENCE 2010; 12:203-210. [PMID: 21152364 DOI: 10.1145/1830483.1830519] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
Abstract
Recent advances in genotyping technology have led to the generation of an enormous quantity of genetic data. Traditional methods of statistical analysis have proved insufficient in extracting all of the information about the genetic components of common, complex human diseases. A contributing factor to the problem of analysis is that amongst the small main effects of each single gene on disease susceptibility, there are non-linear, gene-gene interactions that can be difficult for traditional, parametric analyses to detect. In addition, exhaustively searching all multi-locus combinations has proved computationally impractical. Novel strategies for analysis have been developed to address these issues. The Analysis Tool for Heritable and Environmental Network Associations (ATHENA) is an analytical tool that incorporates grammatical evolution neural networks (GENN) to detect interactions among genetic factors. Initial parameters define how the evolutionary process will be implemented. This research addresses how different parameter settings affect detection of disease models involving interactions. In the current study, we iterate over multiple parameter values to determine which combinations appear optimal for detecting interactions in simulated data for multiple genetic models. Our results indicate that the factors that have the greatest influence on detection are: input variable encoding, population size, and parallel computation.
Collapse
Affiliation(s)
- Emily R Holzinger
- Ctr. for Human Genetics Research Dept. of Molecular Physiology & Biophysics; Vanderbilt University Nashville, TN 37232
| | | | | | | | | | | |
Collapse
|
13
|
Günther F, Wawro N, Bammann K. Neural networks for modeling gene-gene interactions in association studies. BMC Genet 2009; 10:87. [PMID: 20030838 PMCID: PMC2817696 DOI: 10.1186/1471-2156-10-87] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2009] [Accepted: 12/23/2009] [Indexed: 01/17/2023] Open
Abstract
Background Our aim is to investigate the ability of neural networks to model different two-locus disease models. We conduct a simulation study to compare neural networks with two standard methods, namely logistic regression models and multifactor dimensionality reduction. One hundred data sets are generated for each of six two-locus disease models, which are considered in a low and in a high risk scenario. Two models represent independence, one is a multiplicative model, and three models are epistatic. For each data set, six neural networks (with up to five hidden neurons) and five logistic regression models (the null model, three main effect models, and the full model) with two different codings for the genotype information are fitted. Additionally, the multifactor dimensionality reduction approach is applied. Results The results show that neural networks are more successful in modeling the structure of the underlying disease model than logistic regression models in most of the investigated situations. In our simulation study, neither logistic regression nor multifactor dimensionality reduction are able to correctly identify biological interaction. Conclusions Neural networks are a promising tool to handle complex data situations. However, further research is necessary concerning the interpretation of their parameters.
Collapse
Affiliation(s)
- Frauke Günther
- University of Bremen, Bremen Institute for Prevention Research and Social Medicine, Linzer Strasse 10, 28359 Bremen, Germany.
| | | | | |
Collapse
|
14
|
Günther F, Wawro N, Bammann K. Neural networks for modeling gene-gene interactions in association studies. BMC Genet 2009. [PMID: 20030838 DOI: 10.1186/1471‐2156‐10‐87] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND Our aim is to investigate the ability of neural networks to model different two-locus disease models. We conduct a simulation study to compare neural networks with two standard methods, namely logistic regression models and multifactor dimensionality reduction. One hundred data sets are generated for each of six two-locus disease models, which are considered in a low and in a high risk scenario. Two models represent independence, one is a multiplicative model, and three models are epistatic. For each data set, six neural networks (with up to five hidden neurons) and five logistic regression models (the null model, three main effect models, and the full model) with two different codings for the genotype information are fitted. Additionally, the multifactor dimensionality reduction approach is applied. RESULTS The results show that neural networks are more successful in modeling the structure of the underlying disease model than logistic regression models in most of the investigated situations. In our simulation study, neither logistic regression nor multifactor dimensionality reduction are able to correctly identify biological interaction. CONCLUSIONS Neural networks are a promising tool to handle complex data situations. However, further research is necessary concerning the interpretation of their parameters.
Collapse
Affiliation(s)
- Frauke Günther
- University of Bremen, Bremen Institute for Prevention Research and Social Medicine, Linzer Strasse 10, 28359 Bremen, Germany.
| | | | | |
Collapse
|