1
|
Evaluating the detection ability of a range of epistasis detection methods on simulated data for pure and impure epistatic models. PLoS One 2022; 17:e0263390. [PMID: 35180244 PMCID: PMC8856572 DOI: 10.1371/journal.pone.0263390] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2021] [Accepted: 01/18/2022] [Indexed: 11/19/2022] Open
Abstract
Background Numerous approaches have been proposed for the detection of epistatic interactions within GWAS datasets in order to better understand the drivers of disease and genetics. Methods A selection of state-of-the-art approaches were assessed. These included the statistical tests, fast-epistasis, BOOST, logistic regression and wtest; swarm intelligence methods, namely AntEpiSeeker, epiACO and CINOEDV; and data mining approaches, including MDR, GSS, SNPRuler and MPI3SNP. Data were simulated to provide randomly generated models with no individual main effects at different heritabilities (pure epistasis) as well as models based on penetrance tables with some main effects (impure epistasis). Detection of both two and three locus interactions were assessed across a total of 1,560 simulated datasets. The different methods were also applied to a section of the UK biobank cohort for Atrial Fibrillation. Results For pure, two locus interactions, PLINK’s implementation of BOOST recovered the highest number of correct interactions, with 53.9% and significantly better performing than the other methods (p = 4.52e − 36). For impure two locus interactions, MDR exhibited the best performance, recovering 62.2% of the most significant impure epistatic interactions (p = 6.31e − 90 for all but one test). The assessment of three locus interaction prediction revealed that wtest recovered the highest number (17.2%) of pure epistatic interactions(p = 8.49e − 14). wtest also recovered the highest number of three locus impure epistatic interactions (p = 6.76e − 48) while AntEpiSeeker ranked as the most significant the highest number of such interactions (40.5%). Finally, when applied to a real dataset for Atrial Fibrillation, most notably finding an interaction between SYNE2 and DTNB.
Collapse
|
2
|
Malten J, König IR. Modified entropy-based procedure detects gene-gene-interactions in unconventional genetic models. BMC Med Genomics 2020; 13:65. [PMID: 32326960 PMCID: PMC7181579 DOI: 10.1186/s12920-020-0703-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2019] [Accepted: 03/13/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Since it is assumed that genetic interactions play an important role in understanding the mechanisms of complex diseases, different statistical approaches have been suggested in recent years for this task. One interesting approach is the entropy-based IGENT method by Kwon et al. that promises an efficient detection of main effects and interaction effects simultaneously. However, a modification is required if the aim is to only detect interaction effects. METHODS Based on the IGENT method, we present a modification that leads to a conditional mutual information based approach under the condition of linkage equilibrium. The modified estimator is investigated in a comprehensive simulation based on five genetic interaction models and applied to real data from the genome-wide association study by the North American Rheumatoid Arthritis Consortium (NARAC). RESULTS The presented modification of IGENT controls the type I error in all simulated constellations. Furthermore, it provides high power for detecting pure interactions specifically on unconventional genetic models both in simulation and real data. CONCLUSIONS The proposed method uses the IGENT software, which is free available, simple and fast, and detects pure interactions on unconventional genetic models. Our results demonstrate that this modification is an attractive complement to established analysis methods.
Collapse
Affiliation(s)
- Jörg Malten
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Ratzeburger Allee 160, 23562 Lübeck, Germany
| | - Inke R König
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Ratzeburger Allee 160, 23562 Lübeck, Germany.
| |
Collapse
|
3
|
Vance E, Gonzalez Murcia JD, Miller JB, Staley L, Crane PK, Mukherjee S, Kauwe JSK. Failure to detect synergy between variants in transferrin and hemochromatosis and Alzheimer's disease in large cohort. Neurobiol Aging 2020; 89:142.e9-142.e12. [PMID: 32143980 DOI: 10.1016/j.neurobiolaging.2020.01.013] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2019] [Revised: 11/13/2019] [Accepted: 01/28/2020] [Indexed: 10/25/2022]
Abstract
Alzheimer's disease (AD) is the most common cause of dementia and, despite decades of effort, there is no effective treatment. In the last decade, many association studies have identified genetic markers that are associated with AD status. Two of these studies suggest that an epistatic interaction between variants rs1049296 in the transferrin (TF) gene and rs1800562 in the homeostatic iron regulator (HFE) gene, commonly known as hemochromatosis, is in genetic association with AD. TF and HFE are involved in the transport and regulation of iron in the brain, and disrupting these processes exacerbates AD pathology through increased neurodegeneration and oxidative stress. However, by using a significantly larger data set from the Alzheimer's Disease Genetics Consortium, we fail to detect an association between TF rs1049296 or HFE rs1800562 with AD risk (TF rs1049296 p = 0.38 and HFE rs1800562 p = 0.40). In addition, logistic regression with an interaction term and a synergy factor analysis both failed to detect epistasis between TF rs1049296 and HFE rs1800562 (SF = 0.94; p = 0.48) in AD cases. Each of these analyses had sufficient statistical power (power > 0.99), suggesting that previously reported associations may be the result of more complex epistatic interactions, genetic heterogeneity, or false-positive associations because of limited sample sizes.
Collapse
Affiliation(s)
- Elizabeth Vance
- Department of Biology, Brigham Young University, Provo, UT, USA
| | | | - Justin B Miller
- Department of Biology, Brigham Young University, Provo, UT, USA
| | | | - Lyndsay Staley
- Department of Biology, Brigham Young University, Provo, UT, USA
| | - Paul K Crane
- Department of Medicine, University of Washington, Seattle, WA, USA
| | | | - John S K Kauwe
- Department of Biology, Brigham Young University, Provo, UT, USA.
| |
Collapse
|
4
|
Moore JH, Olson RS, Schmitt P, Chen Y, Manduchi E. How Computational Experiments Can Improve Our Understanding of the Genetic Architecture of Common Human Diseases. ARTIFICIAL LIFE 2020; 26:23-37. [PMID: 32027528 DOI: 10.1162/artl_a_00308] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Susceptibility to common human diseases such as cancer is influenced by many genetic and environmental factors that work together in a complex manner. The state of the art is to perform a genome-wide association study (GWAS) that measures millions of single-nucleotide polymorphisms (SNPs) throughout the genome followed by a one-SNP-at-a-time statistical analysis to detect univariate associations. This approach has identified thousands of genetic risk factors for hundreds of diseases. However, the genetic risk factors detected have very small effect sizes and collectively explain very little of the overall heritability of the disease. Nonetheless, it is assumed that the genetic component of risk is due to many independent risk factors that contribute additively. The fact that many genetic risk factors with small effects can be detected is taken as evidence to support this notion. It is our working hypothesis that the genetic architecture of common diseases is partly driven by non-additive interactions. To test this hypothesis, we developed a heuristic simulation-based method for conducting experiments about the complexity of genetic architecture. We show that a genetic architecture driven by complex interactions is highly consistent with the magnitude and distribution of univariate effects seen in real data. We compare our results with measures of univariate and interaction effects from two large-scale GWASs of sporadic breast cancer and find evidence to support our hypothesis that is consistent with the results of our computational experiment.
Collapse
Affiliation(s)
- Jason H Moore
- University of Pennsylvania, Institute for Biomedical Informatics, Perelman School of Medicine.
| | - Randal S Olson
- University of Pennsylvania, Institute for Biomedical Informatics, Perelman School of Medicine
| | - Peter Schmitt
- University of Pennsylvania, Institute for Biomedical Informatics, Perelman School of Medicine
| | - Yong Chen
- University of Pennsylvania, Institute for Biomedical Informatics, Perelman School of Medicine
| | - Elisabetta Manduchi
- University of Pennsylvania, Institute for Biomedical Informatics, Perelman School of Medicine
| |
Collapse
|
5
|
Van Steen K, Moore JH. How to increase our belief in discovered statistical interactions via large-scale association studies? Hum Genet 2019; 138:293-305. [PMID: 30840129 PMCID: PMC6483943 DOI: 10.1007/s00439-019-01987-w] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2018] [Accepted: 02/20/2019] [Indexed: 12/31/2022]
Abstract
The understanding that differences in biological epistasis may impact disease risk, diagnosis, or disease management stands in wide contrast to the unavailability of widely accepted large-scale epistasis analysis protocols. Several choices in the analysis workflow will impact false-positive and false-negative rates. One of these choices relates to the exploitation of particular modelling or testing strategies. The strengths and limitations of these need to be well understood, as well as the contexts in which these hold. This will contribute to determining the potentially complementary value of epistasis detection workflows and is expected to increase replication success with biological relevance. In this contribution, we take a recently introduced regression-based epistasis detection tool as a leading example to review the key elements that need to be considered to fully appreciate the value of analytical epistasis detection performance assessments. We point out unresolved hurdles and give our perspectives towards overcoming these.
Collapse
Affiliation(s)
- K Van Steen
- WELBIO, GIGA-R Medical Genomics-BIO3, University of Liège, Liege, Belgium.
- Department of Human Genetics, University of Leuven, Leuven, Belgium.
| | - J H Moore
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, USA
| |
Collapse
|
6
|
Uppu S, Krishna A. A deep hybrid model to detect multi-locus interacting SNPs in the presence of noise. Int J Med Inform 2018; 119:134-151. [DOI: 10.1016/j.ijmedinf.2018.09.003] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2017] [Revised: 04/13/2018] [Accepted: 09/03/2018] [Indexed: 01/17/2023]
|
7
|
Urbanowicz RJ, Olson RS, Schmitt P, Meeker M, Moore JH. Benchmarking relief-based feature selection methods for bioinformatics data mining. J Biomed Inform 2018; 85:168-188. [PMID: 30030120 PMCID: PMC6299838 DOI: 10.1016/j.jbi.2018.07.015] [Citation(s) in RCA: 80] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2018] [Revised: 06/30/2018] [Accepted: 07/14/2018] [Indexed: 11/23/2022]
Abstract
Modern biomedical data mining requires feature selection methods that can (1) be applied to large scale feature spaces (e.g. 'omics' data), (2) function in noisy problems, (3) detect complex patterns of association (e.g. gene-gene interactions), (4) be flexibly adapted to various problem domains and data types (e.g. genetic variants, gene expression, and clinical data) and (5) are computationally tractable. To that end, this work examines a set of filter-style feature selection algorithms inspired by the 'Relief' algorithm, i.e. Relief-Based algorithms (RBAs). We implement and expand these RBAs in an open source framework called ReBATE (Relief-Based Algorithm Training Environment). We apply a comprehensive genetic simulation study comparing existing RBAs, a proposed RBA called MultiSURF, and other established feature selection methods, over a variety of problems. The results of this study (1) support the assertion that RBAs are particularly flexible, efficient, and powerful feature selection methods that differentiate relevant features having univariate, multivariate, epistatic, or heterogeneous associations, (2) confirm the efficacy of expansions for classification vs. regression, discrete vs. continuous features, missing data, multiple classes, or class imbalance, (3) identify previously unknown limitations of specific RBAs, and (4) suggest that while MultiSURF∗ performs best for explicitly identifying pure 2-way interactions, MultiSURF yields the most reliable feature selection performance across a wide range of problem types.
Collapse
Affiliation(s)
- Ryan J Urbanowicz
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA.
| | - Randal S Olson
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA.
| | - Peter Schmitt
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA.
| | | | - Jason H Moore
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA.
| |
Collapse
|
8
|
Verma SS, Lucas A, Zhang X, Veturi Y, Dudek S, Li B, Li R, Urbanowicz R, Moore JH, Kim D, Ritchie MD. Collective feature selection to identify crucial epistatic variants. BioData Min 2018; 11:5. [PMID: 29713383 PMCID: PMC5907720 DOI: 10.1186/s13040-018-0168-6] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2017] [Accepted: 04/04/2018] [Indexed: 01/17/2023] Open
Abstract
Background Machine learning methods have gained popularity and practicality in identifying linear and non-linear effects of variants associated with complex disease/traits. Detection of epistatic interactions still remains a challenge due to the large number of features and relatively small sample size as input, thus leading to the so-called "short fat data" problem. The efficiency of machine learning methods can be increased by limiting the number of input features. Thus, it is very important to perform variable selection before searching for epistasis. Many methods have been evaluated and proposed to perform feature selection, but no single method works best in all scenarios. We demonstrate this by conducting two separate simulation analyses to evaluate the proposed collective feature selection approach. Results Through our simulation study we propose a collective feature selection approach to select features that are in the "union" of the best performing methods. We explored various parametric, non-parametric, and data mining approaches to perform feature selection. We choose our top performing methods to select the union of the resulting variables based on a user-defined percentage of variants selected from each method to take to downstream analysis. Our simulation analysis shows that non-parametric data mining approaches, such as MDR, may work best under one simulation criteria for the high effect size (penetrance) datasets, while non-parametric methods designed for feature selection, such as Ranger and Gradient boosting, work best under other simulation criteria. Thus, using a collective approach proves to be more beneficial for selecting variables with epistatic effects also in low effect size datasets and different genetic architectures. Following this, we applied our proposed collective feature selection approach to select the top 1% of variables to identify potential interacting variables associated with Body Mass Index (BMI) in ~ 44,000 samples obtained from Geisinger's MyCode Community Health Initiative (on behalf of DiscovEHR collaboration). Conclusions In this study, we were able to show that selecting variables using a collective feature selection approach could help in selecting true positive epistatic variables more frequently than applying any single method for feature selection via simulation studies. We were able to demonstrate the effectiveness of collective feature selection along with a comparison of many methods in our simulation analysis. We also applied our method to identify non-linear networks associated with obesity.
Collapse
Affiliation(s)
- Shefali S Verma
- 1Biomedical and Translational Bioinformatics Institute, Geisinger Health System, 100 N Academy Avenue, Danville, PA 17822 USA.,2Huck Institute of Life Sciences, The Pennsylvania State University, University Park, PA USA.,3Institute for Biomedical Informatics, University of Pennsylvania, Perelman School of Medicine, Richards Building, 3700 Hamilton Walk, Philadelphia, PA 19104 USA
| | - Anastasia Lucas
- 1Biomedical and Translational Bioinformatics Institute, Geisinger Health System, 100 N Academy Avenue, Danville, PA 17822 USA.,3Institute for Biomedical Informatics, University of Pennsylvania, Perelman School of Medicine, Richards Building, 3700 Hamilton Walk, Philadelphia, PA 19104 USA
| | - Xinyuan Zhang
- 2Huck Institute of Life Sciences, The Pennsylvania State University, University Park, PA USA.,3Institute for Biomedical Informatics, University of Pennsylvania, Perelman School of Medicine, Richards Building, 3700 Hamilton Walk, Philadelphia, PA 19104 USA
| | - Yogasudha Veturi
- 1Biomedical and Translational Bioinformatics Institute, Geisinger Health System, 100 N Academy Avenue, Danville, PA 17822 USA.,3Institute for Biomedical Informatics, University of Pennsylvania, Perelman School of Medicine, Richards Building, 3700 Hamilton Walk, Philadelphia, PA 19104 USA
| | - Scott Dudek
- 1Biomedical and Translational Bioinformatics Institute, Geisinger Health System, 100 N Academy Avenue, Danville, PA 17822 USA.,3Institute for Biomedical Informatics, University of Pennsylvania, Perelman School of Medicine, Richards Building, 3700 Hamilton Walk, Philadelphia, PA 19104 USA
| | - Binglan Li
- 2Huck Institute of Life Sciences, The Pennsylvania State University, University Park, PA USA.,3Institute for Biomedical Informatics, University of Pennsylvania, Perelman School of Medicine, Richards Building, 3700 Hamilton Walk, Philadelphia, PA 19104 USA
| | - Ruowang Li
- 3Institute for Biomedical Informatics, University of Pennsylvania, Perelman School of Medicine, Richards Building, 3700 Hamilton Walk, Philadelphia, PA 19104 USA
| | - Ryan Urbanowicz
- 3Institute for Biomedical Informatics, University of Pennsylvania, Perelman School of Medicine, Richards Building, 3700 Hamilton Walk, Philadelphia, PA 19104 USA
| | - Jason H Moore
- 3Institute for Biomedical Informatics, University of Pennsylvania, Perelman School of Medicine, Richards Building, 3700 Hamilton Walk, Philadelphia, PA 19104 USA
| | - Dokyoon Kim
- 1Biomedical and Translational Bioinformatics Institute, Geisinger Health System, 100 N Academy Avenue, Danville, PA 17822 USA
| | - Marylyn D Ritchie
- 1Biomedical and Translational Bioinformatics Institute, Geisinger Health System, 100 N Academy Avenue, Danville, PA 17822 USA.,2Huck Institute of Life Sciences, The Pennsylvania State University, University Park, PA USA.,3Institute for Biomedical Informatics, University of Pennsylvania, Perelman School of Medicine, Richards Building, 3700 Hamilton Walk, Philadelphia, PA 19104 USA
| |
Collapse
|
9
|
Olson RS, La Cava W, Orzechowski P, Urbanowicz RJ, Moore JH. PMLB: a large benchmark suite for machine learning evaluation and comparison. BioData Min 2017; 10:36. [PMID: 29238404 PMCID: PMC5725843 DOI: 10.1186/s13040-017-0154-4] [Citation(s) in RCA: 67] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2017] [Accepted: 11/07/2017] [Indexed: 11/10/2022] Open
Abstract
Background The selection, development, or comparison of machine learning methods in data mining can be a difficult task based on the target problem and goals of a particular study. Numerous publicly available real-world and simulated benchmark datasets have emerged from different sources, but their organization and adoption as standards have been inconsistent. As such, selecting and curating specific benchmarks remains an unnecessary burden on machine learning practitioners and data scientists. Results The present study introduces an accessible, curated, and developing public benchmark resource to facilitate identification of the strengths and weaknesses of different machine learning methodologies. We compare meta-features among the current set of benchmark datasets in this resource to characterize the diversity of available data. Finally, we apply a number of established machine learning methods to the entire benchmark suite and analyze how datasets and algorithms cluster in terms of performance. From this study, we find that existing benchmarks lack the diversity to properly benchmark machine learning algorithms, and there are several gaps in benchmarking problems that still need to be considered. Conclusions This work represents another important step towards understanding the limitations of popular benchmarking suites and developing a resource that connects existing benchmarking standards to more diverse and efficient standards in the future.
Collapse
Affiliation(s)
- Randal S Olson
- Institute for Biomedical Informatics, University of Pennsylvania, 3700 Hamilton Walk, Philadelphia, 19104 PA USA
| | - William La Cava
- Institute for Biomedical Informatics, University of Pennsylvania, 3700 Hamilton Walk, Philadelphia, 19104 PA USA
| | - Patryk Orzechowski
- Institute for Biomedical Informatics, University of Pennsylvania, 3700 Hamilton Walk, Philadelphia, 19104 PA USA.,Department of Automatics and Biomedical Engineering, AGH University of Science and Technology, Kraków, Poland
| | - Ryan J Urbanowicz
- Institute for Biomedical Informatics, University of Pennsylvania, 3700 Hamilton Walk, Philadelphia, 19104 PA USA
| | - Jason H Moore
- Institute for Biomedical Informatics, University of Pennsylvania, 3700 Hamilton Walk, Philadelphia, 19104 PA USA
| |
Collapse
|
10
|
Moore JH, Andrews PC, Olson RS, Carlson SE, Larock CR, Bulhoes MJ, O'Connor JP, Greytak EM, Armentrout SL. Grid-based stochastic search for hierarchical gene-gene interactions in population-based genetic studies of common human diseases. BioData Min 2017; 10:19. [PMID: 28572842 PMCID: PMC5450417 DOI: 10.1186/s13040-017-0139-3] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2016] [Accepted: 05/18/2017] [Indexed: 11/18/2022] Open
Abstract
Background Large-scale genetic studies of common human diseases have focused almost exclusively on the independent main effects of single-nucleotide polymorphisms (SNPs) on disease susceptibility. These studies have had some success, but much of the genetic architecture of common disease remains unexplained. Attention is now turning to detecting SNPs that impact disease susceptibility in the context of other genetic factors and environmental exposures. These context-dependent genetic effects can manifest themselves as non-additive interactions, which are more challenging to model using parametric statistical approaches. The dimensionality that results from a multitude of genotype combinations, which results from considering many SNPs simultaneously, renders these approaches underpowered. We previously developed the multifactor dimensionality reduction (MDR) approach as a nonparametric and genetic model-free machine learning alternative. Approaches such as MDR can improve the power to detect gene-gene interactions but are limited in their ability to exhaustively consider SNP combinations in genome-wide association studies (GWAS), due to the combinatorial explosion of the search space. We introduce here a stochastic search algorithm called Crush for the application of MDR to modeling high-order gene-gene interactions in genome-wide data. The Crush-MDR approach uses expert knowledge to guide probabilistic searches within a framework that capitalizes on the use of biological knowledge to filter gene sets prior to analysis. Here we evaluated the ability of Crush-MDR to detect hierarchical sets of interacting SNPs using a biology-based simulation strategy that assumes non-additive interactions within genes and additivity in genetic effects between sets of genes within a biochemical pathway. Results We show that Crush-MDR is able to identify genetic effects at the gene or pathway level significantly better than a baseline random search with the same number of model evaluations. We then applied the same methodology to a GWAS for Alzheimer’s disease and showed base level validation that Crush-MDR was able to identify a set of interacting genes with biological ties to Alzheimer’s disease. Conclusions We discuss the role of stochastic search and cloud computing for detecting complex genetic effects in genome-wide data.
Collapse
Affiliation(s)
- Jason H Moore
- Department of Biostatistics and Epidemiology, Institute for Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, 19104 PA USA
| | - Peter C Andrews
- Department of Biostatistics and Epidemiology, Institute for Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, 19104 PA USA
| | - Randal S Olson
- Department of Biostatistics and Epidemiology, Institute for Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, 19104 PA USA
| | | | | | | | | | | | | |
Collapse
|
11
|
Li J, Malley JD, Andrew AS, Karagas MR, Moore JH. Detecting gene-gene interactions using a permutation-based random forest method. BioData Min 2016; 9:14. [PMID: 27053949 PMCID: PMC4822295 DOI: 10.1186/s13040-016-0093-5] [Citation(s) in RCA: 35] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2015] [Accepted: 03/30/2016] [Indexed: 12/05/2022] Open
Abstract
Background Identifying gene-gene interactions is essential to understand disease susceptibility and to detect genetic architectures underlying complex diseases. Here, we aimed at developing a permutation-based methodology relying on a machine learning method, random forest (RF), to detect gene-gene interactions. Our approach called permuted random forest (pRF) which identified the top interacting single nucleotide polymorphism (SNP) pairs by estimating how much the power of a random forest classification model is influenced by removing pairwise interactions. Results We systematically tested our approach on a simulation study with datasets possessing various genetic constraints including heritability, number of SNPs, sample size, etc. Our methodology showed high success rates for detecting the interaction SNP pair. We also applied our approach to two bladder cancer datasets, which showed consistent results with well-studied methodologies, such as multifactor dimensionality reduction (MDR) and statistical epistasis network (SEN). Furthermore, we built permuted random forest networks (PRFN), in which we used nodes to represent SNPs and edges to indicate interactions. Conclusions We successfully developed a scale-invariant methodology to detect pure gene-gene interactions based on permutation strategies and the machine learning method random forest. This methodology showed great potential to be used for detecting gene-gene interactions to study underlying genetic architectures in a scale-free way, which could be benefit to uncover the complex disease mechanisms.
Collapse
Affiliation(s)
- Jing Li
- Department of Genetics, Geisel School of Medicine, Dartmouth College, Hanover, NH USA
| | - James D Malley
- Division of Computational Bioscience, Center for Information Technology, National Institutes of Health, Bethesda, MD USA
| | - Angeline S Andrew
- Department of Epidemiology, Geisel School of Medicine, Dartmouth College, Hanover, NH USA
| | - Margaret R Karagas
- Department of Epidemiology, Geisel School of Medicine, Dartmouth College, Hanover, NH USA
| | - Jason H Moore
- Institute for Biomedical Informatics, University of Pennsylvania, Pennsylvania, PA USA ; Department of Biostatistics and Epidemiology, The Perelman School of Medicine, University of Pennsylvania, Pennsylvania, PA USA
| |
Collapse
|
12
|
Urbanowicz RJ, Moore JH. ExSTraCS 2.0: Description and Evaluation of a Scalable Learning Classifier System. EVOLUTIONARY INTELLIGENCE 2015; 8:89-116. [PMID: 26417393 PMCID: PMC4583133 DOI: 10.1007/s12065-015-0128-8] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Abstract
Algorithmic scalability is a major concern for any machine learning strategy in this age of 'big data'. A large number of potentially predictive attributes is emblematic of problems in bioinformatics, genetic epidemiology, and many other fields. Previously, ExS-TraCS was introduced as an extended Michigan-style supervised learning classifier system that combined a set of powerful heuristics to successfully tackle the challenges of classification, prediction, and knowledge discovery in complex, noisy, and heterogeneous problem domains. While Michigan-style learning classifier systems are powerful and flexible learners, they are not considered to be particularly scalable. For the first time, this paper presents a complete description of the ExS-TraCS algorithm and introduces an effective strategy to dramatically improve learning classifier system scalability. ExSTraCS 2.0 addresses scalability with (1) a rule specificity limit, (2) new approaches to expert knowledge guided covering and mutation mechanisms, and (3) the implementation and utilization of the TuRF algorithm for improving the quality of expert knowledge discovery in larger datasets. Performance over a complex spectrum of simulated genetic datasets demonstrated that these new mechanisms dramatically improve nearly every performance metric on datasets with 20 attributes and made it possible for ExSTraCS to reliably scale up to perform on related 200 and 2000-attribute datasets. ExSTraCS 2.0 was also able to reliably solve the 6, 11, 20, 37, 70, and 135 multiplexer problems, and did so in similar or fewer learning iterations than previously reported, with smaller finite training sets, and without using building blocks discovered from simpler multiplexer problems. Furthermore, ExS-TraCS usability was made simpler through the elimination of previously critical run parameters.
Collapse
Affiliation(s)
- Ryan J. Urbanowicz
- Geisel School of Medicine, 1 Medical Center Dr., Lebanon NH, 03756, USA, Tel.: +603-653-6017, Fax: +603-653-9952
| | - Jason H. Moore
- Geisel School of Medicine, 1 Medical Center Dr., Lebanon NH, 03756, USA, Tel.: +603-653-6017, Fax: +603-653-9952
| |
Collapse
|
13
|
Abstract
Here we introduce the ReliefF machine learning algorithm and some of its extensions for detecting and characterizing epistasis in genetic association studies. We provide a general overview of the method and then highlight some of the modifications that have greatly improved its power for genetic analysis. We end with a few examples of published studies of complex human diseases that have used ReliefF.
Collapse
|
14
|
A classification and characterization of two-locus, pure, strict, epistatic models for simulation and detection. BioData Min 2014; 7:8. [PMID: 25057293 PMCID: PMC4094921 DOI: 10.1186/1756-0381-7-8] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2013] [Accepted: 05/23/2014] [Indexed: 11/13/2022] Open
Abstract
Background The statistical genetics phenomenon of epistasis is widely acknowledged to confound disease etiology. In order to evaluate strategies for detecting these complex multi-locus disease associations, simulation studies are required. The development of the GAMETES software for the generation of complex genetic models, has provided the means to randomly generate an architecturally diverse population of epistatic models that are both pure and strict, i.e. all n loci, but no fewer, are predictive of phenotype. Previous theoretical work characterizing complex genetic models has yet to examine pure, strict, epistasis which should be the most challenging to detect. This study addresses three goals: (1) Classify and characterize pure, strict, two-locus epistatic models, (2) Investigate the effect of model ‘architecture’ on detection difficulty, and (3) Explore how adjusting GAMETES constraints influences diversity in the generated models. Results In this study we utilized a geometric approach to classify pure, strict, two-locus epistatic models by “shape”. In total, 33 unique shape symmetry classes were identified. Using a detection difficulty metric, we found that model shape was consistently a significant predictor of model detection difficulty. Additionally, after categorizing shape classes by the number of edges in their shape projections, we found that this edge number was also significantly predictive of detection difficulty. Analysis of constraints within GAMETES indicated that increasing model population size can expand model class coverage but does little to change the range of observed difficulty metric scores. A variable population prevalence significantly increased the range of observed difficulty metric scores and, for certain constraints, also improved model class coverage. Conclusions These analyses further our theoretical understanding of epistatic relationships and uncover guidelines for the effective generation of complex models using GAMETES. Specifically, (1) we have characterized 33 shape classes by edge number, detection difficulty, and observed frequency (2) our results support the claim that model architecture directly influences detection difficulty, and (3) we found that GAMETES will generate a maximally diverse set of models with a variable population prevalence and a larger model population size. However, a model population size as small as 1,000 is likely to be sufficient.
Collapse
|
15
|
Rudd J, Moore JH, Urbanowicz RJ. A Multi-Core Parallelization Strategy for Statistical Significance Testing in Learning Classifier Systems. EVOLUTIONARY INTELLIGENCE 2013; 6. [PMID: 24358057 DOI: 10.1007/s12065-013-0092-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
Permutation-based statistics for evaluating the significance of class prediction, predictive attributes, and patterns of association have only appeared within the learning classifier system (LCS) literature since 2012. While still not widely utilized by the LCS research community, formal evaluations of test statistic confidence are imperative to large and complex real world applications such as genetic epidemiology where it is standard practice to quantify the likelihood that a seemingly meaningful statistic could have been obtained purely by chance. LCS algorithms are relatively computationally expensive on their own. The compounding requirements for generating permutation-based statistics may be a limiting factor for some researchers interested in applying LCS algorithms to real world problems. Technology has made LCS parallelization strategies more accessible and thus more popular in recent years. In the present study we examine the benefits of externally parallelizing a series of independent LCS runs such that permutation testing with cross validation becomes more feasible to complete on a single multi-core workstation. We test our python implementation of this strategy in the context of a simulated complex genetic epidemiological data mining problem. Our evaluations indicate that as long as the number of concurrent processes does not exceed the number of CPU cores, the speedup achieved is approximately linear.
Collapse
Affiliation(s)
- James Rudd
- Dartmouth College, 1 Medical Center Dr., Lebanon, NH 03755,USA,
| | - Jason H Moore
- Dartmouth College, 1 Medical Center Dr., Lebanon, NH 03755,USA,
| | | |
Collapse
|
16
|
Hu T, Chen Y, Kiralis JW, Collins RL, Wejse C, Sirugo G, Williams SM, Moore JH. An information-gain approach to detecting three-way epistatic interactions in genetic association studies. J Am Med Inform Assoc 2013; 20:630-6. [PMID: 23396514 PMCID: PMC3721169 DOI: 10.1136/amiajnl-2012-001525] [Citation(s) in RCA: 51] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open
Abstract
Background Epistasis has been historically used to describe the phenomenon that the effect of a given gene on a phenotype can be dependent on one or more other genes, and is an essential element for understanding the association between genetic and phenotypic variations. Quantifying epistasis of orders higher than two is very challenging due to both the computational complexity of enumerating all possible combinations in genome-wide data and the lack of efficient and effective methodologies. Objectives In this study, we propose a fast, non-parametric, and model-free measure for three-way epistasis. Methods Such a measure is based on information gain, and is able to separate all lower order effects from pure three-way epistasis. Results Our method was verified on synthetic data and applied to real data from a candidate-gene study of tuberculosis in a West African population. In the tuberculosis data, we found a statistically significant pure three-way epistatic interaction effect that was stronger than any lower-order associations. Conclusion Our study provides a methodological basis for detecting and characterizing high-order gene-gene interactions in genetic association studies.
Collapse
Affiliation(s)
- Ting Hu
- Computational Genetics Laboratory, Geisel School of Medicine, Dartmouth College, Hanover, New Hampshire, USA
| | | | | | | | | | | | | | | |
Collapse
|