1
|
Yaldız B, Erdoğan O, Rafatov S, Iyigün C, Aydın Son Y. Revealing third-order interactions through the integration of machine learning and entropy methods in genomic studies. BioData Min 2024; 17:3. [PMID: 38291454 PMCID: PMC10826120 DOI: 10.1186/s13040-024-00355-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Accepted: 01/16/2024] [Indexed: 02/01/2024] Open
Abstract
BACKGROUND Non-linear relationships at the genotype level are essential in understanding the genetic interactions of complex disease traits. Genome-wide association Studies (GWAS) have revealed statistical association of the SNPs in many complex diseases. As GWAS results could not thoroughly reveal the genetic background of these disorders, Genome-Wide Interaction Studies have started to gain importance. In recent years, various statistical approaches, such as entropy-based methods, have been suggested for revealing these non-additive interactions between variants. This study presents a novel prioritization workflow integrating two-step Random Forest (RF) modeling and entropy analysis after PLINK filtering. PLINK-RF-RF workflow is followed by an entropy-based 3-way interaction information (3WII) method to capture the hidden patterns resulting from non-linear relationships between genotypes in Late-Onset Alzheimer Disease to discover early and differential diagnosis markers. RESULTS Three models from different datasets are developed by integrating PLINK-RF-RF analysis and entropy-based three-way interaction information (3WII) calculation method, which enables the detection of the third-order interactions, which are not primarily considered in epistatic interaction studies. A reduced SNP set is selected for all three datasets by 3WII analysis by PLINK filtering and prioritization of SNP with RF-RF modeling, promising as a model minimization approach. Among SNPs revealed by 3WII, 4 SNPs out of 19 from GenADA, 1 SNP out of 27 from ADNI, and 4 SNPs out of 106 from NCRAD are mapped to genes directly associated with Alzheimer Disease. Additionally, several SNPs are associated with other neurological disorders. Also, the genes the variants mapped to in all datasets are significantly enriched in calcium ion binding, extracellular matrix, external encapsulating structure, and RUNX1 regulates estrogen receptor-mediated transcription pathways. Therefore, these functional pathways are proposed for further examination for a possible LOAD association. Besides, all 3WII variants are proposed as candidate biomarkers for the genotyping-based LOAD diagnosis. CONCLUSION The entropy approach performed in this study reveals the complex genetic interactions that significantly contribute to LOAD risk. We benefited from the entropy-based 3WII as a model minimization step and determined the significant 3-way interactions between the prioritized SNPs by PLINK-RF-RF. This framework is a promising approach for disease association studies, which can also be modified by integrating other machine learning and entropy-based interaction methods.
Collapse
Affiliation(s)
- Burcu Yaldız
- Department of Health Informatics, Graduate School of Informatics, METU, Ankara, Turkey
| | - Onur Erdoğan
- Department of Health Informatics, Graduate School of Informatics, METU, Ankara, Turkey
| | - Sevda Rafatov
- Department of Health Informatics, Graduate School of Informatics, METU, Ankara, Turkey
| | - Cem Iyigün
- Department of Industrial Engineering, METU, Ankara, Turkey
| | - Yeşim Aydın Son
- Department of Health Informatics, Graduate School of Informatics, METU, Ankara, Turkey.
- Graduate School of Informatics, ODTU-NOROM, METU, Ankara, Turkey.
| |
Collapse
|
2
|
Cheng L, Zhu M. Compositional epistasis detection using a few prototype disease models. PLoS One 2019; 14:e0213236. [PMID: 30917131 PMCID: PMC6436689 DOI: 10.1371/journal.pone.0213236] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2017] [Accepted: 02/19/2019] [Indexed: 12/31/2022] Open
Abstract
We study computational approaches for detecting SNP-SNP interactions that are characterized by a set of "two-locus, two-allele, two-phenotype and complete-penetrance" disease models. We argue that existing methods, which use data to determine a best-fitting disease model for each pair of SNPs prior to screening, may be too greedy. We present a less greedy strategy which, for each given pair of SNPs, limits the number of candidate disease models to a set of prototypes determined a priori.
Collapse
Affiliation(s)
- Lu Cheng
- Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Ontario, Canada
| | - Mu Zhu
- Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Ontario, Canada
| |
Collapse
|
3
|
Verma SS, Lucas A, Zhang X, Veturi Y, Dudek S, Li B, Li R, Urbanowicz R, Moore JH, Kim D, Ritchie MD. Collective feature selection to identify crucial epistatic variants. BioData Min 2018; 11:5. [PMID: 29713383 PMCID: PMC5907720 DOI: 10.1186/s13040-018-0168-6] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2017] [Accepted: 04/04/2018] [Indexed: 01/17/2023] Open
Abstract
Background Machine learning methods have gained popularity and practicality in identifying linear and non-linear effects of variants associated with complex disease/traits. Detection of epistatic interactions still remains a challenge due to the large number of features and relatively small sample size as input, thus leading to the so-called "short fat data" problem. The efficiency of machine learning methods can be increased by limiting the number of input features. Thus, it is very important to perform variable selection before searching for epistasis. Many methods have been evaluated and proposed to perform feature selection, but no single method works best in all scenarios. We demonstrate this by conducting two separate simulation analyses to evaluate the proposed collective feature selection approach. Results Through our simulation study we propose a collective feature selection approach to select features that are in the "union" of the best performing methods. We explored various parametric, non-parametric, and data mining approaches to perform feature selection. We choose our top performing methods to select the union of the resulting variables based on a user-defined percentage of variants selected from each method to take to downstream analysis. Our simulation analysis shows that non-parametric data mining approaches, such as MDR, may work best under one simulation criteria for the high effect size (penetrance) datasets, while non-parametric methods designed for feature selection, such as Ranger and Gradient boosting, work best under other simulation criteria. Thus, using a collective approach proves to be more beneficial for selecting variables with epistatic effects also in low effect size datasets and different genetic architectures. Following this, we applied our proposed collective feature selection approach to select the top 1% of variables to identify potential interacting variables associated with Body Mass Index (BMI) in ~ 44,000 samples obtained from Geisinger's MyCode Community Health Initiative (on behalf of DiscovEHR collaboration). Conclusions In this study, we were able to show that selecting variables using a collective feature selection approach could help in selecting true positive epistatic variables more frequently than applying any single method for feature selection via simulation studies. We were able to demonstrate the effectiveness of collective feature selection along with a comparison of many methods in our simulation analysis. We also applied our method to identify non-linear networks associated with obesity.
Collapse
Affiliation(s)
- Shefali S Verma
- 1Biomedical and Translational Bioinformatics Institute, Geisinger Health System, 100 N Academy Avenue, Danville, PA 17822 USA.,2Huck Institute of Life Sciences, The Pennsylvania State University, University Park, PA USA.,3Institute for Biomedical Informatics, University of Pennsylvania, Perelman School of Medicine, Richards Building, 3700 Hamilton Walk, Philadelphia, PA 19104 USA
| | - Anastasia Lucas
- 1Biomedical and Translational Bioinformatics Institute, Geisinger Health System, 100 N Academy Avenue, Danville, PA 17822 USA.,3Institute for Biomedical Informatics, University of Pennsylvania, Perelman School of Medicine, Richards Building, 3700 Hamilton Walk, Philadelphia, PA 19104 USA
| | - Xinyuan Zhang
- 2Huck Institute of Life Sciences, The Pennsylvania State University, University Park, PA USA.,3Institute for Biomedical Informatics, University of Pennsylvania, Perelman School of Medicine, Richards Building, 3700 Hamilton Walk, Philadelphia, PA 19104 USA
| | - Yogasudha Veturi
- 1Biomedical and Translational Bioinformatics Institute, Geisinger Health System, 100 N Academy Avenue, Danville, PA 17822 USA.,3Institute for Biomedical Informatics, University of Pennsylvania, Perelman School of Medicine, Richards Building, 3700 Hamilton Walk, Philadelphia, PA 19104 USA
| | - Scott Dudek
- 1Biomedical and Translational Bioinformatics Institute, Geisinger Health System, 100 N Academy Avenue, Danville, PA 17822 USA.,3Institute for Biomedical Informatics, University of Pennsylvania, Perelman School of Medicine, Richards Building, 3700 Hamilton Walk, Philadelphia, PA 19104 USA
| | - Binglan Li
- 2Huck Institute of Life Sciences, The Pennsylvania State University, University Park, PA USA.,3Institute for Biomedical Informatics, University of Pennsylvania, Perelman School of Medicine, Richards Building, 3700 Hamilton Walk, Philadelphia, PA 19104 USA
| | - Ruowang Li
- 3Institute for Biomedical Informatics, University of Pennsylvania, Perelman School of Medicine, Richards Building, 3700 Hamilton Walk, Philadelphia, PA 19104 USA
| | - Ryan Urbanowicz
- 3Institute for Biomedical Informatics, University of Pennsylvania, Perelman School of Medicine, Richards Building, 3700 Hamilton Walk, Philadelphia, PA 19104 USA
| | - Jason H Moore
- 3Institute for Biomedical Informatics, University of Pennsylvania, Perelman School of Medicine, Richards Building, 3700 Hamilton Walk, Philadelphia, PA 19104 USA
| | - Dokyoon Kim
- 1Biomedical and Translational Bioinformatics Institute, Geisinger Health System, 100 N Academy Avenue, Danville, PA 17822 USA
| | - Marylyn D Ritchie
- 1Biomedical and Translational Bioinformatics Institute, Geisinger Health System, 100 N Academy Avenue, Danville, PA 17822 USA.,2Huck Institute of Life Sciences, The Pennsylvania State University, University Park, PA USA.,3Institute for Biomedical Informatics, University of Pennsylvania, Perelman School of Medicine, Richards Building, 3700 Hamilton Walk, Philadelphia, PA 19104 USA
| |
Collapse
|
4
|
Mei C, Hou M, Guo S, Hua F, Zheng D, Xu F, Jiang Y, Li L, Qiao Y, Fan Y, Zhou Q. Polymorphisms in DNA repair genes of XRCC1, XPA, XPC, XPD and associations with lung cancer risk in Chinese people. Thorac Cancer 2014; 5:232-42. [PMID: 26767006 DOI: 10.1111/1759-7714.12073] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2013] [Accepted: 09/01/2013] [Indexed: 02/05/2023] Open
Abstract
BACKGROUND The carcinogenic chemicals and reactive oxygen species in tobacco can result in DNA damage. DNA repair genes play an important role in maintaining genome integrity. Genetic polymorphisms of DNA repair genes and smoking may contribute to susceptibility of lung cancer. METHODS In this hospital-based case-control study, we investigated the relationship between 13 tagging single nucleotide polymorphisms (SNPs) in base excision repair pathway and nucleotide excision repair pathway genes, smoking, and lung cancer susceptibility. Thirteen tag SNPs were genotyped in 265 lung cancer patients and 301 healthy controls. Logistic regression and multifactor dimensionality reduction method were applied to explore the association and high-order gene-gene and gene-smoking interaction. RESULTS In single tag SNP analysis, XPA rs2808668, XPC rs2733533, and XPD rs1799787 were significantly associated with lung cancer susceptibility. Joint effects analysis of XPA rs2808668, XPC rs2733533 and XPD rs1799787 showed that there was an increased risk of lung cancer with increasing numbers of risk alleles. Haplotype analysis showed that XRCC1 (rs25487, rs1799782, rs3213334) GCC had a positive association with lung cancer. Analysis of gene-gene and gene-smoking interaction by multifactor dimensionality reduction showed that a positive interaction existed between the four genes and smoking. The two-factor model, including XPC rs2755333 and smoking, had the best prediction ability for lung cancer. Compared with the C/C genotype of XPC rs2733533 and no smoking, the combination of genotype A carriers with XPC rs2733533 and heavy smokers (≥30 pack-year) had a 13.32-fold risk of lung cancer. CONCLUSION Our results suggest multiple genetic variants in multiple DNA repair genes may jointly contribute to lung cancer risk through gene-gene and gene-smoking interactions.
Collapse
Affiliation(s)
- Chaorong Mei
- Tianjin Key Laboratory of Lung Cancer Metastasis and Tumor Microenvironment, Tianjin Lung Cancer Institute, Tianjin Medical University General Hospital Tianjin, China; Tibet Chengdu branch of West China Hospital, Sichuan University Changdu, China
| | - Mei Hou
- Cancer Center, West China Hospital, Sichuan University Chengdu, China
| | - Shanxian Guo
- Tianjin Key Laboratory of Lung Cancer Metastasis and Tumor Microenvironment, Tianjin Lung Cancer Institute, Tianjin Medical University General Hospital Tianjin, China
| | - Feng Hua
- Tianjin Key Laboratory of Lung Cancer Metastasis and Tumor Microenvironment, Tianjin Lung Cancer Institute, Tianjin Medical University General Hospital Tianjin, China
| | - Dejie Zheng
- Tianjin Key Laboratory of Lung Cancer Metastasis and Tumor Microenvironment, Tianjin Lung Cancer Institute, Tianjin Medical University General Hospital Tianjin, China
| | - Feng Xu
- Cancer Center, West China Hospital, Sichuan University Chengdu, China
| | - Yong Jiang
- Department of Cancer Epidemiology, Cancer Institute, Chinese Academy of Medical Sciences and Peking Union Medical College Beijing, China
| | - Lu Li
- Cancer Center, West China Hospital, Sichuan University Chengdu, China
| | - Youlin Qiao
- Department of Cancer Epidemiology, Cancer Institute, Chinese Academy of Medical Sciences and Peking Union Medical College Beijing, China
| | - Yaguang Fan
- Tianjin Key Laboratory of Lung Cancer Metastasis and Tumor Microenvironment, Tianjin Lung Cancer Institute, Tianjin Medical University General Hospital Tianjin, China
| | - Qinghua Zhou
- Tianjin Key Laboratory of Lung Cancer Metastasis and Tumor Microenvironment, Tianjin Lung Cancer Institute, Tianjin Medical University General Hospital Tianjin, China
| |
Collapse
|
5
|
Chen GK, Guo Y. Discovering epistasis in large scale genetic association studies by exploiting graphics cards. Front Genet 2013; 4:266. [PMID: 24348518 PMCID: PMC3848199 DOI: 10.3389/fgene.2013.00266] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2013] [Accepted: 11/16/2013] [Indexed: 11/13/2022] Open
Abstract
Despite the enormous investments made in collecting DNA samples and generating germline variation data across thousands of individuals in modern genome-wide association studies (GWAS), progress has been frustratingly slow in explaining much of the heritability in common disease. Today's paradigm of testing independent hypotheses on each single nucleotide polymorphism (SNP) marker is unlikely to adequately reflect the complex biological processes in disease risk. Alternatively, modeling risk as an ensemble of SNPs that act in concert in a pathway, and/or interact non-additively on log risk for example, may be a more sensible way to approach gene mapping in modern studies. Implementing such analyzes genome-wide can quickly become intractable due to the fact that even modest size SNP panels on modern genotype arrays (500k markers) pose a combinatorial nightmare, require tens of billions of models to be tested for evidence of interaction. In this article, we provide an in-depth analysis of programs that have been developed to explicitly overcome these enormous computational barriers through the use of processors on graphics cards known as Graphics Processing Units (GPU). We include tutorials on GPU technology, which will convey why they are growing in appeal with today's numerical scientists. One obvious advantage is the impressive density of microprocessor cores that are available on only a single GPU. Whereas high end servers feature up to 24 Intel or AMD CPU cores, the latest GPU offerings from nVidia feature over 2600 cores. Each compute node may be outfitted with up to 4 GPU devices. Success on GPUs varies across problems. However, epistasis screens fare well due to the high degree of parallelism exposed in these problems. Papers that we review routinely report GPU speedups of over two orders of magnitude (>100x) over standard CPU implementations.
Collapse
Affiliation(s)
- Gary K Chen
- Division of Biostatics, Department of Preventive Medicine, University of Southern California Los Angeles, CA, USA
| | - Yunfei Guo
- Division of Biostatics, Department of Preventive Medicine, University of Southern California Los Angeles, CA, USA ; Zilkha Neurogenetic Institute, University of Southern California Los Angeles, CA, USA
| |
Collapse
|
6
|
Dai H, Charnigo RJ, Becker ML, Leeder JS, Motsinger-Reif AA. Risk score modeling of multiple gene to gene interactions using aggregated-multifactor dimensionality reduction. BioData Min 2013; 6:1. [PMID: 23294634 PMCID: PMC3560267 DOI: 10.1186/1756-0381-6-1] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2012] [Accepted: 12/21/2012] [Indexed: 01/27/2023] Open
Abstract
UNLABELLED BACKGROUND Multifactor Dimensionality Reduction (MDR) has been widely applied to detect gene-gene (GxG) interactions associated with complex diseases. Existing MDR methods summarize disease risk by a dichotomous predisposing model (high-risk/low-risk) from one optimal GxG interaction, which does not take the accumulated effects from multiple GxG interactions into account. RESULTS We propose an Aggregated-Multifactor Dimensionality Reduction (A-MDR) method that exhaustively searches for and detects significant GxG interactions to generate an epistasis enriched gene network. An aggregated epistasis enriched risk score, which takes into account multiple GxG interactions simultaneously, replaces the dichotomous predisposing risk variable and provides higher resolution in the quantification of disease susceptibility. We evaluate this new A-MDR approach in a broad range of simulations. Also, we present the results of an application of the A-MDR method to a data set derived from Juvenile Idiopathic Arthritis patients treated with methotrexate (MTX) that revealed several GxG interactions in the folate pathway that were associated with treatment response. The epistasis enriched risk score that pooled information from 82 significant GxG interactions distinguished MTX responders from non-responders with 82% accuracy. CONCLUSIONS The proposed A-MDR is innovative in the MDR framework to investigate aggregated effects among GxG interactions. New measures (pOR, pRR and pChi) are proposed to detect multiple GxG interactions.
Collapse
Affiliation(s)
- Hongying Dai
- Research Development and Clinical Investigation, Children's Mercy Hospital, Kansas City, MO, 64108, USA.
| | | | | | | | | |
Collapse
|
7
|
Applications of multifactor dimensionality reduction to genome-wide data using the R package 'MDR'. Methods Mol Biol 2013; 1019:479-98. [PMID: 23756907 DOI: 10.1007/978-1-62703-447-0_23] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
This chapter describes how to use the R package 'MDR' to search and identify gene-gene interactions in high-dimensional data and illustrates applications for exploratory analysis of multi-locus models by providing specific examples.
Collapse
|
8
|
Systems genetics in "-omics" era: current and future development. Theory Biosci 2012; 132:1-16. [PMID: 23138757 DOI: 10.1007/s12064-012-0168-x] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2012] [Accepted: 10/25/2012] [Indexed: 02/06/2023]
Abstract
The systems genetics is an emerging discipline that integrates high-throughput expression profiling technology and systems biology approaches for revealing the molecular mechanism of complex traits, and will improve our understanding of gene functions in the biochemical pathway and genetic interactions between biological molecules. With the rapid advances of microarray analysis technologies, bioinformatics is extensively used in the studies of gene functions, SNP-SNP genetic interactions, LD block-block interactions, miRNA-mRNA interactions, DNA-protein interactions, protein-protein interactions, and functional mapping for LD blocks. Based on bioinformatics panel, which can integrate "-omics" datasets to extract systems knowledge and useful information for explaining the molecular mechanism of complex traits, systems genetics is all about to enhance our understanding of biological processes. Systems biology has provided systems level recognition of various biological phenomena, and constructed the scientific background for the development of systems genetics. In addition, the next-generation sequencing technology and post-genome wide association studies empower the discovery of new gene and rare variants. The integration of different strategies will help to propose novel hypothesis and perfect the theoretical framework of systems genetics, which will make contribution to the future development of systems genetics, and open up a whole new area of genetics.
Collapse
|
9
|
Dai H, Bhandary M, Becker M, Leeder JS, Gaedigk R, Motsinger-Reif AA. Global tests of P-values for multifactor dimensionality reduction models in selection of optimal number of target genes. BioData Min 2012; 5:3. [PMID: 22616673 PMCID: PMC3508622 DOI: 10.1186/1756-0381-5-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2012] [Accepted: 04/19/2012] [Indexed: 11/12/2022] Open
Abstract
Background Multifactor Dimensionality Reduction (MDR) is a popular and successful data mining method developed to characterize and detect nonlinear complex gene-gene interactions (epistasis) that are associated with disease susceptibility. Because MDR uses a combinatorial search strategy to detect interaction, several filtration techniques have been developed to remove genes (SNPs) that have no interactive effects prior to analysis. However, the cutoff values implemented for these filtration methods are arbitrary, therefore different choices of cutoff values will lead to different selections of genes (SNPs). Methods We suggest incorporating a global test of p-values to filtration procedures to identify the optimal number of genes/SNPs for further MDR analysis and demonstrate this approach using a ReliefF filter technique. We compare the performance of different global testing procedures in this context, including the Kolmogorov-Smirnov test, the inverse chi-square test, the inverse normal test, the logit test, the Wilcoxon test and Tippett’s test. Additionally we demonstrate the approach on a real data application with a candidate gene study of drug response in Juvenile Idiopathic Arthritis. Results Extensive simulation of correlated p-values show that the inverse chi-square test is the most appropriate approach to be incorporated with the screening approach to determine the optimal number of SNPs for the final MDR analysis. The Kolmogorov-Smirnov test has high inflation of Type I errors when p-values are highly correlated or when p-values peak near the center of histogram. Tippett’s test has very low power when the effect size of GxG interactions is small. Conclusions The proposed global tests can serve as a screening approach prior to individual tests to prevent false discovery. Strong power in small sample sizes and well controlled Type I error in absence of GxG interactions make global tests highly recommended in epistasis studies.
Collapse
Affiliation(s)
- Hongying Dai
- Department of Medical Research, Children's Mercy Hospital, 2401 Gillham Road, Kansas City, MO, 64108, USA.
| | | | | | | | | | | |
Collapse
|