1
|
Lin WY. Searching for gene-gene interactions through variance quantitative trait loci of 29 continuous Taiwan Biobank phenotypes. Front Genet 2024; 15:1357238. [PMID: 38516378 PMCID: PMC10956579 DOI: 10.3389/fgene.2024.1357238] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2023] [Accepted: 02/27/2024] [Indexed: 03/23/2024] Open
Abstract
Introduction: After the era of genome-wide association studies (GWAS), thousands of genetic variants have been identified to exhibit main effects on human phenotypes. The next critical issue would be to explore the interplay between genes, the so-called "gene-gene interactions" (GxG) or epistasis. An exhaustive search for all single-nucleotide polymorphism (SNP) pairs is not recommended because this will induce a harsh penalty of multiple testing. Limiting the search of epistasis on SNPs reported by previous GWAS may miss essential interactions between SNPs without significant marginal effects. Moreover, most methods are computationally intensive and can be challenging to implement genome-wide. Methods: I here searched for GxG through variance quantitative trait loci (vQTLs) of 29 continuous Taiwan Biobank (TWB) phenotypes. A discovery cohort of 86,536 and a replication cohort of 25,460 TWB individuals were analyzed, respectively. Results: A total of 18 nearly independent vQTLs with linkage disequilibrium measure r 2 < 0.01 were identified and replicated from nine phenotypes. 15 significant GxG were found with p-values <1.1E-5 (in the discovery cohort) and false discovery rates <2% (in the replication cohort). Among these 15 GxG, 11 were detected for blood traits including red blood cells, hemoglobin, and hematocrit; 2 for total bilirubin; 1 for fasting glucose; and 1 for total cholesterol (TCHO). All GxG were observed for gene pairs on the same chromosome, except for the APOA5 (chromosome 11)-TOMM40 (chromosome 19) interaction for TCHO. Discussion: This study provided a computationally feasible way to search for GxG genome-wide and applied this approach to 29 phenotypes.
Collapse
Affiliation(s)
- Wan-Yu Lin
- Institute of Health Data Analytics and Statistics, College of Public Health, National Taiwan University, Taipei, Taiwan
- Master of Public Health Degree Program, College of Public Health, National Taiwan University, Taipei, Taiwan
| |
Collapse
|
2
|
Rehman A, Mujahid M, Saba T, Jeon G. Optimised stacked machine learning algorithms for genomics and genetics disorder detection in the healthcare industry. Funct Integr Genomics 2024; 24:23. [PMID: 38305949 DOI: 10.1007/s10142-024-01289-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2023] [Revised: 12/22/2023] [Accepted: 01/02/2024] [Indexed: 02/03/2024]
Abstract
With recent advances in precision medicine and healthcare computing, there is an enormous demand for developing machine learning algorithms in genomics to enhance the rapid analysis of disease disorders. Technological advancement in genomics and imaging provides clinicians with enormous amounts of data, but prediction is still mostly subjective, resulting in problematic medical treatment. Machine learning is being employed in several domains of the healthcare sector, encompassing clinical research, early disease identification, and medicinal innovation with a historical perspective. The main objective of this study is to detect patients who, based on several medical standards, are more susceptible to having a genetic disorder. A genetic disease prediction algorithm was employed, leveraging the patient's health history to evaluate the probability of diagnosing a genetic disorder. We developed a computationally efficient machine learning approach to predict the overall lifespan of patients with a genomics disorder and to classify and predict patients with a genetic disease. The SVM, RF, and ETC are stacked using two-layer meta-estimators to develop the proposed model. The first layer comprises all the baseline models employed to predict the outcomes based on the dataset. The second layer comprises a component known as a meta-classifier. Results from the experiment indicate that the model achieved an accuracy of 90.45% and a recall score of 90.19%. The area under the curve (AUC) for mitochondrial diseases is 98.1%; for multifactorial diseases, it is 97.5%; and for single-gene inheritance, it is 98.8%. The proposed approach presents a novel method for predicting patient prognosis in a manner that is unbiased, accurate, and comprehensive. The proposed approach outperforms human professionals using the current clinical standard for genetic disease classification in terms of identification accuracy. The implementation of stacked will significantly improve the field of biomedical research by improving the anticipation of genetic diseases.
Collapse
Affiliation(s)
- Amjad Rehman
- Artificial Intelligence & Data Analytics Lab, CCIS, Prince Sultan University, Riyadh, 11586, Saudi Arabia
| | - Muhammad Mujahid
- Artificial Intelligence & Data Analytics Lab, CCIS, Prince Sultan University, Riyadh, 11586, Saudi Arabia
| | - Tanzila Saba
- Artificial Intelligence & Data Analytics Lab, CCIS, Prince Sultan University, Riyadh, 11586, Saudi Arabia
| | - Gwanggil Jeon
- Artificial Intelligence & Data Analytics Lab, CCIS, Prince Sultan University, Riyadh, 11586, Saudi Arabia.
- Department of Embedded Systems Engineering, Incheon National University, Incheon, 610101, Korea.
| |
Collapse
|
3
|
Ren F, Li S, Wen Z, Liu Y, Tang D. The Spherical Evolutionary Multi-Objective (SEMO) Algorithm for Identifying Disease Multi-Locus SNP Interactions. Genes (Basel) 2023; 15:11. [PMID: 38275593 PMCID: PMC10815643 DOI: 10.3390/genes15010011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2023] [Revised: 11/21/2023] [Accepted: 12/18/2023] [Indexed: 01/27/2024] Open
Abstract
Single-nucleotide polymorphisms (SNPs), as disease-related biogenetic markers, are crucial in elucidating complex disease susceptibility and pathogenesis. Due to computational inefficiency, it is difficult to identify high-dimensional SNP interactions efficiently using combinatorial search methods, so the spherical evolutionary multi-objective (SEMO) algorithm for detecting multi-locus SNP interactions was proposed. The algorithm uses a spherical search factor and a feedback mechanism of excellent individual history memory to enhance the balance between search and acquisition. Moreover, a multi-objective fitness function based on the decomposition idea was used to evaluate the associations by combining two functions, K2-Score and LR-Score, as an objective function for the algorithm's evolutionary iterations. The performance evaluation of SEMO was compared with six state-of-the-art algorithms on a simulated dataset. The results showed that SEMO outperforms the comparative methods by detecting SNP interactions quickly and accurately with a shorter average run time. The SEMO algorithm was applied to the Wellcome Trust Case Control Consortium (WTCCC) breast cancer dataset and detected two- and three-point SNP interactions that were significantly associated with breast cancer, confirming the effectiveness of the algorithm. New combinations of SNPs associated with breast cancer were also identified, which will provide a new way to detect SNP interactions quickly and accurately.
Collapse
Affiliation(s)
- Fuxiang Ren
- College of Medical Information Engineering, Guangdong Pharmaceutical University, Guangzhou 510006, China; (F.R.); (S.L.); (Y.L.)
| | - Shiyin Li
- College of Medical Information Engineering, Guangdong Pharmaceutical University, Guangzhou 510006, China; (F.R.); (S.L.); (Y.L.)
| | - Zihao Wen
- College of Mathematics and Informatics, College of Software Engineering, South China Agricultural University, Guangzhou 510642, China
- Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - Yidi Liu
- College of Medical Information Engineering, Guangdong Pharmaceutical University, Guangzhou 510006, China; (F.R.); (S.L.); (Y.L.)
| | - Deyu Tang
- College of Medical Information Engineering, Guangdong Pharmaceutical University, Guangzhou 510006, China; (F.R.); (S.L.); (Y.L.)
- College of Mathematics and Informatics, College of Software Engineering, South China Agricultural University, Guangzhou 510642, China
| |
Collapse
|
4
|
Peng YZ, Lin Y, Huang Y, Li Y, Luo G, Liao J. GEP-EpiSeeker: a gene expression programming-based method for epistatic interaction detection in genome-wide association studies. BMC Genomics 2021; 22:910. [PMID: 34930147 PMCID: PMC8686218 DOI: 10.1186/s12864-021-08207-8] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2021] [Accepted: 11/24/2021] [Indexed: 11/10/2022] Open
Abstract
Background Identification of epistatic interactions provides a systematic way for exploring associations among different single nucleotide polymorphism (SNP) and complex diseases. Although considerable progress has been made in epistasis detection, efficiently and accurately identifying epistatic interactions remains a challenge due to the intensive growth of measuring SNP combinations. Results In this work, we formulate the detection of epistatic interactions by a combinational optimization problem, and propose a novel evolutionary-based framework, called GEP-EpiSeeker, to detect epistatic interactions using Gene Expression Programming. In GEP-EpiSeeker, we propose several tailor-made chromosome rules to describe SNP combinations, and incorporate Bayesian network-based fitness evaluation into the evolution of tailor-made chromosomes to find suspected SNP combinations, and adopt the Chi-square test to identify optimal solutions from suspected SNP combinations. Moreover, to improve the convergence and accuracy of the algorithm, we design two genetic operators with multiple and adjacent mutations and an adaptive genetic manipulation method with fuzzy control to efficiently manipulate the evolution of tailor-made chromosomes. We compared GEP-EpiSeeker with state-of-the-art methods including BEAM, BOOST, AntEpiSeeker, MACOED, and EACO in terms of power, recall, precision and F1-score on the GWAS datasets of 12 DME disease models and 10 DNME disease models. Our experimental results show that GEP-EpiSeeker outperforms comparative methods. Conclusions Here we presented a novel method named GEP-EpiSeeker, based on the Gene Expression Programming algorithm, to identify epistatic interactions in Genome-wide Association Studies. The results indicate that GEP-EpiSeeker could be a promising alternative to the existing methods in epistasis detection and will provide a new way for accurately identifying epistasis.
Collapse
Affiliation(s)
- Yu Zhong Peng
- School of Computer & Information Engineering, Nanning Normal University, Nanning, 530001, China.,School of Computer science, Fudan University, Shanghai, 200433, China
| | - Yanmei Lin
- School of Computer & Information Engineering, Nanning Normal University, Nanning, 530001, China
| | - Yiran Huang
- School of Computer and Electronics and Information, Guangxi Key Laboratory of Multimedia Communications and Network Technology, Guangxi University, Nanning, 530004, China.
| | - Ying Li
- School of Computer & Information Engineering, Nanning Normal University, Nanning, 530001, China
| | - Guangsheng Luo
- School of Computer science, Fudan University, Shanghai, 200433, China
| | - Jianping Liao
- School of Computer & Information Engineering, Nanning Normal University, Nanning, 530001, China.
| |
Collapse
|
5
|
Musolf AM, Holzinger ER, Malley JD, Bailey-Wilson JE. What makes a good prediction? Feature importance and beginning to open the black box of machine learning in genetics. Hum Genet 2021; 141:1515-1528. [PMID: 34862561 PMCID: PMC9360120 DOI: 10.1007/s00439-021-02402-z] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2021] [Accepted: 11/08/2021] [Indexed: 01/26/2023]
Abstract
Genetic data have become increasingly complex within the past decade, leading researchers to pursue increasingly complex questions, such as those involving epistatic interactions and protein prediction. Traditional methods are ill-suited to answer these questions, but machine learning (ML) techniques offer an alternative solution. ML algorithms are commonly used in genetics to predict or classify subjects, but some methods evaluate which features (variables) are responsible for creating a good prediction; this is called feature importance. This is critical in genetics, as researchers are often interested in which features (e.g., SNP genotype or environmental exposure) are responsible for a good prediction. This allows for the deeper analysis beyond simple prediction, including the determination of risk factors associated with a given phenotype. Feature importance further permits the researcher to peer inside the black box of many ML algorithms to see how they work and which features are critical in informing a good prediction. This review focuses on ML methods that provide feature importance metrics for the analysis of genetic data. Five major categories of ML algorithms: k nearest neighbors, artificial neural networks, deep learning, support vector machines, and random forests are described. The review ends with a discussion of how to choose the best machine for a data set. This review will be particularly useful for genetic researchers looking to use ML methods to answer questions beyond basic prediction and classification.
Collapse
Affiliation(s)
- Anthony M Musolf
- Statistical Genetics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, 333 Cassell Drive Suite 1200, Baltimore, MD, 21224, USA
| | - Emily R Holzinger
- Target Sciences, Informatics and Predictive Sciences, Bristol Myers Squibb, Cambridge, MA, USA
| | - James D Malley
- Statistical Genetics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, 333 Cassell Drive Suite 1200, Baltimore, MD, 21224, USA
| | - Joan E Bailey-Wilson
- Statistical Genetics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, 333 Cassell Drive Suite 1200, Baltimore, MD, 21224, USA.
| |
Collapse
|
6
|
Varzari A, Deyneko IV, Tudor E, Grallert H, Illig T. Synergistic effect of genetic polymorphisms in TLR6 and TLR10 genes on the risk of pulmonary tuberculosis in a Moldavian population. Innate Immun 2021; 27:365-376. [PMID: 34275341 PMCID: PMC8419295 DOI: 10.1177/17534259211029996] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
Polymorphisms in genes that control immune function and regulation may influence susceptibility to pulmonary tuberculosis (TB). In this study, 14 polymorphisms in 12 key genes involved in the immune response (VDR, MR1, TLR1, TLR2, TLR10, SLC11A1, IL1B, IL10, IFNG, TNF, IRAK1, and FOXP3) were tested for their association with pulmonary TB in 271 patients with TB and 251 community-matched controls from the Republic of Moldova. In addition, gene-gene interactions involved in TB susceptibility were analyzed for a total of 43 genetic loci. Single nucleotide polymorphism (SNP) analysis revealed a nominal association between TNF rs1800629 and pulmonary TB (Fisher exact test P = 0.01843). In the pairwise interaction analysis, the combination of the genotypes TLR6 rs5743810 GA and TLR10 rs11096957 GT was significantly associated with an increased genetic risk of pulmonary TB (OR = 2.48, 95% CI = 1.62-3.85; Fisher exact test P value = 1.5 × 10-5, significant after Bonferroni correction). In conclusion, the TLR6 rs5743810 and TLR10 rs11096957 two-locus interaction confers a significantly higher risk for pulmonary TB; due to its high frequency in the population, this SNP combination may serve as a novel biomarker for predicting TB susceptibility.
Collapse
Affiliation(s)
- Alexander Varzari
- Laboratory of Human Genetics, Chiril Draganiuc Institute of Phthisiopneumology, Republic of Moldova.,Hannover Unified Biobank, 9177Hannover Medical School, Hannover Medical School, Germany
| | - Igor V Deyneko
- Laboratory of Functional Genomics, Timiryazev Institute of Plant Physiology Russian Academy of Sciences, Russia
| | - Elena Tudor
- Laboratory of Human Genetics, Chiril Draganiuc Institute of Phthisiopneumology, Republic of Moldova
| | - Harald Grallert
- Research Unit of Molecular Epidemiology, Institute of Epidemiology, Helmholtz Zentrum München Research Center for Environmental Health, Germany
| | - Thomas Illig
- Hannover Unified Biobank, 9177Hannover Medical School, Hannover Medical School, Germany.,Department of Human Genetics, 9177Hannover Medical School, Hannover Medical School, Germany
| |
Collapse
|
7
|
Abo Alchamlat S, Farnir F. Aggregation of experts: an application in the field of "interactomics" (detection of interactions on the basis of genomic data). BMC Bioinformatics 2018; 19:445. [PMID: 30497383 PMCID: PMC6267805 DOI: 10.1186/s12859-018-2447-0] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2017] [Accepted: 10/25/2018] [Indexed: 12/03/2022] Open
Abstract
Background Despite the successful mapping of genes involved in the determinism of numerous traits, a large part of the genetic variation remains unexplained. A possible explanation is that the simple models used in many studies might not properly fit the actual underlying situations. Consequently, various methods have attempted to deal with the simultaneous mapping of genomic regions, assuming that these regions might interact, leading to a complex determinism for various traits. Despite some successes, no gold standard methodology has emerged. Actually, combining several interaction mapping methods might be a better strategy, leading to positive results over a larger set of situations. Our work is a step in that direction. Results We first have demonstrated why aggregating results from several distinct methods might increase the statistical power while controlling the type I error. We have illustrated the approach using 6 existing methods (namely: MDR, Boost, BHIT, KNN-MDR, MegaSNPHunter and AntEpiSeeker) on simulated and real data sets. We have used a very simple aggregation strategy: a majority vote across the best loci combinations identified by the individual methods. In order to assess the performances of our aggregation approach in problems where most individual methods tend to fail, we have simulated difficult situations where no marginal effects of individual genes exist and where genetic heterogeneity is present. we have also demonstrated the use of the strategy on real data, using a WTCCC dataset on rheumatoid arthritis. Since we have been using simplistic assumptions to infer the expected power of the aggregation method, the actual power we estimated from our simulations has turned out to be a bit smaller than theoretically expected. Results nevertheless have shown that grouping the results of several methods is advantageous in terms of power, accuracy and type I error control. Furthermore, as more methods should become available in the future, using a grouping strategy will become more advantageous since adding more methods seems to improve the performances of the aggregated method. Conclusions The aggregation of methods as a tool to detect genetic interactions is a potentially useful addition to the arsenal used in complex traits analyses. Electronic supplementary material The online version of this article (10.1186/s12859-018-2447-0) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Sinan Abo Alchamlat
- Department of Biostatistics, Faculty of Veterinary Medicine, University of Liège, Sart Tilman B43, 4000, Liege, Belgium
| | - Frédéric Farnir
- Department of Biostatistics, Faculty of Veterinary Medicine, University of Liège, Sart Tilman B43, 4000, Liege, Belgium.
| |
Collapse
|
8
|
Li X. A fast and exhaustive method for heterogeneity and epistasis analysis based on multi-objective optimization. Bioinformatics 2018; 33:2829-2836. [PMID: 28541468 DOI: 10.1093/bioinformatics/btx339] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2017] [Accepted: 05/20/2017] [Indexed: 12/29/2022] Open
Abstract
Motivation The existing epistasis analysis approaches have been criticized mainly for their: (i) ignoring heterogeneity during epistasis analysis; (ii) high computational costs; and (iii) volatility of performances and results. Therefore, they will not perform well in general, leading to lack of reproducibility and low power in complex disease association studies. In this work, a fast scheme is proposed to accelerate exhaustive searching based on multi-objective optimization named ESMO for concurrently analyzing heterogeneity and epistasis phenomena. In ESMO, mutual entropy and Bayesian network approaches are combined for evaluating epistatic SNP combinations. In order to be compatible with heterogeneity of complex diseases, we designed an adaptive framework based on non-dominant sort and top k selection algorithm with improved time complexity O(k*M*N) . Moreover, ESMO is accelerated by strategies such as trading space for time, calculation sharing and parallel computing. Finally, ESMO is nonparametric and model-free. Results We compared ESMO with other recent or classic methods using different evaluating measures. The experimental results show that our method not only can quickly handle epistasis, but also can effectively detect heterogeneity of complex population structures. Availability and implementation https://github.com/XiongLi2016/ESMO/tree/master/ESMO-common-master . Contact lx_hncs@163.com.
Collapse
Affiliation(s)
- Xiong Li
- School of Software, East China Jiaotong University, Nanchang 330013, China
| |
Collapse
|
9
|
Heterogeneity Analysis and Diagnosis of Complex Diseases Based on Deep Learning Method. Sci Rep 2018; 8:6155. [PMID: 29670206 PMCID: PMC5906634 DOI: 10.1038/s41598-018-24588-5] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2018] [Accepted: 04/05/2018] [Indexed: 12/26/2022] Open
Abstract
Understanding genetic mechanism of complex diseases is a serious challenge. Existing methods often neglect the heterogeneity phenomenon of complex diseases, resulting in lack of power or low reproducibility. Addressing heterogeneity when detecting epistatic single nucleotide polymorphisms (SNPs) can enhance the power of association studies and improve prediction performance of complex diseases diagnosis. In this study, we propose a three-stage framework including epistasis detection, clustering and prediction to address both epistasis and heterogeneity of complex diseases based on deep learning method. The epistasis detection stage applies a multi-objective optimization method to find several candidate sets of epistatic SNPs which contribute to different subtypes of complex diseases. Then, a K-means clustering algorithm is used to define subtypes of the case group. Finally, a deep learning model has been trained for disease prediction based on graphics processing unit (GPU). Experimental results on pure and heterogeneous datasets show that our method has potential practicality and can serve as a possible alternative to other methods. Therefore, when epistasis and heterogeneity exist at the same time, our method is especially suitable for diagnosis of complex diseases.
Collapse
|
10
|
Niel C, Sinoquet C, Dina C, Rocheleau G. SMMB: a stochastic Markov blanket framework strategy for epistasis detection in GWAS. Bioinformatics 2018; 34:2773-2780. [DOI: 10.1093/bioinformatics/bty154] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2017] [Accepted: 03/09/2018] [Indexed: 12/22/2022] Open
Affiliation(s)
- Clément Niel
- Laboratoire des Sciences du Numérique de Nantes (LS2N), Centre National de la recherche Scientifique UMR6004, University of Nantes, Nantes, France
| | - Christine Sinoquet
- Laboratoire des Sciences du Numérique de Nantes (LS2N), Centre National de la recherche Scientifique UMR6004, University of Nantes, Nantes, France
| | - Christian Dina
- Institut du Thorax, Institut National de la Santé et de la Recherche Médicale UMR 1087, Centre National de la Recherche Scientifique UMR 6291, University of Nantes, Nantes, France
| | - Ghislain Rocheleau
- European Genomic Institute for Diabetes FR3508, Centre National de la Recherche Scientifique UMR 8199, Lille 2 University, Lille, France
| |
Collapse
|