1
|
Hajiaghabozorgi M, Fischbach M, Albrecht M, Wang W, Myers CL. BridGE: a pathway-based analysis tool for detecting genetic interactions from GWAS. Nat Protoc 2024; 19:1400-1435. [PMID: 38514837 PMCID: PMC11311251 DOI: 10.1038/s41596-024-00954-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2022] [Accepted: 11/22/2023] [Indexed: 03/23/2024]
Abstract
Genetic interactions have the potential to modulate phenotypes, including human disease. In principle, genome-wide association studies (GWAS) provide a platform for detecting genetic interactions; however, traditional methods for identifying them, which tend to focus on testing individual variant pairs, lack statistical power. In this protocol, we describe a novel computational approach, called Bridging Gene sets with Epistasis (BridGE), for discovering genetic interactions between biological pathways from GWAS data. We present a Python-based implementation of BridGE along with instructions for its application to a typical human GWAS cohort. The major stages include initial data processing and quality control, construction of a variant-level genetic interaction network, measurement of pathway-level genetic interactions, evaluation of statistical significance using sample permutations and generation of results in a standardized output format. The BridGE software pipeline includes options for running the analysis on multiple cores and multiple nodes for users who have access to computing clusters or a cloud computing environment. In a cluster computing environment with 10 nodes and 100 GB of memory per node, the method can be run in less than 24 h for typical human GWAS cohorts. Using BridGE requires knowledge of running Python programs and basic shell script programming experience.
Collapse
Affiliation(s)
- Mehrad Hajiaghabozorgi
- Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, USA
| | - Mathew Fischbach
- Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, USA
- Graduate Program in Bioinformatics and Computational Biology (BICB), University of Minnesota, Minneapolis, MN, USA
| | - Michael Albrecht
- Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, USA
| | - Wen Wang
- Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, USA.
| | - Chad L Myers
- Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, USA.
- Graduate Program in Bioinformatics and Computational Biology (BICB), University of Minnesota, Minneapolis, MN, USA.
| |
Collapse
|
2
|
Carré C, Carluer JB, Chaux C, Estoup-Streiff C, Roche N, Hosy E, Mas A, Krouk G. Next-Gen GWAS: full 2D epistatic interaction maps retrieve part of missing heritability and improve phenotypic prediction. Genome Biol 2024; 25:76. [PMID: 38523316 PMCID: PMC10962106 DOI: 10.1186/s13059-024-03202-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2023] [Accepted: 02/19/2024] [Indexed: 03/26/2024] Open
Abstract
The problem of missing heritability requires the consideration of genetic interactions among different loci, called epistasis. Current GWAS statistical models require years to assess the entire combinatorial epistatic space for a single phenotype. We propose Next-Gen GWAS (NGG) that evaluates over 60 billion single nucleotide polymorphism combinatorial first-order interactions within hours. We apply NGG to Arabidopsis thaliana providing two-dimensional epistatic maps at gene resolution. We demonstrate on several phenotypes that a large proportion of the missing heritability can be retrieved, that it indeed lies in epistatic interactions, and that it can be used to improve phenotype prediction.
Collapse
Affiliation(s)
- Clément Carré
- BionomeeX, Montpellier, France.
- IMAG, Univ. Montpellier, CNRS, Montpellier, France.
- IPSiM, Univ. Montpellier, CNRS, INRAE, Montpellier, France.
| | - Jean Baptiste Carluer
- IMAG, Univ. Montpellier, CNRS, Montpellier, France
- IPSiM, Univ. Montpellier, CNRS, INRAE, Montpellier, France
| | | | | | | | - Eric Hosy
- Interdisciplinary Institute for Neuroscience, University of Bordeaux, CNRS, Bordeaux, France
| | - André Mas
- BionomeeX, Montpellier, France.
- IMAG, Univ. Montpellier, CNRS, Montpellier, France.
| | - Gabriel Krouk
- BionomeeX, Montpellier, France.
- IPSiM, Univ. Montpellier, CNRS, INRAE, Montpellier, France.
| |
Collapse
|
3
|
Ponte-Fernandez C, Gonzalez-Dominguez J, Carvajal-Rodriguez A, Martin MJ. Evaluation of Existing Methods for High-Order Epistasis Detection. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:912-926. [PMID: 33055017 DOI: 10.1109/tcbb.2020.3030312] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Finding epistatic interactions among loci when expressing a phenotype is a widely employed strategy to understand the genetic architecture of complex traits in GWAS. The abundance of methods dedicated to the same purpose, however, makes it increasingly difficult for scientists to decide which method is more suitable for their studies. This work compares the different epistasis detection methods published during the last decade in terms of runtime, detection power and type I error rate, with a special emphasis on high-order interactions. Results show that in terms of detection power, the only methods that perform well across all experiments are the exhaustive methods, although their computational cost may be prohibitive in large-scale studies. Regarding non-exhaustive methods, not one could consistently find epistasis interactions when marginal effects are absent. If marginal effects are present, there are methods that perform well for high-order interactions, such as BADTrees, FDHE-IW, SingleMI or SNPHarvester. As for false-positive control, only SNPHarvester, FDHE-IW and DCHE show good results. The study concludes that there is no single epistasis detection method to recommend in all scenarios. Authors should prioritize exhaustive methods when sufficient computational resources are available considering the data set size, and resort to non-exhaustive methods when the analysis time is prohibitive.
Collapse
|
4
|
Evaluating the detection ability of a range of epistasis detection methods on simulated data for pure and impure epistatic models. PLoS One 2022; 17:e0263390. [PMID: 35180244 PMCID: PMC8856572 DOI: 10.1371/journal.pone.0263390] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2021] [Accepted: 01/18/2022] [Indexed: 11/19/2022] Open
Abstract
Background Numerous approaches have been proposed for the detection of epistatic interactions within GWAS datasets in order to better understand the drivers of disease and genetics. Methods A selection of state-of-the-art approaches were assessed. These included the statistical tests, fast-epistasis, BOOST, logistic regression and wtest; swarm intelligence methods, namely AntEpiSeeker, epiACO and CINOEDV; and data mining approaches, including MDR, GSS, SNPRuler and MPI3SNP. Data were simulated to provide randomly generated models with no individual main effects at different heritabilities (pure epistasis) as well as models based on penetrance tables with some main effects (impure epistasis). Detection of both two and three locus interactions were assessed across a total of 1,560 simulated datasets. The different methods were also applied to a section of the UK biobank cohort for Atrial Fibrillation. Results For pure, two locus interactions, PLINK’s implementation of BOOST recovered the highest number of correct interactions, with 53.9% and significantly better performing than the other methods (p = 4.52e − 36). For impure two locus interactions, MDR exhibited the best performance, recovering 62.2% of the most significant impure epistatic interactions (p = 6.31e − 90 for all but one test). The assessment of three locus interaction prediction revealed that wtest recovered the highest number (17.2%) of pure epistatic interactions(p = 8.49e − 14). wtest also recovered the highest number of three locus impure epistatic interactions (p = 6.76e − 48) while AntEpiSeeker ranked as the most significant the highest number of such interactions (40.5%). Finally, when applied to a real dataset for Atrial Fibrillation, most notably finding an interaction between SYNE2 and DTNB.
Collapse
|
5
|
Kondratyev NV, Alfimova MV, Golov AK, Golimbet VE. Bench Research Informed by GWAS Results. Cells 2021; 10:3184. [PMID: 34831407 PMCID: PMC8623533 DOI: 10.3390/cells10113184] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2021] [Revised: 11/11/2021] [Accepted: 11/11/2021] [Indexed: 12/15/2022] Open
Abstract
Scientifically interesting as well as practically important phenotypes often belong to the realm of complex traits. To the extent that these traits are hereditary, they are usually 'highly polygenic'. The study of such traits presents a challenge for researchers, as the complex genetic architecture of such traits makes it nearly impossible to utilise many of the usual methods of reverse genetics, which often focus on specific genes. In recent years, thousands of genome-wide association studies (GWAS) were undertaken to explore the relationships between complex traits and a large number of genetic factors, most of which are characterised by tiny effects. In this review, we aim to familiarise 'wet biologists' with approaches for the interpretation of GWAS results, to clarify some issues that may seem counterintuitive and to assess the possibility of using GWAS results in experiments on various complex traits.
Collapse
Affiliation(s)
| | | | - Arkadiy K. Golov
- Mental Health Research Center, 115522 Moscow, Russia; (M.V.A.); (A.K.G.); (V.E.G.)
- Institute of Gene Biology, Russian Academy of Sciences, 119334 Moscow, Russia
| | - Vera E. Golimbet
- Mental Health Research Center, 115522 Moscow, Russia; (M.V.A.); (A.K.G.); (V.E.G.)
| |
Collapse
|
6
|
MIDESP: Mutual Information-Based Detection of Epistatic SNP Pairs for Qualitative and Quantitative Phenotypes. BIOLOGY 2021; 10:biology10090921. [PMID: 34571798 PMCID: PMC8469369 DOI: 10.3390/biology10090921] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/02/2021] [Revised: 09/09/2021] [Accepted: 09/13/2021] [Indexed: 11/17/2022]
Abstract
Simple Summary The interactions between SNPs, which are known as epistasis, can strongly influence the phenotype. Their detection is still a challenge, which is made even more difficult through the existence of background associations that can hide correct epistatic interactions. To address the limitations of existing methods, we present in this study our novel method MIDESP for the detection of epistatic SNP pairs. It is the first mutual information-based method that can be applied to both qualitative and quantitative phenotypes and which explicitly accounts for background associations in the dataset. Abstract The interactions between SNPs result in a complex interplay with the phenotype, known as epistasis. The knowledge of epistasis is a crucial part of understanding genetic causes of complex traits. However, due to the enormous number of SNP pairs and their complex relationship to the phenotype, identification still remains a challenging problem. Many approaches for the detection of epistasis have been developed using mutual information (MI) as an association measure. However, these methods have mainly been restricted to case–control phenotypes and are therefore of limited applicability for quantitative traits. To overcome this limitation of MI-based methods, here, we present an MI-based novel algorithm, MIDESP, to detect epistasis between SNPs for qualitative as well as quantitative phenotypes. Moreover, by incorporating a dataset-dependent correction technique, we deal with the effect of background associations in a genotypic dataset to separate correct epistatic interaction signals from those of false positive interactions resulting from the effect of single SNP×phenotype associations. To demonstrate the effectiveness of MIDESP, we apply it on two real datasets with qualitative and quantitative phenotypes, respectively. Our results suggest that by eliminating the background associations, MIDESP can identify important genes, which play essential roles for bovine tuberculosis or the egg weight of chickens.
Collapse
|
7
|
Blumenthal DB, Baumbach J, Hoffmann M, Kacprowski T, List M. A framework for modeling epistatic interaction. Bioinformatics 2021; 37:1708-1716. [PMID: 33252645 DOI: 10.1093/bioinformatics/btaa990] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2020] [Revised: 10/21/2020] [Accepted: 11/16/2020] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Recently, various tools for detecting single nucleotide polymorphisms (SNPs) involved in epistasis have been developed. However, no studies evaluate the employed statistical epistasis models such as the χ2-test or quadratic regression independently of the tools that use them. Such an independent evaluation is crucial for developing improved epistasis detection tools, for it allows to decide if a tool's performance should be attributed to the epistasis model or to the optimization strategy run on top of it. RESULTS We present a protocol for evaluating epistasis models independently of the tools they are used in and generalize existing models designed for dichotomous phenotypes to the categorical and quantitative case. In addition, we propose a new model which scores candidate SNP sets by computing maximum likelihood distributions for the observed phenotypes in the cells of their penetrance tables. Extensive experiments show that the proposed maximum likelihood model outperforms three widely used epistasis models in most cases. The experiments also provide valuable insights into the properties of existing models, for instance, that quadratic regression perform particularly well on instances with quantitative phenotypes. AVAILABILITY AND IMPLEMENTATION The evaluation protocol and all compared models are implemented in C++ and are supported under Linux and macOS. They are available at https://github.com/baumbachlab/genepiseeker/, along with test datasets and scripts to reproduce the experiments. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- David B Blumenthal
- Chair of Experimental Bioinformatics, TUM School of Life Sciences, Technical University of Munich, 85354 Freising, Germany
| | - Jan Baumbach
- Chair of Experimental Bioinformatics, TUM School of Life Sciences, Technical University of Munich, 85354 Freising, Germany
| | - Markus Hoffmann
- Chair of Experimental Bioinformatics, TUM School of Life Sciences, Technical University of Munich, 85354 Freising, Germany
| | - Tim Kacprowski
- Chair of Experimental Bioinformatics, TUM School of Life Sciences, Technical University of Munich, 85354 Freising, Germany
| | - Markus List
- Chair of Experimental Bioinformatics, TUM School of Life Sciences, Technical University of Munich, 85354 Freising, Germany
| |
Collapse
|
8
|
Kjaersgaard Andersen R, Clemmensen SB, Larsen LA, Hjelmborg JVB, Ødum N, Jemec GBE, Christensen K. Evidence of gene-gene interaction in Hidradenitis suppurativa - A nationwide register study of Danish twins. Br J Dermatol 2021; 186:78-85. [PMID: 34289077 DOI: 10.1111/bjd.20654] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/15/2021] [Indexed: 11/28/2022]
Abstract
BACKGROUND Hidradenitis suppurativa (HS) is a recurrent inflammatory skin disease that, apart from rare causative loss-of-function mutations, has a widely unknown genetic aetiology. Our objective was to estimate the relative importance of genetic and environmental factors underlying HS susceptibility. METHODS Through the Danish Twin Registry and the Danish National Patient Registry we joined information on zygosity with that of HS status. HS cases were identified by International Code of Diseases 8 (705.91) and 10 (L73.2). Heritability was assessed by the classic biometric model and the possibility of gene-gene interaction through the multi-locus modeling approach. RESULTS Amongst 100,044 registered twins, we found 170 twins (from 163 pairs) diagnosed with HS. The seven concordant pairs were all monozygotic, and monozygotic twins had a casewise concordance rate of 28% (95% CI: 7%; 49%), corresponding to a familial risk of 73 (95% CI 13; 133) times that of the background population. The biometrical modelling suggested a heritability of 0.80 (95% CI 0.67; 0.93), and the multilocus index estimate was 230 (95% CI: 60; 400). This is highly indicative of gene-gene interactions, with the possibility of up to six interacting loci. CONCLUSION This twin study is substantially larger, and employs a more valid phenotype than prior studies. Genetics account for the majority of the HS susceptibility, and HS is most likely caused by gene-gene interactions rather than monogenetic mutations or solely additive genetic factors. New approaches aimed at assessing potential interactions at a SNP-SNP level should be implemented in future HS genome-wide association studies.
Collapse
Affiliation(s)
| | - S B Clemmensen
- Danish Twin Registry, Department of Epidemiology, Biostatistics and Biodemography, University of southern Denmark, Odense, Denmark
| | - L A Larsen
- Danish Twin Registry, Department of Epidemiology, Biostatistics and Biodemography, University of southern Denmark, Odense, Denmark
| | - J V B Hjelmborg
- Danish Twin Registry, Department of Epidemiology, Biostatistics and Biodemography, University of southern Denmark, Odense, Denmark
| | - N Ødum
- LEO Foundation Skin Immunology Research Center, Department of Immunology and Microbiology, University of Copenhagen, Copenhagen, Denmark
| | - G B E Jemec
- Department of Dermatology, Zealand University Hospital, Roskilde
| | - K Christensen
- Danish Twin Registry, Department of Epidemiology, Biostatistics and Biodemography, University of southern Denmark, Odense, Denmark
| |
Collapse
|
9
|
Johnsen PV, Riemer-Sørensen S, DeWan AT, Cahill ME, Langaas M. A new method for exploring gene-gene and gene-environment interactions in GWAS with tree ensemble methods and SHAP values. BMC Bioinformatics 2021; 22:230. [PMID: 33947323 PMCID: PMC8097909 DOI: 10.1186/s12859-021-04041-7] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2020] [Accepted: 02/22/2021] [Indexed: 01/08/2023] Open
Abstract
Background The identification of gene–gene and gene–environment interactions in genome-wide association studies is challenging due to the unknown nature of the interactions and the overwhelmingly large number of possible combinations. Parametric regression models are suitable to look for prespecified interactions. Nonparametric models such as tree ensemble models, with the ability to detect any unspecified interaction, have previously been difficult to interpret. However, with the development of methods for model explainability, it is now possible to interpret tree ensemble models efficiently and with a strong theoretical basis. Results We propose a tree ensemble- and SHAP-based method for identifying as well as interpreting potential gene–gene and gene–environment interactions on large-scale biobank data. A set of independent cross-validation runs are used to implicitly investigate the whole genome. We apply and evaluate the method using data from the UK Biobank with obesity as the phenotype. The results are in line with previous research on obesity as we identify top SNPs previously associated with obesity. We further demonstrate how to interpret and visualize interaction candidates. Conclusions The new method identifies interaction candidates otherwise not detected with parametric regression models. However, further research is needed to evaluate the uncertainties of these candidates. The method can be applied to large-scale biobanks with high-dimensional data. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04041-7.
Collapse
Affiliation(s)
- Pål V Johnsen
- SINTEF DIGITAL, Forskningsveien 1, 0373, Oslo, Norway. .,Department of Mathematical Sciences, Norwegian University of Science and Technology, A. Getz vei 1, 7491, Trondheim, Norway.
| | | | - Andrew Thomas DeWan
- Department of Chronic Disease Epidemiology and Center for Perinatal, Pediatric and Environmental Epidemiology, Yale School of Public Health, 1 Church Street, New Haven, CT, 06510, USA.,Gemini Center for Sepsis Research, Department of Circulation and Medical Imaging, NTNU, Norwegian University of Science and Technology, Prinsesse Kristinas gate 3, 7030, Trondheim, Norway
| | - Megan E Cahill
- Department of Chronic Disease Epidemiology and Center for Perinatal, Pediatric and Environmental Epidemiology, Yale School of Public Health, 1 Church Street, New Haven, CT, 06510, USA
| | - Mette Langaas
- Department of Mathematical Sciences, Norwegian University of Science and Technology, A. Getz vei 1, 7491, Trondheim, Norway
| |
Collapse
|
10
|
Rotroff DM. A Bioinformatics Crash Course for Interpreting Genomics Data. Chest 2020; 158:S113-S123. [PMID: 32658646 PMCID: PMC8176646 DOI: 10.1016/j.chest.2020.03.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2019] [Revised: 11/11/2019] [Accepted: 03/09/2020] [Indexed: 10/23/2022] Open
Abstract
Reductions in genotyping costs and improvements in computational power have made conducting genome-wide association studies (GWAS) standard practice for many complex diseases. GWAS is the assessment of genetic variants across the genome of many individuals to determine which, if any, genetic variants are associated with a specific trait. As with any analysis, there are evolving best practices that should be followed to ensure scientific rigor and reliability in the conclusions. This article presents a brief summary for many of the key bioinformatics considerations when either planning or evaluating GWAS. This review is meant to serve as a guide to those without deep expertise in bioinformatics and GWAS and give them tools to critically evaluate this popular approach to investigating complex diseases. In addition, a checklist is provided that can be used by investigators to evaluate whether a GWAS has appropriately accounted for the many potential sources of bias and generally followed current best practices.
Collapse
Affiliation(s)
- Daniel M Rotroff
- Department of Quantitative Health Sciences, Lerner Research Institute, Cleveland Clinic, Cleveland, OH.
| |
Collapse
|
11
|
Blumenthal DB, Viola L, List M, Baumbach J, Tieri P, Kacprowski T. EpiGEN: an epistasis simulation pipeline. Bioinformatics 2020; 36:4957-4959. [DOI: 10.1093/bioinformatics/btaa245] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2019] [Revised: 04/03/2020] [Accepted: 04/08/2020] [Indexed: 02/06/2023] Open
Abstract
Abstract
Summary
Simulated data are crucial for evaluating epistasis detection tools in genome-wide association studies. Existing simulators are limited, as they do not account for linkage disequilibrium (LD), support limited interaction models of single nucleotide polymorphisms (SNPs) and only dichotomous phenotypes or depend on proprietary software. In contrast, EpiGEN supports SNP interactions of arbitrary order, produces realistic LD patterns and generates both categorical and quantitative phenotypes.
Availability and implementation
EpiGEN is implemented in Python 3 and is freely available at https://github.com/baumbachlab/epigen.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- David B Blumenthal
- Technical University of Munich, School of Life Sciences Weihenstephan, Chair of Experimental Bioinformatics, 85354 Freising, Germany
| | - Lorenzo Viola
- Technical University of Munich, School of Life Sciences Weihenstephan, Chair of Experimental Bioinformatics, 85354 Freising, Germany
| | - Markus List
- Technical University of Munich, School of Life Sciences Weihenstephan, Chair of Experimental Bioinformatics, 85354 Freising, Germany
| | - Jan Baumbach
- Technical University of Munich, School of Life Sciences Weihenstephan, Chair of Experimental Bioinformatics, 85354 Freising, Germany
| | - Paolo Tieri
- CNR National Research Council, IAC Institute for Applied Computing, 00185 Rome, Italy
| | - Tim Kacprowski
- Technical University of Munich, School of Life Sciences Weihenstephan, Chair of Experimental Bioinformatics, 85354 Freising, Germany
| |
Collapse
|