1
|
Luyapan J, Ji X, Li S, Xiao X, Zhu D, Duell EJ, Christiani DC, Schabath MB, Arnold SM, Zienolddiny S, Brunnström H, Melander O, Thornquist MD, MacKenzie TA, Amos CI, Gui J. A new efficient method to detect genetic interactions for lung cancer GWAS. BMC Med Genomics 2020; 13:162. [PMID: 33126877 PMCID: PMC7596958 DOI: 10.1186/s12920-020-00807-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2019] [Accepted: 10/11/2020] [Indexed: 01/01/2023] Open
Abstract
BACKGROUND Genome-wide association studies (GWAS) have proven successful in predicting genetic risk of disease using single-locus models; however, identifying single nucleotide polymorphism (SNP) interactions at the genome-wide scale is limited due to computational and statistical challenges. We addressed the computational burden encountered when detecting SNP interactions for survival analysis, such as age of disease-onset. To confront this problem, we developed a novel algorithm, called the Efficient Survival Multifactor Dimensionality Reduction (ES-MDR) method, which used Martingale Residuals as the outcome parameter to estimate survival outcomes, and implemented the Quantitative Multifactor Dimensionality Reduction method to identify significant interactions associated with age of disease-onset. METHODS To demonstrate efficacy, we evaluated this method on two simulation data sets to estimate the type I error rate and power. Simulations showed that ES-MDR identified interactions using less computational workload and allowed for adjustment of covariates. We applied ES-MDR on the OncoArray-TRICL Consortium data with 14,935 cases and 12,787 controls for lung cancer (SNPs = 108,254) to search over all two-way interactions to identify genetic interactions associated with lung cancer age-of-onset. We tested the best model in an independent data set from the OncoArray-TRICL data. RESULTS Our experiment on the OncoArray-TRICL data identified many one-way and two-way models with a single-base deletion in the noncoding region of BRCA1 (HR 1.24, P = 3.15 × 10-15), as the top marker to predict age of lung cancer onset. CONCLUSIONS From the results of our extensive simulations and analysis of a large GWAS study, we demonstrated that our method is an efficient algorithm that identified genetic interactions to include in our models to predict survival outcomes.
Collapse
Affiliation(s)
- Jennifer Luyapan
- Quantitative Biomedical Science Program, Geisel School of Medicine, Dartmouth College, Hanover, NH, 03755, USA
- Department of Biomedical Data Science, Geisel School of Medicine, Dartmouth College, One Medical Center Dr., Lebanon, NH, 03756, USA
| | - Xuemei Ji
- Department of Biomedical Data Science, Geisel School of Medicine, Dartmouth College, One Medical Center Dr., Lebanon, NH, 03756, USA
| | - Siting Li
- Quantitative Biomedical Science Program, Geisel School of Medicine, Dartmouth College, Hanover, NH, 03755, USA
- Department of Biomedical Data Science, Geisel School of Medicine, Dartmouth College, One Medical Center Dr., Lebanon, NH, 03756, USA
| | - Xiangjun Xiao
- Institute for Clinical and Translational Research, Dan L. Duncan Comprehensive Cancer Center, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Dakai Zhu
- Department of Biomedical Data Science, Geisel School of Medicine, Dartmouth College, One Medical Center Dr., Lebanon, NH, 03756, USA
- Institute for Clinical and Translational Research, Dan L. Duncan Comprehensive Cancer Center, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Eric J Duell
- Unit of Nutrition and Cancer, Catalan Institute of Oncology (ICO-IDIBELL), 08908, Barcelona, Spain
| | - David C Christiani
- Department of Environmental Health, Harvard School of Public Health, Boston, MA, 02115, USA
- Department of Medicine, Massachusetts General Hospital, Boston, MA, 02115, USA
| | - Matthew B Schabath
- Department of Cancer Epidemiology, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL, 33612, USA
| | - Susanne M Arnold
- Markey Cancer Center, University of Kentucky, First Floor, 800 Rose Street, Lexington, KY, 40508, USA
| | - Shanbeh Zienolddiny
- National Institute of Occupational Health, 0033 Gydas vei 8, 0033, Oslo, Norway
| | - Hans Brunnström
- Laboratory Medicine Region Skåne, Department of Clinical Sciences Lund, Pathology, Lund University, Lund, Sweden
| | - Olle Melander
- Department of Clinical Sciences, Lund University, Malmö, Sweden
| | - Mark D Thornquist
- Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, WA, 98109, USA
| | - Todd A MacKenzie
- Quantitative Biomedical Science Program, Geisel School of Medicine, Dartmouth College, Hanover, NH, 03755, USA
- Department of Biomedical Data Science, Geisel School of Medicine, Dartmouth College, One Medical Center Dr., Lebanon, NH, 03756, USA
| | - Christopher I Amos
- Quantitative Biomedical Science Program, Geisel School of Medicine, Dartmouth College, Hanover, NH, 03755, USA.
- Department of Biomedical Data Science, Geisel School of Medicine, Dartmouth College, One Medical Center Dr., Lebanon, NH, 03756, USA.
- Institute for Clinical and Translational Research, Dan L. Duncan Comprehensive Cancer Center, Baylor College of Medicine, Houston, TX, 77030, USA.
| | - Jiang Gui
- Quantitative Biomedical Science Program, Geisel School of Medicine, Dartmouth College, Hanover, NH, 03755, USA.
- Department of Biomedical Data Science, Geisel School of Medicine, Dartmouth College, One Medical Center Dr., Lebanon, NH, 03756, USA.
| |
Collapse
|
2
|
Tessier F, Fontaine-Bisson B, Lefebvre JF, El-Sohemy A, Roy-Gagnon MH. Investigating Gene-Gene and Gene-Environment Interactions in the Association Between Overnutrition and Obesity-Related Phenotypes. Front Genet 2019; 10:151. [PMID: 30886629 PMCID: PMC6409307 DOI: 10.3389/fgene.2019.00151] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2018] [Accepted: 02/12/2019] [Indexed: 01/12/2023] Open
Abstract
Introduction: Animal studies suggested that NFKB1, IKBKB, and SOCS3 genes could be involved in the association between overnutrition and obesity. This study aims to investigate interactions involving these genes and macronutrient intakes affecting obesity-related phenotypes. Methods: We used a traditional statistical method, logistic regression, and compared it to alternative statistical method, multifactor dimensionality reduction (MDR) and penalized logistic regression (PLR), to better detect genes/environment interactions in the Toronto Nutrigenomics and Health Study (n = 1639) using dichotomized body mass index (BMI) and waist circumference as obesity-related phenotypes. Exposure variables included genotype on 54 single nucleotide polymorphisms (NFKB1: 18, IKBKB: 9, SOCS3: 27), macronutrient (carbohydrates, protein, fat) and alcohol intakes and ethno-cultural background. Results: After correction for multiple testing, no interaction was found using logistic regression. MDR identified interactions between SOCS3 rs6501199 and rs4969172, and IKBKB rs3747811 affecting BMI in the Caucasian population; SOCS3 rs6501199 and NFKB1 rs1609798 affecting WC in the Caucasian population; and SOCS3 rs4436839 and IKBKB rs3747811 affecting WC in the South Asian population. PLR found a main effect of SOCS3 rs12944581 on BMI among the South Asian population. Conclusion: While MDR and PLR had discordant results, some models support results from previous studies. These results emphasize the need to use alternative statistical methods to investigate high-order interactions and suggest that variants in the nutrient-responsive hypothalamic IKKB/NF-kB signaling pathway may be involved in obesity pathogenesis.
Collapse
Affiliation(s)
- François Tessier
- School of Epidemiology and Public Health, University of Ottawa, Ottawa, ON, Canada
| | | | | | - Ahmed El-Sohemy
- Department of Nutritional Sciences, University of Toronto, Toronto, ON, Canada
| | | |
Collapse
|
3
|
Yang CH, Lin YD, Yen CY, Chuang LY, Chang HW. A systematic gene-gene and gene-environment interaction analysis of DNA repair genes XRCC1, XRCC2, XRCC3, XRCC4, and oral cancer risk. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2016; 19:238-47. [PMID: 25831063 DOI: 10.1089/omi.2014.0121] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Oral cancer is the sixth most common cancer worldwide with a high mortality rate. Biomarkers that anticipate susceptibility, prognosis, or response to treatments are much needed. Oral cancer is a polygenic disease involving complex interactions among genetic and environmental factors, which require multifaceted analyses. Here, we examined in a dataset of 103 oral cancer cases and 98 controls from Taiwan the association between oral cancer risk and the DNA repair genes X-ray repair cross-complementing group (XRCCs) 1-4, and the environmental factors of smoking, alcohol drinking, and betel quid (BQ) chewing. We employed logistic regression, multifactor dimensionality reduction (MDR), and hierarchical interaction graphs for analyzing gene-gene (G×G) and gene-environment (G×E) interactions. We identified a significantly elevated risk of the XRCC2 rs2040639 heterozygous variant among smokers [adjusted odds ratio (OR) 3.7, 95% confidence interval (CI)=1.1-12.1] and alcohol drinkers [adjusted OR=5.7, 95% CI=1.4-23.2]. The best two-factor based G×G interaction of oral cancer included the XRCC1 rs1799782 and XRCC2 rs2040639 [OR=3.13, 95% CI=1.66-6.13]. For the G×E interaction, the estimated OR of oral cancer for two (drinking-BQ chewing), three (XRCC1-XRCC2-BQ chewing), four (XRCC1-XRCC2-age-BQ chewing), and five factors (XRCC1-XRCC2-age-drinking-BQ chewing) were 32.9 [95% CI=14.1-76.9], 31.0 [95% CI=14.0-64.7], 49.8 [95% CI=21.0-117.7] and 82.9 [95% CI=31.0-221.5], respectively. Taken together, the genotypes of XRCC1 rs1799782 and XRCC2 rs2040639 DNA repair genes appear to be significantly associated with oral cancer. These were enhanced by exposure to certain environmental factors. The observations presented here warrant further research in larger study samples to examine their relevance for routine clinical care in oncology.
Collapse
Affiliation(s)
- Cheng-Hong Yang
- 1 Department of Electronic Engineering, National Kaohsiung University of Applied Sciences , Kaohsiung, Taiwan
| | | | | | | | | |
Collapse
|
4
|
Acikel C, Aydin Son Y, Celik C, Gul H. Evaluation of potential novel variations and their interactions related to bipolar disorders: analysis of genome-wide association study data. Neuropsychiatr Dis Treat 2016; 12:2997-3004. [PMID: 27920536 PMCID: PMC5127431 DOI: 10.2147/ndt.s112558] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
BACKGROUND Multifactor dimensionality reduction (MDR) is a nonparametric approach that can be used to detect relevant interactions between single-nucleotide polymorphisms (SNPs). The aim of this study was to build the best genomic model based on SNP associations and to identify candidate polymorphisms that are the underlying molecular basis of the bipolar disorders. METHODS This study was performed on Whole-Genome Association Study of Bipolar Disorder (dbGaP [database of Genotypes and Phenotypes] study accession number: phs000017.v3.p1) data. After preprocessing of the genotyping data, three classification-based data mining methods (ie, random forest, naïve Bayes, and k-nearest neighbor) were performed. Additionally, as a nonparametric, model-free approach, the MDR method was used to evaluate the SNP profiles. The validity of these methods was evaluated using true classification rate, recall (sensitivity), precision (positive predictive value), and F-measure. RESULTS Random forests, naïve Bayes, and k-nearest neighbors identified 16, 13, and ten candidate SNPs, respectively. Surprisingly, the top six SNPs were reported by all three methods. Random forests and k-nearest neighbors were more successful than naïve Bayes, with recall values >0.95. On the other hand, MDR generated a model with comparable predictive performance based on five SNPs. Although different SNP profiles were identified in MDR compared to the classification-based models, all models mapped SNPs to the DOCK10 gene. CONCLUSION Three classification-based data mining approaches, random forests, naïve Bayes, and k-nearest neighbors, have prioritized similar SNP profiles as predictors of bipolar disorders, in contrast to MDR, which has found different SNPs through analysis of two-way and three-way interactions. The reduced number of associated SNPs discovered by MDR, without loss in the classification performance, would facilitate validation studies and decision support models, and would reduce the cost to develop predictive and diagnostic tests. Nevertheless, we need to emphasize that translation of genomic models to the clinical setting requires models with higher classification performance.
Collapse
Affiliation(s)
| | - Yesim Aydin Son
- Department of Health Informatics, Graduate School of Informatics, Middle East Technical University
| | | | - Husamettin Gul
- Department of Medical Informatics, Gulhane Military Medical Academy, Ankara, Turkey
| |
Collapse
|
5
|
Huh I, Kwon MS, Park T. An Efficient Stepwise Statistical Test to Identify Multiple Linked Human Genetic Variants Associated with Specific Phenotypic Traits. PLoS One 2015; 10:e0138700. [PMID: 26406920 PMCID: PMC4583484 DOI: 10.1371/journal.pone.0138700] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2015] [Accepted: 09/02/2015] [Indexed: 11/19/2022] Open
Abstract
Recent advances in genotyping methodologies have allowed genome-wide association studies (GWAS) to accurately identify genetic variants that associate with common or pathological complex traits. Although most GWAS have focused on associations with single genetic variants, joint identification of multiple genetic variants, and how they interact, is essential for understanding the genetic architecture of complex phenotypic traits. Here, we propose an efficient stepwise method based on the Cochran-Mantel-Haenszel test (for stratified categorical data) to identify causal joint multiple genetic variants in GWAS. This method combines the CMH statistic with a stepwise procedure to detect multiple genetic variants associated with specific categorical traits, using a series of associated I × J contingency tables and a null hypothesis of no phenotype association. Through a new stratification scheme based on the sum of minor allele count criteria, we make the method more feasible for GWAS data having sample sizes of several thousands. We also examine the properties of the proposed stepwise method via simulation studies, and show that the stepwise CMH test performs better than other existing methods (e.g., logistic regression and detection of associations by Markov blanket) for identifying multiple genetic variants. Finally, we apply the proposed approach to two genomic sequencing datasets to detect linked genetic variants associated with bipolar disorder and obesity, respectively.
Collapse
Affiliation(s)
- Iksoo Huh
- Department of Statistics, Seoul National University, Gwanak-gu, Seoul, Korea
| | - Min-Seok Kwon
- Interdisciplinary Program in Bioinformatics, Seoul National University, Gwanak-gu, Seoul, Korea
| | - Taesung Park
- Department of Statistics, Seoul National University, Gwanak-gu, Seoul, Korea
- Interdisciplinary Program in Bioinformatics, Seoul National University, Gwanak-gu, Seoul, Korea
- * E-mail:
| |
Collapse
|
6
|
Gola D, Mahachie John JM, van Steen K, König IR. A roadmap to multifactor dimensionality reduction methods. Brief Bioinform 2015; 17:293-308. [PMID: 26108231 PMCID: PMC4793893 DOI: 10.1093/bib/bbv038] [Citation(s) in RCA: 56] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2015] [Indexed: 02/02/2023] Open
Abstract
Complex diseases are defined to be determined by multiple genetic and environmental factors alone as well as in interactions. To analyze interactions in genetic data, many statistical methods have been suggested, with most of them relying on statistical regression models. Given the known limitations of classical methods, approaches from the machine-learning community have also become attractive. From this latter family, a fast-growing collection of methods emerged that are based on the Multifactor Dimensionality Reduction (MDR) approach. Since its first introduction, MDR has enjoyed great popularity in applications and has been extended and modified multiple times. Based on a literature search, we here provide a systematic and comprehensive overview of these suggested methods. The methods are described in detail, and the availability of implementations is listed. Most recent approaches offer to deal with large-scale data sets and rare variants, which is why we expect these methods to even gain in popularity.
Collapse
|
7
|
Li P, Guo M, Wang C, Liu X, Zou Q. An overview of SNP interactions in genome-wide association studies. Brief Funct Genomics 2014; 14:143-55. [PMID: 25241224 DOI: 10.1093/bfgp/elu036] [Citation(s) in RCA: 80] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
Abstract
With the recent explosion in high-throughput genotyping technology, the amount and quality of single-nucleotide polymorphism (SNP) data has increased exponentially. Therefore, the identification of SNP interactions that are associated with common diseases is playing an increasing and important role in interpreting the genetic basis of disease susceptibility and in devising new diagnostic tests and treatments. However, because these data sets are large, although they typically have small sample sizes and low signal-to-noise ratios, there has been no major breakthrough despite many efforts, making this a major focus in the field of bioinformatics. In this article, we review the two main aspects of SNP interaction studies in recent years-the simulation and identification of SNP interactions-and then discuss the principles, efficiency and differences between these methods.
Collapse
|
8
|
Abstract
MicroRNA profiling is an important task to investigate miRNA functions and recent technologies such as microarray, single nucleotide polymorphism (SNP), quantitative real-time PCR (qPCR), and next-generation sequencing (NGS) have played a major role for miRNA analysis. In this chapter, we give an overview on statistical approaches for gene expressions, SNP, qPCR, and NGS data including preliminary analyses (pre-processing, differential expression, classification, clustering, exploration of interactions, and the use of ontologies). Our goal is to outline the key approaches with a brief discussion of problems avenues for their solutions and to give some examples for real-world use. Readers will be able to understand the different data formats (expression levels, sequences etc.) and they will be able to choose appropriate methods for their own research and application. On the other hand, we give brief notes on most popular tools/packages for statistical genetic analysis. This chapter aims to serve as a brief introduction to different kinds of statistical methods and also provides an extensive source of references.
Collapse
|
9
|
Gui J, Moore JH, Williams SM, Andrews P, Hillege HL, van der Harst P, Navis G, Van Gilst WH, Asselbergs FW, Gilbert-Diamond D. A Simple and Computationally Efficient Approach to Multifactor Dimensionality Reduction Analysis of Gene-Gene Interactions for Quantitative Traits. PLoS One 2013; 8:e66545. [PMID: 23805232 PMCID: PMC3689797 DOI: 10.1371/journal.pone.0066545] [Citation(s) in RCA: 59] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2012] [Accepted: 05/07/2013] [Indexed: 12/03/2022] Open
Abstract
We present an extension of the two-class multifactor dimensionality reduction (MDR) algorithm that enables detection and characterization of epistatic SNP-SNP interactions in the context of a quantitative trait. The proposed Quantitative MDR (QMDR) method handles continuous data by modifying MDR’s constructive induction algorithm to use a T-test. QMDR replaces the balanced accuracy metric with a T-test statistic as the score to determine the best interaction model. We used a simulation to identify the empirical distribution of QMDR’s testing score. We then applied QMDR to genetic data from the ongoing prospective Prevention of Renal and Vascular End-Stage Disease (PREVEND) study.
Collapse
Affiliation(s)
- Jiang Gui
- Institute for Quantitative Biomedical Sciences, Geisel School of Medicine, Lebanon, New Hampshire, United States of America
- Section of Biostatistics and Epidemiology, Departments of Community and Family Medicine, Geisel School of Medicine, Lebanon, New Hampshire, United States of America
| | - Jason H. Moore
- Institute for Quantitative Biomedical Sciences, Geisel School of Medicine, Lebanon, New Hampshire, United States of America
- Section of Biostatistics and Epidemiology, Departments of Community and Family Medicine, Geisel School of Medicine, Lebanon, New Hampshire, United States of America
- Department of Genetics, Geisel School of Medicine, Lebanon, New Hampshire, United States of America
- * E-mail:
| | - Scott M. Williams
- Institute for Quantitative Biomedical Sciences, Geisel School of Medicine, Lebanon, New Hampshire, United States of America
- Department of Genetics, Geisel School of Medicine, Lebanon, New Hampshire, United States of America
| | - Peter Andrews
- Institute for Quantitative Biomedical Sciences, Geisel School of Medicine, Lebanon, New Hampshire, United States of America
| | - Hans L. Hillege
- Department of Cardiology, University Medical Center Groningen, Groningen, The Netherlands
| | - Pim van der Harst
- Department of Cardiology, University Medical Center Groningen, Groningen, The Netherlands
| | - Gerjan Navis
- Department of Nephrology, University Medical Center Groningen, Groningen, The Netherlands
| | - Wiek H. Van Gilst
- Department of Cardiology, University Medical Center Groningen, Groningen, The Netherlands
| | - Folkert W. Asselbergs
- Department of Cardiology, Division of Heart and Lungs, University Medical Center Utrecht, Utrecht, The Netherlands
| | - Diane Gilbert-Diamond
- Institute for Quantitative Biomedical Sciences, Geisel School of Medicine, Lebanon, New Hampshire, United States of America
- Section of Biostatistics and Epidemiology, Departments of Community and Family Medicine, Geisel School of Medicine, Lebanon, New Hampshire, United States of America
| |
Collapse
|
10
|
Fang YH, Chiu YF. SVM-based generalized multifactor dimensionality reduction approaches for detecting gene-gene interactions in family studies. Genet Epidemiol 2013; 36:88-98. [PMID: 22851472 DOI: 10.1002/gepi.21602] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
Gene-gene interaction plays an important role in the etiology of complex diseases, which may exist without a genetic main effect. Most current statistical approaches, however, focus on assessing an interaction effect in the presence of the gene's main effects. It would be very helpful to develop methods that can detect not only the gene's main effects but also gene-gene interaction effects regardless of the existence of the gene's main effects while adjusting for confounding factors. In addition, when a disease variant is rare or when the sample size is quite limited, the statistical asymptotic properties are not applicable; therefore, approaches based on a reasonable and applicable computational framework would be practical and frequently applied. In this study, we have developed an extended support vector machine (SVM) method and an SVM-based pedigree-based generalized multifactor dimensionality reduction (PGMDR) method to study interactions in the presence or absence of main effects of genes with an adjustment for covariates using limited samples of families. A new test statistic is proposed for classifying the affected and the unaffected in the SVM-based PGMDR approach to improve performance in detecting gene-gene interactions. Simulation studies under various scenarios have been performed to compare the performances of the proposed and the original methods. The proposed and original approaches have been applied to a real data example for illustration and comparison. Both the simulation and real data studies show that the proposed SVM and SVM-based PGMDR methods have great prediction accuracies, consistencies, and power in detecting gene-gene interactions.
Collapse
Affiliation(s)
- Yao-Hwei Fang
- Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, Miaoli, Taiwan, ROC
| | | |
Collapse
|
11
|
High-order SNP combinations associated with complex diseases: efficient discovery, statistical power and functional interactions. PLoS One 2012; 7:e33531. [PMID: 22536319 PMCID: PMC3334940 DOI: 10.1371/journal.pone.0033531] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2011] [Accepted: 02/10/2012] [Indexed: 11/19/2022] Open
Abstract
There has been increased interest in discovering combinations of single-nucleotide polymorphisms (SNPs) that are strongly associated with a phenotype even if each SNP has little individual effect. Efficient approaches have been proposed for searching two-locus combinations from genome-wide datasets. However, for high-order combinations, existing methods either adopt a brute-force search which only handles a small number of SNPs (up to few hundreds), or use heuristic search that may miss informative combinations. In addition, existing approaches lack statistical power because of the use of statistics with high degrees-of-freedom and the huge number of hypotheses tested during combinatorial search. Due to these challenges, functional interactions in high-order combinations have not been systematically explored. We leverage discriminative-pattern-mining algorithms from the data-mining community to search for high-order combinations in case-control datasets. The substantially improved efficiency and scalability demonstrated on synthetic and real datasets with several thousands of SNPs allows the study of several important mathematical and statistical properties of SNP combinations with order as high as eleven. We further explore functional interactions in high-order combinations and reveal a general connection between the increase in discriminative power of a combination over its subsets and the functional coherence among the genes comprising the combination, supported by multiple datasets. Finally, we study several significant high-order combinations discovered from a lung-cancer dataset and a kidney-transplant-rejection dataset in detail to provide novel insights on the complex diseases. Interestingly, many of these associations involve combinations of common variations that occur in small fractions of population. Thus, our approach is an alternative methodology for exploring the genetics of rare diseases for which the current focus is on individually rare variations.
Collapse
|
12
|
Shang J, Zhang J, Sun Y, Liu D, Ye D, Yin Y. Performance analysis of novel methods for detecting epistasis. BMC Bioinformatics 2011; 12:475. [PMID: 22172045 PMCID: PMC3259123 DOI: 10.1186/1471-2105-12-475] [Citation(s) in RCA: 57] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2011] [Accepted: 12/15/2011] [Indexed: 02/03/2023] Open
Abstract
Background Epistasis is recognized fundamentally important for understanding the mechanism of disease-causing genetic variation. Though many novel methods for detecting epistasis have been proposed, few studies focus on their comparison. Undertaking a comprehensive comparison study is an urgent task and a pathway of the methods to real applications. Results This paper aims at a comparison study of epistasis detection methods through applying related software packages on datasets. For this purpose, we categorize methods according to their search strategies, and select five representative methods (TEAM, BOOST, SNPRuler, AntEpiSeeker and epiMODE) originating from different underlying techniques for comparison. The methods are tested on simulated datasets with different size, various epistasis models, and with/without noise. The types of noise include missing data, genotyping error and phenocopy. Performance is evaluated by detection power (three forms are introduced), robustness, sensitivity and computational complexity. Conclusions None of selected methods is perfect in all scenarios and each has its own merits and limitations. In terms of detection power, AntEpiSeeker performs best on detecting epistasis displaying marginal effects (eME) and BOOST performs best on identifying epistasis displaying no marginal effects (eNME). In terms of robustness, AntEpiSeeker is robust to all types of noise on eME models, BOOST is robust to genotyping error and phenocopy on eNME models, and SNPRuler is robust to phenocopy on eME models and missing data on eNME models. In terms of sensitivity, AntEpiSeeker is the winner on eME models and both SNPRuler and BOOST perform well on eNME models. In terms of computational complexity, BOOST is the fastest among the methods. In terms of overall performance, AntEpiSeeker and BOOST are recommended as the efficient and effective methods. This comparison study may provide guidelines for applying the methods and further clues for epistasis detection.
Collapse
Affiliation(s)
- Junliang Shang
- School of Computer Science & Technology, Xidian University, Xi'an 710071, China.
| | | | | | | | | | | |
Collapse
|
13
|
Oki NO, Motsinger-Reif AA. Multifactor dimensionality reduction as a filter-based approach for genome wide association studies. Front Genet 2011; 2:80. [PMID: 22303374 PMCID: PMC3268633 DOI: 10.3389/fgene.2011.00080] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2011] [Accepted: 10/26/2011] [Indexed: 11/13/2022] Open
Abstract
Advances in genotyping technology and the multitude of genetic data available now provide a vast amount of data that is proving to be useful in the quest for a better understanding of human genetic diseases through the study of genetic variation. This has led to the development of approaches such as genome wide association studies (GWAS) designed specifically for interrogating variants across the genome for association with disease, typically by testing single locus, univariate associations. More recently it has been accepted that epistatic (interaction) effects may also be great contributors to these genetic effects, and GWAS methods are now being applied to find epistatic effects. The challenge for these methods still remain in prioritization and interpretation of results, as it has also become standard for initial findings to be independently investigated in replication cohorts or functional studies. This is motivating the development and implementation of filter-based approaches to prioritize variants found to be significant in a discovery stage for follow-up for replication. Such filters must be able to detect both univariate and interactive effects. In the current study we present and evaluate the use of multifactor dimensionality reduction (MDR) as such a filter, with simulated data and a wide range of effect sizes. Additionally, we compare the performance of the MDR filter to a similar filter approach using logistic regression (LR), the more traditional approach used in GWAS analysis, as well as evaporative cooling (EC)-another prominent machine learning filtering method. The results of our simulation study show that MDR is an effective method for such prioritization, and that it can detect main effects, and interactions with or without marginal effects. Importantly, it performed as well as EC and LR for main effect models. It also significantly outperforms LR for various two-locus epistatic models, while it has equivalent results as EC for the epistatic models. The results of this study demonstrate the potential of MDR as a filter to detect gene-gene interactions in GWAS studies.
Collapse
Affiliation(s)
- Noffisat O. Oki
- Bioinformatics Research Center, North Carolina State UniversityRaleigh, NC, USA
| | - Alison A. Motsinger-Reif
- Bioinformatics Research Center, North Carolina State UniversityRaleigh, NC, USA
- Department of Statistics, North Carolina State UniversityRaleigh, NC, USA
| |
Collapse
|
14
|
LI FG, WANG ZP, HU G, LI H. Current status of SNPs interaction in genome-wide association study. YI CHUAN = HEREDITAS 2011; 33:901-10. [DOI: 10.3724/sp.j.1005.2011.00901] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
15
|
Pereira TV, Mingroni-Netto RC, Yamada Y. ADRB2 and LEPR gene polymorphisms: synergistic effects on the risk of obesity in Japanese. Obesity (Silver Spring) 2011; 19:1523-7. [PMID: 21233812 DOI: 10.1038/oby.2010.322] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
The objective of the present study was to validate a recently reported synergistic effect between variants located in the leptin receptor (LEPR) gene and in the β-2 adrenergic receptor (ADRB2) gene on the risk of overweight/obesity. We studied a middle-aged/elderly sample of 4,193 nondiabetic Japanese subjects stratified according gender (1,911 women and 2,282 men). The LEPR Gln223Arg (rs1137101) variant as well as both ADRB2 Arg16Gly (rs1042713) and Gln27Glu (rs1042714) polymorphisms were analyzed. The primary outcome was the risk of overweight/obesity defined as BMI ≥25 kg/m(2), whereas secondary outcomes included the risk of a BMI ≥27 kg/m(2) and BMI as a continuous variable. None of the studied polymorphisms showed statistically significant individual effects, regardless of the group or phenotype studied. Haplotype analysis also did not disclose any associations of ADRB2 polymorphisms with BMI. However, dimensionality reduction-based models confirmed significant interactions among the investigated variants for BMI as a continuous variable as well as for the risk of obesity defined as BMI ≥27 kg/m(2). All disclosed interactions were found in men only. Our results provide external validation for a male specific ADRB2-LEPR interaction effect on the risk of overweight/obesity, but indicate that effect sizes associated with these interactions may be smaller in the population studied.
Collapse
Affiliation(s)
- Tiago V Pereira
- Departamento de Genética e Biologia Evolutiva, Centro de Estudos do Genoma Humano, Instituto de Biociências, Universidade de São Paulo, São Paulo, Brazil
| | | | | |
Collapse
|
16
|
A comparison of multifactor dimensionality reduction and L1-penalized regression to identify gene-gene interactions in genetic association studies. Stat Appl Genet Mol Biol 2011; 10:Article 4. [PMID: 21291414 DOI: 10.2202/1544-6115.1613] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Recently, the amount of high-dimensional data has exploded, creating new analytical challenges for human genetics. Furthermore, much evidence suggests that common complex diseases may be due to complex etiologies such as gene-gene interactions, which are difficult to identify in high-dimensional data using traditional statistical approaches. Data-mining approaches are gaining popularity for variable selection in association studies, and one of the most commonly used methods to evaluate potential gene-gene interactions is Multifactor Dimensionality Reduction (MDR). Additionally, a number of penalized regression techniques, such as Lasso, are gaining popularity within the statistical community and are now being applied to association studies, including extensions for interactions. In this study, we compare the performance of MDR, the traditional lasso with L1 penalty (TL1), and the group lasso for categorical data with group-wise L1 penalty (GL1) to detect gene-gene interactions through a broad range of simulations. We find that each method has both advantages and disadvantages, and relative performance is context dependent. TL1 frequently over-fits, identifying false positive as well as true positive loci. MDR has higher power for epistatic models that exhibit independent main effects; for both Lasso methods, main effects tend to dominate. For purely epistatic models, GL1 has the best performance for lower minor allele frequencies, but MDR performs best for higher frequencies. These results provide guidance of when each approach might be best suited for detecting and characterizing interactions with different mechanisms.
Collapse
|
17
|
Gui J, Andrew AS, Andrews P, Nelson HM, Kelsey KT, Karagas MR, Moore JH. A robust multifactor dimensionality reduction method for detecting gene-gene interactions with application to the genetic analysis of bladder cancer susceptibility. Ann Hum Genet 2010; 75:20-8. [PMID: 21091664 DOI: 10.1111/j.1469-1809.2010.00624.x] [Citation(s) in RCA: 55] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
A central goal of human genetics is to identify susceptibility genes for common human diseases. An important challenge is modelling gene-gene interaction or epistasis that can result in nonadditivity of genetic effects. The multifactor dimensionality reduction (MDR) method was developed as a machine learning alternative to parametric logistic regression for detecting interactions in the absence of significant marginal effects. The goal of MDR is to reduce the dimensionality inherent in modelling combinations of polymorphisms using a computational approach called constructive induction. Here, we propose a Robust Multifactor Dimensionality Reduction (RMDR) method that performs constructive induction using a Fisher's Exact Test rather than a predetermined threshold. The advantage of this approach is that only statistically significant genotype combinations are considered in the MDR analysis. We use simulation studies to demonstrate that this approach will increase the success rate of MDR when there are only a few genotype combinations that are significantly associated with case-control status. We show that there is no loss of success rate when this is not the case. We then apply the RMDR method to the detection of gene-gene interactions in genotype data from a population-based study of bladder cancer in New Hampshire.
Collapse
Affiliation(s)
- Jiang Gui
- Dartmouth Medical School, Lebanon, NH 03756, USA
| | | | | | | | | | | | | |
Collapse
|
18
|
Gui J, Moore JH, Kelsey KT, Marsit CJ, Karagas MR, Andrew AS. A novel survival multifactor dimensionality reduction method for detecting gene-gene interactions with application to bladder cancer prognosis. Hum Genet 2010; 129:101-10. [PMID: 20981448 DOI: 10.1007/s00439-010-0905-5] [Citation(s) in RCA: 47] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2010] [Accepted: 10/17/2010] [Indexed: 11/30/2022]
Abstract
The widespread use of high-throughput methods of single nucleotide polymorphism (SNP) genotyping has created a number of computational and statistical challenges. The problem of identifying SNP-SNP interactions in case-control studies has been studied extensively, and a number of new techniques have been developed. Little progress has been made, however, in the analysis of SNP-SNP interactions in relation to time-to-event data, such as patient survival time or time to cancer relapse. We present an extension of the two class multifactor dimensionality reduction (MDR) algorithm that enables detection and characterization of epistatic SNP-SNP interactions in the context of survival analysis. The proposed Survival MDR (Surv-MDR) method handles survival data by modifying MDR's constructive induction algorithm to use the log-rank test. Surv-MDR replaces balanced accuracy with log-rank test statistics as the score to determine the best models. We simulated datasets with a survival outcome related to two loci in the absence of any marginal effects. We compared Surv-MDR with Cox-regression for their ability to identify the true predictive loci in these simulated data. We also used this simulation to construct the empirical distribution of Surv-MDR's testing score. We then applied Surv-MDR to genetic data from a population-based epidemiologic study to find prognostic markers of survival time following a bladder cancer diagnosis. We identified several two-loci SNP combinations that have strong associations with patients' survival outcome. Surv-MDR is capable of detecting interaction models with weak main effects. These epistatic models tend to be dropped by traditional Cox regression approaches to evaluating interactions. With improved efficiency to handle genome wide datasets, Surv-MDR will play an important role in a research strategy that embraces the complexity of the genotype-phenotype mapping relationship since epistatic interactions are an important component of the genetic basis of disease.
Collapse
Affiliation(s)
- Jiang Gui
- Department of Community and Family Medicine, Norris-Cotton Cancer Center, Dartmouth Medical School, One Medical Center Drive, Lebanon, NH 03756, USA
| | | | | | | | | | | |
Collapse
|