1
|
Majumdar S, Basu S, McGue M, Chatterjee S. Simultaneous selection of multiple important single nucleotide polymorphisms in familial genome wide association studies data. Sci Rep 2023; 13:8476. [PMID: 37231056 PMCID: PMC10213008 DOI: 10.1038/s41598-023-35379-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2022] [Accepted: 05/17/2023] [Indexed: 05/27/2023] Open
Abstract
We propose a resampling-based fast variable selection technique for detecting relevant single nucleotide polymorphisms (SNP) in a multi-marker mixed effect model. Due to computational complexity, current practice primarily involves testing the effect of one SNP at a time, commonly termed as 'single SNP association analysis'. Joint modeling of genetic variants within a gene or pathway may have better power to detect associated genetic variants, especially the ones with weak effects. In this paper, we propose a computationally efficient model selection approach-based on the e-values framework-for single SNP detection in families while utilizing information on multiple SNPs simultaneously. To overcome computational bottleneck of traditional model selection methods, our method trains one single model, and utilizes a fast and scalable bootstrap procedure. We illustrate through numerical studies that our proposed method is more effective in detecting SNPs associated with a trait than either single-marker analysis using family data or model selection methods that ignore the familial dependency structure. Further, we perform gene-level analysis in Minnesota Center for Twin and Family Research (MCTFR) dataset using our method to detect several SNPs using this that have been implicated to be associated with alcohol consumption.
Collapse
Affiliation(s)
- Subhabrata Majumdar
- University of Minnesota Twin Cities, Minneapolis, USA.
- AI Risk and Vulnerability Alliance, Seattle, USA.
| | - Saonli Basu
- University of Minnesota Twin Cities, Minneapolis, USA
| | - Matt McGue
- University of Minnesota Twin Cities, Minneapolis, USA
| | | |
Collapse
|
2
|
Lin HY, Huang PY, Tseng TS, Park JY. SNPxE: SNP-environment interaction pattern identifier. BMC Bioinformatics 2021; 22:425. [PMID: 34493206 PMCID: PMC8425112 DOI: 10.1186/s12859-021-04326-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2021] [Accepted: 08/11/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Interactions of single nucleotide polymorphisms (SNPs) and environmental factors play an important role in understanding complex diseases' pathogenesis. A growing number of SNP-environment studies have been conducted in the past decade; however, the statistical methods for evaluating SNP-environment interactions are still underdeveloped. The conventional statistical approach with a full interaction model with an additive SNP mode tests one specific interaction type, so the full interaction model approach tends to lead to false-negative findings. To increase detection accuracy, developing a statistical tool to effectively detect various SNP-environment interaction patterns is necessary. RESULTS SNPxE, a SNP-environment interaction pattern identifier, tests multiple interaction patterns associated with a phenotype for each SNP-environment pair. SNPxE evaluates 27 interaction patterns for an ordinal environment factor and 18 patterns for a categorical environment factor. For detecting SNP-environment interactions, SNPxE considers three major components: (1) model structure, (2) SNP's inheritance mode, and (3) risk direction. Among the multiple testing patterns, the best interaction pattern will be identified based on the Bayesian information criterion or the smallest p-value of the interaction. Furthermore, the risk sub-groups based on the SNPs and environmental factors can be identified. SNPxE can be applied to both numeric and binary phenotypes. For better results interpretation, a heat-table of the outcome proportions can be generated for the sub-groups of a SNP-environment pair. CONCLUSIONS SNPxE is a valuable tool for intensively evaluate SNP-environment interactions, and the SNPxE findings can provide insights for solving the missing heritability issue. The R function of SNPxE is freely available for download at GitHub ( https://github.com/LinHuiyi/SIPI ).
Collapse
Affiliation(s)
- Hui-Yi Lin
- Biostatistics Program, School of Public Health, Louisiana State University Health Sciences Center, New Orleans, LA, 70112, USA.
| | - Po-Yu Huang
- Computational Intelligence Technology Center, Industrial Technology Research Institute, Hsinchu City, Taiwan
| | - Tung-Sung Tseng
- Behavioral and Community Health Sciences Program, School of Public Health, Louisiana State University Health Sciences Center, New Orleans, LA, 70112, USA
| | - Jong Y Park
- Department of Cancer Epidemiology, Moffitt Cancer Center and Research Institute, Tampa, FL, 33612, USA
| |
Collapse
|
3
|
Yang T, Chen H, Tang H, Li D, Wei P. A powerful and data-adaptive test for rare-variant-based gene-environment interaction analysis. Stat Med 2019; 38:1230-1244. [PMID: 30460711 PMCID: PMC6399020 DOI: 10.1002/sim.8037] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2018] [Revised: 10/17/2018] [Accepted: 10/22/2018] [Indexed: 12/20/2022]
Abstract
As whole-exome/genome sequencing data become increasingly available in genetic epidemiology research consortia, there is emerging interest in testing the interactions between rare genetic variants and environmental exposures that modify the risk of complex diseases. However, testing rare-variant-based gene-by-environment interactions (GxE) is more challenging than testing the genetic main effects due to the difficulty in correctly estimating the latter under the null hypothesis of no GxE effects and the presence of neutral variants. In response, we have developed a family of powerful and data-adaptive GxE tests, called "aGE" tests, in the framework of the adaptive powered score test, originally proposed for testing the genetic main effects. Using extensive simulations, we show that aGE tests can control the type I error rate in the presence of a large number of neutral variants or a nonlinear environmental main effect, and the power is more resilient to the inclusion of neutral variants than that of existing methods. We demonstrate the performance of the proposed aGE tests using Pancreatic Cancer Case-Control Consortium Exome Chip data. An R package "aGE" is available at http://github.com/ytzhong/projects/.
Collapse
Affiliation(s)
- Tianzhong Yang
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA
- Department of Biostatistics and Data Science, School of Public Health, The University of Texas Health Science Center at Houston, TX 77030, USA
| | - Han Chen
- Human Genetics Center, Department of Epidemiology, Human Genetics and Environmental Sciences, School of Public Health,The University of Texas Health Science Center at Houston, TX77030, USA
- Center for Precision Health, School of Public Health and School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX77030, USA
| | - Hongwei Tang
- Departments of Gastrointestinal Medical Oncology and Epidemiology, The University of Texas MD Anderson Cancer Center, Houston, TX77030, USA
| | - Donghui Li
- Departments of Gastrointestinal Medical Oncology and Epidemiology, The University of Texas MD Anderson Cancer Center, Houston, TX77030, USA
| | - Peng Wei
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA
| |
Collapse
|
4
|
Leffers HCB, Lange T, Collins C, Ulff-Møller CJ, Jacobsen S. The study of interactions between genome and exposome in the development of systemic lupus erythematosus. Autoimmun Rev 2019; 18:382-392. [PMID: 30772495 DOI: 10.1016/j.autrev.2018.11.005] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2018] [Accepted: 11/18/2018] [Indexed: 12/31/2022]
Abstract
Systemic lupus erythematosus (SLE) is a systemic inflammatory autoimmune disease characterized by a broad spectrum of clinical and serological manifestations. This may reflect a complex and multifactorial etiology involving several identified genetic and environmental factors, though not explaining the full risk of SLE. Established SLE risk genotypes are either very rare or with modest effect sizes and twin studies indicate that other factors besides genetics must be operative in SLE etiology. The exposome comprises the cumulative environmental influences on an individual and associated biological responses through the lifespan. It has been demonstrated that exposure to silica, smoking and exogenous hormones candidate as environmental risk factors in SLE, while alcohol consumption seems to be protective. Very few studies have investigated potential gene-environment interactions to determine if some of the unexplained SLE risk is attributable hereto. Even less have focused on interactions between specific risk genotypes and environmental exposures relevant to SLE pathogenesis. Cohort and case-control studies may provide data to suggest such biological interactions and various statistical measures of interaction can indicate the magnitude of such. However, such studies do often have very large sample-size requirements and we suggest that the rarity of SLE to some extent can be compensated by increasing the ratio of controls. This review summarizes the current body of knowledge on gene-environment interactions in SLE. We argue for the prioritization of studies that comprise the increasing details available of the genome and exposome relevant to SLE as they have the potential to disclose new aspects of SLE pathogenesis including phenotype heterogeneity.
Collapse
Affiliation(s)
- Henrik Christian Bidstrup Leffers
- Copenhagen Lupus and Vasculitis Clinic, Center for Rheumatology and Spine Diseases, Copenhagen University Hospital, Rigshospitalet, Copenhagen, Denmark
| | - Theis Lange
- Department of Public Health, Section of Biostatistics, University of Copenhagen, Denmark; Center for Statistical Science, Peking University, Beijing, China
| | - Christopher Collins
- Department of Rheumatology, MedStar Washington Hospital Center, Washington, DC, USA
| | - Constance Jensina Ulff-Møller
- Copenhagen Lupus and Vasculitis Clinic, Center for Rheumatology and Spine Diseases, Copenhagen University Hospital, Rigshospitalet, Copenhagen, Denmark
| | - Søren Jacobsen
- Copenhagen Lupus and Vasculitis Clinic, Center for Rheumatology and Spine Diseases, Copenhagen University Hospital, Rigshospitalet, Copenhagen, Denmark; Department of Clinical Medicine, Faculty of Health Science, University of Copenhagen, Denmark..
| |
Collapse
|
5
|
Coombes BJ, Basu S, McGue M. A linear mixed model framework for gene-based gene-environment interaction tests in twin studies. Genet Epidemiol 2018; 42:648-663. [PMID: 30203856 DOI: 10.1002/gepi.22150] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2017] [Revised: 04/25/2018] [Accepted: 04/30/2018] [Indexed: 02/03/2023]
Abstract
Interaction between genes and environments (G×E) can be well investigated in families due to the shared genes and environment among family members. However, the majority of the current tests of G×E interaction between a set of variants and an environment are only suitable for studies with unrelated subjects. In this paper, we extend several G×E interaction tests to a linear mixed model framework to study interaction between a set of correlated environments and a candidate gene in families. The correlated environments can either be modeled separately or jointly in one model. We demonstrate theoretically that the tests developed by modeling correlated environments separately are valid and present a computationally fast alternative to detect G×E interaction in families. For either strategy, we propose treating the genetic main effects as a random effect to reduce the number of main-effect parameters and thus improve the power to detect interactions. Additionally, we propose a generalization of a test of interaction that adaptively sums the interactions using a sequential algorithm. This generalized set of tests, referred to as the sequential algorithm for the sum of powered score (Seq-SPU) family of tests, can be expressed as a weighted version of the SPU. We find that the adaptive version of our test, Seq-aSPU, can outperform aSPU in cases where the interactions effects are in opposite directions. We applied these methods to the Minnesota Center for Twin and Family Research data set and found one significant gene in interaction with four psychosocial environmental factors affecting the alcohol consumption among the twins.
Collapse
Affiliation(s)
- Brandon J Coombes
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota
| | - Saonli Basu
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota
| | - Matt McGue
- Department of Psychology, School of Public Health, University of Minnesota, Minneapolis, Minnesota
| |
Collapse
|
6
|
Application of the parametric bootstrap for gene-set analysis of gene-environment interactions. Eur J Hum Genet 2018; 26:1679-1686. [PMID: 30089830 DOI: 10.1038/s41431-018-0236-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2017] [Revised: 07/02/2018] [Accepted: 07/19/2018] [Indexed: 02/07/2023] Open
Abstract
Testing for gene-environment (GE) interactions in a gene-set defined by a biological pathway can help us understand the interplay between genes and environments and provide insight into disease etiology. A self-contained gene-set analysis can be performed by combining gene-level p-values using approaches such as the Gamma Method. In a gene-set analysis of genetic main effects, permutation approaches are commonly used to avoid inflated probability of a type 1 error caused by correlation of genes within the same pathway. However, when testing interaction effects, it is typically not possible to construct an exact permutation test. We therefore propose using a parametric bootstrap. For testing an interaction term, this approach requires fitting the null model, which only contains main effects; however, for a gene-set GE interaction model, the number of main effects can be large and therefore they may not be estimable. To estimate the main effects of SNPs in a gene-set, we propose modeling them as random effects. We then repetitively simulate null data from this model and analyze it to generate the null distribution of gene-set GE p-values, allowing for an empirical assessment of significance of the global GE effect in the gene-set of interest. Through simulation, we demonstrate that this approach maintains correct type I error, and is well powered to detect GE interactions. We apply our method to test whether the association of obesity with bipolar disorder (BD) is modified by genetic variation in the Wnt signaling pathway.
Collapse
|
7
|
Mazo Lopera MA, Coombes BJ, de Andrade M. An Efficient Test for Gene-Environment Interaction in Generalized Linear Mixed Models with Family Data. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2017; 14:ijerph14101134. [PMID: 28953253 PMCID: PMC5664635 DOI: 10.3390/ijerph14101134] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/19/2017] [Revised: 09/20/2017] [Accepted: 09/25/2017] [Indexed: 02/07/2023]
Abstract
Gene-environment (GE) interaction has important implications in the etiology of complex diseases that are caused by a combination of genetic factors and environment variables. Several authors have developed GE analysis in the context of independent subjects or longitudinal data using a gene-set. In this paper, we propose to analyze GE interaction for discrete and continuous phenotypes in family studies by incorporating the relatedness among the relatives for each family into a generalized linear mixed model (GLMM) and by using a gene-based variance component test. In addition, we deal with collinearity problems arising from linkage disequilibrium among single nucleotide polymorphisms (SNPs) by considering their coefficients as random effects under the null model estimation. We show that the best linear unbiased predictor (BLUP) of such random effects in the GLMM is equivalent to the ridge regression estimator. This equivalence provides a simple method to estimate the ridge penalty parameter in comparison to other computationally-demanding estimation approaches based on cross-validation schemes. We evaluated the proposed test using simulation studies and applied it to real data from the Baependi Heart Study consisting of 76 families. Using our approach, we identified an interaction between BMI and the Peroxisome Proliferator Activated Receptor Gamma (PPARG) gene associated with diabetes.
Collapse
Affiliation(s)
- Mauricio A Mazo Lopera
- School of Statistics, National University of Colombia, Medellín, Antioquia 050022, Colombia.
- Departament of Health Sciences Research, Mayo Clinic, Rochester, MN 55905, USA.
| | - Brandon J Coombes
- Departament of Health Sciences Research, Mayo Clinic, Rochester, MN 55905, USA.
| | - Mariza de Andrade
- Departament of Health Sciences Research, Mayo Clinic, Rochester, MN 55905, USA.
| |
Collapse
|