1
|
Chi JT, Ipsen ICF, Hsiao TH, Lin CH, Wang LS, Lee WP, Lu TP, Tzeng JY. SEAGLE: A Scalable Exact Algorithm for Large-Scale Set-Based Gene-Environment Interaction Tests in Biobank Data. Front Genet 2021; 12:710055. [PMID: 34795690 PMCID: PMC8593472 DOI: 10.3389/fgene.2021.710055] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2021] [Accepted: 09/13/2021] [Indexed: 11/13/2022] Open
Abstract
The explosion of biobank data offers unprecedented opportunities for gene-environment interaction (GxE) studies of complex diseases because of the large sample sizes and the rich collection in genetic and non-genetic information. However, the extremely large sample size also introduces new computational challenges in G×E assessment, especially for set-based G×E variance component (VC) tests, which are a widely used strategy to boost overall G×E signals and to evaluate the joint G×E effect of multiple variants from a biologically meaningful unit (e.g., gene). In this work, we focus on continuous traits and present SEAGLE, a Scalable Exact AlGorithm for Large-scale set-based G×E tests, to permit G×E VC tests for biobank-scale data. SEAGLE employs modern matrix computations to calculate the test statistic and p-value of the GxE VC test in a computationally efficient fashion, without imposing additional assumptions or relying on approximations. SEAGLE can easily accommodate sample sizes in the order of 105, is implementable on standard laptops, and does not require specialized computing equipment. We demonstrate the performance of SEAGLE using extensive simulations. We illustrate its utility by conducting genome-wide gene-based G×E analysis on the Taiwan Biobank data to explore the interaction of gene and physical activity status on body mass index.
Collapse
Affiliation(s)
- Jocelyn T. Chi
- Department of Statistics, North Carolina State University, Raleigh, NC, United States
| | - Ilse C. F. Ipsen
- Department of Mathematics, North Carolina State University, Raleigh, NC, United States
| | - Tzu-Hung Hsiao
- Department of Medical Research, Taichung Veterans General Hospital, Taichung, Taiwan
| | - Ching-Heng Lin
- Department of Medical Research, Taichung Veterans General Hospital, Taichung, Taiwan
| | - Li-San Wang
- Penn Neurodegeneration Genomics Center, Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| | - Wan-Ping Lee
- Penn Neurodegeneration Genomics Center, Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| | - Tzu-Pin Lu
- Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taipei, Taiwan
| | - Jung-Ying Tzeng
- Department of Statistics, North Carolina State University, Raleigh, NC, United States
- Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taipei, Taiwan
- Department of Statistics, National Cheng-Kung University, Tainan, Taiwan
| |
Collapse
|
2
|
Zhang H, Zhao N, Mehrotra DV, Shen J. Composite Kernel Association Test (CKAT) for SNP-set joint assessment of genotype and genotype-by-treatment interaction in Pharmacogenetics studies. Bioinformatics 2020; 36:3162-3168. [PMID: 32101275 DOI: 10.1093/bioinformatics/btaa125] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2019] [Revised: 02/14/2020] [Accepted: 02/19/2020] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION It is of substantial interest to discover novel genetic markers that influence drug response in order to develop personalized treatment strategies that maximize therapeutic efficacy and safety. To help enable such discoveries, we focus on testing the association between the cumulative effect of multiple single nucleotide polymorphisms (SNPs) in a particular genomic region and a drug response of interest. However, the currently existing methods are either computational inefficient or not able to control type I error and provide decent power for whole exome or genome analysis in Pharmacogenetics (PGx) studies with small sample sizes. RESULTS In this article, we propose the Composite Kernel Association Test (CKAT), a flexible and robust kernel machine-based approach to jointly test the genetic main effect and SNP-treatment interaction effect for SNP-sets in Pharmacogenetics (PGx) assessments embedded within randomized clinical trials. An analytic procedure is developed to accurately calculate the P-value so that computationally extensive procedures (e.g. permutation or perturbation) can be avoided. We evaluate CKAT through extensive simulation studies and application to the gene-level association test of the reduction in Clostridium difficile infection recurrence in patients treated with bezlotoxumab. The results demonstrate that the proposed CKAT controls type I error well for PGx studies, is efficient for whole exome/genome association analysis and provides better power performance than existing methods across multiple scenarios. AVAILABILITY AND IMPLEMENTATION The R package CKAT is publicly available on CRAN https://cran.r-project.org/web/packages/CKAT/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hong Zhang
- Biostatistics and Research Decision Sciences, Merck Research Laboratories, Merck & Co., Inc., Rahway, NJ 07065, USA
| | - Ni Zhao
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205, USA
| | - Devan V Mehrotra
- Biostatistics and Research Decision Sciences, Merck Research Laboratories, Merck & Co., Inc., North Wales, PA 19454, USA
| | - Judong Shen
- Biostatistics and Research Decision Sciences, Merck Research Laboratories, Merck & Co., Inc., Rahway, NJ 07065, USA
| |
Collapse
|
3
|
Zhao N, Zhang H, Clark JJ, Maity A, Wu MC. Composite kernel machine regression based on likelihood ratio test for joint testing of genetic and gene–environment interaction effect. Biometrics 2019; 75:625-637. [DOI: 10.1111/biom.13003] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2018] [Accepted: 10/09/2018] [Indexed: 12/17/2022]
Affiliation(s)
- Ni Zhao
- Department of BiostatisticsJohns Hopkins UniversityBaltimore, Maryland
| | - Haoyu Zhang
- Department of BiostatisticsJohns Hopkins UniversityBaltimore, Maryland
| | - Jennifer J. Clark
- Department of BiostatisticsUniversity of North Carolina at Chapel HillChapel Hill, North Carolina
| | - Arnab Maity
- Department of StatisticsNorth Carolina State UniversityRaleigh, North Carolina
| | - Michael C. Wu
- Public Health Sciences Division,Fred Hutchinson Cancer Research CenterSeattle, Washington
| |
Collapse
|
4
|
Chen J, Chen W, Zhao N, Wu MC, Schaid DJ. Small Sample Kernel Association Tests for Human Genetic and Microbiome Association Studies. Genet Epidemiol 2015; 40:5-19. [PMID: 26643881 DOI: 10.1002/gepi.21934] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2015] [Revised: 07/30/2015] [Accepted: 09/02/2015] [Indexed: 12/14/2022]
Abstract
Kernel machine based association tests (KAT) have been increasingly used in testing the association between an outcome and a set of biological measurements due to its power to combine multiple weak signals of complex relationship with the outcome through the specification of a relevant kernel. Human genetic and microbiome association studies are two important applications of KAT. However, the classic KAT framework relies on large sample theory, and conservativeness has been observed for small sample studies, especially for microbiome association studies. The common approach for addressing the small sample problem relies on computationally intensive resampling methods. Here, we derive an exact test for KAT with continuous traits, which resolve the small sample conservatism of KAT without the need for resampling. The exact test has significantly improved power to detect association for microbiome studies. For binary traits, we propose a similar approximate test, and we show that the approximate test is very powerful for a wide range of kernels including common variant- and microbiome-based kernels, and the approximate test controls the type I error well for these kernels. In contrast, the sequence kernel association tests have slightly inflated genomic inflation factors after small sample adjustment. Extensive simulations and application to a real microbiome association study are used to demonstrate the utility of our method.
Collapse
Affiliation(s)
- Jun Chen
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, United States of America
| | - Wenan Chen
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, United States of America
| | - Ni Zhao
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| | - Michael C Wu
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| | - Daniel J Schaid
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, United States of America
| |
Collapse
|
5
|
Marceau R, Lu W, Holloway S, Sale MM, Worrall BB, Williams SR, Hsu FC, Tzeng JY. A Fast Multiple-Kernel Method With Applications to Detect Gene-Environment Interaction. Genet Epidemiol 2015; 39:456-68. [PMID: 26139508 DOI: 10.1002/gepi.21909] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2015] [Revised: 05/10/2015] [Accepted: 05/20/2015] [Indexed: 01/27/2023]
Abstract
Kernel machine (KM) models are a powerful tool for exploring associations between sets of genetic variants and complex traits. Although most KM methods use a single kernel function to assess the marginal effect of a variable set, KM analyses involving multiple kernels have become increasingly popular. Multikernel analysis allows researchers to study more complex problems, such as assessing gene-gene or gene-environment interactions, incorporating variance-component based methods for population substructure into rare-variant association testing, and assessing the conditional effects of a variable set adjusting for other variable sets. The KM framework is robust, powerful, and provides efficient dimension reduction for multifactor analyses, but requires the estimation of high dimensional nuisance parameters. Traditional estimation techniques, including regularization and the "expectation-maximization (EM)" algorithm, have a large computational cost and are not scalable to large sample sizes needed for rare variant analysis. Therefore, under the context of gene-environment interaction, we propose a computationally efficient and statistically rigorous "fastKM" algorithm for multikernel analysis that is based on a low-rank approximation to the nuisance effect kernel matrices. Our algorithm is applicable to various trait types (e.g., continuous, binary, and survival traits) and can be implemented using any existing single-kernel analysis software. Through extensive simulation studies, we show that our algorithm has similar performance to an EM-based KM approach for quantitative traits while running much faster. We also apply our method to the Vitamin Intervention for Stroke Prevention (VISP) clinical trial, examining gene-by-vitamin effects on recurrent stroke risk and gene-by-age effects on change in homocysteine level.
Collapse
Affiliation(s)
- Rachel Marceau
- Department of Statistics, North Carolina State University, Raleigh, North Carolina, United States of America
| | - Wenbin Lu
- Department of Statistics, North Carolina State University, Raleigh, North Carolina, United States of America
| | - Shannon Holloway
- Department of Statistics, North Carolina State University, Raleigh, North Carolina, United States of America
| | - Michèle M Sale
- Center for Public Health Genomics, University of Virginia, Charlottesville, Virginia, United States of America.,Department of Medicine, University of Virginia, Charlottesville, Virginia, United States of America.,Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, Virginia, United States of America
| | - Bradford B Worrall
- Center for Public Health Genomics, University of Virginia, Charlottesville, Virginia, United States of America.,Department of Neurology, University of Virginia, Charlottesville, Virginia, United States of America
| | - Stephen R Williams
- Center for Public Health Genomics, University of Virginia, Charlottesville, Virginia, United States of America.,Cardiovascular Research Center, University of Virginia, Charlottesville, Virginia, United States of America
| | - Fang-Chi Hsu
- Department of Biostatistical Sciences, Wake Forest School of Medicine, Winston-Salem, North Carolina, United States of America
| | - Jung-Ying Tzeng
- Department of Statistics, North Carolina State University, Raleigh, North Carolina, United States of America.,Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, United States of America.,Department of Statistics, National Cheng-Kung University, Tainan, Taiwan
| |
Collapse
|
6
|
Wang Z, Maity A, Hsiao CK, Voora D, Kaddurah-Daouk R, Tzeng JY. Module-based association analysis for omics data with network structure. PLoS One 2015; 10:e0122309. [PMID: 25822417 PMCID: PMC4378989 DOI: 10.1371/journal.pone.0122309] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2014] [Accepted: 02/20/2015] [Indexed: 02/06/2023] Open
Abstract
Module-based analysis (MBA) aims to evaluate the effect of a group of biological elements sharing common features, such as SNPs in the same gene or metabolites in the same pathways, and has become an attractive alternative to traditional single bio-element approaches. Because bio-elements regulate and interact with each other as part of network, incorporating network structure information can more precisely model the biological effects, enhance the ability to detect true associations, and facilitate our understanding of the underlying biological mechanisms. However, most MBA methods ignore the network structure information, which depicts the interaction and regulation relationship among basic functional units in biology system. We construct the connectivity kernel and the topology kernel to capture the relationship among bio-elements in a module, and use a kernel machine framework to evaluate the joint effect of bio-elements. Our proposed kernel machine approach directly incorporates network structure so to enhance the study efficiency; it can assess interactions among modules, account covariates, and is computational efficient. Through simulation studies and real data application, we demonstrate that the proposed network-based methods can have markedly better power than the approaches ignoring network information under a range of scenarios.
Collapse
Affiliation(s)
- Zhi Wang
- Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, 27695, United States of America
| | - Arnab Maity
- Department of Statistics, North Carolina State University, Raleigh, North Carolina, 27695, United States of America
| | - Chuhsing Kate Hsiao
- Institute of Epidemiology and Preventive Medicine, College of Public Health, National Taiwan University, Taipei, Taiwan
| | - Deepak Voora
- Institute for Genome Sciences and Policy, Duke University, Durham, North Carolina, United States of America
| | - Rima Kaddurah-Daouk
- Department of Psychiatry and Behavioral Sciences, Duke University, Durham, North Carolina, United States of America
| | - Jung-Ying Tzeng
- Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, 27695, United States of America
- Department of Statistics, North Carolina State University, Raleigh, North Carolina, 27695, United States of America
- Department of Statistics, National Cheng-Kung University, Taiwan, R.O.C
| |
Collapse
|