1
|
Papachristou C, Biswas S. Comparison of haplotype-based tests for detecting gene-environment interactions with rare variants. Brief Bioinform 2019; 21:851-862. [PMID: 31329820 DOI: 10.1093/bib/bbz031] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2018] [Revised: 02/06/2019] [Accepted: 02/28/2019] [Indexed: 11/13/2022] Open
Abstract
Dissecting the genetic mechanism underlying a complex disease hinges on discovering gene-environment interactions (GXE). However, detecting GXE is a challenging problem especially when the genetic variants under study are rare. Haplotype-based tests have several advantages over the so-called collapsing tests for detecting rare variants as highlighted in recent literature. Thus, it is of practical interest to compare haplotype-based tests for detecting GXE including the recent ones developed specifically for rare haplotypes. We compare the following methods: haplo.glm, hapassoc, HapReg, Bayesian hierarchical generalized linear model (BhGLM) and logistic Bayesian LASSO (LBL). We simulate data under different types of association scenarios and levels of gene-environment dependence. We find that when the type I error rates are controlled to be the same for all methods, LBL is the most powerful method for detecting GXE. We applied the methods to a lung cancer data set, in particular, in region 15q25.1 as it has been suggested in the literature that it interacts with smoking to affect the lung cancer susceptibility and that it is associated with smoking behavior. LBL and BhGLM were able to detect a rare haplotype-smoking interaction in this region. We also analyzed the sequence data from the Dallas Heart Study, a population-based multi-ethnic study. Specifically, we considered haplotype blocks in the gene ANGPTL4 for association with trait serum triglyceride and used ethnicity as a covariate. Only LBL found interactions of haplotypes with race (Hispanic). Thus, in general, LBL seems to be the best method for detecting GXE among the ones we studied here. Nonetheless, it requires the most computation time.
Collapse
Affiliation(s)
| | - Swati Biswas
- Department of Mathematical Sciences, University of Texas at Dallas, Richardson, TX, USA
| |
Collapse
|
2
|
Liang L, Ma Y, Carroll RJ. A semiparametric efficient estimator in case-control studies for gene-environment independent models. J MULTIVARIATE ANAL 2019; 173:38-50. [PMID: 31680705 DOI: 10.1016/j.jmva.2019.01.006] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Case-controls studies are popular epidemiological designs for detecting gene-environment interactions in the etiology of complex diseases, where the genetic susceptibility and environmental exposures may often be reasonably assumed independent in the source population. Various papers have presented analytical methods exploiting gene-environment independence to achieve better efficiency, all of which require either a rare disease assumption or a distributional assumption on the genetic variables. We relax both assumptions. We construct a semiparametric estimator in case-control studies exploiting gene-environment independence, while the distributions of genetic susceptibility and environmental exposures are both unspecified and the disease rate is assumed unknown and is not required to be close to zero. The resulting estimator is semiparametric efficient and its superiority over prospective logistic regression, the usual analysis in case-control studies, is demonstrated in various numerical illustrations.
Collapse
Affiliation(s)
- Liang Liang
- Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115, USA
| | - Yanyuan Ma
- Department of Statistics, Penn State University, University Park, PA 16802, USA
| | - Raymond J Carroll
- Department of Statistics, Texas A&M University, 3143 TAMU, College Station, TX 77843, USA.,School of Mathematical and Physical Sciences, University of Technology Sydney, PO Box 123, Broadway NSW 2007, Australia
| |
Collapse
|
3
|
Liang L, Carroll R, Ma Y. Dimension reduction and estimation in the secondary analysis of case-control studies. Electron J Stat 2018; 12:1782-1821. [PMID: 30100949 PMCID: PMC6086603 DOI: 10.1214/18-ejs1446] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Studying the relationship between covariates based on retrospective data is the main purpose of secondary analysis, an area of increasing interest. We examine the secondary analysis problem when multiple covariates are available, while only a regression mean model is specified. Despite the completely parametric modeling of the regression mean function, the case-control nature of the data requires special treatment and semi-parametric efficient estimation generates various nonparametric estimation problems with multivariate covariates. We devise a dimension reduction approach that fits with the specified primary and secondary models in the original problem setting, and use reweighting to adjust for the case-control nature of the data, even when the disease rate in the source population is unknown. The resulting estimator is both locally efficient and robust against the misspecification of the regression error distribution, which can be heteroscedastic as well as non-Gaussian. We demonstrate the advantage of our method over several existing methods, both analytically and numerically.
Collapse
Affiliation(s)
- Liang Liang
- Department of Biostatistics, Harvard University, Boston, MA 02115, USA,
| | - Raymond Carroll
- Department of Statistics, Texas A&M University, 3143 TAMU, College Station, TX 77843, USA, and School of Mathematical and, Physical Sciences, University of Technology Sydney, PO Box 123 Broadway NSW 2007, Australia,
| | - Yanyuan Ma
- Department of Statistics, Penn State University, University Park, PA 16802, USA,
| |
Collapse
|
4
|
Stalder O, Asher A, Liang L, Carroll RJ, Ma Y, Chatterjee N. Semiparametric analysis of complex polygenic gene-environment interactions in case-control studies. Biometrika 2017; 104:801-812. [PMID: 29430038 PMCID: PMC5793684 DOI: 10.1093/biomet/asx045] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2016] [Indexed: 01/20/2023] Open
Abstract
Many methods have recently been proposed for efficient analysis of case-control studies of gene-environment interactions using a retrospective likelihood framework that exploits the natural assumption of gene-environment independence in the underlying population. However, for polygenic modelling of gene-environment interactions, which is a topic of increasing scientific interest, applications of retrospective methods have been limited due to a requirement in the literature for parametric modelling of the distribution of the genetic factors. We propose a general, computationally simple, semiparametric method for analysis of case-control studies that allows exploitation of the assumption of gene-environment independence without any further parametric modelling assumptions about the marginal distributions of any of the two sets of factors. The method relies on the key observation that an underlying efficient profile likelihood depends on the distribution of genetic factors only through certain expectation terms that can be evaluated empirically. We develop asymptotic inferential theory for the estimator and evaluate its numerical performance via simulation studies. An application of the method is presented.
Collapse
Affiliation(s)
- Odile Stalder
- Institute of Social and Preventive Medicine, University of Bern, Finkenhubelweg 11, 3012 Bern,
| | - Alex Asher
- Department of Statistics, Texas A&M University, College Station, Texas 77843, U.S.A
| | - Liang Liang
- Department of Statistics, Texas A&M University, College Station, Texas 77843, U.S.A
| | - Raymond J Carroll
- Department of Statistics, Texas A&M University, College Station, Texas 77843, U.S.A
| | - Yanyuan Ma
- Department of Statistics, Penn State University, University Park, Pennsylvania 16802,
| | - Nilanjan Chatterjee
- Department of Biostatistics, Johns Hopkins University, 615 N. Wolfe Street, Baltimore, Maryland 21205,
| |
Collapse
|
5
|
Zhang Y, Lin S, Biswas S. Detecting rare and common haplotype-environment interaction under uncertainty of gene-environment independence assumption. Biometrics 2016; 73:344-355. [PMID: 27478935 DOI: 10.1111/biom.12567] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2015] [Revised: 05/01/2016] [Accepted: 06/01/2016] [Indexed: 11/28/2022]
Abstract
Finding rare variants and gene-environment interactions (GXE) is critical in dissecting complex diseases. We consider the problem of detecting GXE where G is a rare haplotype and E is a nongenetic factor. Such methods typically assume G-E independence, which may not hold in many applications. A pertinent example is lung cancer-there is evidence that variants on Chromosome 15q25.1 interact with smoking to affect the risk. However, these variants are associated with smoking behavior rendering the assumption of G-E independence inappropriate. With the motivation of detecting GXE under G-E dependence, we extend an existing approach, logistic Bayesian LASSO, which assumes G-E independence (LBL-GXE-I) by modeling G-E dependence through a multinomial logistic regression (referred to as LBL-GXE-D). Unlike LBL-GXE-I, LBL-GXE-D controls type I error rates in all situations; however, it has reduced power when G-E independence holds. To control type I error without sacrificing power, we further propose a unified approach, LBL-GXE, to incorporate uncertainty in the G-E independence assumption by employing a reversible jump Markov chain Monte Carlo method. Our simulations show that LBL-GXE has power similar to that of LBL-GXE-I when G-E independence holds, yet has well-controlled type I errors in all situations. To illustrate the utility of LBL-GXE, we analyzed a lung cancer dataset and found several significant interactions in the 15q25.1 region, including one between a specific rare haplotype and smoking.
Collapse
Affiliation(s)
- Yuan Zhang
- Department of Mathematical Sciences, University of Texas at Dallas, Richardson, Texas 75080, U.S.A
| | - Shili Lin
- Department of Statistics, The Ohio State University, Columbus, Ohio 43210, U.S.A
| | - Swati Biswas
- Department of Mathematical Sciences, University of Texas at Dallas, Richardson, Texas 75080, U.S.A
| |
Collapse
|
6
|
Ma Y, Carroll RJ. Semiparametric Estimation in the Secondary Analysis of Case-Control Studies. J R Stat Soc Series B Stat Methodol 2016; 78:127-151. [PMID: 26834506 PMCID: PMC4731052 DOI: 10.1111/rssb.12107] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023]
Abstract
We study the regression relationship among covariates in case-control data, an area known as the secondary analysis of case-control studies. The context is such that only the form of the regression mean is specified, so that we allow an arbitrary regression error distribution, which can depend on the covariates and thus can be heteroscedastic. Under mild regularity conditions we establish the theoretical identifiability of such models. Previous work in this context has either (a) specified a fully parametric distribution for the regression errors, (b) specified a homoscedastic distribution for the regression errors, (c) has specified the rate of disease in the population (we refer this as true population), or (d) has made a rare disease approximation. We construct a class of semiparametric estimation procedures that rely on none of these. The estimators differ from the usual semiparametric ones in that they draw conclusions about the true population, while technically operating in a hypothetic superpopulation. We also construct estimators with a unique feature, in that they are robust against the misspecification of the regression error distribution in terms of variance structure, while all other nonparametric effects are estimated despite of the biased samples. We establish the asymptotic properties of the estimators and illustrate their finite sample performance through simulation studies, as well as through an empirical example on the relation between red meat consumption and heterocyclic amines. Our analysis verified the positive relationship between red meat consumption and two forms of HCA, indicating that increased red meat consumption leads to increased levels of MeIQA and PhiP, both being risk factors for colorectal cancer. Computer software as well as data to illustrate the methodology are available at http://wileyonlinelibrary.com/journal/rss-datasets.
Collapse
Affiliation(s)
- Yanyuan Ma
- Department of Statistics, University of South Carolina, Columbia, SC 29208; Department of Statistics, Texas A&M University, College Station, TX 77843
| | - Raymond J. Carroll
- Department of Statistics, University of South Carolina, Columbia, SC 29208; Department of Statistics, Texas A&M University, College Station, TX 77843
| |
Collapse
|
7
|
Rahman S. A Tilted Kernel Estimator for Nonparametric Regression in the Secondary Analysis of Case–Control Studies. STATISTICS IN BIOSCIENCES 2015. [DOI: 10.1007/s12561-014-9120-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
8
|
Chen HY, Rader DE, Li M. Likelihood Inferences on Semiparametric Odds Ratio Model. J Am Stat Assoc 2015. [DOI: 10.1080/01621459.2014.948544] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
9
|
Yu Z, Demetriou M, Gillen DL. Genome-Wide Analysis of Gene-Gene and Gene-Environment Interactions Using Closed-Form Wald Tests. Genet Epidemiol 2015; 39:446-55. [PMID: 26095143 DOI: 10.1002/gepi.21907] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2014] [Revised: 02/25/2015] [Accepted: 05/06/2015] [Indexed: 01/31/2023]
Abstract
Despite the successful discovery of hundreds of variants for complex human traits using genome-wide association studies, the degree to which genes and environmental risk factors jointly affect disease risk is largely unknown. One obstacle toward this goal is that the computational effort required for testing gene-gene and gene-environment interactions is enormous. As a result, numerous computationally efficient tests were recently proposed. However, the validity of these methods often relies on unrealistic assumptions such as additive main effects, main effects at only one variable, no linkage disequilibrium between the two single-nucleotide polymorphisms (SNPs) in a pair or gene-environment independence. Here, we derive closed-form and consistent estimates for interaction parameters and propose to use Wald tests for testing interactions. The Wald tests are asymptotically equivalent to the likelihood ratio tests (LRTs), largely considered to be the gold standard tests but generally too computationally demanding for genome-wide interaction analysis. Simulation studies show that the proposed Wald tests have very similar performances with the LRTs but are much more computationally efficient. Applying the proposed tests to a genome-wide study of multiple sclerosis, we identify interactions within the major histocompatibility complex region. In this application, we find that (1) focusing on pairs where both SNPs are marginally significant leads to more significant interactions when compared to focusing on pairs where at least one SNP is marginally significant; and (2) parsimonious parameterization of interaction effects might decrease, rather than increase, statistical power.
Collapse
Affiliation(s)
- Zhaoxia Yu
- Department of Statistics, University of California, Irvine, California, United States of America
| | - Michael Demetriou
- Department of Neurology, University of California, Irvine, California, United States of America.,Department of Microbiology & Molecular Genetics, University of California, Irvine, California, United States of America
| | - Daniel L Gillen
- Department of Statistics, University of California, Irvine, California, United States of America
| |
Collapse
|
10
|
Biswas S, Xia S, Lin S. Detecting rare haplotype-environment interaction with logistic Bayesian LASSO. Genet Epidemiol 2013; 38:31-41. [PMID: 24272913 DOI: 10.1002/gepi.21773] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2013] [Revised: 09/13/2013] [Accepted: 10/15/2013] [Indexed: 11/09/2022]
Abstract
Two important contributors to missing heritability are believed to be rare variants and gene-environment interaction (GXE). Thus, detecting GXE where G is a rare haplotype variant (rHTV) is a pressing problem. Haplotype analysis is usually the natural second step to follow up on a genomic region that is implicated to be associated through single nucleotide variants (SNV) analysis. Further, rHTV can tag associated rare SNV and provide greater power to detect them than popular collapsing methods. Recently we proposed Logistic Bayesian LASSO (LBL) for detecting rHTV association with case-control data. LBL shrinks the unassociated (especially common) haplotypes toward zero so that an associated rHTV can be identified with greater power. Here, we incorporate environmental factors and their interactions with haplotypes in LBL. As LBL is based on retrospective likelihood, this extension is not trivial. We model the joint distribution of haplotypes and covariates given the case-control status. We apply the approach (LBL-GXE) to the Michigan, Mayo, AREDS, Pennsylvania Cohort Study on Age-related Macular Degeneration (AMD). LBL-GXE detects interaction of a specific rHTV in CFH gene with smoking. To the best of our knowledge, this is the first time in the AMD literature that an interaction of smoking with a specific (rather than pooled) rHTV has been implicated. We also carry out simulations and find that LBL-GXE has reasonably good powers for detecting interactions with rHTV while keeping the type I error rates well controlled. Thus, we conclude that LBL-GXE is a useful tool for uncovering missing heritability.
Collapse
Affiliation(s)
- Swati Biswas
- Department of Mathematical Sciences, University of Texas at Dallas, Richardson, Texas, United States of America
| | | | | |
Collapse
|
11
|
A Note on Penalized Regression Spline Estimation in the Secondary Analysis of Case-Control Data. STATISTICS IN BIOSCIENCES 2013; 5:250-260. [DOI: 10.1007/s12561-013-9094-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
12
|
Chen HY, Reilly MP, Li M. Semiparametric odds ratio model for case-control and matched case-control designs. Stat Med 2013; 32:3126-42. [PMID: 23307592 DOI: 10.1002/sim.5742] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2011] [Accepted: 12/19/2012] [Indexed: 11/09/2022]
Abstract
We propose a semiparametric odds ratio model that extends Umbach and Weinberg's approach to exploiting gene-environment association model for efficiency gains in case-control designs to both discrete and continuous data. We directly model the gene-environment association in the control population to avoid estimating the intercept in the disease risk model, which is inherently difficult because of the scarcity of information on the parameter with the sampling designs. We propose a novel permutation-based approach to eliminate the high-dimensional nuisance parameters in the matched case-control design. The proposed approach reduces to the conditional logistic regression when the model for the gene-environment association is unrestricted. Simulation studies demonstrate good performance of the proposed approach. We apply the proposed approach to a study of gene-environment interaction on coronary artery disease.
Collapse
Affiliation(s)
- Hua Yun Chen
- Division of Epidemiology and Biostatistics, School of Public Health, University of Illinois at Chicago, 1603 West Taylor Street, Chicago, IL 60612, USA.
| | | | | |
Collapse
|
13
|
Wei J, Carroll RJ, Müller UU, Van Keilegom I, Chatterjee N. Robust estimation for homoscedastic regression in the secondary analysis of case-control data. J R Stat Soc Series B Stat Methodol 2012; 75:185-206. [PMID: 23637568 DOI: 10.1111/j.1467-9868.2012.01052.x] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Primary analysis of case-control studies focuses on the relationship between disease D and a set of covariates of interest (Y, X). A secondary application of the case-control study, which is often invoked in modern genetic epidemiologic association studies, is to investigate the interrelationship between the covariates themselves. The task is complicated owing to the case-control sampling, where the regression of Y on X is different from what it is in the population. Previous work has assumed a parametric distribution for Y given X and derived semiparametric efficient estimation and inference without any distributional assumptions about X. We take up the issue of estimation of a regression function when Y given X follows a homoscedastic regression model, but otherwise the distribution of Y is unspecified. The semiparametric efficient approaches can be used to construct semiparametric efficient estimates, but they suffer from a lack of robustness to the assumed model for Y given X. We take an entirely different approach. We show how to estimate the regression parameters consistently even if the assumed model for Y given X is incorrect, and thus the estimates are model robust. For this we make the assumption that the disease rate is known or well estimated. The assumption can be dropped when the disease is rare, which is typically so for most case-control studies, and the estimation algorithm simplifies. Simulations and empirical examples are used to illustrate the approach.
Collapse
Affiliation(s)
- Jiawei Wei
- Texas A&M University, College Station, USA
| | | | | | | | | |
Collapse
|
14
|
Aschard H, Lutz S, Maus B, Duell EJ, Fingerlin TE, Chatterjee N, Kraft P, Van Steen K. Challenges and opportunities in genome-wide environmental interaction (GWEI) studies. Hum Genet 2012; 131:1591-613. [PMID: 22760307 DOI: 10.1007/s00439-012-1192-0] [Citation(s) in RCA: 110] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2012] [Accepted: 06/11/2012] [Indexed: 02/03/2023]
Abstract
The interest in performing gene-environment interaction studies has seen a significant increase with the increase of advanced molecular genetics techniques. Practically, it became possible to investigate the role of environmental factors in disease risk and hence to investigate their role as genetic effect modifiers. The understanding that genetics is important in the uptake and metabolism of toxic substances is an example of how genetic profiles can modify important environmental risk factors to disease. Several rationales exist to set up gene-environment interaction studies and the technical challenges related to these studies-when the number of environmental or genetic risk factors is relatively small-has been described before. In the post-genomic era, it is now possible to study thousands of genes and their interaction with the environment. This brings along a whole range of new challenges and opportunities. Despite a continuing effort in developing efficient methods and optimal bioinformatics infrastructures to deal with the available wealth of data, the challenge remains how to best present and analyze genome-wide environmental interaction (GWEI) studies involving multiple genetic and environmental factors. Since GWEIs are performed at the intersection of statistical genetics, bioinformatics and epidemiology, usually similar problems need to be dealt with as for genome-wide association gene-gene interaction studies. However, additional complexities need to be considered which are typical for large-scale epidemiological studies, but are also related to "joining" two heterogeneous types of data in explaining complex disease trait variation or for prediction purposes.
Collapse
Affiliation(s)
- Hugues Aschard
- Department of Epidemiology, Harvard School of Public Health, Boston, MA, USA.
| | | | | | | | | | | | | | | |
Collapse
|
15
|
Chen HY, Chen J. On information coded in gene-environment independence in case-control studies. Am J Epidemiol 2011; 174:736-43. [PMID: 21828372 DOI: 10.1093/aje/kwr153] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
For analysis of case-control genetic association studies, it has recently been shown that gene-environment independence in the population can be leveraged to increase efficiency for estimating gene-environment interaction effects in comparison with the standard prospective analysis. However, for the special case in which data on the binary phenotype and genetic and environmental risk factors can be summarized in a 2 × 2 × 2 table, the authors show here that there is no efficiency gain for estimating interaction effects, nor is there an efficiency gain for estimating the genetic and environmental main effects. This contrasts with the well-known result assuming that rare phenotype prevalence and gene-environment independence in the control population for the same data can lead to efficiency gain. This discrepancy is counterintuitive, since the 2 likelihoods are also approximately equal when the phenotype is rare. An explanation for the paradox based on a theoretical analysis is provided. Implications of these results for data analyses are also examined, and practical guidance on analyzing such case-control studies is offered.
Collapse
Affiliation(s)
- Hua Yun Chen
- Division of Epidemiology and Biostatistics, School of Public Health, University of Illinois, Chicago, 1603 West Taylor Street, Chicago, IL 60612, USA.
| | | |
Collapse
|
16
|
Li J, Zhang K, Yi N. A Bayesian hierarchical model for detecting haplotype-haplotype and haplotype-environment interactions in genetic association studies. Hum Hered 2011; 71:148-60. [PMID: 21778734 DOI: 10.1159/000324841] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2010] [Accepted: 02/03/2011] [Indexed: 12/27/2022] Open
Abstract
OBJECTIVE Genetic association studies based on haplotypes are powerful in the discovery and characterization of the genetic basis of complex human diseases. However, statistical methods for detecting haplotype-haplotype and haplotype-environment interactions have not yet been fully developed owing to the difficulties encountered: large numbers of potential haplotypes and unknown haplotype pairs. Furthermore, methods for detecting the association between rare haplotypes and disease have not kept pace with their counterpart of common haplotypes. METHODS/RESULTS We herein propose an efficient and robust method to tackle these problems based on a Bayesian hierarchical generalized linear model. Our model simultaneously fits environmental effects, main effects of numerous common and rare haplotypes, and haplotype-haplotype and haplotype-environment interactions. The key to the approach is the use of a continuous prior distribution on coefficients that favors sparseness in the fitted model and facilitates computation. We develop a fast expectation-maximization algorithm to fit models by estimating posterior modes of coefficients. We incorporate our algorithm into the iteratively weighted least squares for classical generalized linear models as implemented in the R package glm. We evaluate the proposed method and compare its performance to existing methods on extensive simulated data. CONCLUSION The results show that the proposed method performs well under all situations and is more powerful than existing approaches.
Collapse
Affiliation(s)
- Jun Li
- Department of Biostatistics, Section on Statistical Genetics, University of Alabama at Birmingham, Birmingham, AL 35294-0022, USA
| | | | | |
Collapse
|
17
|
Fridley BL, Jenkins GD, Biernacka JM. Self-contained gene-set analysis of expression data: an evaluation of existing and novel methods. PLoS One 2010; 5. [PMID: 20862301 PMCID: PMC2941449 DOI: 10.1371/journal.pone.0012693] [Citation(s) in RCA: 48] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2010] [Accepted: 08/22/2010] [Indexed: 11/26/2022] Open
Abstract
Gene set methods aim to assess the overall evidence of association of a set of genes with a phenotype, such as disease or a quantitative trait. Multiple approaches for gene set analysis of expression data have been proposed. They can be divided into two types: competitive and self-contained. Benefits of self-contained methods include that they can be used for genome-wide, candidate gene, or pathway studies, and have been reported to be more powerful than competitive methods. We therefore investigated ten self-contained methods that can be used for continuous, discrete and time-to-event phenotypes. To assess the power and type I error rate for the various previously proposed and novel approaches, an extensive simulation study was completed in which the scenarios varied according to: number of genes in a gene set, number of genes associated with the phenotype, effect sizes, correlation between expression of genes within a gene set, and the sample size. In addition to the simulated data, the various methods were applied to a pharmacogenomic study of the drug gemcitabine. Simulation results demonstrated that overall Fisher's method and the global model with random effects have the highest power for a wide range of scenarios, while the analysis based on the first principal component and Kolmogorov-Smirnov test tended to have lowest power. The methods investigated here are likely to play an important role in identifying pathways that contribute to complex traits.
Collapse
Affiliation(s)
- Brooke L Fridley
- Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, USA.
| | | | | |
Collapse
|
18
|
Hu YJ, Lin DY, Zeng D. A general framework for studying genetic effects and gene-environment interactions with missing data. Biostatistics 2010; 11:583-98. [PMID: 20348396 DOI: 10.1093/biostatistics/kxq015] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Missing data arise in genetic association studies when genotypes are unknown or when haplotypes are of direct interest. We provide a general likelihood-based framework for making inference on genetic effects and gene-environment interactions with such missing data. We allow genetic and environmental variables to be correlated while leaving the distribution of environmental variables completely unspecified. We consider 3 major study designs-cross-sectional, case-control, and cohort designs-and construct appropriate likelihood functions for all common phenotypes (e.g. case-control status, quantitative traits, and potentially censored ages at onset of disease). The likelihood functions involve both finite- and infinite-dimensional parameters. The maximum likelihood estimators are shown to be consistent, asymptotically normal, and asymptotically efficient. Expectation-Maximization (EM) algorithms are developed to implement the corresponding inference procedures. Extensive simulation studies demonstrate that the proposed inferential and numerical methods perform well in practical settings. Illustration with a genome-wide association study of lung cancer is provided.
Collapse
Affiliation(s)
- Y J Hu
- Department of Biostatistics, University of North Carolina, Chapel Hill, NC 27599-7420, USA
| | | | | |
Collapse
|