1
|
Wang Z, Shi W, Carroll RJ, Chatterjee N. Joint Modeling of Gene-Environment Correlations and Interactions using Polygenic Risk Scores in Case-Control Studies. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.02.14.528572. [PMID: 36824704 PMCID: PMC9948994 DOI: 10.1101/2023.02.14.528572] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/18/2023]
Abstract
Polygenic risk scores (PRS) are rapidly emerging as aggregated measures of disease-risk associated with many genetic variants. Understanding the interplay of PRS with environmental factors is critical for interpreting and applying PRS in a wide variety of settings. We develop an efficient method for simultaneously modeling gene-environment correlations and interactions using PRS in case-control studies. We use a logistic-normal regression modeling framework to specify the disease risk and PRS distribution in the underlying population and propose joint inference across the two models using the retrospective likelihood of the case-control data. Extensive simulation studies demonstrate the flexibility of the method in trading-off bias and efficiency for the estimation of various model parameters compared to the standard logistic regression or a case-only analysis for gene-environment interactions, or a control-only analysis for gene-environment correlations. Finally, using simulated case-control datasets within the UK Biobank study, we demonstrate the power of the proposed method for its ability to recover results from the full prospective cohort for the detection of an interaction between long-term oral contraceptive use and PRS on the risk of breast cancer. This method is computationally efficient and implemented in a user-friendly R package.
Collapse
Affiliation(s)
- Ziqiao Wang
- Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, MD, USA
| | - Wen Shi
- McKusick-Nathans Institute, Department of Genetic Medicine, School of Medicine, Johns Hopkins University, Baltimore, MD, USA
| | - Raymond J. Carroll
- Department of Statistics, Texas A&M University, College Station, TX, USA
| | - Nilanjan Chatterjee
- Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, MD, USA
- Department of Oncology, School of Medicine, Johns Hopkins University, Baltimore, MD, USA
| |
Collapse
|
2
|
Chen S, Zhang H. Analysis of parent‐of‐origin effects for secondary phenotypes using case–control mother–child pair data. Genet Epidemiol 2022; 46:430-445. [DOI: 10.1002/gepi.22463] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2022] [Revised: 03/28/2022] [Accepted: 04/20/2022] [Indexed: 11/10/2022]
Affiliation(s)
- Shuyue Chen
- School of Data Science University of Science and Technology of China Hefei Anhui P.R. China
| | - Hong Zhang
- Department of Statistics and Finance, School of Management University of Science and Technology of China Hefei Anhui P.R. China
| |
Collapse
|
3
|
Modeling Secondary Phenotypes Conditional on Genotypes in Case–Control Studies. STATS 2022. [DOI: 10.3390/stats5010014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Traditional case–control genetic association studies examine relationships between case–control status and one or more covariates. It is becoming increasingly common to study secondary phenotypes and their association with the original covariates. The Orofacial Pain: Prospective Evaluation and Risk Assessment (OPPERA) project, a study of temporomandibular disorders (TMD), motivates this work. Numerous measures of interest are collected at enrollment, such as the number of comorbid pain conditions from which a participant suffers. Examining the potential genetic basis of these measures is of secondary interest. Assessing these associations is statistically challenging, as participants do not form a random sample from the population of interest. Standard methods may be biased and lack coverage and power. We propose a general method for the analysis of arbitrary phenotypes utilizing inverse probability weighting and bootstrapping for standard error estimation. The method may be applied to the complicated association tests used in next-generation sequencing studies, such as analyses of haplotypes with ambiguous phase. Simulation studies show that our method performs as well as competing methods when they are applicable and yield promising results for outcome types, such as time-to-event, to which other methods may not apply. The method is applied to the OPPERA baseline case–control genetic study.
Collapse
|
4
|
A review of analysis methods for secondary outcomes in case-control studies. COMMUNICATIONS FOR STATISTICAL APPLICATIONS AND METHODS 2019. [DOI: 10.29220/csam.2019.26.2.103] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
5
|
Wang J, Ning J, Shete S. Mediation analysis in a case-control study when the mediator is a censored variable. Stat Med 2019; 38:1213-1229. [PMID: 30421436 DOI: 10.1002/sim.8028] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2017] [Revised: 09/11/2018] [Accepted: 10/15/2018] [Indexed: 11/10/2022]
Abstract
Mediation analysis is an approach for assessing the direct and indirect effects of an initial variable on an outcome through a mediator. In practice, mediation models can involve a censored mediator (eg, a woman's age at menopause). The current research for mediation analysis with a censored mediator focuses on scenarios where outcomes are continuous. However, the outcomes can be binary (eg, type 2 diabetes). Another challenge when analyzing such a mediation model is to use data from a case-control study, which results in biased estimations for the initial variable-mediator association if a standard approach is directly applied. In this study, we propose an approach (denoted as MAC-CC) to analyze the mediation model with a censored mediator given data from a case-control study, based on the semiparametric accelerated failure time model along with a pseudo-likelihood function. We adapted the measures for assessing the indirect and direct effects using counterfactual definitions. We conducted simulation studies to investigate the performance of MAC-CC and compared it to those of the naïve approach and the complete-case approach. MAC-CC accurately estimates the coefficients of different paths, the indirect effects, and the proportions of the total effects mediated. We applied the proposed and existing approaches to the mediation study of genetic variants, a woman's age at menopause, and type 2 diabetes based on a case-control study of type 2 diabetes. Our results indicate that there is no mediating effect from the age at menopause on the association between the genetic variants and type 2 diabetes.
Collapse
Affiliation(s)
- Jian Wang
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, Texas
| | - Jing Ning
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, Texas
| | - Sanjay Shete
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, Texas.,Department of Epidemiology, The University of Texas MD Anderson Cancer Center, Houston, Texas
| |
Collapse
|
6
|
Aschard H, Laville V, Tchetgen ET, Knights D, Imhann F, Seksik P, Zaitlen N, Silverberg MS, Cosnes J, Weersma RK, Xavier R, Beaugerie L, Skurnik D, Sokol H. Genetic effects on the commensal microbiota in inflammatory bowel disease patients. PLoS Genet 2019; 15:e1008018. [PMID: 30849075 PMCID: PMC6426259 DOI: 10.1371/journal.pgen.1008018] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2018] [Revised: 03/20/2019] [Accepted: 02/13/2019] [Indexed: 12/16/2022] Open
Abstract
Several bacteria in the gut microbiota have been shown to be associated with inflammatory bowel disease (IBD), and dozens of IBD genetic variants have been identified in genome-wide association studies. However, the role of the microbiota in the etiology of IBD in terms of host genetic susceptibility remains unclear. Here, we studied the association between four major genetic variants associated with an increased risk of IBD and bacterial taxa in up to 633 IBD cases. We performed systematic screening for associations, identifying and replicating associations between NOD2 variants and two taxa: the Roseburia genus and the Faecalibacterium prausnitzii species. By exploring the overall association patterns between genes and bacteria, we found that IBD risk alleles were significantly enriched for associations concordant with bacteria-IBD associations. To understand the significance of this pattern in terms of the study design and known effects from the literature, we used counterfactual principles to assess the fitness of a few parsimonious gene-bacteria-IBD causal models. Our analyses showed evidence that the disease risk of these genetic variants were likely to be partially mediated by the microbiome. We confirmed these results in extensive simulation studies and sensitivity analyses using the association between NOD2 and F. prausnitzii as a case study.
Collapse
Affiliation(s)
- Hugues Aschard
- Centre de Bioinformatique, Biostatistique et Biologie Intégrative (C3BI), Institut Pasteur, Paris, France
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, United States of America
- * E-mail: (HA); (DS); (HS)
| | - Vincent Laville
- Centre de Bioinformatique, Biostatistique et Biologie Intégrative (C3BI), Institut Pasteur, Paris, France
| | - Eric Tchetgen Tchetgen
- Department of Statistics, The Wharton School at the University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
| | - Dan Knights
- Department of Computer Science and Engineering, University of Minnesota, Minneapolis, Minnesota, United States of America
- Broad Institute of Harvard and MIT, Cambridge, Massachusetts, United States of America
- Center for Computational and Integrative Biology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, United States of America
- Biotechnology Institute, University of Minnesota, St. Paul, Minnesota, United States of America
| | - Floris Imhann
- Department of Gastroenterology and Hepatology, University of Groningen and University Medical Center Groningen, Groningen, the Netherlands
| | - Philippe Seksik
- Department of Gastroenterology, Saint Antoine Hospital, Paris, France
| | - Noah Zaitlen
- Department of Medicine, University of California, San Francisco, California, United States of America
| | - Mark S. Silverberg
- Zane Cohen Centre for Digestive Diseases, Mount Sinai Hospital, Toronto, Ontario, Canada
| | - Jacques Cosnes
- Department of Gastroenterology, Saint Antoine Hospital, Paris, France
- Sorbonne Université, Paris, France
| | - Rinse K. Weersma
- Department of Gastroenterology and Hepatology, University of Groningen and University Medical Center Groningen, Groningen, the Netherlands
| | - Ramnik Xavier
- Broad Institute of Harvard and MIT, Cambridge, Massachusetts, United States of America
- Center for Computational and Integrative Biology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, United States of America
- Division of Gastroenterology, Massachusetts General Hospital and Harvard Medical School, Boston, Massachusetts, United States of America
| | - Laurent Beaugerie
- Department of Gastroenterology, Saint Antoine Hospital, Paris, France
- Sorbonne Université, Paris, France
| | - David Skurnik
- Division of Infectious Diseases, Harvard Medical School, Boston, Massachusetts, United States of America
- Massachusetts Technology and Analytics, Brookline, Massachusetts, United States of America
- Department of Microbiology, Necker Hospital and University Paris Descartes, Paris, France
- INSERM U1151-Equipe 11, Institut Necker-Enfants Malades, Paris, France
- * E-mail: (HA); (DS); (HS)
| | - Harry Sokol
- Department of Gastroenterology, Saint Antoine Hospital, Paris, France
- Sorbonne Université, Paris, France
- Micalis Institute, AgroParisTech, Jouy-en-Josas, France
- INSERM CRSA UMRS U938, Paris, France
- * E-mail: (HA); (DS); (HS)
| |
Collapse
|
7
|
Liang L, Ma Y, Wei Y, Carroll RJ. Semiparametrically efficient estimation in quantile regression of secondary analysis. J R Stat Soc Series B Stat Methodol 2018; 80:625-648. [PMID: 30337833 PMCID: PMC6191046 DOI: 10.1111/rssb.12272] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Analysing secondary outcomes is a common practice for case-control studies. Traditional secondary analysis employs either completely parametric models or conditional mean regression models to link the secondary outcome to covariates. In many situations, quantile regression models complement mean-based analyses and provide alternative new insights on the associations of interest. For example, biomedical outcomes are often highly asymmetric, and median regression is more useful in describing the 'central' behaviour than mean regressions. There are also cases where the research interest is to study the high or low quantiles of a population, as they are more likely to be at risk. We approach the secondary quantile regression problem from a semiparametric perspective, allowing the covariate distribution to be completely unspecified. We derive a class of consistent semiparametric estimators and identify the efficient member. The asymptotic properties of the resulting estimators are established. Simulation results and a real data analysis are provided to demonstrate the superior performance of our approach with a comparison with the only existing approach so far in the literature.
Collapse
Affiliation(s)
| | - Yanyuan Ma
- Penn State University, University Park, USA
| | - Ying Wei
- Columbia University, New York, USA
| | - Raymond J Carroll
- Texas A&M University, College Station, USA, and University of Technology, Sydney, Australia
| |
Collapse
|
8
|
Liang L, Carroll R, Ma Y. Dimension reduction and estimation in the secondary analysis of case-control studies. Electron J Stat 2018; 12:1782-1821. [PMID: 30100949 PMCID: PMC6086603 DOI: 10.1214/18-ejs1446] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Studying the relationship between covariates based on retrospective data is the main purpose of secondary analysis, an area of increasing interest. We examine the secondary analysis problem when multiple covariates are available, while only a regression mean model is specified. Despite the completely parametric modeling of the regression mean function, the case-control nature of the data requires special treatment and semi-parametric efficient estimation generates various nonparametric estimation problems with multivariate covariates. We devise a dimension reduction approach that fits with the specified primary and secondary models in the original problem setting, and use reweighting to adjust for the case-control nature of the data, even when the disease rate in the source population is unknown. The resulting estimator is both locally efficient and robust against the misspecification of the regression error distribution, which can be heteroscedastic as well as non-Gaussian. We demonstrate the advantage of our method over several existing methods, both analytically and numerically.
Collapse
Affiliation(s)
- Liang Liang
- Department of Biostatistics, Harvard University, Boston, MA 02115, USA,
| | - Raymond Carroll
- Department of Statistics, Texas A&M University, 3143 TAMU, College Station, TX 77843, USA, and School of Mathematical and, Physical Sciences, University of Technology Sydney, PO Box 123 Broadway NSW 2007, Australia,
| | - Yanyuan Ma
- Department of Statistics, Penn State University, University Park, PA 16802, USA,
| |
Collapse
|
9
|
Sofer T, Schifano ED, Christiani DC, Lin X. Weighted pseudolikelihood for SNP set analysis with multiple secondary outcomes in case-control genetic association studies. Biometrics 2017; 73:1210-1220. [PMID: 28346824 DOI: 10.1111/biom.12680] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2015] [Revised: 01/01/2017] [Accepted: 02/01/2017] [Indexed: 11/29/2022]
Abstract
We propose a weighted pseudolikelihood method for analyzing the association of a SNP set, example, SNPs in a gene or a genetic pathway or network, with multiple secondary phenotypes in case-control genetic association studies. To boost analysis power, we assume that the SNP-specific effects are shared across all secondary phenotypes using a scaled mean model. We estimate regression parameters using Inverse Probability Weighted (IPW) estimating equations obtained from the weighted pseudolikelihood, which accounts for case-control sampling to prevent potential ascertainment bias. To test the effect of a SNP set, we propose a weighted variance component pseudo-score test. We also propose a penalized IPW pseudolikelihood method for selecting a subset of SNPs that are associated with the multiple secondary phenotypes. We show that the proposed variable selection procedure has the oracle properties and is robust to misspecification of the correlation structure among secondary phenotypes. We select the tuning parameter using a weighted Bayesian Information-like Criterion (wBIC). We evaluate the finite sample performance of the proposed methods via simulations, and illustrate the methods by the analysis of the multiple secondary smoking behavior outcomes in a lung cancer case-control genetic association study.
Collapse
Affiliation(s)
- Tamar Sofer
- Department of Biostatistics, University of Washington, Seattle, Washington 98105, U.S.A
| | - Elizabeth D Schifano
- Department of Statistics, University of Connecticut, Storrs, Connecticut 06269, U.S.A
| | - David C Christiani
- Department of Environmental Health, Harvard School of Public Health, Boston, Massachusetts 02115, U.S.A
| | - Xihong Lin
- Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts 02115, U.S.A
| |
Collapse
|
10
|
Yung G, Lin X. Validity of using ad hoc methods to analyze secondary traits in case-control association studies. Genet Epidemiol 2016; 40:732-743. [PMID: 27670932 DOI: 10.1002/gepi.21994] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2015] [Revised: 06/23/2016] [Accepted: 06/26/2016] [Indexed: 11/10/2022]
Abstract
Case-control association studies often collect from their subjects information on secondary phenotypes. Reusing the data and studying the association between genes and secondary phenotypes provide an attractive and cost-effective approach that can lead to discovery of new genetic associations. A number of approaches have been proposed, including simple and computationally efficient ad hoc methods that ignore ascertainment or stratify on case-control status. Justification for these approaches relies on the assumption of no covariates and the correct specification of the primary disease model as a logistic model. Both might not be true in practice, for example, in the presence of population stratification or the primary disease model following a probit model. In this paper, we investigate the validity of ad hoc methods in the presence of covariates and possible disease model misspecification. We show that in taking an ad hoc approach, it may be desirable to include covariates that affect the primary disease in the secondary phenotype model, even though these covariates are not necessarily associated with the secondary phenotype. We also show that when the disease is rare, ad hoc methods can lead to severely biased estimation and inference if the true disease model follows a probit model instead of a logistic model. Our results are justified theoretically and via simulations. Applied to real data analysis of genetic associations with cigarette smoking, ad hoc methods collectively identified as highly significant (P<10-5) single nucleotide polymorphisms from over 10 genes, genes that were identified in previous studies of smoking cessation.
Collapse
Affiliation(s)
- Godwin Yung
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, United States of America
| | - Xihong Lin
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, United States of America
| |
Collapse
|
11
|
Ma Y, Carroll RJ. Semiparametric Estimation in the Secondary Analysis of Case-Control Studies. J R Stat Soc Series B Stat Methodol 2016; 78:127-151. [PMID: 26834506 PMCID: PMC4731052 DOI: 10.1111/rssb.12107] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023]
Abstract
We study the regression relationship among covariates in case-control data, an area known as the secondary analysis of case-control studies. The context is such that only the form of the regression mean is specified, so that we allow an arbitrary regression error distribution, which can depend on the covariates and thus can be heteroscedastic. Under mild regularity conditions we establish the theoretical identifiability of such models. Previous work in this context has either (a) specified a fully parametric distribution for the regression errors, (b) specified a homoscedastic distribution for the regression errors, (c) has specified the rate of disease in the population (we refer this as true population), or (d) has made a rare disease approximation. We construct a class of semiparametric estimation procedures that rely on none of these. The estimators differ from the usual semiparametric ones in that they draw conclusions about the true population, while technically operating in a hypothetic superpopulation. We also construct estimators with a unique feature, in that they are robust against the misspecification of the regression error distribution in terms of variance structure, while all other nonparametric effects are estimated despite of the biased samples. We establish the asymptotic properties of the estimators and illustrate their finite sample performance through simulation studies, as well as through an empirical example on the relation between red meat consumption and heterocyclic amines. Our analysis verified the positive relationship between red meat consumption and two forms of HCA, indicating that increased red meat consumption leads to increased levels of MeIQA and PhiP, both being risk factors for colorectal cancer. Computer software as well as data to illustrate the methodology are available at http://wileyonlinelibrary.com/journal/rss-datasets.
Collapse
Affiliation(s)
- Yanyuan Ma
- Department of Statistics, University of South Carolina, Columbia, SC 29208; Department of Statistics, Texas A&M University, College Station, TX 77843
| | - Raymond J. Carroll
- Department of Statistics, University of South Carolina, Columbia, SC 29208; Department of Statistics, Texas A&M University, College Station, TX 77843
| |
Collapse
|
12
|
Rahman S. A Tilted Kernel Estimator for Nonparametric Regression in the Secondary Analysis of Case–Control Studies. STATISTICS IN BIOSCIENCES 2015. [DOI: 10.1007/s12561-014-9120-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
13
|
Lutz SM, Hokanson JE, Lange C. An alternative hypothesis testing strategy for secondary phenotype data in case-control genetic association studies. Front Genet 2014; 5:188. [PMID: 25071819 PMCID: PMC4076613 DOI: 10.3389/fgene.2014.00188] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2014] [Accepted: 06/04/2014] [Indexed: 11/13/2022] Open
Abstract
Motivated by the challenges associated with accounting for the ascertainment when analyzing secondary phenotypes that are correlated with case-control status, Lin and Zeng have proposed a method that properly reflects the case-control sampling (Lin and Zeng, 2009). The Lin and Zeng method has the advantage of accurately estimating effect sizes for secondary phenotypes that are normally distributed or dichotomous. This method can be computationally intensive in practice under the null hypothesis when the likelihood surface that needs to be maximized can be relatively flat. We propose an extension of the Lin and Zeng method for hypothesis testing that uses proportional odds logistic regression to circumvent these computational issues. Through simulation studies, we compare the power and type-1 error rate of our method to standard approaches and Lin and Zeng's approach.
Collapse
Affiliation(s)
- Sharon M Lutz
- Department of Biostatistics, University of Colorado Aurora, CO, USA
| | - John E Hokanson
- Department of Epidemiology, University of Colorado Aurora, CO, USA
| | - Christoph Lange
- Department of Biostatistics, Harvard School of Public Health Boston, MA, USA ; Channing Laboratory, Harvard Medical School Boston, MA, USA ; Institute for Genomic Mathematics, University of Bonn Bonn, Germany ; German Center for Neurodegenerative Diseases (DZNE) Bonn, Germany
| |
Collapse
|
14
|
A Note on Penalized Regression Spline Estimation in the Secondary Analysis of Case-Control Data. STATISTICS IN BIOSCIENCES 2013; 5:250-260. [DOI: 10.1007/s12561-013-9094-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
15
|
Tchetgen Tchetgen EJ. A general regression framework for a secondary outcome in case-control studies. Biostatistics 2013; 15:117-28. [PMID: 24152770 DOI: 10.1093/biostatistics/kxt041] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Modern case-control studies typically involve the collection of data on a large number of outcomes, often at considerable logistical and monetary expense. These data are of potentially great value to subsequent researchers, who, although not necessarily concerned with the disease that defined the case series in the original study, may want to use the available information for a regression analysis involving a secondary outcome. Because cases and controls are selected with unequal probability, regression analysis involving a secondary outcome generally must acknowledge the sampling design. In this paper, the author presents a new framework for the analysis of secondary outcomes in case-control studies. The approach is based on a careful re-parameterization of the conditional model for the secondary outcome given the case-control outcome and regression covariates, in terms of (a) the population regression of interest of the secondary outcome given covariates and (b) the population regression of the case-control outcome on covariates. The error distribution for the secondary outcome given covariates and case-control status is otherwise unrestricted. For a continuous outcome, the approach sometimes reduces to extending model (a) by including a residual of (b) as a covariate. However, the framework is general in the sense that models (a) and (b) can take any functional form, and the methodology allows for an identity, log or logit link function for model (a).
Collapse
Affiliation(s)
- Eric J Tchetgen Tchetgen
- Department of Biostatistics, Harvard School of Public Health, 677 Huntington Avenue, Boston, MA 02115, USA
| |
Collapse
|
16
|
Ghosh A, Wright FA, Zou F. Unified Analysis of Secondary Traits in Case-Control Association Studies. J Am Stat Assoc 2013; 108. [PMID: 24409003 DOI: 10.1080/01621459.2013.793121] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
It has been repeatedly shown that in case-control association studies, analysis of a secondary trait which ignores the original sampling scheme can produce highly biased risk estimates. Although a number of approaches have been proposed to properly analyze secondary traits, most approaches fail to reproduce the marginal logistic model assumed for the original case-control trait and/or do not allow for interaction between secondary trait and genotype marker on primary disease risk. In addition, the flexible handling of covariates remains challenging. We present a general retrospective likelihood framework to perform association testing for both binary and continuous secondary traits which respects marginal models and incorporates the interaction term. We provide a computational algorithm, based on a reparameterized approximate profile likelihood, for obtaining the maximum likelihood (ML) estimate and its standard error for the genetic effect on secondary trait, in presence of covariates. For completeness we also present an alternative pseudo-likelihood method for handling covariates. We describe extensive simulations to evaluate the performance of the ML estimator in comparison with the pseudo-likelihood and other competing methods.
Collapse
Affiliation(s)
- Arpita Ghosh
- Public Health Foundation of India, New Delhi, India
| | - Fred A Wright
- Department of Biostatistics, University of North Carolina at Chapel Hill, North Carolina, USA
| | - Fei Zou
- Department of Biostatistics, University of North Carolina at Chapel Hill, North Carolina, USA
| |
Collapse
|
17
|
Lutz S, Yip WK, Hokanson J, Laird N, Lange C. A general semi-parametric approach to the analysis of genetic association studies in population-based designs. BMC Genet 2013; 14:13. [PMID: 23448186 PMCID: PMC3648382 DOI: 10.1186/1471-2156-14-13] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2012] [Accepted: 02/01/2013] [Indexed: 12/03/2022] Open
Abstract
Background For genetic association studies in designs of unrelated individuals, current statistical methodology typically models the phenotype of interest as a function of the genotype and assumes a known statistical model for the phenotype. In the analysis of complex phenotypes, especially in the presence of ascertainment conditions, the specification of such model assumptions is not straight-forward and is error-prone, potentially causing misleading results. Results In this paper, we propose an alternative approach that treats the genotype as the random variable and conditions upon the phenotype. Thereby, the validity of the approach does not depend on the correctness of assumptions about the phenotypic model. Misspecification of the phenotypic model may lead to reduced statistical power. Theoretical derivations and simulation studies demonstrate both the validity and the advantages of the approach over existing methodology. In the COPDGene study (a GWAS for Chronic Obstructive Pulmonary Disease (COPD)), we apply the approach to a secondary, quantitative phenotype, the Fagerstrom nicotine dependence score, that is correlated with COPD affection status. The software package that implements this method is available. Conclusions The flexibility of this approach enables the straight-forward application to quantitative phenotypes and binary traits in ascertained and unascertained samples. In addition to its robustness features, our method provides the platform for the construction of complex statistical models for longitudinal data, multivariate data, multi-marker tests, rare-variant analysis, and others.
Collapse
Affiliation(s)
- Sharon Lutz
- Department of Biostatistics, University of Colorado Anschutz Medical Campus, Aurora, USA.
| | | | | | | | | |
Collapse
|
18
|
Wei J, Carroll RJ, Müller UU, Van Keilegom I, Chatterjee N. Robust estimation for homoscedastic regression in the secondary analysis of case-control data. J R Stat Soc Series B Stat Methodol 2012; 75:185-206. [PMID: 23637568 DOI: 10.1111/j.1467-9868.2012.01052.x] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Primary analysis of case-control studies focuses on the relationship between disease D and a set of covariates of interest (Y, X). A secondary application of the case-control study, which is often invoked in modern genetic epidemiologic association studies, is to investigate the interrelationship between the covariates themselves. The task is complicated owing to the case-control sampling, where the regression of Y on X is different from what it is in the population. Previous work has assumed a parametric distribution for Y given X and derived semiparametric efficient estimation and inference without any distributional assumptions about X. We take up the issue of estimation of a regression function when Y given X follows a homoscedastic regression model, but otherwise the distribution of Y is unspecified. The semiparametric efficient approaches can be used to construct semiparametric efficient estimates, but they suffer from a lack of robustness to the assumed model for Y given X. We take an entirely different approach. We show how to estimate the regression parameters consistently even if the assumed model for Y given X is incorrect, and thus the estimates are model robust. For this we make the assumption that the disease rate is known or well estimated. The assumption can be dropped when the disease is rare, which is typically so for most case-control studies, and the estimation algorithm simplifies. Simulations and empirical examples are used to illustrate the approach.
Collapse
Affiliation(s)
- Jiawei Wei
- Texas A&M University, College Station, USA
| | | | | | | | | |
Collapse
|
19
|
Wang J, Spitz MR, Amos CI, Wu X, Wetter DW, Cinciripini PM, Shete S. Method for evaluating multiple mediators: mediating effects of smoking and COPD on the association between the CHRNA5-A3 variant and lung cancer risk. PLoS One 2012; 7:e47705. [PMID: 23077662 PMCID: PMC3471886 DOI: 10.1371/journal.pone.0047705] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2012] [Accepted: 09/14/2012] [Indexed: 01/18/2023] Open
Abstract
A mediation model explores the direct and indirect effects between an independent variable and a dependent variable by including other variables (or mediators). Mediation analysis has recently been used to dissect the direct and indirect effects of genetic variants on complex diseases using case-control studies. However, bias could arise in the estimations of the genetic variant-mediator association because the presence or absence of the mediator in the study samples is not sampled following the principles of case-control study design. In this case, the mediation analysis using data from case-control studies might lead to biased estimates of coefficients and indirect effects. In this article, we investigated a multiple-mediation model involving a three-path mediating effect through two mediators using case-control study data. We propose an approach to correct bias in coefficients and provide accurate estimates of the specific indirect effects. Our approach can also be used when the original case-control study is frequency matched on one of the mediators. We employed bootstrapping to assess the significance of indirect effects. We conducted simulation studies to investigate the performance of the proposed approach, and showed that it provides more accurate estimates of the indirect effects as well as the percent mediated than standard regressions. We then applied this approach to study the mediating effects of both smoking and chronic obstructive pulmonary disease (COPD) on the association between the CHRNA5-A3 gene locus and lung cancer risk using data from a lung cancer case-control study. The results showed that the genetic variant influences lung cancer risk indirectly through all three different pathways. The percent of genetic association mediated was 18.3% through smoking alone, 30.2% through COPD alone, and 20.6% through the path including both smoking and COPD, and the total genetic variant-lung cancer association explained by the two mediators was 69.1%.
Collapse
Affiliation(s)
- Jian Wang
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, Texas, United States of America
- Department of Epidemiology, The University of Texas MD Anderson Cancer Center, Houston, Texas, United States of America
| | - Margaret R. Spitz
- Department of Molecular and Cellular Biology, Dan L. Duncan Cancer Center, Baylor College of Medicine, Houston, Texas, United States of America
| | - Christopher I. Amos
- Department of Genetics, The University of Texas MD Anderson Cancer Center, Houston, Texas, United States of America
| | - Xifeng Wu
- Department of Epidemiology, The University of Texas MD Anderson Cancer Center, Houston, Texas, United States of America
| | - David W. Wetter
- Department of Health Disparities Research, The University of Texas MD Anderson Cancer Center, Houston, Texas, United States of America
| | - Paul M. Cinciripini
- Department of Behavioral Science, The University of Texas MD Anderson Cancer Center, Houston, Texas, United States of America
| | - Sanjay Shete
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, Texas, United States of America
- Department of Epidemiology, The University of Texas MD Anderson Cancer Center, Houston, Texas, United States of America
- * E-mail:
| |
Collapse
|
20
|
Chen HY, Kittles R, Zhang W. Bias correction to secondary trait analysis with case-control design. Stat Med 2012; 32:1494-508. [PMID: 22987618 DOI: 10.1002/sim.5613] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2011] [Accepted: 08/21/2012] [Indexed: 11/08/2022]
Abstract
In genetic association studies with densely typed genetic markers, it is often of substantial interest to examine not only the primary phenotype but also the secondary traits for their association with the genetic markers. For more efficient sample ascertainment of the primary phenotype, a case-control design or its variants, such as the extreme-value sampling design for a quantitative trait, are often adopted. The secondary trait analysis without correcting for the sample ascertainment may yield a biased association estimator. We propose a new method aiming at correcting the potential bias due to the inadequate adjustment of the sample ascertainment. The method yields explicit correction formulas that can be used to both screen the genetic markers and rapidly evaluate the sensitivity of the results to the assumed baseline case-prevalence rate in the population. Simulation studies demonstrate good performance of the proposed approach in comparison with the more computationally intensive approaches, such as the compensator approaches and the maximum prospective likelihood approach. We illustrate the application of the approach by analysis of the genetic association of prostate specific antigen in a case-control study of prostate cancer in the African American population.
Collapse
Affiliation(s)
- Hua Yun Chen
- Division of Epidemiology and Biostatistics, School of Public Health, University of Illinois at Chicago, Chicago, IL 60612 USA.
| | | | | |
Collapse
|
21
|
Wang J, Shete S. Analysis of secondary phenotype involving the interactive effect of the secondary phenotype and genetic variants on the primary disease. Ann Hum Genet 2012; 76:484-99. [PMID: 22881407 DOI: 10.1111/j.1469-1809.2012.00725.x] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
A genome-wide association (GWA) study is usually designed as a case-control study, where the presence and absence of the primary disease define the cases and controls, respectively. Using the existing data from GWA studies, investigators are also trying to identify the association between genetic variants and secondary phenotypes, which are defined as traits associated with the primary disease. However, recent studies have shown that bias arises in the estimation of marker-secondary phenotype association using originally collected data. We recently proposed a bias correction approach to accurately estimate the odds ratio (OR) for marker-secondary phenotype association. In this communication, we further investigated whether our bias correction approach is robust for a scenario involving the interactive effect of the secondary phenotype and genetic variants on the primary disease. We found that in such a scenario, our bias correction approach also provides an accurate estimation of OR for marker-secondary phenotype association. We investigated accuracy of our approach using simulation studies and showed that the approach better controlled for type I errors than the existing approaches. We also applied our bias correction approach to the real data analysis of association between an N-acetyltransferase gene, NAT2, and smoking on the basis of colorectal adenoma data.
Collapse
Affiliation(s)
- Jian Wang
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA
| | | |
Collapse
|
22
|
Li H, Gail MH. Efficient adaptively weighted analysis of secondary phenotypes in case-control genome-wide association studies. Hum Hered 2012; 73:159-73. [PMID: 22710642 DOI: 10.1159/000338943] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2011] [Accepted: 04/20/2012] [Indexed: 11/19/2022] Open
Abstract
We propose and compare methods of analysis for detecting associations between genotypes of a single nucleotide polymorphism (SNP) and a dichotomous secondary phenotype (X), when the data arise from a case-control study of a primary dichotomous phenotype (D), which is not rare. We considered both a dichotomous genotype (G) as in recessive or dominant models and an additive genetic model based on the number of minor alleles present. To estimate the log odds ratio β(1) relating X to G in the general population, one needs to understand the conditional distribution [D ∣ X, G] in the general population. For the most general model, [D ∣ X, G], one needs external data on P(D = 1) to estimate β(1). We show that for this 'full model', the maximum likelihood (FM) corresponds to a previously proposed weighted logistic regression (WL) approach if G is dichotomous. For the additive model, WL yields results numerically close, but not identical, to those of the maximum likelihood FM. Efficiency can be gained by assuming that [D ∣ X, G] is a logistic model with no interaction between X and G (the 'reduced model'). However, the resulting maximum likelihood (RM) can be misleading in the presence of interactions. We therefore propose an adaptively weighted approach (AW) that captures the efficiency of RM but is robust to the occasional SNP that might interact with the secondary phenotype to affect the risk of the primary disease. We study the robustness of FM, WL, RM and AW to misspecification of P(D = 1). In principle, one should be able to estimate β(1) without external information on P(D = 1) under the reduced model. However, our simulations show that the resulting inference is unreliable. Therefore, in practice one needs to introduce external information on P(D = 1), even in the absence of interactions between X and G.
Collapse
Affiliation(s)
- Huilin Li
- Division of Biostatistics, Department of Population Health, School of Medicine, New York University, New York, NY 10016, USA.
| | | |
Collapse
|
23
|
Wang J, Shete S. Estimation of odds ratios of genetic variants for the secondary phenotypes associated with primary diseases. Genet Epidemiol 2011; 35:190-200. [PMID: 21308766 DOI: 10.1002/gepi.20568] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2010] [Revised: 12/16/2010] [Accepted: 01/04/2011] [Indexed: 11/07/2022]
Abstract
Genetic association studies for binary diseases are designed as case-control studies: the cases are those affected with the primary disease and the controls are free of the disease. At the time of case-control collection, information about secondary phenotypes is also collected. Association studies of secondary phenotype and genetic variants have received a great deal of interest recently. To study the secondary phenotypes, investigators use standard regression approaches, where individuals with secondary phenotypes are coded as cases and those without secondary phenotypes are coded as controls. However, using the secondary phenotype as an outcome variable in a case-control study might lead to a biased estimate of odds ratios (ORs) for genetic variants. The secondary phenotype is associated with the primary disease; therefore, individuals with and without the secondary phenotype are not sampled following the principles of a case-control study. In this article, we demonstrate that such analyses will lead to a biased estimate of OR and propose new approaches to provide more accurate OR estimates of genetic variants associated with the secondary phenotype for both unmatched and frequency-matched (with respect to the secondary phenotype) case-control studies. We also propose a bootstrapping method to estimate the empirical confidence intervals for the corrected ORs. Using simulation studies and analysis of lung cancer data for single-nucleotide polymorphism associated with smoking quantity, we compared our new approaches to standard logistic regression and to an extended version of the inverse-probability-of-sampling-weighted regression. The proposed approaches provide more accurate estimation of the true OR.
Collapse
Affiliation(s)
- Jian Wang
- Department of Epidemiology, The University of Texas MD Anderson Cancer Center, Houston, Texas 77030, USA
| | | |
Collapse
|