1
|
Schildcrout JS, Haneuse S, Tao R, Zelnick LR, Schisterman EF, Garbett SP, Mercaldo ND, Rathouz PJ, Heagerty PJ. Two-Phase, Generalized Case-Control Designs for the Study of Quantitative Longitudinal Outcomes. Am J Epidemiol 2020; 189:81-90. [PMID: 31165875 DOI: 10.1093/aje/kwz127] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2018] [Revised: 05/06/2019] [Accepted: 05/14/2019] [Indexed: 01/30/2023] Open
Abstract
We propose a general class of 2-phase epidemiologic study designs for quantitative, longitudinal data that are useful when phase 1 longitudinal outcome and covariate data are available but data on the exposure (e.g., a biomarker) can only be collected on a subset of subjects during phase 2. To conduct a study using a design in the class, one first summarizes the longitudinal outcomes by fitting a simple linear regression of the response on a time-varying covariate for each subject. Sampling strata are defined by splitting the estimated regression intercept or slope distributions into distinct (low, medium, and high) regions. Stratified sampling is then conducted from strata defined by the intercepts, by the slopes, or from a mixture. In general, samples selected with extreme intercept values will yield low variances for associations of time-fixed exposures with the outcome and samples enriched with extreme slope values will yield low variances for associations of time-varying exposures with the outcome (including interactions with time-varying exposures). We describe ascertainment-corrected maximum likelihood and multiple-imputation estimation procedures that permit valid and efficient inferences. We embed all methodological developments within the framework of conducting a substudy that seeks to examine genetic associations with lung function among continuous smokers in the Lung Health Study (United States and Canada, 1986-1994).
Collapse
Affiliation(s)
| | - Sebastien Haneuse
- Department of Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, Massachusetts
| | - Ran Tao
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee
| | - Leila R Zelnick
- Division of Nephrology, Department of Medicine, University of Washington, Seattle, Washington
| | - Enrique F Schisterman
- Division of Intramural Population Health Research, Eunice Kennedy Shriver National Institute of Child Health and Human Development, Bethesda, Maryland
| | - Shawn P Garbett
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee
| | | | - Paul J Rathouz
- Department of Population Health, Dell Medical School, University of Texas, Austin, Texas
| | - Patrick J Heagerty
- Department of Biostatistics, School of Public Health, University of Washington, Seattle, Washington
| |
Collapse
|
2
|
Pan Y, Cai J, Longnecker MP, Zhou H. Secondary outcome analysis for data from an outcome-dependent sampling design. Stat Med 2018; 37:2321-2337. [PMID: 29682775 DOI: 10.1002/sim.7672] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2016] [Revised: 01/19/2018] [Accepted: 03/08/2018] [Indexed: 11/11/2022]
Abstract
Outcome-dependent sampling (ODS) scheme is a cost-effective way to conduct a study. For a study with continuous primary outcome, an ODS scheme can be implemented where the expensive exposure is only measured on a simple random sample and supplemental samples selected from 2 tails of the primary outcome variable. With the tremendous cost invested in collecting the primary exposure information, investigators often would like to use the available data to study the relationship between a secondary outcome and the obtained exposure variable. This is referred as secondary analysis. Secondary analysis in ODS designs can be tricky, as the ODS sample is not a random sample from the general population. In this article, we use the inverse probability weighted and augmented inverse probability weighted estimating equations to analyze the secondary outcome for data obtained from the ODS design. We do not make any parametric assumptions on the primary and secondary outcome and only specify the form of the regression mean models, thus allow an arbitrary error distribution. Our approach is robust to second- and higher-order moment misspecification. It also leads to more precise estimates of the parameters by effectively using all the available participants. Through simulation studies, we show that the proposed estimator is consistent and asymptotically normal. Data from the Collaborative Perinatal Project are analyzed to illustrate our method.
Collapse
Affiliation(s)
- Yinghao Pan
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Jianwen Cai
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Matthew P Longnecker
- Epidemiology Branch, National Institute of Environmental Health Sciences, Research Triangle Park, NC, USA
| | - Haibo Zhou
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| |
Collapse
|
3
|
Espin-Garcia O, Craiu RV, Bull SB. Two-phase designs for joint quantitative-trait-dependent and genotype-dependent sampling in post-GWAS regional sequencing. Genet Epidemiol 2017; 42:104-116. [PMID: 29239496 PMCID: PMC5814750 DOI: 10.1002/gepi.22099] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2017] [Revised: 10/23/2017] [Accepted: 10/23/2017] [Indexed: 11/09/2022]
Abstract
We evaluate two‐phase designs to follow‐up findings from genome‐wide association study (GWAS) when the cost of regional sequencing in the entire cohort is prohibitive. We develop novel expectation‐maximization‐based inference under a semiparametric maximum likelihood formulation tailored for post‐GWAS inference. A GWAS‐SNP (where SNP is single nucleotide polymorphism) serves as a surrogate covariate in inferring association between a sequence variant and a normally distributed quantitative trait (QT). We assess test validity and quantify efficiency and power of joint QT‐SNP‐dependent sampling and analysis under alternative sample allocations by simulations. Joint allocation balanced on SNP genotype and extreme‐QT strata yields significant power improvements compared to marginal QT‐ or SNP‐based allocations. We illustrate the proposed method and evaluate the sensitivity of sample allocation to sampling variation using data from a sequencing study of systolic blood pressure.
Collapse
Affiliation(s)
- Osvaldo Espin-Garcia
- Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada.,Lunenfeld-Tanenbaum Research Institute, Sinai Health System, Toronto, ON, Canada
| | - Radu V Craiu
- Department of Statistical Sciences, University of Toronto, Toronto, ON, Canada
| | - Shelley B Bull
- Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada.,Lunenfeld-Tanenbaum Research Institute, Sinai Health System, Toronto, ON, Canada
| |
Collapse
|
4
|
Ding J, Lu TS, Cai J, Zhou H. Recent progresses in outcome-dependent sampling with failure time data. LIFETIME DATA ANALYSIS 2017; 23:57-82. [PMID: 26759313 PMCID: PMC4942414 DOI: 10.1007/s10985-015-9355-7] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/30/2015] [Accepted: 12/22/2015] [Indexed: 06/05/2023]
Abstract
An outcome-dependent sampling (ODS) design is a retrospective sampling scheme where one observes the primary exposure variables with a probability that depends on the observed value of the outcome variable. When the outcome of interest is failure time, the observed data are often censored. By allowing the selection of the supplemental samples depends on whether the event of interest happens or not and oversampling subjects from the most informative regions, ODS design for the time-to-event data can reduce the cost of the study and improve the efficiency. We review recent progresses and advances in research on ODS designs with failure time data. This includes researches on ODS related designs like case-cohort design, generalized case-cohort design, stratified case-cohort design, general failure-time ODS design, length-biased sampling design and interval sampling design.
Collapse
Affiliation(s)
- Jieli Ding
- School of Mathematics and Statistics, Wuhan University, Wuhan, Hubei, 430072, China
| | - Tsui-Shan Lu
- Department of Mathematics, National Taiwan Normal University, Taipei, 116, Taiwan
| | - Jianwen Cai
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA
| | - Haibo Zhou
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA.
| |
Collapse
|
5
|
Schildcrout JS, Rathouz PJ, Zelnick LR, Garbett SP, Heagerty PJ. BIASED SAMPLING DESIGNS TO IMPROVE RESEARCH EFFICIENCY: FACTORS INFLUENCING PULMONARY FUNCTION OVER TIME IN CHILDREN WITH ASTHMA. Ann Appl Stat 2015; 9:731-753. [PMID: 26322147 PMCID: PMC4551501 DOI: 10.1214/15-aoas826] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Substudies of the Childhood Asthma Management Program (CAMP Research Group, 1999, 2000) seek to identify patient characteristics associated with asthma symptoms and lung function. To determine if genetic measures are associated with trajectories of lung function as measured by forced vital capacity (FVC), children in the primary cohort study retrospectively had candidate loci evaluated. Given participant burden and constraints on financial resources, it is often desirable to target a sub-sample for ascertainment of costly measures. Methods that can leverage the longitudinal outcome on the full cohort to selectively measure informative individuals have been promising, but have been restricted in their use to analysis of the targeted sub-sample. In this paper we detail two multiple imputation analysis strategies that exploit outcome and partially observed covariate data on the non-sampled subjects, and we characterize alternative design and analysis combinations that could be used for future studies of pulmonary function and other outcomes. Candidate predictor (e.g. IL10 cytokine polymorphisms) associations obtained from targeted sampling designs can be estimated with very high efficiency compared to standard designs. Further, even though multiple imputation can dramatically improve estimation efficiency for covariates available on all subjects (e.g., gender and baseline age), only modest efficiency gains were observed in parameters associated with predictors that are exclusive to the targeted sample. Our results suggest that future studies of longitudinal trajectories can be efficiently conducted by use of outcome-dependent designs and associated full cohort analysis.
Collapse
Affiliation(s)
| | - Paul J Rathouz
- Department of Biostatistics and Medical Informatics, University of Wisconsin School of Medicine and Public Health
| | - Leila R Zelnick
- Department of Biostatistics, University of Washington School of Public Health
| | - Shawn P Garbett
- Division of Cancer Biology, Vanderbilt University School of Medicine
| | - Patrick J Heagerty
- Department of Biostatistics, University of Washington School of Public Health
| |
Collapse
|
6
|
Ding J, Zhou H, Liu Y, Cai J, Longnecker MP. Estimating effect of environmental contaminants on women's subfecundity for the MoBa study data with an outcome-dependent sampling scheme. Biostatistics 2014; 15:636-50. [PMID: 24812419 DOI: 10.1093/biostatistics/kxu016] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Motivated by the need from our on-going environmental study in the Norwegian Mother and Child Cohort (MoBa) study, we consider an outcome-dependent sampling (ODS) scheme for failure-time data with censoring. Like the case-cohort design, the ODS design enriches the observed sample by selectively including certain failure subjects. We present an estimated maximum semiparametric empirical likelihood estimation (EMSELE) under the proportional hazards model framework. The asymptotic properties of the proposed estimator were derived. Simulation studies were conducted to evaluate the small-sample performance of our proposed method. Our analyses show that the proposed estimator and design is more efficient than the current default approach and other competing approaches. Applying the proposed approach with the data set from the MoBa study, we found a significant effect of an environmental contaminant on fecundability.
Collapse
Affiliation(s)
- Jieli Ding
- School of Mathematics and Statistics, Wuhan University, Wuhan, Hubei 430072, China
| | - Haibo Zhou
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Yanyan Liu
- School of Mathematics and Statistics, Wuhan University, Wuhan, Hubei 430072, China
| | - Jianwen Cai
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Matthew P Longnecker
- National Institute of Environmental Health Sciences, National Institute of Health, Research Triangle Park, NC 27709, USA
| |
Collapse
|
7
|
Zhou H, Xu W, Zeng D, Cai J. Semiparametric Inference for Data with a Continuous Outcome from a Two-Phase Probability Dependent Sampling Scheme. J R Stat Soc Series B Stat Methodol 2013; 76:197-215. [PMID: 24737947 DOI: 10.1111/rssb.12029] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Multi-phased designs and biased sampling designs are two of the well recognized approaches to enhance study efficiency. In this paper, we propose a new and cost-effective sampling design, the two-phase probability dependent sampling design (PDS), for studies with a continuous outcome. This design will enable investigators to make efficient use of resources by targeting more informative subjects for sampling. We develop a new semiparametric empirical likelihood inference method to take advantage of data obtained through a PDS design. Simulation study results indicate that the proposed sampling scheme, coupled with the proposed estimator, is more efficient and more powerful than the existing outcome dependent sampling design and the simple random sampling design with the same sample size. We illustrate the proposed method with a real data set from an environmental epidemiologic study.
Collapse
Affiliation(s)
- Haibo Zhou
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, U.S.A
| | - Wangli Xu
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, U.S.A ; Center for Applied Statistics, School of Statistics, Renmin University of China, Beijing, 100872, China
| | - Donglin Zeng
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, U.S.A
| | - Jianwen Cai
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, U.S.A
| |
Collapse
|
8
|
Schildcrout JS, Garbett SP, Heagerty PJ. Outcome vector dependent sampling with longitudinal continuous response data: stratified sampling based on summary statistics. Biometrics 2013; 69:405-16. [PMID: 23409789 PMCID: PMC3880022 DOI: 10.1111/biom.12013] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2012] [Revised: 11/01/2012] [Accepted: 11/01/2012] [Indexed: 11/30/2022]
Abstract
The analysis of longitudinal trajectories usually focuses on evaluation of explanatory factors that are either associated with rates of change, or with overall mean levels of a continuous outcome variable. In this article, we introduce valid design and analysis methods that permit outcome dependent sampling of longitudinal data for scenarios where all outcome data currently exist, but a targeted substudy is being planned in order to collect additional key exposure information on a limited number of subjects. We propose a stratified sampling based on specific summaries of individual longitudinal trajectories, and we detail an ascertainment corrected maximum likelihood approach for estimation using the resulting biased sample of subjects. In addition, we demonstrate that the efficiency of an outcome-based sampling design relative to use of a simple random sample depends highly on the choice of outcome summary statistic used to direct sampling, and we show a natural link between the goals of the longitudinal regression model and corresponding desirable designs. Using data from the Childhood Asthma Management Program, where genetic information required retrospective ascertainment, we study a range of designs that examine lung function profiles over 4 years of follow-up for children classified according to their genotype for the IL 13 cytokine.
Collapse
Affiliation(s)
- Jonathan S Schildcrout
- Department of Biostatistics, Vanderbilt University School of Medicine, Nashville, TN, USA.
| | | | | |
Collapse
|
9
|
Xu W, Zhou H. Mixed effect regression analysis for a cluster-based two-stage outcome-auxiliary-dependent sampling design with a continuous outcome. Biostatistics 2012; 13:650-64. [PMID: 22723503 PMCID: PMC3440236 DOI: 10.1093/biostatistics/kxs013] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2011] [Revised: 04/20/2012] [Accepted: 04/23/2012] [Indexed: 11/13/2022] Open
Abstract
Two-stage design is a well-known cost-effective way for conducting biomedical studies when the exposure variable is expensive or difficult to measure. Recent research development further allowed one or both stages of the two-stage design to be outcome dependent on a continuous outcome variable. This outcome-dependent sampling feature enables further efficiency gain in parameter estimation and overall cost reduction of the study (e.g. Wang, X. and Zhou, H., 2010. Design and inference for cancer biomarker study with an outcome and auxiliary-dependent subsampling. Biometrics 66, 502-511; Zhou, H., Song, R., Wu, Y. and Qin, J., 2011. Statistical inference for a two-stage outcome-dependent sampling design with a continuous outcome. Biometrics 67, 194-202). In this paper, we develop a semiparametric mixed effect regression model for data from a two-stage design where the second-stage data are sampled with an outcome-auxiliary-dependent sample (OADS) scheme. Our method allows the cluster- or center-effects of the study subjects to be accounted for. We propose an estimated likelihood function to estimate the regression parameters. Simulation study indicates that greater study efficiency gains can be achieved under the proposed two-stage OADS design with center-effects when compared with other alternative sampling schemes. We illustrate the proposed method by analyzing a dataset from the Collaborative Perinatal Project.
Collapse
Affiliation(s)
- Wangli Xu
- Center for Applied Statistics, School of Statistics, Renmin University of China, Beijing 100872, China and Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | | |
Collapse
|
10
|
Ding J, Liu Y, Peden DB, Kleeberger SR, Zhou H. Regression analysis for a summed missing data problem under an outcome-dependent sampling scheme. CAN J STAT 2012. [DOI: 10.1002/cjs.11131] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|