1
|
Kundu P, Chatterjee N. Logistic regression analysis of two-phase studies using generalized method of moments. Biometrics 2023; 79:241-252. [PMID: 34677824 DOI: 10.1111/biom.13584] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2020] [Revised: 08/10/2021] [Accepted: 09/23/2021] [Indexed: 11/29/2022]
Abstract
Two-phase designs can reduce the cost of epidemiological studies by limiting the ascertainment of expensive covariates or/and exposures to an efficiently selected subset (phase-II) of a larger (phase-I) study. Efficient analysis of the resulting data set combining disparate information from phase-I and phase-II, however, can be complex. Most of the existing methods, including semiparametric maximum-likelihood estimator, require the information in phase-I to be summarized into a fixed number of strata. In this paper, we describe a novel method for the analysis of two-phase studies where information from phase-I is summarized by parameters associated with a reduced logistic regression model of the disease outcome on available covariates. We then setup estimating equations for parameters associated with the desired extended logistic regression model, based on information on the reduced model parameters from phase-I and complete data available at phase-II after accounting for nonrandom sampling design. We use generalized method of moments to solve overly identified estimating equations and develop the resulting asymptotic theory for the proposed estimator. Simulation studies show that the use of reduced parametric models, as opposed to summarizing data into strata, can lead to more efficient utilization of phase-I data. An application of the proposed method is illustrated using the data from the U.S. National Wilms Tumor Study.
Collapse
Affiliation(s)
- Prosenjit Kundu
- Department of Biostatistics, Bloomberg School of Public Health, The Johns Hopkins University, Baltimore, Maryland, USA
| | - Nilanjan Chatterjee
- Department of Biostatistics, Bloomberg School of Public Health; Department of Oncology, School of Medicine, The Johns Hopkins University, Baltimore, Maryland, USA
| |
Collapse
|
2
|
Zhang H, Ding J. Hypothesis testing in outcome-dependent sampling design under generalized linear models. COMMUN STAT-SIMUL C 2022. [DOI: 10.1080/03610918.2019.1682155] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Affiliation(s)
- Haodong Zhang
- School of Mathematics and Statistics, Wuhan University, Wuhan, Hubei, China
| | - Jieli Ding
- School of Mathematics and Statistics, Wuhan University, Wuhan, Hubei, China
| |
Collapse
|
3
|
Amorim G, Tao R, Lotspeich S, Shaw PA, Lumley T, Shepherd BE. Two-Phase Sampling Designs for Data Validation in Settings with Covariate Measurement Error and Continuous Outcome. JOURNAL OF THE ROYAL STATISTICAL SOCIETY. SERIES A, (STATISTICS IN SOCIETY) 2021; 184:1368-1389. [PMID: 34975235 PMCID: PMC8715909 DOI: 10.1111/rssa.12689] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Measurement errors are present in many data collection procedures and can harm analyses by biasing estimates. To correct for measurement error, researchers often validate a subsample of records and then incorporate the information learned from this validation sample into estimation. In practice, the validation sample is often selected using simple random sampling (SRS). However, SRS leads to inefficient estimates because it ignores information on the error-prone variables, which can be highly correlated to the unknown truth. Applying and extending ideas from the two-phase sampling literature, we propose optimal and nearly-optimal designs for selecting the validation sample in the classical measurement-error framework. We target designs to improve the efficiency of model-based and design-based estimators, and show how the resulting designs compare to each other. Our results suggest that sampling schemes that extract more information from the error-prone data are substantially more efficient than SRS, for both design- and model-based estimators. The optimal procedure, however, depends on the analysis method, and can differ substantially. This is supported by theory and simulations. We illustrate the various designs using data from an HIV cohort study.
Collapse
Affiliation(s)
- Gustavo Amorim
- Department of Biostatistics, Vanderbilt University Medical Center, Nashvile, TN, USA
| | - Ran Tao
- Department of Biostatistics, Vanderbilt University Medical Center, Nashvile, TN, USA
- Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Sarah Lotspeich
- Department of Biostatistics, Vanderbilt University Medical Center, Nashvile, TN, USA
| | - Pamela A. Shaw
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, PA, USA
| | - Thomas Lumley
- Department of Statistics, University of Auckland, Auckland, New Zealand
| | - Bryan E. Shepherd
- Department of Biostatistics, Vanderbilt University Medical Center, Nashvile, TN, USA
| |
Collapse
|
4
|
Che M, Lawless JF, Han P. Empirical and conditional likelihoods for two‐phase studies. CAN J STAT 2020. [DOI: 10.1002/cjs.11566] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Affiliation(s)
- Menglu Che
- Department of Statistics and Actuarial Science University of Waterloo Waterloo Ontario Canada
| | - Jerald F. Lawless
- Department of Statistics and Actuarial Science University of Waterloo Waterloo Ontario Canada
| | - Peisong Han
- Department of Biostatistics, School of Public Health University of Michigan Ann Arbor MI U.S.A
| |
Collapse
|
5
|
Zhou Q, Cai J, Zhou H. Semiparametric inference for a two-stage outcome-dependent sampling design with interval-censored failure time data. LIFETIME DATA ANALYSIS 2020; 26:85-108. [PMID: 30617753 PMCID: PMC6612481 DOI: 10.1007/s10985-019-09461-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/27/2017] [Accepted: 01/02/2019] [Indexed: 06/09/2023]
Abstract
We propose a two-stage outcome-dependent sampling design and inference procedure for studies that concern interval-censored failure time outcomes. This design enhances the study efficiency by allowing the selection probabilities of the second-stage sample, for which the expensive exposure variable is ascertained, to depend on the first-stage observed interval-censored failure time outcomes. In particular, the second-stage sample is enriched by selectively including subjects who are known or observed to experience the failure at an early or late time. We develop a sieve semiparametric maximum pseudo likelihood procedure that makes use of all available data from the proposed two-stage design. The resulting regression parameter estimator is shown to be consistent and asymptotically normal, and a consistent estimator for its asymptotic variance is derived. Simulation results demonstrate that the proposed design and inference procedure performs well in practical situations and is more efficient than the existing designs and methods. An application to a phase 3 HIV vaccine trial is provided.
Collapse
Affiliation(s)
- Qingning Zhou
- Department of Mathematics and Statistics, University of North Carolina at Charlotte, Fretwell 335L, 9201 University City Blvd., Charlotte, NC, 28223, USA.
| | - Jianwen Cai
- Department of Biostatistics, University of North Carolina at Chapel Hill, 3101D McGavran-Greenberg Hall, Chapel Hill, NC, 27599, USA
| | - Haibo Zhou
- Department of Biostatistics, University of North Carolina at Chapel Hill, 3104C McGavran-Greenberg Hall, Chapel Hill, NC, 27599, USA
| |
Collapse
|
6
|
Zhou Q, Cai J, Zhou H. Outcome-dependent sampling with interval-censored failure time data. Biometrics 2017; 74:58-67. [PMID: 28771664 DOI: 10.1111/biom.12744] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2016] [Revised: 06/01/2017] [Accepted: 06/01/2017] [Indexed: 11/30/2022]
Abstract
Epidemiologic studies and disease prevention trials often seek to relate an exposure variable to a failure time that suffers from interval-censoring. When the failure rate is low and the time intervals are wide, a large cohort is often required so as to yield reliable precision on the exposure-failure-time relationship. However, large cohort studies with simple random sampling could be prohibitive for investigators with a limited budget, especially when the exposure variables are expensive to obtain. Alternative cost-effective sampling designs and inference procedures are therefore desirable. We propose an outcome-dependent sampling (ODS) design with interval-censored failure time data, where we enrich the observed sample by selectively including certain more informative failure subjects. We develop a novel sieve semiparametric maximum empirical likelihood approach for fitting the proportional hazards model to data from the proposed interval-censoring ODS design. This approach employs the empirical likelihood and sieve methods to deal with the infinite-dimensional nuisance parameters, which greatly reduces the dimensionality of the estimation problem and eases the computation difficulty. The consistency and asymptotic normality of the resulting regression parameter estimator are established. The results from our extensive simulation study show that the proposed design and method works well for practical situations and is more efficient than the alternative designs and competing approaches. An example from the Atherosclerosis Risk in Communities (ARIC) study is provided for illustration.
Collapse
Affiliation(s)
- Qingning Zhou
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, U.S.A
| | - Jianwen Cai
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, U.S.A
| | - Haibo Zhou
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, U.S.A
| |
Collapse
|
7
|
Ding J, Lu TS, Cai J, Zhou H. Recent progresses in outcome-dependent sampling with failure time data. LIFETIME DATA ANALYSIS 2017; 23:57-82. [PMID: 26759313 PMCID: PMC4942414 DOI: 10.1007/s10985-015-9355-7] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/30/2015] [Accepted: 12/22/2015] [Indexed: 06/05/2023]
Abstract
An outcome-dependent sampling (ODS) design is a retrospective sampling scheme where one observes the primary exposure variables with a probability that depends on the observed value of the outcome variable. When the outcome of interest is failure time, the observed data are often censored. By allowing the selection of the supplemental samples depends on whether the event of interest happens or not and oversampling subjects from the most informative regions, ODS design for the time-to-event data can reduce the cost of the study and improve the efficiency. We review recent progresses and advances in research on ODS designs with failure time data. This includes researches on ODS related designs like case-cohort design, generalized case-cohort design, stratified case-cohort design, general failure-time ODS design, length-biased sampling design and interval sampling design.
Collapse
Affiliation(s)
- Jieli Ding
- School of Mathematics and Statistics, Wuhan University, Wuhan, Hubei, 430072, China
| | - Tsui-Shan Lu
- Department of Mathematics, National Taiwan Normal University, Taipei, 116, Taiwan
| | - Jianwen Cai
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA
| | - Haibo Zhou
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA.
| |
Collapse
|
8
|
|
9
|
Ding J, Zhou H, Liu Y, Cai J, Longnecker MP. Estimating effect of environmental contaminants on women's subfecundity for the MoBa study data with an outcome-dependent sampling scheme. Biostatistics 2014; 15:636-50. [PMID: 24812419 DOI: 10.1093/biostatistics/kxu016] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Motivated by the need from our on-going environmental study in the Norwegian Mother and Child Cohort (MoBa) study, we consider an outcome-dependent sampling (ODS) scheme for failure-time data with censoring. Like the case-cohort design, the ODS design enriches the observed sample by selectively including certain failure subjects. We present an estimated maximum semiparametric empirical likelihood estimation (EMSELE) under the proportional hazards model framework. The asymptotic properties of the proposed estimator were derived. Simulation studies were conducted to evaluate the small-sample performance of our proposed method. Our analyses show that the proposed estimator and design is more efficient than the current default approach and other competing approaches. Applying the proposed approach with the data set from the MoBa study, we found a significant effect of an environmental contaminant on fecundability.
Collapse
Affiliation(s)
- Jieli Ding
- School of Mathematics and Statistics, Wuhan University, Wuhan, Hubei 430072, China
| | - Haibo Zhou
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Yanyan Liu
- School of Mathematics and Statistics, Wuhan University, Wuhan, Hubei 430072, China
| | - Jianwen Cai
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Matthew P Longnecker
- National Institute of Environmental Health Sciences, National Institute of Health, Research Triangle Park, NC 27709, USA
| |
Collapse
|
10
|
Zhou H, Xu W, Zeng D, Cai J. Semiparametric Inference for Data with a Continuous Outcome from a Two-Phase Probability Dependent Sampling Scheme. J R Stat Soc Series B Stat Methodol 2013; 76:197-215. [PMID: 24737947 DOI: 10.1111/rssb.12029] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Multi-phased designs and biased sampling designs are two of the well recognized approaches to enhance study efficiency. In this paper, we propose a new and cost-effective sampling design, the two-phase probability dependent sampling design (PDS), for studies with a continuous outcome. This design will enable investigators to make efficient use of resources by targeting more informative subjects for sampling. We develop a new semiparametric empirical likelihood inference method to take advantage of data obtained through a PDS design. Simulation study results indicate that the proposed sampling scheme, coupled with the proposed estimator, is more efficient and more powerful than the existing outcome dependent sampling design and the simple random sampling design with the same sample size. We illustrate the proposed method with a real data set from an environmental epidemiologic study.
Collapse
Affiliation(s)
- Haibo Zhou
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, U.S.A
| | - Wangli Xu
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, U.S.A ; Center for Applied Statistics, School of Statistics, Renmin University of China, Beijing, 100872, China
| | - Donglin Zeng
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, U.S.A
| | - Jianwen Cai
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, U.S.A
| |
Collapse
|
11
|
Xu W, Zhou H. Mixed effect regression analysis for a cluster-based two-stage outcome-auxiliary-dependent sampling design with a continuous outcome. Biostatistics 2012; 13:650-64. [PMID: 22723503 PMCID: PMC3440236 DOI: 10.1093/biostatistics/kxs013] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2011] [Revised: 04/20/2012] [Accepted: 04/23/2012] [Indexed: 11/13/2022] Open
Abstract
Two-stage design is a well-known cost-effective way for conducting biomedical studies when the exposure variable is expensive or difficult to measure. Recent research development further allowed one or both stages of the two-stage design to be outcome dependent on a continuous outcome variable. This outcome-dependent sampling feature enables further efficiency gain in parameter estimation and overall cost reduction of the study (e.g. Wang, X. and Zhou, H., 2010. Design and inference for cancer biomarker study with an outcome and auxiliary-dependent subsampling. Biometrics 66, 502-511; Zhou, H., Song, R., Wu, Y. and Qin, J., 2011. Statistical inference for a two-stage outcome-dependent sampling design with a continuous outcome. Biometrics 67, 194-202). In this paper, we develop a semiparametric mixed effect regression model for data from a two-stage design where the second-stage data are sampled with an outcome-auxiliary-dependent sample (OADS) scheme. Our method allows the cluster- or center-effects of the study subjects to be accounted for. We propose an estimated likelihood function to estimate the regression parameters. Simulation study indicates that greater study efficiency gains can be achieved under the proposed two-stage OADS design with center-effects when compared with other alternative sampling schemes. We illustrate the proposed method by analyzing a dataset from the Collaborative Perinatal Project.
Collapse
Affiliation(s)
- Wangli Xu
- Center for Applied Statistics, School of Statistics, Renmin University of China, Beijing 100872, China and Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | | |
Collapse
|
12
|
Ding J, Liu Y, Peden DB, Kleeberger SR, Zhou H. Regression analysis for a summed missing data problem under an outcome-dependent sampling scheme. CAN J STAT 2012. [DOI: 10.1002/cjs.11131] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
13
|
Li D, Lewinger JP, Gauderman WJ, Murcray CE, Conti D. Using extreme phenotype sampling to identify the rare causal variants of quantitative traits in association studies. Genet Epidemiol 2011; 35:790-9. [PMID: 21922541 DOI: 10.1002/gepi.20628] [Citation(s) in RCA: 105] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2011] [Revised: 07/17/2011] [Accepted: 07/22/2011] [Indexed: 12/11/2022]
Abstract
Variants identified in recent genome-wide association studies based on the common-disease common-variant hypothesis are far from fully explaining the hereditability of complex traits. Rare variants may, in part, explain some of the missing hereditability. Here, we explored the advantage of the extreme phenotype sampling in rare-variant analysis and refined this design framework for future large-scale association studies on quantitative traits. We first proposed a power calculation approach for a likelihood-based analysis method. We then used this approach to demonstrate the potential advantages of extreme phenotype sampling for rare variants. Next, we discussed how this design can influence future sequencing-based association studies from a cost-efficiency (with the phenotyping cost included) perspective. Moreover, we discussed the potential of a two-stage design with the extreme sample as the first stage and the remaining nonextreme subjects as the second stage. We demonstrated that this two-stage design is a cost-efficient alternative to the one-stage cross-sectional design or traditional two-stage design. We then discussed the analysis strategies for this extreme two-stage design and proposed a corresponding design optimization procedure. To address many practical concerns, for example measurement error or phenotypic heterogeneity at the very extremes, we examined an approach in which individuals with very extreme phenotypes are discarded. We demonstrated that even with a substantial proportion of these extreme individuals discarded, an extreme-based sampling can still be more efficient. Finally, we expanded the current analysis and design framework to accommodate the CMC approach where multiple rare-variants in the same gene region are analyzed jointly.
Collapse
Affiliation(s)
- Dalin Li
- Medical Genetics Institute, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| | | | | | | | | |
Collapse
|