1
|
Cao Y, Haneuse S, Zheng Y, Chen J. Two-phase stratified sampling and analysis for predicting binary outcomes. Biostatistics 2021:6470040. [PMID: 34923588 DOI: 10.1093/biostatistics/kxab044] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2021] [Revised: 11/03/2021] [Accepted: 11/22/2021] [Indexed: 11/13/2022] Open
Abstract
The two-phase study design is a cost-efficient sampling strategy when certain data elements are expensive and, thus, can only be collected on a sub-sample of subjects. To date guidance on how best to allocate resources within the design has assumed that primary interest lies in estimating association parameters. When primary interest lies in the development and evaluation of a risk prediction tool, however, such guidance may, in fact, be detrimental. To resolve this, we propose a novel strategy for resource allocation based on oversampling cases and subjects who have more extreme risk estimates according to a preliminary model developed using fully observed predictors. Key to the proposed strategy is that it focuses on enhancing efficiency regarding estimation of measures of predictive accuracy, rather than on efficiency regarding association parameters which is the standard paradigm. Towards valid estimation and inference for accuracy measures using the resultant data, we extend an existing semiparametric maximum likelihood ethod for estimating odds ratio association parameters to accommodate the biased sampling scheme and data incompleteness. Motivated by our sampling design, we additionally propose a general post-stratification scheme for analyzing general two-phase data for estimating predictive accuracy measures. Through theoretical calculations and simulation studies, we show that the proposed sampling strategy and post-stratification scheme achieve the promised efficiency improvement. Finally, we apply the proposed methods to develop and evaluate a preliminary model for predicting the risk of hospital readmission after cardiac surgery using data from the Pennsylvania Health Care Cost Containment Council.
Collapse
Affiliation(s)
- Yaqi Cao
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Perelman School of Medicine, Philadelphia, PA 19104, USA and Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China
| | - Sebastien Haneuse
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, 677 Huntington Ave, Boston, MA 02115, USA
| | - Yingye Zheng
- Department of Biostatistics, Fred Hutchinson Cancer Research Center, 1100 Fairview Ave. N., Seattle, WA 98109, USA
| | - Jinbo Chen
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Perelman School of Medicine, Philadelphia, PA 19104, USA
| |
Collapse
|
2
|
Tao R, Mercaldo ND, Haneuse S, Maronge JM, Rathouz PJ, Heagerty PJ, Schildcrout JS. Two-wave two-phase outcome-dependent sampling designs, with applications to longitudinal binary data. Stat Med 2021; 40:1863-1876. [PMID: 33442883 DOI: 10.1002/sim.8876] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2020] [Revised: 12/07/2020] [Accepted: 12/25/2020] [Indexed: 12/26/2022]
Abstract
Two-phase outcome-dependent sampling (ODS) designs are useful when resource constraints prohibit expensive exposure ascertainment on all study subjects. One class of ODS designs for longitudinal binary data stratifies subjects into three strata according to those who experience the event at none, some, or all follow-up times. For time-varying covariate effects, exclusively selecting subjects with response variation can yield highly efficient estimates. However, if interest lies in the association of a time-invariant covariate, or the joint associations of time-varying and time-invariant covariates with the outcome, then the optimal design is unknown. Therefore, we propose a class of two-wave two-phase ODS designs for longitudinal binary data. We split the second-phase sample selection into two waves, between which an interim design evaluation analysis is conducted. The interim design evaluation analysis uses first-wave data to conduct a simulation-based search for the optimal second-wave design that will improve the likelihood of study success. Although we focus on longitudinal binary response data, the proposed design is general and can be applied to other response distributions. We believe that the proposed designs can be useful in settings where (1) the expected second-phase sample size is fixed and one must tailor stratum-specific sampling probabilities to maximize estimation efficiency, or (2) relative sampling probabilities are fixed across sampling strata and one must tailor sample size to achieve a desired precision. We describe the class of designs, examine finite sampling operating characteristics, and apply the designs to an exemplar longitudinal cohort study, the Lung Health Study.
Collapse
Affiliation(s)
- Ran Tao
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee, USA.,Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Nathaniel D Mercaldo
- Departments of Radiology and Neurology, Massachusetts General Hospital and Harvard University, Boston, Massachusetts, USA
| | - Sebastien Haneuse
- Department of Biostatistics, Harvard University, Boston, Massachusetts, USA
| | - Jacob M Maronge
- Department of Statistics, University of Wisconsin-Madison, Madison, Wisconsin, USA
| | - Paul J Rathouz
- Department of Population Health, University of Texas, Austin, Texas, USA
| | - Patrick J Heagerty
- Department of Biostatistics, University of Washington, Seattle, Washington, USA
| | - Jonathan S Schildcrout
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| |
Collapse
|
3
|
Yu J, Zhou H, Cai J. Accelerated failure time model for data from outcome-dependent sampling. LIFETIME DATA ANALYSIS 2021; 27:15-37. [PMID: 33044612 PMCID: PMC7856009 DOI: 10.1007/s10985-020-09508-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/26/2019] [Accepted: 09/29/2020] [Indexed: 05/26/2023]
Abstract
Outcome-dependent sampling designs such as the case-control or case-cohort design are widely used in epidemiological studies for their outstanding cost-effectiveness. In this article, we propose and develop a smoothed weighted Gehan estimating equation approach for inference in an accelerated failure time model under a general failure time outcome-dependent sampling scheme. The proposed estimating equation is continuously differentiable and can be solved by the standard numerical methods. In addition to developing asymptotic properties of the proposed estimator, we also propose and investigate a new optimal power-based subsamples allocation criteria in the proposed design by maximizing the power function of a significant test. Simulation results show that the proposed estimator is more efficient than other existing competing estimators and the optimal power-based subsamples allocation will provide an ODS design that yield improved power for the test of exposure effect. We illustrate the proposed method with a data set from the Norwegian Mother and Child Cohort Study to evaluate the relationship between exposure to perfluoroalkyl substances and women's subfecundity.
Collapse
Affiliation(s)
- Jichang Yu
- School of Statistics and Mathematics, Zhongnan University of Economics and Law, Wuhan, 430073, Hubei, China
| | - Haibo Zhou
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA
| | - Jianwen Cai
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA.
| |
Collapse
|
4
|
Cao Y, Shi Y, Yu J. Statistical inference for the accelerated failure time model under two-stage generalized case–cohort design. COMMUN STAT-THEOR M 2019. [DOI: 10.1080/03610926.2018.1528363] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Affiliation(s)
- Yongxiu Cao
- School of Statistics and Mathematics, Zhongnan University of Economics and Law, Wuhan, China
| | - Yueyong Shi
- School of Economics and Management, China University of Geosciences, Wuhan, China
- Center for Resources and Environmental Economic Research, China University of Geosciences, Wuhan, China
| | - Jichang Yu
- School of Statistics and Mathematics, Zhongnan University of Economics and Law, Wuhan, China
| |
Collapse
|
5
|
Abstract
The two-phase design is a cost-effective sampling strategy to evaluate the effects of covariates on an outcome when certain covariates are too expensive to be measured on all study subjects. Under such a design, the outcome and inexpensive covariates are measured on all subjects in the first phase and the first-phase information is used to select subjects for measurements of expensive covariates in the second phase. Previous research on two-phase studies has focused largely on the inference procedures rather than the design aspects. We investigate the design efficiency of the two-phase study, as measured by the semiparametric efficiency bound for estimating the regression coefficients of expensive covariates. We consider general two-phase studies, where the outcome variable can be continuous, discrete, or censored, and the second-phase sampling can depend on the first-phase data in any manner. We develop optimal or approximately optimal two-phase designs, which can be substantially more efficient than the existing designs. We demonstrate the improvements of the new designs over the existing ones through extensive simulation studies and two large medical studies.
Collapse
Affiliation(s)
- Ran Tao
- Department of Biostatistics and Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, TN 37232.,Department of Biostatistics, University of North Carolina, Chapel Hill, NC 27599
| | - Donglin Zeng
- Department of Biostatistics and Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, TN 37232.,Department of Biostatistics, University of North Carolina, Chapel Hill, NC 27599
| | - Dan-Yu Lin
- Department of Biostatistics and Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, TN 37232.,Department of Biostatistics, University of North Carolina, Chapel Hill, NC 27599
| |
Collapse
|
6
|
Zelnick LR, Schildcrout JS, Heagerty PJ. Likelihood-based analysis of outcome-dependent sampling designs with longitudinal data. Stat Med 2018. [PMID: 29542170 DOI: 10.1002/sim.7633] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
The use of outcome-dependent sampling with longitudinal data analysis has previously been shown to improve efficiency in the estimation of regression parameters. The motivating scenario is when outcome data exist for all cohort members but key exposure variables will be gathered only on a subset. Inference with outcome-dependent sampling designs that also incorporates incomplete information from those individuals who did not have their exposure ascertained has been investigated for univariate but not longitudinal outcomes. Therefore, with a continuous longitudinal outcome, we explore the relative contributions of various sources of information toward the estimation of key regression parameters using a likelihood framework. We evaluate the efficiency gains that alternative estimators might offer over random sampling, and we offer insight into their relative merits in select practical scenarios. Finally, we illustrate the potential impact of design and analysis choices using data from the Cystic Fibrosis Foundation Patient Registry.
Collapse
Affiliation(s)
- Leila R Zelnick
- Department of Medicine, University of Washington, Seattle, WA 98195, USA
| | | | - Patrick J Heagerty
- Department of Biostatistics, University of Washington, Seattle, WA 98195, USA
| |
Collapse
|
7
|
Lu TS, Longnecker MP, Zhou H. Statistical inferences for data from studies conducted with an aggregated multivariate outcome-dependent sample design. Stat Med 2016; 36:985-997. [PMID: 27966260 DOI: 10.1002/sim.7195] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2016] [Revised: 09/20/2016] [Accepted: 11/18/2016] [Indexed: 11/11/2022]
Abstract
Outcome-dependent sampling (ODS) scheme is a cost-effective sampling scheme where one observes the exposure with a probability that depends on the outcome. The well-known such design is the case-control design for binary response, the case-cohort design for the failure time data, and the general ODS design for a continuous response. While substantial work has been carried out for the univariate response case, statistical inference and design for the ODS with multivariate cases remain under-developed. Motivated by the need in biological studies for taking the advantage of the available responses for subjects in a cluster, we propose a multivariate outcome-dependent sampling (multivariate-ODS) design that is based on a general selection of the continuous responses within a cluster. The proposed inference procedure for the multivariate-ODS design is semiparametric where all the underlying distributions of covariates are modeled nonparametrically using the empirical likelihood methods. We show that the proposed estimator is consistent and developed the asymptotically normality properties. Simulation studies show that the proposed estimator is more efficient than the estimator obtained using only the simple-random-sample portion of the multivariate-ODS or the estimator from a simple random sample with the same sample size. The multivariate-ODS design together with the proposed estimator provides an approach to further improve study efficiency for a given fixed study budget. We illustrate the proposed design and estimator with an analysis of association of polychlorinated biphenyl exposure to hearing loss in children born to the Collaborative Perinatal Study. Copyright © 2016 John Wiley & Sons, Ltd.
Collapse
Affiliation(s)
- Tsui-Shan Lu
- Department of Mathematics, National Taiwan Normal University, Taipei, Taiwan
| | - Matthew P Longnecker
- Epidemiology Branch, National Institute of Environmental Health Sciences, Research Triangle Park, 27709, NC, U.S.A
| | - Haibo Zhou
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, 27599, NC, U.S.A
| |
Collapse
|
8
|
Tan Z, Qin G, Zhou H. Estimation of a partially linear additive model for data from an outcome-dependent sampling design with a continuous outcome. Biostatistics 2016; 17:663-76. [PMID: 27006375 PMCID: PMC5031945 DOI: 10.1093/biostatistics/kxw015] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2015] [Revised: 02/15/2016] [Accepted: 02/17/2016] [Indexed: 11/13/2022] Open
Abstract
Outcome-dependent sampling (ODS) designs have been well recognized as a cost-effective way to enhance study efficiency in both statistical literature and biomedical and epidemiologic studies. A partially linear additive model (PLAM) is widely applied in real problems because it allows for a flexible specification of the dependence of the response on some covariates in a linear fashion and other covariates in a nonlinear non-parametric fashion. Motivated by an epidemiological study investigating the effect of prenatal polychlorinated biphenyls exposure on children's intelligence quotient (IQ) at age 7 years, we propose a PLAM in this article to investigate a more flexible non-parametric inference on the relationships among the response and covariates under the ODS scheme. We propose the estimation method and establish the asymptotic properties of the proposed estimator. Simulation studies are conducted to show the improved efficiency of the proposed ODS estimator for PLAM compared with that from a traditional simple random sampling design with the same sample size. The data of the above-mentioned study is analyzed to illustrate the proposed method.
Collapse
Affiliation(s)
- Ziwen Tan
- Department of Biostatistics, School of Public Health and Key Laboratory of Public Health Safety, Fudan University, Shanghai 200032, China
| | - Guoyou Qin
- Department of Biostatistics, School of Public Health and Key Laboratory of Public Health Safety, Fudan University, Shanghai 200032, China
| | - Haibo Zhou
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| |
Collapse
|