1
|
Maronge JM, Tao R, Schildcrout JS, Rathouz PJ. Generalized case-control sampling under generalized linear models. Biometrics 2023; 79:332-343. [PMID: 34586638 PMCID: PMC9358725 DOI: 10.1111/biom.13571] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2020] [Revised: 08/17/2021] [Accepted: 09/14/2021] [Indexed: 12/01/2022]
Abstract
A generalized case-control (GCC) study, like the standard case-control study, leverages outcome-dependent sampling (ODS) to extend to nonbinary responses. We develop a novel, unifying approach for analyzing GCC study data using the recently developed semiparametric extension of the generalized linear model (GLM), which is substantially more robust to model misspecification than existing approaches based on parametric GLMs. For valid estimation and inference, we use a conditional likelihood to account for the biased sampling design. We describe analysis procedures for estimation and inference for the semiparametric GLM under a conditional likelihood, and we discuss problems with estimation and inference under a conditional likelihood when the response distribution is misspecified. We demonstrate the flexibility of our approach over existing ones through extensive simulation studies, and we apply the methodology to an analysis of the Asset and Health Dynamics Among the Oldest Old study, which motives our research. The proposed approach yields a simple yet versatile solution for handling ODS in a wide variety of possible response distributions and sampling schemes encountered in practice.
Collapse
Affiliation(s)
- Jacob M. Maronge
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, Texas, USA
| | - Ran Tao
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
- Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Jonathan S. Schildcrout
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Paul J. Rathouz
- Department of Population Health, Dell Medical School at the University of Texas at Austin, Austin, Texas, USA
| |
Collapse
|
2
|
Gravio CD, Tao R, Schildcrout JS. Design and analysis of two-phase studies with multivariate longitudinal data. Biometrics 2022. [PMID: 35014029 DOI: 10.1111/biom.13616] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2021] [Revised: 11/03/2021] [Accepted: 12/10/2021] [Indexed: 11/27/2022]
Abstract
Two-phase studies are crucial when outcome and covariate data are available in a first phase sample (e.g., a cohort study), but costs associated with retrospective ascertainment of a novel exposure limit the size of the second phase sample, in whom the exposure is collected. For longitudinal outcomes, one class of two-phase studies stratifies subjects based on an outcome vector summary (e.g., an average or a slope over time) and oversamples subjects in the extreme value strata while undersampling subjects in the medium value stratum. Based on the choice of the summary, two-phase studies for longitudinal data can increase efficiency of time-varying and/or time-fixed exposure parameter estimates. In this manuscript, we extend efficient, two-phase study designs to multivariate longitudinal continuous outcomes, and we detail two analysis approaches. The first approach is a multiple imputation analysis that combines complete data from subjects selected for phase two with the incomplete data from those not selected. The second approach is a conditional maximum likelihood analysis that is intended for applications where only data from subjects selected for phase two are available. Importantly, we show that both approaches can be applied to secondary analyses of previously conducted two-phase studies. We examine finite sample operating characteristics of the two approaches and use the Lung Health Study (Connett et al., 1993) to examine genetic associations with lung function decline over time. This article is protected by copyright. All rights reserved.
Collapse
Affiliation(s)
- Chiara Di Gravio
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee, 37232, U.S.A
| | - Ran Tao
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee, 37232, U.S.A.,Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, Tennessee, 37232, U.S.A
| | - Jonathan S Schildcrout
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee, 37232, U.S.A
| |
Collapse
|
3
|
Li CX, Matthay EC, Rowe C, Bradshaw PT, Ahern J. Conducting density-sampled case-control studies using survey data with complex sampling designs: A simulation study. Ann Epidemiol 2022; 65:109-115. [PMID: 34216780 PMCID: PMC8962511 DOI: 10.1016/j.annepidem.2021.06.019] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2021] [Revised: 05/29/2021] [Accepted: 06/24/2021] [Indexed: 01/03/2023]
Abstract
PURPOSE Population-based surveys are possible sources from which to draw representative control data for case-control studies. However, these surveys involve complex sampling that could lead to biased estimates of measures of association if not properly accounted for in analyses. Approaches to incorporating complex-sampled controls in density-sampled case-control designs have not been examined. METHODS We used a simulation study to evaluate the performance of different approaches to estimating incidence density ratios (IDR) from case-control studies with controls drawn from complex survey data using risk-set sampling. In simulated population data, we applied four survey sampling approaches, with varying survey sizes, and assessed the performance of four analysis methods for incorporating survey-based controls. RESULTS Estimates of the IDR were unbiased for methods that conducted risk-set sampling with probability of selection proportional to survey weights. Estimates of the IDR were biased when sampling weights were not incorporated, or only included in regression modeling. The unbiased analysis methods performed comparably and produced estimates with variance comparable to biased methods. Variance increased and confidence interval coverage decreased as survey size decreased. CONCLUSIONS Unbiased estimates are obtainable in risk-set sampled case-control studies using controls drawn from complex survey data when weights are properly incorporated.
Collapse
Affiliation(s)
- Catherine X. Li
- Division of Epidemiology & Biostatistics, School of Public Health, University of California, Berkeley, Berkeley, CA,Department of Epidemiology, University of North Carolina at Chapel Hill, Chapel Hill, NC
| | - Ellicott C. Matthay
- Center for Health and Community, University of California, San Francisco, San Francisco, CA
| | - Christopher Rowe
- Division of Epidemiology & Biostatistics, School of Public Health, University of California, Berkeley, Berkeley, CA
| | - Patrick T. Bradshaw
- Division of Epidemiology & Biostatistics, School of Public Health, University of California, Berkeley, Berkeley, CA
| | - Jennifer Ahern
- Division of Epidemiology & Biostatistics, School of Public Health, University of California, Berkeley, Berkeley, CA
| |
Collapse
|
4
|
Cao Y, Haneuse S, Zheng Y, Chen J. Two-phase stratified sampling and analysis for predicting binary outcomes. Biostatistics 2021:6470040. [PMID: 34923588 DOI: 10.1093/biostatistics/kxab044] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2021] [Revised: 11/03/2021] [Accepted: 11/22/2021] [Indexed: 11/13/2022] Open
Abstract
The two-phase study design is a cost-efficient sampling strategy when certain data elements are expensive and, thus, can only be collected on a sub-sample of subjects. To date guidance on how best to allocate resources within the design has assumed that primary interest lies in estimating association parameters. When primary interest lies in the development and evaluation of a risk prediction tool, however, such guidance may, in fact, be detrimental. To resolve this, we propose a novel strategy for resource allocation based on oversampling cases and subjects who have more extreme risk estimates according to a preliminary model developed using fully observed predictors. Key to the proposed strategy is that it focuses on enhancing efficiency regarding estimation of measures of predictive accuracy, rather than on efficiency regarding association parameters which is the standard paradigm. Towards valid estimation and inference for accuracy measures using the resultant data, we extend an existing semiparametric maximum likelihood ethod for estimating odds ratio association parameters to accommodate the biased sampling scheme and data incompleteness. Motivated by our sampling design, we additionally propose a general post-stratification scheme for analyzing general two-phase data for estimating predictive accuracy measures. Through theoretical calculations and simulation studies, we show that the proposed sampling strategy and post-stratification scheme achieve the promised efficiency improvement. Finally, we apply the proposed methods to develop and evaluate a preliminary model for predicting the risk of hospital readmission after cardiac surgery using data from the Pennsylvania Health Care Cost Containment Council.
Collapse
Affiliation(s)
- Yaqi Cao
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Perelman School of Medicine, Philadelphia, PA 19104, USA and Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China
| | - Sebastien Haneuse
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, 677 Huntington Ave, Boston, MA 02115, USA
| | - Yingye Zheng
- Department of Biostatistics, Fred Hutchinson Cancer Research Center, 1100 Fairview Ave. N., Seattle, WA 98109, USA
| | - Jinbo Chen
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Perelman School of Medicine, Philadelphia, PA 19104, USA
| |
Collapse
|
5
|
Sauer S, Hedt-Gauthier B, Haneuse S. Optimal allocation in stratified cluster-based outcome-dependent sampling designs. Stat Med 2021; 40:4090-4107. [PMID: 34076912 DOI: 10.1002/sim.9016] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2020] [Revised: 03/31/2021] [Accepted: 04/12/2021] [Indexed: 11/08/2022]
Abstract
In public health research, finite resources often require that decisions be made at the study design stage regarding which individuals to sample for detailed data collection. At the same time, when study units are naturally clustered, as patients are in clinics, it may be preferable to sample clusters rather than the study units, especially when the costs associated with travel between clusters are high. In this setting, aggregated data on the outcome and select covariates are sometimes routinely available through, for example, a country's Health Management Information System. If used wisely, this information can be used to guide decisions regarding which clusters to sample, and potentially obtain gains in efficiency over simple random sampling. In this article, we derive a series of formulas for optimal allocation of resources when a single-stage stratified cluster-based outcome-dependent sampling design is to be used and a marginal mean model is specified to answer the question of interest. Specifically, we consider two settings: (i) when a particular parameter in the mean model is of primary interest; and, (ii) when multiple parameters are of interest. We investigate the finite population performance of the optimal allocation framework through a comprehensive simulation study. Our results show that there are trade-offs that must be considered at the design stage: optimizing for one parameter yields efficiency gains over balanced and simple random sampling, while resulting in losses for the other parameters in the model. Optimizing for all parameters simultaneously yields smaller gains in efficiency, but mitigates the losses for the other parameters in the model.
Collapse
Affiliation(s)
- Sara Sauer
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA
| | - Bethany Hedt-Gauthier
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA.,Department of Global Health and Social Medicine, Harvard Medical School, Boston, Massachusetts, USA
| | - Sebastien Haneuse
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA
| |
Collapse
|