1
|
Yoon SH, Vandal A, Rivera-Rodriguez C. Weight calibration in the joint modelling of medical cost and mortality. Stat Methods Med Res 2024; 33:728-742. [PMID: 38444359 PMCID: PMC11145918 DOI: 10.1177/09622802241236935] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/07/2024]
Abstract
Joint modelling of longitudinal and time-to-event data is a method that recognizes the dependency between the two data types, and combines the two outcomes into a single model, which leads to more precise estimates. These models are applicable when individuals are followed over a period of time, generally to monitor the progression of a disease or a medical condition, and also when longitudinal covariates are available. Medical cost datasets are often also available in longitudinal scenarios, but these datasets usually arise from a complex sampling design rather than simple random sampling and such complex sampling design needs to be accounted for in the statistical analysis. Ignoring the sampling mechanism can lead to misleading conclusions. This article proposes a novel approach to the joint modelling of complex data by combining survey calibration with standard joint modelling. This is achieved by incorporating a new set of equations to calibrate the sampling weights for the survival model in a joint model setting. The proposed method is applied to data on anti-dementia medication costs and mortality in people with diagnosed dementia in New Zealand.
Collapse
Affiliation(s)
- Seong Hoon Yoon
- Department of Statistics, The University of Auckland, Auckland, New Zealand
| | - Alain Vandal
- Department of Statistics, The University of Auckland, Auckland, New Zealand
| | | |
Collapse
|
2
|
Takeuchi Y, Hagiwawa Y, Komukai S, Matsuyama Y. Estimation of the causal effects of time-varying treatments in nested case-control studies using marginal structural Cox models. Biometrics 2024; 80:ujae005. [PMID: 38465985 DOI: 10.1093/biomtc/ujae005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2022] [Revised: 10/30/2023] [Accepted: 01/22/2024] [Indexed: 03/12/2024]
Abstract
When estimating the causal effects of time-varying treatments on survival in nested case-control (NCC) studies, marginal structural Cox models (Cox-MSMs) with inverse probability weights (IPWs) are a natural approach. However, calculating IPWs from the cases and controls is difficult because they are not random samples from the full cohort, and the number of subjects may be insufficient for calculation. To overcome these difficulties, we propose a method for calculating IPWs to fit Cox-MSMs to NCC sampling data. We estimate the IPWs using a pseudo-likelihood estimation method with an inverse probability of sampling weight using NCC samples, and additional samples of subjects who experience treatment changes and subjects whose follow-up is censored are required to calculate the weights. Our method only requires covariate histories for the samples. The confidence intervals are calculated from the robust variance estimator for the NCC sampling data. We also derive the asymptotic properties of the estimator of Cox-MSM under NCC sampling. The proposed methods will allow researchers to apply several case-control matching methods to improve statistical efficiency. A simulation study was conducted to evaluate the finite sample performance of the proposed method. We also applied our method to a motivating pharmacoepidemiological study examining the effect of statins on the incidence of coronary heart disease. The proposed method may be useful for estimating the causal effects of time-varying treatments in NCC studies.
Collapse
Affiliation(s)
- Yoshinori Takeuchi
- Department of Biostatistics, School of Public Health, Graduate School of Medicine, The University of Tokyo, Bunkyo-ku, Tokyo 113-0033, Japan
- Division of Medical Statistics, Department of Social Medicine, Faculty of Medicine, Toho University, Ota-ku, Tokyo 143-8540, Japan
| | - Yasuhiro Hagiwawa
- Department of Biostatistics, School of Public Health, Graduate School of Medicine, The University of Tokyo, Bunkyo-ku, Tokyo 113-0033, Japan
| | - Sho Komukai
- Department of Biomedical Statistics, Graduate School of Medicine, Osaka University, Suita-shi, Osaka 565-0871, Japan
| | - Yutaka Matsuyama
- Department of Biostatistics, School of Public Health, Graduate School of Medicine, The University of Tokyo, Bunkyo-ku, Tokyo 113-0033, Japan
| |
Collapse
|
3
|
Chen T, Lumley T. Optimal sampling for design-based estimators of regression models. Stat Med 2022; 41:1482-1497. [PMID: 34989429 PMCID: PMC8918008 DOI: 10.1002/sim.9300] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2021] [Revised: 12/02/2021] [Accepted: 12/10/2021] [Indexed: 11/05/2022]
Abstract
Two-phase designs measure variables of interest on a subcohort where the outcome and covariates are readily available or cheap to collect on all individuals in the cohort. Given limited resource availability, it is of interest to find an optimal design that includes more informative individuals in the final sample. We explore the optimal designs and efficiencies for analyses by design-based estimators. Generalized raking is an efficient class of design-based estimators, and they improve on the inverse-probability weighted (IPW) estimator by adjusting weights based on the auxiliary information. We derive a closed-form solution of the optimal design for estimating regression coefficients from generalized raking estimators. We compare it with the optimal design for analysis via the IPW estimator and other two-phase designs in measurement-error settings. We consider general two-phase designs where the outcome variable and variables of interest can be continuous or discrete. Our results show that the optimal designs for analyses by the two classes of design-based estimators can be very different. The optimal design for analysis via the IPW estimator is optimal for IPW estimation and typically gives near-optimal efficiency for generalized raking estimation, though we show there is potential improvement in some settings.
Collapse
Affiliation(s)
- Tong Chen
- Department of Statistics, University of Auckland, Auckland, New Zealand
| | - Thomas Lumley
- Department of Statistics, University of Auckland, Auckland, New Zealand
| |
Collapse
|
4
|
Han K, Shaw PA, Lumley T. Combining multiple imputation with raking of weights: An efficient and robust approach in the setting of nearly true models. Stat Med 2021; 40:6777-6791. [PMID: 34585424 PMCID: PMC8963275 DOI: 10.1002/sim.9210] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2020] [Revised: 07/30/2021] [Accepted: 09/14/2021] [Indexed: 01/01/2023]
Abstract
Multiple imputation (MI) provides us with efficient estimators in model-based methods for handling missing data under the true model. It is also well-understood that design-based estimators are robust methods that do not require accurately modeling the missing data; however, they can be inefficient. In any applied setting, it is difficult to know whether a missing data model may be good enough to win the bias-efficiency trade-off. Raking of weights is one approach that relies on constructing an auxiliary variable from data observed on the full cohort, which is then used to adjust the weights for the usual Horvitz-Thompson estimator. Computing the optimally efficient raking estimator requires evaluating the expectation of the efficient score given the full cohort data, which is generally infeasible. We demonstrate MI as a practical method to compute a raking estimator that will be optimal. We compare this estimator to common parametric and semi-parametric estimators, including standard MI. We show that while estimators, such as the semi-parametric maximum likelihood and MI estimator, obtain optimal performance under the true model, the proposed raking estimator utilizing MI maintains a better robustness-efficiency trade-off even under mild model misspecification. We also show that the standard raking estimator, without MI, is often competitive with the optimal raking estimator. We demonstrate these properties through several numerical examples and provide a theoretical discussion of conditions for asymptotically superior relative efficiency of the proposed raking estimator.
Collapse
Affiliation(s)
- Kyunghee Han
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Pamela A. Shaw
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Thomas Lumley
- Department of Statistics, University of Auckland, Auckland, New Zealand
| |
Collapse
|
5
|
Graziano F, Valsecchi MG, Rebora P. Sampling strategies to evaluate the prognostic value of a new biomarker on a time-to-event end-point. BMC Med Res Methodol 2021; 21:93. [PMID: 33941092 PMCID: PMC8091513 DOI: 10.1186/s12874-021-01283-0] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2020] [Accepted: 04/15/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The availability of large epidemiological or clinical data storing biological samples allow to study the prognostic value of novel biomarkers, but efficient designs are needed to select a subsample on which to measure them, for parsimony and economical reasons. Two-phase stratified sampling is a flexible approach to perform such sub-sampling, but literature on stratification variables to be used in the sampling and power evaluation is lacking especially for survival data. METHODS We compared the performance of different sampling designs to assess the prognostic value of a new biomarker on a time-to-event endpoint, applying a Cox model weighted by the inverse of the empirical inclusion probability. RESULTS Our simulation results suggest that case-control stratified (or post stratified) by a surrogate variable of the marker can yield higher performances than simple random, probability proportional to size, and case-control sampling. In the presence of high censoring rate, results showed an advantage of nested case-control and counter-matching designs in term of design effect, although the use of a fixed ratio between cases and controls might be disadvantageous. On real data on childhood acute lymphoblastic leukemia, we found that optimal sampling using pilot data is greatly efficient. CONCLUSIONS Our study suggests that, in our sample, case-control stratified by surrogate and nested case-control yield estimates and power comparable to estimates obtained in the full cohort while strongly decreasing the number of patients required. We recommend to plan the sample size and using sampling designs for exploration of novel biomarker in clinical cohort data.
Collapse
Affiliation(s)
- Francesca Graziano
- BICOCCA BIOINFORMATICS BIOSTATISTICS AND BIOIMAGING CENTRE-B4, School of Medicine and Surgery, University of Milano - Bicocca, Via Cadore 48, 20900, Monza, Italy
| | - Maria Grazia Valsecchi
- BICOCCA BIOINFORMATICS BIOSTATISTICS AND BIOIMAGING CENTRE-B4, School of Medicine and Surgery, University of Milano - Bicocca, Via Cadore 48, 20900, Monza, Italy
| | - Paola Rebora
- BICOCCA BIOINFORMATICS BIOSTATISTICS AND BIOIMAGING CENTRE-B4, School of Medicine and Surgery, University of Milano - Bicocca, Via Cadore 48, 20900, Monza, Italy.
| |
Collapse
|
6
|
Rivera-Rodriguez C, Cheung G, Cullum S. Using Big Data to Estimate Dementia Prevalence in New Zealand: Protocol for an Observational Study. JMIR Res Protoc 2021; 10:e20225. [PMID: 33404510 PMCID: PMC7817360 DOI: 10.2196/20225] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2020] [Revised: 09/06/2020] [Accepted: 11/24/2020] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND Dementia describes a cluster of symptoms that includes memory loss; difficulties with thinking, problem solving, or language; and functional impairment. Dementia can be caused by a number of neurodegenerative diseases, such as Alzheimer disease and cerebrovascular disease. Currently in New Zealand, most of the systematically collected and detailed information on dementia is obtained through a suite of International Residential Assessment Instrument (interRAI) assessments, including the home care, contact assessment, and long-term care facility versions. These versions of interRAI are standardized comprehensive geriatric assessments. Patients are referred to have an interRAI assessment by the Needs Assessment and Service Coordination (NASC) services after a series of screening processes. Previous estimates of the prevalence and costs of dementia in New Zealand have been based on international studies with different populations and health and social care systems. This new local knowledge will have implications for estimating the demographic distribution and socioeconomic impact of dementia in New Zealand. OBJECTIVE This study investigates the prevalence of dementia, risk factors for dementia, and drivers of the informal cost of dementia among people registered in the NASC database in New Zealand. METHODS This study aims to analyze secondary data routinely collected by the NASC and interRAI (home care and contact assessment versions) databases between July 1, 2014, and July 1, 2019, in New Zealand. The databases will be linked to produce an integrated data set, which will be used to (1) investigate the sociodemographic and clinical risk factors associated with dementia and other neurological conditions, (2) estimate the prevalence of dementia using weighting methods for complex samples, and (3) identify the cost of informal care per client (in number of hours of care provided by unpaid carers) and the drivers of such costs. We will use design-based survey methods for the estimation of prevalence and generalized estimating equations for regression models and correlated and longitudinal data. RESULTS The results will provide much needed statistics regarding dementia prevalence and risk factors and the cost of informal care for people living with dementia in New Zealand. Potential health inequities for different ethnic groups will be highlighted, which can then be used by decision makers to inform the development of policy and practice. CONCLUSIONS As of November 2020, there were no dementia prevalence studies or studies on informal care costs of dementia using national data from New Zealand. All existing studies have used data from other populations with substantially different demographic distributions. This study will give insight into the actual prevalence, risk factors, and informal care costs of dementia for the population with support needs in New Zealand. It will provide valuable information to improve health outcomes and better inform policy and planning. INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID) DERR1-10.2196/20225.
Collapse
Affiliation(s)
| | - Gary Cheung
- Department of Psychological Medicine, University of Auckland, Auckland, New Zealand
| | - Sarah Cullum
- Department of Psychological Medicine, University of Auckland, Auckland, New Zealand
| |
Collapse
|
7
|
Chen T, Lumley T. Optimal multiwave sampling for regression modeling in two-phase designs. Stat Med 2020; 39:4912-4921. [PMID: 33016376 PMCID: PMC7902311 DOI: 10.1002/sim.8760] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2020] [Revised: 08/27/2020] [Accepted: 09/08/2020] [Indexed: 11/09/2022]
Abstract
Two-phase designs involve measuring extra variables on a subset of the cohort where some variables are already measured. The goal of two-phase designs is to choose a subsample of individuals from the cohort and analyse that subsample efficiently. It is of interest to obtain an optimal design that gives the most efficient estimates of regression parameters. In this article, we propose a multiwave sampling design to approximate the optimal design for design-based estimators. Influence functions are used to compute the optimal sampling allocations. We propose to use informative priors on regression parameters to derive the wave-1 sampling probabilities because any prespecified sampling probabilities may be far from optimal and decrease the design efficiency. The posterior distributions of the regression parameters derived from the current wave will then be used as priors for the next wave. Generalized raking is used in the final statistical analysis. We show that a two-wave sampling with reasonable informative priors will end up with a highly efficient estimation for the parameter of interest and be close to the underlying optimal design.
Collapse
Affiliation(s)
- Tong Chen
- Department of Statistics, University of Auckland, Auckland, New Zealand
| | - Thomas Lumley
- Department of Statistics, University of Auckland, Auckland, New Zealand
| |
Collapse
|
8
|
Jazić I, Lee S, Haneuse S. Estimation and inference for semi-competing risks based on data from a nested case-control study. Stat Methods Med Res 2020; 29:3326-3339. [PMID: 32552435 DOI: 10.1177/0962280220926219] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Abstract
In semi-competing risks, the occurrence of some non-terminal event is subject to a terminal event, usually death. While existing methods for semi-competing risks data analysis assume complete information on all relevant covariates, data on at least one covariate are often not readily available in practice. In this setting, for standard univariate time-to-event analyses, researchers may choose from several strategies for sub-sampling patients on whom to collect complete data, including the nested case-control study design. Here, we consider a semi-competing risks analysis through the reuse of data from an existing nested case-control study for which risk sets were formed based on either the non-terminal or the terminal event. Additionally, we introduce the supplemented nested case-control design in which detailed data are collected on additional events of the other type. We propose estimation with respect to a frailty illness-death model through maximum weighted likelihood, specifying the baseline hazard functions either parametrically or semi-parametrically via B-splines. Two standard error estimators are proposed: (i) a computationally simple sandwich estimator and (ii) an estimator based on a perturbation resampling procedure. We derive the asymptotic properties of the proposed methods and evaluate their small-sample properties via simulation. The designs/methods are illustrated with an investigation of risk factors for acute graft-versus-host disease among N = 8838 patients undergoing hematopoietic stem cell transplantation, for which death is a significant competing risk.
Collapse
Affiliation(s)
- Ina Jazić
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Stephanie Lee
- Clinical Research Division, Fred Hutchinson Cancer Research Center, Seattle, WA, USA
| | - Sebastien Haneuse
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| |
Collapse
|
9
|
Feifel J, Dobler D. Dynamic inference in general nested case-control designs. Biometrics 2020; 77:175-185. [PMID: 32145031 DOI: 10.1111/biom.13259] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2019] [Revised: 02/04/2020] [Accepted: 02/25/2020] [Indexed: 11/28/2022]
Abstract
Nested case-control designs are attractive in studies with a time-to-event endpoint if the outcome is rare or if interest lies in evaluating expensive covariates. The appeal is that these designs restrict to small subsets of all patients at risk just prior to the observed event times. Only these small subsets need to be evaluated. Typically, the controls are selected at random and methods for time-simultaneous inference have been proposed in the literature. However, the martingale structure behind nested case-control designs allows for more powerful and flexible non-standard sampling designs. We exploit that structure to find simultaneous confidence bands based on wild bootstrap resampling procedures within this general class of designs. We show in a simulation study that the intended coverage probability is obtained for confidence bands for cumulative baseline hazard functions. We apply our methods to observational data about hospital-acquired infections.
Collapse
Affiliation(s)
- J Feifel
- Institute of Statistics, Ulm University, Ulm, Germany
| | - D Dobler
- Department of Mathematics, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
| |
Collapse
|
10
|
Gail MH, Altman DG, Cadarette SM, Collins G, Evans SJ, Sekula P, Williamson E, Woodward M. Design choices for observational studies of the effect of exposure on disease incidence. BMJ Open 2019; 9:e031031. [PMID: 31822541 PMCID: PMC6924819 DOI: 10.1136/bmjopen-2019-031031] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/11/2019] [Revised: 08/30/2019] [Accepted: 11/07/2019] [Indexed: 11/03/2022] Open
Abstract
The purpose of this paper is to help readers choose an appropriate observational study design for measuring an association between an exposure and disease incidence. We discuss cohort studies, sub-samples from cohorts (case-cohort and nested case-control designs), and population-based or hospital-based case-control studies. Appropriate study design is the foundation of a scientifically valid observational study. Mistakes in design are often irremediable. Key steps are understanding the scientific aims of the study and what is required to achieve them. Some designs will not yield the information required to realise the aims. The choice of design also depends on the availability of source populations and resources. Choosing an appropriate design requires balancing the pros and cons of various designs in view of study aims and practical constraints. We compare various cohort and case-control designs to estimate the effect of an exposure on disease incidence and mention how certain design features can reduce threats to study validity.
Collapse
Affiliation(s)
- Mitchell H Gail
- Biostatistics Branch, National Cancer Institute, Rockville, Maryland, USA
| | - Douglas G Altman
- Nuffield Department of Orthopaedics, Centre for Statistics in Medicine, Oxford, UK
| | - Suzanne M Cadarette
- Faculty of Pharmacy and School of Public Health, University of Toronto, Toronto, Ontario, Canada
| | - Gary Collins
- Centre for Statistics in Medicine, University of Oxford, Oxford, UK
| | - Stephen Jw Evans
- Medical Statistics Unit, London School of Hygiene and Tropical Medicine, London, UK
| | - Peggy Sekula
- Institute of Genetic Epidemiology and Faculty of Medicine, Medical Center, University of Freiburg, Freiburg, Germany
| | - Elizabeth Williamson
- Department of Medical Statistics, London School of Hygiene and Tropical Medicine, London, UK
| | - Mark Woodward
- The George Institute for Global Health, Oxford University UK and University of New South Wales, Sydney, New South Wales, Australia
| |
Collapse
|
11
|
Rivera-Rodriguez C, Spiegelman D, Haneuse S. On the analysis of two-phase designs in cluster-correlated data settings. Stat Med 2019; 38:4611-4624. [PMID: 31359448 PMCID: PMC6736737 DOI: 10.1002/sim.8321] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2018] [Revised: 06/04/2019] [Accepted: 06/21/2019] [Indexed: 11/06/2022]
Abstract
In public health research, information that is readily available may be insufficient to address the primary question(s) of interest. One cost-efficient way forward, especially in resource-limited settings, is to conduct a two-phase study in which the population is initially stratified, at phase I, by the outcome and/or some categorical risk factor(s). At phase II detailed covariate data is ascertained on a subsample within each phase I strata. While analysis methods for two-phase designs are well established, they have focused exclusively on settings in which participants are assumed to be independent. As such, when participants are naturally clustered (eg, patients within clinics) these methods may yield invalid inference. To address this, we develop a novel analysis approach based on inverse-probability weighting that permits researchers to specify some working covariance structure and appropriately accounts for the sampling design and ensures valid inference via a robust sandwich estimator for which a closed-form expression is provided. To enhance statistical efficiency, we propose a calibrated inverse-probability weighting estimator that makes use of information available at phase I but not used in the design. In addition to describing the technique, practical guidance is provided for the cluster-correlated data settings that we consider. A comprehensive simulation study is conducted to evaluate small-sample operating characteristics, including the impact of using naïve methods that ignore correlation due to clustering, as well as to investigate design considerations. Finally, the methods are illustrated using data from a one-time survey of the national antiretroviral treatment program in Malawi.
Collapse
Affiliation(s)
| | - D. Spiegelman
- Center on Methods for Implementation and Dissemination Science, Department of Biostatistics, Yale University School of Public Health, CT, USA
- Department of Epidemiology, Harvard School of Public Health, MA, USA
- Department of Biostatistics, Harvard School of Public Health, MA, USA
| | - S. Haneuse
- Department of Biostatistics, Harvard School of Public Health, MA, USA
| |
Collapse
|
12
|
Rivera-Rodriguez C, Haneuse S, Wang M, Spiegelman D. Augmented pseudo-likelihood estimation for two-phase studies. Stat Methods Med Res 2019; 29:344-358. [PMID: 30834815 DOI: 10.1177/0962280219833415] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
In many public health and medical research settings, information on key covariates may not be readily available or too expensive to gather for all individuals in the study. In such settings, the two-phase design provides a way forward by first stratifying an initial (large) phase I sample on the basis of covariates readily available (including, possibly, the outcome), and sub-sampling participants at phase II to collect the expensive measure(s). When the outcome of interest is binary, several methods have been proposed for estimation and inference for the parameters of a logistic regression model, including weighted likelihood, pseudo-likelihood and maximum likelihood. Although these methods yield consistent estimation and valid inference, they do so solely on the basis of the phase I stratification and the detailed covariate information obtained at phase II. Moreover, they ignore any additional information that is readily available at phase I but was not used as part of the stratified sampling design. Motivated by the potential for efficiency gains, especially concerning parameters corresponding to the additional phase I covariates, we propose a novel augmented pseudo-likelihood estimator for two-phase studies that makes use of all available information. In contrast to recently-proposed weighted likelihood-based methods that calibrate to the influence function of the model of interest, the methods we propose do not require the development of additional models and, therefore, enjoy a degree of robustness. In addition, we expand the broader framework for pseudo-likelihood based estimation and inference to permit link functions for binary regression other than the logit link. Comprehensive simulations, based on a one-time cross sectional survey of 82,887 patients undergoing anti-retroviral therapy in Malawi between 2005 and 2007, illustrate finite sample properties of the proposed methods and compare their performance competing approaches. The proposed method yields the lowest standard errors when the model is correctly specified. Finally, the methods are applied to a large implementation science project examining the effect of an enhanced community health worker program to improve adherence to WHO guidelines for at least four antenatal visits, in Dar es Salaam, Tanzania.
Collapse
Affiliation(s)
| | - Sebastien Haneuse
- Department of Biostatistics, Harvard T.H. Chan School of Public Health Boston, MA, USA
| | - Molin Wang
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA.,Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Donna Spiegelman
- Department of Biostatistics, Harvard T.H. Chan School of Public Health Boston, MA, USA.,Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA.,Department of Biostatistics, Center on Methods for Implementation and Dissemination Science, Yale University School of Public Health, Yale, CT, USA
| |
Collapse
|
13
|
|
14
|
Rivera CL, Lumley T. Using the entire history in the analysis of nested case cohort samples. Stat Med 2016; 35:3213-28. [DOI: 10.1002/sim.6917] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2015] [Revised: 01/20/2016] [Accepted: 01/31/2016] [Indexed: 11/10/2022]
Affiliation(s)
- C. L. Rivera
- Department of Biostatistics; Harvard School of Public Health; 677 Huntington Avenue, Kresge 803B Boston MA 02115 U.S.A
| | - T. Lumley
- Department of Biostatistics; Harvard School of Public Health; 677 Huntington Avenue, Kresge 803B Boston MA 02115 U.S.A
| |
Collapse
|