1
|
Di Gravio C, Schildcrout JS, Tao R. Efficient designs and analysis of two-phase studies with longitudinal binary data. Biometrics 2024; 80:ujad010. [PMID: 38364804 PMCID: PMC10871867 DOI: 10.1093/biomtc/ujad010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2023] [Revised: 08/23/2023] [Accepted: 11/09/2023] [Indexed: 02/18/2024]
Abstract
Researchers interested in understanding the relationship between a readily available longitudinal binary outcome and a novel biomarker exposure can be confronted with ascertainment costs that limit sample size. In such settings, two-phase studies can be cost-effective solutions that allow researchers to target informative individuals for exposure ascertainment and increase estimation precision for time-varying and/or time-fixed exposure coefficients. In this paper, we introduce a novel class of residual-dependent sampling (RDS) designs that select informative individuals using data available on the longitudinal outcome and inexpensive covariates. Together with the RDS designs, we propose a semiparametric analysis approach that efficiently uses all data to estimate the parameters. We describe a numerically stable and computationally efficient EM algorithm to maximize the semiparametric likelihood. We examine the finite sample operating characteristics of the proposed approaches through extensive simulation studies, and compare the efficiency of our designs and analysis approach with existing ones. We illustrate the usefulness of the proposed RDS designs and analysis method in practice by studying the association between a genetic marker and poor lung function among patients enrolled in the Lung Health Study (Connett et al, 1993).
Collapse
Affiliation(s)
- Chiara Di Gravio
- Department of Epidemiology and Biostatistics, School of Public Health, Imperial College London, London, SW7 2AZ, United Kingdom
| | - Jonathan S Schildcrout
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37232, xUnited Kingdom
| | - Ran Tao
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37232, United Kingdom
- Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, TN 37232, United Kingdom
| |
Collapse
|
2
|
Gravio CD, Tao R, Schildcrout JS. Design and analysis of two-phase studies with multivariate longitudinal data. Biometrics 2022. [PMID: 35014029 DOI: 10.1111/biom.13616] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2021] [Revised: 11/03/2021] [Accepted: 12/10/2021] [Indexed: 11/27/2022]
Abstract
Two-phase studies are crucial when outcome and covariate data are available in a first phase sample (e.g., a cohort study), but costs associated with retrospective ascertainment of a novel exposure limit the size of the second phase sample, in whom the exposure is collected. For longitudinal outcomes, one class of two-phase studies stratifies subjects based on an outcome vector summary (e.g., an average or a slope over time) and oversamples subjects in the extreme value strata while undersampling subjects in the medium value stratum. Based on the choice of the summary, two-phase studies for longitudinal data can increase efficiency of time-varying and/or time-fixed exposure parameter estimates. In this manuscript, we extend efficient, two-phase study designs to multivariate longitudinal continuous outcomes, and we detail two analysis approaches. The first approach is a multiple imputation analysis that combines complete data from subjects selected for phase two with the incomplete data from those not selected. The second approach is a conditional maximum likelihood analysis that is intended for applications where only data from subjects selected for phase two are available. Importantly, we show that both approaches can be applied to secondary analyses of previously conducted two-phase studies. We examine finite sample operating characteristics of the two approaches and use the Lung Health Study (Connett et al., 1993) to examine genetic associations with lung function decline over time. This article is protected by copyright. All rights reserved.
Collapse
Affiliation(s)
- Chiara Di Gravio
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee, 37232, U.S.A
| | - Ran Tao
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee, 37232, U.S.A.,Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, Tennessee, 37232, U.S.A
| | - Jonathan S Schildcrout
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee, 37232, U.S.A
| |
Collapse
|
3
|
Tao R, Mercaldo ND, Haneuse S, Maronge JM, Rathouz PJ, Heagerty PJ, Schildcrout JS. Two-wave two-phase outcome-dependent sampling designs, with applications to longitudinal binary data. Stat Med 2021; 40:1863-1876. [PMID: 33442883 DOI: 10.1002/sim.8876] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2020] [Revised: 12/07/2020] [Accepted: 12/25/2020] [Indexed: 12/26/2022]
Abstract
Two-phase outcome-dependent sampling (ODS) designs are useful when resource constraints prohibit expensive exposure ascertainment on all study subjects. One class of ODS designs for longitudinal binary data stratifies subjects into three strata according to those who experience the event at none, some, or all follow-up times. For time-varying covariate effects, exclusively selecting subjects with response variation can yield highly efficient estimates. However, if interest lies in the association of a time-invariant covariate, or the joint associations of time-varying and time-invariant covariates with the outcome, then the optimal design is unknown. Therefore, we propose a class of two-wave two-phase ODS designs for longitudinal binary data. We split the second-phase sample selection into two waves, between which an interim design evaluation analysis is conducted. The interim design evaluation analysis uses first-wave data to conduct a simulation-based search for the optimal second-wave design that will improve the likelihood of study success. Although we focus on longitudinal binary response data, the proposed design is general and can be applied to other response distributions. We believe that the proposed designs can be useful in settings where (1) the expected second-phase sample size is fixed and one must tailor stratum-specific sampling probabilities to maximize estimation efficiency, or (2) relative sampling probabilities are fixed across sampling strata and one must tailor sample size to achieve a desired precision. We describe the class of designs, examine finite sampling operating characteristics, and apply the designs to an exemplar longitudinal cohort study, the Lung Health Study.
Collapse
Affiliation(s)
- Ran Tao
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee, USA.,Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Nathaniel D Mercaldo
- Departments of Radiology and Neurology, Massachusetts General Hospital and Harvard University, Boston, Massachusetts, USA
| | - Sebastien Haneuse
- Department of Biostatistics, Harvard University, Boston, Massachusetts, USA
| | - Jacob M Maronge
- Department of Statistics, University of Wisconsin-Madison, Madison, Wisconsin, USA
| | - Paul J Rathouz
- Department of Population Health, University of Texas, Austin, Texas, USA
| | - Patrick J Heagerty
- Department of Biostatistics, University of Washington, Seattle, Washington, USA
| | - Jonathan S Schildcrout
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| |
Collapse
|
4
|
Beesley LJ, Salvatore M, Fritsche LG, Pandit A, Rao A, Brummett C, Willer CJ, Lisabeth LD, Mukherjee B. The emerging landscape of health research based on biobanks linked to electronic health records: Existing resources, statistical challenges, and potential opportunities. Stat Med 2020; 39:773-800. [PMID: 31859414 PMCID: PMC7983809 DOI: 10.1002/sim.8445] [Citation(s) in RCA: 52] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2018] [Revised: 09/10/2019] [Accepted: 11/16/2019] [Indexed: 01/03/2023]
Abstract
Biobanks linked to electronic health records provide rich resources for health-related research. With improvements in administrative and informatics infrastructure, the availability and utility of data from biobanks have dramatically increased. In this paper, we first aim to characterize the current landscape of available biobanks and to describe specific biobanks, including their place of origin, size, and data types. The development and accessibility of large-scale biorepositories provide the opportunity to accelerate agnostic searches, expedite discoveries, and conduct hypothesis-generating studies of disease-treatment, disease-exposure, and disease-gene associations. Rather than designing and implementing a single study focused on a few targeted hypotheses, researchers can potentially use biobanks' existing resources to answer an expanded selection of exploratory questions as quickly as they can analyze them. However, there are many obvious and subtle challenges with the design and analysis of biobank-based studies. Our second aim is to discuss statistical issues related to biobank research such as study design, sampling strategy, phenotype identification, and missing data. We focus our discussion on biobanks that are linked to electronic health records. Some of the analytic issues are illustrated using data from the Michigan Genomics Initiative and UK Biobank, two biobanks with two different recruitment mechanisms. We summarize the current body of literature for addressing these challenges and discuss some standing open problems. This work complements and extends recent reviews about biobank-based research and serves as a resource catalog with analytical and practical guidance for statisticians, epidemiologists, and other medical researchers pursuing research using biobanks.
Collapse
Affiliation(s)
| | | | | | - Anita Pandit
- University of Michigan, Department of Biostatistics
| | - Arvind Rao
- University of Michigan, Department of Computational Medicine and Bioinformatics
| | - Chad Brummett
- University of Michigan, Department of Anesthesiology
| | - Cristen J. Willer
- University of Michigan, Department of Computational Medicine and Bioinformatics
| | | | | |
Collapse
|
5
|
Schildcrout JS, Haneuse S, Tao R, Zelnick LR, Schisterman EF, Garbett SP, Mercaldo ND, Rathouz PJ, Heagerty PJ. Two-Phase, Generalized Case-Control Designs for the Study of Quantitative Longitudinal Outcomes. Am J Epidemiol 2020; 189:81-90. [PMID: 31165875 DOI: 10.1093/aje/kwz127] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2018] [Revised: 05/06/2019] [Accepted: 05/14/2019] [Indexed: 01/30/2023] Open
Abstract
We propose a general class of 2-phase epidemiologic study designs for quantitative, longitudinal data that are useful when phase 1 longitudinal outcome and covariate data are available but data on the exposure (e.g., a biomarker) can only be collected on a subset of subjects during phase 2. To conduct a study using a design in the class, one first summarizes the longitudinal outcomes by fitting a simple linear regression of the response on a time-varying covariate for each subject. Sampling strata are defined by splitting the estimated regression intercept or slope distributions into distinct (low, medium, and high) regions. Stratified sampling is then conducted from strata defined by the intercepts, by the slopes, or from a mixture. In general, samples selected with extreme intercept values will yield low variances for associations of time-fixed exposures with the outcome and samples enriched with extreme slope values will yield low variances for associations of time-varying exposures with the outcome (including interactions with time-varying exposures). We describe ascertainment-corrected maximum likelihood and multiple-imputation estimation procedures that permit valid and efficient inferences. We embed all methodological developments within the framework of conducting a substudy that seeks to examine genetic associations with lung function among continuous smokers in the Lung Health Study (United States and Canada, 1986-1994).
Collapse
Affiliation(s)
| | - Sebastien Haneuse
- Department of Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, Massachusetts
| | - Ran Tao
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee
| | - Leila R Zelnick
- Division of Nephrology, Department of Medicine, University of Washington, Seattle, Washington
| | - Enrique F Schisterman
- Division of Intramural Population Health Research, Eunice Kennedy Shriver National Institute of Child Health and Human Development, Bethesda, Maryland
| | - Shawn P Garbett
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee
| | | | - Paul J Rathouz
- Department of Population Health, Dell Medical School, University of Texas, Austin, Texas
| | - Patrick J Heagerty
- Department of Biostatistics, School of Public Health, University of Washington, Seattle, Washington
| |
Collapse
|
6
|
Ni A, Satagopan JM. Estimating Additive Interaction Effect in Stratified Two-Phase Case-Control Design. Hum Hered 2019; 84:90-108. [PMID: 31634888 PMCID: PMC6925975 DOI: 10.1159/000502738] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2018] [Accepted: 08/15/2019] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND AND AIMS There is considerable interest in epidemiology to estimate an additive interaction effect between two risk factors in case-control studies. An additive interaction is defined as the differential reduction in absolute risk associated with one factor between different levels of the other factor. A stratified two-phase case-control design is commonly used in epidemiology to reduce the cost of assembling covariates. It is crucial to obtain valid estimates of the model parameters by accounting for the underlying stratification scheme to obtain accurate and precise estimates of additive interaction effects. The aim of this paper is to examine the properties of different methods for estimating model parameters and additive interaction effects under a stratified two-phase case-control design. METHODS Using simulations, we investigate the properties of three existing methods, namely stratum-specific offset, inverse-probability weighting, and multiple imputation for estimating model parameters and additive interaction effects. We also illustrate these properties using data from two published epidemiology studies. RESULTS Simulation studies show that the multiple imputation method performs well when both the true and analysis models are additive (i.e., does not include multiplicative interaction terms) but does not provide a discernible advantage over the offset method when the analysis models are non-additive (i.e., includes multiplicative interaction terms). The offset method exhibits the best overall properties when the analysis model contains multiplicative interaction effects. CONCLUSION When estimating additive interaction between risk factors in stratified two-phase case-control studies, we recommend estimating model parameters using multiple imputation when the analysis model is additive, and we recommend the offset method when the analysis model is non-additive.
Collapse
Affiliation(s)
- Ai Ni
- Division of Biostatistics, The Ohio State University, Columbus, Ohio, USA,
| | - Jaya M Satagopan
- Department of Biostatistics and Epidemiology, School of Public Health, Rutgers University, Piscataway, New York, USA
| |
Collapse
|