1
|
Kang K, Seidlitz J, Bethlehem RA, Xiong J, Jones MT, Mehta K, Keller AS, Tao R, Randolph A, Larsen B, Tervo-Clemmens B, Feczko E, Miranda Dominguez O, Nelson S, Schildcrout J, Fair D, Satterthwaite TD, Alexander-Bloch A, Vandekar S. Study design features increase replicability in cross-sectional and longitudinal brain-wide association studies. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.05.29.542742. [PMID: 37398345 PMCID: PMC10312450 DOI: 10.1101/2023.05.29.542742] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/04/2023]
Abstract
Brain-wide association studies (BWAS) are a fundamental tool in discovering brain-behavior associations. Several recent studies showed that thousands of study participants are required for good replicability of BWAS because the standardized effect sizes (ESs) are much smaller than the reported standardized ESs in smaller studies. Here, we perform analyses and meta-analyses of a robust effect size index using 63 longitudinal and cross-sectional magnetic resonance imaging studies from the Lifespan Brain Chart Consortium (77,695 total scans) to demonstrate that optimizing study design is critical for increasing standardized ESs and replicability in BWAS. A meta-analysis of brain volume associations with age indicates that BWAS with larger variability in covariate have larger reported standardized ES. In addition, the longitudinal studies we examined reported systematically larger standardized ES than cross-sectional studies. Analyzing age effects on global and regional brain measures from the United Kingdom Biobank and the Alzheimer's Disease Neuroimaging Initiative, we show that modifying longitudinal study design through sampling schemes improves the standardized ESs and replicability. Sampling schemes that improve standardized ESs and replicability include increasing between-subject age variability in the sample and adding a single additional longitudinal measurement per subject. To ensure that our results are generalizable, we further evaluate these longitudinal sampling schemes on cognitive, psychopathology, and demographic associations with structural and functional brain outcome measures in the Adolescent Brain and Cognitive Development dataset. We demonstrate that commonly used longitudinal models can, counterintuitively, reduce standardized ESs and replicability. The benefit of conducting longitudinal studies depends on the strengths of the between- versus within-subject associations of the brain and non-brain measures. Explicitly modeling between- versus within-subject effects avoids averaging the effects and allows optimizing the standardized ESs for each separately. Together, these results provide guidance for study designs that improve the replicability of BWAS.
Collapse
Affiliation(s)
- Kaidi Kang
- Department of Biostatistics, Vanderbilt University Medical Center
| | - Jakob Seidlitz
- Department of Child and Adolescent Psychiatry and Behavioral Sciences, The Children’s Hospital of Philadelphia
- Department of Psychiatry, University of Pennsylvania
- Lifespan Brain Institute of The Children’s Hospital of Philadelphia and Penn Medicine
| | | | - Jiangmei Xiong
- Department of Biostatistics, Vanderbilt University Medical Center
| | - Megan T. Jones
- Department of Biostatistics, Vanderbilt University Medical Center
| | - Kahini Mehta
- Department of Psychiatry, University of Pennsylvania
- Lifespan Brain Institute of The Children’s Hospital of Philadelphia and Penn Medicine
- Penn Lifespan Informatics and Neuroimaging Center (PennLINC), Perelman School of Medicine, University of Pennsylvania
| | - Arielle S. Keller
- Department of Psychiatry, University of Pennsylvania
- Lifespan Brain Institute of The Children’s Hospital of Philadelphia and Penn Medicine
- Penn Lifespan Informatics and Neuroimaging Center (PennLINC), Perelman School of Medicine, University of Pennsylvania
| | - Ran Tao
- Department of Biostatistics, Vanderbilt University Medical Center
| | - Anita Randolph
- Department of Pediatrics, University of Minnesota Medical School
| | - Bart Larsen
- Department of Pediatrics, University of Minnesota Medical School
| | - Brenden Tervo-Clemmens
- Department of Department of Psychiatry & Behavioral Sciences, University of Minnesota Medical School
| | - Eric Feczko
- Department of Pediatrics, University of Minnesota Medical School
| | | | - Steve Nelson
- Department of Pediatrics, University of Minnesota Medical School
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Damien Fair
- Department of Pediatrics, University of Minnesota Medical School
| | - Theodore D. Satterthwaite
- Department of Psychiatry, University of Pennsylvania
- Lifespan Brain Institute of The Children’s Hospital of Philadelphia and Penn Medicine
- Penn Lifespan Informatics and Neuroimaging Center (PennLINC), Perelman School of Medicine, University of Pennsylvania
| | - Aaron Alexander-Bloch
- Department of Child and Adolescent Psychiatry and Behavioral Sciences, The Children’s Hospital of Philadelphia
- Department of Psychiatry, University of Pennsylvania
- Lifespan Brain Institute of The Children’s Hospital of Philadelphia and Penn Medicine
| | - Simon Vandekar
- Department of Biostatistics, Vanderbilt University Medical Center
| |
Collapse
|
2
|
Gravio CD, Tao R, Schildcrout JS. Design and analysis of two-phase studies with multivariate longitudinal data. Biometrics 2022. [PMID: 35014029 DOI: 10.1111/biom.13616] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2021] [Revised: 11/03/2021] [Accepted: 12/10/2021] [Indexed: 11/27/2022]
Abstract
Two-phase studies are crucial when outcome and covariate data are available in a first phase sample (e.g., a cohort study), but costs associated with retrospective ascertainment of a novel exposure limit the size of the second phase sample, in whom the exposure is collected. For longitudinal outcomes, one class of two-phase studies stratifies subjects based on an outcome vector summary (e.g., an average or a slope over time) and oversamples subjects in the extreme value strata while undersampling subjects in the medium value stratum. Based on the choice of the summary, two-phase studies for longitudinal data can increase efficiency of time-varying and/or time-fixed exposure parameter estimates. In this manuscript, we extend efficient, two-phase study designs to multivariate longitudinal continuous outcomes, and we detail two analysis approaches. The first approach is a multiple imputation analysis that combines complete data from subjects selected for phase two with the incomplete data from those not selected. The second approach is a conditional maximum likelihood analysis that is intended for applications where only data from subjects selected for phase two are available. Importantly, we show that both approaches can be applied to secondary analyses of previously conducted two-phase studies. We examine finite sample operating characteristics of the two approaches and use the Lung Health Study (Connett et al., 1993) to examine genetic associations with lung function decline over time. This article is protected by copyright. All rights reserved.
Collapse
Affiliation(s)
- Chiara Di Gravio
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee, 37232, U.S.A
| | - Ran Tao
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee, 37232, U.S.A.,Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, Tennessee, 37232, U.S.A
| | - Jonathan S Schildcrout
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee, 37232, U.S.A
| |
Collapse
|
3
|
Sauer S, Hedt-Gauthier B, Haneuse S. Optimal allocation in stratified cluster-based outcome-dependent sampling designs. Stat Med 2021; 40:4090-4107. [PMID: 34076912 DOI: 10.1002/sim.9016] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2020] [Revised: 03/31/2021] [Accepted: 04/12/2021] [Indexed: 11/08/2022]
Abstract
In public health research, finite resources often require that decisions be made at the study design stage regarding which individuals to sample for detailed data collection. At the same time, when study units are naturally clustered, as patients are in clinics, it may be preferable to sample clusters rather than the study units, especially when the costs associated with travel between clusters are high. In this setting, aggregated data on the outcome and select covariates are sometimes routinely available through, for example, a country's Health Management Information System. If used wisely, this information can be used to guide decisions regarding which clusters to sample, and potentially obtain gains in efficiency over simple random sampling. In this article, we derive a series of formulas for optimal allocation of resources when a single-stage stratified cluster-based outcome-dependent sampling design is to be used and a marginal mean model is specified to answer the question of interest. Specifically, we consider two settings: (i) when a particular parameter in the mean model is of primary interest; and, (ii) when multiple parameters are of interest. We investigate the finite population performance of the optimal allocation framework through a comprehensive simulation study. Our results show that there are trade-offs that must be considered at the design stage: optimizing for one parameter yields efficiency gains over balanced and simple random sampling, while resulting in losses for the other parameters in the model. Optimizing for all parameters simultaneously yields smaller gains in efficiency, but mitigates the losses for the other parameters in the model.
Collapse
Affiliation(s)
- Sara Sauer
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA
| | - Bethany Hedt-Gauthier
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA.,Department of Global Health and Social Medicine, Harvard Medical School, Boston, Massachusetts, USA
| | - Sebastien Haneuse
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA
| |
Collapse
|
4
|
Sauer S, Hedt-Gauthier B, Rivera-Rodriguez C, Haneuse S. Small-sample inference for cluster-based outcome-dependent sampling schemes in resource-limited settings: Investigating low birthweight in Rwanda. Biometrics 2021; 78:701-715. [PMID: 33444459 DOI: 10.1111/biom.13423] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2020] [Accepted: 12/31/2020] [Indexed: 11/27/2022]
Abstract
The neonatal mortality rate in Rwanda remains above the United Nations Sustainable Development Goal 3 target of 12 deaths per 1000 live births. As part of a larger effort to reduce preventable neonatal deaths in the country, we conducted a study to examine risk factors for low birthweight. The data were collected via a cost-efficient cluster-based outcome-dependent sampling (ODS) scheme wherein clusters of individuals (health centers) were selected on the basis of, in part, the outcome rate of the individuals. For a given data set collected via a cluster-based ODS scheme, estimation for a marginal model may proceed via inverse-probability-weighted generalized estimating equations, where the cluster-specific weights are the inverse probability of the health center's inclusion in the sample. In this paper, we provide a detailed treatment of the asymptotic properties of this estimator, together with an explicit expression for the asymptotic variance and a corresponding estimator. Furthermore, motivated by the study we conducted in Rwanda, we propose a number of small-sample bias corrections to both the point estimates and the standard error estimates. Through simulation, we show that applying these corrections when the number of clusters is small generally reduces the bias in the point estimates, and results in closer to nominal coverage. The proposed methods are applied to data from 18 health centers and 1 district hospital in Rwanda.
Collapse
Affiliation(s)
- Sara Sauer
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA
| | - Bethany Hedt-Gauthier
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA.,Department of Global Health and Social Medicine, Harvard Medical School, Boston, Massachusetts, USA
| | | | - Sebastien Haneuse
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA
| |
Collapse
|
5
|
Tao R, Mercaldo ND, Haneuse S, Maronge JM, Rathouz PJ, Heagerty PJ, Schildcrout JS. Two-wave two-phase outcome-dependent sampling designs, with applications to longitudinal binary data. Stat Med 2021; 40:1863-1876. [PMID: 33442883 DOI: 10.1002/sim.8876] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2020] [Revised: 12/07/2020] [Accepted: 12/25/2020] [Indexed: 12/26/2022]
Abstract
Two-phase outcome-dependent sampling (ODS) designs are useful when resource constraints prohibit expensive exposure ascertainment on all study subjects. One class of ODS designs for longitudinal binary data stratifies subjects into three strata according to those who experience the event at none, some, or all follow-up times. For time-varying covariate effects, exclusively selecting subjects with response variation can yield highly efficient estimates. However, if interest lies in the association of a time-invariant covariate, or the joint associations of time-varying and time-invariant covariates with the outcome, then the optimal design is unknown. Therefore, we propose a class of two-wave two-phase ODS designs for longitudinal binary data. We split the second-phase sample selection into two waves, between which an interim design evaluation analysis is conducted. The interim design evaluation analysis uses first-wave data to conduct a simulation-based search for the optimal second-wave design that will improve the likelihood of study success. Although we focus on longitudinal binary response data, the proposed design is general and can be applied to other response distributions. We believe that the proposed designs can be useful in settings where (1) the expected second-phase sample size is fixed and one must tailor stratum-specific sampling probabilities to maximize estimation efficiency, or (2) relative sampling probabilities are fixed across sampling strata and one must tailor sample size to achieve a desired precision. We describe the class of designs, examine finite sampling operating characteristics, and apply the designs to an exemplar longitudinal cohort study, the Lung Health Study.
Collapse
Affiliation(s)
- Ran Tao
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee, USA.,Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Nathaniel D Mercaldo
- Departments of Radiology and Neurology, Massachusetts General Hospital and Harvard University, Boston, Massachusetts, USA
| | - Sebastien Haneuse
- Department of Biostatistics, Harvard University, Boston, Massachusetts, USA
| | - Jacob M Maronge
- Department of Statistics, University of Wisconsin-Madison, Madison, Wisconsin, USA
| | - Paul J Rathouz
- Department of Population Health, University of Texas, Austin, Texas, USA
| | - Patrick J Heagerty
- Department of Biostatistics, University of Washington, Seattle, Washington, USA
| | - Jonathan S Schildcrout
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| |
Collapse
|
6
|
Yu J, Zhou H, Cai J. Accelerated failure time model for data from outcome-dependent sampling. LIFETIME DATA ANALYSIS 2021; 27:15-37. [PMID: 33044612 PMCID: PMC7856009 DOI: 10.1007/s10985-020-09508-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/26/2019] [Accepted: 09/29/2020] [Indexed: 05/26/2023]
Abstract
Outcome-dependent sampling designs such as the case-control or case-cohort design are widely used in epidemiological studies for their outstanding cost-effectiveness. In this article, we propose and develop a smoothed weighted Gehan estimating equation approach for inference in an accelerated failure time model under a general failure time outcome-dependent sampling scheme. The proposed estimating equation is continuously differentiable and can be solved by the standard numerical methods. In addition to developing asymptotic properties of the proposed estimator, we also propose and investigate a new optimal power-based subsamples allocation criteria in the proposed design by maximizing the power function of a significant test. Simulation results show that the proposed estimator is more efficient than other existing competing estimators and the optimal power-based subsamples allocation will provide an ODS design that yield improved power for the test of exposure effect. We illustrate the proposed method with a data set from the Norwegian Mother and Child Cohort Study to evaluate the relationship between exposure to perfluoroalkyl substances and women's subfecundity.
Collapse
Affiliation(s)
- Jichang Yu
- School of Statistics and Mathematics, Zhongnan University of Economics and Law, Wuhan, 430073, Hubei, China
| | - Haibo Zhou
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA
| | - Jianwen Cai
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA.
| |
Collapse
|
7
|
McGee G, Kioumourtzoglou M, Weisskopf MG, Haneuse S, Coull BA. On the interplay between exposure misclassification and informative cluster size. J R Stat Soc Ser C Appl Stat 2020. [DOI: 10.1111/rssc.12430] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Affiliation(s)
- Glen McGee
- Harvard T.H. Chan School of Public Health Boston USA
| | | | | | | | | |
Collapse
|
8
|
Sagel SD, Wagner BD, Ziady A, Kelley T, Clancy JP, Narvaez-Rivas M, Pilewski J, Joseloff E, Sha W, Zelnick L, Setchell KDR, Heltshe SL, Muhlebach MS. Utilizing centralized biorepository samples for biomarkers of cystic fibrosis lung disease severity. J Cyst Fibros 2020; 19:632-640. [PMID: 31870630 PMCID: PMC7305052 DOI: 10.1016/j.jcf.2019.12.007] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2019] [Revised: 10/30/2019] [Accepted: 12/08/2019] [Indexed: 12/15/2022]
Abstract
BACKGROUND Circulating biomarkers reflective of lung disease activity and severity have the potential to improve patient care and accelerate drug development in CF. The objective of this study was to leverage banked specimens to test the hypothesis that blood-based biomarkers discriminate CF children segregated by lung disease severity. METHODS Banked serum samples were selected from children who were categorized into two extremes of phenotype associated with lung function ('mild' or 'severe') based on CF-specific data and were matched on age, gender, CFTR genotype, and P. aeruginosa infection status. Targeted inflammatory proteins, lipids, and discovery metabolite profiles were measured in these serum samples. RESULTS The severe cohort, characterized by a lower CF-specific FEV1 percentile, had significantly higher circulating concentrations of high sensitivity C-reactive protein, serum amyloid A, granulocyte colony stimulating factor, and calprotectin compared to the mild cohort. The mild cohort tended to have higher serum linoleic acid concentrations. The metabolite arabitol was lower in the severe cohort while other CF relevant metabolic pathways showed non-significant differences after adjusting for multiple comparisons. A sensitivity analysis to correct for biased estimates that may result from selecting subjects using an extremes of phenotype approach confirmed the protein biomarker findings. CONCLUSIONS Circulating inflammatory proteins differ in CF children segregated by lung function. These findings serve to demonstrate the value of maintaining centralized, high quality patient derived samples for future research, with linkage to clinical information to answer testable hypotheses in biomarker development.
Collapse
Affiliation(s)
- Scott D Sagel
- Department of Pediatrics, Children's Hospital Colorado, University of Colorado School of Medicine, Aurora, CO, USA.
| | - Brandie D Wagner
- Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Denver, Aurora, Colorado, USA
| | - Assem Ziady
- Division of Pulmonary Medicine, Department of Pediatrics, Cincinnati Children's Hospital Medical Center, Cincinnati, OH, USA
| | - Tom Kelley
- Division of Pulmonology, Department of Pediatrics, Case Western Reserve University, Cleveland, OH
| | - John P Clancy
- Division of Pulmonary Medicine, Department of Pediatrics, Cincinnati Children's Hospital Medical Center, Cincinnati, OH, USA
| | | | - Joseph Pilewski
- Division of Pulmonary, Allergy, and Critical Care Medicine, Department of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
| | | | - Wei Sha
- Bioinformatics Services Division, Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, 150 Research Campus Dr., Kannapolis, NC, USA
| | - Leila Zelnick
- Division of Nephrology, University of Washington School of Medicine, Seattle, WA, USA
| | | | - Sonya L Heltshe
- Cystic Fibrosis Foundation Therapeutics Development Network Coordinating Center, Seattle Children's Research Institute, Seattle, WA, USA; Department of Pediatrics, University of Washington, School of Medicine, Seattle, WA, USA
| | - Marianne S Muhlebach
- Division of Pulmonology, Department of Pediatrics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| |
Collapse
|
9
|
Schildcrout JS, Haneuse S, Tao R, Zelnick LR, Schisterman EF, Garbett SP, Mercaldo ND, Rathouz PJ, Heagerty PJ. Two-Phase, Generalized Case-Control Designs for the Study of Quantitative Longitudinal Outcomes. Am J Epidemiol 2020; 189:81-90. [PMID: 31165875 DOI: 10.1093/aje/kwz127] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2018] [Revised: 05/06/2019] [Accepted: 05/14/2019] [Indexed: 01/30/2023] Open
Abstract
We propose a general class of 2-phase epidemiologic study designs for quantitative, longitudinal data that are useful when phase 1 longitudinal outcome and covariate data are available but data on the exposure (e.g., a biomarker) can only be collected on a subset of subjects during phase 2. To conduct a study using a design in the class, one first summarizes the longitudinal outcomes by fitting a simple linear regression of the response on a time-varying covariate for each subject. Sampling strata are defined by splitting the estimated regression intercept or slope distributions into distinct (low, medium, and high) regions. Stratified sampling is then conducted from strata defined by the intercepts, by the slopes, or from a mixture. In general, samples selected with extreme intercept values will yield low variances for associations of time-fixed exposures with the outcome and samples enriched with extreme slope values will yield low variances for associations of time-varying exposures with the outcome (including interactions with time-varying exposures). We describe ascertainment-corrected maximum likelihood and multiple-imputation estimation procedures that permit valid and efficient inferences. We embed all methodological developments within the framework of conducting a substudy that seeks to examine genetic associations with lung function among continuous smokers in the Lung Health Study (United States and Canada, 1986-1994).
Collapse
Affiliation(s)
| | - Sebastien Haneuse
- Department of Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, Massachusetts
| | - Ran Tao
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee
| | - Leila R Zelnick
- Division of Nephrology, Department of Medicine, University of Washington, Seattle, Washington
| | - Enrique F Schisterman
- Division of Intramural Population Health Research, Eunice Kennedy Shriver National Institute of Child Health and Human Development, Bethesda, Maryland
| | - Shawn P Garbett
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee
| | | | - Paul J Rathouz
- Department of Population Health, Dell Medical School, University of Texas, Austin, Texas
| | - Patrick J Heagerty
- Department of Biostatistics, School of Public Health, University of Washington, Seattle, Washington
| |
Collapse
|
10
|
Flanders WD. Invited Commentary: Two-Phase, Generalized Case-Control Designs for Quantitative Longitudinal Outcomes and Evolution of the Case-Control Study. Am J Epidemiol 2020; 189:91-94. [PMID: 31566676 DOI: 10.1093/aje/kwz200] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2019] [Revised: 08/23/2019] [Accepted: 08/27/2019] [Indexed: 11/12/2022] Open
Abstract
The case-control study design has evolved substantially over the past half century. The design has long been recognized as a way to increase efficiency by studying fewer subjects than would be required for a full cohort study. Historically, it was thought that case-control studies required a rare disease assumption for valid risk ratio estimation, but it was later realized that rare disease was not necessary. Over time, the design and analysis methods were further modified to allow estimation of rate ratios or to allow each person to serve as his/her own control (as we see with case-cohort and case-crossover studies, for example). We now understand that efficiency can be increased through the use of outcome-dependent sampling not only for dichotomous outcomes but also for continuous outcomes in longitudinal studies with repeated outcome measurement during follow-up. In their accompanying paper, Schildcrout et al. (Am J Epidemiol. 2019;000(00):000-000) contribute to our understanding, clearly summarizing many recent advances in study design and analyses that allow more general and efficient use of case-control studies. Their simulations demonstrate that improved efficiency is achieved with these methods when the goal is to estimate associations of exposure with trajectories and patterns of change over time. Here we comment on application of some of these generalized case-control methods to causal inference.
Collapse
Affiliation(s)
- W Dana Flanders
- Department of Epidemiology, Rollins School of Public Health, Emory University, Atlanta, GA 30322.,Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA 30322
| |
Collapse
|
11
|
McGee G, Schildcrout J, Normand SL, Haneuse S. Outcome-dependent sampling in cluster-correlated data settings with application to hospital profiling. JOURNAL OF THE ROYAL STATISTICAL SOCIETY. SERIES A, (STATISTICS IN SOCIETY) 2020; 183:379-402. [PMID: 35991674 PMCID: PMC9390011 DOI: 10.1111/rssa.12503] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Hospital readmission is a key marker of quality of healthcare and an important policy measure, used by the Centers for Medicare and Medicaid Services to determine, in part, reimbursement rates. Currently, analyses of readmissions are based on a logistic-normal generalized linear mixed model that permits estimation of hospital-specific measures while adjusting for case mix differences. Recent moves to identify and address healthcare disparities call for expanding case mix adjustment to include measures of socio-economic status while minimizing additional burden to hospitals associated with collecting data on such measures. Towards resolving this dilemma, we propose that detailed socio-economic data be collected on a subsample of patients via an outcome-dependent sampling scheme, specifically the cluster-stratified case-control design. Estimation and inference, for both the fixed and the random-effects components, are performed via pseudo-maximum-likelihood wherein inverse probability weights are incorporated in the usual integrated likelihood to account for the design. In comprehensive simulations, cluster-stratified case-control sampling proves to be an efficient design whenever interest lies in fixed or random effects of a generalized linear mixed model and covariates are unobserved or expensive to collect. The methods are motivated by and illustrated with an analysis of N = 889661 Medicare beneficiaries hospitalized between 2011 and 2013 with congestive heart failure at one of K = 3116 hospitals. Results highlight that the framework proposed provides a means of mitigating disparities in terms of which hospitals are indicated as being poor performers, relative to a naive analysis that fails to adjust for missing case mix variables.
Collapse
Affiliation(s)
- Glen McGee
- Harvard T.H. Chan School of Public Health, Boston, USA
| | | | - Sharon-Lise Normand
- Harvard Medical School and Harvard T.H. Chan School of Publich Health, Boston, USA
| | | |
Collapse
|
12
|
Abstract
The two-phase design is a cost-effective sampling strategy to evaluate the effects of covariates on an outcome when certain covariates are too expensive to be measured on all study subjects. Under such a design, the outcome and inexpensive covariates are measured on all subjects in the first phase and the first-phase information is used to select subjects for measurements of expensive covariates in the second phase. Previous research on two-phase studies has focused largely on the inference procedures rather than the design aspects. We investigate the design efficiency of the two-phase study, as measured by the semiparametric efficiency bound for estimating the regression coefficients of expensive covariates. We consider general two-phase studies, where the outcome variable can be continuous, discrete, or censored, and the second-phase sampling can depend on the first-phase data in any manner. We develop optimal or approximately optimal two-phase designs, which can be substantially more efficient than the existing designs. We demonstrate the improvements of the new designs over the existing ones through extensive simulation studies and two large medical studies.
Collapse
Affiliation(s)
- Ran Tao
- Department of Biostatistics and Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, TN 37232.,Department of Biostatistics, University of North Carolina, Chapel Hill, NC 27599
| | - Donglin Zeng
- Department of Biostatistics and Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, TN 37232.,Department of Biostatistics, University of North Carolina, Chapel Hill, NC 27599
| | - Dan-Yu Lin
- Department of Biostatistics and Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, TN 37232.,Department of Biostatistics, University of North Carolina, Chapel Hill, NC 27599
| |
Collapse
|
13
|
Rivera-Rodriguez C, Spiegelman D, Haneuse S. On the analysis of two-phase designs in cluster-correlated data settings. Stat Med 2019; 38:4611-4624. [PMID: 31359448 PMCID: PMC6736737 DOI: 10.1002/sim.8321] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2018] [Revised: 06/04/2019] [Accepted: 06/21/2019] [Indexed: 11/06/2022]
Abstract
In public health research, information that is readily available may be insufficient to address the primary question(s) of interest. One cost-efficient way forward, especially in resource-limited settings, is to conduct a two-phase study in which the population is initially stratified, at phase I, by the outcome and/or some categorical risk factor(s). At phase II detailed covariate data is ascertained on a subsample within each phase I strata. While analysis methods for two-phase designs are well established, they have focused exclusively on settings in which participants are assumed to be independent. As such, when participants are naturally clustered (eg, patients within clinics) these methods may yield invalid inference. To address this, we develop a novel analysis approach based on inverse-probability weighting that permits researchers to specify some working covariance structure and appropriately accounts for the sampling design and ensures valid inference via a robust sandwich estimator for which a closed-form expression is provided. To enhance statistical efficiency, we propose a calibrated inverse-probability weighting estimator that makes use of information available at phase I but not used in the design. In addition to describing the technique, practical guidance is provided for the cluster-correlated data settings that we consider. A comprehensive simulation study is conducted to evaluate small-sample operating characteristics, including the impact of using naïve methods that ignore correlation due to clustering, as well as to investigate design considerations. Finally, the methods are illustrated using data from a one-time survey of the national antiretroviral treatment program in Malawi.
Collapse
Affiliation(s)
| | - D. Spiegelman
- Center on Methods for Implementation and Dissemination Science, Department of Biostatistics, Yale University School of Public Health, CT, USA
- Department of Epidemiology, Harvard School of Public Health, MA, USA
- Department of Biostatistics, Harvard School of Public Health, MA, USA
| | - S. Haneuse
- Department of Biostatistics, Harvard School of Public Health, MA, USA
| |
Collapse
|
14
|
Zelnick LR, Schildcrout JS, Heagerty PJ. Likelihood-based analysis of outcome-dependent sampling designs with longitudinal data. Stat Med 2018. [PMID: 29542170 DOI: 10.1002/sim.7633] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
The use of outcome-dependent sampling with longitudinal data analysis has previously been shown to improve efficiency in the estimation of regression parameters. The motivating scenario is when outcome data exist for all cohort members but key exposure variables will be gathered only on a subset. Inference with outcome-dependent sampling designs that also incorporates incomplete information from those individuals who did not have their exposure ascertained has been investigated for univariate but not longitudinal outcomes. Therefore, with a continuous longitudinal outcome, we explore the relative contributions of various sources of information toward the estimation of key regression parameters using a likelihood framework. We evaluate the efficiency gains that alternative estimators might offer over random sampling, and we offer insight into their relative merits in select practical scenarios. Finally, we illustrate the potential impact of design and analysis choices using data from the Cystic Fibrosis Foundation Patient Registry.
Collapse
Affiliation(s)
- Leila R Zelnick
- Department of Medicine, University of Washington, Seattle, WA 98195, USA
| | | | - Patrick J Heagerty
- Department of Biostatistics, University of Washington, Seattle, WA 98195, USA
| |
Collapse
|
15
|
Abstract
In resource-limited settings, long-term evaluation of national antiretroviral treatment (ART) programs often relies on aggregated data, the analysis of which may be subject to ecological bias. As researchers and policy makers consider evaluating individual-level outcomes such as treatment adherence or mortality, the well-known case-control design is appealing in that it provides efficiency gains over random sampling. In the context that motivates this article, valid estimation and inference requires acknowledging any clustering, although, to our knowledge, no statistical methods have been published for the analysis of case-control data for which the underlying population exhibits clustering. Furthermore, in the specific context of an ongoing collaboration in Malawi, rather than performing case-control sampling across all clinics, case-control sampling within clinics has been suggested as a more practical strategy. To our knowledge, although similar outcome-dependent sampling schemes have been described in the literature, a case-control design specific to correlated data settings is new. In this article, we describe this design, discuss balanced versus unbalanced sampling techniques, and provide a general approach to analyzing case-control studies in cluster-correlated settings based on inverse probability-weighted generalized estimating equations. Inference is based on a robust sandwich estimator with correlation parameters estimated to ensure appropriate accounting of the outcome-dependent sampling scheme. We conduct comprehensive simulations, based in part on real data on a sample of N = 78,155 program registrants in Malawi between 2005 and 2007, to evaluate small-sample operating characteristics and potential trade-offs associated with standard case-control sampling or when case-control sampling is performed within clusters.
Collapse
Affiliation(s)
| | - Claudia Rivera
- Harvard T.H. Chan School of Public Health, Boston, Massachusetts
| |
Collapse
|
16
|
Sun Z, Mukherjee B, Estes JP, Vokonas PS, Park SK. Exposure enriched outcome dependent designs for longitudinal studies of gene-environment interaction. Stat Med 2017; 36:2947-2960. [PMID: 28497531 DOI: 10.1002/sim.7332] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2016] [Revised: 03/08/2017] [Accepted: 04/20/2017] [Indexed: 12/15/2022]
Abstract
Joint effects of genetic and environmental factors have been increasingly recognized in the development of many complex human diseases. Despite the popularity of case-control and case-only designs, longitudinal cohort studies that can capture time-varying outcome and exposure information have long been recommended for gene-environment (G × E) interactions. To date, literature on sampling designs for longitudinal studies of G × E interaction is quite limited. We therefore consider designs that can prioritize a subsample of the existing cohort for retrospective genotyping on the basis of currently available outcome, exposure, and covariate data. In this work, we propose stratified sampling based on summaries of individual exposures and outcome trajectories and develop a full conditional likelihood approach for estimation that adjusts for the biased sample. We compare the performance of our proposed design and analysis with combinations of different sampling designs and estimation approaches via simulation. We observe that the full conditional likelihood provides improved estimates for the G × E interaction and joint exposure effects over uncorrected complete-case analysis, and the exposure enriched outcome trajectory dependent design outperforms other designs in terms of estimation efficiency and power for detection of the G × E interaction. We also illustrate our design and analysis using data from the Normative Aging Study, an ongoing longitudinal cohort study initiated by the Veterans Administration in 1963. Copyright © 2017 John Wiley & Sons, Ltd.
Collapse
Affiliation(s)
- Zhichao Sun
- Department of Biostatistics, University of Michigan, Ann Arbor, 48109, MI, U.S.A
| | - Bhramar Mukherjee
- Department of Biostatistics, University of Michigan, Ann Arbor, 48109, MI, U.S.A.,Department of Epidemiology, University of Michigan, Ann Arbor, 48109, MI, U.S.A
| | - Jason P Estes
- Department of Biostatistics, University of Michigan, Ann Arbor, 48109, MI, U.S.A
| | - Pantel S Vokonas
- Veterans Affairs Normative Aging Study, VA Boston Healthcare System, Department of Medicine, Boston University School of Medicine, Boston, 02118, MA, U.S.A
| | - Sung Kyun Park
- Department of Epidemiology, University of Michigan, Ann Arbor, 48109, MI, U.S.A.,Department of Environmental Health Sciences, University of Michigan, Ann Arbor, 48109, MI, U.S.A
| |
Collapse
|
17
|
Schildcrout JS, Rathouz PJ, Zelnick LR, Garbett SP, Heagerty PJ. BIASED SAMPLING DESIGNS TO IMPROVE RESEARCH EFFICIENCY: FACTORS INFLUENCING PULMONARY FUNCTION OVER TIME IN CHILDREN WITH ASTHMA. Ann Appl Stat 2015; 9:731-753. [PMID: 26322147 PMCID: PMC4551501 DOI: 10.1214/15-aoas826] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Substudies of the Childhood Asthma Management Program (CAMP Research Group, 1999, 2000) seek to identify patient characteristics associated with asthma symptoms and lung function. To determine if genetic measures are associated with trajectories of lung function as measured by forced vital capacity (FVC), children in the primary cohort study retrospectively had candidate loci evaluated. Given participant burden and constraints on financial resources, it is often desirable to target a sub-sample for ascertainment of costly measures. Methods that can leverage the longitudinal outcome on the full cohort to selectively measure informative individuals have been promising, but have been restricted in their use to analysis of the targeted sub-sample. In this paper we detail two multiple imputation analysis strategies that exploit outcome and partially observed covariate data on the non-sampled subjects, and we characterize alternative design and analysis combinations that could be used for future studies of pulmonary function and other outcomes. Candidate predictor (e.g. IL10 cytokine polymorphisms) associations obtained from targeted sampling designs can be estimated with very high efficiency compared to standard designs. Further, even though multiple imputation can dramatically improve estimation efficiency for covariates available on all subjects (e.g., gender and baseline age), only modest efficiency gains were observed in parameters associated with predictors that are exclusive to the targeted sample. Our results suggest that future studies of longitudinal trajectories can be efficiently conducted by use of outcome-dependent designs and associated full cohort analysis.
Collapse
Affiliation(s)
| | - Paul J Rathouz
- Department of Biostatistics and Medical Informatics, University of Wisconsin School of Medicine and Public Health
| | - Leila R Zelnick
- Department of Biostatistics, University of Washington School of Public Health
| | - Shawn P Garbett
- Division of Cancer Biology, Vanderbilt University School of Medicine
| | - Patrick J Heagerty
- Department of Biostatistics, University of Washington School of Public Health
| |
Collapse
|