1
|
Kang K, Seidlitz J, Bethlehem RAI, Xiong J, Jones MT, Mehta K, Keller AS, Tao R, Randolph A, Larsen B, Tervo-Clemmens B, Feczko E, Dominguez OM, Nelson SM, Schildcrout J, Fair DA, Satterthwaite TD, Alexander-Bloch A, Vandekar S. Study design features increase replicability in brain-wide association studies. Nature 2024; 636:719-727. [PMID: 39604734 PMCID: PMC11655360 DOI: 10.1038/s41586-024-08260-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2023] [Accepted: 10/21/2024] [Indexed: 11/29/2024]
Abstract
Brain-wide association studies (BWAS) are a fundamental tool in discovering brain-behaviour associations1,2. Several recent studies have shown that thousands of study participants are required for good replicability of BWAS1-3. Here we performed analyses and meta-analyses of a robust effect size index using 63 longitudinal and cross-sectional MRI studies from the Lifespan Brain Chart Consortium4 (77,695 total scans) to demonstrate that optimizing study design is critical for increasing standardized effect sizes and replicability in BWAS. A meta-analysis of brain volume associations with age indicates that BWAS with larger variability of the covariate and longitudinal studies have larger reported standardized effect size. Analysing age effects on global and regional brain measures from the UK Biobank and the Alzheimer's Disease Neuroimaging Initiative, we showed that modifying study design through sampling schemes improves standardized effect sizes and replicability. To ensure that our results are generalizable, we further evaluated the longitudinal sampling schemes on cognitive, psychopathology and demographic associations with structural and functional brain outcome measures in the Adolescent Brain and Cognitive Development dataset. We demonstrated that commonly used longitudinal models, which assume equal between-subject and within-subject changes can, counterintuitively, reduce standardized effect sizes and replicability. Explicitly modelling the between-subject and within-subject effects avoids conflating them and enables optimizing the standardized effect sizes for each separately. Together, these results provide guidance for study designs that improve the replicability of BWAS.
Collapse
Affiliation(s)
- Kaidi Kang
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN, USA.
| | - Jakob Seidlitz
- Department of Child and Adolescent Psychiatry and Behavioral Sciences, The Children's Hospital of Philadelphia, Philadelphia, PA, USA
- Department of Psychiatry, University of Pennsylvania, Philadelphia, PA, USA
- Lifespan Brain Institute of The Children's Hospital of Philadelphia and Penn Medicine, Philadelphia, PA, USA
| | | | - Jiangmei Xiong
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Megan T Jones
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Kahini Mehta
- Department of Psychiatry, University of Pennsylvania, Philadelphia, PA, USA
- Lifespan Brain Institute of The Children's Hospital of Philadelphia and Penn Medicine, Philadelphia, PA, USA
- Penn Lifespan Informatics and Neuroimaging Center (PennLINC), Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Arielle S Keller
- Department of Psychological Sciences, University of Connecticut, Mansfield, CT, USA
- Institute for the Brain and Cognitive Sciences, University of Connecticut, Mansfield, CT, USA
| | - Ran Tao
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN, USA
- Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Anita Randolph
- Department of Pediatrics, University of Minnesota Medical School, Minneapolis, MN, USA
- Masonic Institute for the Developing Brain, University of Minnesota, Minneapolis, MN, USA
| | - Bart Larsen
- Department of Pediatrics, University of Minnesota Medical School, Minneapolis, MN, USA
- Masonic Institute for the Developing Brain, University of Minnesota, Minneapolis, MN, USA
| | - Brenden Tervo-Clemmens
- Masonic Institute for the Developing Brain, University of Minnesota, Minneapolis, MN, USA
- Department of Psychiatry and Behavioral Sciences, University of Minnesota Medical School, Minneapolis, MN, USA
| | - Eric Feczko
- Department of Pediatrics, University of Minnesota Medical School, Minneapolis, MN, USA
- Masonic Institute for the Developing Brain, University of Minnesota, Minneapolis, MN, USA
| | - Oscar Miranda Dominguez
- Department of Pediatrics, University of Minnesota Medical School, Minneapolis, MN, USA
- Masonic Institute for the Developing Brain, University of Minnesota, Minneapolis, MN, USA
| | - Steven M Nelson
- Department of Pediatrics, University of Minnesota Medical School, Minneapolis, MN, USA
- Masonic Institute for the Developing Brain, University of Minnesota, Minneapolis, MN, USA
| | - Jonathan Schildcrout
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Damien A Fair
- Department of Pediatrics, University of Minnesota Medical School, Minneapolis, MN, USA
- Masonic Institute for the Developing Brain, University of Minnesota, Minneapolis, MN, USA
- Institute of Child Development, University of Minnesota, Minneapolis, MN, USA
| | - Theodore D Satterthwaite
- Department of Psychiatry, University of Pennsylvania, Philadelphia, PA, USA
- Lifespan Brain Institute of The Children's Hospital of Philadelphia and Penn Medicine, Philadelphia, PA, USA
- Penn Lifespan Informatics and Neuroimaging Center (PennLINC), Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Aaron Alexander-Bloch
- Department of Child and Adolescent Psychiatry and Behavioral Sciences, The Children's Hospital of Philadelphia, Philadelphia, PA, USA
- Department of Psychiatry, University of Pennsylvania, Philadelphia, PA, USA
- Lifespan Brain Institute of The Children's Hospital of Philadelphia and Penn Medicine, Philadelphia, PA, USA
| | - Simon Vandekar
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN, USA.
| |
Collapse
|
2
|
Hasler J, Ma Y, Wei Y, Parikh R, Chen J. A SEMIPARAMETRIC METHOD FOR RISK PREDICTION USING INTEGRATED ELECTRONIC HEALTH RECORD DATA. Ann Appl Stat 2024; 18:3318-3337. [PMID: 40134753 PMCID: PMC11934126 DOI: 10.1214/24-aoas1938] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/27/2025]
Abstract
When using electronic health records (EHRs) for clinical and translational research, additional data is often available from external sources to enrich the information extracted from EHRs. For example, academic biobanks have more granular data available, and patient reported data is often collected through small-scale surveys. It is common that the external data is available only for a small subset of patients who have EHR information. We propose efficient and robust methods for building and evaluating models for predicting the risk of binary outcomes using such integrated EHR data. Our method is built upon an idea derived from the two-phase design literature that modeling the availability of a patient's external data as a function of an EHR-based preliminary predictive score leads to effective utilization of the EHR data. Through both theoretical and simulation studies, we show that our method has high efficiency for estimating log-odds ratio parameters, the area under the ROC curve, as well as other measures for quantifying predictive accuracy. We apply our method to develop a model for predicting the short-term mortality risk of oncology patients, where the data was extracted from the University of Pennsylvania hospital system EHR and combined with survey-based patient reported outcome data.
Collapse
Affiliation(s)
| | - Yanyuan Ma
- Department of Statistics, Pennsylvania State University
| | - Yizheng Wei
- Department of Statistics, University of South Carolina
| | - Ravi Parikh
- Departments of Medicine and Health Policy and Medicine, University of Pennsylvania
| | - Jinbo Chen
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania
| |
Collapse
|
3
|
Kang K, Seidlitz J, Bethlehem RA, Xiong J, Jones MT, Mehta K, Keller AS, Tao R, Randolph A, Larsen B, Tervo-Clemmens B, Feczko E, Miranda Dominguez O, Nelson S, Lifespan Brain Chart Consortium, 3R-BRAIN, AIBL, Alzheimer’s Disease Neuroimaging Initiative, Alzheimer’s Disease Repository Without Borders Investigators, CALM Team, CCNP, COBRE, cVEDA, Harvard Aging Brain Study, IMAGEN, POND, The PREVENT-AD Research Group, Schildcrout J, Fair D, Satterthwaite TD, Alexander-Bloch A, Vandekar S. Study design features increase replicability in cross-sectional and longitudinal brain-wide association studies. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.05.29.542742. [PMID: 37398345 PMCID: PMC10312450 DOI: 10.1101/2023.05.29.542742] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/04/2023]
Abstract
Brain-wide association studies (BWAS) are a fundamental tool in discovering brain-behavior associations. Several recent studies showed that thousands of study participants are required for good replicability of BWAS because the standardized effect sizes (ESs) are much smaller than the reported standardized ESs in smaller studies. Here, we perform analyses and meta-analyses of a robust effect size index using 63 longitudinal and cross-sectional magnetic resonance imaging studies from the Lifespan Brain Chart Consortium (77,695 total scans) to demonstrate that optimizing study design is critical for increasing standardized ESs and replicability in BWAS. A meta-analysis of brain volume associations with age indicates that BWAS with larger variability in covariate have larger reported standardized ES. In addition, the longitudinal studies we examined reported systematically larger standardized ES than cross-sectional studies. Analyzing age effects on global and regional brain measures from the United Kingdom Biobank and the Alzheimer's Disease Neuroimaging Initiative, we show that modifying longitudinal study design through sampling schemes improves the standardized ESs and replicability. Sampling schemes that improve standardized ESs and replicability include increasing between-subject age variability in the sample and adding a single additional longitudinal measurement per subject. To ensure that our results are generalizable, we further evaluate these longitudinal sampling schemes on cognitive, psychopathology, and demographic associations with structural and functional brain outcome measures in the Adolescent Brain and Cognitive Development dataset. We demonstrate that commonly used longitudinal models can, counterintuitively, reduce standardized ESs and replicability. The benefit of conducting longitudinal studies depends on the strengths of the between- versus within-subject associations of the brain and non-brain measures. Explicitly modeling between- versus within-subject effects avoids averaging the effects and allows optimizing the standardized ESs for each separately. Together, these results provide guidance for study designs that improve the replicability of BWAS.
Collapse
Affiliation(s)
- Kaidi Kang
- Department of Biostatistics, Vanderbilt University Medical Center
| | - Jakob Seidlitz
- Department of Child and Adolescent Psychiatry and Behavioral Sciences, The Children’s Hospital of Philadelphia
- Department of Psychiatry, University of Pennsylvania
- Lifespan Brain Institute of The Children’s Hospital of Philadelphia and Penn Medicine
| | | | - Jiangmei Xiong
- Department of Biostatistics, Vanderbilt University Medical Center
| | - Megan T. Jones
- Department of Biostatistics, Vanderbilt University Medical Center
| | - Kahini Mehta
- Department of Psychiatry, University of Pennsylvania
- Lifespan Brain Institute of The Children’s Hospital of Philadelphia and Penn Medicine
- Penn Lifespan Informatics and Neuroimaging Center (PennLINC), Perelman School of Medicine, University of Pennsylvania
| | - Arielle S. Keller
- Department of Psychiatry, University of Pennsylvania
- Lifespan Brain Institute of The Children’s Hospital of Philadelphia and Penn Medicine
- Penn Lifespan Informatics and Neuroimaging Center (PennLINC), Perelman School of Medicine, University of Pennsylvania
| | - Ran Tao
- Department of Biostatistics, Vanderbilt University Medical Center
| | - Anita Randolph
- Department of Pediatrics, University of Minnesota Medical School
| | - Bart Larsen
- Department of Pediatrics, University of Minnesota Medical School
| | - Brenden Tervo-Clemmens
- Department of Department of Psychiatry & Behavioral Sciences, University of Minnesota Medical School
| | - Eric Feczko
- Department of Pediatrics, University of Minnesota Medical School
| | | | - Steve Nelson
- Department of Pediatrics, University of Minnesota Medical School
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Damien Fair
- Department of Pediatrics, University of Minnesota Medical School
| | - Theodore D. Satterthwaite
- Department of Psychiatry, University of Pennsylvania
- Lifespan Brain Institute of The Children’s Hospital of Philadelphia and Penn Medicine
- Penn Lifespan Informatics and Neuroimaging Center (PennLINC), Perelman School of Medicine, University of Pennsylvania
| | - Aaron Alexander-Bloch
- Department of Child and Adolescent Psychiatry and Behavioral Sciences, The Children’s Hospital of Philadelphia
- Department of Psychiatry, University of Pennsylvania
- Lifespan Brain Institute of The Children’s Hospital of Philadelphia and Penn Medicine
| | - Simon Vandekar
- Department of Biostatistics, Vanderbilt University Medical Center
| |
Collapse
|
4
|
Di Gravio C, Schildcrout JS, Tao R. Efficient designs and analysis of two-phase studies with longitudinal binary data. Biometrics 2024; 80:ujad010. [PMID: 38364804 PMCID: PMC10871867 DOI: 10.1093/biomtc/ujad010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2023] [Revised: 08/23/2023] [Accepted: 11/09/2023] [Indexed: 02/18/2024]
Abstract
Researchers interested in understanding the relationship between a readily available longitudinal binary outcome and a novel biomarker exposure can be confronted with ascertainment costs that limit sample size. In such settings, two-phase studies can be cost-effective solutions that allow researchers to target informative individuals for exposure ascertainment and increase estimation precision for time-varying and/or time-fixed exposure coefficients. In this paper, we introduce a novel class of residual-dependent sampling (RDS) designs that select informative individuals using data available on the longitudinal outcome and inexpensive covariates. Together with the RDS designs, we propose a semiparametric analysis approach that efficiently uses all data to estimate the parameters. We describe a numerically stable and computationally efficient EM algorithm to maximize the semiparametric likelihood. We examine the finite sample operating characteristics of the proposed approaches through extensive simulation studies, and compare the efficiency of our designs and analysis approach with existing ones. We illustrate the usefulness of the proposed RDS designs and analysis method in practice by studying the association between a genetic marker and poor lung function among patients enrolled in the Lung Health Study (Connett et al, 1993).
Collapse
Affiliation(s)
- Chiara Di Gravio
- Department of Epidemiology and Biostatistics, School of Public Health, Imperial College London, London, SW7 2AZ, United Kingdom
| | - Jonathan S Schildcrout
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37232, xUnited Kingdom
| | - Ran Tao
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37232, United Kingdom
- Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, TN 37232, United Kingdom
| |
Collapse
|
5
|
Lee M, Chen J, Zeleniuch-Jacquotte A, Liu M. Goodness-of-fit two-phase sampling designs for time-to-event outcomes: a simulation study based on New York University Women's Health Study for breast cancer. BMC Med Res Methodol 2023; 23:119. [PMID: 37208600 DOI: 10.1186/s12874-023-01950-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2022] [Accepted: 05/11/2023] [Indexed: 05/21/2023] Open
Abstract
BACKGROUND Sub-cohort sampling designs such as a case-cohort study play a key role in studying biomarker-disease associations due to their cost effectiveness. Time-to-event outcome is often the focus in cohort studies, and the research goal is to assess the association between the event risk and risk factors. In this paper, we propose a novel goodness-of-fit two-phase sampling design for time-to-event outcomes when some covariates (e.g., biomarkers) can only be measured on a subgroup of study subjects. METHODS Assuming that an external model, which can be the well-established risk models such as the Gail model for breast cancer, Gleason score for prostate cancer, and Framingham risk models for heart diseases, or built from preliminary data, is available to relate the outcome and complete covariates, we propose to oversample subjects with worse goodness-of-fit (GOF) based on an external survival model and time-to-event. With the cases and controls sampled using the GOF two-phase design, the inverse sampling probability weighting method is used to estimate the log hazard ratio of both incomplete and complete covariates. We conducted extensive simulations to evaluate the efficiency gain of our proposed GOF two-phase sampling designs over case-cohort study designs. RESULTS Through extensive simulations based on a dataset from the New York University Women's Health Study, we showed that the proposed GOF two-phase sampling designs were unbiased and generally had higher efficiency compared to the standard case-cohort study designs. CONCLUSION In cohort studies with rare outcomes, an important design question is how to select informative subjects to reduce sampling costs while maintaining statistical efficiency. Our proposed goodness-of-fit two-phase design provides efficient alternatives to standard case-cohort designs for assessing the association between time-to-event outcome and risk factors. This method is conveniently implemented in standard software.
Collapse
Affiliation(s)
- Myeonggyun Lee
- Department of Population Health, New York University Grossman School of Medicine, New York, NY, 10016, USA
| | - Jinbo Chen
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA
| | - Anne Zeleniuch-Jacquotte
- Department of Population Health, New York University Grossman School of Medicine, New York, NY, 10016, USA
- Department of Environmental Medicine, New York University Grossman School of Medicine, New York, NY, 10016, USA
| | - Mengling Liu
- Department of Population Health, New York University Grossman School of Medicine, New York, NY, 10016, USA.
- Department of Environmental Medicine, New York University Grossman School of Medicine, New York, NY, 10016, USA.
| |
Collapse
|
6
|
Maronge JM, Schildcrout JS, Rathouz PJ. Model misspecification and robust analysis for outcome-dependent sampling designs under generalized linear models. Stat Med 2023; 42:1338-1352. [PMID: 36757145 PMCID: PMC10883476 DOI: 10.1002/sim.9673] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2022] [Revised: 12/19/2022] [Accepted: 01/13/2023] [Indexed: 02/10/2023]
Abstract
Outcome-dependent sampling (ODS) is a commonly used class of sampling designs to increase estimation efficiency in settings where response information (and possibly adjuster covariates) is available, but the exposure is expensive and/or cumbersome to collect. We focus on ODS within the context of a two-phase study, where in Phase One the response and adjuster covariate information is collected on a large cohort that is representative of the target population, but the expensive exposure variable is not yet measured. In Phase Two, using response information from Phase One, we selectively oversample a subset of informative subjects in whom we collect expensive exposure information. Importantly, the Phase Two sample is no longer representative, and we must use ascertainment-correcting analysis procedures for valid inferences. In this paper, we focus on likelihood-based analysis procedures, particularly a conditional-likelihood approach and a full-likelihood approach. Whereas the full-likelihood retains incomplete Phase One data for subjects not selected into Phase Two, the conditional-likelihood explicitly conditions on Phase Two sample selection (ie, it is a "complete case" analysis procedure). These designs and analysis procedures are typically implemented assuming a known, parametric model for the response distribution. However, in this paper, we approach analyses implementing a novel semi-parametric extension to generalized linear models (SPGLM) to develop likelihood-based procedures with improved robustness to misspecification of distributional assumptions. We specifically focus on the common setting where standard GLM distributional assumptions are not satisfied (eg, misspecified mean/variance relationship). We aim to provide practical design guidance and flexible tools for practitioners in these settings.
Collapse
Affiliation(s)
- Jacob M. Maronge
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, TX, USA
| | | | - Paul J. Rathouz
- Department of Population Health, Dell Medical School at the University of Texas at Austin, TX, USA
| |
Collapse
|
7
|
Ryan B, Nirmalkanna A, Cigsar C, Yilmaz YE. Evaluation of Designs and Estimation Methods Under Response-Dependent Two-Phase Sampling for Genetic Association Studies. STATISTICS IN BIOSCIENCES 2023. [DOI: 10.1007/s12561-023-09369-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/05/2023]
|
8
|
Maronge JM, Tao R, Schildcrout JS, Rathouz PJ. Generalized case-control sampling under generalized linear models. Biometrics 2023; 79:332-343. [PMID: 34586638 PMCID: PMC9358725 DOI: 10.1111/biom.13571] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2020] [Revised: 08/17/2021] [Accepted: 09/14/2021] [Indexed: 12/01/2022]
Abstract
A generalized case-control (GCC) study, like the standard case-control study, leverages outcome-dependent sampling (ODS) to extend to nonbinary responses. We develop a novel, unifying approach for analyzing GCC study data using the recently developed semiparametric extension of the generalized linear model (GLM), which is substantially more robust to model misspecification than existing approaches based on parametric GLMs. For valid estimation and inference, we use a conditional likelihood to account for the biased sampling design. We describe analysis procedures for estimation and inference for the semiparametric GLM under a conditional likelihood, and we discuss problems with estimation and inference under a conditional likelihood when the response distribution is misspecified. We demonstrate the flexibility of our approach over existing ones through extensive simulation studies, and we apply the methodology to an analysis of the Asset and Health Dynamics Among the Oldest Old study, which motives our research. The proposed approach yields a simple yet versatile solution for handling ODS in a wide variety of possible response distributions and sampling schemes encountered in practice.
Collapse
Affiliation(s)
- Jacob M. Maronge
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, Texas, USA
| | - Ran Tao
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
- Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Jonathan S. Schildcrout
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Paul J. Rathouz
- Department of Population Health, Dell Medical School at the University of Texas at Austin, Austin, Texas, USA
| |
Collapse
|
9
|
Che M, Han P, Lawless JF. Improving estimation efficiency for two-phase, outcome-dependent sampling studies. Electron J Stat 2023. [DOI: 10.1214/23-ejs2124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/08/2023]
Affiliation(s)
- Menglu Che
- Department of Biostatistics, School of Public Health, Yale University
| | - Peisong Han
- Department of Biostatistics, School of Public Health, University of Michigan
| | - Jerald F. Lawless
- Department of Statistics and Actuarial Science, University of Waterloo
| |
Collapse
|
10
|
Lotspeich SC, Shepherd BE, Amorim GGC, Shaw PA, Tao R. Efficient odds ratio estimation under two-phase sampling using error-prone data from a multi-national HIV research cohort. Biometrics 2022; 78:1674-1685. [PMID: 34213008 PMCID: PMC8720323 DOI: 10.1111/biom.13512] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2020] [Revised: 05/19/2021] [Accepted: 06/17/2021] [Indexed: 12/30/2022]
Abstract
Persons living with HIV engage in routine clinical care, generating large amounts of data in observational HIV cohorts. These data are often error-prone, and directly using them in biomedical research could bias estimation and give misleading results. A cost-effective solution is the two-phase design, under which the error-prone variables are observed for all patients during Phase I, and that information is used to select patients for data auditing during Phase II. For example, the Caribbean, Central, and South America network for HIV epidemiology (CCASAnet) selected a random sample from each site for data auditing. Herein, we consider efficient odds ratio estimation with partially audited, error-prone data. We propose a semiparametric approach that uses all information from both phases and accommodates a number of error mechanisms. We allow both the outcome and covariates to be error-prone and these errors to be correlated, and selection of the Phase II sample can depend on Phase I data in an arbitrary manner. We devise a computationally efficient, numerically stable EM algorithm to obtain estimators that are consistent, asymptotically normal, and asymptotically efficient. We demonstrate the advantages of the proposed methods over existing ones through extensive simulations. Finally, we provide applications to the CCASAnet cohort.
Collapse
Affiliation(s)
- Sarah C. Lotspeich
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN, U.S.A
| | - Bryan E. Shepherd
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN, U.S.A
| | - Gustavo G. C. Amorim
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN, U.S.A
| | - Pamela A. Shaw
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, U.S.A
| | - Ran Tao
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN, U.S.A
- Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, TN, U.S.A
| |
Collapse
|
11
|
Chen T, Lumley T. Optimal sampling for design-based estimators of regression models. Stat Med 2022; 41:1482-1497. [PMID: 34989429 PMCID: PMC8918008 DOI: 10.1002/sim.9300] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2021] [Revised: 12/02/2021] [Accepted: 12/10/2021] [Indexed: 11/05/2022]
Abstract
Two-phase designs measure variables of interest on a subcohort where the outcome and covariates are readily available or cheap to collect on all individuals in the cohort. Given limited resource availability, it is of interest to find an optimal design that includes more informative individuals in the final sample. We explore the optimal designs and efficiencies for analyses by design-based estimators. Generalized raking is an efficient class of design-based estimators, and they improve on the inverse-probability weighted (IPW) estimator by adjusting weights based on the auxiliary information. We derive a closed-form solution of the optimal design for estimating regression coefficients from generalized raking estimators. We compare it with the optimal design for analysis via the IPW estimator and other two-phase designs in measurement-error settings. We consider general two-phase designs where the outcome variable and variables of interest can be continuous or discrete. Our results show that the optimal designs for analyses by the two classes of design-based estimators can be very different. The optimal design for analysis via the IPW estimator is optimal for IPW estimation and typically gives near-optimal efficiency for generalized raking estimation, though we show there is potential improvement in some settings.
Collapse
Affiliation(s)
- Tong Chen
- Department of Statistics, University of Auckland, Auckland, New Zealand
| | - Thomas Lumley
- Department of Statistics, University of Auckland, Auckland, New Zealand
| |
Collapse
|
12
|
Gravio CD, Tao R, Schildcrout JS. Design and analysis of two-phase studies with multivariate longitudinal data. Biometrics 2022. [PMID: 35014029 DOI: 10.1111/biom.13616] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2021] [Revised: 11/03/2021] [Accepted: 12/10/2021] [Indexed: 11/27/2022]
Abstract
Two-phase studies are crucial when outcome and covariate data are available in a first phase sample (e.g., a cohort study), but costs associated with retrospective ascertainment of a novel exposure limit the size of the second phase sample, in whom the exposure is collected. For longitudinal outcomes, one class of two-phase studies stratifies subjects based on an outcome vector summary (e.g., an average or a slope over time) and oversamples subjects in the extreme value strata while undersampling subjects in the medium value stratum. Based on the choice of the summary, two-phase studies for longitudinal data can increase efficiency of time-varying and/or time-fixed exposure parameter estimates. In this manuscript, we extend efficient, two-phase study designs to multivariate longitudinal continuous outcomes, and we detail two analysis approaches. The first approach is a multiple imputation analysis that combines complete data from subjects selected for phase two with the incomplete data from those not selected. The second approach is a conditional maximum likelihood analysis that is intended for applications where only data from subjects selected for phase two are available. Importantly, we show that both approaches can be applied to secondary analyses of previously conducted two-phase studies. We examine finite sample operating characteristics of the two approaches and use the Lung Health Study (Connett et al., 1993) to examine genetic associations with lung function decline over time. This article is protected by copyright. All rights reserved.
Collapse
Affiliation(s)
- Chiara Di Gravio
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee, 37232, U.S.A
| | - Ran Tao
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee, 37232, U.S.A.,Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, Tennessee, 37232, U.S.A
| | - Jonathan S Schildcrout
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee, 37232, U.S.A
| |
Collapse
|
13
|
Cao Y, Haneuse S, Zheng Y, Chen J. Two-phase stratified sampling and analysis for predicting binary outcomes. Biostatistics 2021:6470040. [PMID: 34923588 DOI: 10.1093/biostatistics/kxab044] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2021] [Revised: 11/03/2021] [Accepted: 11/22/2021] [Indexed: 11/13/2022] Open
Abstract
The two-phase study design is a cost-efficient sampling strategy when certain data elements are expensive and, thus, can only be collected on a sub-sample of subjects. To date guidance on how best to allocate resources within the design has assumed that primary interest lies in estimating association parameters. When primary interest lies in the development and evaluation of a risk prediction tool, however, such guidance may, in fact, be detrimental. To resolve this, we propose a novel strategy for resource allocation based on oversampling cases and subjects who have more extreme risk estimates according to a preliminary model developed using fully observed predictors. Key to the proposed strategy is that it focuses on enhancing efficiency regarding estimation of measures of predictive accuracy, rather than on efficiency regarding association parameters which is the standard paradigm. Towards valid estimation and inference for accuracy measures using the resultant data, we extend an existing semiparametric maximum likelihood ethod for estimating odds ratio association parameters to accommodate the biased sampling scheme and data incompleteness. Motivated by our sampling design, we additionally propose a general post-stratification scheme for analyzing general two-phase data for estimating predictive accuracy measures. Through theoretical calculations and simulation studies, we show that the proposed sampling strategy and post-stratification scheme achieve the promised efficiency improvement. Finally, we apply the proposed methods to develop and evaluate a preliminary model for predicting the risk of hospital readmission after cardiac surgery using data from the Pennsylvania Health Care Cost Containment Council.
Collapse
Affiliation(s)
- Yaqi Cao
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Perelman School of Medicine, Philadelphia, PA 19104, USA and Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China
| | - Sebastien Haneuse
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, 677 Huntington Ave, Boston, MA 02115, USA
| | - Yingye Zheng
- Department of Biostatistics, Fred Hutchinson Cancer Research Center, 1100 Fairview Ave. N., Seattle, WA 98109, USA
| | - Jinbo Chen
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Perelman School of Medicine, Philadelphia, PA 19104, USA
| |
Collapse
|
14
|
Amorim G, Tao R, Lotspeich S, Shaw PA, Lumley T, Shepherd BE. Two-Phase Sampling Designs for Data Validation in Settings with Covariate Measurement Error and Continuous Outcome. JOURNAL OF THE ROYAL STATISTICAL SOCIETY. SERIES A, (STATISTICS IN SOCIETY) 2021; 184:1368-1389. [PMID: 34975235 PMCID: PMC8715909 DOI: 10.1111/rssa.12689] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Measurement errors are present in many data collection procedures and can harm analyses by biasing estimates. To correct for measurement error, researchers often validate a subsample of records and then incorporate the information learned from this validation sample into estimation. In practice, the validation sample is often selected using simple random sampling (SRS). However, SRS leads to inefficient estimates because it ignores information on the error-prone variables, which can be highly correlated to the unknown truth. Applying and extending ideas from the two-phase sampling literature, we propose optimal and nearly-optimal designs for selecting the validation sample in the classical measurement-error framework. We target designs to improve the efficiency of model-based and design-based estimators, and show how the resulting designs compare to each other. Our results suggest that sampling schemes that extract more information from the error-prone data are substantially more efficient than SRS, for both design- and model-based estimators. The optimal procedure, however, depends on the analysis method, and can differ substantially. This is supported by theory and simulations. We illustrate the various designs using data from an HIV cohort study.
Collapse
Affiliation(s)
- Gustavo Amorim
- Department of Biostatistics, Vanderbilt University Medical Center, Nashvile, TN, USA
| | - Ran Tao
- Department of Biostatistics, Vanderbilt University Medical Center, Nashvile, TN, USA
- Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Sarah Lotspeich
- Department of Biostatistics, Vanderbilt University Medical Center, Nashvile, TN, USA
| | - Pamela A. Shaw
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, PA, USA
| | - Thomas Lumley
- Department of Statistics, University of Auckland, Auckland, New Zealand
| | - Bryan E. Shepherd
- Department of Biostatistics, Vanderbilt University Medical Center, Nashvile, TN, USA
| |
Collapse
|
15
|
Le Guen Y, Belloy ME, Napolioni V, Eger SJ, Kennedy G, Tao R, He Z, Greicius MD. A novel age-informed approach for genetic association analysis in Alzheimer's disease. Alzheimers Res Ther 2021; 13:72. [PMID: 33794991 PMCID: PMC8017764 DOI: 10.1186/s13195-021-00808-5] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2021] [Accepted: 03/11/2021] [Indexed: 01/17/2023]
Abstract
BACKGROUND Many Alzheimer's disease (AD) genetic association studies disregard age or incorrectly account for it, hampering variant discovery. METHODS Using simulated data, we compared the statistical power of several models: logistic regression on AD diagnosis adjusted and not adjusted for age; linear regression on a score integrating case-control status and age; and multivariate Cox regression on age-at-onset. We applied these models to real exome-wide data of 11,127 sequenced individuals (54% cases) and replicated suggestive associations in 21,631 genotype-imputed individuals (51% cases). RESULTS Modeling variable AD risk across age results in 5-10% statistical power gain compared to logistic regression without age adjustment, while incorrect age adjustment leads to critical power loss. Applying our novel AD-age score and/or Cox regression, we discovered and replicated novel variants associated with AD on KIF21B, USH2A, RAB10, RIN3, and TAOK2 genes. CONCLUSION Our AD-age score provides a simple means for statistical power gain and is recommended for future AD studies.
Collapse
Affiliation(s)
- Yann Le Guen
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA, 94304, USA.
| | - Michael E Belloy
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA, 94304, USA
| | - Valerio Napolioni
- School of Biosciences and Veterinary Medicine, University of Camerino, 62032, Camerino, Italy
| | - Sarah J Eger
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA, 94304, USA
| | - Gabriel Kennedy
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA, 94304, USA
| | - Ran Tao
- Department of Biostatistics and Vanderbilt Genetic Institute, Vanderbilt University, Nashville, TN, 37203, USA
| | - Zihuai He
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA, 94304, USA
- Quantitative Sciences Unit, Department of Medicine, Stanford University, Stanford, CA, 94304, USA
| | - Michael D Greicius
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA, 94304, USA
| |
Collapse
|
16
|
Tao R, Lotspeich SC, Amorim G, Shaw PA, Shepherd BE. Efficient semiparametric inference for two-phase studies with outcome and covariate measurement errors. Stat Med 2021; 40:725-738. [PMID: 33145800 PMCID: PMC8214478 DOI: 10.1002/sim.8799] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2020] [Revised: 09/07/2020] [Accepted: 10/20/2020] [Indexed: 11/07/2022]
Abstract
In modern observational studies using electronic health records or other routinely collected data, both the outcome and covariates of interest can be error-prone and their errors often correlated. A cost-effective solution is the two-phase design, under which the error-prone outcome and covariates are observed for all subjects during the first phase and that information is used to select a validation subsample for accurate measurements of these variables in the second phase. Previous research on two-phase measurement error problems largely focused on scenarios where there are errors in covariates only or the validation sample is a simple random sample of study subjects. Herein, we propose a semiparametric approach to general two-phase measurement error problems with a quantitative outcome, allowing for correlated errors in the outcome and covariates and arbitrary second-phase selection. We devise a computationally efficient and numerically stable expectation-maximization algorithm to maximize the nonparametric likelihood function. The resulting estimators possess desired statistical properties. We demonstrate the superiority of the proposed methods over existing approaches through extensive simulation studies, and we illustrate their use in an observational HIV study.
Collapse
Affiliation(s)
- Ran Tao
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee
- Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, Tennessee
| | - Sarah C. Lotspeich
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee
| | - Gustavo Amorim
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee
| | - Pamela A. Shaw
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania
| | - Bryan E. Shepherd
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee
| |
Collapse
|
17
|
Tao R, Mercaldo ND, Haneuse S, Maronge JM, Rathouz PJ, Heagerty PJ, Schildcrout JS. Two-wave two-phase outcome-dependent sampling designs, with applications to longitudinal binary data. Stat Med 2021; 40:1863-1876. [PMID: 33442883 DOI: 10.1002/sim.8876] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2020] [Revised: 12/07/2020] [Accepted: 12/25/2020] [Indexed: 12/26/2022]
Abstract
Two-phase outcome-dependent sampling (ODS) designs are useful when resource constraints prohibit expensive exposure ascertainment on all study subjects. One class of ODS designs for longitudinal binary data stratifies subjects into three strata according to those who experience the event at none, some, or all follow-up times. For time-varying covariate effects, exclusively selecting subjects with response variation can yield highly efficient estimates. However, if interest lies in the association of a time-invariant covariate, or the joint associations of time-varying and time-invariant covariates with the outcome, then the optimal design is unknown. Therefore, we propose a class of two-wave two-phase ODS designs for longitudinal binary data. We split the second-phase sample selection into two waves, between which an interim design evaluation analysis is conducted. The interim design evaluation analysis uses first-wave data to conduct a simulation-based search for the optimal second-wave design that will improve the likelihood of study success. Although we focus on longitudinal binary response data, the proposed design is general and can be applied to other response distributions. We believe that the proposed designs can be useful in settings where (1) the expected second-phase sample size is fixed and one must tailor stratum-specific sampling probabilities to maximize estimation efficiency, or (2) relative sampling probabilities are fixed across sampling strata and one must tailor sample size to achieve a desired precision. We describe the class of designs, examine finite sampling operating characteristics, and apply the designs to an exemplar longitudinal cohort study, the Lung Health Study.
Collapse
Affiliation(s)
- Ran Tao
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee, USA.,Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Nathaniel D Mercaldo
- Departments of Radiology and Neurology, Massachusetts General Hospital and Harvard University, Boston, Massachusetts, USA
| | - Sebastien Haneuse
- Department of Biostatistics, Harvard University, Boston, Massachusetts, USA
| | - Jacob M Maronge
- Department of Statistics, University of Wisconsin-Madison, Madison, Wisconsin, USA
| | - Paul J Rathouz
- Department of Population Health, University of Texas, Austin, Texas, USA
| | - Patrick J Heagerty
- Department of Biostatistics, University of Washington, Seattle, Washington, USA
| | - Jonathan S Schildcrout
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| |
Collapse
|
18
|
Chen T, Lumley T. Optimal multiwave sampling for regression modeling in two-phase designs. Stat Med 2020; 39:4912-4921. [PMID: 33016376 PMCID: PMC7902311 DOI: 10.1002/sim.8760] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2020] [Revised: 08/27/2020] [Accepted: 09/08/2020] [Indexed: 11/09/2022]
Abstract
Two-phase designs involve measuring extra variables on a subset of the cohort where some variables are already measured. The goal of two-phase designs is to choose a subsample of individuals from the cohort and analyse that subsample efficiently. It is of interest to obtain an optimal design that gives the most efficient estimates of regression parameters. In this article, we propose a multiwave sampling design to approximate the optimal design for design-based estimators. Influence functions are used to compute the optimal sampling allocations. We propose to use informative priors on regression parameters to derive the wave-1 sampling probabilities because any prespecified sampling probabilities may be far from optimal and decrease the design efficiency. The posterior distributions of the regression parameters derived from the current wave will then be used as priors for the next wave. Generalized raking is used in the final statistical analysis. We show that a two-wave sampling with reasonable informative priors will end up with a highly efficient estimation for the parameter of interest and be close to the underlying optimal design.
Collapse
Affiliation(s)
- Tong Chen
- Department of Statistics, University of Auckland, Auckland, New Zealand
| | - Thomas Lumley
- Department of Statistics, University of Auckland, Auckland, New Zealand
| |
Collapse
|
19
|
Han K, Lumley T, Shepherd BE, Shaw PA. Two-phase analysis and study design for survival models with error-prone exposures. Stat Methods Med Res 2020; 30:962280220978500. [PMID: 33327876 PMCID: PMC8715910 DOI: 10.1177/0962280220978500] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/10/2023]
Abstract
Increasingly, medical research is dependent on data collected for non-research purposes, such as electronic health records data. Health records data and other large databases can be prone to measurement error in key exposures, and unadjusted analyses of error-prone data can bias study results. Validating a subset of records is a cost-effective way of gaining information on the error structure, which in turn can be used to adjust analyses for this error and improve inference. We extend the mean score method for the two-phase analysis of discrete-time survival models, which uses the unvalidated covariates as auxiliary variables that act as surrogates for the unobserved true exposures. This method relies on a two-phase sampling design and an estimation approach that preserves the consistency of complete case regression parameter estimates in the validated subset, with increased precision leveraged from the auxiliary data. Furthermore, we develop optimal sampling strategies which minimize the variance of the mean score estimator for a target exposure under a fixed cost constraint. We consider the setting where an internal pilot is necessary for the optimal design so that the phase two sample is split into a pilot and an adaptive optimal sample. Through simulations and data example, we evaluate efficiency gains of the mean score estimator using the derived optimal validation design compared to balanced and simple random sampling for the phase two sample. We also empirically explore efficiency gains that the proposed discrete optimal design can provide for the Cox proportional hazards model in the setting of a continuous-time survival outcome.
Collapse
Affiliation(s)
- Kyunghee Han
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Pennsylvania, PA, USA
| | - Thomas Lumley
- Department of Statistics, University of Auckland, Auckland, New Zealand
| | - Bryan E Shepherd
- Department of Biostatistics, Vanderbilt University, Nashville, TN, USA
| | - Pamela A Shaw
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Pennsylvania, PA, USA
| |
Collapse
|
20
|
Shepherd BE, Shaw PA. Errors in multiple variables in human immunodeficiency virus (HIV) cohort and electronic health record data: statistical challenges and opportunities. STATISTICAL COMMUNICATIONS IN INFECTIOUS DISEASES 2020; 12:20190015. [PMID: 35880997 PMCID: PMC9204761 DOI: 10.1515/scid-2019-0015] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/15/2019] [Accepted: 08/21/2020] [Indexed: 06/15/2023]
Abstract
Objectives: Observational data derived from patient electronic health records (EHR) data are increasingly used for human immunodeficiency virus/acquired immunodeficiency syndrome (HIV/AIDS) research. There are challenges to using these data, in particular with regards to data quality; some are recognized, some unrecognized, and some recognized but ignored. There are great opportunities for the statistical community to improve inference by incorporating validation subsampling into analyses of EHR data.Methods: Methods to address measurement error, misclassification, and missing data are relevant, as are sampling designs such as two-phase sampling. However, many of the existing statistical methods for measurement error, for example, only address relatively simple settings, whereas the errors seen in these datasets span multiple variables (both predictors and outcomes), are correlated, and even affect who is included in the study.Results/Conclusion: We will discuss some preliminary methods in this area with a particular focus on time-to-event outcomes and outline areas of future research.
Collapse
Affiliation(s)
- Bryan E. Shepherd
- Biostatistics, Vanderbilt University, 2525 West End, Suite 11000, 37203Nashville, Tennessee, USA
| | - Pamela A. Shaw
- Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| |
Collapse
|
21
|
Che M, Lawless JF, Han P. Empirical and conditional likelihoods for two‐phase studies. CAN J STAT 2020. [DOI: 10.1002/cjs.11566] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Affiliation(s)
- Menglu Che
- Department of Statistics and Actuarial Science University of Waterloo Waterloo Ontario Canada
| | - Jerald F. Lawless
- Department of Statistics and Actuarial Science University of Waterloo Waterloo Ontario Canada
| | - Peisong Han
- Department of Biostatistics, School of Public Health University of Michigan Ann Arbor MI U.S.A
| |
Collapse
|
22
|
Wang L, Williams ML, Chen Y, Chen J. Novel two-phase sampling designs for studying binary outcomes. Biometrics 2020; 76:210-223. [PMID: 31449330 PMCID: PMC7042058 DOI: 10.1111/biom.13140] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2017] [Accepted: 08/06/2019] [Indexed: 11/26/2022]
Abstract
In biomedical cohort studies for assessing the association between an outcome variable and a set of covariates, usually, some covariates can only be measured on a subgroup of study subjects. An important design question is-which subjects to select into the subgroup to increase statistical efficiency. When the outcome is binary, one may adopt a case-control sampling design or a balanced case-control design where cases and controls are further matched on a small number of complete discrete covariates. While the latter achieves success in estimating odds ratio (OR) parameters for the matching covariates, similar two-phase design options have not been explored for the remaining covariates, especially the incompletely collected ones. This is of great importance in studies where the covariates of interest cannot be completely collected. To this end, assuming that an external model is available to relate the outcome and complete covariates, we propose a novel sampling scheme that oversamples cases and controls with worse goodness-of-fit based on the external model and further matches them on complete covariates similarly to the balanced design. We develop a pseudolikelihood method for estimating OR parameters. Through simulation studies and explorations in a real-cohort study, we find that our design generally leads to reduced asymptotic variances of the OR estimates and the reduction for the matching covariates is comparable to that of the balanced design.
Collapse
Affiliation(s)
- Le Wang
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
- Department of Mathematics and Statistics, Villanova University, Villanova, PA 19085, USA
| | - Matthew L Williams
- Division of Cardiovascular Surgery, Department of Surgery, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Yong Chen
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Jinbo Chen
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
23
|
Schildcrout JS, Haneuse S, Tao R, Zelnick LR, Schisterman EF, Garbett SP, Mercaldo ND, Rathouz PJ, Heagerty PJ. Two-Phase, Generalized Case-Control Designs for the Study of Quantitative Longitudinal Outcomes. Am J Epidemiol 2020; 189:81-90. [PMID: 31165875 DOI: 10.1093/aje/kwz127] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2018] [Revised: 05/06/2019] [Accepted: 05/14/2019] [Indexed: 01/30/2023] Open
Abstract
We propose a general class of 2-phase epidemiologic study designs for quantitative, longitudinal data that are useful when phase 1 longitudinal outcome and covariate data are available but data on the exposure (e.g., a biomarker) can only be collected on a subset of subjects during phase 2. To conduct a study using a design in the class, one first summarizes the longitudinal outcomes by fitting a simple linear regression of the response on a time-varying covariate for each subject. Sampling strata are defined by splitting the estimated regression intercept or slope distributions into distinct (low, medium, and high) regions. Stratified sampling is then conducted from strata defined by the intercepts, by the slopes, or from a mixture. In general, samples selected with extreme intercept values will yield low variances for associations of time-fixed exposures with the outcome and samples enriched with extreme slope values will yield low variances for associations of time-varying exposures with the outcome (including interactions with time-varying exposures). We describe ascertainment-corrected maximum likelihood and multiple-imputation estimation procedures that permit valid and efficient inferences. We embed all methodological developments within the framework of conducting a substudy that seeks to examine genetic associations with lung function among continuous smokers in the Lung Health Study (United States and Canada, 1986-1994).
Collapse
Affiliation(s)
| | - Sebastien Haneuse
- Department of Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, Massachusetts
| | - Ran Tao
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee
| | - Leila R Zelnick
- Division of Nephrology, Department of Medicine, University of Washington, Seattle, Washington
| | - Enrique F Schisterman
- Division of Intramural Population Health Research, Eunice Kennedy Shriver National Institute of Child Health and Human Development, Bethesda, Maryland
| | - Shawn P Garbett
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee
| | | | - Paul J Rathouz
- Department of Population Health, Dell Medical School, University of Texas, Austin, Texas
| | - Patrick J Heagerty
- Department of Biostatistics, School of Public Health, University of Washington, Seattle, Washington
| |
Collapse
|
24
|
Flanders WD. Invited Commentary: Two-Phase, Generalized Case-Control Designs for Quantitative Longitudinal Outcomes and Evolution of the Case-Control Study. Am J Epidemiol 2020; 189:91-94. [PMID: 31566676 DOI: 10.1093/aje/kwz200] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2019] [Revised: 08/23/2019] [Accepted: 08/27/2019] [Indexed: 11/12/2022] Open
Abstract
The case-control study design has evolved substantially over the past half century. The design has long been recognized as a way to increase efficiency by studying fewer subjects than would be required for a full cohort study. Historically, it was thought that case-control studies required a rare disease assumption for valid risk ratio estimation, but it was later realized that rare disease was not necessary. Over time, the design and analysis methods were further modified to allow estimation of rate ratios or to allow each person to serve as his/her own control (as we see with case-cohort and case-crossover studies, for example). We now understand that efficiency can be increased through the use of outcome-dependent sampling not only for dichotomous outcomes but also for continuous outcomes in longitudinal studies with repeated outcome measurement during follow-up. In their accompanying paper, Schildcrout et al. (Am J Epidemiol. 2019;000(00):000-000) contribute to our understanding, clearly summarizing many recent advances in study design and analyses that allow more general and efficient use of case-control studies. Their simulations demonstrate that improved efficiency is achieved with these methods when the goal is to estimate associations of exposure with trajectories and patterns of change over time. Here we comment on application of some of these generalized case-control methods to causal inference.
Collapse
Affiliation(s)
- W Dana Flanders
- Department of Epidemiology, Rollins School of Public Health, Emory University, Atlanta, GA 30322.,Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA 30322
| |
Collapse
|
25
|
Abstract
The two-phase design is a cost-effective sampling strategy to evaluate the effects of covariates on an outcome when certain covariates are too expensive to be measured on all study subjects. Under such a design, the outcome and inexpensive covariates are measured on all subjects in the first phase and the first-phase information is used to select subjects for measurements of expensive covariates in the second phase. Previous research on two-phase studies has focused largely on the inference procedures rather than the design aspects. We investigate the design efficiency of the two-phase study, as measured by the semiparametric efficiency bound for estimating the regression coefficients of expensive covariates. We consider general two-phase studies, where the outcome variable can be continuous, discrete, or censored, and the second-phase sampling can depend on the first-phase data in any manner. We develop optimal or approximately optimal two-phase designs, which can be substantially more efficient than the existing designs. We demonstrate the improvements of the new designs over the existing ones through extensive simulation studies and two large medical studies.
Collapse
Affiliation(s)
- Ran Tao
- Department of Biostatistics and Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, TN 37232.,Department of Biostatistics, University of North Carolina, Chapel Hill, NC 27599
| | - Donglin Zeng
- Department of Biostatistics and Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, TN 37232.,Department of Biostatistics, University of North Carolina, Chapel Hill, NC 27599
| | - Dan-Yu Lin
- Department of Biostatistics and Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, TN 37232.,Department of Biostatistics, University of North Carolina, Chapel Hill, NC 27599
| |
Collapse
|
26
|
Ni A, Satagopan JM. Estimating Additive Interaction Effect in Stratified Two-Phase Case-Control Design. Hum Hered 2019; 84:90-108. [PMID: 31634888 PMCID: PMC6925975 DOI: 10.1159/000502738] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2018] [Accepted: 08/15/2019] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND AND AIMS There is considerable interest in epidemiology to estimate an additive interaction effect between two risk factors in case-control studies. An additive interaction is defined as the differential reduction in absolute risk associated with one factor between different levels of the other factor. A stratified two-phase case-control design is commonly used in epidemiology to reduce the cost of assembling covariates. It is crucial to obtain valid estimates of the model parameters by accounting for the underlying stratification scheme to obtain accurate and precise estimates of additive interaction effects. The aim of this paper is to examine the properties of different methods for estimating model parameters and additive interaction effects under a stratified two-phase case-control design. METHODS Using simulations, we investigate the properties of three existing methods, namely stratum-specific offset, inverse-probability weighting, and multiple imputation for estimating model parameters and additive interaction effects. We also illustrate these properties using data from two published epidemiology studies. RESULTS Simulation studies show that the multiple imputation method performs well when both the true and analysis models are additive (i.e., does not include multiplicative interaction terms) but does not provide a discernible advantage over the offset method when the analysis models are non-additive (i.e., includes multiplicative interaction terms). The offset method exhibits the best overall properties when the analysis model contains multiplicative interaction effects. CONCLUSION When estimating additive interaction between risk factors in stratified two-phase case-control studies, we recommend estimating model parameters using multiple imputation when the analysis model is additive, and we recommend the offset method when the analysis model is non-additive.
Collapse
Affiliation(s)
- Ai Ni
- Division of Biostatistics, The Ohio State University, Columbus, Ohio, USA,
| | - Jaya M Satagopan
- Department of Biostatistics and Epidemiology, School of Public Health, Rutgers University, Piscataway, New York, USA
| |
Collapse
|
27
|
Bjørnland T, Bye A, Ryeng E, Wisløff U, Langaas M. Powerful extreme phenotype sampling designs and score tests for genetic association studies. Stat Med 2018; 37:4234-4251. [PMID: 30088284 DOI: 10.1002/sim.7914] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2017] [Revised: 06/20/2018] [Accepted: 06/25/2018] [Indexed: 12/15/2022]
Abstract
We consider cross-sectional genetic association studies (common and rare variants) where non-genetic information is available or feasible to obtain for N individuals, but where it is infeasible to genotype all N individuals. We consider continuously measurable Gaussian traits (phenotypes). Genotyping n < N extreme phenotype individuals can yield better power to detect phenotype-genotype associations, as compared to randomly selecting n individuals. We define a person as having an extreme phenotype if the observed phenotype is above a specified threshold or below a specified threshold. We consider a model where these thresholds can be tailored to each individual. The classical extreme sampling design is to set equal thresholds for all individuals. We introduce a design (z-extreme sampling) where personalized thresholds are defined based on the residuals of a regression model including only non-genetic (fully available) information. We derive score tests for the situation where only n extremes are analyzed (complete case analysis) and for the situation where the non-genetic information on N - n non-extremes is included in the analysis (all case analysis). For the classical design, all case analysis is generally more powerful than complete case analysis. For the z-extreme sample, we show that all case and complete case tests are equally powerful. Simulations and data analysis also show that z-extreme sampling is at least as powerful as the classical extreme sampling design and the classical design is shown to be at times less powerful than random sampling. The method of dichotomizing extreme phenotypes is also discussed.
Collapse
Affiliation(s)
- Thea Bjørnland
- Department of Mathematical Sciences, Norwegian University of Science and Technology, Trondheim, Norway
| | - Anja Bye
- Department of Circulation and Medical Imaging, Norwegian University of Science and Technology, Trondheim, Norway
| | - Einar Ryeng
- Department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology, Trondheim, Norway
| | - Ulrik Wisløff
- Department of Circulation and Medical Imaging, Norwegian University of Science and Technology, Trondheim, Norway
| | - Mette Langaas
- Department of Mathematical Sciences, Norwegian University of Science and Technology, Trondheim, Norway
| |
Collapse
|