1
|
Zhang Y, Kissin DM, Liao KJ, DeSantis CE, Yartel AK, Gutman R. Multiple Imputation of Missing Race/Ethnicity Information in the National Assisted Reproductive Technology Surveillance System. J Womens Health (Larchmt) 2024; 33:328-338. [PMID: 38112534 PMCID: PMC10998289 DOI: 10.1089/jwh.2023.0267] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2023] Open
Abstract
Background: Missing race/ethnicity data are common in many surveillance systems and registries, which may limit complete and accurate assessments of racial and ethnic disparities. Centers for Disease Control and Prevention's National Assisted Reproductive Technology (ART) Surveillance System (NASS) has a congressional mandate to collect data on all ART cycles performed by fertility clinics in the United States and provides valuable information on ART utilization and treatment outcomes. However, race/ethnicity data are missing for many ART cycles in NASS. Materials and Methods: We multiply imputed missing race/ethnicity data using variables from NASS and additional zip code-level race/ethnicity information in U.S. Census data. To evaluate imputed data quality, we generated training data by imposing missing values on known race/ethnicity under missing at random assumption, imputed, and examined the relationship between race/ethnicity and the rate of stillbirth per pregnancy. Results: The distribution of imputed race/ethnicity was comparable to the reported one with the largest difference of 0.53% for non-Hispanic Asian. Our imputation procedure was well calibrated and correctly identified that 89.91% (standard error = 0.18) of known race/ethnicity values on average in training data. Compared to complete-case analysis, using multiply imputed data reduced bias of parameter estimates (the range of bias for stillbirth per pregnancy across race/ethnicity groups is 0.02%-0.18% for imputed data analysis, versus 0.04%-0.66% for complete-case analysis) and yielded narrower confidence intervals. Conclusions: Our results underscore the importance of collecting complete race/ethnicity information for ART surveillance. However, when the missingness exists, multiply imputed race/ethnicity can improve the accuracy and precision of health outcomes estimated across racial/ethnic groups.
Collapse
Affiliation(s)
- Yujia Zhang
- Division of Reproductive Health, Centers for Disease Control and Prevention, Atlanta, Georgia, USA
| | - Dmitry M. Kissin
- Division of Reproductive Health, Centers for Disease Control and Prevention, Atlanta, Georgia, USA
| | - Kuo Jen Liao
- Division of Reproductive Health, Centers for Disease Control and Prevention, Atlanta, Georgia, USA
- CDC Foundation, Atlanta, Georgia, USA
| | - Carol E. DeSantis
- Division of Reproductive Health, Centers for Disease Control and Prevention, Atlanta, Georgia, USA
- CDC Foundation, Atlanta, Georgia, USA
| | - Anthony K. Yartel
- Division of Reproductive Health, Centers for Disease Control and Prevention, Atlanta, Georgia, USA
- CDC Foundation, Atlanta, Georgia, USA
| | - Roee Gutman
- Department of Biostatistics, Brown University, Providence, Rhode Island, USA
| |
Collapse
|
2
|
Guo F, Langworthy B, Ogino S, Wang M. Comparison between inverse-probability weighting and multiple imputation in Cox model with missing failure subtype. Stat Methods Med Res 2024; 33:344-356. [PMID: 38262434 DOI: 10.1177/09622802231226328] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2024]
Abstract
Identifying and distinguishing risk factors for heterogeneous disease subtypes has been of great interest. However, missingness in disease subtypes is a common problem in those data analyses. Several methods have been proposed to deal with the missing data, including complete-case analysis, inverse-probability weighting, and multiple imputation. Although extant literature has compared these methods in missing problems, none has focused on the competing risk setting. In this paper, we discuss the assumptions required when complete-case analysis, inverse-probability weighting, and multiple imputation are used to deal with the missing failure subtype problem, focusing on how to implement these methods under various realistic scenarios in competing risk settings. Besides, we compare these three methods regarding their biases, efficiency, and robustness to model misspecifications using simulation studies. Our results show that complete-case analysis can be seriously biased when the missing completely at random assumption does not hold. Inverse-probability weighting and multiple imputation estimators are valid when we correctly specify the corresponding models for missingness and for imputation, and multiple imputation typically shows higher efficiency than inverse-probability weighting. However, in real-world studies, building imputation models for the missing subtypes can be more challenging than building missingness models. In that case, inverse-probability weighting could be preferred for its easy usage. We also propose two automated model selection procedures and demonstrate their usage in a study of the association between smoking and colorectal cancer subtypes in the Nurses' Health Study and Health Professional Follow-Up Study.
Collapse
Affiliation(s)
- Fuyu Guo
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | | | - Shuji Ogino
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Cancer Immunology and Cancer Epidemiology Programs, Dana-Farber Harvard Cancer Center, Boston, MA, USA
- Program in MPE Molecular Pathological Epidemiology, Department of Pathology, Brigham and Women's Hospital and Harvard Medical School, Boston, MA,USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Molin Wang
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Boston, MA,USA
- Harvard Medical School, Boston, MA, USA
| |
Collapse
|
3
|
Matthews JNS, Bazakou S, Henderson R, Sharples LD. Contrasting principal stratum and hypothetical strategy estimands in multi-period crossover trials with incomplete data. Biometrics 2023; 79:1896-1907. [PMID: 36308035 DOI: 10.1111/biom.13777] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2021] [Accepted: 10/05/2022] [Indexed: 11/30/2022]
Abstract
Complete case analyses of complete crossover designs provide an opportunity to make comparisons based on patients who can tolerate all treatments. It is argued that this provides a means of estimating a principal stratum strategy estimand, something which is difficult to do in parallel group trials. While some trial users will consider this a relevant aim, others may be interested in hypothetical strategy estimands, that is, the effect that would be found if all patients completed the trial. Whether these estimands differ importantly is a question of interest to the different users of the trial results. This paper derives the difference between principal stratum strategy and hypothetical strategy estimands, where the former is estimated by a complete-case analysis of the crossover design, and a model for the dropout process is assumed. Complete crossover designs, that is, those where all treatments appear in all sequences, and which compare t treatments over p periods with respect to a continuous outcome are considered. Numerical results are presented for Williams designs with four and six periods. Results from a trial of obstructive sleep apnoea-hypopnoea (TOMADO) are also used for illustration. The results demonstrate that the percentage difference between the estimands is modest, exceeding 5% only when the trial has been severely affected by dropouts or if the within-subject correlation is low.
Collapse
Affiliation(s)
- John N S Matthews
- School of Mathematics, Statistics & Physics, Newcastle University, Newcastle upon Tyne, UK
- Public Health Sciences Institute, Newcastle University, Newcastle upon Tyne, UK
| | - Sofia Bazakou
- School of Mathematics, Statistics & Physics, Newcastle University, Newcastle upon Tyne, UK
| | - Robin Henderson
- School of Mathematics, Statistics & Physics, Newcastle University, Newcastle upon Tyne, UK
| | - Linda D Sharples
- Department of Medical Statistics, London School of Hygiene and Tropical Medicine, London, UK
| |
Collapse
|
4
|
Lee J, Beretvas SN. Comparing methods for handling missing covariates in meta-regression. Res Synth Methods 2023; 14:117-136. [PMID: 35796095 DOI: 10.1002/jrsm.1585] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2021] [Revised: 04/18/2022] [Accepted: 05/31/2022] [Indexed: 01/18/2023]
Abstract
Meta-analysts often encounter missing covariate values when estimating meta-regression models. In practice, ad hoc approaches involving data deletion have been widely used. The current study investigates the performance of different methods for handling missing covariates in meta-regression, including complete-case analysis (CCA), shifting-case analysis (SCA), multiple imputation (MI), and full information maximum likelihood (FIML), assuming missing at random mechanism. According to the simulation results, we advocate the use of MI and FIML than CCA and SCA approaches in practice. In addition, we cautiously note the challenges and potential advantages of using MI in the meta-analysis context.
Collapse
Affiliation(s)
- Jihyun Lee
- Quantitative Methods, Educational Psychology Department, The University of Texas at Austin, Austin, Texas, USA
| | - S Natasha Beretvas
- Quantitative Methods, Educational Psychology Department, The University of Texas at Austin, Austin, Texas, USA
| |
Collapse
|
5
|
Ross RK, Breskin A, Westreich D. When Is a Complete-Case Approach to Missing Data Valid? The Importance of Effect-Measure Modification. Am J Epidemiol 2020; 189:1583-1589. [PMID: 32601706 DOI: 10.1093/aje/kwaa124] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2019] [Revised: 06/22/2020] [Accepted: 06/23/2020] [Indexed: 12/19/2022] Open
Abstract
When estimating causal effects, careful handling of missing data is needed to avoid bias. Complete-case analysis is commonly used in epidemiologic analyses. Previous work has shown that covariate-stratified effect estimates from complete-case analysis are unbiased when missingness is independent of the outcome conditional on the exposure and covariates. Here, we assess the bias of complete-case analysis for adjusted marginal effects when confounding is present under various causal structures of missing data. We show that estimation of the marginal risk difference requires an unbiased estimate of the unconditional joint distribution of confounders and any other covariates required for conditional independence of missingness and outcome. The dependence of missing data on these covariates must be considered to obtain a valid estimate of the covariate distribution. If none of these covariates are effect-measure modifiers on the absolute scale, however, the marginal risk difference will equal the stratified risk differences and the complete-case analysis will be unbiased when the stratified effect estimates are unbiased. Estimation of unbiased marginal effects in complete-case analysis therefore requires close consideration of causal structure and effect-measure modification.
Collapse
|
6
|
Che M, Han P, Lawless JF. Improving estimation efficiency for regression with MNAR covariates. Biometrics 2019; 76:270-280. [PMID: 31393001 DOI: 10.1111/biom.13131] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2018] [Accepted: 07/25/2019] [Indexed: 11/29/2022]
Abstract
For regression with covariates missing not at random where the missingness depends on the missing covariate values, complete-case (CC) analysis leads to consistent estimation when the missingness is independent of the response given all covariates, but it may not have the desired level of efficiency. We propose a general empirical likelihood framework to improve estimation efficiency over the CC analysis. We expand on methods in Bartlett et al. (2014, Biostatistics 15, 719-730) and Xie and Zhang (2017, Int J Biostat 13, 1-20) that improve efficiency by modeling the missingness probability conditional on the response and fully observed covariates by allowing the possibility of modeling other data distribution-related quantities. We also give guidelines on what quantities to model and demonstrate that our proposal has the potential to yield smaller biases than existing methods when the missingness probability model is incorrect. Simulation studies are presented, as well as an application to data collected from the US National Health and Nutrition Examination Survey.
Collapse
Affiliation(s)
- Menglu Che
- Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Ontario, Canada
| | - Peisong Han
- Department of Biostatistics, School of Public Health, University of Michigan, Ann Arbor, Michigan
| | - Jerald F Lawless
- Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Ontario, Canada
| |
Collapse
|
7
|
Atem FD, Matsouaka RA, Zimmern VE. Cox regression model with randomly censored covariates. Biom J 2019; 61:1020-1032. [PMID: 30908720 DOI: 10.1002/bimj.201800275] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2018] [Revised: 02/07/2019] [Accepted: 02/07/2019] [Indexed: 11/11/2022]
Abstract
This paper deals with a Cox proportional hazards regression model, where some covariates of interest are randomly right-censored. While methods for censored outcomes have become ubiquitous in the literature, methods for censored covariates have thus far received little attention and, for the most part, dealt with the issue of limit-of-detection. For randomly censored covariates, an often-used method is the inefficient complete-case analysis (CCA) which consists in deleting censored observations in the data analysis. When censoring is not completely independent, the CCA leads to biased and spurious results. Methods for missing covariate data, including type I and type II covariate censoring as well as limit-of-detection do not readily apply due to the fundamentally different nature of randomly censored covariates. We develop a novel method for censored covariates using a conditional mean imputation based on either Kaplan-Meier estimates or a Cox proportional hazards model to estimate the effects of these covariates on a time-to-event outcome. We evaluate the performance of the proposed method through simulation studies and show that it provides good bias reduction and statistical efficiency. Finally, we illustrate the method using data from the Framingham Heart Study to assess the relationship between offspring and parental age of onset of cardiovascular events.
Collapse
Affiliation(s)
- Folefac D Atem
- Department of Biostatistics and Data Science, University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Roland A Matsouaka
- Department of Biostatistics and Bioinformatics, Duke University, Durham, NC, USA.,Program for Comparative Effectiveness Methodology, Duke Clinical Research Institute, Duke University, Durham, NC, USA
| | - Vincent E Zimmern
- Department of Pediatrics, University of Texas Southwestern Medical School, Dallas, TX, USA.,Department of Pediatrics, Children Hospital Dallas, Dallas, TX, USA
| |
Collapse
|
8
|
Perkins NJ, Cole SR, Harel O, Tchetgen Tchetgen EJ, Sun B, Mitchell EM, Schisterman EF. Principled Approaches to Missing Data in Epidemiologic Studies. Am J Epidemiol 2018; 187:568-575. [PMID: 29165572 PMCID: PMC5860376 DOI: 10.1093/aje/kwx348] [Citation(s) in RCA: 145] [Impact Index Per Article: 24.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2016] [Revised: 09/08/2017] [Accepted: 09/12/2017] [Indexed: 11/12/2022] Open
Abstract
Principled methods with which to appropriately analyze missing data have long existed; however, broad implementation of these methods remains challenging. In this and 2 companion papers (Am J Epidemiol. 2018;187(3):576-584 and Am J Epidemiol. 2018;187(3):585-591), we discuss issues pertaining to missing data in the epidemiologic literature. We provide details regarding missing-data mechanisms and nomenclature and encourage the conduct of principled analyses through a detailed comparison of multiple imputation and inverse probability weighting. Data from the Collaborative Perinatal Project, a multisite US study conducted from 1959 to 1974, are used to create a masked data-analytical challenge with missing data induced by known mechanisms. We illustrate the deleterious effects of missing data with naive methods and show how principled methods can sometimes mitigate such effects. For example, when data were missing at random, naive methods showed a spurious protective effect of smoking on the risk of spontaneous abortion (odds ratio (OR) = 0.43, 95% confidence interval (CI): 0.19, 0.93), while implementation of principled methods multiple imputation (OR = 1.30, 95% CI: 0.95, 1.77) or augmented inverse probability weighting (OR = 1.40, 95% CI: 1.00, 1.97) provided estimates closer to the "true" full-data effect (OR = 1.31, 95% CI: 1.05, 1.64). We call for greater acknowledgement of and attention to missing data and for the broad use of principled missing-data methods in epidemiologic research.
Collapse
Affiliation(s)
- Neil J Perkins
- Division of Intramural Population Health Research, Eunice Kennedy Shriver National Institute of Child Health and Human Development, Rockville, Maryland
| | - Stephen R Cole
- Department of Epidemiology, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina
| | - Ofer Harel
- Department of Statistics, College of Liberal Arts and Sciences, University of Connecticut, Storrs, Connecticut
| | | | - BaoLuo Sun
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts
| | | | - Enrique F Schisterman
- Division of Intramural Population Health Research, Eunice Kennedy Shriver National Institute of Child Health and Human Development, Rockville, Maryland
| |
Collapse
|
9
|
Ng M, Gakidou E, Murray CJL, Lim SS. A comparison of missing data procedures for addressing selection bias in HIV sentinel surveillance data. Popul Health Metr 2013; 11:12. [PMID: 23883362 PMCID: PMC3724705 DOI: 10.1186/1478-7954-11-12] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2012] [Accepted: 07/15/2013] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Selection bias is common in clinic-based HIV surveillance. Clinics located in HIV hotspots are often the first to be chosen and monitored, while clinics in less prevalent areas are added to the surveillance system later on. Consequently, the estimated HIV prevalence based on clinic data is substantially distorted, with markedly higher HIV prevalence in the earlier periods and trends that reveal much more dramatic declines than actually occur. METHODS Using simulations, we compare and contrast the performance of the various approaches and models for handling selection bias in clinic-based HIV surveillance. In particular, we compare the application of complete-case analysis and multiple imputation (MI). Several models are considered for each of the approaches. We demonstrate the application of the methods through sentinel surveillance data collected between 2002 and 2008 from India. RESULTS Simulations suggested that selection bias, if not handled properly, can lead to biased estimates of HIV prevalence trends and inaccurate evaluation of program impact. Complete-case analysis and MI differed considerably in their ability to handle selection bias. In scenarios where HIV prevalence remained constant over time (i.e. β = 0), the estimated β^1 derived from MI tended to be biased downward. Depending on the imputation model used, the estimated bias ranged from -1.883 to -0.048 in logit prevalence. Furthermore, as the level of selection bias intensified, the extent of bias also increased. In contrast, the estimates yielded by complete-case analysis were relatively unbiased and stable across the various scenarios. The estimated bias ranged from -0.002 to 0.002 in logit prevalence. CONCLUSIONS Given that selection bias is common in clinic-based HIV surveillance, when analyzing data from such sources appropriate adjustment methods need to be applied. The results in this paper suggest that indiscriminant application of imputation models can lead to biased results.
Collapse
Affiliation(s)
- Marie Ng
- Institute for Health Metrics and Evaluation, University of Washington, Seattle, USA
| | - Emmanuela Gakidou
- Institute for Health Metrics and Evaluation, University of Washington, Seattle, USA
| | | | - Stephen S Lim
- Institute for Health Metrics and Evaluation, University of Washington, Seattle, USA
| |
Collapse
|
10
|
Abstract
Missing data is the norm rather than the exception in complex epidemiological studies. Complete-case analyses, which discard all subjects with some data values missing, are known to be valid under the very restrictive assumption that the response mechanism is missing completely at random (MCAR). While conditions weaker than MCAR are known under which estimators of regression coefficients are unbiased, one often comes across the view in the literature that MCAR is necessary for the complete cases to form a simple random subsample of the target sample. In this paper, we explain why this is not the case, and we distill an assumption weaker than MCAR under which the simple random subsample condition holds, which we call available at random (AAR). Moreover, we show that, unlike MCAR, AAR response mechanisms can be missing not at random (MNAR). We also suggest how approximate AAR mechanisms might arise in practice through cancellation of selection and drop-out effects, and we conclude that before pooling partially complete and complete cases into an analysis, the investigator should consider how selection might impact on the representativeness of the cases included in the pooled analysis (compared to those comprising the complete cases only).
Collapse
Affiliation(s)
- John C Galati
- Clinical Epidemiology and Biostatistics Unit, Murdoch Childrens Research Institute, Royal Children's Hospital, Australia Department of Mathematics and Statistics, La Trobe University, Australia
| | - Katherine A Seaton
- Department of Mathematics and Statistics, La Trobe University, Australia
| |
Collapse
|