1
|
Mainzer RM, Moreno-Betancur M, Nguyen CD, Simpson JA, Carlin JB, Lee KJ. Gaps in the usage and reporting of multiple imputation for incomplete data: findings from a scoping review of observational studies addressing causal questions. BMC Med Res Methodol 2024; 24:193. [PMID: 39232661 PMCID: PMC11373423 DOI: 10.1186/s12874-024-02302-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2024] [Accepted: 08/02/2024] [Indexed: 09/06/2024] Open
Abstract
BACKGROUND Missing data are common in observational studies and often occur in several of the variables required when estimating a causal effect, i.e. the exposure, outcome and/or variables used to control for confounding. Analyses involving multiple incomplete variables are not as straightforward as analyses with a single incomplete variable. For example, in the context of multivariable missingness, the standard missing data assumptions ("missing completely at random", "missing at random" [MAR], "missing not at random") are difficult to interpret and assess. It is not clear how the complexities that arise due to multivariable missingness are being addressed in practice. The aim of this study was to review how missing data are managed and reported in observational studies that use multiple imputation (MI) for causal effect estimation, with a particular focus on missing data summaries, missing data assumptions, primary and sensitivity analyses, and MI implementation. METHODS We searched five top general epidemiology journals for observational studies that aimed to answer a causal research question and used MI, published between January 2019 and December 2021. Article screening and data extraction were performed systematically. RESULTS Of the 130 studies included in this review, 108 (83%) derived an analysis sample by excluding individuals with missing data in specific variables (e.g., outcome) and 114 (88%) had multivariable missingness within the analysis sample. Forty-four (34%) studies provided a statement about missing data assumptions, 35 of which stated the MAR assumption, but only 11/44 (25%) studies provided a justification for these assumptions. The number of imputations, MI method and MI software were generally well-reported (71%, 75% and 88% of studies, respectively), while aspects of the imputation model specification were not clear for more than half of the studies. A secondary analysis that used a different approach to handle the missing data was conducted in 69/130 (53%) studies. Of these 69 studies, 68 (99%) lacked a clear justification for the secondary analysis. CONCLUSION Effort is needed to clarify the rationale for and improve the reporting of MI for estimation of causal effects from observational data. We encourage greater transparency in making and reporting analytical decisions related to missing data.
Collapse
Affiliation(s)
- Rheanna M Mainzer
- Clinical Epidemiology and Biostatistics Unit, Murdoch Children's Research Institute, Parkville, Victoria, 3052, Australia.
- Department of Paediatrics, The University of Melbourne, Parkville, Victoria, 3052, Australia.
| | - Margarita Moreno-Betancur
- Clinical Epidemiology and Biostatistics Unit, Murdoch Children's Research Institute, Parkville, Victoria, 3052, Australia
- Department of Paediatrics, The University of Melbourne, Parkville, Victoria, 3052, Australia
| | - Cattram D Nguyen
- Clinical Epidemiology and Biostatistics Unit, Murdoch Children's Research Institute, Parkville, Victoria, 3052, Australia
- Department of Paediatrics, The University of Melbourne, Parkville, Victoria, 3052, Australia
| | - Julie A Simpson
- Centre for Epidemiology and Biostatistics, Melbourne School of Population and Global Health, The University of Melbourne, Parkville, Victoria, 3052, Australia
- Nuffield Department of Medicine, University of Oxford, Oxford, UK
| | - John B Carlin
- Clinical Epidemiology and Biostatistics Unit, Murdoch Children's Research Institute, Parkville, Victoria, 3052, Australia
- Centre for Epidemiology and Biostatistics, Melbourne School of Population and Global Health, The University of Melbourne, Parkville, Victoria, 3052, Australia
| | - Katherine J Lee
- Clinical Epidemiology and Biostatistics Unit, Murdoch Children's Research Institute, Parkville, Victoria, 3052, Australia
- Department of Paediatrics, The University of Melbourne, Parkville, Victoria, 3052, Australia
| |
Collapse
|
2
|
Oberman HI, Vink G. Toward a standardized evaluation of imputation methodology. Biom J 2024; 66:e2200107. [PMID: 36932050 DOI: 10.1002/bimj.202200107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2022] [Revised: 02/01/2023] [Accepted: 02/08/2023] [Indexed: 03/19/2023]
Abstract
Developing new imputation methodology has become a very active field. Unfortunately, there is no consensus on how to perform simulation studies to evaluate the properties of imputation methods. In part, this may be due to different aims between fields and studies. For example, when evaluating imputation techniques aimed at prediction, different aims may be formulated than when statistical inference is of interest. The lack of consensus may also stem from different personal preferences or scientific backgrounds. All in all, the lack of common ground in evaluating imputation methodology may lead to suboptimal use in practice. In this paper, we propose a move toward a standardized evaluation of imputation methodology. To demonstrate the need for standardization, we highlight a set of possible pitfalls that bring forth a chain of potential problems in the objective assessment of the performance of imputation routines. Additionally, we suggest a course of action for simulating and evaluating missing data problems. Our suggested course of action is by no means meant to serve as a complete cookbook, but rather meant to incite critical thinking and a move to objective and fair evaluations of imputation methodology. We invite the readers of this paper to contribute to the suggested course of action.
Collapse
Affiliation(s)
- Hanne I Oberman
- Departement of Methodology & Statistics, Utrecht, The Netherlands
| | - Gerko Vink
- Departement of Methodology & Statistics, Utrecht, The Netherlands
| |
Collapse
|
3
|
Shahid F, Mehmood A, Khan R, AL Smadi A, Yaqub M, Alsmadi MK, Zheng Z. 1D Convolutional LSTM-based wind power prediction integrated with PkNN data imputation technique. JOURNAL OF KING SAUD UNIVERSITY - COMPUTER AND INFORMATION SCIENCES 2023; 35:101816. [DOI: 10.1016/j.jksuci.2023.101816] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/22/2024]
|
4
|
Mitra R, McGough SF, Chakraborti T, Holmes C, Copping R, Hagenbuch N, Biedermann S, Noonan J, Lehmann B, Shenvi A, Doan XV, Leslie D, Bianconi G, Sanchez-Garcia R, Davies A, Mackintosh M, Andrinopoulou ER, Basiri A, Harbron C, MacArthur BD. Learning from data with structured missingness. NAT MACH INTELL 2023. [DOI: 10.1038/s42256-022-00596-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
|
5
|
Witte J, Foraita R, Didelez V. Multiple imputation and test-wise deletion for causal discovery with incomplete cohort data. Stat Med 2022; 41:4716-4743. [PMID: 35908775 DOI: 10.1002/sim.9535] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2021] [Revised: 06/12/2022] [Accepted: 07/11/2022] [Indexed: 11/08/2022]
Abstract
Causal discovery algorithms estimate causal graphs from observational data. This can provide a valuable complement to analyses focusing on the causal relation between individual treatment-outcome pairs. Constraint-based causal discovery algorithms rely on conditional independence testing when building the graph. Until recently, these algorithms have been unable to handle missing values. In this article, we investigate two alternative solutions: test-wise deletion and multiple imputation. We establish necessary and sufficient conditions for the recoverability of causal structures under test-wise deletion, and argue that multiple imputation is more challenging in the context of causal discovery than for estimation. We conduct an extensive comparison by simulating from benchmark causal graphs: as one might expect, we find that test-wise deletion and multiple imputation both clearly outperform list-wise deletion and single imputation. Crucially, our results further suggest that multiple imputation is especially useful in settings with a small number of either Gaussian or discrete variables, but when the dataset contains a mix of both neither method is uniformly best. The methods we compare include random forest imputation and a hybrid procedure combining test-wise deletion and multiple imputation. An application to data from the IDEFICS cohort study on diet- and lifestyle-related diseases in European children serves as an illustrating example.
Collapse
Affiliation(s)
- Janine Witte
- Leibniz Institute for Prevention Research and Epidemiology - BIPS, Bremen, Germany.,Faculty of Mathematics and Computer Science, University of Bremen, Bremen, Germany
| | - Ronja Foraita
- Leibniz Institute for Prevention Research and Epidemiology - BIPS, Bremen, Germany
| | - Vanessa Didelez
- Leibniz Institute for Prevention Research and Epidemiology - BIPS, Bremen, Germany.,Faculty of Mathematics and Computer Science, University of Bremen, Bremen, Germany
| |
Collapse
|
6
|
Extended missing data imputation via GANs for ranking applications. Data Min Knowl Discov 2022. [DOI: 10.1007/s10618-022-00837-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
|
7
|
Abstract
We offer a natural and extensible measure-theoretic treatment of missingness at random. Within the standard missing-data framework, we give a novel characterization of the observed data as a stopping-set sigma algebra. We demonstrate that the usual missingness-at-random conditions are equivalent to requiring particular stochastic processes to be adapted to a set-indexed filtration. These measurability conditions ensure the usual factorization of likelihood ratios. We illustrate how the theory can be extended easily to incorporate explanatory variables, to describe longitudinal data in continuous time, and to admit more general coarsening of observations.
Collapse
Affiliation(s)
- D M Farewell
- Division of Population Medicine, School of Medicine, College of Biomedical and Life Sciences, Cardiff University, Cardiff CF14 4YS, U.K
| | - R M Daniel
- Division of Population Medicine, School of Medicine, College of Biomedical and Life Sciences, Cardiff University, Cardiff CF14 4YS, U.K
| | - S R Seaman
- MRC Biostatistics Unit, University of Cambridge, Robinson Way, Cambridge CB2 0SR, U.K
| |
Collapse
|
8
|
Affiliation(s)
- Karthika Mohan
- Department of Computer Science, University of California Berkeley, Berkeley, CA
| | - Judea Pearl
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA
| |
Collapse
|
9
|
Noghrehchi F, Stoklosa J, Penev S, Warton DI. Selecting the model for multiple imputation of missing data: Just use an IC! Stat Med 2021; 40:2467-2497. [PMID: 33629367 PMCID: PMC8248419 DOI: 10.1002/sim.8915] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2019] [Revised: 11/16/2020] [Accepted: 01/31/2021] [Indexed: 11/06/2022]
Abstract
Multiple imputation and maximum likelihood estimation (via the expectation‐maximization algorithm) are two well‐known methods readily used for analyzing data with missing values. While these two methods are often considered as being distinct from one another, multiple imputation (when using improper imputation) is actually equivalent to a stochastic expectation‐maximization approximation to the likelihood. In this article, we exploit this key result to show that familiar likelihood‐based approaches to model selection, such as Akaike's information criterion (AIC) and the Bayesian information criterion (BIC), can be used to choose the imputation model that best fits the observed data. Poor choice of imputation model is known to bias inference, and while sensitivity analysis has often been used to explore the implications of different imputation models, we show that the data can be used to choose an appropriate imputation model via conventional model selection tools. We show that BIC can be consistent for selecting the correct imputation model in the presence of missing data. We verify these results empirically through simulation studies, and demonstrate their practicality on two classical missing data examples. An interesting result we saw in simulations was that not only can parameter estimates be biased by misspecifying the imputation model, but also by overfitting the imputation model. This emphasizes the importance of using model selection not just to choose the appropriate type of imputation model, but also to decide on the appropriate level of imputation model complexity.
Collapse
Affiliation(s)
- Firouzeh Noghrehchi
- Discipline of Biomedical Informatics and Digital Health, The University of Sydney, Sydney, New South Wales, Australia
| | - Jakub Stoklosa
- School of Mathematics and Statistics, The University of New South Wales, Sydney, New South Wales, Australia.,Evolution and Ecology Research Centre, The University of New South Wales, Sydney, New South Wales, Australia
| | - Spiridon Penev
- School of Mathematics and Statistics, The University of New South Wales, Sydney, New South Wales, Australia
| | - David I Warton
- School of Mathematics and Statistics, The University of New South Wales, Sydney, New South Wales, Australia.,Evolution and Ecology Research Centre, The University of New South Wales, Sydney, New South Wales, Australia
| |
Collapse
|
10
|
Ali M, Kauermann G. A split questionnaire survey design in the context of statistical matching. STAT METHOD APPL-GER 2021. [DOI: 10.1007/s10260-020-00554-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
11
|
Moreno-Betancur M, Lee KJ, Leacy FP, White IR, Simpson JA, Carlin JB. Canonical Causal Diagrams to Guide the Treatment of Missing Data in Epidemiologic Studies. Am J Epidemiol 2018; 187:2705-2715. [PMID: 30124749 PMCID: PMC6269242 DOI: 10.1093/aje/kwy173] [Citation(s) in RCA: 30] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2018] [Accepted: 08/03/2018] [Indexed: 12/02/2022] Open
Abstract
With incomplete data, the “missing at random” (MAR) assumption is widely understood to enable unbiased estimation with appropriate methods. While the need to assess the plausibility of MAR and to perform sensitivity analyses considering “missing not at random” (MNAR) scenarios has been emphasized, the practical difficulty of these tasks is rarely acknowledged. With multivariable missingness, what MAR means is difficult to grasp, and in many MNAR scenarios unbiased estimation is possible using methods commonly associated with MAR. Directed acyclic graphs (DAGs) have been proposed as an alternative framework for specifying practically accessible assumptions beyond the MAR-MNAR dichotomy. However, there is currently no general algorithm for deciding how to handle the missing data given a specific DAG. Here we construct “canonical” DAGs capturing typical missingness mechanisms in epidemiologic studies with incomplete data on exposure, outcome, and confounding factors. For each DAG, we determine whether common target parameters are “recoverable,” meaning that they can be expressed as functions of the available data distribution and thus estimated consistently, or whether sensitivity analyses are necessary. We investigate the performance of available-case and multiple-imputation procedures. Using data from waves 1–3 of the Longitudinal Study of Australian Children (2004–2008), we illustrate how our findings can guide the treatment of missing data in point-exposure studies.
Collapse
Affiliation(s)
- Margarita Moreno-Betancur
- Clinical Epidemiology and Biostatistics Unit, Murdoch Children’s Research Institute, Melbourne, Victoria, Australia
- Centre for Epidemiology and Biostatistics, Melbourne School of Population and Global Health, University of Melbourne, Melbourne, Victoria, Australia
| | - Katherine J Lee
- Clinical Epidemiology and Biostatistics Unit, Murdoch Children’s Research Institute, Melbourne, Victoria, Australia
- Department of Paediatrics, Melbourne Medical School, University of Melbourne, Melbourne, Victoria, Australia
| | - Finbarr P Leacy
- Data Science Centre, Royal College of Surgeons in Ireland, Dublin, Ireland
| | - Ian R White
- MRC Clinical Trials Unit, London, United Kingdom
| | - Julie A Simpson
- Centre for Epidemiology and Biostatistics, Melbourne School of Population and Global Health, University of Melbourne, Melbourne, Victoria, Australia
| | - John B Carlin
- Clinical Epidemiology and Biostatistics Unit, Murdoch Children’s Research Institute, Melbourne, Victoria, Australia
- Centre for Epidemiology and Biostatistics, Melbourne School of Population and Global Health, University of Melbourne, Melbourne, Victoria, Australia
| |
Collapse
|