1
|
Afkanpour M, Hosseinzadeh E, Tabesh H. Identify the most appropriate imputation method for handling missing values in clinical structured datasets: a systematic review. BMC Med Res Methodol 2024; 24:188. [PMID: 39198744 PMCID: PMC11351057 DOI: 10.1186/s12874-024-02310-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2024] [Accepted: 08/19/2024] [Indexed: 09/01/2024] Open
Abstract
BACKGROUND AND OBJECTIVES Comprehending the research dataset is crucial for obtaining reliable and valid outcomes. Health analysts must have a deep comprehension of the data being analyzed. This comprehension allows them to suggest practical solutions for handling missing data, in a clinical data source. Accurate handling of missing values is critical for producing precise estimates and making informed decisions, especially in crucial areas like clinical research. With data's increasing diversity and complexity, numerous scholars have developed a range of imputation techniques. To address this, we conducted a systematic review to introduce various imputation techniques based on tabular dataset characteristics, including the mechanism, pattern, and ratio of missingness, to identify the most appropriate imputation methods in the healthcare field. MATERIALS AND METHODS We searched four information databases namely PubMed, Web of Science, Scopus, and IEEE Xplore, for articles published up to September 20, 2023, that discussed imputation methods for addressing missing values in a clinically structured dataset. Our investigation of selected articles focused on four key aspects: the mechanism, pattern, ratio of missingness, and various imputation strategies. By synthesizing insights from these perspectives, we constructed an evidence map to recommend suitable imputation methods for handling missing values in a tabular dataset. RESULTS Out of 2955 articles, 58 were included in the analysis. The findings from the development of the evidence map, based on the structure of the missing values and the types of imputation methods used in the extracted items from these studies, revealed that 45% of the studies employed conventional statistical methods, 31% utilized machine learning and deep learning methods, and 24% applied hybrid imputation techniques for handling missing values. CONCLUSION Considering the structure and characteristics of missing values in a clinical dataset is essential for choosing the most appropriate data imputation technique, especially within conventional statistical methods. Accurately estimating missing values to reflect reality enhances the likelihood of obtaining high-quality and reusable data, contributing significantly to precise medical decision-making processes. Performing this review study creates a guideline for choosing the most appropriate imputation methods in data preprocessing stages to perform analytical processes on structured clinical datasets.
Collapse
Affiliation(s)
- Marziyeh Afkanpour
- Department of Medical Informatics, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran
| | - Elham Hosseinzadeh
- Department of Medical Informatics, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran
| | - Hamed Tabesh
- Department of Medical Informatics, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran.
| |
Collapse
|
2
|
Li Y, Zhou Q, Fan Y, Pan G, Dai Z, Lei B. A novel machine learning-based imputation strategy for missing data in step-stress accelerated degradation test. Heliyon 2024; 10:e26429. [PMID: 38434061 PMCID: PMC10906311 DOI: 10.1016/j.heliyon.2024.e26429] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2023] [Revised: 11/25/2023] [Accepted: 02/13/2024] [Indexed: 03/05/2024] Open
Abstract
The presence of missing data is a significant data quality issue that negatively impacts the accuracy and reliability of data analysis. This issue is especially relevant in the context of accelerated tests, particularly for step-stress accelerated degradation tests. While missing data can occur due to objective factors or human error, high missing rate is an inevitable pattern of missing data that will occur during the conversion process of accelerated test data. This type of missing data manifests as a degradation dataset with unequal measuring intervals. Therefore, developing a more appropriate imputation method for accelerated test data is essential. In this study, we propose a novel hybrid imputation method that combines the LSSVM and RBF models to address missing data problems. A comparison is conducted between the proposed model and various traditional and machine learning imputation methods using simulation data, to justify the advantages of the proposed model over the existing methods. Finally, the proposed model is implemented on real degradation datasets of the super-luminescent diode (SLD) to validate its performance and effectiveness in dealing with missing data in step-stress accelerated degradation test. Additionally, due to the generalizability of the proposed method, it is expected to be applicable in other scenarios with high missing data rates.
Collapse
Affiliation(s)
- Yaqiu Li
- China Electronic Product Reliability and Environmental Testing Research Institute, No. 76, West Zhucun Avenue, Guangzhou, China
- Key Laboratory of Active Medical Devices Quality & Reliability Management and Assessment, No. 76, West Zhucun Avenue, Guangzhou, China
| | - Qijie Zhou
- China Electronic Product Reliability and Environmental Testing Research Institute, No. 76, West Zhucun Avenue, Guangzhou, China
- Key Laboratory of Active Medical Devices Quality & Reliability Management and Assessment, No. 76, West Zhucun Avenue, Guangzhou, China
| | - Ye Fan
- Beijing Institute of Structure and Environment Engineer, No.1, South Dahongmen Avenue, Beijing, China
| | - Guangze Pan
- China Electronic Product Reliability and Environmental Testing Research Institute, No. 76, West Zhucun Avenue, Guangzhou, China
- Guangdong Provincial Key Laboratory of Electronic Information Products Reliability Technology, No. 76, West Zhucun Avenue, Guangzhou, China
| | - Zongbei Dai
- China Electronic Product Reliability and Environmental Testing Research Institute, No. 76, West Zhucun Avenue, Guangzhou, China
| | - Baimao Lei
- China Electronic Product Reliability and Environmental Testing Research Institute, No. 76, West Zhucun Avenue, Guangzhou, China
| |
Collapse
|
3
|
Riley RD, Ensor J, Hattle M, Papadimitropoulou K, Morris TP. Two-stage or not two-stage? That is the question for IPD meta-analysis projects. Res Synth Methods 2023; 14:903-910. [PMID: 37606180 PMCID: PMC7615283 DOI: 10.1002/jrsm.1661] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2023] [Revised: 06/27/2023] [Accepted: 07/22/2023] [Indexed: 08/23/2023]
Abstract
Individual participant data meta-analysis (IPDMA) projects obtain, check, harmonise and synthesise raw data from multiple studies. When undertaking the meta-analysis, researchers must decide between a two-stage or a one-stage approach. In a two-stage approach, the IPD are first analysed separately within each study to obtain aggregate data (e.g., treatment effect estimates and standard errors); then, in the second stage, these aggregate data are combined in a standard meta-analysis model (e.g., common-effect or random-effects). In a one-stage approach, the IPD from all studies are analysed in a single step using an appropriate model that accounts for clustering of participants within studies and, potentially, between-study heterogeneity (e.g., a general or generalised linear mixed model). The best approach to take is debated in the literature, and so here we provide clearer guidance for a broad audience. Both approaches are important tools for IPDMA researchers and neither are a panacea. If most studies in the IPDMA are small (few participants or events), a one-stage approach is recommended due to using a more exact likelihood. However, in other situations, researchers can choose either approach, carefully following best practice. Some previous claims recommending to always use a one-stage approach are misleading, and the two-stage approach will often suffice for most researchers. When differences do arise between the two approaches, often it is caused by researchers using different modelling assumptions or estimation methods, rather than using one or two stages per se.
Collapse
Affiliation(s)
- Richard D. Riley
- Institute of Applied Health Research, College of Medical and Dental SciencesUniversity of BirminghamBirminghamUK
| | - Joie Ensor
- Institute of Applied Health Research, College of Medical and Dental SciencesUniversity of BirminghamBirminghamUK
| | - Miriam Hattle
- Institute of Applied Health Research, College of Medical and Dental SciencesUniversity of BirminghamBirminghamUK
- School of MedicineKeele UniversityKeeleStaffordshireUK
| | | | - Tim P. Morris
- MRC Clinical Trials Unit at UCLInstitute of Clinical Trials and Methodology, UCLLondonUK
| |
Collapse
|
4
|
Markle-Reid M, Fisher K, Walker KM, Beauchamp M, Cameron JI, Dayler D, Fleck R, Gafni A, Ganann R, Hajas K, Koetsier B, Mahony R, Pollard C, Prescott J, Rooke T, Whitmore C. The stroke transitional care intervention for older adults with stroke and multimorbidity: a multisite pragmatic randomized controlled trial. BMC Geriatr 2023; 23:687. [PMID: 37872479 PMCID: PMC10594728 DOI: 10.1186/s12877-023-04403-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2023] [Accepted: 10/12/2023] [Indexed: 10/25/2023] Open
Abstract
BACKGROUND This study aimed to test, in real-world clinical practice, the effectiveness of a Transitional Care Stroke Intervention (TCSI) compared to usual care on health outcomes, self-management, patient experience, and health and social service use costs in older adults (≥ 55 years) with stroke and multimorbidity (≥ 2 chronic conditions). METHODS This pragmatic randomized controlled trial (RCT) included older adults discharged from hospital to community with stroke and multimorbidity using outpatient stroke rehabilitation services in two communities in Ontario, Canada. Participants were randomized 1:1 to usual care (control group) or usual care plus the 6-month TCSI (intervention group). The TCSI was delivered virtually by an interprofessional (IP) team, and included care coordination/system navigation support, phone/video visits, monthly IP team conferences, and an online resource to support system navigation. The primary outcome was risk of hospital readmission (all cause) after six-months. Secondary outcomes included physical and mental functioning, stroke self-management, patient experience, and health and social service use costs. The intention-to-treat principle was used to conduct the primary and secondary analyses. RESULTS Ninety participants were enrolled (44 intervention, 46 control); 11 (12%) participants were lost to follow-up, leaving 79 (39 intervention, 40 control). No significant between-group differences were seen for baseline to six-month risk of hospital readmission. Differences favouring the intervention group were seen in the following secondary outcomes: physical functioning (SF-12 PCS mean difference: 5.10; 95% CI: 1.58-8.62, p = 0.005), stroke self-management (Southampton Stroke Self-Management Questionnaire mean difference: 6.00; 95% CI: 0.51-11.50, p = 0.03), and patient experience (Person-Centred Coordinated Care Experiences Questionnaire mean difference: 2.64, 95% CI: 0.81, 4.47, p = 0.005). No between-group differences were found in total healthcare costs or other secondary outcomes. CONCLUSIONS Although participation in the TCSI did not impact hospital readmissions, there were improvements in physical functioning, stroke self-management and patient experience in older adults with stroke and multimorbidity without increasing total healthcare costs. Challenges associated with the COVID-19 pandemic, including the shift from in-person to virtual delivery, and re-deployment of interventionists could have influenced the results. A larger pragmatic RCT is needed to determine intervention effectiveness in diverse geographic settings and ethno-cultural populations and examine intervention scalability. TRIAL REGISTRATION ClinicalTrials.gov Identifier: NCT04278794 . Registered May 2, 2020.
Collapse
Affiliation(s)
- Maureen Markle-Reid
- School of Nursing, Faculty of Health Sciences, McMaster University, 1280 Main Street West, Room HSc3N25, Hamilton, ON, L8S 4K1, Canada.
- Health Research Methods, Department of Health, Evidence and Impact, Faculty of Health Sciences, and the Centre of Health Economics and Policy Analysis, McMaster University, 1280 Main Street West, HSC 2C, Hamilton, ON, L8S 4K1, Canada.
- Aging, Community and Health Research Unit, School of Nursing, Faculty of Health Sciences, McMaster University, 1280 Main Street West, Room HSc3N25, Hamilton, ON, L8S 4K1, Canada.
- McMaster Institute for Research On Aging, McMaster University, Hamilton, ON, Canada.
| | - Kathryn Fisher
- School of Nursing, Faculty of Health Sciences, McMaster University, 1280 Main Street West, Room HSc3N25, Hamilton, ON, L8S 4K1, Canada
- Aging, Community and Health Research Unit, School of Nursing, Faculty of Health Sciences, McMaster University, 1280 Main Street West, Room HSc3N25, Hamilton, ON, L8S 4K1, Canada
- McMaster Institute for Research On Aging, McMaster University, Hamilton, ON, Canada
| | - Kimberly M Walker
- Aging, Community and Health Research Unit, School of Nursing, Faculty of Health Sciences, McMaster University, 1280 Main Street West, Room HSc3N25, Hamilton, ON, L8S 4K1, Canada
- Upstream Lab, MAP Centre for Urban Health Solutions, St. Michael's Hospital, 209 Victoria Street, Ontario, M5B 1T8, Toronto, Canada
| | - Marla Beauchamp
- McMaster Institute for Research On Aging, McMaster University, Hamilton, ON, Canada
| | - Jill I Cameron
- Department of Occupational Science and Occupational Therapy, Rehabilitation Sciences Institute, Temerty Faculty of Medicine, University of Toronto, 160-500 University Ave, Toronto, ON, M5V 1V7, Canada
| | - David Dayler
- Aging, Community and Health Research Unit, School of Nursing, Faculty of Health Sciences, McMaster University, 1280 Main Street West, Room HSc3N25, Hamilton, ON, L8S 4K1, Canada
| | - Rebecca Fleck
- Rehabilitation Program, Parkwood Institute, St. Joseph's Health Care London, 268 Grosvenor Street, Ontario, N6A 4V2, London, Canada
| | - Amiram Gafni
- Health Research Methods, Department of Health, Evidence and Impact, Faculty of Health Sciences, and the Centre of Health Economics and Policy Analysis, McMaster University, 1280 Main Street West, HSC 2C, Hamilton, ON, L8S 4K1, Canada
- Aging, Community and Health Research Unit, School of Nursing, Faculty of Health Sciences, McMaster University, 1280 Main Street West, Room HSc3N25, Hamilton, ON, L8S 4K1, Canada
| | - Rebecca Ganann
- School of Nursing, Faculty of Health Sciences, McMaster University, 1280 Main Street West, Room HSc3N25, Hamilton, ON, L8S 4K1, Canada
- Aging, Community and Health Research Unit, School of Nursing, Faculty of Health Sciences, McMaster University, 1280 Main Street West, Room HSc3N25, Hamilton, ON, L8S 4K1, Canada
- McMaster Institute for Research On Aging, McMaster University, Hamilton, ON, Canada
| | - Ken Hajas
- Aging, Community and Health Research Unit, School of Nursing, Faculty of Health Sciences, McMaster University, 1280 Main Street West, Room HSc3N25, Hamilton, ON, L8S 4K1, Canada
| | - Barbara Koetsier
- Aging, Community and Health Research Unit, School of Nursing, Faculty of Health Sciences, McMaster University, 1280 Main Street West, Room HSc3N25, Hamilton, ON, L8S 4K1, Canada
| | - Robert Mahony
- Aging, Community and Health Research Unit, School of Nursing, Faculty of Health Sciences, McMaster University, 1280 Main Street West, Room HSc3N25, Hamilton, ON, L8S 4K1, Canada
| | - Chris Pollard
- Hotel Dieu Shaver Health, and Rehabilitation Centre, 541 Glenridge Ave, St. Catherines, ON, L2T 4C2, Canada
| | - Jim Prescott
- Aging, Community and Health Research Unit, School of Nursing, Faculty of Health Sciences, McMaster University, 1280 Main Street West, Room HSc3N25, Hamilton, ON, L8S 4K1, Canada
| | - Tammy Rooke
- CarePartners, 139 Washburn Drive, Kitchener, ON, N2R 1S1, Canada
| | - Carly Whitmore
- School of Nursing, Faculty of Health Sciences, McMaster University, 1280 Main Street West, Room HSc3N25, Hamilton, ON, L8S 4K1, Canada
- Aging, Community and Health Research Unit, School of Nursing, Faculty of Health Sciences, McMaster University, 1280 Main Street West, Room HSc3N25, Hamilton, ON, L8S 4K1, Canada
- McMaster Institute for Research On Aging, McMaster University, Hamilton, ON, Canada
| |
Collapse
|
5
|
Dong M, Mitani A. Multiple imputation methods for missing multilevel ordinal outcomes. BMC Med Res Methodol 2023; 23:112. [PMID: 37161419 PMCID: PMC10169455 DOI: 10.1186/s12874-023-01909-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2022] [Accepted: 03/31/2023] [Indexed: 05/11/2023] Open
Abstract
BACKGROUND Multiple imputation (MI) is an established technique for handling missing data in observational studies. Joint modelling (JM) and fully conditional specification (FCS) are commonly used methods for imputing multilevel data. However, MI methods for multilevel ordinal outcome variables have not been well studied, especially when cluster size is informative on the outcome. The purpose of this study is to describe and compare different MI strategies for dealing with multilevel ordinal outcomes when informative cluster size (ICS) exists. METHODS We conducted comprehensive Monte Carlo simulation studies to compare the performance of five strategies: complete case analysis (CCA), FCS, FCS+CS (including cluster size (CS) in the imputation model), JM, and JM+CS under various scenarios. We evaluated their performance using a proportional odds logistic regression model estimated with cluster weighted generalized estimating equations (CWGEE). RESULTS The simulation results showed that including CS in the imputation model can significantly improve estimation accuracy when ICS exists. FCS provided more accurate and robust estimation than JM, followed by CCA for multilevel ordinal outcomes. We further applied these strategies to a real dental study to assess the association between metabolic syndrome and clinical attachment loss scores. The results based on FCS + CS indicated that the power of the analysis would increase after carrying out the appropriate MI strategy. CONCLUSIONS MI is an effective tool to increase the accuracy and power of the downstream statistical analysis for missing ordinal outcomes. FCS slightly outperforms JM when imputing multilevel ordinal outcomes. When there is plausible ICS, we recommend including CS in the imputation phase.
Collapse
Affiliation(s)
- Mei Dong
- Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, Canada
| | - Aya Mitani
- Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, Canada.
| |
Collapse
|
6
|
Gunn HJ, Rezvan PH, Fernández MI, Comulada WS. How to apply variable selection machine learning algorithms with multiply imputed data: A missing discussion. Psychol Methods 2023; 28:452-471. [PMID: 35113633 PMCID: PMC10117422 DOI: 10.1037/met0000478] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Psychological researchers often use standard linear regression to identify relevant predictors of an outcome of interest, but challenges emerge with incomplete data and growing numbers of candidate predictors. Regularization methods like the LASSO can reduce the risk of overfitting, increase model interpretability, and improve prediction in future samples; however, handling missing data when using regularization-based variable selection methods is complicated. Using listwise deletion or an ad hoc imputation strategy to deal with missing data when using regularization methods can lead to loss of precision, substantial bias, and a reduction in predictive ability. In this tutorial, we describe three approaches for fitting a LASSO when using multiple imputation to handle missing data and illustrate how to implement these approaches in practice with an applied example. We discuss implications of each approach and describe additional research that would help solidify recommendations for best practices. (PsycInfo Database Record (c) 2023 APA, all rights reserved).
Collapse
Affiliation(s)
- Heather J. Gunn
- Department of Quantitative Health Sciences, Mayo Clinic, Rochester, Minnesota, United States
| | - Panteha Hayati Rezvan
- Department of Psychiatry and Biobehavioral Sciences, University of California, Los Angeles
| | | | - W. Scott Comulada
- Department of Psychiatry and Biobehavioral Sciences, University of California, Los Angeles
| |
Collapse
|
7
|
Quartagno M, Carpenter JR. Substantive model compatible multilevel multiple imputation: A joint modeling approach. Stat Med 2022; 41:5000-5015. [PMID: 35959539 DOI: 10.1002/sim.9549] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2022] [Revised: 05/03/2022] [Accepted: 07/25/2022] [Indexed: 01/07/2023]
Abstract
BACKGROUND Substantive model compatible multiple imputation (SMC-MI) is a relatively novel imputation method that is particularly useful when the analyst's model includes interactions, non-linearities, and/or partially observed random slope variables. METHODS Here we thoroughly investigate a SMC-MI strategy based on joint modeling of the covariates of the analysis model. We provide code to apply the proposed strategy and we perform an extensive simulation work to test it in various circumstances. We explore the impact on the results of various factors, including whether the missing data are at the individual or cluster level, whether there are non-linearities and whether the imputation model is correctly specified. Finally, we apply the imputation methods to the motivating example data. RESULTS SMC-JM appears to be superior to standard JM imputation, particularly in presence of large variation in random slopes, non-linearities, and interactions. Results seem to be robust to slight mis-specification of the imputation model for the covariates. When imputing level 2 data, enough clusters have to be observed in order to obtain unbiased estimates of the level 2 parameters. CONCLUSIONS SMC-JM is preferable to standard JM imputation in presence of complexities in the analysis model of interest, such as non-linearities or random slopes.
Collapse
Affiliation(s)
- Matteo Quartagno
- Institute for Clinical Trials and Methodology, University College London, London, UK
| | - James R Carpenter
- Institute for Clinical Trials and Methodology, University College London, London, UK.,Department of Medical Statistics, London School of Hygiene and Tropical Medicine, London, UK
| |
Collapse
|
8
|
Rezvan PH, Comulada WS, Fernández MI, Belin TR. Assessing Alternative Imputation Strategies for Infrequently Missing Items on Multi-item Scales. COMMUNICATIONS IN STATISTICS. CASE STUDIES, DATA ANALYSIS AND APPLICATIONS 2022; 8:682-713. [PMID: 36467970 PMCID: PMC9718541 DOI: 10.1080/23737484.2022.2115430] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/17/2023]
Abstract
Health-science researchers often measure psychological constructs using multi-item scales and encounter missing items on some participants. Multiple imputation (MI) has emerged as an alternative to ad-hoc methods (e.g., mean substitution) for handling incomplete data on multi-item scales, appealingly reflecting available information while accounting for uncertainty due to missing values in a unified inferential framework. However, MI can be implemented in a variety of ways. When the number of variables to impute gets large, some strategies yield unstable estimates of quantities of interest while others are not technically feasible to implement. These considerations raise pragmatic questions about the extent to which ad-hoc procedures would yield statistical properties that are competitive with theoretically motivated methods. Drawing on an HIV study where depression and anxiety symptoms are measured with multi-item scales, this empirical investigation contrasts ad-hoc methods for handling missing items with various MI implementations that differ as to whether imputation is at the item-level or scale-level and how auxiliary variables are incorporated. While the findings are consistent with previous reports favoring item-level imputation when feasible to implement, we found only subtle differences in statistical properties across procedures, suggesting that weaknesses of ad-hoc procedures may be muted when missing data percentages are modest.
Collapse
Affiliation(s)
- Panteha Hayati Rezvan
- Department of Psychiatry and Biobehavioral Sciences, University of California, Los Angeles, U.S.A
| | - W. Scott Comulada
- Department of Psychiatry and Biobehavioral Sciences, University of California, Los Angeles, U.S.A
- Department of Health Policy and Management, UCLA Fielding School of Public Health, Los Angeles, California, U.S.A
| | - M. Isabel Fernández
- College of Osteopathic Medicine, Nova Southeastern University, Miami, Florida, U.S.A
| | - Thomas R. Belin
- Department of Psychiatry and Biobehavioral Sciences, University of California, Los Angeles, U.S.A
- Department of Biostatistics, UCLA Fielding School of Public Health, Los Angeles, California, U.S.A
| |
Collapse
|
9
|
Nguyen CD, Moreno-Betancur M, Rodwell L, Romaniuk H, Carlin JB, Lee KJ. Multiple imputation of semi-continuous exposure variables that are categorized for analysis. Stat Med 2021; 40:6093-6106. [PMID: 34423450 DOI: 10.1002/sim.9172] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2020] [Revised: 08/05/2021] [Accepted: 08/06/2021] [Indexed: 11/11/2022]
Abstract
Semi-continuous variables are characterized by a point mass at one value and a continuous range of values for remaining observations. An example is alcohol consumption quantity, with a spike of zeros representing non-drinkers and positive values for drinkers. If multiple imputation is used to handle missing values for semi-continuous variables, it is unclear how this should be implemented within the standard approaches of fully conditional specification (FCS) and multivariate normal imputation (MVNI). This question is brought into focus by the use of categorized versions of semi-continuous exposure variables in analyses (eg, no drinking, drinking below binge level, binge drinking, heavy binge drinking), raising the question of how best to achieve congeniality between imputation and analysis models. We performed a simulation study comparing nine approaches for imputing semi-continuous exposures requiring categorization for analysis. Three methods imputed the categories directly: ordinal logistic regression, and imputation of binary indicator variables representing the categories using MVNI (with two variants). Six methods (predictive mean matching, zero-inflated binomial imputation, and two-part imputation methods with variants in FCS and MVNI) imputed the semi-continuous variable, with categories derived after imputation. The ordinal and zero-inflated binomial methods had good performance across most scenarios, while MVNI methods requiring rounding after imputation did not perform well. There were mixed results for predictive mean matching and the two-part methods, depending on whether the estimands were proportions or regression coefficients. The results highlight the need to consider the parameter of interest when selecting an imputation procedure.
Collapse
Affiliation(s)
- Cattram D Nguyen
- Clinical Epidemiology and Biostatistics Unit, Murdoch Children's Research Institute, Parkville, Victoria, Australia.,Department of Paediatrics, Faculty of Medicine, Dentistry and Health Sciences, The University of Melbourne, Melbourne, Victoria, Australia
| | - Margarita Moreno-Betancur
- Clinical Epidemiology and Biostatistics Unit, Murdoch Children's Research Institute, Parkville, Victoria, Australia.,Department of Paediatrics, Faculty of Medicine, Dentistry and Health Sciences, The University of Melbourne, Melbourne, Victoria, Australia
| | - Laura Rodwell
- Clinical Epidemiology and Biostatistics Unit, Murdoch Children's Research Institute, Parkville, Victoria, Australia.,Department of Paediatrics, Faculty of Medicine, Dentistry and Health Sciences, The University of Melbourne, Melbourne, Victoria, Australia.,Department for Health Evidence, Radboud Institute for Health Sciences, Radboud University Medical Center, Nijmegen, The Netherlands
| | - Helena Romaniuk
- Biostatistics Unit, Faculty of Health, Deakin University, Geelong, Victoria, Australia
| | - John B Carlin
- Clinical Epidemiology and Biostatistics Unit, Murdoch Children's Research Institute, Parkville, Victoria, Australia.,Department of Paediatrics, Faculty of Medicine, Dentistry and Health Sciences, The University of Melbourne, Melbourne, Victoria, Australia
| | - Katherine J Lee
- Clinical Epidemiology and Biostatistics Unit, Murdoch Children's Research Institute, Parkville, Victoria, Australia.,Department of Paediatrics, Faculty of Medicine, Dentistry and Health Sciences, The University of Melbourne, Melbourne, Victoria, Australia
| |
Collapse
|
10
|
Smith MJ, Belot A, Quartagno M, Luque Fernandez MA, Bonaventure A, Gachau S, Benitez Majano S, Rachet B, Njagi EN. Excess Mortality by Multimorbidity, Socioeconomic, and Healthcare Factors, amongst Patients Diagnosed with Diffuse Large B-Cell or Follicular Lymphoma in England. Cancers (Basel) 2021; 13:5805. [PMID: 34830964 PMCID: PMC8616469 DOI: 10.3390/cancers13225805] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2021] [Revised: 11/10/2021] [Accepted: 11/16/2021] [Indexed: 12/22/2022] Open
Abstract
(1) Background: Socioeconomic inequalities of survival in patients with lymphoma persist, which may be explained by patients' comorbidities. We aimed to assess the association between comorbidities and the survival of patients diagnosed with diffuse large B-cell (DLBCL) or follicular lymphoma (FL) in England accounting for other socio-demographic characteristics. (2) Methods: Population-based cancer registry data were linked to Hospital Episode Statistics. We used a flexible multilevel excess hazard model to estimate excess mortality and net survival by patient's comorbidity status, adjusted for sociodemographic, economic, and healthcare factors, and accounting for the patient's area of residence. We used the latent normal joint modelling multiple imputation approach for missing data. (3) Results: Overall, 15,516 and 29,898 patients were diagnosed with FL and DLBCL in England between 2005 and 2013, respectively. Amongst DLBCL and FL patients, respectively, those in the most deprived areas showed 1.22 (95% confidence interval (CI): 1.18-1.27) and 1.45 (95% CI: 1.30-1.62) times higher excess mortality hazard compared to those in the least deprived areas, adjusted for comorbidity status, age at diagnosis, sex, ethnicity, and route to diagnosis. (4) Conclusions: Deprivation is consistently associated with poorer survival among patients diagnosed with DLBCL or FL, after adjusting for co/multimorbidities. Comorbidities and multimorbidities need to be considered when planning public health interventions targeting haematological malignancies in England.
Collapse
Affiliation(s)
- Matthew James Smith
- Inequalities in Cancer Outcomes Network, Department of Non-Communicable Disease Epidemiology, London School of Hygiene and Tropical Medicine, London WC1E 7HT, UK; (A.B.); (M.A.L.F.); (S.B.M.); (B.R.); (E.N.N.)
| | - Aurélien Belot
- Inequalities in Cancer Outcomes Network, Department of Non-Communicable Disease Epidemiology, London School of Hygiene and Tropical Medicine, London WC1E 7HT, UK; (A.B.); (M.A.L.F.); (S.B.M.); (B.R.); (E.N.N.)
| | - Matteo Quartagno
- MRC Clinical Trials Unit, Institute of Clinical Trials and Methodology, University College London, London WC1V 6LJ, UK;
| | - Miguel Angel Luque Fernandez
- Inequalities in Cancer Outcomes Network, Department of Non-Communicable Disease Epidemiology, London School of Hygiene and Tropical Medicine, London WC1E 7HT, UK; (A.B.); (M.A.L.F.); (S.B.M.); (B.R.); (E.N.N.)
- Noncommunicable Disease and Cancer Epidemiology Group, Instituto de Investigación Biosanitaria de Granada, Ibs.GRANADA, Andalusian School of Public Health, 18012 Granada, Spain
- Centro de Investigación Biomédica en Red de Epidemiología y Salud Pública (CIBER of Epidemiology and Public Health, CIBERESP), 28029 Madrid, Spain
| | - Audrey Bonaventure
- Epidemiology of Childhood and Adolescent Cancers Team, Research Centre in Epidemiology and Biostatistics (CRESS), Inserm UMR 1153, Université de Paris, 94801 Villejuif, France;
| | - Susan Gachau
- School of Mathematics, University of Nairobi, Nairobi 30197-00100, Kenya;
| | - Sara Benitez Majano
- Inequalities in Cancer Outcomes Network, Department of Non-Communicable Disease Epidemiology, London School of Hygiene and Tropical Medicine, London WC1E 7HT, UK; (A.B.); (M.A.L.F.); (S.B.M.); (B.R.); (E.N.N.)
| | - Bernard Rachet
- Inequalities in Cancer Outcomes Network, Department of Non-Communicable Disease Epidemiology, London School of Hygiene and Tropical Medicine, London WC1E 7HT, UK; (A.B.); (M.A.L.F.); (S.B.M.); (B.R.); (E.N.N.)
| | - Edmund Njeru Njagi
- Inequalities in Cancer Outcomes Network, Department of Non-Communicable Disease Epidemiology, London School of Hygiene and Tropical Medicine, London WC1E 7HT, UK; (A.B.); (M.A.L.F.); (S.B.M.); (B.R.); (E.N.N.)
| |
Collapse
|
11
|
Smith MJ, Fernandez MAL, Belot A, Quartagno M, Bonaventure A, Majano SB, Rachet B, Njagi EN. Investigating the inequalities in route to diagnosis amongst patients with diffuse large B-cell or follicular lymphoma in England. Br J Cancer 2021; 125:1299-1307. [PMID: 34389805 PMCID: PMC8548410 DOI: 10.1038/s41416-021-01523-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2020] [Revised: 06/23/2021] [Accepted: 08/03/2021] [Indexed: 12/22/2022] Open
Abstract
INTRODUCTION Diagnostic delay is associated with lower chances of cancer survival. Underlying comorbidities are known to affect the timely diagnosis of cancer. Diffuse large B-cell (DLBCL) and follicular lymphomas (FL) are primarily diagnosed amongst older patients, who are more likely to have comorbidities. Characteristics of clinical commissioning groups (CCG) are also known to impact diagnostic delay. We assess the association between comorbidities and diagnostic delay amongst patients with DLBCL or FL in England during 2005-2013. METHODS Multivariable generalised linear mixed-effect models were used to assess the main association. Empirical Bayes estimates of the random effects were used to explore between-cluster variation. The latent normal joint modelling multiple imputation approach was used to account for partially observed variables. RESULTS We included 30,078 and 15,551 patients diagnosed with DLBCL or FL, respectively. Amongst patients from the same CCG, having multimorbidity was strongly associated with the emergency route to diagnosis (DLBCL: odds ratio 1.56, CI 1.40-1.73; FL: odds ratio 1.80, CI 1.45-2.23). Amongst DLBCL patients, the diagnostic delay was possibly correlated with CCGs that had higher population densities. CONCLUSIONS Underlying comorbidity is associated with diagnostic delay amongst patients with DLBCL or FL. Results suggest a possible correlation between CCGs with higher population densities and diagnostic delay of aggressive lymphomas.
Collapse
Affiliation(s)
- Matthew J Smith
- Inequalities in Cancer Outcomes Network, Department of Non-Communicable Disease Epidemiology, London School of Hygiene and Tropical Medicine, London, UK.
| | - Miguel Angel Luque Fernandez
- Inequalities in Cancer Outcomes Network, Department of Non-Communicable Disease Epidemiology, London School of Hygiene and Tropical Medicine, London, UK
- Noncommunicable Disease and Cancer Epidemiology Group, Instituto de Investigación Biosanitaria de Granada, Ibs.GRANADA, Andalusian School of Public Health, Granada, Spain
| | - Aurélien Belot
- Inequalities in Cancer Outcomes Network, Department of Non-Communicable Disease Epidemiology, London School of Hygiene and Tropical Medicine, London, UK
| | - Matteo Quartagno
- MRC Clinical Trials Unit, Institute of Clinical Trials and Methodology, University College London, London, UK
| | - Audrey Bonaventure
- CRESS, Université de Paris, INSERM, UMR 1153, Epidemiology of Childhood and Adolescent Cancers Team, Villejuif, France
| | - Sara Benitez Majano
- Inequalities in Cancer Outcomes Network, Department of Non-Communicable Disease Epidemiology, London School of Hygiene and Tropical Medicine, London, UK
| | - Bernard Rachet
- Inequalities in Cancer Outcomes Network, Department of Non-Communicable Disease Epidemiology, London School of Hygiene and Tropical Medicine, London, UK
| | - Edmund Njeru Njagi
- Inequalities in Cancer Outcomes Network, Department of Non-Communicable Disease Epidemiology, London School of Hygiene and Tropical Medicine, London, UK
| |
Collapse
|
12
|
Carpenter JR, Smuk M. Missing data: A statistical framework for practice. Biom J 2021; 63:915-947. [PMID: 33624862 PMCID: PMC7615108 DOI: 10.1002/bimj.202000196] [Citation(s) in RCA: 61] [Impact Index Per Article: 20.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2020] [Revised: 11/19/2020] [Accepted: 11/24/2020] [Indexed: 12/19/2022]
Abstract
Missing data are ubiquitous in medical research, yet there is still uncertainty over when restricting to the complete records is likely to be acceptable, when more complex methods (e.g. maximum likelihood, multiple imputation and Bayesian methods) should be used, how they relate to each other and the role of sensitivity analysis. This article seeks to address both applied practitioners and researchers interested in a more formal explanation of some of the results. For practitioners, the framework, illustrative examples and code should equip them with a practical approach to address the issues raised by missing data (particularly using multiple imputation), alongside an overview of how the various approaches in the literature relate. In particular, we describe how multiple imputation can be readily used for sensitivity analyses, which are still infrequently performed. For those interested in more formal derivations, we give outline arguments for key results, use simple examples to show how methods relate, and references for full details. The ideas are illustrated with a cohort study, a multi-centre case control study and a randomised clinical trial.
Collapse
Affiliation(s)
- James R. Carpenter
- Department of Medical Statistics, London School of Hygiene & Tropical Medicine, London, UK
- MRC Clinical Trials Unit at UCL, London, UK
| | - Melanie Smuk
- Department of Medical Statistics, London School of Hygiene & Tropical Medicine, London, UK
| |
Collapse
|
13
|
Multiple imputation of missing data in multilevel models with the R package mdmb: a flexible sequential modeling approach. Behav Res Methods 2021; 53:2631-2649. [PMID: 34027594 PMCID: PMC8613130 DOI: 10.3758/s13428-020-01530-0] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/14/2020] [Indexed: 11/08/2022]
Abstract
Multilevel models often include nonlinear effects, such as random slopes or interaction effects. The estimation of these models can be difficult when the underlying variables contain missing data. Although several methods for handling missing data such as multiple imputation (MI) can be used with multilevel data, conventional methods for multilevel MI often do not properly take the nonlinear associations between the variables into account. In the present paper, we propose a sequential modeling approach based on Bayesian estimation techniques that can be used to handle missing data in a variety of multilevel models that involve nonlinear effects. The main idea of this approach is to decompose the joint distribution of the data into several parts that correspond to the outcome and explanatory variables in the intended analysis, thus generating imputations in a manner that is compatible with the substantive analysis model. In three simulation studies, we evaluate the sequential modeling approach and compare it with conventional as well as other substantive-model-compatible approaches to multilevel MI. We implemented the sequential modeling approach in the R package mdmb and provide a worked example to illustrate its application.
Collapse
|
14
|
Gachau S, Njagi EN, Owuor N, Mwaniki P, Quartagno M, Sarguta R, English M, Ayieko P. Handling missing data in a composite outcome with partially observed components: simulation study based on clustered paediatric routine data. J Appl Stat 2021; 49:2389-2402. [PMID: 35755090 PMCID: PMC9225614 DOI: 10.1080/02664763.2021.1895087] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2020] [Accepted: 02/21/2021] [Indexed: 10/21/2022]
Abstract
Composite scores are useful in providing insights and trends about complex and multidimensional quality of care processes. However, missing data in subcomponents may hinder the overall reliability of a composite measure. In this study, strategies for handling missing data in Paediatric Admission Quality of Care (PAQC) score, an ordinal composite outcome, were explored through a simulation study. Specifically, the implications of the conventional method employed in addressing missing PAQC score subcomponents, consisting of scoring missing PAQC score components with a zero, and a multiple imputation (MI)-based strategy, were assessed. The latent normal joint modelling MI approach was used for the latter. Across simulation scenarios, MI of missing PAQC score elements at item level produced minimally biased estimates compared to the conventional method. Moreover, regression coefficients were more prone to bias compared to standards errors. Magnitude of bias was dependent on the proportion of missingness and the missing data generating mechanism. Therefore, incomplete composite outcome subcomponents should be handled carefully to alleviate potential for biased estimates and misleading inferences. Further research on other strategies of imputing at the component and composite outcome level and imputing compatibly with the substantive model in this setting, is needed.
Collapse
Affiliation(s)
- Susan Gachau
- Health Services Unit, Kenya Medical Research Institute-Wellcome Trust Research Programme, Nairobi, Kenya
- School of Mathematics, University of Nairobi, Nairobi, Kenya
| | - Edmund Njeru Njagi
- Department of Non-Communicable Disease Epidemiology, London School of Hygiene and Tropical Medicine, London, UK
| | - Nelson Owuor
- School of Mathematics, University of Nairobi, Nairobi, Kenya
| | - Paul Mwaniki
- Health Services Unit, Kenya Medical Research Institute-Wellcome Trust Research Programme, Nairobi, Kenya
- School of Mathematics, University of Nairobi, Nairobi, Kenya
| | - Matteo Quartagno
- Institute of Clinical Trials and Methodology, University College London, London, UK
| | - Rachel Sarguta
- School of Mathematics, University of Nairobi, Nairobi, Kenya
| | - Mike English
- Health Services Unit, Kenya Medical Research Institute-Wellcome Trust Research Programme, Nairobi, Kenya
- Nuffield Department of Medicine, University of Oxford, Oxford, UK
| | - Philip Ayieko
- Department of Infectious Disease Epidemiology, London School of Hygiene and Tropical Medicine, London, UK
- Mwanza Intervention Trials Unit, Mwanza, Tanzania
| |
Collapse
|
15
|
Bazo-Alvarez JC, Morris TP, Pham TM, Carpenter JR, Petersen I. Handling Missing Values in Interrupted Time Series Analysis of Longitudinal Individual-Level Data. Clin Epidemiol 2020; 12:1045-1057. [PMID: 33116899 PMCID: PMC7549500 DOI: 10.2147/clep.s266428] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2020] [Accepted: 08/16/2020] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND In the interrupted time series (ITS) approach, it is common to average the outcome of interest at each time point and then perform a segmented regression (SR) analysis. In this study, we illustrate that such 'aggregate-level' analysis is biased when data are missing at random (MAR) and provide alternative analysis methods. METHODS Using electronic health records from the UK, we evaluated weight change over time induced by the initiation of antipsychotic treatment. We contrasted estimates from aggregate-level SR analysis against estimates from mixed models with and without multiple imputation of missing covariates, using individual-level data. Then, we conducted a simulation study for insight about the different results in a controlled environment. RESULTS Aggregate-level SR analysis suggested a substantial weight gain after initiation of treatment (average short-term weight change: 0.799kg/week) compared to mixed models (0.412kg/week). Simulation studies confirmed that aggregate-level SR analysis was biased when data were MAR. In simulations, mixed models gave less biased estimates than SR analysis and, in combination with multilevel multiple imputation, provided unbiased estimates. Mixed models with multiple imputation can be used with other types of ITS outcomes (eg, proportions). Other standard methods applied in ITS do not help to correct this bias problem. CONCLUSION Aggregate-level SR analysis can bias the ITS estimates when individual-level data are MAR, because taking averages of individual-level data before SR means that data at the cluster level are missing not at random. Avoiding the averaging-step and using mixed models with or without multilevel multiple imputation of covariates is recommended.
Collapse
Affiliation(s)
- Juan Carlos Bazo-Alvarez
- Research Department of Primary Care and Population Health, University College London (UCL), London, UK
- Instituto de Investigación, Universidad Católica Los Ángeles de Chimbote, Chimbote, Peru
| | | | | | - James R Carpenter
- MRC Clinical Trials Unit at UCL, London, UK
- Department of Medical Statistics, London School of Hygiene and Tropical Medicine, London, UK
| | - Irene Petersen
- Research Department of Primary Care and Population Health, University College London (UCL), London, UK
- Department of Clinical Epidemiology, Aarhus University, Aarhus, Denmark
| |
Collapse
|
16
|
Huque MH, Moreno-Betancur M, Quartagno M, Simpson JA, Carlin JB, Lee KJ. Multiple imputation methods for handling incomplete longitudinal and clustered data where the target analysis is a linear mixed effects model. Biom J 2020; 62:444-466. [PMID: 31919921 PMCID: PMC7614826 DOI: 10.1002/bimj.201900051] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2019] [Revised: 08/18/2019] [Accepted: 09/30/2019] [Indexed: 11/06/2022]
Abstract
Multiple imputation (MI) is increasingly popular for handling multivariate missing data. Two general approaches are available in standard computer packages: MI based on the posterior distribution of incomplete variables under a multivariate (joint) model, and fully conditional specification (FCS), which imputes missing values using univariate conditional distributions for each incomplete variable given all the others, cycling iteratively through the univariate imputation models. In the context of longitudinal or clustered data, it is not clear whether these approaches result in consistent estimates of regression coefficient and variance component parameters when the analysis model of interest is a linear mixed effects model (LMM) that includes both random intercepts and slopes with either covariates or both covariates and outcome contain missing information. In the current paper, we compared the performance of seven different MI methods for handling missing values in longitudinal and clustered data in the context of fitting LMMs with both random intercepts and slopes. We study the theoretical compatibility between specific imputation models fitted under each of these approaches and the LMM, and also conduct simulation studies in both the longitudinal and clustered data settings. Simulations were motivated by analyses of the association between body mass index (BMI) and quality of life (QoL) in the Longitudinal Study of Australian Children (LSAC). Our findings showed that the relative performance of MI methods vary according to whether the incomplete covariate has fixed or random effects and whether there is missingnesss in the outcome variable. We showed that compatible imputation and analysis models resulted in consistent estimation of both regression parameters and variance components via simulation. We illustrate our findings with the analysis of LSAC data.
Collapse
Affiliation(s)
- Md Hamidul Huque
- Murdoch Children’s Research Institute, 50 Flemington Road, Parkville, VIC,3052, Australia
- Department of Paediatrics, University of Melbourne, Parkville, VIC, 3052, Australia
- University of New South Wales, Kensington, Kensington, NSW 2052, Australia
| | | | - Matteo Quartagno
- Institute for Clinical Trials and Methodology, University College London, 90 High Holborn, WC1V 6LJ
| | - Julie A. Simpson
- Centre for Epidemiology and Biostatistics, Melbourne School of Population and Global Health, The University of Melbourne, Parkville, VIC, 3052, Australia
| | - John B. Carlin
- Murdoch Children’s Research Institute, 50 Flemington Road, Parkville, VIC,3052, Australia
- Department of Paediatrics, University of Melbourne, Parkville, VIC, 3052, Australia
- Centre for Epidemiology and Biostatistics, Melbourne School of Population and Global Health, The University of Melbourne, Parkville, VIC, 3052, Australia
| | - Katherine J. Lee
- Murdoch Children’s Research Institute, 50 Flemington Road, Parkville, VIC,3052, Australia
- Department of Paediatrics, University of Melbourne, Parkville, VIC, 3052, Australia
| |
Collapse
|
17
|
Quartagno M, Carpenter JR. Multiple imputation for discrete data: Evaluation of the joint latent normal model. Biom J 2019; 61:1003-1019. [PMID: 30868652 PMCID: PMC6618333 DOI: 10.1002/bimj.201800222] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2018] [Revised: 11/28/2018] [Accepted: 01/29/2019] [Indexed: 11/11/2022]
Abstract
Missing data are ubiquitous in clinical and social research, and multiple imputation (MI) is increasingly the methodology of choice for practitioners. Two principal strategies for imputation have been proposed in the literature: joint modelling multiple imputation (JM‐MI) and full conditional specification multiple imputation (FCS‐MI). While JM‐MI is arguably a preferable approach, because it involves specification of an explicit imputation model, FCS‐MI is pragmatically appealing, because of its flexibility in handling different types of variables. JM‐MI has developed from the multivariate normal model, and latent normal variables have been proposed as a natural way to extend this model to handle categorical variables. In this article, we evaluate the latent normal model through an extensive simulation study and an application on data from the German Breast Cancer Study Group, comparing the results with FCS‐MI. We divide our investigation in four sections, focusing on (i) binary, (ii) categorical, (iii) ordinal, and (iv) count data. Using data simulated from both the latent normal model and the general location model, we find that in all but one extreme general location model setting JM‐MI works very well, and sometimes outperforms FCS‐MI. We conclude the latent normal model, implemented in the R package jomo, can be used with confidence by researchers, both for single and multilevel multiple imputation.
Collapse
Affiliation(s)
- Matteo Quartagno
- Department of Medical Statistics, London School of Hygiene and Tropical Medicine, London, UK
| | - James R Carpenter
- Department of Medical Statistics, London School of Hygiene and Tropical Medicine, London, UK.,MRC Clinical Trials Unit at UCL, 90 High Holborn, London, UK
| |
Collapse
|