1
|
Casiraghi E, Wong R, Hall M, Coleman B, Notaro M, Evans MD, Tronieri JS, Blau H, Laraway B, Callahan TJ, Chan LE, Bramante CT, Buse JB, Moffitt RA, Stürmer T, Johnson SG, Raymond Shao Y, Reese J, Robinson PN, Paccanaro A, Valentini G, Huling JD, Wilkins KJ. A method for comparing multiple imputation techniques: A case study on the U.S. national COVID cohort collaborative. J Biomed Inform 2023; 139:104295. [PMID: 36716983 PMCID: PMC10683778 DOI: 10.1016/j.jbi.2023.104295] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2022] [Revised: 01/16/2023] [Accepted: 01/21/2023] [Indexed: 02/01/2023]
Abstract
Healthcare datasets obtained from Electronic Health Records have proven to be extremely useful for assessing associations between patients' predictors and outcomes of interest. However, these datasets often suffer from missing values in a high proportion of cases, whose removal may introduce severe bias. Several multiple imputation algorithms have been proposed to attempt to recover the missing information under an assumed missingness mechanism. Each algorithm presents strengths and weaknesses, and there is currently no consensus on which multiple imputation algorithm works best in a given scenario. Furthermore, the selection of each algorithm's parameters and data-related modeling choices are also both crucial and challenging. In this paper we propose a novel framework to numerically evaluate strategies for handling missing data in the context of statistical analysis, with a particular focus on multiple imputation techniques. We demonstrate the feasibility of our approach on a large cohort of type-2 diabetes patients provided by the National COVID Cohort Collaborative (N3C) Enclave, where we explored the influence of various patient characteristics on outcomes related to COVID-19. Our analysis included classic multiple imputation techniques as well as simple complete-case Inverse Probability Weighted models. Extensive experiments show that our approach can effectively highlight the most promising and performant missing-data handling strategy for our case study. Moreover, our methodology allowed a better understanding of the behavior of the different models and of how it changed as we modified their parameters. Our method is general and can be applied to different research fields and on datasets containing heterogeneous types.
Collapse
Affiliation(s)
- Elena Casiraghi
- AnacletoLab, Department of Computer Science "Giovanni degli Antoni", Università degli Studi di Milano, Milan, Italy; CINI, Infolife National Laboratory, Roma, Italy; Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Rachel Wong
- Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY, USA
| | - Margaret Hall
- Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY, USA
| | - Ben Coleman
- The Jackson Laboratory for Genomic Medicine, Farmington, USA; Institute for Systems Genomics, University of Connecticut, Farmington, CT, USA
| | - Marco Notaro
- AnacletoLab, Department of Computer Science "Giovanni degli Antoni", Università degli Studi di Milano, Milan, Italy; CINI, Infolife National Laboratory, Roma, Italy
| | - Michael D Evans
- Biostatistical Design and Analysis Center, Clinical and Translational Science Institute, University of Minnesota, Minneapolis, MN, USA
| | - Jena S Tronieri
- Department of Psychiatry, Perelman School of Medicine at the University of Pennsylvania, Philadelphia, PA, USA
| | - Hannah Blau
- The Jackson Laboratory for Genomic Medicine, Farmington, USA
| | - Bryan Laraway
- University of Colorado, Anschutz Medical Campus, Aurora, CO, USA
| | | | - Lauren E Chan
- College of Public Health and Human Sciences, Oregon State University, Corvallis, USA
| | - Carolyn T Bramante
- Division of General Internal Medicine, University of Minnesota, Minneapolis, MN, USA
| | - John B Buse
- NC Translational and Clinical Sciences Institute, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA; Division of Endocrinology, Department of Medicine, University of North Carolina School of Medicine, USA
| | - Richard A Moffitt
- Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY, USA
| | - Til Stürmer
- Department of Epidemiology, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Steven G Johnson
- Institute for Health Informatics, University of Minnesota, Minneapolis, MN, USA
| | - Yu Raymond Shao
- Harvard-MIT Division of Health Sciences and Technology (HST), 260 Longwood Ave, Boston, USA; Department of Radiation Oncology, UT Southwestern Medical Center, Dallas, USA
| | - Justin Reese
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Peter N Robinson
- The Jackson Laboratory for Genomic Medicine, Farmington, USA; Institute for Systems Genomics, University of Connecticut, Farmington, CT, USA
| | - Alberto Paccanaro
- School of Applied Mathematics (EMAp), Fundação Getúlio Vargas, Rio de Janeiro, Brazil; Department of Computer Science, Royal Holloway, University of London, Egham, UK
| | - Giorgio Valentini
- AnacletoLab, Department of Computer Science "Giovanni degli Antoni", Università degli Studi di Milano, Milan, Italy; CINI, Infolife National Laboratory, Roma, Italy
| | - Jared D Huling
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, USA
| | - Kenneth J Wilkins
- Biostatistics Program, Office of the Director, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, MD, USA
| |
Collapse
|