1
|
Casiraghi E, Wong R, Hall M, Coleman B, Notaro M, Evans MD, Tronieri JS, Blau H, Laraway B, Callahan TJ, Chan LE, Bramante CT, Buse JB, Moffitt RA, Stürmer T, Johnson SG, Raymond Shao Y, Reese J, Robinson PN, Paccanaro A, Valentini G, Huling JD, Wilkins KJ. A method for comparing multiple imputation techniques: A case study on the U.S. national COVID cohort collaborative. J Biomed Inform 2023; 139:104295. [PMID: 36716983 PMCID: PMC10683778 DOI: 10.1016/j.jbi.2023.104295] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2022] [Revised: 01/16/2023] [Accepted: 01/21/2023] [Indexed: 02/01/2023]
Abstract
Healthcare datasets obtained from Electronic Health Records have proven to be extremely useful for assessing associations between patients' predictors and outcomes of interest. However, these datasets often suffer from missing values in a high proportion of cases, whose removal may introduce severe bias. Several multiple imputation algorithms have been proposed to attempt to recover the missing information under an assumed missingness mechanism. Each algorithm presents strengths and weaknesses, and there is currently no consensus on which multiple imputation algorithm works best in a given scenario. Furthermore, the selection of each algorithm's parameters and data-related modeling choices are also both crucial and challenging. In this paper we propose a novel framework to numerically evaluate strategies for handling missing data in the context of statistical analysis, with a particular focus on multiple imputation techniques. We demonstrate the feasibility of our approach on a large cohort of type-2 diabetes patients provided by the National COVID Cohort Collaborative (N3C) Enclave, where we explored the influence of various patient characteristics on outcomes related to COVID-19. Our analysis included classic multiple imputation techniques as well as simple complete-case Inverse Probability Weighted models. Extensive experiments show that our approach can effectively highlight the most promising and performant missing-data handling strategy for our case study. Moreover, our methodology allowed a better understanding of the behavior of the different models and of how it changed as we modified their parameters. Our method is general and can be applied to different research fields and on datasets containing heterogeneous types.
Collapse
Affiliation(s)
- Elena Casiraghi
- AnacletoLab, Department of Computer Science "Giovanni degli Antoni", Università degli Studi di Milano, Milan, Italy; CINI, Infolife National Laboratory, Roma, Italy; Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Rachel Wong
- Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY, USA
| | - Margaret Hall
- Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY, USA
| | - Ben Coleman
- The Jackson Laboratory for Genomic Medicine, Farmington, USA; Institute for Systems Genomics, University of Connecticut, Farmington, CT, USA
| | - Marco Notaro
- AnacletoLab, Department of Computer Science "Giovanni degli Antoni", Università degli Studi di Milano, Milan, Italy; CINI, Infolife National Laboratory, Roma, Italy
| | - Michael D Evans
- Biostatistical Design and Analysis Center, Clinical and Translational Science Institute, University of Minnesota, Minneapolis, MN, USA
| | - Jena S Tronieri
- Department of Psychiatry, Perelman School of Medicine at the University of Pennsylvania, Philadelphia, PA, USA
| | - Hannah Blau
- The Jackson Laboratory for Genomic Medicine, Farmington, USA
| | - Bryan Laraway
- University of Colorado, Anschutz Medical Campus, Aurora, CO, USA
| | | | - Lauren E Chan
- College of Public Health and Human Sciences, Oregon State University, Corvallis, USA
| | - Carolyn T Bramante
- Division of General Internal Medicine, University of Minnesota, Minneapolis, MN, USA
| | - John B Buse
- NC Translational and Clinical Sciences Institute, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA; Division of Endocrinology, Department of Medicine, University of North Carolina School of Medicine, USA
| | - Richard A Moffitt
- Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY, USA
| | - Til Stürmer
- Department of Epidemiology, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Steven G Johnson
- Institute for Health Informatics, University of Minnesota, Minneapolis, MN, USA
| | - Yu Raymond Shao
- Harvard-MIT Division of Health Sciences and Technology (HST), 260 Longwood Ave, Boston, USA; Department of Radiation Oncology, UT Southwestern Medical Center, Dallas, USA
| | - Justin Reese
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Peter N Robinson
- The Jackson Laboratory for Genomic Medicine, Farmington, USA; Institute for Systems Genomics, University of Connecticut, Farmington, CT, USA
| | - Alberto Paccanaro
- School of Applied Mathematics (EMAp), Fundação Getúlio Vargas, Rio de Janeiro, Brazil; Department of Computer Science, Royal Holloway, University of London, Egham, UK
| | - Giorgio Valentini
- AnacletoLab, Department of Computer Science "Giovanni degli Antoni", Università degli Studi di Milano, Milan, Italy; CINI, Infolife National Laboratory, Roma, Italy
| | - Jared D Huling
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, USA
| | - Kenneth J Wilkins
- Biostatistics Program, Office of the Director, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, MD, USA
| |
Collapse
|
2
|
Cappelletti L, Petrini A, Gliozzo J, Casiraghi E, Schubach M, Kircher M, Valentini G. Boosting tissue-specific prediction of active cis-regulatory regions through deep learning and Bayesian optimization techniques. BMC Bioinformatics 2022; 23:154. [PMID: 36510125 PMCID: PMC9743524 DOI: 10.1186/s12859-022-04582-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2022] [Accepted: 01/20/2022] [Indexed: 12/14/2022] Open
Abstract
BACKGROUND Cis-regulatory regions (CRRs) are non-coding regions of the DNA that fine control the spatio-temporal pattern of transcription; they are involved in a wide range of pivotal processes such as the development of specific cell-lines/tissues and the dynamic cell response to physiological stimuli. Recent studies showed that genetic variants occurring in CRRs are strongly correlated with pathogenicity or deleteriousness. Considering the central role of CRRs in the regulation of physiological and pathological conditions, the correct identification of CRRs and of their tissue-specific activity status through Machine Learning methods plays a major role in dissecting the impact of genetic variants on human diseases. Unfortunately, the problem is still open, though some promising results have been already reported by (deep) machine-learning based methods that predict active promoters and enhancers in specific tissues or cell lines by encoding epigenetic or spectral features directly extracted from DNA sequences. RESULTS We present the experiments we performed to compare two Deep Neural Networks, a Feed-Forward Neural Network model working on epigenomic features, and a Convolutional Neural Network model working only on genomic sequence, targeted to the identification of enhancer- and promoter-activity in specific cell lines. While performing experiments to understand how the experimental setup influences the prediction performance of the methods, we particularly focused on (1) automatic model selection performed by Bayesian optimization and (2) exploring different data rebalancing setups for reducing negative unbalancing effects. CONCLUSIONS Results show that (1) automatic model selection by Bayesian optimization improves the quality of the learner; (2) data rebalancing considerably impacts the prediction performance of the models; test set rebalancing may provide over-optimistic results, and should therefore be cautiously applied; (3) despite working on sequence data, convolutional models obtain performance close to those of feed forward models working on epigenomic information, which suggests that also sequence data carries informative content for CRR-activity prediction. We therefore suggest combining both models/data types in future works.
Collapse
Affiliation(s)
- Luca Cappelletti
- grid.4708.b0000 0004 1757 2822AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Milan, Italy
| | - Alessandro Petrini
- grid.4708.b0000 0004 1757 2822AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Milan, Italy
| | - Jessica Gliozzo
- grid.4708.b0000 0004 1757 2822AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Milan, Italy
| | - Elena Casiraghi
- grid.4708.b0000 0004 1757 2822AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Milan, Italy
| | - Max Schubach
- grid.6363.00000 0001 2218 4662Berlin Institute of Health at Charité, Universitätsmedizin Berlin, Berlin, Germany
| | - Martin Kircher
- grid.6363.00000 0001 2218 4662Berlin Institute of Health at Charité, Universitätsmedizin Berlin, Berlin, Germany
| | - Giorgio Valentini
- grid.4708.b0000 0004 1757 2822AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Milan, Italy ,European Laboratory for Learning and Intelligent Systems (ELLIS), Berlin, Germany ,CINI National Laboratory of Artificial Intelligence and Intelligent Systems (AIIS), Rome, Italy ,grid.4708.b0000 0004 1757 2822Data Science Research Center, Università degli Studi di Milano, Milan, Italy
| |
Collapse
|
3
|
Esposito A, Casiraghi E, Chiaraviglio F, Scarabelli A, Stellato E, Plensich G, Lastella G, Di Meglio L, Fusco S, Avola E, Jachetti A, Giannitto C, Malchiodi D, Frasca M, Beheshti A, Robinson PN, Valentini G, Forzenigo L, Carrafiello G. Artificial Intelligence in Predicting Clinical Outcome in COVID-19 Patients from Clinical, Biochemical and a Qualitative Chest X-Ray Scoring System. REPORTS IN MEDICAL IMAGING 2021. [DOI: 10.2147/rmi.s292314] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
|
4
|
Casiraghi E, Malchiodi D, Trucco G, Frasca M, Cappelletti L, Fontana T, Esposito AA, Avola E, Jachetti A, Reese J, Rizzi A, Robinson PN, Valentini G. Explainable Machine Learning for Early Assessment of COVID-19 Risk Prediction in Emergency Departments. IEEE ACCESS : PRACTICAL INNOVATIONS, OPEN SOLUTIONS 2020; 8:196299-196325. [PMID: 34812365 PMCID: PMC8545262 DOI: 10.1109/access.2020.3034032] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/16/2020] [Accepted: 10/19/2020] [Indexed: 05/06/2023]
Abstract
Between January and October of 2020, the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus has infected more than 34 million persons in a worldwide pandemic leading to over one million deaths worldwide (data from the Johns Hopkins University). Since the virus begun to spread, emergency departments were busy with COVID-19 patients for whom a quick decision regarding in- or outpatient care was required. The virus can cause characteristic abnormalities in chest radiographs (CXR), but, due to the low sensitivity of CXR, additional variables and criteria are needed to accurately predict risk. Here, we describe a computerized system primarily aimed at extracting the most relevant radiological, clinical, and laboratory variables for improving patient risk prediction, and secondarily at presenting an explainable machine learning system, which may provide simple decision criteria to be used by clinicians as a support for assessing patient risk. To achieve robust and reliable variable selection, Boruta and Random Forest (RF) are combined in a 10-fold cross-validation scheme to produce a variable importance estimate not biased by the presence of surrogates. The most important variables are then selected to train a RF classifier, whose rules may be extracted, simplified, and pruned to finally build an associative tree, particularly appealing for its simplicity. Results show that the radiological score automatically computed through a neural network is highly correlated with the score computed by radiologists, and that laboratory variables, together with the number of comorbidities, aid risk prediction. The prediction performance of our approach was compared to that that of generalized linear models and shown to be effective and robust. The proposed machine learning-based computational system can be easily deployed and used in emergency departments for rapid and accurate risk prediction in COVID-19 patients.
Collapse
Affiliation(s)
- Elena Casiraghi
- Department of Computer Science “Giovanni degli Antoni,”Università degli Studi di Milano20133MilanItaly
- CINI National Laboratory of Artificial Intelligence and Intelligent Systems (AIIS)Università di Roma00185RomaItaly
| | - Dario Malchiodi
- Department of Computer Science “Giovanni degli Antoni,”Università degli Studi di Milano20133MilanItaly
- CINI National Laboratory of Artificial Intelligence and Intelligent Systems (AIIS)Università di Roma00185RomaItaly
- Data Science Research CenterUniversità degli Studi di Milano20133MilanItaly
| | - Gabriella Trucco
- Department of Computer Science “Giovanni degli Antoni,”Università degli Studi di Milano20133MilanItaly
| | - Marco Frasca
- Department of Computer Science “Giovanni degli Antoni,”Università degli Studi di Milano20133MilanItaly
| | - Luca Cappelletti
- Department of Computer Science “Giovanni degli Antoni,”Università degli Studi di Milano20133MilanItaly
| | - Tommaso Fontana
- Dipartimento di ElettronicaInformazione e BioingegneriaPolitecnico di Milano20133MilanItaly
| | | | - Emanuele Avola
- Postgraduate School in RadiodiagnosticsUniversità degli Studi di Milano20122MilanItaly
| | - Alessandro Jachetti
- Accident and Emergency DepartmentFondazione IRCCS Ca Granda Ospedale Maggiore Policlinico20122MilanItaly
| | - Justin Reese
- Division of Environmental Genomics and Systems BiologyLawrence Berkeley National LaboratoryBerkeleyCA94720USA
| | - Alessandro Rizzi
- Department of Computer Science “Giovanni degli Antoni,”Università degli Studi di Milano20133MilanItaly
| | | | - Giorgio Valentini
- Department of Computer Science “Giovanni degli Antoni,”Università degli Studi di Milano20133MilanItaly
- CINI National Laboratory of Artificial Intelligence and Intelligent Systems (AIIS)Università di Roma00185RomaItaly
- Data Science Research CenterUniversità degli Studi di Milano20133MilanItaly
| |
Collapse
|