1
|
Shadbahr T, Roberts M, Stanczuk J, Gilbey J, Teare P, Dittmer S, Thorpe M, Torné RV, Sala E, Lió P, Patel M, Preller J, Rudd JHF, Mirtti T, Rannikko AS, Aston JAD, Tang J, Schönlieb CB. The impact of imputation quality on machine learning classifiers for datasets with missing values. COMMUNICATIONS MEDICINE 2023; 3:139. [PMID: 37803172 PMCID: PMC10558448 DOI: 10.1038/s43856-023-00356-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2022] [Accepted: 09/13/2023] [Indexed: 10/08/2023] Open
Abstract
BACKGROUND Classifying samples in incomplete datasets is a common aim for machine learning practitioners, but is non-trivial. Missing data is found in most real-world datasets and these missing values are typically imputed using established methods, followed by classification of the now complete samples. The focus of the machine learning researcher is to optimise the classifier's performance. METHODS We utilise three simulated and three real-world clinical datasets with different feature types and missingness patterns. Initially, we evaluate how the downstream classifier performance depends on the choice of classifier and imputation methods. We employ ANOVA to quantitatively evaluate how the choice of missingness rate, imputation method, and classifier method influences the performance. Additionally, we compare commonly used methods for assessing imputation quality and introduce a class of discrepancy scores based on the sliced Wasserstein distance. We also assess the stability of the imputations and the interpretability of model built on the imputed data. RESULTS The performance of the classifier is most affected by the percentage of missingness in the test data, with a considerable performance decline observed as the test missingness rate increases. We also show that the commonly used measures for assessing imputation quality tend to lead to imputed data which poorly matches the underlying data distribution, whereas our new class of discrepancy scores performs much better on this measure. Furthermore, we show that the interpretability of classifier models trained using poorly imputed data is compromised. CONCLUSIONS It is imperative to consider the quality of the imputation when performing downstream classification as the effects on the classifier can be considerable.
Collapse
Affiliation(s)
- Tolou Shadbahr
- Research Program in Systems Oncology, Faculty of Medicine, University of Helsinki, Helsinki, Finland
| | - Michael Roberts
- Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge, UK.
- Data Science & Artificial Intelligence, AstraZeneca, Cambridge, UK.
| | - Jan Stanczuk
- Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge, UK
| | - Julian Gilbey
- Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge, UK
| | - Philip Teare
- Data Science & Artificial Intelligence, AstraZeneca, Cambridge, UK
| | - Sören Dittmer
- Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge, UK
- ZeTeM, University of Bremen, Bremen, Germany
| | - Matthew Thorpe
- Department of Mathematics, University of Manchester, Manchester, UK
| | - Ramon Viñas Torné
- Department of Computer Science and Technology, University of Cambridge, Cambridge, UK
| | - Evis Sala
- Department of Radiology, University of Cambridge, Cambridge, UK
| | - Pietro Lió
- Department of Mathematics, University of Manchester, Manchester, UK
| | - Mishal Patel
- Data Science & Artificial Intelligence, AstraZeneca, Cambridge, UK
- Clinical Pharmacology & Safety Sciences, AstraZeneca, Cambridge, UK
| | - Jacobus Preller
- Addenbrooke's Hospital, Cambridge University Hospitals NHS Trust, Cambridge, UK
| | - James H F Rudd
- Department of Medicine, University of Cambridge, Cambridge, UK
| | - Tuomas Mirtti
- Research Program in Systems Oncology, Faculty of Medicine, University of Helsinki, Helsinki, Finland
- Department of Pathology, University of Helsinki and Helsinki University Hospital, Helsinki, Finland
- iCAN-Digital Precision Cancer Medicine Flagship, Helsinki, Finland
| | - Antti Sakari Rannikko
- Research Program in Systems Oncology, Faculty of Medicine, University of Helsinki, Helsinki, Finland
- iCAN-Digital Precision Cancer Medicine Flagship, Helsinki, Finland
- Department of Urology, University of Helsinki and Helsinki University Hospital, Helsinki, Finland
| | - John A D Aston
- Department of Pure Mathematics and Mathematical Statistics, University of Cambridge, Cambridge, UK
| | - Jing Tang
- Research Program in Systems Oncology, Faculty of Medicine, University of Helsinki, Helsinki, Finland
| | - Carola-Bibiane Schönlieb
- Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge, UK
| |
Collapse
|
2
|
Motamedi F, Pérez-Sánchez H, Mehridehnavi A, Fassihi A, Ghasemi F. Accelerating Big Data Quantitative Structure-Activity Prediction through LASSO-Random Forest Algorithm. Bioinformatics 2021; 38:469-475. [PMID: 34601564 DOI: 10.1093/bioinformatics/btab659] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2020] [Revised: 08/14/2021] [Accepted: 09/28/2021] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION The aim of quantitative structure-activity prediction (QSAR) studies is to identify novel drug-like molecules that can be suggested as lead compounds by means of two approaches, which are discussed in this article. First, to identify appropriate molecular descriptors by focusing on one feature-selection algorithms; and second to predict the biological activities of designed compounds.Recent studies have shown increased interest in the prediction of a huge number of molecules, known as Big Data, using deep learning models. However, despite all these efforts to solve critical challenges in QSAR models, such as over-fitting, massive processing procedures, is major shortcomings of deep learning models. Hence, finding the most effective molecular descriptors in the shortest possible time is an ongoing task. One of the successful methods to speed up the extraction of the best features from big datasets is the use of least absolute shrinkage and selection operator (LASSO). This algorithm is a regression model that selects a subset of molecular descriptors with the aim of enhancing prediction accuracy and interpretability because of removing inappropriate and irrelevant features. RESULTS To implement and test our proposed model, a random forest was built to predict the molecular activities of Kaggle competition compounds. Finally, the prediction results and computation time of the suggested model were compared with the other well-known algorithms, i.e. Boruta-random forest, deep random forest, and deep belief network model. The results revealed that improving output correlation through LASSO-random forest leads to appreciably reduced implementation time and model complexity, while maintaining accuracy of the predictions. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Fahimeh Motamedi
- Department of Bioinformatics and Systems Biology, School of Advanced Technologies in Medicine, Isfahan University of Medical Sciences, Isfahan, Iran
| | - Horacio Pérez-Sánchez
- Structural Bioinformatics and High Performance Computing Reseach Group (BIO-HPC), Computer Engineering Department, Universidad Católica de Murcia (UCAM), Murcia, E30107, Spain
| | - Alireza Mehridehnavi
- Department of Bioinformatics and Systems Biology, School of Advanced Technologies in Medicine, Isfahan University of Medical Sciences, Isfahan, Iran.,Medical Image and Signal Processing Research Center, School of Advanced Technologies in Medicine, Isfahan University of Medical Sciences, Isfahan, Iran
| | - Afshin Fassihi
- School of Pharmacology and Pharmaceutical Sciences, Isfahan University of Medical Sciences, Isfahan, Iran.,Bioinformatics Research Center, School of Pharmacology and Pharmaceutical Sciences, Isfahan University of Medical Sciences, Isfahan, Iran
| | - Fahimeh Ghasemi
- Department of Bioinformatics and Systems Biology, School of Advanced Technologies in Medicine, Isfahan University of Medical Sciences, Isfahan, Iran.,Biosensor Research Centre, School of Advanced Technologies in Medicine, Isfahan University of Medical Sciences, Isfahan, Iran
| |
Collapse
|