1
|
Al-Mekhlafi A, Klawonn F. HiPerMAb: a tool for judging the potential of small sample size biomarker pilot studies. Int J Biostat 2024; 20:157-167. [PMID: 36867668 DOI: 10.1515/ijb-2022-0063] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2022] [Accepted: 02/01/2023] [Indexed: 03/04/2023]
Abstract
Common statistical approaches are not designed to deal with so-called "short fat data" in biomarker pilot studies, where the number of biomarker candidates exceeds the sample size by magnitudes. High-throughput technologies for omics data enable the measurement of ten thousands and more biomarker candidates for specific diseases or states of a disease. Due to the limited availability of study participants, ethical reasons and high costs for sample processing and analysis researchers often prefer to start with a small sample size pilot study in order to judge the potential of finding biomarkers that enable - usually in combination - a sufficiently reliable classification of the disease state under consideration. We developed a user-friendly tool, called HiPerMAb that allows to evaluate pilot studies based on performance measures like multiclass AUC, entropy, area above the cost curve, hypervolume under manifold, and misclassification rate using Monte-Carlo simulations to compute the p-values and confidence intervals. The number of "good" biomarker candidates is compared to the expected number of "good" biomarker candidates in a data set with no association to the considered disease states. This allows judging the potential in the pilot study even if statistical tests with correction for multiple testing fail to provide any hint of significance.
Collapse
Affiliation(s)
- Amani Al-Mekhlafi
- Department of Biostatistics, Helmholtz Centre for Infection Research, Braunschweig, Germany
- PhD Programme "Epidemiology" Hannover Medical School (MHH), Hannover, Germany
| | - Frank Klawonn
- Department of Computer Science, Ostfalia University of Applied Sciences, Wolfenbuettel, Germany
| |
Collapse
|
2
|
Lin LS, Kao CH, Li YJ, Chen HH, Chen HY. Improved support vector machine classification for imbalanced medical datasets by novel hybrid sampling combining modified mega-trend-diffusion and bagging extreme learning machine model. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:17672-17701. [PMID: 38052532 DOI: 10.3934/mbe.2023786] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/07/2023]
Abstract
To handle imbalanced datasets in machine learning or deep learning models, some studies suggest sampling techniques to generate virtual examples of minority classes to improve the models' prediction accuracy. However, for kernel-based support vector machines (SVM), some sampling methods suggest generating synthetic examples in an original data space rather than in a high-dimensional feature space. This may be ineffective in improving SVM classification for imbalanced datasets. To address this problem, we propose a novel hybrid sampling technique termed modified mega-trend-diffusion-extreme learning machine (MMTD-ELM) to effectively move the SVM decision boundary toward a region of the majority class. By this movement, the prediction of SVM for minority class examples can be improved. The proposed method combines α-cut fuzzy number method for screening representative examples of majority class and MMTD method for creating new examples of the minority class. Furthermore, we construct a bagging ELM model to monitor the similarity between new examples and original data. In this paper, four datasets are used to test the efficiency of the proposed MMTD-ELM method in imbalanced data prediction. Additionally, we deployed two SVM models to compare prediction performance of the proposed MMTD-ELM method with three state-of-the-art sampling techniques in terms of geometric mean (G-mean), F-measure (F1), index of balanced accuracy (IBA) and area under curve (AUC) metrics. Furthermore, paired t-test is used to elucidate whether the suggested method has statistically significant differences from the other sampling techniques in terms of the four evaluation metrics. The experimental results demonstrated that the proposed method achieves the best average values in terms of G-mean, F1, IBA and AUC. Overall, the suggested MMTD-ELM method outperforms these sampling methods for imbalanced datasets.
Collapse
Affiliation(s)
- Liang-Sian Lin
- Department of Information Management, National Taipei University of Nursing and Health Sciences, Taipei 112303, Taiwan
| | - Chen-Huan Kao
- Department of Information Management, National Taipei University of Nursing and Health Sciences, Taipei 112303, Taiwan
| | - Yi-Jie Li
- Department of Information Management, National Taipei University of Nursing and Health Sciences, Taipei 112303, Taiwan
| | - Hao-Hsuan Chen
- Department of Information Management, National Taipei University of Nursing and Health Sciences, Taipei 112303, Taiwan
| | - Hung-Yu Chen
- Department of Information Management, National Chin-Yi University of Technology, Taichung 411030, Taiwan
| |
Collapse
|
3
|
Movahedi F, Padman R, Antaki JF. Limitations of receiver operating characteristic curve on imbalanced data: Assist device mortality risk scores. J Thorac Cardiovasc Surg 2023; 165:1433-1442.e2. [PMID: 34446286 PMCID: PMC8800945 DOI: 10.1016/j.jtcvs.2021.07.041] [Citation(s) in RCA: 11] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/02/2021] [Revised: 07/20/2021] [Accepted: 07/23/2021] [Indexed: 02/01/2023]
Abstract
OBJECTIVE In the left ventricular assist device domain, the receiver operating characteristic is a commonly applied metric of performance of classifiers. However, the receiver operating characteristic can provide a distorted view of classifiers' ability to predict short-term mortality due to the overwhelmingly greater proportion of patients who survive, that is, imbalanced data. This study illustrates the ambiguity of the receiver operating characteristic in evaluating 2 classifiers of 90-day left ventricular assist device mortality and introduces the precision recall curve as a supplemental metric that is more representative of left ventricular assist device classifiers in predicting the minority class. METHODS This study compared the receiver operating characteristic and precision recall curve for 2 classifiers for 90-day left ventricular assist device mortality, HeartMate Risk Score and Random Forest for 800 patients (test group) recorded in the Interagency Registry for Mechanically Assisted Circulatory Support who received a continuous-flow left ventricular assist device between 2006 and 2016 (mean age, 59 years; 146 female vs 654 male patients), in whom 90-day mortality rate is only 8%. RESULTS The receiver operating characteristic indicates similar performance of Random Forest and HeartMate Risk Score classifiers with respect to area under the curve of 0.77 and Random Forest 0.63, respectively. This is in contrast to their precision recall curve with area under the curve of 0.43 versus 0.16 for Random Forest and HeartMate Risk Score, respectively. The precision recall curve for HeartMate Risk Score showed the precision rapidly decreased to only 10% with slightly increasing sensitivity. CONCLUSIONS The receiver operating characteristic can portray an overly optimistic performance of a classifier or risk score when applied to imbalanced data. The precision recall curve provides better insight about the performance of a classifier by focusing on the minority class.
Collapse
Affiliation(s)
- Faezeh Movahedi
- Swanson School of Engineering, University of Pittsburgh, Pittsburgh, Pa
| | - Rema Padman
- Heinz College, Carnegie Mellon University, Pittsburgh, Pa
| | - James F Antaki
- Meinig School of Biomedical Engineering, Cornell University, Ithaca, NY.
| |
Collapse
|
4
|
A statistical learning framework for predicting left ventricular ejection fraction based on glutathione peroxidase-3 level in ischemic heart disease. Comput Biol Med 2022; 149:105929. [DOI: 10.1016/j.compbiomed.2022.105929] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2022] [Revised: 07/10/2022] [Accepted: 07/30/2022] [Indexed: 11/18/2022]
|
5
|
Chicken Swarm-Based Feature Subset Selection with Optimal Machine Learning Enabled Data Mining Approach. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12136787] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Data mining (DM) involves the process of identifying patterns, correlation, and anomalies existing in massive datasets. The applicability of DM includes several areas such as education, healthcare, business, and finance. Educational Data Mining (EDM) is an interdisciplinary domain which focuses on the applicability of DM, machine learning (ML), and statistical approaches for pattern recognition in massive quantities of educational data. This type of data suffers from the curse of dimensionality problems. Thus, feature selection (FS) approaches become essential. This study designs a Feature Subset Selection with an optimal machine learning model for Educational Data Mining (FSSML-EDM). The proposed method involves three major processes. At the initial stage, the presented FSSML-EDM model uses the Chicken Swarm Optimization-based Feature Selection (CSO-FS) technique for electing feature subsets. Next, an extreme learning machine (ELM) classifier is employed for the classification of educational data. Finally, the Artificial Hummingbird (AHB) algorithm is utilized for adjusting the parameters involved in the ELM model. The performance study revealed that FSSML-EDM model achieves better results compared with other models under several dimensions.
Collapse
|
6
|
Aarthi R, Vinayagasundaram B. Effective management of class imbalance problem in climate data analysis using a hybrid of deep learning and data level sampling. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2022. [DOI: 10.3233/jifs-210666] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Climate change and its consequences for human life have emerged as the world’s most pressing challenge. Due to the complexity, veracity, and velocity of climate data, a traditional, simple, and single machine learning model will not be sufficient to perform effective and timely analysis. The climate data can be effectively analyzed, and climate models can be developed with the proposed hybrid model. The deep learning AutoEncoder (AE) is used for feature extraction, removal of redundant and noisy data. The Synthetic Minority class Oversampling (SMOTE) technique to generate samples in minority class to mitigate the imbalance in the sample distribution. Extreme Learning Machine (ELM) is used for further feature classification. The proposed method exploits big data strategies and the results interpretation process to extract accurate insight from climate data. ELM handles the class imbalance problem to improve the performance of the Early Warning System (EWS) model and fine-tune it. The hybrid method drastically reduces the computation cost and improves the accuracy to 93%, 86%, 95%, and 98% of four different datasets against other machine learning models. The experimental results of the AE_SMOTE_ELM model, compared with other state-of-the-art deep learning methods, shows accuracy and an efficiency of 90.4% and 91.76%, respectively, for two climate datasets.
Collapse
Affiliation(s)
- R.J. Aarthi
- Computer Centre, Madras Institute of Technology, Anna University, Chrompet, Chennai, India
| | - B. Vinayagasundaram
- Computer Centre, Madras Institute of Technology, Anna University, Chrompet, Chennai, India
| |
Collapse
|
7
|
Ghorbani M, Kazi A, Soleymani Baghshah M, Rabiee HR, Navab N. RA-GCN: Graph convolutional network for disease prediction problems with imbalanced data. Med Image Anal 2021; 75:102272. [PMID: 34731774 DOI: 10.1016/j.media.2021.102272] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2020] [Revised: 10/03/2021] [Accepted: 10/15/2021] [Indexed: 10/20/2022]
Abstract
Disease prediction is a well-known classification problem in medical applications. Graph Convolutional Networks (GCNs) provide a powerful tool for analyzing the patients' features relative to each other. This can be achieved by modeling the problem as a graph node classification task, where each node is a patient. Due to the nature of such medical datasets, class imbalance is a prevalent issue in the field of disease prediction, where the distribution of classes is skewed. When the class imbalance is present in the data, the existing graph-based classifiers tend to be biased towards the major class(es) and neglect the samples in the minor class(es). On the other hand, the correct diagnosis of the rare positive cases (true-positives) among all the patients is vital in a healthcare system. In conventional methods, such imbalance is tackled by assigning appropriate weights to classes in the loss function which is still dependent on the relative values of weights, sensitive to outliers, and in some cases biased towards the minor class(es). In this paper, we propose a Re-weighted Adversarial Graph Convolutional Network (RA-GCN) to prevent the graph-based classifier from emphasizing the samples of any particular class. This is accomplished by associating a graph-based neural network to each class, which is responsible for weighting the class samples and changing the importance of each sample for the classifier. Therefore, the classifier adjusts itself and determines the boundary between classes with more attention to the important samples. The parameters of the classifier and weighting networks are trained by an adversarial approach. We show experiments on synthetic and three publicly available medical datasets. Our results demonstrate the superiority of RA-GCN compared to recent methods in identifying the patient's status on all three datasets. The detailed analysis of our method is provided as quantitative and qualitative experiments on synthetic datasets.
Collapse
Affiliation(s)
- Mahsa Ghorbani
- Department of Computer Engineering, Sharif University of Technology, Tehran, Iran; Computer Aided Medical Procedures, Department of Informatics, Technical University of Munich, Germany.
| | - Anees Kazi
- Computer Aided Medical Procedures, Department of Informatics, Technical University of Munich, Germany
| | | | - Hamid R Rabiee
- Department of Computer Engineering, Sharif University of Technology, Tehran, Iran.
| | - Nassir Navab
- Computer Aided Medical Procedures, Department of Informatics, Technical University of Munich, Germany; Whiting School of Engineering, Johns Hopkins University, Baltimore, USA
| |
Collapse
|
8
|
Prediction of Aquatic Ecosystem Health Indices through Machine Learning Models Using the WGAN-Based Data Augmentation Method. SUSTAINABILITY 2021. [DOI: 10.3390/su131810435] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Changes in hydrological characteristics and increases in various pollutant loadings due to rapid climate change and urbanization have a significant impact on the deterioration of aquatic ecosystem health (AEH). Therefore, it is important to effectively evaluate the AEH in advance and establish appropriate strategic plans. Recently, machine learning (ML) models have been widely used to solve hydrological and environmental problems in various fields. However, in general, collecting sufficient data for ML training is time-consuming and labor-intensive. Especially in classification problems, data imbalance can lead to erroneous prediction results of ML models. In this study, we proposed a method to solve the data imbalance problem through data augmentation based on Wasserstein Generative Adversarial Network (WGAN) and to efficiently predict the grades (from A to E grades) of AEH indices (i.e., Benthic Macroinvertebrate Index (BMI), Trophic Diatom Index (TDI), Fish Assessment Index (FAI)) through the ML models. Raw datasets for the AEH indices composed of various physicochemical factors (i.e., WT, DO, BOD5, SS, TN, TP, and Flow) and AEH grades were built and augmented through the WGAN. The performance of each ML model was evaluated through a 10-fold cross-validation (CV), and the performances of the ML models trained on the raw and WGAN-based training sets were compared and analyzed through AEH grade prediction on the test sets. The results showed that the ML models trained on the WGAN-based training set had an average F1-score for grades of each AEH index of 0.9 or greater for the test set, which was superior to the models trained on the raw training set (fewer data compared to other datasets) only. Through the above results, it was confirmed that by using the dataset augmented through WGAN, the ML model can yield better AEH grade predictive performance compared to the model trained on limited datasets; this approach reduces the effort needed for actual data collection from rivers which requires enormous time and cost. In the future, the results of this study can be used as basic data to construct big data of aquatic ecosystems, needed to efficiently evaluate and predict AEH in rivers based on the ML models.
Collapse
|
9
|
Mzoughi H, Njeh I, Wali A, Slima MB, BenHamida A, Mhiri C, Mahfoudhe KB. Deep Multi-Scale 3D Convolutional Neural Network (CNN) for MRI Gliomas Brain Tumor Classification. J Digit Imaging 2021; 33:903-915. [PMID: 32440926 DOI: 10.1007/s10278-020-00347-9] [Citation(s) in RCA: 83] [Impact Index Per Article: 27.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
Accurate and fully automatic brain tumor grading from volumetric 3D magnetic resonance imaging (MRI) is an essential procedure in the field of medical imaging analysis for full assistance of neuroradiology during clinical diagnosis. We propose, in this paper, an efficient and fully automatic deep multi-scale three-dimensional convolutional neural network (3D CNN) architecture for glioma brain tumor classification into low-grade gliomas (LGG) and high-grade gliomas (HGG) using the whole volumetric T1-Gado MRI sequence. Based on a 3D convolutional layer and a deep network, via small kernels, the proposed architecture has the potential to merge both the local and global contextual information with reduced weights. To overcome the data heterogeneity, we proposed a preprocessing technique based on intensity normalization and adaptive contrast enhancement of MRI data. Furthermore, for an effective training of such a deep 3D network, we used a data augmentation technique. The paper studied the impact of the proposed preprocessing and data augmentation on classification accuracy.Quantitative evaluations, over the well-known benchmark (Brats-2018), attest that the proposed architecture generates the most discriminative feature map to distinguish between LG and HG gliomas compared with 2D CNN variant. The proposed approach offers promising results outperforming the recently supervised and unsupervised state-of-the-art approaches by achieving an overall accuracy of 96.49% using the validation dataset. The obtained experimental results confirm that adequate MRI's preprocessing and data augmentation could lead to an accurate classification when exploiting CNN-based approaches.
Collapse
Affiliation(s)
- Hiba Mzoughi
- Advanced Technologies for Medecine and Signal (ATMS), Sfax university, ENIS, Route de la Soukra km 4, 3038, Sfax, Tunisia.
- National Engineering School of Gabes, Gabes university, Avenue Omar Ibn El Khattab, Zrig Gabes, 6029, Gabes, Tunisia.
| | - Ines Njeh
- Advanced Technologies for Medecine and Signal (ATMS), Sfax university, ENIS, Route de la Soukra km 4, 3038, Sfax, Tunisia
- Higher Institute of Computer Science and Multimedia of Gabes, Gabes university, Gabes, Tunisia
| | - Ali Wali
- National Engineering School of Sfax, Regim-Lab, Sfax university, Sfax, Tunisia
| | - Mohamed Ben Slima
- Advanced Technologies for Medecine and Signal (ATMS), Sfax university, ENIS, Route de la Soukra km 4, 3038, Sfax, Tunisia
- National School of Electronics and Telecommunications of Sfax, Sfax university, Sfax, Tunisia
| | - Ahmed BenHamida
- Advanced Technologies for Medecine and Signal (ATMS), Sfax university, ENIS, Route de la Soukra km 4, 3038, Sfax, Tunisia
- National Engineering School of Sfax, Regim-Lab, Sfax university, Sfax, Tunisia
| | - Chokri Mhiri
- Department of Neurology, Habib Bourguiba University Hospital, Sfax, Tunisia
| | | |
Collapse
|
10
|
Alhassan Z, Watson M, Budgen D, Alshammari R, Alessa A, Al Moubayed N. Improving Current Glycated Hemoglobin Prediction in Adults: Use of Machine Learning Algorithms With Electronic Health Records. JMIR Med Inform 2021; 9:e25237. [PMID: 34028357 PMCID: PMC8185616 DOI: 10.2196/25237] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2020] [Revised: 01/05/2021] [Accepted: 04/22/2021] [Indexed: 01/30/2023] Open
Abstract
Background Predicting the risk of glycated hemoglobin (HbA1c) elevation can help identify patients with the potential for developing serious chronic health problems, such as diabetes. Early preventive interventions based upon advanced predictive models using electronic health records data for identifying such patients can ultimately help provide better health outcomes. Objective Our study investigated the performance of predictive models to forecast HbA1c elevation levels by employing several machine learning models. We also examined the use of patient electronic health record longitudinal data in the performance of the predictive models. Explainable methods were employed to interpret the decisions made by the black box models. Methods This study employed multiple logistic regression, random forest, support vector machine, and logistic regression models, as well as a deep learning model (multilayer perceptron) to classify patients with normal (<5.7%) and elevated (≥5.7%) levels of HbA1c. We also integrated current visit data with historical (longitudinal) data from previous visits. Explainable machine learning methods were used to interrogate the models and provide an understanding of the reasons behind the decisions made by the models. All models were trained and tested using a large data set from Saudi Arabia with 18,844 unique patient records. Results The machine learning models achieved promising results for predicting current HbA1c elevation risk. When coupled with longitudinal data, the machine learning models outperformed the multiple logistic regression model used in the comparative study. The multilayer perceptron model achieved an accuracy of 83.22% for the area under receiver operating characteristic curve when used with historical data. All models showed a close level of agreement on the contribution of random blood sugar and age variables with and without longitudinal data. Conclusions This study shows that machine learning models can provide promising results for the task of predicting current HbA1c levels (≥5.7% or less). Using patients’ longitudinal data improved the performance and affected the relative importance for the predictors used. The models showed results that are consistent with comparable studies.
Collapse
Affiliation(s)
- Zakhriya Alhassan
- Department of Computer Science, Durham University, Durham, United Kingdom.,College of Computer Science and Engineering, University of Jeddah, Jeddah, Saudi Arabia
| | - Matthew Watson
- Department of Computer Science, Durham University, Durham, United Kingdom
| | - David Budgen
- Department of Computer Science, Durham University, Durham, United Kingdom
| | - Riyad Alshammari
- National Center for Artificial Intelligence, Saudi Data and Artificial Intelligence Authority, Riyadh, Saudi Arabia
| | - Ali Alessa
- Department of Information Technology Programs, Institute of Public Administration, Riyadh, Saudi Arabia
| | - Noura Al Moubayed
- Department of Computer Science, Durham University, Durham, United Kingdom
| |
Collapse
|
11
|
Rangasamy DP, Rajappan S, Natarajan A, Ramasamy R, Vijayakumar D. Variable population‐sized particle swarm optimization for highly imbalanced dataset classification. Comput Intell 2021. [DOI: 10.1111/coin.12436] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
| | - Sivaraj Rajappan
- Department of Computer Science and Engineering Nandha Engineering College Erode India
| | - Anitha Natarajan
- Department of Information Technology Kongu Engineering College Erode India
| | - Rajadevi Ramasamy
- Department of Information Technology Kongu Engineering College Erode India
| | - Devisurya Vijayakumar
- Department of Information Technology Kongu Engineering College Erode India
- Department of Computer Science and Engineering Nandha Engineering College Erode India
| |
Collapse
|
12
|
Quiroz JC, Feng YZ, Cheng ZY, Rezazadegan D, Chen PK, Lin QT, Qian L, Liu XF, Berkovsky S, Coiera E, Song L, Qiu X, Liu S, Cai XR. Development and Validation of a Machine Learning Approach for Automated Severity Assessment of COVID-19 Based on Clinical and Imaging Data: Retrospective Study. JMIR Med Inform 2021; 9:e24572. [PMID: 33534723 PMCID: PMC7879715 DOI: 10.2196/24572] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2020] [Revised: 01/24/2021] [Accepted: 01/27/2021] [Indexed: 01/15/2023] Open
Abstract
BACKGROUND COVID-19 has overwhelmed health systems worldwide. It is important to identify severe cases as early as possible, such that resources can be mobilized and treatment can be escalated. OBJECTIVE This study aims to develop a machine learning approach for automated severity assessment of COVID-19 based on clinical and imaging data. METHODS Clinical data-including demographics, signs, symptoms, comorbidities, and blood test results-and chest computed tomography scans of 346 patients from 2 hospitals in the Hubei Province, China, were used to develop machine learning models for automated severity assessment in diagnosed COVID-19 cases. We compared the predictive power of the clinical and imaging data from multiple machine learning models and further explored the use of four oversampling methods to address the imbalanced classification issue. Features with the highest predictive power were identified using the Shapley Additive Explanations framework. RESULTS Imaging features had the strongest impact on the model output, while a combination of clinical and imaging features yielded the best performance overall. The identified predictive features were consistent with those reported previously. Although oversampling yielded mixed results, it achieved the best model performance in our study. Logistic regression models differentiating between mild and severe cases achieved the best performance for clinical features (area under the curve [AUC] 0.848; sensitivity 0.455; specificity 0.906), imaging features (AUC 0.926; sensitivity 0.818; specificity 0.901), and a combination of clinical and imaging features (AUC 0.950; sensitivity 0.764; specificity 0.919). The synthetic minority oversampling method further improved the performance of the model using combined features (AUC 0.960; sensitivity 0.845; specificity 0.929). CONCLUSIONS Clinical and imaging features can be used for automated severity assessment of COVID-19 and can potentially help triage patients with COVID-19 and prioritize care delivery to those at a higher risk of severe disease.
Collapse
Affiliation(s)
- Juan Carlos Quiroz
- Centre for Health Informatics, Australian Institute of Health Innovation, Faculty of Medicine, Health and Human Sciences, Macquarie University, Macquarie Park, Australia
- Centre for Big Data Research in Health, University of New South Wales, Sydney, Australia
| | - You-Zhen Feng
- Medical Imaging Centre, The First Affiliated Hospital of Jinan University, Guangzhou, China
| | - Zhong-Yuan Cheng
- Medical Imaging Centre, The First Affiliated Hospital of Jinan University, Guangzhou, China
| | - Dana Rezazadegan
- Centre for Health Informatics, Australian Institute of Health Innovation, Faculty of Medicine, Health and Human Sciences, Macquarie University, Macquarie Park, Australia
- Department of Computer Science and Software Engineering, Swinburne University of Technology, Melbourne, Australia
| | - Ping-Kang Chen
- Medical Imaging Centre, The First Affiliated Hospital of Jinan University, Guangzhou, China
| | - Qi-Ting Lin
- Medical Imaging Centre, The First Affiliated Hospital of Jinan University, Guangzhou, China
| | - Long Qian
- Department of Biomedical Engineering, Peking University, Beijing, China
| | - Xiao-Fang Liu
- Institute of Robotics and Automatic Information System, College of Artificial Intelligence, Nankai University, Tianjin, China
- School of Data and Computer Science, Sun Yat-sen University, Guangzhou, China
| | - Shlomo Berkovsky
- Centre for Health Informatics, Australian Institute of Health Innovation, Faculty of Medicine, Health and Human Sciences, Macquarie University, Macquarie Park, Australia
| | - Enrico Coiera
- Centre for Health Informatics, Australian Institute of Health Innovation, Faculty of Medicine, Health and Human Sciences, Macquarie University, Macquarie Park, Australia
| | - Lei Song
- Department of Radiology, Xiangyang Central Hospital, Affiliated Hospital of Hubei University of Arts and Science, Xiangyang, China
| | - Xiaoming Qiu
- Department of Radiology, Huangshi Central Hospital, Affiliated Hospital of Hubei Polytechnic University, Edong Healthcare Group, Huangshi, China
| | - Sidong Liu
- Centre for Health Informatics, Australian Institute of Health Innovation, Faculty of Medicine, Health and Human Sciences, Macquarie University, Macquarie Park, Australia
| | - Xiang-Ran Cai
- Medical Imaging Centre, The First Affiliated Hospital of Jinan University, Guangzhou, China
| |
Collapse
|
13
|
Maglogiannis I, Iliadis L, Pimenidis E. Overlap-Based Undersampling Method for Classification of Imbalanced Medical Datasets. IFIP ADVANCES IN INFORMATION AND COMMUNICATION TECHNOLOGY 2020. [PMCID: PMC7256568 DOI: 10.1007/978-3-030-49186-4_30] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Early diagnosis of some life-threatening diseases such as cancers and heart is crucial for effective treatments. Supervised machine learning has proved to be a very useful tool to serve this purpose. Historical data of patients including clinical and demographic information is used for training learning algorithms. This builds predictive models that provide initial diagnoses. However, in the medical domain, it is common to have the positive class under-represented in a dataset. In such a scenario, a typical learning algorithm tends to be biased towards the negative class, which is the majority class, and misclassify positive cases. This is known as the class imbalance problem. In this paper, a framework for predictive diagnostics of diseases with imbalanced records is presented. To reduce the classification bias, we propose the usage of an overlap-based undersampling method to improve the visibility of minority class samples in the region where the two classes overlap. This is achieved by detecting and removing negative class instances from the overlapping region. This will improve class separability in the data space. Experimental results show achievement of high accuracy in the positive class, which is highly preferable in the medical domain, while good trade-offs between sensitivity and specificity were obtained. Results also show that the method often outperformed other state-of-the-art and well-established techniques.
Collapse
Affiliation(s)
| | - Lazaros Iliadis
- Department of Civil Engineering, Lab of Mathematics and Informatics (ISCE), Democritus University of Thrace, Xanthi, Greece
| | - Elias Pimenidis
- Department of Computer Science and Creative Technologies, University of the West of England, Bristol, UK
| |
Collapse
|