1
|
Lin LS, Kao CH, Li YJ, Chen HH, Chen HY. Improved support vector machine classification for imbalanced medical datasets by novel hybrid sampling combining modified mega-trend-diffusion and bagging extreme learning machine model. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:17672-17701. [PMID: 38052532 DOI: 10.3934/mbe.2023786] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/07/2023]
Abstract
To handle imbalanced datasets in machine learning or deep learning models, some studies suggest sampling techniques to generate virtual examples of minority classes to improve the models' prediction accuracy. However, for kernel-based support vector machines (SVM), some sampling methods suggest generating synthetic examples in an original data space rather than in a high-dimensional feature space. This may be ineffective in improving SVM classification for imbalanced datasets. To address this problem, we propose a novel hybrid sampling technique termed modified mega-trend-diffusion-extreme learning machine (MMTD-ELM) to effectively move the SVM decision boundary toward a region of the majority class. By this movement, the prediction of SVM for minority class examples can be improved. The proposed method combines α-cut fuzzy number method for screening representative examples of majority class and MMTD method for creating new examples of the minority class. Furthermore, we construct a bagging ELM model to monitor the similarity between new examples and original data. In this paper, four datasets are used to test the efficiency of the proposed MMTD-ELM method in imbalanced data prediction. Additionally, we deployed two SVM models to compare prediction performance of the proposed MMTD-ELM method with three state-of-the-art sampling techniques in terms of geometric mean (G-mean), F-measure (F1), index of balanced accuracy (IBA) and area under curve (AUC) metrics. Furthermore, paired t-test is used to elucidate whether the suggested method has statistically significant differences from the other sampling techniques in terms of the four evaluation metrics. The experimental results demonstrated that the proposed method achieves the best average values in terms of G-mean, F1, IBA and AUC. Overall, the suggested MMTD-ELM method outperforms these sampling methods for imbalanced datasets.
Collapse
Affiliation(s)
- Liang-Sian Lin
- Department of Information Management, National Taipei University of Nursing and Health Sciences, Taipei 112303, Taiwan
| | - Chen-Huan Kao
- Department of Information Management, National Taipei University of Nursing and Health Sciences, Taipei 112303, Taiwan
| | - Yi-Jie Li
- Department of Information Management, National Taipei University of Nursing and Health Sciences, Taipei 112303, Taiwan
| | - Hao-Hsuan Chen
- Department of Information Management, National Taipei University of Nursing and Health Sciences, Taipei 112303, Taiwan
| | - Hung-Yu Chen
- Department of Information Management, National Chin-Yi University of Technology, Taichung 411030, Taiwan
| |
Collapse
|
2
|
Nhu NT, Kang JH, Yeh TS, Wu CC, Tsai CY, Piravej K, Lam C. Prediction of posttraumatic functional recovery in middle-aged and older patients through dynamic ensemble selection modeling. Front Public Health 2023; 11:1164820. [PMID: 37408743 PMCID: PMC10319009 DOI: 10.3389/fpubh.2023.1164820] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2023] [Accepted: 05/17/2023] [Indexed: 07/07/2023] Open
Abstract
Introduction Age-specific risk factors may delay posttraumatic functional recovery; complex interactions exist between these factors. In this study, we investigated the prediction ability of machine learning models for posttraumatic (6 months) functional recovery in middle-aged and older patients on the basis of their preexisting health conditions. Methods Data obtained from injured patients aged ≥45 years were divided into training-validation (n = 368) and test (n = 159) data sets. The input features were the sociodemographic characteristics and baseline health conditions of the patients. The output feature was functional status 6 months after injury; this was assessed using the Barthel Index (BI). On the basis of their BI scores, the patients were categorized into functionally independent (BI >60) and functionally dependent (BI ≤60) groups. The permutation feature importance method was used for feature selection. Six algorithms were validated through cross-validation with hyperparameter optimization. The algorithms exhibiting satisfactory performance were subjected to bagging to construct stacking, voting, and dynamic ensemble selection models. The best model was evaluated on the test data set. Partial dependence (PD) and individual conditional expectation (ICE) plots were created. Results In total, nineteen of twenty-seven features were selected. Logistic regression, linear discrimination analysis, and Gaussian Naive Bayes algorithms exhibited satisfactory performances and were, therefore, used to construct ensemble models. The k-Nearest Oracle Elimination model outperformed the other models when evaluated on the training-validation data set (sensitivity: 0.732, 95% CI: 0.702-0.761; specificity: 0.813, 95% CI: 0.805-0.822); it exhibited compatible performance on the test data set (sensitivity: 0.779, 95% CI: 0.559-0.950; specificity: 0.859, 95% CI: 0.799-0.912). The PD and ICE plots showed consistent patterns with practical tendencies. Conclusion Preexisting health conditions can predict long-term functional outcomes in injured middle-aged and older patients, thus predicting prognosis and facilitating clinical decision-making.
Collapse
Affiliation(s)
- Nguyen Thanh Nhu
- International Ph.D. Program in Medicine, College of Medicine, Taipei Medical University, Taipei, Taiwan
- Faculty of Medicine, Can Tho University of Medicine and Pharmacy, Can Tho, Vietnam
| | - Jiunn-Horng Kang
- International Ph.D. Program in Medicine, College of Medicine, Taipei Medical University, Taipei, Taiwan
- Department of Physical Medicine and Rehabilitation, School of Medicine, College of Medicine, Taipei Medical University, Taipei, Taiwan
- Department of Physical Medicine and Rehabilitation, Taipei Medical University Hospital, Taipei, Taiwan
- Graduate Institute of Nanomedicine and Medical Engineering, College of Biomedical Engineering, Taipei Medical University, Taipei, Taiwan
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei, Taiwan
| | - Tian-Shin Yeh
- Department of Physical Medicine and Rehabilitation, School of Medicine, College of Medicine, Taipei Medical University, Taipei, Taiwan
- Department of Physical Medicine and Rehabilitation, Wan Fang Hospital, Taipei Medical University, Taipei, Taiwan
- Department of Epidemiology and Nutrition, Harvard T. H. Chan School of Public Health, Harvard University, Boston, MA, United States
- Nuffield Department of Population Health, University of Oxford, Oxford, United Kingdom
| | - Chia-Chieh Wu
- Emergency Department, Wan Fang Hospital, Taipei Medical University, Taipei, Taiwan
- Department of Emergency, School of Medicine, College of Medicine, Taipei Medical University, Taipei, Taiwan
| | - Cheng-Yu Tsai
- Centre for Transport Studies, Department of Civil and Environmental Engineering, Imperial College London, London, United Kingdom
| | - Krisna Piravej
- Department of Rehabilitation Medicine, Faculty of Medicine, Chulalongkorn University, Bangkok, Thailand
- Department of Chula Neuroscience Center, King Chulalongkorn Memorial Hospital, Bangkok, Thailand
| | - Carlos Lam
- Emergency Department, Wan Fang Hospital, Taipei Medical University, Taipei, Taiwan
- Department of Emergency, School of Medicine, College of Medicine, Taipei Medical University, Taipei, Taiwan
| |
Collapse
|
3
|
S J S, S C PK, Assegie TA. A cost-sensitive logistic regression model for breast cancer detection. THE IMAGING SCIENCE JOURNAL 2023. [DOI: 10.1080/13682199.2022.2161697] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Affiliation(s)
- Sushma S J
- Department of Electronics and Communication Engineering, GSSSIETW, Mysuru, India
| | - Prasanna Kumar S C
- Department of Electronics and Instrumentation Engineering, RVCE, Bangalore, India
| | - Tsehay Admassu Assegie
- Department of Computer Science, College of Computational and Natural Science, Injibara University, Injibara, Ethiopia
| |
Collapse
|
4
|
Chatterjee S, Maity S, Bhattacharjee M, Banerjee S, Das AK, Ding W. Variational Autoencoder Based Imbalanced COVID-19 Detection Using Chest X-Ray Images. NEW GENERATION COMPUTING 2022; 41:25-60. [PMID: 36439303 PMCID: PMC9676807 DOI: 10.1007/s00354-022-00194-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/03/2021] [Accepted: 10/16/2022] [Indexed: 06/12/2023]
Abstract
Early and fast detection of disease is essential for the fight against COVID-19 pandemic. Researchers have focused on developing robust and cost-effective detection methods using Deep learning based chest X-Ray image processing. However, such prediction models are often not well suited to address the challenge of highly imabalanced datasets. The current work is an attempt to address the issue by utilizing unsupervised Variational Auto Encoders (VAEs). Firstly, chest X-Ray images are converted to a latent space by learning the most important features using VAEs. Secondly, a wide range of well established data resampling techniques are used to balance the preexisting imbalanced classes in the latent vector form of the dataset. Finally, the modified dataset in the new feature space is used to train well known classification models to classify chest X-Ray images into three different classes viz., "COVID-19", "Pneumonia", and "Normal". In order to capture the quality of resampling methods, 10-folds cross validation technique is applied on the dataset. Extensive experimental analysis have been carried out and results so obtained indicate significant improvement in COVID-19 detection using the proposed VAE based method. Furthermore, the ingenuity of the results have been established by performing Wilcoxon rank test with 95% level of significance.
Collapse
Affiliation(s)
- Sankhadeep Chatterjee
- Department of Computer Science and Technology, Indian Institute of Engineering Science and Technology, Shibpur, West Bengal India
| | - Soumyajit Maity
- Department of Computer Science and Engineering, University of Engineering & Management, Kolkata, West Bengal India
| | - Mayukh Bhattacharjee
- Department of Computer Science and Engineering, University of Engineering & Management, Kolkata, West Bengal India
| | - Soumen Banerjee
- Department of Electronics and Communication Engineering, Budge Budge Institute of Technology, Budge Budge, Kolkata, West Bengal 700137 India
| | - Asit Kumar Das
- Department of Computer Science and Technology, Indian Institute of Engineering Science and Technology, Shibpur, West Bengal India
| | - Weiping Ding
- School of Information Science and Technology, Nantong University, 66479, Nantong, 226019 Jiangsu China
| |
Collapse
|
5
|
Song B, Li S, Sunny S, Gurushanth K, Mendonca P, Mukhia N, Patrick S, Gurudath S, Raghavan S, Tsusennaro I, Leivon ST, Kolur T, Shetty V, Bushan V, Ramesh R, Peterson T, Pillai V, Wilder-Smith P, Sigamani A, Suresh A, Kuriakose MA, Birur P, Liang R. Classification of imbalanced oral cancer image data from high-risk population. JOURNAL OF BIOMEDICAL OPTICS 2021; 26:JBO-210246R. [PMID: 34689442 PMCID: PMC8536945 DOI: 10.1117/1.jbo.26.10.105001] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/03/2021] [Accepted: 09/28/2021] [Indexed: 06/13/2023]
Abstract
SIGNIFICANCE Early detection of oral cancer is vital for high-risk patients, and machine learning-based automatic classification is ideal for disease screening. However, current datasets collected from high-risk populations are unbalanced and often have detrimental effects on the performance of classification. AIM To reduce the class bias caused by data imbalance. APPROACH We collected 3851 polarized white light cheek mucosa images using our customized oral cancer screening device. We use weight balancing, data augmentation, undersampling, focal loss, and ensemble methods to improve the neural network performance of oral cancer image classification with the imbalanced multi-class datasets captured from high-risk populations during oral cancer screening in low-resource settings. RESULTS By applying both data-level and algorithm-level approaches to the deep learning training process, the performance of the minority classes, which were difficult to distinguish at the beginning, has been improved. The accuracy of "premalignancy" class is also increased, which is ideal for screening applications. CONCLUSIONS Experimental results show that the class bias induced by imbalanced oral cancer image datasets could be reduced using both data- and algorithm-level methods. Our study may provide an important basis for helping understand the influence of unbalanced datasets on oral cancer deep learning classifiers and how to mitigate.
Collapse
Affiliation(s)
- Bofan Song
- The University of Arizona, Wyant College of Optical Sciences, Tucson, Arizona, United States
| | - Shaobai Li
- The University of Arizona, Wyant College of Optical Sciences, Tucson, Arizona, United States
| | | | | | | | - Nirza Mukhia
- KLE Society Institute of Dental Sciences, Bangalore, India
| | | | | | | | | | | | - Trupti Kolur
- Mazumdar Shaw Medical Foundation, Bangalore, India
| | - Vivek Shetty
- Mazumdar Shaw Medical Foundation, Bangalore, India
| | - Vidya Bushan
- Mazumdar Shaw Medical Foundation, Bangalore, India
| | - Rohan Ramesh
- Christian Institute of Health Sciences and Research, Dimapur, India
| | - Tyler Peterson
- The University of Arizona, Wyant College of Optical Sciences, Tucson, Arizona, United States
| | - Vijay Pillai
- Mazumdar Shaw Medical Foundation, Bangalore, India
| | - Petra Wilder-Smith
- University of California Beckman Laser Institute and Medical Clinic, Irvine, California, United States
| | | | - Amritha Suresh
- Mazumdar Shaw Medical Centre, Bangalore, India
- Mazumdar Shaw Medical Foundation, Bangalore, India
| | | | - Praveen Birur
- KLE Society Institute of Dental Sciences, Bangalore, India
- Biocon Foundation, Bangalore, India
| | - Rongguang Liang
- The University of Arizona, Wyant College of Optical Sciences, Tucson, Arizona, United States
| |
Collapse
|
6
|
A Hybrid Supervised Machine Learning Classifier System for Breast Cancer Prognosis Using Feature Selection and Data Imbalance Handling Approaches. ELECTRONICS 2021. [DOI: 10.3390/electronics10060699] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
Nowadays, breast cancer is the most frequent cancer among women. Early detection is a critical issue that can be effectively achieved by machine learning (ML) techniques. Thus in this article, the methods to improve the accuracy of ML classification models for the prognosis of breast cancer are investigated. Wrapper-based feature selection approach along with nature-inspired algorithms such as Particle Swarm Optimization, Genetic Search, and Greedy Stepwise has been used to identify the important features. On these selected features popular machine learning classifiers Support Vector Machine, J48 (C4.5 Decision Tree Algorithm), Multilayer-Perceptron (a feed-forward ANN) were used in the system. The methodology of the proposed system is structured into five stages which include (1) Data Pre-processing; (2) Data imbalance handling; (3) Feature Selection; (4) Machine Learning Classifiers; (5) classifier’s performance evaluation. The dataset under this research experimentation is referred from the UCI Machine Learning Repository, named Breast Cancer Wisconsin (Diagnostic) Data Set. This article indicated that the J48 decision tree classifier is the appropriate machine learning-based classifier for optimum breast cancer prognosis. Support Vector Machine with Particle Swarm Optimization algorithm for feature selection achieves the accuracy of 98.24%, MCC = 0.961, Sensitivity = 99.11%, Specificity = 96.54%, and Kappa statistics of 0.9606. It is also observed that the J48 Decision Tree classifier with the Genetic Search algorithm for feature selection achieves the accuracy of 98.83%, MCC = 0.974, Sensitivity = 98.95%, Specificity = 98.58%, and Kappa statistics of 0.9735. Furthermore, Multilayer Perceptron ANN classifier with Genetic Search algorithm for feature selection achieves the accuracy of 98.59%, MCC = 0.968, Sensitivity = 98.6%, Specificity = 98.57%, and Kappa statistics of 0.9682.
Collapse
|