1
|
Prinzi F, Orlando A, Gaglio S, Vitabile S. Interpretable Radiomic Signature for Breast Microcalcification Detection and Classification. JOURNAL OF IMAGING INFORMATICS IN MEDICINE 2024; 37:1038-1053. [PMID: 38351223 PMCID: PMC11169144 DOI: 10.1007/s10278-024-01012-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/28/2023] [Revised: 11/20/2023] [Accepted: 12/05/2023] [Indexed: 06/13/2024]
Abstract
Breast microcalcifications are observed in 80% of mammograms, and a notable proportion can lead to invasive tumors. However, diagnosing microcalcifications is a highly complicated and error-prone process due to their diverse sizes, shapes, and subtle variations. In this study, we propose a radiomic signature that effectively differentiates between healthy tissue, benign microcalcifications, and malignant microcalcifications. Radiomic features were extracted from a proprietary dataset, composed of 380 healthy tissue, 136 benign, and 242 malignant microcalcifications ROIs. Subsequently, two distinct signatures were selected to differentiate between healthy tissue and microcalcifications (detection task) and between benign and malignant microcalcifications (classification task). Machine learning models, namely Support Vector Machine, Random Forest, and XGBoost, were employed as classifiers. The shared signature selected for both tasks was then used to train a multi-class model capable of simultaneously classifying healthy, benign, and malignant ROIs. A significant overlap was discovered between the detection and classification signatures. The performance of the models was highly promising, with XGBoost exhibiting an AUC-ROC of 0.830, 0.856, and 0.876 for healthy, benign, and malignant microcalcifications classification, respectively. The intrinsic interpretability of radiomic features, and the use of the Mean Score Decrease method for model introspection, enabled models' clinical validation. In fact, the most important features, namely GLCM Contrast, FO Minimum and FO Entropy, were compared and found important in other studies on breast cancer.
Collapse
Affiliation(s)
- Francesco Prinzi
- Department of Biomedicine, Neuroscience and Advanced Diagnostics (BiND), University of Palermo, Palermo, Italy.
- Department of Computer Science and Technology, University of Cambridge, CB2 1TN, Cambridge, United Kingdom.
| | - Alessia Orlando
- Section of Radiology - Department of Biomedicine, Neuroscience and Advanced Diagnostics (BiND), University Hospital "Paolo Giaccone", Palermo, Italy
| | - Salvatore Gaglio
- Department of Engineering, University of Palermo, Palermo, Italy
- Institute for High-Performance Computing and Networking, National Research Council (ICAR-CNR), Palermo, Italy
| | - Salvatore Vitabile
- Department of Biomedicine, Neuroscience and Advanced Diagnostics (BiND), University of Palermo, Palermo, Italy
| |
Collapse
|
2
|
Demircioğlu A. Applying oversampling before cross-validation will lead to high bias in radiomics. Sci Rep 2024; 14:11563. [PMID: 38773233 PMCID: PMC11109211 DOI: 10.1038/s41598-024-62585-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2024] [Accepted: 05/20/2024] [Indexed: 05/23/2024] Open
Abstract
Class imbalance is often unavoidable for radiomic data collected from clinical routine. It can create problems during classifier training since the majority class could dominate the minority class. Consequently, resampling methods like oversampling or undersampling are applied to the data to class-balance the data. However, the resampling must not be applied upfront to all data because it would lead to data leakage and, therefore, to erroneous results. This study aims to measure the extent of this bias. Five-fold cross-validation with 30 repeats was performed using a set of 15 radiomic datasets to train predictive models. The training involved two scenarios: first, the models were trained correctly by applying the resampling methods during the cross-validation. Second, the models were trained incorrectly by performing the resampling on all the data before cross-validation. The bias was defined empirically as the difference between the best-performing models in both scenarios in terms of area under the receiver operating characteristic curve (AUC), sensitivity, specificity, balanced accuracy, and the Brier score. In addition, a simulation study was performed on a randomly generated dataset for verification. The results demonstrated that incorrectly applying the oversampling methods to all data resulted in a large positive bias (up to 0.34 in AUC, 0.33 in sensitivity, 0.31 in specificity, and 0.37 in balanced accuracy). The bias depended on the data balance, and approximately an increase of 0.10 in the AUC was observed for each increase in imbalance. The models also showed a bias in calibration measured using the Brier score, which differed by up to -0.18 between the correctly and incorrectly trained models. The undersampling methods were not affected significantly by bias. These results emphasize that any resampling method should be applied correctly only to the training data to avoid data leakage and, subsequently, biased model performance and calibration.
Collapse
Affiliation(s)
- Aydin Demircioğlu
- Institute of Diagnostic and Interventional Radiology and Neuroradiology, University Hospital Essen, Hufelandstraße 55, 45147, Essen, Germany.
| |
Collapse
|
3
|
Suttie M, Kable J, Mahnke AH, Bandoli G. Machine learning approaches to the identification of children affected by prenatal alcohol exposure: A narrative review. ALCOHOL, CLINICAL & EXPERIMENTAL RESEARCH 2024; 48:585-595. [PMID: 38302824 PMCID: PMC11015982 DOI: 10.1111/acer.15271] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/29/2023] [Revised: 12/05/2023] [Accepted: 01/14/2024] [Indexed: 02/03/2024]
Abstract
Fetal alcohol spectrum disorders (FASDs) affect at least 0.8% of the population globally. The diagnosis of FASD is uniquely complex, with a heterogeneous physical and neurobehavioral presentation that requires multidisciplinary expertise for diagnosis. Many researchers have begun to incorporate machine learning approaches into FASD research to identify children who are affected by prenatal alcohol exposure, including those with FASD. This narrative review highlights these efforts. Following an introduction to machine learning, we summarize examples from the literature of neurobehavioral screening tools and physiologic markers of exposure. We discuss individual efforts, including models that classify FASD based on parent-reported neurocognitive or behavioral questionnaires, 3D facial imaging, brain imaging, DNA methylation patterns, microRNA profiles, cardiac orienting response, and dysmorphic facial features. We highlight model performance and discuss the limitations of these approaches. We conclude by considering the scalability of these approaches and how these machine learning models, largely developed from clinical samples or highly exposed birth cohorts, may perform in the general population.
Collapse
Affiliation(s)
- Michael Suttie
- Nuffield Department of Women’s & Reproductive Health, University of Oxford, UK
- Big Data Institute, University of Oxford, UK
| | - Julie Kable
- Departments of Psychiatry and Behavioral Science and Pediatrics, Emory University School of Medicine, 201 Dowman Drive, Atlanta, GA, 30322, USA
| | - Amanda H. Mahnke
- Department of Neuroscience and Experimental Therapeutics, Texas A&M University School of Medicine, 8447 Riverside Parkway, Bryan, TX 77807, USA
| | - Gretchen Bandoli
- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA
| |
Collapse
|
4
|
Louro PL, Redinho H, Malheiro R, Paiva RP, Panda R. A Comparison Study of Deep Learning Methodologies for Music Emotion Recognition. SENSORS (BASEL, SWITZERLAND) 2024; 24:2201. [PMID: 38610412 PMCID: PMC11014202 DOI: 10.3390/s24072201] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/05/2024] [Revised: 03/20/2024] [Accepted: 03/26/2024] [Indexed: 04/14/2024]
Abstract
Classical machine learning techniques have dominated Music Emotion Recognition. However, improvements have slowed down due to the complex and time-consuming task of handcrafting new emotionally relevant audio features. Deep learning methods have recently gained popularity in the field because of their ability to automatically learn relevant features from spectral representations of songs, eliminating such necessity. Nonetheless, there are limitations, such as the need for large amounts of quality labeled data, a common problem in MER research. To understand the effectiveness of these techniques, a comparison study using various classical machine learning and deep learning methods was conducted. The results showed that using an ensemble of a Dense Neural Network and a Convolutional Neural Network architecture resulted in a state-of-the-art 80.20% F1 score, an improvement of around 5% considering the best baseline results, concluding that future research should take advantage of both paradigms, that is, combining handcrafted features with feature learning.
Collapse
Affiliation(s)
- Pedro Lima Louro
- CISUC, LASI, DEI, FCTUC, University of Coimbra, 3030-790 Coimbra, Portugal; (H.R.); (R.M.); (R.P.P.); (R.P.)
| | - Hugo Redinho
- CISUC, LASI, DEI, FCTUC, University of Coimbra, 3030-790 Coimbra, Portugal; (H.R.); (R.M.); (R.P.P.); (R.P.)
| | - Ricardo Malheiro
- CISUC, LASI, DEI, FCTUC, University of Coimbra, 3030-790 Coimbra, Portugal; (H.R.); (R.M.); (R.P.P.); (R.P.)
- School of Technology and Management, Polytechnic Institute of Leiria, 2411-901 Leiria, Portugal
| | - Rui Pedro Paiva
- CISUC, LASI, DEI, FCTUC, University of Coimbra, 3030-790 Coimbra, Portugal; (H.R.); (R.M.); (R.P.P.); (R.P.)
| | - Renato Panda
- CISUC, LASI, DEI, FCTUC, University of Coimbra, 3030-790 Coimbra, Portugal; (H.R.); (R.M.); (R.P.P.); (R.P.)
- Ci2—Smart Cities Research Center, Polytechnic Institute of Tomar, 2300-313 Tomar, Portugal
| |
Collapse
|
5
|
Arizmendi CJ, Bernacki ML, Raković M, Plumley RD, Urban CJ, Panter AT, Greene JA, Gates KM. Predicting student outcomes using digital logs of learning behaviors: Review, current standards, and suggestions for future work. Behav Res Methods 2023; 55:3026-3054. [PMID: 36018483 PMCID: PMC10556130 DOI: 10.3758/s13428-022-01939-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/20/2022] [Indexed: 11/08/2022]
Abstract
Using traces of behaviors to predict outcomes is useful in varied contexts ranging from buyer behaviors to behaviors collected from smart-home devices. Increasingly, higher education systems have been using Learning Management System (LMS) digital data to capture and understand students' learning and well-being. Researchers in the social sciences are increasingly interested in the potential of using digital log data to predict outcomes and design interventions. Using LMS data for predicting the likelihood of students' success in for-credit college courses provides a useful example of how social scientists can use these techniques on a variety of data types. Here, we provide a primer on how LMS data can be feature-mapped and analyzed to accomplish these goals. We begin with a literature review summarizing current approaches to analyzing LMS data, then discuss ethical issues of privacy when using demographic data and equitable model building. In the second part of the paper, we provide an overview of popular machine learning algorithms and review analytic considerations such as feature generation, assessment of model performance, and sampling techniques. Finally, we conclude with an empirical example demonstrating the ability of LMS data to predict student success, summarizing important features and assessing model performance across different model specifications.
Collapse
Affiliation(s)
| | | | - Mladen Raković
- Centre for Learning Analytics, Monash University, Melbourne, Australia
| | - Robert D Plumley
- The University of North Carolina Chapel Hill, Chapel Hill, NC, USA
| | | | - A T Panter
- The University of North Carolina Chapel Hill, Chapel Hill, NC, USA
| | - Jeffrey A Greene
- The University of North Carolina Chapel Hill, Chapel Hill, NC, USA
| | - Kathleen M Gates
- The University of North Carolina Chapel Hill, Chapel Hill, NC, USA
| |
Collapse
|
6
|
Nouri Z, Choi SW, Choi IJ, Ryu KW, Woo SM, Park SJ, Lee WJ, Choi W, Jung YS, Myung SK, Lee JH, Park JY, Praveen Z, Woo YJ, Park JH, Kim MK. Exploring Connections between Oral Microbiota, Short-Chain Fatty Acids, and Specific Cancer Types: A Study of Oral Cancer, Head and Neck Cancer, Pancreatic Cancer, and Gastric Cancer. Cancers (Basel) 2023; 15:cancers15112898. [PMID: 37296861 DOI: 10.3390/cancers15112898] [Citation(s) in RCA: 11] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2023] [Revised: 04/25/2023] [Accepted: 05/22/2023] [Indexed: 06/12/2023] Open
Abstract
The association between oral microbiota and cancer development has been a topic of intense research in recent years, with compelling evidence suggesting that the oral microbiome may play a significant role in cancer initiation and progression. However, the causal connections between the two remain a subject of debate, and the underlying mechanisms are not fully understood. In this case-control study, we aimed to identify common oral microbiota associated with several cancer types and investigate the potential mechanisms that may trigger immune responses and initiate cancer upon cytokine secretion. Saliva and blood samples were collected from 309 adult cancer patients and 745 healthy controls to analyze the oral microbiome and the mechanisms involved in cancer initiation. Machine learning techniques revealed that six bacterial genera were associated with cancer. The abundance of Leuconostoc, Streptococcus, Abiotrophia, and Prevotella was reduced in the cancer group, while abundance of Haemophilus and Neisseria enhanced. G protein-coupled receptor kinase, H+-transporting ATPase, and futalosine hydrolase were found significantly enriched in the cancer group. Total short-chain fatty acid (SCFAs) concentrations and free fatty acid receptor 2 (FFAR2) expression levels were greater in the control group when compared with the cancer group, while serum tumor necrosis factor alpha induced protein 8 (TNFAIP8), interleukin-6 (IL6), and signal transducer and activator of transcription 3 (STAT3) levels were higher in the cancer group when compared with the control group. These results suggested that the alterations in the composition of oral microbiota can contribute to a reduction in SCFAs and FFAR2 expression that may initiate an inflammatory response through the upregulation of TNFAIP8 and the IL-6/STAT3 pathway, which could ultimately increase the risk of cancer onset.
Collapse
Affiliation(s)
- Zahra Nouri
- Cancer Epidemiology Branch, Division of Cancer Epidemiology and Prevention, National Cancer Center, 323 Ilsandong-gu, Goyang-si 10408, Gyeonggi-do, Republic of Korea
| | - Sung Weon Choi
- Oral Oncology Clinic, Research Institute and Hospital, National Cancer Center, 323 Ilsandong-gu, Goyang-si 10408, Gyeonggi-do, Republic of Korea
| | - Il Ju Choi
- Center for Gastric Cancer, National Cancer Center, 323 Ilsandong-gu, Goyang-si 10408, Gyeonggi-do, Republic of Korea
| | - Keun Won Ryu
- Center for Gastric Cancer, National Cancer Center, 323 Ilsandong-gu, Goyang-si 10408, Gyeonggi-do, Republic of Korea
| | - Sang Myung Woo
- Center for Liver and Pancreatobiliary Cancer, National Cancer Center, 323 Ilsandong-gu, Goyang-si 10408, Gyeonggi-do, Republic of Korea
| | - Sang-Jae Park
- Center for Liver and Pancreatobiliary Cancer, National Cancer Center, 323 Ilsandong-gu, Goyang-si 10408, Gyeonggi-do, Republic of Korea
| | - Woo Jin Lee
- Center for Liver and Pancreatobiliary Cancer, National Cancer Center, 323 Ilsandong-gu, Goyang-si 10408, Gyeonggi-do, Republic of Korea
| | - Wonyoung Choi
- Center for Rare Cancers, National Cancer Center, 323 Ilsandong-gu, Goyang-si 10408, Gyeonggi-do, Republic of Korea
| | - Yuh-Seog Jung
- Department of Otorhinolaryngology, National Cancer Center, 323 Ilsandong-gu, Goyang-si 10408, Gyeonggi-do, Republic of Korea
| | - Seung-Kwon Myung
- Department of Cancer AI & Digital Health, National Cancer Center Graduate School of Cancer Science and Policy, 323 Ilsandong-gu, Goyang-si 10408, Gyeonggi-do, Republic of Korea
| | - Jong-Ho Lee
- Oral Oncology Clinic, Research Institute and Hospital, National Cancer Center, 323 Ilsandong-gu, Goyang-si 10408, Gyeonggi-do, Republic of Korea
| | - Joo-Yong Park
- Oral Oncology Clinic, Research Institute and Hospital, National Cancer Center, 323 Ilsandong-gu, Goyang-si 10408, Gyeonggi-do, Republic of Korea
| | - Zeba Praveen
- Cancer Epidemiology Branch, Division of Cancer Epidemiology and Prevention, National Cancer Center, 323 Ilsandong-gu, Goyang-si 10408, Gyeonggi-do, Republic of Korea
| | - Yun Jung Woo
- Cancer Epidemiology Branch, Division of Cancer Epidemiology and Prevention, National Cancer Center, 323 Ilsandong-gu, Goyang-si 10408, Gyeonggi-do, Republic of Korea
| | - Jin Hee Park
- Cancer Epidemiology Branch, Division of Cancer Epidemiology and Prevention, National Cancer Center, 323 Ilsandong-gu, Goyang-si 10408, Gyeonggi-do, Republic of Korea
| | - Mi Kyung Kim
- Cancer Epidemiology Branch, Division of Cancer Epidemiology and Prevention, National Cancer Center, 323 Ilsandong-gu, Goyang-si 10408, Gyeonggi-do, Republic of Korea
| |
Collapse
|
7
|
Hassanzadeh R, Farhadian M, Rafieemehr H. Hospital mortality prediction in traumatic injuries patients: comparing different SMOTE-based machine learning algorithms. BMC Med Res Methodol 2023; 23:101. [PMID: 37087425 PMCID: PMC10122327 DOI: 10.1186/s12874-023-01920-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2022] [Accepted: 04/13/2023] [Indexed: 04/24/2023] Open
Abstract
BACKGROUND Trauma is one of the most critical public health issues worldwide, leading to death and disability and influencing all age groups. Therefore, there is great interest in models for predicting mortality in trauma patients admitted to the ICU. The main objective of the present study is to develop and evaluate SMOTE-based machine-learning tools for predicting hospital mortality in trauma patients with imbalanced data. METHODS This retrospective cohort study was conducted on 126 trauma patients admitted to an intensive care unit at Besat hospital in Hamadan Province, western Iran, from March 2020 to March 2021. Data were extracted from the medical information records of patients. According to the imbalanced property of the data, SMOTE techniques, namely SMOTE, Borderline-SMOTE1, Borderline-SMOTE2, SMOTE-NC, and SVM-SMOTE, were used for primary preprocessing. Then, the Decision Tree (DT), Random Forest (RF), Naive Bayes (NB), Artificial Neural Network (ANN), Support Vector Machine (SVM), and Extreme Gradient Boosting (XGBoost) methods were used to predict patients' hospital mortality with traumatic injuries. The performance of the methods used was evaluated by sensitivity, specificity, Positive Predictive Value (PPV), Negative Predictive Value (NPV), accuracy, Area Under the Curve (AUC), Geometric Mean (G-means), F1 score, and P-value of McNemar's test. RESULTS Of the 126 patients admitted to an ICU, 117 (92.9%) survived and 9 (7.1%) died. The mean follow-up time from the date of trauma to the date of outcome was 3.98 ± 4.65 days. The performance of ML algorithms is not good with imbalanced data, whereas the performance of SMOTE-based ML algorithms is significantly improved. The mean area under the ROC curve (AUC) of all SMOTE-based models was more than 91%. F1-score and G-means before balancing the dataset were below 70% for all ML models except ANN. In contrast, F1-score and G-means for the balanced datasets reached more than 90% for all SMOTE-based models. Among all SMOTE-based ML methods, RF and ANN based on SMOTE and XGBoost based on SMOTE-NC achieved the highest value for all evaluation criteria. CONCLUSIONS This study has shown that SMOTE-based ML algorithms better predict outcomes in traumatic injuries than ML algorithms. They have the potential to assist ICU physicians in making clinical decisions.
Collapse
Affiliation(s)
- Roghayyeh Hassanzadeh
- Department of Biostatistics, School of Public Health, Hamadan University of Medical Sciences, Hamadan, Iran
| | - Maryam Farhadian
- Research Center for Health Sciences, Department of Biostatistics, School of Public Health, Hamadan University of Medical Sciences, Hamadan, Iran.
| | - Hassan Rafieemehr
- Department of Medical Laboratory Sciences, School of Paramedicine, Hamadan University of Medical Sciences, Hamadan, Iran.
| |
Collapse
|
8
|
Dials J, Demirel D, Sanchez-Arias R, Halic T, Kruger U, De S, Gromski MA. Skill-level classification and performance evaluation for endoscopic sleeve gastroplasty. Surg Endosc 2023:10.1007/s00464-023-09955-2. [PMID: 36897405 PMCID: PMC10000349 DOI: 10.1007/s00464-023-09955-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2022] [Accepted: 02/12/2023] [Indexed: 03/11/2023]
Abstract
BACKGROUND We previously developed grading metrics for quantitative performance measurement for simulated endoscopic sleeve gastroplasty (ESG) to create a scalar reference to classify subjects into experts and novices. In this work, we used synthetic data generation and expanded our skill level analysis using machine learning techniques. METHODS We used the synthetic data generation algorithm SMOTE to expand and balance our dataset of seven actual simulated ESG procedures using synthetic data. We performed optimization to seek optimum metrics to classify experts and novices by identifying the most critical and distinctive sub-tasks. We used support vector machine (SVM), AdaBoost, K-nearest neighbors (KNN) Kernel Fisher discriminant analysis (KFDA), random forest, and decision tree classifiers to classify surgeons as experts or novices after grading. Furthermore, we used an optimization model to create weights for each task and separate the clusters by maximizing the distance between the expert and novice scores. RESULTS We split our dataset into a training set of 15 samples and a testing dataset of five samples. We put this dataset through six classifiers, SVM, KFDA, AdaBoost, KNN, random forest, and decision tree, resulting in 0.94, 0.94, 1.00, 1.00, 1.00, and 1.00 accuracy, respectively, for training and 1.00 accuracy for the testing results for SVM and AdaBoost. Our optimization model maximized the distance between the expert and novice groups from 2 to 53.72. CONCLUSION This paper shows that feature reduction, in combination with classification algorithms such as SVM and KNN, can be used in tandem to classify endoscopists as experts or novices based on their results recorded using our grading metrics. Furthermore, this work introduces a non-linear constraint optimization to separate the two clusters and find the most important tasks using weights.
Collapse
Affiliation(s)
- James Dials
- Department of Computer Science, Florida Polytechnic University, Lakeland, FL, USA
| | - Doga Demirel
- Department of Computer Science, Florida Polytechnic University, Lakeland, FL, USA.
| | - Reinaldo Sanchez-Arias
- Department of Data Science and Business Analytics, Florida Polytechnic University, Lakeland, FL, USA
| | | | - Uwe Kruger
- Department of Biomedical Engineering, Rensselaer Polytechnic Institute, Troy, NY, USA
| | - Suvranu De
- College of Engineering, Florida A&M University - Florida State University, Tallahassee, FL, USA
| | - Mark A Gromski
- Division of Gastroenterology and Hepatology, Indiana University School of Medicine, Indianapolis, IN, USA
| |
Collapse
|
9
|
Li D, Zheng C, Zhao J, Liu Y. Diagnosis of heart failure from imbalance datasets using multi-level classification. Biomed Signal Process Control 2023. [DOI: 10.1016/j.bspc.2022.104538] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
|
10
|
Szeghalmy S, Fazekas A. A Comparative Study of the Use of Stratified Cross-Validation and Distribution-Balanced Stratified Cross-Validation in Imbalanced Learning. SENSORS (BASEL, SWITZERLAND) 2023; 23:2333. [PMID: 36850931 PMCID: PMC9967638 DOI: 10.3390/s23042333] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/22/2022] [Revised: 02/06/2023] [Accepted: 02/15/2023] [Indexed: 06/18/2023]
Abstract
Nowadays, the solution to many practical problems relies on machine learning tools. However, compiling the appropriate training data set for real-world classification problems is challenging because collecting the right amount of data for each class is often difficult or even impossible. In such cases, we can easily face the problem of imbalanced learning. There are many methods in the literature for solving the imbalanced learning problem, so it has become a serious question how to compare the performance of the imbalanced learning methods. Inadequate validation techniques can provide misleading results (e.g., due to data shift), which leads to the development of methods designed for imbalanced data sets, such as stratified cross-validation (SCV) and distribution optimally balanced SCV (DOB-SCV). Previous studies have shown that higher classification performance scores (AUC) can be achieved on imbalanced data sets using DOB-SCV instead of SCV. We investigated the effect of the oversamplers on this difference. The study was conducted on 420 data sets, involving several sampling methods and the DTree, kNN, SVM, and MLP classifiers. We point out that DOB-SCV often provides a little higher F1 and AUC values for classification combined with sampling. However, the results also prove that the selection of the sampler-classifier pair is more important for the classification performance than the choice between the DOB-SCV and the SCV techniques.
Collapse
|
11
|
A Comparison of Undersampling, Oversampling, and SMOTE Methods for Dealing with Imbalanced Classification in Educational Data Mining. INFORMATION 2023. [DOI: 10.3390/info14010054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023] Open
Abstract
Educational data mining is capable of producing useful data-driven applications (e.g., early warning systems in schools or the prediction of students’ academic achievement) based on predictive models. However, the class imbalance problem in educational datasets could hamper the accuracy of predictive models as many of these models are designed on the assumption that the predicted class is balanced. Although previous studies proposed several methods to deal with the imbalanced class problem, most of them focused on the technical details of how to improve each technique, while only a few focused on the application aspect, especially for the application of data with different imbalance ratios. In this study, we compared several sampling techniques to handle the different ratios of the class imbalance problem (i.e., moderately or extremely imbalanced classifications) using the High School Longitudinal Study of 2009 dataset. For our comparison, we used random oversampling (ROS), random undersampling (RUS), and the combination of the synthetic minority oversampling technique for nominal and continuous (SMOTE-NC) and RUS as a hybrid resampling technique. We used the Random Forest as our classification algorithm to evaluate the results of each sampling technique. Our results show that random oversampling for moderately imbalanced data and hybrid resampling for extremely imbalanced data seem to work best. The implications for educational data mining applications and suggestions for future research are discussed.
Collapse
|
12
|
Instance hardness and multivariate Gaussian distribution-based oversampling technique for imbalance classification. Pattern Anal Appl 2023. [DOI: 10.1007/s10044-022-01129-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
|
13
|
Sowjanya AM, Mrudula O. Effective treatment of imbalanced datasets in health care using modified SMOTE coupled with stacked deep learning algorithms. APPLIED NANOSCIENCE 2023; 13:1829-1840. [PMID: 35132368 PMCID: PMC8811587 DOI: 10.1007/s13204-021-02063-4] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/21/2021] [Accepted: 08/28/2021] [Indexed: 12/03/2022]
Abstract
One of the prominent uses of Predictive Analytics is Health care for more accurate predictions based on proper analysis of cumulative datasets. Often times the datasets are quite imbalanced and sampling techniques like Synthetic Minority Oversampling Technique (SMOTE) give only moderate accuracy in such cases. To overcome this problem, a two-step approach has been proposed. In the first step, SMOTE is modified to reduce the class imbalance in terms of Distance-based SMOTE (D-SMOTE) and Bi-phasic SMOTE (BP-SMOTE) which were then coupled with selective classifiers for prediction. An increase in accuracy is noted for both BP-SMOTE and D-SMOTE compared to basic SMOTE. In the second step, Machine learning, Deep Learning and Ensemble algorithms were used to develop a Stacking Ensemble Framework which showed a significant increase in accuracy for Stacking compared to individual machine learning algorithms like Decision Tree, Naïve Bayes, Neural Networks and Ensemble techniques like Voting, Bagging and Boosting. Two different methods have been developed by combing Deep learning with Stacking approach namely Stacked CNN and Stacked RNN which yielded significantly higher accuracy of 96-97% compared to individual algorithms. Framingham dataset is used for data sampling, Wisconsin Hospital data of Breast Cancer study is used for Stacked CNN and Novel Coronavirus 2019 dataset relating to forecasting COVID-19 cases, is used for Stacked RNN.
Collapse
Affiliation(s)
- A. Mary Sowjanya
- grid.411381.e0000 0001 0728 2694Department of CS & SE, Andhra University College of Engineering (A), Visakhapatnam, Andhra Pradesh India
| | - Owk Mrudula
- grid.411381.e0000 0001 0728 2694Department of CS & SE, Andhra University College of Engineering (A), Visakhapatnam, Andhra Pradesh India
| |
Collapse
|
14
|
Ghaderi Zefrehi H, Altınçay H. MaMiPot: a paradigm shift for the classification of imbalanced data. J Intell Inf Syst 2022. [DOI: 10.1007/s10844-022-00763-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
15
|
An imbalanced binary classification method via space mapping using normalizing flows with class discrepancy constraints. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2022.12.029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
|
16
|
Class-imbalanced positive instances augmentation via three-line hybrid. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.109902] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
17
|
Hashimoto-Roth E, Surendra A, Lavallée-Adam M, Bennett SAL, Čuperlović-Culf M. METAbolomics data Balancing with Over-sampling Algorithms (META-BOA): an online resource for addressing class imbalance. Bioinformatics 2022; 38:5326-5327. [PMID: 36222566 DOI: 10.1093/bioinformatics/btac649] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2022] [Revised: 09/04/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION Class imbalance, or unequal sample sizes between classes, is an increasing concern in machine learning for metabolomic and lipidomic data mining, which can result in overfitting for the over-represented class. Numerous methods have been developed for handling class imbalance, but they are not readily accessible to users with limited computational experience. Moreover, there is no resource that enables users to easily evaluate the effect of different over-sampling algorithms. RESULTS METAbolomics data Balancing with Over-sampling Algorithms (META-BOA) is a web-based application that enables users to select between four different methods for class balancing, followed by data visualization and classification of the sample to observe the augmentation effects. META-BOA outputs a newly balanced dataset, generating additional samples in the minority class, according to the user's choice of Synthetic Minority Over-sampling Technique (SMOTE), Borderline-SMOTE (BSMOTE), Adaptive Synthetic (ADASYN) or Random Over-Sampling Examples (ROSE). To present the effect of over-sampling on the data META-BOA further displays both principal component analysis and t-distributed stochastic neighbor embedding visualization of data pre- and post-over-sampling. Random forest classification is utilized to compare sample classification in both the original and balanced datasets, enabling users to select the most appropriate method for their further analyses. AVAILABILITY AND IMPLEMENTATION META-BOA is available at https://complimet.ca/meta-boa. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Emily Hashimoto-Roth
- Department of Biochemistry, Microbiology and Immunology, Ottawa Institute of Systems Biology, Ottawa, ON, Canada.,Neural Regeneration Laboratory and India Taylor Lipidomic Research Platform, Ottawa, ON K1H 8M5, Canada
| | - Anuradha Surendra
- Digital Technologies Research Centre, National Research Council of Canada, Ottawa, ON K1A 0R6, Canada
| | - Mathieu Lavallée-Adam
- Department of Biochemistry, Microbiology and Immunology, Ottawa Institute of Systems Biology, Ottawa, ON, Canada
| | - Steffany A L Bennett
- Department of Biochemistry, Microbiology and Immunology, Ottawa Institute of Systems Biology, Ottawa, ON, Canada.,Neural Regeneration Laboratory and India Taylor Lipidomic Research Platform, Ottawa, ON K1H 8M5, Canada.,Department of Chemistry and Biomolecular Sciences, Centre for Catalysis Research and Innovation, University of Ottawa, Ottawa, ON K1N 6N5, Canada
| | - Miroslava Čuperlović-Culf
- Department of Biochemistry, Microbiology and Immunology, Ottawa Institute of Systems Biology, Ottawa, ON, Canada.,Neural Regeneration Laboratory and India Taylor Lipidomic Research Platform, Ottawa, ON K1H 8M5, Canada.,Digital Technologies Research Centre, National Research Council of Canada, Ottawa, ON K1A 0R6, Canada
| |
Collapse
|
18
|
Perturbation-based oversampling technique for imbalanced classification problems. INT J MACH LEARN CYB 2022. [DOI: 10.1007/s13042-022-01662-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
19
|
Distance-based arranging oversampling technique for imbalanced data. Neural Comput Appl 2022. [DOI: 10.1007/s00521-022-07828-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/14/2022]
|
20
|
Kumar A. A new fitness function in genetic programming for classification of imbalanced data. J EXP THEOR ARTIF IN 2022. [DOI: 10.1080/0952813x.2022.2120087] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/14/2022]
Affiliation(s)
- Arvind Kumar
- Location Intelligence R&D, Precisely Software and Data India Private Limited, Noida, India
| |
Collapse
|
21
|
Li M, Zhou H, Liu Q, Wang G. SW: A weighted space division framework for imbalanced problems with label noise. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.109233] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
22
|
Wang G, Yin Z, Zhao M, Tian Y, Sun Z. Identification of human mental workload levels in a language comprehension task with imbalance neurophysiological data. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2022; 224:107011. [PMID: 35863122 DOI: 10.1016/j.cmpb.2022.107011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/09/2021] [Revised: 05/23/2022] [Accepted: 07/06/2022] [Indexed: 06/15/2023]
Abstract
BACKGROUND AND OBJECTIVE Operator's capability for accurately comprehending verbal commands is critically important to maintain the performance of human-machine interaction. It can be evaluated by human mental workload measured with electroencephalography (EEG). However, the time duration of different workload conditions within a task session is unequal due to varied psychophysiological processes across individuals. It leads to data imbalance of the EEG for training workload classifiers. METHODS In this study, we propose an EEG feature oversampling technique, Gaussian-SMOTE based feature ensemble (GSMOTE-FE), for workload recognition with imbalanced classes. First, artificial EEG instances are drawn from a Gaussian distribution in the margin between the minority and majority workload classes. Tomek links are detected as clues to remove redundant feature vectors. Then, we embed a feature selection module based on the GINI importance while an ensemble classifier committee with bootstrap aggregating is used to further enhance classification performance. RESULTS We validate the GSMOTE-FE framework based on an experiment that simulates operators to understand the correct meaning of the instructions in the Chinese language. Participants' EEG signals and reaction time data were both recorded to validate the proposed workload classifier. Workload classification accuracy and Macro-F1 values are 0.6553 and 0.5862, respectively. Corresponding G-mean and AUC achieve at 0.5757 and 0.5958, respectively. CONCLUSIONS The performance of the GSMOTE-FE is demonstrated to be comparable with the advanced oversampling techniques. The workload classifier has the capability to indicate low and high levels of the task demand of the Chinese language understanding task.
Collapse
Affiliation(s)
- Guangying Wang
- School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai, 200093, PR China
| | - Zhong Yin
- Engineering Research Center of Optical Instrument and System, Ministry of Education, Shanghai Key Lab of Modern Optical System, University of Shanghai for Science and Technology, Shanghai, 200093, PR China; School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai, 200093, PR China.
| | - Mengyuan Zhao
- College of Foreign Languages, University of Shanghai for Science and Technology, Shanghai, 200093, PR China
| | - Ying Tian
- School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai, 200093, PR China
| | - Zhanquan Sun
- School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai, 200093, PR China
| |
Collapse
|
23
|
Kumar V, Lalotra GS, Kumar RK. Improving performance of classifiers for diagnosis of critical diseases to prevent COVID risk. COMPUTERS & ELECTRICAL ENGINEERING : AN INTERNATIONAL JOURNAL 2022; 102:108236. [PMID: 35915590 PMCID: PMC9329734 DOI: 10.1016/j.compeleceng.2022.108236] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/28/2021] [Revised: 07/03/2022] [Accepted: 07/13/2022] [Indexed: 06/15/2023]
Abstract
The risk of developing COVID-19 and its variants may be higher in those with pre-existing health conditions such as thyroid disease, Hepatitis C Virus (HCV), breast tissue disease, chronic dermatitis, and other severe infections. Early and precise identification of these disorders is critical. A huge number of patients in nations like India require early and rapid testing as a preventative measure. The problem of imbalance arises from the skewed nature of data in which the instances from majority class are classified correct, while the minority class is unfortunately misclassified by many classifiers. When it comes to human life, this kind of misclassification is unacceptable. To solve the misclassification issue and improve accuracy in such datasets, we applied a variety of data balancing techniques to several machine learning algorithms. The outcomes are encouraging, with a considerable increase in accuracy. As an outcome of these proper diagnoses, we can make plans and take the required actions to stop patients from acquiring serious health issues or viral infections.
Collapse
Affiliation(s)
- Vinod Kumar
- Computer Science and Engineering, Koneru Lakshmaiah Education Foundation, Vaddeswaram, India
| | | | - Ravi Kant Kumar
- Computer Science and Engineering, SRM University, Andhra Pradesh, India
| |
Collapse
|
24
|
Zhang A, Yu H, Zhou S, Huan Z, Yang X. Instance weighted SMOTE by indirectly exploring the data distribution. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.108919] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
25
|
Teji JS, Jain S, Gupta SK, Suri JS. NeoAI 1.0: Machine learning-based paradigm for prediction of neonatal and infant risk of death. Comput Biol Med 2022; 147:105639. [DOI: 10.1016/j.compbiomed.2022.105639] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2021] [Revised: 05/01/2022] [Accepted: 05/01/2022] [Indexed: 11/29/2022]
|
26
|
Kumar V, Lalotra GS, Sasikala P, Rajput DS, Kaluri R, Lakshmanna K, Shorfuzzaman M, Alsufyani A, Uddin M. Addressing Binary Classification over Class Imbalanced Clinical Datasets Using Computationally Intelligent Techniques. Healthcare (Basel) 2022; 10:healthcare10071293. [PMID: 35885819 PMCID: PMC9322725 DOI: 10.3390/healthcare10071293] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2022] [Revised: 07/03/2022] [Accepted: 07/07/2022] [Indexed: 11/16/2022] Open
Abstract
Nowadays, healthcare is the prime need of every human being in the world, and clinical datasets play an important role in developing an intelligent healthcare system for monitoring the health of people. Mostly, the real-world datasets are inherently class imbalanced, clinical datasets also suffer from this imbalance problem, and the imbalanced class distributions pose several issues in the training of classifiers. Consequently, classifiers suffer from low accuracy, precision, recall, and a high degree of misclassification, etc. We performed a brief literature review on the class imbalanced learning scenario. This study carries the empirical performance evaluation of six classifiers, namely Decision Tree, k-Nearest Neighbor, Logistic regression, Artificial Neural Network, Support Vector Machine, and Gaussian Naïve Bayes, over five imbalanced clinical datasets, Breast Cancer Disease, Coronary Heart Disease, Indian Liver Patient, Pima Indians Diabetes Database, and Coronary Kidney Disease, with respect to seven different class balancing techniques, namely Undersampling, Random oversampling, SMOTE, ADASYN, SVM-SMOTE, SMOTEEN, and SMOTETOMEK. In addition to this, the appropriate explanations for the superiority of the classifiers as well as data-balancing techniques are also explored. Furthermore, we discuss the possible recommendations on how to tackle the class imbalanced datasets while training the different supervised machine learning methods. Result analysis demonstrates that SMOTEEN balancing method often performed better over all the other six data-balancing techniques with all six classifiers and for all five clinical datasets. Except for SMOTEEN, all other six balancing techniques almost had equal performance but moderately lesser performance than SMOTEEN.
Collapse
Affiliation(s)
- Vinod Kumar
- Computer Science and Engineering, Koneru Lakshmaiah Education Foundation, Vaddeswaram 522302, India;
| | - Gotam Singh Lalotra
- Government Degree College Basohli, University of Jammu, Basohli 184201, India;
| | - Ponnusamy Sasikala
- New Media Technology, Makhanlal Chaturvedi National University of Journalism and Communication, Bhopal 462011, India;
| | - Dharmendra Singh Rajput
- School of Information Technology and Engineering, Vellore Institute of Technology, Vellore 632014, India; (R.K.); (K.L.)
- Correspondence: (D.S.R.); (M.U.)
| | - Rajesh Kaluri
- School of Information Technology and Engineering, Vellore Institute of Technology, Vellore 632014, India; (R.K.); (K.L.)
| | - Kuruva Lakshmanna
- School of Information Technology and Engineering, Vellore Institute of Technology, Vellore 632014, India; (R.K.); (K.L.)
| | - Mohammad Shorfuzzaman
- Department of Computer Science, College of Computers and Information Technology, Taif University, P.O. Box 11099, Taif 21944, Saudi Arabia; (M.S.); (A.A.)
| | - Abdulmajeed Alsufyani
- Department of Computer Science, College of Computers and Information Technology, Taif University, P.O. Box 11099, Taif 21944, Saudi Arabia; (M.S.); (A.A.)
| | - Mueen Uddin
- College of Computing and IT University of Doha for Science and Technology, Doha P.O. Box 24449, Qatar
- Correspondence: (D.S.R.); (M.U.)
| |
Collapse
|
27
|
Ortiz-Toro C, García-Pedrero A, Lillo-Saavedra M, Gonzalo-Martín C. Automatic detection of pneumonia in chest X-ray images using textural features. Comput Biol Med 2022; 145:105466. [PMID: 35585732 PMCID: PMC8966154 DOI: 10.1016/j.compbiomed.2022.105466] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2021] [Revised: 03/25/2022] [Accepted: 03/26/2022] [Indexed: 12/16/2022]
Abstract
Fast and accurate diagnosis is critical for the triage and management of pneumonia, particularly in the current scenario of a COVID-19 pandemic, where this pathology is a major symptom of the infection. With the objective of providing tools for that purpose, this study assesses the potential of three textural image characterisation methods: radiomics, fractal dimension and the recently developed superpixel-based histon, as biomarkers to be used for training Artificial Intelligence (AI) models in order to detect pneumonia in chest X-ray images. Models generated from three different AI algorithms have been studied: K-Nearest Neighbors, Support Vector Machine and Random Forest. Two open-access image datasets were used in this study. In the first one, a dataset composed of paediatric chest X-ray, the best performing generated models achieved an 83.3% accuracy with 89% sensitivity for radiomics, 89.9% accuracy with 93.6% sensitivity for fractal dimension and 91.3% accuracy with 90.5% sensitivity for superpixels based histon. Second, a dataset derived from an image repository developed primarily as a tool for studying COVID-19 was used. For this dataset, the best performing generated models resulted in a 95.3% accuracy with 99.2% sensitivity for radiomics, 99% accuracy with 100% sensitivity for fractal dimension and 99% accuracy with 98.6% sensitivity for superpixel-based histons. The results confirm the validity of the tested methods as reliable and easy-to-implement automatic diagnostic tools for pneumonia.
Collapse
Affiliation(s)
- César Ortiz-Toro
- Department of Computer Architecture and Technology, Universidad Politécnica de Madrid, 28660, Boadilla del Monte, Spain
| | - Angel García-Pedrero
- Department of Computer Architecture and Technology, Universidad Politécnica de Madrid, 28660, Boadilla del Monte, Spain,Center for Biomedical Technology, Campus de Montegancedo, Universidad Politécnica de Madrid, 28233, Pozuelo de Alarcón, Spain
| | - Mario Lillo-Saavedra
- Facultad de Ingeniería Agrícola, Universidad de Concepción, Chillán, 3812120, Chile
| | - Consuelo Gonzalo-Martín
- Department of Computer Architecture and Technology, Universidad Politécnica de Madrid, 28660, Boadilla del Monte, Spain,Center for Biomedical Technology, Campus de Montegancedo, Universidad Politécnica de Madrid, 28233, Pozuelo de Alarcón, Spain,Corresponding author. Department of Computer Architecture and Technology, Universidad Politécnica de Madrid, 28660, Boadilla del Monte, Spain
| |
Collapse
|
28
|
Clustering-based adaptive data augmentation for class-imbalance in machine learning (CADA): additive manufacturing use case. Neural Comput Appl 2022. [DOI: 10.1007/s00521-022-07347-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
AbstractLarge amount of data are generated from in-situ monitoring of additive manufacturing (AM) processes which is later used in prediction modelling for defect classification to speed up quality inspection of products. A high volume of this process data is defect-free (majority class) and a lower volume of this data has defects (minority class) which result in the class-imbalance issue. Using imbalanced datasets, classifiers often provide sub-optimal classification results, i.e. better performance on the majority class than the minority class. However, it is important for process engineers that models classify defects more accurately than the class with no defects since this is crucial for quality inspection. Hence, we address the class-imbalance issue in manufacturing process data to support in-situ quality control of additive manufactured components. For this, we propose cluster-based adaptive data augmentation (CADA) for oversampling to address the class-imbalance problem. Quantitative experiments are conducted to evaluate the performance of the proposed method and to compare with other selected oversampling methods using AM datasets from an aerospace industry and a publicly available casting manufacturing dataset. The results show that CADA outperformed random oversampling and the SMOTE method and is similar to random data augmentation and cluster-based oversampling. Furthermore, the results of the statistical significance test show that there is a significant difference between the studied methods. As such, the CADA method can be considered as an alternative method for oversampling to improve the performance of models on the minority class.
Collapse
|
29
|
A Highly Adaptive Oversampling Approach to Address the Issue of Data Imbalance. COMPUTERS 2022. [DOI: 10.3390/computers11050073] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/04/2022]
Abstract
Data imbalance is a serious problem in machine learning that can be alleviated at the data level by balancing the class distribution with sampling. In the last decade, several sampling methods have been published to address the shortcomings of the initial ones, such as noise sensitivity and incorrect neighbor selection. Based on the review of the literature, it has become clear to us that the algorithms achieve varying performance on different data sets. In this paper, we present a new oversampler that has been developed based on the key steps and sampling strategies identified by analyzing dozens of existing methods and that can be fitted to various data sets through an optimization process. Experiments were performed on a number of data sets, which show that the proposed method had a similar or better effect on the performance of SVM, DTree, kNN and MLP classifiers compared with other well-known samplers found in the literature. The results were also confirmed by statistical tests.
Collapse
|
30
|
Ren J, Wang Y, Mao M, Cheung YM. Equalization ensemble for large scale highly imbalanced data classification. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.108295] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
31
|
Santos MS, Abreu PH, Japkowicz N, Fernández A, Soares C, Wilk S, Santos J. On the joint-effect of class imbalance and overlap: a critical review. Artif Intell Rev 2022. [DOI: 10.1007/s10462-022-10150-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
32
|
An imbalanced learning method by combining SMOTE with Center Offset Factor. Appl Soft Comput 2022. [DOI: 10.1016/j.asoc.2022.108618] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
33
|
|
34
|
RDPVR: Random Data Partitioning with Voting Rule for Machine Learning from Class-Imbalanced Datasets. ELECTRONICS 2022. [DOI: 10.3390/electronics11020228] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
Since most classifiers are biased toward the dominant class, class imbalance is a challenging problem in machine learning. The most popular approaches to solving this problem include oversampling minority examples and undersampling majority examples. Oversampling may increase the probability of overfitting, whereas undersampling eliminates examples that may be crucial to the learning process. We present a linear time resampling method based on random data partitioning and a majority voting rule to address both concerns, where an imbalanced dataset is partitioned into a number of small subdatasets, each of which must be class balanced. After that, a specific classifier is trained for each subdataset, and the final classification result is established by applying the majority voting rule to the results of all of the trained models. We compared the performance of the proposed method to some of the most well-known oversampling and undersampling methods, employing a range of classifiers, on 33 benchmark machine learning class-imbalanced datasets. The classification results produced by the classifiers employed on the generated data by the proposed method were comparable to most of the resampling methods tested, with the exception of SMOTEFUNA, which is an oversampling method that increases the probability of overfitting. The proposed method produced results that were comparable to the Easy Ensemble (EE) undersampling method. As a result, for solving the challenge of machine learning from class-imbalanced datasets, we advocate using either EE or our method.
Collapse
|
35
|
Serra A, Cattelani L, Fratello M, Fortino V, Kinaret PAS, Greco D. Supervised Methods for Biomarker Detection from Microarray Experiments. Methods Mol Biol 2022; 2401:101-120. [PMID: 34902125 DOI: 10.1007/978-1-0716-1839-4_8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Biomarkers are valuable indicators of the state of a biological system. Microarray technology has been extensively used to identify biomarkers and build computational predictive models for disease prognosis, drug sensitivity and toxicity evaluations. Activation biomarkers can be used to understand the underlying signaling cascades, mechanisms of action and biological cross talk. Biomarker detection from microarray data requires several considerations both from the biological and computational points of view. In this chapter, we describe the main methodology used in biomarkers discovery and predictive modeling and we address some of the related challenges. Moreover, we discuss biomarker validation and give some insights into multiomics strategies for biomarker detection.
Collapse
Affiliation(s)
- Angela Serra
- Faculty of Medicine and Health Technology, Tampere University, Tampere, Finland
- BioMediTech Institute, Tampere University, Tampere, Finland
- Finnish Hub for Development and Validation of Integrated Approaches (FHAIVE), University of Tampere, Tampere, Finland
| | - Luca Cattelani
- Faculty of Medicine and Health Technology, Tampere University, Tampere, Finland
- BioMediTech Institute, Tampere University, Tampere, Finland
- Finnish Hub for Development and Validation of Integrated Approaches (FHAIVE), University of Tampere, Tampere, Finland
| | - Michele Fratello
- Faculty of Medicine and Health Technology, Tampere University, Tampere, Finland
- BioMediTech Institute, Tampere University, Tampere, Finland
- Finnish Hub for Development and Validation of Integrated Approaches (FHAIVE), University of Tampere, Tampere, Finland
| | - Vittorio Fortino
- Institute of Biomedicine, University of Eastern Finland, Kuopio, Finland
| | - Pia Anneli Sofia Kinaret
- Faculty of Medicine and Health Technology, Tampere University, Tampere, Finland
- BioMediTech Institute, Tampere University, Tampere, Finland
- Finnish Hub for Development and Validation of Integrated Approaches (FHAIVE), University of Tampere, Tampere, Finland
- Institute of Biotechnology, University of Helsinki, Helsinki, Finland
| | - Dario Greco
- Faculty of Medicine and Health Technology, Tampere University, Tampere, Finland.
- BioMediTech Institute, Tampere University, Tampere, Finland.
- Finnish Hub for Development and Validation of Integrated Approaches (FHAIVE), University of Tampere, Tampere, Finland.
- Institute of Biotechnology, University of Helsinki, Helsinki, Finland.
| |
Collapse
|
36
|
Tedesco S, Andrulli M, Larsson MÅ, Kelly D, Alamäki A, Timmons S, Barton J, Condell J, O’Flynn B, Nordström A. Comparison of Machine Learning Techniques for Mortality Prediction in a Prospective Cohort of Older Adults. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2021; 18:12806. [PMID: 34886532 PMCID: PMC8657506 DOI: 10.3390/ijerph182312806] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/28/2021] [Revised: 12/01/2021] [Accepted: 12/02/2021] [Indexed: 12/16/2022]
Abstract
As global demographics change, ageing is a global phenomenon which is increasingly of interest in our modern and rapidly changing society. Thus, the application of proper prognostic indices in clinical decisions regarding mortality prediction has assumed a significant importance for personalized risk management (i.e., identifying patients who are at high or low risk of death) and to help ensure effective healthcare services to patients. Consequently, prognostic modelling expressed as all-cause mortality prediction is an important step for effective patient management. Machine learning has the potential to transform prognostic modelling. In this paper, results on the development of machine learning models for all-cause mortality prediction in a cohort of healthy older adults are reported. The models are based on features covering anthropometric variables, physical and lab examinations, questionnaires, and lifestyles, as well as wearable data collected in free-living settings, obtained for the "Healthy Ageing Initiative" study conducted on 2291 recruited participants. Several machine learning techniques including feature engineering, feature selection, data augmentation and resampling were investigated for this purpose. A detailed empirical comparison of the impact of the different techniques is presented and discussed. The achieved performances were also compared with a standard epidemiological model. This investigation showed that, for the dataset under consideration, the best results were achieved with Random UnderSampling in conjunction with Random Forest (either with or without probability calibration). However, while including probability calibration slightly reduced the average performance, it increased the model robustness, as indicated by the lower 95% confidence intervals. The analysis showed that machine learning models could provide comparable results to standard epidemiological models while being completely data-driven and disease-agnostic, thus demonstrating the opportunity for building machine learning models on health records data for research and clinical practice. However, further testing is required to significantly improve the model performance and its robustness.
Collapse
Affiliation(s)
- Salvatore Tedesco
- Tyndall National Institute, University College Cork, Lee Maltings Complex, Dyke Parade, T12R5CP Cork, Ireland; (M.A.); (J.B.); (B.O.)
| | - Martina Andrulli
- Tyndall National Institute, University College Cork, Lee Maltings Complex, Dyke Parade, T12R5CP Cork, Ireland; (M.A.); (J.B.); (B.O.)
| | - Markus Åkerlund Larsson
- Department of Public Health and Clinical Medicine, Section of Sustainable Health, Umeå University, SE-901 87 Umeå, Sweden; (M.Å.L.); (A.N.)
| | - Daniel Kelly
- School of Computing, Engineering and Intelligent Systems, Ulster University, Londonderry BT48 7JL, UK; (D.K.); (J.C.)
| | - Antti Alamäki
- Department of Physiotherapy, Karelia University of Applied Sciences, Tikkarinne 9, FI-80200 Joensuu, Finland;
| | - Suzanne Timmons
- Centre for Gerontology and Rehabilitation, University College Cork, T12XH60 Cork, Ireland;
| | - John Barton
- Tyndall National Institute, University College Cork, Lee Maltings Complex, Dyke Parade, T12R5CP Cork, Ireland; (M.A.); (J.B.); (B.O.)
| | - Joan Condell
- School of Computing, Engineering and Intelligent Systems, Ulster University, Londonderry BT48 7JL, UK; (D.K.); (J.C.)
| | - Brendan O’Flynn
- Tyndall National Institute, University College Cork, Lee Maltings Complex, Dyke Parade, T12R5CP Cork, Ireland; (M.A.); (J.B.); (B.O.)
| | - Anna Nordström
- Department of Public Health and Clinical Medicine, Section of Sustainable Health, Umeå University, SE-901 87 Umeå, Sweden; (M.Å.L.); (A.N.)
- School of Sport Sciences, UiT the Arctic University of Norway, 9037 Tromsø, Norway
| |
Collapse
|
37
|
|
38
|
Dudjak M, Martinović G. An empirical study of data intrinsic characteristics that make learning from imbalanced data difficult. EXPERT SYSTEMS WITH APPLICATIONS 2021; 182:115297. [DOI: 10.1016/j.eswa.2021.115297] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/02/2023]
|
39
|
Zoumpekas T, Puig A, Salamó M, Garcı́a‐Sellés D, Blanco Nuñez L, Guinau M. An intelligent framework for end‐to‐end rockfall detection. INT J INTELL SYST 2021. [DOI: 10.1002/int.22557] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Affiliation(s)
- Thanasis Zoumpekas
- Department of Mathematics and Computer Science, WAI Research Group, IMUB and UBICS Institutes University of Barcelona Barcelona Spain
| | - Anna Puig
- Department of Mathematics and Computer Science, WAI Research Group, IMUB and UBICS Institutes University of Barcelona Barcelona Spain
| | - Maria Salamó
- Department of Mathematics and Computer Science, WAI Research Group, IMUB and UBICS Institutes University of Barcelona Barcelona Spain
| | - David Garcı́a‐Sellés
- Department of Earth and Ocean Dynamics, RISKNAT Research Group, Geomodels Institute University of Barcelona Barcelona Spain
| | - Laura Blanco Nuñez
- Department of Earth and Ocean Dynamics, GGAC Research Group, Geomodels Institute University of Barcelona Barcelona Spain
- Anufra—Soil and Water Consulting Barcelona Spain
| | - Marta Guinau
- Department of Earth and Ocean Dynamics, RISKNAT Research Group, Geomodels Institute University of Barcelona Barcelona Spain
| |
Collapse
|
40
|
Li Y, Li M, Yuan J, Lu J, Abdel-Aty M. Analysis and prediction of intersection traffic violations using automated enforcement system data. ACCIDENT; ANALYSIS AND PREVENTION 2021; 162:106422. [PMID: 34607246 DOI: 10.1016/j.aap.2021.106422] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/27/2021] [Revised: 09/01/2021] [Accepted: 09/22/2021] [Indexed: 06/13/2023]
Abstract
The automated enforcement system (AES) is an effective way of supplementing traditional traffic enforcement, and the traffic violation data from AES can also be effectively used for safety research. In this study, traffic violation data were used to analyze the influencing factors associated with traffic violations and to predict the probability of violations at intersections. The potential factors influencing violations include 24 independent factors related to time, space, traffic and weather. Results from a logistic model showed that the midday period, weekends, residential districts, collector roads, congested traffic conditions, high traffic flow, lower wind speed and low temperature would increase the probability of traffic violations. The probability of violations was predicted by the random forest algorithm, which was proven to be the best traffic violation prediction model among logistic regression, Gaussian naive Bayes, and support vector machine. Moreover, the proximity weighted synthetic oversampling technique (ProWSyn) method was applied to reduce the impact of the imbalance ratio (IR) and improve the model's prediction performance. The receiver operating characteristics (ROC) curves and Precision-Recall (PR) curves illustrated that the random forest algorithm using oversampling data had the best classifier prediction performance than undersampling data. The area under curve (AUC) and out-of-bag (OOB) error with IR = 1 reached 0.914 and 0.0787, which showed the better performance of the random forest algorithm using ProWSyn in dealing with imbalanced traffic violation data.
Collapse
Affiliation(s)
- Yunxuan Li
- Department of Civil Engineering, Tsinghua University, Beijing 100084, PR China
| | - Meng Li
- Department of Civil Engineering, Tsinghua University, Beijing 100084, PR China
| | - Jinghui Yuan
- National Transportation Research Center, Oak Ridge National Laboratory, Knoxville, TN 37918 United States
| | - Jian Lu
- School of Transportation, Southeast University, Nanjing, Jiangsu 211189, PR China
| | - Mohamed Abdel-Aty
- Department of Civil, Environmental and Construction Engineering, University of Central Florida, Orlando, FL 32816-2450, United States
| |
Collapse
|
41
|
|
42
|
Liu J, Wong ZSY, So HY, Tsui KL. Evaluating resampling methods and structured features to improve fall incident report identification by the severity level. J Am Med Inform Assoc 2021; 28:1756-1764. [PMID: 34010385 DOI: 10.1093/jamia/ocab048] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2020] [Revised: 02/24/2021] [Accepted: 04/27/2021] [Indexed: 11/14/2022] Open
Abstract
OBJECTIVE This study aims to improve the classification of the fall incident severity level by considering data imbalance issues and structured features through machine learning. MATERIALS AND METHODS We present an incident report classification (IRC) framework to classify the in-hospital fall incident severity level by addressing the imbalanced class problem and incorporating structured attributes. After text preprocessing, bag-of-words features, structured text features, and structured clinical features were extracted from the reports. Next, resampling techniques were incorporated into the training process. Machine learning algorithms were used to build classification models. IRC systems were trained, validated, and tested using a repeated and randomly stratified shuffle-split cross-validation method. Finally, we evaluated the system performance using the F1-measure, precision, and recall over 15 stratified test sets. RESULTS The experimental results demonstrated that the classification system setting considering both data imbalance issues and structured features outperformed the other system settings (with a mean macro-averaged F1-measure of 0.733). Considering the structured features and resampling techniques, this classification system setting significantly improved the mean F1-measure for the rare class by 30.88% (P value < .001) and the mean macro-averaged F1-measure by 8.26% from the baseline system setting (P value < .001). In general, the classification system employing the random forest algorithm and random oversampling method outperformed the others. CONCLUSIONS Structured features provide essential information for categorizing the fall incident severity level. Resampling methods help rebalance the class distribution of the original incident report data, which improves the performance of machine learning models. The IRC framework presented in this study effectively automates the identification of fall incident reports by the severity level.
Collapse
Affiliation(s)
- Jiaxing Liu
- School of Statistics and Mathematics, Zhongnan University of Economics and Law, Wuhan, China.,School of Data Science, City University of Hong Kong, Kowloon, Hong Kong SAR, China
| | - Zoie S Y Wong
- Graduate School of Public Health, St. Luke's International University, Tokyo, Japan
| | - H Y So
- Alice Ho Miu Ling Nethersole Hospital, New Territories, Hong Kong SAR, China
| | - Kwok Leung Tsui
- School of Data Science, City University of Hong Kong, Kowloon, Hong Kong SAR, China
| |
Collapse
|
43
|
Soui M, Mansouri N, Alhamad R, Kessentini M, Ghedira K. NSGA-II as feature selection technique and AdaBoost classifier for COVID-19 prediction using patient's symptoms. NONLINEAR DYNAMICS 2021; 106:1453-1475. [PMID: 34025034 PMCID: PMC8129611 DOI: 10.1007/s11071-021-06504-1] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/02/2021] [Accepted: 04/28/2021] [Indexed: 05/20/2023]
Abstract
Nowadays, humanity is facing one of the most dangerous pandemics known as COVID-19. Due to its high inter-person contagiousness, COVID-19 is rapidly spreading across the world. Positive patients are often suffering from different symptoms that can vary from mild to severe including cough, fever, sore throat, and body aches. In more dire cases, infected patients can experience severe symptoms that can cause breathing difficulties which lead to stern organ failure and die. The medical corps all over the world are overloaded because of the exponentially myriad number of contagions. Therefore, screening for the disease becomes overwrought with the limited tools of test. Additionally, test results may take a long time to acquire, leaving behind a higher potential for the prevalence of the virus among other individuals by the patients. To reduce the chances of infection, we suggest a prediction model that distinguishes the infected COVID-19 cases based on clinical symptoms and features. This model can be helpful for citizens to catch their infection without the need for visiting the hospital. Also, it helps the medical staff in triaging patients in case of a deficiency of medical amenities. In this paper, we use the non-dominated sorting genetic algorithm (NSGA-II) to select the interesting features by finding the best trade-offs between two conflicting objectives: minimizing the number of features and maximizing the weights of selected features. Then, a classification phase is conducted using an AdaBoost classifier. The proposed model is evaluated using two different datasets. To maximize results, we performed a natural selection of hyper-parameters of the classifier using the genetic algorithm. The obtained results prove the efficiency of NSGA-II as a feature selection algorithm combined with AdaBoost classifier. It exhibits higher classification results that outperformed the existing methods.
Collapse
Affiliation(s)
- Makram Soui
- College of Computing and Informatics, Saudi Electronic University, Riyadh, Saudi Arabia
| | | | - Raed Alhamad
- College of Computing and Informatics, Saudi Electronic University, Riyadh, Saudi Arabia
| | | | - Khaled Ghedira
- Private Higher School of Engineering and Technology, Ariana, Tunisia
| |
Collapse
|
44
|
Anyaso-Samuel S, Sachdeva A, Guha S, Datta S. Metagenomic Geolocation Prediction Using an Adaptive Ensemble Classifier. Front Genet 2021; 12:642282. [PMID: 33959149 PMCID: PMC8093763 DOI: 10.3389/fgene.2021.642282] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2020] [Accepted: 03/18/2021] [Indexed: 11/13/2022] Open
Abstract
Microbiome samples harvested from urban environments can be informative in predicting the geographic location of unknown samples. The idea that different cities may have geographically disparate microbial signatures can be utilized to predict the geographical location based on city-specific microbiome samples. We implemented this idea first; by utilizing standard bioinformatics procedures to pre-process the raw metagenomics samples provided by the CAMDA organizers. We trained several component classifiers and a robust ensemble classifier with data generated from taxonomy-dependent and taxonomy-free approaches. Also, we implemented class weighting and an optimal oversampling technique to overcome the class imbalance in the primary data. In each instance, we observed that the component classifiers performed differently, whereas the ensemble classifier consistently yielded optimal performance. Finally, we predicted the source cities of mystery samples provided by the organizers. Our results highlight the unreliability of restricting the classification of metagenomic samples to source origins to a single classification algorithm. By combining several component classifiers via the ensemble approach, we obtained classification results that were as good as the best-performing component classifier.
Collapse
Affiliation(s)
- Samuel Anyaso-Samuel
- Department of Biostatistics, University of Florida, Gainesville, FL, United States
| | - Archie Sachdeva
- Department of Biostatistics, University of Florida, Gainesville, FL, United States
| | - Subharup Guha
- Department of Biostatistics, University of Florida, Gainesville, FL, United States
| | - Somnath Datta
- Department of Biostatistics, University of Florida, Gainesville, FL, United States
| |
Collapse
|
45
|
Imbalanced data classification based on diverse sample generation and classifier fusion. INT J MACH LEARN CYB 2021. [DOI: 10.1007/s13042-021-01321-9] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
46
|
Abstract
Data imbalance is a thorny issue in machine learning. SMOTE is a famous oversampling method of imbalanced learning. However, it has some disadvantages such as sample overlapping, noise interference, and blindness of neighbor selection. In order to address these problems, we present a new oversampling method, OS-CCD, based on a new concept, the classification contribution degree. The classification contribution degree determines the number of synthetic samples generated by SMOTE for each positive sample. OS-CCD follows the spatial distribution characteristics of original samples on the class boundary, as well as avoids oversampling from noisy points. Experiments on twelve benchmark datasets demonstrate that OS-CCD outperforms six classical oversampling methods in terms of accuracy, F1-score, AUC, and ROC.
Collapse
|
47
|
Silva WA, Villela SM. Improving the one-against-all binary approach for multiclass classification using balancing techniques. APPL INTELL 2021. [DOI: 10.1007/s10489-020-01805-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
48
|
Teh K, Armitage P, Tesfaye S, Selvarajah D, Wilkinson ID. Imbalanced learning: Improving classification of diabetic neuropathy from magnetic resonance imaging. PLoS One 2020; 15:e0243907. [PMID: 33320890 PMCID: PMC7737960 DOI: 10.1371/journal.pone.0243907] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2020] [Accepted: 12/01/2020] [Indexed: 11/21/2022] Open
Abstract
One of the fundamental challenges when dealing with medical imaging datasets is class imbalance. Class imbalance happens where an instance in the class of interest is relatively low, when compared to the rest of the data. This study aims to apply oversampling strategies in an attempt to balance the classes and improve classification performance. We evaluated four different classifiers from k-nearest neighbors (k-NN), support vector machine (SVM), multilayer perceptron (MLP) and decision trees (DT) with 73 oversampling strategies. In this work, we used imbalanced learning oversampling techniques to improve classification in datasets that are distinctively sparser and clustered. This work reports the best oversampling and classifier combinations and concludes that the usage of oversampling methods always outperforms no oversampling strategies hence improving the classification results.
Collapse
Affiliation(s)
- Kevin Teh
- Academic Unit of Radiology, Department of Infection, Immunity and Cardiovascular Disease, University of Sheffield, Sheffield, United Kingdom
- * E-mail:
| | - Paul Armitage
- Academic Unit of Radiology, Department of Infection, Immunity and Cardiovascular Disease, University of Sheffield, Sheffield, United Kingdom
| | - Solomon Tesfaye
- Diabetes Research Department, Sheffield Teaching Hospitals NHS Foundation Trust, Sheffield, United Kingdom
| | - Dinesh Selvarajah
- Diabetes Research Department, Sheffield Teaching Hospitals NHS Foundation Trust, Sheffield, United Kingdom
- Department of Oncology and Metabolism, University of Sheffield, Sheffield, United Kingdom
| | - Iain D. Wilkinson
- Academic Unit of Radiology, Department of Infection, Immunity and Cardiovascular Disease, University of Sheffield, Sheffield, United Kingdom
| |
Collapse
|
49
|
Zhu Y, Yan Y, Zhang Y, Zhang Y. EHSO: Evolutionary Hybrid Sampling in overlapping scenarios for imbalanced learning. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2020.08.060] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
50
|
Vandewiele G, Dehaene I, Kovács G, Sterckx L, Janssens O, Ongenae F, De Backere F, De Turck F, Roelens K, Decruyenaere J, Van Hoecke S, Demeester T. Overly optimistic prediction results on imbalanced data: a case study of flaws and benefits when applying over-sampling. Artif Intell Med 2020; 111:101987. [PMID: 33461687 DOI: 10.1016/j.artmed.2020.101987] [Citation(s) in RCA: 26] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2020] [Revised: 09/09/2020] [Accepted: 11/12/2020] [Indexed: 01/10/2023]
Abstract
Information extracted from electrohysterography recordings could potentially prove to be an interesting additional source of information to estimate the risk on preterm birth. Recently, a large number of studies have reported near-perfect results to distinguish between recordings of patients that will deliver term or preterm using a public resource, called the Term/Preterm Electrohysterogram database. However, we argue that these results are overly optimistic due to a methodological flaw being made. In this work, we focus on one specific type of methodological flaw: applying over-sampling before partitioning the data into mutually exclusive training and testing sets. We show how this causes the results to be biased using two artificial datasets and reproduce results of studies in which this flaw was identified. Moreover, we evaluate the actual impact of over-sampling on predictive performance, when applied prior to data partitioning, using the same methodologies of related studies, to provide a realistic view of these methodologies' generalization capabilities. We make our research reproducible by providing all the code under an open license.
Collapse
Affiliation(s)
- Gilles Vandewiele
- IDLab, Ghent University - imec, Technologiepark-Zwijnaarde 126, Ghent, Belgium.
| | - Isabelle Dehaene
- Department of Gynaecology and Obstetrics, Ghent University Hospital, Corneel Heymanslaan 10, Ghent, Belgium
| | - György Kovács
- Analytical Minds Ltd Arpad street 5, Beregsurany, Hungary
| | - Lucas Sterckx
- IDLab, Ghent University - imec, Technologiepark-Zwijnaarde 126, Ghent, Belgium
| | - Olivier Janssens
- IDLab, Ghent University - imec, Technologiepark-Zwijnaarde 126, Ghent, Belgium
| | - Femke Ongenae
- IDLab, Ghent University - imec, Technologiepark-Zwijnaarde 126, Ghent, Belgium
| | - Femke De Backere
- IDLab, Ghent University - imec, Technologiepark-Zwijnaarde 126, Ghent, Belgium
| | - Filip De Turck
- IDLab, Ghent University - imec, Technologiepark-Zwijnaarde 126, Ghent, Belgium
| | - Kristien Roelens
- Department of Gynaecology and Obstetrics, Ghent University Hospital, Corneel Heymanslaan 10, Ghent, Belgium
| | - Johan Decruyenaere
- Department of Intensive Care Medicine, Ghent University Hospital, Corneel Heymanslaan 10, Ghent, Belgium
| | - Sofie Van Hoecke
- IDLab, Ghent University - imec, Technologiepark-Zwijnaarde 126, Ghent, Belgium
| | - Thomas Demeester
- IDLab, Ghent University - imec, Technologiepark-Zwijnaarde 126, Ghent, Belgium
| |
Collapse
|