1
|
Chao AF, Wang CS, Li BY, Chen HY. From hate to harmony: Leveraging large language models for safer speech in times of COVID-19 crisis. Heliyon 2024; 10:e35468. [PMID: 39220951 PMCID: PMC11365350 DOI: 10.1016/j.heliyon.2024.e35468] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2024] [Revised: 07/15/2024] [Accepted: 07/29/2024] [Indexed: 09/04/2024] Open
Abstract
This study investigates the rampant spread of offensive and derogatory language during the COVID-19 pandemic and aims to mitigate it through machine learning. Employing advanced Large Language Models (LLMs), the research develops a sophisticated framework adept at detecting and transforming abusive and hateful speech. The project begins by meticulously compiling a dataset, focusing specifically on Chinese language abuse and hate speech. It incorporates an extensive list of 30 pandemic-related terms, significantly enriching the resources available for this type of research. A two-tier detection model is then introduced, achieving a remarkable accuracy of 94.42 % in its first phase and an impressive 81.48 % in the second. Furthermore, the study enhances paraphrasing efficiency by integrating generative AI techniques, primarily Large Language Models, with a Latent Dirichlet Allocation (LDA) topic model. This combination allows for a thorough analysis of language before and after modification. The results highlight the transformative power of these methods. They show that the rephrased statements not only reduce the initial hostility but also preserve the essential themes and meanings. This breakthrough offers users effective rephrasing suggestions to prevent the spread of hate speech, contributing to more positive and constructive public discourse.
Collapse
Affiliation(s)
- August F.Y. Chao
- Department of Computer Science and Information Engineering, National Penghu University of Science and Technology, Taiwan
| | - Chen-Shu Wang
- Department of Information and Finance Management, National Taipei University of Technology, Taiwan
| | - Bo-Yi Li
- Department of Management Information Systems, National Chengchi University, Taiwan
| | - Hong-Yan Chen
- Department of Information and Finance Management, National Taipei University of Technology, Taiwan
| |
Collapse
|
2
|
Yang Y, Khorshidi HA, Aickelin U. A review on over-sampling techniques in classification of multi-class imbalanced datasets: insights for medical problems. Front Digit Health 2024; 6:1430245. [PMID: 39131184 PMCID: PMC11310152 DOI: 10.3389/fdgth.2024.1430245] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2024] [Accepted: 07/12/2024] [Indexed: 08/13/2024] Open
Abstract
There has been growing attention to multi-class classification problems, particularly those challenges of imbalanced class distributions. To address these challenges, various strategies, including data-level re-sampling treatment and ensemble methods, have been introduced to bolster the performance of predictive models and Artificial Intelligence (AI) algorithms in scenarios where excessive level of imbalance is present. While most research and algorithm development have been focused on binary classification problems, in health informatics there is an increased interest in the field to address the problem of multi-class classification in imbalanced datasets. Multi-class imbalance problems bring forth more complex challenges, as a delicate approach is required to generate synthetic data and simultaneously maintain the relationship between the multiple classes. The aim of this review paper is to examine over-sampling methods tailored for medical and other datasets with multi-class imbalance. Out of 2,076 peer-reviewed papers identified through searches, 197 eligible papers were chosen and thoroughly reviewed for inclusion, narrowing to 37 studies being selected for in-depth analysis. These studies are categorised into four categories: metric, adaptive, structure-based, and hybrid approaches. The most significant finding is the emerging trend toward hybrid resampling methods that combine the strengths of various techniques to effectively address the problem of imbalanced data. This paper provides an extensive analysis of each selected study, discusses their findings, and outlines directions for future research.
Collapse
Affiliation(s)
- Yuxuan Yang
- School of Computing and Information Systems, The University of Melbourne, Parkville, VIC, Australia
| | - Hadi Akbarzadeh Khorshidi
- School of Computing and Information Systems, The University of Melbourne, Parkville, VIC, Australia
- Cancer Health Services Research, Melbourne School of Population and Global Health, The University of Melbourne, Parkville, VIC, Australia
| | - Uwe Aickelin
- School of Computing and Information Systems, The University of Melbourne, Parkville, VIC, Australia
| |
Collapse
|
3
|
Shaha TR, Begum M, Uddin J, Torres VY, Iturriaga JA, Ashraf I, Samad MA. Feature group partitioning: an approach for depression severity prediction with class balancing using machine learning algorithms. BMC Med Res Methodol 2024; 24:123. [PMID: 38831346 PMCID: PMC11145774 DOI: 10.1186/s12874-024-02249-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2023] [Accepted: 05/20/2024] [Indexed: 06/05/2024] Open
Abstract
In contemporary society, depression has emerged as a prominent mental disorder that exhibits exponential growth and exerts a substantial influence on premature mortality. Although numerous research applied machine learning methods to forecast signs of depression. Nevertheless, only a limited number of research have taken into account the severity level as a multiclass variable. Besides, maintaining the equality of data distribution among all the classes rarely happens in practical communities. So, the inevitable class imbalance for multiple variables is considered a substantial challenge in this domain. Furthermore, this research emphasizes the significance of addressing class imbalance issues in the context of multiple classes. We introduced a new approach Feature group partitioning (FGP) in the data preprocessing phase which effectively reduces the dimensionality of features to a minimum. This study utilized synthetic oversampling techniques, specifically Synthetic Minority Over-sampling Technique (SMOTE) and Adaptive Synthetic (ADASYN), for class balancing. The dataset used in this research was collected from university students by administering the Burn Depression Checklist (BDC). For methodological modifications, we implemented heterogeneous ensemble learning stacking, homogeneous ensemble bagging, and five distinct supervised machine learning algorithms. The issue of overfitting was mitigated by evaluating the accuracy of the training, validation, and testing datasets. To justify the effectiveness of the prediction models, balanced accuracy, sensitivity, specificity, precision, and f1-score indices are used. Overall, comprehensive analysis demonstrates the discrimination between the Conventional Depression Screening (CDS) and FGP approach. In summary, the results show that the stacking classifier for FGP with SMOTE approach yields the highest balanced accuracy, with a rate of 92.81%. The empirical evidence has demonstrated that the FGP approach, when combined with the SMOTE, able to produce better performance in predicting the severity of depression. Most importantly the optimization of the training time of the FGP approach for all of the classifiers is a significant achievement of this research.
Collapse
Affiliation(s)
- Tumpa Rani Shaha
- Department of Computer Science and Engineering, Dhaka University of Engineering & Technology, Gazipur, 1707, Bangladesh
- Department of Computer Science and Engineering, Bangabandhu Sheikh Mujibur Rahman Science & Technology University, Gopalganj, 8100, Bangladesh
| | - Momotaz Begum
- Department of Computer Science and Engineering, Dhaka University of Engineering & Technology, Gazipur, 1707, Bangladesh.
| | - Jia Uddin
- AI and Big Data Department, Woosong University, Daejeon, 34606, South Korea
| | - Vanessa Yélamos Torres
- Universidad Europea del Atlántico, Santander, 39011, Spain
- Universidad Internacional Iberoamericana Campeche, Campeche, 24560, México
- Universidad de La Romana, La Romana, República Dominicana
| | - Josep Alemany Iturriaga
- Universidad Europea del Atlántico, Santander, 39011, Spain
- Universidad Internacional Iberoamericana Arecibo, Puerto Rico, 00613, USA
- Universidade Internacional do Cuanza, Cuito, Bié, Angola
| | - Imran Ashraf
- Department of Information and Communication Engineering, Yeungnam University, Gyeongsan, 38541, South Korea.
| | - Md Abdus Samad
- Department of Information and Communication Engineering, Yeungnam University, Gyeongsan, 38541, South Korea.
| |
Collapse
|
4
|
Abnoosian K, Farnoosh R, Behzadi MH. Prediction of diabetes disease using an ensemble of machine learning multi-classifier models. BMC Bioinformatics 2023; 24:337. [PMID: 37697283 PMCID: PMC10496262 DOI: 10.1186/s12859-023-05465-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2023] [Accepted: 09/04/2023] [Indexed: 09/13/2023] Open
Abstract
BACKGROUND AND OBJECTIVE Diabetes is a life-threatening chronic disease with a growing global prevalence, necessitating early diagnosis and treatment to prevent severe complications. Machine learning has emerged as a promising approach for diabetes diagnosis, but challenges such as limited labeled data, frequent missing values, and dataset imbalance hinder the development of accurate prediction models. Therefore, a novel framework is required to address these challenges and improve performance. METHODS In this study, we propose an innovative pipeline-based multi-classification framework to predict diabetes in three classes: diabetic, non-diabetic, and prediabetes, using the imbalanced Iraqi Patient Dataset of Diabetes. Our framework incorporates various pre-processing techniques, including duplicate sample removal, attribute conversion, missing value imputation, data normalization and standardization, feature selection, and k-fold cross-validation. Furthermore, we implement multiple machine learning models, such as k-NN, SVM, DT, RF, AdaBoost, and GNB, and introduce a weighted ensemble approach based on the Area Under the Receiver Operating Characteristic Curve (AUC) to address dataset imbalance. Performance optimization is achieved through grid search and Bayesian optimization for hyper-parameter tuning. RESULTS Our proposed model outperforms other machine learning models, including k-NN, SVM, DT, RF, AdaBoost, and GNB, in predicting diabetes. The model achieves high average accuracy, precision, recall, F1-score, and AUC values of 0.9887, 0.9861, 0.9792, 0.9851, and 0.999, respectively. CONCLUSION Our pipeline-based multi-classification framework demonstrates promising results in accurately predicting diabetes using an imbalanced dataset of Iraqi diabetic patients. The proposed framework addresses the challenges associated with limited labeled data, missing values, and dataset imbalance, leading to improved prediction performance. This study highlights the potential of machine learning techniques in diabetes diagnosis and management, and the proposed framework can serve as a valuable tool for accurate prediction and improved patient care. Further research can build upon our work to refine and optimize the framework and explore its applicability in diverse datasets and populations.
Collapse
Affiliation(s)
- Karlo Abnoosian
- Department of Statistics, Science and Research Branch, Islamic Azad University, Tehran, Iran
| | - Rahman Farnoosh
- School of Mathematics, Iran University of Science and Technology, Tehran, Iran.
| | - Mohammad Hassan Behzadi
- Department of Statistics, Science and Research Branch, Islamic Azad University, Tehran, Iran
| |
Collapse
|
5
|
Effect of inconsistency rate of granulated datasets on classification performance: An experimental approach. Inf Sci (N Y) 2023. [DOI: 10.1016/j.ins.2022.11.135] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
6
|
Han M, Guo H, Li J, Wang W. Global-local information based oversampling for multi-class imbalanced data. INT J MACH LEARN CYB 2022. [DOI: 10.1007/s13042-022-01746-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
|
7
|
ESMOTE: an overproduce-and-choose synthetic examples generation strategy based on evolutionary computation. Neural Comput Appl 2022. [DOI: 10.1007/s00521-022-08004-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/07/2022]
|
8
|
Zhao Y, Chen K, Peng J, Wang J, Song N. Diverse needs and cooperative deeds: Comprehending users’ identities in online health communities. Inf Process Manag 2022. [DOI: 10.1016/j.ipm.2022.103060] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
|
9
|
|
10
|
Choi HS, Jung D, Kim S, Yoon S. Imbalanced Data Classification via Cooperative Interaction Between Classifier and Generator. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:3343-3356. [PMID: 33531305 DOI: 10.1109/tnnls.2021.3052243] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Learning classifiers with imbalanced data can be strongly biased toward the majority class. To address this issue, several methods have been proposed using generative adversarial networks (GANs). Existing GAN-based methods, however, do not effectively utilize the relationship between a classifier and a generator. This article proposes a novel three-player structure consisting of a discriminator, a generator, and a classifier, along with decision boundary regularization. Our method is distinctive in which the generator is trained in cooperation with the classifier to provide minority samples that gradually expand the minority decision region, improving performance for imbalanced data classification. The proposed method outperforms the existing methods on real data sets as well as synthetic imbalanced data sets.
Collapse
|
11
|
PF-SMOTE: A novel parameter-free SMOTE for imbalanced datasets. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.05.017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
12
|
Classification of Electrocardiography Hybrid Convolutional Neural Network-Long Short Term Memory with Fully Connected Layer. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:6348424. [PMID: 35860642 PMCID: PMC9293511 DOI: 10.1155/2022/6348424] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/14/2022] [Accepted: 05/23/2022] [Indexed: 11/26/2022]
Abstract
Electrocardiography (ECG) is a technique for observing and recording the electrical activity of the human heart. The usage of an ECG signal is common among clinical professionals in the collection of time data for the examination of any rhythmic conditions associated with a subject. The investigation was carried out in order to computerize the assignment by exhibiting the issue using encoder-decoder techniques, creating the information that was simply typical of it, and utilising misfortune appropriation to anticipate standard or anomalous information. On a broad variety of applications such as voice recognition and prediction, the long short-term memory (LSTM) fully connected layer (FCL) and the two convolutional neural networks (CNNs) have shown superior performance over deep learning networks (DLNs). DNNs are suitable for making high points for a more divisible region and CNNs are suitable for reducing recurrence types, LSTMs are appropriate for temporary displays, in the same way as CNNs are appropriate for reducing recurrence types. The CNN, LSTM, and DNN algorithms are acceptable for viewing. The complementarity of DNNs, CNNs, and LSTMs was investigated in this research by bringing them all together under the single architectural company. The researchers got the ECG data from the MIT-BIH arrhythmia database as a result of the investigation. Our results demonstrate that the approach proposed may expressively describe ECG series and identify abnormalities via scores that outperform existing supervised and unsupervised methods in both the short term and long term. The LSTM network and FCL additionally demonstrated that the unbalanced datasets associated with the ECG beat detection problem could be consistently resolved and that they were not susceptible to the accuracy of ECG signals. It is recommended that cardiologists employ the unique technique to aid them in performing reliable and impartial interpretation of ECG data in telemedicine settings.
Collapse
|
13
|
Xu Y, Yu Z, Chen CLP. Classifier Ensemble Based on Multiview Optimization for High-Dimensional Imbalanced Data Classification. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; PP:870-883. [PMID: 35657843 DOI: 10.1109/tnnls.2022.3177695] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
High-dimensional class imbalanced data have plagued the performance of classification algorithms seriously. Because of a large number of redundant/invalid features and the class imbalanced issue, it is difficult to construct an optimal classifier for high-dimensional imbalanced data. Classifier ensemble has attracted intensive attention since it can achieve better performance than an individual classifier. In this work, we propose a multiview optimization (MVO) to learn more effective and robust features from high-dimensional imbalanced data, based on which an accurate and robust ensemble system is designed. Specifically, an optimized subview generation (OSG) in MVO is first proposed to generate multiple optimized subviews from different scenarios, which can strengthen the classification ability of features and increase the diversity of ensemble members simultaneously. Second, a new evaluation criterion that considers the distribution of data in each optimized subview is developed based on which a selective ensemble of optimized subviews (SEOS) is designed to perform the subview selective ensemble. Finally, an oversampling approach is executed on the optimized view to obtain a new class rebalanced subset for the classifier. Experimental results on 25 high-dimensional class imbalanced datasets indicate that the proposed method outperforms other mainstream classifier ensemble methods.
Collapse
|
14
|
Dai W, Ning C, Nan J, Wang D. Stochastic configuration networks for imbalanced data classification. INT J MACH LEARN CYB 2022. [DOI: 10.1007/s13042-022-01565-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
15
|
Huang Y, Liu DR, Lee SJ, Hsu CH, Liu YG. A boosting resampling method for regression based on a conditional variational autoencoder. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2021.12.100] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
|
16
|
Özkan Y, Demirarslan M, Suner A. Effect of data preprocessing on ensemble learning for classification in disease diagnosis. COMMUN STAT-SIMUL C 2022. [DOI: 10.1080/03610918.2022.2053717] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Affiliation(s)
- Yüksel Özkan
- Department of Biostatistics and Medical Informatics, Faculty of Medicine, Ege University, Izmir, Turkey
| | - Mert Demirarslan
- Department of Biostatistics and Medical Informatics, Faculty of Medicine, Ege University, Izmir, Turkey
| | - Aslı Suner
- Department of Biostatistics and Medical Informatics, Faculty of Medicine, Ege University, Izmir, Turkey
| |
Collapse
|
17
|
Guo W, Wang Z, Ma M, Chen L, Yang H, Li D, Du W. Semi‐supervised multiple empirical kernel learning with pseudo empirical loss and similarity regularization. INT J INTELL SYST 2022. [DOI: 10.1002/int.22690] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Affiliation(s)
- Wei Guo
- Key Laboratory of Smart Manufacturing in Energy Chemical Process, Ministry of Education East China University of Science and Technology Shanghai People's Republic of China
- Department of Computer Science and Engineering East China University of Science and Technology Shanghai People's Republic of China
| | - Zhe Wang
- Key Laboratory of Smart Manufacturing in Energy Chemical Process, Ministry of Education East China University of Science and Technology Shanghai People's Republic of China
- Department of Computer Science and Engineering East China University of Science and Technology Shanghai People's Republic of China
| | - Menghao Ma
- Key Laboratory of Smart Manufacturing in Energy Chemical Process, Ministry of Education East China University of Science and Technology Shanghai People's Republic of China
- Department of Computer Science and Engineering East China University of Science and Technology Shanghai People's Republic of China
| | - Lilong Chen
- Department of Computer Science and Engineering East China University of Science and Technology Shanghai People's Republic of China
| | - Hai Yang
- Department of Computer Science and Engineering East China University of Science and Technology Shanghai People's Republic of China
| | - Dongdong Li
- Department of Computer Science and Engineering East China University of Science and Technology Shanghai People's Republic of China
| | - Wenli Du
- Key Laboratory of Smart Manufacturing in Energy Chemical Process, Ministry of Education East China University of Science and Technology Shanghai People's Republic of China
| |
Collapse
|
18
|
Dong Y, Xiao H, Dong Y. SA-CGAN: An oversampling method based on single attribute guided conditional GAN for multi-class imbalanced learning. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2021.04.135] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
19
|
Yi X, Xu Y, Hu Q, Krishnamoorthy S, Li W, Tang Z. ASN-SMOTE: a synthetic minority oversampling method with adaptive qualified synthesizer selection. COMPLEX INTELL SYST 2022. [DOI: 10.1007/s40747-021-00638-w] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Abstract
AbstractOversampling is a promising preprocessing technique for imbalanced datasets which generates new minority instances to balance the dataset. However, improper generated minority instances, i.e., noise instances, may interfere the learning of the classifier and impact it negatively. Given this, in this paper, we propose a simple and effective oversampling approach known as ASN-SMOTE based on the k-nearest neighbors and the synthetic minority oversampling technology (SMOTE). ASN-SMOTE first filters noise in the minority class by determining whether the nearest neighbor of each minority instance belongs to the minority or majority class. After that, ASN-SMOTE uses the nearest majority instance of each minority instance to effectively perceive the decision boundary, inside which the qualified minority instances are selected adaptively for each minority instance by the proposed adaptive neighbor selection scheme to synthesize new minority instance. To substantiate the effectiveness, ASN-SMOTE has been applied to three different classifiers and comprehensive experiments have been conducted on 24 imbalanced benchmark datasets. ASN-SMOTE is also extensively compared with nine notable oversampling algorithms. The results show that ASN-SMOTE achieves the best results in the majority of datasets. The ASN-SMOTE implementation is available at: https://www.github.com/yixinkai123/ASN-SMOTE/.
Collapse
|
20
|
Odenthal L, Allmer J, Yousef M. Ensemble Classifiers for Multiclass MicroRNA Classification. Methods Mol Biol 2022; 2257:235-254. [PMID: 34432282 DOI: 10.1007/978-1-0716-1170-8_12] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
Gene regulation is of utmost importance to cell homeostasis; thus, any dysregulation in it often leads to disease. MicroRNAs (miRNAs) are involved in posttranscriptional gene regulation and consequently, their dysregulation has been associated with many diseases.MiRBase version 21 contains microRNAs from about 200 species organized into about 70 clades. It has been shown that not all miRNAs collected in the database are likely to be real and, therefore, novel routes to delineate between correct and false miRNAs should be explored. We introduce a novel approach based on k-mer frequencies and machine learning that assigns an unknown/unlabeled miRNA to its most likely clade/species of origin. A simple way to filter new data would be to ensure that the novel miRNA categorizes closely to the species it is said to originate from. For that, an ensemble classifier of multiple two-class random forest classifiers was designed, where each random forest was trained on one species-clade pair. The approach was tested with different sampling methods on a dataset that was taken from miRBase version 21 and it was evaluated using a hierarchical F-measure. The approach predicted 81% to 94% of the test data correctly, depending on the sampling method. This is the first classifier that can classify miRNAs to their species of origin. This method will aid in the evaluation of miRNA database integrity and analysis of noisy miRNA samples.
Collapse
Affiliation(s)
- Luise Odenthal
- Bioinformatics/Medical Informatics, University of Bielefeld, Bielefeld, Germany
| | - Jens Allmer
- Medical Informatics and Bioinformatics, Institute for Measurement Engineering and Sensor Technology, Hochschule Ruhr West, University of Applied Sciences, Mülheim adR, Germany
| | - Malik Yousef
- Department of Information System, Galilee Digital Health Research Center (GDH), Zefat Academic College, Zefat, Israel.
| |
Collapse
|
21
|
Upadhyay K, Kaur P, Verma DK. Evaluating the Performance of Data Level Methods Using KEEL Tool to Address Class Imbalance Problem. ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING 2021. [DOI: 10.1007/s13369-021-06377-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
22
|
Yao L, Lin TB. Evolutionary Mahalanobis Distance-Based Oversampling for Multi-Class Imbalanced Data Classification. SENSORS (BASEL, SWITZERLAND) 2021; 21:6616. [PMID: 34640936 PMCID: PMC8512012 DOI: 10.3390/s21196616] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/23/2021] [Revised: 09/14/2021] [Accepted: 09/29/2021] [Indexed: 11/18/2022]
Abstract
The number of sensing data are often imbalanced across data classes, for which oversampling on the minority class is an effective remedy. In this paper, an effective oversampling method called evolutionary Mahalanobis distance oversampling (EMDO) is proposed for multi-class imbalanced data classification. EMDO utilizes a set of ellipsoids to approximate the decision regions of the minority class. Furthermore, multi-objective particle swarm optimization (MOPSO) is integrated with the Gustafson-Kessel algorithm in EMDO to learn the size, center, and orientation of every ellipsoid. Synthetic minority samples are generated based on Mahalanobis distance within every ellipsoid. The number of synthetic minority samples generated by EMDO in every ellipsoid is determined based on the density of minority samples in every ellipsoid. The results of computer simulations conducted herein indicate that EMDO outperforms most of the widely used oversampling schemes.
Collapse
Affiliation(s)
- Leehter Yao
- Department of Electrical Engineering, National Taipei University of Technology, Taipei 10618, Taiwan;
| | | |
Collapse
|
23
|
Pozi MSM, Azhar NA, Raziff ARA, Ajrina LH. SVGPM: evolving SVM decision function by using genetic programming to solve imbalanced classification problem. PROGRESS IN ARTIFICIAL INTELLIGENCE 2021. [DOI: 10.1007/s13748-021-00260-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
24
|
Solorio-Ramírez JL, Saldana-Perez M, Lytras MD, Moreno-Ibarra MA, Yáñez-Márquez C. Brain Hemorrhage Classification in CT Scan Images Using Minimalist Machine Learning. Diagnostics (Basel) 2021; 11:1449. [PMID: 34441383 PMCID: PMC8392442 DOI: 10.3390/diagnostics11081449] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2021] [Revised: 08/04/2021] [Accepted: 08/07/2021] [Indexed: 01/22/2023] Open
Abstract
Over time, a myriad of applications have been generated for pattern classification algorithms. Several case studies include parametric classifiers such as the Multi-Layer Perceptron (MLP) classifier, which is one of the most widely used today. Others use non-parametric classifiers, Support Vector Machine (SVM), K-Nearest Neighbors (K-NN), Naïve Bayes (NB), Adaboost, and Random Forest (RF). However, there is still little work directed toward a new trend in Artificial Intelligence (AI), which is known as eXplainable Artificial Intelligence (X-AI). This new trend seeks to make Machine Learning (ML) algorithms increasingly simple and easy to understand for users. Therefore, following this new wave of knowledge, in this work, the authors develop a new pattern classification methodology, based on the implementation of the novel Minimalist Machine Learning (MML) paradigm and a higher relevance attribute selection algorithm, which we call dMeans. We examine and compare the performance of this methodology with MLP, NB, KNN, SVM, Adaboost, and RF classifiers to perform the task of classification of Computed Tomography (CT) brain images. These grayscale images have an area of 128 × 128 pixels, and there are two classes available in the dataset: CT without Hemorrhage and CT with Intra-Ventricular Hemorrhage (IVH), which were classified using the Leave-One-Out Cross-Validation method. Most of the models tested by Leave-One-Out Cross-Validation performed between 50% and 75% accuracy, while sensitivity and sensitivity ranged between 58% and 86%. The experiments performed using our methodology matched the best classifier observed with 86.50% accuracy, and they outperformed all state-of-the-art algorithms in specificity with 91.60%. This performance is achieved hand in hand with simple and practical methods, which go hand in hand with this trend of generating easily explainable algorithms.
Collapse
Affiliation(s)
| | | | - Miltiadis D. Lytras
- Effat College of Engineering, Effat University, P.O. Box 34689, Jeddah 21478, Saudi Arabia
| | | | - Cornelio Yáñez-Márquez
- Centro de Investigación en Computación, Instituto Politécnico Nacional, CDMX 07700, Mexico;
| |
Collapse
|
25
|
A novel kernel-free least squares twin support vector machine for fast and accurate multi-class classification. Knowl Based Syst 2021. [DOI: 10.1016/j.knosys.2021.107123] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
26
|
Classification of Diseases Using Machine Learning Algorithms: A Comparative Study. MATHEMATICS 2021. [DOI: 10.3390/math9151817] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
Machine learning in the medical area has become a very important requirement. The healthcare professional needs useful tools to diagnose medical illnesses. Classifiers are important to provide tools that can be useful to the health professional for this purpose. However, questions arise: which classifier to use? What metrics are appropriate to measure the performance of the classifier? How to determine a good distribution of the data so that the classifier does not bias the medical patterns to be classified in a particular class? Then most important question: does a classifier perform well for a particular disease? This paper will present some answers to the questions mentioned above, making use of classification algorithms widely used in machine learning research with datasets relating to medical illnesses under the supervised learning scheme. In addition to state-of-the-art algorithms in pattern classification, we introduce a novelty: the use of meta-learning to determine, a priori, which classifier would be the ideal for a specific dataset. The results obtained show numerically and statistically that there are reliable classifiers to suggest medical diagnoses. In addition, we provide some insights about the expected performance of classifiers for such a task.
Collapse
|
27
|
Improving Imbalanced Land Cover Classification with K-Means SMOTE: Detecting and Oversampling Distinctive Minority Spectral Signatures. INFORMATION 2021. [DOI: 10.3390/info12070266] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Land cover maps are a critical tool to support informed policy development, planning, and resource management decisions. With significant upsides, the automatic production of Land Use/Land Cover maps has been a topic of interest for the remote sensing community for several years, but it is still fraught with technical challenges. One such challenge is the imbalanced nature of most remotely sensed data. The asymmetric class distribution impacts negatively the performance of classifiers and adds a new source of error to the production of these maps. In this paper, we address the imbalanced learning problem, by using K-means and the Synthetic Minority Oversampling Technique (SMOTE) as an improved oversampling algorithm. K-means SMOTE improves the quality of newly created artificial data by addressing both the between-class imbalance, as traditional oversamplers do, but also the within-class imbalance, avoiding the generation of noisy data while effectively overcoming data imbalance. The performance of K-means SMOTE is compared to three popular oversampling methods (Random Oversampling, SMOTE and Borderline-SMOTE) using seven remote sensing benchmark datasets, three classifiers (Logistic Regression, K-Nearest Neighbors and Random Forest Classifier) and three evaluation metrics using a five-fold cross-validation approach with three different initialization seeds. The statistical analysis of the results show that the proposed method consistently outperforms the remaining oversamplers producing higher quality land cover classifications. These results suggest that LULC data can benefit significantly from the use of more sophisticated oversamplers as spectral signatures for the same class can vary according to geographical distribution.
Collapse
|
28
|
Benítez-Buenache A, Álvarez-Pérez L, Figueiras-Vidal AR. On the design of Bayesian principled algorithms for imbalanced classification. Knowl Based Syst 2021. [DOI: 10.1016/j.knosys.2021.106969] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
29
|
Handling imbalance in hierarchical classification problems using local classifiers approaches. Data Min Knowl Discov 2021. [DOI: 10.1007/s10618-021-00762-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
30
|
Addressing the multi-label imbalance for neural networks: An approach based on stratified mini-batches. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2020.12.122] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
31
|
Mehmood Z, Asghar S. Customizing SVM as a base learner with AdaBoost ensemble to learn from multi-class problems: A hybrid approach AdaBoost-MSVM. Knowl Based Syst 2021. [DOI: 10.1016/j.knosys.2021.106845] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
32
|
A novel progressively undersampling method based on the density peaks sequence for imbalanced data. Knowl Based Syst 2021. [DOI: 10.1016/j.knosys.2020.106689] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
33
|
SMOTE-Based Weighted Deep Rotation Forest for the Imbalanced Hyperspectral Data Classification. REMOTE SENSING 2021. [DOI: 10.3390/rs13030464] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Abstract
Conventional classification algorithms have shown great success in balanced hyperspectral data classification. However, the imbalanced class distribution is a fundamental problem of hyperspectral data, and it is regarded as one of the great challenges in classification tasks. To solve this problem, a non-ANN based deep learning, namely SMOTE-Based Weighted Deep Rotation Forest (SMOTE-WDRoF) is proposed in this paper. First, the neighboring pixels of instances are introduced as the spatial information and balanced datasets are created by using the SMOTE algorithm. Second, these datasets are fed into the WDRoF model that consists of the rotation forest and the multi-level cascaded random forests. Specifically, the rotation forest is used to generate rotation feature vectors, which are input into the subsequent cascade forest. Furthermore, the output probability of each level and the original data are stacked as the dataset of the next level. And the sample weights are automatically adjusted according to the dynamic weight function constructed by the classification results of each level. Compared with the traditional deep learning approaches, the proposed method consumes much less training time. The experimental results on four public hyperspectral data demonstrate that the proposed method can get better performance than support vector machine, random forest, rotation forest, SMOTE combined rotation forest, convolutional neural network, and rotation-based deep forest in multiclass imbalance learning.
Collapse
|
34
|
Jing XY, Zhang X, Zhu X, Wu F, You X, Gao Y, Shan S, Yang JY. Multiset Feature Learning for Highly Imbalanced Data Classification. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2021; 43:139-156. [PMID: 31331881 DOI: 10.1109/tpami.2019.2929166] [Citation(s) in RCA: 26] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
With the expansion of data, increasing imbalanced data has emerged. When the imbalance ratio (IR) of data is high, most existing imbalanced learning methods decline seriously in classification performance. In this paper, we systematically investigate the highly imbalanced data classification problem, and propose an uncorrelated cost-sensitive multiset learning (UCML) approach for it. Specifically, UCML first constructs multiple balanced subsets through random partition, and then employs the multiset feature learning (MFL) to learn discriminant features from the constructed multiset. To enhance the usability of each subset and deal with the non-linearity issue existed in each subset, we further propose a deep metric based UCML (DM-UCML) approach. DM-UCML introduces the generative adversarial network technique into the multiset constructing process, such that each subset can own similar distribution with the original dataset. To cope with the non-linearity issue, DM-UCML integrates deep metric learning with MFL, such that more favorable performance can be achieved. In addition, DM-UCML designs a new discriminant term to enhance the discriminability of learned metrics. Experiments on eight traditional highly class-imbalanced datasets and two large-scale datasets indicate that: the proposed approaches outperform state-of-the-art highly imbalanced learning methods and are more robust to high IR.
Collapse
|
35
|
Zhao Y, Da J, Yan J. Detecting health misinformation in online health communities: Incorporating behavioral features into machine learning based approaches. Inf Process Manag 2021. [DOI: 10.1016/j.ipm.2020.102390] [Citation(s) in RCA: 52] [Impact Index Per Article: 17.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
|
36
|
Bedi S, Samal A, Ray C, Snow D. Comparative evaluation of machine learning models for groundwater quality assessment. ENVIRONMENTAL MONITORING AND ASSESSMENT 2020; 192:776. [PMID: 33219864 DOI: 10.1007/s10661-020-08695-3] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/27/2020] [Accepted: 10/20/2020] [Indexed: 06/11/2023]
Abstract
Contamination from pesticides and nitrate in groundwater is a significant threat to water quality in general and agriculturally intensive regions in particular. Three widely used machine learning models, namely, artificial neural networks (ANN), support vector machines (SVM), and extreme gradient boosting (XGB), were evaluated for their efficacy in predicting contamination levels using sparse data with non-linear relationships. The predictive ability of the models was assessed using a dataset consisting of 303 wells across 12 Midwestern states in the USA. Multiple hydrogeologic, water quality, and land use features were chosen as the independent variables, and classes were based on measured concentration ranges of nitrate and pesticide. This study evaluates the classification performance of the models for two, three, and four class scenarios and compares them with the corresponding regression models. The study also examines the issue of class imbalance and tests the efficacy of three class imbalance mitigation techniques: oversampling, weighting, and oversampling and weighting, for all the scenarios. The models' performance is reported using multiple metrics, both insensitive to class imbalance (accuracy) and sensitive to class imbalance (F1 score and MCC). Finally, the study assesses the importance of features using game-theoretic Shapley values to rank features consistently and offer model interpretability.
Collapse
Affiliation(s)
- Shine Bedi
- Computer Science and Engineering, University of Nebraska, Lincoln, NE, USA.
| | - Ashok Samal
- Computer Science and Engineering, University of Nebraska, Lincoln, NE, USA
| | | | - Daniel Snow
- Water Sciences Laboratory, University of Nebraska, Lincoln, NE, USA
| |
Collapse
|
37
|
Pereira RM, Bertolini D, Teixeira LO, Silla CN, Costa YMG. COVID-19 identification in chest X-ray images on flat and hierarchical classification scenarios. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2020; 194:105532. [PMID: 32446037 PMCID: PMC7207172 DOI: 10.1016/j.cmpb.2020.105532] [Citation(s) in RCA: 217] [Impact Index Per Article: 54.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/15/2020] [Revised: 05/05/2020] [Accepted: 05/06/2020] [Indexed: 05/02/2023]
Abstract
BACKGROUND AND OBJECTIVE The COVID-19 can cause severe pneumonia and is estimated to have a high impact on the healthcare system. Early diagnosis is crucial for correct treatment in order to possibly reduce the stress in the healthcare system. The standard image diagnosis tests for pneumonia are chest X-ray (CXR) and computed tomography (CT) scan. Although CT scan is the gold standard, CXR are still useful because it is cheaper, faster and more widespread. This study aims to identify pneumonia caused by COVID-19 from other types and also healthy lungs using only CXR images. METHODS In order to achieve the objectives, we have proposed a classification schema considering the following perspectives: i) a multi-class classification; ii) hierarchical classification, since pneumonia can be structured as a hierarchy. Given the natural data imbalance in this domain, we also proposed the use of resampling algorithms in the schema in order to re-balance the classes distribution. We observed that, texture is one of the main visual attributes of CXR images, our classification schema extract features using some well-known texture descriptors and also using a pre-trained CNN model. We also explored early and late fusion techniques in the schema in order to leverage the strength of multiple texture descriptors and base classifiers at once. To evaluate the approach, we composed a database, named RYDLS-20, containing CXR images of pneumonia caused by different pathogens as well as CXR images of healthy lungs. The classes distribution follows a real-world scenario in which some pathogens are more common than others. RESULTS The proposed approach tested in RYDLS-20 achieved a macro-avg F1-Score of 0.65 using a multi-class approach and a F1-Score of 0.89 for the COVID-19 identification in the hierarchical classification scenario. CONCLUSIONS As far as we know, the top identification rate obtained in this paper is the best nominal rate obtained for COVID-19 identification in an unbalanced environment with more than three classes. We must also highlight the novel proposed hierarchical classification approach for this task, which considers the types of pneumonia caused by the different pathogens and lead us to the best COVID-19 recognition rate obtained here.
Collapse
Affiliation(s)
- Rodolfo M Pereira
- Instituto Federal de Educação, Ciência e Tecnologia do Paraná (IFPR), Pinhais, PR, Brazil; Pontifícia Universidade Catalica do Paraná (PUCPR), Curitiba, PR, Brazil.
| | - Diego Bertolini
- Universidade Tecnologica Federal do Paraná (UTFPR), Campo Mourão, PR, Brazil; Universidade Estadual de Maringá (UEM), Maringá, PR, Brazil
| | | | - Carlos N Silla
- Pontifícia Universidade Catalica do Paraná (PUCPR), Curitiba, PR, Brazil
| | | |
Collapse
|
38
|
Babukarthik RG, Adiga VAK, Sambasivam G, Chandramohan D, Amudhavel J. Prediction of COVID-19 Using Genetic Deep Learning Convolutional Neural Network (GDCNN). IEEE ACCESS : PRACTICAL INNOVATIONS, OPEN SOLUTIONS 2020; 8:177647-177666. [PMID: 34786292 PMCID: PMC8545287 DOI: 10.1109/access.2020.3025164] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/27/2020] [Accepted: 09/07/2020] [Indexed: 05/14/2023]
Abstract
Rapid spread of Coronavirus disease COVID-19 leads to severe pneumonia and it is estimated to create a high impact on the healthcare system. An urgent need for early diagnosis is required for precise treatment, which in turn reduces the pressure in the health care system. Some of the standard image diagnosis available is Computed Tomography (CT) scan and Chest X-Ray (CXR). Even though a CT scan is considered a gold standard in diagnosis, CXR is most widely used due to widespread, faster, and cheaper. This study aims to provide a solution for identifying pneumonia due to COVID-19 and healthy lungs (normal person) using CXR images. One of the remarkable methods used for extracting a high dimensional feature from medical images is the Deep learning method. In this research, the state-of-the-art techniques used is Genetic Deep Learning Convolutional Neural Network (GDCNN). It is trained from the scratch for extracting features for classifying them between COVID-19 and normal images. A dataset consisting of more than 5000 CXR image samples is used for classifying pneumonia, normal and other pneumonia diseases. Training a GDCNN from scratch proves that, the proposed method performs better compared to other transfer learning techniques. Classification accuracy of 98.84%, the precision of 93%, the sensitivity of 100%, and specificity of 97.0% in COVID-19 prediction is achieved. Top classification accuracy obtained in this research reveals the best nominal rate in the identification of COVID-19 disease prediction in an unbalanced environment. The novel model proposed for classification proves to be better than the existing models such as ReseNet18, ReseNet50, Squeezenet, DenseNet-121, and Visual Geometry Group (VGG16).
Collapse
Affiliation(s)
- R. G. Babukarthik
- Department of Computer Science and EngineeringDayananda Sagar UniversityBengaluru560078India
| | - V. Ananth Krishna Adiga
- Department of Computer Science and EngineeringDayananda Sagar UniversityBengaluru560078India
| | - G. Sambasivam
- Faculty of Information and Communication TechnologyISBAT UniversityKampalaUganda
| | - D. Chandramohan
- Department of Computer Science and EngineeringMadanapalle Institute of Technology and ScienceMadanapalle517325India
| | - J. Amudhavel
- School of Computer Science and EngineeringVIT Bhopal UniversityBhopal466114India
| |
Collapse
|
39
|
Koziarski M, Woźniak M, Krawczyk B. Combined Cleaning and Resampling algorithm for multi-class imbalanced data with label noise. Knowl Based Syst 2020. [DOI: 10.1016/j.knosys.2020.106223] [Citation(s) in RCA: 39] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
40
|
|
41
|
Krawczyk B, Koziarski M, Wozniak M. Radial-Based Oversampling for Multiclass Imbalanced Data Classification. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2020; 31:2818-2831. [PMID: 31247563 DOI: 10.1109/tnnls.2019.2913673] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Learning from imbalanced data is among the most popular topics in the contemporary machine learning. However, the vast majority of attention in this field is given to binary problems, while their much more difficult multiclass counterparts are relatively unexplored. Handling data sets with multiple skewed classes poses various challenges and calls for a better understanding of the relationship among classes. In this paper, we propose multiclass radial-based oversampling (MC-RBO), a novel data-sampling algorithm dedicated to multiclass problems. The main novelty of our method lies in using potential functions for generating artificial instances. We take into account information coming from all of the classes, contrary to existing multiclass oversampling approaches that use only minority class characteristics. The process of artificial instance generation is guided by exploring areas where the value of the mutual class distribution is very small. This way, we ensure a smart oversampling procedure that can cope with difficult data distributions and alleviate the shortcomings of existing methods. The usefulness of the MC-RBO algorithm is evaluated on the basis of extensive experimental study and backed-up with a thorough statistical analysis. Obtained results show that by taking into account information coming from all of the classes and conducting a smart oversampling, we can significantly improve the process of learning from multiclass imbalanced data.
Collapse
|
42
|
Brzezinski D, Stefanowski J, Susmaga R, Szczech I. On the Dynamics of Classification Measures for Imbalanced and Streaming Data. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2020; 31:2868-2878. [PMID: 30892237 DOI: 10.1109/tnnls.2019.2899061] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
As each imbalanced classification problem comes with its own set of challenges, the measure used to evaluate classifiers must be individually selected. To help researchers make this decision in an informed manner, experimental and theoretical investigations compare general properties of measures. However, existing studies do not analyze changes in measure behavior imposed by different imbalance ratios. Moreover, several characteristics of imbalanced data streams, such as the effect of dynamically changing class proportions, have not been thoroughly investigated from the perspective of different metrics. In this paper, we study measure dynamics by analyzing changes of measure values, distributions, and gradients with diverging class proportions. For this purpose, we visualize measure probability mass functions and gradients. In addition, we put forward a histogram-based normalization method that provides a unified, probabilistic interpretation of any measure over data sets with different class distributions. The results of analyzing eight popular classification measures show that the effect class proportions have on each measure is different and should be taken into account when evaluating classifiers. Apart from highlighting imbalance-related properties of each measure, our study shows a direct connection between class ratio changes and certain types of concept drift, which could be influential in designing new types of classifiers and drift detectors for imbalanced data streams.
Collapse
|
43
|
Abstract
National monitoring of forestlands and the processes causing canopy cover loss, be they abrupt or gradual, partial or stand clearing, temporary (disturbance) or persisting (deforestation), are necessary at fine scales to inform management, science and policy. This study utilizes the Landsat archive and an ensemble of disturbance algorithms to produce maps attributing event type and timing to >258 million ha of contiguous Unites States forested ecosystems (1986–2010). Nationally, 75.95 million forest ha (759,531 km2) experienced change, with 80.6% attributed to removals, 12.4% to wildfire, 4.7% to stress and 2.2% to conversion. Between regions, the relative amounts and rates of removals, wildfire, stress and conversion varied substantially. The removal class had 82.3% (0.01 S.E.) user’s and 72.2% (0.02 S.E.) producer’s accuracy. A survey of available national attribution datasets, from the data user’s perspective, of scale, relevant processes and ecological depth suggests knowledge gaps remain.
Collapse
|
44
|
Imbalanced credit risk evaluation based on multiple sampling, multiple kernel fuzzy self-organizing map and local accuracy ensemble. Appl Soft Comput 2020. [DOI: 10.1016/j.asoc.2020.106262] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
45
|
Coastal Wetland Mapping Using Ensemble Learning Algorithms: A Comparative Study of Bagging, Boosting and Stacking Techniques. REMOTE SENSING 2020. [DOI: 10.3390/rs12101683] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Coastal wetlands are a critical component of the coastal landscape that are increasingly threatened by sea level rise and other human disturbance. Periodically mapping wetland distribution is crucial to coastal ecosystem management. Ensemble algorithms (EL), such as random forest (RF) and gradient boosting machine (GBM) algorithms, are now commonly applied in the field of remote sensing. However, the performance and potential of other EL methods, such as extreme gradient boosting (XGBoost) and bagged trees, are rarely compared and tested for coastal wetland mapping. In this study, we applied the three most widely used EL techniques (i.e., bagging, boosting and stacking) to map wetland distribution in a highly modified coastal catchment, the Manning River Estuary, Australia. Our results demonstrated the advantages of using ensemble classifiers to accurately map wetland types in a coastal landscape. Enhanced bagging decision trees, i.e., classifiers with additional methods to increasing ensemble diversity such as RF and weighted subspace random forest, had comparably high predictive power. For the stacking method evaluated in this study, our results are inconclusive, and further comprehensive quantitative study is encouraged. Our findings also suggested that the ensemble methods were less effective at discriminating minority classes in comparison with more common classes. Finally, the variable importance results indicated that hydro-geomorphic factors, such as tidal depth and distance to water edge, were among the most influential variables across the top classifiers. However, vegetation indices derived from longer time series of remote sensing data that arrest the full features of land phenology are likely to improve wetland type separation in coastal areas.
Collapse
|
46
|
A Novel and Simple Mathematical Transform Improves the Perfomance of Lernmatrix in Pattern Classification. MATHEMATICS 2020. [DOI: 10.3390/math8050732] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
The Lernmatrix is a classic associative memory model. The Lernmatrix is capable of executing the pattern classification task, but its performance is not competitive when compared to state-of-the-art classifiers. The main contribution of this paper consists of the proposal of a simple mathematical transform, whose application eliminates the subtractive alterations between patterns. As a consequence, the Lernmatrix performance is significantly improved. To perform the experiments, we selected 20 datasets that are challenging for any classifier, as they exhibit class imbalance. The effectiveness of our proposal was compared against seven supervised classifiers of the most important approaches (Bayes, nearest neighbors, decision trees, logistic function, support vector machines, and neural networks). By choosing balanced accuracy as a performance measure, our proposal obtained the best results in 10 datasets. The elimination of subtractive alterations makes the new model competitive against the best classifiers, and sometimes beats them. After applying the Friedman test and the Holm post hoc test, we can conclude that within a 95% confidence, our proposal competes successfully with the most effective classifiers of the state of the art.
Collapse
|
47
|
|
48
|
Wang Z, Li Y, Li D, Zhu Z, Du W. Entropy and gravitation based dynamic radius nearest neighbor classification for imbalanced problem. Knowl Based Syst 2020. [DOI: 10.1016/j.knosys.2020.105474] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
49
|
|
50
|
Classification of Guillain–Barré Syndrome Subtypes Using Sampling Techniques with Binary Approach. Symmetry (Basel) 2020. [DOI: 10.3390/sym12030482] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Guillain–Barré Syndrome (GBS) is an unusual disorder where the body’s immune system affects the peripheral nervous system. GBS has four main subtypes, whose treatments vary among them. Severe cases of GBS can be fatal. This work aimed to investigate whether balancing an original GBS dataset improves the predictive models created in a previous study. purpleBalancing a dataset is to pursue symmetry in the number of instances of each of the classes.The dataset includes 129 records of Mexican patients diagnosed with some subtype of GBS. We created 10 binary datasets from the original dataset. Then, we balanced these datasets using four different methods to undersample the majority class and one method to oversample the minority class. Finally, we used three classifiers with different approaches to creating predictive models. The results show that balancing the original dataset improves the previous predictive models. The goal of the predictive models is to identify the GBS subtypes applying Machine Learning algorithms. It is expected that specialists may use the model to have a complementary diagnostic using a reduced set of relevant features. Early identification of the subtype will allow starting with the appropriate treatment for patient recovery. This is a contribution to exploring the performance of balancing techniques with real data.
Collapse
|