1
|
Fotouhi M, Samadi Khoshe Mehr F, Delazar S, Shahidi R, Setayeshpour B, Toosi MN, Arian A. Assessment of LI-RADS efficacy in classification of hepatocellular carcinoma and benign liver nodules using DCE-MRI features and machine learning. Eur J Radiol Open 2023; 11:100535. [PMID: 37964787 PMCID: PMC10641154 DOI: 10.1016/j.ejro.2023.100535] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2023] [Revised: 10/12/2023] [Accepted: 10/23/2023] [Indexed: 11/16/2023] Open
Abstract
Purpose The current study aimed to evaluate the efficiency of dynamic contrast-enhanced (DCE) MRI visual features in classifying benign liver nodules and hepatocellular carcinoma (HCC) using a machine learning model. Methods 115 LI-RADS3, 137 LI-RADS4, and 140 LI-RADS5 nodules were included (392 nodules from 245 patients), which were evaluated by follow-up imaging for LR-3 and pathology results for LR-4 and LR-5 nodules. Data was collected retrospectively from 3 T and 1.5 T MRI scanners. All the lesions were categorized into 124 benign and 268 HCC lesions. Visual features included tumor size, arterial-phase hyper-enhancement (APHE), washout, lesion segment, mass/mass-like, and capsule presence. Gini-importance method extracted the most important features to prevent over-fitting. Final dataset was split into training(70%), validation(10%), and test dataset(20%). The SVM model was used to train the classifying algorithm. For model validation, 5-fold cross-validation was utilized, and the test data set was used to assess the final accuracy. The area under the curve and receiver operating characteristic curves were used to assess the performance of the classifier model. Results For test dataset, the accuracy, sensitivity, and specificity values for classifying benign and HCC lesions were 82%,84%, and 81%, respectively. APHE, washout, tumor size, and mass/mass-like features significantly differentiated benign and HCC lesions with p-value < .001. Conclusions The developed classification model employing DCE-MRI features showed significant performance of visual features in classifying benign and HCC lesions. Our study also highlighted the significance of mass and mass-like features in addition to LI-RADS categorization. For future work, this study suggests developing a deep-learning algorithm for automatic lesion segmentation and feature assessment to reduce lesion categorization errors.
Collapse
Affiliation(s)
- Maryam Fotouhi
- Advanced Diagnostic and Interventional Radiology (ADIR), Radiology department, Imam Khomeini Hospital Complex, Tehran University of Medical Science, Iran
| | - Fardin Samadi Khoshe Mehr
- Research Centre for Molecular and Cellular Imaging (RCMCI), Advanced Medical Technologies and Equipment Institute (AMTEI), Tehran University of Medical Sciences, Tehran, Iran
| | - Sina Delazar
- Advanced Diagnostic and Interventional Radiology (ADIR), Radiology department, Imam Khomeini Hospital Complex, Tehran University of Medical Science, Iran
| | - Ramin Shahidi
- School of Medicine, Bushehr University of Medical Sciences, Bushehr, Iran
| | | | - Mohssen Nassiri Toosi
- Imam Khomeini Hospital Complex, Liver Transplantation Research Centre, Tehran University of Medical Sciences, Tehran, Iran
| | - Arvin Arian
- Advanced Diagnostic and Interventional Radiology (ADIR), Radiology department, Imam Khomeini Hospital Complex, Tehran University of Medical Science, Iran
| |
Collapse
|
2
|
Balanced neighbor exploration for semi-supervised node classification on imbalanced graph data. Inf Sci (N Y) 2023. [DOI: 10.1016/j.ins.2023.02.064] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/12/2023]
|
3
|
Li T, Wang Y, Liu L, Chen L, Chen CP. Subspace-based minority oversampling for imbalance classification. Inf Sci (N Y) 2023. [DOI: 10.1016/j.ins.2022.11.108] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/05/2022]
|
4
|
Yuan X, Chen S, Zhou H, Sun C, Yuwen L. CHSMOTE: Convex hull-based synthetic minority oversampling technique for alleviating the class imbalance problem. Inf Sci (N Y) 2023. [DOI: 10.1016/j.ins.2022.12.056] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
|
5
|
GhoshRoy D, Alvi PA, Santosh KC. Unboxing Industry-Standard AI Models for Male Fertility Prediction with SHAP. Healthcare (Basel) 2023; 11:929. [PMID: 37046855 PMCID: PMC10094449 DOI: 10.3390/healthcare11070929] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2023] [Revised: 03/21/2023] [Accepted: 03/21/2023] [Indexed: 04/14/2023] Open
Abstract
Infertility is a social stigma for individuals, and male factors cause approximately 30% of infertility. Despite this, male infertility is underrecognized and underrepresented as a disease. According to the World Health Organization (WHO), changes in lifestyle and environmental factors are the prime reasons for the declining rate of male fertility. Artificial intelligence (AI)/machine learning (ML) models have become an effective solution for early fertility detection. Seven industry-standard ML models are used: support vector machine, random forest (RF), decision tree, logistic regression, naïve bayes, adaboost, and multi-layer perception to detect male fertility. Shapley additive explanations (SHAP) are vital tools that examine the feature's impact on each model's decision making. On these, we perform a comprehensive comparative study to identify good and poor classification models. While dealing with the all-above-mentioned models, the RF model achieves an optimal accuracy and area under curve (AUC) of 90.47% and 99.98%, respectively, by considering five-fold cross-validation (CV) with the balanced dataset. Furthermore, we provide the SHAP explanations of existing models that attain good and poor performance. The findings of this study show that decision making (based on ML models) with SHAP provides thorough explanations for detecting male fertility, as well as a reference for clinicians for further treatment planning.
Collapse
Affiliation(s)
- Debasmita GhoshRoy
- School of Automation, Banasthali Vidyapith, Tonk 304022, Rajasthan, India
- Applied AI Research Lab, Vermillion, SD 57069, USA
| | - Parvez Ahmad Alvi
- Department of Physics, Banasthali Vidyapith, Tonk 304022, Rajasthan, India
| | - KC Santosh
- Applied AI Research Lab, Vermillion, SD 57069, USA
- Department of Computer Science, University of South Dakota, Vermillion, SD 57069, USA
| |
Collapse
|
6
|
Improving Classification Performance in Credit Card Fraud Detection by Using New Data Augmentation. AI 2023. [DOI: 10.3390/ai4010008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023] Open
Abstract
In many industrialized and developing nations, credit cards are one of the most widely used methods of payment for online transactions. Credit card invention has streamlined, facilitated, and enhanced internet transactions. It has, however, also given criminals more opportunities to commit fraud, which has raised the rate of fraud. Credit card fraud has a concerning global impact; many businesses and ordinary users have lost millions of US dollars as a result. Since there is a large number of transactions, many businesses and organizations rely heavily on applying machine learning techniques to automatically classify or identify fraudulent transactions. As the performance of machine learning techniques greatly depends on the quality of the training data, the imbalance in the data is not a trivial issue. In general, only a small percentage of fraudulent transactions are presented in the data. This greatly affects the performance of machine learning classifiers. In order to deal with the rarity of fraudulent occurrences, this paper investigates a variety of data augmentation techniques to address the imbalanced data problem and introduces a new data augmentation model, K-CGAN, for credit card fraud detection. A number of the main classification techniques are then used to evaluate the performance of the augmentation techniques. These results show that B-SMOTE, K-CGAN, and SMOTE have the highest Precision and Recall compared with other augmentation methods. Among those, K-CGAN has the highest F1 Score and Accuracy.
Collapse
|
7
|
Yan M, Hui SC, Li N. DML-PL: Deep Metric Learning Based Pseudo-Labeling Framework for Class Imbalanced Semi-Supervised Learning. Inf Sci (N Y) 2023. [DOI: 10.1016/j.ins.2023.01.074] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
|
8
|
Han M, Guo H, Li J, Wang W. Global-local information based oversampling for multi-class imbalanced data. INT J MACH LEARN CYB 2022. [DOI: 10.1007/s13042-022-01746-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
|
9
|
Balakrishnan V, Govindan V, Govaichelvan KN. Tamil Offensive Language Detection: Supervised versus Unsupervised Learning Approaches. ACM T ASIAN LOW-RESO 2022. [DOI: 10.1145/3575860] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
Studies on Natural Language Processing are mainly conducted in English, with very few exploring languages that are under-resourced, including the Dravidian languages. We present a novel work in detecting offensive language using a corpus collected from YouTube containing comments in Tamil. The study specifically aims to compare two machine learning approaches, namely, supervised, and unsupervised to detect offensive patterns in textual communications. In the first setup, offensive language detection models were developed using the traditional machine learning algorithms such as Random Forest, Logistic Regression, Support Vector Machine and AdaBoost, and assessed based on human labeling. Conversely, we used K-means (K = 2) to cluster the unlabeled data before training the same set of machine learning algorithms to detect offensive communications. Performance scores indicate unsupervised clustering to be more effective than human labeling with ensemble classifiers achieving an impressive accuracy of 99.70% and 99.87%, respectively for balanced and imbalanced datasets, hence showing that unsupervised approach can be used effectively to detect offensive language in low resourced languages.
Collapse
|
10
|
HS-Gen: a hypersphere-constrained generation mechanism to improve synthetic minority oversampling for imbalanced classification. COMPLEX INTELL SYST 2022. [DOI: 10.1007/s40747-022-00938-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
AbstractMitigating the impact of class-imbalance data on classifiers is a challenging task in machine learning. SMOTE is a well-known method to tackle this task by modifying class distribution and generating synthetic instances. However, most of the SMOTE-based methods focus on the phase of data selection, while few consider the phase of data generation. This paper proposes a hypersphere-constrained generation mechanism (HS-Gen) to improve synthetic minority oversampling. Unlike linear interpolation commonly used in SMOTE-based methods, HS-Gen generates a minority instance in a hypersphere rather than on a straight line. This mechanism expands the distribution range of minority instances with significant randomness and diversity. Furthermore, HS-Gen is attached with a noise prevention strategy that adaptively shrinks the hypersphere by determining whether new instances fall into the majority class region. HS-Gen can be regarded as an oversampling optimization mechanism and flexibly embedded into the SMOTE-based methods. We conduct comparative experiments by embedding HS-Gen into the original SMOTE, Borderline-SMOTE, ADASYN, k-means SMOTE, and RSMOTE. Experimental results show that the embedded versions can generate higher quality synthetic instances than the original ones. Moreover, on these oversampled datasets, the conventional classifiers (C4.5 and Adaboost) obtain significant performance improvement in terms of F1 measure and G-mean.
Collapse
|
11
|
Imbalanced binary classification under distribution uncertainty. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2022.11.063] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
12
|
Dong Q, Zhou Y, Lian J, Li L. Online adaptive humidity monitoring method for proton exchange membrane fuel cell based on fuzzy C-means clustering and online sequence extreme learning machine. Electrochim Acta 2022. [DOI: 10.1016/j.electacta.2022.141059] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
|
13
|
Venkataramana L, Prasad DVV, Saraswathi S, Mithumary CM, Karthikeyan R, Monika N. Classification of COVID-19 from tuberculosis and pneumonia using deep learning techniques. Med Biol Eng Comput 2022; 60:2681-2691. [PMID: 35834050 PMCID: PMC9281341 DOI: 10.1007/s11517-022-02632-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2022] [Accepted: 07/05/2022] [Indexed: 12/02/2022]
Abstract
Deep learning provides the healthcare industry with the ability to analyse data at exceptional speeds without compromising on accuracy. These techniques are applicable to healthcare domain for accurate and timely prediction. Convolutional neural network is a class of deep learning methods which has become dominant in various computer vision tasks and is attracting interest across a variety of domains, including radiology. Lung diseases such as tuberculosis (TB), bacterial and viral pneumonias, and COVID-19 are not predicted accurately due to availability of very few samples for either of the lung diseases. The disease could be easily diagnosed using X-ray or CT scan images. But the number of images available for each of the disease is not as equally as other resulting in imbalance nature of input data. Conventional supervised machine learning methods do not achieve higher accuracy when trained using a lesser amount of COVID-19 data samples. Image data augmentation is a technique that can be used to artificially expand the size of a training dataset by creating modified versions of images in the dataset. Data augmentation helped reduce overfitting when training a deep neural network. The SMOTE (Synthetic Minority Oversampling Technique) algorithm is used for the purpose of balancing the classes. The novelty in this research work is to apply combined data augmentation and class balance techniques before classification of tuberculosis, pneumonia, and COVID-19. The classification accuracy obtained with the proposed multi-level classification after training the model is recorded as 97.4% for TB and pneumonia and 88% for bacterial, viral, and COVID-19 classifications. The proposed multi-level classification method produced is ~8 to ~10% improvement in classification accuracy when compared with the existing methods in this area of research. The results reveal the fact that the proposed system is scalable to growing medical data and classifies lung diseases and its sub-types in less time with higher accuracy.
Collapse
Affiliation(s)
- Lokeswari Venkataramana
- Department of CSE, Sri Sivasubramaniya Nadar College of Engineering, Kalavakkam, Chennai, India
| | - D. Venkata Vara Prasad
- Department of CSE, Sri Sivasubramaniya Nadar College of Engineering, Kalavakkam, Chennai, India
| | - S. Saraswathi
- Department of CSE, Sri Sivasubramaniya Nadar College of Engineering, Kalavakkam, Chennai, India
| | - C. M. Mithumary
- Department of CSE, Sri Sivasubramaniya Nadar College of Engineering, Kalavakkam, Chennai, India
| | - R. Karthikeyan
- Department of CSE, Sri Sivasubramaniya Nadar College of Engineering, Kalavakkam, Chennai, India
| | - N. Monika
- Department of CSE, Sri Sivasubramaniya Nadar College of Engineering, Kalavakkam, Chennai, India
| |
Collapse
|
14
|
A Tailored Particle Swarm and Egyptian Vulture Optimization-Based Synthetic Minority-Oversampling Technique for Class Imbalance Problem. INFORMATION 2022. [DOI: 10.3390/info13080386] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Class imbalance is one of the significant challenges in classification problems. The uneven distribution of data samples in different classes may occur due to human error, improper/unguided collection of data samples, etc. The uneven distribution of class samples among classes may affect the classification accuracy of the developed model. The main motivation behind this study is the design and development of methodologies for handling class imbalance problems. In this study, a new variant of the synthetic minority oversampling technique (SMOTE) has been proposed with the hybridization of particle swarm optimization (PSO) and Egyptian vulture (EV). The proposed method has been termed SMOTE-PSOEV in this study. The proposed method generates an optimized set of synthetic samples from traditional SMOTE and augments the five datasets for verification and validation. The SMOTE-PSOEV is then compared with existing SMOTE variants, i.e., Tomek Link, Borderline SMOTE1, Borderline SMOTE2, Distance SMOTE, and ADASYN. After data augmentation to the minority classes, the performance of SMOTE-PSOEV has been evaluated using support vector machine (SVM), Naïve Bayes (NB), and k-nearest-neighbor (k-NN) classifiers. The results illustrate that the proposed models achieved higher accuracy than existing SMOTE variants.
Collapse
|
15
|
|
16
|
PF-SMOTE: A novel parameter-free SMOTE for imbalanced datasets. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.05.017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
17
|
Huang ZA, Sang Y, Sun Y, Lv J. A Neural Network Learning Algorithm for Highly Imbalanced Data Classification. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2022.08.074] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
|
18
|
SASMOTE: A Self-Attention Oversampling Method for Imbalanced CSI Fingerprints in Indoor Positioning Systems. SENSORS 2022; 22:s22155677. [PMID: 35957237 PMCID: PMC9371244 DOI: 10.3390/s22155677] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Revised: 07/04/2022] [Accepted: 07/07/2022] [Indexed: 01/27/2023]
Abstract
WiFi localization based on channel state information (CSI) fingerprints has become the mainstream method for indoor positioning due to the widespread deployment of WiFi networks, in which fingerprint database building is critical. However, issues, such as insufficient samples or missing data in the collection fingerprint database, result in unbalanced training data for the localization system during the construction of the CSI fingerprint database. To address the above issue, we propose a deep learning-based oversampling method, called Self-Attention Synthetic Minority Oversampling Technique (SASMOTE), for complementing the fingerprint database to improve localization accuracy. Specifically, a novel self-attention encoder-decoder is firstly designed to compress the original data dimensionality and extract rich features. The synthetic minority oversampling technique (SMOTE) is adopted to oversample minority class data to achieve data balance. In addition, we also construct the corresponding CSI fingerprinting dataset to train the model. Finally, extensive experiments are performed on different data to verify the performance of the proposed method. The results show that our SASMOTE method can effectively solve the data imbalance problem. Meanwhile, the improved location model, 1D-MobileNet, is tested on the balanced fingerprint database to further verify the excellent performance of our proposed methods.
Collapse
|
19
|
Prasetiyowati MI, Maulidevi NU, Surendro K. The accuracy of Random Forest performance can be improved by conducting a feature selection with a balancing strategy. PeerJ Comput Sci 2022; 8:e1041. [PMID: 35875646 PMCID: PMC9299283 DOI: 10.7717/peerj-cs.1041] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2021] [Accepted: 06/22/2022] [Indexed: 06/12/2023]
Abstract
One of the significant purposes of building a model is to increase its accuracy within a shorter timeframe through the feature selection process. It is carried out by determining the importance of available features in a dataset using Information Gain (IG). The process is used to calculate the amounts of information contained in features with high values selected to accelerate the performance of an algorithm. In selecting informative features, a threshold value (cut-off) is used by the Information Gain (IG). Therefore, this research aims to determine the time and accuracy-performance needed to improve feature selection by integrating IG, the Fast Fourier Transform (FFT), and Synthetic Minor Oversampling Technique (SMOTE) methods. The feature selection model is then applied to the Random Forest, a tree-based machine learning algorithm with random feature selection. A total of eight datasets consisting of three balanced and five imbalanced datasets were used to conduct this research. Furthermore, the SMOTE found in the imbalance dataset was used to balance the data. The result showed that the feature selection using Information Gain, FFT, and SMOTE improved the performance accuracy of Random Forest.
Collapse
Affiliation(s)
- Maria Irmina Prasetiyowati
- Doctoral Program of Electrical Engineering and Informatics, School of Electrical Engineering and Informatics, Institut Teknologi Bandung, Bandung, Jawa Barat, Indonesia
| | - Nur Ulfa Maulidevi
- Department of Electrical Engineering and Informatics, School of Electrical Engineering and Informatics, Institut Teknologi Bandung, Bandung, Jawa Barat, Indonesia
| | - Kridanto Surendro
- Department of Electrical Engineering and Informatics, School of Electrical Engineering and Informatics, Institut Teknologi Bandung, Bandung, Jawa Barat, Indonesia
| |
Collapse
|
20
|
|
21
|
A novel synthetic minority oversampling technique based on relative and absolute densities for imbalanced classification. APPL INTELL 2022. [DOI: 10.1007/s10489-022-03512-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
|
22
|
|
23
|
Wang X, Gong J, Song Y, Hu J. Adaptively weighted three-way decision oversampling: A cluster imbalanced-ratio based approach. APPL INTELL 2022. [DOI: 10.1007/s10489-022-03394-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
24
|
Liu S. SMOTE-LMKNN: A Synthetic Minority Oversampling Technique Based on Local Means-Based k-Nearest Neighbor. INT J PATTERN RECOGN 2022. [DOI: 10.1142/s0218001422500197] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Traditional classifiers are trapped by the class-imbalanced problem due to the fact that they are biased toward the majority class. Oversampling methods can improve imbalanced classification by creating synthetic minority class samples. Noise generation has been a great challenge in oversampling methods. Filtering-based and direction-change methods are proposed against noise generation. Yet, the adopted noise filters in filtering-based methods are biased to the majority class. Besides, the [Formula: see text]-nearest neighbor (KNN)-based interpolation in filtering-based and direction-change methods is susceptible to abnormal samples (e.g. outliers, noise or unsafe borderline samples). To overcome noise generation while solving the above shortcomings of filtering-based and direction-change methods, this work presents a new synthetic minority oversampling technique based on local means-based KNN (SMOTE-LMKNN). In SMOTE-LMKNN, the local mean-based KNN (LMKNN) is first introduced to describe the local characteristic of imbalanced data. Second, a new LMKNN-based noise filter is proposed to remove noise and unsafe borderline samples. Third, the interpolation between a base sample and its LMKNN is proposed to create synthetic minority class samples. Empirical results of extensive experiments with 18 data sets show that SMOTE-LMKNN is competent compared with seven popular oversampling methods in training KNN classifier and classification and regression tree (CART).
Collapse
Affiliation(s)
- Shuang Liu
- College of General Education, Chongqing Industry Polytechnic College, Chongqing 401120, P. R. China
| |
Collapse
|
25
|
Gupta S, Goel L, Singh A, Prasad A, Ullah MA. Psychological Analysis for Depression Detection from Social Networking Sites. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:4395358. [PMID: 35432513 PMCID: PMC9007657 DOI: 10.1155/2022/4395358] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/07/2021] [Revised: 02/28/2022] [Accepted: 03/24/2022] [Indexed: 11/23/2022]
Abstract
Rapid technological advancements are altering people's communication styles. With the growth of the Internet, social networks (Twitter, Facebook, Telegram, and Instagram) have become popular forums for people to share their thoughts, psychological behavior, and emotions. Psychological analysis analyzes text and extracts facts, features, and important information from the opinions of users. Researchers working on psychological analysis rely on social networks for the detection of depression-related behavior and activity. Social networks provide innumerable data on mindsets of a person's onset of depression, such as low sociology and activities such as undergoing medical treatment, a primary emphasis on oneself, and a high rate of activity during the day and night. In this paper, we used five machine learning classifiers-decision trees, K-nearest neighbor, support vector machines, logistic regression, and LSTM-for depression detection in tweets. The dataset is collected in two forms-balanced and imbalanced-where the oversampling of techniques is studied technically. The results show that the LSTM classification model outperforms the other baseline models in the depression detection healthcare approach for both balanced and imbalanced data.
Collapse
Affiliation(s)
- Sonam Gupta
- Department of Computer Science and Engineering, Ajay Kumar Garg Engineering College, Ghaziabad, India
| | - Lipika Goel
- Gokaraju Rangaraju Institute of Engineering and Technology, Hyderabad, India
| | - Arjun Singh
- School of Computing and Information Technology, Manipal University Jaipur, Jaipur, India
| | - Ajay Prasad
- University of Petroleum and Energy Studies, Dehradun, India
| | - Mohammad Aman Ullah
- Department of Computer Science and Engineering, International Islamic University Chittagong, Chittagong, Bangladesh
| |
Collapse
|
26
|
|
27
|
RDPVR: Random Data Partitioning with Voting Rule for Machine Learning from Class-Imbalanced Datasets. ELECTRONICS 2022. [DOI: 10.3390/electronics11020228] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
Since most classifiers are biased toward the dominant class, class imbalance is a challenging problem in machine learning. The most popular approaches to solving this problem include oversampling minority examples and undersampling majority examples. Oversampling may increase the probability of overfitting, whereas undersampling eliminates examples that may be crucial to the learning process. We present a linear time resampling method based on random data partitioning and a majority voting rule to address both concerns, where an imbalanced dataset is partitioned into a number of small subdatasets, each of which must be class balanced. After that, a specific classifier is trained for each subdataset, and the final classification result is established by applying the majority voting rule to the results of all of the trained models. We compared the performance of the proposed method to some of the most well-known oversampling and undersampling methods, employing a range of classifiers, on 33 benchmark machine learning class-imbalanced datasets. The classification results produced by the classifiers employed on the generated data by the proposed method were comparable to most of the resampling methods tested, with the exception of SMOTEFUNA, which is an oversampling method that increases the probability of overfitting. The proposed method produced results that were comparable to the Easy Ensemble (EE) undersampling method. As a result, for solving the challenge of machine learning from class-imbalanced datasets, we advocate using either EE or our method.
Collapse
|
28
|
An extension of Synthetic Minority Oversampling Technique based on Kalman filter for imbalanced datasets. MACHINE LEARNING WITH APPLICATIONS 2022. [DOI: 10.1016/j.mlwa.2022.100267] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
|
29
|
Weighted oversampling algorithms for imbalanced problems and application in prediction of streamflow. Knowl Based Syst 2021. [DOI: 10.1016/j.knosys.2021.107306] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|