1
|
Mogollón Gutiérrez Ó, Sancho Núñez JC, Ávila M, Caro A. A detailed study of resampling algorithms for cyberattack classification in engineering applications. PeerJ Comput Sci 2024; 10:e1975. [PMID: 38660195 PMCID: PMC11041950 DOI: 10.7717/peerj-cs.1975] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Accepted: 03/11/2024] [Indexed: 04/26/2024]
Abstract
The evolution of engineering applications is highly relevant in the context of protecting industrial systems. As industries are increasingly interconnected, the need for robust cybersecurity measures becomes paramount. Engineering informatics not only provides tools for knowledge representation and extraction but also affords a comprehensive spectrum of developing sophisticated cybersecurity solutions. However, safeguarding industrial systems poses a unique challenge due to the inherent heterogeneity of data within these environments. Together with this problem, it's crucial to acknowledge that datasets that simulate real cyberattacks within these diverse environments exhibit a high imbalance, often skewed towards certain types of traffics. This study proposes a system for addressing class imbalance in cybersecurity. To do this, three oversampling (SMOTE, Borderline1-SMOTE, and ADASYN) and five undersampling (random undersampling, cluster centroids, NearMiss, repeated edited nearest neighbor, and Tomek Links) methods are tested. Particularly, these balancing algorithms are used to generate one-vs-rest binary models and to develop a two-stage classification system. By doing so, this study aims to enhance the efficacy of cybersecurity measures ensuring a more comprehensive understanding and defense against the diverse range of threats encountered in industrial environments. Experimental results demonstrates the effectiveness of proposed system for cyberattack detection and classification among nine widely known cyberattacks.
Collapse
Affiliation(s)
| | | | - Mar Ávila
- Escuela Politecnica, University of Extremadura, Cáceres, Cáceres, Spain
| | - Andrés Caro
- Escuela Politecnica, University of Extremadura, Cáceres, Cáceres, Spain
| |
Collapse
|
2
|
Dablain D, Krawczyk B, Chawla NV. DeepSMOTE: Fusing Deep Learning and SMOTE for Imbalanced Data. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:6390-6404. [PMID: 35085094 DOI: 10.1109/tnnls.2021.3136503] [Citation(s) in RCA: 25] [Impact Index Per Article: 25.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Despite over two decades of progress, imbalanced data is still considered a significant challenge for contemporary machine learning models. Modern advances in deep learning have further magnified the importance of the imbalanced data problem, especially when learning from images. Therefore, there is a need for an oversampling method that is specifically tailored to deep learning models, can work on raw images while preserving their properties, and is capable of generating high-quality, artificial images that can enhance minority classes and balance the training set. We propose Deep synthetic minority oversampling technique (SMOTE), a novel oversampling algorithm for deep learning models that leverages the properties of the successful SMOTE algorithm. It is simple, yet effective in its design. It consists of three major components: 1) an encoder/decoder framework; 2) SMOTE-based oversampling; and 3) a dedicated loss function that is enhanced with a penalty term. An important advantage of DeepSMOTE over generative adversarial network (GAN)-based oversampling is that DeepSMOTE does not require a discriminator, and it generates high-quality artificial images that are both information-rich and suitable for visual inspection. DeepSMOTE code is publicly available at https://github.com/dd1github/DeepSMOTE.
Collapse
|
3
|
Zhang B, Hu S, Li M. Comparative study of multiple machine learning algorithms for risk level prediction in goaf. Heliyon 2023; 9:e19092. [PMID: 37636440 PMCID: PMC10448475 DOI: 10.1016/j.heliyon.2023.e19092] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2023] [Revised: 08/09/2023] [Accepted: 08/10/2023] [Indexed: 08/29/2023] Open
Abstract
With the acceleration of the mining process, the goaf has become one of the main sources of danger in underground mines, seriously threatening the safe production of mines. To make an accurate prediction of the risk level of the goaf quickly, this paper optimizes the features of the goaf by correlation analysis and feature importance and constructs a combination of feature parameters for the risk level prediction of the goaf to solve the problem of redundancy of evaluation indexes. Multiple machine learning algorithms are applied to 121 sets of goaf data respectively, and the optimal algorithm and the best combination of feature parameters are obtained by evaluating the mining area with multiple indicators such as accuracy and kappa coefficient. The best combination of features parameters are ground-water, goaf layout, volume of goaf, goaf volume, span-height ratio, and mining disturbance, and the optimal algorithm is Extra Tree (ET), which needles the goaf risk level prediction problem with the accuracy of 94%. This model can be used to solve the problem of how to quickly and accurately predict the risk level of the goaf.
Collapse
Affiliation(s)
- Bin Zhang
- School of Safety Science and Emergency Management, Wuhan University of Technology, Wuhan, Hubei, 430070, China
| | - Shaohua Hu
- School of Safety Science and Emergency Management, Wuhan University of Technology, Wuhan, Hubei, 430070, China
| | - Moxiao Li
- School of Safety Science and Emergency Management, Wuhan University of Technology, Wuhan, Hubei, 430070, China
| |
Collapse
|
4
|
Class-biased sarcasm detection using BiLSTM variational autoencoder-based synthetic oversampling. Soft comput 2023. [DOI: 10.1007/s00500-023-07956-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/09/2023]
|
5
|
Chourasia P, Ali S, Ciccolella S, Vedova GD, Patterson M. Reads2Vec: Efficient Embedding of Raw High-Throughput Sequencing Reads Data. JOURNAL OF COMPUTATIONAL BIOLOGY : A JOURNAL OF COMPUTATIONAL MOLECULAR CELL BIOLOGY 2023; 30:469-491. [PMID: 36730750 DOI: 10.1089/cmb.2022.0424] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
The massive amount of genomic data appearing for SARS-CoV-2 since the beginning of the COVID-19 pandemic has challenged traditional methods for studying its dynamics. As a result, new methods such as Pangolin, which can scale to the millions of samples of SARS-CoV-2 currently available, have appeared. Such a tool is tailored to take as input assembled, aligned, and curated full-length sequences, such as those found in the GISAID database. As high-throughput sequencing technologies continue to advance, such assembly, alignment, and curation may become a bottleneck, creating a need for methods that can process raw sequencing reads directly. In this article, we propose Reads2Vec, an alignment-free embedding approach that can generate a fixed-length feature vector representation directly from the raw sequencing reads without requiring assembly. Furthermore, since such an embedding is a numerical representation, it may be applied to highly optimized classification and clustering algorithms. Experiments on simulated data show that our proposed embedding obtains better classification results and better clustering properties contrary to existing alignment-free baselines. In a study on real data, we show that alignment-free embeddings have better clustering properties than the Pangolin tool and that the spike region of the SARS-CoV-2 genome heavily informs the alignment-free clusterings, which is consistent with current biological knowledge of SARS-CoV-2.
Collapse
Affiliation(s)
- Prakash Chourasia
- Department of Computer Science, Georgia State University, Atlanta, Georgia, USA
| | - Sarwan Ali
- Department of Computer Science, Georgia State University, Atlanta, Georgia, USA
| | - Simone Ciccolella
- Department of Informatics, Systems and Communication (DISCo), University of Milano-Bicocca, Milan, Italy
| | - Gianluca Della Vedova
- Department of Informatics, Systems and Communication (DISCo), University of Milano-Bicocca, Milan, Italy
| | - Murray Patterson
- Department of Computer Science, Georgia State University, Atlanta, Georgia, USA
| |
Collapse
|
6
|
Arafa A, El-Fishawy N, Badawy M, Radad M. RN-Autoencoder: Reduced Noise Autoencoder for classifying imbalanced cancer genomic data. J Biol Eng 2023; 17:7. [PMID: 36717866 PMCID: PMC9887895 DOI: 10.1186/s13036-022-00319-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2022] [Accepted: 12/12/2022] [Indexed: 01/31/2023] Open
Abstract
BACKGROUND In the current genomic era, gene expression datasets have become one of the main tools utilized in cancer classification. Both curse of dimensionality and class imbalance problems are inherent characteristics of these datasets. These characteristics have a negative impact on the performance of most classifiers when used to classify cancer using genomic datasets. RESULTS This paper introduces Reduced Noise-Autoencoder (RN-Autoencoder) for pre-processing imbalanced genomic datasets for precise cancer classification. Firstly, RN-Autoencoder solves the curse of dimensionality problem by utilizing the autoencoder for feature reduction and hence generating new extracted data with lower dimensionality. In the next stage, RN-Autoencoder introduces the extracted data to the well-known Reduced Noise-Synthesis Minority Over Sampling Technique (RN- SMOTE) that efficiently solve the problem of class imbalance in the extracted data. RN-Autoencoder has been evaluated using different classifiers and various imbalanced datasets with different imbalance ratios. The results proved that the performance of the classifiers has been improved with RN-Autoencoder and outperformed the performance with original data and extracted data with percentages based on the classifier, dataset and evaluation metric. Also, the performance of RN-Autoencoder has been compared to the performance of the current state of the art and resulted in an increase up to 18.017, 19.183, 18.58 and 8.87% in terms of test accuracy using colon, leukemia, Diffuse Large B-Cell Lymphoma (DLBCL) and Wisconsin Diagnostic Breast Cancer (WDBC) datasets respectively. CONCLUSION RN-Autoencoder is a model for cancer classification using imbalanced gene expression datasets. It utilizes the autoencoder to reduce the high dimensionality of the gene expression datasets and then handles the class imbalance using RN-SMOTE. RN-Autoencoder has been evaluated using many different classifiers and many different imbalanced datasets. The performance of many classifiers has improved and some have succeeded in classifying cancer with 100% performance in terms of all used metrics. In addition, RN-Autoencoder outperformed many recent works using the same datasets.
Collapse
Affiliation(s)
- Ahmed Arafa
- grid.411775.10000 0004 0621 4712Faculty of Electronic Engineering, Menoufia University, El-Gish Street, Box No. 32951, Menouf, Menoufia Egypt
| | - Nawal El-Fishawy
- grid.411775.10000 0004 0621 4712Faculty of Electronic Engineering, Menoufia University, El-Gish Street, Box No. 32951, Menouf, Menoufia Egypt
| | - Mohammed Badawy
- grid.411775.10000 0004 0621 4712Faculty of Electronic Engineering, Menoufia University, El-Gish Street, Box No. 32951, Menouf, Menoufia Egypt
| | - Marwa Radad
- grid.411775.10000 0004 0621 4712Faculty of Electronic Engineering, Menoufia University, El-Gish Street, Box No. 32951, Menouf, Menoufia Egypt
| |
Collapse
|
7
|
Fu S, Tian Y, Tang J, Liu X. Cost-sensitive learning with modified Stein loss function. Neurocomputing 2023. [DOI: 10.1016/j.neucom.2023.01.052] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
|
8
|
HS-Gen: a hypersphere-constrained generation mechanism to improve synthetic minority oversampling for imbalanced classification. COMPLEX INTELL SYST 2022. [DOI: 10.1007/s40747-022-00938-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
AbstractMitigating the impact of class-imbalance data on classifiers is a challenging task in machine learning. SMOTE is a well-known method to tackle this task by modifying class distribution and generating synthetic instances. However, most of the SMOTE-based methods focus on the phase of data selection, while few consider the phase of data generation. This paper proposes a hypersphere-constrained generation mechanism (HS-Gen) to improve synthetic minority oversampling. Unlike linear interpolation commonly used in SMOTE-based methods, HS-Gen generates a minority instance in a hypersphere rather than on a straight line. This mechanism expands the distribution range of minority instances with significant randomness and diversity. Furthermore, HS-Gen is attached with a noise prevention strategy that adaptively shrinks the hypersphere by determining whether new instances fall into the majority class region. HS-Gen can be regarded as an oversampling optimization mechanism and flexibly embedded into the SMOTE-based methods. We conduct comparative experiments by embedding HS-Gen into the original SMOTE, Borderline-SMOTE, ADASYN, k-means SMOTE, and RSMOTE. Experimental results show that the embedded versions can generate higher quality synthetic instances than the original ones. Moreover, on these oversampled datasets, the conventional classifiers (C4.5 and Adaboost) obtain significant performance improvement in terms of F1 measure and G-mean.
Collapse
|
9
|
Class-imbalanced positive instances augmentation via three-line hybrid. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.109902] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
10
|
El Moutaouakil K, Roudani M, El Ouissari A. Optimal Entropy Genetic Fuzzy-C-Means SMOTE (OEGFCM-SMOTE). Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.110235] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
|
11
|
Barukab O, Ahmad A, Khan T, Thayyil Kunhumuhammed MR. Analysis of Parkinson's Disease Using an Imbalanced-Speech Dataset by Employing Decision Tree Ensemble Methods. Diagnostics (Basel) 2022; 12:diagnostics12123000. [PMID: 36553007 PMCID: PMC9776735 DOI: 10.3390/diagnostics12123000] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2022] [Revised: 11/07/2022] [Accepted: 11/24/2022] [Indexed: 12/05/2022] Open
Abstract
Parkinson's disease (PD) currently affects approximately 10 million people worldwide. The detection of PD positive subjects is vital in terms of disease prognostics, diagnostics, management and treatment. Different types of early symptoms, such as speech impairment and changes in writing, are associated with Parkinson disease. To classify potential patients of PD, many researchers used machine learning algorithms in various datasets related to this disease. In our research, we study the dataset of the PD vocal impairment feature, which is an imbalanced dataset. We propose comparative performance evaluation using various decision tree ensemble methods, with or without oversampling techniques. In addition, we compare the performance of classifiers with different sizes of ensembles and various ratios of the minority class and the majority class with oversampling and undersampling. Finally, we combine feature selection with best-performing ensemble classifiers. The result shows that AdaBoost, random forest, and decision tree developed for the RUSBoost imbalanced dataset perform well in performance metrics such as precision, recall, F1-score, area under the receiver operating characteristic curve (AUROC) and the geometric mean. Further, feature selection methods, namely lasso and information gain, were used to screen the 10 best features using the best ensemble classifiers. AdaBoost with information gain feature selection method is the best performing ensemble method with an F1-score of 0.903.
Collapse
Affiliation(s)
- Omar Barukab
- Department of Information Technology, Faculty of Computing and Information Technology in Rabigh (FCITR), King Abdulaziz University, Jeddah 21589, Saudi Arabia
- Correspondence:
| | - Amir Ahmad
- College of Information Technology, United Arab Emirates University, Al Ain P.O. Box 15551, United Arab Emirates
| | - Tabrej Khan
- Department of Information Systems, Faculty of Computing and Information Technology in Rabigh (FCITR), King Abdulaziz University, Jeddah 21589, Saudi Arabia
| | - Mujeeb Rahiman Thayyil Kunhumuhammed
- Department of Computer Science, Faculty of Computing and Information Technology in Rabigh (FCITR), King Abdulaziz University, Jeddah 21589, Saudi Arabia
| |
Collapse
|
12
|
Zhao W, Su Y, Hu M, Zhao H. Hybrid ResNet based on joint basic and attention modules for long-tailed classification. Int J Approx Reason 2022. [DOI: 10.1016/j.ijar.2022.08.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
|
13
|
Distance-based arranging oversampling technique for imbalanced data. Neural Comput Appl 2022. [DOI: 10.1007/s00521-022-07828-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/14/2022]
|
14
|
Sedighi-Maman Z, Heath JJ. An Interpretable Two-Phase Modeling Approach for Lung Cancer Survivability Prediction. SENSORS (BASEL, SWITZERLAND) 2022; 22:6783. [PMID: 36146145 PMCID: PMC9503480 DOI: 10.3390/s22186783] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/23/2022] [Revised: 08/28/2022] [Accepted: 09/05/2022] [Indexed: 06/16/2023]
Abstract
Although lung cancer survival status and survival length predictions have primarily been studied individually, a scheme that leverages both fields in an interpretable way for physicians remains elusive. We propose a two-phase data analytic framework that is capable of classifying survival status for 0.5-, 1-, 1.5-, 2-, 2.5-, and 3-year time-points (phase I) and predicting the number of survival months within 3 years (phase II) using recent Surveillance, Epidemiology, and End Results data from 2010 to 2017. In this study, we employ three analytical models (general linear model, extreme gradient boosting, and artificial neural networks), five data balancing techniques (synthetic minority oversampling technique (SMOTE), relocating safe level SMOTE, borderline SMOTE, adaptive synthetic sampling, and majority weighted minority oversampling technique), two feature selection methods (least absolute shrinkage and selection operator (LASSO) and random forest), and the one-hot encoding approach. By implementing a comprehensive data preparation phase, we demonstrate that a computationally efficient and interpretable method such as GLM performs comparably to more complex models. Moreover, we quantify the effects of individual features in phase I and II by exploiting GLM coefficients. To the best of our knowledge, this study is the first to (a) implement a comprehensive data processing approach to develop performant, computationally efficient, and interpretable methods in comparison to black-box models, (b) visualize top factors impacting survival odds by utilizing the change in odds ratio, and (c) comprehensively explore short-term lung cancer survival using a two-phase approach.
Collapse
Affiliation(s)
- Zahra Sedighi-Maman
- Robert B. Willumstad School of Business, Adelphi University, Garden City, NY 11530, USA
| | - Jonathan J. Heath
- McDonough School of Business, Georgetown University, Washington, DC 20057, USA
| |
Collapse
|
15
|
Tang M, Meng C, Wu H, Zhu H, Yi J, Tang J, Wang Y. Fault Detection for Wind Turbine Blade Bolts Based on GSG Combined with CS-LightGBM. SENSORS (BASEL, SWITZERLAND) 2022; 22:s22186763. [PMID: 36146110 PMCID: PMC9505918 DOI: 10.3390/s22186763] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/10/2022] [Revised: 08/24/2022] [Accepted: 08/25/2022] [Indexed: 05/27/2023]
Abstract
Aiming at the problem of class imbalance in the wind turbine blade bolts operation-monitoring dataset, a fault detection method for wind turbine blade bolts based on Gaussian Mixture Model-Synthetic Minority Oversampling Technique-Gaussian Mixture Model (GSG) combined with Cost-Sensitive LightGBM (CS-LightGBM) was proposed. Since it is difficult to obtain the fault samples of blade bolts, the GSG oversampling method was constructed to increase the fault samples in the blade bolt dataset. The method obtains the optimal number of clusters through the BIC criterion, and uses the GMM based on the optimal number of clusters to optimally cluster the fault samples in the blade bolt dataset. According to the density distribution of fault samples in inter-clusters, we synthesized new fault samples using SMOTE in an intra-cluster. This retains the distribution characteristics of the original fault class samples. Then, we used the GMM with the same initial cluster center to cluster the fault class samples that were added to new samples, and removed the synthetic fault class samples that were not clustered into the corresponding clusters. Finally, the synthetic data training set was used to train the CS-LightGBM fault detection model. Additionally, the hyperparameters of CS-LightGBM were optimized by the Bayesian optimization algorithm to obtain the optimal CS-LightGBM fault detection model. The experimental results show that compared with six models including SMOTE-LightGBM, CS-LightGBM, K-means-SMOTE-LightGBM, etc., the proposed fault detection model is superior to the other comparison methods in the false alarm rate, missing alarm rate and F1-score index. The method can well realize the fault detection of large wind turbine blade bolts.
Collapse
Affiliation(s)
- Mingzhu Tang
- School of Energy and Power Engineering, Changsha University of Science & Technology, Changsha 410114, China
| | - Caihua Meng
- School of Energy and Power Engineering, Changsha University of Science & Technology, Changsha 410114, China
| | - Huawei Wu
- Hubei Key Laboratory of Power System Design and Test for Electrical Vehicle, Hubei University of Arts and Science, Xiangyang 441053, China
| | - Hongqiu Zhu
- School of Automation, Central South University, Changsha 410083, China
| | - Jiabiao Yi
- School of Energy and Power Engineering, Changsha University of Science & Technology, Changsha 410114, China
| | - Jun Tang
- School of Energy and Power Engineering, Changsha University of Science & Technology, Changsha 410114, China
| | - Yifan Wang
- School of Energy and Power Engineering, Changsha University of Science & Technology, Changsha 410114, China
| |
Collapse
|
16
|
Deng J, Zhang X, Li M, Jiang H, Chen Q. Feasibility study on Raman spectra-based deep learning models for monitoring the contamination degree and level of aflatoxin B1 in edible oil. Microchem J 2022. [DOI: 10.1016/j.microc.2022.107613] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
|
17
|
A Tailored Particle Swarm and Egyptian Vulture Optimization-Based Synthetic Minority-Oversampling Technique for Class Imbalance Problem. INFORMATION 2022. [DOI: 10.3390/info13080386] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Class imbalance is one of the significant challenges in classification problems. The uneven distribution of data samples in different classes may occur due to human error, improper/unguided collection of data samples, etc. The uneven distribution of class samples among classes may affect the classification accuracy of the developed model. The main motivation behind this study is the design and development of methodologies for handling class imbalance problems. In this study, a new variant of the synthetic minority oversampling technique (SMOTE) has been proposed with the hybridization of particle swarm optimization (PSO) and Egyptian vulture (EV). The proposed method has been termed SMOTE-PSOEV in this study. The proposed method generates an optimized set of synthetic samples from traditional SMOTE and augments the five datasets for verification and validation. The SMOTE-PSOEV is then compared with existing SMOTE variants, i.e., Tomek Link, Borderline SMOTE1, Borderline SMOTE2, Distance SMOTE, and ADASYN. After data augmentation to the minority classes, the performance of SMOTE-PSOEV has been evaluated using support vector machine (SVM), Naïve Bayes (NB), and k-nearest-neighbor (k-NN) classifiers. The results illustrate that the proposed models achieved higher accuracy than existing SMOTE variants.
Collapse
|
18
|
Li X, Kong K, Shen H, Wei Z, Liao X. Intrusion detection method based on imbalanced learning classification. J EXP THEOR ARTIF IN 2022. [DOI: 10.1080/0952813x.2022.2104384] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/16/2022]
Affiliation(s)
- Xiangjun Li
- School of Software, Nanchang University, Nanchang, China
- Department of Computer Science and Technology, Nanchang University, Nanchang, China
| | - Ke Kong
- Department of Computer Science and Technology, Nanchang University, Nanchang, China
- Faculty of Computing, Harbin Institute of Technology, Harbin, China
| | - Hua Shen
- School of Management, Nanchang University, Nanchang, China
| | - Zhixiang Wei
- School of Software, Nanchang University, Nanchang, China
| | - Xiaofeng Liao
- School of Management, Nanchang University, Nanchang, China
| |
Collapse
|
19
|
An Oversampling Method of Unbalanced Data for Mechanical Fault Diagnosis Based on MeanRadius-SMOTE. SENSORS 2022; 22:s22145166. [PMID: 35890845 PMCID: PMC9324964 DOI: 10.3390/s22145166] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/13/2022] [Revised: 06/26/2022] [Accepted: 07/08/2022] [Indexed: 11/28/2022]
Abstract
With the development of machine learning, data-driven mechanical fault diagnosis methods have been widely used in the field of PHM. Due to the limitation of the amount of fault data, it is a difficult problem for fault diagnosis to solve the problem of unbalanced data sets. Under unbalanced data sets, faults with little historical data are always difficult to diagnose and lead to economic losses. In order to improve the prediction accuracy under unbalanced data sets, this paper proposes MeanRadius-SMOTE based on the traditional SMOTE oversampling algorithm, which effectively avoids the generation of useless samples and noise samples. This paper validates the effectiveness of the algorithm on three linear unbalanced data sets and four step unbalanced data sets. Experimental results show that MeanRadius-SMOTE outperforms SMOTE and LR-SMOTE in various evaluation indicators, as well as has better robustness against different imbalance rates. In addition, MeanRadius-SMOTE can take into account the prediction accuracy of the overall and minority class, which is of great significance for engineering applications.
Collapse
|
20
|
Keskes N, Fakhfakh S, Kanoun O, Derbel N. Representativeness consideration in the selection of classification algorithms for the ECG signal quality assessment. Biomed Signal Process Control 2022. [DOI: 10.1016/j.bspc.2022.103686] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
|
21
|
Wei G, Mu W, Song Y, Dou J. An improved and random synthetic minority oversampling technique for imbalanced data. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.108839] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
22
|
Han F, Zhu S, Ling Q, Han H, Li H, Guo X, Cao J. Gene-CWGAN: a data enhancement method for gene expression profile based on improved CWGAN-GP. Neural Comput Appl 2022. [DOI: 10.1007/s00521-022-07417-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
23
|
Xu D, Zhang Z, Shi J. A New Multi-Sensor Stream Data Augmentation Method for Imbalanced Learning in Complex Manufacturing Process. SENSORS (BASEL, SWITZERLAND) 2022; 22:4042. [PMID: 35684662 PMCID: PMC9185280 DOI: 10.3390/s22114042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/27/2022] [Revised: 05/23/2022] [Accepted: 05/24/2022] [Indexed: 06/15/2023]
Abstract
Multiple sensors are often mounted in a complex manufacturing process to detect failures. Due to the high reliability of modern manufacturing processes, failures only happen occasionally. Therefore, data collected in practical manufacturing processes are extremely imbalanced, which often brings about bias of supervised learning models. Data collected by the multiple sensors can be regarded as multivariate time series or multi-sensor stream data. The high dimension of multi-sensor stream data makes building models even more challenging. In this study, a new and easy-to-apply data augmentation approach, namely, imbalanced multi-sensor stream data augmentation (IMSDA), is proposed for imbalanced learning. IMSDA can generate high quality of failure data for all dimensions. The generated data can keep the similar temporal property of the original multivariate time series. Both raw data and generated data are used to train the failure detection models, but the models are tested by the same real dataset. The proposed method is applied to a real-world industry case. Results show that IMSDA can not only obtain good quality failure data to reduce the imbalance level but also significantly improve the performance of supervised failure detection models.
Collapse
Affiliation(s)
- Dongting Xu
- School of Mechanical Engineering, Southeast University, Nanjing 211189, China; (D.X.); (Z.Z.)
- School of Mechanical Engineering, Nanjing Institute of Technology, Nanjing 211167, China
| | - Zhisheng Zhang
- School of Mechanical Engineering, Southeast University, Nanjing 211189, China; (D.X.); (Z.Z.)
| | - Jinfei Shi
- School of Mechanical Engineering, Southeast University, Nanjing 211189, China; (D.X.); (Z.Z.)
- School of Mechanical Engineering, Nanjing Institute of Technology, Nanjing 211167, China
| |
Collapse
|
24
|
Sun Z, Jiang A, Wang G, Zhang M, Yan H. Feature Optimization Method of Material Identification for Loose Particles Inside Sealed Relays. SENSORS 2022; 22:s22093566. [PMID: 35591257 PMCID: PMC9102643 DOI: 10.3390/s22093566] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/12/2022] [Revised: 05/04/2022] [Accepted: 05/05/2022] [Indexed: 12/10/2022]
Abstract
Existing material identification for loose particles inside sealed relays focuses on the selection and optimization of classification algorithms, which ignores the features in the material dataset. In this paper, we propose a feature optimization method of material identification for loose particles inside sealed relays. First, for the missing value problem, multiple methods were used to process the material dataset. By comparing the identification accuracy achieved by a Random-Forest-based classifier (RF classifier) on the different processed datasets, the optimal direct-discarding method was obtained. Second, for the uneven data distribution problem, multiple methods were used to process the material dataset. By comparing the achieved identification accuracy, the optimal min–max standardization method was obtained. Then, for the feature selection problem, an innovative multi-index–fusion feature selection method was designed, and its superiority was verified through several tests. Test results show that the identification accuracy achieved by RF classifier on the dataset was improved from 59.63% to 63.60%. Test results of ten material verification datasets show that the identification accuracies achieved by RF classifier were greatly improved, with an average improvement of 3.01%. This strongly promotes research progress in loose particle material identification and is an important supplement to existing loose particle detection research. This is also the highest loose particle material identification accuracy achieved to in aerospace engineering, which has important practical value for improving the reliability of aerospace systems. Theoretically, it can be applied to feature optimization in machine learning.
Collapse
Affiliation(s)
- Zhigang Sun
- Electronic Engineering College, Heilongjiang University, Harbin 150080, China; (Z.S.); (A.J.); (M.Z.)
- Reliability Institute for Electric Apparatus and Electronics, Harbin Institute of Technology, Harbin 150001, China;
| | - Aiping Jiang
- Electronic Engineering College, Heilongjiang University, Harbin 150080, China; (Z.S.); (A.J.); (M.Z.)
| | - Guotao Wang
- Electronic Engineering College, Heilongjiang University, Harbin 150080, China; (Z.S.); (A.J.); (M.Z.)
- Reliability Institute for Electric Apparatus and Electronics, Harbin Institute of Technology, Harbin 150001, China;
- Correspondence:
| | - Min Zhang
- Electronic Engineering College, Heilongjiang University, Harbin 150080, China; (Z.S.); (A.J.); (M.Z.)
| | - Huizhen Yan
- Reliability Institute for Electric Apparatus and Electronics, Harbin Institute of Technology, Harbin 150001, China;
| |
Collapse
|
25
|
Huang K, Wang X. CCR-GSVM: A boundary data generation algorithm for support vector machine in imbalanced majority noise problem. APPL INTELL 2022. [DOI: 10.1007/s10489-022-03408-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
26
|
An Oversampling Method for Class Imbalance Problems on Large Datasets. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12073424] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Abstract
Several oversampling methods have been proposed for solving the class imbalance problem. However, most of them require searching the k-nearest neighbors to generate synthetic objects. This requirement makes them time-consuming and therefore unsuitable for large datasets. In this paper, an oversampling method for large class imbalance problems that do not require the k-nearest neighbors’ search is proposed. According to our experiments on large datasets with different sizes of imbalance, the proposed method is at least twice as fast as 8 the fastest method reported in the literature while obtaining similar oversampling quality.
Collapse
|
27
|
|
28
|
Xiao C, Guo Y, Zhao K, Liu S, He N, He Y, Guo S, Chen Z. Prognostic Value of Machine Learning in Patients with Acute Myocardial Infarction. J Cardiovasc Dev Dis 2022; 9:jcdd9020056. [PMID: 35200709 PMCID: PMC8880640 DOI: 10.3390/jcdd9020056] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2021] [Revised: 01/27/2022] [Accepted: 02/05/2022] [Indexed: 01/09/2023] Open
Abstract
(1) Background: Patients with acute myocardial infarction (AMI) still experience many major adverse cardiovascular events (MACEs), including myocardial infarction, heart failure, kidney failure, coronary events, cerebrovascular events, and death. This retrospective study aims to assess the prognostic value of machine learning (ML) for the prediction of MACEs. (2) Methods: Five-hundred patients diagnosed with AMI and who had undergone successful percutaneous coronary intervention were included in the study. Logistic regression (LR) analysis was used to assess the relevance of MACEs and 24 selected clinical variables. Six ML models were developed with five-fold cross-validation in the training dataset and their ability to predict MACEs was compared to LR with the testing dataset. (3) Results: The MACE rate was calculated as 30.6% after a mean follow-up of 1.42 years. Killip classification (Killip IV vs. I class, odds ratio 4.386, 95% confidence interval 1.943–9.904), drug compliance (irregular vs. regular compliance, 3.06, 1.721–5.438), age (per year, 1.025, 1.006–1.044), and creatinine (1 µmol/L, 1.007, 1.002–1.012) and cholesterol levels (1 mmol/L, 0.708, 0.556–0.903) were independent predictors of MACEs. In the training dataset, the best performing model was the random forest (RDF) model with an area under the curve of (0.749, 0.644–0.853) and accuracy of (0.734, 0.647–0.820). In the testing dataset, the RDF showed the most significant survival difference (log-rank p = 0.017) in distinguishing patients with and without MACEs. (4) Conclusions: The RDF model has been identified as superior to other models for MACE prediction in this study. ML methods can be promising for improving optimal predictor selection and clinical outcomes in patients with AMI.
Collapse
Affiliation(s)
- Changhu Xiao
- Hunan Key Laboratory of Biomedical Nanomaterials and Devices, Hunan University of Technology, Zhuzhou 412007, China; (C.X.); (K.Z.); (S.L.); (N.H.)
| | - Yuan Guo
- Hunan Key Laboratory of Biomedical Nanomaterials and Devices, Hunan University of Technology, Zhuzhou 412007, China; (C.X.); (K.Z.); (S.L.); (N.H.)
- Department of Cardiovascular Medicine, Zhuzhou Hospital Affiliated to Xiangya School of Medicine, Central South University, Zhuzhou 412007, China; (Y.H.); (S.G.)
- Department of Cardiovascular Medicine, Xiangya Hospital, Central South University, Changsha 410008, China
- Correspondence: (Y.G.); (Z.C.)
| | - Kaixuan Zhao
- Hunan Key Laboratory of Biomedical Nanomaterials and Devices, Hunan University of Technology, Zhuzhou 412007, China; (C.X.); (K.Z.); (S.L.); (N.H.)
| | - Sha Liu
- Hunan Key Laboratory of Biomedical Nanomaterials and Devices, Hunan University of Technology, Zhuzhou 412007, China; (C.X.); (K.Z.); (S.L.); (N.H.)
| | - Nongyue He
- Hunan Key Laboratory of Biomedical Nanomaterials and Devices, Hunan University of Technology, Zhuzhou 412007, China; (C.X.); (K.Z.); (S.L.); (N.H.)
| | - Yi He
- Department of Cardiovascular Medicine, Zhuzhou Hospital Affiliated to Xiangya School of Medicine, Central South University, Zhuzhou 412007, China; (Y.H.); (S.G.)
| | - Shuhong Guo
- Department of Cardiovascular Medicine, Zhuzhou Hospital Affiliated to Xiangya School of Medicine, Central South University, Zhuzhou 412007, China; (Y.H.); (S.G.)
| | - Zhu Chen
- Hunan Key Laboratory of Biomedical Nanomaterials and Devices, Hunan University of Technology, Zhuzhou 412007, China; (C.X.); (K.Z.); (S.L.); (N.H.)
- Correspondence: (Y.G.); (Z.C.)
| |
Collapse
|
29
|
|
30
|
Zhao Y, Liang J, Chen L, Wang Y, Gong J. Evaluation and prediction of free driving behavior type based on fuzzy comprehensive support vector machine. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2022. [DOI: 10.3233/jifs-201680] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Driving behavior type is a hotspot in transportation field, but there have been few studies on free driving behavior type. The factor of current driving behavior evaluation model is single, and its environmental adaptability is insufficient, and driving behavior type is difficult to predict accurately. In addition, free driving behavior as one kind of the important driving operation behaviors lacks quantitative assessment methods and models. In view of these deficiencies, evaluation and prediction of free driving behavior based on Fuzzy Comprehensive Support Vector Machine (FC-SVM) is proposed. Firstly, a variety of individual decision-making behavior data obfuscating with environmental complexity are collected. These obtained parameters were used as FC multi-factor evaluation parameters to quantitatively evaluate free driving behavior from multiple aspects, and to qualitatively derive the driver’s driving behavior type. Further, the SVM used the RBF kernel function to obtain the optimal parameters and train the SVM network, and it used the obtained SVM model for the prediction of driving behavior type in short time. The results of simulations using different methods show that the SD value of FC-SVM evaluation results is the lowest, only 1.273. Compared with other common methods, its MacroP reaches 89.2% . It is interesting to find that aggressive driving can be more distinct from other behavior types. Moreover, the mixed traffic flow composed of aggressive driver has a higher traffic efficiency in basic sections. This work is of great value for improving driving behavior, reducing road congestion and improving road traffic efficiency in the mixed intelligent traffic.
Collapse
Affiliation(s)
- Yucheng Zhao
- Automotive Engineering Research Institute, Jiangsu University, Zhenjiang, Jiangsu, China
| | - Jun Liang
- Automotive Engineering Research Institute, Jiangsu University, Zhenjiang, Jiangsu, China
| | - Long Chen
- Automotive Engineering Research Institute, Jiangsu University, Zhenjiang, Jiangsu, China
| | - Yafei Wang
- School of Mechanical and Power Engineering, Shanghai Jiaotong University, Shanghai, China
| | - Jinfeng Gong
- China Automotive Technology Research Center Co., Ltd, Tianjin, China
| |
Collapse
|
31
|
Kokol P, Kokol M, Zagoranski S. Machine learning on small size samples: A synthetic knowledge synthesis. Sci Prog 2022; 105:368504211029777. [PMID: 35220816 PMCID: PMC10358596 DOI: 10.1177/00368504211029777] [Citation(s) in RCA: 58] [Impact Index Per Article: 29.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/10/2023]
Abstract
Machine Learning is an increasingly important technology dealing with the growing complexity of the digitalised world. Despite the fact, that we live in a 'Big data' world where, almost 'everything' is digitally stored, there are many real-world situations, where researchers are still faced with small data samples. The present bibliometric knowledge synthesis study aims to answer the research question 'What is the small data problem in machine learning and how it is solved?' The analysis a positive trend in the number of research publications and substantial growth of the research community, indicating that the research field is reaching maturity. Most productive countries are China, United States and United Kingdom. Despite notable international cooperation, the regional concentration of research literature production in economically more developed countries was observed. Thematic analysis identified four research themes. The themes are concerned with to dimension reduction in complex big data analysis, data augmentation techniques in deep learning, data mining and statistical learning on small datasets.
Collapse
Affiliation(s)
- Peter Kokol
- Faculty of Electrical Engineering and Computer Science, University of Maribor, Maribor, Slovenia
| | | | | |
Collapse
|
32
|
An extension of Synthetic Minority Oversampling Technique based on Kalman filter for imbalanced datasets. MACHINE LEARNING WITH APPLICATIONS 2022. [DOI: 10.1016/j.mlwa.2022.100267] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
|
33
|
Mayabadi S, Saadatfar H. Two density-based sampling approaches for imbalanced and overlapping data. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.108217] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
|
34
|
Wang S, Dai Y, Shen J, Xuan J. Research on expansion and classification of imbalanced data based on SMOTE algorithm. Sci Rep 2021; 11:24039. [PMID: 34912009 PMCID: PMC8674253 DOI: 10.1038/s41598-021-03430-5] [Citation(s) in RCA: 29] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2021] [Accepted: 11/22/2021] [Indexed: 11/09/2022] Open
Abstract
With the development of artificial intelligence, big data classification technology provides the advantageous help for the medicine auxiliary diagnosis research. While due to the different conditions in the different sample collection, the medical big data is often imbalanced. The class-imbalance problem has been reported as a serious obstacle to the classification performance of many standard learning algorithms. SMOTE algorithm could be used to generate sample points randomly to improve imbalance rate, but its application is affected by the marginalization generation and blindness of parameter selection. Focusing on this problem, an improved SMOTE algorithm based on Normal distribution is proposed in this paper, so that the new sample points are distributed closer to the center of the minority sample with a higher probability to avoid the marginalization of the expanded data. Experiments show that the classification effect is better when use proposed algorithm to expand the imbalanced dataset of Pima, WDBC, WPBC, Ionosphere and Breast-cancer-wisconsin than the original SMOTE algorithm. In addition, the parameter selection of the proposed algorithm is analyzed and it is found that the classification effect is the best when the distribution characteristics of the original data was maintained best by selecting appropriate parameters in our designed experiments.
Collapse
Affiliation(s)
- Shujuan Wang
- College of Mathematical Sciences, Harbin Engineering University, Harbin, 150001, China
| | - Yuntao Dai
- College of Mathematical Sciences, Harbin Engineering University, Harbin, 150001, China
| | - Jihong Shen
- College of Mathematical Sciences, Harbin Engineering University, Harbin, 150001, China
| | - Jingxue Xuan
- College of Science, Qiqihar University, Qiqihar, 161006, China.
| |
Collapse
|
35
|
Bi W, Zhang Q. Forecasting mergers and acquisitions failure based on partial-sigmoid neural network and feature selection. PLoS One 2021; 16:e0259575. [PMID: 34788332 PMCID: PMC8598039 DOI: 10.1371/journal.pone.0259575] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2021] [Accepted: 10/22/2021] [Indexed: 11/19/2022] Open
Abstract
Traditional forecasting methods in mergers and acquisitions (M&A) data have two limitations that significantly reduce forecasting accuracy: (1) the imbalance of data, that is, the failure cases of M&A are far fewer than the successful cases (82%/18% of our sample), and (2) both the bidder and the target of the merger have numerous descriptive features, making it difficult to choose which ones to forecast. This study proposes a neural network using partial-sigmoid (i.e., partial-sigmoid neural network [PSNN]) as the activation function of the output layer and compares three feature selection methods, namely, chi-square (chi2) test, information gain and gradient boosting decision tree (GBDT). Experimental results prove that our PSNN (improved up to 0.37 precision, 0.49 recall, 0.41 G-Mean and 0.23 F1-measure) and feature selection (improved 1.83%-13.16% accuracy) method can effectively improve the adverse effects of the defects of the above two merger data on forecasting. Scholars who studied the forecast of merger failure have overlooked three important features: assets of the previous year, market value and capital expenditure. The chi2 test feature selection method is the best among the three feature selection methods.
Collapse
Affiliation(s)
- Wenbin Bi
- School of Economics and Management, Beijing Jiaotong University, Beijing, China
| | - Qiusheng Zhang
- School of Economics and Management, Beijing Jiaotong University, Beijing, China
| |
Collapse
|
36
|
Wu J, Shen J, Xu M, Shao M. A novel combined dynamic ensemble selection model for imbalanced data to detect COVID-19 from complete blood count. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2021; 211:106444. [PMID: 34614451 PMCID: PMC8479386 DOI: 10.1016/j.cmpb.2021.106444] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/27/2021] [Accepted: 09/22/2021] [Indexed: 06/01/2023]
Abstract
BACKGROUND As blood testing is radiation-free, low-cost and simple to operate, some researchers use machine learning to detect COVID-19 from blood test data. However, few studies take into consideration the imbalanced data distribution, which can impair the performance of a classifier. METHOD A novel combined dynamic ensemble selection (DES) method is proposed for imbalanced data to detect COVID-19 from complete blood count. This method combines data preprocessing and improved DES. Firstly, we use the hybrid synthetic minority over-sampling technique and edited nearest neighbor (SMOTE-ENN) to balance data and remove noise. Secondly, in order to improve the performance of DES, a novel hybrid multiple clustering and bagging classifier generation (HMCBCG) method is proposed to reinforce the diversity and local regional competence of candidate classifiers. RESULTS The experimental results based on three popular DES methods show that the performance of HMCBCG is better than only use bagging. HMCBCG+KNE obtains the best performance for COVID-19 screening with 99.81% accuracy, 99.86% F1, 99.78% G-mean and 99.81% AUC. CONCLUSION Compared to other advanced methods, our combined DES model can improve accuracy, G-mean, F1 and AUC of COVID-19 screening.
Collapse
Affiliation(s)
- Jiachao Wu
- College of Management and Economics, Tianjin University, Tianjin, 300072, China
| | - Jiang Shen
- College of Management and Economics, Tianjin University, Tianjin, 300072, China
| | - Man Xu
- Business School, Nankai University, Tianjin, 300071, China
| | - Minglai Shao
- School of New Media and Communication, Tianjin University, Tianjin, 300072, China.
| |
Collapse
|
37
|
Sun Z, Gao M, Wang G, Lv B, He C, Teng Y. Research on Evaluating the Filtering Method for Broiler Sound Signal from Multiple Perspectives. Animals (Basel) 2021; 11:ani11082238. [PMID: 34438695 PMCID: PMC8388365 DOI: 10.3390/ani11082238] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2021] [Revised: 07/23/2021] [Accepted: 07/27/2021] [Indexed: 11/16/2022] Open
Abstract
Broiler sounds can provide feedback on their own body condition, to a certain extent. Aiming at the noise in the sound signals collected in broiler farms, research on evaluating the filtering methods for broiler sound signals from multiple perspectives is proposed, and the best performer can be obtained for broiler sound signal filtering. Multiple perspectives include the signal angle and the recognition angle, which are embodied in three indicators: signal-to-noise ratio (SNR), root mean square error (RMSE), and prediction accuracy. The signal filtering methods used in this study include Basic Spectral Subtraction, Improved Spectral Subtraction based on multi-taper spectrum estimation, Wiener filtering and Sparse Decomposition using both thirty atoms and fifty atoms. In analysis of the signal angle, Improved Spectral Subtraction based on multi-taper spectrum estimation achieved the highest average SNR of 5.5145 and achieved the smallest average RMSE of 0.0508. In analysis of the recognition angle, the kNN classifier and Random Forest classifier achieved the highest average prediction accuracy on the data set established from the sound signals filtered by Wiener filtering, which were 88.83% and 88.69%, respectively. These are significantly higher than those obtained by classifiers on data sets established from sound signals filtered by other methods. Further research shows that after removing the starting noise in the sound signal, Wiener filtering achieved the highest average SNR of 5.6108 and a new RMSE of 0.0551. Finally, in comprehensive analysis of both the signal angle and the recognition angle, this research determined that Wiener filtering is the best broiler sound signal filtering method. This research lays the foundation for follow-up research on extracting classification features from high-quality broiler sound signals to realize broiler health monitoring. At the same time, the research results can be popularized and applied to studies on the detection and processing of livestock and poultry sound signals, which has extremely important reference and practical value.
Collapse
Affiliation(s)
- Zhigang Sun
- Electronic Engineering College, Heilongjiang University, Harbin 150080, China; (Z.S.); (M.G.); (B.L.); (C.H.); (Y.T.)
| | - Mengmeng Gao
- Electronic Engineering College, Heilongjiang University, Harbin 150080, China; (Z.S.); (M.G.); (B.L.); (C.H.); (Y.T.)
| | - Guotao Wang
- Electronic Engineering College, Heilongjiang University, Harbin 150080, China; (Z.S.); (M.G.); (B.L.); (C.H.); (Y.T.)
- School of Electrical Engineering and Automation, Harbin Institute of Technology, Harbin 150001, China
- Correspondence:
| | - Bingze Lv
- Electronic Engineering College, Heilongjiang University, Harbin 150080, China; (Z.S.); (M.G.); (B.L.); (C.H.); (Y.T.)
| | - Cailing He
- Electronic Engineering College, Heilongjiang University, Harbin 150080, China; (Z.S.); (M.G.); (B.L.); (C.H.); (Y.T.)
| | - Yuru Teng
- Electronic Engineering College, Heilongjiang University, Harbin 150080, China; (Z.S.); (M.G.); (B.L.); (C.H.); (Y.T.)
| |
Collapse
|
38
|
Chao X, Zhang L. Few-shot imbalanced classification based on data augmentation. MULTIMEDIA SYSTEMS 2021. [PMID: 0 DOI: 10.1007/s00530-021-00827-0] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/01/2021] [Accepted: 06/22/2021] [Indexed: 05/26/2023]
|
39
|
Kang Y, Jia N, Cui R, Deng J. A graph-based semi-supervised reject inference framework considering imbalanced data distribution for consumer credit scoring. Appl Soft Comput 2021. [DOI: 10.1016/j.asoc.2021.107259] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
40
|
Tian Y, Bian B, Tang X, Zhou J. A new non-kernel quadratic surface approach for imbalanced data classification in online credit scoring. Inf Sci (N Y) 2021. [DOI: 10.1016/j.ins.2021.02.026] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
41
|
Gnip P, Vokorokos L, Drotár P. Selective oversampling approach for strongly imbalanced data. PeerJ Comput Sci 2021; 7:e604. [PMID: 34239981 PMCID: PMC8237317 DOI: 10.7717/peerj-cs.604] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2021] [Accepted: 05/31/2021] [Indexed: 06/03/2023]
Abstract
Challenges posed by imbalanced data are encountered in many real-world applications. One of the possible approaches to improve the classifier performance on imbalanced data is oversampling. In this paper, we propose the new selective oversampling approach (SOA) that first isolates the most representative samples from minority classes by using an outlier detection technique and then utilizes these samples for synthetic oversampling. We show that the proposed approach improves the performance of two state-of-the-art oversampling methods, namely, the synthetic minority oversampling technique and adaptive synthetic sampling. The prediction performance is evaluated on four synthetic datasets and four real-world datasets, and the proposed SOA methods always achieved the same or better performance than other considered existing oversampling methods.
Collapse
Affiliation(s)
- Peter Gnip
- Department of Computers and Informatics, Technical University of Košice, Slovak Republic
| | - Liberios Vokorokos
- Department of Computers and Informatics, Technical University of Košice, Slovak Republic
| | - Peter Drotár
- Department of Computers and Informatics, Technical University of Košice, Slovak Republic
| |
Collapse
|
42
|
|
43
|
Vuttipittayamongkol P, Elyan E, Petrovski A. On the class overlap problem in imbalanced data classification. Knowl Based Syst 2021. [DOI: 10.1016/j.knosys.2020.106631] [Citation(s) in RCA: 32] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
|
44
|
Wie YM, Lee KG, Lee KH, Ko T, Lee KH. The Experimental Process Design of Artificial Lightweight Aggregates Using an Orthogonal Array Table and Analysis by Machine Learning. MATERIALS 2020; 13:ma13235570. [PMID: 33297369 PMCID: PMC7730768 DOI: 10.3390/ma13235570] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/03/2020] [Revised: 12/02/2020] [Accepted: 12/04/2020] [Indexed: 11/16/2022]
Abstract
The purpose of this study is to experimentally design the drying, calcination, and sintering processes of artificial lightweight aggregates through the orthogonal array, to expand the data using the results, and to model the manufacturing process of lightweight aggregates through machine-learning techniques. The experimental design of the process consisted of L18(3661), which means that 36 × 61 data can be obtained in 18 experiments using an orthogonal array design. After the experiment, the data were expanded to 486 instances and trained by several machine-learning techniques such as linear regression, random forest, and support vector regression (SVR). We evaluated the predictive performance of machine-learning models by comparing predicted and actual values. As a result, the SVR showed the best performance for predicting measured values. This model also worked well for predictions of untested cases.
Collapse
Affiliation(s)
- Young Min Wie
- Department of Materials Engineering, Kyonggi University, Suwon 16227, Korea; (Y.M.W.); (K.G.L.)
| | - Ki Gang Lee
- Department of Materials Engineering, Kyonggi University, Suwon 16227, Korea; (Y.M.W.); (K.G.L.)
| | - Kang Hyuck Lee
- Center for Built Environment, Sungkyunkwan University, Suwon 16419, Korea;
| | - Taehoon Ko
- Department of Medical Informatics, The Catholic University of Korea, Seoul 06591, Korea;
| | - Kang Hoon Lee
- Department Civil & Environmental Engineering, Hanyang University, Seoul 04763, Korea
- Correspondence: ; Tel.: +82-31-249-9774; Fax: +82-31-244-6300
| |
Collapse
|
45
|
GT2FS-SMOTE: An Intelligent Oversampling Approach Based Upon General Type-2 Fuzzy Sets to Detect Web Spam. ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING 2020. [DOI: 10.1007/s13369-020-04995-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|