1
|
Dablain D, Krawczyk B, Chawla NV. DeepSMOTE: Fusing Deep Learning and SMOTE for Imbalanced Data. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:6390-6404. [PMID: 35085094 DOI: 10.1109/tnnls.2021.3136503] [Citation(s) in RCA: 25] [Impact Index Per Article: 25.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Despite over two decades of progress, imbalanced data is still considered a significant challenge for contemporary machine learning models. Modern advances in deep learning have further magnified the importance of the imbalanced data problem, especially when learning from images. Therefore, there is a need for an oversampling method that is specifically tailored to deep learning models, can work on raw images while preserving their properties, and is capable of generating high-quality, artificial images that can enhance minority classes and balance the training set. We propose Deep synthetic minority oversampling technique (SMOTE), a novel oversampling algorithm for deep learning models that leverages the properties of the successful SMOTE algorithm. It is simple, yet effective in its design. It consists of three major components: 1) an encoder/decoder framework; 2) SMOTE-based oversampling; and 3) a dedicated loss function that is enhanced with a penalty term. An important advantage of DeepSMOTE over generative adversarial network (GAN)-based oversampling is that DeepSMOTE does not require a discriminator, and it generates high-quality artificial images that are both information-rich and suitable for visual inspection. DeepSMOTE code is publicly available at https://github.com/dd1github/DeepSMOTE.
Collapse
|
2
|
Niyogisubizo J, Liao L, Zou F, Han G, Nziyumva E, Li B, Lin Y. Predicting traffic crash severity using hybrid of balanced bagging classification and light gradient boosting machine. INTELL DATA ANAL 2023. [DOI: 10.3233/ida-216398] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
Accident severity prediction is a hot topic of research aimed at ensuring road safety as well as taking precautionary measures for anticipated future road crashes. In the past decades, both classical statistical methods and machine learning algorithms have been used to predict traffic crash severity. However, most of these models suffer from several drawbacks including low accuracy, and lack of interpretability for people. To address these issues, this paper proposed a hybrid of Balanced Bagging Classification (BBC) and Light Gradient Boosting Machine (LGBM) to improve the accuracy of crash severity prediction and eliminate the issues of bias and variance. To the best of the author’s knowledge, this is one of the pioneer studies which explores the application of BBC-LGBM to predict traffic crash severity. On the accident dataset of Great Britain (UK) from 2013 to 2019, the proposed model has demonstrated better performance when compared with other models such as Gaussian Naïve Bayes (GNB), Support vector machines (SVM), and Random Forest (RF). More specifically, the proposed model managed to achieve better performance among all metrics for the testing dataset (accuracy = 77.7%, precision = 75%, recall = 73%, F1-Score = 68%). Moreover, permutation importance is used to interpret the results and analyze the importance of each factor influencing crash severity. The accuracy-enhanced model is significant to several stakeholders including drivers for early alarm and government departments, insurance companies, and even hospitals for the services concerned about human lives and property damage in road crashes.
Collapse
Affiliation(s)
- Jovial Niyogisubizo
- Fujian Key Lab for Automotive Electronics and Electric Drive, Fujian University of Technology, Fujian, China
- Fujian Provincial Universities Engineering Research Centre for Intelligent Self-Driving Technology, Fujian University of Technology, Fuzhou, Fujian, China
| | - Lyuchao Liao
- Fujian Key Lab for Automotive Electronics and Electric Drive, Fujian University of Technology, Fujian, China
- Fujian Provincial Universities Engineering Research Centre for Intelligent Self-Driving Technology, Fujian University of Technology, Fuzhou, Fujian, China
| | - Fumin Zou
- Fujian Key Lab for Automotive Electronics and Electric Drive, Fujian University of Technology, Fujian, China
- Fujian Provincial Universities Engineering Research Centre for Intelligent Self-Driving Technology, Fujian University of Technology, Fuzhou, Fujian, China
| | - Guangjie Han
- Fujian Key Lab for Automotive Electronics and Electric Drive, Fujian University of Technology, Fujian, China
- College of Internet of Things Engineering, Hohai University, Nanjing, Jiangsu, China
| | - Eric Nziyumva
- Fujian Key Lab for Automotive Electronics and Electric Drive, Fujian University of Technology, Fujian, China
- Fujian Provincial Universities Engineering Research Centre for Intelligent Self-Driving Technology, Fujian University of Technology, Fuzhou, Fujian, China
| | - Ben Li
- Fujian Key Lab for Automotive Electronics and Electric Drive, Fujian University of Technology, Fujian, China
- Fujian Provincial Universities Engineering Research Centre for Intelligent Self-Driving Technology, Fujian University of Technology, Fuzhou, Fujian, China
| | - Yuyuan Lin
- Fujian Key Lab for Automotive Electronics and Electric Drive, Fujian University of Technology, Fujian, China
- Fujian Provincial Universities Engineering Research Centre for Intelligent Self-Driving Technology, Fujian University of Technology, Fuzhou, Fujian, China
| |
Collapse
|
3
|
HS-Gen: a hypersphere-constrained generation mechanism to improve synthetic minority oversampling for imbalanced classification. COMPLEX INTELL SYST 2022. [DOI: 10.1007/s40747-022-00938-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
AbstractMitigating the impact of class-imbalance data on classifiers is a challenging task in machine learning. SMOTE is a well-known method to tackle this task by modifying class distribution and generating synthetic instances. However, most of the SMOTE-based methods focus on the phase of data selection, while few consider the phase of data generation. This paper proposes a hypersphere-constrained generation mechanism (HS-Gen) to improve synthetic minority oversampling. Unlike linear interpolation commonly used in SMOTE-based methods, HS-Gen generates a minority instance in a hypersphere rather than on a straight line. This mechanism expands the distribution range of minority instances with significant randomness and diversity. Furthermore, HS-Gen is attached with a noise prevention strategy that adaptively shrinks the hypersphere by determining whether new instances fall into the majority class region. HS-Gen can be regarded as an oversampling optimization mechanism and flexibly embedded into the SMOTE-based methods. We conduct comparative experiments by embedding HS-Gen into the original SMOTE, Borderline-SMOTE, ADASYN, k-means SMOTE, and RSMOTE. Experimental results show that the embedded versions can generate higher quality synthetic instances than the original ones. Moreover, on these oversampled datasets, the conventional classifiers (C4.5 and Adaboost) obtain significant performance improvement in terms of F1 measure and G-mean.
Collapse
|
4
|
Perturbation-based oversampling technique for imbalanced classification problems. INT J MACH LEARN CYB 2022. [DOI: 10.1007/s13042-022-01662-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
5
|
Santhiappan S, Chelladurai J, Ravindran B. TOMBoost: a topic modeling based boosting approach for learning with class imbalance. INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS 2022. [DOI: 10.1007/s41060-022-00363-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
6
|
|
7
|
Klikowski J, Woźniak M. Deterministic Sampling Classifier with weighted Bagging for drifted imbalanced data stream classification. Appl Soft Comput 2022. [DOI: 10.1016/j.asoc.2022.108855] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
8
|
Pan X, Jia R, Huang J, Wang H. A resistance outlier sampling algorithm for imbalanced data prediction. INTELL DATA ANAL 2022. [DOI: 10.3233/ida-211519] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Classification of imbalanced data is an important challenge in current research. Sampling is an important way to solve the problem of imbalanced data classification, but some traditional sampling algorithms are susceptible to outliers. Therefore, an iF-ADASYN sampling algorithm is proposed in this paper. First, based on the ADASYN algorithm, we introduce the isolation Forest algorithm to overcome its vulnerability to outliers. Then, a calculation method of anomaly index which can delete outliers accurately of minority data is presented. The experimental results of four UCI public imbalanced datasets show that the algorithm can effectively improve the accuracy of the minority class, and increase the stability. In the real thrombus dataset, the AUC value of the iF-ADASYN algorithm is more significant than that of SMOTE and ADASYN algorithms, and the recognition rate of patients with thrombosis increased by 20%. The iF-ADASYN algorithm obtains better resistance to outliers than the original ADASYN algorithm. Meanwhile, it improves the accuracy of minority class decision boundary region division.
Collapse
Affiliation(s)
- Xiaoying Pan
- School of Computer Science and Technology, Xi’an University of Posts and Telecommunications, Xi’an, Shaanxi, China
- The Key Laboratory of Network Data Analysis and Intelligent Processing, Xi’an, Shaanxi, China
| | - Rong Jia
- School of Computer Science and Technology, Xi’an University of Posts and Telecommunications, Xi’an, Shaanxi, China
| | - Jiahao Huang
- School of Computer Science and Technology, Xi’an University of Posts and Telecommunications, Xi’an, Shaanxi, China
| | - Hao Wang
- School of Software, Northwestern Polytechnical University, Xi’an, Shaanxi, China
| |
Collapse
|
9
|
Switching: understanding the class-reversed sampling in tail sample memorization. Mach Learn 2022. [DOI: 10.1007/s10994-021-06087-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
10
|
Falcone R, Anderlucci L, Montanari A. Matrix sketching for supervised classification with imbalanced classes. Data Min Knowl Discov 2021. [DOI: 10.1007/s10618-021-00791-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
AbstractThe presence of imbalanced classes is more and more common in practical applications and it is known to heavily compromise the learning process. In this paper we propose a new method aimed at addressing this issue in binary supervised classification. Re-balancing the class sizes has turned out to be a fruitful strategy to overcome this problem. Our proposal performs re-balancing through matrix sketching. Matrix sketching is a recently developed data compression technique that is characterized by the property of preserving most of the linear information that is present in the data. Such property is guaranteed by the Johnson-Lindenstrauss’ Lemma (1984) and allows to embed an n-dimensional space into a reduced one without distorting, within an $$\epsilon $$
ϵ
-size interval, the distances between any pair of points. We propose to use matrix sketching as an alternative to the standard re-balancing strategies that are based on random under-sampling the majority class or random over-sampling the minority one. We assess the properties of our method when combined with linear discriminant analysis (LDA), classification trees (C4.5) and Support Vector Machines (SVM) on simulated and real data. Results show that sketching can represent a sound alternative to the most widely used rebalancing methods.
Collapse
|
11
|
RB-CCR: Radial-Based Combined Cleaning and Resampling algorithm for imbalanced data classification. Mach Learn 2021. [DOI: 10.1007/s10994-021-06012-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
AbstractReal-world classification domains, such as medicine, health and safety, and finance, often exhibit imbalanced class priors and have asynchronous misclassification costs. In such cases, the classification model must achieve a high recall without significantly impacting precision. Resampling the training data is the standard approach to improving classification performance on imbalanced binary data. However, the state-of-the-art methods ignore the local joint distribution of the data or correct it as a post-processing step. This can causes sub-optimal shifts in the training distribution, particularly when the target data distribution is complex. In this paper, we propose Radial-Based Combined Cleaning and Resampling (RB-CCR). RB-CCR utilizes the concept of class potential to refine the energy-based resampling approach of CCR. In particular, RB-CCR exploits the class potential to accurately locate sub-regions of the data-space for synthetic oversampling. The category sub-region for oversampling can be specified as an input parameter to meet domain-specific needs or be automatically selected via cross-validation. Our $$5\times 2$$
5
×
2
cross-validated results on 57 benchmark binary datasets with 9 classifiers show that RB-CCR achieves a better precision-recall trade-off than CCR and generally out-performs the state-of-the-art resampling methods in terms of AUC and G-mean.
Collapse
|
12
|
Geometric Regularization of Local Activations for Knowledge Transfer in Convolutional Neural Networks. INFORMATION 2021. [DOI: 10.3390/info12080333] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
In this work, we propose a mechanism for knowledge transfer between Convolutional Neural Networks via the geometric regularization of local features produced by the activations of convolutional layers. We formulate appropriate loss functions, driving a “student” model to adapt such that its local features exhibit similar geometrical characteristics to those of an “instructor” model, at corresponding layers. The investigated functions, inspired by manifold-to-manifold distance measures, are designed to compare the neighboring information inside the feature space of the involved activations without any restrictions in the features’ dimensionality, thus enabling knowledge transfer between different architectures. Experimental evidence demonstrates that the proposed technique is effective in different settings, including knowledge-transfer to smaller models, transfer between different deep architectures and harnessing knowledge from external data, producing models with increased accuracy compared to a typical training. Furthermore, results indicate that the presented method can work synergistically with methods such as knowledge distillation, further increasing the accuracy of the trained models. Finally, experiments on training with limited data show that a combined regularization scheme can achieve the same generalization as a non-regularized training with 50% of the data in the CIFAR-10 classification task.
Collapse
|
13
|
Khorshidi HA, Aickelin U. Constructing classifiers for imbalanced data using diversity optimisation. Inf Sci (N Y) 2021. [DOI: 10.1016/j.ins.2021.02.069] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
|
14
|
Identification of Road Traffic Injury Risk Prone Area Using Environmental Factors by Machine Learning Classification in Nonthaburi, Thailand. SUSTAINABILITY 2021. [DOI: 10.3390/su13073907] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Road traffic injuries are a major cause of morbidity and mortality worldwide and currently rank ninth globally among the leading causes of disease burden regarding disability-adjusted life years lost. Nonthaburi and Pathum Thani are parts of the greater Bangkok metropolitan area, and the road traffic injury rate is very high in these areas. This study aimed to identify the environmental factors affecting road traffic injury risk prone areas and classify road traffic injuries from an environmental factor dataset using machine learning algorithms. Road traffic injury risk prone areas were set as the dependent variables for the analysis, with other factors that influence road traffic injury risk prone areas being set as independent variables. A total of 20 environmental factors were selected from the spatial datasets. Then, machine learning algorithms were applied using a grid search. The first experiment from 2017 in Nonthaburi and Pathum Thani was used for training the model, and then, 2018 data from Nonthaburi and Pathum Thani were used for validation. The second experiment used 2018 Nonthaburi data for the training, and 2018 Pathum Thani data were used for the validation. The important factors were grocery stores, convenience stores, electronics stores, drugstores, schools, gas stations, restaurants, supermarkets, and road geometrics, with length being the most critical factor that influenced the road traffic injury risk prone model. The first and second experiments in a random forest model provided the best model environmental factors affecting road traffic injury risk prone areas, and machine learning can classify such road traffic injuries.
Collapse
|
15
|
Abstract
One of the significant challenges in machine learning is the classification of imbalanced data. In many situations, standard classifiers cannot learn how to distinguish minority class examples from the others. Since many real problems are unbalanced, this problem has become very relevant and deeply studied today. This paper presents a new preprocessing method based on Delaunay tessellation and the preprocessing algorithm SMOTE (Synthetic Minority Over-sampling Technique), which we call DTO-SMOTE (Delaunay Tessellation Oversampling SMOTE). DTO-SMOTE constructs a mesh of simplices (in this paper, we use tetrahedrons) for creating synthetic examples. We compare results with five preprocessing algorithms (GEOMETRIC-SMOTE, SVM-SMOTE, SMOTE-BORDERLINE-1, SMOTE-BORDERLINE-2, and SMOTE), eight classification algorithms, and 61 binary-class data sets. For some classifiers, DTO-SMOTE has higher performance than others in terms of Area Under the ROC curve (AUC), Geometric Mean (GEO), and Generalized Index of Balanced Accuracy (IBA).
Collapse
|
16
|
Bej S, Davtyan N, Wolfien M, Nassar M, Wolkenhauer O. LoRAS: an oversampling approach for imbalanced datasets. Mach Learn 2020. [DOI: 10.1007/s10994-020-05913-4] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Abstract
AbstractThe Synthetic Minority Oversampling TEchnique (SMOTE) is widely-used for the analysis of imbalanced datasets. It is known that SMOTE frequently over-generalizes the minority class, leading to misclassifications for the majority class, and effecting the overall balance of the model. In this article, we present an approach that overcomes this limitation of SMOTE, employing Localized Random Affine Shadowsampling (LoRAS) to oversample from an approximated data manifold of the minority class. We benchmarked our algorithm with 14 publicly available imbalanced datasets using three different Machine Learning (ML) algorithms and compared the performance of LoRAS, SMOTE and several SMOTE extensions that share the concept of using convex combinations of minority class data points for oversampling with LoRAS. We observed that LoRAS, on average generates better ML models in terms of F1-Score and Balanced accuracy. Another key observation is that while most of the extensions of SMOTE we have tested, improve the F1-Score with respect to SMOTE on an average, they compromise on the Balanced accuracy of a classification model. LoRAS on the contrary, improves both F1 Score and the Balanced accuracy thus produces better classification models. Moreover, to explain the success of the algorithm, we have constructed a mathematical framework to prove that LoRAS oversampling technique provides a better estimate for the mean of the underlying local data distribution of the minority class data space.
Collapse
|
17
|
Koziarski M, Woźniak M, Krawczyk B. Combined Cleaning and Resampling algorithm for multi-class imbalanced data with label noise. Knowl Based Syst 2020. [DOI: 10.1016/j.knosys.2020.106223] [Citation(s) in RCA: 39] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
18
|
Krawczyk B, Koziarski M, Wozniak M. Radial-Based Oversampling for Multiclass Imbalanced Data Classification. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2020; 31:2818-2831. [PMID: 31247563 DOI: 10.1109/tnnls.2019.2913673] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Learning from imbalanced data is among the most popular topics in the contemporary machine learning. However, the vast majority of attention in this field is given to binary problems, while their much more difficult multiclass counterparts are relatively unexplored. Handling data sets with multiple skewed classes poses various challenges and calls for a better understanding of the relationship among classes. In this paper, we propose multiclass radial-based oversampling (MC-RBO), a novel data-sampling algorithm dedicated to multiclass problems. The main novelty of our method lies in using potential functions for generating artificial instances. We take into account information coming from all of the classes, contrary to existing multiclass oversampling approaches that use only minority class characteristics. The process of artificial instance generation is guided by exploring areas where the value of the mutual class distribution is very small. This way, we ensure a smart oversampling procedure that can cope with difficult data distributions and alleviate the shortcomings of existing methods. The usefulness of the MC-RBO algorithm is evaluated on the basis of extensive experimental study and backed-up with a thorough statistical analysis. Obtained results show that by taking into account information coming from all of the classes and conducting a smart oversampling, we can significantly improve the process of learning from multiclass imbalanced data.
Collapse
|
19
|
Liu ZT, Wu BH, Li DY, Xiao P, Mao JW. Speech Emotion Recognition Based on Selective Interpolation Synthetic Minority Over-Sampling Technique in Small Sample Environment. SENSORS 2020; 20:s20082297. [PMID: 32316473 PMCID: PMC7219047 DOI: 10.3390/s20082297] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/15/2020] [Revised: 04/10/2020] [Accepted: 04/14/2020] [Indexed: 11/16/2022]
Abstract
Speech emotion recognition often encounters the problems of data imbalance and redundant features in different application scenarios. Researchers usually design different recognition models for different sample conditions. In this study, a speech emotion recognition model for a small sample environment is proposed. A data imbalance processing method based on selective interpolation synthetic minority over-sampling technique (SISMOTE) is proposed to reduce the impact of sample imbalance on emotion recognition results. In addition, feature selection method based on variance analysis and gradient boosting decision tree (GBDT) is introduced, which can exclude the redundant features that possess poor emotional representation. Results of experiments of speech emotion recognition on three databases (i.e., CASIA, Emo-DB, SAVEE) show that our method obtains average recognition accuracy of 90.28% (CASIA), 75.00% (SAVEE) and 85.82% (Emo-DB) for speaker-dependent speech emotion recognition which is superior to some state-of-the-arts works.
Collapse
Affiliation(s)
- Zhen-Tao Liu
- School of Automation, China University of Geosciences, Wuhan 430074, China; (Z.-T.L.); (B.-H.W.); (P.X.); (J.-W.M.)
- Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Wuhan 430074, China
| | - Bao-Han Wu
- School of Automation, China University of Geosciences, Wuhan 430074, China; (Z.-T.L.); (B.-H.W.); (P.X.); (J.-W.M.)
- Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Wuhan 430074, China
| | - Dan-Yun Li
- School of Automation, China University of Geosciences, Wuhan 430074, China; (Z.-T.L.); (B.-H.W.); (P.X.); (J.-W.M.)
- Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Wuhan 430074, China
- Correspondence:
| | - Peng Xiao
- School of Automation, China University of Geosciences, Wuhan 430074, China; (Z.-T.L.); (B.-H.W.); (P.X.); (J.-W.M.)
- Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Wuhan 430074, China
| | - Jun-Wei Mao
- School of Automation, China University of Geosciences, Wuhan 430074, China; (Z.-T.L.); (B.-H.W.); (P.X.); (J.-W.M.)
- Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Wuhan 430074, China
| |
Collapse
|
20
|
Classical and Deep Learning Paradigms for Detection and Validation of Key Genes of Risky Outcomes of HCV. ALGORITHMS 2020. [DOI: 10.3390/a13030073] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
Hepatitis C virus (HCV) is one of the most dangerous viruses worldwide. It is the foremost cause of the hepatic cirrhosis, and hepatocellular carcinoma, HCC. Detecting new key genes that play a role in the growth of HCC in HCV patients using machine learning techniques paves the way for producing accurate antivirals. In this work, there are two phases: detecting the up/downregulated genes using classical univariate and multivariate feature selection methods, and validating the retrieved list of genes using Insilico classifiers. However, the classification algorithms in the medical domain frequently suffer from a deficiency of training cases. Therefore, a deep neural network approach is proposed here to validate the significance of the retrieved genes in classifying the HCV-infected samples from the disinfected ones. The validation model is based on the artificial generation of new examples from the retrieved genes’ expressions using sparse autoencoders. Subsequently, the generated genes’ expressions data are used to train conventional classifiers. Our results in the first phase yielded a better retrieval of significant genes using Principal Component Analysis (PCA), a multivariate approach. The retrieved list of genes using PCA had a higher number of HCC biomarkers compared to the ones retrieved from the univariate methods. In the second phase, the classification accuracy can reveal the relevance of the extracted key genes in classifying the HCV-infected and disinfected samples.
Collapse
|
21
|
Schlögl M. A multivariate analysis of environmental effects on road accident occurrence using a balanced bagging approach. ACCIDENT; ANALYSIS AND PREVENTION 2020; 136:105398. [PMID: 31855710 DOI: 10.1016/j.aap.2019.105398] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/03/2019] [Revised: 09/06/2019] [Accepted: 12/05/2019] [Indexed: 06/10/2023]
Abstract
Determining and understanding the environmental factors contributing to road traffic accident occurrence is of core importance in road safety research. In this study, a methodology to obtain robust and unbiased results when modeling imbalanced, high-resolution accident data is described. Based on a data set covering the whole highway network of Austria in a fine spatial (250 m) and temporal (1 h) scale, the effects of 48 covariates on accident occurrence are analyzed, with a special emphasis on real-time weather variables obtained through meteorological re-analysis. A balanced bagging approach is employed to cope with the issue of class imbalance. By fitting different tree-based classifiers to a large number of bootstrapped training samples, ensembles of binary classification models are established. The final prediction is achieved through majority vote across each ensemble, resulting in a robust prediction with reduced variance. Findings show the merits of the proposed approach in terms of model quality and robustness of the results, consistently displaying accuracies around 80% while exhibiting sensitivities of approximately 50%. In addition to certain features related to roadway geometrics, surface condition and traffic volume, a number of weather variables are found to be of importance for predicting accident occurrence. The proposed methodological take may not only pave the way for further analyses of high-resolution road safety data including real-time information, but can also be transferred to any other imbalanced classification problem.
Collapse
Affiliation(s)
- Matthias Schlögl
- Institute of Statistics, University of Natural Resources and Life Sciences (BOKU), Vienna, Austria; Transportation Infrastructure Technologies, Austrian Institute of Technology (AIT), Vienna, Austria.
| |
Collapse
|
22
|
Zhu T, Lin Y, Liu Y. Improving interpolation-based oversampling for imbalanced data learning. Knowl Based Syst 2020. [DOI: 10.1016/j.knosys.2019.06.034] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
|
23
|
A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) for handling class imbalance. Inf Sci (N Y) 2019. [DOI: 10.1016/j.ins.2019.07.070] [Citation(s) in RCA: 100] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
24
|
Bellinger C, Sharma S, Japkowicz N, Zaïane OR. Framework for extreme imbalance classification: SWIM—sampling with the majority class. Knowl Inf Syst 2019. [DOI: 10.1007/s10115-019-01380-z] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|