1
|
CBDT-Oglyc: Prediction of O-glycosylation sites using ChiMIC-based balanced decision table and feature selection. J Bioinform Comput Biol 2023; 21:2350024. [PMID: 37899352 DOI: 10.1142/s0219720023500245] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/31/2023]
Abstract
O-glycosylation (Oglyc) plays an important role in various biological processes. The key to understanding the mechanisms of Oglyc is identifying the corresponding glycosylation sites. Two critical steps, feature selection and classifier design, greatly affect the accuracy of computational methods for predicting Oglyc sites. Based on an efficient feature selection algorithm and a classifier capable of handling imbalanced datasets, a new computational method, ChiMIC-based balanced decision table O-glycosylation (CBDT-Oglyc), is proposed. ChiMIC-based balanced decision table for O-glycosylation (CBDT-Oglyc), is proposed to predict Oglyc sites in proteins. Sequence characterization is performed by combining amino acid composition (AAC), undirected composition of [Formula: see text]-spaced amino acid pairs (undirected-CKSAAP) and pseudo-position-specific scoring matrix (PsePSSM). Chi-MIC-share algorithm is used for feature selection, which simplifies the model and improves predictive accuracy. For imbalanced classification, a backtracking method based on local chi-square test is designed, and then cost-sensitive learning is incorporated to construct a novel classifier named ChiMIC-based balanced decision table (CBDT). Based on a 1:49 (positives:negatives) training set, the CBDT classifier achieves significantly better prediction performance than traditional classifiers. Moreover, the independent test results on separate human and mouse glycoproteins show that CBDT-Oglyc outperforms previous methods in global accuracy. CBDT-Oglyc shows great promise in predicting Oglyc sites and is expected to facilitate further experimental studies on protein glycosylation.
Collapse
|
2
|
Machine-Learning Algorithms for Process Condition Data-Based Inclusion Prediction in Continuous-Casting Process: A Case Study. SENSORS (BASEL, SWITZERLAND) 2023; 23:6719. [PMID: 37571504 PMCID: PMC10422374 DOI: 10.3390/s23156719] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/03/2023] [Revised: 07/25/2023] [Accepted: 07/26/2023] [Indexed: 08/13/2023]
Abstract
Quality-related prediction in the continuous-casting process is important for the quality and process control of casting slabs. As intelligent manufacturing technologies continue to evolve, numerous data-driven techniques have been available for industrial applications. This case study was aimed at developing a machine-learning algorithm, capable of predicting slag inclusion defects in continuous-casting slabs, based on process condition sensor data. A large dataset consisting of sensor data from nearly 7300 casting samples has been analyzed, with the empirical mode decomposition (EMD) algorithm utilized to process the multi-modal time series. The following machine-learning algorithms have been examined: K-Nearest neighbors, support vector classifier (linear and nonlinear kernels), decision trees, random forests, AdaBoost, and Artificial Neural Networks. Four over-sampling or under-sampling algorithms have been adopted to solve imbalanced data distribution. In the experiment, the optimized random forest outperformed other machine-learning algorithms in terms of recall and ROC AUC, which could provide valuable insights for quality control.
Collapse
|
3
|
DeepSTABp: A Deep Learning Approach for the Prediction of Thermal Protein Stability. Int J Mol Sci 2023; 24:ijms24087444. [PMID: 37108605 PMCID: PMC10138888 DOI: 10.3390/ijms24087444] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Revised: 04/04/2023] [Accepted: 04/13/2023] [Indexed: 04/29/2023] Open
Abstract
Proteins are essential macromolecules that carry out a plethora of biological functions. The thermal stability of proteins is an important property that affects their function and determines their suitability for various applications. However, current experimental approaches, primarily thermal proteome profiling, are expensive, labor-intensive, and have limited proteome and species coverage. To close the gap between available experimental data and sequence information, a novel protein thermal stability predictor called DeepSTABp has been developed. DeepSTABp uses a transformer-based protein language model for sequence embedding and state-of-the-art feature extraction in combination with other deep learning techniques for end-to-end protein melting temperature prediction. DeepSTABp can predict the thermal stability of a wide range of proteins, making it a powerful and efficient tool for large-scale prediction. The model captures the structural and biological properties that impact protein stability, and it allows for the identification of the structural features that contribute to protein stability. DeepSTABp is available to the public via a user-friendly web interface, making it accessible to researchers in various fields.
Collapse
|
4
|
Taxi drivers' traffic violations detection using random forest algorithm: A case study in China. TRAFFIC INJURY PREVENTION 2023; 24:362-370. [PMID: 36976788 DOI: 10.1080/15389588.2023.2191286] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/08/2022] [Revised: 01/29/2023] [Accepted: 03/12/2023] [Indexed: 05/23/2023]
Abstract
OBJECTIVE To effectively explore the impacts of several key factors on taxi drivers' traffic violations and provide traffic management departments with scientific decisions to reduce traffic fatalities and injuries. METHODS 43,458 electronic enforcement data about taxi drivers' traffic violations in Nanchang City, Jiangxi Province, China, from July 1, 2020, to June 30, 2021, were utilized to explore the characteristics of traffic violations. A random forest algorithm was used to predict the severity of taxi drivers' traffic violations and 11 factors affecting traffic violations, including time, road conditions, environment, and taxi companies were analyzed using the Shapley Additionality Explanation (SHAP) framework. RESULTS Firstly, the ensemble method Balanced Bagging Classifier (BBC) was applied to balance the dataset. The results showed that the imbalance ratio (IR) of the original imbalanced dataset reduced from 6.61% to 2.60%. Moreover, a prediction model for the severity of taxi drivers' traffic violations was established by using the Random Forest, and the results showed that accuracy, m_F1, m_G-mean, m_AUC, and m_AP obtained 0.877, 0.849, 0.599, 0.976, and 0.957, respectively. Compared with the algorithms of Decision Tree, XG Boost, Ada Boost, and Neural Network, the performance measures of the prediction model based on Random Forest were the best. Finally, the SHAP framework was used to improve the interpretability of the model and identify important factors affecting taxi drivers' traffic violations. The results showed that functional districts, location of the violation, and road grade were found to have a high impact on the probability of traffic violations; their mean SHAP values were 0.39, 0.36, and 0.26, respectively. CONCLUSIONS Findings of this paper may help to discover the relationship between the influencing factors and the severity of traffic violations, and provide a theoretical basis for reducing the traffic violations of taxi drivers and improving the road safety management.
Collapse
|
5
|
Ensemble Capsule Network with an Attention Mechanism for the Fault Diagnosis of Bearings from Imbalanced Data Samples. SENSORS 2022; 22:s22155543. [PMID: 35898042 PMCID: PMC9332463 DOI: 10.3390/s22155543] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/30/2022] [Revised: 07/19/2022] [Accepted: 07/21/2022] [Indexed: 11/16/2022]
Abstract
In order to solve the problem of imbalanced and noisy data samples for the fault diagnosis of rolling bearings, a novel ensemble capsule network (Capsnet) with a convolutional block attention module (CBAM) that is based on a weighted majority voting method is proposed in this study. Firstly, the complete ensemble empirical mode decomposition with adaptive noise (CEEMDAN) method was used to decompose the raw vibration signal into different IMF signals, which are noise reduction signals. Secondly, the IMF signals were input into the Capsnet with CBAM in order to diagnose the fault category preliminarily. Finally, the weighted majority voting method was utilized so as to fuse all of the preliminary diagnosis results in order to obtain the final diagnostic decision. In order to verify the effectiveness of the proposed ensemble of Capsnet with CBAM, this method was applied to the fault diagnosis of rolling bearings with imbalanced and different SNR data samples. The diagnostic results show that the proposed diagnostic method can achieve higher levels of accuracy than other methods, such as single CNN, single Capsnet, ensemble CNN and an ensemble capsule network without CBAM and that it has stronger immunity to noise than an ensemble capsule network without CBAM.
Collapse
|
6
|
An MRI Scans-Based Alzheimer's Disease Detection via Convolutional Neural Network and Transfer Learning. Diagnostics (Basel) 2022; 12:diagnostics12071531. [PMID: 35885437 PMCID: PMC9318866 DOI: 10.3390/diagnostics12071531] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2022] [Revised: 06/20/2022] [Accepted: 06/21/2022] [Indexed: 02/04/2023] Open
Abstract
Alzheimer’s disease (AD) is the most common type (>60%) of dementia and can wreak havoc on the psychological and physiological development of sufferers and their carers, as well as the economic and social development. Attributed to the shortage of medical staff, automatic diagnosis of AD has become more important to relieve the workload of medical staff and increase the accuracy of medical diagnoses. Using the common MRI scans as inputs, an AD detection model has been designed using convolutional neural network (CNN). To enhance the fine-tuning of hyperparameters and, thus, the detection accuracy, transfer learning (TL) is introduced, which brings the domain knowledge from heterogeneous datasets. Generative adversarial network (GAN) is applied to generate additional training data in the minority classes of the benchmark datasets. Performance evaluation and analysis using three benchmark (OASIS-series) datasets revealed the effectiveness of the proposed method, which increases the accuracy of the detection model by 2.85−3.88%, 2.43−2.66%, and 1.8−40.1% in the ablation study of GAN and TL, as well as the comparison with existing works, respectively.
Collapse
|
7
|
Weighting Methods for Rare Event Identification From Imbalanced Datasets. Front Big Data 2022; 4:715320. [PMID: 35005617 PMCID: PMC8734962 DOI: 10.3389/fdata.2021.715320] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2021] [Accepted: 11/04/2021] [Indexed: 11/30/2022] Open
Abstract
In machine learning, we often face the situation where the event we are interested in has very few data points buried in a massive amount of data. This is typical in network monitoring, where data are streamed from sensing or measuring units continuously but most data are not for events. With imbalanced datasets, the classifiers tend to be biased in favor of the main class. Rare event detection has received much attention in machine learning, and yet it is still a challenging problem. In this paper, we propose a remedy for the standing problem. Weighting and sampling are two fundamental approaches to address the problem. We focus on the weighting method in this paper. We first propose a boosting-style algorithm to compute class weights, which is proved to have excellent theoretical property. Then we propose an adaptive algorithm, which is suitable for real-time applications. The adaptive nature of the two algorithms allows a controlled tradeoff between true positive rate and false positive rate and avoids excessive weight on the rare class, which leads to poor performance on the main class. Experiments on power grid data and some public datasets show that the proposed algorithms outperform the existing weighting and boosting methods, and that their superiority is more noticeable with noisy data.
Collapse
|
8
|
Apple Leaf Disease Identification with a Small and Imbalanced Dataset Based on Lightweight Convolutional Networks. SENSORS (BASEL, SWITZERLAND) 2021; 22:173. [PMID: 35009716 PMCID: PMC8749501 DOI: 10.3390/s22010173] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/07/2021] [Revised: 12/23/2021] [Accepted: 12/25/2021] [Indexed: 01/13/2023]
Abstract
The intelligent identification and classification of plant diseases is an important research objective in agriculture. In this study, in order to realize the rapid and accurate identification of apple leaf disease, a new lightweight convolutional neural network RegNet was proposed. A series of comparative experiments had been conducted based on 2141 images of 5 apple leaf diseases (rust, scab, ring rot, panonychus ulmi, and healthy leaves) in the field environment. To assess the effectiveness of the RegNet model, a series of comparison experiments were conducted with state-of-the-art convolutional neural networks (CNN) such as ShuffleNet, EfficientNet-B0, MobileNetV3, and Vision Transformer. The results show that RegNet-Adam with a learning rate of 0.0001 obtained an average accuracy of 99.8% on the validation set and an overall accuracy of 99.23% on the test set, outperforming all other pre-trained models. In other words, the proposed method based on transfer learning established in this research can realize the rapid and accurate identification of apple leaf disease.
Collapse
|
9
|
Predicting the Mortality and Readmission of In-Hospital Cardiac Arrest Patients With Electronic Health Records: A Machine Learning Approach. J Med Internet Res 2021; 23:e27798. [PMID: 34515639 PMCID: PMC8477292 DOI: 10.2196/27798] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2021] [Revised: 05/09/2021] [Accepted: 07/21/2021] [Indexed: 01/01/2023] Open
Abstract
Background In-hospital cardiac arrest (IHCA) is associated with high mortality and health care costs in the recovery phase. Predicting adverse outcome events, including readmission, improves the chance for appropriate interventions and reduces health care costs. However, studies related to the early prediction of adverse events of IHCA survivors are rare. Therefore, we used a deep learning model for prediction in this study. Objective This study aimed to demonstrate that with the proper data set and learning strategies, we can predict the 30-day mortality and readmission of IHCA survivors based on their historical claims. Methods National Health Insurance Research Database claims data, including 168,693 patients who had experienced IHCA at least once and 1,569,478 clinical records, were obtained to generate a data set for outcome prediction. We predicted the 30-day mortality/readmission after each current record (ALL-mortality/ALL-readmission) and 30-day mortality/readmission after IHCA (cardiac arrest [CA]-mortality/CA-readmission). We developed a hierarchical vectorizer (HVec) deep learning model to extract patients’ information and predict mortality and readmission. To embed the textual medical concepts of the clinical records into our deep learning model, we used Text2Node to compute the distributed representations of all medical concept codes as a 128-dimensional vector. Along with the patient’s demographic information, our novel HVec model generated embedding vectors to hierarchically describe the health status at the record-level and patient-level. Multitask learning involving two main tasks and auxiliary tasks was proposed. As CA-mortality and CA-readmission were rare, person upsampling of patients with CA and weighting of CA records were used to improve prediction performance. Results With the multitask learning setting in the model learning process, we achieved an area under the receiver operating characteristic of 0.752 for CA-mortality, 0.711 for ALL-mortality, 0.852 for CA-readmission, and 0.889 for ALL-readmission. The area under the receiver operating characteristic was improved to 0.808 for CA-mortality and 0.862 for CA-readmission after solving the extremely imbalanced issue for CA-mortality/CA-readmission by upsampling and weighting. Conclusions This study demonstrated the potential of predicting future outcomes for IHCA survivors by machine learning. The results showed that our proposed approach could effectively alleviate data imbalance problems and train a better model for outcome prediction.
Collapse
|
10
|
Analysis of Machine Learning Algorithms for Anomaly Detection on Edge Devices. SENSORS (BASEL, SWITZERLAND) 2021; 21:4946. [PMID: 34300686 PMCID: PMC8309800 DOI: 10.3390/s21144946] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/01/2021] [Revised: 07/11/2021] [Accepted: 07/16/2021] [Indexed: 11/16/2022]
Abstract
The Internet of Things (IoT) consists of small devices or a network of sensors, which permanently generate huge amounts of data. Usually, they have limited resources, either computing power or memory, which means that raw data are transferred to central systems or the cloud for analysis. Lately, the idea of moving intelligence to the IoT is becoming feasible, with machine learning (ML) moved to edge devices. The aim of this study is to provide an experimental analysis of processing a large imbalanced dataset (DS2OS), split into a training dataset (80%) and a test dataset (20%). The training dataset was reduced by randomly selecting a smaller number of samples to create new datasets Di (i = 1, 2, 5, 10, 15, 20, 40, 60, 80%). Afterwards, they were used with several machine learning algorithms to identify the size at which the performance metrics show saturation and classification results stop improving with an F1 score equal to 0.95 or higher, which happened at 20% of the training dataset. Further on, two solutions for the reduction of the number of samples to provide a balanced dataset are given. In the first, datasets DRi consist of all anomalous samples in seven classes and a reduced majority class ('NL') with i = 0.1, 0.2, 0.5, 1, 2, 5, 10, 15, 20 percent of randomly selected samples. In the second, datasets DCi are generated from the representative samples determined with clustering from the training dataset. All three dataset reduction methods showed comparable performance results. Further evaluation of training times and memory usage on Raspberry Pi 4 shows a possibility to run ML algorithms with limited sized datasets on edge devices.
Collapse
|
11
|
Convolutional Rebalancing Network for the Classification of Large Imbalanced Rice Pest and Disease Datasets in the Field. FRONTIERS IN PLANT SCIENCE 2021; 12:671134. [PMID: 34290724 PMCID: PMC8287420 DOI: 10.3389/fpls.2021.671134] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/23/2021] [Accepted: 05/19/2021] [Indexed: 05/21/2023]
Abstract
The accurate classification of crop pests and diseases is essential for their prevention and control. However, datasets of pest and disease images collected in the field usually exhibit long-tailed distributions with heavy category imbalance, posing great challenges for a deep recognition and classification model. This paper proposes a novel convolutional rebalancing network to classify rice pests and diseases from image datasets collected in the field. To improve the classification performance, the proposed network includes a convolutional rebalancing module, an image augmentation module, and a feature fusion module. In the convolutional rebalancing module, instance-balanced sampling is used to extract features of the images in the rice pest and disease dataset, while reversed sampling is used to improve feature extraction of the categories with fewer images in the dataset. Building on the convolutional rebalancing module, we design an image augmentation module to augment the training data effectively. To further enhance the classification performance, a feature fusion module fuses the image features learned by the convolutional rebalancing module and ensures that the feature extraction of the imbalanced dataset is more comprehensive. Extensive experiments in the large-scale imbalanced dataset of rice pests and diseases (18,391 images), publicly available plant image datasets (Flavia, Swedish Leaf, and UCI Leaf) and pest image datasets (SMALL and IP102) verify the robustness of the proposed network, and the results demonstrate its superior performance over state-of-the-art methods, with an accuracy of 97.58% on rice pest and disease image dataset. We conclude that the proposed network can provide an important tool for the intelligent control of rice pests and diseases in the field.
Collapse
|
12
|
NonClasGP-Pred: robust and efficient prediction of non-classically secreted proteins by integrating subset-specific optimal models of imbalanced data. Microb Genom 2020; 6:mgen000483. [PMID: 33245691 PMCID: PMC8116686 DOI: 10.1099/mgen.0.000483] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2020] [Accepted: 11/06/2020] [Indexed: 01/01/2023] Open
Abstract
Non-classically secreted proteins (NCSPs) are proteins that are located in the extracellular environment, although there is a lack of known signal peptides or secretion motifs. They usually perform different biological functions in intracellular and extracellular environments, and several of their biological functions are linked to bacterial virulence and cell defence. Accurate protein localization is essential for all living organisms, however, the performance of existing methods developed for NCSP identification has been unsatisfactory and in particular suffer from data deficiency and possible overfitting problems. Further improvement is desirable, especially to address the lack of informative features and mining subset-specific features in imbalanced datasets. In the present study, a new computational predictor was developed for NCSP prediction of gram-positive bacteria. First, to address the possible prediction bias caused by the data imbalance problem, ten balanced subdatasets were generated for ensemble model construction. Then, the F-score algorithm combined with sequential forward search was used to strengthen the feature representation ability for each of the training subdatasets. Third, the subset-specific optimal feature combination process was adopted to characterize the original data from different aspects, and all subdataset-based models were integrated into a unified model, NonClasGP-Pred, which achieved an excellent performance with an accuracy of 93.23 %, a sensitivity of 100 %, a specificity of 89.01 %, a Matthew's correlation coefficient of 87.68 % and an area under the curve value of 0.9975 for ten-fold cross-validation. Based on assessment on the independent test dataset, the proposed model outperformed state-of-the-art available toolkits. For availability and implementation, see: http://lab.malab.cn/~wangchao/softwares/NonClasGP/.
Collapse
|
13
|
Pearson Correlation-Based Feature Selection for Document Classification Using Balanced Training. SENSORS 2020; 20:s20236793. [PMID: 33261136 PMCID: PMC7730850 DOI: 10.3390/s20236793] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/27/2020] [Revised: 11/15/2020] [Accepted: 11/25/2020] [Indexed: 11/18/2022]
Abstract
Documents are stored in a digital form across several organizations. Printing this amount of data and placing it into folders instead of storing digitally is against the practical, economical, and ecological perspective. An efficient way of retrieving data from digitally stored documents is also required. This article presents a real-time supervised learning technique for document classification based on deep convolutional neural network (DCNN), which aims to reduce the impact of adverse document image issues such as signatures, marks, logo, and handwritten notes. The proposed technique’s major steps include data augmentation, feature extraction using pre-trained neural network models, feature fusion, and feature selection. We propose a novel data augmentation technique, which normalizes the imbalanced dataset using the secondary dataset RVL-CDIP. The DCNN features are extracted using the VGG19 and AlexNet networks. The extracted features are fused, and the fused feature vector is optimized by applying a Pearson correlation coefficient-based technique to select the optimized features while removing the redundant features. The proposed technique is tested on the Tobacco3482 dataset, which gives a classification accuracy of 93.1% using a cubic support vector machine classifier, proving the validity of the proposed technique.
Collapse
|
14
|
Predictive Modeling for Frailty Conditions in Elderly People: Machine Learning Approaches. JMIR Med Inform 2020; 8:e16678. [PMID: 32442149 PMCID: PMC7303829 DOI: 10.2196/16678] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2019] [Revised: 01/07/2020] [Accepted: 02/16/2020] [Indexed: 12/15/2022] Open
Abstract
Background Frailty is one of the most critical age-related conditions in older adults. It is often recognized as a syndrome of physiological decline in late life, characterized by a marked vulnerability to adverse health outcomes. A clear operational definition of frailty, however, has not been agreed so far. There is a wide range of studies on the detection of frailty and their association with mortality. Several of these studies have focused on the possible risk factors associated with frailty in the elderly population while predicting who will be at increased risk of frailty is still overlooked in clinical settings. Objective The objective of our study was to develop predictive models for frailty conditions in older people using different machine learning methods based on a database of clinical characteristics and socioeconomic factors. Methods An administrative health database containing 1,095,612 elderly people aged 65 or older with 58 input variables and 6 output variables was used. We first identify and define six problems/outputs as surrogates of frailty. We then resolve the imbalanced nature of the data through resampling process and a comparative study between the different machine learning (ML) algorithms – Artificial neural network (ANN), Genetic programming (GP), Support vector machines (SVM), Random Forest (RF), Logistic regression (LR) and Decision tree (DT) – was carried out. The performance of each model was evaluated using a separate unseen dataset. Results Predicting mortality outcome has shown higher performance with ANN (TPR 0.81, TNR 0.76, accuracy 0.78, F1-score 0.79) and SVM (TPR 0.77, TNR 0.80, accuracy 0.79, F1-score 0.78) than predicting the other outcomes. On average, over the six problems, the DT classifier has shown the lowest accuracy, while other models (GP, LR, RF, ANN, and SVM) performed better. All models have shown lower accuracy in predicting an event of an emergency admission with red code than predicting fracture and disability. In predicting urgent hospitalization, only SVM achieved better performance (TPR 0.75, TNR 0.77, accuracy 0.73, F1-score 0.76) with the 10-fold cross validation compared with other models in all evaluation metrics. Conclusions We developed machine learning models for predicting frailty conditions (mortality, urgent hospitalization, disability, fracture, and emergency admission). The results show that the prediction performance of machine learning models significantly varies from problem to problem in terms of different evaluation metrics. Through further improvement, the model that performs better can be used as a base for developing decision-support tools to improve early identification and prediction of frail older adults.
Collapse
|
15
|
Hepatotoxicity Modeling Using Counter-Propagation Artificial Neural Networks: Handling an Imbalanced Classification Problem. Molecules 2020; 25:molecules25030481. [PMID: 31979300 PMCID: PMC7037161 DOI: 10.3390/molecules25030481] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2019] [Revised: 01/20/2020] [Accepted: 01/21/2020] [Indexed: 12/13/2022] Open
Abstract
Drug-induced liver injury is a major concern in the drug development process. Expensive and time-consuming in vitro and in vivo studies do not reflect the complexity of the phenomenon. Complementary to wet lab methods are in silico approaches, which present a cost-efficient method for toxicity prediction. The aim of our study was to explore the capabilities of counter-propagation artificial neural networks (CPANNs) for the classification of an imbalanced dataset related to idiosyncratic drug-induced liver injury and to develop a model for prediction of the hepatotoxic potential of drugs. Genetic algorithm optimization of CPANN models was used to build models for the classification of drugs into hepatotoxic and non-hepatotoxic class using molecular descriptors. For the classification of an imbalanced dataset, we modified the classical CPANN training algorithm by integrating random subsampling into the training procedure of CPANN to improve the classification ability of CPANN. According to the number of models accepted by internal validation and according to the prediction statistics on the external set, we concluded that using an imbalanced set with balanced subsampling in each learning epoch is a better approach compared to using a fixed balanced set in the case of the counter-propagation artificial neural network learning methodology.
Collapse
|
16
|
Vehicle Classification Using an Imbalanced Dataset Based on a Single Magnetic Sensor. SENSORS 2018; 18:s18061690. [PMID: 29794974 PMCID: PMC6022199 DOI: 10.3390/s18061690] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/13/2018] [Revised: 05/18/2018] [Accepted: 05/23/2018] [Indexed: 11/16/2022]
Abstract
This paper aims to improve the accuracy of automatic vehicle classifiers for imbalanced datasets. Classification is made through utilizing a single anisotropic magnetoresistive sensor, with the models of vehicles involved being classified into hatchbacks, sedans, buses, and multi-purpose vehicles (MPVs). Using time domain and frequency domain features in combination with three common classification algorithms in pattern recognition, we develop a novel feature extraction method for vehicle classification. These three common classification algorithms are the k-nearest neighbor, the support vector machine, and the back-propagation neural network. Nevertheless, a problem remains with the original vehicle magnetic dataset collected being imbalanced, and may lead to inaccurate classification results. With this in mind, we propose an approach called SMOTE, which can further boost the performance of classifiers. Experimental results show that the k-nearest neighbor (KNN) classifier with the SMOTE algorithm can reach a classification accuracy of 95.46%, thus minimizing the effect of the imbalance.
Collapse
|