1
|
Robert Vincent ACS, Sengan S. Effective clinical decision support implementation using a multi filter and wrapper optimisation model for Internet of Things based healthcare data. Sci Rep 2024; 14:21820. [PMID: 39294200 PMCID: PMC11410983 DOI: 10.1038/s41598-024-71726-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2024] [Accepted: 08/30/2024] [Indexed: 09/20/2024] Open
Abstract
Feature Selection (FS) is essential in the Internet of Things (IoT)-based Clinical Decision Support Systems (CDSS) to improve the accuracy and efficiency of the system. With the increasing number of sensors and devices used in healthcare, the volume of data generated is vast and complex. Relevant FS from this data is crucial in reducing computational overhead, improving the system's interpretability, and enhancing the Decision-Making System (DMS) quality. FS also aids in addressing the problems of data redundancy and noise, which can negatively impact the system's performance. FS is critical to developing practical and dependable CDSS in IoT-based healthcare sectors. This research proposes a two-phase FS model. Phase-I employs an ensemble of five Filter Methods (FM), followed by a Pearson Correlation Method (PCM). Phase-II uses the Binary Optimized Genetic Grey Wolf Optimization Algorithm (BOGGWOA) as a Wrapper Method (WM). This recommended model integrates the most valuable features of each filter. Then, it uses the Pearson Correlation Coefficient (PCC) to get rid of features that aren't needed, a Support Vector Machine (SVM) to guess how accurate their classification will be, and BOGGWOA as the Wrapper Method (WM) to pick the most essential features with the best CA.
Collapse
Affiliation(s)
| | - Sudhakar Sengan
- Department of Computer Science and Engineering, PSN College of Engineering and Technology, Tirunelveli, Tamil Nadu, 627451, India.
| |
Collapse
|
2
|
Daneshvar NHN, Masoudi-Sobhanzadeh Y, Omidi Y. A voting-based machine learning approach for classifying biological and clinical datasets. BMC Bioinformatics 2023; 24:140. [PMID: 37041456 PMCID: PMC10088226 DOI: 10.1186/s12859-023-05274-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2022] [Accepted: 04/05/2023] [Indexed: 04/13/2023] Open
Abstract
BACKGROUND Different machine learning techniques have been proposed to classify a wide range of biological/clinical data. Given the practicability of these approaches accordingly, various software packages have been also designed and developed. However, the existing methods suffer from several limitations such as overfitting on a specific dataset, ignoring the feature selection concept in the preprocessing step, and losing their performance on large-size datasets. To tackle the mentioned restrictions, in this study, we introduced a machine learning framework consisting of two main steps. First, our previously suggested optimization algorithm (Trader) was extended to select a near-optimal subset of features/genes. Second, a voting-based framework was proposed to classify the biological/clinical data with high accuracy. To evaluate the efficiency of the proposed method, it was applied to 13 biological/clinical datasets, and the outcomes were comprehensively compared with the prior methods. RESULTS The results demonstrated that the Trader algorithm could select a near-optimal subset of features with a significant level of p-value < 0.01 relative to the compared algorithms. Additionally, on the large-sie datasets, the proposed machine learning framework improved prior studies by ~ 10% in terms of the mean values associated with fivefold cross-validation of accuracy, precision, recall, specificity, and F-measure. CONCLUSION Based on the obtained results, it can be concluded that a proper configuration of efficient algorithms and methods can increase the prediction power of machine learning approaches and help researchers in designing practical diagnosis health care systems and offering effective treatment plans.
Collapse
Affiliation(s)
| | - Yosef Masoudi-Sobhanzadeh
- Research Center for Pharmaceutical Nanotechnology, Biomedicine Institute, Tabriz University of Medical Sciences, Tabriz, Iran.
- Faculty of Advanced Medical Sciences, Tabriz University of Medical Sciences, Tabriz, Iran.
| | - Yadollah Omidi
- Department of Pharmaceutical Sciences, College of Pharmacy, Nova Southeastern University, Florida, 33328, USA.
| |
Collapse
|
3
|
Navin K, Nehemiah HK, Nancy Jane Y, Veena Saroji H. A classification framework using filter–wrapper based feature selection approach for the diagnosis of congenital heart failure. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2023. [DOI: 10.3233/jifs-221348] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
Premature mortality from cardiovascular disease can be reduced with early detection of heart failure by analysing the patients’ risk factors and assuring accurate diagnosis. This work proposes a clinical decision support system for the diagnosis of congenital heart failure by utilizing a data pre-processing approach for dealing missing values and a filter-wrapper based method for selecting the most relevant features. Missing values are imputed using a missForest method in four out of eight heart disease datasets collected from the Machine Learning Repository maintained by University of California, Irvine. The Fast Correlation Based Filter is used as the filter approach, while the union of the Atom Search Optimization Algorithm and the Henry Gas Solubility Optimization represent the wrapper-based algorithms, with the fitness function as the combination of accuracy, G-mean, and Matthew’s correlation coefficient measured by the Support Vector Machine. A total of four boosted classifiers namely, XGBoost, AdaBoost, CatBoost, and LightGBM are trained using the selected features. The proposed work achieves an accuracy of 89%, 84%, 83%, 80% for Heart Failure Clinical Records, 81%, 80%, 83%, 82% for Single Proton Emission Computed Tomography, 90%, 82%, 93%, 80% for Single Proton Emission Computed Tomography F, 80%, 80%, 81%, 80% for Statlog Heart Disease, 80%, 85%, 83%, 86% for Cleveland Heart Disease, 82%, 85%, 85%, 82% for Hungarian Heart Disease, 80%, 81%, 79%, 82% for VA Long Beach, 97%, 89%, 98%, 97%, for Switzerland Heart Disease for four classifiers respectively. The suggested technique outperformed the other classifiers when evaluated against Random Forest, Classification and Regression Trees, Support Vector Machine, and K-Nearest Neighbor.
Collapse
Affiliation(s)
- K.S. Navin
- Ramanujan Computing Centre, Anna University, Chennai, India
| | | | - Y. Nancy Jane
- Department of Computer Technology, Madras Institute of Technology, Chennai, India
| | - H. Veena Saroji
- Assistant Director Planning, Directorate of Health Services, Kerala, India
| |
Collapse
|
4
|
Robust classification of heart valve sound based on adaptive EMD and feature fusion. PLoS One 2022; 17:e0276264. [PMID: 36480575 PMCID: PMC9731417 DOI: 10.1371/journal.pone.0276264] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2022] [Accepted: 10/03/2022] [Indexed: 12/13/2022] Open
Abstract
Cardiovascular disease (CVD) is considered one of the leading causes of death worldwide. In recent years, this research area has attracted researchers' attention to investigate heart sounds to diagnose the disease. To effectively distinguish heart valve defects from normal heart sounds, adaptive empirical mode decomposition (EMD) and feature fusion techniques were used to analyze the classification of heart sounds. Based on the correlation coefficient and Root Mean Square Error (RMSE) method, adaptive EMD was proposed under the condition of screening the intrinsic mode function (IMF) components. Adaptive thresholds based on Hausdorff Distance were used to choose the IMF components used for reconstruction. The multidimensional features extracted from the reconstructed signal were ranked and selected. The features of waveform transformation, energy and heart sound signal can indicate the state of heart activity corresponding to various heart sounds. Here, a set of ordinary features were extracted from the time, frequency and nonlinear domains. To extract more compelling features and achieve better classification results, another four cardiac reserve time features were fused. The fusion features were sorted using six different feature selection algorithms. Three classifiers, random forest, decision tree, and K-nearest neighbor, were trained on open source and our databases. Compared to the previous work, our extensive experimental evaluations show that the proposed method can achieve the best results and have the highest accuracy of 99.3% (1.9% improvement in classification accuracy). The excellent results verified the robustness and effectiveness of the fusion features and proposed method.
Collapse
|
5
|
Nematzadeh H, García-Nieto J, Navas-Delgado I, Aldana-Montes JF. Automatic frequency-based feature selection using discrete weighted evolution strategy. Appl Soft Comput 2022. [DOI: 10.1016/j.asoc.2022.109699] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
6
|
Abdelwahed NM, El-Tawel GS, Makhlouf MA. Effective hybrid feature selection using different bootstrap enhances cancers classification performance. BioData Min 2022; 15:24. [PMID: 36175944 PMCID: PMC9523996 DOI: 10.1186/s13040-022-00304-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2022] [Accepted: 08/31/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Machine learning can be used to predict the different onset of human cancers. Highly dimensional data have enormous, complicated problems. One of these is an excessive number of genes plus over-fitting, fitting time, and classification accuracy. Recursive Feature Elimination (RFE) is a wrapper method for selecting the best subset of features that cause the best accuracy. Despite the high performance of RFE, time computation and over-fitting are two disadvantages of this algorithm. Random forest for selection (RFS) proves its effectiveness in selecting the effective features and improving the over-fitting problem. METHOD This paper proposed a method, namely, positions first bootstrap step (PFBS) random forest selection recursive feature elimination (RFS-RFE) and its abbreviation is PFBS- RFS-RFE to enhance cancer classification performance. It used a bootstrap with many positions included in the outer first bootstrap step (OFBS), inner first bootstrap step (IFBS), and outer/ inner first bootstrap step (O/IFBS). In the first position, OFBS is applied as a resampling method (bootstrap) with replacement before selection step. The RFS is applied with bootstrap = false i.e., the whole datasets are used to build each tree. The importance features are hybrid with RFE to select the most relevant subset of features. In the second position, IFBS is applied as a resampling method (bootstrap) with replacement during applied RFS. The importance features are hybrid with RFE. In the third position, O/IFBS is applied as a hybrid of first and second positions. RFE used logistic regression (LR) as an estimator. The proposed methods are incorporated with four classifiers to solve the feature selection problems and modify the performance of RFE, in which five datasets with different size are used to assess the performance of the PFBS-RFS-RFE. RESULTS The results showed that the O/IFBS-RFS-RFE achieved the best performance compared with previous work and enhanced the accuracy, variance and ROC area for RNA gene and dermatology erythemato-squamous diseases datasets to become 99.994%, 0.0000004, 1.000 and 100.000%, 0.0 and 1.000, respectively. CONCLUSION High dimensional datasets and RFE algorithm face many troubles in cancers classification performance. PFBS-RFS-RFE is proposed to fix these troubles with different positions. The importance features which extracted from RFS are used with RFE to obtain the effective features.
Collapse
Affiliation(s)
- Noura Mohammed Abdelwahed
- Department of Information Systems, Faculty of Computers and Informatics, Suez Canal University, Ismailia, Egypt.
| | - Gh S El-Tawel
- Department of Computer Science, Faculty of Computers and Informatics, Suez Canal University, Ismailia, Egypt
| | - M A Makhlouf
- Department of Information Systems, Faculty of Computers and Informatics, Suez Canal University, Ismailia, Egypt
| |
Collapse
|
7
|
Abasabadi S, Nematzadeh H, Motameni H, Akbari E. Hybrid feature selection based on SLI and genetic algorithm for microarray datasets. THE JOURNAL OF SUPERCOMPUTING 2022; 78:19725-19753. [PMID: 35789817 PMCID: PMC9244444 DOI: 10.1007/s11227-022-04650-w] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Accepted: 06/08/2022] [Indexed: 06/15/2023]
Abstract
One of the major problems in microarray datasets is the large number of features, which causes the issue of "the curse of dimensionality" when machine learning is applied to these datasets. Feature selection refers to the process of finding optimal feature set by removing irrelevant and redundant features. It has a significant role in pattern recognition, classification, and machine learning. In this study, a new and efficient hybrid feature selection method, called Garank&rand, is presented. The method combines a wrapper feature selection algorithm based on the genetic algorithm (GA) with a proposed filter feature selection method, SLI-γ. In Garank&rand, some initial solutions are built regarding the most relevant features based on SLI-γ, and the remaining ones are only the random features. Eleven high-dimensional and standard datasets were used for the accuracy evaluation of the proposed SLI-γ. Additionally, four high-dimensional well-known datasets of microarray experiments were used to carry out an extensive experimental study for the performance evaluation of Garank&rand. This experimental analysis showed the robustness of the method as well as its ability to obtain highly accurate solutions at the earlier stages of the GA evolutionary process. Finally, the performance of Garank&rand was also compared to the results of GA to highlight its competitiveness and its ability to successfully reduce the original feature set size and execution time.
Collapse
Affiliation(s)
- Sedighe Abasabadi
- Department of Computer Engineering, Sari Branch, Islamic Azad University, Sari, Iran
| | - Hossein Nematzadeh
- Department of Computer Engineering, Sari Branch, Islamic Azad University, Sari, Iran
| | - Homayun Motameni
- Department of Computer Engineering, Sari Branch, Islamic Azad University, Sari, Iran
| | - Ebrahim Akbari
- Department of Computer Engineering, Sari Branch, Islamic Azad University, Sari, Iran
| |
Collapse
|
8
|
Anuratha K, Parvathy M. Twitter Sentiment Analysis Using Social-Spider Lex Feature-Based Syntactic-Senti Rule Recurrent Neural Network Classification. INT J UNCERTAIN FUZZ 2022. [DOI: 10.1142/s0218488522400037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Social platforms have become one of the major sources of unstructured text. Investigating the unstructured text and interpreting the meaning is a complex job. Sentiment Analysis is an emerging approach as the social platforms have lot of opinionated data. 1 It uses language processing, classification of texts and linguistics to retrieve the opinions from the text. Twitter is a micro blogging site which is popular amongst the social users as it is a vast open data-platform and it witnesses lot of sentiments. Twitter Sentiment Analysis is a process of automatic mining of user tweets for opinions, emotions, attitude to derive useful insights into community opinions and classify the opinions as well. Due to the enormous increase in the number of collaborative tweets, it has become complex to identify the terms that carries sentiments. Also, the unstructured tweets may have non-relevant terms and reduce the classification accuracy. To address these issues, we propose a Social-Spider Lex Feature Ensemble Model-Based Syntactic-Senti Rule prediction Recurrent Neural Network Classifier (S2LFEM-S2RRNN) to obtain better classification accuracy. Twitter is used as source of data and we have extracted the tweets using Twitter API. Initially, data pre-processing is done to remove unwanted data, symbols and content terms are extracted to improvise the dataset. Then, the significant lexical content terms are extracted employing the proposed Social Spider Lex Feature Ensemble Model (S2LFEM) based on Syntactic-Senti Rule Prediction. The semantics 4 of the terms are analysed on the verbs, subjectivity of the tweet patterns to count the overall weightage of tweets. Based on tweet weightage Recurrent Neural Network is trained to classify the tweets int to positive, negative and neutral. The experiment results show that the proposed classifier outperforms the existing models for sentiment classification in terms of accuracy with a performance score 94.1%.
Collapse
Affiliation(s)
- K. Anuratha
- Department of Information Technology, Sri Sai Ram Institute of Technology, Chennai, Tamil Nadu, India
- Anna University, Chennai, Tamil Nadu, India
| | - M. Parvathy
- Department of Computer Science and Engineering, Sethu Institute of Technology, Pulloor, Tamil Nadu, India
| |
Collapse
|
9
|
Zhang H, Gong M, Nie F, Li X. Unified Dual-label Semi-supervised Learning with Top-k Feature Selection. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.05.090] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
10
|
Wang Y, Gao X, Ru X, Sun P, Wang J. A hybrid feature selection algorithm and its application in bioinformatics. PeerJ Comput Sci 2022; 8:e933. [PMID: 35494789 PMCID: PMC9044222 DOI: 10.7717/peerj-cs.933] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2022] [Accepted: 03/03/2022] [Indexed: 06/14/2023]
Abstract
Feature selection is an independent technology for high-dimensional datasets that has been widely applied in a variety of fields. With the vast expansion of information, such as bioinformatics data, there has been an urgent need to investigate more effective and accurate methods involving feature selection in recent decades. Here, we proposed the hybrid MMPSO method, by combining the feature ranking method and the heuristic search method, to obtain an optimal subset that can be used for higher classification accuracy. In this study, ten datasets obtained from the UCI Machine Learning Repository were analyzed to demonstrate the superiority of our method. The MMPSO algorithm outperformed other algorithms in terms of classification accuracy while utilizing the same number of features. Then we applied the method to a biological dataset containing gene expression information about liver hepatocellular carcinoma (LIHC) samples obtained from The Cancer Genome Atlas (TCGA) and Genotype-Tissue Expression (GTEx). On the basis of the MMPSO algorithm, we identified a 18-gene signature that performed well in distinguishing normal samples from tumours. Nine of the 18 differentially expressed genes were significantly up-regulated in LIHC tumour samples, and the area under curves (AUC) of the combination seven genes (ADRA2B, ERAP2, NPC1L1, PLVAP, POMC, PYROXD2, TRIM29) in classifying tumours with normal samples was greater than 0.99. Six genes (ADRA2B, PYROXD2, CACHD1, FKBP1B, PRKD1 and RPL7AP6) were significantly correlated with survival time. The MMPSO algorithm can be used to effectively extract features from a high-dimensional dataset, which will provide new clues for identifying biomarkers or therapeutic targets from biological data and more perspectives in tumor research.
Collapse
Affiliation(s)
- Yangyang Wang
- School of Electronics and Information, Northwestern Polytechnical University, Xi’an, Shaanxi, China
| | - Xiaoguang Gao
- School of Electronics and Information, Northwestern Polytechnical University, Xi’an, Shaanxi, China
| | - Xinxin Ru
- School of Electronics and Information, Northwestern Polytechnical University, Xi’an, Shaanxi, China
| | - Pengzhan Sun
- School of Electronics and Information, Northwestern Polytechnical University, Xi’an, Shaanxi, China
| | - Jihan Wang
- Institute of Medical Research, Northwestern Polytechnical University, Xi’an, Shaanxi, China
| |
Collapse
|
11
|
Alvarez-Gonzalez R, Mendez-Vazquez A. Deep Learning Architecture Reduction for fMRI Data. Brain Sci 2022; 12:brainsci12020235. [PMID: 35203997 PMCID: PMC8870362 DOI: 10.3390/brainsci12020235] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2021] [Accepted: 01/12/2022] [Indexed: 11/16/2022] Open
Abstract
In recent years, deep learning models have demonstrated an inherently better ability to tackle non-linear classification tasks, due to advances in deep learning architectures. However, much remains to be achieved, especially in designing deep convolutional neural network (CNN) configurations. The number of hyper-parameters that need to be optimized to achieve accuracy in classification problems increases with every layer used, and the selection of kernels in each CNN layer has an impact on the overall CNN performance in the training stage, as well as in the classification process. When a popular classifier fails to perform acceptably in practical applications, it may be due to deficiencies in the algorithm and data processing. Thus, understanding the feature extraction process provides insights to help optimize pre-trained architectures, better generalize the models, and obtain the context of each layer’s features. In this work, we aim to improve feature extraction through the use of a texture amortization map (TAM). An algorithm was developed to obtain characteristics from the filters amortizing the filter’s effect depending on the texture of the neighboring pixels. From the initial algorithm, a novel geometric classification score (GCS) was developed, in order to obtain a measure that indicates the effect of one class on another in a classification problem, in terms of the complexity of the learnability in every layer of the deep learning architecture. For this, we assume that all the data transformations in the inner layers still belong to a Euclidean space. In this scenario, we can evaluate which layers provide the best transformations in a CNN, allowing us to reduce the weights of the deep learning architecture using the geometric hypothesis.
Collapse
|
12
|
Lin S, Lin Y, Wu K, Wang Y, Feng Z, Duan M, Liu S, Fan Y, Huang L, Zhou F. FeCO3, constructing the network biomarkers using the inter-feature correlation coefficients and its application in detecting high-order breast cancer biomarkers. Curr Bioinform 2022. [DOI: 10.2174/1574893617666220124123303] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Aims:
This study aims to formulate the inter-feature correlation as the engineered features.
Background:
Modern biotechnologies tend to generate a huge number of characteristics of a sample, while an OMIC dataset usually has a few dozens or hundreds of samples due to the high costs of generating the OMIC data. So many bio-OMIC studies assumed the inter-feature independence and selected a feature with a high phenotype-association.
Objective:
However, many features are closely associated with each other due to their physical or functional interactions, which may be utilized as a new view of features.
Method:
This study proposed a feature engineering algorithm based on the correlation coefficients (FeCO3) by utilizing the correlations between a given sample and a few reference samples. A comprehensive evaluation was carried out for the proposed FeCO3 network features using 24 bio-OMIC datasets.
Result:
The experimental data suggested that the newly calculated FeCO3 network features tended to achieve better classification performances than the original features, using the same popular feature selection and classification algorithms. The FeCO3 network features were also consistently supported by the literature. FeCO3 was utilized to investigate the high-order engineered biomarkers of breast cancer, and detected the PBX2 gene (Pre-B-Cell Leukemia Transcription Factor 2) as one of the candidate breast cancer biomarkers. Although the two methylated residues cg14851325 (Pvalue=8.06e-2) and cg16602460 (Pvalue=1.19e-1) within PBX2 did not have statistically significant association with breast cancers, the high-order inter-feature correlations showed a significant association with breast cancers.
Conclusion:
The proposed FeCO3 network features calculated the high-order inter-feature correlations as novel features, and may facilitate the investigations of complex diseases from this new perspective. The source code is available in FigShare at 10.6084/m9.figshare.13550051 or the web site http://www.healthinformaticslab.org/supp/ .
Collapse
Affiliation(s)
- Shenggeng Lin
- College of Computer Science and Technology, and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin 130012, China
- State Key Laboratory of Microbial Metabolism, and School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Yuqi Lin
- College of Computer Science and Technology, and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin 130012, China
| | - Kexin Wu
- College of Computer Science and Technology, and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin 130012, China
| | - Yueying Wang
- College of Computer Science and Technology, and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin 130012, China
- Department of Epidemiology and Biostatistics, School of Public Health, Jilin University, Changchun, Jilin Province, China
| | - Zixuan Feng
- College of Computer Science and Technology, and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin 130012, China
| | - Meiyu Duan
- College of Computer Science and Technology, and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin 130012, China
| | - Shuai Liu
- College of Computer Science and Technology, and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin 130012, China
| | - Yusi Fan
- College of Software, and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin 130012, China
| | - Lan Huang
- College of Computer Science and Technology, and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin 130012, China
| | - Fengfeng Zhou
- College of Computer Science and Technology, and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin 130012, China
| |
Collapse
|
13
|
An C, Zhou Q, Yang S. A reinforcement learning guided adaptive cost-sensitive feature acquisition method. Appl Soft Comput 2022. [DOI: 10.1016/j.asoc.2022.108437] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
|
14
|
Two-way threshold-based intelligent water drops feature selection algorithm for accurate detection of breast cancer. Soft comput 2021. [DOI: 10.1007/s00500-021-06498-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
|
15
|
Zhang J, Wen X, Cho A, Whang M. An Empathy Evaluation System Using Spectrogram Image Features of Audio. SENSORS 2021; 21:s21217111. [PMID: 34770419 PMCID: PMC8587789 DOI: 10.3390/s21217111] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/11/2021] [Revised: 10/17/2021] [Accepted: 10/25/2021] [Indexed: 12/01/2022]
Abstract
Watching videos online has become part of a relaxed lifestyle. The music in videos has a sensitive influence on human emotions, perception, and imaginations, which can make people feel relaxed or sad, and so on. Therefore, it is particularly important for people who make advertising videos to understand the relationship between the physical elements of music and empathy characteristics. The purpose of this paper is to analyze the music features in an advertising video and extract the music features that make people empathize. This paper combines both methods of the power spectrum of MFCC and image RGB analysis to find the audio feature vector. In spectral analysis, the eigenvectors obtained in the analysis process range from blue (low range) to green (medium range) to red (high range). The machine learning random forest classifier is used to classify the data obtained by machine learning, and the trained model is used to monitor the development of an advertisement empathy system in real time. The result is that the optimal model is obtained with the training accuracy result of 99.173% and a test accuracy of 86.171%, which can be deemed as correct by comparing the three models of audio feature value analysis. The contribution of this study can be summarized as follows: (1) the low-frequency and high-amplitude audio in the video is more likely to resonate than the high-frequency and high-amplitude audio; (2) it is found that frequency and audio amplitude are important attributes for describing waveforms by observing the characteristics of the machine learning classifier; (3) a new audio extraction method is proposed to induce human empathy. That is, the feature value extracted by the method of spectrogram image features of audio has the most ability to arouse human empathy.
Collapse
Affiliation(s)
- Jing Zhang
- Department of Emotion Engineering, University of Sangmyung, Seoul 03016, Korea; (J.Z.); (X.W.); (A.C.)
| | - Xingyu Wen
- Department of Emotion Engineering, University of Sangmyung, Seoul 03016, Korea; (J.Z.); (X.W.); (A.C.)
| | - Ayoung Cho
- Department of Emotion Engineering, University of Sangmyung, Seoul 03016, Korea; (J.Z.); (X.W.); (A.C.)
| | - Mincheol Whang
- Department of Human Centered Artificial Intelligence, University of Sangmyung, Seoul 03016, Korea
- Correspondence: ; Tel.: +82-2-2287-5293
| |
Collapse
|
16
|
Hamid TMTA, Sallehuddin R, Yunos ZM, Ali A. Ensemble Based Filter Feature Selection with Harmonize Particle Swarm Optimization and Support Vector Machine for Optimal Cancer Classification. MACHINE LEARNING WITH APPLICATIONS 2021. [DOI: 10.1016/j.mlwa.2021.100054] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022] Open
|
17
|
Mandal M, Singh PK, Ijaz MF, Shafi J, Sarkar R. A Tri-Stage Wrapper-Filter Feature Selection Framework for Disease Classification. SENSORS (BASEL, SWITZERLAND) 2021; 21:5571. [PMID: 34451013 PMCID: PMC8402295 DOI: 10.3390/s21165571] [Citation(s) in RCA: 29] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/01/2021] [Revised: 08/10/2021] [Accepted: 08/13/2021] [Indexed: 12/24/2022]
Abstract
In machine learning and data science, feature selection is considered as a crucial step of data preprocessing. When we directly apply the raw data for classification or clustering purposes, sometimes we observe that the learning algorithms do not perform well. One possible reason for this is the presence of redundant, noisy, and non-informative features or attributes in the datasets. Hence, feature selection methods are used to identify the subset of relevant features that can maximize the model performance. Moreover, due to reduction in feature dimension, both training time and storage required by the model can be reduced as well. In this paper, we present a tri-stage wrapper-filter-based feature selection framework for the purpose of medical report-based disease detection. In the first stage, an ensemble was formed by four filter methods-Mutual Information, ReliefF, Chi Square, and Xvariance-and then each feature from the union set was assessed by three classification algorithms-support vector machine, naïve Bayes, and k-nearest neighbors-and an average accuracy was calculated. The features with higher accuracy were selected to obtain a preliminary subset of optimal features. In the second stage, Pearson correlation was used to discard highly correlated features. In these two stages, XGBoost classification algorithm was applied to obtain the most contributing features that, in turn, provide the best optimal subset. Then, in the final stage, we fed the obtained feature subset to a meta-heuristic algorithm, called whale optimization algorithm, in order to further reduce the feature set and to achieve higher accuracy. We evaluated the proposed feature selection framework on four publicly available disease datasets taken from the UCI machine learning repository, namely, arrhythmia, leukemia, DLBCL, and prostate cancer. Our obtained results confirm that the proposed method can perform better than many state-of-the-art methods and can detect important features as well. Less features ensure less medical tests for correct diagnosis, thus saving both time and cost.
Collapse
Affiliation(s)
- Moumita Mandal
- Department of Computer Science and Engineering, Jadavpur University, Kolkata 700032, India; (M.M.); (R.S.)
| | - Pawan Kumar Singh
- Department of Information Technology, Jadavpur University, Kolkata 700106, India;
| | - Muhammad Fazal Ijaz
- Department of Intelligent Mechatronics Engineering, Sejong University, Seoul 05006, Korea
| | - Jana Shafi
- Department of Computer Science, College of Arts and Science, Prince Sattam bin Abdul Aziz University, Wadi Ad-Dwasir 11991, Saudi Arabia;
| | - Ram Sarkar
- Department of Computer Science and Engineering, Jadavpur University, Kolkata 700032, India; (M.M.); (R.S.)
| |
Collapse
|
18
|
Prediction of Hypertension Outcomes Based on Gain Sequence Forward Tabu Search Feature Selection and XGBoost. Diagnostics (Basel) 2021; 11:diagnostics11050792. [PMID: 33925766 PMCID: PMC8146551 DOI: 10.3390/diagnostics11050792] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2021] [Revised: 04/23/2021] [Accepted: 04/26/2021] [Indexed: 01/30/2023] Open
Abstract
For patients with hypertension, serious complications, such as myocardial infarction, a common cause of heart failure, occurs in the late stage of hypertension. Hypertension outcomes can lead to complications, including death. Hypertension outcomes threaten patients’ lives and need to be predicted. In our research, we reviewed the hypertension medical data from a tertiary-grade A class hospital in Beijing, and established a hypertension outcome prediction model with the machine learning theory. We first proposed a gain sequence forward tabu search feature selection (GSFTS-FS) method, which can search the optimal combination of medical variables that affect hypertension outcomes. Based on this, the XGBoost algorithm established a prediction model because of its good stability. We verified the proposed method by comparing other commonly used models in similar works. The proposed GSFTS-FS improved the performance by about 10%. The proposed prediction method has the best performance and its AUC value, accuracy, F1 value, and recall of 10-fold cross-validation were 0.96. 0.95, 0.88, and 0.82, respectively. It also performed well on test datasets with 0.92, 0.94, 0.87, and 0.80 for AUC, accuracy, F1, and recall, respectively. Therefore, the XGBoost with GSFTS-FS can accurately and effectively predict the occurrence of outcomes for patients with hypertension, and can provide guidance for doctors in clinical diagnoses and medical decision-making.
Collapse
|
19
|
RIFS2D: A two-dimensional version of a randomly restarted incremental feature selection algorithm with an application for detecting low-ranked biomarkers. Comput Biol Med 2021; 133:104405. [PMID: 33930763 DOI: 10.1016/j.compbiomed.2021.104405] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2021] [Revised: 04/13/2021] [Accepted: 04/13/2021] [Indexed: 12/20/2022]
Abstract
The era of big data introduces both opportunities and challenges for biomedical researchers. One of the inherent difficulties in the biomedical research field is to recruit large cohorts of samples, while high-throughput biotechnologies may produce thousands or even millions of features for each sample. Researchers tend to evaluate the individual correlation of each feature with the class label and use the incremental feature selection (IFS) strategy to select the top-ranked features with the best prediction performance. Recent experimental data showed that a subset of continuously ranked features randomly restarted from a low-ranked feature (an RIFS block) may outperform the subset of top-ranked features. This study proposed a feature selection Algorithm RIFS2D by integrating multiple RIFS blocks. A comprehensive comparative experiment was conducted with the IFS, RIFS and existing feature selection algorithms and demonstrated that a subset of low-ranked features may also achieve promising prediction performance. This study suggested that a prediction model with promising performance may be trained by low-ranked features, even when top-ranked features did not achieve satisfying prediction performance. Further comparative experiments were conducted between RIFS2D and t-tests for the detection of early-stage breast cancer. The data showed that the RIFS2D-recommended features achieved better prediction accuracy and were targeted by more drugs than the t-test top-ranked features.
Collapse
|
20
|
Integration of multi-objective PSO based feature selection and node centrality for medical datasets. Genomics 2020; 112:4370-4384. [PMID: 32717320 DOI: 10.1016/j.ygeno.2020.07.027] [Citation(s) in RCA: 69] [Impact Index Per Article: 17.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2020] [Revised: 06/22/2020] [Accepted: 07/14/2020] [Indexed: 01/19/2023]
Abstract
In the past decades, the rapid growth of computer and database technologies has led to the rapid growth of large-scale medical datasets. On the other, medical applications with high dimensional datasets that require high speed and accuracy are rapidly increasing. One of the dimensionality reduction approaches is feature selection that can increase the accuracy of the disease diagnosis and reduce its computational complexity. In this paper, a novel PSO-based multi objective feature selection method is proposed. The proposed method consists of three main phases. In the first phase, the original features are showed as a graph representation model. In the next phase, feature centralities for all nodes in the graph are calculated, and finally, in the third phase, an improved PSO-based search process is utilized to final feature selection. The results on five medical datasets indicate that the proposed method improves previous related methods in terms of efficiency and effectiveness.
Collapse
|
21
|
Tarekegn A, Ricceri F, Costa G, Ferracin E, Giacobini M. Predictive Modeling for Frailty Conditions in Elderly People: Machine Learning Approaches. JMIR Med Inform 2020; 8:e16678. [PMID: 32442149 PMCID: PMC7303829 DOI: 10.2196/16678] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2019] [Revised: 01/07/2020] [Accepted: 02/16/2020] [Indexed: 12/15/2022] Open
Abstract
Background Frailty is one of the most critical age-related conditions in older adults. It is often recognized as a syndrome of physiological decline in late life, characterized by a marked vulnerability to adverse health outcomes. A clear operational definition of frailty, however, has not been agreed so far. There is a wide range of studies on the detection of frailty and their association with mortality. Several of these studies have focused on the possible risk factors associated with frailty in the elderly population while predicting who will be at increased risk of frailty is still overlooked in clinical settings. Objective The objective of our study was to develop predictive models for frailty conditions in older people using different machine learning methods based on a database of clinical characteristics and socioeconomic factors. Methods An administrative health database containing 1,095,612 elderly people aged 65 or older with 58 input variables and 6 output variables was used. We first identify and define six problems/outputs as surrogates of frailty. We then resolve the imbalanced nature of the data through resampling process and a comparative study between the different machine learning (ML) algorithms – Artificial neural network (ANN), Genetic programming (GP), Support vector machines (SVM), Random Forest (RF), Logistic regression (LR) and Decision tree (DT) – was carried out. The performance of each model was evaluated using a separate unseen dataset. Results Predicting mortality outcome has shown higher performance with ANN (TPR 0.81, TNR 0.76, accuracy 0.78, F1-score 0.79) and SVM (TPR 0.77, TNR 0.80, accuracy 0.79, F1-score 0.78) than predicting the other outcomes. On average, over the six problems, the DT classifier has shown the lowest accuracy, while other models (GP, LR, RF, ANN, and SVM) performed better. All models have shown lower accuracy in predicting an event of an emergency admission with red code than predicting fracture and disability. In predicting urgent hospitalization, only SVM achieved better performance (TPR 0.75, TNR 0.77, accuracy 0.73, F1-score 0.76) with the 10-fold cross validation compared with other models in all evaluation metrics. Conclusions We developed machine learning models for predicting frailty conditions (mortality, urgent hospitalization, disability, fracture, and emergency admission). The results show that the prediction performance of machine learning models significantly varies from problem to problem in terms of different evaluation metrics. Through further improvement, the model that performs better can be used as a base for developing decision-support tools to improve early identification and prediction of frail older adults.
Collapse
Affiliation(s)
- Adane Tarekegn
- Modeling and Data Science, Department of Mathematics, University of Turin, Turin, Italy
| | - Fulvio Ricceri
- Department of Clinical and Biological Sciences, University of Turin, Turin, Italy.,Unit of Epidemiology, Regional Health Service, Local Health Unit Torino 3, Turin, Italy
| | - Giuseppe Costa
- Department of Clinical and Biological Sciences, University of Turin, Turin, Italy.,Unit of Epidemiology, Regional Health Service, Local Health Unit Torino 3, Turin, Italy
| | - Elisa Ferracin
- Unit of Epidemiology, Regional Health Service, Local Health Unit Torino 3, Turin, Italy
| | - Mario Giacobini
- Data Analysis and Modeling Unit, Department of Veterinary Sciences, University of Turin, Turin, Italy
| |
Collapse
|
22
|
Yoosefzadeh-Najafabadi M, Earl HJ, Tulpan D, Sulik J, Eskandari M. Application of Machine Learning Algorithms in Plant Breeding: Predicting Yield From Hyperspectral Reflectance in Soybean. FRONTIERS IN PLANT SCIENCE 2020; 11:624273. [PMID: 33510761 PMCID: PMC7835636 DOI: 10.3389/fpls.2020.624273] [Citation(s) in RCA: 55] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/31/2020] [Accepted: 12/10/2020] [Indexed: 05/20/2023]
Abstract
Recent substantial advances in high-throughput field phenotyping have provided plant breeders with affordable and efficient tools for evaluating a large number of genotypes for important agronomic traits at early growth stages. Nevertheless, the implementation of large datasets generated by high-throughput phenotyping tools such as hyperspectral reflectance in cultivar development programs is still challenging due to the essential need for intensive knowledge in computational and statistical analyses. In this study, the robustness of three common machine learning (ML) algorithms, multilayer perceptron (MLP), support vector machine (SVM), and random forest (RF), were evaluated for predicting soybean (Glycine max) seed yield using hyperspectral reflectance. For this aim, the hyperspectral reflectance data for the whole spectra ranged from 395 to 1005 nm, which were collected at the R4 and R5 growth stages on 250 soybean genotypes grown in four environments. The recursive feature elimination (RFE) approach was performed to reduce the dimensionality of the hyperspectral reflectance data and select variables with the largest importance values. The results indicated that R5 is more informative stage for measuring hyperspectral reflectance to predict seed yields. The 395 nm reflectance band was also identified as the high ranked band in predicting the soybean seed yield. By considering either full or selected variables as the input variables, the ML algorithms were evaluated individually and combined-version using the ensemble-stacking (E-S) method to predict the soybean yield. The RF algorithm had the highest performance with a value of 84% yield classification accuracy among all the individual tested algorithms. Therefore, by selecting RF as the metaClassifier for E-S method, the prediction accuracy increased to 0.93, using all variables, and 0.87, using selected variables showing the success of using E-S as one of the ensemble techniques. This study demonstrated that soybean breeders could implement E-S algorithm using either the full or selected spectra reflectance to select the high-yielding soybean genotypes, among a large number of genotypes, at early growth stages.
Collapse
Affiliation(s)
| | - Hugh J. Earl
- Department of Plant Agriculture, University of Guelph, Guelph, ON, Canada
| | - Dan Tulpan
- Department of Animal Biosciences, University of Guelph, Guelph, ON, Canada
| | - John Sulik
- Department of Plant Agriculture, University of Guelph, Guelph, ON, Canada
| | - Milad Eskandari
- Department of Plant Agriculture, University of Guelph, Guelph, ON, Canada
- *Correspondence: Milad Eskandari,
| |
Collapse
|
23
|
A Framework for Feature Selection to Exploit Feature Group Structures. ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING 2020. [PMCID: PMC7206161 DOI: 10.1007/978-3-030-47426-3_61] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Filter feature selection methods play an important role in machine learning tasks when low computational costs, classifier independence or simplicity is important. Existing filter methods predominantly focus only on the input data and do not take advantage of the external sources of correlations within feature groups to improve the classification accuracy. We propose a framework which facilitates supervised filter feature selection methods to exploit feature group information from external sources of knowledge and use this framework to incorporate feature group information into minimum Redundancy Maximum Relevance (mRMR) algorithm, resulting in GroupMRMR algorithm. We show that GroupMRMR achieves high accuracy gains over mRMR (up to \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${\sim }$$\end{document}35%) and other popular filter methods (up to \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${\sim }$$\end{document}50%). GroupMRMR has same computational complexity as that of mRMR, therefore, does not incur additional computational costs. Proposed method has many real world applications, particularly the ones that use genomic, text and image data whose features demonstrate strong group structures.
Collapse
|