1
|
Al-shalif SA, Senan N, Saeed F, Ghaban W, Ibrahim N, Aamir M, Sharif W. A systematic literature review on meta-heuristic based feature selection techniques for text classification. PeerJ Comput Sci 2024; 10:e2084. [PMID: 38983195 PMCID: PMC11232610 DOI: 10.7717/peerj-cs.2084] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2023] [Accepted: 05/03/2024] [Indexed: 07/11/2024]
Abstract
Feature selection (FS) is a critical step in many data science-based applications, especially in text classification, as it includes selecting relevant and important features from an original feature set. This process can improve learning accuracy, streamline learning duration, and simplify outcomes. In text classification, there are often many excessive and unrelated features that impact performance of the applied classifiers, and various techniques have been suggested to tackle this problem, categorized as traditional techniques and meta-heuristic (MH) techniques. In order to discover the optimal subset of features, FS processes require a search strategy, and MH techniques use various strategies to strike a balance between exploration and exploitation. The goal of this research article is to systematically analyze the MH techniques used for FS between 2015 and 2022, focusing on 108 primary studies from three different databases such as Scopus, Science Direct, and Google Scholar to identify the techniques used, as well as their strengths and weaknesses. The findings indicate that MH techniques are efficient and outperform traditional techniques, with the potential for further exploration of MH techniques such as Ringed Seal Search (RSS) to improve FS in several applications.
Collapse
Affiliation(s)
- Sarah Abdulkarem Al-shalif
- Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, Parit Raja, Johor, Malaysia
| | - Norhalina Senan
- Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, Parit Raja, Johor, Malaysia
| | - Faisal Saeed
- DAAI Research Group, Department of Computing and Data Science, School of Computing and Digital Technology, University of Birmingham, Birmingham, United Kingdom
| | - Wad Ghaban
- Applied College, University of Tabuk, Tabuk, Saudi Arabia
| | - Noraini Ibrahim
- Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, Parit Raja, Johor, Malaysia
| | - Muhammad Aamir
- School of Electronics, Computing and Mathematics,, University of Derby, Derby, United Kingdom
| | - Wareesa Sharif
- Faculty of Computing, The Islamia University of Bahawalpur, Bahawalpur, Pakistan
| |
Collapse
|
2
|
Feda AK, Adegboye M, Adegboye OR, Agyekum EB, Fendzi Mbasso W, Kamel S. S-shaped grey wolf optimizer-based FOX algorithm for feature selection. Heliyon 2024; 10:e24192. [PMID: 38293420 PMCID: PMC10825485 DOI: 10.1016/j.heliyon.2024.e24192] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2023] [Revised: 12/09/2023] [Accepted: 01/04/2024] [Indexed: 02/01/2024] Open
Abstract
The FOX algorithm is a recently developed metaheuristic approach inspired by the behavior of foxes in their natural habitat. While the FOX algorithm exhibits commendable performance, its basic version, in complex problem scenarios, may become trapped in local optima, failing to identify the optimal solution due to its weak exploitation capabilities. This research addresses a high-dimensional feature selection problem. In feature selection, the most informative features are retained while discarding irrelevant ones. An enhanced version of the FOX algorithm is proposed, aiming to mitigate its drawbacks in feature selection. The improved approach referred to as S-shaped Grey Wolf Optimizer-based FOX (FOX-GWO), which focuses on augmenting the local search capabilities of the FOX algorithm via the integration of GWO. Additionally, the introduction of an S-shaped transfer function enables the population to explore both binary options throughout the search process. Through a series of experiments on 18 datasets with varying dimensions, FOX-GWO outperforms in 83.33 % of datasets for average accuracy, 61.11 % for reduced feature dimensionality, and 72.22 % for average fitness value across the 18 datasets. Meaning it efficiently explores high-dimensional spaces. These findings highlight its practical value and potential to advance feature selection in complex data analysis, enhancing model prediction accuracy.
Collapse
Affiliation(s)
- Afi Kekeli Feda
- Management Information System Department, European University of Lefke, Mersin, 10, Turkey
| | | | | | - Ephraim Bonah Agyekum
- Department of Nuclear and Renewable Energy, Ural Federal University named after the first President of Russia Boris Yeltsin, 620002, 19 Mira Street, Ekaterinburg, Russia
| | - Wulfran Fendzi Mbasso
- Laboratory of Technology and Applied Sciences, University Institute of Technology, University of Douala, PO Box: 8698, Douala, Cameroon
| | - Salah Kamel
- Department of Electrical Engineering, Faculty of Engineering, Aswan University, Aswan, 81542, Egypt
| |
Collapse
|
3
|
Rahimi MR, Makarem D, Sarspy S, Mahdavi SA, Albaghdadi MF, Armaghan SM. Classification of cancer cells and gene selection based on microarray data using MOPSO algorithm. J Cancer Res Clin Oncol 2023; 149:15171-15184. [PMID: 37634207 DOI: 10.1007/s00432-023-05308-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2023] [Accepted: 08/16/2023] [Indexed: 08/29/2023]
Abstract
PURPOSE Microarray information is crucial for the identification and categorisation of malignant tissues. The very limited sample size in the microarray has always been a challenge for classification design in cancer research. As a result, by pre-processing gene selection approaches and genes lacking their information, the microarray data are deleted prior to categorisation. In essence, an appropriate gene selection technique can significantly increase the accuracy of illness (cancer) classification. METHODS For the classification of high-dimensional microarray data, a novel approach based on the hybrid model of multi-objective particle swarm optimisation (MOPSO) is proposed in this research. First, a binary vector representing each particle's position is presented at random. A gene is represented by each bit. Bit 0 denotes the absence of selection of the characteristic (gene) corresponding to it, while bit 1 denotes the selection of the gene. Therefore, the position of each particle represents a set of genes, and the linear Bayesian discriminant analysis classification algorithm calculates each particle's degree of fitness to assess the quality of the gene set that particle has chosen. The suggested methodology is applied to four different cancer database sets, and the results are contrasted with those of other approaches currently in use. RESULTS The proposed algorithm has been applied on four sets of cancer database and its results have been compared with other existing methods. The results of the implementation show that the improvement of classification accuracy in the proposed algorithm compared to other methods for four sets of databases is 25.84% on average. So that it has improved by 18.63% in the blood cancer database, 24.25% in the lung cancer database, 27.73% in the breast cancer database, and 32.80% in the prostate cancer database. Therefore, the proposed algorithm is able to identify a small set of genes containing information in a way choose to increase the classification accuracy. CONCLUSION Our proposed solution is used for data classification, which also improves classification accuracy. This is possible because the MOPSO model removes redundancy and reduces the number of redundant and redundant genes by considering how genes are correlated with each other.
Collapse
Affiliation(s)
| | - Dorna Makarem
- Escuela Tecnica Superior de Ingenieros de Telecomunicacion Politecnica de Madrid, Madrid, Spain
| | - Sliva Sarspy
- Department of Computer Science, College of Science, Cihan University-Erbil, Erbil, Iraq
| | | | | | | |
Collapse
|
4
|
Blourchi P, Ghasemzadeh A. Majority voting based on different feature ranking techniques from gene expression. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2023. [DOI: 10.3233/jifs-224029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/03/2023]
Abstract
In bioinformatics studies, many modeling tasks are characterized by high dimensionality, leading to the widespread use of feature selection techniques to reduce dimensionality. There are a multitude of feature selection techniques that have been proposed in the literature, each relying on a single measurement method to select candidate features. This has an impact on the classification performance. To address this issue, we propose a majority voting method that uses five different feature ranking techniques: entropy score, Pearson’s correlation coefficient, Spearman correlation coefficient, Kendall correlation coefficient, and t-test. By using a majority voting approach, only the features that appear in all five ranking methods are selected. This selection process has three key advantages over traditional techniques. Firstly, it is independent of any particular feature ranking method. Secondly, the feature space dimension is significantly reduced compared to other ranking methods. Finally, the performance is improved as the most discriminatory and informative features are selected via the majority voting process. The performance of the proposed method was evaluated using an SVM, and the results were assessed using accuracy, sensitivity, specificity, and AUC on various biomedical datasets. The results demonstrate the superior effectiveness of the proposed method compared to state-of-the-art methods in the literature.
Collapse
|
5
|
Yin K, Zhai J, Xie A, Zhu J. Feature selection using max dynamic relevancy and min redundancy. Pattern Anal Appl 2023. [DOI: 10.1007/s10044-023-01138-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/19/2023]
|
6
|
Hybrid Filter and Genetic Algorithm-Based Feature Selection for Improving Cancer Classification in High-Dimensional Microarray Data. Processes (Basel) 2023. [DOI: 10.3390/pr11020562] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/16/2023] Open
Abstract
The advancements in intelligent systems have contributed tremendously to the fields of bioinformatics, health, and medicine. Intelligent classification and prediction techniques have been used in studying microarray datasets, which store information about the ways used to express the genes, to assist greatly in diagnosing chronic diseases, such as cancer in its earlier stage, which is important and challenging. However, the high-dimensionality and noisy nature of the microarray data lead to slow performance and low cancer classification accuracy while using machine learning techniques. In this paper, a hybrid filter-genetic feature selection approach has been proposed to solve the high-dimensional microarray datasets problem which ultimately enhances the performance of cancer classification precision. First, the filter feature selection methods including information gain, information gain ratio, and Chi-squared are applied in this study to select the most significant features of cancerous microarray datasets. Then, a genetic algorithm has been employed to further optimize and enhance the selected features in order to improve the proposed method’s capability for cancer classification. To test the proficiency of the proposed scheme, four cancerous microarray datasets were used in the study—this primarily included breast, lung, central nervous system, and brain cancer datasets. The experimental results show that the proposed hybrid filter-genetic feature selection approach achieved better performance of several common machine learning methods in terms of Accuracy, Recall, Precision, and F-measure.
Collapse
|
7
|
Ashraf MT, Hamid I, Nawaz Q, Ali H. Hybrid Approach using Extreme Gradient Boosting (XGBoost) and Evolutionary Algorithm for Cancer Classification. 2023 INTERNATIONAL MULTI-DISCIPLINARY CONFERENCE IN EMERGING RESEARCH TRENDS (IMCERT) 2023. [DOI: 10.1109/imcert57083.2023.10075236] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/01/2023]
Affiliation(s)
| | - Isma Hamid
- National Textie University,Department of Computer Science,Faisalabad,Pakistan
| | - Qamar Nawaz
- University of Agriculture,Department of Computer Science,Faisalabad,Pakistan
| | - Hamid Ali
- National Textile University,Department of Computer Science,Faisalabad,Pakistan
| |
Collapse
|
8
|
Sarkar A, Hossain SKS, Sarkar R. Human activity recognition from sensor data using spatial attention-aided CNN with genetic algorithm. Neural Comput Appl 2023; 35:5165-5191. [PMID: 36311167 PMCID: PMC9596348 DOI: 10.1007/s00521-022-07911-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2022] [Accepted: 09/29/2022] [Indexed: 12/01/2022]
Abstract
Capturing time and frequency relationships of time series signals offers an inherent barrier for automatic human activity recognition (HAR) from wearable sensor data. Extracting spatiotemporal context from the feature space of the sensor reading sequence is challenging for the current recurrent, convolutional, or hybrid activity recognition models. The overall classification accuracy also gets affected by large size feature maps that these models generate. To this end, in this work, we have put forth a hybrid architecture for wearable sensor data-based HAR. We initially use Continuous Wavelet Transform to encode the time series of sensor data as multi-channel images. Then, we utilize a Spatial Attention-aided Convolutional Neural Network (CNN) to extract higher-dimensional features. To find the most essential features for recognizing human activities, we develop a novel feature selection (FS) method. In order to identify the fitness of the features for the FS, we first employ three filter-based methods: Mutual Information (MI), Relief-F, and minimum redundancy maximum relevance (mRMR). The best set of features is then chosen by removing the lower-ranked features using a modified version of the Genetic Algorithm (GA). The K-Nearest Neighbors (KNN) classifier is then used to categorize human activities. We conduct comprehensive experiments on five well-known, publicly accessible HAR datasets, namely UCI-HAR, WISDM, MHEALTH, PAMAP2, and HHAR. Our model significantly outperforms the state-of-the-art models in terms of classification performance. We also observe an improvement in overall recognition accuracy with the use of GA-based FS technique with a lower number of features. The source code of the paper is publicly available here https://github.com/apusarkar2195/HAR_WaveletTransform_SpatialAttention_FeatureSelection.
Collapse
Affiliation(s)
- Apu Sarkar
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, India
| | - S. K. Sabbir Hossain
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, India
| | - Ram Sarkar
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, India
| |
Collapse
|
9
|
Nassiri Z, Omranpour H. Learning the transfer function in binary metaheuristic algorithm for feature selection in classification problems. Neural Comput Appl 2022. [DOI: 10.1007/s00521-022-07869-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/10/2022]
|
10
|
Ahmed S, Sheikh KH, Mirjalili S, Sarkar R. Binary Simulated Normal Distribution Optimizer for feature selection: Theory and application in COVID-19 datasets. EXPERT SYSTEMS WITH APPLICATIONS 2022; 200:116834. [PMID: 36034050 PMCID: PMC9396289 DOI: 10.1016/j.eswa.2022.116834] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/13/2020] [Revised: 02/25/2022] [Accepted: 03/03/2022] [Indexed: 05/04/2023]
Abstract
Classification accuracy achieved by a machine learning technique depends on the feature set used in the learning process. However, it is often found that all the features extracted by some means for a particular task do not contribute to the classification process. Feature selection (FS) is an imperative and challenging pre-processing technique that helps to discard the unnecessary and irrelevant features while reducing the computational time and space requirement and increasing the classification accuracy. Generalized Normal Distribution Optimizer (GNDO), a recently proposed meta-heuristic algorithm, can be used to solve any optimization problem. In this paper, a hybrid version of GNDO with Simulated Annealing (SA) called Binary Simulated Normal Distribution Optimizer (BSNDO) is proposed which uses SA as a local search to achieve higher classification accuracy. The proposed method is evaluated on 18 well-known UCI datasets and compared with its predecessor as well as some popular FS methods. Moreover, this method is tested on high dimensional microarray datasets to prove its worth in real-life datasets. On top of that, it is also applied to a COVID-19 dataset for classification purposes. The obtained results prove the usefulness of BSNDO as a FS method. The source code of this work is publicly available at https://github.com/ahmed-shameem/Feature_selection.
Collapse
Affiliation(s)
- Shameem Ahmed
- Department of Computer Science and Engineering, Jadavpur University, Kolkata 700032, India
| | - Khalid Hassan Sheikh
- Department of Computer Science and Engineering, Jadavpur University, Kolkata 700032, India
| | - Seyedali Mirjalili
- King Abdulaziz University, Jeddah, Saudi Arabia
- Centre for Artificial Intelligence Research and Optimisation, Torrens University Australia, Fortitude Valley, Brisbane, 4006 QLD, Australia
- Yonsei Frontier Lab, Yonsei University, Seoul, Republic of Korea
| | - Ram Sarkar
- Department of Computer Science and Engineering, Jadavpur University, Kolkata 700032, India
| |
Collapse
|
11
|
Deng X, Li M, Wang L, Wan Q. RFCBF: Enhance the Performance and Stability of Fast Correlation-Based Filter. INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE AND APPLICATIONS 2022. [DOI: 10.1142/s1469026822500092] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Feature selection is a preprocessing step that plays a crucial role in the domain of machine learning and data mining. Feature selection methods have been shown to be effective in removing redundant and irrelevant features, improving the learning algorithm’s prediction performance. Among the various methods of feature selection based on redundancy, the fast correlation-based filter (FCBF) is one of the most effective. In this paper, we developed a novel extension of FCBF, called resampling FCBF (RFCBF) that combines resampling technique to improve classification accuracy. We performed comprehensive experiments to compare the RFCBF with other state-of-the-art feature selection methods using three competitive classifiers (K-nearest neighbor, support vector machine, and logistic regression) on 12 publicly available datasets. The experimental results show that the RFCBF algorithm yields significantly better results than previous state-of-the-art methods in terms of classification accuracy and runtime.
Collapse
Affiliation(s)
- Xiongshi Deng
- School of Information Engineering, Nanchang Institute of Technology, No. 289 Tianxxaing Road, Nanchang Jiangxi, P. R. China
| | - Min Li
- School of Information Engineering, Nanchang Institute of Technology, No. 289 Tianxxaing Road, Nanchang Jiangxi, P. R. China
| | - Lei Wang
- School of Information Engineering, Nanchang Institute of Technology, No. 289 Tianxxaing Road, Nanchang Jiangxi, P. R. China
| | - Qikang Wan
- School of Information Engineering, Nanchang Institute of Technology, No. 289 Tianxxaing Road, Nanchang Jiangxi, P. R. China
| |
Collapse
|
12
|
Guney H, Oztoprak H. A robust ensemble feature selection technique for high‐dimensional datasets based on minimum weight threshold method. Comput Intell 2022. [DOI: 10.1111/coin.12524] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Huseyin Guney
- Computer Engineering Department Bahçeşehir Cyprus University Nicosia North Cyprus Turkey
| | - Huseyin Oztoprak
- Electrical and Electronics Engineering Department Cyprus International University Nicosia North Cyprus Turkey
| |
Collapse
|
13
|
Rotational effect and dosimetric impact: HDMLC vs 5-mm MLC leaf width in single isocenter multiple metastases radiosurgery with Brainlab Elements™. JOURNAL OF RADIOTHERAPY IN PRACTICE 2022. [DOI: 10.1017/s1460396922000048] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Abstract
Purpose:
To analyse the impact of multileaf collimator (MLC) leaf width in multiple metastases radiosurgery (SRS) considering the target distance to isocenter and rotational displacements.
Methods:
Ten plans were optimised. The plans were created with Elements Multiple Mets SRS v2·0 (Brainlab AG, Munchen, Germany). The mean number of metastases per plan was 5 ± 2 [min 3, max 9], and the mean volume of gross tumour volume (GTV) was 1·1 ± 1·3 cc [min 0·02, max 5·1]. Planning target volume margin criterion was based on GTV-isocenter distance and target dimensions. Plans were performed using 6 MV with high-definition MLC (HDMLC) and reoptimised using 5-mm MLC (MLC-5). Plans were compared using Paddick conformity index (PCI), gradient index, monitor units , volume receiving half of prescription isodose (PIV50), maximum dose to brainstem, optic chiasm and optic nerves, and V12Gy, V10Gy and V5Gy for healthy brain were analysed. The maximum displacement due to rotational combinations was optimised by a genetic algorithm for both plans. Plans were reoptimised and compared using optimised margin.
Results:
HDMLC plans had better conformity and higher dose falloff than MLC-5 plans. Dosimetric differences were statistically significant (p < 0·05). The smaller the lesion volume, the higher the dosimetric differences between both plans. The effect of rotational displacements produced for each target in SRS was not dependent on the MLC (p > 0·05).
Conclusions:
The finer HDMLC offers dosimetric advantages compared with the MLC-5 in terms of target conformity and dose to the surrounding organs at risk. However, only dose falloff differences due to rotations depend on MLC.
Collapse
|
14
|
Liu Z, Wang R, Zhang W. Improving the generalization of unsupervised feature learning by using data from different sources on gene expression data for cancer diagnosis. Med Biol Eng Comput 2022; 60:1055-1073. [DOI: 10.1007/s11517-022-02522-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2021] [Accepted: 01/30/2022] [Indexed: 10/19/2022]
|
15
|
Kundu R, Chattopadhyay S, Cuevas E, Sarkar R. AltWOA: Altruistic Whale Optimization Algorithm for feature selection on microarray datasets. Comput Biol Med 2022; 144:105349. [PMID: 35303580 DOI: 10.1016/j.compbiomed.2022.105349] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2021] [Revised: 02/22/2022] [Accepted: 02/22/2022] [Indexed: 12/15/2022]
Abstract
The data-driven modern era has enabled the collection of large amounts of biomedical and clinical data. DNA microarray gene expression datasets have mainly gained significant attention to the research community owing to their ability to identify diseases through the "bio-markers" or specific alterations in the gene sequence that represent that particular disease (for example, different types of cancer). However, gene expression datasets are very high-dimensional, while only a few of those are "bio-markers". Meta-heuristic-based feature selection effectively filters out only the relevant genes from a large set of attributes efficiently to reduce data storage and computation requirements. To this end, in this paper, we propose an Altruistic Whale Optimization Algorithm (AltWOA) for the feature selection problem in high-dimensional microarray data. AltWOA is an improvement on the basic Whale Optimization Algorithm. We embed the concept of altruism in the whale population to help efficient propagation of candidate solutions that can reach the global optima over the iterations. Evaluation of the proposed method on eight high dimensional microarray datasets reveals the superiority of AltWOA compared to popular and classical techniques in the literature on the same datasets both in terms of accuracy and the final number of features selected. The relevant codes for the proposed approach are available publicly at https://github.com/Rohit-Kundu/AltWOA.
Collapse
Affiliation(s)
- Rohit Kundu
- Department of Electrical Engineering, Jadavpur University, Kolkata, 700032, India.
| | - Soham Chattopadhyay
- Department of Electrical Engineering, Jadavpur University, Kolkata, 700032, India.
| | - Erik Cuevas
- Departamento de Electrónica, Universidad de Guadalajara, CUCEI, Av. Revolución 1500, Guadalajara, Jal, Mexico.
| | - Ram Sarkar
- Department of Computer Science & Engineering, Jadavpur University, Kolkata, 700032, India.
| |
Collapse
|
16
|
Optimal Deep Learning Enabled Prostate Cancer Detection Using Microarray Gene Expression. JOURNAL OF HEALTHCARE ENGINEERING 2022; 2022:7364704. [PMID: 35310199 PMCID: PMC8930217 DOI: 10.1155/2022/7364704] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/15/2021] [Revised: 12/30/2021] [Accepted: 01/15/2022] [Indexed: 12/23/2022]
Abstract
Prostate cancer is the main cause of death over the globe. Earlier detection and classification of cancer is highly important to improve patient health. Previous studies utilized statistical and machine learning (ML) techniques for prostate cancer detection. However, several challenges that exist in the investigation process are the existence of high dimensionality data and less number of training samples. Metaheuristic algorithms can be used to resolve the curse of dimensionality and improve the detection rate of artificial intelligence (AI) techniques. With this motivation, this article develops an artificial intelligence based feature selection with deep learning model for prostate cancer detection (AIFSDL-PCD) using microarray gene expression data. The AIFSDL-PCD technique involves preprocessing to enhance the input data quality. In addition, a chaotic invasive weed optimization (CIWO) based feature selection (FS) technique for choosing an optimal subset of features shows the novelty of the work. Moreover, the deep neural network (DNN) model can be applied as a classification model to detect the existence of prostate cancer in the microarray gene expression data. Furthermore, the hyperparameters of the DNN model can be effectively adjusted by the use of RMSprop optimizer. The design of CIWO based FS technique helps for reducing the computational complexity and improve the classification accuracy. The experimental results highlighted the betterment of the AIFSDL-PCD approach on the other techniques with respect to distinct measures.
Collapse
|
17
|
Adaptive feature selection framework for DNA methylation-based age prediction. Soft comput 2022. [DOI: 10.1007/s00500-022-06844-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
18
|
Deng X, Li M, Deng S, Wang L. Hybrid gene selection approach using XGBoost and multi-objective genetic algorithm for cancer classification. Med Biol Eng Comput 2022; 60:663-681. [PMID: 35028863 DOI: 10.1007/s11517-021-02476-x] [Citation(s) in RCA: 28] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2021] [Accepted: 11/23/2021] [Indexed: 12/15/2022]
Abstract
Microarray gene expression data are often accompanied by a large number of genes and a small number of samples. However, only a few of these genes are relevant to cancer, resulting in significant gene selection challenges. Hence, we propose a two-stage gene selection approach by combining extreme gradient boosting (XGBoost) and a multi-objective optimization genetic algorithm (XGBoost-MOGA) for cancer classification in microarray datasets. In the first stage, the genes are ranked using an ensemble-based feature selection using XGBoost. This stage can effectively remove irrelevant genes and yield a group comprising the most relevant genes related to the class. In the second stage, XGBoost-MOGA searches for an optimal gene subset based on the most relevant genes' group using a multi-objective optimization genetic algorithm. We performed comprehensive experiments to compare XGBoost-MOGA with other state-of-the-art feature selection methods using two well-known learning classifiers on 14 publicly available microarray expression datasets. The experimental results show that XGBoost-MOGA yields significantly better results than previous state-of-the-art algorithms in terms of various evaluation criteria, such as accuracy, F-score, precision, and recall.
Collapse
Affiliation(s)
- Xiongshi Deng
- School of Information Engineering, Nanchang Institute of Technology, Jiangxi, 330099, People's Republic of China.,Jiangxi Province Key Laboratory of Water Information Cooperative Sensing and Intelligent Processing, Jiangxi, 330099, People's Republic of China
| | - Min Li
- School of Information Engineering, Nanchang Institute of Technology, Jiangxi, 330099, People's Republic of China. .,Jiangxi Province Key Laboratory of Water Information Cooperative Sensing and Intelligent Processing, Jiangxi, 330099, People's Republic of China.
| | - Shaobo Deng
- School of Information Engineering, Nanchang Institute of Technology, Jiangxi, 330099, People's Republic of China.,Jiangxi Province Key Laboratory of Water Information Cooperative Sensing and Intelligent Processing, Jiangxi, 330099, People's Republic of China
| | - Lei Wang
- School of Information Engineering, Nanchang Institute of Technology, Jiangxi, 330099, People's Republic of China.,Jiangxi Province Key Laboratory of Water Information Cooperative Sensing and Intelligent Processing, Jiangxi, 330099, People's Republic of China
| |
Collapse
|
19
|
Alhenawi E, Al-Sayyed R, Hudaib A, Mirjalili S. Feature selection methods on gene expression microarray data for cancer classification: A systematic review. Comput Biol Med 2022; 140:105051. [PMID: 34839186 DOI: 10.1016/j.compbiomed.2021.105051] [Citation(s) in RCA: 36] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2021] [Revised: 11/01/2021] [Accepted: 11/15/2021] [Indexed: 11/29/2022]
Abstract
This systematic review provides researchers interested in feature selection (FS) for processing microarray data with comprehensive information about the main research directions for gene expression classification conducted during the recent seven years. A set of 132 researches published by three different publishers is reviewed. The studied papers are categorized into nine directions based on their objectives. The FS directions that received various levels of attention were then summarized. The review revealed that 'propose hybrid FS methods' represented the most interesting research direction with a percentage of 34.9%, while the other directions have lower percentages that ranged from 13.6% down to 3%. This guides researchers to select the most competitive research direction. Papers in each category are thoroughly reviewed based on six perspectives, mainly: method(s), classifier(s), dataset(s), dataset dimension(s) range, performance metric(s), and result(s) achieved.
Collapse
Affiliation(s)
- Esra'a Alhenawi
- King Abdullah II School for Information Technology, The University of Jordan, Amman, Jordan.
| | - Rizik Al-Sayyed
- King Abdullah II School for Information Technology, The University of Jordan, Amman, Jordan.
| | - Amjad Hudaib
- King Abdullah II School for Information Technology, The University of Jordan, Amman, Jordan.
| | - Seyedali Mirjalili
- Center for Artificial Intelligence Research and Optimization, Torrens University Australia, Fortitude Valley, Brisbane, 4006, QLD, Australia; Yonsei Frontier Lab, Yonsei University, Seoul, South Korea.
| |
Collapse
|
20
|
Chakraborty A, Ghosh KK, De R, Cuevas E, Sarkar R. Learning automata based particle swarm optimization for solving class imbalance problem. Appl Soft Comput 2021. [DOI: 10.1016/j.asoc.2021.107959] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
21
|
Uzma, Halim Z. An ensemble filter-based heuristic approach for cancerous gene expression classification. Knowl Based Syst 2021. [DOI: 10.1016/j.knosys.2021.107560] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
22
|
Monitoring Forest Health Using Hyperspectral Imagery: Does Feature Selection Improve the Performance of Machine-Learning Techniques? REMOTE SENSING 2021. [DOI: 10.3390/rs13234832] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
This study analyzed highly correlated, feature-rich datasets from hyperspectral remote sensing data using multiple statistical and machine-learning methods. The effect of filter-based feature selection methods on predictive performance was compared. In addition, the effect of multiple expert-based and data-driven feature sets, derived from the reflectance data, was investigated. Defoliation of trees (%), derived from in situ measurements from fall 2016, was modeled as a function of reflectance. Variable importance was assessed using permutation-based feature importance. Overall, the support vector machine (SVM) outperformed other algorithms, such as random forest (RF), extreme gradient boosting (XGBoost), and lasso (L1) and ridge (L2) regressions by at least three percentage points. The combination of certain feature sets showed small increases in predictive performance, while no substantial differences between individual feature sets were observed. For some combinations of learners and feature sets, filter methods achieved better predictive performances than using no feature selection. Ensemble filters did not have a substantial impact on performance. The most important features were located around the red edge. Additional features in the near-infrared region (800–1000 nm) were also essential to achieve the overall best performances. Filter methods have the potential to be helpful in high-dimensional situations and are able to improve the interpretation of feature effects in fitted models, which is an essential constraint in environmental modeling studies. Nevertheless, more training data and replication in similar benchmarking studies are needed to be able to generalize the results.
Collapse
|
23
|
A novel bio-inspired hybrid multi-filter wrapper gene selection method with ensemble classifier for microarray data. Neural Comput Appl 2021; 35:11531-11561. [PMID: 34539088 PMCID: PMC8435304 DOI: 10.1007/s00521-021-06459-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2020] [Accepted: 08/26/2021] [Indexed: 01/04/2023]
Abstract
Microarray technology is known as one of the most important tools for collecting DNA expression data. This technology allows researchers to investigate and examine types of diseases and their origins. However, microarray data are often associated with a small sample size, a significant number of genes, imbalanced data, etc., making classification models inefficient. Thus, a new hybrid solution based on a multi-filter and adaptive chaotic multi-objective forest optimization algorithm (AC-MOFOA) is presented to solve the gene selection problem and construct the Ensemble Classifier. In the proposed solution, a multi-filter model (i.e., ensemble filter) is proposed as preprocessing step to reduce the dataset's dimensions, using a combination of five filter methods to remove redundant and irrelevant genes. Accordingly, the results of the five filter methods are combined using a voting-based function. Additionally, the results of the proposed multi-filter indicate that it has good capability in reducing the gene subset size and selecting relevant genes. Then, an AC-MOFOA based on the concepts of non-dominated sorting, crowding distance, chaos theory, and adaptive operators is presented. AC-MOFOA as a wrapper method aimed at reducing dataset dimensions, optimizing KELM, and increasing the accuracy of the classification, simultaneously. Next, in this method, an ensemble classifier model is presented using AC-MOFOA results to classify microarray data. The performance of the proposed algorithm was evaluated on nine public microarray datasets, and its results were compared in terms of the number of selected genes, classification efficiency, execution time, time complexity, hypervolume indicator, and spacing metric with five hybrid multi-objective methods, and three hybrid single-objective methods. According to the results, the proposed hybrid method could increase the accuracy of the KELM in most datasets by reducing the dataset's dimensions and achieve similar or superior performance compared to other multi-objective methods. Furthermore, the proposed Ensemble Classifier model could provide better classification accuracy and generalizability in the seven of nine microarray datasets compared to conventional ensemble methods. Moreover, the comparison results of the Ensemble Classifier model with three state-of-the-art ensemble generation methods indicate its competitive performance in which the proposed ensemble model achieved better results in the five of nine datasets.
Collapse
|
24
|
Ahmed S, Ghosh KK, Mirjalili S, Sarkar R. AIEOU: Automata-based improved equilibrium optimizer with U-shaped transfer function for feature selection. Knowl Based Syst 2021. [DOI: 10.1016/j.knosys.2021.107283] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
25
|
Mandal M, Singh PK, Ijaz MF, Shafi J, Sarkar R. A Tri-Stage Wrapper-Filter Feature Selection Framework for Disease Classification. SENSORS (BASEL, SWITZERLAND) 2021; 21:5571. [PMID: 34451013 PMCID: PMC8402295 DOI: 10.3390/s21165571] [Citation(s) in RCA: 29] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/01/2021] [Revised: 08/10/2021] [Accepted: 08/13/2021] [Indexed: 12/24/2022]
Abstract
In machine learning and data science, feature selection is considered as a crucial step of data preprocessing. When we directly apply the raw data for classification or clustering purposes, sometimes we observe that the learning algorithms do not perform well. One possible reason for this is the presence of redundant, noisy, and non-informative features or attributes in the datasets. Hence, feature selection methods are used to identify the subset of relevant features that can maximize the model performance. Moreover, due to reduction in feature dimension, both training time and storage required by the model can be reduced as well. In this paper, we present a tri-stage wrapper-filter-based feature selection framework for the purpose of medical report-based disease detection. In the first stage, an ensemble was formed by four filter methods-Mutual Information, ReliefF, Chi Square, and Xvariance-and then each feature from the union set was assessed by three classification algorithms-support vector machine, naïve Bayes, and k-nearest neighbors-and an average accuracy was calculated. The features with higher accuracy were selected to obtain a preliminary subset of optimal features. In the second stage, Pearson correlation was used to discard highly correlated features. In these two stages, XGBoost classification algorithm was applied to obtain the most contributing features that, in turn, provide the best optimal subset. Then, in the final stage, we fed the obtained feature subset to a meta-heuristic algorithm, called whale optimization algorithm, in order to further reduce the feature set and to achieve higher accuracy. We evaluated the proposed feature selection framework on four publicly available disease datasets taken from the UCI machine learning repository, namely, arrhythmia, leukemia, DLBCL, and prostate cancer. Our obtained results confirm that the proposed method can perform better than many state-of-the-art methods and can detect important features as well. Less features ensure less medical tests for correct diagnosis, thus saving both time and cost.
Collapse
Affiliation(s)
- Moumita Mandal
- Department of Computer Science and Engineering, Jadavpur University, Kolkata 700032, India; (M.M.); (R.S.)
| | - Pawan Kumar Singh
- Department of Information Technology, Jadavpur University, Kolkata 700106, India;
| | - Muhammad Fazal Ijaz
- Department of Intelligent Mechatronics Engineering, Sejong University, Seoul 05006, Korea
| | - Jana Shafi
- Department of Computer Science, College of Arts and Science, Prince Sattam bin Abdul Aziz University, Wadi Ad-Dwasir 11991, Saudi Arabia;
| | - Ram Sarkar
- Department of Computer Science and Engineering, Jadavpur University, Kolkata 700032, India; (M.M.); (R.S.)
| |
Collapse
|
26
|
Ghosh M, Sen S, Sarkar R, Maulik U. Quantum squirrel inspired algorithm for gene selection in methylation and expression data of prostate cancer. Appl Soft Comput 2021. [DOI: 10.1016/j.asoc.2021.107221] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
|
27
|
Gao H, Wu C, Huang D, Zha D, Zhou C. Prediction of fetal weight based on back propagation neural network optimized by genetic algorithm. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2021; 18:4402-4410. [PMID: 34198444 DOI: 10.3934/mbe.2021222] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Fetal weight is an important index to judge fetal development and ensure the safety of pregnant women. However, fetal weight cannot be directly measured. This study proposed a prediction model of fetal weight based on genetic algorithm to optimize back propagation (GA-BP) neural network. Using random number table method, 80 cases of pregnant women in our hospital from September 2018 to March 2019 were divided into control group and observation group, 40 cases in each group. The doctors in the control group predicted the fetal weight subjectively according to routine ultrasound and physical examination. In the observation group, the continuous weight change model of pregnant women was established by using the regression model and the historical physical examination data obtained by feature normalization pretreatment, and then the genetic algorithm (GA) was used to optimize the initial weights and thresholds of back propagation (BP) neural network to establish the fetal weight prediction model. The coincidence rate of fetal weight was compared between the two groups after birth. Results: The prediction error of GA-BPNN was controlled within 6%. And the accuracy of GA-BPNN was 76.3%, which were 14.5% higher than that of traditional methods. According to the error curve, GA-BP is more effective in predicting the actual fetal weight. Conclusion: The GA-BPNN model can accurately and quickly predict fetal weight.
Collapse
Affiliation(s)
- Hong Gao
- The Third People's Hospital of HeFei, Heifei 230000, China
| | - Cuiyun Wu
- The Third People's Hospital of HeFei, Heifei 230000, China
| | - Dunnian Huang
- The Third People's Hospital of HeFei, Heifei 230000, China
| | - Dahui Zha
- The Third People's Hospital of HeFei, Heifei 230000, China
| | - Cuiping Zhou
- The Third People's Hospital of HeFei, Heifei 230000, China
| |
Collapse
|
28
|
A Hybrid Swarm and Gravitation-based feature selection algorithm for handwritten Indic script classification problem. COMPLEX INTELL SYST 2021. [DOI: 10.1007/s40747-020-00237-1] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
AbstractIn any multi-script environment, handwritten script classification is an unavoidable pre-requisite before the document images are fed to their respective Optical Character Recognition (OCR) engines. Over the years, this complex pattern classification problem has been solved by researchers proposing various feature vectors mostly having large dimensions, thereby increasing the computation complexity of the whole classification model. Feature Selection (FS) can serve as an intermediate step to reduce the size of the feature vectors by restricting them only to the essential and relevant features. In the present work, we have addressed this issue by introducing a new FS algorithm, called Hybrid Swarm and Gravitation-based FS (HSGFS). This algorithm has been applied over three feature vectors introduced in the literature recently—Distance-Hough Transform (DHT), Histogram of Oriented Gradients (HOG), and Modified log-Gabor (MLG) filter Transform. Three state-of-the-art classifiers, namely, Multi-Layer Perceptron (MLP), K-Nearest Neighbour (KNN), and Support Vector Machine (SVM), are used to evaluate the optimal subset of features generated by the proposed FS model. Handwritten datasets at block, text line, and word level, consisting of officially recognized 12 Indic scripts, are prepared for experimentation. An average improvement in the range of 2–5% is achieved in the classification accuracy by utilizing only about 75–80% of the original feature vectors on all three datasets. The proposed method also shows better performance when compared to some popularly used FS models. The codes used for implementing HSGFS can be found in the following Github link: https://github.com/Ritam-Guha/HSGFS.
Collapse
|
29
|
S-shaped versus V-shaped transfer functions for binary Manta ray foraging optimization in feature selection problem. Neural Comput Appl 2021. [DOI: 10.1007/s00521-020-05560-9] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
30
|
Mahendran N, Durai Raj Vincent PM, Srinivasan K, Chang CY. Machine Learning Based Computational Gene Selection Models: A Survey, Performance Evaluation, Open Issues, and Future Research Directions. Front Genet 2020; 11:603808. [PMID: 33362861 PMCID: PMC7758324 DOI: 10.3389/fgene.2020.603808] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2020] [Accepted: 10/29/2020] [Indexed: 12/20/2022] Open
Abstract
Gene Expression is the process of determining the physical characteristics of living beings by generating the necessary proteins. Gene Expression takes place in two steps, translation and transcription. It is the flow of information from DNA to RNA with enzymes' help, and the end product is proteins and other biochemical molecules. Many technologies can capture Gene Expression from the DNA or RNA. One such technique is Microarray DNA. Other than being expensive, the main issue with Microarray DNA is that it generates high-dimensional data with minimal sample size. The issue in handling such a heavyweight dataset is that the learning model will be over-fitted. This problem should be addressed by reducing the dimension of the data source to a considerable amount. In recent years, Machine Learning has gained popularity in the field of genomic studies. In the literature, many Machine Learning-based Gene Selection approaches have been discussed, which were proposed to improve dimensionality reduction precision. This paper does an extensive review of the various works done on Machine Learning-based gene selection in recent years, along with its performance analysis. The study categorizes various feature selection algorithms under Supervised, Unsupervised, and Semi-supervised learning. The works done in recent years to reduce the features for diagnosing tumors are discussed in detail. Furthermore, the performance of several discussed methods in the literature is analyzed. This study also lists out and briefly discusses the open issues in handling the high-dimension and less sample size data.
Collapse
Affiliation(s)
- Nivedhitha Mahendran
- School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, India
| | - P. M. Durai Raj Vincent
- School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, India
| | - Kathiravan Srinivasan
- School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, India
| | - Chuan-Yu Chang
- Department of Computer Science and Information Engineering, National Yunlin University of Science and Technology, Douliu, Taiwan
| |
Collapse
|
31
|
Improved coral reefs optimization with adaptive $$\beta $$-hill climbing for feature selection. Neural Comput Appl 2020. [DOI: 10.1007/s00521-020-05409-1] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
|
32
|
|
33
|
A weighted ensemble-based active learning model to label microarray data. Med Biol Eng Comput 2020; 58:2427-2441. [PMID: 32770460 DOI: 10.1007/s11517-020-02238-1] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2019] [Accepted: 07/26/2020] [Indexed: 10/23/2022]
Abstract
Classification of cancerous genes from microarray data is an important research area in bioinformatics. Large amount of microarray data are available, but it is very costly to label them. This paper proposes an active learning model, a semi-supervised classification approach, to label the microarray data using which predictions can be made even with lesser amount of labeled data. Initially, a pool of unlabeled instances is given from which some instances are randomly chosen for labeling. Successive selection of instances to be labeled from unlabeled pool is determined by selection algorithms. The proposed method is devised following an ensemble approach to combine the decisions of three classifiers in order to arrive at a consensus which provides a more accurate prediction of the class label to ensure that each individual classifier learns in an uncorrelated manner. Our method combines the heuristic techniques used by an active learning algorithm to choose training samples with the multiple learning paradigm attained by an ensemble to optimize the search space by choosing efficiently from an already sparse learning pool. On evaluating the proposed method on 10 microarray datasets, we achieve performance which is comparable with state-of-the-art methods. The code and datasets are given at https://github.com/anuran-Chakraborty/Active-learning. Flowchart of the proposed ensemble-based active learning framework.
Collapse
|
34
|
Introducing clustering based population in Binary Gravitational Search Algorithm for Feature Selection. Appl Soft Comput 2020. [DOI: 10.1016/j.asoc.2020.106341] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
35
|
Guha R, Ghosh M, Mutsuddi S, Sarkar R, Mirjalili S. Embedded chaotic whale survival algorithm for filter–wrapper feature selection. Soft comput 2020. [DOI: 10.1007/s00500-020-05183-1] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
|
36
|
A survey on single and multi omics data mining methods in cancer data classification. J Biomed Inform 2020; 107:103466. [DOI: 10.1016/j.jbi.2020.103466] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2019] [Revised: 05/01/2020] [Accepted: 05/31/2020] [Indexed: 01/09/2023]
|
37
|
Uzma, Al-Obeidat F, Tubaishat A, Shah B, Halim Z. Gene encoder: a feature selection technique through unsupervised deep learning-based clustering for large gene expression data. Neural Comput Appl 2020. [DOI: 10.1007/s00521-020-05101-4] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
|
38
|
EvoPreprocess—Data Preprocessing Framework with Nature-Inspired Optimization Algorithms. MATHEMATICS 2020. [DOI: 10.3390/math8060900] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
The quality of machine learning models can suffer when inappropriate data is used, which is especially prevalent in high-dimensional and imbalanced data sets. Data preparation and preprocessing can mitigate some problems and can thus result in better models. The use of meta-heuristic and nature-inspired methods for data preprocessing has become common, but these approaches are still not readily available to practitioners with a simple and extendable application programming interface (API). In this paper the EvoPreprocess open-source Python framework, that preprocesses data with the use of evolutionary and nature-inspired optimization algorithms, is presented. The main problems addressed by the framework are data sampling (simultaneous over- and under-sampling data instances), feature selection and data weighting for supervised machine learning problems. EvoPreprocess framework provides a simple object-oriented and parallelized API of the preprocessing tasks and can be used with scikit-learn and imbalanced-learn Python machine learning libraries. The framework uses self-adaptive well-known nature-inspired meta-heuristic algorithms and can easily be extended with custom optimization and evaluation strategies. The paper presents the architecture of the framework, its use, experiment results and comparison to other common preprocessing approaches.
Collapse
|
39
|
Ghosh KK, Ghosh S, Sen S, Sarkar R, Maulik U. A two-stage approach towards protein secondary structure classification. Med Biol Eng Comput 2020; 58:1723-1737. [PMID: 32472446 DOI: 10.1007/s11517-020-02194-w] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2019] [Accepted: 05/20/2020] [Indexed: 12/11/2022]
Abstract
Protein secondary structure (PSS) describes the local folded structures which get formed inside a polypeptide due to interactions among atoms of the backbone. Generally, globular proteins are divided into four classes, namely all-α, all-β, α + β, and α/β. As nearly 90% of proteins fall into the said four classes, these are mostly considered for the purpose of computational classification of proteins. Classification of PSS is important for different biological functions that include protein fold recognition, tertiary structure prediction, prediction of DNA-binding sites, and reduction of the conformation search space among others. In this paper, we have proposed a machine learning-based model for secondary structure classification of proteins into four classes: all-α, all-β, α + β, and α/β. In doing so, we have considered both sequence-based and structure-based features. At first, mutual information (MI), a filter-based feature selection method, is used to remove the redundant features, and then these selected features are used to train three different classifiers-random forest, K-nearest neighbor (KNN), and multi-layer perceptron (MLP). After that, some standard classifier combination approaches are applied to integrate the decision made by the said classifiers and it has been found that weighted product rule performs the best among all. The overall accuracies obtained using the proposed model on the four standard datasets, namely 640, 1189, 25pdb, and fc699 are 86.89%, 92.93%, 91.38%, and 94.87% respectively. The proposed model outperforms some state-of-the-art methods considered here for comparison. Significantly high classification accuracy produced by our proposed model on four datasets is attributed to the development of a comprehensive feature set (by eliminating redundant features through feature selection technique) which is then passed through an ensemble consists of three different classifiers. Assigning different weights to the outcome of different classifiers thus proved to be useful in designing the model for predicting the secondary structure of proteins based on its sequence-based and structure-based features. Graphical abstract.
Collapse
Affiliation(s)
- Kushal Kanti Ghosh
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, India.
| | - Soulib Ghosh
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, India
| | - Sagnik Sen
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, India
| | - Ram Sarkar
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, India
| | - Ujjwal Maulik
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, India
| |
Collapse
|
40
|
Bommert A, Sun X, Bischl B, Rahnenführer J, Lang M. Benchmark for filter methods for feature selection in high-dimensional classification data. Comput Stat Data Anal 2020. [DOI: 10.1016/j.csda.2019.106839] [Citation(s) in RCA: 206] [Impact Index Per Article: 51.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
|
41
|
Hosseinpoor MJ, Parvin H, Nejatian S, Rezaie V. Gene Regulatory Elements Extraction in Breast Cancer by Hi-C Data Using a Meta-Heuristic Method. RUSS J GENET+ 2019. [DOI: 10.1134/s1022795419090072] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
|
42
|
Ghosh M, Guha R, Alam I, Lohariwal P, Jalan D, Sarkar R. Binary Genetic Swarm Optimization: A Combination of GA and PSO for Feature Selection. JOURNAL OF INTELLIGENT SYSTEMS 2019. [DOI: 10.1515/jisys-2019-0062] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Abstract
Feature selection (FS) is a technique which helps to find the most optimal feature subset to develop an efficient pattern recognition model under consideration. The use of genetic algorithm (GA) and particle swarm optimization (PSO) in the field of FS is profound. In this paper, we propose an insightful way to perform FS by amassing information from the candidate solutions produced by GA and PSO. Our aim is to combine the exploitation ability of GA with the exploration capacity of PSO. We name this new model as binary genetic swarm optimization (BGSO). The proposed method initially lets GA and PSO to run independently. To extract sufficient information from the feature subsets obtained by those, BGSO combines their results by an algorithm called average weighted combination method to produce an intermediate solution. Thereafter, a local search called sequential one-point flipping is applied to refine the intermediate solution further in order to generate the final solution. BGSO is applied on 20 popular UCI datasets. The results were obtained by two classifiers, namely, k nearest neighbors (KNN) and multi-layer perceptron (MLP). The overall results and comparisons show that the proposed method outperforms the constituent algorithms in 16 and 14 datasets using KNN and MLP, respectively, whereas among the constituent algorithms, GA is able to achieve the best classification accuracy for 2 and 7 datasets and PSO achieves best accuracy for 2 and 4 datasets, respectively, for the same set of classifiers. This proves the applicability and usefulness of the method in the domain of FS.
Collapse
Affiliation(s)
- Manosij Ghosh
- Computer Science and Engineering Department, Jadavpur University, 188, Raja S.C. Mallick Road, Kolkata 700032, West Bengal, India
| | - Ritam Guha
- Computer Science and Engineering Department, Jadavpur University, 188, Raja S.C. Mallick Road, Kolkata 700032, West Bengal, India
| | - Imran Alam
- Computer Science and Engineering Department, Jadavpur University, 188, Raja S.C. Mallick Road, Kolkata 700032, West Bengal, India
| | - Priyank Lohariwal
- Computer Science and Engineering Department, Jadavpur University, 188, Raja S.C. Mallick Road, Kolkata 700032, West Bengal, India
| | - Devesh Jalan
- Computer Science and Engineering Department, Jadavpur University, 188, Raja S.C. Mallick Road, Kolkata 700032, West Bengal, India
| | - Ram Sarkar
- Computer Science and Engineering Department, Jadavpur University, 188, Raja S.C. Mallick Road, Kolkata 700032, West Bengal, India
| |
Collapse
|
43
|
|
44
|
Guha R, Ghosh M, Singh PK, Sarkar R, Nasipuri M. M-HMOGA: A New Multi-Objective Feature Selection Algorithm for Handwritten Numeral Classification. JOURNAL OF INTELLIGENT SYSTEMS 2019. [DOI: 10.1515/jisys-2019-0064] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Abstract
The feature selection process is very important in the field of pattern recognition, which selects the informative features so as to reduce the curse of dimensionality, thus improving the overall classification accuracy. In this paper, a new feature selection approach named Memory-Based Histogram-Oriented Multi-objective Genetic Algorithm (M-HMOGA) is introduced to identify the informative feature subset to be used for a pattern classification problem. The proposed M-HMOGA approach is applied to two recently used feature sets, namely Mojette transform and Regional Weighted Run Length features. The experimentations are carried out on Bangla, Devanagari, and Roman numeral datasets, which are the three most popular scripts used in the Indian subcontinent. In-house Bangla and Devanagari script datasets and Competition on Handwritten Digit Recognition (HDRC) 2013 Roman numeral dataset are used for evaluating our model. Moreover, as proof of robustness, we have applied an innovative approach of using different datasets for training and testing. We have used in-house Bangla and Devanagari script datasets for training the model, and the trained model is then tested on Indian Statistical Institute numeral datasets. For Roman numerals, we have used the HDRC 2013 dataset for training and the Modified National Institute of Standards and Technology dataset for testing. Comparison of the results obtained by the proposed model with existing HMOGA and MOGA techniques clearly indicates the superiority of M-HMOGA over both of its ancestors. Moreover, use of K-nearest neighbor as well as multi-layer perceptron as classifiers speaks for the classifier-independent nature of M-HMOGA. The proposed M-HMOGA model uses only about 45–50% of the total feature set in order to achieve around 1% increase when the same datasets are partitioned for training-testing and a 2–3% increase in the classification ability while using only 35–45% features when different datasets are used for training-testing with respect to the situation when all the features are used for classification.
Collapse
Affiliation(s)
- Ritam Guha
- Department of Computer Science and Engineering, Jadavpur University, 188, Raja S.C. Mullick Road, Kolkata-700032, West Bengal, India
| | - Manosij Ghosh
- Department of Computer Science and Engineering, Jadavpur University, 188, Raja S.C. Mullick Road, Kolkata-700032, West Bengal, India
| | - Pawan Kumar Singh
- Department of Computer Science and Engineering, Jadavpur University, 188, Raja S.C. Mullick Road, Kolkata-700032, West Bengal, India
| | - Ram Sarkar
- Department of Computer Science and Engineering, Jadavpur University, 188, Raja S.C. Mullick Road, Kolkata-700032, West Bengal, India
| | - Mita Nasipuri
- Department of Computer Science and Engineering, Jadavpur University, 188, Raja S.C. Mullick Road, Kolkata-700032, West Bengal, India
| |
Collapse
|
45
|
Guha R, Ghosh M, Kapri S, Shaw S, Mutsuddi S, Bhateja V, Sarkar R. Deluge based Genetic Algorithm for feature selection. EVOLUTIONARY INTELLIGENCE 2019. [DOI: 10.1007/s12065-019-00218-5] [Citation(s) in RCA: 32] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
46
|
Malakar S, Ghosh M, Bhowmik S, Sarkar R, Nasipuri M. A GA based hierarchical feature selection approach for handwritten word recognition. Neural Comput Appl 2019. [DOI: 10.1007/s00521-018-3937-8] [Citation(s) in RCA: 40] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|