1
|
A new ranking-based stability measure for feature selection algorithms. Soft comput 2023. [DOI: 10.1007/s00500-022-07767-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
|
2
|
Ensemble feature selection for multi‐label text classification: An intelligent order statistics approach. INT J INTELL SYST 2022. [DOI: 10.1002/int.23044] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
|
3
|
Abasabadi S, Nematzadeh H, Motameni H, Akbari E. Hybrid feature selection based on SLI and genetic algorithm for microarray datasets. THE JOURNAL OF SUPERCOMPUTING 2022; 78:19725-19753. [PMID: 35789817 PMCID: PMC9244444 DOI: 10.1007/s11227-022-04650-w] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Accepted: 06/08/2022] [Indexed: 06/15/2023]
Abstract
One of the major problems in microarray datasets is the large number of features, which causes the issue of "the curse of dimensionality" when machine learning is applied to these datasets. Feature selection refers to the process of finding optimal feature set by removing irrelevant and redundant features. It has a significant role in pattern recognition, classification, and machine learning. In this study, a new and efficient hybrid feature selection method, called Garank&rand, is presented. The method combines a wrapper feature selection algorithm based on the genetic algorithm (GA) with a proposed filter feature selection method, SLI-γ. In Garank&rand, some initial solutions are built regarding the most relevant features based on SLI-γ, and the remaining ones are only the random features. Eleven high-dimensional and standard datasets were used for the accuracy evaluation of the proposed SLI-γ. Additionally, four high-dimensional well-known datasets of microarray experiments were used to carry out an extensive experimental study for the performance evaluation of Garank&rand. This experimental analysis showed the robustness of the method as well as its ability to obtain highly accurate solutions at the earlier stages of the GA evolutionary process. Finally, the performance of Garank&rand was also compared to the results of GA to highlight its competitiveness and its ability to successfully reduce the original feature set size and execution time.
Collapse
Affiliation(s)
- Sedighe Abasabadi
- Department of Computer Engineering, Sari Branch, Islamic Azad University, Sari, Iran
| | - Hossein Nematzadeh
- Department of Computer Engineering, Sari Branch, Islamic Azad University, Sari, Iran
| | - Homayun Motameni
- Department of Computer Engineering, Sari Branch, Islamic Azad University, Sari, Iran
| | - Ebrahim Akbari
- Department of Computer Engineering, Sari Branch, Islamic Azad University, Sari, Iran
| |
Collapse
|
4
|
Alhenawi E, Al-Sayyed R, Hudaib A, Mirjalili S. Feature selection methods on gene expression microarray data for cancer classification: A systematic review. Comput Biol Med 2022; 140:105051. [PMID: 34839186 DOI: 10.1016/j.compbiomed.2021.105051] [Citation(s) in RCA: 37] [Impact Index Per Article: 18.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2021] [Revised: 11/01/2021] [Accepted: 11/15/2021] [Indexed: 11/29/2022]
Abstract
This systematic review provides researchers interested in feature selection (FS) for processing microarray data with comprehensive information about the main research directions for gene expression classification conducted during the recent seven years. A set of 132 researches published by three different publishers is reviewed. The studied papers are categorized into nine directions based on their objectives. The FS directions that received various levels of attention were then summarized. The review revealed that 'propose hybrid FS methods' represented the most interesting research direction with a percentage of 34.9%, while the other directions have lower percentages that ranged from 13.6% down to 3%. This guides researchers to select the most competitive research direction. Papers in each category are thoroughly reviewed based on six perspectives, mainly: method(s), classifier(s), dataset(s), dataset dimension(s) range, performance metric(s), and result(s) achieved.
Collapse
Affiliation(s)
- Esra'a Alhenawi
- King Abdullah II School for Information Technology, The University of Jordan, Amman, Jordan.
| | - Rizik Al-Sayyed
- King Abdullah II School for Information Technology, The University of Jordan, Amman, Jordan.
| | - Amjad Hudaib
- King Abdullah II School for Information Technology, The University of Jordan, Amman, Jordan.
| | - Seyedali Mirjalili
- Center for Artificial Intelligence Research and Optimization, Torrens University Australia, Fortitude Valley, Brisbane, 4006, QLD, Australia; Yonsei Frontier Lab, Yonsei University, Seoul, South Korea.
| |
Collapse
|
5
|
Abasabadi S, Nematzadeh H, Motameni H, Akbari E. Automatic ensemble feature selection using fast non-dominated sorting. INFORM SYST 2021. [DOI: 10.1016/j.is.2021.101760] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
6
|
Buyrukoğlu S. New hybrid data mining model for prediction of
Salmonella
presence in agricultural waters based on ensemble feature selection and machine learning algorithms. J Food Saf 2021. [DOI: 10.1111/jfs.12903] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Selim Buyrukoğlu
- Department of Computer Engineering, Faculty of Engineering Çankırı Karatekin University Çankırı Turkey
| |
Collapse
|
7
|
Salman R, Alzaatreh A, Sulieman H, Faisal S. A Bootstrap Framework for Aggregating within and between Feature Selection Methods. ENTROPY 2021; 23:e23020200. [PMID: 33561948 PMCID: PMC7914949 DOI: 10.3390/e23020200] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/08/2020] [Revised: 02/01/2021] [Accepted: 02/03/2021] [Indexed: 11/16/2022]
Abstract
In the past decade, big data has become increasingly prevalent in a large number of applications. As a result, datasets suffering from noise and redundancy issues have necessitated the use of feature selection across multiple domains. However, a common concern in feature selection is that different approaches can give very different results when applied to similar datasets. Aggregating the results of different selection methods helps to resolve this concern and control the diversity of selected feature subsets. In this work, we implemented a general framework for the ensemble of multiple feature selection methods. Based on diversified datasets generated from the original set of observations, we aggregated the importance scores generated by multiple feature selection techniques using two methods: the Within Aggregation Method (WAM), which refers to aggregating importance scores within a single feature selection; and the Between Aggregation Method (BAM), which refers to aggregating importance scores between multiple feature selection methods. We applied the proposed framework on 13 real datasets with diverse performances and characteristics. The experimental evaluation showed that WAM provides an effective tool for determining the best feature selection method for a given dataset. WAM has also shown greater stability than BAM in terms of identifying important features. The computational demands of the two methods appeared to be comparable. The results of this work suggest that by applying both WAM and BAM, practitioners can gain a deeper understanding of the feature selection process.
Collapse
|
8
|
Ensembles of feature selectors for dealing with class-imbalanced datasets: A proposal and comparative study. Inf Sci (N Y) 2020. [DOI: 10.1016/j.ins.2020.05.077] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
9
|
Krzhizhanovskaya VV, Závodszky G, Lees MH, Dongarra JJ, Sloot PMA, Brissos S, Teixeira J. Analysis of Ensemble Feature Selection for Correlated High-Dimensional RNA-Seq Cancer Data. LECTURE NOTES IN COMPUTER SCIENCE 2020. [PMCID: PMC7304026 DOI: 10.1007/978-3-030-50420-5_39] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Discovery of diagnostic and prognostic molecular markers is important and actively pursued the research field in cancer research. For complex diseases, this process is often performed using Machine Learning. The current study compares two approaches for the discovery of relevant variables: by application of a single feature selection algorithm, versus by an ensemble of diverse algorithms. These approaches are used to identify variables that are relevant discerning of four cancer types using RNA-seq profiles from the Cancer Genome Atlas. The comparison is carried out in two directions: evaluating the predictive performance of models and monitoring the stability of selected variables. The most informative features are identified using a four feature selection algorithms, namely U-test, ReliefF, and two variants of the MDFS algorithm. Discerning normal and tumor tissues is performed using the Random Forest algorithm. The highest stability of the feature set was obtained when U-test was used. Unfortunately, models built on feature sets obtained from the ensemble of feature selection algorithms were no better than for models developed on feature sets obtained from individual algorithms. On the other hand, the feature selectors leading to the best classification results varied between data sets.
Collapse
|
10
|
Feature Selection Applied to Microarray Data. Methods Mol Biol 2019; 1986:123-152. [PMID: 31115887 DOI: 10.1007/978-1-4939-9442-7_6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
A typical characteristic of microarray data is that it has a very high number of features (in the order of thousands) while the number of examples is usually less than 100. In the context of microarray classification, this poses a challenge for machine learning methods, which can suffer overfitting and thus degradation in their performance. A common solution is to apply a dimensionality reduction technique before classification, to reduce the number of features. This chapter will be focused on one of the most famous dimensionality reduction techniques: feature selection. We will see how feature selection can help improve the classification accuracy in several microarray data scenarios.
Collapse
|
11
|
Pereira T, Ferreira FL, Cardoso S, Silva D, de Mendonça A, Guerreiro M, Madeira SC. Neuropsychological predictors of conversion from mild cognitive impairment to Alzheimer's disease: a feature selection ensemble combining stability and predictability. BMC Med Inform Decis Mak 2018; 18:137. [PMID: 30567554 PMCID: PMC6299964 DOI: 10.1186/s12911-018-0710-y] [Citation(s) in RCA: 26] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2018] [Accepted: 11/21/2018] [Indexed: 02/02/2023] Open
Abstract
BACKGROUND Predicting progression from Mild Cognitive Impairment (MCI) to Alzheimer's Disease (AD) is an utmost open issue in AD-related research. Neuropsychological assessment has proven to be useful in identifying MCI patients who are likely to convert to dementia. However, the large battery of neuropsychological tests (NPTs) performed in clinical practice and the limited number of training examples are challenge to machine learning when learning prognostic models. In this context, it is paramount to pursue approaches that effectively seek for reduced sets of relevant features. Subsets of NPTs from which prognostic models can be learnt should not only be good predictors, but also stable, promoting generalizable and explainable models. METHODS We propose a feature selection (FS) ensemble combining stability and predictability to choose the most relevant NPTs for prognostic prediction in AD. First, we combine the outcome of multiple (filter and embedded) FS methods. Then, we use a wrapper-based approach optimizing both stability and predictability to compute the number of selected features. We use two large prospective studies (ADNI and the Portuguese Cognitive Complaints Cohort, CCC) to evaluate the approach and assess the predictive value of a large number of NPTs. RESULTS The best subsets of features include approximately 30 and 20 (from the original 79 and 40) features, for ADNI and CCC data, respectively, yielding stability above 0.89 and 0.95, and AUC above 0.87 and 0.82. Most NPTs learnt using the proposed feature selection ensemble have been identified in the literature as strong predictors of conversion from MCI to AD. CONCLUSIONS The FS ensemble approach was able to 1) identify subsets of stable and relevant predictors from a consensus of multiple FS methods using baseline NPTs and 2) learn reliable prognostic models of conversion from MCI to AD using these subsets of features. The machine learning models learnt from these features outperformed the models trained without FS and achieved competitive results when compared to commonly used FS algorithms. Furthermore, the selected features are derived from a consensus of methods thus being more robust, while releasing users from choosing the most appropriate FS method to be used in their classification task.
Collapse
Affiliation(s)
- Telma Pereira
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Lisbon, Portugal
- Instituto Superior Técnico, Universidade de Lisboa, Lisbon, Portugal
| | | | - Sandra Cardoso
- Laboratório de Neurociências, Instituto de Medicina Molecular, Faculdade de Medicina, Universidade de Lisboa, Lisbon, Portugal
| | - Dina Silva
- Cognitive Neuroscience Research Group, Department of Psychology and Educational Sciences and Centre for Biomedical Research (CBMR), University of Algarve, Faro, Portugal
| | - Alexandre de Mendonça
- Laboratório de Neurociências, Instituto de Medicina Molecular, Faculdade de Medicina, Universidade de Lisboa, Lisbon, Portugal
| | - Manuela Guerreiro
- Laboratório de Neurociências, Instituto de Medicina Molecular, Faculdade de Medicina, Universidade de Lisboa, Lisbon, Portugal
| | - Sara C. Madeira
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Lisbon, Portugal
| | - for the Alzheimer’s Disease Neuroimaging Initiative
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Lisbon, Portugal
- Instituto Superior Técnico, Universidade de Lisboa, Lisbon, Portugal
- Laboratório de Neurociências, Instituto de Medicina Molecular, Faculdade de Medicina, Universidade de Lisboa, Lisbon, Portugal
- Cognitive Neuroscience Research Group, Department of Psychology and Educational Sciences and Centre for Biomedical Research (CBMR), University of Algarve, Faro, Portugal
| |
Collapse
|
12
|
López-Cabrera JD, Lorenzo-Ginori JV. Feature selection for the classification of traced neurons. J Neurosci Methods 2018; 303:41-54. [DOI: 10.1016/j.jneumeth.2018.04.002] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2017] [Revised: 03/19/2018] [Accepted: 04/04/2018] [Indexed: 10/17/2022]
|