1
|
Zhang Y, Nie B, Du J, Chen J, Du Y, Jin H, Zheng X, Chen X, Miao Z. Feature selection based on neighborhood rough sets and Gini index. PeerJ Comput Sci 2023; 9:e1711. [PMID: 38192483 PMCID: PMC10773927 DOI: 10.7717/peerj-cs.1711] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2023] [Accepted: 10/30/2023] [Indexed: 01/10/2024]
Abstract
Neighborhood rough set is considered an essential approach for dealing with incomplete data and inexact knowledge representation, and it has been widely applied in feature selection. The Gini index is an indicator used to evaluate the impurity of a dataset and is also commonly employed to measure the importance of features in feature selection. This article proposes a novel feature selection methodology based on these two concepts. In this methodology, we present the neighborhood Gini index and the neighborhood class Gini index and then extensively discuss their properties and relationships with attributes. Subsequently, two forward greedy feature selection algorithms are developed using these two metrics as a foundation. Finally, to comprehensively evaluate the performance of the algorithm proposed in this article, comparative experiments were conducted on 16 UCI datasets from various domains, including industry, food, medicine, and pharmacology, against four classical neighborhood rough set-based feature selection algorithms. The experimental results indicate that the proposed algorithm improves the average classification accuracy on the 16 datasets by over 6%, with improvements exceeding 10% in five. Furthermore, statistical tests reveal no significant differences between the proposed algorithm and the four classical neighborhood rough set-based feature selection algorithms. However, the proposed algorithm demonstrates high stability, eliminating most redundant or irrelevant features effectively while enhancing classification accuracy. In summary, the algorithm proposed in this article outperforms classical neighborhood rough set-based feature selection algorithms.
Collapse
Affiliation(s)
- Yuchao Zhang
- School of Computer Science, Jiangxi University of Chinese Medicine, NanChang, JiangXi, China
| | - Bin Nie
- School of Computer Science, Jiangxi University of Chinese Medicine, NanChang, JiangXi, China
| | - Jianqiang Du
- School of Computer Science, Jiangxi University of Chinese Medicine, NanChang, JiangXi, China
| | - Jiandong Chen
- School of Computer Science, Jiangxi University of Chinese Medicine, NanChang, JiangXi, China
| | - Yuwen Du
- School of Computer Science, Jiangxi University of Chinese Medicine, NanChang, JiangXi, China
| | - Haike Jin
- School of Computer Science, Jiangxi University of Chinese Medicine, NanChang, JiangXi, China
| | - Xuepeng Zheng
- School of Computer Science, Jiangxi University of Chinese Medicine, NanChang, JiangXi, China
| | - Xingxin Chen
- School of Computer Science, Jiangxi University of Chinese Medicine, NanChang, JiangXi, China
| | - Zhen Miao
- School of Computer Science, Jiangxi University of Chinese Medicine, NanChang, JiangXi, China
| |
Collapse
|
2
|
Li Y, Cheng Y. Streaming Feature Selection for Multi-Label Data with Dynamic Sliding Windows and Feature Repulsion Loss. ENTROPY 2019. [PMCID: PMC7514496 DOI: 10.3390/e21121151] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
In recent years, there has been a growing interest in the problem of multi-label streaming feature selection with no prior knowledge of the feature space. However, the algorithms proposed to handle this problem seldom consider the group structure of streaming features. Another shortcoming arises from the fact that few studies have addressed atomic feature models, and particularly, few have measured the attraction and repulsion between features. To remedy these shortcomings, we develop the streaming feature selection algorithm with dynamic sliding windows and feature repulsion loss (SF-DSW-FRL). This algorithm is essentially carried out in three consecutive steps. Firstly, within dynamic sliding windows, candidate streaming features that are strongly related to the labels in different feature groups are selected and stored in a fixed sliding window. Then, the interaction between features is measured by a loss function inspired by the mutual repulsion and attraction between atoms in physics. Specifically, one feature attraction term and two feature repulsion terms are constructed and combined to create the feature repulsion loss function. Finally, for the fixed sliding window, the best feature subset is selected according to this loss function. The effectiveness of the proposed algorithm is demonstrated through experiments on several multi-label datasets, statistical hypothesis testing, and stability analysis.
Collapse
Affiliation(s)
- Yu Li
- School of Computer and Information, Anqing Normal University, Anqing 246003, China;
- Lab of Multimedia and Recommendation Systems, Hefei University of Technology, Hefei 230009, China
| | - Yusheng Cheng
- School of Computer and Information, Anqing Normal University, Anqing 246003, China;
- The University Key Laboratory of Intelligent Perception and Computing of Anhui Province, Anqing 246003, China
- Correspondence:
| |
Collapse
|
3
|
Robust Feature Selection from Microarray Data Based on Cooperative Game Theory and Qualitative Mutual Information. Adv Bioinformatics 2016; 2016:1058305. [PMID: 27127506 PMCID: PMC4818815 DOI: 10.1155/2016/1058305] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2015] [Revised: 02/20/2016] [Accepted: 02/22/2016] [Indexed: 11/17/2022] Open
Abstract
High dimensionality of microarray data sets may lead to low efficiency and overfitting. In this paper, a multiphase cooperative game theoretic feature selection approach is proposed for microarray data classification. In the first phase, due to high dimension of microarray data sets, the features are reduced using one of the two filter-based feature selection methods, namely, mutual information and Fisher ratio. In the second phase, Shapley index is used to evaluate the power of each feature. The main innovation of the proposed approach is to employ Qualitative Mutual Information (QMI) for this purpose. The idea of Qualitative Mutual Information causes the selected features to have more stability and this stability helps to deal with the problem of data imbalance and scarcity. In the third phase, a forward selection scheme is applied which uses a scoring function to weight each feature. The performance of the proposed method is compared with other popular feature selection algorithms such as Fisher ratio, minimum redundancy maximum relevance, and previous works on cooperative game based feature selection. The average classification accuracy on eleven microarray data sets shows that the proposed method improves both average accuracy and average stability compared to other approaches.
Collapse
|