de Haro-Garcia A, Toledano JPP, Cerruela-Garcia G, Garcia-Pedrajas N. Grab'Em: A Novel Graph-Based Method for Combining Feature Subset Selectors.
IEEE TRANSACTIONS ON CYBERNETICS 2022;
52:2942-2954. [PMID:
33027013 DOI:
10.1109/tcyb.2020.3018815]
[Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Feature selection is one of the most frequent tasks in data mining applications. Its ability to remove useless and redundant features improves the classification performance and gains knowledge about a given problem makes feature selection a common first step in data mining. In many feature selection applications, we need to combine the results of different feature selection processes. The two most common scenarios are the ensembles of feature selectors and the scaling up of feature selection methods using a data division approach. The standard procedure is to store the number of times every feature has been selected as a vote for the feature and then evaluate different selection thresholds with a certain criterion to obtain the final subset of selected features. However, this method is suboptimal as the relationships of the features are not considered in the voting process. Two redundant features may be selected a similar number of times due to the different sets of instances used each time. Thus, a voting scheme would tend to select both of them. In this article, we present a new approach: instead of using only the number of times a feature has been selected, the approach considers how many times the features have been selected together by a feature selection algorithm. The proposal is based on constructing an undirected graph where the vertices are the features, and the edges count the number of times every pair of instances has been selected together. This graph is used to select the best subset of features, avoiding the redundancy introduced by the voting scheme. The proposal improves the results of the standard voting scheme in both ensembles of feature selectors and data division methods for scaling up feature selection.
Collapse