1
|
Labory J, Njomgue-Fotso E, Bottini S. Benchmarking feature selection and feature extraction methods to improve the performances of machine-learning algorithms for patient classification using metabolomics biomedical data. Comput Struct Biotechnol J 2024; 23:1274-1287. [PMID: 38560281 PMCID: PMC10979063 DOI: 10.1016/j.csbj.2024.03.016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2023] [Revised: 03/12/2024] [Accepted: 03/18/2024] [Indexed: 04/04/2024] Open
Abstract
Objective Classification tasks are an open challenge in the field of biomedicine. While several machine-learning techniques exist to accomplish this objective, several peculiarities associated with biomedical data, especially when it comes to omics measurements, prevent their use or good performance achievements. Omics approaches aim to understand a complex biological system through systematic analysis of its content at the molecular level. On the other hand, omics data are heterogeneous, sparse and affected by the classical "curse of dimensionality" problem, i.e. having much fewer observation, samples (n) than omics features (p). Furthermore, a major problem with multi-omics data is the imbalance either at the class or feature level. The objective of this work is to study whether feature extraction and/or feature selection techniques can improve the performances of classification machine-learning algorithms on omics measurements. Methods Among all omics, metabolomics has emerged as a powerful tool in cancer research, facilitating a deeper understanding of the complex metabolic landscape associated with tumorigenesis and tumor progression. Thus, we selected three publicly available metabolomics datasets, and we applied several feature extraction techniques both linear and non-linear, coupled or not with feature selection methods, and evaluated the performances regarding patient classification in the different configurations for the three datasets. Results We provide general workflow and guidelines on when to use those techniques depending on the characteristics of the data available. To further test the extension of our approach to other omics data, we have included a transcriptomics and a proteomics data. Overall, for all datasets, we showed that applying supervised feature selection improves the performances of feature extraction methods for classification purposes. Scripts used to perform all analyses are available at: https://github.com/Plant-Net/Metabolomic_project/.
Collapse
Affiliation(s)
- Justine Labory
- Université Côte d′Azur, Center of Modeling Simulation and Interactions, Nice, France
- INRAE, Université Côte d′Azur, CNRS, Institut Sophia Agrobiotech, Sophia-Antipolis, France
- Université Côte d′Azur, Inserm U1081, CNRS UMR 7284, Institute for Research on Cancer and Aging, Nice (IRCAN), Nice, France
| | | | - Silvia Bottini
- Université Côte d′Azur, Center of Modeling Simulation and Interactions, Nice, France
- INRAE, Université Côte d′Azur, CNRS, Institut Sophia Agrobiotech, Sophia-Antipolis, France
| |
Collapse
|
2
|
Song Z, Yang X, Xu Z, King I. Graph-Based Semi-Supervised Learning: A Comprehensive Review. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:8174-8194. [PMID: 35302941 DOI: 10.1109/tnnls.2022.3155478] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Semi-supervised learning (SSL) has tremendous value in practice due to the utilization of both labeled and unlabelled data. An essential class of SSL methods, referred to as graph-based semi-supervised learning (GSSL) methods in the literature, is to first represent each sample as a node in an affinity graph, and then, the label information of unlabeled samples can be inferred based on the structure of the constructed graph. GSSL methods have demonstrated their advantages in various domains due to their uniqueness of structure, the universality of applications, and their scalability to large-scale data. Focusing on GSSL methods only, this work aims to provide both researchers and practitioners with a solid and systematic understanding of relevant advances as well as the underlying connections among them. The concentration on one class of SSL makes this article distinct from recent surveys that cover a more general and broader picture of SSL methods yet often neglect the fundamental understanding of GSSL methods. In particular, a significant contribution of this article lies in a newly generalized taxonomy for GSSL under the unified framework, with the most up-to-date references and valuable resources such as codes, datasets, and applications. Furthermore, we present several potential research directions as future work with our insights into this rapidly growing field.
Collapse
|
3
|
Li RY, Guo Y, Zhang B. Adaptive Kernel Graph Nonnegative Matrix Factorization. INFORMATION 2023. [DOI: 10.3390/info14040208] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/31/2023] Open
Abstract
Nonnegative matrix factorization (NMF) is an efficient method for feature learning in the field of machine learning and data mining. To investigate the nonlinear characteristics of datasets, kernel-method-based NMF (KNMF) and its graph-regularized extensions have received much attention from various researchers due to their promising performance. However, the graph similarity matrix of the existing methods is often predefined in the original space of data and kept unchanged during the matrix-factorization procedure, which leads to non-optimal graphs. To address these problems, we propose a kernel-graph-learning-based, nonlinear, nonnegative matrix-factorization method in this paper, termed adaptive kernel graph nonnegative matrix factorization (AKGNMF). In order to automatically capture the manifold structure of the data on the nonlinear feature space, AKGNMF learned an adaptive similarity graph. We formulated a unified objective function, in which global similarity graph learning is optimized jointly with the matrix decomposition process. A local graph Laplacian is further imposed on the learned feature subspace representation. The proposed method relies on both the factorization that respects geometric structure and the mapped high-dimensional subspace feature representations. In addition, an efficient iterative solution was derived to update all variables in the resultant objective problem in turn. Experiments on the synthetic dataset visually demonstrate the ability of AKGNMF to separate the nonlinear dataset with high clustering accuracy. Experiments on real-world datasets verified the effectiveness of AKGNMF in three aspects, including clustering performance, parameter sensitivity and convergence. Comprehensive experimental findings indicate that, compared with various classic methods and the state-of-the-art methods, the proposed AKGNMF algorithm demonstrated effectiveness and superiority.
Collapse
|
4
|
Huang D, Zhang Q, Li Z. Semi-supervised attribute reduction for partially labeled categorical data based on predicted label. Int J Approx Reason 2023. [DOI: 10.1016/j.ijar.2022.12.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
|
5
|
Feature selection for distance-based regression: An umbrella review and a one-shot wrapper. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.11.023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
|
6
|
Lv S, Wei L, Zhang Q, Liu B, Xu Z. Improved Inference for Imputation-Based Semisupervised Learning Under Misspecified Setting. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:6346-6359. [PMID: 34029195 DOI: 10.1109/tnnls.2021.3077312] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Semisupervised learning (SSL) has been extensively studied in related literature. Despite its success, many existing learning algorithms for semisupervised problems require specific distributional assumptions, such as "cluster assumption" and "low-density assumption," and thus, it is often hard to verify them in practice. We are interested in quantifying the effect of SSL based on kernel methods under a misspecified setting. The misspecified setting means that the target function is not contained in a hypothesis space under which some specific learning algorithm works. Practically, this assumption is mild and standard for various kernel-based approaches. Under this misspecified setting, this article makes an attempt to provide a theoretical justification on when and how the unlabeled data can be exploited to improve inference of a learning task. Our theoretical justification is indicated from the viewpoint of the asymptotic variance of our proposed two-step estimation. It is shown that the proposed pointwise nonparametric estimator has a smaller asymptotic variance than the supervised estimator using the labeled data alone. Several simulated experiments are implemented to support our theoretical results.
Collapse
|
7
|
Binary dwarf mongoose optimizer for solving high-dimensional feature selection problems. PLoS One 2022; 17:e0274850. [PMID: 36201524 PMCID: PMC9536540 DOI: 10.1371/journal.pone.0274850] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2022] [Accepted: 09/06/2022] [Indexed: 11/13/2022] Open
Abstract
Selecting appropriate feature subsets is a vital task in machine learning. Its main goal is to remove noisy, irrelevant, and redundant feature subsets that could negatively impact the learning model's accuracy and improve classification performance without information loss. Therefore, more advanced optimization methods have been employed to locate the optimal subset of features. This paper presents a binary version of the dwarf mongoose optimization called the BDMO algorithm to solve the high-dimensional feature selection problem. The effectiveness of this approach was validated using 18 high-dimensional datasets from the Arizona State University feature selection repository and compared the efficacy of the BDMO with other well-known feature selection techniques in the literature. The results show that the BDMO outperforms other methods producing the least average fitness value in 14 out of 18 datasets which means that it achieved 77.77% on the overall best fitness values. The result also shows BDMO demonstrating stability by returning the least standard deviation (SD) value in 13 of 18 datasets (72.22%). Furthermore, the study achieved higher validation accuracy in 15 of the 18 datasets (83.33%) over other methods. The proposed approach also yielded the highest validation accuracy attainable in the COIL20 and Leukemia datasets which vividly portray the superiority of the BDMO.
Collapse
|
8
|
Fan M, Zhang X, Hu J, Gu N, Tao D. Adaptive Data Structure Regularized Multiclass Discriminative Feature Selection. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:5859-5872. [PMID: 33882003 DOI: 10.1109/tnnls.2021.3071603] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Feature selection (FS), which aims to identify the most informative subset of input features, is an important approach to dimensionality reduction. In this article, a novel FS framework is proposed for both unsupervised and semisupervised scenarios. To make efficient use of data distribution to evaluate features, the framework combines data structure learning (as referred to as data distribution modeling) and FS in a unified formulation such that the data structure learning improves the results of FS and vice versa. Moreover, two types of data structures, namely the soft and hard data structures, are learned and used in the proposed FS framework. The soft data structure refers to the pairwise weights among data samples, and the hard data structure refers to the estimated labels obtained from clustering or semisupervised classification. Both of these data structures are naturally formulated as regularization terms in the proposed framework. In the optimization process, the soft and hard data structures are learned from data represented by the selected features, and then, the most informative features are reselected by referring to the data structures. In this way, the framework uses the interactions between data structure learning and FS to select the most discriminative and informative features. Following the proposed framework, a new semisupervised FS (SSFS) method is derived and studied in depth. Experiments on real-world data sets demonstrate the effectiveness of the proposed method.
Collapse
|
9
|
BÜYÜKKEÇECİ M, OKUR MC. A Comprehensive Review of Feature Selection and Feature Selection Stability in Machine Learning. GAZI UNIVERSITY JOURNAL OF SCIENCE 2022. [DOI: 10.35378/gujs.993763] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/19/2023]
Abstract
Feature selection is a data preprocessing method used to reduce the number of features in the datasets. Feature selection techniques search the entire feature space to find an optimal feature set that is free of redundant and irrelevant features. Reducing the dimensionality of a dataset by removing redundant and irrelevant features has a pivotal role in improving the performance, i.e., accuracy, of inductive learners and building simple models. Thus, feature selection is an imperative task of machine learning. The apparent need for feature selection raised considerable interest and became an important research topic in a wide range of fields, including bioinformatics, text classification, image recognition, and computer vision. As a result, a large pool of feature selection methods has been proposed, and a considerable amount of literature has been published on feature selection. The quality of feature selection algorithms is measured not only by the performance of the features they prefer but also by their stability. Therefore, this study focuses on two topics: feature selection and feature selection stability. In the pages that follow, general concepts and methods of feature selection are discussed and then an overview of feature selection stability and stability measures are given.
Collapse
|
10
|
Akinola OO, Ezugwu AE, Agushaka JO, Zitar RA, Abualigah L. Multiclass feature selection with metaheuristic optimization algorithms: a review. Neural Comput Appl 2022; 34:19751-19790. [PMID: 36060097 PMCID: PMC9424068 DOI: 10.1007/s00521-022-07705-4] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2022] [Accepted: 08/02/2022] [Indexed: 11/24/2022]
Abstract
Selecting relevant feature subsets is vital in machine learning, and multiclass feature selection is harder to perform since most classifications are binary. The feature selection problem aims at reducing the feature set dimension while maintaining the performance model accuracy. Datasets can be classified using various methods. Nevertheless, metaheuristic algorithms attract substantial attention to solving different problems in optimization. For this reason, this paper presents a systematic survey of literature for solving multiclass feature selection problems utilizing metaheuristic algorithms that can assist classifiers selects optima or near optima features faster and more accurately. Metaheuristic algorithms have also been presented in four primary behavior-based categories, i.e., evolutionary-based, swarm-intelligence-based, physics-based, and human-based, even though some literature works presented more categorization. Further, lists of metaheuristic algorithms were introduced in the categories mentioned. In finding the solution to issues related to multiclass feature selection, only articles on metaheuristic algorithms used for multiclass feature selection problems from the year 2000 to 2022 were reviewed about their different categories and detailed descriptions. We considered some application areas for some of the metaheuristic algorithms applied for multiclass feature selection with their variations. Popular multiclass classifiers for feature selection were also examined. Moreover, we also presented the challenges of metaheuristic algorithms for feature selection, and we identified gaps for further research studies.
Collapse
Affiliation(s)
- Olatunji O. Akinola
- School of Mathematics, Statistics, and Computer Science, University of KwaZulu-Natal, King Edward Avenue, Pietermaritzburg Campus, Pietermaritzburg, 3201 KwaZulu-Natal South Africa
| | - Absalom E. Ezugwu
- School of Mathematics, Statistics, and Computer Science, University of KwaZulu-Natal, King Edward Avenue, Pietermaritzburg Campus, Pietermaritzburg, 3201 KwaZulu-Natal South Africa
| | - Jeffrey O. Agushaka
- School of Mathematics, Statistics, and Computer Science, University of KwaZulu-Natal, King Edward Avenue, Pietermaritzburg Campus, Pietermaritzburg, 3201 KwaZulu-Natal South Africa
| | - Raed Abu Zitar
- Sorbonne Center of Artificial Intelligence, Sorbonne University-Abu Dhabi, 38044 Abu Dhabi, United Arab Emirates
| | - Laith Abualigah
- Hourani Center for Applied Scientific Research, Al-Ahliyya Amman University, Amman, 19328 Jordan
- Faculty of Inforsmation Technology, Middle East University, Amman, 11831 Jordan
| |
Collapse
|
11
|
Wang C, Chen X, Yuan G, Nie F, Yang M. Semisupervised Feature Selection With Sparse Discriminative Least Squares Regression. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:8413-8424. [PMID: 33872166 DOI: 10.1109/tcyb.2021.3060804] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
In big data time, selecting informative features has become an urgent need. However, due to the huge cost of obtaining enough labeled data for supervised tasks, researchers have turned their attention to semisupervised learning, which exploits both labeled and unlabeled data. In this article, we propose a sparse discriminative semisupervised feature selection (SDSSFS) method. In this method, the ϵ -dragging technique for the supervised task is extended to the semisupervised task, which is used to enlarge the distance between classes in order to obtain a discriminative solution. The flexible l2,p norm is implicitly used as regularization in the new model. Therefore, we can obtain a more sparse solution by setting smaller p . An iterative method is proposed to simultaneously learn the regression coefficients and ϵ -dragging matrix and predicting the unknown class labels. Experimental results on ten real-world datasets show the superiority of our proposed method.
Collapse
|
12
|
Balasubramanian K, N.P. A. Correlation-based feature selection using bio-inspired algorithms and optimized KELM classifier for glaucoma diagnosis. Appl Soft Comput 2022. [DOI: 10.1016/j.asoc.2022.109432] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
13
|
Chen X, Chen R, Wu Q, Nie F, Yang M, Mao R. Semisupervised Feature Selection via Structured Manifold Learning. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:5756-5766. [PMID: 33635817 DOI: 10.1109/tcyb.2021.3052847] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Recently, semisupervised feature selection has gained more attention in many real applications due to the high cost of obtaining labeled data. However, existing methods cannot solve the "multimodality" problem that samples in some classes lie in several separate clusters. To solve the multimodality problem, this article proposes a new feature selection method for semisupervised task, namely, semisupervised structured manifold learning (SSML). The new method learns a new structured graph which consists of more clusters than the known classes. Meanwhile, we propose to exploit the submanifold in both labeled data and unlabeled data by consuming the nearest neighbors of each object in both labeled and unlabeled objects. An iterative optimization algorithm is proposed to solve the new model. A series of experiments was conducted on both synthetic and real-world datasets and the experimental results verify the ability of the new method to solve the multimodality problem and its superior performance compared with the state-of-the-art methods.
Collapse
|
14
|
Robust dual-graph regularized and minimum redundancy based on self-representation for semi-supervised feature selection. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.03.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
15
|
Yuan A, You M, He D, Li X. Convex Non-Negative Matrix Factorization With Adaptive Graph for Unsupervised Feature Selection. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:5522-5534. [PMID: 33237876 DOI: 10.1109/tcyb.2020.3034462] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Unsupervised feature selection (UFS) aims to remove the redundant information and select the most representative feature subset from the original data, so it occupies a core position for high-dimensional data preprocessing. Many proposed approaches use self-expression to explore the correlation between the data samples or use pseudolabel matrix learning to learn the mapping between the data and labels. Furthermore, the existing methods have tried to add constraints to either of these two modules to reduce the redundancy, but no prior literature embeds them into a joint model to select the most representative features by the computed top ranking scores. To address the aforementioned issue, this article presents a novel UFS method via a convex non-negative matrix factorization with an adaptive graph constraint (CNAFS). Through convex matrix factorization with adaptive graph constraint, it can dig up the correlation between the data and keep the local manifold structure of the data. To our knowledge, it is the first work that integrates pseudo label matrix learning into the self-expression module and optimizes them simultaneously for the UFS solution. Besides, two different manifold regularizations are constructed for the pseudolabel matrix and the encoding matrix to keep the local geometrical structure. Eventually, extensive experiments on the benchmark datasets are conducted to prove the effectiveness of our method. The source code is available at: https://github.com/misteru/CNAFS.
Collapse
|
16
|
Zhang H, Gong M, Nie F, Li X. Unified Dual-label Semi-supervised Learning with Top-k Feature Selection. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.05.090] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
17
|
Zhang R, Zhang H, Li X, Yang S. Unsupervised Feature Selection With Extended OLSDA via Embedding Nonnegative Manifold Structure. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:2274-2280. [PMID: 33382663 DOI: 10.1109/tnnls.2020.3045053] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
As to unsupervised learning, most discriminative information is encoded in the cluster labels. To obtain the pseudo labels, unsupervised feature selection methods usually utilize spectral clustering to generate them. Nonetheless, two related disadvantages exist accordingly: 1) the performance of feature selection highly depends on the constructed Laplacian matrix and 2) the pseudo labels are obtained with mixed signs, while the real ones should be nonnegative. To address this problem, a novel approach for unsupervised feature selection is proposed by extending orthogonal least square discriminant analysis (OLSDA) to the unsupervised case, such that nonnegative pseudo labels can be achieved. Additionally, an orthogonal constraint is imposed on the class indicator to hold the manifold structure. Furthermore, l2,1 regularization is imposed to ensure that the projection matrix is row sparse for efficient feature selection and proved to be equivalent to l2,0 regularization. Finally, extensive experiments on nine benchmark data sets are conducted to demonstrate the effectiveness of the proposed approach.
Collapse
|
18
|
Dai J, Liu Q. Semi-supervised attribute reduction for interval data based on misclassification cost. INT J MACH LEARN CYB 2021. [DOI: 10.1007/s13042-021-01483-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
19
|
Pintas JT, Fernandes LAF, Garcia ACB. Feature selection methods for text classification: a systematic literature review. Artif Intell Rev 2021. [DOI: 10.1007/s10462-021-09970-6] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
20
|
Huang S, Liu Z, Jin W, Mu Y. Broad learning system with manifold regularized sparse features for semi-supervised classification. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2021.08.052] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
21
|
Wu X, Chen H, Li T, Wan J. Semi-supervised feature selection with minimal redundancy based on local adaptive. APPL INTELL 2021. [DOI: 10.1007/s10489-021-02288-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
22
|
|
23
|
Huang Y, Shen Z, Cai F, Li T, Lv F. Adaptive graph-based generalized regression model for unsupervised feature selection. Knowl Based Syst 2021. [DOI: 10.1016/j.knosys.2021.107156] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
|
24
|
Zhong W, Chen X, Nie F, Huang JZ. Adaptive discriminant analysis for semi-supervised feature selection. Inf Sci (N Y) 2021. [DOI: 10.1016/j.ins.2021.02.035] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
25
|
Xue C, Zhang T, Xiao D. Output-Related and -Unrelated Fault Monitoring with an Improvement Prototype Knockoff Filter and Feature Selection Based on Laplacian Eigen Maps and Sparse Regression. ACS OMEGA 2021; 6:10828-10839. [PMID: 34056237 PMCID: PMC8153765 DOI: 10.1021/acsomega.1c00506] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/28/2021] [Accepted: 04/06/2021] [Indexed: 06/12/2023]
Abstract
In the process industry, fault monitoring related to output is an important step to ensure product quality and improve economic benefits. In order to distinguish the influence of input variables on the output more accurately, this paper introduces a subalgorithm of fault-unrelated block partition into the prototype knockoff filter (PKF) algorithm for its improvement. The improved PKF algorithm can divide the input data into three blocks: fault-unrelated block, output-related block, and output-unrelated block. Removing the data of fault-unrelated blocks can greatly reduce the difficulty of fault monitoring. This paper proposes a feature selection based on the Laplacian Eigen maps and sparse regression algorithm for output-unrelated blocks. The algorithm has the ability to detect faults caused by variables with small contribution to variance and proves the descent of the algorithm from a theoretical point of view. The output relation block is monitored by the Broyden-Fletcher-Goldfarb-Shanno method. Finally, the effectiveness of the proposed fault detection method is verified by the recognized Eastman process data in Tennessee.
Collapse
Affiliation(s)
- Cuiping Xue
- College
of Science, Northeastern University, Shenyang 110819, China
| | - Tie Zhang
- College
of Science, Northeastern University, Shenyang 110819, China
| | - Dong Xiao
- College
of Information Science and Engineering and Liaoning Key Laboratory
of Intelligent Diagnosis and Safety for Metallurgical Industry, Northeastern University, Shenyang 110819, China
| |
Collapse
|
26
|
Joint local structure preservation and redundancy minimization for unsupervised feature selection. APPL INTELL 2020. [DOI: 10.1007/s10489-020-01800-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
27
|
Shang R, Xu K, Jiao L. Subspace learning for unsupervised feature selection via adaptive structure learning and rank approximation. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2020.06.111] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
28
|
Zhou P, Chen J, Fan M, Du L, Shen YD, Li X. Unsupervised feature selection for balanced clustering. Knowl Based Syst 2020. [DOI: 10.1016/j.knosys.2019.105417] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
29
|
Liu Y, Ye D, Li W, Wang H, Gao Y. Robust neighborhood embedding for unsupervised feature selection. Knowl Based Syst 2020. [DOI: 10.1016/j.knosys.2019.105462] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
30
|
Huang S, Xu Z, Kang Z, Ren Y. Regularized nonnegative matrix factorization with adaptive local structure learning. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2019.11.070] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
31
|
Shang R, Song J, Jiao L, Li Y. Double feature selection algorithm based on low-rank sparse non-negative matrix factorization. INT J MACH LEARN CYB 2020. [DOI: 10.1007/s13042-020-01079-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
32
|
Ahmadizadeh C, Pousett B, Menon C. Investigation of Channel Selection for Gesture Classification for Prosthesis Control Using Force Myography: A Case Study. Front Bioeng Biotechnol 2019; 7:331. [PMID: 31921794 PMCID: PMC6914858 DOI: 10.3389/fbioe.2019.00331] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2019] [Accepted: 10/29/2019] [Indexed: 11/13/2022] Open
Abstract
Background: Various human machine interfaces (HMIs) are used to control prostheses, such as robotic hands. One of the promising HMIs is Force Myography (FMG). Previous research has shown the potential for the use of high density FMG (HD-FMG) that can lead to higher accuracy of prosthesis control. Motivation: The more sensors used in an FMG controlled system, the more complicated and costlier the system becomes. This study proposes a design method that can produce powered prostheses with performance comparable to that of HD-FMG controlled systems using a fewer number of sensors. An HD-FMG apparatus would be used to collect information from the user only in the design phase. Channel selection would then be applied to the collected data to determine the number and location of sensors that are vital to performance of the device. This study assessed the use of multiple channel selection (CS) methods for this purpose. Methods: In this case study, three datasets were used. These datasets were collected from force sensitive resistors embedded in the inner socket of a subject with transradial amputation. Sensor data were collected as the subject carried out five repetitions of six gestures. Collected data were then used to asses five CS methods: Sequential forward selection (SFS) with two different stopping criteria, minimum redundancy-maximum relevance (mRMR), genetic algorithm (GA), and Boruta. Results: Three out of the five methods (mRMR, GA, and Boruta) were able to decrease channel numbers significantly while maintaining classification accuracy in all datasets. Neither of them outperformed the other two in all datasets. However, GA resulted in the smallest channel subset in all three of the datasets. The three selected methods were also compared in terms of stability [i.e., consistency of the channel subset chosen by the method as new training data were introduced or some training data were removed (Chandrashekar and Sahin, 2014)]. Boruta and mRMR resulted in less instability compared to GA when applied to the datasets of this study. Conclusion: This study shows feasibility of using the proposed design method that can produce prosthetic systems that are simpler than HD-FMG systems but have performance comparable to theirs.
Collapse
Affiliation(s)
- Chakaveh Ahmadizadeh
- Menrva Research Group, Schools of Mechatronic Systems Engineering and Engineering Science, Simon Fraser University, Metro Vancouver, BC, Canada
| | | | - Carlo Menon
- Menrva Research Group, Schools of Mechatronic Systems Engineering and Engineering Science, Simon Fraser University, Metro Vancouver, BC, Canada
| |
Collapse
|
33
|
Ma J, Wu J, Zhao J, Jiang J, Zhou H, Sheng QZ. Nonrigid Point Set Registration With Robust Transformation Learning Under Manifold Regularization. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2019; 30:3584-3597. [PMID: 30371389 DOI: 10.1109/tnnls.2018.2872528] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
This paper solves the problem of nonrigid point set registration by designing a robust transformation learning scheme. The principle is to iteratively establish point correspondences and learn the nonrigid transformation between two given sets of points. In particular, the local feature descriptors are used to search the correspondences and some unknown outliers will be inevitably introduced. To precisely learn the underlying transformation from noisy correspondences, we cast the point set registration into a semisupervised learning problem, where a set of indicator variables is adopted to help distinguish outliers in a mixture model. To exploit the intrinsic structure of a point set, we constrain the transformation with manifold regularization which plays a role of prior knowledge. Moreover, the transformation is modeled in the reproducing kernel Hilbert space, and a sparsity-induced approximation is utilized to boost efficiency. We apply the proposed method to learning motion flows between image pairs of similar scenes for visual homing, which is a specific type of mobile robot navigation. Extensive experiments on several publicly available data sets reveal the superiority of the proposed method over state-of-the-art competitors, particularly in the context of the degenerated data.
Collapse
|
34
|
Correlation-Based Ensemble Feature Selection Using Bioinspired Algorithms and Classification Using Backpropagation Neural Network. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2019; 2019:7398307. [PMID: 31662787 PMCID: PMC6778924 DOI: 10.1155/2019/7398307] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/16/2019] [Revised: 08/02/2019] [Accepted: 08/16/2019] [Indexed: 11/17/2022]
Abstract
A framework for clinical diagnosis which uses bioinspired algorithms for feature selection and gradient descendant backpropagation neural network for classification has been designed and implemented. The clinical data are subjected to data preprocessing, feature selection, and classification. Hot deck imputation has been used for handling missing values and min-max normalization is used for data transformation. Wrapper approach that employs bioinspired algorithms, namely, Differential Evolution, Lion Optimization, and Glowworm Swarm Optimization with accuracy of AdaBoostSVM classifier as fitness function has been used for feature selection. Each bioinspired algorithm selects a subset of features yielding three feature subsets. Correlation-based ensemble feature selection is performed to select the optimal features from the three feature subsets. The optimal features selected through correlation-based ensemble feature selection are used to train a gradient descendant backpropagation neural network. Ten-fold cross-validation technique has been used to train and test the performance of the classifier. Hepatitis dataset and Wisconsin Diagnostic Breast Cancer (WDBC) dataset from University of California Irvine (UCI) Machine Learning repository have been used to evaluate the classification accuracy. An accuracy of 98.47% is obtained for Wisconsin Diagnostic Breast Cancer dataset, and 95.51% is obtained for Hepatitis dataset. The proposed framework can be tailored to develop clinical decision-making systems for any health disorders to assist physicians in clinical diagnosis.
Collapse
|
35
|
Fast unsupervised feature selection based on the improved binary ant system and mutation strategy. Neural Comput Appl 2019. [DOI: 10.1007/s00521-018-03991-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
36
|
|
37
|
Liu H, Hu QV, He L. Term-Based Personalization for Feature Selection in Clinical Handover Form Auto-Filling. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:1219-1230. [PMID: 30296238 DOI: 10.1109/tcbb.2018.2874237] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Feature learning and selection have been widely applied in many research areas because of their good performance and lower complexity. Traditional methods usually treat all terms with same feature sets, such that performance can be damaged when noisy information is brought via wrong features for a given term. In this paper, we propose a term-based personalization approach to finding the best features for each term. First, features are given as the input so that we focus on selection strategies. Second, the importance of each feature subset to a given term is evaluated by the term-feature probabilistic relevance model. We present a feature searching method to generate feature candidate subsets for each term, since evaluating all the possible feature subsets is computationally intensive. Finally, we obtain the personalized feature set for each term as a subset of all features. Experiments have been conducted on the NICTA Synthetic Nursing Handover dataset and the results show that our approach is promising and effective.
Collapse
|
38
|
Ordozgoiti B, Mozo A, López de Lacalle JG. Regularized greedy column subset selection. Inf Sci (N Y) 2019. [DOI: 10.1016/j.ins.2019.02.039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
39
|
|
40
|
Zhang Y, Zhou Y, Zhang D, Song W. A Stroke Risk Detection: Improving Hybrid Feature Selection Method. J Med Internet Res 2019; 21:e12437. [PMID: 30938684 PMCID: PMC6466481 DOI: 10.2196/12437] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2018] [Revised: 01/04/2019] [Accepted: 01/26/2019] [Indexed: 01/16/2023] Open
Abstract
Background Stroke is one of the most common diseases that cause mortality. Detecting the risk of stroke for individuals is critical yet challenging because of a large number of risk factors for stroke. Objective This study aimed to address the limitation of ineffective feature selection in existing research on stroke risk detection. We have proposed a new feature selection method called weighting- and ranking-based hybrid feature selection (WRHFS) to select important risk factors for detecting ischemic stroke. Methods WRHFS integrates the strengths of various filter algorithms by following the principle of a wrapper approach. We employed a variety of filter-based feature selection models as the candidate set, including standard deviation, Pearson correlation coefficient, Fisher score, information gain, Relief algorithm, and chi-square test and used sensitivity, specificity, accuracy, and Youden index as performance metrics to evaluate the proposed method. Results This study chose 792 samples from the electronic records of 13,421 patients in a community hospital. Each sample included 28 features (24 blood test features and 4 demographic features). The results of evaluation showed that the proposed method selected 9 important features out of the original 28 features and significantly outperformed baseline methods. Their cumulative contribution was 0.51. The WRHFS method achieved a sensitivity of 82.7% (329/398), specificity of 80.4% (317/394), classification accuracy of 81.5% (645/792), and Youden index of 0.63 using only the top 9 features. We have also presented a chart for visualizing the risk of having ischemic strokes. Conclusions This study has proposed, developed, and evaluated a new feature selection method for identifying the most important features for building effective and parsimonious models for stroke risk detection. The findings of this research provide several novel research contributions and practical implications.
Collapse
Affiliation(s)
- Yonglai Zhang
- Medical Big Data Institute, Software School, North University of China, Taiyuan, China
| | - Yaojian Zhou
- Medical Big Data Institute, Software School, North University of China, Taiyuan, China
| | - Dongsong Zhang
- Department of Business Information Systems and Operations Research, Belk School of Business, University of North Carolina, Charlotte, NC, United States
| | - Wenai Song
- Medical Big Data Institute, Software School, North University of China, Taiyuan, China
| |
Collapse
|
41
|
|
42
|
Lin Q, Xue Y, Wen J, Zhong P. A sharing multi-view feature selection method via Alternating Direction Method of Multipliers. Neurocomputing 2019. [DOI: 10.1016/j.neucom.2018.12.043] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
43
|
Shi C, Duan C, Gu Z, Tian Q, An G, Zhao R. Semi-supervised feature selection analysis with structured multi-view sparse regularization. Neurocomputing 2019. [DOI: 10.1016/j.neucom.2018.10.027] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
44
|
|
45
|
|
46
|
|
47
|
|
48
|
Constructing effective personalized policies using counterfactual inference from biased data sets with many features. Mach Learn 2018. [DOI: 10.1007/s10994-018-5768-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
49
|
Zhao M, Lin M, Chiu B, Zhang Z, Tang XS. Trace Ratio Criterion based Discriminative Feature Selection via l2,-norm regularization for supervised learning. Neurocomputing 2018. [DOI: 10.1016/j.neucom.2018.08.040] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
50
|
Sheikhpour R, Sarram MA, Sheikhpour E. Semi-supervised sparse feature selection via graph Laplacian based scatter matrix for regression problems. Inf Sci (N Y) 2018. [DOI: 10.1016/j.ins.2018.08.035] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|