7
|
A Feature Selection Algorithm Integrating Maximum Classification Information and Minimum Interaction Feature Dependency Information. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2021:3569632. [PMID: 34992644 PMCID: PMC8727115 DOI: 10.1155/2021/3569632] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/28/2021] [Revised: 11/21/2021] [Accepted: 12/07/2021] [Indexed: 11/17/2022]
Abstract
Feature selection is the key step in the analysis of high-dimensional small sample data. The core of feature selection is to analyse and quantify the correlation between features and class labels and the redundancy between features. However, most of the existing feature selection algorithms only consider the classification contribution of individual features and ignore the influence of interfeature redundancy and correlation. Therefore, this paper proposes a feature selection algorithm for nonlinear dynamic conditional relevance (NDCRFS) through the study and analysis of the existing feature selection algorithm ideas and method. Firstly, redundancy and relevance between features and between features and class labels are discriminated by mutual information, conditional mutual information, and interactive mutual information. Secondly, the selected features and candidate features are dynamically weighted utilizing information gain factors. Finally, to evaluate the performance of this feature selection algorithm, NDCRFS was validated against 6 other feature selection algorithms on three classifiers, using 12 different data sets, for variability and classification metrics between the different algorithms. The experimental results show that the NDCRFS method can improve the quality of the feature subsets and obtain better classification results.
Collapse
|
9
|
Jalali-Najafabadi F, Stadler M, Dand N, Jadon D, Soomro M, Ho P, Marzo-Ortega H, Helliwell P, Korendowych E, Simpson MA, Packham J, Smith CH, Barker JN, McHugh N, Warren RB, Barton A, Bowes J. Application of information theoretic feature selection and machine learning methods for the development of genetic risk prediction models. Sci Rep 2021; 11:23335. [PMID: 34857774 PMCID: PMC8640070 DOI: 10.1038/s41598-021-00854-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2021] [Accepted: 09/27/2021] [Indexed: 01/20/2023] Open
Abstract
In view of the growth of clinical risk prediction models using genetic data, there is an increasing need for studies that use appropriate methods to select the optimum number of features from a large number of genetic variants with a high degree of redundancy between features due to linkage disequilibrium (LD). Filter feature selection methods based on information theoretic criteria, are well suited to this challenge and will identify a subset of the original variables that should result in more accurate prediction. However, data collected from cohort studies are often high-dimensional genetic data with potential confounders presenting challenges to feature selection and risk prediction machine learning models. Patients with psoriasis are at high risk of developing a chronic arthritis known as psoriatic arthritis (PsA). The prevalence of PsA in this patient group can be up to 30% and the identification of high risk patients represents an important clinical research which would allow early intervention and a reduction of disability. This also provides us with an ideal scenario for the development of clinical risk prediction models and an opportunity to explore the application of information theoretic criteria methods. In this study, we developed the feature selection and psoriatic arthritis (PsA) risk prediction models that were applied to a cross-sectional genetic dataset of 1462 PsA cases and 1132 cutaneous-only psoriasis (PsC) cases using 2-digit HLA alleles imputed using the SNP2HLA algorithm. We also developed stratification method to mitigate the impact of potential confounder features and illustrate that confounding features impact the feature selection. The mitigated dataset was used in training of seven supervised algorithms. 80% of data was randomly used for training of seven supervised machine learning methods using stratified nested cross validation and 20% was selected randomly as a holdout set for internal validation. The risk prediction models were then further validated in UK Biobank dataset containing data on 1187 participants and a set of features overlapping with the training dataset.Performance of these methods has been evaluated using the area under the curve (AUC), accuracy, precision, recall, F1 score and decision curve analysis(net benefit). The best model is selected based on three criteria: the 'lowest number of feature subset' with the 'maximal average AUC over the nested cross validation' and good generalisability to the UK Biobank dataset. In the original dataset, with over 100 different bootstraps and seven feature selection (FS) methods, HLA_C_*06 was selected as the most informative genetic variant. When the dataset is mitigated the single most important genetic features based on rank was identified as HLA_B_*27 by the seven different feature selection methods, consistent with previous analyses of this data using regression based methods. However, the predictive accuracy of these single features in post mitigation was found to be moderate (AUC= 0.54 (internal cross validation), AUC=0.53 (internal hold out set), AUC=0.55(external data set)). Sequentially adding additional HLA features based on rank improved the performance of the Random Forest classification model where 20 2-digit features selected by Interaction Capping (ICAP) demonstrated (AUC= 0.61 (internal cross validation), AUC=0.57 (internal hold out set), AUC=0.58 (external dataset)). The stratification method for mitigation of confounding features and filter information theoretic feature selection can be applied to a high dimensional dataset with the potential confounders.
Collapse
Affiliation(s)
- Farideh Jalali-Najafabadi
- Centre for Genetics and Genomics Versus Arthritis,Centre for Musculoskeletal Research,Faculty of Biology, Medicine and Health, Manchester Academic Health Science Centre, The University of Manchester, Manchester, M13 9PT, UK.
| | - Michael Stadler
- Centre for Genetics and Genomics Versus Arthritis,Centre for Musculoskeletal Research,Faculty of Biology, Medicine and Health, Manchester Academic Health Science Centre, The University of Manchester, Manchester, M13 9PT, UK
| | - Nick Dand
- Department of Medical and Molecular Genetics, Faculty of Life Sciences and Medicine, King's College London, London , UK
| | - Deepak Jadon
- Department of Medicine, University of Cambridge, Cambridge, UK
| | - Mehreen Soomro
- Centre for Genetics and Genomics Versus Arthritis,Centre for Musculoskeletal Research,Faculty of Biology, Medicine and Health, Manchester Academic Health Science Centre, The University of Manchester, Manchester, M13 9PT, UK
| | - Pauline Ho
- Centre for Genetics and Genomics Versus Arthritis,Centre for Musculoskeletal Research,Faculty of Biology, Medicine and Health, Manchester Academic Health Science Centre, The University of Manchester, Manchester, M13 9PT, UK
- NIHR Manchester Musculoskeletal Biomedical Research Unit,Central Manchester NHS Foundation Trust, Manchester Academic Health Science Centre, Manchester, UK
| | - Helen Marzo-Ortega
- NIHR Leeds Biomedical Research Centre, Leeds Teaching Hospitals Trust and Leeds Institute of Rheumatic and Musculoskeletal Disease, University of Leeds, Manchester, UK
| | - Philip Helliwell
- NIHR Leeds Biomedical Research Centre, Leeds Teaching Hospitals Trust and Leeds Institute of Rheumatic and Musculoskeletal Disease, University of Leeds, Manchester, UK
| | - Eleanor Korendowych
- Royal National Hospital for Rheumatic Diseases and Dept Pharmacy and Pharmacology, University of Bath, Bath , UK
| | - Michael A Simpson
- Department of Medical and Molecular Genetics, Faculty of Life Sciences and Medicine, King's College London, London , UK
| | - Jonathan Packham
- Division of Epidemiology and Public Health, University of Nottingham, Nottingham , UK
| | - Catherine H Smith
- St John's Institute of Dermatology, Guys and St Thomas' Foundation Trust, London, UK
| | - Jonathan N Barker
- St John's Institute of Dermatology, Faculty of Life Sciences and Medicine, King's College London, London, UK
| | - Neil McHugh
- Royal National Hospital for Rheumatic Diseases and Dept Pharmacy and Pharmacology, University of Bath, Bath , UK
| | - Richard B Warren
- Dermatology Centre, Salford Royal NHS Foundation Trust, University of Manchester, Manchester, UK
| | - Anne Barton
- Centre for Genetics and Genomics Versus Arthritis,Centre for Musculoskeletal Research,Faculty of Biology, Medicine and Health, Manchester Academic Health Science Centre, The University of Manchester, Manchester, M13 9PT, UK
- NIHR Manchester Musculoskeletal Biomedical Research Unit,Central Manchester NHS Foundation Trust, Manchester Academic Health Science Centre, Manchester, UK
| | - John Bowes
- Centre for Genetics and Genomics Versus Arthritis,Centre for Musculoskeletal Research,Faculty of Biology, Medicine and Health, Manchester Academic Health Science Centre, The University of Manchester, Manchester, M13 9PT, UK
- NIHR Manchester Musculoskeletal Biomedical Research Unit,Central Manchester NHS Foundation Trust, Manchester Academic Health Science Centre, Manchester, UK
| |
Collapse
|
10
|
Multi-Label Feature Selection Combining Three Types of Conditional Relevance. ENTROPY 2021; 23:e23121617. [PMID: 34945923 PMCID: PMC8700541 DOI: 10.3390/e23121617] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/27/2021] [Revised: 11/19/2021] [Accepted: 11/25/2021] [Indexed: 11/17/2022]
Abstract
With the rapid growth of the Internet, the curse of dimensionality caused by massive multi-label data has attracted extensive attention. Feature selection plays an indispensable role in dimensionality reduction processing. Many researchers have focused on this subject based on information theory. Here, to evaluate feature relevance, a novel feature relevance term (FR) that employs three incremental information terms to comprehensively consider three key aspects (candidate features, selected features, and label correlations) is designed. A thorough examination of the three key aspects of FR outlined above is more favorable to capturing the optimal features. Moreover, we employ label-related feature redundancy as the label-related feature redundancy term (LR) to reduce unnecessary redundancy. Therefore, a designed multi-label feature selection method that integrates FR with LR is proposed, namely, Feature Selection combining three types of Conditional Relevance (TCRFS). Numerous experiments indicate that TCRFS outperforms the other 6 state-of-the-art multi-label approaches on 13 multi-label benchmark data sets from 4 domains.
Collapse
|