1
|
Paja W. Application of the Fuzzy Approach for Evaluating and Selecting Relevant Objects, Features, and Their Ranges. ENTROPY (BASEL, SWITZERLAND) 2023; 25:1223. [PMID: 37628253 PMCID: PMC10453594 DOI: 10.3390/e25081223] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/28/2023] [Revised: 08/08/2023] [Accepted: 08/14/2023] [Indexed: 08/27/2023]
Abstract
Relevant attribute selection in machine learning is a key aspect aimed at simplifying the problem, reducing its dimensionality, and consequently accelerating computation. This paper proposes new algorithms for selecting relevant features and evaluating and selecting a subset of relevant objects in a dataset. Both algorithms are mainly based on the use of a fuzzy approach. The research presented here yielded preliminary results of a new approach to the problem of selecting relevant attributes and objects and selecting appropriate ranges of their values. Detailed results obtained on the Sonar dataset show the positive effects of this approach. Moreover, the observed results may suggest the effectiveness of the proposed method in terms of identifying a subset of truly relevant attributes from among those identified by traditional feature selection methods.
Collapse
Affiliation(s)
- Wiesław Paja
- Institute of Computer Science, College of Natural Sciences, University of Rzeszów, Rejtana Str. 16C, 35-959 Rzeszów, Poland
| |
Collapse
|
2
|
Chen N, Chen S, Zhang Q, Wang SR, Tang LJ, Jiang JH, Yu RQ, Zhou YP. Robust classification and biomarker discovery of inherited metabolic diseases using GC-MS urinary metabolomics analysis combined with chemometrics. Microchem J 2023. [DOI: 10.1016/j.microc.2023.108600] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/08/2023]
|
3
|
Computational Analysis Identifies Novel Biomarkers for High-Risk Bladder Cancer Patients. Int J Mol Sci 2022; 23:ijms23137057. [PMID: 35806060 PMCID: PMC9266725 DOI: 10.3390/ijms23137057] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2022] [Revised: 05/24/2022] [Accepted: 05/26/2022] [Indexed: 12/04/2022] Open
Abstract
In the case of bladder cancer, carcinoma in situ (CIS) is known to have poor diagnosis. However, there are not enough studies that examine the biomarkers relevant to CIS development. Omics experiments generate data with tens of thousands of descriptive variables, e.g., gene expression levels. Often, many of these descriptive variables are identified as somehow relevant, resulting in hundreds or thousands of relevant variables for building models or for further data analysis. We analyze one such dataset describing patients with bladder cancer, mostly non-muscle-invasive (NMIBC), and propose a novel approach to feature selection. This approach returns high-quality features for prediction and yet allows interpretability as well as a certain level of insight into the analyzed data. As a result, we obtain a small set of seven of the most-useful biomarkers for diagnostics. They can also be used to build tests that avoid the costly and time-consuming existing methods. We summarize the current biological knowledge of the chosen biomarkers and contrast it with our findings.
Collapse
|
4
|
Yin D, Chen D, Tang Y, Dong H, Li X. Adaptive feature selection with shapley and hypothetical testing: Case study of EEG feature engineering. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2021.11.063] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
|
5
|
A Multi-Objective Multi-Label Feature Selection Algorithm Based on Shapley Value. ENTROPY 2021; 23:e23081094. [PMID: 34441234 PMCID: PMC8394764 DOI: 10.3390/e23081094] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/21/2021] [Revised: 08/15/2021] [Accepted: 08/18/2021] [Indexed: 01/16/2023]
Abstract
Multi-label learning is dedicated to learning functions so that each sample is labeled with a true label set. With the increase of data knowledge, the feature dimensionality is increasing. However, high-dimensional information may contain noisy data, making the process of multi-label learning difficult. Feature selection is a technical approach that can effectively reduce the data dimension. In the study of feature selection, the multi-objective optimization algorithm has shown an excellent global optimization performance. The Pareto relationship can handle contradictory objectives in the multi-objective problem well. Therefore, a Shapley value-fused feature selection algorithm for multi-label learning (SHAPFS-ML) is proposed. The method takes multi-label criteria as the optimization objectives and the proposed crossover and mutation operators based on Shapley value are conducive to identifying relevant, redundant and irrelevant features. The comparison of experimental results on real-world datasets reveals that SHAPFS-ML is an effective feature selection method for multi-label classification, which can reduce the classification algorithm’s computational complexity and improve the classification accuracy.
Collapse
|
6
|
Lesiński W, Mnich K, Rudnicki WR. Prediction of Alternative Drug-Induced Liver Injury Classifications Using Molecular Descriptors, Gene Expression Perturbation, and Toxicology Reports. Front Genet 2021; 12:661075. [PMID: 34276771 PMCID: PMC8282233 DOI: 10.3389/fgene.2021.661075] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2021] [Accepted: 05/25/2021] [Indexed: 11/13/2022] Open
Abstract
Motivation: Drug-induced liver injury (DILI) is one of the primary problems in drug development. Early prediction of DILI, based on the chemical properties of substances and experiments performed on cell lines, would bring a significant reduction in the cost of clinical trials and faster development of drugs. The current study aims to build predictive models of risk of DILI for chemical compounds using multiple sources of information. Methods: Using several supervised machine learning algorithms, we built predictive models for several alternative splits of compounds between DILI and non-DILI classes. To this end, we used chemical properties of the given compounds, their effects on gene expression levels in six human cell lines treated with them, as well as their toxicological profiles. First, we identified the most informative variables in all data sets. Then, these variables were used to build machine learning models. Finally, composite models were built with the Super Learner approach. All modeling was performed using multiple repeats of cross-validation for unbiased and precise estimates of performance. Results: With one exception, gene expression profiles of human cell lines were non-informative and resulted in random models. Toxicological reports were not useful for prediction of DILI. The best results were obtained for models discerning between harmless compounds and those for which any level of DILI was observed (AUC = 0.75). These models were built with Random Forest algorithm that used molecular descriptors.
Collapse
Affiliation(s)
- Wojciech Lesiński
- Institute of Computer Science, University of Bialystok, Białystok, Poland
| | - Krzysztof Mnich
- Computational Center, University of Bialystok, Białystok, Poland
| | - Witold R Rudnicki
- Institute of Computer Science, University of Bialystok, Białystok, Poland.,Computational Center, University of Bialystok, Białystok, Poland
| |
Collapse
|
7
|
|
8
|
MRMR-SSA: a hybrid approach for optimal feature selection. EVOLUTIONARY INTELLIGENCE 2021. [DOI: 10.1007/s12065-021-00608-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
9
|
Polewko-Klim A, Mnich K, Rudnicki WR. Robust Data Integration Method for Classification of Biomedical Data. J Med Syst 2021; 45:45. [PMID: 33624190 PMCID: PMC7902598 DOI: 10.1007/s10916-021-01718-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2020] [Accepted: 01/26/2021] [Indexed: 10/26/2022]
Abstract
We present a protocol for integrating two types of biological data - clinical and molecular - for more effective classification of patients with cancer. The proposed approach is a hybrid between early and late data integration strategy. In this hybrid protocol, the set of informative clinical features is extended by the classification results based on molecular data sets. The results are then treated as new synthetic variables. The hybrid protocol was applied to METABRIC breast cancer samples and TCGA urothelial bladder carcinoma samples. Various data types were used for clinical endpoint prediction: clinical data, gene expression, somatic copy number aberrations, RNA-Seq, methylation, and reverse phase protein array. The performance of the hybrid data integration was evaluated with a repeated cross validation procedure and compared with other methods of data integration: early integration and late integration via super learning. The hybrid method gave similar results to those obtained by the best of the tested variants of super learning. What is more, the hybrid method allowed for further sensitivity analysis and recursive feature elimination, which led to compact predictive models for cancer clinical endpoints. For breast cancer, the final model consists of eight clinical variables and two synthetic features obtained from molecular data. For urothelial bladder carcinoma, only two clinical features and one synthetic variable were necessary to build the best predictive model. We have shown that the inclusion of the synthetic variables based on the RNA expression levels and copy number alterations can lead to improved quality of prognostic tests. Thus, it should be considered for inclusion in wider medical practice.
Collapse
Affiliation(s)
- Aneta Polewko-Klim
- Institute of Computer Science, University of Bialystok, Bialystok, Poland
| | - Krzysztof Mnich
- Computational Center, University of Bialystok, Bialystok, Poland
| | - Witold R. Rudnicki
- Institute of Computer Science, University of Bialystok, Bialystok, Poland
- Computational Center, University of Bialystok, Bialystok, Poland
| |
Collapse
|
10
|
Lesiński W, Mnich K, Golińska AK, Rudnicki WR. Integration of human cell lines gene expression and chemical properties of drugs for Drug Induced Liver Injury prediction. Biol Direct 2021; 16:2. [PMID: 33422118 PMCID: PMC7796564 DOI: 10.1186/s13062-020-00286-z] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2020] [Accepted: 12/01/2020] [Indexed: 11/10/2022] Open
Abstract
MOTIVATION Drug-induced liver injury (DILI) is one of the primary problems in drug development. Early prediction of DILI can bring a significant reduction in the cost of clinical trials. In this work we examined whether occurrence of DILI can be predicted using gene expression profile in cancer cell lines and chemical properties of drugs. METHODS We used gene expression profiles from 13 human cell lines, as well as molecular properties of drugs to build Machine Learning models of DILI. To this end, we have used a robust cross-validated protocol based on feature selection and Random Forest algorithm. In this protocol we first identify the most informative variables and then use them to build predictive models. The models are first built using data from single cell lines, and chemical properties. Then they are integrated using Super Learner method with several underlying methods for integration. The entire modelling process is performed using nested cross-validation. RESULTS We have obtained weakly predictive ML models when using either molecular descriptors, or some individual cell lines (AUC ∈(0.55-0.61)). Models obtained with the Super Learner approach have a significantly improved accuracy (AUC=0.73), which allows to divide substances in two categories: low-risk and high-risk.
Collapse
Affiliation(s)
- Wojciech Lesiński
- Institute of Computer Science, University of Białystok, Ciołkowskiego 1M, Białystok, Poland
| | - Krzysztof Mnich
- Computational Center, University of Białystok, Ciołkowskiego 1M, Białystok, Poland
| | | | - Witold R. Rudnicki
- Institute of Computer Science, University of Białystok, Ciołkowskiego 1M, Białystok, Poland
- Computational Center, University of Białystok, Ciołkowskiego 1M, Białystok, Poland
| |
Collapse
|
11
|
Polewko-Klim A, Lesiński W, Golińska AK, Mnich K, Siwek M, Rudnicki WR. Sensitivity analysis based on the random forest machine learning algorithm identifies candidate genes for regulation of innate and adaptive immune response of chicken. Poult Sci 2020; 99:6341-6354. [PMID: 33248550 PMCID: PMC7704721 DOI: 10.1016/j.psj.2020.08.059] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2019] [Revised: 07/14/2020] [Accepted: 08/11/2020] [Indexed: 11/25/2022] Open
Abstract
Two categories of immune responses—innate and adaptive immunity—have both polygenic backgrounds and a significant environmental component. The goal of the reported study was to define candidate genes and mutations for the immune traits of interest in chickens using machine learning–based sensitivity analysis for single-nucleotide polymorphisms (SNPs) located in candidate genes defined in quantitative trait loci regions. Here the adaptive immunity is represented by the specific antibody response toward keyhole limpet hemocyanin (KLH), whereas the innate immunity was represented by natural antibodies toward lipopolysaccharide (LPS) and lipoteichoic acid (LTA). The analysis consisted of 3 basic steps: an identification of candidate SNPs via feature selection, an optimisation of the feature set using recursive feature elimination, and finally a gene-level sensitivity analysis for final selection of models. The predictive model based on 5 genes (MAPK8IP3 CRLF3, UNC13D, ILR9, and PRCKB) explains 14.9% of variance for KLH adaptive response. The models obtained for LTA and LPS use more genes and have lower predictive power, explaining respectively 7.8 and 4.5% of total variance. In comparison, the linear models built on genes identified by a standard statistical analysis explain 1.5, 0.5, and 0.3% of variance for KLH, LTA, and LPS response, respectively. The present study shows that machine learning methods applied to systems with a complex interaction network can discover phenotype-genotype associations with much higher sensitivity than traditional statistical models. It adds contribution to evidence suggesting a role of MAPK8IP3 in the adaptive immune response. It also indicates that CRLF3 is involved in this process as well. Both findings need additional verification.
Collapse
Affiliation(s)
- Aneta Polewko-Klim
- Institute of Computer Science, University of Bialystok, Białystok, Poland.
| | - Wojciech Lesiński
- Institute of Computer Science, University of Bialystok, Białystok, Poland
| | | | - Krzysztof Mnich
- Computational Centre, University of Bialystok, Białystok, Poland
| | - Maria Siwek
- Animal Biotechnology and Genetics Department, University of Technology and Life Sciences, Bydgoszcz, Poland
| | - Witold R Rudnicki
- Institute of Computer Science, University of Bialystok, Białystok, Poland; Computational Centre, University of Bialystok, Białystok, Poland; Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw, Warsaw, Poland
| |
Collapse
|
12
|
Towards Prediction of Heart Arrhythmia Onset Using Machine Learning. LECTURE NOTES IN COMPUTER SCIENCE 2020. [PMCID: PMC7303682 DOI: 10.1007/978-3-030-50423-6_28] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
Current study aims at prediction of the onset of malignant cardiac arrhythmia in patients with Implantable Cardioverter-Defibrillators (ICDs) using Machine Learning algorithms. The input data consisted of 184 signals of RR-intervals from 29 patients with ICD, recorded both during normal heartbeat and arrhythmia. For every signal we generated 47 descriptors with different signal analysis methods. Then, we performed feature selection using several methods and used selected feature for building predictive models with the help of Random Forest algorithm. Entire modelling procedure was performed within 5-fold cross-validation procedure that was repeated 10 times. Results were stable and repeatable. The results obtained (AUC = 0.82, MCC = 0.45) are statistically significant and show that RR intervals carry information about arrhythmia onset. The sample size used in this study was too small to build useful medical predictive models, hence large data sets should be explored to construct models of sufficient quality to be of direct utility in medical practice.
Collapse
|
13
|
Krzhizhanovskaya VV, Závodszky G, Lees MH, Dongarra JJ, Sloot PMA, Brissos S, Teixeira J. Analysis of Ensemble Feature Selection for Correlated High-Dimensional RNA-Seq Cancer Data. LECTURE NOTES IN COMPUTER SCIENCE 2020. [PMCID: PMC7304026 DOI: 10.1007/978-3-030-50420-5_39] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Discovery of diagnostic and prognostic molecular markers is important and actively pursued the research field in cancer research. For complex diseases, this process is often performed using Machine Learning. The current study compares two approaches for the discovery of relevant variables: by application of a single feature selection algorithm, versus by an ensemble of diverse algorithms. These approaches are used to identify variables that are relevant discerning of four cancer types using RNA-seq profiles from the Cancer Genome Atlas. The comparison is carried out in two directions: evaluating the predictive performance of models and monitoring the stability of selected variables. The most informative features are identified using a four feature selection algorithms, namely U-test, ReliefF, and two variants of the MDFS algorithm. Discerning normal and tumor tissues is performed using the Random Forest algorithm. The highest stability of the feature set was obtained when U-test was used. Unfortunately, models built on feature sets obtained from the ensemble of feature selection algorithms were no better than for models developed on feature sets obtained from individual algorithms. On the other hand, the feature selectors leading to the best classification results varied between data sets.
Collapse
|