1
|
Morabito F, Adornetto C, Monti P, Amaro A, Reggiani F, Colombo M, Rodriguez-Aldana Y, Tripepi G, D’Arrigo G, Vener C, Torricelli F, Rossi T, Neri A, Ferrarini M, Cutrona G, Gentile M, Greco G. Genes selection using deep learning and explainable artificial intelligence for chronic lymphocytic leukemia predicting the need and time to therapy. Front Oncol 2023; 13:1198992. [PMID: 37719021 PMCID: PMC10501728 DOI: 10.3389/fonc.2023.1198992] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2023] [Accepted: 07/31/2023] [Indexed: 09/19/2023] Open
Abstract
Analyzing gene expression profiles (GEP) through artificial intelligence provides meaningful insight into cancer disease. This study introduces DeepSHAP Autoencoder Filter for Genes Selection (DSAF-GS), a novel deep learning and explainable artificial intelligence-based approach for feature selection in genomics-scale data. DSAF-GS exploits the autoencoder's reconstruction capabilities without changing the original feature space, enhancing the interpretation of the results. Explainable artificial intelligence is then used to select the informative genes for chronic lymphocytic leukemia prognosis of 217 cases from a GEP database comprising roughly 20,000 genes. The model for prognosis prediction achieved an accuracy of 86.4%, a sensitivity of 85.0%, and a specificity of 87.5%. According to the proposed approach, predictions were strongly influenced by CEACAM19 and PIGP, moderately influenced by MKL1 and GNE, and poorly influenced by other genes. The 10 most influential genes were selected for further analysis. Among them, FADD, FIBP, FIBP, GNE, IGF1R, MKL1, PIGP, and SLC39A6 were identified in the Reactome pathway database as involved in signal transduction, transcription, protein metabolism, immune system, cell cycle, and apoptosis. Moreover, according to the network model of the 3D protein-protein interaction (PPI) explored using the NetworkAnalyst tool, FADD, FIBP, IGF1R, QTRT1, GNE, SLC39A6, and MKL1 appear coupled into a complex network. Finally, all 10 selected genes showed a predictive power on time to first treatment (TTFT) in univariate analyses on a basic prognostic model including IGHV mutational status, del(11q) and del(17p), NOTCH1 mutations, β2-microglobulin, Rai stage, and B-lymphocytosis known to predict TTFT in CLL. However, only IGF1R [hazard ratio (HR) 1.41, 95% CI 1.08-1.84, P=0.013), COL28A1 (HR 0.32, 95% CI 0.10-0.97, P=0.045), and QTRT1 (HR 7.73, 95% CI 2.48-24.04, P<0.001) genes were significantly associated with TTFT in multivariable analyses when combined with the prognostic factors of the basic model, ultimately increasing the Harrell's c-index and the explained variation to 78.6% (versus 76.5% of the basic prognostic model) and 52.6% (versus 42.2% of the basic prognostic model), respectively. Also, the goodness of model fit was enhanced (χ2 = 20.1, P=0.002), indicating its improved performance above the basic prognostic model. In conclusion, DSAF-GS identified a group of significant genes for CLL prognosis, suggesting future directions for bio-molecular research.
Collapse
Affiliation(s)
| | - Carlo Adornetto
- Department of Mathematics and Computer Science, University of Calabria, Cosenza, Italy
| | - Paola Monti
- Mutagenesis and Cancer Prevention Unit, Istituto di Ricovero e Cura a Carattere Scientifico (IRCCS) Ospedale Policlinico San Martino, Genoa, Italy
| | - Adriana Amaro
- Tumor Epigenetics Unit, Istituto di Ricovero e Cura a Carattere Scientifico (IRCCS) Ospedale Policlinico San Martino, Genoa, Italy
| | - Francesco Reggiani
- Tumor Epigenetics Unit, Istituto di Ricovero e Cura a Carattere Scientifico (IRCCS) Ospedale Policlinico San Martino, Genoa, Italy
| | - Monica Colombo
- Molecular Pathology Unit, Istituto di Ricovero e Cura a Carattere Scientifico (IRCCS) Ospedale Policlinico San Martino, Genoa, Italy
| | | | - Giovanni Tripepi
- Consiglio Nazionale delle Ricerche, Istituto di Fisiologia Clinica del Consiglio Nazionale delle Ricerche (CNR), Reggio Calabria, Italy
| | - Graziella D’Arrigo
- Consiglio Nazionale delle Ricerche, Istituto di Fisiologia Clinica del Consiglio Nazionale delle Ricerche (CNR), Reggio Calabria, Italy
| | - Claudia Vener
- Department of Oncology and Hemato-Oncology, University of Milan, Milan, Italy
| | - Federica Torricelli
- Laboratory of Translational Research, Azienda Unità Sanitaria Locale - Istituto di Ricovero e Cura a Crabtree Scientifico (USL-IRCCS) of Reggio Emilia, Reggio Emilia, Italy
| | - Teresa Rossi
- Laboratory of Translational Research, Azienda Unità Sanitaria Locale - Istituto di Ricovero e Cura a Crabtree Scientifico (USL-IRCCS) of Reggio Emilia, Reggio Emilia, Italy
| | - Antonino Neri
- Scientific Directorate, Azienda Unità Sanitaria Locale - Istituto di Ricovero e Cura a Carattere Scientifico (USL-IRCCS) of Reggio Emilia, Reggio Emilia, Italy
| | - Manlio Ferrarini
- Unità Operariva (UO) Molecular Pathology, Ospedale Policlinico San Martino Istituto di Ricovero e Cura a Carattere Scientifico (IRCCS), Genoa, Italy
| | - Giovanna Cutrona
- Molecular Pathology Unit, Istituto di Ricovero e Cura a Carattere Scientifico (IRCCS) Ospedale Policlinico San Martino, Genoa, Italy
| | - Massimo Gentile
- Hematology Unit, Department of Onco-Hematology, Azienda Ospedaliera (A.O.) of Cosenza, Cosenza, Italy
- Department of Pharmacy and Health and Nutritional Sciences, University of Calabria, Cosenza, Italy
| | - Gianluigi Greco
- Department of Mathematics and Computer Science, University of Calabria, Cosenza, Italy
| |
Collapse
|
2
|
Vahabzadeh V, Moattar MH. Robust microarray data feature selection using a correntropy based distance metric learning approach. Comput Biol Med 2023; 161:107056. [PMID: 37235945 DOI: 10.1016/j.compbiomed.2023.107056] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2022] [Revised: 04/18/2023] [Accepted: 05/20/2023] [Indexed: 05/28/2023]
Abstract
Classification of high-dimensional microarray data is a challenge in bioinformatics and genetic data processing. One of the challenging issues of feature selection is the presence of outliers. The Euclidean distance metric is sensitive to outliers. In this study, a distance metric learning based feature selection approach that uses the correntropy function as the discrimination metric is proposed. For this purpose, the metric learning problem is formulated as an optimization problem and solved using the Lagrange method. The output of the approach signifies the most important and robust features. After feature selection, different classification methods such as SVM, decision trees, and NN classifiers are used to investigate the classification accuracy of the proposed method as well as precision, recall, and F-measure. Experiments are carried out on 13 high-dimensional datasets and show that the proposed method outperforms the previous models in terms of accuracy and robustness.
Collapse
Affiliation(s)
- Venus Vahabzadeh
- Department of Software Engineering, Mashhad Branch, Islamic Azad University, Mashhad, Iran.
| | | |
Collapse
|
3
|
Najafiamiri F, Khalafi M, Golalipour M, Azimmohseni M. On clustering of periodically correlated processes based on Hilbert-Schmidt inner product of Fourier transforms. COMMUN STAT-SIMUL C 2023. [DOI: 10.1080/03610918.2023.2170409] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Affiliation(s)
- Farzad Najafiamiri
- Department of Statistics, Faculty of Science, Golestan University, Gorgan, Iran
| | - Mahnaz Khalafi
- Department of Statistics, Faculty of Science, Golestan University, Gorgan, Iran
| | - Masoud Golalipour
- Medical Cellular and Molecular Research Center, Golestan University of Medical Sciences, Gorgan, Iran
| | - Majid Azimmohseni
- Department of Statistics, Faculty of Science, Golestan University, Gorgan, Iran
| |
Collapse
|
4
|
A two-phase gene selection method using anomaly detection and genetic algorithm for microarray data. Knowl Based Syst 2023. [DOI: 10.1016/j.knosys.2022.110249] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
|
5
|
Singh V, Verma NK. Gene Expression Data Analysis Using Feature Weighted Robust Fuzzy c-Means Clustering. IEEE Trans Nanobioscience 2022; PP:99-105. [PMID: 35259111 DOI: 10.1109/tnb.2022.3157396] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Clustering of gene expression data has been proven to be very useful in various applications, i.e., identifying the natural structure inherent in gene expression, understanding gene functions, mining relevant information from noisy data, and understanding gene regulation. In all these applications, genes, i.e., features, play a crucial role in characterizing them into different groups. These features may be relevant, irrelevant, or redundant, but they have different contributions during the clustering process. This paper presents a novel approach by considering the effect of features during the clustering process. In the proposed method, the fuzzy c-means the objective function is modified using a weighted Euclidean distance between the features with a monotonically decreasing function. The monotonically decreasing function helps control the features' contribution during the clustering process to partition the data into more relevant clusters. The proposed approach is validated, and performance is presented in various clustering performance measures on the different standard datasets. These clustering performance measures have also been compared with multiple state-of-the-art methods.
Collapse
|
6
|
Rout S, Mallick PK, Mishra D. DRBF-DS: Double RBF Kernel-Based Deep Sampling with CNNs to Handle Complex Imbalanced Datasets. ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING 2022. [DOI: 10.1007/s13369-021-06480-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
|
7
|
Wang T, Sun B, Jiang C, Weng H, Chu X. Kernel alignment-based three-way clustering on attribute space and its application in stroke risk identification. INT J MACH LEARN CYB 2021. [DOI: 10.1007/s13042-021-01478-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
8
|
Gumaei A, Sammouda R, Al-Rakhami M, AlSalman H, El-Zaart A. Feature selection with ensemble learning for prostate cancer diagnosis from microarray gene expression. Health Informatics J 2021; 27:1460458221989402. [PMID: 33570011 DOI: 10.1177/1460458221989402] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Cancer diagnosis using machine learning algorithms is one of the main topics of research in computer-based medical science. Prostate cancer is considered one of the reasons that are leading to deaths worldwide. Data analysis of gene expression from microarray using machine learning and soft computing algorithms is a useful tool for detecting prostate cancer in medical diagnosis. Even though traditional machine learning methods have been successfully applied for detecting prostate cancer, the large number of attributes with a small sample size of microarray data is still a challenge that limits their ability for effective medical diagnosis. Selecting a subset of relevant features from all features and choosing an appropriate machine learning method can exploit the information of microarray data to improve the accuracy rate of detection. In this paper, we propose to use a correlation feature selection (CFS) method with random committee (RC) ensemble learning to detect prostate cancer from microarray data of gene expression. A set of experiments are conducted on a public benchmark dataset using 10-fold cross-validation technique to evaluate the proposed approach. The experimental results revealed that the proposed approach attains 95.098% accuracy, which is higher than related work methods on the same dataset.
Collapse
Affiliation(s)
- Abdu Gumaei
- Research Chair of Pervasive and Mobile Computing, King Saud University, Saudi Arabia.,Taiz University, Yemen
| | | | - Mabrook Al-Rakhami
- Research Chair of Pervasive and Mobile Computing, King Saud University, Saudi Arabia
| | | | | |
Collapse
|
9
|
Pashaei E, Pashaei E. Gene selection using hybrid dragonfly black hole algorithm: A case study on RNA-seq COVID-19 data. Anal Biochem 2021; 627:114242. [PMID: 33974890 DOI: 10.1016/j.ab.2021.114242] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2020] [Revised: 04/12/2021] [Accepted: 05/02/2021] [Indexed: 11/18/2022]
Abstract
This paper introduces a new hybrid approach (DBH) for solving gene selection problem that incorporates the strengths of two existing metaheuristics: binary dragonfly algorithm (BDF) and binary black hole algorithm (BBHA). This hybridization aims to identify a limited and stable set of discriminative genes without sacrificing classification accuracy, whereas most current methods have encountered challenges in extracting disease-related information from a vast amount of redundant genes. The proposed approach first applies the minimum redundancy maximum relevancy (MRMR) filter method to reduce the dimensionality of feature space and then utilizes the suggested hybrid DBH algorithm to determine a smaller set of significant genes. The proposed approach was evaluated on eight benchmark gene expression datasets, and then, was compared against the latest state-of-art techniques to demonstrate algorithm efficiency. The comparative study shows that the proposed approach achieves a significant improvement as compared with existing methods in terms of classification accuracy and the number of selected genes. Moreover, the performance of the suggested method was examined on real RNA-Seq coronavirus-related gene expression data of asthmatic patients for selecting the most significant genes in order to improve the discriminative accuracy of angiotensin-converting enzyme 2 (ACE2). ACE2, as a coronavirus receptor, is a biomarker that helps to classify infected patients from uninfected in order to identify subgroups at risk for COVID-19. The result denotes that the suggested MRMR-DBH approach represents a very promising framework for finding a new combination of most discriminative genes with high classification accuracy.
Collapse
Affiliation(s)
- Elnaz Pashaei
- Department of Software Engineering, Istanbul Aydin University, Istanbul, Turkey.
| | - Elham Pashaei
- Department of Computer Engineering, Istanbul Gelisim University, Istanbul, Turkey.
| |
Collapse
|
10
|
Mahendran N, Durai Raj Vincent PM, Srinivasan K, Chang CY. Machine Learning Based Computational Gene Selection Models: A Survey, Performance Evaluation, Open Issues, and Future Research Directions. Front Genet 2020; 11:603808. [PMID: 33362861 PMCID: PMC7758324 DOI: 10.3389/fgene.2020.603808] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2020] [Accepted: 10/29/2020] [Indexed: 12/20/2022] Open
Abstract
Gene Expression is the process of determining the physical characteristics of living beings by generating the necessary proteins. Gene Expression takes place in two steps, translation and transcription. It is the flow of information from DNA to RNA with enzymes' help, and the end product is proteins and other biochemical molecules. Many technologies can capture Gene Expression from the DNA or RNA. One such technique is Microarray DNA. Other than being expensive, the main issue with Microarray DNA is that it generates high-dimensional data with minimal sample size. The issue in handling such a heavyweight dataset is that the learning model will be over-fitted. This problem should be addressed by reducing the dimension of the data source to a considerable amount. In recent years, Machine Learning has gained popularity in the field of genomic studies. In the literature, many Machine Learning-based Gene Selection approaches have been discussed, which were proposed to improve dimensionality reduction precision. This paper does an extensive review of the various works done on Machine Learning-based gene selection in recent years, along with its performance analysis. The study categorizes various feature selection algorithms under Supervised, Unsupervised, and Semi-supervised learning. The works done in recent years to reduce the features for diagnosing tumors are discussed in detail. Furthermore, the performance of several discussed methods in the literature is analyzed. This study also lists out and briefly discusses the open issues in handling the high-dimension and less sample size data.
Collapse
Affiliation(s)
- Nivedhitha Mahendran
- School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, India
| | - P. M. Durai Raj Vincent
- School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, India
| | - Kathiravan Srinivasan
- School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, India
| | - Chuan-Yu Chang
- Department of Computer Science and Information Engineering, National Yunlin University of Science and Technology, Douliu, Taiwan
| |
Collapse
|
11
|
Xu D, Zhang J, Xu H, Zhang Y, Chen W, Gao R, Dehmer M. Multi-scale supervised clustering-based feature selection for tumor classification and identification of biomarkers and targets on genomic data. BMC Genomics 2020; 21:650. [PMID: 32962626 PMCID: PMC7510277 DOI: 10.1186/s12864-020-07038-3] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2020] [Accepted: 08/30/2020] [Indexed: 12/19/2022] Open
Abstract
Background The small number of samples and the curse of dimensionality hamper the better application of deep learning techniques for disease classification. Additionally, the performance of clustering-based feature selection algorithms is still far from being satisfactory due to their limitation in using unsupervised learning methods. To enhance interpretability and overcome this problem, we developed a novel feature selection algorithm. In the meantime, complex genomic data brought great challenges for the identification of biomarkers and therapeutic targets. The current some feature selection methods have the problem of low sensitivity and specificity in this field. Results In this article, we designed a multi-scale clustering-based feature selection algorithm named MCBFS which simultaneously performs feature selection and model learning for genomic data analysis. The experimental results demonstrated that MCBFS is robust and effective by comparing it with seven benchmark and six state-of-the-art supervised methods on eight data sets. The visualization results and the statistical test showed that MCBFS can capture the informative genes and improve the interpretability and visualization of tumor gene expression and single-cell sequencing data. Additionally, we developed a general framework named McbfsNW using gene expression data and protein interaction data to identify robust biomarkers and therapeutic targets for diagnosis and therapy of diseases. The framework incorporates the MCBFS algorithm, network recognition ensemble algorithm and feature selection wrapper. McbfsNW has been applied to the lung adenocarcinoma (LUAD) data sets. The preliminary results demonstrated that higher prediction results can be attained by identified biomarkers on the independent LUAD data set, and we also structured a drug-target network which may be good for LUAD therapy. Conclusions The proposed novel feature selection method is robust and effective for gene selection, classification, and visualization. The framework McbfsNW is practical and helpful for the identification of biomarkers and targets on genomic data. It is believed that the same methods and principles are extensible and applicable to other different kinds of data sets.
Collapse
Affiliation(s)
- Da Xu
- School of Mathematics and Statistics, Shandong University, Weihai, 264209, China
| | - Jialin Zhang
- School of Mathematics and Statistics, Shandong University, Weihai, 264209, China
| | - Hanxiao Xu
- School of Mathematics and Statistics, Shandong University, Weihai, 264209, China
| | - Yusen Zhang
- School of Mathematics and Statistics, Shandong University, Weihai, 264209, China.
| | - Wei Chen
- School of Mathematics and Statistics, Shandong University, Weihai, 264209, China
| | - Rui Gao
- School of Control Science and Engineering, Shandong University, Jinan, 250061, China
| | - Matthias Dehmer
- Institute for Intelligent Production, Faculty for Management, University of Applied Sciences Upper Austria, Steyr Campus, Steyr, Austria.,College of Computer and Control Engineering, Nankai University, Tianjin, 300071, China.,Department of Mechatronics and Biomedical Computer Science, UMIT, Hall in Tyrol, Austria
| |
Collapse
|
12
|
A survey on single and multi omics data mining methods in cancer data classification. J Biomed Inform 2020; 107:103466. [DOI: 10.1016/j.jbi.2020.103466] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2019] [Revised: 05/01/2020] [Accepted: 05/31/2020] [Indexed: 01/09/2023]
|
13
|
Uzma, Al-Obeidat F, Tubaishat A, Shah B, Halim Z. Gene encoder: a feature selection technique through unsupervised deep learning-based clustering for large gene expression data. Neural Comput Appl 2020. [DOI: 10.1007/s00521-020-05101-4] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
|
14
|
Mabu AM, Prasad R, Yadav R. Mining gene expression data using data mining techniques: A critical review. JOURNAL OF INFORMATION & OPTIMIZATION SCIENCES 2019. [DOI: 10.1080/02522667.2018.1555311] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
Affiliation(s)
- Audu Musa Mabu
- Department of Computer Science & Information Technology, Sam Higginbottom University of Agriculture, Technology and Sciences, Naini, Allahabad 211007, Uttar Pradesh, India,
| | - Rajesh Prasad
- School of Information Technology & Computing, American University of Nigeria, Yola 640101, Nigeria
| | - Raghav Yadav
- Department of Computer Science & Information Technology, Sam Higginbottom University of Agriculture, Technology and Sciences, Naini, Allahabad 211007, Uttar Pradesh, India,
| |
Collapse
|
15
|
Sharma A, Rani R. C-HMOSHSSA: Gene selection for cancer classification using multi-objective meta-heuristic and machine learning methods. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2019; 178:219-235. [PMID: 31416551 DOI: 10.1016/j.cmpb.2019.06.029] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/10/2019] [Revised: 06/24/2019] [Accepted: 06/27/2019] [Indexed: 05/21/2023]
Abstract
BACKGROUND AND OBJECTIVE Over the last two decades, DNA microarray technology has emerged as a powerful tool for early cancer detection and prevention. It helps to provide a detailed overview of disease complex microenvironment. Moreover, online availability of thousands of gene expression assays made microarray data classification an active research area. A common goal is to find a minimum subset of genes and maximizing the classification accuracy. METHODS In pursuit of a similar objective, we have proposed framework (C-HMOSHSSA) for gene selection using multi-objective spotted hyena optimizer (MOSHO) and salp swarm algorithm (SSA). The real-life optimization problems with more than one objective usually face the challenge to maintain convergence and diversity. Salp Swarm Algorithm (SSA) maintains diversity but, suffers from the overhead of maintaining the necessary information. On the other hand, the calculation of MOSHO requires low computational efforts hence is used for maintaining the necessary information. Therefore, the proposed algorithm is a hybrid algorithm that utilizes the features of both SSA and MOSHO to facilitate its exploration and exploitation capability. RESULTS Four different classifiers are trained on seven high-dimensional datasets using a subset of features (genes), which are obtained after applying the proposed hybrid gene selection algorithm. The results show that the proposed technique significantly outperforms existing state-of-the-art techniques. CONCLUSION It is also shown that the new sets of informative and biologically relevant genes are successfully identified by the proposed technique. The proposed approach can also be applied to other problem domains of interest which involve feature selection.
Collapse
Affiliation(s)
- Aman Sharma
- Computer Science and Engineering Department, Thapar Institute of Engineering & Technology, Patiala, Punjab, India.
| | - Rinkle Rani
- Computer Science and Engineering Department, Thapar Institute of Engineering & Technology, Patiala, Punjab, India.
| |
Collapse
|
16
|
Zhao Q, Zhang Y. Ensemble Method of Feature Selection and Reverse Construction of Gene Logical Network Based on Information Entropy. INT J PATTERN RECOGN 2019. [DOI: 10.1142/s0218001420590041] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
In this paper, we propose a novel ensemble gene selection method to obtain a gene subset. Then we provide a reverse construction method of gene network derived from expression profile data of the gene subset. The uncertainty coefficient based on information entropy are used to define the existence of logical relations among these genes. If the uncertainty coefficient between some genes exceeds predefined thresholds, the gene nodes will be connected by directed edges. Thus, a gene network is generated, which we define as gene logical network. This method is applied to the breast cancer data including control group and experimental group, with comparisons of the 2nd-order logic type distribution, average degree as well as average path length of the networks. It is found that these structures with different networks are quite distinct. By the comparison of the degree difference between control group and experimental group, the key genes are picked up. By defining the dynamics evolution rules of state transition based on the logical regulation among the key genes in the network, the dynamic behaviors for normal breast cells and cells with cancer of different stages are simulated numerically. Some of them are highly related to the development of breast cancer through literature inquiry. The study may provide a useful revelation to the biological mechanism in the formation and development of cancer.
Collapse
Affiliation(s)
- Qingfeng Zhao
- College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao, Shandong 266590, P. R. China
- Shandong Province Key Laboratory of Wisdom Mine Information Technology, Shandong University of Science and Technology, Qingdao 266590, P. R. China
| | - Yulin Zhang
- College of Mathematics and Systems Science, Shandong University of Science and Technology, Qingdao, Shandong 266590, P. R. China
| |
Collapse
|
17
|
Kang C, Huo Y, Xin L, Tian B, Yu B. Feature selection and tumor classification for microarray data using relaxed Lasso and generalized multi-class support vector machine. J Theor Biol 2019; 463:77-91. [DOI: 10.1016/j.jtbi.2018.12.010] [Citation(s) in RCA: 43] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2018] [Revised: 11/03/2018] [Accepted: 12/06/2018] [Indexed: 02/08/2023]
|
18
|
Feature selection of gene expression data for Cancer classification using double RBF-kernels. BMC Bioinformatics 2018; 19:396. [PMID: 30373514 PMCID: PMC6206917 DOI: 10.1186/s12859-018-2400-2] [Citation(s) in RCA: 38] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2018] [Accepted: 09/26/2018] [Indexed: 12/31/2022] Open
Abstract
BACKGROUND Using knowledge-based interpretation to analyze omics data can not only obtain essential information regarding various biological processes, but also reflect the current physiological status of cells and tissue. The major challenge to analyze gene expression data, with a large number of genes and small samples, is to extract disease-related information from a massive amount of redundant data and noise. Gene selection, eliminating redundant and irrelevant genes, has been a key step to address this problem. RESULTS The modified method was tested on four benchmark datasets with either two-class phenotypes or multiclass phenotypes, outperforming previous methods, with relatively higher accuracy, true positive rate, false positive rate and reduced runtime. CONCLUSIONS This paper proposes an effective feature selection method, combining double RBF-kernels with weighted analysis, to extract feature genes from gene expression data, by exploring its nonlinear mapping ability.
Collapse
|
19
|
Shahbeig S, Rahideh A, Helfroush MS, Kazemi K. An efficient search algorithm for biomarker selection from RNA-seq prostate cancer data. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2018. [DOI: 10.3233/jifs-171297] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
- Saleh Shahbeig
- Department of Electrical and Electronics Engineering, Shiraz University of Technology, Shiraz, Iran
| | - Akbar Rahideh
- Department of Electrical and Electronics Engineering, Shiraz University of Technology, Shiraz, Iran
| | | | - Kamran Kazemi
- Department of Electrical and Electronics Engineering, Shiraz University of Technology, Shiraz, Iran
| |
Collapse
|
20
|
Tang T, Chen S, Zhao M, Huang W, Luo J. Very large-scale data classification based on K-means clustering and multi-kernel SVM. Soft comput 2018. [DOI: 10.1007/s00500-018-3041-0] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
21
|
Truong HQ, Ngo LT, Pedrycz W. Granular Fuzzy Possibilistic C-Means Clustering approach to DNA microarray problem. Knowl Based Syst 2017. [DOI: 10.1016/j.knosys.2017.06.019] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
|
22
|
Clinical application of modified bag-of-features coupled with hybrid neural-based classifier in dengue fever classification using gene expression data. Med Biol Eng Comput 2017; 56:709-720. [PMID: 28891000 DOI: 10.1007/s11517-017-1722-y] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2017] [Accepted: 08/28/2017] [Indexed: 12/27/2022]
Abstract
Dengue fever detection and classification have a vital role due to the recent outbreaks of different kinds of dengue fever. Recently, the advancement in the microarray technology can be employed for such classification process. Several studies have established that the gene selection phase takes a significant role in the classifier performance. Subsequently, the current study focused on detecting two different variations, namely, dengue fever (DF) and dengue hemorrhagic fever (DHF). A modified bag-of-features method has been proposed to select the most promising genes in the classification process. Afterward, a modified cuckoo search optimization algorithm has been engaged to support the artificial neural (ANN-MCS) to classify the unknown subjects into three different classes namely, DF, DHF, and another class containing convalescent and normal cases. The proposed method has been compared with other three well-known classifiers, namely, multilayer perceptron feed-forward network (MLP-FFN), artificial neural network (ANN) trained with cuckoo search (ANN-CS), and ANN trained with PSO (ANN-PSO). Experiments have been carried out with different number of clusters for the initial bag-of-features-based feature selection phase. After obtaining the reduced dataset, the hybrid ANN-MCS model has been employed for the classification process. The results have been compared in terms of the confusion matrix-based performance measuring metrics. The experimental results indicated a highly statistically significant improvement with the proposed classifier over the traditional ANN-CS model.
Collapse
|
23
|
Dashtban M, Balafar M, Suravajhala P. Gene selection for tumor classification using a novel bio-inspired multi-objective approach. Genomics 2017; 110:10-17. [PMID: 28780377 DOI: 10.1016/j.ygeno.2017.07.010] [Citation(s) in RCA: 31] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2017] [Revised: 07/12/2017] [Accepted: 07/30/2017] [Indexed: 12/21/2022]
Abstract
Identifying the informative genes has always been a major step in microarray data analysis. The complexity of various cancer datasets makes this issue still challenging. In this paper, a novel Bio-inspired Multi-objective algorithm is proposed for gene selection in microarray data classification specifically in the binary domain of feature selection. The presented method extends the traditional Bat Algorithm with refined formulations, effective multi-objective operators, and novel local search strategies employing social learning concepts in designing random walks. A hybrid model using the Fisher criterion is then applied to three widely-used microarray cancer datasets to explore significant biomarkers which reveal the effectiveness of the proposed method for genomic analysis. Experimental results unveil new combinations of informative biomarkers have association with other studies.
Collapse
Affiliation(s)
- M Dashtban
- Department of Computer Engineering, Faculty of Electrical & Computer Engineering, University of Tabriz, Iran.
| | - Mohammadali Balafar
- Department of Computer Engineering, Faculty of Electrical & Computer Engineering, University of Tabriz, Iran
| | - Prashanth Suravajhala
- Birla Institute of Scientific Research, Statue Circle, Jaipur 302001, Rajasthan, India; Bioclues.org, Kukatpally, Hyderabad 500072, Telangana, India
| |
Collapse
|
24
|
Yang G, Hu Z. Gene Feature Extraction Based on Nonnegative Dual Graph Regularized Latent Low-Rank Representation. BIOMED RESEARCH INTERNATIONAL 2017; 2017:1096028. [PMID: 28466003 PMCID: PMC5390636 DOI: 10.1155/2017/1096028] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/21/2017] [Accepted: 03/13/2017] [Indexed: 01/16/2023]
Abstract
Aiming at the problem of gene expression profile's high redundancy and heavy noise, a new feature extraction model based on nonnegative dual graph regularized latent low-rank representation (NNDGLLRR) is presented on the basis of latent low-rank representation (Lat-LRR). By introducing dual graph manifold regularized constraint, the NNDGLLRR can keep the internal spatial structure of the original data effectively and improve the final clustering accuracy while segmenting the subspace. The introduction of nonnegative constraints makes the computation with some sparsity, which enhances the robustness of the algorithm. Different from Lat-LRR, a new solution model is adopted to simplify the computational complexity. The experimental results show that the proposed algorithm has good feature extraction performance for the heavy redundancy and noise gene expression profile, which, compared with LRR and Lat-LRR, can achieve better clustering accuracy.
Collapse
Affiliation(s)
- Guoliang Yang
- School of Electrical Engineering and Automation, Jiangxi University of Science and Technology, Ganzhou 341000, China
| | - Zhengwei Hu
- School of Electrical Engineering and Automation, Jiangxi University of Science and Technology, Ganzhou 341000, China
| |
Collapse
|
25
|
Gene selection for tumor classification using neighborhood rough sets and entropy measures. J Biomed Inform 2017; 67:59-68. [PMID: 28215562 DOI: 10.1016/j.jbi.2017.02.007] [Citation(s) in RCA: 69] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2016] [Revised: 01/25/2017] [Accepted: 02/09/2017] [Indexed: 01/04/2023]
Abstract
With the development of bioinformatics, tumor classification from gene expression data becomes an important useful technology for cancer diagnosis. Since a gene expression data often contains thousands of genes and a small number of samples, gene selection from gene expression data becomes a key step for tumor classification. Attribute reduction of rough sets has been successfully applied to gene selection field, as it has the characters of data driving and requiring no additional information. However, traditional rough set method deals with discrete data only. As for the gene expression data containing real-value or noisy data, they are usually employed by a discrete preprocessing, which may result in poor classification accuracy. In this paper, we propose a novel gene selection method based on the neighborhood rough set model, which has the ability of dealing with real-value data whilst maintaining the original gene classification information. Moreover, this paper addresses an entropy measure under the frame of neighborhood rough sets for tackling the uncertainty and noisy of gene expression data. The utilization of this measure can bring about a discovery of compact gene subsets. Finally, a gene selection algorithm is designed based on neighborhood granules and the entropy measure. Some experiments on two gene expression data show that the proposed gene selection is an effective method for improving the accuracy of tumor classification.
Collapse
|
26
|
Nanni L, Salvatore C, Cerasa A, Castiglioni I. Combining multiple approaches for the early diagnosis of Alzheimer's Disease. Pattern Recognit Lett 2016. [DOI: 10.1016/j.patrec.2016.10.010] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|