1
|
Jeon J, Suk Y, Kim SC, Jo HY, Kim K, Jung I. Denoiseit: denoising gene expression data using rank based isolation trees. BMC Bioinformatics 2024; 25:271. [PMID: 39169300 PMCID: PMC11340143 DOI: 10.1186/s12859-024-05899-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2024] [Accepted: 08/13/2024] [Indexed: 08/23/2024] Open
Abstract
BACKGROUND Selecting informative genes or eliminating uninformative ones before any downstream gene expression analysis is a standard task with great impact on the results. A carefully curated gene set significantly enhances the likelihood of identifying meaningful biomarkers. METHOD In contrast to the conventional forward gene search methods that focus on selecting highly informative genes, we propose a backward search method, DenoiseIt, that aims to remove potential outlier genes yielding a robust gene set with reduced noise. The gene set constructed by DenoiseIt is expected to capture biologically significant genes while pruning irrelevant ones to the greatest extent possible. Therefore, it also enhances the quality of downstream comparative gene expression analysis. DenoiseIt utilizes non-negative matrix factorization in conjunction with isolation forests to identify outlier rank features and remove their associated genes. RESULTS DenoiseIt was applied to both bulk and single-cell RNA-seq data collected from TCGA and a COVID-19 cohort to show that it proficiently identified and removed genes exhibiting expression anomalies confined to specific samples rather than a known group. DenoiseIt also showed to reduce the level of technical noise while preserving a higher proportion of biologically relevant genes compared to existing methods. The DenoiseIt Software is publicly available on GitHub at https://github.com/cobi-git/DenoiseIt.
Collapse
Affiliation(s)
- Jaemin Jeon
- Interdisciplinary Program in Bioinformatics, Seoul National University, Gwanak-gu, Seoul, 08826, Republic of Korea
| | - Youjeong Suk
- School of Computer Science and Engineering, Kyungpook National University, Buk-gu, Daegu, 41566, Republic of Korea
| | - Sang Cheol Kim
- Division of Healthcare and Artificial Intelligence, Department of Precision Medicine, Korea National Institute of Health, Korea Disease Control and Prevention Agency, Osong, CheongJu, 28159, Republic of Korea
| | - Hye-Yeong Jo
- Division of Healthcare and Artificial Intelligence, Department of Precision Medicine, Korea National Institute of Health, Korea Disease Control and Prevention Agency, Osong, CheongJu, 28159, Republic of Korea
| | - Kwangsoo Kim
- Department of Transdisciplinary Medicine, Seoul National University Hospital, Jongno-gu, Seoul, 03080, Republic of Korea.
- Department of Medicine, Seoul National University, Jongno-gu, Seoul, 03080, Republic of Korea.
| | - Inuk Jung
- School of Computer Science and Engineering, Kyungpook National University, Buk-gu, Daegu, 41566, Republic of Korea.
| |
Collapse
|
2
|
Al-Shalif SA, Senan N, Saeed F, Ghaban W, Ibrahim N, Aamir M, Sharif W. A systematic literature review on meta-heuristic based feature selection techniques for text classification. PeerJ Comput Sci 2024; 10:e2084. [PMID: 38983195 PMCID: PMC11232610 DOI: 10.7717/peerj-cs.2084] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2023] [Accepted: 05/03/2024] [Indexed: 07/11/2024]
Abstract
Feature selection (FS) is a critical step in many data science-based applications, especially in text classification, as it includes selecting relevant and important features from an original feature set. This process can improve learning accuracy, streamline learning duration, and simplify outcomes. In text classification, there are often many excessive and unrelated features that impact performance of the applied classifiers, and various techniques have been suggested to tackle this problem, categorized as traditional techniques and meta-heuristic (MH) techniques. In order to discover the optimal subset of features, FS processes require a search strategy, and MH techniques use various strategies to strike a balance between exploration and exploitation. The goal of this research article is to systematically analyze the MH techniques used for FS between 2015 and 2022, focusing on 108 primary studies from three different databases such as Scopus, Science Direct, and Google Scholar to identify the techniques used, as well as their strengths and weaknesses. The findings indicate that MH techniques are efficient and outperform traditional techniques, with the potential for further exploration of MH techniques such as Ringed Seal Search (RSS) to improve FS in several applications.
Collapse
Affiliation(s)
- Sarah Abdulkarem Al-Shalif
- Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, Parit Raja, Johor, Malaysia
| | - Norhalina Senan
- Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, Parit Raja, Johor, Malaysia
| | - Faisal Saeed
- DAAI Research Group, Department of Computing and Data Science, School of Computing and Digital Technology, University of Birmingham, Birmingham, United Kingdom
| | - Wad Ghaban
- Applied College, University of Tabuk, Tabuk, Saudi Arabia
| | - Noraini Ibrahim
- Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, Parit Raja, Johor, Malaysia
| | - Muhammad Aamir
- School of Electronics, Computing and Mathematics,, University of Derby, Derby, United Kingdom
| | - Wareesa Sharif
- Faculty of Computing, The Islamia University of Bahawalpur, Bahawalpur, Pakistan
| |
Collapse
|
3
|
Arafa A, El-Fishawy N, Badawy M, Radad M. RN-Autoencoder: Reduced Noise Autoencoder for classifying imbalanced cancer genomic data. J Biol Eng 2023; 17:7. [PMID: 36717866 PMCID: PMC9887895 DOI: 10.1186/s13036-022-00319-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2022] [Accepted: 12/12/2022] [Indexed: 01/31/2023] Open
Abstract
BACKGROUND In the current genomic era, gene expression datasets have become one of the main tools utilized in cancer classification. Both curse of dimensionality and class imbalance problems are inherent characteristics of these datasets. These characteristics have a negative impact on the performance of most classifiers when used to classify cancer using genomic datasets. RESULTS This paper introduces Reduced Noise-Autoencoder (RN-Autoencoder) for pre-processing imbalanced genomic datasets for precise cancer classification. Firstly, RN-Autoencoder solves the curse of dimensionality problem by utilizing the autoencoder for feature reduction and hence generating new extracted data with lower dimensionality. In the next stage, RN-Autoencoder introduces the extracted data to the well-known Reduced Noise-Synthesis Minority Over Sampling Technique (RN- SMOTE) that efficiently solve the problem of class imbalance in the extracted data. RN-Autoencoder has been evaluated using different classifiers and various imbalanced datasets with different imbalance ratios. The results proved that the performance of the classifiers has been improved with RN-Autoencoder and outperformed the performance with original data and extracted data with percentages based on the classifier, dataset and evaluation metric. Also, the performance of RN-Autoencoder has been compared to the performance of the current state of the art and resulted in an increase up to 18.017, 19.183, 18.58 and 8.87% in terms of test accuracy using colon, leukemia, Diffuse Large B-Cell Lymphoma (DLBCL) and Wisconsin Diagnostic Breast Cancer (WDBC) datasets respectively. CONCLUSION RN-Autoencoder is a model for cancer classification using imbalanced gene expression datasets. It utilizes the autoencoder to reduce the high dimensionality of the gene expression datasets and then handles the class imbalance using RN-SMOTE. RN-Autoencoder has been evaluated using many different classifiers and many different imbalanced datasets. The performance of many classifiers has improved and some have succeeded in classifying cancer with 100% performance in terms of all used metrics. In addition, RN-Autoencoder outperformed many recent works using the same datasets.
Collapse
Affiliation(s)
- Ahmed Arafa
- grid.411775.10000 0004 0621 4712Faculty of Electronic Engineering, Menoufia University, El-Gish Street, Box No. 32951, Menouf, Menoufia Egypt
| | - Nawal El-Fishawy
- grid.411775.10000 0004 0621 4712Faculty of Electronic Engineering, Menoufia University, El-Gish Street, Box No. 32951, Menouf, Menoufia Egypt
| | - Mohammed Badawy
- grid.411775.10000 0004 0621 4712Faculty of Electronic Engineering, Menoufia University, El-Gish Street, Box No. 32951, Menouf, Menoufia Egypt
| | - Marwa Radad
- grid.411775.10000 0004 0621 4712Faculty of Electronic Engineering, Menoufia University, El-Gish Street, Box No. 32951, Menouf, Menoufia Egypt
| |
Collapse
|
4
|
Shojaee Z, Shahzadeh Fazeli SA, Abbasi E, Adibnia F, Masuli F, Rovetta S. A Mutual Information Based on Ant Colony Optimization Method to Feature Selection for Categorical Data Clustering. IRANIAN JOURNAL OF SCIENCE AND TECHNOLOGY, TRANSACTIONS A: SCIENCE 2022. [DOI: 10.1007/s40995-022-01395-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
|
5
|
Chamlal H, Ouaderhman T, Aaboub F. A graph based preordonnances theoretic supervised feature selection in high dimensional data. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.109899] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
6
|
Improved swarm-optimization-based filter-wrapper gene selection from microarray data for gene expression tumor classification. Pattern Anal Appl 2022. [DOI: 10.1007/s10044-022-01117-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
7
|
Azadifar S, Rostami M, Berahmand K, Moradi P, Oussalah M. Graph-based relevancy-redundancy gene selection method for cancer diagnosis. Comput Biol Med 2022; 147:105766. [PMID: 35779479 DOI: 10.1016/j.compbiomed.2022.105766] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2022] [Revised: 06/12/2022] [Accepted: 06/18/2022] [Indexed: 11/26/2022]
Abstract
Nowadays, microarray data processing is one of the most important applications in molecular biology for cancer diagnosis. A major task in microarray data processing is gene selection, which aims to find a subset of genes with the least inner similarity and most relevant to the target class. Removing unnecessary, redundant, or noisy data reduces the data dimensionality. This research advocates a graph theoretic-based gene selection method for cancer diagnosis. Both unsupervised and supervised modes use well-known and successful social network approaches such as the maximum weighted clique criterion and edge centrality to rank genes. The suggested technique has two goals: (i) to maximize the relevancy of the chosen genes with the target class and (ii) to reduce their inner redundancy. A maximum weighted clique is chosen in a repetitive way in each iteration of this procedure. The appropriate genes are then chosen from among the existing features in this maximum clique using edge centrality and gene relevance. In the experiment, several datasets consisting of Colon, Leukemia, SRBCT, Prostate Tumor, and Lung Cancer, with different properties, are used to demonstrate the efficacy of the developed model. Our performance is compared to that of renowned filter-based gene selection approaches for cancer diagnosis whose results demonstrate a clear superiority.
Collapse
Affiliation(s)
- Saeid Azadifar
- Department of Computer Engineering, University of Khajeh Nasir Toosi, Tehran, Iran
| | - Mehrdad Rostami
- Centre for Machine Vision and Signal Processing, University of Oulu, Oulu, Finland.
| | - Kamal Berahmand
- School of Computer Science, Faculty of Science, Queensland University of Technology (QUT), Brisbane, Australia
| | - Parham Moradi
- Department of Computer Engineering, University of Kurdistan, Sanandaj, Iran
| | - Mourad Oussalah
- Centre for Machine Vision and Signal Processing, University of Oulu, Oulu, Finland; Research Unit of Medical Imaging, Physics, and Technology, Faculty of Medicine, University of Oulu, Finland
| |
Collapse
|
8
|
Shobana M, Balasraswathi VR, Radhika R, Oleiwi AK, Chaudhury S, Ladkat AS, Naved M, Rahmani AW. Classification and Detection of Mesothelioma Cancer Using Feature Selection-Enabled Machine Learning Technique. BIOMED RESEARCH INTERNATIONAL 2022; 2022:9900668. [PMID: 35937383 PMCID: PMC9348925 DOI: 10.1155/2022/9900668] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/02/2022] [Revised: 06/30/2022] [Accepted: 07/14/2022] [Indexed: 11/18/2022]
Abstract
Cancer of the mesothelium, sometimes referred to as malignant mesothelioma (MM), is an extremely uncommon form of the illness that almost always results in death. Chemotherapy, surgery, radiation therapy, and immunotherapy are all potential treatments for multiple myeloma; however, the majority of patients are identified with the disease at an advanced stage, at which time it is resistant to these therapies. After obtaining a diagnosis of advanced multiple myeloma, the average length of time that a person lives is one year after hearing this news. There is a substantial link between asbestos exposure and mesothelioma (MM). Using an approach that enables feature selection and machine learning, this article proposes a classification and detection method for mesothelioma cancer. The CFS correlation-based feature selection approach is first used in the feature selection process. It acts as a filter, selecting just the traits that are relevant to the categorization. The accuracy of the categorization model is improved as a direct consequence of this. After that, classification is carried out with the help of naive Bayes, fuzzy SVM, and the ID3 algorithm. Various metrics have been utilized during the process of measuring the effectiveness of machine learning strategies. It has been discovered that the choice of features has a substantial influence on the accuracy of the categorization.
Collapse
Affiliation(s)
- M. Shobana
- SRM Institute of Science and Technology, SRM Nagar, Kattankulathur, Kanchipuram, 603203, Chennai, India
| | - V. R. Balasraswathi
- Department of Networking and Communications, School of Computing, SRM Institute of Science and Technology, Kattankulathur, India
| | - R. Radhika
- Department of Networking and Communications, School of Computing, SRM Institute of Science and Technology, Kattankulathur, India
| | - Ahmed Kareem Oleiwi
- Department of Computer Technical Engineering, The Islamic University, 54001 Najaf, Iraq
| | | | - Ajay S. Ladkat
- Department of Instrumentation Engineering, Vishwakarma Institute of Technology, Pune, India
| | - Mohd Naved
- Amity International Business School (AIBS), Amity University, Noida, India
| | | |
Collapse
|
9
|
Tabakhi S, Lu H. Multi-agent Feature Selection for Integrative Multi-omics Analysis. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2022; 2022:1638-1642. [PMID: 36086594 DOI: 10.1109/embc48229.2022.9871758] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Multiomics data integration is key for cancer prediction as it captures different aspects of molecular mechanisms. Nevertheless, the high-dimensionality of multi-omics data with a relatively small number of patients presents a challenge for the cancer prediction tasks. While feature selection techniques have been widely used to tackle the curse of dimensionality of multi-omics data, most existing methods have been applied to each type of omics data separately. In this paper, we propose a multi-agent architecture for feature selection, called MAgentOmics, to consider all omics data together. MAgentOmics extends the ant colony optimization algorithm to multi-omics data, which iteratively builds candidate solutions and evaluates them. Moreover, a new fitness function is introduced to assess the candidate feature subsets without using prediction target such as survival time of patients. Therefore, it can be considered as an unsupervised method. We evaluate the performance of MAgentOmics on the TCGA ovarian cancer multi-omics data from 176 patients using a 5-fold cross-validation. The results demonstrate that the integration power of MAgentOmics is relatively better than the state-of-the-art supervised multi-view method. The code is publicly available at https://github.com/SinaTabakhi/MAgentOmics. Clinical relevance- Discovering knowledge in existing multi-omics datasets through better feature selection enhances the clinical understanding of cancers and speeds-up decision-making in the clinic.
Collapse
|
10
|
Tahmouresi A, Rashedi E, Yaghoobi MM, Rezaei M. Gene selection using pyramid gravitational search algorithm. PLoS One 2022; 17:e0265351. [PMID: 35290401 PMCID: PMC8923457 DOI: 10.1371/journal.pone.0265351] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2021] [Accepted: 02/28/2022] [Indexed: 11/24/2022] Open
Abstract
Genetics play a prominent role in the development and progression of malignant neoplasms. Identification of the relevant genes is a high-dimensional data processing problem. Pyramid gravitational search algorithm (PGSA), a hybrid method in which the number of genes is cyclically reduced is proposed to conquer the curse of dimensionality. PGSA consists of two elements, a filter and a wrapper method (inspired by the gravitational search algorithm) which iterates through cycles. The genes selected in each cycle are passed on to the subsequent cycles to further reduce the dimension. PGSA tries to maximize the classification accuracy using the most informative genes while reducing the number of genes. Results are reported on a multi-class microarray gene expression dataset for breast cancer. Several feature selection algorithms have been implemented to have a fair comparison. The PGSA ranked first in terms of accuracy (84.5%) with 73 genes. To check if the selected genes are meaningful in terms of patient’s survival and response to therapy, protein-protein interaction network analysis has been applied on the genes. An interesting pattern was emerged when examining the genetic network. HSP90AA1, PTK2 and SRC genes were amongst the top-rated bottleneck genes, and DNA damage, cell adhesion and migration pathways are highly enriched in the network.
Collapse
Affiliation(s)
| | - Esmat Rashedi
- Department of Electrical and Computer Engineering, Graduate University of Advanced Technology, Kerman, Iran
- * E-mail:
| | - Mohammad Mehdi Yaghoobi
- Department of Biotechnology, Institute of Science and High Technology and Environmental Sciences, Graduate University of Advanced Technology, Kerman, Iran
| | - Masoud Rezaei
- Faculty of Medicine, Kerman University of Medical Sciences, Kerman, Iran
| |
Collapse
|
11
|
Fan L, Ma X. Maximum power point tracking of PEMFC based on hybrid artificial bee colony algorithm with fuzzy control. Sci Rep 2022; 12:4316. [PMID: 35279691 PMCID: PMC8918329 DOI: 10.1038/s41598-022-08327-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2021] [Accepted: 03/07/2022] [Indexed: 11/28/2022] Open
Abstract
Maximum power point tracking (MPPT) is an effective method to improve the power generation efficiency and power supply quality of a proton exchange membrane fuel cell (PEMFC). Due to the inherent nonlinear characteristics of PEMFC, conventional MPPT methods are often difficult to achieve a satisfactory control effect. Considering this, artificial bee colony algorithm combining fuzzy control (ABC-fuzzy) was proposed to construct a MPPT control scheme for PEMFC. The global optimization ability of ABC algorithm was used to approach the maximum power point of PEMFC and solve the problem of falling into local optimization, and fuzzy control was used to eliminate the problems of large overshoot and slow convergence speed of ABC algorithm. The testing results show that compared with perturb & observe algorithm, conductance increment and ABC methods, ABC-fuzzy method can make PEMFC obtain greater output power, faster regulation speed, smaller steady-state error, less oscillation and stronger anti-interference ability. The MPPT scheme based on ABC-fuzzy can effectively realize the maximum power output of PEMFC, and plays an important role in improving the service life and power supply efficiency of PEMFC.
Collapse
Affiliation(s)
- Liping Fan
- College of Information Engineering, Shenyang University of Chemical Technology, Shenyang, 110142, China. .,Key Laboratory of Collaborative Control and Optimization Technology of Industrial Environment and Resource of Liaoning Province, Shenyang University of Chemical Technology, Shenyang, 110142, China.
| | - Xianyang Ma
- College of Information Engineering, Shenyang University of Chemical Technology, Shenyang, 110142, China.,Key Laboratory of Collaborative Control and Optimization Technology of Industrial Environment and Resource of Liaoning Province, Shenyang University of Chemical Technology, Shenyang, 110142, China
| |
Collapse
|
12
|
An optimization approach with weighted SCiForest and weighted Hausdorff distance for noise data and redundant data. APPL INTELL 2022. [DOI: 10.1007/s10489-021-02685-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
13
|
Cuffless Blood Pressure Measurement Using Linear and Nonlinear Optimized Feature Selection. Diagnostics (Basel) 2022; 12:diagnostics12020408. [PMID: 35204499 PMCID: PMC8870879 DOI: 10.3390/diagnostics12020408] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2022] [Revised: 01/30/2022] [Accepted: 01/30/2022] [Indexed: 02/04/2023] Open
Abstract
The cuffless blood pressure (BP) measurement allows for frequent measurement without discomfort to the patient compared to the cuff inflation measurement. With the availability of a large dataset containing physiological waveforms, now it is possible to use them through different learning algorithms to produce a relationship with changes in BP. In this paper, a novel cuffless noninvasive blood pressure measurement technique has been proposed using optimized features from electrocardiogram and photoplethysmography based on multivariate symmetric uncertainty (MSU). The technique is an improvement over other contemporary methods due to the inclusion of feature optimization depending on both linear and nonlinear relationships with the change of blood pressure. MSU has been used as a selection criterion with algorithms such as the fast correlation and ReliefF algorithms followed by the penalty-based regression technique to make sure the features have maximum relevance as well as minimum redundancy. The result from the technique was compared with the performance of similar techniques using the MIMIC-II dataset. After training and testing, the root mean square error (RMSE) comes as 5.28 mmHg for systolic BP and 5.98 mmHg for diastolic BP. In addition, in terms of mean absolute error, the result improved to 4.27 mmHg for SBP and 5.01 for DBP compared to recent cuffless BP measurement techniques which have used substantially large datasets and feature optimization. According to the British Hypertension Society Standard (BHS), our proposed technique achieved at least grade B in all cumulative criteria for cuffless BP measurement.
Collapse
|
14
|
Jaddi NS, Saniee Abadeh M. Cell separation algorithm with enhanced search behaviour in miRNA feature selection for cancer diagnosis. INFORM SYST 2022. [DOI: 10.1016/j.is.2021.101906] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
|
15
|
Alhenawi E, Al-Sayyed R, Hudaib A, Mirjalili S. Feature selection methods on gene expression microarray data for cancer classification: A systematic review. Comput Biol Med 2022; 140:105051. [PMID: 34839186 DOI: 10.1016/j.compbiomed.2021.105051] [Citation(s) in RCA: 37] [Impact Index Per Article: 18.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2021] [Revised: 11/01/2021] [Accepted: 11/15/2021] [Indexed: 11/29/2022]
Abstract
This systematic review provides researchers interested in feature selection (FS) for processing microarray data with comprehensive information about the main research directions for gene expression classification conducted during the recent seven years. A set of 132 researches published by three different publishers is reviewed. The studied papers are categorized into nine directions based on their objectives. The FS directions that received various levels of attention were then summarized. The review revealed that 'propose hybrid FS methods' represented the most interesting research direction with a percentage of 34.9%, while the other directions have lower percentages that ranged from 13.6% down to 3%. This guides researchers to select the most competitive research direction. Papers in each category are thoroughly reviewed based on six perspectives, mainly: method(s), classifier(s), dataset(s), dataset dimension(s) range, performance metric(s), and result(s) achieved.
Collapse
Affiliation(s)
- Esra'a Alhenawi
- King Abdullah II School for Information Technology, The University of Jordan, Amman, Jordan.
| | - Rizik Al-Sayyed
- King Abdullah II School for Information Technology, The University of Jordan, Amman, Jordan.
| | - Amjad Hudaib
- King Abdullah II School for Information Technology, The University of Jordan, Amman, Jordan.
| | - Seyedali Mirjalili
- Center for Artificial Intelligence Research and Optimization, Torrens University Australia, Fortitude Valley, Brisbane, 4006, QLD, Australia; Yonsei Frontier Lab, Yonsei University, Seoul, South Korea.
| |
Collapse
|
16
|
Cauteruccio F. Alignment of Microarray Data. Methods Mol Biol 2022; 2401:217-237. [PMID: 34902131 DOI: 10.1007/978-1-0716-1839-4_14] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
The aim in microarray data analysis is to discover patterns of gene expression and to identify similar genes. Simply comparing new gene sequences to known DNA sequences often does not reveal the function of a new gene; thus, more sophisticated techniques are in order. Nowadays, data mining techniques, and in particular the clustering process, play an important role in bioinformatics. To analyze vast amounts of data can be difficult; thus, a way to cluster similar data is needed. This chapter is devoted to illustrate the general data mining approach used in microarray data analysis, combining clustering, alignment and similarity, and to highlight a novel similarity measure capable of capturing hidden correlations between data.
Collapse
Affiliation(s)
- Francesco Cauteruccio
- Department of Mathematics and Computer Science, University of Calabria, Rende, Italy.
| |
Collapse
|
17
|
Azadifar S, Ahmadi A. A graph-based gene selection method for medical diagnosis problems using a many-objective PSO algorithm. BMC Med Inform Decis Mak 2021; 21:333. [PMID: 34838034 PMCID: PMC8627636 DOI: 10.1186/s12911-021-01696-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2021] [Accepted: 11/16/2021] [Indexed: 11/16/2022] Open
Abstract
Background Gene expression data play an important role in bioinformatics applications. Although there may be a large number of features in such data, they mainly tend to contain only a few samples. This can negatively impact the performance of data mining and machine learning algorithms. One of the most effective approaches to alleviate this problem is to use gene selection methods. The aim of gene selection is to reduce the dimensions (features) of gene expression data leading to eliminating irrelevant and redundant genes. Methods This paper presents a hybrid gene selection method based on graph theory and a many-objective particle swarm optimization (PSO) algorithm. To this end, a filter method is first utilized to reduce the initial space of the genes. Then, the gene space is represented as a graph to apply a graph clustering method to group the genes into several clusters. Moreover, the many-objective PSO algorithm is utilized to search an optimal subset of genes according to several criteria, which include classification error, node centrality, specificity, edge centrality, and the number of selected genes. A repair operator is proposed to cover the whole space of the genes and ensure that at least one gene is selected from each cluster. This leads to an increasement in the diversity of the selected genes. Results To evaluate the performance of the proposed method, extensive experiments are conducted based on seven datasets and two evaluation measures. In addition, three classifiers—Decision Tree (DT), Support Vector Machine (SVM), and K-Nearest Neighbors (KNN)—are utilized to compare the effectiveness of the proposed gene selection method with other state-of-the-art methods. The results of these experiments demonstrate that our proposed method not only achieves more accurate classification, but also selects fewer genes than other methods. Conclusion This study shows that the proposed multi-objective PSO algorithm simultaneously removes irrelevant and redundant features using several different criteria. Also, the use of the clustering algorithm and the repair operator has improved the performance of the proposed method by covering the whole space of the problem.
Collapse
Affiliation(s)
- Saeid Azadifar
- Faculty of Computer Engineering, K. N. Toosi University of Technology, Tehran, Iran.
| | - Ali Ahmadi
- Faculty of Computer Engineering, K. N. Toosi University of Technology, Tehran, Iran
| |
Collapse
|
18
|
Mahapatra S, Sahu SS. ANOVA-particle swarm optimization-based feature selection and gradient boosting machine classifier for improved protein-protein interaction prediction. Proteins 2021; 90:443-454. [PMID: 34528291 DOI: 10.1002/prot.26236] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2020] [Revised: 08/09/2021] [Accepted: 09/03/2021] [Indexed: 01/22/2023]
Abstract
Feature fusion and selection strategies have been applied to improve accuracy in the prediction of protein-protein interaction (PPI). In this paper, an embedded feature selection framework is developed by integrating a cost function based on analysis of variance (ANOVA) with the particle swarm optimization (PSO), termed AVPSO. Initially, the features of the protein sequences extracted using pseudo-amino acid composition (PseAAC), conjoint triad composition, and local descriptor are fused. Then, AVPSO is employed to select the optimal set of features. The light gradient boosting machine (LGBM) classifier is used to predict the PPIs using the optimal feature subset. On the five-fold cross-validation analysis, the proposed model (AVPSO-LGBM) achieved an average accuracy of 97.12% and 95.09%, respectively, on the intraspecies PPI datasets Saccharomyces cerevisiae and Helicobacter pylori. On the interspecies, PPI datasets of the Human-Bacillus and Human-Yersinia, an average accuracy of 95.20% and 93.44%, are achieved. Results obtained on independent test datasets, and network datasets show that the prediction accuracy of the AVPSO-LGBM is better than the existing methods, demonstrating its generalization ability. The improved prediction performance obtained by the proposed model makes it a reliable and effective PPI prediction model.
Collapse
Affiliation(s)
- Satyajit Mahapatra
- Department of Electronics and Communication Engineering, Birla Institute of Technology, Ranchi, India
| | - Sitanshu Sekhar Sahu
- Department of Electronics and Communication Engineering, Birla Institute of Technology, Ranchi, India
| |
Collapse
|
19
|
Bose S, Das C, Banerjee A, Ghosh K, Chattopadhyay M, Chattopadhyay S, Barik A. An ensemble machine learning model based on multiple filtering and supervised attribute clustering algorithm for classifying cancer samples. PeerJ Comput Sci 2021; 7:e671. [PMID: 34616883 PMCID: PMC8459790 DOI: 10.7717/peerj-cs.671] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2020] [Accepted: 07/20/2021] [Indexed: 06/13/2023]
Abstract
BACKGROUND Machine learning is one kind of machine intelligence technique that learns from data and detects inherent patterns from large, complex datasets. Due to this capability, machine learning techniques are widely used in medical applications, especially where large-scale genomic and proteomic data are used. Cancer classification based on bio-molecular profiling data is a very important topic for medical applications since it improves the diagnostic accuracy of cancer and enables a successful culmination of cancer treatments. Hence, machine learning techniques are widely used in cancer detection and prognosis. METHODS In this article, a new ensemble machine learning classification model named Multiple Filtering and Supervised Attribute Clustering algorithm based Ensemble Classification model (MFSAC-EC) is proposed which can handle class imbalance problem and high dimensionality of microarray datasets. This model first generates a number of bootstrapped datasets from the original training data where the oversampling procedure is applied to handle the class imbalance problem. The proposed MFSAC method is then applied to each of these bootstrapped datasets to generate sub-datasets, each of which contains a subset of the most relevant/informative attributes of the original dataset. The MFSAC method is a feature selection technique combining multiple filters with a new supervised attribute clustering algorithm. Then for every sub-dataset, a base classifier is constructed separately, and finally, the predictive accuracy of these base classifiers is combined using the majority voting technique forming the MFSAC-based ensemble classifier. Also, a number of most informative attributes are selected as important features based on their frequency of occurrence in these sub-datasets. RESULTS To assess the performance of the proposed MFSAC-EC model, it is applied on different high-dimensional microarray gene expression datasets for cancer sample classification. The proposed model is compared with well-known existing models to establish its effectiveness with respect to other models. From the experimental results, it has been found that the generalization performance/testing accuracy of the proposed classifier is significantly better compared to other well-known existing models. Apart from that, it has been also found that the proposed model can identify many important attributes/biomarker genes.
Collapse
Affiliation(s)
- Shilpi Bose
- Department of Computer Science and Engineering, Netaji Subhash Engineering College, Kolkata, West Bengal, India
| | - Chandra Das
- Department of Computer Science and Engineering, Netaji Subhash Engineering College, Kolkata, West Bengal, India
| | - Abhik Banerjee
- Department of Computer Science and Engineering, Netaji Subhash Engineering College, Kolkata, West Bengal, India
| | - Kuntal Ghosh
- Machine Intelligence Unit & Center for Soft Computing Research, Indian Statistical Institute, Kolkata, West Bengal, India
| | | | - Samiran Chattopadhyay
- Department of Information Technology, Jadavpur University, Kolkata, West Bengal, India
| | - Aishwarya Barik
- Department of Computer Science and Engineering, Netaji Subhash Engineering College, Kolkata, West Bengal, India
| |
Collapse
|
20
|
Jiang Y, Luo Q, Wei Y, Abualigah L, Zhou Y. An efficient binary Gradient-based optimizer for feature selection. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2021; 18:3813-3854. [PMID: 34198414 DOI: 10.3934/mbe.2021192] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Feature selection (FS) is a classic and challenging optimization task in the field of machine learning and data mining. Gradient-based optimizer (GBO) is a recently developed metaheuristic with population-based characteristics inspired by gradient-based Newton's method that uses two main operators: the gradient search rule (GSR), the local escape operator (LEO) and a set of vectors to explore the search space for solving continuous problems. This article presents a binary GBO (BGBO) algorithm and for feature selecting problems. The eight independent GBO variants are proposed, and eight transfer functions divided into two families of S-shaped and V-shaped are evaluated to map the search space to a discrete space of research. To verify the performance of the proposed binary GBO algorithm, 18 well-known UCI datasets and 10 high-dimensional datasets are tested and compared with other advanced FS methods. The experimental results show that among the proposed binary GBO algorithms has the best comprehensive performance and has better performance than other well known metaheuristic algorithms in terms of the performance measures.
Collapse
Affiliation(s)
- Yugui Jiang
- College of Artificial Intelligence, Guangxi University for Nationalities, Nanning 530006, China
- Guangxi Key Laboratories of Hybrid Computation and IC Design Analysis, Nanning 530006, China
| | - Qifang Luo
- College of Artificial Intelligence, Guangxi University for Nationalities, Nanning 530006, China
- Guangxi Key Laboratories of Hybrid Computation and IC Design Analysis, Nanning 530006, China
| | - Yuanfei Wei
- Xiangsihu College of Gunagxi University for Nationalities, Nanning, Guangxi 532100, China
| | - Laith Abualigah
- Faculty of Computer Sciences and Informatics, Amman Arab University, Amman 11953, Jordan
| | - Yongquan Zhou
- College of Artificial Intelligence, Guangxi University for Nationalities, Nanning 530006, China
- Guangxi Key Laboratories of Hybrid Computation and IC Design Analysis, Nanning 530006, China
| |
Collapse
|
21
|
Baliarsingh SK, Muhammad K, Bakshi S. SARA: A memetic algorithm for high-dimensional biomedical data. Appl Soft Comput 2021. [DOI: 10.1016/j.asoc.2020.107009] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
|
22
|
Dagnew G, Shekar B. Ensemble learning‐based classification of microarray cancer data on tree‐based features. COGNITIVE COMPUTATION AND SYSTEMS 2021. [DOI: 10.1049/ccs2.12003] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open
Affiliation(s)
- Guesh Dagnew
- Department of Studies and Research in Computer Science Mangalore University Mangalore Karnataka India
| | - B.H. Shekar
- Department of Studies and Research in Computer Science Mangalore University Mangalore Karnataka India
| |
Collapse
|
23
|
An Adaptive Harmony Search Approach for Gene Selection and Classification of High Dimensional Medical Data. JOURNAL OF KING SAUD UNIVERSITY - COMPUTER AND INFORMATION SCIENCES 2021. [DOI: 10.1016/j.jksuci.2018.02.013] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
24
|
Mahendran N, Durai Raj Vincent PM, Srinivasan K, Chang CY. Machine Learning Based Computational Gene Selection Models: A Survey, Performance Evaluation, Open Issues, and Future Research Directions. Front Genet 2020; 11:603808. [PMID: 33362861 PMCID: PMC7758324 DOI: 10.3389/fgene.2020.603808] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2020] [Accepted: 10/29/2020] [Indexed: 12/20/2022] Open
Abstract
Gene Expression is the process of determining the physical characteristics of living beings by generating the necessary proteins. Gene Expression takes place in two steps, translation and transcription. It is the flow of information from DNA to RNA with enzymes' help, and the end product is proteins and other biochemical molecules. Many technologies can capture Gene Expression from the DNA or RNA. One such technique is Microarray DNA. Other than being expensive, the main issue with Microarray DNA is that it generates high-dimensional data with minimal sample size. The issue in handling such a heavyweight dataset is that the learning model will be over-fitted. This problem should be addressed by reducing the dimension of the data source to a considerable amount. In recent years, Machine Learning has gained popularity in the field of genomic studies. In the literature, many Machine Learning-based Gene Selection approaches have been discussed, which were proposed to improve dimensionality reduction precision. This paper does an extensive review of the various works done on Machine Learning-based gene selection in recent years, along with its performance analysis. The study categorizes various feature selection algorithms under Supervised, Unsupervised, and Semi-supervised learning. The works done in recent years to reduce the features for diagnosing tumors are discussed in detail. Furthermore, the performance of several discussed methods in the literature is analyzed. This study also lists out and briefly discusses the open issues in handling the high-dimension and less sample size data.
Collapse
Affiliation(s)
- Nivedhitha Mahendran
- School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, India
| | - P. M. Durai Raj Vincent
- School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, India
| | - Kathiravan Srinivasan
- School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, India
| | - Chuan-Yu Chang
- Department of Computer Science and Information Engineering, National Yunlin University of Science and Technology, Douliu, Taiwan
| |
Collapse
|
25
|
Guo J, Jin M, Chen Y, Liu J. An embedded gene selection method using knockoffs optimizing neural network. BMC Bioinformatics 2020; 21:414. [PMID: 32962627 PMCID: PMC7510330 DOI: 10.1186/s12859-020-03717-w] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2020] [Accepted: 08/19/2020] [Indexed: 11/30/2022] Open
Abstract
Background Gene selection refers to find a small subset of discriminant genes from the gene expression profiles. How to select genes that affect specific phenotypic traits effectively is an important research work in the field of biology. The neural network has better fitting ability when dealing with nonlinear data, and it can capture features automatically and flexibly. In this work, we propose an embedded gene selection method using neural network. The important genes can be obtained by calculating the weight coefficient after the training is completed. In order to solve the problem of black box of neural network and further make the training results interpretable in neural network, we use the idea of knockoffs to construct the knockoff feature genes of the original feature genes. This method not only make each feature gene to compete with each other, but also make each feature gene compete with its knockoff feature gene. This approach can help to select the key genes that affect the decision-making of neural networks. Results We use maize carotenoids, tocopherol methyltransferase, raffinose family oligosaccharides and human breast cancer dataset to do verification and analysis. Conclusions The experiment results demonstrate that the knockoffs optimizing neural network method has better detection effect than the other existing algorithms, and specially for processing the nonlinear gene expression and phenotype data.
Collapse
Affiliation(s)
- Juncheng Guo
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China.,Institute of Information Engineering, Chinese Academy of Sciences, Beijing, 10049, China.,School of Cyber Security, University of Chinese Academy of Sciences, Beijing, 10049, China
| | - Min Jin
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, 430070, China
| | - Yuanyuan Chen
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, 430070, China
| | - Jianxiao Liu
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China. .,National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, 430070, China.
| |
Collapse
|
26
|
MotieGhader H, Masoudi-Sobhanzadeh Y, Ashtiani SH, Masoudi-Nejad A. mRNA and microRNA selection for breast cancer molecular subtype stratification using meta-heuristic based algorithms. Genomics 2020; 112:3207-3217. [DOI: 10.1016/j.ygeno.2020.06.014] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2020] [Revised: 05/13/2020] [Accepted: 06/02/2020] [Indexed: 02/06/2023]
|
27
|
Integration of multi-objective PSO based feature selection and node centrality for medical datasets. Genomics 2020; 112:4370-4384. [PMID: 32717320 DOI: 10.1016/j.ygeno.2020.07.027] [Citation(s) in RCA: 69] [Impact Index Per Article: 17.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2020] [Revised: 06/22/2020] [Accepted: 07/14/2020] [Indexed: 01/19/2023]
Abstract
In the past decades, the rapid growth of computer and database technologies has led to the rapid growth of large-scale medical datasets. On the other, medical applications with high dimensional datasets that require high speed and accuracy are rapidly increasing. One of the dimensionality reduction approaches is feature selection that can increase the accuracy of the disease diagnosis and reduce its computational complexity. In this paper, a novel PSO-based multi objective feature selection method is proposed. The proposed method consists of three main phases. In the first phase, the original features are showed as a graph representation model. In the next phase, feature centralities for all nodes in the graph are calculated, and finally, in the third phase, an improved PSO-based search process is utilized to final feature selection. The results on five medical datasets indicate that the proposed method improves previous related methods in terms of efficiency and effectiveness.
Collapse
|
28
|
Gene selection of non-small cell lung cancer data for adjuvant chemotherapy decision using cell separation algorithm. APPL INTELL 2020. [DOI: 10.1007/s10489-020-01740-1] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
|
29
|
Meenachi L, Ramakrishnan S. Differential evolution and ACO based global optimal feature selection with fuzzy rough set for cancer data classification. Soft comput 2020. [DOI: 10.1007/s00500-020-05070-9] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
30
|
Baliarsingh SK, Vipsita S. Chaotic emperor penguin optimised extreme learning machine for microarray cancer classification. IET Syst Biol 2020; 14:85-95. [PMID: 32196467 DOI: 10.1049/iet-syb.2019.0028] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
Microarray technology plays a significant role in cancer classification, where a large number of genes and samples are simultaneously analysed. For the efficient analysis of the microarray data, there is a great demand for the development of intelligent techniques. In this article, the authors propose a novel hybrid technique employing Fisher criterion, ReliefF, and extreme learning machine (ELM) based on the principle of chaotic emperor penguin optimisation algorithm (CEPO). EPO is a recently developed metaheuristic method. In the proposed method, initially, Fisher score and ReliefF are independently used as filters for relevant gene selection. Further, a novel population-based metaheuristic, namely, CEPO was proposed to pre-train the ELM by selecting the optimal input weights and hidden biases. The authors have successfully conducted experiments on seven well-known data sets. To evaluate the effectiveness, the proposed method is compared with original EPO, genetic algorithm, and particle swarm optimisation-based ELM along with other state-of-the-art techniques. The experimental results show that the proposed framework achieves better accuracy as compared to the state-of-the-art schemes. The efficacy of the proposed method is demonstrated in terms of accuracy, sensitivity, specificity, and F-measure.
Collapse
Affiliation(s)
- Santos Kumar Baliarsingh
- DST-FIST Bioinformatics Lab, Department of Computer Science and Engineering, International Institute of Information Technology, Bhubaneswar, India.
| | - Swati Vipsita
- DST-FIST Bioinformatics Lab, Department of Computer Science and Engineering, International Institute of Information Technology, Bhubaneswar, India
| |
Collapse
|
31
|
Noorie Z, Afsari F. Sparse feature selection: Relevance, redundancy and locality structure preserving guided by pairwise constraints. Appl Soft Comput 2020. [DOI: 10.1016/j.asoc.2019.105956] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
32
|
A memetic algorithm using emperor penguin and social engineering optimization for medical data classification. Appl Soft Comput 2019. [DOI: 10.1016/j.asoc.2019.105773] [Citation(s) in RCA: 40] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
33
|
Sun L, Wang W, Xu J, Zhang S. Improved LLE and neighborhood rough sets-based gene selection using Lebesgue measure for cancer classification on gene expression data. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2019. [DOI: 10.3233/jifs-181904] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]
Affiliation(s)
- Lin Sun
- Postdoctoral Mobile Station of Biology, College of Life Science, Henan Normal University, Xinxiang, China
- College of Computer and Information Engineering, Henan Normal University, Xinxiang, China
| | - Wei Wang
- College of Computer and Information Engineering, Henan Normal University, Xinxiang, China
| | - Jiucheng Xu
- Postdoctoral Mobile Station of Biology, College of Life Science, Henan Normal University, Xinxiang, China
- College of Computer and Information Engineering, Henan Normal University, Xinxiang, China
| | - Shiguang Zhang
- College of Computer and Information Engineering, Henan Normal University, Xinxiang, China
| |
Collapse
|
34
|
Bir-Jmel A, Douiri SM, Elbernoussi S. Gene Selection via a New Hybrid Ant Colony Optimization Algorithm for Cancer Classification in High-Dimensional Data. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2019; 2019:7828590. [PMID: 31737086 PMCID: PMC6815598 DOI: 10.1155/2019/7828590] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/04/2019] [Revised: 08/14/2019] [Accepted: 09/09/2019] [Indexed: 11/18/2022]
Abstract
The recent advance in the microarray data analysis makes it easy to simultaneously measure the expression levels of several thousand genes. These levels can be used to distinguish cancerous tissues from normal ones. In this work, we are interested in gene expression data dimension reduction for cancer classification, which is a common task in most microarray data analysis studies. This reduction has an essential role in enhancing the accuracy of the classification task and helping biologists accurately predict cancer in the body; this is carried out by selecting a small subset of relevant genes and eliminating the redundant or noisy genes. In this context, we propose a hybrid approach (MWIS-ACO-LS) for the gene selection problem, based on the combination of a new graph-based approach for gene selection (MWIS), in which we seek to minimize the redundancy between genes by considering the correlation between the latter and maximize gene-ranking (Fisher) scores, and a modified ACO coupled with a local search (LS) algorithm using the classifier 1NN for measuring the quality of the candidate subsets. In order to evaluate the proposed method, we tested MWIS-ACO-LS on ten well-replicated microarray datasets of high dimensions varying from 2308 to 12600 genes. The experimental results based on ten high-dimensional microarray classification problems demonstrated the effectiveness of our proposed method.
Collapse
Affiliation(s)
- Ahmed Bir-Jmel
- Laboratory of Mathematics, Computer Science & Applications-Security of Information, Department of Mathematics, Faculty of Sciences, Mohammed V University, Rabat, Morocco
| | - Sidi Mohamed Douiri
- Laboratory of Mathematics, Computer Science & Applications-Security of Information, Department of Mathematics, Faculty of Sciences, Mohammed V University, Rabat, Morocco
| | - Souad Elbernoussi
- Laboratory of Mathematics, Computer Science & Applications-Security of Information, Department of Mathematics, Faculty of Sciences, Mohammed V University, Rabat, Morocco
| |
Collapse
|
35
|
Fast unsupervised feature selection based on the improved binary ant system and mutation strategy. Neural Comput Appl 2019. [DOI: 10.1007/s00521-018-03991-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
36
|
A new optimal gene selection approach for cancer classification using enhanced Jaya-based forest optimization algorithm. Neural Comput Appl 2019. [DOI: 10.1007/s00521-019-04355-x] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
|
37
|
Sun L, Kong X, Xu J, Xue Z, Zhai R, Zhang S. A Hybrid Gene Selection Method Based on ReliefF and Ant Colony Optimization Algorithm for Tumor Classification. Sci Rep 2019; 9:8978. [PMID: 31222027 PMCID: PMC6586811 DOI: 10.1038/s41598-019-45223-x] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2018] [Accepted: 06/04/2019] [Indexed: 12/20/2022] Open
Abstract
For the DNA microarray datasets, tumor classification based on gene expression profiles has drawn great attention, and gene selection plays a significant role in improving the classification performance of microarray data. In this study, an effective hybrid gene selection method based on ReliefF and Ant colony optimization (ACO) algorithm for tumor classification is proposed. First, for the ReliefF algorithm, the average distance among k nearest or k non-nearest neighbor samples are introduced to estimate the difference among samples, based on which the distances between the samples in the same class or the different classes are defined, and then it can more effectively evaluate the weight values of genes for samples. To obtain the stable results in emergencies, a distance coefficient is developed to construct a new formula of updating weight coefficient of genes to further reduce the instability during calculations. When decreasing the distance between the same samples and increasing the distance between the different samples, the weight division is more obvious. Thus, the ReliefF algorithm can be improved to reduce the initial dimensionality of gene expression datasets and obtain a candidate gene subset. Second, a new pruning rule is designed to reduce dimensionality and obtain a new candidate subset with the smaller number of genes. The probability formula of the next point in the path selected by the ants is presented to highlight the closeness of the correlation relationship between the reaction variables. To increase the pheromone concentration of important genes, a new phenotype updating formula of the ACO algorithm is adopted to prevent the pheromone left by the ants that are overwhelmed with time, and then the weight coefficients of the genes are applied here to eliminate the interference of difference data as much as possible. It follows that the improved ACO algorithm has the ability of the strong positive feedback, which quickly converges to an optimal solution through the accumulation and the updating of pheromone. Finally, by combining the improved ReliefF algorithm and the improved ACO method, a hybrid filter-wrapper-based gene selection algorithm called as RFACO-GS is proposed. The experimental results under several public gene expression datasets demonstrate that the proposed method is very effective, which can significantly reduce the dimensionality of gene expression datasets, and select the most relevant genes with high classification accuracy.
Collapse
Affiliation(s)
- Lin Sun
- College of Computer and Information Engineering, Henan Normal University, Xinxiang, 453007, China.
- Post-doctoral Mobile Station of Biology, College of Life Science, Henan Normal University, Xinxiang, China.
| | - Xianglin Kong
- College of Computer and Information Engineering, Henan Normal University, Xinxiang, 453007, China
| | - Jiucheng Xu
- College of Computer and Information Engineering, Henan Normal University, Xinxiang, 453007, China.
- Post-doctoral Mobile Station of Biology, College of Life Science, Henan Normal University, Xinxiang, China.
| | - Zhan'ao Xue
- College of Computer and Information Engineering, Henan Normal University, Xinxiang, 453007, China
| | - Ruibing Zhai
- College of Computer and Information Engineering, Henan Normal University, Xinxiang, 453007, China
| | - Shiguang Zhang
- College of Computer and Information Engineering, Henan Normal University, Xinxiang, 453007, China
- School of Computer Science and Technology, Tianjin University, Tianjin, 300072, China
| |
Collapse
|
38
|
|
39
|
Baliarsingh SK, Vipsita S, Muhammad K, Dash B, Bakshi S. Analysis of high-dimensional genomic data employing a novel bio-inspired algorithm. Appl Soft Comput 2019. [DOI: 10.1016/j.asoc.2019.01.007] [Citation(s) in RCA: 30] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
|
40
|
|
41
|
Solorio-Fernández S, Carrasco-Ochoa JA, Martínez-Trinidad JF. A review of unsupervised feature selection methods. Artif Intell Rev 2019. [DOI: 10.1007/s10462-019-09682-y] [Citation(s) in RCA: 172] [Impact Index Per Article: 34.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
|
42
|
Alanni R, Hou J, Azzawi H, Xiang Y. A novel gene selection algorithm for cancer classification using microarray datasets. BMC Med Genomics 2019; 12:10. [PMID: 30646919 PMCID: PMC6334429 DOI: 10.1186/s12920-018-0447-6] [Citation(s) in RCA: 36] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2018] [Accepted: 12/07/2018] [Indexed: 12/18/2022] Open
Abstract
Background Microarray datasets are an important medical diagnostic tool as they represent the states of a cell at the molecular level. Available microarray datasets for classifying cancer types generally have a fairly small sample size compared to the large number of genes involved. This fact is known as a curse of dimensionality, which is a challenging problem. Gene selection is a promising approach that addresses this problem and plays an important role in the development of efficient cancer classification due to the fact that only a small number of genes are related to the classification problem. Gene selection addresses many problems in microarray datasets such as reducing the number of irrelevant and noisy genes, and selecting the most related genes to improve the classification results. Methods An innovative Gene Selection Programming (GSP) method is proposed to select relevant genes for effective and efficient cancer classification. GSP is based on Gene Expression Programming (GEP) method with a new defined population initialization algorithm, a new fitness function definition, and improved mutation and recombination operators. . Support Vector Machine (SVM) with a linear kernel serves as a classifier of the GSP. Results Experimental results on ten microarray cancer datasets demonstrate that Gene Selection Programming (GSP) is effective and efficient in eliminating irrelevant and redundant genes/features from microarray datasets. The comprehensive evaluations and comparisons with other methods show that GSP gives a better compromise in terms of all three evaluation criteria, i.e., classification accuracy, number of selected genes, and computational cost. The gene set selected by GSP has shown its superior performances in cancer classification compared to those selected by the up-to-date representative gene selection methods. Conclusion Gene subset selected by GSP can achieve a higher classification accuracy with less processing time.
Collapse
Affiliation(s)
- Russul Alanni
- School of Information Technology, Deakin University, Burwood, 3125, VIC, Australia.
| | - Jingyu Hou
- School of Information Technology, Deakin University, Burwood, 3125, VIC, Australia
| | - Hasseeb Azzawi
- School of Information Technology, Deakin University, Burwood, 3125, VIC, Australia
| | - Yong Xiang
- School of Information Technology, Deakin University, Burwood, 3125, VIC, Australia
| |
Collapse
|
43
|
Malakar S, Ghosh M, Bhowmik S, Sarkar R, Nasipuri M. A GA based hierarchical feature selection approach for handwritten word recognition. Neural Comput Appl 2019. [DOI: 10.1007/s00521-018-3937-8] [Citation(s) in RCA: 40] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
44
|
Abstract
The automatic classification of DNA microarray data is one of the hot topics in the field of bioinformatics, since it is an effective tool for the diagnosis of diseases in patients. The aim of this chapter is to present the most relevant aspects related to the classification of microarrays. We carried out an analysis of the strategies used for the classification of microarray data and a review of the main methods used in the literature. In addition, other related aspects are addressed as the reduction of dimensionality, to try to eliminate redundant information in genes, or the treatment of imbalanced data and missing of data. To conclude, we present an exhaustive review of the main scientific works in journals to show the most successful techniques applied in this discipline as well as the most used datasets to verify their effectiveness.
Collapse
|
45
|
Yuan M, Yang Z, Ji G. Partial maximum correlation information: A new feature selection method for microarray data classification. Neurocomputing 2019. [DOI: 10.1016/j.neucom.2018.09.084] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
46
|
Prasad Y, Biswas K, Hanmandlu M. A recursive PSO scheme for gene selection in microarray data. Appl Soft Comput 2018. [DOI: 10.1016/j.asoc.2018.06.019] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
47
|
A novel feature selection method to predict protein structural class. Comput Biol Chem 2018; 76:118-129. [DOI: 10.1016/j.compbiolchem.2018.06.007] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2018] [Revised: 05/14/2018] [Accepted: 06/30/2018] [Indexed: 01/05/2023]
|
48
|
Rahmaninia M, Moradi P. OSFSMI: Online stream feature selection method based on mutual information. Appl Soft Comput 2018. [DOI: 10.1016/j.asoc.2017.08.034] [Citation(s) in RCA: 41] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
49
|
Sun L, Zhang X, Xu J, Wang W, Liu R. A Gene selection approach based on the fisher linear discriminant and the neighborhood rough set. Bioengineered 2017; 9:144-151. [PMID: 29161975 PMCID: PMC5972918 DOI: 10.1080/21655979.2017.1403678] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open
Abstract
In recent years, tumor classification based on gene expression profiles has drawn great attention, and related research results have been widely applied to the clinical diagnosis of major gene diseases. These studies are of tremendous importance for accurate cancer diagnosis and subtype recognition. However, the microarray data of gene expression profiles have small samples, high dimensionality, large noise and data redundancy. To further improve the classification performance of microarray data, a gene selection approach based on the Fisher linear discriminant (FLD) and the neighborhood rough set (NRS) is proposed. First, the FLD method is employed to reduce the preliminarily genetic data to obtain features with a strong classification ability, which can form a candidate gene subset. Then, neighborhood precision and neighborhood roughness are defined in a neighborhood decision system, and the calculation approaches for neighborhood dependency and the significance of an attribute are given. A reduction model of neighborhood decision systems is presented. Thus, a gene selection algorithm based on FLD and NRS is proposed. Finally, four public gene datasets are used in the simulation experiments. Experimental results under the SVM classifier demonstrate that the proposed algorithm is effective, and it can select a smaller and more well-classified gene subset, as well as obtain better classification performance.
Collapse
Affiliation(s)
- Lin Sun
- a College of Computer & Information Engineering, Henan Normal University , Xinxiang , Henan , China.,b Post-doctoral Mobile Station of Biology, College of Life Science, Henan Normal University , Xinxiang , Henan , China.,c Engineering Technology Research Center for Computing Intelligence & Data Mining of Henan Province , Xinxiang , Henan , China
| | - Xiaoyu Zhang
- a College of Computer & Information Engineering, Henan Normal University , Xinxiang , Henan , China
| | - Jiucheng Xu
- a College of Computer & Information Engineering, Henan Normal University , Xinxiang , Henan , China
| | - Wei Wang
- a College of Computer & Information Engineering, Henan Normal University , Xinxiang , Henan , China.,c Engineering Technology Research Center for Computing Intelligence & Data Mining of Henan Province , Xinxiang , Henan , China
| | - Ruonan Liu
- a College of Computer & Information Engineering, Henan Normal University , Xinxiang , Henan , China
| |
Collapse
|
50
|
Verification of Three-Phase Dependency Analysis Bayesian Network Learning Method for Maize Carotenoid Gene Mining. BIOMED RESEARCH INTERNATIONAL 2017; 2017:1813494. [PMID: 28828382 PMCID: PMC5554554 DOI: 10.1155/2017/1813494] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/03/2017] [Accepted: 06/27/2017] [Indexed: 11/17/2022]
Abstract
Background and Objective Mining the genes related to maize carotenoid components is important to improve the carotenoid content and the quality of maize. Methods On the basis of using the entropy estimation method with Gaussian kernel probability density estimator, we use the three-phase dependency analysis (TPDA) Bayesian network structure learning method to construct the network of maize gene and carotenoid components traits. Results In the case of using two discretization methods and setting different discretization values, we compare the learning effect and efficiency of 10 kinds of Bayesian network structure learning methods. The method is verified and analyzed on the maize dataset of global germplasm collection with 527 elite inbred lines. Conclusions The result confirmed the effectiveness of the TPDA method, which outperforms significantly another 9 kinds of Bayesian network learning methods. It is an efficient method of mining genes for maize carotenoid components traits. The parameters obtained by experiments will help carry out practical gene mining effectively in the future.
Collapse
|