1
|
Nekouie N, Romoozi M, Esmaeili M. A New Evolutionary Ensemble Learning of Multimodal Feature Selection from Microarray Data. Neural Process Lett 2023. [DOI: 10.1007/s11063-023-11159-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/15/2023]
|
2
|
Robust dual-graph regularized and minimum redundancy based on self-representation for semi-supervised feature selection. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.03.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
3
|
|
4
|
Alhenawi E, Al-Sayyed R, Hudaib A, Mirjalili S. Feature selection methods on gene expression microarray data for cancer classification: A systematic review. Comput Biol Med 2022; 140:105051. [PMID: 34839186 DOI: 10.1016/j.compbiomed.2021.105051] [Citation(s) in RCA: 37] [Impact Index Per Article: 18.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2021] [Revised: 11/01/2021] [Accepted: 11/15/2021] [Indexed: 11/29/2022]
Abstract
This systematic review provides researchers interested in feature selection (FS) for processing microarray data with comprehensive information about the main research directions for gene expression classification conducted during the recent seven years. A set of 132 researches published by three different publishers is reviewed. The studied papers are categorized into nine directions based on their objectives. The FS directions that received various levels of attention were then summarized. The review revealed that 'propose hybrid FS methods' represented the most interesting research direction with a percentage of 34.9%, while the other directions have lower percentages that ranged from 13.6% down to 3%. This guides researchers to select the most competitive research direction. Papers in each category are thoroughly reviewed based on six perspectives, mainly: method(s), classifier(s), dataset(s), dataset dimension(s) range, performance metric(s), and result(s) achieved.
Collapse
Affiliation(s)
- Esra'a Alhenawi
- King Abdullah II School for Information Technology, The University of Jordan, Amman, Jordan.
| | - Rizik Al-Sayyed
- King Abdullah II School for Information Technology, The University of Jordan, Amman, Jordan.
| | - Amjad Hudaib
- King Abdullah II School for Information Technology, The University of Jordan, Amman, Jordan.
| | - Seyedali Mirjalili
- Center for Artificial Intelligence Research and Optimization, Torrens University Australia, Fortitude Valley, Brisbane, 4006, QLD, Australia; Yonsei Frontier Lab, Yonsei University, Seoul, South Korea.
| |
Collapse
|
5
|
|
6
|
Shah SH, Iqbal MJ, Ahmad I, Khan S, Rodrigues JJPC. Optimized gene selection and classification of cancer from microarray gene expression data using deep learning. Neural Comput Appl 2020. [DOI: 10.1007/s00521-020-05367-8] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/09/2022]
|
7
|
Wang W, Xie G, Ren Z, Xie T, Li J. Gene Selection for the Discrimination of Colorectal Cancer. Curr Mol Med 2019; 20:415-428. [PMID: 31746296 DOI: 10.2174/1566524019666191119105209] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2019] [Revised: 10/29/2019] [Accepted: 10/31/2019] [Indexed: 12/15/2022]
Abstract
BACKGROUND Colorectal cancer (CRC) is the third most common cancer worldwide. Cancer discrimination is a typical application of gene expression analysis using a microarray technique. However, microarray data suffer from the curse of dimensionality and usual imbalanced class distribution between the majority (tumor samples) and minority (normal samples) classes. Feature gene selection is necessary and important for cancer discrimination. OBJECTIVES To select feature genes for the discrimination of CRC. METHODS We improve the feature selection algorithm based on differential evolution, DEFSw by using RUSBoost classifier and weight accuracy instead of the common classifier and evaluation measure for selecting feature genes from imbalance data. We firstly extract differently expressed genes (DEGs) from the CRC dataset of the TCGA and then select the feature genes from the DEGs using the improved DEFSw algorithm. Finally, we validate the selected feature gene sets using independent datasets and retrieve the cancer related information for these genes based on text mining through the Coremine Medical online database. RESULTS We select out 16 single-gene feature sets for colorectal cancer discrimination and 19 single-gene feature sets only for colon cancer discrimination. CONCLUSIONS In summary, we find a series of high potential candidate biomarkers or signatures, which can discriminate either or both of colon cancer and rectal cancer with high sensitivity and specificity.
Collapse
Affiliation(s)
- Wenhui Wang
- Network Information Center, The Sixth Affiliated Hospital of Sun Yat-Sen University, Guangzhou, China.,National Engineering Research Center of Digital Life, Sun Yat-sen University, Guangzhou, China.,Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical University, Guangzhou, China
| | - Guanglei Xie
- Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical University, Guangzhou, China
| | - Zhonglu Ren
- College of Medical Information Engineering, Guangdong Pharmaceutical University, Guangzhou, China
| | - Tingyan Xie
- Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical University, Guangzhou, China
| | - Jinming Li
- Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical University, Guangzhou, China
| |
Collapse
|
8
|
Liver Cancer Classification Model Using Hybrid Feature Selection Based on Class-Dependent Technique for the Central Region of Thailand. INFORMATION 2019. [DOI: 10.3390/info10060187] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Liver cancer data always consist of a large number of multidimensional datasets. A dataset that has huge features and multiple classes may be irrelevant to the pattern classification in machine learning. Hence, feature selection improves the performance of the classification model to achieve maximum classification accuracy. The aims of the present study were to find the best feature subset and to evaluate the classification performance of the predictive model. This paper proposed a hybrid feature selection approach by combining information gain and sequential forward selection based on the class-dependent technique (IGSFS-CD) for the liver cancer classification model. Two different classifiers (decision tree and naïve Bayes) were used to evaluate feature subsets. The liver cancer datasets were obtained from the Cancer Hospital Thailand database. Three ensemble methods (ensemble classifiers, bagging, and AdaBoost) were applied to improve the performance of classification. The IGSFS-CD method provided good accuracy of 78.36% (sensitivity 0.7841 and specificity 0.9159) on LC_dataset-1. In addition, LC_dataset II delivered the best performance with an accuracy of 84.82% (sensitivity 0.8481 and specificity 0.9437). The IGSFS-CD method achieved better classification performance compared to the class-independent method. Furthermore, the best feature subset selection could help reduce the complexity of the predictive model.
Collapse
|
9
|
Meenachi L, Ramakrishnan S. Evolutionary sequential genetic search technique-based cancer classification using fuzzy rough nearest neighbour classifier. Healthc Technol Lett 2018; 5:130-135. [PMID: 30155265 PMCID: PMC6103784 DOI: 10.1049/htl.2018.5041] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2018] [Accepted: 06/29/2018] [Indexed: 11/30/2022] Open
Abstract
Cancer is one of the deadly diseases of human life. The patient may likely to survive if the disease is diagnosed in its early stages. In this Letter, the authors propose a genetic search fuzzy rough (GSFR) feature selection algorithm, which is hybridised using the evolutionary sequential genetic search technique and fuzzy rough set to select features. The genetic operator's selection, crossover and mutation are applied to generate the subset of features from dataset. The generated subset is subjected to the evaluation with the modified dependency function of the fuzzy rough set using positive and boundary regions, which act as a fitness function. The generation and evaluation of the subset of features continue until the best subset is arrived at to develop the classification model. Selected features are applied to the different classifiers, from the classifiers fuzzy-rough nearest neighbour (FRNN) classifier, which outperforms in terms of classification accuracy and computation time. Hence, the FRNN is applied for performance analysis of existing feature selection algorithms against the proposed GSFR feature selection algorithm. The result generated from the proposed GSFR feature selection algorithm proved to be precise when compared to other feature selection algorithms.
Collapse
Affiliation(s)
- Loganathan Meenachi
- Department of Information Technology, Dr.Mahalingam College of Engineering and Technology, Pollachi, Tamil Nadu, India
| | - Srinivasan Ramakrishnan
- Department of Information Technology, Dr.Mahalingam College of Engineering and Technology, Pollachi, Tamil Nadu, India
| |
Collapse
|
10
|
Sun Y, Lu C, Li X. The Cross-Entropy Based Multi-Filter Ensemble Method for Gene Selection. Genes (Basel) 2018; 9:E258. [PMID: 29772787 PMCID: PMC5977198 DOI: 10.3390/genes9050258] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2018] [Revised: 04/20/2018] [Accepted: 05/02/2018] [Indexed: 11/30/2022] Open
Abstract
The gene expression profile has the characteristics of a high dimension, low sample, and continuous type, and it is a great challenge to use gene expression profile data for the classification of tumor samples. This paper proposes a cross-entropy based multi-filter ensemble (CEMFE) method for microarray data classification. Firstly, multiple filters are used to select the microarray data in order to obtain a plurality of the pre-selected feature subsets with a different classification ability. The top N genes with the highest rank of each subset are integrated so as to form a new data set. Secondly, the cross-entropy algorithm is used to remove the redundant data in the data set. Finally, the wrapper method, which is based on forward feature selection, is used to select the best feature subset. The experimental results show that the proposed method is more efficient than other gene selection methods and that it can achieve a higher classification accuracy under fewer characteristic genes.
Collapse
Affiliation(s)
- Yingqiang Sun
- School of Information Science and Engineering, Ningbo University, Ningbo 315000, China.
- College of Engineering, Lishui University, Lishui 323000, China.
| | - Chengbo Lu
- College of Engineering, Lishui University, Lishui 323000, China.
| | - Xiaobo Li
- School of Information Science and Engineering, Ningbo University, Ningbo 315000, China.
- College of Engineering, Lishui University, Lishui 323000, China.
| |
Collapse
|
11
|
Mandal K, Sarmah R, Bhattacharyya DK. Biomarker Identification for Cancer Disease Using Biclustering Approach: An Empirical Study. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 16:490-509. [PMID: 29993834 DOI: 10.1109/tcbb.2018.2820695] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
This paper presents an exhaustive empirical study to identify biomarkers using two approaches: frequency-based and network-based, over seventeen different biclustering algorithms and six different cancer expression datasets. To systematically analyze the biclustering algorithms, we perform enrichment analysis, subtype identification and biomarker identification. Biclustering algorithms such as C&C, SAMBA and Plaid are useful to detect biomarkers by both approaches for all datasets except prostate cancer. We detect a total of 102 gene biomarkers using frequency-based method out of which 19 are for blood cancer, 36 for lung cancer, 25 for colon cancer, 13 for multi-tissue cancer and 9 for prostate cancer. Using the network-based approach we detect a total of 41 gene biomarkers of which 15 are from blood cancer, 12 from lung cancer, 6 from colon cancer, 7 from multi-tissue cancer and 1 from prostate cancer dataset. We further extend our network analysis over some biclusters and detect some gene biomarkers not detected earlier by both frequency-based or network-based approach. We expand our work on breast cancer miRNA expression data to evaluate the performance of the biclustering algorithms. We detect 19 breast cancer biomarkers by frequency-based method and 5 by network-based method for the miRNA dataset.
Collapse
|
12
|
Bi-stage hierarchical selection of pathway genes for cancer progression using a swarm based computational approach. Appl Soft Comput 2018. [DOI: 10.1016/j.asoc.2017.10.024] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
|
13
|
Quantitative prediction of drug side effects based on drug-related features. Interdiscip Sci 2017; 9:434-444. [DOI: 10.1007/s12539-017-0236-5] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2016] [Revised: 04/29/2017] [Accepted: 05/03/2017] [Indexed: 01/07/2023]
|
14
|
Tran B, Xue B, Zhang M. Bare-Bone Particle Swarm Optimisation for Simultaneously Discretising and Selecting Features for High-Dimensional Classification. APPLICATIONS OF EVOLUTIONARY COMPUTATION 2016. [DOI: 10.1007/978-3-319-31204-0_45] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
15
|
Wang A, An N, Chen G, Li L, Alterovitz G. Improving PLS-RFE based gene selection for microarray data classification. Comput Biol Med 2015; 62:14-24. [PMID: 25912984 DOI: 10.1016/j.compbiomed.2015.04.011] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2014] [Revised: 04/07/2015] [Accepted: 04/08/2015] [Indexed: 10/23/2022]
Abstract
Gene selection plays a crucial role in constructing efficient classifiers for microarray data classification, since microarray data is characterized by high dimensionality and small sample sizes and contains irrelevant and redundant genes. In practical use, partial least squares-based gene selection approaches can obtain gene subsets of good qualities, but are considerably time-consuming. In this paper, we propose to integrate partial least squares based recursive feature elimination (PLS-RFE) with two feature elimination schemes: simulated annealing and square root, respectively, to speed up the feature selection process. Inspired from the strategy of annealing schedule, the two proposed approaches eliminate a number of features rather than one least informative feature during each iteration and the number of removed features decreases as the iteration proceeds. To verify the effectiveness and efficiency of the proposed approaches, we perform extensive experiments on six publicly available microarray data with three typical classifiers, including Naïve Bayes, K-Nearest-Neighbor and Support Vector Machine, and compare our approaches with ReliefF, PLS and PLS-RFE feature selectors in terms of classification accuracy and running time. Experimental results demonstrate that the two proposed approaches accelerate the feature selection process impressively without degrading the classification accuracy and obtain more compact feature subsets for both two-category and multi-category problems. Further experimental comparisons in feature subset consistency show that the proposed approach with simulated annealing scheme not only has better time performance, but also obtains slightly better feature subset consistency than the one with square root scheme.
Collapse
Affiliation(s)
- Aiguo Wang
- School of Computer and Information, Hefei University of Technology, Hefei, China.
| | - Ning An
- School of Computer and Information, Hefei University of Technology, Hefei, China.
| | - Guilin Chen
- School of Computer and Information Engineering, Chuzhou University, Chuzhou, China.
| | - Lian Li
- School of Computer and Information, Hefei University of Technology, Hefei, China.
| | - Gil Alterovitz
- Center for Biomedical Informatics, Harvard Medical School, Boston, USA; Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, USA; Children׳s Hospital Informatics Program at the Harvard/MIT Division of Health Sciences and Technology, Boston, USA.
| |
Collapse
|
16
|
Wang Y, Fan X, Cai Y. A comparative study of improvements Pre-filter methods bring on feature selection using microarray data. Health Inf Sci Syst 2014; 2:7. [PMID: 25825671 PMCID: PMC4340279 DOI: 10.1186/2047-2501-2-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2014] [Accepted: 10/03/2014] [Indexed: 12/13/2022] Open
Abstract
Background Feature selection techniques have become an apparent need in biomarker discoveries with the development of microarray. However, the high dimensional nature of microarray made feature selection become time-consuming. To overcome such difficulties, filter data according to the background knowledge before applying feature selection techniques has become a hot topic in microarray analysis. Different methods may affect final results greatly, thus it is important to evaluate these pre-filter methods in a system way. Methods In this paper, we compared the performance of statistical-based, biological-based pre-filter methods and the combination of them on microRNA-mRNA parallel expression profiles using L1 logistic regression as feature selection techniques. Four types of data were built for both microRNA and mRNA expression profiles. Results Results showed that pre-filter methods could reduce the number of features greatly for both mRNA and microRNA expression datasets. The features selected after pre-filter procedures were shown to be significant in biological levels such as biology process and microRNA functions. Analyses of classification performance based on precision showed the pre-filter methods were necessary when the number of raw features was much bigger than that of samples. All the computing time was greatly shortened after pre-filter procedures. Conclusions With similar or better classification improvements, less but biological significant features, pre-filter-based feature selection should be taken into consideration if researchers need fast results when facing complex computing problems in bioinformatics. Electronic supplementary material The online version of this article (doi:10.1186/2047-2501-2-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Yingying Wang
- Research Center for Biomedical Information, Shenzhen Institutes of Advanced Technologies, Chinese Academy of Sciences, Shenzhen, China
| | - Xiaomao Fan
- Research Center for Biomedical Information, Shenzhen Institutes of Advanced Technologies, Chinese Academy of Sciences, Shenzhen, China
| | - Yunpeng Cai
- Research Center for Biomedical Information, Shenzhen Institutes of Advanced Technologies, Chinese Academy of Sciences, Shenzhen, China
| |
Collapse
|