1
|
Lee H, Kim J. A Gene Selection Method Considering Measurement Errors. J Comput Biol 2024; 31:71-82. [PMID: 38010511 DOI: 10.1089/cmb.2023.0041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2023] Open
Abstract
The analysis of gene expression data has made significant contributions to understanding disease mechanisms and developing new drugs and therapies. In such analysis, gene selection is often required for identifying informative and relevant genes and removing redundant and irrelevant ones. However, this is not an easy task as gene expression data have inherent challenges such as ultra-high dimensionality, biological noise, and measurement errors. This study focuses on the measurement errors in gene selection problems. Typically, high-throughput experiments have their own intrinsic measurement errors, which can result in an increase of falsely discovered genes. To alleviate this problem, this study proposes a gene selection method that takes into account measurement errors using generalized liner measurement error models. The method consists of iterative filtering and selection steps until convergence, leading to fewer false positives and providing stable results under measurement errors. The performance of the proposed method is demonstrated through simulation studies and applied to a lung cancer data set.
Collapse
Affiliation(s)
- Hajoung Lee
- Department of Statistics, Sungkyunkwan University, Seoul, South Korea
| | - Jaejik Kim
- Department of Statistics, Sungkyunkwan University, Seoul, South Korea
| |
Collapse
|
2
|
Osama S, Ali M, Ali AA, Shaban H. Gene selection and tumor identification based on a hybrid of the multi-filter embedded recursive mountain gazelle algorithm. Comput Biol Med 2023; 167:107674. [PMID: 37976816 DOI: 10.1016/j.compbiomed.2023.107674] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2023] [Revised: 10/09/2023] [Accepted: 11/06/2023] [Indexed: 11/19/2023]
Abstract
Microarray gene expression data are useful for identifying gene expression patterns associated with cancer outcomes; however, their high dimensionality make it difficult to extract meaningful information and accurately classify tumors. Hence, developing effective methods for reducing dimensionality while preserving relevant information is a crucial task. Hybrid-based gene selection methods are widely proposed in the gene expression analysis domain and can still be enhanced in terms of efficiency and reliability. This study proposes a new hybrid-based gene selection method, called multi-filter embedded mountain gazelle optimizer (MUL-MGO), which utilizes two filters and an embedded method to remove irrelevant genes, followed by selecting the most relevant genes using recently developed MGO algorithm. To the best of our knowledge, this is the first work to exploit MGO as a gene or feature selection method. A new version of MGO, called recursive mountain gazelle optimizer (RMGO), which implements MGO algorithm recursively to avoid local optima, minimize search space, and obtain minimum gene count without decreasing the classifier's performance, is developed. The proposed RMGO is used to develop a new hybrid gene selection method employing similar filters and embedded methods as MUL-MGO, but with a recursive MGO algorithm version. The resulting method is called multi-filter embedded recursive mountain gazelle optimizer (MUL-RMGO). Several classifiers are used for cancer classification. Accordingly, several experimental studies are performed on eight microarray gene expression datasets to demonstrate the proficiencies of MUL-MGO and MUL-RMGO methods. The experimental findings indicate the efficiency and productivity of the suggested MUL-MGO and MUL-RMGO methods for gene selection. The methods outperform cutting-edge methods in the literature, with MUL-RMGO exceeding MUL-MGO in terms of accuracy and selected gene count.
Collapse
Affiliation(s)
- Sarah Osama
- Computer Science Department, Faculty of Computers and Information, Minia University, Minia, Egypt.
| | - Moatez Ali
- Department of Internal Medicine, St. Barnabas Hospital, NY, USA.
| | - Abdelmgeid A Ali
- Computer Science Department, Faculty of Computers and Information, Minia University, Minia, Egypt.
| | - Hassan Shaban
- Computer Science Department, Faculty of Computers and Information, Minia University, Minia, Egypt.
| |
Collapse
|
3
|
Yao N, Pan J, Chen X, Li P, Li Y, Wang Z, Yao T, Qian L, Yi D, Wu Y. Discovery of potential biomarkers for lung cancer classification based on human proteome microarrays using Stochastic Gradient Boosting approach. J Cancer Res Clin Oncol 2023; 149:6803-6812. [PMID: 36807761 DOI: 10.1007/s00432-023-04643-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2022] [Accepted: 02/08/2023] [Indexed: 02/21/2023]
Abstract
PURPOSE Early identification of lung cancer (LC) will considerably facilitate the intervention and prevention of LC. The human proteome micro-arrays approach can be used as a "liquid biopsy" to diagnose LC to complement conventional diagnosis, which needs advanced bioinformatics methods such as feature selection (FS) and refined machine learning models. METHODS A two-stage FS methodology by infusing Pearson's Correlation (PC) with a univariate filter (SBF) or recursive feature elimination (RFE) was used to reduce the redundancy of the original dataset. The Stochastic Gradient Boosting (SGB), Random Forest (RF), and Support Vector Machine (SVM) techniques were applied to build ensemble classifiers based on four subsets. The synthetic minority oversampling technique (SMOTE) was used in the preprocessing of imbalanced data. RESULTS FS approach with SBF and RFE extracted 25 and 55 features, respectively, with 14 overlapped ones. All three ensemble models demonstrate superior accuracy (ranging from 0.867 to 0.967) and sensitivity (0.917 to 1.00) in the test datasets with SGB of SBF subset outperforming others. The SMOTE technique has improved the model performance in the training process. Three of the top selected candidate biomarkers (LGR4, CDC34, and GHRHR) were highly suggested to play a role in lung tumorigenesis. CONCLUSION A novel hybrid FS method with classical ensemble machine learning algorithms was first used in the classification of protein microarray data. The parsimony model constructed by the SGB algorithm with the appropriate FS and SMOTE approach performs well in the classification task with higher sensitivity and specificity. Standardization and innovation of bioinformatics approach for protein microarray analysis need further exploration and validation.
Collapse
Affiliation(s)
- Ning Yao
- Department of Health Statistics, College of Preventive Medicine, Army Medical University, No.30 Gaotanyan Street, Shapingba District, Chongqing, 400038, China
- Chongqing Center for Disease Control and Prevention, No.8 Changjiang 2nd Street, Yuzhong District, Chongqing, 400042, China
| | - Jianbo Pan
- Center for Novel Target and Therapeutic Intervention, Institute of Life Sciences, Chongqing Medical University, Chongqing, 400016, China
| | - Xicheng Chen
- Department of Health Statistics, College of Preventive Medicine, Army Medical University, No.30 Gaotanyan Street, Shapingba District, Chongqing, 400038, China
| | - Pengpeng Li
- Department of Health Statistics, College of Preventive Medicine, Army Medical University, No.30 Gaotanyan Street, Shapingba District, Chongqing, 400038, China
| | - Yang Li
- Department of Health Statistics, College of Preventive Medicine, Army Medical University, No.30 Gaotanyan Street, Shapingba District, Chongqing, 400038, China
| | - Zhenyan Wang
- Department of Health Statistics, College of Preventive Medicine, Army Medical University, No.30 Gaotanyan Street, Shapingba District, Chongqing, 400038, China
| | - Tianhua Yao
- Department of Health Statistics, College of Preventive Medicine, Army Medical University, No.30 Gaotanyan Street, Shapingba District, Chongqing, 400038, China
| | - Li Qian
- Department of Health Statistics, College of Preventive Medicine, Army Medical University, No.30 Gaotanyan Street, Shapingba District, Chongqing, 400038, China
| | - Dong Yi
- Department of Health Statistics, College of Preventive Medicine, Army Medical University, No.30 Gaotanyan Street, Shapingba District, Chongqing, 400038, China.
| | - Yazhou Wu
- Department of Health Statistics, College of Preventive Medicine, Army Medical University, No.30 Gaotanyan Street, Shapingba District, Chongqing, 400038, China.
| |
Collapse
|
4
|
Blourchi P, Ghasemzadeh A. Majority voting based on different feature ranking techniques from gene expression. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2023. [DOI: 10.3233/jifs-224029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/03/2023]
Abstract
In bioinformatics studies, many modeling tasks are characterized by high dimensionality, leading to the widespread use of feature selection techniques to reduce dimensionality. There are a multitude of feature selection techniques that have been proposed in the literature, each relying on a single measurement method to select candidate features. This has an impact on the classification performance. To address this issue, we propose a majority voting method that uses five different feature ranking techniques: entropy score, Pearson’s correlation coefficient, Spearman correlation coefficient, Kendall correlation coefficient, and t-test. By using a majority voting approach, only the features that appear in all five ranking methods are selected. This selection process has three key advantages over traditional techniques. Firstly, it is independent of any particular feature ranking method. Secondly, the feature space dimension is significantly reduced compared to other ranking methods. Finally, the performance is improved as the most discriminatory and informative features are selected via the majority voting process. The performance of the proposed method was evaluated using an SVM, and the results were assessed using accuracy, sensitivity, specificity, and AUC on various biomedical datasets. The results demonstrate the superior effectiveness of the proposed method compared to state-of-the-art methods in the literature.
Collapse
|
5
|
Devi SS, Prithiviraj K.. Breast Cancer Classification With Microarray Gene Expression Data Based on Improved Whale Optimization Algorithm. INTERNATIONAL JOURNAL OF SWARM INTELLIGENCE RESEARCH 2023. [DOI: 10.4018/ijsir.317091] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Breast cancer is one of the most common and dangerous cancer types in women worldwide. Since it is generally a genetic disease, microarray technology-based cancer prediction is technically significant among lot of diagnosis methods. The microarray gene expression data contains fewer samples with many redundant and noisy genes. It leads to inaccurate diagnose and low prediction accuracy. To overcome these difficulties, this paper proposes an Improved Whale Optimization Algorithm (IWOA) for wrapper based feature selection in gene expression data. The proposed IWOA incorporates modified cross over and mutation operations to enhance the exploration and exploitation of classical WOA. The proposed IWOA adapts multiobjective fitness function, which simultaneously balance between minimization of error rate and feature selection. The experimental analysis demonstrated that, the proposed IWOA with Gradient Boost Classifier (GBC) achieves high classification accuracy of 97.7% with minimum subset of features and also converges quickly for the breast cancer dataset.
Collapse
Affiliation(s)
- S. Sathiya Devi
- University College of Engineering, Birla Institute of Technology, Trichy, India
| | - Prithiviraj K.
- University College of Engineering, Birla Institute of Technology, Trichy, India
| |
Collapse
|
6
|
Liu C, Wu S, Lai L, Liu J, Guo Z, Ye Z, Chen X. Comprehensive analysis of cuproptosis-related lncRNAs in immune infiltration and prognosis in hepatocellular carcinoma. BMC Bioinformatics 2023; 24:4. [PMID: 36597032 PMCID: PMC9811804 DOI: 10.1186/s12859-022-05091-1] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2022] [Accepted: 12/01/2022] [Indexed: 01/05/2023] Open
Abstract
BACKGROUND Being among the most common malignancies worldwide, hepatocellular carcinoma (HCC) accounting for the third cause of cancer mortality. The regulation of cell death is the most crucial step in tumor progression and has become a crucial target for nearly all therapeutic options. Cuproptosis, a copper-induced cell death, was recently reported in Science. However, its primary function in carcinogenesis is still unclear. METHODS Cuproptosis-related lncRNAs significantly associated with overall survival (OS) were screened by stepwise univariate Cox regression. The signature of cuproptosis-related lncRNAs for HCC prognosis was constructed by the LASSO algorithm and multivariate Cox regression. Further Kaplan-Meier analysis, proportional hazards model, and ROC analysis were performed. Functional annotation was performed using gene set enrichment analysis (GSEA). The relationship between prognostic cuproptosis-related lncRNAs and HCC prognosis was further explored by GEPIA( http://gepia.cancer-pku.cn/ ) online analysis tool. Finally, we used the ESTIMATE and XCELL algorithms to estimate stromal and immune cells in tumor tissue and cast each sample to infer the underlying mechanism of cuproptosis-related lncRNAs in the tumor immune microenvironment (TIME) of HCC patients. RESULTS Four cuproptosis-related lncRNAs were used to construct a prognostic lncRNA signature, which was an independent factor in predicting OS in HCC patients. Kaplan-Meier curves showed significant differences in survival rates between risk subgroups (p = 0.002). At the same time, we found that the expression levels of most immune checkpoint genes increased with increasing risk scores. Tumorigenesis and immunological-related pathways were primarily enhanced in the high-risk group, as determined by GSEA. The results of drug sensitivity analysis showed that compared with patients in the high-risk group, the IC50 values of erlotinib and lapatinib were lower in patients in the low-risk group, while the opposite was true for sunitinib, paclitaxel, gemcitabine, and imatinib. We also found that elevated AL133243.2 expression was significantly associated with worse OS and disease-free survival (DFS), more advanced T stage and higher tumor grade, and reduced immune cell infiltration, suggesting that HCC patients with low AL133243.2 expression in tumor tissues may have a better response to immunotherapy. CONCLUSION Collectively, the cuproptosis-associated lncRNA signature can serve as an independent predictor to guide individual treatment strategies. Furthermore, AL133243.2 is a promising marker for predicting immunotherapy response in HCC patients. This data may facilitate further exploration of more effective immunotherapy strategies for HCC.
Collapse
Affiliation(s)
- Chunhua Liu
- grid.417384.d0000 0004 1764 2632Rehabilitation Center, The Second Affiliated Hospital of Wenzhou Medical University, 108 Xueyuan West Road, Wenzhou, Zhejiang China
| | - Simin Wu
- grid.417384.d0000 0004 1764 2632Rehabilitation Center, The Second Affiliated Hospital of Wenzhou Medical University, 108 Xueyuan West Road, Wenzhou, Zhejiang China
| | - Liying Lai
- grid.13402.340000 0004 1759 700XDepartment of Cancer Rehabilitation, Lishui Hospital of Traditional Chinese Medicine Affiliated to the Zhejiang University of Chinese Medicine, Lishui, Zhejiang China
| | - Jinyu Liu
- grid.13402.340000 0004 1759 700XDepartment of Cancer Rehabilitation, Lishui Hospital of Traditional Chinese Medicine Affiliated to the Zhejiang University of Chinese Medicine, Lishui, Zhejiang China
| | - Zhaofu Guo
- grid.13402.340000 0004 1759 700XDepartment of Cancer Rehabilitation, Lishui Hospital of Traditional Chinese Medicine Affiliated to the Zhejiang University of Chinese Medicine, Lishui, Zhejiang China
| | - Zegen Ye
- grid.13402.340000 0004 1759 700XDepartment of Cancer Rehabilitation, Lishui Hospital of Traditional Chinese Medicine Affiliated to the Zhejiang University of Chinese Medicine, Lishui, Zhejiang China
| | - Xiang Chen
- Rehabilitation Center, The Second Affiliated Hospital of Wenzhou Medical University, 108 Xueyuan West Road, Wenzhou, Zhejiang, China.
| |
Collapse
|
7
|
Pashaei E. Mutation-based Binary Aquila optimizer for gene selection in cancer classification. Comput Biol Chem 2022; 101:107767. [PMID: 36084602 DOI: 10.1016/j.compbiolchem.2022.107767] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2022] [Revised: 07/10/2022] [Accepted: 08/29/2022] [Indexed: 11/19/2022]
Abstract
Microarray data classification is one of the hottest issues in the field of bioinformatics due to its efficiency in diagnosing patients' ailments. But the difficulty is that microarrays possess a huge number of genes where the majority of which are redundant or irrelevant resulting in the deterioration of classification accuracy. For this issue, mutated binary Aquila Optimizer (MBAO) with a time-varying mirrored S-shaped (TVMS) transfer function is proposed as a new wrapper gene (or feature) selection method to find the optimal subset of informative genes. The suggested hybrid method utilizes Minimum Redundancy Maximum Relevance (mRMR) as a filtering approach to choose top-ranked genes in the first stage and then uses MBAO-TVMS as an efficient wrapper approach to identify the most discriminative genes in the second stage. TVMS is adopted to transform the continuous version of Aquila Optimizer (AO) to binary one and a mutation mechanism is incorporated into binary AO to aid the algorithm to escape local optima and improve its global search capabilities. The suggested method was tested on eleven well-known benchmark microarray datasets and compared to other current state-of-the-art methods. Based on the obtained results, mRMR-MBAO confirms its superiority over the mRMR-BAO algorithm and the other comparative GS approaches on the majority of the medical datasets strategies in terms of classification accuracy and the number of selected genes. R codes of MBAO are available at https://github.com/el-pashaei/MBAO.
Collapse
Affiliation(s)
- Elham Pashaei
- Department of Computer Engineering, Istanbul Gelisim University, Istanbul, Turkey.
| |
Collapse
|
8
|
Azadifar S, Rostami M, Berahmand K, Moradi P, Oussalah M. Graph-based relevancy-redundancy gene selection method for cancer diagnosis. Comput Biol Med 2022; 147:105766. [PMID: 35779479 DOI: 10.1016/j.compbiomed.2022.105766] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2022] [Revised: 06/12/2022] [Accepted: 06/18/2022] [Indexed: 11/26/2022]
Abstract
Nowadays, microarray data processing is one of the most important applications in molecular biology for cancer diagnosis. A major task in microarray data processing is gene selection, which aims to find a subset of genes with the least inner similarity and most relevant to the target class. Removing unnecessary, redundant, or noisy data reduces the data dimensionality. This research advocates a graph theoretic-based gene selection method for cancer diagnosis. Both unsupervised and supervised modes use well-known and successful social network approaches such as the maximum weighted clique criterion and edge centrality to rank genes. The suggested technique has two goals: (i) to maximize the relevancy of the chosen genes with the target class and (ii) to reduce their inner redundancy. A maximum weighted clique is chosen in a repetitive way in each iteration of this procedure. The appropriate genes are then chosen from among the existing features in this maximum clique using edge centrality and gene relevance. In the experiment, several datasets consisting of Colon, Leukemia, SRBCT, Prostate Tumor, and Lung Cancer, with different properties, are used to demonstrate the efficacy of the developed model. Our performance is compared to that of renowned filter-based gene selection approaches for cancer diagnosis whose results demonstrate a clear superiority.
Collapse
Affiliation(s)
- Saeid Azadifar
- Department of Computer Engineering, University of Khajeh Nasir Toosi, Tehran, Iran
| | - Mehrdad Rostami
- Centre for Machine Vision and Signal Processing, University of Oulu, Oulu, Finland.
| | - Kamal Berahmand
- School of Computer Science, Faculty of Science, Queensland University of Technology (QUT), Brisbane, Australia
| | - Parham Moradi
- Department of Computer Engineering, University of Kurdistan, Sanandaj, Iran
| | - Mourad Oussalah
- Centre for Machine Vision and Signal Processing, University of Oulu, Oulu, Finland; Research Unit of Medical Imaging, Physics, and Technology, Faculty of Medicine, University of Oulu, Finland
| |
Collapse
|
9
|
Rostami M, Forouzandeh S, Berahmand K, Soltani M, Shahsavari M, Oussalah M. Gene selection for microarray data classification via multi-objective graph theoretic-based method. Artif Intell Med 2022; 123:102228. [PMID: 34998517 DOI: 10.1016/j.artmed.2021.102228] [Citation(s) in RCA: 28] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2020] [Revised: 11/23/2021] [Accepted: 11/27/2021] [Indexed: 12/20/2022]
Abstract
In recent decades, the improvement of computer technology has increased the growth of high-dimensional microarray data. Thus, data mining methods for DNA microarray data classification usually involve samples consisting of thousands of genes. One of the efficient strategies to solve this problem is gene selection, which improves the accuracy of microarray data classification and also decreases computational complexity. In this paper, a novel social network analysis-based gene selection approach is proposed. The proposed method has two main objectives of the relevance maximization and redundancy minimization of the selected genes. In this method, on each iteration, a maximum community is selected repetitively. Then among the existing genes in this community, the appropriate genes are selected by using the node centrality-based criterion. The reported results indicate that the developed gene selection algorithm while increasing the classification accuracy of microarray data, will also decrease the time complexity.
Collapse
Affiliation(s)
- Mehrdad Rostami
- Centre of Machine Vision and Signal Processing, Faculty of Information Technology, University of Oulu, Oulu, Finland.
| | - Saman Forouzandeh
- Department of Computer Engineering, University of Applied Science and Technology, Center of Tehran Municipality ICT org., Tehran, Iran
| | - Kamal Berahmand
- School of Computer Sciences, Science and Engineering Faculty, Queensland University of Technology (QUT), Brisbane, Australia.
| | - Mina Soltani
- Department of Nutrition, Kashan University of Medical Sciences, Kashan, Iran
| | - Meisam Shahsavari
- Department of engineering physics, Tsinghua University, Beijing, China
| | - Mourad Oussalah
- Centre of Machine Vision and Signal Processing, Faculty of Information Technology, University of Oulu, Oulu, Finland; Research Unit of Medical Imaging, Physics, and Technology, Faculty of Medicine, University of Oulu, Finland.
| |
Collapse
|
10
|
Sheikhi G, Altınçay H. A novel dissimilarity metric based on feature‐to‐feature scatter frequencies for clustering‐based feature selection in biomedical data. Comput Intell 2021. [DOI: 10.1111/coin.12470] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Ghazaal Sheikhi
- Department of Computer Engineering Final International University Kyrenia North Cyprus Turkey
| | - Hakan Altınçay
- Department of Computer Engineering Eastern Mediterranean University Famagusta North Cyprus Turkey
| |
Collapse
|
11
|
Al-Rajab M, Lu J, Xu Q. A framework model using multifilter feature selection to enhance colon cancer classification. PLoS One 2021; 16:e0249094. [PMID: 33861766 PMCID: PMC8691854 DOI: 10.1371/journal.pone.0249094] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2020] [Accepted: 03/11/2021] [Indexed: 11/18/2022] Open
Abstract
Gene expression profiles can be utilized in the diagnosis of critical diseases such as cancer. The selection of biomarker genes from these profiles is significant and crucial for cancer detection. This paper presents a framework proposing a two-stage multifilter hybrid model of feature selection for colon cancer classification. Colon cancer is being extremely common nowadays among other types of cancer. There is a need to find fast and an accurate method to detect the tissues, and enhance the diagnostic process and the drug discovery. This paper reports on a study whose objective has been to improve the diagnosis of cancer of the colon through a two-stage, multifilter model of feature selection. The model described deals with feature selection using a combination of Information Gain and a Genetic Algorithm. The next stage is to filter and rank the genes identified through this method using the minimum Redundancy Maximum Relevance (mRMR) technique. The final phase is to further analyze the data using correlated machine learning algorithms. This two-stage approach, which involves the selection of genes before classification techniques are used, improves success rates for the identification of cancer cells. It is found that Decision Tree, K-Nearest Neighbor, and Naïve Bayes classifiers had showed promising accurate results using the developed hybrid framework model. It is concluded that the performance of our proposed method has achieved a higher accuracy in comparison with the existing methods reported in the literatures. This study can be used as a clue to enhance treatment and drug discovery for the colon cancer cure.
Collapse
Affiliation(s)
- Murad Al-Rajab
- School of Computing and Engineering, University of
Huddersfield, Huddersfield, United Kingdom
| | - Joan Lu
- School of Computing and Engineering, University of
Huddersfield, Huddersfield, United Kingdom
| | - Qiang Xu
- School of Computing and Engineering, University of
Huddersfield, Huddersfield, United Kingdom
| |
Collapse
|
12
|
Mahendran N, Durai Raj Vincent PM, Srinivasan K, Chang CY. Machine Learning Based Computational Gene Selection Models: A Survey, Performance Evaluation, Open Issues, and Future Research Directions. Front Genet 2020; 11:603808. [PMID: 33362861 PMCID: PMC7758324 DOI: 10.3389/fgene.2020.603808] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2020] [Accepted: 10/29/2020] [Indexed: 12/20/2022] Open
Abstract
Gene Expression is the process of determining the physical characteristics of living beings by generating the necessary proteins. Gene Expression takes place in two steps, translation and transcription. It is the flow of information from DNA to RNA with enzymes' help, and the end product is proteins and other biochemical molecules. Many technologies can capture Gene Expression from the DNA or RNA. One such technique is Microarray DNA. Other than being expensive, the main issue with Microarray DNA is that it generates high-dimensional data with minimal sample size. The issue in handling such a heavyweight dataset is that the learning model will be over-fitted. This problem should be addressed by reducing the dimension of the data source to a considerable amount. In recent years, Machine Learning has gained popularity in the field of genomic studies. In the literature, many Machine Learning-based Gene Selection approaches have been discussed, which were proposed to improve dimensionality reduction precision. This paper does an extensive review of the various works done on Machine Learning-based gene selection in recent years, along with its performance analysis. The study categorizes various feature selection algorithms under Supervised, Unsupervised, and Semi-supervised learning. The works done in recent years to reduce the features for diagnosing tumors are discussed in detail. Furthermore, the performance of several discussed methods in the literature is analyzed. This study also lists out and briefly discusses the open issues in handling the high-dimension and less sample size data.
Collapse
Affiliation(s)
- Nivedhitha Mahendran
- School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, India
| | - P. M. Durai Raj Vincent
- School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, India
| | - Kathiravan Srinivasan
- School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, India
| | - Chuan-Yu Chang
- Department of Computer Science and Information Engineering, National Yunlin University of Science and Technology, Douliu, Taiwan
| |
Collapse
|
13
|
|
14
|
Zou Q, Ma Q. The application of machine learning to disease diagnosis and treatment. Math Biosci 2019; 320:108305. [PMID: 31857093 DOI: 10.1016/j.mbs.2019.108305] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Affiliation(s)
- Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, China.
| | - Qin Ma
- Department of Biomedical Informatics, The Ohio State University, United States.
| |
Collapse
|
15
|
Shukla AK. Identification of cancerous gene groups from microarray data by employing adaptive genetic and support vector machine technique. Comput Intell 2019. [DOI: 10.1111/coin.12245] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Affiliation(s)
- Alok Kumar Shukla
- Department of Computer Science & EngineeringG.L. Bajaj Institute of Technology & Management Greater Noida India
| |
Collapse
|
16
|
A study on metaheuristics approaches for gene selection in microarray data: algorithms, applications and open challenges. EVOLUTIONARY INTELLIGENCE 2019. [DOI: 10.1007/s12065-019-00306-6] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|