1
|
Yaqoob A, Verma NK, Aziz RM. Optimizing Gene Selection and Cancer Classification with Hybrid Sine Cosine and Cuckoo Search Algorithm. J Med Syst 2024; 48:10. [PMID: 38193948 DOI: 10.1007/s10916-023-02031-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Accepted: 12/28/2023] [Indexed: 01/10/2024]
Abstract
Gene expression datasets offer a wide range of information about various biological processes. However, it is difficult to find the important genes among the high-dimensional biological data due to the existence of redundant and unimportant ones. Numerous Feature Selection (FS) techniques have been created to get beyond this obstacle. Improving the efficacy and precision of FS methodologies is crucial in order to identify significant genes amongst complicated complex biological data. In this work, we present a novel approach to gene selection called the Sine Cosine and Cuckoo Search Algorithm (SCACSA). This hybrid method is designed to work with well-known machine learning classifiers Support Vector Machine (SVM). Using a dataset on breast cancer, the hybrid gene selection algorithm's performance is carefully assessed and compared to other feature selection methods. To improve the quality of the feature set, we use minimum Redundancy Maximum Relevance (mRMR) as a filtering strategy in the first step. The hybrid SCACSA method is then used to enhance and optimize the gene selection procedure. Lastly, we classify the dataset according to the chosen genes by using the SVM classifier. Given the pivotal role gene selection plays in unraveling complex biological datasets, SCACSA stands out as an invaluable tool for the classification of cancer datasets. The findings help medical practitioners make well-informed decisions about cancer diagnosis and provide them with a valuable tool for navigating the complex world of gene expression data.
Collapse
Affiliation(s)
- Abrar Yaqoob
- School of Advanced Sciences and Languages, VIT Bhopal University, Kothrikalan, Sehore, 466114, India.
| | - Navneet Kumar Verma
- School of Advanced Sciences and Languages, VIT Bhopal University, Kothrikalan, Sehore, 466114, India
| | - Rabia Musheer Aziz
- School of Advanced Sciences and Languages, VIT Bhopal University, Kothrikalan, Sehore, 466114, India
| |
Collapse
|
2
|
Alweshah M. Coronavirus herd immunity optimizer to solve classification problems. Soft comput 2023; 27:3509-3529. [PMID: 35309595 PMCID: PMC8922087 DOI: 10.1007/s00500-022-06917-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/13/2022] [Indexed: 11/28/2022]
Abstract
Classification is a technique in data mining that is used to predict the value of a categorical variable and to produce input data and datasets of varying values. The classification algorithm makes use of the training datasets to build a model which can be used for allocating unclassified records to a defined class. In this paper, the coronavirus herd immunity optimizer (CHIO) algorithm is used to boost the efficiency of the probabilistic neural network (PNN) when solving classification problems. First, the PNN produces a random initial solution and submits it to the CHIO, which then attempts to refine the PNN weights. This is accomplished by the management of random phases and the effective identification of a search space that can probably decide the optimal value. The proposed CHIO-PNN approach was applied to 11 benchmark datasets to assess its classification accuracy, and its results were compared with those of the PNN and three methods in the literature, the firefly algorithm, African buffalo algorithm, and β-hill climbing. The results showed that the CHIO-PNN achieved an overall classification rate of 90.3% on all datasets, at a faster convergence speed as compared outperforming all the methods in the literature. Supplementary Information The online version contains supplementary material available at 10.1007/s00500-022-06917-z.
Collapse
Affiliation(s)
- Mohammed Alweshah
- Prince Abdullah Bin Ghazi Faculty of Information and Communication Technology, Al-Balqa Applied University, Al-Salt, Jordan
| |
Collapse
|
3
|
Al-Shaikh A, Mahafzah BA, Alshraideh M. Hybrid harmony search algorithm for social network contact tracing of COVID-19. Soft comput 2023; 27:3343-3365. [PMID: 34220301 PMCID: PMC8237257 DOI: 10.1007/s00500-021-05948-2] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 06/04/2021] [Indexed: 02/05/2023]
Abstract
The coronavirus disease 2019 (COVID-19) was first reported in December 2019 in Wuhan, China, and then moved to almost every country showing an unprecedented outbreak. The world health organization declared COVID-19 a pandemic. Since then, millions of people were infected, and millions have lost their lives all around the globe. By the end of 2020, effective vaccines that could prevent the fast spread of the disease started to loom on the horizon. Nevertheless, isolation, social distancing, face masks, and quarantine are the best-known measures, in the time being, to fight the pandemic. On the other hand, contact tracing is an effective procedure in tracking infections and saving others' lives. In this paper, we devise a new approach using a hybrid harmony search (HHS) algorithm that casts the problem of finding strongly connected components (SCCs) to contact tracing. This new approach is named as hybrid harmony search contact tracing (HHS-CT) algorithm. The hybridization is achieved by integrating the stochastic hill climbing into the operators' design of the harmony search algorithm. The HHS-CT algorithm is compared to other existing algorithms of finding SCCs in directed graphs, where it showed its superiority over these algorithms. The devised approach provides a 77.18% enhancement in terms of run time and an exceptional average error rate of 1.7% compared to the other existing algorithms of finding SCCs.
Collapse
Affiliation(s)
- Ala’a Al-Shaikh
- Learning and Teaching Technology Center, Al-Balqa Applied University, Al-Salt, 19117 Jordan
| | - Basel A. Mahafzah
- Department of Computer Science, King Abdulla II School of Information Technology, The University of Jordan, Amman, 11942 Jordan
| | - Mohammad Alshraideh
- Department of Computer Science, King Abdulla II School of Information Technology, The University of Jordan, Amman, 11942 Jordan
| |
Collapse
|
4
|
Akinola OA, Agushaka JO, Ezugwu AE. Binary dwarf mongoose optimizer for solving high-dimensional feature selection problems. PLoS One 2022; 17:e0274850. [PMID: 36201524 PMCID: PMC9536540 DOI: 10.1371/journal.pone.0274850] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2022] [Accepted: 09/06/2022] [Indexed: 11/13/2022] Open
Abstract
Selecting appropriate feature subsets is a vital task in machine learning. Its main goal is to remove noisy, irrelevant, and redundant feature subsets that could negatively impact the learning model's accuracy and improve classification performance without information loss. Therefore, more advanced optimization methods have been employed to locate the optimal subset of features. This paper presents a binary version of the dwarf mongoose optimization called the BDMO algorithm to solve the high-dimensional feature selection problem. The effectiveness of this approach was validated using 18 high-dimensional datasets from the Arizona State University feature selection repository and compared the efficacy of the BDMO with other well-known feature selection techniques in the literature. The results show that the BDMO outperforms other methods producing the least average fitness value in 14 out of 18 datasets which means that it achieved 77.77% on the overall best fitness values. The result also shows BDMO demonstrating stability by returning the least standard deviation (SD) value in 13 of 18 datasets (72.22%). Furthermore, the study achieved higher validation accuracy in 15 of the 18 datasets (83.33%) over other methods. The proposed approach also yielded the highest validation accuracy attainable in the COIL20 and Leukemia datasets which vividly portray the superiority of the BDMO.
Collapse
Affiliation(s)
- Olatunji A. Akinola
- School of Mathematics, Statistics, and Computer Science, University of KwaZulu-Natal, Pietermaritzburg, KwaZulu-Natal, South Africa
| | - Jeffrey O. Agushaka
- School of Mathematics, Statistics, and Computer Science, University of KwaZulu-Natal, Pietermaritzburg, KwaZulu-Natal, South Africa
- Department of Computer Science, Federal University of Lafia, Lafia, Nasarawa State, Nigeria
| | - Absalom E. Ezugwu
- School of Mathematics, Statistics, and Computer Science, University of KwaZulu-Natal, Pietermaritzburg, KwaZulu-Natal, South Africa
| |
Collapse
|
5
|
Obadina OO, Thaha MA, Mohamed Z, Shaheed MH. Grey-box modelling and fuzzy logic control of a Leader-Follower robot manipulator system: A hybrid Grey Wolf-Whale Optimisation approach. ISA TRANSACTIONS 2022; 129:572-593. [PMID: 35277266 DOI: 10.1016/j.isatra.2022.02.023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/06/2021] [Revised: 01/04/2022] [Accepted: 02/15/2022] [Indexed: 06/14/2023]
Abstract
This study presents the development of a grey-box modelling approach and fuzzy logic control for real time trajectory control of an experimental four degree-of-freedom Leader-Follower Robot (LFR) manipulator system using a hybrid optimisation algorithm, known as Grey Wolf Optimiser (GWO) - Whale Optimisation Algorithm (WOA). The approach has advantages in achieving an accurate model of the LFR manipulator system, and together with a better trajectory tracking performance. In the first instance, the white box model is formed by modelling the dynamics of the follower manipulator using the Euler-Lagrange formulation. This white-box model is then improved upon by re-tuning the model's parameters using GWO-WOA and experimental data from the real LFR manipulator system, thus forming the grey-box model. A minimum improvement of 73.9% is achieved by the grey-box model in comparison to the white-box model. In the latter part of this investigation, the developed grey-box model is used for the design, tuning and real-time implementation of a fuzzy PD+I controller on the experimental LFR manipulator system. A 78% improvement in the total mean squared error is realised after tuning the membership functions of the fuzzy logic controller using GWO-WOA. Experimental results show that the approach significantly improves the trajectory tracking performance of the LFR manipulator system in terms of mean squared error, steady state error and time delay.
Collapse
Affiliation(s)
- Ololade O Obadina
- School of Engineering and Materials Science, Queen Mary University of London, UK
| | - Mohamed A Thaha
- Blizard Institute, Barts and The London School of Medicine & Dentistry, Queen Mary University of London, UK; Department of Colorectal Surgery, Royal London Hospital, Barts Health NHS Trust, Whitechapel, London, UK
| | - Zaharuddin Mohamed
- School of Electrical Engineering, Universiti Teknologi Malaysia, Malaysia
| | - M Hasan Shaheed
- School of Engineering and Materials Science, Queen Mary University of London, UK.
| |
Collapse
|
6
|
Akinola OO, Ezugwu AE, Agushaka JO, Zitar RA, Abualigah L. Multiclass feature selection with metaheuristic optimization algorithms: a review. Neural Comput Appl 2022; 34:19751-19790. [PMID: 36060097 PMCID: PMC9424068 DOI: 10.1007/s00521-022-07705-4] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2022] [Accepted: 08/02/2022] [Indexed: 11/24/2022]
Abstract
Selecting relevant feature subsets is vital in machine learning, and multiclass feature selection is harder to perform since most classifications are binary. The feature selection problem aims at reducing the feature set dimension while maintaining the performance model accuracy. Datasets can be classified using various methods. Nevertheless, metaheuristic algorithms attract substantial attention to solving different problems in optimization. For this reason, this paper presents a systematic survey of literature for solving multiclass feature selection problems utilizing metaheuristic algorithms that can assist classifiers selects optima or near optima features faster and more accurately. Metaheuristic algorithms have also been presented in four primary behavior-based categories, i.e., evolutionary-based, swarm-intelligence-based, physics-based, and human-based, even though some literature works presented more categorization. Further, lists of metaheuristic algorithms were introduced in the categories mentioned. In finding the solution to issues related to multiclass feature selection, only articles on metaheuristic algorithms used for multiclass feature selection problems from the year 2000 to 2022 were reviewed about their different categories and detailed descriptions. We considered some application areas for some of the metaheuristic algorithms applied for multiclass feature selection with their variations. Popular multiclass classifiers for feature selection were also examined. Moreover, we also presented the challenges of metaheuristic algorithms for feature selection, and we identified gaps for further research studies.
Collapse
Affiliation(s)
- Olatunji O. Akinola
- School of Mathematics, Statistics, and Computer Science, University of KwaZulu-Natal, King Edward Avenue, Pietermaritzburg Campus, Pietermaritzburg, 3201 KwaZulu-Natal South Africa
| | - Absalom E. Ezugwu
- School of Mathematics, Statistics, and Computer Science, University of KwaZulu-Natal, King Edward Avenue, Pietermaritzburg Campus, Pietermaritzburg, 3201 KwaZulu-Natal South Africa
| | - Jeffrey O. Agushaka
- School of Mathematics, Statistics, and Computer Science, University of KwaZulu-Natal, King Edward Avenue, Pietermaritzburg Campus, Pietermaritzburg, 3201 KwaZulu-Natal South Africa
| | - Raed Abu Zitar
- Sorbonne Center of Artificial Intelligence, Sorbonne University-Abu Dhabi, 38044 Abu Dhabi, United Arab Emirates
| | - Laith Abualigah
- Hourani Center for Applied Scientific Research, Al-Ahliyya Amman University, Amman, 19328 Jordan
- Faculty of Inforsmation Technology, Middle East University, Amman, 11831 Jordan
| |
Collapse
|
7
|
A Highly Discriminative Hybrid Feature Selection Algorithm for Cancer Diagnosis. ScientificWorldJournal 2022; 2022:1056490. [PMID: 35983572 PMCID: PMC9381276 DOI: 10.1155/2022/1056490] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2022] [Accepted: 07/20/2022] [Indexed: 11/17/2022] Open
Abstract
Cancer is a deadly disease that occurs due to rapid and uncontrolled cell growth. In this article, a machine learning (ML) algorithm is proposed to diagnose different cancer diseases from big data. The algorithm comprises a two-stage hybrid feature selection. In the first stage, an overall ranker is initiated to combine the results of three filter-based feature evaluation methods, namely, chi-squared, F-statistic, and mutual information (MI). The features are then ordered according to this combination. In the second stage, the modified wrapper-based sequential forward selection is utilized to discover the optimal feature subset, using ML models such as support vector machine (SVM), decision tree (DT), random forest (RF), and K-nearest neighbor (KNN) classifiers. To examine the proposed algorithm, many tests have been carried out on four cancerous microarray datasets, employing in the process 10-fold cross-validation and hyperparameter tuning. The performance of the algorithm is evaluated by calculating the diagnostic accuracy. The results indicate that for the leukemia dataset, both SVM and KNN models register the highest accuracy at 100% using only 5 features. For the ovarian cancer dataset, the SVM model achieves the highest accuracy at 100% using only 6 features. For the small round blue cell tumor (SRBCT) dataset, the SVM model also achieves the highest accuracy at 100% using only 8 features. For the lung cancer dataset, the SVM model also achieves the highest accuracy at 99.57% using 19 features. By comparing with other algorithms, the results obtained from the proposed algorithm are superior in terms of the number of selected features and diagnostic accuracy.
Collapse
|
8
|
Al-Obeidat F, Rocha Á, Akram M, Razzaq S, Maqbool F. (CDRGI)-Cancer detection through relevant genes identification. Neural Comput Appl 2022. [DOI: 10.1007/s00521-021-05739-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
9
|
Ai H. GSEA-SDBE: A gene selection method for breast cancer classification based on GSEA and analyzing differences in performance metrics. PLoS One 2022; 17:e0263171. [PMID: 35472078 PMCID: PMC9041804 DOI: 10.1371/journal.pone.0263171] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2021] [Accepted: 01/13/2022] [Indexed: 12/20/2022] Open
Abstract
MOTIVATION Selecting the most relevant genes for sample classification is a common process in gene expression studies. Moreover, determining the smallest set of relevant genes that can achieve the required classification performance is particularly important in diagnosing cancer and improving treatment. RESULTS In this study, I propose a novel method to eliminate irrelevant and redundant genes, and thus determine the smallest set of relevant genes for breast cancer diagnosis. The method is based on random forest models, gene set enrichment analysis (GSEA), and my developed Sort Difference Backward Elimination (SDBE) algorithm; hence, the method is named GSEA-SDBE. Using this method, genes are filtered according to their importance following random forest training and GSEA is used to select genes by core enrichment of Kyoto Encyclopedia of Genes and Genomes pathways that are strongly related to breast cancer. Subsequently, the SDBE algorithm is applied to eliminate redundant genes and identify the most relevant genes for breast cancer diagnosis. In the SDBE algorithm, the differences in the Matthews correlation coefficients (MCCs) of performing random forest models are computed before and after the deletion of each gene to indicate the degree of redundancy of the corresponding deleted gene on the remaining genes during backward elimination. Next, the obtained MCC difference list is divided into two parts from a set position and each part is respectively sorted. By continuously iterating and changing the set position, the most relevant genes are stably assembled on the left side of the gene list, facilitating their identification, and the redundant genes are gathered on the right side of the gene list for easy elimination. A cross-comparison of the SDBE algorithm was performed by respectively computing differences between MCCs and ROC_AUC_score and then respectively using 10-fold classification models, e.g., random forest (RF), support vector machine (SVM), k-nearest neighbor (KNN), extreme gradient boosting (XGBoost), and extremely randomized trees (ExtraTrees). Finally, the classification performance of the proposed method was compared with that of three advanced algorithms for five cancer datasets. Results showed that analyzing MCC differences and using random forest models was the optimal solution for the SDBE algorithm. Accordingly, three consistently relevant genes (i.e., VEGFD, TSLP, and PKMYT1) were selected for the diagnosis of breast cancer. The performance metrics (MCC and ROC_AUC_score, respectively) of the random forest models based on 10-fold verification reached 95.28% and 98.75%. In addition, survival analysis showed that VEGFD and TSLP could be used to predict the prognosis of patients with breast cancer. Moreover, the proposed method significantly outperformed the other methods tested as it allowed selecting a smaller number of genes while maintaining the required classification accuracy.
Collapse
Affiliation(s)
- Hu Ai
- Department of Criminal Technology, Guizhou Police College, Guiyang, Guizhou, China
- * E-mail:
| |
Collapse
|
10
|
Sathya M, Jeyaselvi M, Joshi S, Pandey E, Pareek PK, Jamal SS, Kumar V, Atiglah HK. Cancer Categorization Using Genetic Algorithm to Identify Biomarker Genes. JOURNAL OF HEALTHCARE ENGINEERING 2022; 2022:5821938. [PMID: 35242297 PMCID: PMC8888099 DOI: 10.1155/2022/5821938] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/24/2021] [Accepted: 12/14/2021] [Indexed: 11/18/2022]
Abstract
In the microarray gene expression data, there are a large number of genes that are expressed at varying levels of expression. Given that there are only a few critically significant genes, it is challenging to analyze and categorize datasets that span the whole gene space. In order to aid in the diagnosis of cancer disease and, as a consequence, the suggestion of individualized treatment, the discovery of biomarker genes is essential. Starting with a large pool of candidates, the parallelized minimal redundancy and maximum relevance ensemble (mRMRe) is used to choose the top m informative genes from a huge pool of candidates. A Genetic Algorithm (GA) is used to heuristically compute the ideal set of genes by applying the Mahalanobis Distance (MD) as a distance metric. Once the genes have been identified, they are input into the GA. It is used as a classifier to four microarray datasets using the approved approach (mRMRe-GA), with the Support Vector Machine (SVM) serving as the classification basis. Leave-One-Out-Cross-Validation (LOOCV) is a cross-validation technique for assessing the performance of a classifier. It is now being investigated if the proposed mRMRe-GA strategy can be compared to other approaches. It has been shown that the proposed mRMRe-GA approach enhances classification accuracy while employing less genetic material than previous methods. Microarray, Gene Expression Data, GA, Feature Selection, SVM, and Cancer Classification are some of the terms used in this paper.
Collapse
Affiliation(s)
- M. Sathya
- Department of Information Science and Engineering, AMC Engineering College, Bengaluru, Karnataka 560083, India
| | - M. Jeyaselvi
- Department of Computer Science and Engineering, SRM Institute of Science and Technology, Chennai, India
| | - Shubham Joshi
- Department of Computer Engineering, SVKM'S NMIMS MPSTME Shirpur, Maharashtra 425405, India
| | - Ekta Pandey
- Applied Science Department, Bundhelkhand Institute of Engineering and Technology, Jhansi, Uttar Pradesh, India
| | - Piyush Kumar Pareek
- Department of Computer Science & Engineering & Head of IPR Cell, Nitte Meenakshi Institute of Technology, Bengaluru, India
| | - Sajjad Shaukat Jamal
- Department of Mathematics, College of Science, King Khalid University, Abha, Saudi Arabia
| | - Vinay Kumar
- Department of Computer Engineering and Application, GLA University, Mathura, India
| | - Henry Kwame Atiglah
- Department of Electrical and Electronics Engineering, Tamale Technical University, Tamale, Ghana
| |
Collapse
|
11
|
Deng X, Li M, Deng S, Wang L. Hybrid gene selection approach using XGBoost and multi-objective genetic algorithm for cancer classification. Med Biol Eng Comput 2022; 60:663-681. [PMID: 35028863 DOI: 10.1007/s11517-021-02476-x] [Citation(s) in RCA: 29] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2021] [Accepted: 11/23/2021] [Indexed: 12/15/2022]
Abstract
Microarray gene expression data are often accompanied by a large number of genes and a small number of samples. However, only a few of these genes are relevant to cancer, resulting in significant gene selection challenges. Hence, we propose a two-stage gene selection approach by combining extreme gradient boosting (XGBoost) and a multi-objective optimization genetic algorithm (XGBoost-MOGA) for cancer classification in microarray datasets. In the first stage, the genes are ranked using an ensemble-based feature selection using XGBoost. This stage can effectively remove irrelevant genes and yield a group comprising the most relevant genes related to the class. In the second stage, XGBoost-MOGA searches for an optimal gene subset based on the most relevant genes' group using a multi-objective optimization genetic algorithm. We performed comprehensive experiments to compare XGBoost-MOGA with other state-of-the-art feature selection methods using two well-known learning classifiers on 14 publicly available microarray expression datasets. The experimental results show that XGBoost-MOGA yields significantly better results than previous state-of-the-art algorithms in terms of various evaluation criteria, such as accuracy, F-score, precision, and recall.
Collapse
Affiliation(s)
- Xiongshi Deng
- School of Information Engineering, Nanchang Institute of Technology, Jiangxi, 330099, People's Republic of China.,Jiangxi Province Key Laboratory of Water Information Cooperative Sensing and Intelligent Processing, Jiangxi, 330099, People's Republic of China
| | - Min Li
- School of Information Engineering, Nanchang Institute of Technology, Jiangxi, 330099, People's Republic of China. .,Jiangxi Province Key Laboratory of Water Information Cooperative Sensing and Intelligent Processing, Jiangxi, 330099, People's Republic of China.
| | - Shaobo Deng
- School of Information Engineering, Nanchang Institute of Technology, Jiangxi, 330099, People's Republic of China.,Jiangxi Province Key Laboratory of Water Information Cooperative Sensing and Intelligent Processing, Jiangxi, 330099, People's Republic of China
| | - Lei Wang
- School of Information Engineering, Nanchang Institute of Technology, Jiangxi, 330099, People's Republic of China.,Jiangxi Province Key Laboratory of Water Information Cooperative Sensing and Intelligent Processing, Jiangxi, 330099, People's Republic of China
| |
Collapse
|
12
|
Bose S, Das C, Banerjee A, Ghosh K, Chattopadhyay M, Chattopadhyay S, Barik A. An ensemble machine learning model based on multiple filtering and supervised attribute clustering algorithm for classifying cancer samples. PeerJ Comput Sci 2021; 7:e671. [PMID: 34616883 PMCID: PMC8459790 DOI: 10.7717/peerj-cs.671] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2020] [Accepted: 07/20/2021] [Indexed: 06/13/2023]
Abstract
BACKGROUND Machine learning is one kind of machine intelligence technique that learns from data and detects inherent patterns from large, complex datasets. Due to this capability, machine learning techniques are widely used in medical applications, especially where large-scale genomic and proteomic data are used. Cancer classification based on bio-molecular profiling data is a very important topic for medical applications since it improves the diagnostic accuracy of cancer and enables a successful culmination of cancer treatments. Hence, machine learning techniques are widely used in cancer detection and prognosis. METHODS In this article, a new ensemble machine learning classification model named Multiple Filtering and Supervised Attribute Clustering algorithm based Ensemble Classification model (MFSAC-EC) is proposed which can handle class imbalance problem and high dimensionality of microarray datasets. This model first generates a number of bootstrapped datasets from the original training data where the oversampling procedure is applied to handle the class imbalance problem. The proposed MFSAC method is then applied to each of these bootstrapped datasets to generate sub-datasets, each of which contains a subset of the most relevant/informative attributes of the original dataset. The MFSAC method is a feature selection technique combining multiple filters with a new supervised attribute clustering algorithm. Then for every sub-dataset, a base classifier is constructed separately, and finally, the predictive accuracy of these base classifiers is combined using the majority voting technique forming the MFSAC-based ensemble classifier. Also, a number of most informative attributes are selected as important features based on their frequency of occurrence in these sub-datasets. RESULTS To assess the performance of the proposed MFSAC-EC model, it is applied on different high-dimensional microarray gene expression datasets for cancer sample classification. The proposed model is compared with well-known existing models to establish its effectiveness with respect to other models. From the experimental results, it has been found that the generalization performance/testing accuracy of the proposed classifier is significantly better compared to other well-known existing models. Apart from that, it has been also found that the proposed model can identify many important attributes/biomarker genes.
Collapse
Affiliation(s)
- Shilpi Bose
- Department of Computer Science and Engineering, Netaji Subhash Engineering College, Kolkata, West Bengal, India
| | - Chandra Das
- Department of Computer Science and Engineering, Netaji Subhash Engineering College, Kolkata, West Bengal, India
| | - Abhik Banerjee
- Department of Computer Science and Engineering, Netaji Subhash Engineering College, Kolkata, West Bengal, India
| | - Kuntal Ghosh
- Machine Intelligence Unit & Center for Soft Computing Research, Indian Statistical Institute, Kolkata, West Bengal, India
| | | | - Samiran Chattopadhyay
- Department of Information Technology, Jadavpur University, Kolkata, West Bengal, India
| | - Aishwarya Barik
- Department of Computer Science and Engineering, Netaji Subhash Engineering College, Kolkata, West Bengal, India
| |
Collapse
|
13
|
A novel bio-inspired hybrid multi-filter wrapper gene selection method with ensemble classifier for microarray data. Neural Comput Appl 2021; 35:11531-11561. [PMID: 34539088 PMCID: PMC8435304 DOI: 10.1007/s00521-021-06459-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2020] [Accepted: 08/26/2021] [Indexed: 01/04/2023]
Abstract
Microarray technology is known as one of the most important tools for collecting DNA expression data. This technology allows researchers to investigate and examine types of diseases and their origins. However, microarray data are often associated with a small sample size, a significant number of genes, imbalanced data, etc., making classification models inefficient. Thus, a new hybrid solution based on a multi-filter and adaptive chaotic multi-objective forest optimization algorithm (AC-MOFOA) is presented to solve the gene selection problem and construct the Ensemble Classifier. In the proposed solution, a multi-filter model (i.e., ensemble filter) is proposed as preprocessing step to reduce the dataset's dimensions, using a combination of five filter methods to remove redundant and irrelevant genes. Accordingly, the results of the five filter methods are combined using a voting-based function. Additionally, the results of the proposed multi-filter indicate that it has good capability in reducing the gene subset size and selecting relevant genes. Then, an AC-MOFOA based on the concepts of non-dominated sorting, crowding distance, chaos theory, and adaptive operators is presented. AC-MOFOA as a wrapper method aimed at reducing dataset dimensions, optimizing KELM, and increasing the accuracy of the classification, simultaneously. Next, in this method, an ensemble classifier model is presented using AC-MOFOA results to classify microarray data. The performance of the proposed algorithm was evaluated on nine public microarray datasets, and its results were compared in terms of the number of selected genes, classification efficiency, execution time, time complexity, hypervolume indicator, and spacing metric with five hybrid multi-objective methods, and three hybrid single-objective methods. According to the results, the proposed hybrid method could increase the accuracy of the KELM in most datasets by reducing the dataset's dimensions and achieve similar or superior performance compared to other multi-objective methods. Furthermore, the proposed Ensemble Classifier model could provide better classification accuracy and generalizability in the seven of nine microarray datasets compared to conventional ensemble methods. Moreover, the comparison results of the Ensemble Classifier model with three state-of-the-art ensemble generation methods indicate its competitive performance in which the proposed ensemble model achieved better results in the five of nine datasets.
Collapse
|
14
|
Mishra P, Bhoi N. Cancer gene recognition from microarray data with manta ray based enhanced ANFIS technique. Biocybern Biomed Eng 2021. [DOI: 10.1016/j.bbe.2021.06.004] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
|
15
|
Al-Shaikh A, Mahafzah BA, Alshraideh M. Hybrid harmony search algorithm for social network contact tracing of COVID-19. Soft comput 2021. [DOI: https://doi.org/10.1007/s00500-021-05948-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
|
16
|
Al-Shaikh A, Mahafzah BA, Alshraideh M. Hybrid harmony search algorithm for social network contact tracing of COVID-19. Soft comput 2021. [DOI: https:/doi.org/10.1007/s00500-021-05948-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
|
17
|
Al-Rajab M, Lu J, Xu Q. A framework model using multifilter feature selection to enhance colon cancer classification. PLoS One 2021; 16:e0249094. [PMID: 33861766 PMCID: PMC8691854 DOI: 10.1371/journal.pone.0249094] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2020] [Accepted: 03/11/2021] [Indexed: 11/18/2022] Open
Abstract
Gene expression profiles can be utilized in the diagnosis of critical diseases such as cancer. The selection of biomarker genes from these profiles is significant and crucial for cancer detection. This paper presents a framework proposing a two-stage multifilter hybrid model of feature selection for colon cancer classification. Colon cancer is being extremely common nowadays among other types of cancer. There is a need to find fast and an accurate method to detect the tissues, and enhance the diagnostic process and the drug discovery. This paper reports on a study whose objective has been to improve the diagnosis of cancer of the colon through a two-stage, multifilter model of feature selection. The model described deals with feature selection using a combination of Information Gain and a Genetic Algorithm. The next stage is to filter and rank the genes identified through this method using the minimum Redundancy Maximum Relevance (mRMR) technique. The final phase is to further analyze the data using correlated machine learning algorithms. This two-stage approach, which involves the selection of genes before classification techniques are used, improves success rates for the identification of cancer cells. It is found that Decision Tree, K-Nearest Neighbor, and Naïve Bayes classifiers had showed promising accurate results using the developed hybrid framework model. It is concluded that the performance of our proposed method has achieved a higher accuracy in comparison with the existing methods reported in the literatures. This study can be used as a clue to enhance treatment and drug discovery for the colon cancer cure.
Collapse
Affiliation(s)
- Murad Al-Rajab
- School of Computing and Engineering, University of
Huddersfield, Huddersfield, United Kingdom
| | - Joan Lu
- School of Computing and Engineering, University of
Huddersfield, Huddersfield, United Kingdom
| | - Qiang Xu
- School of Computing and Engineering, University of
Huddersfield, Huddersfield, United Kingdom
| |
Collapse
|
18
|
Hameed SS, Hassan WH, Latiff LA, Muhammadsharif FF. A comparative study of nature-inspired metaheuristic algorithms using a three-phase hybrid approach for gene selection and classification in high-dimensional cancer datasets. Soft comput 2021. [DOI: 10.1007/s00500-021-05726-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
|
19
|
Feature Selection for Colon Cancer Detection Using K-Means Clustering and Modified Harmony Search Algorithm. MATHEMATICS 2021. [DOI: 10.3390/math9050570] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
This paper proposes a feature selection method that is effective in distinguishing colorectal cancer patients from normal individuals using K-means clustering and the modified harmony search algorithm. As the genetic cause of colorectal cancer originates from mutations in genes, it is important to classify the presence or absence of colorectal cancer through gene information. The proposed methodology consists of four steps. First, the original data are Z-normalized by data preprocessing. Candidate genes are then selected using the Fisher score. Next, one representative gene is selected from each cluster after candidate genes are clustered using K-means clustering. Finally, feature selection is carried out using the modified harmony search algorithm. The gene combination created by feature selection is then applied to the classification model and verified using 5-fold cross-validation. The proposed model obtained a classification accuracy of up to 94.36%. Furthermore, on comparing the proposed method with other methods, we prove that the proposed method performs well in classifying colorectal cancer. Moreover, we believe that the proposed model can be applied not only to colorectal cancer but also to other gene-related diseases.
Collapse
|
20
|
Abiodun EO, Alabdulatif A, Abiodun OI, Alawida M, Alabdulatif A, Alkhawaldeh RS. A systematic review of emerging feature selection optimization methods for optimal text classification: the present state and prospective opportunities. Neural Comput Appl 2021; 33:15091-15118. [PMID: 34404964 PMCID: PMC8361413 DOI: 10.1007/s00521-021-06406-8] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2021] [Accepted: 07/31/2021] [Indexed: 02/07/2023]
Abstract
Specialized data preparation techniques, ranging from data cleaning, outlier detection, missing value imputation, feature selection (FS), amongst others, are procedures required to get the most out of data and, consequently, get the optimal performance of predictive models for classification tasks. FS is a vital and indispensable technique that enables the model to perform faster, eliminate noisy data, remove redundancy, reduce overfitting, improve precision and increase generalization on testing data. While conventional FS techniques have been leveraged for classification tasks in the past few decades, they fail to optimally reduce the high dimensionality of the feature space of texts, thus breeding inefficient predictive models. Emerging technologies such as the metaheuristics and hyper-heuristics optimization methods provide a new paradigm for FS due to their efficiency in improving the accuracy of classification, computational demands, storage, as well as functioning seamlessly in solving complex optimization problems with less time. However, little details are known on best practices for case-to-case usage of emerging FS methods. The literature continues to be engulfed with clear and unclear findings in leveraging effective methods, which, if not performed accurately, alters precision, real-world-use feasibility, and the predictive model's overall performance. This paper reviews the present state of FS with respect to metaheuristics and hyper-heuristic methods. Through a systematic literature review of over 200 articles, we set out the most recent findings and trends to enlighten analysts, practitioners and researchers in the field of data analytics seeking clarity in understanding and implementing effective FS optimization methods for improved text classification tasks.
Collapse
Affiliation(s)
- Esther Omolara Abiodun
- School of Computer Sciences, Universiti Sains Malaysia, George Town, Malaysia ,Department of Computer Sciences, University of Abuja, Abuja, Nigeria
| | - Abdulatif Alabdulatif
- Department of Computer Science, College of Computer, Qassim University, Buraydah, Saudi Arabia
| | - Oludare Isaac Abiodun
- School of Computer Sciences, Universiti Sains Malaysia, George Town, Malaysia ,Department of Computer Sciences, University of Abuja, Abuja, Nigeria
| | - Moatsum Alawida
- School of Computer Sciences, Universiti Sains Malaysia, George Town, Malaysia ,Department of Computer Sciences, Abu Dhabi University, Abu Dhabi, UAE
| | - Abdullah Alabdulatif
- Computer Department, College of Sciences and Arts, Qassim University, P.O. Box 53, Al-Rass, Saudi Arabia
| | - Rami S. Alkhawaldeh
- Department of Computer Information Systems, The University of Jordan, Aqaba, 77110 Jordan
| |
Collapse
|
21
|
MotieGhader H, Masoudi-Sobhanzadeh Y, Ashtiani SH, Masoudi-Nejad A. mRNA and microRNA selection for breast cancer molecular subtype stratification using meta-heuristic based algorithms. Genomics 2020; 112:3207-3217. [DOI: 10.1016/j.ygeno.2020.06.014] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2020] [Revised: 05/13/2020] [Accepted: 06/02/2020] [Indexed: 02/06/2023]
|
22
|
Gholami J, Pourpanah F, Wang X. Feature selection based on improved binary global harmony search for data classification. Appl Soft Comput 2020. [DOI: 10.1016/j.asoc.2020.106402] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
|
23
|
Integration of multi-objective PSO based feature selection and node centrality for medical datasets. Genomics 2020; 112:4370-4384. [PMID: 32717320 DOI: 10.1016/j.ygeno.2020.07.027] [Citation(s) in RCA: 69] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2020] [Revised: 06/22/2020] [Accepted: 07/14/2020] [Indexed: 01/19/2023]
Abstract
In the past decades, the rapid growth of computer and database technologies has led to the rapid growth of large-scale medical datasets. On the other, medical applications with high dimensional datasets that require high speed and accuracy are rapidly increasing. One of the dimensionality reduction approaches is feature selection that can increase the accuracy of the disease diagnosis and reduce its computational complexity. In this paper, a novel PSO-based multi objective feature selection method is proposed. The proposed method consists of three main phases. In the first phase, the original features are showed as a graph representation model. In the next phase, feature centralities for all nodes in the graph are calculated, and finally, in the third phase, an improved PSO-based search process is utilized to final feature selection. The results on five medical datasets indicate that the proposed method improves previous related methods in terms of efficiency and effectiveness.
Collapse
|
24
|
Boveiri HR. An enhanced cuckoo optimization algorithm for task graph scheduling in cluster-computing systems. Soft comput 2020. [DOI: 10.1007/s00500-019-04520-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
25
|
A survey on single and multi omics data mining methods in cancer data classification. J Biomed Inform 2020; 107:103466. [DOI: 10.1016/j.jbi.2020.103466] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2019] [Revised: 05/01/2020] [Accepted: 05/31/2020] [Indexed: 01/09/2023]
|
26
|
Baliarsingh SK, Vipsita S. Chaotic emperor penguin optimised extreme learning machine for microarray cancer classification. IET Syst Biol 2020; 14:85-95. [PMID: 32196467 DOI: 10.1049/iet-syb.2019.0028] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
Microarray technology plays a significant role in cancer classification, where a large number of genes and samples are simultaneously analysed. For the efficient analysis of the microarray data, there is a great demand for the development of intelligent techniques. In this article, the authors propose a novel hybrid technique employing Fisher criterion, ReliefF, and extreme learning machine (ELM) based on the principle of chaotic emperor penguin optimisation algorithm (CEPO). EPO is a recently developed metaheuristic method. In the proposed method, initially, Fisher score and ReliefF are independently used as filters for relevant gene selection. Further, a novel population-based metaheuristic, namely, CEPO was proposed to pre-train the ELM by selecting the optimal input weights and hidden biases. The authors have successfully conducted experiments on seven well-known data sets. To evaluate the effectiveness, the proposed method is compared with original EPO, genetic algorithm, and particle swarm optimisation-based ELM along with other state-of-the-art techniques. The experimental results show that the proposed framework achieves better accuracy as compared to the state-of-the-art schemes. The efficacy of the proposed method is demonstrated in terms of accuracy, sensitivity, specificity, and F-measure.
Collapse
Affiliation(s)
- Santos Kumar Baliarsingh
- DST-FIST Bioinformatics Lab, Department of Computer Science and Engineering, International Institute of Information Technology, Bhubaneswar, India.
| | - Swati Vipsita
- DST-FIST Bioinformatics Lab, Department of Computer Science and Engineering, International Institute of Information Technology, Bhubaneswar, India
| |
Collapse
|
27
|
Breast and Colon Cancer Classification from Gene Expression Profiles Using Data Mining Techniques. Symmetry (Basel) 2020. [DOI: 10.3390/sym12030408] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
Early detection of cancer increases the probability of recovery. This paper presents an intelligent decision support system (IDSS) for the early diagnosis of cancer based on gene expression profiles collected using DNA microarrays. Such datasets pose a challenge because of the small number of samples (no more than a few hundred) relative to the large number of genes (in the order of thousands). Therefore, a method of reducing the number of features (genes) that are not relevant to the disease of interest is necessary to avoid overfitting. The proposed methodology uses the information gain (IG) to select the most important features from the input patterns. Then, the selected features (genes) are reduced by applying the grey wolf optimization (GWO) algorithm. Finally, the methodology employs a support vector machine (SVM) classifier for cancer type classification. The proposed methodology was applied to two datasets (Breast and Colon) and was evaluated based on its classification accuracy, which is the most important performance measure in disease diagnosis. The experimental results indicate that the proposed methodology is able to enhance the stability of the classification accuracy as well as the feature selection.
Collapse
|
28
|
MapReduce-Based Parallel Genetic Algorithm for CpG-Site Selection in Age Prediction. Genes (Basel) 2019; 10:genes10120969. [PMID: 31775313 PMCID: PMC6947642 DOI: 10.3390/genes10120969] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2019] [Revised: 11/12/2019] [Accepted: 11/15/2019] [Indexed: 11/23/2022] Open
Abstract
Genomic biomarkers such as DNA methylation (DNAm) are employed for age prediction. In recent years, several studies have suggested the association between changes in DNAm and its effect on human age. The high dimensional nature of this type of data significantly increases the execution time of modeling algorithms. To mitigate this problem, we propose a two-stage parallel algorithm for selection of age related CpG-sites. The algorithm first attempts to cluster the data into similar age ranges. In the next stage, a parallel genetic algorithm (PGA), based on the MapReduce paradigm (MR-based PGA), is used for selecting age-related features of each individual age range. In the proposed method, the execution of the algorithm for each age range (data parallel), the evaluation of chromosomes (task parallel) and the calculation of the fitness function (data parallel) are performed using a novel parallel framework. In this paper, we consider 16 different healthy DNAm datasets that are related to the human blood tissue and that contain the relevant age information. These datasets are combined into a single unioned set, which is in turn randomly divided into two sets of train and test data with a ratio of 7:3, respectively. We build a Gradient Boosting Regressor (GBR) model on the selected CpG-sites from the train set. To evaluate the model accuracy, we compared our results with state-of-the-art approaches that used these datasets, and observed that our method performs better on the unseen test dataset with a Mean Absolute Deviation (MAD) of 3.62 years, and a correlation (R2) of 95.96% between age and DNAm. In the train data, the MAD and R2 are 1.27 years and 99.27%, respectively. Finally, we evaluate our method in terms of the effect of parallelization in computation time. The algorithm without parallelization requires 4123 min to complete, whereas the parallelized execution on 3 computing machines having 32 processing cores each, only takes a total of 58 min. This shows that our proposed algorithm is both efficient and scalable.
Collapse
|
29
|
|
30
|
A new optimal gene selection approach for cancer classification using enhanced Jaya-based forest optimization algorithm. Neural Comput Appl 2019. [DOI: 10.1007/s00521-019-04355-x] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
|
31
|
Kawabata T, Emoto R, Nishino J, Takahashi K, Matsui S. Two-stage analysis for selecting fixed numbers of features in omics association studies. Stat Med 2019; 38:2956-2971. [PMID: 30931544 DOI: 10.1002/sim.8150] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2018] [Revised: 12/31/2018] [Accepted: 02/28/2019] [Indexed: 11/07/2022]
Abstract
One of main roles of omics-based association studies with high-throughput technologies is to screen out relevant molecular features, such as genetic variants, genes, and proteins, from a large pool of such candidate features based on their associations with the phenotype of interest. Typically, screened features are subject to validation studies using more established or conventional assays, where the number of evaluable features is relatively limited, so that there may exist a fixed number of features measurable by these assays. Such a limitation necessitates narrowing a feature set down to a fixed size, following an initial screening analysis via multiple testing where adjustment for multiplicity is made. We propose a two-stage screening approach to control the false discovery rate (FDR) for a feature set with fixed size that is subject to validation studies, rather than for a feature set from the initial screening analysis. Out of the feature set selected in the first stage with a relaxed FDR level, a fraction of features with most statistical significance is firstly selected. For the remaining feature set, features are selected based on biological consideration only, without regard to any statistical information, which allows evaluating the FDR level for the finally selected feature set with fixed size. Improvement of the power is discussed in the proposed two-stage screening approach. Simulation experiments based on parametric models and real microarray datasets demonstrated substantial increment in the number of screened features for biological consideration compared with the standard screening approach, allowing for more extensive and in-depth biological investigations in omics association studies.
Collapse
Affiliation(s)
- Takanori Kawabata
- Clinical Research Promotion Unit, Clinical Research Center, Shizuoka Cancer Center, Shizuoka, Japan
| | - Ryo Emoto
- Department of Biostatistics, Nagoya University Graduate School of Medicine, Nagoya, Japan
| | - Jo Nishino
- Department of Medical Science Mathematics, Medical Research Institute, Tokyo Medical and Dental University, Tokyo, Japan
| | - Kunihiko Takahashi
- Department of Biostatistics, Nagoya University Graduate School of Medicine, Nagoya, Japan
| | - Shigeyuki Matsui
- Department of Biostatistics, Nagoya University Graduate School of Medicine, Nagoya, Japan
| |
Collapse
|
32
|
A new data analysis method based on feature linear combination. J Biomed Inform 2019; 94:103173. [PMID: 30965135 DOI: 10.1016/j.jbi.2019.103173] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2018] [Revised: 04/02/2019] [Accepted: 04/06/2019] [Indexed: 01/15/2023]
Abstract
In biological data, feature relationships are complex and diverse, they could reflect physiological and pathological changes. Defining simple and efficient classification rules based on feature relationships is helpful for discriminating different conditions and studying disease mechanism. The popular data analysis method, k top scoring pairs (k-TSP), explores the feature relationship by focusing on the difference of the relative level of two features in different groups and classifies samples based on the exploration. To define more efficient classification rules, we propose a new data analysis method based on the linear combination of k > 0 top scoring pairs (LC-k-TSP). LC-k-TSP applies support vector machine (SVM) to define the best linear relationship of each feature pair, scores feature pairs by the discriminative abilities of the corresponding linear combinations and selects k disjoint top scoring pairs to construct an ensemble classifier. Experiments on twelve public datasets showed the superiority of LC-k-TSP over k-TSP which evaluates the relationship of every two features in the same way. The experiment also illustrated that LC-k-TSP performed similarly to SVM and random forest (RF) in accuracy rate. LC-k-TSP studies the own unique linear combination for each feature pair and defines simple classification rules, it is easy to explore the biomedical explanation. Finally, we applied LC-k-TSP to analyze the hepatocellular carcinoma (HCC) metabolomics data and define the simple classification rules for discrimination of different liver diseases. It obtained accuracy rates of 89.76% and 89.13% in distinguishing between small HCC and hepatic cirrhosis (CIR) groups as well as between HCC and CIR groups, superior to 87.99% and 80.35% by k-TSP. Hence, defining classification rules based on feature relationships is an effective way to analyze biological data. LC-k-TSP which checks different feature pairs by their corresponding unique best linear relationship has the superiority over k-TSP which checks each pair by the same linear relationship. Availability and implementation: http://www.402.dicp.ac.cn/download_ok_4.htm.
Collapse
|
33
|
Ventura-Molina E, Alarcón-Paredes A, Aldape-Pérez M, Yáñez-Márquez C, Adolfo Alonso G. Gene selection for enhanced classification on microarray data using a weighted k-NN based algorithm. INTELL DATA ANAL 2019. [DOI: 10.3233/ida-173720] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
- Elías Ventura-Molina
- Centro de Investigación en Computación, Instituto Politécnico Nacional. Av. Juan de Dios Bátiz, Esq. Miguel Othón de Mendizábal. Col. Nueva Industrial Vallejo, Gustavo A. Madero, 07738, Ciudad de México, México
| | - Antonio Alarcón-Paredes
- Facultad de Ingeniería, Universidad Autónoma de Guerrero. Av. Lázaro Cárdenas s/n, Ciudad Universitaria Zona Sur, 39087. Chilpancingo Guerrero, México
| | - Mario Aldape-Pérez
- Centro de Innovación y Desarrollo Tecnológico en Cómputo, Instituto Politécnico Nacional, México. Av. Juan de Dios Bátiz, Col. Nueva Industrial Vallejo, 07700, Ciudad de México, México
| | - Cornelio Yáñez-Márquez
- Centro de Investigación en Computación, Instituto Politécnico Nacional. Av. Juan de Dios Bátiz, Esq. Miguel Othón de Mendizábal. Col. Nueva Industrial Vallejo, Gustavo A. Madero, 07738, Ciudad de México, México
| | - Gustavo Adolfo Alonso
- Facultad de Ingeniería, Universidad Autónoma de Guerrero. Av. Lázaro Cárdenas s/n, Ciudad Universitaria Zona Sur, 39087. Chilpancingo Guerrero, México
| |
Collapse
|
34
|
|
35
|
A-COA: an adaptive cuckoo optimization algorithm for continuous and combinatorial optimization. Neural Comput Appl 2018. [DOI: 10.1007/s00521-018-3928-9] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
36
|
K T, N KV, S S, M P. Distributed ICSA Clustering Approach for Large Scale Protein Sequences and Cancer Diagnosis. Asian Pac J Cancer Prev 2018; 19:3105-3109. [PMID: 30486549 PMCID: PMC6318385 DOI: 10.31557/apjcp.2018.19.11.3105] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
Objective: With the over saturating growth of biological sequence databases, handling of these amounts of data has increasingly become a problem. Clustering has become one of the principal research objectives in structural and functional genomics. However, exact clustering algorithms, such as partitioned and hierarchical clustering, scale relatively poorly in terms of run time and memory usage with large sets of sequences. Methods: From these performance limits, heuristic optimizations such as Cuckoo Search Algorithm with genetic operators (ICSA) algorithm have been implemented in distributed computing environment. The proposed ICSA, a global optimized algorithm that can cluster large numbers of protein sequences by running on distributed computing hardware. Results: It allocates both memory and computing resources efficiently. Compare with the latest research results, our method requires only 15% of the execution time and obtains even higher quality information of protein sequence. Conclusion: From the experimental analysis, We noticed that the cluster of large protein sequence data sets using ICSA technique instead of only alignment methods reduce extremely the execution time and improve the efficiency of this important task in molecular biology. Moreover, the new era of proteomics is providing us with extensive knowledge of mutations and other alterations in cancer study.
Collapse
Affiliation(s)
- Thenmozhi K
- Department of Computer Applications, Selvam College of Technology, Namakkal, India.
| | | | | | | |
Collapse
|
37
|
Sapre S, Mini S. Opposition-based moth flame optimization with Cauchy mutation and evolutionary boundary constraint handling for global optimization. Soft comput 2018. [DOI: 10.1007/s00500-018-3586-y] [Citation(s) in RCA: 50] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
38
|
Nikdelfaz O, Jalili S. Disease genes prediction by HMM based PU-learning using gene expression profiles. J Biomed Inform 2018; 81:102-111. [PMID: 29571901 DOI: 10.1016/j.jbi.2018.03.006] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2017] [Revised: 11/22/2017] [Accepted: 03/12/2018] [Indexed: 12/24/2022]
Abstract
Predicting disease candidate genes from human genome is a crucial part of nowadays biomedical research. According to observations, diseases with the same phenotype have the similar biological characteristics and genes associated with these same diseases tend to share common functional properties. Therefore, by applying machine learning methods, new disease genes are predicted based on previous ones. In recent studies, some semi-supervised learning methods, called Positive-Unlabeled Learning (PU-Learning) are used for predicting disease candidate genes. In this study, a novel method is introduced to predict disease candidate genes through gene expression profiles by learning hidden Markov models. In order to evaluate the proposed method, it is applied on a mixed part of 398 disease genes from three disease types and 12001 unlabeled genes. Compared to the other methods in literature, the experimental results indicate a significant improvement in favor of the proposed method.
Collapse
Affiliation(s)
- Ozra Nikdelfaz
- Tarbiat Modares University, Computer Engineering Department, Islamic Republic of Iran.
| | - Saeed Jalili
- Tarbiat Modares University, Computer Engineering Department, Islamic Republic of Iran.
| |
Collapse
|
39
|
Gao L, Ye M, Lu X, Huang D. Hybrid Method Based on Information Gain and Support Vector Machine for Gene Selection in Cancer Classification. GENOMICS PROTEOMICS & BIOINFORMATICS 2017; 15:389-395. [PMID: 29246519 PMCID: PMC5828665 DOI: 10.1016/j.gpb.2017.08.002] [Citation(s) in RCA: 42] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/12/2017] [Revised: 07/25/2017] [Accepted: 08/08/2017] [Indexed: 12/30/2022]
Abstract
It remains a great challenge to achieve sufficient cancer classification accuracy with the entire set of genes, due to the high dimensions, small sample size, and big noise of gene expression data. We thus proposed a hybrid gene selection method, Information Gain-Support Vector Machine (IG-SVM) in this study. IG was initially employed to filter irrelevant and redundant genes. Then, further removal of redundant genes was performed using SVM to eliminate the noise in the datasets more effectively. Finally, the informative genes selected by IG-SVM served as the input for the LIBSVM classifier. Compared to other related algorithms, IG-SVM showed the highest classification accuracy and superior performance as evaluated using five cancer gene expression datasets based on a few selected genes. As an example, IG-SVM achieved a classification accuracy of 90.32% for colon cancer, which is difficult to be accurately classified, only based on three genes including CSRP1, MYL9, and GUCA2B.
Collapse
Affiliation(s)
- Lingyun Gao
- School of Medical Information, Wannan Medical College, Wuhu 241002, China
| | - Mingquan Ye
- School of Medical Information, Wannan Medical College, Wuhu 241002, China.
| | - Xiaojie Lu
- School of Medical Information, Wannan Medical College, Wuhu 241002, China
| | - Daobin Huang
- School of Medical Information, Wannan Medical College, Wuhu 241002, China
| |
Collapse
|