1
|
Li M, Cao R, Zhao Y, Li Y, Deng S. Population characteristic exploitation-based multi-orientation multi-objective gene selection for microarray data classification. Comput Biol Med 2024; 170:108089. [PMID: 38330824 DOI: 10.1016/j.compbiomed.2024.108089] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2023] [Revised: 01/23/2024] [Accepted: 01/27/2024] [Indexed: 02/10/2024]
Abstract
Gene selection is a process of selecting discriminative genes from microarray data that helps to diagnose and classify cancer samples effectively. Swarm intelligence evolution-based gene selection algorithms can never circumvent the problem that the population is prone to local optima in the process of gene selection. To tackle this challenge, previous research has focused primarily on two aspects: mitigating premature convergence to local optima and escaping from local optima. In contrast to these strategies, this paper introduces a novel perspective by adopting reverse thinking, where the issue of local optima is seen as an opportunity rather than an obstacle. Building on this foundation, we propose MOMOGS-PCE, a novel gene selection approach that effectively exploits the advantageous characteristics of populations trapped in local optima to uncover global optimal solutions. Specifically, MOMOGS-PCE employs a novel population initialization strategy, which involves the initialization of multiple populations that explore diverse orientations to foster distinct population characteristics. The subsequent step involved the utilization of an enhanced NSGA-II algorithm to amplify the advantageous characteristics exhibited by the population. Finally, a novel exchange strategy is proposed to facilitate the transfer of characteristics between populations that have reached near maturity in evolution, thereby promoting further population evolution and enhancing the search for more optimal gene subsets. The experimental results demonstrated that MOMOGS-PCE exhibited significant advantages in comprehensive indicators compared with six competitive multi-objective gene selection algorithms. It is confirmed that the "reverse-thinking" approach not only avoids local optima but also leverages it to uncover superior gene subsets for cancer diagnosis.
Collapse
Affiliation(s)
- Min Li
- School of Information Engineering, Nanchang Institute of Technology, No. 289 Tianxiang Road, Nanchang, Jiangxi, PR China.
| | - Rutun Cao
- School of Information Engineering, Nanchang Institute of Technology, No. 289 Tianxiang Road, Nanchang, Jiangxi, PR China
| | - Yangfan Zhao
- School of Information Engineering, Nanchang Institute of Technology, No. 289 Tianxiang Road, Nanchang, Jiangxi, PR China
| | - Yulong Li
- School of Information Engineering, Nanchang Institute of Technology, No. 289 Tianxiang Road, Nanchang, Jiangxi, PR China
| | - Shaobo Deng
- School of Information Engineering, Nanchang Institute of Technology, No. 289 Tianxiang Road, Nanchang, Jiangxi, PR China
| |
Collapse
|
2
|
Nekouie N, Romoozi M, Esmaeili M. A New Evolutionary Ensemble Learning of Multimodal Feature Selection from Microarray Data. Neural Process Lett 2023. [DOI: 10.1007/s11063-023-11159-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/15/2023]
|
3
|
Elitist random swapped particle swarm optimization embedded with variable k-nearest neighbour classification: a new PSO variant applied to gene identification. Soft comput 2022. [DOI: 10.1007/s00500-022-07515-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/10/2022]
|
4
|
Vahmiyan M, Kheirabadi M, Akbari E. Feature selection methods in microarray gene expression data: a systematic mapping study. Neural Comput Appl 2022. [DOI: 10.1007/s00521-022-07661-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/07/2022]
|
5
|
Quantitative Detection of Gastrointestinal Tumor Markers Using a Machine Learning Algorithm and Multicolor Quantum Dot Biosensor. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:9022821. [PMID: 36093502 PMCID: PMC9458379 DOI: 10.1155/2022/9022821] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/04/2022] [Revised: 07/27/2022] [Accepted: 08/02/2022] [Indexed: 11/17/2022]
Abstract
This work was to explore the application value of gastrointestinal tumor markers based on gene feature selection model of principal component analysis (PCA) algorithm and multicolor quantum dots (QDs) immunobiosensor in the detection of gastrointestinal tumors. Based on the PCA method, the neighborhood rough set algorithm was introduced to improve it, and the tumor gene feature selection model (OPCA) was established to analyze its classification accuracy and accuracy. Four kinds of coupled biosensors were fabricated based on QDs, namely, 525 nm Cd Se/Zn S QDs-carbohydrate antigen 125 (QDs525-CA125 McAb), 605 nm Cd Se/Zn S QDs-cancer antigen 19-9 (QDs605-CA19-9 McAb), 645 nm Cd Se/Zn S QDs-anticancer embryonic antigen (QDs 645-CEA McAb), and 565 nm Cd Se/Zn S QDs-anti-alpha-fetoprotein (QDs565-AFP McAb). The quantum dot-antibody conjugates were identified and quantified by fluorescence spectroscopy and ultraviolet absorption spectroscopy. The results showed that the classification precision of OPCA model in colon tumor and gastric cancer datasets was 99.52% and 99.03%, respectively, and the classification accuracy was 94.86% and 94.2%, respectively, which were significantly higher than those of other algorithms. The fluorescence values of AFP McAb, CEA McAb, CA19-9 McAb, and CA125 McAb reached the maximum when the conjugation concentrations were 25 µg/mL, 20 µg/mL, 30 µg/mL, and 30 µg/m, respectively. The highest recovery rate of AFP was 98.51%, and its fluorescence intensity was 35.78 ± 2.99, which was significantly higher than that of other antigens (P < 0.001). In summary, the OPCA model based on PCA algorithm can obtain fewer feature gene sets and improve the accuracy of sample classification. Intelligent immunobiosensors based on machine learning algorithms and QDs have potential application value in gastrointestinal gene feature selection and tumor marker detection, which provides a new idea for clinical diagnosis of gastrointestinal tumors.
Collapse
|
6
|
Efficient Diagnosis of Autism with Optimized Machine Learning Models: An Experimental Analysis on Genetic and Personal Characteristic Datasets. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12083812] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
Early diagnosis of autism is extremely beneficial for patients. Traditional diagnosis approaches have been unable to diagnose autism in a fast and accurate way; rather, there are multiple factors that can be related to identifying the autism disorder. The gene expression (GE) of individuals may be one of these factors, in addition to personal and behavioral characteristics (PBC). Machine learning (ML) based on PBC and GE data analytics emphasizes the need to develop accurate prediction models. The quality of prediction relies on the accuracy of the ML model. To improve the accuracy of prediction, optimized feature selection algorithms are applied to solve the high dimensionality problem of the datasets used. Comparing different optimized feature selection methods using bio-inspired algorithms over different types of data can allow for the most accurate model to be identified. Therefore, in this paper, we investigated enhancing the classification process of autism spectrum disorder using 16 proposed optimized ML models (GWO-NB, GWO-SVM, GWO-KNN, GWO-DT, FPA-NB, FPA-KNN, FPA-SVM, FPA-DT, BA-NB, BA-SVM, BA-KNN, BA-DT, ABC-NB, ABC-SVM, ABV-KNN, and ABC-DT). Four bio-inspired algorithms namely, Gray Wolf Optimization (GWO), Flower Pollination Algorithm (FPA), Bat Algorithms (BA), and Artificial Bee Colony (ABC), were employed for optimizing the wrapper feature selection method in order to select the most informative features and to increase the accuracy of the classification models. Five evaluation metrics were used to evaluate the performance of the proposed models: accuracy, F1 score, precision, recall, and area under the curve (AUC). The obtained results demonstrated that the proposed models achieved a good performance as expected, with accuracies of 99.66% and 99.34% obtained by the GWO-SVM model on the PBC and GE datasets, respectively.
Collapse
|
7
|
Adaptive feature selection framework for DNA methylation-based age prediction. Soft comput 2022. [DOI: 10.1007/s00500-022-06844-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
8
|
Yan C, Li M, Ma J, Liao Y, Luo H, Wang J, Luo J. A Novel Feature Selection Method Based on MRMR and Enhanced Flower Pollination Algorithm for High Dimensional Biomedical Data. Curr Bioinform 2022. [DOI: 10.2174/1574893616666210624130124] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
The massive amount of biomedical data accumulated in the past decades can
be utilized for diagnosing disease.
Objective:
However, the high dimensionality, small sample sizes, and irrelevant features of data often have
a negative influence on the accuracy and speed of disease prediction. Some existing machine learning
models cannot capture the patterns on these datasets accurately without utilizing feature selection.
Methods:
Filter and wrapper are two prevailing feature selection methods. The filter method is fast but
has low prediction accuracy, while the latter can obtain high accuracy but has a formidable computation
cost. Given the drawbacks of using filter or wrapper individually, a novel feature selection method,
called MRMR-EFPATS, is proposed, which hybridizes filter method Minimum Redundancy Maximum
Relevance (MRMR) and wrapper method based on an improved Flower Pollination Algorithm (FPA).
First, MRMR is employed to rank and screen out some important features quickly. These features are
further chosen for individual populations following the wrapper method for faster convergence and less
computational time. Then, due to its efficiency and flexibility, FPA is adopted to further discover an optimal
feature subset.
Result:
FPA still has some drawbacks, such as slow convergence rate, inadequacy in terms of searching
new solutions, and tends to be trapped in local optima. In our work, an elite strategy is adopted to
improve the convergence speed of the FPA. Tabu search and Adaptive Gaussian Mutation are employed
to improve the search capability of FPA and escape from local optima. Here, the KNN classifier with
the 5-fold-CV is utilized to evaluate the classification accuracy.
Conclusion:
Extensive experimental results on six public high dimensional biomedical datasets show
that the proposed MRMR-EFPATS has achieved superior performance compared to other state-of-theart
methods.
Collapse
Affiliation(s)
- Chaokun Yan
- School of Computer and Information Engineering, Henan University, Kaifeng, China
| | - Mengyuan Li
- School of Computer and Information Engineering, Henan University, Kaifeng, China
| | | | - Yi Liao
- Academy of Arts & Design, Tsinghua University, Beijing, China
| | - Huimin Luo
- School of Computer and Information Engineering, Henan University, Kaifeng, China
| | - Jianlin Wang
- School of Computer and Information Engineering, Henan University, Kaifeng, China
| | - Junwei Luo
- College of Computer Science
and Technology, Henan Polytechnic University, Jiaozuo, China
| |
Collapse
|
9
|
Multi-objective feature selection based on quasi-oppositional based Jaya algorithm for microarray data. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2021.107804] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
10
|
Elastic Correlation Adjusted Regression (ECAR) scores for high dimensional variable importance measuring. Sci Rep 2021; 11:23354. [PMID: 34857823 PMCID: PMC8640025 DOI: 10.1038/s41598-021-02706-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2021] [Accepted: 11/22/2021] [Indexed: 11/08/2022] Open
Abstract
Investigation of the genetic basis of traits or clinical outcomes heavily relies on identifying relevant variables in molecular data. However, characteristics such as high dimensionality and complex correlation structures of these data hinder the development of related methods, resulting in the inclusion of false positives and negatives. We developed a variable importance measure method, termed the ECAR scores, that evaluates the importance of variables in the dataset. Based on this score, ranking and selection of variables can be achieved simultaneously. Unlike most current approaches, the ECAR scores aim to rank the influential variables as high as possible while maintaining the grouping property, instead of selecting the ones that are merely predictive. The ECAR scores' performance is tested and compared to other methods on simulated, semi-synthetic, and real datasets. Results showed that the ECAR scores improve the CAR scores in terms of accuracy of variable selection and high-rank variables' predictive power. It also outperforms other classic methods such as lasso and stability selection when there is a high degree of correlation among influential variables. As an application, we used the ECAR scores to analyze genes associated with forced expiratory volume in the first second in patients with lung cancer and reported six associated genes.
Collapse
|
11
|
Abstract
The problems of gene regulatory network (GRN) reconstruction and the creation of disease diagnostic effective systems based on genes expression data are some of the current directions of modern bioinformatics. In this manuscript, we present the results of the research focused on the evaluation of the effectiveness of the most used metrics to estimate the gene expression profiles’ proximity, which can be used to extract the groups of informative gene expression profiles while taking into account the states of the investigated samples. Symmetry is very important in the field of both genes’ and/or proteins’ interaction since it undergirds essentially all interactions between molecular components in the GRN and extraction of gene expression profiles, which allows us to identify how the investigated biological objects (disease, state of patients, etc.) contribute to the further reconstruction of GRN in terms of both the symmetry and understanding the mechanism of molecular element interaction in a biological organism. Within the framework of our research, we have investigated the following metrics: Mutual information maximization (MIM) using various methods of Shannon entropy calculation, Pearson’s χ2 test and correlation distance. The accuracy of the investigated samples classification was used as the main quality criterion to evaluate the appropriate metric effectiveness. The random forest classifier (RF) was used during the simulation process. The research results have shown that results of the use of various methods of Shannon entropy within the framework of the MIM metric disagree with each other. As a result, we have proposed the modified mutual information maximization (MMIM) proximity metric based on the joint use of various methods of Shannon entropy calculation and the Harrington desirability function. The results of the simulation have also shown that the correlation proximity metric is less effective in comparison to both the MMIM metric and Pearson’s χ2 test. Finally, we propose the hybrid proximity metric (HPM) that considers both the MMIM metric and Pearson’s χ2 test. The proposed metric was investigated within the framework of one-cluster structure effectiveness evaluation. To our mind, the main benefit of the proposed HPM is in increasing the objectivity of mutually similar gene expression profiles extraction due to the joint use of the various effective proximity metrics that can contradict with each other when they are used alone.
Collapse
|
12
|
A novel bio-inspired hybrid multi-filter wrapper gene selection method with ensemble classifier for microarray data. Neural Comput Appl 2021; 35:11531-11561. [PMID: 34539088 PMCID: PMC8435304 DOI: 10.1007/s00521-021-06459-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2020] [Accepted: 08/26/2021] [Indexed: 01/04/2023]
Abstract
Microarray technology is known as one of the most important tools for collecting DNA expression data. This technology allows researchers to investigate and examine types of diseases and their origins. However, microarray data are often associated with a small sample size, a significant number of genes, imbalanced data, etc., making classification models inefficient. Thus, a new hybrid solution based on a multi-filter and adaptive chaotic multi-objective forest optimization algorithm (AC-MOFOA) is presented to solve the gene selection problem and construct the Ensemble Classifier. In the proposed solution, a multi-filter model (i.e., ensemble filter) is proposed as preprocessing step to reduce the dataset's dimensions, using a combination of five filter methods to remove redundant and irrelevant genes. Accordingly, the results of the five filter methods are combined using a voting-based function. Additionally, the results of the proposed multi-filter indicate that it has good capability in reducing the gene subset size and selecting relevant genes. Then, an AC-MOFOA based on the concepts of non-dominated sorting, crowding distance, chaos theory, and adaptive operators is presented. AC-MOFOA as a wrapper method aimed at reducing dataset dimensions, optimizing KELM, and increasing the accuracy of the classification, simultaneously. Next, in this method, an ensemble classifier model is presented using AC-MOFOA results to classify microarray data. The performance of the proposed algorithm was evaluated on nine public microarray datasets, and its results were compared in terms of the number of selected genes, classification efficiency, execution time, time complexity, hypervolume indicator, and spacing metric with five hybrid multi-objective methods, and three hybrid single-objective methods. According to the results, the proposed hybrid method could increase the accuracy of the KELM in most datasets by reducing the dataset's dimensions and achieve similar or superior performance compared to other multi-objective methods. Furthermore, the proposed Ensemble Classifier model could provide better classification accuracy and generalizability in the seven of nine microarray datasets compared to conventional ensemble methods. Moreover, the comparison results of the Ensemble Classifier model with three state-of-the-art ensemble generation methods indicate its competitive performance in which the proposed ensemble model achieved better results in the five of nine datasets.
Collapse
|
13
|
Gumaei A, Sammouda R, Al-Rakhami M, AlSalman H, El-Zaart A. Feature selection with ensemble learning for prostate cancer diagnosis from microarray gene expression. Health Informatics J 2021; 27:1460458221989402. [PMID: 33570011 DOI: 10.1177/1460458221989402] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Cancer diagnosis using machine learning algorithms is one of the main topics of research in computer-based medical science. Prostate cancer is considered one of the reasons that are leading to deaths worldwide. Data analysis of gene expression from microarray using machine learning and soft computing algorithms is a useful tool for detecting prostate cancer in medical diagnosis. Even though traditional machine learning methods have been successfully applied for detecting prostate cancer, the large number of attributes with a small sample size of microarray data is still a challenge that limits their ability for effective medical diagnosis. Selecting a subset of relevant features from all features and choosing an appropriate machine learning method can exploit the information of microarray data to improve the accuracy rate of detection. In this paper, we propose to use a correlation feature selection (CFS) method with random committee (RC) ensemble learning to detect prostate cancer from microarray data of gene expression. A set of experiments are conducted on a public benchmark dataset using 10-fold cross-validation technique to evaluate the proposed approach. The experimental results revealed that the proposed approach attains 95.098% accuracy, which is higher than related work methods on the same dataset.
Collapse
Affiliation(s)
- Abdu Gumaei
- Research Chair of Pervasive and Mobile Computing, King Saud University, Saudi Arabia.,Taiz University, Yemen
| | | | - Mabrook Al-Rakhami
- Research Chair of Pervasive and Mobile Computing, King Saud University, Saudi Arabia
| | | | | |
Collapse
|
14
|
Kaneko H. Examining variable selection methods for the predictive performance of regression models and the proportion of selected variables and selected random variables. Heliyon 2021; 7:e07356. [PMID: 34195450 PMCID: PMC8237311 DOI: 10.1016/j.heliyon.2021.e07356] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2021] [Revised: 05/02/2021] [Accepted: 06/16/2021] [Indexed: 11/24/2022] Open
Abstract
The selection of a descriptor, X, is crucial for improving the interpretation and prediction accuracy of a regression model. In this study, the prediction accuracy of models constructed using the selected X was determined and the results of variable selection, according to the number of selected X and number of selected variables that are unrelated to an objective variable, such as activities and properties (y), were investigated to evaluate the variable or feature selection methods. Variable selection methods include least absolute shrinkage and selection operator, genetic algorithm-based partial least squares, genetic algorithm-based support vector regression, and Boruta. Several regression analysis methods were used to test the prediction accuracy of the model constructed using the selected X. The characteristics of each variable selection method were analyzed using eight datasets. The results showed that even when variables unrelated to y were selected by variable selection and the number of unrelated variables was the same as the number of the original variables, a regression model with good accuracy, which ignores the influence of such noise variables, can be constructed by applying various regression analysis methods. Additionally, the variables related to y must not to be deleted. These findings provide a basis for improving the variable selection methods.
Collapse
Affiliation(s)
- Hiromasa Kaneko
- Department of Applied Chemistry, School of Science and Technology, Meiji University, 1-1-1 Higashi-Mita, Tama-ku, Kawasaki, Kanagawa 214-8571, Japan
| |
Collapse
|
15
|
Dashtban M, Li W. Predicting non-attendance in hospital outpatient appointments using deep learning approach. Health Syst (Basingstoke) 2021; 11:189-210. [PMID: 36147556 PMCID: PMC9487947 DOI: 10.1080/20476965.2021.1924085] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
The hospital outpatient non-attendance imposes a substantial financial burden on hospitals and roots in multiple diverse reasons. This research aims to build an advanced predictive model for predicting non-attendance regarding the whole spectrum of probable contributing factors to non-attendance that could be collated from heterogeneous sources including electronic patients records and external non-hospital data. We proposed a new non-attendance prediction model based on deep neural networks and machine learning models. The proposed approach works upon sparse stacked denoising autoencoders (SDAEs) to learn the underlying manifold of data and thereby compacting information and providing a better representation that can be utilised afterwards by other learning models as well. The proposed approach is evaluated over real hospital data and compared with several well-known and scalable machine learning models. The evaluation results reveal the proposed approach with softmax layer and logistic regression outperforms other methods in practice.
Collapse
Affiliation(s)
- M. Dashtban
- Informatics Research Centre, Henley Business School, University of Reading, Reading, UK
| | - Weizi Li
- Informatics Research Centre, Henley Business School, University of Reading, Reading, UK
| |
Collapse
|
16
|
Pashaei E, Pashaei E. Gene selection using hybrid dragonfly black hole algorithm: A case study on RNA-seq COVID-19 data. Anal Biochem 2021; 627:114242. [PMID: 33974890 DOI: 10.1016/j.ab.2021.114242] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2020] [Revised: 04/12/2021] [Accepted: 05/02/2021] [Indexed: 11/18/2022]
Abstract
This paper introduces a new hybrid approach (DBH) for solving gene selection problem that incorporates the strengths of two existing metaheuristics: binary dragonfly algorithm (BDF) and binary black hole algorithm (BBHA). This hybridization aims to identify a limited and stable set of discriminative genes without sacrificing classification accuracy, whereas most current methods have encountered challenges in extracting disease-related information from a vast amount of redundant genes. The proposed approach first applies the minimum redundancy maximum relevancy (MRMR) filter method to reduce the dimensionality of feature space and then utilizes the suggested hybrid DBH algorithm to determine a smaller set of significant genes. The proposed approach was evaluated on eight benchmark gene expression datasets, and then, was compared against the latest state-of-art techniques to demonstrate algorithm efficiency. The comparative study shows that the proposed approach achieves a significant improvement as compared with existing methods in terms of classification accuracy and the number of selected genes. Moreover, the performance of the suggested method was examined on real RNA-Seq coronavirus-related gene expression data of asthmatic patients for selecting the most significant genes in order to improve the discriminative accuracy of angiotensin-converting enzyme 2 (ACE2). ACE2, as a coronavirus receptor, is a biomarker that helps to classify infected patients from uninfected in order to identify subgroups at risk for COVID-19. The result denotes that the suggested MRMR-DBH approach represents a very promising framework for finding a new combination of most discriminative genes with high classification accuracy.
Collapse
Affiliation(s)
- Elnaz Pashaei
- Department of Software Engineering, Istanbul Aydin University, Istanbul, Turkey.
| | - Elham Pashaei
- Department of Computer Engineering, Istanbul Gelisim University, Istanbul, Turkey.
| |
Collapse
|
17
|
Mirsadeghi L, Haji Hosseini R, Banaei-Moghaddam AM, Kavousi K. EARN: an ensemble machine learning algorithm to predict driver genes in metastatic breast cancer. BMC Med Genomics 2021; 14:122. [PMID: 33962648 PMCID: PMC8105935 DOI: 10.1186/s12920-021-00974-3] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2020] [Accepted: 04/27/2021] [Indexed: 12/27/2022] Open
Abstract
BACKGROUND Today, there are a lot of markers on the prognosis and diagnosis of complex diseases such as primary breast cancer. However, our understanding of the drivers that influence cancer aggression is limited. METHODS In this work, we study somatic mutation data consists of 450 metastatic breast tumor samples from cBio Cancer Genomics Portal. We use four software tools to extract features from this data. Then, an ensemble classifier (EC) learning algorithm called EARN (Ensemble of Artificial Neural Network, Random Forest, and non-linear Support Vector Machine) is proposed to evaluate plausible driver genes for metastatic breast cancer (MBCA). The decision-making strategy for the proposed ensemble machine is based on the aggregation of the predicted scores obtained from individual learning classifiers to be prioritized homo sapiens genes annotated as protein-coding from NCBI. RESULTS This study is an attempt to focus on the findings in several aspects of MBCA prognosis and diagnosis. First, drivers and passengers predicted by SVM, ANN, RF, and EARN are introduced. Second, biological inferences of predictions are discussed based on gene set enrichment analysis. Third, statistical validation and comparison of all learning methods are performed by some evaluation metrics. Finally, the pathway enrichment analysis (PEA) using ReactomeFIVIz tool (FDR < 0.03) for the top 100 genes predicted by EARN leads us to propose a new gene set panel for MBCA. It includes HDAC3, ABAT, GRIN1, PLCB1, and KPNA2 as well as NCOR1, TBL1XR1, SIRT4, KRAS, CACNA1E, PRKCG, GPS2, SIN3A, ACTB, KDM6B, and PRMT1. Furthermore, we compare results for MBCA to other outputs regarding 983 primary tumor samples of breast invasive carcinoma (BRCA) obtained from the Cancer Genome Atlas (TCGA). The comparison between outputs shows that ROC-AUC reaches 99.24% using EARN for MBCA and 99.79% for BRCA. This statistical result is better than three individual classifiers in each case. CONCLUSIONS This research using an integrative approach assists precision oncologists to design compact targeted panels that eliminate the need for whole-genome/exome sequencing. The schematic representation of the proposed model is presented as the Graphic abstract.
Collapse
Affiliation(s)
- Leila Mirsadeghi
- Department of Biology, Faculty of Science, Payame Noor University, Tehran, Iran
| | - Reza Haji Hosseini
- Department of Biology, Faculty of Science, Payame Noor University, Tehran, Iran.
| | - Ali Mohammad Banaei-Moghaddam
- Laboratory of Genomics and Epigenomics (LGE), Department of Biochemistry, Institute of Biochemistry and Biophysics (IBB), University of Tehran, Tehran, Iran
| | - Kaveh Kavousi
- Laboratory of Complex Biological Systems and Bioinformatics (CBB), Department of Bioinformatics, Institute of Biochemistry and Biophysics (IBB), University of Tehran, Tehran, Iran.
| |
Collapse
|
18
|
Zhang G, Xue Z, Yan C, Wang J, Luo H. A Novel Biomarker Identification Approach for Gastric Cancer Using Gene Expression and DNA Methylation Dataset. Front Genet 2021; 12:644378. [PMID: 33868380 PMCID: PMC8044773 DOI: 10.3389/fgene.2021.644378] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Accepted: 02/16/2021] [Indexed: 01/09/2023] Open
Abstract
As one type of complex disease, gastric cancer has high mortality rate, and there are few effective treatments for patients in advanced stage. With the development of biological technology, a large amount of multiple-omics data of gastric cancer are generated, which enables computational method to discover potential biomarkers of gastric cancer. That will be very important to detect gastric cancer at earlier stages and thus assist in providing timely treatment. However, most of biological data have the characteristics of high dimension and low sample size. It is hard to process directly without feature selection. Besides, only using some omic data, such as gene expression data, provides limited evidence to investigate gastric cancer associated biomarkers. In this research, gene expression data and DNA methylation data are integrated to analyze gastric cancer, and a feature selection approach is proposed to identify the possible biomarkers of gastric cancer. After the original data are pre-processed, the mutual information (MI) is applied to select some top genes. Then, fold change (FC) and T-test are adopted to identify differentially expressed genes (DEG). In particular, false discover rate (FDR) is introduced to revise p_value to further screen genes. For chosen genes, a deep neural network (DNN) model is utilized as the classifier to measure the quality of classification. The experimental results show that the approach can achieve superior performance in terms of accuracy and other metrics. Biological analysis for chosen genes further validates the effectiveness of the approach.
Collapse
Affiliation(s)
- Ge Zhang
- School of Computer and Information Engineering, Henan University, Kaifeng, China
| | - Zijing Xue
- School of Computer and Information Engineering, Henan University, Kaifeng, China
| | - Chaokun Yan
- School of Computer and Information Engineering, Henan University, Kaifeng, China
| | - Jianlin Wang
- School of Computer and Information Engineering, Henan University, Kaifeng, China
| | - Huimin Luo
- School of Computer and Information Engineering, Henan University, Kaifeng, China
| |
Collapse
|
19
|
Hameed SS, Hassan WH, Latiff LA, Muhammadsharif FF. A comparative study of nature-inspired metaheuristic algorithms using a three-phase hybrid approach for gene selection and classification in high-dimensional cancer datasets. Soft comput 2021. [DOI: 10.1007/s00500-021-05726-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
|
20
|
Li H, Ding L, Hong X, Chen Y, Liao R, Wang T, Meng S, Jiang Z, Liu D. Integrative genomic expression analysis reveals stable differences between lung cancer and systemic sclerosis. BMC Cancer 2021; 21:259. [PMID: 33691643 PMCID: PMC7944918 DOI: 10.1186/s12885-021-07959-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2021] [Accepted: 02/23/2021] [Indexed: 12/09/2022] Open
Abstract
BACKGROUND The incidence and mortality of lung cancer are the highest among all cancers. Patients with systemic sclerosis show a four-fold greater risk of lung cancer than the general population. However, the underlying mechanism remains poorly understood. METHODS The expression profiles of 355 peripheral blood samples were integratedly analyzed, including 70 cases of lung cancer, 61 cases of systemic sclerosis, and 224 healthy controls. After data normalization and cleaning, differentially expressed genes (DEGs) between disease and control were obtained and deeply analyzed by bioinformatics methods. The gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis were performed online by DAVID and KOBAS. The protein-protein interaction (PPI) networks were constructed from the STRING database. RESULTS From a total of 14,191 human genes, 299 and 1644 genes were identified as DEGs in systemic sclerosis and lung cancer, respectively. Among them, 64 DEGs were overlapping, including 36 co-upregulated, 10 co-downregulated, and 18 counter-regulated DEGs. Functional and enrichment analysis showed that the two diseases had common changes in immune-related genes. The expression of innate immune response and response to virus-related genes increased significantly, while the expression of negative regulation of cell cycle-related genes decreased notably. In contrast, the expression of mitophagy regulation, chromatin binding and fatty acid metabolism-related genes showed distinct trends. CONCLUSIONS Stable differences and similarities between systemic sclerosis and lung cancer were revealed. In peripheral blood, enhanced innate immunity and weakened negative regulation of cell cycle may be the common mechanisms of the two diseases, which may be associated with the high risk of lung cancer in systemic sclerosis patients. On the other hand, the counter-regulated DEGs can be used as novelbiomarkers of pulmonary diseases. In addition, fat metabolism-related DEGs were consideredto be associated with clinical blood lipid data.
Collapse
Affiliation(s)
- Heng Li
- Department of Rheumatology and Immunology, Shenzhen People's Hospital, The Second Clinical Medical College of Jinan University, Shenzhen, 518020, China
- Integrated Chinese and Western Medicine Postdoctoral Research Station, Jinan University, Guangzhou, 510632, China
| | - Liping Ding
- Department of Rheumatology and Immunology, Shenzhen People's Hospital, The Second Clinical Medical College of Jinan University, Shenzhen, 518020, China
| | - Xiaoping Hong
- Department of Rheumatology and Immunology, Shenzhen People's Hospital, The Second Clinical Medical College of Jinan University, Shenzhen, 518020, China
| | - Yulan Chen
- Department of Rheumatology and Immunology, Shenzhen People's Hospital, The Second Clinical Medical College of Jinan University, Shenzhen, 518020, China
| | - Rui Liao
- Department of Rheumatology and Immunology, Shenzhen People's Hospital, The Second Clinical Medical College of Jinan University, Shenzhen, 518020, China
| | - Tingting Wang
- Department of Rheumatology and Immunology, Shenzhen People's Hospital, The Second Clinical Medical College of Jinan University, Shenzhen, 518020, China
- Integrated Chinese and Western Medicine Postdoctoral Research Station, Jinan University, Guangzhou, 510632, China
| | - Shuhui Meng
- Department of Rheumatology and Immunology, Shenzhen People's Hospital, The Second Clinical Medical College of Jinan University, Shenzhen, 518020, China
| | - Zhenyou Jiang
- Department of Microbiology and Immunology, College of Basic Medicine and Public Hygiene, Jinan University, Guangzhou, 510632, China.
| | - Dongzhou Liu
- Department of Rheumatology and Immunology, Shenzhen People's Hospital, The Second Clinical Medical College of Jinan University, Shenzhen, 518020, China.
- The First Affiliated Hospital (Shenzhen People's Hospital) Southern University of Science and Technology, Shenzhen, 518055, China.
| |
Collapse
|
21
|
Lai CM, Huang HP. A gene selection algorithm using simplified swarm optimization with multi-filter ensemble technique. Appl Soft Comput 2021. [DOI: 10.1016/j.asoc.2020.106994] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
|
22
|
Alharthi AM, Lee MH, Algamal ZY. Gene selection and classification of microarray gene expression data based on a new adaptive L1-norm elastic net penalty. INFORMATICS IN MEDICINE UNLOCKED 2021. [DOI: 10.1016/j.imu.2021.100622] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
|
23
|
Hamraz M, Gul N, Raza M, Khan DM, Khalil U, Zubair S, Khan Z. Robust proportional overlapping analysis for feature selection in binary classification within functional genomic experiments. PeerJ Comput Sci 2021; 7:e562. [PMID: 34141889 PMCID: PMC8176540 DOI: 10.7717/peerj-cs.562] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2020] [Accepted: 05/04/2021] [Indexed: 05/10/2023]
Abstract
In this paper, a novel feature selection method called Robust Proportional Overlapping Score (RPOS), for microarray gene expression datasets has been proposed, by utilizing the robust measure of dispersion, i.e., Median Absolute Deviation (MAD). This method robustly identifies the most discriminative genes by considering the overlapping scores of the gene expression values for binary class problems. Genes with a high degree of overlap between classes are discarded and the ones that discriminate between the classes are selected. The results of the proposed method are compared with five state-of-the-art gene selection methods based on classification error, Brier score, and sensitivity, by considering eleven gene expression datasets. Classification of observations for different sets of selected genes by the proposed method is carried out by three different classifiers, i.e., random forest, k-nearest neighbors (k-NN), and support vector machine (SVM). Box-plots and stability scores of the results are also shown in this paper. The results reveal that in most of the cases the proposed method outperforms the other methods.
Collapse
Affiliation(s)
- Muhammad Hamraz
- Department of Statistics, Abdul Wali Khan University Mardan, Mardan, Pakistan
| | - Naz Gul
- Department of Statistics, Abdul Wali Khan University Mardan, Mardan, Pakistan
| | - Mushtaq Raza
- Department of Computer Sciences, Abdul Wali Khan University Mardan, Mardan, Pakistan
| | - Dost Muhammad Khan
- Department of Statistics, Abdul Wali Khan University Mardan, Mardan, Pakistan
| | - Umair Khalil
- Department of Statistics, Abdul Wali Khan University Mardan, Mardan, Pakistan
| | - Seema Zubair
- Department of Mathematics, Statistics and Computer Science, University of Agriculture Peshawar, Peshawar, Pakistan
| | - Zardad Khan
- Department of Statistics, Abdul Wali Khan University Mardan, Mardan, Pakistan
| |
Collapse
|
24
|
Mahendran N, Durai Raj Vincent PM, Srinivasan K, Chang CY. Machine Learning Based Computational Gene Selection Models: A Survey, Performance Evaluation, Open Issues, and Future Research Directions. Front Genet 2020; 11:603808. [PMID: 33362861 PMCID: PMC7758324 DOI: 10.3389/fgene.2020.603808] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2020] [Accepted: 10/29/2020] [Indexed: 12/20/2022] Open
Abstract
Gene Expression is the process of determining the physical characteristics of living beings by generating the necessary proteins. Gene Expression takes place in two steps, translation and transcription. It is the flow of information from DNA to RNA with enzymes' help, and the end product is proteins and other biochemical molecules. Many technologies can capture Gene Expression from the DNA or RNA. One such technique is Microarray DNA. Other than being expensive, the main issue with Microarray DNA is that it generates high-dimensional data with minimal sample size. The issue in handling such a heavyweight dataset is that the learning model will be over-fitted. This problem should be addressed by reducing the dimension of the data source to a considerable amount. In recent years, Machine Learning has gained popularity in the field of genomic studies. In the literature, many Machine Learning-based Gene Selection approaches have been discussed, which were proposed to improve dimensionality reduction precision. This paper does an extensive review of the various works done on Machine Learning-based gene selection in recent years, along with its performance analysis. The study categorizes various feature selection algorithms under Supervised, Unsupervised, and Semi-supervised learning. The works done in recent years to reduce the features for diagnosing tumors are discussed in detail. Furthermore, the performance of several discussed methods in the literature is analyzed. This study also lists out and briefly discusses the open issues in handling the high-dimension and less sample size data.
Collapse
Affiliation(s)
- Nivedhitha Mahendran
- School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, India
| | - P. M. Durai Raj Vincent
- School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, India
| | - Kathiravan Srinivasan
- School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, India
| | - Chuan-Yu Chang
- Department of Computer Science and Information Engineering, National Yunlin University of Science and Technology, Douliu, Taiwan
| |
Collapse
|
25
|
García-Mendoza CV, Gambino OJ, Villarreal-Cervantes MG, Calvo H. Evolutionary Optimization of Ensemble Learning to Determine Sentiment Polarity in an Unbalanced Multiclass Corpus. ENTROPY 2020; 22:e22091020. [PMID: 33286789 PMCID: PMC7597113 DOI: 10.3390/e22091020] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/07/2020] [Revised: 09/10/2020] [Accepted: 09/10/2020] [Indexed: 11/16/2022]
Abstract
Sentiment polarity classification in social media is a very important task, as it enables gathering trends on particular subjects given a set of opinions. Currently, a great advance has been made by using deep learning techniques, such as word embeddings, recurrent neural networks, and encoders, such as BERT. Unfortunately, these techniques require large amounts of data, which, in some cases, is not available. In order to model this situation, challenges, such as the Spanish TASS organized by the Spanish Society for Natural Language Processing (SEPLN), have been proposed, which pose particular difficulties: First, an unwieldy balance in the training and the test set, being this latter more than eight times the size of the training set. Another difficulty is the marked unbalance in the distribution of classes, which is also different between both sets. Finally, there are four different labels, which create the need to adapt current classifications methods for multiclass handling. Traditional machine learning methods, such as Naïve Bayes, Logistic Regression, and Support Vector Machines, achieve modest performance in these conditions, but used as an ensemble it is possible to attain competitive execution. Several strategies to build classifier ensembles have been proposed; this paper proposes estimating an optimal weighting scheme using a Differential Evolution algorithm focused on dealing with particular issues that multiclass classification and unbalanced corpora pose. The ensemble with the proposed optimized weighting scheme is able to improve the classification results on the full test set of the TASS challenge (General corpus), achieving state of the art performance when compared with other works on this task, which make no use of NLP techniques.
Collapse
Affiliation(s)
- Consuelo V. García-Mendoza
- Escuela Superior de Cómputo, Instituto Politécnico Nacional, Mexico City 07738, Mexico; (C.V.G.-M.); (O.J.G.)
| | - Omar J. Gambino
- Escuela Superior de Cómputo, Instituto Politécnico Nacional, Mexico City 07738, Mexico; (C.V.G.-M.); (O.J.G.)
| | | | - Hiram Calvo
- Centro de Investigación en Computación, Instituto Politécnico Nacional, Mexico City 07738, Mexico
- Correspondence: ; Tel.: +52-55-57296000 (ext. 56516)
| |
Collapse
|
26
|
A survey on single and multi omics data mining methods in cancer data classification. J Biomed Inform 2020; 107:103466. [DOI: 10.1016/j.jbi.2020.103466] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2019] [Revised: 05/01/2020] [Accepted: 05/31/2020] [Indexed: 01/09/2023]
|
27
|
Akramifard H, Balafar M, Razavi S, Ramli AR. Emphasis Learning, Features Repetition in Width Instead of Length to Improve Classification Performance: Case Study-Alzheimer's Disease Diagnosis. SENSORS (BASEL, SWITZERLAND) 2020; 20:E941. [PMID: 32050715 PMCID: PMC7039233 DOI: 10.3390/s20030941] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/26/2019] [Revised: 10/28/2019] [Accepted: 10/28/2019] [Indexed: 01/21/2023]
Abstract
In the past decade, many studies have been conducted to advance computer-aided systems for Alzheimer's disease (AD) diagnosis. Most of them have recently developed systems concentrated on extracting and combining features from MRI, PET, and CSF. For the most part, they have obtained very high performance. However, improving the performance of a classification problem is complicated, specifically when the model's accuracy or other performance measurements are higher than 90%. In this study, a novel methodology is proposed to address this problem, specifically in Alzheimer's disease diagnosis classification. This methodology is the first of its kind in the literature, based on the notion of replication on the feature space instead of the traditional sample space. Briefly, the main steps of the proposed method include extracting, embedding, and exploring the best subset of features. For feature extraction, we adopt VBM-SPM; for embedding features, a concatenation strategy is used on the features to ultimately create one feature vector for each subject. Principal component analysis is applied to extract new features, forming a low-dimensional compact space. A novel process is applied by replicating selected components, assessing the classification model, and repeating the replication until performance divergence or convergence. The proposed method aims to explore most significant features and highest-preforming model at the same time, to classify normal subjects from AD and mild cognitive impairment (MCI) patients. In each epoch, a small subset of candidate features is assessed by support vector machine (SVM) classifier. This repeating procedure is continued until the highest performance is achieved. Experimental results reveal the highest performance reported in the literature for this specific classification problem. We obtained a model with accuracies of 98.81%, 81.61%, and 81.40% for AD vs. normal control (NC), MCI vs. NC, and AD vs. MCI classification, respectively.
Collapse
Affiliation(s)
- Hamid Akramifard
- . Faculty of Electrical and Computer Engineering, University of Tabriz, East Azerbaijan, Tabriz 51666-16471, Iran; (H.A.); (S.R.)
| | - MohammadAli Balafar
- . Faculty of Electrical and Computer Engineering, University of Tabriz, East Azerbaijan, Tabriz 51666-16471, Iran; (H.A.); (S.R.)
| | - SeyedNaser Razavi
- . Faculty of Electrical and Computer Engineering, University of Tabriz, East Azerbaijan, Tabriz 51666-16471, Iran; (H.A.); (S.R.)
| | - Abd Rahman Ramli
- . Department of Computer and Communication Systems Engineering, University Putra Malaysia, UPM-Serdang 43400, Malaysia;
| |
Collapse
|
28
|
Al-Betar MA, Alomari OA, Abu-Romman SM. A TRIZ-inspired bat algorithm for gene selection in cancer classification. Genomics 2020; 112:114-126. [DOI: 10.1016/j.ygeno.2019.09.015] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2019] [Revised: 09/05/2019] [Accepted: 09/17/2019] [Indexed: 10/25/2022]
|
29
|
MapReduce-Based Parallel Genetic Algorithm for CpG-Site Selection in Age Prediction. Genes (Basel) 2019; 10:genes10120969. [PMID: 31775313 PMCID: PMC6947642 DOI: 10.3390/genes10120969] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2019] [Revised: 11/12/2019] [Accepted: 11/15/2019] [Indexed: 11/23/2022] Open
Abstract
Genomic biomarkers such as DNA methylation (DNAm) are employed for age prediction. In recent years, several studies have suggested the association between changes in DNAm and its effect on human age. The high dimensional nature of this type of data significantly increases the execution time of modeling algorithms. To mitigate this problem, we propose a two-stage parallel algorithm for selection of age related CpG-sites. The algorithm first attempts to cluster the data into similar age ranges. In the next stage, a parallel genetic algorithm (PGA), based on the MapReduce paradigm (MR-based PGA), is used for selecting age-related features of each individual age range. In the proposed method, the execution of the algorithm for each age range (data parallel), the evaluation of chromosomes (task parallel) and the calculation of the fitness function (data parallel) are performed using a novel parallel framework. In this paper, we consider 16 different healthy DNAm datasets that are related to the human blood tissue and that contain the relevant age information. These datasets are combined into a single unioned set, which is in turn randomly divided into two sets of train and test data with a ratio of 7:3, respectively. We build a Gradient Boosting Regressor (GBR) model on the selected CpG-sites from the train set. To evaluate the model accuracy, we compared our results with state-of-the-art approaches that used these datasets, and observed that our method performs better on the unseen test dataset with a Mean Absolute Deviation (MAD) of 3.62 years, and a correlation (R2) of 95.96% between age and DNAm. In the train data, the MAD and R2 are 1.27 years and 99.27%, respectively. Finally, we evaluate our method in terms of the effect of parallelization in computation time. The algorithm without parallelization requires 4123 min to complete, whereas the parallelized execution on 3 computing machines having 32 processing cores each, only takes a total of 58 min. This shows that our proposed algorithm is both efficient and scalable.
Collapse
|
30
|
Bir-Jmel A, Douiri SM, Elbernoussi S. Gene Selection via a New Hybrid Ant Colony Optimization Algorithm for Cancer Classification in High-Dimensional Data. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2019; 2019:7828590. [PMID: 31737086 PMCID: PMC6815598 DOI: 10.1155/2019/7828590] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/04/2019] [Revised: 08/14/2019] [Accepted: 09/09/2019] [Indexed: 11/18/2022]
Abstract
The recent advance in the microarray data analysis makes it easy to simultaneously measure the expression levels of several thousand genes. These levels can be used to distinguish cancerous tissues from normal ones. In this work, we are interested in gene expression data dimension reduction for cancer classification, which is a common task in most microarray data analysis studies. This reduction has an essential role in enhancing the accuracy of the classification task and helping biologists accurately predict cancer in the body; this is carried out by selecting a small subset of relevant genes and eliminating the redundant or noisy genes. In this context, we propose a hybrid approach (MWIS-ACO-LS) for the gene selection problem, based on the combination of a new graph-based approach for gene selection (MWIS), in which we seek to minimize the redundancy between genes by considering the correlation between the latter and maximize gene-ranking (Fisher) scores, and a modified ACO coupled with a local search (LS) algorithm using the classifier 1NN for measuring the quality of the candidate subsets. In order to evaluate the proposed method, we tested MWIS-ACO-LS on ten well-replicated microarray datasets of high dimensions varying from 2308 to 12600 genes. The experimental results based on ten high-dimensional microarray classification problems demonstrated the effectiveness of our proposed method.
Collapse
Affiliation(s)
- Ahmed Bir-Jmel
- Laboratory of Mathematics, Computer Science & Applications-Security of Information, Department of Mathematics, Faculty of Sciences, Mohammed V University, Rabat, Morocco
| | - Sidi Mohamed Douiri
- Laboratory of Mathematics, Computer Science & Applications-Security of Information, Department of Mathematics, Faculty of Sciences, Mohammed V University, Rabat, Morocco
| | - Souad Elbernoussi
- Laboratory of Mathematics, Computer Science & Applications-Security of Information, Department of Mathematics, Faculty of Sciences, Mohammed V University, Rabat, Morocco
| |
Collapse
|
31
|
Sharma A, Rani R. C-HMOSHSSA: Gene selection for cancer classification using multi-objective meta-heuristic and machine learning methods. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2019; 178:219-235. [PMID: 31416551 DOI: 10.1016/j.cmpb.2019.06.029] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/10/2019] [Revised: 06/24/2019] [Accepted: 06/27/2019] [Indexed: 05/21/2023]
Abstract
BACKGROUND AND OBJECTIVE Over the last two decades, DNA microarray technology has emerged as a powerful tool for early cancer detection and prevention. It helps to provide a detailed overview of disease complex microenvironment. Moreover, online availability of thousands of gene expression assays made microarray data classification an active research area. A common goal is to find a minimum subset of genes and maximizing the classification accuracy. METHODS In pursuit of a similar objective, we have proposed framework (C-HMOSHSSA) for gene selection using multi-objective spotted hyena optimizer (MOSHO) and salp swarm algorithm (SSA). The real-life optimization problems with more than one objective usually face the challenge to maintain convergence and diversity. Salp Swarm Algorithm (SSA) maintains diversity but, suffers from the overhead of maintaining the necessary information. On the other hand, the calculation of MOSHO requires low computational efforts hence is used for maintaining the necessary information. Therefore, the proposed algorithm is a hybrid algorithm that utilizes the features of both SSA and MOSHO to facilitate its exploration and exploitation capability. RESULTS Four different classifiers are trained on seven high-dimensional datasets using a subset of features (genes), which are obtained after applying the proposed hybrid gene selection algorithm. The results show that the proposed technique significantly outperforms existing state-of-the-art techniques. CONCLUSION It is also shown that the new sets of informative and biologically relevant genes are successfully identified by the proposed technique. The proposed approach can also be applied to other problem domains of interest which involve feature selection.
Collapse
Affiliation(s)
- Aman Sharma
- Computer Science and Engineering Department, Thapar Institute of Engineering & Technology, Patiala, Punjab, India.
| | - Rinkle Rani
- Computer Science and Engineering Department, Thapar Institute of Engineering & Technology, Patiala, Punjab, India.
| |
Collapse
|