1
|
A joint optimization framework integrated with biological knowledge for clustering incomplete gene expression data. Soft comput 2022. [DOI: 10.1007/s00500-022-07180-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
2
|
AbdelAziz AM, Soliman T, Ghany KKA, Sewisy A. A hybrid multi-objective whale optimization algorithm for analyzing microarray data based on Apache Spark. PeerJ Comput Sci 2021; 7:e416. [PMID: 33834101 PMCID: PMC8022636 DOI: 10.7717/peerj-cs.416] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2020] [Accepted: 02/05/2021] [Indexed: 06/12/2023]
Abstract
A microarray is a revolutionary tool that generates vast volumes of data that describe the expression profiles of genes under investigation that can be qualified as Big Data. Hadoop and Spark are efficient frameworks, developed to store and analyze Big Data. Analyzing microarray data helps researchers to identify correlated genes. Clustering has been successfully applied to analyze microarray data by grouping genes with similar expression profiles into clusters. The complex nature of microarray data obligated clustering methods to employ multiple evaluation functions to ensure obtaining solutions with high quality. This transformed the clustering problem into a Multi-Objective Problem (MOP). A new and efficient hybrid Multi-Objective Whale Optimization Algorithm with Tabu Search (MOWOATS) was proposed to solve MOPs. In this article, MOWOATS is proposed to analyze massive microarray datasets. Three evaluation functions have been developed to ensure an effective assessment of solutions. MOWOATS has been adapted to run in parallel using Spark over Hadoop computing clusters. The quality of the generated solutions was evaluated based on different indices, such as Silhouette and Davies-Bouldin indices. The obtained clusters were very similar to the original classes. Regarding the scalability, the running time was inversely proportional to the number of computing nodes.
Collapse
Affiliation(s)
| | - Taysir Soliman
- Faculty of Computers and Information, Assiut University, Egypt
| | - Kareem Kamal A. Ghany
- Faculty of Computers and Artificial Intelligence, Beni-Suef University, Egypt
- College of Computing and Informatics, Saudi Electronic University, Riyadh, KSA
| | - Adel Sewisy
- Faculty of Computers and Information, Assiut University, Egypt
| |
Collapse
|
3
|
Simultaneous feature selection and clustering of micro-array and RNA-sequence gene expression data using multiobjective optimization. INT J MACH LEARN CYB 2020. [DOI: 10.1007/s13042-020-01139-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
4
|
Automatic clustering and feature selection using gravitational search algorithm and its application to microarray data analysis. Neural Comput Appl 2019. [DOI: 10.1007/s00521-017-3321-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
5
|
Mitra S, Saha S. A multiobjective multi-view cluster ensemble technique: Application in patient subclassification. PLoS One 2019; 14:e0216904. [PMID: 31120942 PMCID: PMC6533037 DOI: 10.1371/journal.pone.0216904] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2019] [Accepted: 04/30/2019] [Indexed: 11/21/2022] Open
Abstract
Recent high throughput omics technology has been used to assemble large biomedical omics datasets. Clustering of single omics data has proven invaluable in biomedical research. For the task of patient sub-classification, all the available omics data should be utilized combinedly rather than treating them individually. Clustering of multi-omics datasets has the potential to reveal deep insights. Here, we propose a late integration based multiobjective multi-view clustering algorithm which uses a special perturbation operator. Initially, a large number of diverse clustering solutions (called base partitionings) are generated for each omic dataset using four clustering algorithms, viz., k means, complete linkage, spectral and fast search clustering. These base partitionings of multi-omic datasets are suitably combined using a special perturbation operator. The perturbation operator uses an ensemble technique to generate new solutions from the base partitionings. The optimal combination of multiple partitioning solutions across different views is determined after optimizing the objective functions, namely conn-XB, for checking the quality of partitionings for different views, and agreement index, for checking agreement between the views. The search capability of a multiobjective simulated annealing approach, namely AMOSA is used for this purpose. Lastly, the non-dominated solutions of the different views are combined based on similarity to generate a single set of non-dominated solutions. The proposed algorithm is evaluated on 13 multi-view cancer datasets. An elaborated comparative study with several baseline methods and five state-of-the-art models is performed to show the effectiveness of the algorithm.
Collapse
Affiliation(s)
- Sayantan Mitra
- Department of Computer Science and Engineering, Indian Institute of Technology Patna, India
- * E-mail:
| | - Sriparna Saha
- Department of Computer Science and Engineering, Indian Institute of Technology Patna, India
| |
Collapse
|
6
|
Saha S. A line symmetry based genetic clustering technique: encoding lines in chromosomes. INT J MACH LEARN CYB 2018. [DOI: 10.1007/s13042-017-0680-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
7
|
Aggregation of multi-objective fuzzy symmetry-based clustering techniques for improving gene and cancer classification. Soft comput 2018. [DOI: 10.1007/s00500-017-2865-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
8
|
Parraga-Alava J, Dorn M, Inostroza-Ponta M. A multi-objective gene clustering algorithm guided by apriori biological knowledge with intensification and diversification strategies. BioData Min 2018; 11:16. [PMID: 30100924 PMCID: PMC6081857 DOI: 10.1186/s13040-018-0178-4] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2017] [Accepted: 07/29/2018] [Indexed: 01/10/2023] Open
Abstract
BACKGROUND Biologists aim to understand the genetic background of diseases, metabolic disorders or any other genetic condition. Microarrays are one of the main high-throughput technologies for collecting information about the behaviour of genetic information on different conditions. In order to analyse this data, clustering arises as one of the main techniques used, and it aims at finding groups of genes that have some criterion in common, like similar expression profile. However, the problem of finding groups is normally multi dimensional, making necessary to approach the clustering as a multi-objective problem where various cluster validity indexes are simultaneously optimised. They are usually based on criteria like compactness and separation, which may not be sufficient since they can not guarantee the generation of clusters that have both similar expression patterns and biological coherence. METHOD We propose a Multi-Objective Clustering algorithm Guided by a-Priori Biological Knowledge (MOC-GaPBK) to find clusters of genes with high levels of co-expression, biological coherence, and also good compactness and separation. Cluster quality indexes are used to optimise simultaneously gene relationships at expression level and biological functionality. Our proposal also includes intensification and diversification strategies to improve the search process. RESULTS The effectiveness of the proposed algorithm is demonstrated on four publicly available datasets. Comparative studies of the use of different objective functions and other widely used microarray clustering techniques are reported. Statistical, visual and biological significance tests are carried out to show the superiority of the proposed algorithm. CONCLUSIONS Integrating a-priori biological knowledge into a multi-objective approach and using intensification and diversification strategies allow the proposed algorithm to find solutions with higher quality than other microarray clustering techniques available in the literature in terms of co-expression, biological coherence, compactness and separation.
Collapse
Affiliation(s)
- Jorge Parraga-Alava
- Centre for Biotechnology and Bioengineering (CeBiB), Departamento de Ingeniería Informática, Universidad de Santiago de Chile, Av. Ecuador 3659, Santiago, Chile
- Carrera de Computación, Escuela Superior Politécnica Agropecuaria de Manabí Manuel Félix López, Campus Politécnico Sitio El Limón, Calceta, Ecuador
| | - Marcio Dorn
- Instituto de Informatica, Universidade Federal do Rio Grande do Sul, Av. Bento Gonçalves 9500, Porto Alegre, 91501-970 Brasil
| | - Mario Inostroza-Ponta
- Centre for Biotechnology and Bioengineering (CeBiB), Departamento de Ingeniería Informática, Universidad de Santiago de Chile, Av. Ecuador 3659, Santiago, Chile
| |
Collapse
|
9
|
Bi-stage hierarchical selection of pathway genes for cancer progression using a swarm based computational approach. Appl Soft Comput 2018. [DOI: 10.1016/j.asoc.2017.10.024] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
|
10
|
Saha S, Acharya S, K K, Miriyala S. Simultaneous Clustering and Feature Weighting Using Multiobjective Optimization for Identifying Functionally Similar miRNAs. IEEE J Biomed Health Inform 2017; 22:1684-1690. [PMID: 29990050 DOI: 10.1109/jbhi.2017.2784898] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
MicroRNAs (miRNAs) are a type of RNAs, which are responsible for monitoring the gene expression values. Recent research asserts that miRNAs form some clustering on chromosomes. The miRNAs belonging to a particular cluster are highly similar in terms of their activity and they are termed as "coregulated" miRNAs. The current paper presents an approach that simultaneously performs two tasks: i) clustering of miRNAs into different categories based on some similarity measures ii) identification of proper weight values for different time points with respect to which expression values are available. In general, a large number of expression values are available for a given miRNA data set. All these values may not be suitable to be used equally to measure the similarity between two miRNAs. In the current study, the problem of proper selection of weight values for different time points and then determining the proper partitioning from the given miRNA data set utilizing the similarity computed using the new set of weight values is formulated as an optimization problem where several cluster validity indices are optimized as the goodness measures. To that end, a multiobjective differential evolution based optimization technique is utilized. The supremacy of the proposed technique is tested on three miRNA data sets in comparison to some recent approaches in terms of some popular performance measures like Silhouette index and DB-index. The observations are further supported by statistical and biological significance tests. Supplementary information is available at https://www.iitp.ac.in/~sriparna/journals.html.
Collapse
|
11
|
Nidheesh N, Abdul Nazeer KA, Ameer PM. An enhanced deterministic K-Means clustering algorithm for cancer subtype prediction from gene expression data. Comput Biol Med 2017; 91:213-221. [PMID: 29100115 DOI: 10.1016/j.compbiomed.2017.10.014] [Citation(s) in RCA: 42] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2017] [Revised: 10/13/2017] [Accepted: 10/14/2017] [Indexed: 01/21/2023]
Abstract
BACKGROUND Clustering algorithms with steps involving randomness usually give different results on different executions for the same dataset. This non-deterministic nature of algorithms such as the K-Means clustering algorithm limits their applicability in areas such as cancer subtype prediction using gene expression data. It is hard to sensibly compare the results of such algorithms with those of other algorithms. The non-deterministic nature of K-Means is due to its random selection of data points as initial centroids. METHOD We propose an improved, density based version of K-Means, which involves a novel and systematic method for selecting initial centroids. The key idea of the algorithm is to select data points which belong to dense regions and which are adequately separated in feature space as the initial centroids. RESULTS We compared the proposed algorithm to a set of eleven widely used single clustering algorithms and a prominent ensemble clustering algorithm which is being used for cancer data classification, based on the performances on a set of datasets comprising ten cancer gene expression datasets. The proposed algorithm has shown better overall performance than the others. CONCLUSION There is a pressing need in the Biomedical domain for simple, easy-to-use and more accurate Machine Learning tools for cancer subtype prediction. The proposed algorithm is simple, easy-to-use and gives stable results. Moreover, it provides comparatively better predictions of cancer subtypes from gene expression data.
Collapse
Affiliation(s)
- N Nidheesh
- Department of Electronics and Communication Engineering, National Institute of Technology Calicut, Kerala 673601, India.
| | - K A Abdul Nazeer
- Department of Computer Science and Engineering, National Institute of Technology Calicut, Kerala 673601, India
| | - P M Ameer
- Department of Electronics and Communication Engineering, National Institute of Technology Calicut, Kerala 673601, India
| |
Collapse
|
12
|
|
13
|
Oyelade J, Isewon I, Oladipupo F, Aromolaran O, Uwoghiren E, Ameh F, Achas M, Adebiyi E. Clustering Algorithms: Their Application to Gene Expression Data. Bioinform Biol Insights 2016; 10:237-253. [PMID: 27932867 PMCID: PMC5135122 DOI: 10.4137/bbi.s38316] [Citation(s) in RCA: 69] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2016] [Revised: 09/05/2016] [Accepted: 09/09/2016] [Indexed: 12/17/2022] Open
Abstract
Gene expression data hide vital information required to understand the biological process that takes place in a particular organism in relation to its environment. Deciphering the hidden patterns in gene expression data proffers a prodigious preference to strengthen the understanding of functional genomics. The complexity of biological networks and the volume of genes present increase the challenges of comprehending and interpretation of the resulting mass of data, which consists of millions of measurements; these data also inhibit vagueness, imprecision, and noise. Therefore, the use of clustering techniques is a first step toward addressing these challenges, which is essential in the data mining process to reveal natural structures and identify interesting patterns in the underlying data. The clustering of gene expression data has been proven to be useful in making known the natural structure inherent in gene expression data, understanding gene functions, cellular processes, and subtypes of cells, mining useful information from noisy data, and understanding gene regulation. The other benefit of clustering gene expression data is the identification of homology, which is very important in vaccine design. This review examines the various clustering algorithms applicable to the gene expression data in order to discover and provide useful knowledge of the appropriate clustering technique that will guarantee stability and high degree of accuracy in its analysis procedure.
Collapse
Affiliation(s)
- Jelili Oyelade
- Department of Computer and Information Sciences, Covenant University, Ota, Ogun State, Nigeria
- Covenant University Bioinformatics Research (CUBRe), Covenant University, Ota, Ogun State, Nigeria
| | - Itunuoluwa Isewon
- Department of Computer and Information Sciences, Covenant University, Ota, Ogun State, Nigeria
- Covenant University Bioinformatics Research (CUBRe), Covenant University, Ota, Ogun State, Nigeria
| | - Funke Oladipupo
- Department of Computer and Information Sciences, Covenant University, Ota, Ogun State, Nigeria
| | - Olufemi Aromolaran
- Department of Computer and Information Sciences, Covenant University, Ota, Ogun State, Nigeria
| | - Efosa Uwoghiren
- Department of Computer and Information Sciences, Covenant University, Ota, Ogun State, Nigeria
| | - Faridah Ameh
- Department of Computer and Information Sciences, Covenant University, Ota, Ogun State, Nigeria
| | - Moses Achas
- Department of Computer Science and Information Technology, Bells University of Technology, Ota, Ogun State, Nigeria
| | - Ezekiel Adebiyi
- Department of Computer and Information Sciences, Covenant University, Ota, Ogun State, Nigeria
- Covenant University Bioinformatics Research (CUBRe), Covenant University, Ota, Ogun State, Nigeria
| |
Collapse
|
14
|
Acharya S, Saha S, Thadisina Y. Multiobjective Simulated Annealing-Based Clustering of Tissue Samples for Cancer Diagnosis. IEEE J Biomed Health Inform 2016; 20:691-8. [DOI: 10.1109/jbhi.2015.2404971] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
15
|
Jothi R, Mohanty SK, Ojha A. Functional grouping of similar genes using eigenanalysis on minimum spanning tree based neighborhood graph. Comput Biol Med 2016; 71:135-48. [PMID: 26945461 DOI: 10.1016/j.compbiomed.2016.02.007] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2015] [Revised: 01/16/2016] [Accepted: 02/12/2016] [Indexed: 10/22/2022]
Abstract
Gene expression data clustering is an important biological process in DNA microarray analysis. Although there have been many clustering algorithms for gene expression analysis, finding a suitable and effective clustering algorithm is always a challenging problem due to the heterogeneous nature of gene profiles. Minimum Spanning Tree (MST) based clustering algorithms have been successfully employed to detect clusters of varying shapes and sizes. This paper proposes a novel clustering algorithm using Eigenanalysis on Minimum Spanning Tree based neighborhood graph (E-MST). As MST of a set of points reflects the similarity of the points with their neighborhood, the proposed algorithm employs a similarity graph obtained from k(') rounds of MST (k(')-MST neighborhood graph). By studying the spectral properties of the similarity matrix obtained from k(')-MST graph, the proposed algorithm achieves improved clustering results. We demonstrate the efficacy of the proposed algorithm on 12 gene expression datasets. Experimental results show that the proposed algorithm performs better than the standard clustering algorithms.
Collapse
Affiliation(s)
- R Jothi
- Indian Institute of Information Technology, Design and Manufacturing Jabalpur, Madhya Pradesh, India.
| | - Sraban Kumar Mohanty
- Indian Institute of Information Technology, Design and Manufacturing Jabalpur, Madhya Pradesh, India.
| | - Aparajita Ojha
- Indian Institute of Information Technology, Design and Manufacturing Jabalpur, Madhya Pradesh, India.
| |
Collapse
|
16
|
Acharya S, Saha S. Importance of proximity measures in clustering of cancer and miRNA datasets: proposal of an automated framework. MOLECULAR BIOSYSTEMS 2016; 12:3478-3501. [DOI: 10.1039/c6mb00609d] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]
Abstract
Distance plays an important role in the clustering process for allocating data points to different clusters.
Collapse
Affiliation(s)
- Sudipta Acharya
- Department of Computer Science and Engineering
- Indian Institute of Technology Patna
- India
| | - Sriparna Saha
- Department of Computer Science and Engineering
- Indian Institute of Technology Patna
- India
| |
Collapse
|
17
|
Saha S, Alok AK, Ekbal A. Use of Semisupervised Clustering and Feature-Selection Techniques for Identification of Co-expressed Genes. IEEE J Biomed Health Inform 2015. [PMID: 26208367 DOI: 10.1109/jbhi.2015.2451735] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Studying the patterns hidden in gene-expression data helps to understand the functionality of genes. In general, clustering techniques are widely used for the identification of natural partitionings from the gene expression data. In order to put constraints on dimensionality, feature selection is the key issue because not all features are important from clustering point of view. Moreover some limited amount of supervised information can help to fine tune the obtained clustering solution. In this paper, the problem of simultaneous feature selection and semisupervised clustering is formulated as a multiobjective optimization (MOO) task. A modern simulated annealing-based MOO technique namely AMOSA is utilized as the background optimization methodology. Here, features and cluster centers are represented in the form of a string and the assignment of genes to different clusters is done using a point symmetry-based distance. Six optimization criteria based on several internal and external cluster validity indices are utilized. In order to generate the supervised information, a popular clustering technique, Fuzzy C-mean, is utilized. Appropriate subset of features, proper number of clusters and the proper partitioning are determined using the search capability of AMOSA. The effectiveness of this proposed semisupervised clustering technique, Semi-FeaClustMOO, is demonstrated on five publicly available benchmark gene-expression datasets. Comparison results with the existing techniques for gene-expression data clustering again reveal the superiority of the proposed technique. Statistical and biological significance tests have also been carried out.
Collapse
|
18
|
Semi-supervised clustering for gene-expression data in multiobjective optimization framework. INT J MACH LEARN CYB 2015. [DOI: 10.1007/s13042-015-0335-8] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
19
|
Lotfi E, Keshavarz A. Gene expression microarray classification using PCA–BEL. Comput Biol Med 2014; 54:180-7. [DOI: 10.1016/j.compbiomed.2014.09.008] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2014] [Revised: 09/13/2014] [Accepted: 09/16/2014] [Indexed: 01/15/2023]
|
20
|
Doostparast Torshizi A, Fazel Zarandi MH. A new cluster validity measure based on general type-2 fuzzy sets: Application in gene expression data clustering. Knowl Based Syst 2014. [DOI: 10.1016/j.knosys.2014.03.023] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|