1
|
Uddin J, Ghazali R, H. Abawajy J, Shah H, Husaini NA, Zeb A. Rough set based information theoretic approach for clustering uncertain categorical data. PLoS One 2022; 17:e0265190. [PMID: 35559954 PMCID: PMC9106167 DOI: 10.1371/journal.pone.0265190] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2021] [Accepted: 02/27/2022] [Indexed: 12/02/2022] Open
Abstract
Motivation Many real applications such as businesses and health generate large categorical datasets with uncertainty. A fundamental task is to efficiently discover hidden and non-trivial patterns from such large uncertain categorical datasets. Since the exact value of an attribute is often unknown in uncertain categorical datasets, conventional clustering analysis algorithms do not provide a suitable means for dealing with categorical data, uncertainty, and stability. Problem statement The ability of decision making in the presence of vagueness and uncertainty in data can be handled using Rough Set Theory. Though, recent categorical clustering techniques based on Rough Set Theory help but they suffer from low accuracy, high computational complexity, and generalizability especially on data sets where they sometimes fail or hardly select their best clustering attribute. Objectives The main objective of this research is to propose a new information theoretic based Rough Purity Approach (RPA). Another objective of this work is to handle the problems of traditional Rough Set Theory based categorical clustering techniques. Hence, the ultimate goal is to cluster uncertain categorical datasets efficiently in terms of the performance, generalizability and computational complexity. Methods The RPA takes into consideration information-theoretic attribute purity of the categorical-valued information systems. Several extensive experiments are conducted to evaluate the efficiency of RPA using a real Supplier Base Management (SBM) and six benchmark UCI datasets. The proposed RPA is also compared with several recent categorical data clustering techniques. Results The experimental results show that RPA outperforms the baseline algorithms. The significant percentage improvement with respect to time (66.70%), iterations (83.13%), purity (10.53%), entropy (14%), and accuracy (12.15%) as well as Rough Accuracy of clusters show that RPA is suitable for practical usage. Conclusion We conclude that as compared to other techniques, the attribute purity of categorical-valued information systems can better cluster the data. Hence, RPA technique can be recommended for large scale clustering in multiple domains and its performance can be enhanced for further research.
Collapse
Affiliation(s)
- Jamal Uddin
- Qurtuba University of Science & IT, Peshawar, Pakistan
- * E-mail:
| | - Rozaida Ghazali
- Universiti Tun Hussien Onn Malaysia, Batu Pahat, Johor, Malaysia
| | | | | | | | - Asim Zeb
- Abbottabad University of Science & Technology, Abbottabad, Pakistan
| |
Collapse
|
2
|
Miya J, Ansari MA. Medical images performance analysis and observations with SPIHT and wavelet techniques. JOURNAL OF INFORMATION & OPTIMIZATION SCIENCES 2020. [DOI: 10.1080/02522667.2020.1721616] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Affiliation(s)
- Javed Miya
- Department of Computer Science, Uttrakhand Technical University, Dehradun 248007, Uttarakhand, India
| | - M. A. Ansari
- Department of Electrical Engineering, Gautam Budha University, Greater Noida 201312, Uttar Pradesh, India,
| |
Collapse
|
3
|
Golchin M, Liew AWC. Parallel biclustering detection using strength Pareto front evolutionary algorithm. Inf Sci (N Y) 2017. [DOI: 10.1016/j.ins.2017.06.031] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
|
4
|
Abstract
A new method of MRI brain segmentation integrates fuzzy [Formula: see text]-means (FCM) clustering and rough set theory. In this paper, we use rough set algorithm to find the suitable initial clustering number to initial clustering centers for FCM. Then we use FCM to MRI brain segmentation, but the algorithm of FCM has the limitation of converging to local infinitesimal point in medical segmentation. While avoiding being trapped in a local optimum, we use the particle swarm optimization algorithm to restrict convergence of FCM which can reduce calculation. The final experiment results show that improved algorithm not only retains the advantages of rapid convergence but also can control the local convergence and improve the global search ability. The method in this paper is better than that of cluttering performance.
Collapse
Affiliation(s)
- Yang Zhang
- School of Information and Engineering, Wenzhou Medical University, Wenzhou, Zhejiang 325000, P. R. China
| | - Shufan Ye
- Zhejiang Zhonglan Environment Technology Ltd., Wenzhou, Zhejiang 325000, P. R. China
| | - Weifeng Ding
- 118 Hospital of People’s Liberation Army, Wenzhou, Zhejiang 325000, P. R. China
| |
Collapse
|
5
|
An Empirical Analysis of Rough Set Categorical Clustering Techniques. PLoS One 2017; 12:e0164803. [PMID: 28068344 PMCID: PMC5222507 DOI: 10.1371/journal.pone.0164803] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2016] [Accepted: 09/30/2016] [Indexed: 11/19/2022] Open
Abstract
Clustering a set of objects into homogeneous groups is a fundamental operation in data mining. Recently, many attentions have been put on categorical data clustering, where data objects are made up of non-numerical attributes. For categorical data clustering the rough set based approaches such as Maximum Dependency Attribute (MDA) and Maximum Significance Attribute (MSA) has outperformed their predecessor approaches like Bi-Clustering (BC), Total Roughness (TR) and Min-Min Roughness(MMR). This paper presents the limitations and issues of MDA and MSA techniques on special type of data sets where both techniques fails to select or faces difficulty in selecting their best clustering attribute. Therefore, this analysis motivates the need to come up with better and more generalize rough set theory approach that can cope the issues with MDA and MSA. Hence, an alternative technique named Maximum Indiscernible Attribute (MIA) for clustering categorical data using rough set indiscernible relations is proposed. The novelty of the proposed approach is that, unlike other rough set theory techniques, it uses the domain knowledge of the data set. It is based on the concept of indiscernibility relation combined with a number of clusters. To show the significance of proposed approach, the effect of number of clusters on rough accuracy, purity and entropy are described in the form of propositions. Moreover, ten different data sets from previously utilized research cases and UCI repository are used for experiments. The results produced in tabular and graphical forms shows that the proposed MIA technique provides better performance in selecting the clustering attribute in terms of purity, entropy, iterations, time, accuracy and rough accuracy.
Collapse
|
6
|
Rough set approach for clustering categorical data using information-theoretic dependency measure. INFORM SYST 2015. [DOI: 10.1016/j.is.2014.06.008] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
7
|
|
8
|
Fa R, Roberts DJ, Nandi AK. SMART: unique splitting-while-merging framework for gene clustering. PLoS One 2014; 9:e94141. [PMID: 24714159 PMCID: PMC3979766 DOI: 10.1371/journal.pone.0094141] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2013] [Accepted: 03/14/2014] [Indexed: 11/18/2022] Open
Abstract
Successful clustering algorithms are highly dependent on parameter settings. The clustering performance degrades significantly unless parameters are properly set, and yet, it is difficult to set these parameters a priori. To address this issue, in this paper, we propose a unique splitting-while-merging clustering framework, named "splitting merging awareness tactics" (SMART), which does not require any a priori knowledge of either the number of clusters or even the possible range of this number. Unlike existing self-splitting algorithms, which over-cluster the dataset to a large number of clusters and then merge some similar clusters, our framework has the ability to split and merge clusters automatically during the process and produces the the most reliable clustering results, by intrinsically integrating many clustering techniques and tasks. The SMART framework is implemented with two distinct clustering paradigms in two algorithms: competitive learning and finite mixture model. Nevertheless, within the proposed SMART framework, many other algorithms can be derived for different clustering paradigms. The minimum message length algorithm is integrated into the framework as the clustering selection criterion. The usefulness of the SMART framework and its algorithms is tested in demonstration datasets and simulated gene expression datasets. Moreover, two real microarray gene expression datasets are studied using this approach. Based on the performance of many metrics, all numerical results show that SMART is superior to compared existing self-splitting algorithms and traditional algorithms. Three main properties of the proposed SMART framework are summarized as: (1) needing no parameters dependent on the respective dataset or a priori knowledge about the datasets, (2) extendible to many different applications, (3) offering superior performance compared with counterpart algorithms.
Collapse
Affiliation(s)
- Rui Fa
- Department of Electronic and Computer Engineering, Brunel University, Uxbridge, Middlesex, United Kingdom
| | - David J. Roberts
- National Health Service Blood and Transplant, Oxford, United Kingdom
- The University of Oxford, John Radcliffe Hospital, Oxford, United Kingdom
| | - Asoke K. Nandi
- Department of Electronic and Computer Engineering, Brunel University, Uxbridge, Middlesex, United Kingdom
- Department of Mathematical Information Technology, University of Jyväskylä, Jyväskylä, Finland
| |
Collapse
|
9
|
|
10
|
|
11
|
Zanaty E. Determining the number of clusters for kernelized fuzzy C-means algorithms for automatic medical image segmentation. EGYPTIAN INFORMATICS JOURNAL 2012. [DOI: 10.1016/j.eij.2012.01.004] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
12
|
ROMDHANE LOTFIBEN, SHILI HECHMI, AYEB BECHIR. P3M— POSSIBILISTIC MULTI-STEP MAXMIN AND MERGING ALGORITHM WITH APPLICATION TO GENE EXPRESSION DATA MINING. INT J ARTIF INTELL T 2011. [DOI: 10.1142/s0218213009000263] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Gene expression data generated by DNA microarray experiments provide a vast resource of medical diagnostic and disease understanding. Unfortunately, the large amount of data makes it hard, sometimes impossible, to understand the correct behavior of genes. In this work, we develop a possibilistic approach for mining gene microarray data. Our model consists of two steps. In the first step, we use possibilistic clustering to partition the data into groups (or clusters). The optimal number of clusters is evaluated automatically from data using the Partition Information Entropy as a validity measure. In the second step, we select from each computed cluster the most representative genes and model them as a graph called a proximity graph. This set of graphs (or hyper-graph) will be used to predict the function of new and previously unknown genes. Benchmark results on real-world data sets reveal a good performance of our model in computing optimal partitions even in the presence of noise; and a high prediction accuracy on unknown genes.
Collapse
Affiliation(s)
- LOTFI BEN ROMDHANE
- PRINCE Research Group (PRINCE stands for Pole de Recherche en INformatique du CEntre. It is a multi-disciplinary research group working in the fields of data mining, distributed computing, and intelligent networks.), Department of Computer Science, Faculty of Sciences of Monastir, University of Monastir, Monastir 5019, Tunisia
| | - HECHMI SHILI
- PRINCE Research Group (PRINCE stands for Pole de Recherche en INformatique du CEntre. It is a multi-disciplinary research group working in the fields of data mining, distributed computing, and intelligent networks.), Department of Computer Science, Faculty of Sciences of Monastir, University of Monastir, Monastir 5019, Tunisia
| | - BECHIR AYEB
- PRINCE Research Group (PRINCE stands for Pole de Recherche en INformatique du CEntre. It is a multi-disciplinary research group working in the fields of data mining, distributed computing, and intelligent networks.), Department of Computer Science, Faculty of Sciences of Monastir, University of Monastir, Monastir 5019, Tunisia
| |
Collapse
|
13
|
|
14
|
Romdhane LB, Shili H, Ayeb B. Mining microarray gene expression data with unsupervised possibilistic clustering and proximity graphs. APPL INTELL 2009. [DOI: 10.1007/s10489-009-0161-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
15
|
Zhu L, Chung FL, Wang S. Generalized fuzzy C-means clustering algorithm with improved fuzzy partitions. ACTA ACUST UNITED AC 2009; 39:578-91. [PMID: 19174354 DOI: 10.1109/tsmcb.2008.2004818] [Citation(s) in RCA: 140] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
The fuzziness index m has important influence on the clustering result of fuzzy clustering algorithms, and it should not be forced to fix at the usual value m = 2. In view of its distinctive features in applications and its limitation in having m = 2 only, a recent advance of fuzzy clustering called fuzzy c-means clustering with improved fuzzy partitions (IFP-FCM) is extended in this paper, and a generalized algorithm called GIFP-FCM for more effective clustering is proposed. By introducing a novel membership constraint function, a new objective function is constructed, and furthermore, GIFP-FCM clustering is derived. Meanwhile, from the viewpoints of L(p) norm distance measure and competitive learning, the robustness and convergence of the proposed algorithm are analyzed. Furthermore, the classical fuzzy c-means algorithm (FCM) and IFP-FCM can be taken as two special cases of the proposed algorithm. Several experimental results including its application to noisy image texture segmentation are presented to demonstrate its average advantage over FCM and IFP-FCM in both clustering and robustness capabilities.
Collapse
Affiliation(s)
- Lin Zhu
- School of Information Technology, Southern Yangtze University, Wuxi 214036, China
| | | | | |
Collapse
|
16
|
|
17
|
Cheng KO, Law NF, Siu WC, Liew AWC. Identification of coherent patterns in gene expression data using an efficient biclustering algorithm and parallel coordinate visualization. BMC Bioinformatics 2008; 9:210. [PMID: 18433478 PMCID: PMC2396181 DOI: 10.1186/1471-2105-9-210] [Citation(s) in RCA: 52] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2007] [Accepted: 04/23/2008] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The DNA microarray technology allows the measurement of expression levels of thousands of genes under tens/hundreds of different conditions. In microarray data, genes with similar functions usually co-express under certain conditions only 1. Thus, biclustering which clusters genes and conditions simultaneously is preferred over the traditional clustering technique in discovering these coherent genes. Various biclustering algorithms have been developed using different bicluster formulations. Unfortunately, many useful formulations result in NP-complete problems. In this article, we investigate an efficient method for identifying a popular type of biclusters called additive model. Furthermore, parallel coordinate (PC) plots are used for bicluster visualization and analysis. RESULTS We develop a novel and efficient biclustering algorithm which can be regarded as a greedy version of an existing algorithm known as pCluster algorithm. By relaxing the constraint in homogeneity, the proposed algorithm has polynomial-time complexity in the worst case instead of exponential-time complexity as in the pCluster algorithm. Experiments on artificial datasets verify that our algorithm can identify both additive-related and multiplicative-related biclusters in the presence of overlap and noise. Biologically significant biclusters have been validated on the yeast cell-cycle expression dataset using Gene Ontology annotations. Comparative study shows that the proposed approach outperforms several existing biclustering algorithms. We also provide an interactive exploratory tool based on PC plot visualization for determining the parameters of our biclustering algorithm. CONCLUSION We have proposed a novel biclustering algorithm which works with PC plots for an interactive exploratory analysis of gene expression data. Experiments show that the biclustering algorithm is efficient and is capable of detecting co-regulated genes. The interactive analysis enables an optimum parameter determination in the biclustering algorithm so as to achieve the best result. In future, we will modify the proposed algorithm for other bicluster models such as the coherent evolution model.
Collapse
Affiliation(s)
- Kin-On Cheng
- School of Information and Communication Technology, Griffith University, Gold Coast Campus, QLD 4222, Queensland, Australia.
| | | | | | | |
Collapse
|
18
|
Zhao H, Liew AWC, Xie X, Yan H. A new geometric biclustering algorithm based on the Hough transform for analysis of large-scale microarray data. J Theor Biol 2007; 251:264-74. [PMID: 18199458 DOI: 10.1016/j.jtbi.2007.11.030] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2007] [Revised: 10/17/2007] [Accepted: 11/29/2007] [Indexed: 11/30/2022]
Abstract
Biclustering is an important tool in microarray analysis when only a subset of genes co-regulates in a subset of conditions. Different from standard clustering analyses, biclustering performs simultaneous classification in both gene and condition directions in a microarray data matrix. However, the biclustering problem is inherently intractable and computationally complex. In this paper, we present a new biclustering algorithm based on the geometrical viewpoint of coherent gene expression profiles. In this method, we perform pattern identification based on the Hough transform in a column-pair space. The algorithm is especially suitable for the biclustering analysis of large-scale microarray data. Our studies show that the approach can discover significant biclusters with respect to the increased noise level and regulatory complexity. Furthermore, we also test the ability of our method to locate biologically verifiable biclusters within an annotated set of genes.
Collapse
Affiliation(s)
- Hongya Zhao
- Department of Electronic Engineering, City University of Hong Kong, Kowloon, Hong Kong.
| | | | | | | |
Collapse
|
19
|
|
20
|
Gan X, Liew AWC, Yan H. Microarray missing data imputation based on a set theoretic framework and biological knowledge. Nucleic Acids Res 2006; 34:1608-19. [PMID: 16549873 PMCID: PMC1409680 DOI: 10.1093/nar/gkl047] [Citation(s) in RCA: 73] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Gene expressions measured using microarrays usually suffer from the missing value problem. However, in many data analysis methods, a complete data matrix is required. Although existing missing value imputation algorithms have shown good performance to deal with missing values, they also have their limitations. For example, some algorithms have good performance only when strong local correlation exists in data while some provide the best estimate when data is dominated by global structure. In addition, these algorithms do not take into account any biological constraint in their imputation. In this paper, we propose a set theoretic framework based on projection onto convex sets (POCS) for missing data imputation. POCS allows us to incorporate different types of a priori knowledge about missing values into the estimation process. The main idea of POCS is to formulate every piece of prior knowledge into a corresponding convex set and then use a convergence-guaranteed iterative procedure to obtain a solution in the intersection of all these sets. In this work, we design several convex sets, taking into consideration the biological characteristic of the data: the first set mainly exploit the local correlation structure among genes in microarray data, while the second set captures the global correlation structure among arrays. The third set (actually a series of sets) exploits the biological phenomenon of synchronization loss in microarray experiments. In cyclic systems, synchronization loss is a common phenomenon and we construct a series of sets based on this phenomenon for our POCS imputation algorithm. Experiments show that our algorithm can achieve a significant reduction of error compared to the KNNimpute, SVDimpute and LSimpute methods.
Collapse
Affiliation(s)
- Xiangchao Gan
- Department of Electronic Engineering, City University of Hong Kong83 Tat Chee Avenue, Kowloon, Hong Kong
| | - Alan Wee-Chung Liew
- Department of Computer Science and Engineering, The Chinese University of Hong KongShatin, Hong Kong
- To whom correspondence should be addressed. Tel: 852 26098419; Fax: 852 26035024;
| | - Hong Yan
- Department of Electronic Engineering, City University of Hong Kong83 Tat Chee Avenue, Kowloon, Hong Kong
- School of Electrical and Information Engineering, University of SydneyNSW 2006, Australia
| |
Collapse
|
21
|
Clustering Analysis of Gene Expression Data based on Semi-supervised Visual Clustering Algorithm. Soft comput 2006. [DOI: 10.1007/s00500-005-0025-7] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
22
|
Abstract
Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the profusion of options causes confusion. We survey clustering algorithms for data sets appearing in statistics, computer science, and machine learning, and illustrate their applications in some benchmark data sets, the traveling salesman problem, and bioinformatics, a new field attracting intensive efforts. Several tightly related topics, proximity measure, and cluster validation, are also discussed.
Collapse
Affiliation(s)
- Rui Xu
- Department of Electrical and Computer Engineering, University of Missouri-Rolla, Rolla, MO 65409, USA.
| | | |
Collapse
|