1
|
Shan Y, Li S, Li F, Cui Y, Chen M. Dual-level clustering ensemble algorithm with three consensus strategies. Sci Rep 2023; 13:22617. [PMID: 38114636 PMCID: PMC10730624 DOI: 10.1038/s41598-023-49947-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2023] [Accepted: 12/13/2023] [Indexed: 12/21/2023] Open
Abstract
Clustering ensemble (CE), renowned for its robust and potent consensus capability, has garnered significant attention from scholars in recent years and has achieved numerous noteworthy breakthroughs. Nevertheless, three key issues persist: (1) the majority of CE selection strategies rely on preset parameters or empirical knowledge as a premise, lacking adaptive selectivity; (2) the construction of co-association matrix is excessively one-sided; (3) the CE method lacks a more macro perspective to reconcile the conflicts among different consensus results. To address these aforementioned problems, a dual-level clustering ensemble algorithm with three consensus strategies is proposed. Firstly, a backward clustering ensemble selection framework is devised, and its built-in selection strategy can adaptively eliminate redundant members. Then, at the base clustering consensus level, taking into account the interplay between actual spatial location information and the co-occurrence frequency, two modified relation matrices are reconstructed, resulting in the development of two consensus methods with different modes. Additionally, at the CE consensus level with a broader perspective, an adjustable Dempster-Shafer evidence theory is developed as the third consensus method in present algorithm to dynamically fuse multiple ensemble results. Experimental results demonstrate that compared to seven other state-of-the-art and typical CE algorithms, the proposed algorithm exhibits exceptional consensus ability and robustness.
Collapse
Affiliation(s)
- Yunxiao Shan
- School of Science, Harbin University of Science and Technology, Harbin, 150080, China
| | - Shu Li
- School of Science, Harbin University of Science and Technology, Harbin, 150080, China.
- Key Laboratory of Engineering Dielectric and Applications (Ministry of Education), School of Electrical and Electronic Engineering, Harbin University of Science and Technology, Harbin, 150080, China.
| | - Fuxiang Li
- School of Science, Harbin University of Science and Technology, Harbin, 150080, China.
| | - Yuxin Cui
- School of Science, Harbin University of Science and Technology, Harbin, 150080, China
| | - Minghua Chen
- Key Laboratory of Engineering Dielectric and Applications (Ministry of Education), School of Electrical and Electronic Engineering, Harbin University of Science and Technology, Harbin, 150080, China
| |
Collapse
|
2
|
Nie F, Dong X, Hu Z, Wang R, Li X. Discriminative Projected Clustering via Unsupervised LDA. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:9466-9480. [PMID: 36121958 DOI: 10.1109/tnnls.2022.3202719] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
This work focuses on the projected clustering problem. Specifically, an efficient and parameter-free clustering model, named discriminative projected clustering (DPC), is proposed for simultaneously low-dimensional and discriminative projection learning and clustering, from the perspective of least squares regression. The proposed DPC, a constrained regression model, aims at finding both a transformation matrix and a binary indicator matrix to minimize the sum-of-squares error. Theoretically, a significant conclusion is drawn and used to reveal the connection between DPC and linear discriminant analysis (LDA). Experimentally, experiments are conducted on both toy and real-world data to validate the effectiveness and efficiency of DPC; experiments are also conducted on hyperspectral images to further verify its practicability in real-world applications. Experimental results demonstrate that DPC achieves comparable or superior results to some state-of-the-art clustering methods.
Collapse
|
3
|
Wang R, Han S, Zhou J, Chen Y, Wang L, Du T, Ji K, Zhao YO, Zhang K. Transfer-Learning-Based Gaussian Mixture Model for Distributed Clustering. IEEE TRANSACTIONS ON CYBERNETICS 2023; 53:7058-7070. [PMID: 35687639 DOI: 10.1109/tcyb.2022.3177242] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Distributed clustering based on the Gaussian mixture model (GMM) has exhibited excellent clustering capabilities in peer-to-peer (P2P) networks. However, more iterative numbers and communication overhead are required to achieve the consensus in existing distributed GMM clustering algorithms. In addition, the truth that it cannot find a closed form for the update of parameters in GMM causes the imprecise clustering accuracy. To solve these issues, by utilizing the transfer learning technique, a general transfer distributed GMM clustering framework is exploited to promote the clustering performance and accelerate the clustering convergence. In this work, each node is treated as both the source domain and the target domain, and these nodes can learn from each other to complete the clustering task in distributed P2P networks. Based on this framework, the transfer distributed expectation-maximization algorithm with the fixed learning rate is first presented for data clustering. Then, an improved version is designed to obtain the stable clustering accuracy, in which an adaptive transfer learning strategy is adopted to adjust the learning rate automatically instead of a fixed value. To demonstrate the extensibility of the proposed framework, a representative GMM clustering method, the entropy-type classification maximum-likelihood algorithm, is further extended to the transfer distributed counterpart. Experimental results verify the effectiveness of the presented algorithms in contrast with the existing GMM clustering approaches.
Collapse
|
4
|
Kao WC, Xie HX, Lin CY, Cheng WH. Specific Expert Learning: Enriching Ensemble Diversity via Knowledge Distillation. IEEE TRANSACTIONS ON CYBERNETICS 2023; 53:2494-2505. [PMID: 34793316 DOI: 10.1109/tcyb.2021.3125320] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
In recent years, ensemble methods have shown sterling performance and gained popularity in visual tasks. However, the performance of an ensemble is limited by the paucity of diversity among the models. Thus, to enrich the diversity of the ensemble, we present the distillation approach-learning from experts (LFEs). Such method involves a novel knowledge distillation (KD) method that we present, specific expert learning (SEL), which can reduce class selectivity and improve the performance on specific weaker classes and overall accuracy. Through SEL, models can acquire different knowledge from distinct networks with various areas of expertise, and a highly diverse ensemble can be obtained afterward. Our experimental results demonstrate that, on CIFAR-10, the accuracy of the ResNet-32 increases 0.91% with SEL, and that the ensemble trained by SEL increases accuracy by 1.13%. Compared to state-of-the-art approaches, for example, DML only improves accuracy by 0.3% and 1.02% on single ResNet-32 and the ensemble, respectively. Furthermore, our proposed architecture also can be applied to ensemble distillation (ED), which applies KD on the ensemble model. In conclusion, our experimental results show that our proposed SEL not only improves the accuracy of a single classifier but also boosts the diversity of the ensemble model.
Collapse
|
5
|
Wu Y, Wu R, Liu J, Tang X. MetaWCE: Learning to Weight for Weighted Cluster Ensemble. Inf Sci (N Y) 2023. [DOI: 10.1016/j.ins.2023.01.135] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
|
6
|
He K, Massena DG. Examining unsupervised ensemble learning using spectroscopy data of organic compounds. J Comput Aided Mol Des 2023; 37:17-37. [PMID: 36404382 DOI: 10.1007/s10822-022-00488-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2022] [Accepted: 11/03/2022] [Indexed: 11/22/2022]
Abstract
One solution to the challenge of choosing an appropriate clustering algorithm is to combine different clusterings into a single consensus clustering result, known as cluster ensemble (CE). This ensemble learning strategy can provide more robust and stable solutions across different domains and datasets. Unfortunately, not all clusterings in the ensemble contribute to the final data partition. Cluster ensemble selection (CES) aims at selecting a subset from a large library of clustering solutions to form a smaller cluster ensemble that performs as well as or better than the set of all available clustering solutions. In this paper, we investigate four CES methods for the categorization of structurally distinct organic compounds using high-dimensional IR and Raman spectroscopy data. Single quality selection (SQI) forms a subset of the ensemble by selecting the highest quality ensemble members. The Single Quality Selection (SQI) method is used with various quality indices to select subsets by including the highest quality ensemble members. The Bagging method, usually applied in supervised learning, ranks ensemble members by calculating the normalized mutual information (NMI) between ensemble members and consensus solutions generated from a randomly sampled subset of the full ensemble. The hierarchical cluster and select method (HCAS-SQI) uses the diversity matrix of ensemble members to select a diverse set of ensemble members with the highest quality. Furthermore, a combining strategy can be used to combine subsets selected using multiple quality indices (HCAS-MQI) for the refinement of clustering solutions in the ensemble. The IR + Raman hybrid ensemble library is created by merging two complementary "views" of the organic compounds. This inherently more diverse library gives the best full ensemble consensus results. Overall, the Bagging method is recommended because it provides the most robust results that are better than or comparable to the full ensemble consensus solutions.
Collapse
Affiliation(s)
- Kedan He
- Department of Physical Sciences, School of Arts and Sciences, Eastern Connecticut State University, Willimantic, CT, 06226, USA.
| | - Djenerly G Massena
- Department of Physical Sciences, School of Arts and Sciences, Eastern Connecticut State University, Willimantic, CT, 06226, USA
| |
Collapse
|
7
|
Wang Y, Krishna Saraswat S, Elyasi Komari I. Big Data Analysis Using a Parallel Ensemble Clustering Architecture and an Unsupervised Feature Selection Approach. JOURNAL OF KING SAUD UNIVERSITY - COMPUTER AND INFORMATION SCIENCES 2022. [DOI: 10.1016/j.jksuci.2022.11.016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/08/2022]
|
8
|
Huang D, Wang CD, Lai JH, Kwoh CK. Toward Multidiversified Ensemble Clustering of High-Dimensional Data: From Subspaces to Metrics and Beyond. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:12231-12244. [PMID: 33961570 DOI: 10.1109/tcyb.2021.3049633] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
The rapid emergence of high-dimensional data in various areas has brought new challenges to current ensemble clustering research. To deal with the curse of dimensionality, recently considerable efforts in ensemble clustering have been made by means of different subspace-based techniques. However, besides the emphasis on subspaces, rather limited attention has been paid to the potential diversity in similarity/dissimilarity metrics. It remains a surprisingly open problem in ensemble clustering how to create and aggregate a large population of diversified metrics, and furthermore, how to jointly investigate the multilevel diversity in the large populations of metrics, subspaces, and clusters in a unified framework. To tackle this problem, this article proposes a novel multidiversified ensemble clustering approach. In particular, we create a large number of diversified metrics by randomizing a scaled exponential similarity kernel, which are then coupled with random subspaces to form a large set of metric-subspace pairs. Based on the similarity matrices derived from these metric-subspace pairs, an ensemble of diversified base clusterings can be thereby constructed. Furthermore, an entropy-based criterion is utilized to explore the cluster wise diversity in ensembles, based on which three specific ensemble clustering algorithms are presented by incorporating three types of consensus functions. Extensive experiments are conducted on 30 high-dimensional datasets, including 18 cancer gene expression datasets and 12 image/speech datasets, which demonstrate the superiority of our algorithms over the state of the art. The source code is available at https://github.com/huangdonghere/MDEC.
Collapse
|
9
|
Wang Y, Li X, Wong KC, Chang Y, Yang S. Evolutionary Multiobjective Clustering Algorithms With Ensemble for Patient Stratification. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:11027-11040. [PMID: 33961576 DOI: 10.1109/tcyb.2021.3069434] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Patient stratification has been studied widely to tackle subtype diagnosis problems for effective treatment. Due to the dimensionality curse and poor interpretability of data, there is always a long-lasting challenge in constructing a stratification model with high diagnostic ability and good generalization. To address these problems, this article proposes two novel evolutionary multiobjective clustering algorithms with ensemble (NSGA-II-ECFE and MOEA/D-ECFE) with four cluster validity indices used as the objective functions. First, an effective ensemble construction method is developed to enrich the ensemble diversity. After that, an ensemble clustering fitness evaluation (ECFE) method is proposed to evaluate the ensembles by measuring the consensus clustering under those four objective functions. To generate the consensus clustering, ECFE exploits the hybrid co-association matrix from the ensembles and then dynamically selects the suitable clustering algorithm on that matrix. Multiple experiments have been conducted to demonstrate the effectiveness of the proposed algorithm in comparison with seven clustering algorithms, twelve ensemble clustering approaches, and two multiobjective clustering algorithms on 55 synthetic datasets and 35 real patient stratification datasets. The experimental results demonstrate the competitive edges of the proposed algorithms over those compared methods. Furthermore, the proposed algorithm is applied to extend its advantages by identifying cancer subtypes from five cancer-related single-cell RNA-seq datasets.
Collapse
|
10
|
Sheng W, Wang X, Wang Z, Li Q, Zheng Y, Chen S. A Differential Evolution Algorithm With Adaptive Niching and K-Means Operation for Data Clustering. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:6181-6195. [PMID: 33284774 DOI: 10.1109/tcyb.2020.3035887] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Clustering, as an important part of data mining, is inherently a challenging problem. This article proposes a differential evolution algorithm with adaptive niching and k -means operation (denoted as DE_ANS_AKO) for partitional data clustering. Within the proposed algorithm, an adaptive niching scheme, which can dynamically adjust the size of each niche in the population, is devised and integrated to prevent premature convergence of evolutionary search, thus appropriately searching the space to identify the optimal or near-optimal solution. Furthermore, to improve the search efficiency, an adaptive k -means operation has been designed and employed at the niche level of population. The performance of the proposed algorithm has been evaluated on synthetic as well as real datasets and compared with related methods. The experimental results reveal that the proposed algorithm is able to reliably and efficiently deliver high quality clustering solutions and generally outperforms related methods implemented for comparisons.
Collapse
|
11
|
Zhou P, Du L, Liu X, Shen YD, Fan M, Li X. Self-Paced Clustering Ensemble. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2021; 32:1497-1511. [PMID: 32310800 DOI: 10.1109/tnnls.2020.2984814] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
The clustering ensemble has emerged as an important extension of the classical clustering problem. It provides an elegant framework to integrate multiple weak base clusterings to generate a strong consensus result. Most existing clustering ensemble methods usually exploit all data to learn a consensus clustering result, which does not sufficiently consider the adverse effects caused by some difficult instances. To handle this problem, we propose a novel self-paced clustering ensemble (SPCE) method, which gradually involves instances from easy to difficult ones into the ensemble learning. In our method, we integrate the evaluation of the difficulty of instances and ensemble learning into a unified framework, which can automatically estimate the difficulty of instances and ensemble the base clusterings. To optimize the corresponding objective function, we propose a joint learning algorithm to obtain the final consensus clustering result. Experimental results on benchmark data sets demonstrate the effectiveness of our method.
Collapse
|
12
|
Dai D, Tang J, Yu Z, Wong HS, You J, Cao W, Hu Y, Chen CLP. An Inception Convolutional Autoencoder Model for Chinese Healthcare Question Clustering. IEEE TRANSACTIONS ON CYBERNETICS 2021; 51:2019-2031. [PMID: 31180903 DOI: 10.1109/tcyb.2019.2916580] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Healthcare question answering (HQA) system plays a vital role in encouraging patients to inquire for professional consultation. However, there are some challenging factors in learning and representing the question corpus of HQA datasets, such as high dimensionality, sparseness, noise, nonprofessional expression, etc. To address these issues, we propose an inception convolutional autoencoder model for Chinese healthcare question clustering (ICAHC). First, we select a set of kernels with different sizes using convolutional autoencoder networks to explore both the diversity and quality in the clustering ensemble. Thus, these kernels encourage to capture diverse representations. Second, we design four ensemble operators to merge representations based on whether they are independent, and input them into the encoder using different skip connections. Third, it maps features from the encoder into a lower-dimensional space, followed by clustering. We conduct comparative experiments against other clustering algorithms on a Chinese healthcare dataset. Experimental results show the effectiveness of ICAHC in discovering better clustering solutions. The results can be used in the prediction of patients' conditions and the development of an automatic HQA system.
Collapse
|