1
|
He G, Jiang W, Peng R, Yin M, Han M. Soft Subspace Based Ensemble Clustering for Multivariate Time Series Data. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:7761-7774. [PMID: 35157594 DOI: 10.1109/tnnls.2022.3146136] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Recently, multivariate time series (MTS) clustering has gained lots of attention. However, state-of-the-art algorithms suffer from two major issues. First, few existing studies consider correlations and redundancies between variables of MTS data. Second, since different clusters usually exist in different intrinsic variables, how to efficiently enhance the performance by mining the intrinsic variables of a cluster is challenging work. To deal with these issues, we first propose a variable-weighted K-medoids clustering algorithm (VWKM) based on the importance of a variable for a cluster. In VWKM, the proposed variable weighting scheme could identify the important variables for a cluster, which can also provide knowledge and experience to related experts. Then, a Reverse nearest neighborhood-based density Peaks approach (RP) is proposed to handle the problem of initialization sensitivity of VWKM. Next, based on VWKM and the density peaks approach, an ensemble Clustering framework (SSEC) is advanced to further enhance the clustering performance. Experimental results on ten MTS datasets show that our method works well on MTS datasets and outperforms the state-of-the-art clustering ensemble approaches.
Collapse
|
2
|
Wang Y, Krishna Saraswat S, Elyasi Komari I. Big Data Analysis Using a Parallel Ensemble Clustering Architecture and an Unsupervised Feature Selection Approach. JOURNAL OF KING SAUD UNIVERSITY - COMPUTER AND INFORMATION SCIENCES 2022. [DOI: 10.1016/j.jksuci.2022.11.016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/08/2022]
|
3
|
Bench C, Nallala J, Wang CC, Sheridan H, Stone N. Unsupervised segmentation of biomedical hyperspectral image data: tackling high dimensionality with convolutional autoencoders. BIOMEDICAL OPTICS EXPRESS 2022; 13:6373-6388. [PMID: 36589581 PMCID: PMC9774878 DOI: 10.1364/boe.476233] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/21/2022] [Revised: 10/25/2022] [Accepted: 10/25/2022] [Indexed: 06/17/2023]
Abstract
Information about the structure and composition of biopsy specimens can assist in disease monitoring and diagnosis. In principle, this can be acquired from Raman and infrared (IR) hyperspectral images (HSIs) that encode information about how a sample's constituent molecules are arranged in space. Each tissue section/component is defined by a unique combination of spatial and spectral features, but given the high dimensionality of HSI datasets, extracting and utilising them to segment images is non-trivial. Here, we show how networks based on deep convolutional autoencoders (CAEs) can perform this task in an end-to-end fashion by first detecting and compressing relevant features from patches of the HSI into low-dimensional latent vectors, and then performing a clustering step that groups patches containing similar spatio-spectral features together. We showcase the advantages of using this end-to-end spatio-spectral segmentation approach compared to i) the same spatio-spectral technique not trained in an end-to-end manner, and ii) a method that only utilises spectral features (spectral k-means) using simulated HSIs of porcine tissue as test examples. Secondly, we describe the potential advantages/limitations of using three different CAE architectures: a generic 2D CAE, a generic 3D CAE, and a 2D convolutional encoder-decoder architecture inspired by the recently proposed UwU-net that is specialised for extracting features from HSI data. We assess their performance on IR HSIs of real colon samples. We find that all architectures are capable of producing segmentations that show good correspondence with HE stained adjacent tissue slices used as approximate ground truths, indicating the robustness of the CAE-driven spatio-spectral clustering approach for segmenting biomedical HSI data. Additionally, we stress the need for more accurate ground truth information to enable a precise comparison of the advantages offered by each architecture.
Collapse
Affiliation(s)
- Ciaran Bench
- School of Physics and Astronomy, University of Exeter, Exeter, Devon, EX4 4PY, United Kingdom
| | - Jayakrupakar Nallala
- School of Physics and Astronomy, University of Exeter, Exeter, Devon, EX4 4PY, United Kingdom
| | - Chun-Chin Wang
- School of Physics and Astronomy, University of Exeter, Exeter, Devon, EX4 4PY, United Kingdom
| | - Hannah Sheridan
- School of Physics and Astronomy, University of Exeter, Exeter, Devon, EX4 4PY, United Kingdom
| | - Nicholas Stone
- School of Physics and Astronomy, University of Exeter, Exeter, Devon, EX4 4PY, United Kingdom
| |
Collapse
|
4
|
Huang D, Wang CD, Lai JH, Kwoh CK. Toward Multidiversified Ensemble Clustering of High-Dimensional Data: From Subspaces to Metrics and Beyond. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:12231-12244. [PMID: 33961570 DOI: 10.1109/tcyb.2021.3049633] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
The rapid emergence of high-dimensional data in various areas has brought new challenges to current ensemble clustering research. To deal with the curse of dimensionality, recently considerable efforts in ensemble clustering have been made by means of different subspace-based techniques. However, besides the emphasis on subspaces, rather limited attention has been paid to the potential diversity in similarity/dissimilarity metrics. It remains a surprisingly open problem in ensemble clustering how to create and aggregate a large population of diversified metrics, and furthermore, how to jointly investigate the multilevel diversity in the large populations of metrics, subspaces, and clusters in a unified framework. To tackle this problem, this article proposes a novel multidiversified ensemble clustering approach. In particular, we create a large number of diversified metrics by randomizing a scaled exponential similarity kernel, which are then coupled with random subspaces to form a large set of metric-subspace pairs. Based on the similarity matrices derived from these metric-subspace pairs, an ensemble of diversified base clusterings can be thereby constructed. Furthermore, an entropy-based criterion is utilized to explore the cluster wise diversity in ensembles, based on which three specific ensemble clustering algorithms are presented by incorporating three types of consensus functions. Extensive experiments are conducted on 30 high-dimensional datasets, including 18 cancer gene expression datasets and 12 image/speech datasets, which demonstrate the superiority of our algorithms over the state of the art. The source code is available at https://github.com/huangdonghere/MDEC.
Collapse
|
5
|
Lu Y, Yu Z, Wang Y, Ma Z, Wong KC, Li X. GMHCC: High-throughput Analysis of Biomolecular Data using Graph-based Multiple Hierarchical Consensus Clustering. Bioinformatics 2022; 38:3020-3028. [PMID: 35451457 DOI: 10.1093/bioinformatics/btac290] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2021] [Revised: 03/10/2022] [Accepted: 04/19/2022] [Indexed: 11/15/2022] Open
Abstract
MOTIVATION Thanks to the development of high-throughput sequencing technologies, massive amounts of various biomolecular data have been accumulated to revolutionize the study of genomics and molecular biology. One of the main challenges in analyzing this biomolecular data is to cluster their subtypes into subpopulations to facilitate subsequent downstream analysis. Recently, many clustering methods have been developed to address the biomolecular data. However, the computational methods often suffer from many limitations such as high dimensionality, data heterogeneity and noise. RESULTS In our study, we develop a novel Graph-based Multiple Hierarchical Consensus Clustering (GMHCC) method with an unsupervised graph-based feature ranking and a graph-based linking method to explore the multiple hierarchical information of the underlying partitions of the consensus clustering for multiple types of biomolecular data. Indeed, we first propose to use a graph-based unsupervised feature ranking model to measure each feature by building a graph over pairwise features and then providing each feature with a rank. Subsequently, to maintain the diversity and robustness of basic partitions, we propose multiple diverse feature subsets to generate several basic partitions and then explore the hierarchical structures of the multiple basic partitions by refining the global consensus function. Finally, we develop a new graph-based linking method, which explicitly considers the relationships between clusters to generate the final partition. Experiments on multiple types of biomolecular data including thirty-five cancer gene expression datasets and eight single-cell RNA-seq datasets validate the effectiveness of our method over several state-of-the-art consensus clustering approaches. Furthermore, differential gene analysis, gene ontology enrichment analysis, and KEGG pathway analysis are conducted, providing novel insights into cell developmental lineages and characterization mechanisms. AVAILABILITY The source code is available at GitHub: https://github.com/yifuLu/GMHCC. The software and the supporting data can be downloaded from: https://figshare.com/articles/software/GMHCC/17111291. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yifu Lu
- School of Artificial Intelligence, Jilin University, Changchun 130012, China
| | - Zhuohan Yu
- School of Artificial Intelligence, Jilin University, Changchun 130012, China
| | - Yunhe Wang
- School of Artificial Intelligence, Jilin University, Changchun 130012, China
| | - Zhiqiang Ma
- School of Artificial Intelligence, Jilin University, Changchun 130012, China
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Hong Kong 999077, Hong Kong SAR
| | - Xiangtao Li
- School of Artificial Intelligence, Jilin University, Changchun 130012, China
| |
Collapse
|
6
|
Sun JT, Zhang QY. Product typicality attribute mining method based on a topic clustering ensemble. Artif Intell Rev 2022. [DOI: 10.1007/s10462-022-10163-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
7
|
Lung cancer prediction using multi-gene genetic programming by selecting automatic features from amino acid sequences. Comput Biol Chem 2022; 98:107638. [DOI: 10.1016/j.compbiolchem.2022.107638] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2021] [Revised: 12/22/2021] [Accepted: 02/01/2022] [Indexed: 02/07/2023]
|
8
|
Zhou P, Du L, Liu X, Shen YD, Fan M, Li X. Self-Paced Clustering Ensemble. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2021; 32:1497-1511. [PMID: 32310800 DOI: 10.1109/tnnls.2020.2984814] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
The clustering ensemble has emerged as an important extension of the classical clustering problem. It provides an elegant framework to integrate multiple weak base clusterings to generate a strong consensus result. Most existing clustering ensemble methods usually exploit all data to learn a consensus clustering result, which does not sufficiently consider the adverse effects caused by some difficult instances. To handle this problem, we propose a novel self-paced clustering ensemble (SPCE) method, which gradually involves instances from easy to difficult ones into the ensemble learning. In our method, we integrate the evaluation of the difficulty of instances and ensemble learning into a unified framework, which can automatically estimate the difficulty of instances and ensemble the base clusterings. To optimize the corresponding objective function, we propose a joint learning algorithm to obtain the final consensus clustering result. Experimental results on benchmark data sets demonstrate the effectiveness of our method.
Collapse
|
9
|
John CR, Watson D, Russ D, Goldmann K, Ehrenstein M, Pitzalis C, Lewis M, Barnes M. M3C: Monte Carlo reference-based consensus clustering. Sci Rep 2020; 10:1816. [PMID: 32020004 PMCID: PMC7000518 DOI: 10.1038/s41598-020-58766-1] [Citation(s) in RCA: 51] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2018] [Accepted: 01/10/2020] [Indexed: 11/24/2022] Open
Abstract
Genome-wide data is used to stratify patients into classes for precision medicine using clustering algorithms. A common problem in this area is selection of the number of clusters (K). The Monti consensus clustering algorithm is a widely used method which uses stability selection to estimate K. However, the method has bias towards higher values of K and yields high numbers of false positives. As a solution, we developed Monte Carlo reference-based consensus clustering (M3C), which is based on this algorithm. M3C simulates null distributions of stability scores for a range of K values thus enabling a comparison with real data to remove bias and statistically test for the presence of structure. M3C corrects the inherent bias of consensus clustering as demonstrated on simulated and real expression data from The Cancer Genome Atlas (TCGA). For testing M3C, we developed clusterlab, a new method for simulating multivariate Gaussian clusters.
Collapse
Affiliation(s)
- Christopher R John
- Experimental Medicine and Rheumatology, William Harvey Research Institute, Bart's and The London School of Medicine and Dentistry, Queen Mary University of London, Charterhouse Square, London, EC1M 6BQ, United Kingdom.
| | - David Watson
- Oxford Internet Institute, University of Oxford, 1 St. Giles, OX1 3JS, Oxford, United Kingdom
| | - Dominic Russ
- Experimental Medicine and Rheumatology, William Harvey Research Institute, Bart's and The London School of Medicine and Dentistry, Queen Mary University of London, Charterhouse Square, London, EC1M 6BQ, United Kingdom
| | - Katriona Goldmann
- Experimental Medicine and Rheumatology, William Harvey Research Institute, Bart's and The London School of Medicine and Dentistry, Queen Mary University of London, Charterhouse Square, London, EC1M 6BQ, United Kingdom
| | - Michael Ehrenstein
- Rayne Institute, University College London, 5 University Street, London, WC1E 6JF, United Kingdom
| | - Costantino Pitzalis
- Experimental Medicine and Rheumatology, William Harvey Research Institute, Bart's and The London School of Medicine and Dentistry, Queen Mary University of London, Charterhouse Square, London, EC1M 6BQ, United Kingdom
| | - Myles Lewis
- Experimental Medicine and Rheumatology, William Harvey Research Institute, Bart's and The London School of Medicine and Dentistry, Queen Mary University of London, Charterhouse Square, London, EC1M 6BQ, United Kingdom
| | - Michael Barnes
- Experimental Medicine and Rheumatology, William Harvey Research Institute, Bart's and The London School of Medicine and Dentistry, Queen Mary University of London, Charterhouse Square, London, EC1M 6BQ, United Kingdom.
| |
Collapse
|
10
|
|