1
|
Zhang Y, Cheung YM. Graph-Based Dissimilarity Measurement for Cluster Analysis of Any-Type-Attributed Data. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:6530-6544. [PMID: 36094993 DOI: 10.1109/tnnls.2022.3202700] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Heterogeneous attribute data composed of attributes with different types of values are quite common in a variety of real-world applications. As data annotation is usually expensive, clustering has provided a promising way for processing unlabeled data, where the adopted similarity measure plays a key role in determining the clustering accuracy. However, it is a very challenging task to appropriately define the similarity between data objects with heterogeneous attributes because the values from heterogeneous attributes are generally with very different characteristics. Specifically, numerical attributes are with quantitative values, while categorical attributes are with qualitative values. Furthermore, categorical attributes can be categorized into nominal and ordinal ones according to the order information of their values. To circumvent the awkward gap among the heterogeneous attributes, this article will propose a new dissimilarity metric for cluster analysis of such data. We first study the connections among the heterogeneous attributes and build graph representations for them. Then, a metric is proposed, which computes the dissimilarities between attribute values under the guidance of the graph structures. Finally, we develop a new k -means-type clustering algorithm associated with this proposed metric. It turns out that the proposed method is competent to perform cluster analysis of datasets composed of an arbitrary combination of numerical, nominal, and ordinal attributes. Experimental results show its efficacy in comparison with its counterparts.
Collapse
|
2
|
Naik D, Dharavath R, Qi L. Quantum-PSO based unsupervised clustering of users in social networks using attributes. CLUSTER COMPUTING 2023:1-19. [PMID: 37359059 PMCID: PMC10099026 DOI: 10.1007/s10586-023-03993-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/19/2023] [Revised: 02/22/2023] [Accepted: 03/18/2023] [Indexed: 06/28/2023]
Abstract
Unsupervised cluster detection in social network analysis involves grouping social actors into distinct groups, each distinct from the others. Users in the clusters are semantically very similar to those in the same cluster and dissimilar to those in different clusters. Social network clustering reveals a wide range of useful information about users and has many applications in daily life. Various approaches are developed to find social network users' clusters, using only links or attributes and links. This work proposes a method for detecting social network users' clusters based solely on their attributes. In this case, users' attributes are considered categorical values. The most popular clustering algorithm used for categorical data is the K-mode algorithm. However, it may suffer from local optimum due to its random initialization of centroids. To overcome this issue, this manuscript proposes a methodology named the Quantum PSO approach based on user similarity maximization. In the proposed approach, firstly, dimensionality reduction is conducted by performing the relevant attribute set selection followed by redundant attribute removal. Secondly, the QPSO technique is used to maximize the similarity score between users to get clusters. Three different similarity measures are used separately to perform the dimensionality reduction and similarity maximization processes. Experiments are conducted on two popular social network datasets; ego-Twitter, and ego-Facebook. The results show that the proposed approach performs better clustering results in terms of three different performance metrics than K-Mode and K-Mean algorithms.
Collapse
Affiliation(s)
| | | | - Lianyong Qi
- China University of Petroleum (East China), Dongying, China
| |
Collapse
|
3
|
Jiang Z, Liu X, Zang W. A kernel-based intuitionistic weight fuzzy k-modes algorithm using coupled chained P system combines DNA genetic rules for categorical data. Neurocomputing 2023. [DOI: 10.1016/j.neucom.2023.01.020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]
|
4
|
Li Q, Ji S, Hu S, Yu Y, Chen S, Xiong Q, Zeng Z. A Multi-View Deep Metric Learning approach for Categorical Representation on mixed data. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.110161] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/05/2022]
|
5
|
Liu H, Liu H, Li J, Wang Y. Review of Recent Modern Analytical Technology Combined with Chemometrics Approach Researches on Mushroom Discrimination and Evaluation. Crit Rev Anal Chem 2022; 54:1560-1583. [PMID: 36154534 DOI: 10.1080/10408347.2022.2124839] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/14/2022]
Abstract
Mushroom is a macrofungus with precious fruiting body, as a food, a tonic, and a medicine, human have discovered and used mushrooms for thousands of years. Nowadays, mushroom is also a "super food" recommended by the World Health Organization (WHO) and Food and Agriculture Organization (FAO), and favored by consumers. Discrimination of mushroom including species, geographic origin, storage time, etc., is an important prerequisite to ensure their edible safety and commodity quality. Moreover, the effective evaluation of its chemical composition can help us better understand the nutritional properties of mushrooms. Modern analytical technologies such as chromatography, spectroscopy and mass spectrometry, etc., are widely used in the discrimination and evaluation researches of mushrooms, and chemometrics is an effective means of scientifically processing the multidimensional information hidden in these analytical technologies. This review will outline the latest applications of modern analytical technology combined with chemometrics in qualitative and quantitative analysis and quality control of mushrooms in recent years. Briefly describe the basic principles of these technologies, and the analytical processes of common chemometrics in mushroom researches will be summarized. Finally, the limitations and application prospects of chromatography, spectroscopy and mass spectrometry technology are discussed in mushroom quality control and evaluation.
Collapse
Affiliation(s)
- Hong Liu
- College of Agronomy and Biotechnology, Yunnan Agricultural University, Kunming, China
- Medicinal Plants Research Institute, Yunnan Academy of Agricultural Sciences, Kunming, China
| | - Honggao Liu
- College of Agronomy and Biotechnology, Yunnan Agricultural University, Kunming, China
- Zhaotong University, Zhaotong, China
| | - Jieqing Li
- College of Agronomy and Biotechnology, Yunnan Agricultural University, Kunming, China
| | - Yuanzhong Wang
- Medicinal Plants Research Institute, Yunnan Academy of Agricultural Sciences, Kunming, China
| |
Collapse
|
6
|
Wang J, Ma Z, Nie F, Li X. Fast Self-Supervised Clustering With Anchor Graph. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:4199-4212. [PMID: 33587715 DOI: 10.1109/tnnls.2021.3056080] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Benefit from avoiding the utilization of labeled samples, which are usually insufficient in the real world, unsupervised learning has been regarded as a speedy and powerful strategy on clustering tasks. However, clustering directly from primal data sets leads to high computational cost, which limits its application on large-scale and high-dimensional problems. Recently, anchor-based theories are proposed to partly mitigate this problem and field naturally sparse affinity matrix, while it is still a challenge to get excellent performance along with high efficiency. To dispose of this issue, we first presented a fast semisupervised framework (FSSF) combined with a balanced K -means-based hierarchical K -means (BKHK) method and the bipartite graph theory. Thereafter, we proposed a fast self-supervised clustering method involved in this crucial semisupervised framework, in which all labels are inferred from a constructed bipartite graph with exactly k connected components. The proposed method remarkably accelerates the general semisupervised learning through the anchor and consists of four significant parts: 1) obtaining the anchor set as interim through BKHK algorithm; 2) constructing the bipartite graph; 3) solving the self-supervised problem to construct a typical probability model with FSSF; and 4) selecting the most representative points regarding anchors from BKHK as an interim and conducting label propagation. The experimental results on toy examples and benchmark data sets have demonstrated that the proposed method outperforms other approaches.
Collapse
|
7
|
Zhang Y, Cheung YM. Learnable Weighting of Intra-Attribute Distances for Categorical Data Clustering with Nominal and Ordinal Attributes. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:3560-3576. [PMID: 33534702 DOI: 10.1109/tpami.2021.3056510] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
The success of categorical data clustering generally much relies on the distance metric that measures the dissimilarity degree between two objects. However, most of the existing clustering methods treat the two categorical subtypes, i.e., nominal and ordinal attributes, in the same way when calculating the dissimilarity without considering the relative order information of the ordinal values. Moreover, there would exist interdependence among the nominal and ordinal attributes, which is worth exploring for indicating the dissimilarity. This paper will therefore study the intrinsic difference and connection of nominal and ordinal attribute values from a perspective akin to the graph. Accordingly, we propose a novel distance metric to measure the intra-attribute distances of nominal and ordinal attributes in a unified way, meanwhile preserving the order relationship among ordinal values. Subsequently, we propose a new clustering algorithm to make the learning of intra-attribute distance weights and partitions of data objects into a single learning paradigm rather than two separate steps, whereby circumventing a suboptimal solution. Experiments show the efficacy of the proposed algorithm in comparison with the existing counterparts.
Collapse
|
8
|
Li Y, Fan X, Gaussier E. Supervised Categorical Metric Learning With Schatten p-Norms. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:2059-2069. [PMID: 32697727 DOI: 10.1109/tcyb.2020.3004437] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Metric learning has been successful in learning new metrics adapted to numerical datasets. However, its development of categorical data still needs further exploration. In this article, we propose a method, called CPML for categorical projected metric learning, which tries to efficiently (i.e., less computational time and better prediction accuracy) address the problem of metric learning in categorical data. We make use of the value distance metric to represent our data and propose new distances based on this representation. We then show how to efficiently learn new metrics. We also generalize several previous regularizers through the Schatten p -norm and provide a generalization bound for it that complements the standard generalization bound for metric learning. The experimental results show that our method provides state-of-the-art results while being faster.
Collapse
|
9
|
Semi-Lipschitz functions and machine learning for discrete dynamical systems on graphs. Mach Learn 2022. [DOI: 10.1007/s10994-022-06130-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
AbstractConsider a directed tree $${\mathcal {U}}$$
U
and the space of all finite walks on it endowed with a quasi-pseudo-metric—the space of the strategies $${\mathcal {S}}$$
S
on the graph,—which represent the possible changes in the evolution of a dynamical system over time. Consider a reward function acting in a subset $${\mathcal {S}}_0 \subset {\mathcal {S}}$$
S
0
⊂
S
which measures the success. Using well-known facts of the theory of semi-Lipschitz functions in quasi-pseudo-metric spaces, we extend the reward function to the whole space $${\mathcal {S}}.$$
S
.
We obtain in this way an oracle function, which gives a forecast of the reward function for the elements of $${\mathcal {S}}$$
S
, that is, an estimate of the degree of success for any given strategy. After explaining the fundamental properties of a specific quasi-pseudo-metric that we define for the (graph) trees (the bifurcation quasi-pseudo-metric), we focus our attention on analyzing how this structure can be used to represent dynamical systems on graphs. We begin the explanation of the method with a simple example, which is proposed as a reference point for which some variants and successive generalizations are consecutively shown. The main objective is to explain the role of the lack of symmetry of quasi-metrics in our proposal: the irreversibility of dynamical processes is reflected in the asymmetry of their definition.
Collapse
|
10
|
Zhang Y, Cheung YM. A New Distance Metric Exploiting Heterogeneous Interattribute Relationship for Ordinal-and-Nominal-Attribute Data Clustering. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:758-771. [PMID: 32340972 DOI: 10.1109/tcyb.2020.2983073] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Ordinal attribute has all the common characteristics of a nominal one but it differs from the nominal one by having naturally ordered possible values (also called categories interchangeably). In clustering analysis tasks, categorical data composed of both ordinal and nominal attributes (also called mixed-categorical data interchangeably) are common. Under this circumstance, existing distance and similarity measures suffer from at least one of the following two drawbacks: 1) directly treat ordinal attributes as nominal ones, and thus ignore the order information from them and 2) suppose all the attributes are independent of each other, measure the distance between two categories from a target attribute without considering the valuable information provided by the other attributes that correlate with the target one. These two drawbacks may twist the natural distances of attributes and further lead to unsatisfactory clustering results. This article, therefore, presents an entropy-based distance metric that quantifies the distance between categories by exploiting the information provided by different attributes that correlate with the target one. It also preserves the order relationship among ordinal categories during the distance measurement. Since attributes are usually correlated in different degrees, we also define the interdependence between different types of attributes to weight their contributions in forming distances. The proposed metric overcomes the two above-mentioned drawbacks for mixed-categorical data clustering. More important, it conceptually unifies the distances of ordinal and nominal attributes to avoid information loss during clustering. Moreover, it is parameter free, and will not bring extra computational cost compared to the existing state-of-the-art counterparts. Extensive experiments show the superiority of the proposed distance metric.
Collapse
|
11
|
Zhu C, Cao L, Yin J. Unsupervised Heterogeneous Coupling Learning for Categorical Representation. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:533-549. [PMID: 32750827 DOI: 10.1109/tpami.2020.3010953] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Complex categorical data is often hierarchically coupled with heterogeneous relationships between attributes and attribute values and the couplings between objects. Such value-to-object couplings are heterogeneous with complementary and inconsistent interactions and distributions. Limited research exists on unlabeled categorical data representations, ignores the heterogeneous and hierarchical couplings, underestimates data characteristics and complexities, and overuses redundant information, etc. The deep representation learning of unlabeled categorical data is challenging, overseeing such value-to-object couplings, complementarity and inconsistency, and requiring large data, disentanglement, and high computational power. This work introduces a shallow but powerful UNsupervised heTerogeneous couplIng lEarning (UNTIE) approach for representing coupled categorical data by untying the interactions between couplings and revealing heterogeneous distributions embedded in each type of couplings. UNTIE is efficiently optimized w.r.t. a kernel k-means objective function for unsupervised representation learning of heterogeneous and hierarchical value-to-object couplings. Theoretical analysis shows that UNTIE can represent categorical data with maximal separability while effectively represent heterogeneous couplings and disclose their roles in categorical data. The UNTIE-learned representations make significant performance improvement against the state-of-the-art categorical representations and deep representation models on 25 categorical data sets with diversified characteristics.
Collapse
|
12
|
Machine learning algorithm for feature space clustering of mixed data with missing information based on molecule similarity. J Biomed Inform 2021; 125:103954. [PMID: 34793972 DOI: 10.1016/j.jbi.2021.103954] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2021] [Revised: 11/04/2021] [Accepted: 11/08/2021] [Indexed: 11/23/2022]
Abstract
Clustering Algorithms have just fascinated significant devotion in machine learning applications owing to their great competence. Nevertheless, the existing algorithms quite have approximately disputes that need to be further deciphered. For example, most existing algorithms transform one type of feature into another type, which disregards the explicit possessions of information. In addition, most of them deliberate whole features, which may lead to difficulty in calculation and effect in sub-optimal presentation. To address the above difficulties, this paper proposes a novel technique for clustering categorical and numerical features based on feature space clustering of mixed data with missing information (FSCMMI). The procedure involves three stages. Initially, FSCMMI divides the given dataset depending on missing information in instances and features types. The second stage uses the decision-tree procedure to identify the association between instances. Finally, the third stage is used for computing the closeness measure for numerical features and categorical features. Meanwhile, we propose a new training algorithm to cluster mixed datasets. Extensive experimental results on benchmark datasets show that the proposed FSCMMI outperforms several state-of-art clustering methods in terms of accuracy and efficiency.
Collapse
|
13
|
Mau TN, Huynh VN. An LSH-based k-representatives clustering method for large categorical data. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2021.08.050] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
14
|
Li W, Meng X, Huang Y. Fitness distance correlation and mixed search strategy for differential evolution. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2019.12.141] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
15
|
Dorman KS, Maitra R. An efficient
k
‐modes algorithm for clustering categorical datasets. Stat Anal Data Min 2021. [DOI: 10.1002/sam.11546] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Affiliation(s)
- Karin S. Dorman
- Department of Genetics Development and Cell Biology Iowa State University Ames Iowa USA
- Department of Statistics Iowa State University Ames Iowa USA
| | - Ranjan Maitra
- Department of Statistics Iowa State University Ames Iowa USA
| |
Collapse
|
16
|
Kuo R, Zheng Y, Nguyen TPQ. Metaheuristic-based possibilistic fuzzy k-modes algorithms for categorical data clustering. Inf Sci (N Y) 2021. [DOI: 10.1016/j.ins.2020.12.051] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
17
|
Lopez-Martin M, Carro B, Arribas JI, Sanchez-Esguevillas A. Network intrusion detection with a novel hierarchy of distances between embeddings of hash IP addresses. Knowl Based Syst 2021. [DOI: 10.1016/j.knosys.2021.106887] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
18
|
Abstract
Comparing data objects is at the heart of machine learning. For continuous data, object dissimilarity is usually taken to be object distance; however, for categorical data, there is no universal agreement, for categories can be ordered in several different ways. Most existing category dissimilarity measures characterize the distance among the values an attribute may take using precisely the number of different values the attribute takes (the attribute space) and the frequency at which they occur. These kinds of measures overlook attribute interdependence, which may provide valuable information when capturing per-attribute object dissimilarity. In this paper, we introduce a novel object dissimilarity measure that we call Learning-Based Dissimilarity, for comparing categorical data. Our measure characterizes the distance between two categorical values of a given attribute in terms of how likely it is that such values are confused or not when all the dataset objects with the remaining attributes are used to predict them. To that end, we provide an algorithm that, given a target attribute, first learns a classification model in order to compute a confusion matrix for the attribute. Then, our method transforms the confusion matrix into a per-attribute dissimilarity measure. We have successfully tested our measure against 55 datasets gathered from the University of California, Irvine (UCI) Machine Learning Repository. Our results show that it surpasses, in terms of various performance indicators for data clustering, the most prominent distance relations put forward in the literature.
Collapse
|
19
|
Alcaide D, Aerts J. A visual analytic approach for the identification of ICU patient subpopulations using ICD diagnostic codes. PeerJ Comput Sci 2021; 7:e430. [PMID: 33954230 PMCID: PMC8049127 DOI: 10.7717/peerj-cs.430] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2020] [Accepted: 02/15/2021] [Indexed: 05/03/2023]
Abstract
A large number of clinical concepts are categorized under standardized formats that ease the manipulation, understanding, analysis, and exchange of information. One of the most extended codifications is the International Classification of Diseases (ICD) used for characterizing diagnoses and clinical procedures. With formatted ICD concepts, a patient profile can be described through a set of standardized and sorted attributes according to the relevance or chronology of events. This structured data is fundamental to quantify the similarity between patients and detect relevant clinical characteristics. Data visualization tools allow the representation and comprehension of data patterns, usually of a high dimensional nature, where only a partial picture can be projected. In this paper, we provide a visual analytics approach for the identification of homogeneous patient cohorts by combining custom distance metrics with a flexible dimensionality reduction technique. First we define a new metric to measure the similarity between diagnosis profiles through the concordance and relevance of events. Second we describe a variation of the Simplified Topological Abstraction of Data (STAD) dimensionality reduction technique to enhance the projection of signals preserving the global structure of data. The MIMIC-III clinical database is used for implementing the analysis into an interactive dashboard, providing a highly expressive environment for the exploration and comparison of patients groups with at least one identical diagnostic ICD code. The combination of the distance metric and STAD not only allows the identification of patterns but also provides a new layer of information to establish additional relationships between patient cohorts. The method and tool presented here add a valuable new approach for exploring heterogeneous patient populations. In addition, the distance metric described can be applied in other domains that employ ordered lists of categorical data.
Collapse
Affiliation(s)
- Daniel Alcaide
- Department of Electrical Engineering (ESAT) STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics, KU Leuven, Leuven, Belgium
| | - Jan Aerts
- Department of Electrical Engineering (ESAT) STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics, KU Leuven, Leuven, Belgium
- UHasselt, I-BioStat, Data Science Institute, Hasselt, Belgium
| |
Collapse
|
20
|
Lu Y, Cheung YM, Tang YY. Self-Adaptive Multiprototype-Based Competitive Learning Approach: A k-Means-Type Algorithm for Imbalanced Data Clustering. IEEE TRANSACTIONS ON CYBERNETICS 2021; 51:1598-1612. [PMID: 31150353 DOI: 10.1109/tcyb.2019.2916196] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Class imbalance problem has been extensively studied in the recent years, but imbalanced data clustering in unsupervised environment, that is, the number of samples among clusters is imbalanced, has yet to be well studied. This paper, therefore, studies the imbalanced data clustering problem within the framework of k -means-type competitive learning. We introduce a new method called self-adaptive multiprototype-based competitive learning (SMCL) for imbalanced clusters. It uses multiple subclusters to represent each cluster with an automatic adjustment of the number of subclusters. Then, the subclusters are merged into the final clusters based on a novel separation measure. We also propose a new internal clustering validation measure to determine the number of final clusters during the merging process for imbalanced clusters. The advantages of SMCL are threefold: 1) it inherits the advantages of competitive learning and meanwhile is applicable to the imbalanced data clustering; 2) the self-adaptive multiprototype mechanism uses a proper number of subclusters to represent each cluster with any arbitrary shape; and 3) it automatically determines the number of clusters for imbalanced clusters. SMCL is compared with the existing counterparts for imbalanced clustering on the synthetic and real datasets. The experimental results show the efficacy of SMCL for imbalanced clusters.
Collapse
|
21
|
A Novel Consensus Fuzzy K-Modes Clustering Using Coupling DNA-Chain-Hypergraph P System for Categorical Data. Processes (Basel) 2020. [DOI: 10.3390/pr8101326] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
In this paper, a data clustering method named consensus fuzzy k-modes clustering is proposed to improve the performance of the clustering for the categorical data. At the same time, the coupling DNA-chain-hypergraph P system is constructed to realize the process of the clustering. This P system can prevent the clustering algorithm falling into the local optimum and realize the clustering process in implicit parallelism. The consensus fuzzy k-modes algorithm can combine the advantages of the fuzzy k-modes algorithm, weight fuzzy k-modes algorithm and genetic fuzzy k-modes algorithm. The fuzzy k-modes algorithm can realize the soft partition which is closer to reality, but treats all the variables equally. The weight fuzzy k-modes algorithm introduced the weight vector which strengthens the basic k-modes clustering by associating higher weights with features useful in analysis. These two methods are only improvements the k-modes algorithm itself. So, the genetic k-modes algorithm is proposed which used the genetic operations in the clustering process. In this paper, we examine these three kinds of k-modes algorithms and further introduce DNA genetic optimization operations in the final consensus process. Finally, we conduct experiments on the seven UCI datasets and compare the clustering results with another four categorical clustering algorithms. The experiment results and statistical test results show that our method can get better clustering results than the compared clustering algorithms, respectively.
Collapse
|
22
|
Cluster analysis application to identify groups of individuals with high health expenditures. HEALTH SERVICES AND OUTCOMES RESEARCH METHODOLOGY 2020. [DOI: 10.1007/s10742-020-00214-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
23
|
Abstract
There are some NP-hard problems in the prediction of RNA structures. Prediction of RNA folding structure in RNA nucleotide sequence remains an unsolved challenge. We investigate the computing algorithm in RNA folding structural prediction based on extended structure and basin hopping graph, it is a computing mode of basin hopping graph in RNA folding structural prediction including pseudoknots. This study presents the predicting algorithm based on extended structure, it also proposes an improved computing algorithm based on barrier tree and basin hopping graph, which are the attractive approaches in RNA folding structural prediction. Many experiments have been implemented in Rfam14.1 database and PseudoBase database, the experimental results show that our two algorithms are efficient and accurate than the other existing algorithms.
Collapse
Affiliation(s)
- Zhendong Liu
- School of Computer Science and Technology, Shandong Jianzhu University, Jinan 250101, P. R. China
- Department of Biostatistics, University of California, Los Angeles, Los Angeles 90095, USA
- Department of Statistics, Harvard University, Cambridge, MA 02138, USA
| | - Gang Li
- Department of Biostatistics, University of California, Los Angeles, Los Angeles 90095, USA
| | - Jun S. Liu
- Department of Statistics, Harvard University, Cambridge, MA 02138, USA
| |
Collapse
|
24
|
Luo S, Miao D, Zhang Z, Zhang Y, Hu S. A neighborhood rough set model with nominal metric embedding. Inf Sci (N Y) 2020. [DOI: 10.1016/j.ins.2020.02.015] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
25
|
Dong B, Jian S, Zuo K. CDE++: Learning Categorical Data Embedding by Enhancing Heterogeneous Feature Value Coupling Relationships. ENTROPY (BASEL, SWITZERLAND) 2020; 22:e22040391. [PMID: 33286165 PMCID: PMC7516865 DOI: 10.3390/e22040391] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/04/2020] [Revised: 03/21/2020] [Accepted: 03/27/2020] [Indexed: 06/12/2023]
Abstract
Categorical data are ubiquitous in machine learning tasks, and the representation of categorical data plays an important role in the learning performance. The heterogeneous coupling relationships between features and feature values reflect the characteristics of the real-world categorical data which need to be captured in the representations. The paper proposes an enhanced categorical data embedding method, i.e., CDE++, which captures the heterogeneous feature value coupling relationships into the representations. Based on information theory and the hierarchical couplings defined in our previous work CDE (Categorical Data Embedding by learning hierarchical value coupling), CDE++ adopts mutual information and margin entropy to capture feature couplings and designs a hybrid clustering strategy to capture multiple types of feature value clusters. Moreover, Autoencoder is used to learn non-linear couplings between features and value clusters. The categorical data embeddings generated by CDE++ are low-dimensional numerical vectors which are directly applied to clustering and classification and achieve the best performance comparing with other categorical representation learning methods. Parameter sensitivity and scalability tests are also conducted to demonstrate the superiority of CDE++.
Collapse
|
26
|
Cheng L, Wang Y, Ma X. An end-to-end distance measuring for mixed data based on deep relevance learning. INTELL DATA ANAL 2020. [DOI: 10.3233/ida-184399] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
- Li Cheng
- Science and Technology on Parallel and Distributed Laboratory College of Computer, National University of Defense Technology, Changsha, Hunan, China
| | - Yijie Wang
- Science and Technology on Parallel and Distributed Laboratory College of Computer, National University of Defense Technology, Changsha, Hunan, China
| | - Xingkong Ma
- College of Computer, National University of Defense Technology, Changsha, Hunan, China
| |
Collapse
|
27
|
Zhang Y, Cheung YM, Tan KC. A Unified Entropy-Based Distance Metric for Ordinal-and-Nominal-Attribute Data Clustering. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2020; 31:39-52. [PMID: 30908240 DOI: 10.1109/tnnls.2019.2899381] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Ordinal data are common in many data mining and machine learning tasks. Compared to nominal data, the possible values (also called categories interchangeably) of an ordinal attribute are naturally ordered. Nevertheless, since the data values are not quantitative, the distance between two categories of an ordinal attribute is generally not well defined, which surely has a serious impact on the result of the quantitative analysis if an inappropriate distance metric is utilized. From the practical perspective, ordinal-and-nominal-attribute categorical data, i.e., categorical data associated with a mixture of nominal and ordinal attributes, is common, but the distance metric for such data has yet to be well explored in the literature. In this paper, within the framework of clustering analysis, we therefore first propose an entropy-based distance metric for ordinal attributes, which exploits the underlying order information among categories of an ordinal attribute for the distance measurement. Then, we generalize this distance metric and propose a unified one accordingly, which is applicable to ordinal-and-nominal-attribute categorical data. Compared with the existing metrics proposed for categorical data, the proposed metric is simple to use and nonparametric. More importantly, it reasonably exploits the underlying order information of ordinal attributes and statistical information of nominal attributes for distance measurement. Extensive experiments show that the proposed metric outperforms the existing counterparts on both the real and benchmark data sets.
Collapse
|
28
|
|
29
|
Pedronette DCG, Valem LP, Almeida J, da S Torres R. Multimedia Retrieval Through Unsupervised Hypergraph-Based Manifold Ranking. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2019; 28:5824-5838. [PMID: 31180856 DOI: 10.1109/tip.2019.2920526] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Accurately ranking images and multimedia objects are of paramount relevance in many retrieval and learning tasks. Manifold learning methods have been investigated for ranking mainly due to their capacity of taking into account the intrinsic global manifold structure. In this paper, a novel manifold ranking algorithm is proposed based on the hypergraphs for unsupervised multimedia retrieval tasks. Different from traditional graph-based approaches, which represent only pairwise relationships, hypergraphs are capable of modeling similarity relationships among a set of objects. The proposed approach uses the hyperedges for constructing a contextual representation of data samples and exploits the encoded information for deriving a more effective similarity function. An extensive experimental evaluation was conducted on nine public datasets including diverse retrieval scenarios and multimedia content. Experimental results demonstrate that high effectiveness gains can be obtained in comparison with the state-of-the-art methods.
Collapse
|
30
|
Nguyen B, De Baets B. Kernel-Based Distance Metric Learning for Supervised k -Means Clustering. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2019; 30:3084-3095. [PMID: 30668483 DOI: 10.1109/tnnls.2018.2890021] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Finding an appropriate distance metric that accurately reflects the (dis)similarity between examples is a key to the success of k -means clustering. While it is not always an easy task to specify a good distance metric, we can try to learn one based on prior knowledge from some available clustered data sets, an approach that is referred to as supervised clustering. In this paper, a kernel-based distance metric learning method is developed to improve the practical use of k -means clustering. Given the corresponding optimization problem, we derive a meaningful Lagrange dual formulation and introduce an efficient algorithm in order to reduce the training complexity. Our formulation is simple to implement, allowing a large-scale distance metric learning problem to be solved in a computationally tractable way. Experimental results show that the proposed method yields more robust and better performances on synthetic as well as real-world data sets compared to other state-of-the-art distance metric learning methods.
Collapse
|
31
|
Nguyen TPQ, Kuo R. Partition-and-merge based fuzzy genetic clustering algorithm for categorical data. Appl Soft Comput 2019. [DOI: 10.1016/j.asoc.2018.11.028] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
32
|
Kuo R, Nguyen TPQ. Genetic intuitionistic weighted fuzzy k-modes algorithm for categorical data. Neurocomputing 2019. [DOI: 10.1016/j.neucom.2018.11.016] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
33
|
Huo J, Gao Y, Shi Y, Yin H. Cross-Modal Metric Learning for AUC Optimization. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2018; 29:4844-4856. [PMID: 29993954 DOI: 10.1109/tnnls.2017.2769128] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Cross-modal metric learning (CML) deals with learning distance functions for cross-modal data matching. The existing methods mostly focus on minimizing a loss defined on sample pairs. However, the numbers of intraclass and interclass sample pairs can be highly imbalanced in many applications, and this can lead to deteriorating or unsatisfactory performances. The area under the receiver operating characteristic curve (AUC) is a more meaningful performance measure for the imbalanced distribution problem. To tackle the problem as well as to make samples from different modalities directly comparable, a CML method is presented by directly maximizing AUC. The method can be further extended to focus on optimizing partial AUC (pAUC), which is the AUC between two specific false positive rates (FPRs). This is particularly useful in certain applications where only the performances assessed within predefined false positive ranges are critical. The proposed method is formulated as a log-determinant regularized semidefinite optimization problem. For efficient optimization, a minibatch proximal point algorithm is developed. The algorithm is experimentally verified stable with the size of sampled pairs that form a minibatch at each iteration. Several data sets have been used in evaluation, including three cross-modal data sets on face recognition under various scenarios and a single modal data set, the Labeled Faces in the Wild. Results demonstrate the effectiveness of the proposed methods and marked improvements over the existing methods. Specifically, pAUC-optimized CML proves to be more competitive for performance measures such as Rank-1 and verification rate at FPR = 0.1%.
Collapse
|
34
|
Huang JZ. An Algorithm for Clustering Categorical Data With Set-Valued Features. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2018; 29:4593-4606. [PMID: 29990068 DOI: 10.1109/tnnls.2017.2770167] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
In data mining, objects are often represented by a set of features, where each feature of an object has only one value. However, in reality, some features can take on multiple values, for instance, a person with several job titles, hobbies, and email addresses. These features can be referred to as set-valued features and are often treated with dummy features when using existing data mining algorithms to analyze data with set-valued features. In this paper, we propose an SV- $k$ -modes algorithm that clusters categorical data with set-valued features. In this algorithm, a distance function is defined between two objects with set-valued features, and a set-valued mode representation of cluster centers is proposed. We develop a heuristic method to update cluster centers in the iterative clustering process and an initialization algorithm to select the initial cluster centers. The convergence and complexity of the SV- $k$ -modes algorithm are analyzed. Experiments are conducted on both synthetic data and real data from five different applications. The experimental results have shown that the SV- $k$ -modes algorithm performs better when clustering real data than do three other categorical clustering algorithms and that the algorithm is scalable to large data.
Collapse
|
35
|
Jia H, Cheung YM. Subspace Clustering of Categorical and Numerical Data With an Unknown Number of Clusters. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2018; 29:3308-3325. [PMID: 28792907 DOI: 10.1109/tnnls.2017.2728138] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
In clustering analysis, data attributes may have different contributions to the detection of various clusters. To solve this problem, the subspace clustering technique has been developed, which aims at grouping the data objects into clusters based on the subsets of attributes rather than the entire data space. However, the most existing subspace clustering methods are only applicable to either numerical or categorical data, but not both. This paper, therefore, studies the soft subspace clustering of data with both of the numerical and categorical attributes (also simply called mixed data for short). Specifically, an attribute-weighted clustering model based on the definition of object-cluster similarity is presented. Accordingly, a unified weighting scheme for the numerical and categorical attributes is proposed, which quantifies the attribute-to-cluster contribution by taking into account both of intercluster difference and intracluster similarity. Moreover, a rival penalized competitive learning mechanism is further introduced into the proposed soft subspace clustering algorithm so that the subspace cluster structure as well as the most appropriate number of clusters can be learned simultaneously in a single learning paradigm. In addition, an initialization-oriented method is also presented, which can effectively improve the stability and accuracy of -means-type clustering methods on numerical, categorical, and mixed data. The experimental results on different benchmark data sets show the efficacy of the proposed approach.
Collapse
|
36
|
|
37
|
Guo G, Chen L, Ye Y, Jiang Q. Cluster Validation Method for Determining the Number of Clusters in Categorical Sequences. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2017; 28:2936-2948. [PMID: 28114078 DOI: 10.1109/tnnls.2016.2608354] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Cluster validation, which is the process of evaluating the quality of clustering results, plays an important role for practical machine learning systems. Categorical sequences, such as biological sequences in computational biology, have become common in real-world applications. Different from previous studies, which mainly focused on attribute-value data, in this paper, we work on the cluster validation problem for categorical sequences. The evaluation of sequences clustering is currently difficult due to the lack of an internal validation criterion defined with regard to the structural features hidden in sequences. To solve this problem, in this paper, a novel cluster validity index (CVI) is proposed as a function of clustering, with the intracluster structural compactness and intercluster structural separation linearly combined to measure the quality of sequence clusters. A partition-based algorithm for robust clustering of categorical sequences is also proposed, which provides the new measure with high-quality clustering results by the deterministic initialization and the elimination of noise clusters using an information theoretic method. The new clustering algorithm and the CVI are then assembled within the common model selection procedure to determine the number of clusters in categorical sequence sets. A case study on commonly used protein sequences and the experimental results on some real-world sequence sets from different domains are given to demonstrate the performance of the proposed method.
Collapse
|
38
|
Alexandridis A, Chondrodima E, Giannopoulos N, Sarimveis H. A Fast and Efficient Method for Training Categorical Radial Basis Function Networks. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2017; 28:2831-2836. [PMID: 28113644 DOI: 10.1109/tnnls.2016.2598722] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
This brief presents a novel learning scheme for categorical data based on radial basis function (RBF) networks. The proposed approach replaces the numerical vectors known as RBF centers with categorical tuple centers, and employs specially designed measures for calculating the distance between the center and the input tuples. Furthermore, a fast noniterative categorical clustering algorithm is proposed to accomplish the first stage of RBF training involving categorical center selection, whereas the weights are calculated through linear regression. The method is applied on 22 categorical data sets and compared with several different learning schemes, including neural networks, support vector machines, naïve Bayes classifier, and decision trees. Results show that the proposed method is very competitive, outperforming its rivals in terms of predictive capabilities in the majority of the tested cases.
Collapse
|
39
|
Lian C, Ruan S, Denoux T, Li H, Vera P. Spatial Evidential Clustering With Adaptive Distance Metric for Tumor Segmentation in FDG-PET Images. IEEE Trans Biomed Eng 2017; 65:21-30. [PMID: 28371772 DOI: 10.1109/tbme.2017.2688453] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
While the accurate delineation of tumor volumes in FDG-positron emission tomography (PET) is a vital task for diverse objectives in clinical oncology, noise and blur due to the imaging system make it a challenging work. In this paper, we propose to address the imprecision and noise inherent in PET using Dempster-Shafer theory, a powerful tool for modeling and reasoning with uncertain and/or imprecise information. Based on Dempster-Shafer theory, a novel evidential clustering algorithm is proposed and tailored for the tumor segmentation task in three-dimensional. For accurate clustering of PET voxels, each voxel is described not only by the single intensity value but also complementarily by textural features extracted from a patch surrounding the voxel. Considering that there are a large amount of textures without consensus regarding the most informative ones, and some of the extracted features are even unreliable due to the low-quality PET images, a specific procedure is included in the proposed clustering algorithm to adapt distance metric for properly representing the clustering distortions and the similarities between neighboring voxels. This integrated metric adaptation procedure will realize a low-dimensional transformation from the original space, and will limit the influence of unreliable inputs via feature selection. A Dempster-Shafer-theory-based spatial regularization is also proposed and included in the clustering algorithm, so as to effectively quantify the local homogeneity. The proposed method has been compared with other methods on the real-patient FDG-PET images, showing good performance.
Collapse
|