Chen YT, Witten DM. Selective inference for
k-means clustering.
JOURNAL OF MACHINE LEARNING RESEARCH : JMLR 2023;
24:152. [PMID:
38264325 PMCID:
PMC10805457]
[Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/25/2024]
Abstract
We consider the problem of testing for a difference in means between clusters of observations identified via k -means clustering. In this setting, classical hypothesis tests lead to an inflated Type I error rate. In recent work, Gao et al. (2022) considered a related problem in the context of hierarchical clustering. Unfortunately, their solution is highly-tailored to the context of hierarchical clustering, and thus cannot be applied in the setting of k -means clustering. In this paper, we propose a p-value that conditions on all of the intermediate clustering assignments in the k -means algorithm. We show that the p-value controls the selective Type I error for a test of the difference in means between a pair of clusters obtained using k -means clustering in finite samples, and can be efficiently computed. We apply our proposal on hand-written digits data and on single-cell RNA-sequencing data.
Collapse