Jaberi M, Pensky M, Foroosh H. Sparse One-Grab Sampling with Probabilistic Guarantees.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2019;
41:3057-3070. [PMID:
30371353 DOI:
10.1109/tpami.2018.2871850]
[Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Sampling is an important and effective strategy in analyzing "big data," whereby a smaller subset of a dataset is used to estimate the characteristics of its entire population. The main goal in sampling is often to achieve a significant gain in the computational time. However, a major obstacle towards this goal is the assessment of the smallest sample size needed to ensure, with a high probability, a faithful representation of the entire dataset, especially when the data set is compiled of a large number of diverse structures (e.g., clusters). To address this problem, we propose a method referred to as the Sparse Withdrawal of Inliers in a First Trial (SWIFT) that determines the smallest sample size of a subset of a dataset sampled in one grab, with the guarantee that the subset provides a sufficient number of samples from each of the underlying structures necessary for the discovery and inference. The latter is established with high probability, and the lower bound of the smallest sample size depends on probabilistic guarantees. In addition, we derive an upper bound on the smallest sample size that allows for detection of the structures and show that the two bounds are very close to each other in a variety of scenarios. We show that the problem can be modeled using either a hypergeometric or a multinomial probability mass function (pmf), and derive accurate mathematical bounds to determine a tight approximation to the sample size, leading thus to a sparse sampling strategy. The key features of the proposed method are: (i) sparseness of the sampled subset for analyzing data, where the level of sparseness is independent of the population size; (ii) no prior knowledge of the distribution of data, or the number of underlying structures in the data; and (iii) robustness in the presence of overwhelming number of outliers. We evaluate the method thoroughly in terms of accuracy, its behavior against different parameters, and its effectiveness in reducing the computational cost in various applications of computer vision, such as subspace clustering and structure from motion.
Collapse