1
|
Pinchas A, Ben-Gal I, Painsky A. A Comparative Analysis of Discrete Entropy Estimators for Large-Alphabet Problems. ENTROPY (BASEL, SWITZERLAND) 2024; 26:369. [PMID: 38785618 PMCID: PMC11120205 DOI: 10.3390/e26050369] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/15/2024] [Revised: 04/25/2024] [Accepted: 04/25/2024] [Indexed: 05/25/2024]
Abstract
This paper presents a comparative study of entropy estimation in a large-alphabet regime. A variety of entropy estimators have been proposed over the years, where each estimator is designed for a different setup with its own strengths and caveats. As a consequence, no estimator is known to be universally better than the others. This work addresses this gap by comparing twenty-one entropy estimators in the studied regime, starting with the simplest plug-in estimator and leading up to the most recent neural network-based and polynomial approximate estimators. Our findings show that the estimators' performance highly depends on the underlying distribution. Specifically, we distinguish between three types of distributions, ranging from uniform to degenerate distributions. For each class of distribution, we recommend the most suitable estimator. Further, we propose a sample-dependent approach, which again considers three classes of distribution, and report the top-performing estimators in each class. This approach provides a data-dependent framework for choosing the desired estimator in practical setups.
Collapse
Affiliation(s)
- Assaf Pinchas
- School of Electrical Engineering, The Iby and Aladar Fleischman Faculty of Engineering, Tel Aviv University, Tel Aviv 6997801, Israel
| | - Irad Ben-Gal
- Industrial Engineering Department, The Iby and Aladar Fleischman Faculty of Engineering, Tel Aviv University, Tel Aviv 6997801, Israel; (I.B.-G.); (A.P.)
| | - Amichai Painsky
- Industrial Engineering Department, The Iby and Aladar Fleischman Faculty of Engineering, Tel Aviv University, Tel Aviv 6997801, Israel; (I.B.-G.); (A.P.)
| |
Collapse
|
2
|
Camaglia F, Nemenman I, Mora T, Walczak AM. Bayesian estimation of the Kullback-Leibler divergence for categorical systems using mixtures of Dirichlet priors. Phys Rev E 2024; 109:024305. [PMID: 38491647 DOI: 10.1103/physreve.109.024305] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2023] [Accepted: 01/18/2024] [Indexed: 03/18/2024]
Abstract
In many applications in biology, engineering, and economics, identifying similarities and differences between distributions of data from complex processes requires comparing finite categorical samples of discrete counts. Statistical divergences quantify the difference between two distributions. However, their estimation is very difficult and empirical methods often fail, especially when the samples are small. We develop a Bayesian estimator of the Kullback-Leibler divergence between two probability distributions that makes use of a mixture of Dirichlet priors on the distributions being compared. We study the properties of the estimator on two examples: probabilities drawn from Dirichlet distributions and random strings of letters drawn from Markov chains. We extend the approach to the squared Hellinger divergence. Both estimators outperform other estimation techniques, with better results for data with a large number of categories and for higher values of divergences.
Collapse
Affiliation(s)
- Francesco Camaglia
- Laboratoire de physique de l'École normale supérieure, CNRS, PSL University, Sorbonne Université and Université de Paris, 75005 Paris, France
| | - Ilya Nemenman
- Department of Physics, Department of Biology, and Initiative for Theory and Modeling of Living Systems, Emory University, Atlanta, Georgia 30322, USA
| | - Thierry Mora
- Laboratoire de physique de l'École normale supérieure, CNRS, PSL University, Sorbonne Université and Université de Paris, 75005 Paris, France
| | - Aleksandra M Walczak
- Laboratoire de physique de l'École normale supérieure, CNRS, PSL University, Sorbonne Université and Université de Paris, 75005 Paris, France
| |
Collapse
|
3
|
De Gregorio J, Sánchez D, Toral R. Entropy Estimators for Markovian Sequences: A Comparative Analysis. ENTROPY (BASEL, SWITZERLAND) 2024; 26:79. [PMID: 38248204 PMCID: PMC11154276 DOI: 10.3390/e26010079] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Revised: 12/21/2023] [Accepted: 01/16/2024] [Indexed: 01/23/2024]
Abstract
Entropy estimation is a fundamental problem in information theory that has applications in various fields, including physics, biology, and computer science. Estimating the entropy of discrete sequences can be challenging due to limited data and the lack of unbiased estimators. Most existing entropy estimators are designed for sequences of independent events and their performances vary depending on the system being studied and the available data size. In this work, we compare different entropy estimators and their performance when applied to Markovian sequences. Specifically, we analyze both binary Markovian sequences and Markovian systems in the undersampled regime. We calculate the bias, standard deviation, and mean squared error for some of the most widely employed estimators. We discuss the limitations of entropy estimation as a function of the transition probabilities of the Markov processes and the sample size. Overall, this paper provides a comprehensive comparison of entropy estimators and their performance in estimating entropy for systems with memory, which can be useful for researchers and practitioners in various fields.
Collapse
Affiliation(s)
| | - David Sánchez
- Institute for Cross-Disciplinary Physics and Complex Systems IFISC (UIB-CSIC), Campus Universitat de les Illes Balears, E-07122 Palma de Mallorca, Spain; (J.D.G.); (R.T.)
| | | |
Collapse
|
4
|
Zhang Z. Several Basic Elements of Entropic Statistics. ENTROPY (BASEL, SWITZERLAND) 2023; 25:1060. [PMID: 37510007 PMCID: PMC10377889 DOI: 10.3390/e25071060] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/14/2023] [Revised: 07/11/2023] [Accepted: 07/12/2023] [Indexed: 07/30/2023]
Abstract
Inspired by the development in modern data science, a shift is increasingly visible in the foundation of statistical inference, away from a real space, where random variables reside, toward a nonmetrized and nonordinal alphabet, where more general random elements reside. While statistical inferences based on random variables are theoretically well supported in the rich literature of probability and statistics, inferences on alphabets, mostly by way of various entropies and their estimation, are less systematically supported in theory. Without the familiar notions of neighborhood, real or complex moments, tails, et cetera, associated with random variables, probability and statistics based on random elements on alphabets need more attention to foster a sound framework for rigorous development of entropy-based statistical exercises. In this article, several basic elements of entropic statistics are introduced and discussed, including notions of general entropies, entropic sample spaces, entropic distributions, entropic statistics, entropic multinomial distributions, entropic moments, and entropic basis, among other entropic objects. In particular, an entropic-moment-generating function is defined and it is shown to uniquely characterize the underlying distribution in entropic perspective, and, hence, all entropies. An entropic version of the Glivenko-Cantelli convergence theorem is also established.
Collapse
Affiliation(s)
- Zhiyi Zhang
- Department of Mathematics and Statistics, UNC Charlotte, Charlotte, NC 28223, USA
| |
Collapse
|
5
|
Zhang J, Shi J. Asymptotic Normality for Plug-In Estimators of Generalized Shannon’s Entropy. ENTROPY 2022; 24:e24050683. [PMID: 35626567 PMCID: PMC9141039 DOI: 10.3390/e24050683] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/14/2022] [Revised: 05/03/2022] [Accepted: 05/11/2022] [Indexed: 11/16/2022]
Abstract
Shannon’s entropy is one of the building blocks of information theory and an essential aspect of Machine Learning (ML) methods (e.g., Random Forests). Yet, it is only finitely defined for distributions with fast decaying tails on a countable alphabet. The unboundedness of Shannon’s entropy over the general class of all distributions on an alphabet prevents its potential utility from being fully realized. To fill the void in the foundation of information theory, Zhang (2020) proposed generalized Shannon’s entropy, which is finitely defined everywhere. The plug-in estimator, adopted in almost all entropy-based ML method packages, is one of the most popular approaches to estimating Shannon’s entropy. The asymptotic distribution for Shannon’s entropy’s plug-in estimator was well studied in the existing literature. This paper studies the asymptotic properties for the plug-in estimator of generalized Shannon’s entropy on countable alphabets. The developed asymptotic properties require no assumptions on the original distribution. The proposed asymptotic properties allow for interval estimation and statistical tests with generalized Shannon’s entropy.
Collapse
|
6
|
Zhu H, Lei L. A dependency-based machine learning approach to the identification of research topics: a case in COVID-19 studies. LIBRARY HI TECH 2021. [DOI: 10.1108/lht-01-2021-0051] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
PurposePrevious research concerning automatic extraction of research topics mostly used rule-based or topic modeling methods, which were challenged due to the limited rules, the interpretability issue and the heavy dependence on human judgment. This study aims to address these issues with the proposal of a new method that integrates machine learning models with linguistic features for the identification of research topics.Design/methodology/approachFirst, dependency relations were used to extract noun phrases from research article texts. Second, the extracted noun phrases were classified into topics and non-topics via machine learning models and linguistic and bibliometric features. Lastly, a trend analysis was performed to identify hot research topics, i.e. topics with increasing popularity.FindingsThe new method was experimented on a large dataset of COVID-19 research articles and achieved satisfactory results in terms of f-measures, accuracy and AUC values. Hot topics of COVID-19 research were also detected based on the classification results.Originality/valueThis study demonstrates that information retrieval methods can help researchers gain a better understanding of the latest trends in both COVID-19 and other research areas. The findings are significant to both researchers and policymakers.
Collapse
|
7
|
Contreras Rodríguez L, Madarro-Capó EJ, Legón-Pérez CM, Rojas O, Sosa-Gómez G. Selecting an Effective Entropy Estimator for Short Sequences of Bits and Bytes with Maximum Entropy. ENTROPY (BASEL, SWITZERLAND) 2021; 23:561. [PMID: 33946438 PMCID: PMC8147137 DOI: 10.3390/e23050561] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/05/2021] [Revised: 04/26/2021] [Accepted: 04/28/2021] [Indexed: 11/22/2022]
Abstract
Entropy makes it possible to measure the uncertainty about an information source from the distribution of its output symbols. It is known that the maximum Shannon's entropy of a discrete source of information is reached when its symbols follow a Uniform distribution. In cryptography, these sources have great applications since they allow for the highest security standards to be reached. In this work, the most effective estimator is selected to estimate entropy in short samples of bytes and bits with maximum entropy. For this, 18 estimators were compared. Results concerning the comparisons published in the literature between these estimators are discussed. The most suitable estimator is determined experimentally, based on its bias, the mean square error short samples of bytes and bits.
Collapse
Affiliation(s)
- Lianet Contreras Rodríguez
- Facultad de Matemática y Computación, Instituto de Criptografía, Universidad de la Habana, Habana 10400, Cuba; (L.C.R.); (E.J.M.-C.); (C.M.L.-P.)
| | - Evaristo José Madarro-Capó
- Facultad de Matemática y Computación, Instituto de Criptografía, Universidad de la Habana, Habana 10400, Cuba; (L.C.R.); (E.J.M.-C.); (C.M.L.-P.)
| | - Carlos Miguel Legón-Pérez
- Facultad de Matemática y Computación, Instituto de Criptografía, Universidad de la Habana, Habana 10400, Cuba; (L.C.R.); (E.J.M.-C.); (C.M.L.-P.)
| | - Omar Rojas
- Facultad de Ciencias Económicas y Empresariales, Universidad Panamericana, Álvaro del Portillo 49, Zapopan, Jalisco 45010, Mexico;
| | - Guillermo Sosa-Gómez
- Facultad de Ciencias Económicas y Empresariales, Universidad Panamericana, Álvaro del Portillo 49, Zapopan, Jalisco 45010, Mexico;
| |
Collapse
|
8
|
Mölter J, Goodhill GJ. Limitations to Estimating Mutual Information in Large Neural Populations. ENTROPY 2020; 22:e22040490. [PMID: 33286264 PMCID: PMC7516973 DOI: 10.3390/e22040490] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/02/2020] [Revised: 04/22/2020] [Accepted: 04/22/2020] [Indexed: 01/26/2023]
Abstract
Information theory provides a powerful framework to analyse the representation of sensory stimuli in neural population activity. However, estimating the quantities involved such as entropy and mutual information from finite samples is notoriously hard and any direct estimate is known to be heavily biased. This is especially true when considering large neural populations. We study a simple model of sensory processing and show through a combinatorial argument that, with high probability, for large neural populations any finite number of samples of neural activity in response to a set of stimuli is mutually distinct. As a consequence, the mutual information when estimated directly from empirical histograms will be equal to the stimulus entropy. Importantly, this is the case irrespective of the precise relation between stimulus and neural activity and corresponds to a maximal bias. This argument is general and applies to any application of information theory, where the state space is large and one relies on empirical histograms. Overall, this work highlights the need for alternative approaches for an information theoretic analysis when dealing with large neural populations.
Collapse
|
9
|
ÖZKAN K. A New Proposed Estimator for Reducing Bias Due to Undetected Species. GAZI UNIVERSITY JOURNAL OF SCIENCE 2020. [DOI: 10.35378/gujs.554644] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
|
10
|
Holmes CM, Nemenman I. Estimation of mutual information for real-valued data with error bars and controlled bias. Phys Rev E 2019; 100:022404. [PMID: 31574710 DOI: 10.1103/physreve.100.022404] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2019] [Indexed: 06/10/2023]
Abstract
Estimation of mutual information between (multidimensional) real-valued variables is used in analysis of complex systems, biological systems, and recently also quantum systems. This estimation is a hard problem, and universally good estimators provably do not exist. We focus on the estimator introduced by Kraskov et al. [Phys. Rev. E 69, 066138 (2004)PLEEE81539-375510.1103/PhysRevE.69.066138] based on the statistics of distances between neighboring data points, which empirically works for a wide class of underlying probability distributions. First, we illustrate pitfalls of naively applying bootstrapping to estimate the variance of the mutual information estimate. Then we improve this estimator by (1) expanding its range of applicability and by providing (2) a self-consistent way of verifying the absence of bias, (3) a method for estimation of its variance, and (4) guidelines for choosing the values of the free parameter of the estimator. We demonstrate the performance of our estimator on synthetic data sets, as well as on neurophysiological and systems biology data sets.
Collapse
Affiliation(s)
- Caroline M Holmes
- Department of Physics, Princeton University, Princeton, New Jersey 08544, USA
| | - Ilya Nemenman
- Department of Physics, Department of Biology, Initiative in Theory and Modeling of Living Systems, Emory University, Atlanta, Georgia 30322, USA
| |
Collapse
|
11
|
|
12
|
|
13
|
|
14
|
Abstract
A nonparametric estimator of mutual information is proposed and is shown to have asymptotic normality and efficiency, and a bias decaying exponentially in sample size. The asymptotic normality and the rapidly decaying bias together offer a viable inferential tool for assessing mutual information between two random elements on finite alphabets where the maximum likelihood estimator of mutual information greatly inflates the probability of type I error. The proposed estimator is illustrated by three examples in which the association between a pair of genes is assessed based on their expression levels. Several results of simulation study are also provided.
Collapse
|
15
|
Abstract
We compare an entropy estimator [Formula: see text] recently discussed by Zhang ( 2012 ) with two estimators, [Formula: see text] and [Formula: see text], introduced by Grassberger ( 2003 ) and Schürmann ( 2004 ). We prove the identity [Formula: see text], which has not been taken into account by Zhang ( 2012 ). Then we prove that the systematic error (bias) of [Formula: see text] is less than or equal to the bias of the ordinary likelihood (or plug-in) estimator of entropy. Finally, by numerical simulation, we verify that for the most interesting regime of small sample estimation and large event spaces, the estimator [Formula: see text] has a significantly smaller statistical error than [Formula: see text].
Collapse
Affiliation(s)
- Thomas Schürmann
- Jülich Supercomputing Centre, Jülich Research Centre, 52425 Jülich, Germany
| |
Collapse
|
16
|
Jiao J, Venkat K, Han Y, Weissman T. Minimax Estimation of Functionals of Discrete Distributions. IEEE TRANSACTIONS ON INFORMATION THEORY 2015; 61:2835-2885. [PMID: 29375152 PMCID: PMC5786426 DOI: 10.1109/tit.2015.2412945] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
We propose a general methodology for the construction and analysis of essentially minimax estimators for a wide class of functionals of finite dimensional parameters, and elaborate on the case of discrete distributions, where the support size S is unknown and may be comparable with or even much larger than the number of observations n. We treat the respective regions where the functional is nonsmooth and smooth separately. In the nonsmooth regime, we apply an unbiased estimator for the best polynomial approximation of the functional whereas, in the smooth regime, we apply a bias-corrected version of the maximum likelihood estimator (MLE). We illustrate the merit of this approach by thoroughly analyzing the performance of the resulting schemes for estimating two important information measures: 1) the entropy [Formula: see text] and 2) [Formula: see text], α > 0. We obtain the minimax L2 rates for estimating these functionals. In particular, we demonstrate that our estimator achieves the optimal sample complexity n ≍ S/ln S for entropy estimation. We also demonstrate that the sample complexity for estimating Fα (P), 0 < α < 1, is n ≍ S1/α /ln S, which can be achieved by our estimator but not the MLE. For 1 < α < 3/2, we show the minimax L2 rate for estimating Fα (P) is (n ln n)-2(α-1) for infinite support size, while the maximum L2 rate for the MLE is n-2(α-1). For all the above cases, the behavior of the minimax rate-optimal estimators with n samples is essentially that of the MLE (plug-in rule) with n ln n samples, which we term "effective sample size enlargement." We highlight the practical advantages of our schemes for the estimation of entropy and mutual information. We compare our performance with various existing approaches, and demonstrate that our approach reduces running time and boosts the accuracy. Moreover, we show that the minimax rate-optimal mutual information estimator yielded by our framework leads to significant performance boosts over the Chow-Liu algorithm in learning graphical models. The wide use of information measure estimation suggests that the insights and estimators obtained in this paper could be broadly applicable.
Collapse
Affiliation(s)
- Jiantao Jiao
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305 USA
| | - Kartik Venkat
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305 USA
| | - Yanjun Han
- Department of Electronic Engineering, Tsinghua University, Beijing 100084, China
| | - Tsachy Weissman
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305 USA
| |
Collapse
|
17
|
Abstract
In this letter, we introduce an estimator of Küllback-Leibler divergence based on two independent samples. We show that on any finite alphabet, this estimator has an exponentially decaying bias and that it is consistent and asymptotically normal. To explain the importance of this estimator, we provide a thorough analysis of the more standard plug-in estimator. We show that it is consistent and asymptotically normal, but with an infinite bias. Moreover, if we modify the plug-in estimator to remove the rare events that cause the bias to become infinite, the bias still decays at a rate no faster than O(1/n). Further, we extend our results to estimating the symmetrized Küllback-Leibler divergence. We conclude by providing simulation results, which show that the asymptotic properties of these estimators hold even for relatively small sample sizes.
Collapse
Affiliation(s)
- Zhiyi Zhang
- Department of Mathematics and Statistics, University of North Carolina at Charlotte, Charlotte, NC 28223, U.S.A.
| | | |
Collapse
|
18
|
Chao A, Wang YT, Jost L. Entropy and the species accumulation curve: a novel entropy estimator via discovery rates of new species. Methods Ecol Evol 2013. [DOI: 10.1111/2041-210x.12108] [Citation(s) in RCA: 90] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Anne Chao
- Institute of Statistics; National Tsing Hua University; Hsin-Chu Taiwan 30043
| | - Y. T. Wang
- Institute of Statistics; National Tsing Hua University; Hsin-Chu Taiwan 30043
| | - Lou Jost
- EcoMinga Foundation; Via a Runtun Baños Tungurahua Ecuador
| |
Collapse
|
19
|
|