Entropy, information and divergence are words often used in statistics, physics and computer science. This note is a student’s attempt at organizing the a self study into the subject, as there is math course on this subject at my university.
Setup
Let \((\Omega, \mathcal F, P )\) be a probability space.
Continuous and discrete version
Much of the difficulty when trying to understand entropy is the fact that entropy is defined slightly differently in discrete and continuous spaces. This difference can be explained by the use of different reference measures in the definitions. Also, instead of calculating entropy using the \(\sigma\)-algebra and the Radon–Nikodym derivative \(\mathcal F\), it is possible to use an almost partition.
Entropy
The entropy \(H\) of a probability space \((\Omega, \mathcal F, P)\) can be defined (Cover, 1991) as
\[H :\Omega \rightarrow \mathbb R \quad H_{\mu}= \mathbb {E}_P \left[ - \log\frac{dP}{d\mu}\right] \ (\text{if }P \ll \mu) ,\]where we have taken the Radon–Nikodym derivative with respect to a reference measure \(\mu\) on \((\Omega, \mathcal F)\). Also, if \(P\) is not absolutely continuous with respect to \(\mu\) (\(P \ll\mu\)), then we define \(H_\mu = \infty\).
Note that entropy depends on the \(\sigma\)-algebra \(\mathcal F\) and the probability measure \(P\). When it is unclear which probability is considered, we will use \(H_{\mu}[\mathcal{F}](P) \) as entropy. Here we can generally note that for two sigma algebras \( \mathcal F_1 \subseteq \mathcal F_2\) we have \(H_\mu[\mathcal F_1](P) \leq H_\mu[\mathcal F_2](P)\). Thus the more events we can measure, the higher entropy becomes.
Kullback–Leibler divergence
A related quantity to Entropy is Kullback–Leibler divergence, which is defined (Cover, 1991) as
\[D_\text{KL}(P \vert\vert Q) := \mathbb {E}_P \left[ \log\frac{dP}{dQ}\right],\]for two probability measures \(P\) and \(Q\) if \(P\ll Q\), otherwise we define \(D_\text{KL}(P \vert\vert Q) := \infty\).
KL-divergence is also sometimes called relative entropy, since it essentially is entropy for a reference probability measure \(Q\). The first property of KL-divergence is that it is always nonnegative. A simple prof using Jensen’s inequality for when \(Q \ll P \) is
\[D_\text{KL}(P \vert\vert Q) = \mathbb {E}_P \left[-\log\frac{dQ}{dP}\right] \geq - \log \mathbb{E}_P\left[ \frac{dQ}{dP}\right] = 0.\]Here we can also note by conditions for equality that
\[D_\text{KL}(P \vert\vert Q)=0\quad \Leftrightarrow \quad P= Q \ \ P\text{-almost everywhere}\]Furthermore, if \( P\ll Q \ll \mu\) then by the chain rule
\[D_\text{KL}(P \vert\vert Q) = H_\mu(P, Q) - H_\mu( P),\]where
\[H_\mu(P, Q) := -\mathbb {E}_P \left[ \log\frac{dQ}{d\mu}\right],\]is cross-entropy.
Entropy in discrete probability spaces
In a discrete probability space \((\Omega, 2^\Omega, P) \), the reference measure \(\mu\) used is the counting measure. In this case, we have
\[\frac{dP}{d\mu}(w) = \frac{P(\{w\}) }{\mu(\{w\})} \in [0,1],\]for all \(w\in \mathcal \Omega\).In this case we have a upper and lower bound for entropy. Firstly, since \(\frac{dP}{d\mu}(w) \in [0,1]\) entropy is always nonnegative. Then, let \(U\) be the uniform probability measure then \(H_\mu(P,U)= \log(\vert\Omega\vert)\) and
\[0 \leq H_\mu(P) = \log(\vert\Omega\vert) - D_\text{KL}(P \vert\vert U) \leq \log(\vert\Omega\vert)\]Encoding
In the discrete probability space setting, we can also give some physical consequences to the quantities previously defined.
A problem in coding theory is the situation where we have a stream of samples (symbols) \(X_i\) drawn from \((\Omega, 2^\Omega, P)\). We also have a set of words \(\mathcal C\) that are all finite sequences (words) from the alphabet \(\mathcal A\). Now we want to encode symbols into words by an encoding \(C: \Omega \rightarrow \mathcal C\) that is uniquely encodable. A uniquely encodable is an encoding \(C\) such that the extension
\[C^* : \Omega^n \rightarrow \mathcal C \quad (\omega_1, ..., \omega_n )\mapsto C(\omega_1)...C(\omega_n),\]is injective. In other words, we can decode an encoded sequence of symbols into the original symbols unequally.
Now let
\[L_C = \mathbb {E}_P[\vert C\vert],\]denote the expected length of a random encoded symbol. Shannon’s encoding theorem (Gray, 2011) states that
- \(H_\mu \leq L_C \) for all encodings \(C\)
- There exist an unequally encodable encoding \(C\) such that \( H_\mu \leq L_C \leq H_\mu+ 1 \).
As a consequence an optimal encoding will encode symbols \(w\) with probability \(p =P(\{w\})\) with an approximate word length \(-\log_{\vert\mathcal A\vert}2 p\) (actually \(\lceil-\log_{\vert\mathcal A\vert} p\rceil)\). Moreover, cross-entropy \(H(P,Q)\) is the expected word length of an optimal encoding for \(Q\), but for symbols drawn with a probability \(P\). Similarly, KL-divergence is the expected word length difference using an encoding optimized for \(Q\) symbols drawn with \(P\) compared to using an encoding optimized for \(P\).
Uncertainty and information
Entropy is often said to be a measure of information and uncertainty. More precisely, as a measure of information content for a discrete probability space, entropy is the minimum expected number of bits required to store a sample drawn from that space. To understand entropy as uncertainty, one needs to make concrete assumptions on uncertainty itself (Chinčin, 1957). Then, entropy is the function satisfying these conditions; otherwise, entropy as a measure of uncertainty is not unique.
A sanity check for both these notions we consider the entropy change in a discrete probability space endowed with the uniform distribution as the number of elements increases. In this case, the necessary amount of bits required to encode one symbol will increase as the number of possible symbols increases. Similarly, one can argue that uncertainty increases as the number of equally possible events increases. On the other hand, if the probability \(P\) is changed so that one symbol is much more probable than all the others, then the situation could be interpreted as less uncertain. In this case, the minimal expected length of encoded symbols would decrease.