WBM STATS: Kullback-Leibler divergence and Self-Information loss

by W. B. Meitei, PhD

Kullback-Leibler divergence and self-information loss are fundamental concepts in information theory and Bayesian statistics that quantify differences between probability distributions and measure the amount of information contained in events or models.

Kullback-Leibler (KL) divergence is a measure of how one probability distribution Q diverges from a reference or true distribution P. It is often described as the expected excess "surprisal" or information lost when using the approximation Q instead of P. Mathematically, for discrete variables, it is defined as:

\[D_{KL}\left( P\mathbf{\ }\mathbf{||}\mathbf{\ }Q \right) = \sum_{x}^{}{P(x)log\frac{P(x)}{Q(x)}}\]

KL divergence is non-negative and zero only when the two distributions are identical. It is asymmetric and thus not a true distance metric, but rather a "divergence." In Bayesian inference, KL divergence is widely used to compare models, assess model fit, and quantify the information gained when updating from priors to posteriors. It also plays a key role in variational inference as a loss function to approximate complex distributions.

Self-information loss (also known as information content or surprisal) measures the amount of information or surprise brought by observing a particular event x with probability p(x). It is defined as:

I(x) = -log{p(x)}

Rare or unexpected events carry more self-information since they are more surprising, while very likely events have near-zero self-information. This measure forms the basic unit of information in Shannon’s information theory, and relates directly to entropy as its expectation over a distribution.

Together, these concepts provide the theoretical foundation for interpreting evidence in Bayesian hypothesis testing and decision-making. For example, KL divergence can be used as a loss function to evaluate how closely a Bayesian model approximates the true data generating process. It also underpins approaches to designing objective or loss-based priors to avoid paradoxical behaviour like in the Jeffreys-Lindley-Bartlett paradox.

In summary:

KL divergence quantifies how much one distribution diverges or loses information relative to another.

These tools help researchers calibrate tests, choose priors, and interpret evidence more rigorously by connecting probabilistic inference with fundamental information theory.

Here are examples illustrating KL divergence and self-information loss in Bayesian statistics and information theory contexts:

Example 1: Kullback-Leibler Divergence

Suppose we have two discrete probability distributions over the outcomes of a biased coin toss:

True distribution P: Probability of heads = 0.7, tails = 0.3

The KL divergence D_KL(P ∥ Q) measures how much information is lost when Q is used to approximate P:

D_KL(P ∥ Q) = 0.7×log(0.7/0.6)+0.3×log(0.3/0.4) ≈ 0.7×0.154 + 0.3×(−0.222) = 0.108 − 0.067 = 0.041

This small positive value indicates that Q is close to P, but not identical, quantifying the informational “distance” between these distributions.

In Bayesian model comparison, KL divergence can be used to measure how well an approximate posterior matches the true posterior or evaluate which model generates less information loss.

Example 2: Self-Information Loss

For a single event with a probability p(x):

If p(x) = 0.9, the event is highly likely. The self-information is:

I(x) = −log₂(0.9) ≈ 0.152 bits

If p(x) = 0.01, the event is rare and surprising. The self-information is:

I(x) = −log₂(0.01) ≈ 6.64 bits

This illustrates how unlikely observations carry more "surprise" or informational content compared to frequent events.

In Bayesian inference, self-information loss helps quantify how surprising observed data are under a given model, motivating the selection of models that minimise overall surprisal.

Suggested Readings:

Tsuruyama, T. (2025). Kullback–Leibler divergence as a measure of irreversible information loss near black hole horizons. The European Physical Journal Plus, 140(6), 588.
Saankhya Mondal, S. (2024). Understanding KL Divergence, Entropy, and Related Concepts.
Pérez-Cruz, F. (2008). Kullback-Leibler divergence estimation of continuous distributions. IEEE international symposium on information theory (pp. 1666-1670). IEEE.
Kullback-Leibler Divergence. Geeks for Geeks.
What is Kullback-Leibler (KL) Divergence? KL Divergence Explained. Soulpage.
Kullback-Leibler Divergence loss function giving negative values. PyTorch.

Suggested Citation: Meitei, W. B. (2025). Kullback-Leibler divergence and Self-Information loss. WBM STATS.

Pages

Sunday, 10 August 2025

Kullback-Leibler divergence and Self-Information loss

No comments:

Post a Comment

Disclaimer