Showing posts with label Concepts. Show all posts
Showing posts with label Concepts. Show all posts

Sunday, 10 August 2025

Kullback-Leibler divergence and Self-Information loss

by W. B. Meitei, PhD


Kullback-Leibler divergence and self-information loss are fundamental concepts in information theory and Bayesian statistics that quantify differences between probability distributions and measure the amount of information contained in events or models.

Kullback-Leibler (KL) divergence is a measure of how one probability distribution Q diverges from a reference or true distribution P. It is often described as the expected excess "surprisal" or information lost when using the approximation Q instead of P. Mathematically, for discrete variables, it is defined as:

\[D_{KL}\left( P\mathbf{\ }\mathbf{||}\mathbf{\ }Q \right) = \sum_{x}^{}{P(x)log\frac{P(x)}{Q(x)}}\]

KL divergence is non-negative and zero only when the two distributions are identical. It is asymmetric and thus not a true distance metric, but rather a "divergence." In Bayesian inference, KL divergence is widely used to compare models, assess model fit, and quantify the information gained when updating from priors to posteriors. It also plays a key role in variational inference as a loss function to approximate complex distributions.

Self-information loss (also known as information content or surprisal) measures the amount of information or surprise brought by observing a particular event x with probability p(x). It is defined as:

 I(x) = -log{p(x)}

Rare or unexpected events carry more self-information since they are more surprising, while very likely events have near-zero self-information. This measure forms the basic unit of information in Shannon’s information theory, and relates directly to entropy as its expectation over a distribution.

Together, these concepts provide the theoretical foundation for interpreting evidence in Bayesian hypothesis testing and decision-making. For example, KL divergence can be used as a loss function to evaluate how closely a Bayesian model approximates the true data generating process. It also underpins approaches to designing objective or loss-based priors to avoid paradoxical behaviour like in the Jeffreys-Lindley-Bartlett paradox.

In summary:

KL divergence quantifies how much one distribution diverges or loses information relative to another.

These tools help researchers calibrate tests, choose priors, and interpret evidence more rigorously by connecting probabilistic inference with fundamental information theory.

Here are examples illustrating KL divergence and self-information loss in Bayesian statistics and information theory contexts:

Example 1: Kullback-Leibler Divergence

Suppose we have two discrete probability distributions over the outcomes of a biased coin toss:

True distribution P: Probability of heads = 0.7, tails = 0.3

The KL divergence DKL(∥ Q) measures how much information is lost when Q is used to approximate P:

DKL(∥ Q) = 0.7×log(0.7/0.6)+0.3×log(0.3/0.4) ≈ 0.7×0.154 + 0.3×(−0.222) = 0.108 − 0.067 = 0.041

This small positive value indicates that Q is close to P, but not identical, quantifying the informational “distance” between these distributions.

In Bayesian model comparison, KL divergence can be used to measure how well an approximate posterior matches the true posterior or evaluate which model generates less information loss.

Example 2: Self-Information Loss

For a single event with a probability p(x):

  • If p(x) = 0.9, the event is highly likely. The self-information is:

I(x) = −log2(0.9) ≈ 0.152 bits

If p(x) = 0.01, the event is rare and surprising. The self-information is:

    I(x) = −log2(0.01) ≈ 6.64 bits

    This illustrates how unlikely observations carry more "surprise" or informational content compared to frequent events.

    In Bayesian inference, self-information loss helps quantify how surprising observed data are under a given model, motivating the selection of models that minimise overall surprisal.



    Suggested Readings:

    1. Tsuruyama, T. (2025). Kullback–Leibler divergence as a measure of irreversible information loss near black hole horizonsThe European Physical Journal Plus140(6), 588.
    2. Saankhya Mondal, S. (2024). Understanding KL Divergence, Entropy, and Related Concepts.
    3. Pérez-Cruz, F. (2008). Kullback-Leibler divergence estimation of continuous distributionsIEEE international symposium on information theory (pp. 1666-1670). IEEE.

    4. Kullback-Leibler DivergenceGeeks for Geeks.
    5. What is Kullback-Leibler (KL) Divergence? KL Divergence ExplainedSoulpage.
    6. Kullback-Leibler Divergence loss function giving negative valuesPyTorch.

    Suggested Citation: Meitei, W. B. (2025). Kullback-Leibler divergence and Self-Information loss. WBM STATS.

    Pathologies in statistics

    by W. B. Meitei, PhD


    In statistics, the term "pathological" or "pathologies" refers to examples, cases, or behaviours that are unusual, counterintuitive, or problematic within the typical theoretical framework. These pathologies highlight instances where standard assumptions or commonly accepted rules fail, leading to results that may appear strange, misleading, or mathematically "monstrous." They often expose limitations or gaps in existing theories and motivate the development of improved methods or new theoretical insights.

    Definition: Pathologies in statistics denote problematic or extreme examples where statistical methods, assumptions, or models break down or produce misleading results. Such cases may defy intuition or expected behaviours and often require special treatment.

    Purpose: These "pathological" cases serve as important counterexamples that sharpen understanding by demonstrating where existing approaches fail or need refinement. They can reveal hidden pitfalls in data analysis or hypothesis testing.

    Examples of Statistical Pathologies

    • Bartlett’s paradox: When using overly diffuse (improper or very broad) priors in Bayesian hypothesis testing, the Bayes factor may perversely favour the null hypothesis regardless of the data. This counterintuitive outcome is a classic pathological behaviour.
    • Jeffreys-Lindley paradox: With large samples, Bayesian and frequentist methods yield conflicting conclusions due to how evidence scales with sample size and prior specification, a fundamental pathology in hypothesis testing frameworks.
    • Pathological convergence: Certain estimators may fail usual consistency or convergence properties under specific data or design conditions, leading to irregular or misleading inference.
    • Non-identifiability or ill-posed problems: Situations where parameters cannot be estimated uniquely, causing standard estimators to behave erratically.
    • Pathological functions and distributions: In a broader mathematical context, examples like nowhere-differentiable functions (e.g., Weierstrass function) demonstrate pathological behaviours that challenge classical assumptions, indirectly affecting statistical modelling.

    Why Are Pathologies Important?

    • They reveal limitations of standard statistical tools and assumptions.
    • Prompt the creation of robust methods and more careful model specifications.
    • Highlight the need for informed prior choices in Bayesian analysis and for caution in interpreting frequentist measures like p-values.
    • Stimulate research and theoretical development to address these challenges, such as cake priors or adaptive testing methods developed to overcome specific paradoxes.



    Suggested Readings:

    1. Sudre, C. H., Cardoso, M. J., Bouvy, W., Biessels, G. J., Barnes, J., & Ourselin, S. (2014). Bayesian model selection for pathological data. International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 323-330). Cham: Springer International Publishing.
    2. Pathological. Wolfram Mathworld.

    Suggested Citation: Meitei, W. B. (2025). Pathologies in statistics. WBM STATS.

    Friday, 8 August 2025

    Stein's shrinkage prior

    by W. B. Meitei, PhD


    Stein's shrinkage prior originates from the work of Charles Stein in the 1950s and was later formalised by James and Stein in their famous James-Stein estimator. It is a type of prior used in Bayesian estimation and shrinkage methods that "shrinks" estimates toward a central value (often zero or another fixed point), improving overall estimation accuracy, especially in high-dimensional problems.

    Key features and context:

    • Stein discovered that the usual estimator of a multivariate normal mean (the sample mean) is inadmissible when the dimension is three or more, meaning there exist other estimators with uniformly lower mean squared error (MSE). This phenomenon is known as Stein's paradox.
    • The James-Stein estimator shrinks the sample means toward a central vector, trading a small amount of bias for a substantial reduction in variance, leading to lower overall MSE.
    • Stein’s shrinkage prior can be viewed as a Bayesian prior that encourages the shrinkage of estimates. In the Bayesian framework, priors inspired by Stein’s ideas often have super-harmonic or singular properties that lead to improved estimation and prediction.
    • This class of priors is "non-informative" in a special sense: instead of being flat, they encourage shrinkage and improve performance, particularly for multivariate normal models.
    • Stein’s shrinkage priors have been extended to matrix-valued parameters and employed in advanced Bayesian prediction and estimation problems, preserving robust and minimax properties.
    • The essential intuition is that by pooling information across parameters and shrinking estimates toward a common value, one can achieve better overall estimation performance than treating parameters independently.

    Overall, Stein's shrinkage prior fundamentally altered the understanding of estimation by showing that "shrinking" estimates allows better performance, and it provides a Bayesian framework for constructing such improved estimators. It underpins a broad class of shrinkage and regularisation methods widely used in modern statistics and machine learning.

    Stein’s shrinkage prior and its associated shrinkage estimators are fundamental tools in Bayesian statistics, particularly valuable in high-dimensional settings where estimating many parameters simultaneously can be challenging.

    Key formulae

    1. James-Stein Estimator (for multivariate normal means):

    Assume observations X = (X1, X2, …, Xp) are independent N(θi, σ2) with unknown means θi. The usual estimator is the sample mean X itself. The James-Stein estimator shrinks these estimates toward zero (or another fixed point) as:

    \[\theta_{i}^{JS} = \left( 1 - \frac{(p - 2)\sigma^{2}}{\left\| \mathbf{X} \right\|^{2}} \right)X_{i}\]

    where,

    \[\left\| \mathbf{X} \right\|^{2} = \sum_{i = 1}^{p}X_{i}^{2}\]

    1. Stein’s Shrinkage Prior (informal form): A prior that favours shrinkage can be thought of as having density proportional to

    \[\pi(\theta) \propto \left\| \theta \right\|^{- (p - 2)}\]

    which encourages shrinking θ toward zero, improving estimation accuracy.

     Examples

    • If we have multiple parameters (e.g., treatment effects for different groups), the James-Stein estimator will pull extreme observed estimates back towards the overall mean, decreasing the overall estimation error.
    • This approach improves over treating parameters independently, especially when ≥ 3.

    Connection to empirical Bayes

    1. Empirical Bayes methods estimate hyperparameters of a hierarchical model directly from the data, often using marginal likelihood.
    2. The James-Stein estimator can be interpreted as an empirical Bayes estimator where the prior variance is estimated from the data, shrinking individual estimates toward a common mean or zero.
    3. Empirical Bayes blends data-driven prior estimation with Bayesian updating, effectively adapting the amount of shrinkage based on observed variability.

    Connection to hierarchical modelling

    1. Hierarchical models explicitly specify multi-level structures where individual parameters are drawn from a common prior distribution with hyperparameters (e.g., group means with a shared variance).
    2. Stein's shrinkage naturally arises in these models because parameter estimates borrow strength from each other via the shared prior.
    3. The hierarchical Bayesian framework generalises Stein’s shrinkage by placing priors on hyperparameters, allowing flexible and data-driven shrinkage through Markov Chain Monte Carlo or other Bayesian inference methods



    Suggested Readings:

    1. Brown, L. D., & Zhao, L. H. (2012). A geometrical explanation of Stein shrinkage.
    2. DiTraglia, F. (2024). Not Quite the James-Stein Estimator. Econometrics Blog.
    3. Efron, B., & Hastie, T. (2021). Chapter 7: James Stein Estimation and Ridge Regression. Computer age statistical inference, student edition: algorithms, evidence, and data science (Vol. 6). Cambridge University Press.
    4. Kasy, M. (2023). Shrinkage in the Normal means model. Foundations of machine learning
    5. Maruyama, Y. (2024). Special feature: Stein estimation and statistical shrinkage methodsJapanese Journal of Statistics and Data Science7(1), 253-256.
    6. Nakada, R., Kubokawa, T., Ghosh, M., & Karmakar, S. (2021). Shrinkage estimation with singular priors and an application to small area estimationJournal of Multivariate Analysis183, 104726.
    7. Tibshirani, R. (2015). Stein's Unbiased Risk Estimate. Statistical Machine Learning.

    Suggested Citation: Meitei, W. B. (2025). Stein's shrinkage priorWBM STATS.

    Almost sure hypothesis testing

    by W. B. Meitei, PhD


    Almost sure hypothesis testing is a statistical framework that guarantees, with probability one, the correct decision as the sample size becomes very large. Specifically, an almost sure hypothesis test is designed so that:

    • If the null hypothesis is true, the probability that the test incorrectly rejects it goes to zero as the sample size grows, eventually failing to reject the null with probability one for all large samples.
    • Conversely, if the alternative hypothesis is true, the test will reject the null hypothesis with probability one for sufficiently large samples.

    This concept leverages the idea of almost sure convergence, meaning the test's decision converges to the correct outcome with probability one in the long run.

    Almost sure hypothesis testing is notable because it differs from the classical fixed-level tests that maintain a constant significance level (e.g., 0.05) regardless of sample size. Such fixed-level tests can be problematic in large samples, leading to paradoxes like the Jeffreys-Lindley paradox, where frequentist and Bayesian conclusions diverge. In contrast, almost sure hypothesis testing allows the significance level to decrease as the sample size increases, ensuring the probabilities of both Type I (false positive) and Type II (false negative) errors tend to zero.

    Recent research highlights that these tests make only a finite number of erroneous decisions with probability one, providing a robust alternative to traditional significance testing. This methodology is applicable to a wide range of statistical settings, including independent and identically distributed data as well as dependent data with strong mixing properties.



    Suggested Readings:

    1. Naaman, M. (2016). Almost sure hypothesis testing and a resolution of the Jeffreys-Lindley paradox. Electronic Journal of Statistics. 10 (2016) 1526–1550
    2. Dembo, A., & Peres, Y. (1994). A topological criterion for hypothesis testingThe Annals of Statistics, 106-117.

    Suggested Citation: Meitei, W. B. (2025). Almost sure hypothesis testingWBM STATS.

    Thursday, 7 August 2025

    Weakly informative prior

    by W. B. Meitei, PhD


    weakly informative prior in Bayesian statistics is a type of prior distribution that provides some guidance to the analysis by incorporating modest, reasonable information about the parameters but does not impose overly strong assumptions. Its purpose is to regularise the model, constraining estimates within a plausible range to prevent extreme or nonsensical values, while still allowing data to influence the posterior inference substantially.

    Unlike noninformative priors that aim to contribute almost no information (and can lead to unstable or unrealistic estimates), weakly informative priors strike a balance by:

    • Steering estimates away from implausible extremes.
    • Improving stability, especially when data are sparse or noisy,
    • Avoiding overfitting through gentle shrinkage towards reasonable values.

    For example, a weakly informative prior on a regression coefficient might be a normal distribution centred at zero with a moderately large standard deviation (e.g., Normal (0, 5²)), which allows the coefficient to vary widely but discourages implausibly large effects.

    Using weakly informative priors helps make Bayesian analyses more robust and interpretable, reducing risks of misleading results due to overly vague or improper priors, while still respecting the data's signal.



    Suggested Readings:

    1. Gelman, A., Jakulin, A., Pittau, M. G., & Su, Y. S. (2008). A weakly informative default prior distribution for logistic and other regression models.
    2. Evans, M., & Jang, G. H. (2011). Weak informativity and the information in one prior relative to another. Statistical Science. 26(3), 423-439.
    3. Hamra, G. B., MacLehose, R. F., & Cole, S. R. (2013). Sensitivity analyses for sparse-data problems—using weakly informative Bayesian priorsEpidemiology24(2), 233-239.
    4. Lemoine, N. P. (2019). Moving beyond noninformative priors: why and how to choose weakly informative priors in Bayesian analysesOikos128(7), 912-928.
    5. Weakly informative (uninformed) priors. EPIX Analytics.

    Suggested Citation: Meitei, W. B. (2025). Weakly informative priorWBM STATS.

    Cake prior

    by W. B. Meitei, PhD


    cake prior is a special type of prior distribution introduced for Bayesian hypothesis testing to address issues like the Bartlett-Lindley-Jeffreys paradox, which arise when using highly diffuse (non-informative) or improper priors in Bayes factor calculations. The term was coined in recent research to describe priors that allow analysts to use "diffuse" priors, preserving the practical advantages (having one's "cake"), while still producing theoretically sound inference (and "eating it too").

    Key characteristics:

    • Pseudo-normal form: Cake priors are constructed similarly to multivariate normal distributions, but with special features, such as a covariance structure that depends on the statistical model (e.g., through the Fisher information matrix), and additional scaling factors.
    • Resolution of paradoxes: When applied, cake priors allow Bayes factors to avoid the distortions and inconsistencies (e.g., overwhelming support for the null regardless of data) that commonly arise with conventional diffuse or improper priors. This makes hypothesis testing Chernoff-consistent: both Type I and Type II error probabilities approach zero as the sample size increases.
    • Penalisation: Bayesian test statistics under cake priors often resemble penalised likelihood ratio tests, aligning Bayesian and frequentist approaches in large samples.

    The name "cake prior" reflects the goal: to gain the convenience of diffuse priors ("have your cake") while retaining reliable inference ("eat it too").

    Cake priors were formally introduced by Ormerod, Stewart, Yu, and Romanes in 2017.



    Suggested Readings:

    1. Ormerod, J. T., Stewart, M., Yu, W., & Romanes, S. E. (2024). Bayesian hypothesis tests with diffuse priors: Can we have our cake and eat it too?Australian & New Zealand Journal of Statistics66(2), 204-227.

    Suggested Citation: Meitei, W. B. (2025). Cake prior. WBM STATS.

    Chernoff-consistent

    by W. B. Meitei, PhD


    A statistical procedure is called Chernoff-consistent if, as the sample size grows, the probabilities of both Type I error (false positive) and Type II error (false negative) approach zero, meaning the test almost surely makes the correct decision for large sample sizes. In other words, the test's power (probability of correctly rejecting a false null) and its specificity (probability of correctly not rejecting a true null) both converge to one as data accrues.

    "Chernoff-consistency" is stronger than just requiring that one type of error vanish; it requires both errors to vanish asymptotically. The concept comes from Herman Chernoff's work in statistical hypothesis testing, which established the optimal rates at which errors can decrease with increasing sample size.

    In the context of recent statistical methodology, such as "cake priors," tests are described as Chernoff-consistent if they guarantee, with sufficient data, that incorrect decisions become exceedingly rare, addressing issues like the Bartlett-Lindley-Jeffreys paradox by ensuring both types of inference errors disappear in the limit.



    Suggested Readings:

    1. Li, W., & Xie, W. (2024). Efficient estimation of the quantum Chernoff boundPhysical Review A110(2), 022415.
    2. Tslil, O., Lehrer, N., & Carmi, A. (2020). Approaches to Chernoff fusion with applications to distributed estimationDigital Signal Processing107, 102877.
    3. Kuszmaul, W. (2025). A Simple and Combinatorial Approach to Proving Chernoff Bounds and Their Generalizations. Symposium on Simplicity in Algorithms (SOSA) (pp. 77-93). Society for Industrial and Applied Mathematics.
    4. Rao, A. (2018). Lecture 21: The Chernoff Bound.
    5. Williamson, D. P. (2016). Cherno boundsORIE 6334 Spectral Graph Theory.

    Suggested Citation: Meitei, W. B. (2025). Chernoff-consistent. WBM STATS.