Friday, 8 August 2025

Stein's shrinkage prior

by W. B. Meitei, PhD


Stein's shrinkage prior originates from the work of Charles Stein in the 1950s and was later formalised by James and Stein in their famous James-Stein estimator. It is a type of prior used in Bayesian estimation and shrinkage methods that "shrinks" estimates toward a central value (often zero or another fixed point), improving overall estimation accuracy, especially in high-dimensional problems.

Key features and context:

  • Stein discovered that the usual estimator of a multivariate normal mean (the sample mean) is inadmissible when the dimension is three or more, meaning there exist other estimators with uniformly lower mean squared error (MSE). This phenomenon is known as Stein's paradox.
  • The James-Stein estimator shrinks the sample means toward a central vector, trading a small amount of bias for a substantial reduction in variance, leading to lower overall MSE.
  • Stein’s shrinkage prior can be viewed as a Bayesian prior that encourages the shrinkage of estimates. In the Bayesian framework, priors inspired by Stein’s ideas often have super-harmonic or singular properties that lead to improved estimation and prediction.
  • This class of priors is "non-informative" in a special sense: instead of being flat, they encourage shrinkage and improve performance, particularly for multivariate normal models.
  • Stein’s shrinkage priors have been extended to matrix-valued parameters and employed in advanced Bayesian prediction and estimation problems, preserving robust and minimax properties.
  • The essential intuition is that by pooling information across parameters and shrinking estimates toward a common value, one can achieve better overall estimation performance than treating parameters independently.

Overall, Stein's shrinkage prior fundamentally altered the understanding of estimation by showing that "shrinking" estimates allows better performance, and it provides a Bayesian framework for constructing such improved estimators. It underpins a broad class of shrinkage and regularisation methods widely used in modern statistics and machine learning.

Stein’s shrinkage prior and its associated shrinkage estimators are fundamental tools in Bayesian statistics, particularly valuable in high-dimensional settings where estimating many parameters simultaneously can be challenging.

Key formulae

  1. James-Stein Estimator (for multivariate normal means):

Assume observations X = (X1, X2, …, Xp) are independent N(θi, σ2) with unknown means θi. The usual estimator is the sample mean X itself. The James-Stein estimator shrinks these estimates toward zero (or another fixed point) as:

\[\theta_{i}^{JS} = \left( 1 - \frac{(p - 2)\sigma^{2}}{\left\| \mathbf{X} \right\|^{2}} \right)X_{i}\]

where,

\[\left\| \mathbf{X} \right\|^{2} = \sum_{i = 1}^{p}X_{i}^{2}\]

  1. Stein’s Shrinkage Prior (informal form): A prior that favours shrinkage can be thought of as having density proportional to

\[\pi(\theta) \propto \left\| \theta \right\|^{- (p - 2)}\]

which encourages shrinking θ toward zero, improving estimation accuracy.

 Examples

  • If we have multiple parameters (e.g., treatment effects for different groups), the James-Stein estimator will pull extreme observed estimates back towards the overall mean, decreasing the overall estimation error.
  • This approach improves over treating parameters independently, especially when ≥ 3.

Connection to empirical Bayes

  1. Empirical Bayes methods estimate hyperparameters of a hierarchical model directly from the data, often using marginal likelihood.
  2. The James-Stein estimator can be interpreted as an empirical Bayes estimator where the prior variance is estimated from the data, shrinking individual estimates toward a common mean or zero.
  3. Empirical Bayes blends data-driven prior estimation with Bayesian updating, effectively adapting the amount of shrinkage based on observed variability.

Connection to hierarchical modelling

  1. Hierarchical models explicitly specify multi-level structures where individual parameters are drawn from a common prior distribution with hyperparameters (e.g., group means with a shared variance).
  2. Stein's shrinkage naturally arises in these models because parameter estimates borrow strength from each other via the shared prior.
  3. The hierarchical Bayesian framework generalises Stein’s shrinkage by placing priors on hyperparameters, allowing flexible and data-driven shrinkage through Markov Chain Monte Carlo or other Bayesian inference methods



Suggested Readings:

  1. Brown, L. D., & Zhao, L. H. (2012). A geometrical explanation of Stein shrinkage.
  2. DiTraglia, F. (2024). Not Quite the James-Stein Estimator. Econometrics Blog.
  3. Efron, B., & Hastie, T. (2021). Chapter 7: James Stein Estimation and Ridge Regression. Computer age statistical inference, student edition: algorithms, evidence, and data science (Vol. 6). Cambridge University Press.
  4. Kasy, M. (2023). Shrinkage in the Normal means model. Foundations of machine learning
  5. Maruyama, Y. (2024). Special feature: Stein estimation and statistical shrinkage methodsJapanese Journal of Statistics and Data Science7(1), 253-256.
  6. Nakada, R., Kubokawa, T., Ghosh, M., & Karmakar, S. (2021). Shrinkage estimation with singular priors and an application to small area estimationJournal of Multivariate Analysis183, 104726.
  7. Tibshirani, R. (2015). Stein's Unbiased Risk Estimate. Statistical Machine Learning.

Suggested Citation: Meitei, W. B. (2025). Stein's shrinkage priorWBM STATS.

No comments:

Post a Comment