Showing posts with label Bayesian Statistics. Show all posts
Showing posts with label Bayesian Statistics. Show all posts

Sunday, 5 April 2026

Metropolis-Hastings Algorithm

by W. B. Meitei, PhD


The Metropolis-Hastings (MH) algorithm is one of the most popular Markov Chain Monte Carlo (MCMC) algorithms. Like other MCMC methods, the MH algorithm is used to generate draws from a sequence of probability distributions that are serially correlated. The sequence converges to a given target distribution, making it a powerful tool for sampling from complex or high‑dimensional distributions when direct simulation is intractable. This property is especially valuable in Bayesian inference, where the target distribution is often a posterior known only up to a normalizing constant. In this article, we will explore the intuition behind the MH algorithm, its basic steps, and illustrate its implementation in practice.

Before we proceed to the details of the MH algorithm, it is important to understand the fundamentals of Markov chains and MCMC.

In particular, it is important to remember that an MCMC algorithm produces a random sequence \(\left\{ X_{t} \right\}\) with the following features:

  • \(\left\{ X_{t} \right\}\) is a Markov chain, i.e., given the current value \(X_{t}\), all future values \(X_{t + 1},\ X_{t + 2},\ldots,\ X_{t + n}\) are conditionally independent of the past values \(x_{1},\ldots,\ x_{t - 1}\), for any positive integers \(t\) and \(n\);
  • while any two values \(X_{t}\) and \(X_{t + s}\) are generally not independent, but they become increasingly close to independent as the gap \(s\) grows for any positive integers \(t\) and \(s\);
  • the sequence gradually converges to the target distribution, so that the distribution of \(X_{t}\) becomes more and more similar to the target as \(t\) increases.

When we run the MCMC algorithm for \(T\) iterations, we obtain an MCMC sample consisting of the realizations of the first \(T\) terms of the chain:

\[X_{1},\ X_{2},\ldots,\ X_{T}\]

We then use the empirical distribution of this sample to approximate the target distribution.

The Metropolis-Hastings algorithm

Let \(f(x)\) be the probability density (or probability mass) function of the distribution from which we want to generate a sample. We refer to \(f(x)\) as the target distribution. Let \(q(x\ |\ x')\) be a family of conditional distributions that we choose ourselves, and from which it is easy to simulate draws. The vectors \(x,x'\), and the argument of \(f\) all have the same dimension.

The MH algorithm begins with an initial value \(x_{1}\) belonging to the support of the target distribution. This starting point can be specified by the user or sampled from some suitable distribution. Then, the algorithm proceeds recursively to generate the subsequent values of \(x_{t}\).

At each time step \(t\), the next value \(x_{t}\) is generated as follows:

  • Generate a proposal \(y_{t}\) from \(q(y_{t}|x_{t - 1})\)
  • Compute the acceptance probability

\[p_{t} = min\left( \frac{f\left( y_{t} \right)}{f\left( x_{t - 1} \right)}\frac{q\left( x_{t - 1}|y_{t} \right)}{q\left( y_{t}|x_{t - 1} \right)},1 \right)\]

  • Draw a uniform random variable \(u_{t}\) (on [0, 1])
  • If \(u_{t} \leq p_{t}\), accept the proposal and set \(x_{t} = y_{t}\); otherwise, reject the proposal and set \(x_{t} = x_{t - 1}\)

Here, \(y_{t}\), \(q\left( y_{t}|x_{t - 1} \right)\), and \(p_{t}\) are called proposal, proposal distribution, and acceptance probability, respectively.

As \(u_{t}\) is uniformly distributed on [0, 1], the probability of accepting the proposal \(y_{t}\) as the new draw \(x_{t-1}\) is exactly equal to \(p_{t}\). This simple decision rule ensures that the resulting chain gradually explores the support of the target distribution and, in the long run, produces samples that approximate \(f(x)\).

A special case of the MH algorithm is the Metropolis algorithm (developed by Nicholas C. Metropolis), which assumes a symmetric proposal distribution, i.e., \(q\left( y_{t} \middle| x_{t - 1} \right) = q(x_{t - 1}|y_{t})\). Whereas the MH algorithm can handle both symmetric and asymmetric proposal distributions. The acceptance probability of the Metropolis algorithm is,

\[p_{t} = min\left( \frac{f\left( y_{t} \right)}{f\left( x_{t - 1} \right)},1 \right)\]

Fig 1: Comparison of the Metropolis and the Metropolis-Hastings algorithms (Python Code)

Advantages of the Metropolis-Hastings algorithm over other algorithms

One of the main strengths of the MH algorithm is that it only requires knowledge of the target distribution \(f(x)\) up to a multiplicative constant. In practice, this means we do not need to compute the often‑intractable normalizing constant of the distribution. This feature is especially valuable in Bayesian inference, where the posterior is typically known only up to a constant, because the likelihood and prior are fully specified, but the marginal distribution (the normalizing integral) is difficult or impossible to evaluate in closed form.

Suppose,

\[f(x) = ch(x)\]

We assume that we know \(h(x)\) but not c.

The acceptance probability according to the Metropolis-Hastings algorithm is

\[p_{t} = min\left( \frac{f\left( y_{t} \right)}{f\left( x_{t - 1} \right)}\frac{q\left( x_{t - 1}|y_{t} \right)}{q\left( y_{t}|x_{t - 1} \right)},1 \right) = min\left( \frac{ch\left( y_{t} \right)}{ch\left( x_{t - 1} \right)}\frac{q\left( x_{t - 1}|y_{t} \right)}{q\left( y_{t}|x_{t - 1} \right)},1 \right)\]

\[= min\left( \frac{h\left( y_{t} \right)}{h\left( x_{t - 1} \right)}\frac{q\left( x_{t - 1}|y_{t} \right)}{q\left( y_{t}|x_{t - 1} \right)},1 \right)\]

Thus, the acceptance probability, which is the only quantity that depends on f, can be computed without knowing the constant c. This is the beauty of the MH algorithm: we can generate draws from a distribution even if we do not fully know its probability density (or mass) of that distribution.

An independent Metropolis-Hastings algorithm

It is a special case of the MH algorithm (named by Tierney, 1994) in which the proposal distribution does not depend on the current state, i.e., \(q(y_{t}|x_{t - 1})\) does not depend on \(x_{t - 1}\), or, \(q\left( y_{t} \middle| x_{t - 1} \right) = q(y_{t})\). It is also called the independence chain MH algorithm.

For example, a common choice is to extract proposals \(y_{t}\) from a multivariate normal distribution (each draw is independent from the previous ones).

When \(y_{t}\) is drawn independently, the acceptance probability is,

\[p_{t} = min\left( \frac{f\left( y_{t} \right)}{f\left( x_{t - 1} \right)}\frac{q\left( x_{t - 1} \right)}{q\left( y_{t} \right)},1 \right)\]

A random-walk Metropolis-Hastings algorithm

It is a special case of the MH algorithm (named by Tierney, 1994) in which the proposal has the form \(y_{t} = x_{t - 1} + \varepsilon_{t}\), where \(\varepsilon_{t}\) is stochastically independent of the current state \(x_{t - 1}\), so,

\[q\left( y_{t} \middle| x_{t - 1} \right) = q\left( y_{t} - x_{t - 1} \right) = q(\varepsilon_{t})\]

And

\[q\left( x_{t - 1} \middle| y_{t} \right) = q\left( x_{t - 1} - y_{t} \right) = q( - \varepsilon_{t})\]

Thus, the acceptance probability is,

\[p_{t} = min\left( \frac{f\left( y_{t} \right)}{f\left( x_{t - 1} \right)}\frac{q\left( x_{t - 1}|y_{t} \right)}{q\left( y_{t}|x_{t - 1} \right)},1 \right) = min\left( \frac{f\left( x_{t - 1} + \varepsilon_{t} \right)}{f\left( x_{t - 1} \right)}\frac{q\left( - \varepsilon_{t} \right)}{q\left( \varepsilon_{t} \right)},1 \right)\]

Types of proposal distribution

There are three types of proposal distributions. They are:

  1. Perfect proposal distribution
  2. Bad proposal distribution
  3. Good proposal distribution

A perfect proposal distribution is one that is exactly equal to the target distribution; the acceptance probability is 1 for every proposed move (\(p_{t} = 1\) when \(f(x) = q(x)\) for every \(x\)). In this ideal case, every proposal is accepted, and because the proposals are drawn independently from the target, the resulting chain yields a sample of independent draws. This is as good as it gets: the MH algorithm effectively reduces to direct sampling from the target distribution, with no serial correlation and no wasted iterations.

In the case of bad proposal distribution, things go the other way; they differ significantly from the target. The acceptance probabilities drop, and proposals are rejected much more often. As a result, the chain tends to get “stuck” at certain points for long stretches, revisiting the same values repeatedly. This produces a highly serially correlated MCMC sample, in which successive draws are strongly dependent on one another. As discussed in the MCMC diagnostics, such chains can be misleading: they may appear to have converged quickly, but in reality, they explore the target distribution very slowly, forcing us to run the algorithm for much longer, and often at prohibitive computational cost.

A good proposal distribution is simple but powerful; it should be chosen to be as close as possible to the target distribution. When the proposal resembles the target, acceptance probabilities remain reasonably high and the chain moves freely across the parameter space, producing a sample with relatively low serial correlation. Striking this balance is key to efficient MH sampling and to obtaining reliable estimates in a reasonable amount of time.

Fig 2: Good proposal distribution Vs bad proposal distribution (Python Code)

Choosing the right proposal distribution

Choosing the right proposal distribution is the most important step in the MH algorithm. It is the single most important practical decision that separates efficient sampling from painfully slow convergence. Unfortunately, there's no magic formula or universal recipe for picking the perfect proposal distribution in the MH algorithm.

That said, there's a practical, iterative strategy that works well in practice:

  • Start simple: Begin with a basic proposal, such as a multivariate normal distribution.
  • Run a pilot chain: Generate an initial MCMC sample (it might be highly correlated, and that's okay).
  • Learn from it: Use this first sample to estimate key features of the target distribution, like its mean and covariance matrix.
  • Adapt and improve: Tune your proposal distribution based on these insights. For example, tune the proposal distribution’s mean and scale its covariance to match the target's spread.

This adaptive approach turns a potentially poor or bad proposal into one that is much more efficient, often dramatically improving acceptance rates and reducing serial correlation in subsequent chains.

Note: A poor proposal leads to:

  • Low acceptance rates → mostly rejections, wasting computation
  • High serial correlation → chain moves like a drunkard, barely exploring
  • Slow mixing → millions of iterations needed for reliable inference

Properties of the Metropolis-Hastings algorithm

To understand the properties of the MH algorithm, let’s examine its simplest and most intuitive form, the random-walk MH algorithm. Now, suppose the random walk increment \(\varepsilon_{t}\) is symmetric, i.e., \(q\left( \varepsilon_{t} \right) = q( - \varepsilon_{t})\), In this case, the acceptance probability is

\[p_{t} = min\left( \frac{f\left( x_{t - 1} + \varepsilon_{t} \right)}{f\left( x_{t - 1} \right)},1 \right)\]

Suppose we also have the target density with the following two characteristics: points that are close to each other have similar densities, and there is a high probability region (e.g., around the mode of the distribution), or the points that are far away from this region have a much lower density than the points that belong to it.

Now, let us consider two extreme cases:

Case 1: When \(\varepsilon_{t}\) is very small (its variance is very small), then \(\frac{f\left( x_{t - 1} + \varepsilon_{t} \right)}{f\left( x_{t - 1} \right)}\), and \(p_{t}\) are always close to 1, because, by assumption, points that are close to each other have similar densities; the proposals are almost always accepted, but they are very close to each other; the resulting MCMC sample is highly serially correlated.

Case 2: When \(\varepsilon_{t}\) is very large (its variance is very large), then \(\frac{f\left( x_{t - 1} + \varepsilon_{t} \right)}{f\left( x_{t - 1} \right)}\), is always close to 0, when \(x_{t - 1}\) lies in the high-probability region because the proposals tend to be far from this region and have low density; the proposals are often rejected, and the chain remains stuck at high-density points for long periods of time; we obtain a highly serially correlated MCMC sample.

Fig 3Variation in the distribution of the chain with the variation in \(\varepsilon_{t}\) (Python Code)

Avoiding these two extreme cases requires accurate tuning of the variance of \(\varepsilon_{t}\). The adaptive MH algorithm is the most commonly used method; it is very efficient and automatically tunes the variance as the MCMC chain runs.

We can also adopt a simpler approach based on trial and error. For example,

  1. We start by guessing a good value for variance and run an initial MCMC chain
  2. Inspect the trace plots:
    • Flat lines/stuck regions? → variance too large → decrease variance
    • Tiny wiggles, barely moving? → variance too small → increase variance
  3. Adjust and re-run until we see that sweet "fuzzy caterpillar" trace

Convergence of the Metropolis-Hastings algorithm

The MH algorithm guarantees convergence to the target distribution under relatively mild conditions. However, the speed of convergence depends critically on our proposal distribution and tuning. The fundamental approach is: if the proposal distribution \(q(y_{t}|x_{t - 1})\) has the ability to propose any point from any starting point with positive probability, then the MH converges to the target \(f(x)\) regardless of the starting point \(x_{0}\).

To demonstrate convergence, it is also important to verify that the detailed balance condition holds, i.e., that the target distribution is the stationary distribution of the Markov chain generated by the MH algorithm.

Detailed balanced condition states that,

\[f\left( y \middle| x \right)f(x) = f\left( x \middle| y \right)f(y)\]

Let us denote \(f(x_{t}|x_{t - 1})\), the probability distribution of a transition (or transition kernel) from \(x_{t - 1}\) to \(x_{t}\).

Then, when \(x_{t} \neq x_{t - 1}\), then \(x_{t} = y_{t}\), we have

\[f\left( x_{t} \middle| x_{t - 1} \right)f\left( x_{t - 1} \right) = min\left\{ \frac{f\left( x_{t} \right)}{f\left( x_{t - 1} \right)}\frac{q\left( x_{t - 1} \middle| x_{t} \right)}{q\left( x_{t} \middle| x_{t - 1} \right)},1 \right\} q\left( x_{t} \middle| x_{t - 1} \right)f\left( x_{t - 1} \right) = min\left\{ q\left( x_{t - 1} \middle| x_{t} \right)f\left( x_{t} \right),q\left( x_{t} \middle| x_{t - 1} \right)f\left( x_{t - 1} \right) \right\}\]

Similarly, when the transition is from \(x_{t}\) to \(x_{t - 1}\),

\[f\left( x_{t - 1} \middle| x_{t} \right)f\left( x_{t} \right) = min\left\{ q\left( x_{t - 1} \middle| x_{t} \right)f\left( x_{t} \right),q\left( x_{t} \middle| x_{t - 1} \right)f\left( x_{t - 1} \right) \right\}\]

Therefore, when \(x_{t} \neq x_{t - 1}\),

\[f\left( x_{t} \middle| x_{t - 1} \right)f\left( x_{t - 1} \right) = f\left( x_{t - 1} \middle| x_{t} \right)f\left( x_{t} \right)\]

i.e., the target distribution \(f(x)\) and the transition probability distribution satisfies the detailed balance condition.

Now, when \(x_{t} = x_{t - 1}\),

\[f\left( x_{t} \middle| x_{t - 1} \right)f\left( x_{t - 1} \right) = f\left( x_{t} \middle| x_{t} \right)f\left( x_{t} \right)\]

\[f\left( x_{t - 1} \middle| x_{t} \right)f\left( x_{t} \right) = f\left( x_{t} \middle| x_{t} \right)f\left( x_{t} \right)\]

Thus, in both cases, the detailed balanced condition is satisfied. As a consequence, \(f(x)\) is the stationary distribution of the chain.

Furthermore, the resulting chain has to satisfy the condition of irregularity and aperiodicity to guarantee convergence from any initial state \(x_{0}\). Irreducibility refers to the fact that the chain can reach any state from any other state with positive probability in finite steps. Typically, this requires \(q\left( y_{t} \middle| x_{t - 1} \right) > 0\) whenever \(f(x) > 0\). Whereas aperiodicity means the chain does not get “trapped” in cycles. This is ensured because there is always a positive probability of staying at the current state. A Markov chain is called ergodic if it is irreducible and aperiodic. And according to the Ergodic Theorem of Markov Chain, if the resulting chain is ergodic, it will converge almost surely to the target distribution \(f(x)\).



Suggested Readings:

  1. Geyer, C. J. (2011). Introduction to Markov Chain Monte Carlo. Handbook of Markov Chain Monte Carlo. CRC Press.
  2. Robert, C. P., & Casella, G. (2010). Metropolis-Hastings Algorithms. Introducing Monte Carlo Methods with R. Springer.
  3. Robert, C. P., & Casella, G. (2013). Monte Carlo statistical methods. Vol. 2. Springer.
  4. Taboga, M. (2021). Metropolis-Hastings algorithm. Fundamentals of Statistics.

Suggested Citation: Meitei, W. B. (2026). Metropolis-Hastings Algorithm. WBM STATS.

Saturday, 4 April 2026

Ergodic Theorem of Markov Chain

by W. B. Meitei, PhD


If a Markov chain is irreducible, aperiodic, and has a stationary distribution \(\pi( \cdot )\), then for any integrable function \(f\), as \(N \rightarrow \infty\),

\[\frac{1}{N}\sum_{t = 1}^{N}{f\left( X_{t} \right)}\overset{a.s.}{\rightarrow}E_{\pi}\left\lbrack f(X) \right\rbrack\]

where, \(f\) is any bounded function (like log-likelihood, parameter estimates), and \(\ E_{\pi}\lbrack f\rbrack\) is the target posterior expectation.

Consequently, the distribution of \(X_{t}\) converges to \(\pi( \cdot )\) as \(t \rightarrow \infty\), i.e.,

\[\lim_{t \rightarrow \infty}\left\| P\left( X_{t} \in \  \cdot \right) - \pi( \cdot ) \right\|_{TV} = 0\]

where, \(\left\| \ \cdot \ \right\|_{TV}\) denotes the total variation distance. 



Suggested Readings:

  1. Robert, C. P., & Casella, G. (2004). Introducing Monte Carlo Methods with R. Springer.
  2. Geyer, C. J. (2011). Introduction to Markov Chain Monte CarloHandbook of Markov Chain Monte Carlo. CRC Press.
  3. Gundersen, G. (2019). Ergodic Markov Chains

Suggested Citation: Meitei, W. B. (2026). Ergodic Theorem of Markov Chain. WBM STATS.

Sunday, 10 August 2025

Jeffreys-Lindley-Bartlett's Paradox: Unravelling a Foundational Split in Statistics

by W. B. Meitei, PhD


The Jeffreys-Lindley-Bartlett paradox stands as one of the most startling and debated phenomena in the theory of statistical inference, challenging assumptions about how evidence is assessed in the Bayesian versus Frequentist schools. This paradox reveals scenarios where the conclusion from a Frequentist hypothesis test ("reject the null hypothesis") directly conflicts with the output of Bayesian inference, which may decisively favor the null hypothesis, even when analyzing the same data set.

Background and origins

Harold Jeffreys introduced the mathematical and philosophical basis of what became known as the paradox in the 1930s and 1940s, showing that the evidence required to reject a null hypothesis should actually increase with sample size, a claim that clashes with common Frequentist practice.

Dennis Lindley famously formalized and publicized the problem as a paradox in a 1957 paper, giving it prominence and clarity in statistical discussions. Read the full paper here.

M. S. Bartlett extended and clarified the issue, demonstrating that choosing a "non-informative" or very diffuse prior in the Bayesian setup could, counterintuitively, make the evidence overwhelmingly favor the null hypothesis, regardless of how compelling the data against it appeared to be. Read the full paper here.

Explaining the paradox

Suppose we are testing a point null hypothesis (e.g., "the mean is equal to zero") using a large sample.

Frequentist inference: Keeping the significance level fixed (e.g., 5%), we will reject the null hypothesis for any deviation from zero if we collect enough data, since even minuscule differences become "statistically significant" as the sample grows.

Bayesian inference: When using a diffuse prior for the alternative, the Bayes factor, the core Bayesian evidence metric, may incline towards supporting the null hypothesis, despite the Frequentist's significant finding. This occurs because the marginal likelihood for the alternative hypothesis (which averages the likelihood over the entire, wide-ranging prior) gets "diluted," while all prior mass for the null is concentrated at the hypothesis value.

Mathematical roots

The crux of the paradox is how the Bayes factor is sensitive to:

  1. Sample size (n): As n → ∞, the likelihood under the null becomes sharply peaked. The Bayes factor will favor the null unless the prior for the alternative is highly concentrated around values supported by the data, which is seldom the case for broadly specified or "non-informative" priors.
  2. Variance of prior: If the prior for the alternative hypothesis is made increasingly broad (variance → ∞), Bayesian evidence in favor of the null approaches certainty regardless of the observed data, a phenomenon specifically described by Bartlett.

Imagine tossing a coin a million times. We observe 500,600 heads, a small, but "significant" deviation from perfect fairness:

Frequentist result: p-value is tiny; the null ("fair coin") is rejected.

Bayesian result: If using a very flat (uninformative) prior for coin bias, the Bayes factor turns out to overwhelmingly support the null; the prior for "biased coins" is spread so thin that observed data isn't enough to outweigh the null.

Philosophical and practical implications

  • Prior selection is crucial: The paradox demonstrates that using broad or improper priors in Bayesian hypothesis testing can yield practically absurd model selection, motivating the adoption of weakly-informative or problem-specific priors.
  • Rethinking evidence and significance: The conflict highlights foundational issues in how statistical evidence is conceptualized, warning against the mechanical use of p-values or "default" priors.
  • Frequentist significance thresholds: The paradox raises concerns about the routine practice of using fixed significance levels regardless of sample size, suggesting a need to adjust thresholds as data accumulate.

This paradox, far from being a mere technical curiosity, continues to influence both statistical theory and the broader philosophy of evidence, serving as a catalyst for methodological innovation and critical thinking in scientific research.

Recent proposed solutions to the Jeffreys-Lindley-Bartlett paradox

Recent proposed solutions to the Jeffreys-Lindley-Bartlett paradox focus on both the choice of priors in Bayesian testing and on re-examining foundational aspects of hypothesis testing frameworks. Here are key modern approaches:

  • Weakly informative or sample-dependent priors: Instead of using non-informative or overly diffuse priors (which can lead to the paradox), researchers now recommend "weakly informative" priors that become more informative as sample size increases, or that adapt to the data context. Sample-size-dependent prior strategies have been explicitly developed to bridge the Bayesian-Frequentist divide, thereby reducing the conflict as datasets become large.
  • Cake priors: A notable recent innovation, "cake priors," are a new class of priors designed to circumvent the paradox. These allow for diffuse priors while ensuring statistical inferences remain sensible. Cake priors lead to Bayesian tests that are Chernoff-consistent (zero type I and II errors asymptotically) and can often be interpreted as penalized likelihood-ratio tests. Their design ensures that the paradox occurs only with vanishing probability in large samples.
  • Changing significance thresholds with sample size: In the Frequentist sphere, it has been proposed that significance levels (α) should decrease as sample size grows, rather than remain fixed, as maintaining a constant α is a root cause of the paradox. This approach, sometimes called "almost sure hypothesis testing," results in a sequence of hypothesis tests that make only a finite number of errors as the sample size grows, thereby resolving the paradox in practice.
  • Re-evaluating the point-null hypothesis: Some theorists argue that the paradox is less about statistical machinery and more about the unrealistic nature of testing perfect point hypotheses. Instead, they suggest framing nulls as intervals or using shrinkage priors that concentrate prior mass more sensibly, addressing both Type I error and Bayesian evidence alignment.
  • Objective and loss-based prior selection: Methods based on Kullback-Leibler divergence or self-information loss try to assign prior weights objectively, even allowing the prior for the alternative hypothesis to be set in a way closely linked to classical significance levels, avoiding pathologies as variances go to infinity.
  • Handling improper priors: Recent analysis shows that Bartlett’s paradox does not necessarily apply to all improper priors. Some classes of improper priors (such as Stein's shrinkage prior) yield to well-defined Bayes factors, provided their diffusion and measure rates are properly controlled. However, pathologies can still arise, so caution is emphasized, and regularisation rules are suggested.
  • Hybrid and robust Bayesian methods: Ongoing research also investigates hybrid techniques that combine Bayesian insights with Frequentist calibration or robust Bayesian model averaging, to mitigate the effects of prior mis-specification and large-sample divergences.

In summary, the main thrust of recent solutions is smarter, context-aware prior specification (including cake and weakly informative priors), dynamic thresholding of significance in Frequentist testing, and new theoretical frameworks that blend benefits from both Frequentist and Bayesian paradigms, making the paradox less of a practical challenge and more of a caution against naive use of statistical defaults.

The paradox impact on the practical interpretation of p-values and Bayes factors

The Jeffreys-Lindley-Bartlett paradox profoundly impacts how p-values and Bayes factors should be interpreted in practical data analysis, especially as sample sizes grow large:

a. Practical interpretation of p-values

  • Overstated evidence: The paradox demonstrates that small p-values may not always imply strong evidence against the null hypothesis, particularly in large datasets. As the sample size increases, even trivial differences from the null can yield extremely small p-values, potentially leading to exaggerated claims of discovery when the true effect is negligible.
  • Context sensitivity: Researchers are increasingly cautioned that p-values must be interpreted in light of context, effect size, sample size, and prior belief, not just as a mechanical threshold for significance.

b. Practical interpretation of Bayes factors

  • Prior dependence: The paradox shows that the Bayes factor can favor the null hypothesis, even when the observed data strongly deviate from the null, if a diffuse or poorly chosen prior is used for the alternative. Thus, Bayes factors must be interpreted with careful attention to the choice and justification of priors.
  • Calibration required: Interpretation of Bayes factors is not universal; they must be calibrated to the problem at hand, with sensitivity analyses to different priors reducing the risk of misleading conclusions.

In essence, the paradox serves as a cautionary tale: neither p-values nor the Bayes factors alone provide absolute evidence; their interpretation is nuanced, context-dependent, and should be embedded in a broader inferential framework.



Suggested Readings:

  1. Jeffreys, H. (1948). The theory of probability. The Clarendon Press, Oxford.
  2. Bartlett, M. S. (1957). A comment on DV Lindley's statistical paradoxBiometrika44(3-4), 533-534.
  3. Lindley, D. V. (1957). A statistical paradoxBiometrika44(1/2), 187-192.
  4. Wagenmakers, E. J., & Ly, A. (2023). History and nature of the Jeffreys–Lindley paradoxArchive for History of Exact Sciences77(1), 25-72.
  5. Press, S. J. (2003). Bayesian Hypothesis Testing.Subjective and Objective Bayesian Statistics: Principles, Models, and Applications. John Wiley and Sons, Inc.
  6. Lavine, M., & Schervish, M. J. (1999). Bayes factors: What they are and what they are notThe American Statistician53(2), 119-122.
  7. Bartlett's paradox in Bayesian evidenceMATHEMATICS.

Suggested Citation: Meitei, W. B. (2025). Jeffreys-Lindley-Bartlett's Paradox: Unravelling a Foundational Split in Statistics. WBM STATS.