WBM STATS: Bayesian Statistics

Showing posts with label Bayesian Statistics. Show all posts

Sunday, 10 August 2025

Jeffreys-Lindley-Bartlett's Paradox: Unravelling a Foundational Split in Statistics

by W. B. Meitei, PhD

The Jeffreys-Lindley-Bartlett paradox stands as one of the most startling and debated phenomena in the theory of statistical inference, challenging assumptions about how evidence is assessed in the Bayesian versus frequentist schools. This paradox reveals scenarios where the conclusion from a frequentist hypothesis test ("reject the null hypothesis") directly conflicts with the output of Bayesian inference, which may decisively favour the null hypothesis, even when analysing the same data set.

Background and origins

Harold Jeffreys introduced the mathematical and philosophical basis of what became the paradox in the 1930s and 1940s, showing that evidence required to reject a null hypothesis should actually increase with sample size, clashing with common frequentist practice.

Dennis Lindley famously formalised and publicised the problem as a paradox in a 1957 paper, giving it prominence and clarity in statistical discussions. Read the full paper here.

M. S. Bartlett extended and clarified the issue, demonstrating that choosing a "non-informative" or very diffuse prior in the Bayesian setup could, counterintuitively, make the evidence overwhelmingly favour the null hypothesis, regardless of how compelling the data against it appeared. Read the full paper here.

Explaining the paradox

Suppose we are testing a point null hypothesis (e.g., "the mean is equal to zero") using a large sample.

Frequentist inference: Keeping the significance level fixed (e.g., 5%), we will reject the null hypothesis for any deviation from zero if we collect enough data, since even minuscule differences become "statistically significant" as the sample grows.

Bayesian inference: When using a diffuse prior for the alternative, the Bayes factor, the core Bayesian evidence metric, may incline towards supporting the null hypothesis, despite the frequentist's significant finding. This occurs because the marginal likelihood for the alternative hypothesis (which averages the likelihood over the entire, wide-ranging prior) gets "diluted," while all prior mass for the null is concentrated at the hypothesis value.

Mathematical roots

The crux of the paradox is how the Bayes factor is sensitive to:

Sample size (n): As n → ∞, the likelihood under the null becomes sharply peaked. The Bayes factor will favour the null unless the prior for the alternative is highly concentrated around values supported by the data, which seldom holds for broadly specified or "non-informative" priors.
Variance of prior: If the prior for the alternative hypothesis is made increasingly broad (variance → ∞), Bayesian evidence in favour of the null approaches certainty regardless of the observed data, a phenomenon specifically described by Bartlett.

Imagine tossing a coin a million times. We observe 500,600 heads, a small, but "significant" deviation from perfect fairness:

Frequentist result: p-value is tiny; the null ("fair coin") is rejected.

Bayesian result: If using a very flat (uninformative) prior for coin bias, the Bayes factor turns out to overwhelmingly support the null; the prior for "biased coins" is spread so thin that observed data isn't enough to outweigh the null.

Philosophical and practical implications

Prior selection is crucial: The paradox demonstrates that using broad or improper priors in Bayesian hypothesis testing can yield practically absurd model selection, motivating the adoption of weakly-informative or problem-specific priors.
Rethinking evidence and significance: The conflict highlights foundational issues in how statistical evidence is conceptualised, warning against mechanical use of p-values or "default" priors.
Frequentist significance thresholds: The paradox raises concerns about the routine practice of using fixed significance levels regardless of sample size, suggesting a need for adjusting thresholds as data accumulates.

This paradox, far from being a mere technical curiosity, continues to influence both statistical theory and the broader philosophy of evidence, serving as a catalyst for methodological innovation and critical thinking in scientific research.

Recent proposed solutions to the Jeffreys-Lindley-Bartlett paradox

Recent proposed solutions to the Jeffreys-Lindley-Bartlett paradox focus on both the choice of priors in Bayesian testing and on re-examining foundational aspects of hypothesis testing frameworks. Here are key modern approaches:

Weakly informative or sample-dependent priors: Instead of using non-informative or overly diffuse priors (which cause the paradox), researchers now recommend "weakly informative" priors that grow more informative as sample size increases, or that adapt to the data context. Sample size-dependent prior strategies have been explicitly developed to bridge the Bayesian and frequentist divide, reducing the conflict as datasets become large.
Cake priors: A notable recent innovation, "cake priors," are a new class of priors designed to circumvent the paradox. These allow for diffuse priors while ensuring statistical inferences remain sensible. Cake priors lead to Bayesian tests that are Chernoff-consistent (zero type I and II errors asymptotically) and can often be interpreted as penalised likelihood ratio tests. Their design ensures that the paradox occurs only with vanishing probability in large samples.
Changing significance thresholds with sample size: In the frequentist sphere, it has been proposed that significance levels (α) should decrease as sample size grows, rather than remain fixed, as maintaining a constant α is a root cause of the paradox. This approach, sometimes called "almost sure hypothesis testing," results in a sequence of hypothesis tests that make only a finite number of errors as the sample size grows, thereby resolving the paradox in practice.
Re-evaluating the point-null hypothesis: Some theorists argue that the paradox is less about statistical machinery and more about the unrealistic nature of testing perfect point hypotheses. Instead, they suggest framing nulls as intervals or using shrinkage priors that concentrate prior mass more sensibly, addressing both Type I error and Bayesian evidence alignment.
Objective and loss-based prior selection: Methods based on Kullback-Leibler divergence or self-information loss try to assign prior weights objectively, even allowing the prior for the alternative hypothesis to be set in a way closely linked to classical significance levels, avoiding pathologies as variances go to infinity.
Handling improper priors: Recent analysis shows that Bartlett’s paradox does not necessarily apply to all improper priors. Some classes of improper priors (like Stein's shrinkage prior) lead to well-defined Bayes factors, provided their diffusion and measure rate are properly controlled. However, pathologies can still arise, so caution is emphasised, and regularisation rules are suggested.
Hybrid and robust Bayesian methods: Ongoing research also investigates hybrid techniques, combining Bayesian insights with frequentist calibration or robust Bayesian model averaging, to mitigate the effects of prior mis-specification and large-sample divergences.

In summary, the main thrust of recent solutions is smarter, context-aware prior specification (including cake and weakly informative priors), dynamic thresholding of significance in frequentist testing, and new theoretical frameworks that blend benefits from both frequentist and Bayesian paradigms, making the paradox less of a practical challenge and more of a caution against naive use of statistical defaults.

The paradox impact on the practical interpretation of p-values and Bayes factors

The Jeffreys-Lindley-Bartlett paradox profoundly impacts how p-values and Bayes factors should be interpreted in practical data analysis, especially as sample sizes grow large:

a. Practical interpretation of p-values

Overstated evidence: The paradox demonstrates that small p-values may not always imply strong evidence against the null hypothesis, particularly in large datasets. As the sample size increases, even trivial differences from the null can yield extremely small p-values, potentially leading to exaggerated claims of discovery when the true effect is negligible.
Context sensitivity: Researchers are increasingly cautioned that p-values must be interpreted in light of context, effect size, sample size, and prior belief, not just as a mechanical threshold for significance.

b. Practical interpretation of Bayes factors

Prior dependence: The paradox exposes how the Bayes factor can favour the null hypothesis, even when observed data strongly deviate from the null if a diffuse or poorly chosen prior is used for the alternative. Thus, Bayes factors must be interpreted with careful attention to the choice and justification of priors.
Calibration required: Interpretation of Bayes factors is not universal; they must be calibrated to the problem at hand, with sensitivity analyses to different priors reducing the risk of misleading conclusions.

In essence, the paradox serves as a cautionary tale: neither p-values nor the Bayes factors alone provide absolute evidence; their interpretation is nuanced, context-dependent, and should be embedded in a broader inferential framework.

Suggested Readings:

Jeffreys, H. (1948). The theory of probability. The Clarendon Press, Oxford.
Bartlett, M. S. (1957). A comment on DV Lindley's statistical paradox. Biometrika, 44(3-4), 533-534.
Lindley, D. V. (1957). A statistical paradox. Biometrika, 44(1/2), 187-192.
Wagenmakers, E. J., & Ly, A. (2023). History and nature of the Jeffreys–Lindley paradox. Archive for History of Exact Sciences, 77(1), 25-72.
Press, S. J. (2003). Bayesian Hypothesis Testing.Subjective and Objective Bayesian Statistics: Principles, Models, andApplications. John Wiley and Sons, Inc.
Lavine, M., & Schervish, M. J. (1999). Bayes factors: What they are andwhat they are not. The American Statistician, 53(2), 119-122.
Bartlett's paradox in Bayesian evidence. MATHEMATICS.

Suggested Citation: Meitei, W. B. (2025). Jeffreys-Lindley-Bartlett's Paradox: Unravelling a Foundational Split in Statistics. WBM STATS.

Sunday, 27 July 2025

Markov Chain Monte Carlo Diagnostics: Ensuring Reliable Simulations

by W. B. Meitei, PhD

Markov Chain Monte Carlo (MCMC) methods have become essential tools in modern statistical analysis, particularly in Bayesian inference, where drawing samples from complex, high-dimensional probability distributions is often necessary. However, generating samples is only part of the task; ensuring that those samples accurately represent the target distribution is equally important.

This is where MCMC diagnostics come into play. MCMC diagnostics are a set of tools and techniques used to assess the quality and reliability of an MCMC-generated sample. Specifically, MCMC diagnostics help us answer two critical questions:

Has a substantial portion of the sample been drawn from distributions that deviate significantly from the target distribution? This is especially relevant early in the sampling process, where the chain may not yet have "settled" into the correct distribution. Diagnosing this allows us to determine how much of the early sample (often referred to as the burn-in period) should be discarded.
Is the sample size large enough to approximate the target distribution with sufficient accuracy? In MCMC, draws are typically correlated, which means that the effective number of independent observations is smaller than the total number of samples. Diagnostics help us evaluate whether we have enough effective samples to trust the results.

In this article, we explore why MCMC diagnostics are necessary, introduce common techniques used to evaluate convergence and sampling efficiency, and discuss practical steps for interpreting the results to improve simulation quality. Whether you're new to MCMC or looking to refine your workflow, understanding MCMC diagnostics is key to producing valid and robust statistical inferences.

MCMC in brief

An MCMC algorithm generates a sequence {X_t} of random variables (or vectors) that exhibit the following characteristics:

The sequence forms a Markov chain.
As t increases, the distribution of X_t becomes increasingly similar to the target distribution. More formally, the chain converges to its stationary distribution, which is, by design, equal to the target distribution.
While individual terms in the chain, such as X_t and X_t+n, are generally dependent, the dependency weakens as the gap n between them grows larger. In other words, as n becomes large, X_t and X_t+n become approximately independent.

When we perform an MCMC algorithm for T periods, we get an MCMC sample made up of the first T realisations of the chain, x₁, …, x_T, and then we use the empirical distribution of the sample to approximate the target distribution. click here to know more about MCMC methods.

Problems MCMC diagnostic tries to spot

1. A substantial portion of the sample has been drawn from distributions that deviate significantly from the target distribution.

In most cases, the distribution of the initial value X₁ does not match the target distribution. As a result, the distributions of the subsequent terms X₂, X₃, …, X_t also differ from the target, although these differences gradually diminish as t increases, since the distribution of X_t is designed to converge to the target as t approaches infinity.

However, sometimes the chain converges slowly. This means that even after many iterations, the draws may still come from distributions that deviate substantially from the target. Such slow convergence often occurs when the initial value x₁ is located in a region where the target distribution assigns a very small probability.

When this happens, a significant portion of the MCMC sample may consist of observations that do not adequately represent the target distribution.

To address this issue, we can take corrective measures such as:

Discarding a substantial number of initial observations, known as the burn-in sample;
Increasing the total number of iterations (i.e., extending the sample size).

These strategies can help ensure that a greater share of the MCMC sample comes from distributions that closely resemble the target distribution.

2. The effective sample size is not large enough to approximate the target distribution with sufficient accuracy.

Keep in mind that, generally, two terms in the Markov chain, X_t and X_t+n, are not independent. However, their dependence weakens as n increases, and they eventually become nearly independent.

Suppose we can identify the smallest integer n_o such that, for any t, X_t and X_t+no can be treated as practically independent, i.e., the remaining correlation is negligible. For simplicity, also assume that the total number of draws T is divisible by n_o. In this case, we can construct a sub-sample,

\[x_{n_{o}},x_{{2n}_{o}},\ldots,x_{t - n_{o}},x_{T}\]

Consisting of values that are approximately independent of one another. This subsample is referred to as the effective sample, and its size is T/n_o.

In simple terms, the full MCMC sample behaves like an independent sample of size T/n_o. The slower the dependence between observations fades, the larger n_o becomes, and consequently, the smaller the effective sample size is.

If dependence decays very slowly, resulting in a large n_o, the effective sample size may be too small, making the empirical distribution derived from the MCMC sample an unreliable or noisy estimate of the target distribution.

If such an issue is detected, it can often be mitigated by:

Modifying the MCMC algorithm to reduce dependence (i.e., reduce n_o);
Increasing the number of iterations T.

Both approaches aim to increase the effective sample size and thereby improve the quality of the approximation.

The concept of effective sample size discussed here is an intuitive one and differs from the more formal definitions commonly found in MCMC literature. It is meant to clearly highlight the issues that can arise when the dependence between observations declines too slowly.

The fundamental principle behind MCMC diagnostics

Most MCMC diagnostic tests test the following two hypotheses.

A substantial portion of the sample has been drawn from distributions that are significantly similar to the target distribution.
The effective sample size is large enough to approximate the target distribution with sufficient accuracy.

If these two hypotheses hold, then the implication is that the empirical distribution of any large sub-sample is a good approximation of the target distribution.

There are multiple approaches to conducting MCMC diagnostics, each aimed at assessing whether the generated sample accurately represents the target distribution. In what follows, we outline four commonly used diagnostic methods. The first two are relatively simple and intuitive, making them easy to implement and interpret. The latter two, while more technical, offer a more rigorous and statistically grounded evaluation of the quality of the MCMC sample.

a. Sample splits

One of the simplest and straightforward ways to diagnose problems in an MCMC sample is to divide it into two or more parts and examine whether they yield consistent results.

For example, suppose we have an MCMC sample made up of T draws (with T events), i.e.,

\[x_{1},..,x_{T}\]

where each draw x_t is a K×1 random vector, then we can split the sample into two equal halves,

\(x_{1},..,x_{\frac{T}{2}}\) and \(x_{\frac{T}{2} + 1},\ldots,x_{T}\)

Thus, we can compute the sample mean of the two samples as,

\[{\overline{x}}_{1} = \frac{2}{T}\sum_{t = 1}^{T/2}x_{t}\]

\[{\overline{x}}_{1} = \frac{2}{T}\sum_{t = \frac{T}{2} + 1}^{T}x_{t}\]

A significant difference between the two means (a formal statistical test can be used to assess this) suggests that the MCMC sample may not be reliable.

Although the example focuses on comparing sample means, similar checks can be performed using other moments, such as sample variances.

The logic behind this diagnostic is quite intuitive: if two large portions of the sample differ noticeably in their empirical distributions, then at least one of them likely fails to represent the target distribution well. This contradicts the large chunks principle, which assumes that sufficiently large parts of the MCMC sample should provide similar and accurate approximations of the target distribution. If that principle fails, the overall quality of the sample is questionable.

b. Multiple chains

Another simple diagnostic technique involves running the MCMC algorithm multiple times, each time starting from a different (and possibly very different) initial value x₁, to generate several independent MCMC samples. After discarding appropriate burn-in periods, we compare the resulting samples to see if they yield consistent outcomes. As with the sample split method, this can be done by testing whether the sample means or other statistical moments differ significantly across the samples. Consistency across runs provides reassurance about convergence and reliability, while significant differences may indicate problems with mixing or convergence.

c. Trace plots

A trace plot is a fundamental diagnostic tool used to evaluate the performance of an MCMC sample. It shows the values drawn for a specific parameter (or a component of a parameter vector) plotted against the iteration number. This visual representation helps determine whether the chain has converged, mixed properly, and is adequately exploring the target distribution. Any visible trends, shifts, or irregular patterns in the plot may suggest issues such as poor convergence, strong serial correlation, or the need for a longer burn-in period. As such, trace plots are often one of the first tools examined in MCMC diagnostics.

Suppose we have the following MCMC sample,

\[x_{1},..,x_{T}\]

where each draw x_t is a K×1 random vector, then a trace plot is a line graph that has time (t = 1, …, T) on the x-axis and the values taken by one of the coordinates of the draws (x_1i, …, x_Ti) on the y-axis.

Fig. 1: Trace plots of Chain 1, 2, and 3 (RCode to generate the figure)

In the trace plot of Chain 1, the sample appears to have no visible anomalies. While there is some mild serial correlation between consecutive draws, the chain moves through the sample space multiple times, suggesting adequate mixing and no major issues.

In the trace plot of Chain 2, the initial segment (up to around t = 1,500) noticeably differs from the rest of the sample. This likely indicates that the starting distribution and early iterations of the chain were far from the target distribution. Over time, the chain gradually approached the target distribution (around t = 1,500). This reflects Problem 1, i.e., a significant portion of the sample was generated from distributions not representative of the target.

In the trace plot of Chain 3, there is strong serial correlation between successive draws. The chain moves slowly through the sample space and seems to make only a few complete passes. This suggests that the number of effectively independent observations is small, illustrating Problem 2, i.e., a low effective sample size.

Fig. 2: Trace plots of Chain 2 and 3 after refinement (RCode to generate the figure)

The two trace plots in Fig. 2 demonstrate how we address Problems 1 and 2. In the first plot, we used the sample from Chain 2, but after discarding the burn-in sample (initial 1,500 draws). After discarding the burn-in sample, the chain no longer has visible anomalies. In the second plot, a new sample is generated using the same method as Chain 3, but the sample size is greatly expanded from 10,000 to 1,000,000 draws. This larger sample helps increase the effective sample size and allows the chain to explore the sample space more thoroughly. In both examples, the issues observed earlier appear to be resolved: the trace plots now show no visible problems or irregularities.

Although trace plots are relatively simple and informal diagnostic tools, they are among the most commonly used in practice. If you're submitting an MCMC-based study to a scientific journal, reviewers will almost certainly expect you to include trace plots as part of your analysis.

d. Autocorrelation function (ACF) plots

Trace plots can give us a visual indication of how much serial correlation exists between the draws in an MCMC sample. To evaluate this more precisely, we can use ACF (autocorrelation function) plots, also known as correlograms. These plots show how the sample autocorrelation between terms in the chain decreases as the lag increases.

The sample ACF plots for the three MCMC chains previously examined in the trace plot in Fig. 1 are presented in Fig. 3.

Fig. 3: ACF plots of Chain 1, 2, and 3 (RCode to generate the figure)

The ACF plot for Chain 1 shows that while autocorrelation (or serial correlation) is high at short lags, it quickly diminishes to near zero, consistent with the trace plot, which did not indicate any issues.

In contrast, the ACF plots for Chains 2 and 3 reveal that autocorrelation is not only strong at short lags but also declines very slowly. The ACFs of Chains 2 and 3 suggest different underlying issues (Problem 1 for Chain 2 and Problem 2 for Chain 3). While ACF plots are useful for detecting the presence of autocorrelation and quantifying its magnitude, they often do not clearly identify the nature of the problem. For this reason, it is recommended to use ACF plots in conjunction with trace plots for a more comprehensive diagnosis.

Finally, the facts that need to be kept in mind while we perform the MCMC diagnostics:

No single diagnostic is perfect.
We can never be 100% certain about the adequacy of an MCMC sample. While diagnostics are useful for identifying potential issues, they cannot provide absolute assurance that no problems are present.
While conducting multiple diagnostics is always a prudent strategy, running your chains for as many iterations as your resources allow, initiating multiple chains from diverse starting points, and making full use of available computational capacity is recommended. This approach will significantly reduce the risk of convergence issues and enhance the overall reliability of your MCMC results.

Suggested Readings:

Geyer, C. J. (1992). Practical markov chain monte carlo. Statistical science, 473-483.
Vehtari, A., Gelman, A., Simpson, D., Carpenter, B., & Bürkner, P. C. (2021). Rank-normalization, folding, and localization: An improved R ̂ for assessing convergence of MCMC (with discussion). Bayesian analysis, 16(2), 667-718.
Vehtari, A. (2021). Comparison of MCMC effective sample size estimators.
MFIDD2023. MCMC diagnostics
Taboga, M. (2021). Markov Chain Monte Carlo Methods (MCMC) diagnostics. Fundamentals of Statistics.

Suggested Citation: Meitei, W. B. (2025). Markov Chain Monte Carlo Diagnostics: Ensuring Reliable Simulations. WBM STATS.

Markov Chain Monte Carlo

by W. B. Meitei, PhD

Markov Chain Monte Carlo (MCMC) methods are a class of algorithms used to generate samples from complex probability distributions, particularly when direct sampling is challenging or analytically not feasible. The MCMC methods represent a powerful class of algorithms widely used in modern computational statistics, particularly within the framework of Bayesian inference.

The core idea behind MCMC is to construct a Markov chain whose equilibrium (or stationary) distribution is the distribution of interest. By simulating the chain over a large number of steps, one can obtain a sequence of dependent samples that approximate this target distribution. Unlike traditional Monte Carlo methods, which assume independent and identically distributed (i.i.d.) sampling, MCMC methods allow for dependence among samples, making them highly effective in high-dimensional or constrained parameter spaces.

Two of the most widely used MCMC algorithms are the Metropolis-Hastings algorithm and Gibbs sampling. These techniques provide flexible frameworks for sampling from various distributions, even when those distributions are not available in closed form.

MCMC methods have become essential tools in modern statistical modelling, machine learning, and computational science. They are used in diverse applications ranging from parameter estimation and model selection to probabilistic inference in hierarchical models. This article aims to introduce the foundational concepts of MCMC, outline key algorithms, and illustrate their practical relevance in applied statistical analysis.

Fig.1: Standard Monte Carlo Vs Markov Chain Monte Carlo

Before we proceed with MCMC, it is important to understand the fundamentals of the Monte Carlo approach and the Markov chain.

Some basics of Monte Carlo & Markov chain

The Monte Carlo method involves simulating a large number of independent samples from the distribution of the random variable X, denoted as X₁, X₂, …, X_T. These simulated values form an empirical distribution, which places equal probability 1/T on each sampled point.

For instance, we can use the Monte Carlo method to estimate certain features of the probability distribution of a random variable, such as its expected value, variance, or tail probabilities, particularly when these quantities are difficult or practically not feasible to compute analytically.

To approximate the expected value of X, we compute the sample mean of the simulated values.
If we want to estimate the probability that X falls below a certain threshold, we can use the proportion of simulated values that are less than that threshold.
The variance of X can be approximated using the sample variance of the simulated observations.

In general, any calculation we wish to carry out on the true distribution of a random variable X can be approximated by performing the same calculation on its empirical distribution.

This approach typically yields reliable results, as the approximation tends to improve with larger sample sizes. In fact, as the number of simulated observations T increases, the empirical estimates converge to the true values. This foundational concept is often referred to as the plug-in principle, which underlies much of the Monte Carlo method’s practical utility.

MCMC methods operate similarly to standard Monte Carlo methods, but with a key distinction: the simulated values X₁, X₂, …, X_T are not independent draws. Instead, they are serially correlated.

In particular, they are the realisations of T random variables X₁, X₂, …, X_T that form a Markov Chain. Each draw in the sequence depends on the previous one, which introduces correlation between samples. This structure allows MCMC methods to sample from complex probability distributions where independent sampling is difficult or impossible.

We do not need to be an expert on Markov chains to understand MCMC. A basic understanding of the concepts outlined in the following sections is sufficient to effectively understand and apply the methods. However, studying or understanding the basics of the Markov chain is recommended.

a. Markov property

A random sequence {X_t} is said to form a Markov chain if, at any point in time, the future values X_t+n depend only on the present value X_t, and not on the sequence of past values X_t−k, for any positive integers k and n.

\[P\left( X_{t + n} = x\ |\ X_{t},X_{t - 1},\ldots,X_{t - k} \right) = P\left( X_{t + n} = x\ |\ X_{t} \right)\]

This defining characteristic, known as the Markov property, reflects the memoryless nature of the process. In other words, the conditional probability distribution of future outcomes is determined solely by the current state and is independent of the historical trajectory that led to that state.

b. Conditional and unconditional distributions

The conditional probabilities (or densities) that X_t+n = x_t+n, given that X_t = x_t is denoted as

\[f\left( x_{t + n}\ \right|{\ x}_{t})\]

The unconditional probabilities (or densities) that X_t = x_t is denoted as

\[f(x_{t})\]

c. Asymptotic independence

While this is not a general property of all Markov chains, those produced by MCMC methods exhibit a key property, i.e., although the variables X_t and X_t+n are dependent, their dependence weakens as n increases, gradually approaching independence.

This ensures that f(x_t+n | x_t) converges to f(x_t+n) as n → ∞.

This has important implications, i.e., although the initial value of the chain x₁ is typically chosen arbitrarily, as t increases, the distribution of the chain’s values becomes progressively less influenced by this starting point. In other words, f(x_t | x₁) approaches the same distribution f(x_t), regardless of how x₁ was selected.

d. Target distribution

In Markov chains generated by MCMC methods, not only does f(x_t | x₁) converge to f(x_t), but as t increases, the distributions f(x_t) become almost identical to each other. They eventually converge to what is known as the stationary distribution of the chain.

Importantly, this stationary distribution is the same as the target distribution, the distribution from which we ultimately want to draw samples.

In simpler terms, the larger the value of t, the more f(x_t | x₁) and f(x_t) are similar to the target distribution.

A black-box approach

We can imagine an MCMC algorithm as a black box that takes two inputs:

An initial value x₁,
A target distribution

Then the output is a sequence of values x₁, x₂, …, x_T.

Even though the initial values of the sequence may come from distributions that differ significantly from the target distribution, these distributions gradually become more similar to the target as the sequence progresses. When t is sufficiently large, x_t is effectively drawn from a distribution that closely approximates the target distribution.

Burn-in sample

Generally, the initial values of the MCMC sample differ largely from the target distribution. Due to this initial mismatch with the target distribution, it is a common practice to discard the first portion of an MCMC sample. For instance, in a sequence of 10,000 generated values in Fig. 2, the first 1,500 may be discarded, retaining the remaining 8,500 for analysis.

This initial segment is known as the burn-in sample and is removed to exclude early draws that are less representative of the target distribution. By doing so, we reduce the potential bias in Monte Carlo estimates derived from the MCMC sample, as the retained values are more likely to come from distributions that closely approximate the target distribution.

We can perform the MCMC diagnostics to select an appropriate burn-in length.

Fig. 2: Trace plot of an MCMC sample (RCode to generate the figure)

Correlation and effective sample size

Once the burn-in sample has been discarded, we are left with a sequence of draws whose distributions closely resemble the target distribution. However, a key issue remains, i.e., the observations are not independent.

What does this dependence imply? To answer this, we must revisit the foundation of standard Monte Carlo methods. In those methods, the accuracy of an estimate improves with the number of independent draws, denoted by T. The larger the sample, the more reliable the approximation.

In contrast, MCMC methods produce dependent samples. As a result, we introduce the notion of effective sample size, a measure of how many independent observations the dependent MCMC sample is equivalent to. For example, 1,000 correlated draws might contain as much information as only 100 independent ones. In such a case, the effective sample size is said to be 100.

Generally, the greater the correlation between successive samples, the lower the effective sample size, and consequently, the less accurate the resulting MCMC approximation.

For this reason, much of the effort in designing and evaluating MCMC algorithms is focused on minimising autocorrelation (or serial correlation) in the chain, thereby increasing the effective sample size and improving estimation accuracy.

We can perform the MCMC diagnostics to check the effectiveness of the sample size.

Commonly used MCMC algorithm

As mentioned above, there are two commonly used MCMC algorithms. They are:

Metropolis-Hastings algorithm and
Gibbs sampling

1. Metropolis-Hastings algorithm

Of all the MCMC algorithms, the Metropolis-Hastings algorithm is the most commonly used.

Let f(x) denote the density (or mass) function of the target distribution, q(x|x՛), a conditional distribution from which we can extract computer-generated samples (e.g., a multivariate normal distribution with mean x՛), and f be the posterior density, such that, x, x՛, and f have the same dimension.

The algorithm

The Metropolis-Hastings algorithm starts from any value x₁ belonging to the support of the target distribution and generates the subsequent values x_t as follows:

Generate a proposal y_t from f(y_t | x_t-1)
Compute the acceptance probability

\[p_{t} = min\left( \frac{f\left( y_{t} \right)}{f\left( x_{t - 1} \right)}\frac{q\left( x_{t - 1}|y_{t} \right)}{q\left( y_{t}|x_{t - 1} \right)},1 \right)\]

Draw a uniform random variable u_t (on [0, 1])
If u_t ≤ p_t, accept the proposal and set x_t = y_t; otherwise, reject the proposal and set x_t = x_t-1

Advantages of the Metropolis-Hastings algorithm

Suppose,

\[f(x) = ch(x)\]

We assume that we know h(x) but not c.

The acceptance probability according to the Metropolis-Hastings algorithm is

\[p_{t} = min\left( \frac{f\left( y_{t} \right)}{f\left( x_{t - 1} \right)}\frac{q\left( x_{t - 1}|y_{t} \right)}{q\left( y_{t}|x_{t - 1} \right)},1 \right) = min\left( \frac{ch\left( y_{t} \right)}{ch\left( x_{t - 1} \right)}\frac{q\left( x_{t - 1}|y_{t} \right)}{q\left( y_{t}|x_{t - 1} \right)},1 \right)\]

\[= min\left( \frac{h\left( y_{t} \right)}{h\left( x_{t - 1} \right)}\frac{q\left( x_{t - 1}|y_{t} \right)}{q\left( y_{t}|x_{t - 1} \right)},1 \right)\]

Thus, the acceptance probability, which is the only quantity that depends on f, can be computed without knowing the constant c.

This is the beauty of the Metropolis-Hastings algorithm, i.e., we can generate draws from a distribution even if we do not fully know the density of that distribution.

2. Gibbs sampling

Another commonly used MCMC algorithm is the Gibbs sampling algorithm.

The algorithm

The Gibbs sampling algorithm starts from any vector,

\[x_{1} = \left\lbrack x_{1,1},\ldots,x_{1,B} \right\rbrack\]

belonging to the support of the target distribution and generates the subsequent values x_t as follows:

For b = 1, …, B, generate x_t,b from the conditional distribution with density

\[f\left( x_{t,b}\ |\ x_{t,1},\ldots,x_{t,b - 1},x_{t - 1,b + 1},\ldots,x_{t - 1,B} \right)\]

In other words, at each iteration, the blocks are extracted one by one from their full conditional distributions, conditioning on the latest draws of all the other blocks.

Note that at iteration t, when you extract the b^th block, the latest draws of blocks 1 to b-1 are those already extracted in iteration t, while the latest draws of blocks b+1 to B are those extracted in the previous iteration t-1.

What is a full conditional distribution?

Suppose our aim is to generate samples from a random vector x_o whose joint density is given by

\[f\left( x_{o} \right) = f(x_{o,1},\ldots,x_{o,B})\]

where f(x_o,1, …, x_o,B) represents a partition of x_o into B blocks (each possibly consisting of one or more components).

For any block x_o, let x_o,-b denote the vector containing all components of x_o except those in the b^th block.

Now, suppose we have the ability to sample from the B conditional distributions f(x_o,1 | x_o,-1), …, f(x_o,B | x_o,-B). These are known in MCMC terminology as the full conditional distributions.

Suggested Readings:

Brooks, S., Gelman, A., Jones, G., & Meng, X. L. (Eds.). (2011). Handbook of markov chain monte carlo. CRC press.
Taboga, M. (2021). Markov Chain Monte Carlo Methods. Fundamentals of Statistics.
Gilks, W. R., Richardson, S., & Spiegelhalter, D. (Eds.). (1995). Markov chain Monte Carlo in practice. CRC press.

Suggested Citation: Meitei, W. B. (2025). Markov Chain Monte Carlo. WBM STATS.

Pages

Sunday, 10 August 2025

Jeffreys-Lindley-Bartlett's Paradox: Unravelling a Foundational Split in Statistics

Sunday, 27 July 2025

Markov Chain Monte Carlo Diagnostics: Ensuring Reliable Simulations

Markov Chain Monte Carlo

Disclaimer