WBM STATS: Markov Chain Monte Carlo Diagnostics: Ensuring Reliable Simulations

by W. B. Meitei, PhD

Markov Chain Monte Carlo (MCMC) methods have become essential tools in modern statistical analysis, particularly in Bayesian inference, where drawing samples from complex, high-dimensional probability distributions is often necessary. However, generating samples is only part of the task; ensuring that those samples accurately represent the target distribution is equally important.

This is where MCMC diagnostics come into play. MCMC diagnostics are a set of tools and techniques used to assess the quality and reliability of an MCMC-generated sample. Specifically, MCMC diagnostics help us answer two critical questions:

Has a substantial portion of the sample been drawn from distributions that deviate significantly from the target distribution? This is especially relevant early in the sampling process, where the chain may not yet have "settled" into the correct distribution. Diagnosing this allows us to determine how much of the early sample (often referred to as the burn-in period) should be discarded.
Is the sample size large enough to approximate the target distribution with sufficient accuracy? In MCMC, draws are typically correlated, which means that the effective number of independent observations is smaller than the total number of samples. Diagnostics help us evaluate whether we have enough effective samples to trust the results.

In this article, we explore why MCMC diagnostics are necessary, introduce common techniques used to evaluate convergence and sampling efficiency, and discuss practical steps for interpreting the results to improve simulation quality. Whether you're new to MCMC or looking to refine your workflow, understanding MCMC diagnostics is key to producing valid and robust statistical inferences.

MCMC in brief

An MCMC algorithm generates a sequence {X_t} of random variables (or vectors) that exhibit the following characteristics:

The sequence forms a Markov chain.
As t increases, the distribution of X_t becomes increasingly similar to the target distribution. More formally, the chain converges to its stationary distribution, which is, by design, equal to the target distribution.
While individual terms in the chain, such as X_t and X_t+n, are generally dependent, the dependency weakens as the gap n between them grows larger. In other words, as n becomes large, X_t and X_t+n become approximately independent.

When we perform an MCMC algorithm for T periods, we get an MCMC sample made up of the first T realisations of the chain, x₁, …, x_T, and then we use the empirical distribution of the sample to approximate the target distribution. click here to know more about MCMC methods.

Problems MCMC diagnostic tries to spot

1. A substantial portion of the sample has been drawn from distributions that deviate significantly from the target distribution.

In most cases, the distribution of the initial value X₁ does not match the target distribution. As a result, the distributions of the subsequent terms X₂, X₃, …, X_t also differ from the target, although these differences gradually diminish as t increases, since the distribution of X_t is designed to converge to the target as t approaches infinity.

However, sometimes the chain converges slowly. This means that even after many iterations, the draws may still come from distributions that deviate substantially from the target. Such slow convergence often occurs when the initial value x₁ is located in a region where the target distribution assigns a very small probability.

When this happens, a significant portion of the MCMC sample may consist of observations that do not adequately represent the target distribution.

To address this issue, we can take corrective measures such as:

Discarding a substantial number of initial observations, known as the burn-in sample;
Increasing the total number of iterations (i.e., extending the sample size).

These strategies can help ensure that a greater share of the MCMC sample comes from distributions that closely resemble the target distribution.

2. The effective sample size is not large enough to approximate the target distribution with sufficient accuracy.

Keep in mind that, generally, two terms in the Markov chain, X_t and X_t+n, are not independent. However, their dependence weakens as n increases, and they eventually become nearly independent.

Suppose we can identify the smallest integer n_o such that, for any t, X_t and X_t+no can be treated as practically independent, i.e., the remaining correlation is negligible. For simplicity, also assume that the total number of draws T is divisible by n_o. In this case, we can construct a sub-sample,

\[x_{n_{o}},x_{{2n}_{o}},\ldots,x_{t - n_{o}},x_{T}\]

Consisting of values that are approximately independent of one another. This subsample is referred to as the effective sample, and its size is T/n_o.

In simple terms, the full MCMC sample behaves like an independent sample of size T/n_o. The slower the dependence between observations fades, the larger n_o becomes, and consequently, the smaller the effective sample size is.

If dependence decays very slowly, resulting in a large n_o, the effective sample size may be too small, making the empirical distribution derived from the MCMC sample an unreliable or noisy estimate of the target distribution.

If such an issue is detected, it can often be mitigated by:

Modifying the MCMC algorithm to reduce dependence (i.e., reduce n_o);
Increasing the number of iterations T.

Both approaches aim to increase the effective sample size and thereby improve the quality of the approximation.

The concept of effective sample size discussed here is an intuitive one and differs from the more formal definitions commonly found in MCMC literature. It is meant to clearly highlight the issues that can arise when the dependence between observations declines too slowly.

The fundamental principle behind MCMC diagnostics

Most MCMC diagnostic tests test the following two hypotheses.

A substantial portion of the sample has been drawn from distributions that are significantly similar to the target distribution.
The effective sample size is large enough to approximate the target distribution with sufficient accuracy.

If these two hypotheses hold, then the implication is that the empirical distribution of any large sub-sample is a good approximation of the target distribution.

There are multiple approaches to conducting MCMC diagnostics, each aimed at assessing whether the generated sample accurately represents the target distribution. In what follows, we outline four commonly used diagnostic methods. The first two are relatively simple and intuitive, making them easy to implement and interpret. The latter two, while more technical, offer a more rigorous and statistically grounded evaluation of the quality of the MCMC sample.

a. Sample splits

One of the simplest and straightforward ways to diagnose problems in an MCMC sample is to divide it into two or more parts and examine whether they yield consistent results.

For example, suppose we have an MCMC sample made up of T draws (with T events), i.e.,

\[x_{1},..,x_{T}\]

where each draw x_t is a K×1 random vector, then we can split the sample into two equal halves,

\(x_{1},..,x_{\frac{T}{2}}\) and \(x_{\frac{T}{2} + 1},\ldots,x_{T}\)

Thus, we can compute the sample mean of the two samples as,

\[{\overline{x}}_{1} = \frac{2}{T}\sum_{t = 1}^{T/2}x_{t}\]

\[{\overline{x}}_{1} = \frac{2}{T}\sum_{t = \frac{T}{2} + 1}^{T}x_{t}\]

A significant difference between the two means (a formal statistical test can be used to assess this) suggests that the MCMC sample may not be reliable.

Although the example focuses on comparing sample means, similar checks can be performed using other moments, such as sample variances.

The logic behind this diagnostic is quite intuitive: if two large portions of the sample differ noticeably in their empirical distributions, then at least one of them likely fails to represent the target distribution well. This contradicts the large chunks principle, which assumes that sufficiently large parts of the MCMC sample should provide similar and accurate approximations of the target distribution. If that principle fails, the overall quality of the sample is questionable.

b. Multiple chains

Another simple diagnostic technique involves running the MCMC algorithm multiple times, each time starting from a different (and possibly very different) initial value x₁, to generate several independent MCMC samples. After discarding appropriate burn-in periods, we compare the resulting samples to see if they yield consistent outcomes. As with the sample split method, this can be done by testing whether the sample means or other statistical moments differ significantly across the samples. Consistency across runs provides reassurance about convergence and reliability, while significant differences may indicate problems with mixing or convergence.

c. Trace plots

A trace plot is a fundamental diagnostic tool used to evaluate the performance of an MCMC sample. It shows the values drawn for a specific parameter (or a component of a parameter vector) plotted against the iteration number. This visual representation helps determine whether the chain has converged, mixed properly, and is adequately exploring the target distribution. Any visible trends, shifts, or irregular patterns in the plot may suggest issues such as poor convergence, strong serial correlation, or the need for a longer burn-in period. As such, trace plots are often one of the first tools examined in MCMC diagnostics.

Suppose we have the following MCMC sample,

\[x_{1},..,x_{T}\]

where each draw x_t is a K×1 random vector, then a trace plot is a line graph that has time (t = 1, …, T) on the x-axis and the values taken by one of the coordinates of the draws (x_1i, …, x_Ti) on the y-axis.

Fig. 1: Trace plots of Chain 1, 2, and 3 (RCode to generate the figure)

In the trace plot of Chain 1, the sample appears to have no visible anomalies. While there is some mild serial correlation between consecutive draws, the chain moves through the sample space multiple times, suggesting adequate mixing and no major issues.

In the trace plot of Chain 2, the initial segment (up to around t = 1,500) noticeably differs from the rest of the sample. This likely indicates that the starting distribution and early iterations of the chain were far from the target distribution. Over time, the chain gradually approached the target distribution (around t = 1,500). This reflects Problem 1, i.e., a significant portion of the sample was generated from distributions not representative of the target.

In the trace plot of Chain 3, there is strong serial correlation between successive draws. The chain moves slowly through the sample space and seems to make only a few complete passes. This suggests that the number of effectively independent observations is small, illustrating Problem 2, i.e., a low effective sample size.

Fig. 2: Trace plots of Chain 2 and 3 after refinement (RCode to generate the figure)

The two trace plots in Fig. 2 demonstrate how we address Problems 1 and 2. In the first plot, we used the sample from Chain 2, but after discarding the burn-in sample (initial 1,500 draws). After discarding the burn-in sample, the chain no longer has visible anomalies. In the second plot, a new sample is generated using the same method as Chain 3, but the sample size is greatly expanded from 10,000 to 1,000,000 draws. This larger sample helps increase the effective sample size and allows the chain to explore the sample space more thoroughly. In both examples, the issues observed earlier appear to be resolved: the trace plots now show no visible problems or irregularities.

Although trace plots are relatively simple and informal diagnostic tools, they are among the most commonly used in practice. If you're submitting an MCMC-based study to a scientific journal, reviewers will almost certainly expect you to include trace plots as part of your analysis.

d. Autocorrelation function (ACF) plots

Trace plots can give us a visual indication of how much serial correlation exists between the draws in an MCMC sample. To evaluate this more precisely, we can use ACF (autocorrelation function) plots, also known as correlograms. These plots show how the sample autocorrelation between terms in the chain decreases as the lag increases.

The sample ACF plots for the three MCMC chains previously examined in the trace plot in Fig. 1 are presented in Fig. 3.

Fig. 3: ACF plots of Chain 1, 2, and 3 (RCode to generate the figure)

The ACF plot for Chain 1 shows that while autocorrelation (or serial correlation) is high at short lags, it quickly diminishes to near zero, consistent with the trace plot, which did not indicate any issues.

In contrast, the ACF plots for Chains 2 and 3 reveal that autocorrelation is not only strong at short lags but also declines very slowly. The ACFs of Chains 2 and 3 suggest different underlying issues (Problem 1 for Chain 2 and Problem 2 for Chain 3). While ACF plots are useful for detecting the presence of autocorrelation and quantifying its magnitude, they often do not clearly identify the nature of the problem. For this reason, it is recommended to use ACF plots in conjunction with trace plots for a more comprehensive diagnosis.

Finally, the facts that need to be kept in mind while we perform the MCMC diagnostics:

No single diagnostic is perfect.
We can never be 100% certain about the adequacy of an MCMC sample. While diagnostics are useful for identifying potential issues, they cannot provide absolute assurance that no problems are present.
While conducting multiple diagnostics is always a prudent strategy, running your chains for as many iterations as your resources allow, initiating multiple chains from diverse starting points, and making full use of available computational capacity is recommended. This approach will significantly reduce the risk of convergence issues and enhance the overall reliability of your MCMC results.

Suggested Readings:

Geyer, C. J. (1992). Practical markov chain monte carlo. Statistical science, 473-483.
Vehtari, A., Gelman, A., Simpson, D., Carpenter, B., & Bürkner, P. C. (2021). Rank-normalization, folding, and localization: An improved R ̂ for assessing convergence of MCMC (with discussion). Bayesian analysis, 16(2), 667-718.
Vehtari, A. (2021). Comparison of MCMC effective sample size estimators.
MFIDD2023. MCMC diagnostics
Taboga, M. (2021). Markov Chain Monte Carlo Methods (MCMC) diagnostics. Fundamentals of Statistics.

Suggested Citation: Meitei, W. B. (2025). Markov Chain Monte Carlo Diagnostics: Ensuring Reliable Simulations. WBM STATS.

Pages

Sunday, 27 July 2025

Markov Chain Monte Carlo Diagnostics: Ensuring Reliable Simulations

No comments:

Post a Comment

Disclaimer