WBM STATS: Markov Chain Monte Carlo

by W. B. Meitei, PhD

Markov Chain Monte Carlo (MCMC) methods are a class of algorithms used to generate samples from complex probability distributions, particularly when direct sampling is challenging or analytically not feasible. The MCMC methods represent a powerful class of algorithms widely used in modern computational statistics, particularly within the framework of Bayesian inference.

The core idea behind MCMC is to construct a Markov chain whose equilibrium (or stationary) distribution is the distribution of interest. By simulating the chain over a large number of steps, one can obtain a sequence of dependent samples that approximate this target distribution. Unlike traditional Monte Carlo methods, which assume independent and identically distributed (i.i.d.) sampling, MCMC methods allow for dependence among samples, making them highly effective in high-dimensional or constrained parameter spaces.

Two of the most widely used MCMC algorithms are the Metropolis-Hastings algorithm and Gibbs sampling. These techniques provide flexible frameworks for sampling from various distributions, even when those distributions are not available in closed form.

MCMC methods have become essential tools in modern statistical modelling, machine learning, and computational science. They are used in diverse applications ranging from parameter estimation and model selection to probabilistic inference in hierarchical models. This article aims to introduce the foundational concepts of MCMC, outline key algorithms, and illustrate their practical relevance in applied statistical analysis.

Fig.1: Standard Monte Carlo Vs Markov Chain Monte Carlo

Before we proceed with MCMC, it is important to understand the fundamentals of the Monte Carlo approach and the Markov chain.

Some basics of Monte Carlo & Markov chain

The Monte Carlo method involves simulating a large number of independent samples from the distribution of the random variable X, denoted as X₁, X₂, …, X_T. These simulated values form an empirical distribution, which places equal probability 1/T on each sampled point.

For instance, we can use the Monte Carlo method to estimate certain features of the probability distribution of a random variable, such as its expected value, variance, or tail probabilities, particularly when these quantities are difficult or practically not feasible to compute analytically.

To approximate the expected value of X, we compute the sample mean of the simulated values.
If we want to estimate the probability that X falls below a certain threshold, we can use the proportion of simulated values that are less than that threshold.
The variance of X can be approximated using the sample variance of the simulated observations.

In general, any calculation we wish to carry out on the true distribution of a random variable X can be approximated by performing the same calculation on its empirical distribution.

This approach typically yields reliable results, as the approximation tends to improve with larger sample sizes. In fact, as the number of simulated observations T increases, the empirical estimates converge to the true values. This foundational concept is often referred to as the plug-in principle, which underlies much of the Monte Carlo method’s practical utility.

MCMC methods operate similarly to standard Monte Carlo methods, but with a key distinction: the simulated values X₁, X₂, …, X_T are not independent draws. Instead, they are serially correlated.

In particular, they are the realisations of T random variables X₁, X₂, …, X_T that form a Markov Chain. Each draw in the sequence depends on the previous one, which introduces correlation between samples. This structure allows MCMC methods to sample from complex probability distributions where independent sampling is difficult or impossible.

We do not need to be an expert on Markov chains to understand MCMC. A basic understanding of the concepts outlined in the following sections is sufficient to effectively understand and apply the methods. However, studying or understanding the basics of the Markov chain is recommended.

a. Markov property

A random sequence {X_t} is said to form a Markov chain if, at any point in time, the future values X_t+n depend only on the present value X_t, and not on the sequence of past values X_t−k, for any positive integers k and n.

\[P\left( X_{t + n} = x\ |\ X_{t},X_{t - 1},\ldots,X_{t - k} \right) = P\left( X_{t + n} = x\ |\ X_{t} \right)\]

This defining characteristic, known as the Markov property, reflects the memoryless nature of the process. In other words, the conditional probability distribution of future outcomes is determined solely by the current state and is independent of the historical trajectory that led to that state.

b. Conditional and unconditional distributions

The conditional probabilities (or densities) that X_t+n = x_t+n, given that X_t = x_t is denoted as

\[f\left( x_{t + n}\ \right|{\ x}_{t})\]

The unconditional probabilities (or densities) that X_t = x_t is denoted as

\[f(x_{t})\]

c. Asymptotic independence

While this is not a general property of all Markov chains, those produced by MCMC methods exhibit a key property, i.e., although the variables X_t and X_t+n are dependent, their dependence weakens as n increases, gradually approaching independence.

This ensures that f(x_t+n | x_t) converges to f(x_t+n) as n → ∞.

This has important implications, i.e., although the initial value of the chain x₁ is typically chosen arbitrarily, as t increases, the distribution of the chain’s values becomes progressively less influenced by this starting point. In other words, f(x_t | x₁) approaches the same distribution f(x_t), regardless of how x₁ was selected.

d. Target distribution

In Markov chains generated by MCMC methods, not only does f(x_t | x₁) converge to f(x_t), but as t increases, the distributions f(x_t) become almost identical to each other. They eventually converge to what is known as the stationary distribution of the chain.

Importantly, this stationary distribution is the same as the target distribution, the distribution from which we ultimately want to draw samples.

In simpler terms, the larger the value of t, the more f(x_t | x₁) and f(x_t) are similar to the target distribution.

A black-box approach

We can imagine an MCMC algorithm as a black box that takes two inputs:

An initial value x₁,
A target distribution

Then the output is a sequence of values x₁, x₂, …, x_T.

Even though the initial values of the sequence may come from distributions that differ significantly from the target distribution, these distributions gradually become more similar to the target as the sequence progresses. When t is sufficiently large, x_t is effectively drawn from a distribution that closely approximates the target distribution.

Burn-in sample

Generally, the initial values of the MCMC sample differ largely from the target distribution. Due to this initial mismatch with the target distribution, it is a common practice to discard the first portion of an MCMC sample. For instance, in a sequence of 10,000 generated values in Fig. 2, the first 1,500 may be discarded, retaining the remaining 8,500 for analysis.

This initial segment is known as the burn-in sample and is removed to exclude early draws that are less representative of the target distribution. By doing so, we reduce the potential bias in Monte Carlo estimates derived from the MCMC sample, as the retained values are more likely to come from distributions that closely approximate the target distribution.

We can perform the MCMC diagnostics to select an appropriate burn-in length.

Fig. 2: Trace plot of an MCMC sample (RCode to generate the figure)

Correlation and effective sample size

Once the burn-in sample has been discarded, we are left with a sequence of draws whose distributions closely resemble the target distribution. However, a key issue remains, i.e., the observations are not independent.

What does this dependence imply? To answer this, we must revisit the foundation of standard Monte Carlo methods. In those methods, the accuracy of an estimate improves with the number of independent draws, denoted by T. The larger the sample, the more reliable the approximation.

In contrast, MCMC methods produce dependent samples. As a result, we introduce the notion of effective sample size, a measure of how many independent observations the dependent MCMC sample is equivalent to. For example, 1,000 correlated draws might contain as much information as only 100 independent ones. In such a case, the effective sample size is said to be 100.

Generally, the greater the correlation between successive samples, the lower the effective sample size, and consequently, the less accurate the resulting MCMC approximation.

For this reason, much of the effort in designing and evaluating MCMC algorithms is focused on minimising autocorrelation (or serial correlation) in the chain, thereby increasing the effective sample size and improving estimation accuracy.

We can perform the MCMC diagnostics to check the effectiveness of the sample size.

Commonly used MCMC algorithm

As mentioned above, there are two commonly used MCMC algorithms. They are:

Metropolis-Hastings algorithm and
Gibbs sampling

1. Metropolis-Hastings algorithm

Of all the MCMC algorithms, the Metropolis-Hastings algorithm is the most commonly used.

Let f(x) denote the density (or mass) function of the target distribution, q(x|x՛), a conditional distribution from which we can extract computer-generated samples (e.g., a multivariate normal distribution with mean x՛), and f be the posterior density, such that, x, x՛, and f have the same dimension.

The algorithm

The Metropolis-Hastings algorithm starts from any value x₁ belonging to the support of the target distribution and generates the subsequent values x_t as follows:

Generate a proposal y_t from f(y_t | x_t-1)
Compute the acceptance probability

\[p_{t} = min\left( \frac{f\left( y_{t} \right)}{f\left( x_{t - 1} \right)}\frac{q\left( x_{t - 1}|y_{t} \right)}{q\left( y_{t}|x_{t - 1} \right)},1 \right)\]

Draw a uniform random variable u_t (on [0, 1])
If u_t ≤ p_t, accept the proposal and set x_t = y_t; otherwise, reject the proposal and set x_t = x_t-1

Advantages of the Metropolis-Hastings algorithm

Suppose,

\[f(x) = ch(x)\]

We assume that we know h(x) but not c.

The acceptance probability according to the Metropolis-Hastings algorithm is

\[p_{t} = min\left( \frac{f\left( y_{t} \right)}{f\left( x_{t - 1} \right)}\frac{q\left( x_{t - 1}|y_{t} \right)}{q\left( y_{t}|x_{t - 1} \right)},1 \right) = min\left( \frac{ch\left( y_{t} \right)}{ch\left( x_{t - 1} \right)}\frac{q\left( x_{t - 1}|y_{t} \right)}{q\left( y_{t}|x_{t - 1} \right)},1 \right)\]

\[= min\left( \frac{h\left( y_{t} \right)}{h\left( x_{t - 1} \right)}\frac{q\left( x_{t - 1}|y_{t} \right)}{q\left( y_{t}|x_{t - 1} \right)},1 \right)\]

Thus, the acceptance probability, which is the only quantity that depends on f, can be computed without knowing the constant c.

This is the beauty of the Metropolis-Hastings algorithm, i.e., we can generate draws from a distribution even if we do not fully know the density of that distribution.

2. Gibbs sampling

Another commonly used MCMC algorithm is the Gibbs sampling algorithm.

The algorithm

The Gibbs sampling algorithm starts from any vector,

\[x_{1} = \left\lbrack x_{1,1},\ldots,x_{1,B} \right\rbrack\]

belonging to the support of the target distribution and generates the subsequent values x_t as follows:

For b = 1, …, B, generate x_t,b from the conditional distribution with density

\[f\left( x_{t,b}\ |\ x_{t,1},\ldots,x_{t,b - 1},x_{t - 1,b + 1},\ldots,x_{t - 1,B} \right)\]

In other words, at each iteration, the blocks are extracted one by one from their full conditional distributions, conditioning on the latest draws of all the other blocks.

Note that at iteration t, when you extract the b^th block, the latest draws of blocks 1 to b-1 are those already extracted in iteration t, while the latest draws of blocks b+1 to B are those extracted in the previous iteration t-1.

What is a full conditional distribution?

Suppose our aim is to generate samples from a random vector x_o whose joint density is given by

\[f\left( x_{o} \right) = f(x_{o,1},\ldots,x_{o,B})\]

where f(x_o,1, …, x_o,B) represents a partition of x_o into B blocks (each possibly consisting of one or more components).

For any block x_o, let x_o,-b denote the vector containing all components of x_o except those in the b^th block.

Now, suppose we have the ability to sample from the B conditional distributions f(x_o,1 | x_o,-1), …, f(x_o,B | x_o,-B). These are known in MCMC terminology as the full conditional distributions.

Suggested Readings:

Brooks, S., Gelman, A., Jones, G., & Meng, X. L. (Eds.). (2011). Handbook of markov chain monte carlo. CRC press.
Taboga, M. (2021). Markov Chain Monte Carlo Methods. Fundamentals of Statistics.
Gilks, W. R., Richardson, S., & Spiegelhalter, D. (Eds.). (1995). Markov chain Monte Carlo in practice. CRC press.

Suggested Citation: Meitei, W. B. (2025). Markov Chain Monte Carlo. WBM STATS.

Pages

Sunday, 27 July 2025

Markov Chain Monte Carlo

No comments:

Post a Comment

Disclaimer