Sunday, 27 July 2025

Markov Chain Monte Carlo

by W. B. Meitei, PhD


Markov Chain Monte Carlo (MCMC) methods are a class of algorithms designed to generate samples from complex probability distributions, particularly when direct sampling is challenging or analytically not feasible using standard analytic techniques. The MCMC methods constitute a powerful class of algorithms widely used in modern computational statistics, particularly in the context of Bayesian inference.

The core idea behind MCMC is to construct a Markov chain whose stationary distribution is the distribution of interest. By simulating the chain over many steps, one can obtain a sequence of dependent samples that approximate the target distribution. Unlike traditional Monte Carlo methods, which assume independent and identically distributed (i.i.d.) sampling, MCMC methods allow for dependence among samples, making them highly effective in high-dimensional or constrained parameter spaces.

Among the many MCMC algorithms, two of the most widely used MCMC algorithms are the Metropolis-Hastings and the Gibbs sampling algorithms. Both offer flexible frameworks for sampling from complex distributions, even when those distributions are not available in closed form, making them indispensable in real-world Bayesian modeling and other advanced statistical applications..

MCMC algorithms have become essential tools in modern statistical modeling, machine learning, and computational science. They are used in diverse applications ranging from parameter estimation and model selection to probabilistic inference in hierarchical models. This article introduces the foundational concepts of MCMC, outlines key algorithms, and illustrates their practical relevance in applied statistical analysis.

Fig 1: Standard Monte Carlo Vs Markov Chain Monte Carlo

Before we proceed with MCMC, it is important to understand the fundamentals of the Monte Carlo approach and the Markov chain.

Some basics of Monte Carlo & Markov chain

The Monte Carlo method involves simulating a large number of independent samples from the distribution of the random variable X, denoted as X1, X2, …, XT​. These simulated values form an empirical distribution that assigns equal probability 1/T on each sampled point.

For instance, we can use the Monte Carlo method to estimate certain features of the probability distribution of a random variable, such as its expected value, variance, or tail probabilities, particularly when these quantities are difficult or impractical to compute analytically.

  • To approximate the expected value of X, we compute the sample mean of the simulated values.
  • To estimate the probability that X falls below a given threshold, we can use the proportion of simulated values that are less than that threshold.
  • The variance of X can be approximated using the sample variance of the simulated observations.

In general, any calculation we wish to carry out on the true distribution of a random variable X can be approximated by performing the same calculation on its empirical distribution.

This approach typically yields reliable results, as the approximation tends to improve with larger sample sizes. In fact, as the number of simulated observations T increases, the empirical estimates converge to the true values. This foundational concept, often called the plug-in principle, underlies much of the Monte Carlo method’s practical utility.

MCMC methods operate similarly to standard Monte Carlo methods, but with a key distinction: the simulated values X1, X2, …, XT​ are not independent draws. Instead, they are serially correlated.

In particular, they are the realizations of T random variables X1, X2, …, XT that form a Markov Chain. Each draw in the sequence depends on the previous one, which introduces correlation between samples. This structure allows MCMC methods to sample from complex probability distributions where independent sampling is difficult or impossible.

We do not need to be experts in Markov chains to understand MCMC. A basic understanding of the concepts outlined in the following sections is sufficient to effectively understand and apply the methods. However, studying the basics of the Markov chain is recommended.

a. Markov property

A random sequence {Xt} is said to form a Markov chain if, at any point in time, the future values Xt+n depend only on the present value Xt​, and not on the sequence of past values Xt−k​, for any positive integers k and n.

\[P\left( X_{t + n} = x\ |\ X_{t},X_{t - 1},\ldots,X_{t - k} \right) = P\left( X_{t + n} = x\ |\ X_{t} \right)\]

This defining characteristic, known as the Markov property, reflects the process's memoryless nature. In other words, the conditional probability distribution of future outcomes is determined solely by the current state and is independent of the historical trajectory that led to that state.

b. Conditional and unconditional distributions

The conditional probabilities (or densities) that Xt+n = xt+n, given that Xt = xt is denoted as

\[f\left( x_{t + n}\  \right|{\ x}_{t})\]

The unconditional probabilities (or densities) that Xt = xt is denoted as

\[f(x_{t})\]

c. Asymptotic independence

While this is not a general property of all Markov chains, those produced by MCMC methods exhibit a key property: although the variables Xt and Xt+n​ are dependent, their dependence weakens as n increases, gradually approaching independence.

This ensures that f(xt+n | xt) converges to f(xt+n) as n → ∞.

This has important implications, i.e., although the initial value of the chain x1​ is typically chosen arbitrarily, as t increases, the distribution of the chain’s values becomes progressively less influenced by this starting point. In other words, f(xt | x1) approaches the same distribution f(xt), regardless of how x1​ was selected.

d. Target distribution

In Markov chains generated by MCMC methods, not only does f(xt | x1) converge to f(xt), but as t increases, the distributions f(xt) become almost identical to each other. They eventually converge to the chain's stationary distribution.

Importantly, this stationary distribution is the same as the target distribution, the distribution from which we ultimately want to draw samples.

In simpler terms, the larger the value of t, the more f(xt | x1) and f(xt) are similar to the target distribution.

A black-box approach

We can imagine an MCMC algorithm as a black box that takes two inputs:

  • An initial value x1,
  • A target distribution

Then the output is a sequence of values x1, x2, …, xT.

Even though the initial values of the sequence may come from distributions that differ significantly from the target distribution, these distributions gradually become more similar to the target as the sequence progresses. When t is sufficiently large, xt​ is effectively drawn from a distribution that closely approximates the target distribution.

Burn-in sample

Generally, the initial values of the MCMC sample differ largely from the target distribution. Due to this initial mismatch with the target distribution, it is a common practice to discard the first portion of an MCMC sample. For instance, in a sequence of 10,000 generated values in Fig 2, the first 1,500 may be discarded, retaining the remaining 8,500 for analysis.

Fig 2: Trace plot of an MCMC sample with long burn-in sample

This initial segment, known as the burn-in sample, is removed to exclude early draws that are less representative of the target distribution. By doing so, we reduce the potential bias in Monte Carlo estimates derived from the MCMC sample, as the retained values are more likely to come from distributions that closely approximate the target distribution.

We can perform the MCMC diagnostics to select an appropriate burn-in length.

Correlation and effective sample size

Once the burn-in sample has been discarded, we are left with a sequence of draws whose distributions closely resemble the target distribution. However, a key issue remains, i.e., the observations are not independent.

What does this dependence imply? To answer this, we must revisit the foundation of standard Monte Carlo methods. In those methods, the accuracy of an estimate improves with the number of independent draws, denoted by T. The larger the sample, the more reliable the approximation.

In contrast, MCMC methods produce dependent samples. As a result, we introduce the notion of effective sample size, a measure of how many independent observations the dependent MCMC sample is equivalent to. For example, 1,000 correlated draws might contain as much information as only 100 independent ones. In such a case, the effective sample size is 100.

Generally, the greater the correlation between successive samples, the lower the effective sample size, and consequently, the less accurate the resulting MCMC approximation.

For this reason, much of the effort in designing and evaluating MCMC algorithms is focused on minimizing autocorrelation (or serial correlation) in the chain, thereby increasing the effective sample size and improving estimation accuracy.

We can perform the MCMC diagnostics to check the effectiveness of the sample size.

Commonly used MCMC algorithm

As mentioned above, there are two commonly used MCMC algorithms. They are:

  1. Metropolis-Hastings algorithm and
  2. Gibbs sampling

1. Metropolis-Hastings algorithm

Of all the MCMC algorithms, the Metropolis-Hastings algorithm is the most commonly used algorithm.

Let f(x) denote the density (or mass) function of the target distribution, q(x|x՛), a conditional distribution from which we can extract computer-generated samples (e.g., a multivariate normal distribution with mean ), and f be the posterior density, such that x, , and f have the same dimension.

The algorithm

The Metropolis-Hastings algorithm starts from any value x1 belonging to the support of the target distribution and generates the subsequent values xt as follows:

  • Generate a proposal yt from q(yt | xt-1)
  • Compute the acceptance probability

\[p_{t} = min\left( \frac{f\left( y_{t} \right)}{f\left( x_{t - 1} \right)}\frac{q\left( x_{t - 1}|y_{t} \right)}{q\left( y_{t}|x_{t - 1} \right)},1 \right)\]

  • Draw a uniform random variable ut (on [0, 1])
  • If utpt, accept the proposal and set xt = yt; otherwise, reject the proposal and set xt = xt-1

Advantages of the Metropolis-Hastings algorithm

Suppose,

\[f(x) = ch(x)\]

We assume that we know h(x) but not c.

The acceptance probability according to the Metropolis-Hastings algorithm is

\[p_{t} = min\left( \frac{f\left( y_{t} \right)}{f\left( x_{t - 1} \right)}\frac{q\left( x_{t - 1}|y_{t} \right)}{q\left( y_{t}|x_{t - 1} \right)},1 \right) = min\left( \frac{ch\left( y_{t} \right)}{ch\left( x_{t - 1} \right)}\frac{q\left( x_{t - 1}|y_{t} \right)}{q\left( y_{t}|x_{t - 1} \right)},1 \right)\]

\[= min\left( \frac{h\left( y_{t} \right)}{h\left( x_{t - 1} \right)}\frac{q\left( x_{t - 1}|y_{t} \right)}{q\left( y_{t}|x_{t - 1} \right)},1 \right)\]

Thus, the acceptance probability, which is the only quantity that depends on f, can be computed without knowing the constant c.

This is the beauty of the Metropolis-Hastings algorithm: we can generate draws from a distribution even if we do not fully know its probability density (or mass) function.

2. Gibbs sampling

Another commonly used MCMC algorithm is Gibbs sampling.

The algorithm

The Gibbs sampling algorithm starts from any vector,

\[x_{1} = \left\lbrack x_{1,1},\ldots,x_{1,B} \right\rbrack\]

belonging to the support of the target distribution and generates the subsequent values xt as follows:

For b = 1, …, B, generate xt,b from the conditional distribution with density

\[f\left( x_{t,b}\ |\ x_{t,1},\ldots,x_{t,b - 1},x_{t - 1,b + 1},\ldots,x_{t - 1,B} \right)\]

In other words, at each iteration, the blocks are extracted one by one from their full conditional distributions, conditioning on the latest draws of all the other blocks.

Note that at iteration t, when you extract the bth block, the latest draws of blocks 1 to b-1 are those already extracted in iteration t, while the latest draws of blocks b+1 to B are those extracted in the previous iteration t-1.


What is a full conditional distribution?

Suppose our aim is to generate samples from a random vector xo whose joint density is given by

\[f\left( x_{o} \right) = f(x_{o,1},\ldots,x_{o,B})\]

where f(xo,1, …, xo,B) represents a partition of xo into B blocks (each possibly consisting of one or more components).

For any block xo, let xo,-b denote the vector containing all components of xo except those in the bth block.

Now, suppose we can sample from the B conditional distributions f(xo,1 | xo,-1), …, f(xo,B | xo,-B). These are known in MCMC terminology as the full conditional distributions.



Suggested Readings:

  1. Brooks, S., Gelman, A., Jones, G., & Meng, X. L. (Eds.). (2011). Handbook of Markov Chain Monte Carlo. CRC Press.
  2. Taboga, M. (2021). Markov Chain Monte Carlo MethodsFundamentals of Statistics.
  3. Gilks, W. R., Richardson, S., & Spiegelhalter, D. (Eds.). (1995). Markov chain Monte Carlo in practice. CRC Press.

Suggested Citation: Meitei, W. B. (2025). Markov Chain Monte Carlo. WBM STATS.

No comments:

Post a Comment