by W. B. Meitei, PhD
Markov Chain Monte Carlo (MCMC) methods are a class of algorithms used to generate samples from complex probability distributions, particularly when direct sampling is challenging or analytically not feasible. The MCMC methods represent a powerful class of algorithms widely used in modern computational statistics, particularly within the framework of Bayesian inference.
The core idea behind MCMC is to construct a Markov chain whose equilibrium (or stationary) distribution is the distribution of interest. By simulating the chain over a large number of steps, one can obtain a sequence of dependent samples that approximate this target distribution. Unlike traditional Monte Carlo methods, which assume independent and identically distributed (i.i.d.) sampling, MCMC methods allow for dependence among samples, making them highly effective in high-dimensional or constrained parameter spaces.
Two of the most widely used MCMC algorithms are the Metropolis-Hastings algorithm and Gibbs sampling. These techniques provide flexible frameworks for sampling from various distributions, even when those distributions are not available in closed form.
MCMC methods have
become essential tools in modern statistical modelling, machine learning, and
computational science. They are used in diverse applications ranging from
parameter estimation and model selection to probabilistic inference in
hierarchical models. This article aims to introduce the foundational concepts
of MCMC, outline key algorithms, and illustrate their practical relevance in
applied statistical analysis.
Fig.1: Standard Monte Carlo Vs Markov Chain Monte Carlo |
Before we proceed with MCMC, it is important to understand the fundamentals of the Monte Carlo approach and the Markov chain.
Some basics of Monte Carlo & Markov chain
The Monte Carlo method involves simulating a large number of independent samples from the distribution of the random variable X, denoted as X1, X2, …, XT. These simulated values form an empirical distribution, which places equal probability 1/T on each sampled point.
For instance, we can use the Monte Carlo method to estimate certain features of the probability distribution of a random variable, such as its expected value, variance, or tail probabilities, particularly when these quantities are difficult or practically not feasible to compute analytically.
- To approximate the expected value of X, we compute the sample mean of the simulated values.
- If we want to estimate the probability that X falls below a certain threshold, we can use the proportion of simulated values that are less than that threshold.
- The variance of X can be approximated using the sample variance of the simulated observations.
In general, any calculation we wish to carry out on the true distribution of a random variable X can be approximated by performing the same calculation on its empirical distribution.
This approach typically yields reliable results, as the approximation tends to improve with larger sample sizes. In fact, as the number of simulated observations T increases, the empirical estimates converge to the true values. This foundational concept is often referred to as the plug-in principle, which underlies much of the Monte Carlo method’s practical utility.
MCMC methods operate similarly to standard Monte Carlo methods, but with a key distinction: the simulated values X1, X2, …, XT are not independent draws. Instead, they are serially correlated.
In particular, they are the realisations of T random variables X1, X2, …, XT that form a Markov Chain. Each draw in the sequence depends on the previous one, which introduces correlation between samples. This structure allows MCMC methods to sample from complex probability distributions where independent sampling is difficult or impossible.
We do not need to be an expert on Markov chains to understand MCMC. A basic understanding of the concepts outlined in the following sections is sufficient to effectively understand and apply the methods. However, studying or understanding the basics of the Markov chain is recommended.
a. Markov property
A random sequence {Xt} is said to form a Markov chain if, at any point in time, the future values Xt+n depend only on the present value Xt, and not on the sequence of past values Xt−k, for any positive integers k and n.
This defining characteristic, known as the Markov property, reflects the memoryless nature of the process. In other words, the conditional probability distribution of future outcomes is determined solely by the current state and is independent of the historical trajectory that led to that state.
b. Conditional and unconditional distributions
The conditional probabilities (or densities) that Xt+n = xt+n, given that Xt = xt is denoted as
The unconditional probabilities (or densities) that Xt = xt is denoted as
c. Asymptotic independence
While this is not a general property of all Markov chains, those produced by MCMC methods exhibit a key property, i.e., although the variables Xt and Xt+n are dependent, their dependence weakens as n increases, gradually approaching independence.
This ensures that f(xt+n | xt) converges to f(xt+n) as n → ∞.
This has important implications, i.e., although the initial value of the chain x1 is typically chosen arbitrarily, as t increases, the distribution of the chain’s values becomes progressively less influenced by this starting point. In other words, f(xt | x1) approaches the same distribution f(xt), regardless of how x1 was selected.
d. Target distribution
In Markov chains generated by MCMC methods, not only does f(xt | x1) converge to f(xt), but as t increases, the distributions f(xt) become almost identical to each other. They eventually converge to what is known as the stationary distribution of the chain.
Importantly, this stationary distribution is the same as the target distribution, the distribution from which we ultimately want to draw samples.
In simpler terms, the
larger the value of t, the more f(xt | x1)
and f(xt) are similar to the target distribution.
A black-box approach
We can imagine an MCMC algorithm as a black box that takes two inputs:
- An initial value x1,
- A target distribution
Then the output is a sequence of values x1, x2, …, xT.
Even though the initial values of the sequence may come from distributions that differ significantly from the target distribution, these distributions gradually become more similar to the target as the sequence progresses. When t is sufficiently large, xt is effectively drawn from a distribution that closely approximates the target distribution.
Burn-in sample
Generally, the initial values of the MCMC sample differ largely from the target distribution. Due to this initial mismatch with the target distribution, it is a common practice to discard the first portion of an MCMC sample. For instance, in a sequence of 10,000 generated values in Fig. 2, the first 1,500 may be discarded, retaining the remaining 8,500 for analysis.
This initial segment is known as the burn-in sample and is removed to exclude early draws that are less representative of the target distribution. By doing so, we reduce the potential bias in Monte Carlo estimates derived from the MCMC sample, as the retained values are more likely to come from distributions that closely approximate the target distribution.
We can perform the MCMC diagnostics to select an appropriate burn-in length.
![]() |
Fig. 2: Trace plot of an MCMC sample (RCode to generate the figure) |
Correlation and effective sample size
Once the burn-in sample has been discarded, we are left with a sequence of draws whose distributions closely resemble the target distribution. However, a key issue remains, i.e., the observations are not independent.
What does this dependence imply? To answer this, we must revisit the foundation of standard Monte Carlo methods. In those methods, the accuracy of an estimate improves with the number of independent draws, denoted by T. The larger the sample, the more reliable the approximation.
In contrast, MCMC methods produce dependent samples. As a result, we introduce the notion of effective sample size, a measure of how many independent observations the dependent MCMC sample is equivalent to. For example, 1,000 correlated draws might contain as much information as only 100 independent ones. In such a case, the effective sample size is said to be 100.
Generally, the greater the correlation between successive samples, the lower the effective sample size, and consequently, the less accurate the resulting MCMC approximation.
For this reason, much of the effort in designing and evaluating MCMC algorithms is focused on minimising autocorrelation (or serial correlation) in the chain, thereby increasing the effective sample size and improving estimation accuracy.
We can perform the MCMC diagnostics to check the effectiveness of the sample size.
Commonly used MCMC algorithm
As mentioned above, there are two commonly used MCMC algorithms. They are:
- Metropolis-Hastings algorithm and
- Gibbs sampling
1. Metropolis-Hastings algorithm
Of all the MCMC algorithms, the Metropolis-Hastings algorithm is the most commonly used.
Let f(x) denote the density (or mass) function of the target distribution, q(x|x՛), a conditional distribution from which we can extract computer-generated samples (e.g., a multivariate normal distribution with mean x՛), and f be the posterior density, such that, x, x՛, and f have the same dimension.
The algorithm
The Metropolis-Hastings algorithm starts from any value x1 belonging to the support of the target distribution and generates the subsequent values xt as follows:
- Generate a proposal yt from f(yt | xt-1)
- Compute the acceptance probability
- Draw a uniform random variable ut (on [0, 1])
- If ut ≤ pt, accept the proposal and set xt = yt; otherwise, reject the proposal and set xt = xt-1
Advantages of the Metropolis-Hastings algorithm
Suppose,
We assume that we know h(x) but not c.
The acceptance probability according to the Metropolis-Hastings algorithm is
Thus, the acceptance probability, which is the only quantity that depends on f, can be computed without knowing the constant c.
This is the beauty of the Metropolis-Hastings algorithm, i.e., we can generate draws from a distribution even if we do not fully know the density of that distribution.
2. Gibbs sampling
Another commonly used MCMC algorithm is the Gibbs sampling algorithm.
The algorithm
The Gibbs sampling
algorithm starts from any vector,
belonging to the support of the target distribution and generates the subsequent values xt as follows:
For b = 1, …, B, generate xt,b from the conditional distribution with density
In other words, at each iteration, the blocks are extracted one by one from their full conditional distributions, conditioning on the latest draws of all the other blocks.
Note that at iteration t, when you extract the bth block, the latest draws of blocks 1 to b-1 are those already extracted in iteration t, while the latest draws of blocks b+1 to B are those extracted in the previous iteration t-1.
What is a full
conditional distribution?
Suppose our aim is to generate samples from a random vector xo whose joint density is given by
where f(xo,1, …, xo,B) represents a partition of xo into B blocks (each possibly consisting of one or more components).
For any block xo, let xo,-b denote the vector containing all components of xo except those in the bth block.
Now, suppose we have the ability to sample from the B conditional distributions f(xo,1 | xo,-1), …, f(xo,B | xo,-B). These are known in MCMC terminology as the full conditional distributions.
Suggested Readings:
- Brooks, S., Gelman, A., Jones, G., & Meng, X. L. (Eds.). (2011). Handbook of markov chain monte carlo. CRC press.
- Taboga, M. (2021). Markov Chain Monte Carlo Methods. Fundamentals of Statistics.
- Gilks, W. R., Richardson, S., & Spiegelhalter, D. (Eds.). (1995). Markov chain Monte Carlo in practice. CRC press.
Suggested Citation: Meitei, W. B. (2025). Markov Chain Monte Carlo. WBM STATS.
No comments:
Post a Comment