WBM STATS: Metropolis-Hastings Algorithm

by W. B. Meitei, PhD

The Metropolis-Hastings (MH) algorithm is one of the most popular Markov Chain Monte Carlo (MCMC) algorithms. Like other MCMC methods, the MH algorithm is used to generate draws from a sequence of probability distributions that are serially correlated. The sequence converges to a given target distribution, making it a powerful tool for sampling from complex or high‑dimensional distributions when direct simulation is intractable. This property is especially valuable in Bayesian inference, where the target distribution is often a posterior known only up to a normalizing constant. In this article, we will explore the intuition behind the MH algorithm, its basic steps, and illustrate its implementation in practice.

Before we proceed to the details of the MH algorithm, it is important to understand the fundamentals of Markov chains and MCMC.

In particular, it is important to remember that an MCMC algorithm produces a random sequence \(\left\{ X_{t} \right\}\) with the following features:

\(\left\{ X_{t} \right\}\) is a Markov chain, i.e., given the current value \(X_{t}\), all future values \(X_{t + 1},\ X_{t + 2},\ldots,\ X_{t + n}\) are conditionally independent of the past values \(x_{1},\ldots,\ x_{t - 1}\), for any positive integers \(t\) and \(n\);

while any two values \(X_{t}\) and \(X_{t + s}\) are generally not independent, but they become increasingly close to independent as the gap \(s\) grows for any positive integers \(t\) and \(s\);

the sequence gradually converges to the target distribution, so that the distribution of \(X_{t}\) becomes more and more similar to the target as \(t\) increases.

When we run the MCMC algorithm for \(T\) iterations, we obtain an MCMC sample consisting of the realizations of the first \(T\) terms of the chain:

\[X_{1},\ X_{2},\ldots,\ X_{T}\]

We then use the empirical distribution of this sample to approximate the target distribution.

The Metropolis-Hastings algorithm

Let \(f(x)\) be the probability density (or probability mass) function of the distribution from which we want to generate a sample. We refer to \(f(x)\) as the target distribution. Let \(q(x\ |\ x')\) be a family of conditional distributions that we choose ourselves, and from which it is easy to simulate draws. The vectors \(x,x'\), and the argument of \(f\) all have the same dimension.

The MH algorithm begins with an initial value \(x_{1}\) belonging to the support of the target distribution. This starting point can be specified by the user or sampled from some suitable distribution. Then, the algorithm proceeds recursively to generate the subsequent values of \(x_{t}\).

At each time step \(t\), the next value \(x_{t}\) is generated as follows:

Generate a proposal \(y_{t}\) from \(q(y_{t}|x_{t - 1})\)
Compute the acceptance probability

\[p_{t} = min\left( \frac{f\left( y_{t} \right)}{f\left( x_{t - 1} \right)}\frac{q\left( x_{t - 1}|y_{t} \right)}{q\left( y_{t}|x_{t - 1} \right)},1 \right)\]

Draw a uniform random variable \(u_{t}\) (on [0, 1])
If \(u_{t} \leq p_{t}\), accept the proposal and set \(x_{t} = y_{t}\); otherwise, reject the proposal and set \(x_{t} = x_{t - 1}\)

Here, \(y_{t}\), \(q\left( y_{t}|x_{t - 1} \right)\), and \(p_{t}\) are called proposal, proposal distribution, and acceptance probability, respectively.

As \(u_{t}\) is uniformly distributed on [0, 1], the probability of accepting the proposal \(y_{t}\) as the new draw \(x_{t-1}\) is exactly equal to \(p_{t}\). This simple decision rule ensures that the resulting chain gradually explores the support of the target distribution and, in the long run, produces samples that approximate \(f(x)\).

A special case of the MH algorithm is the Metropolis algorithm (developed by Nicholas C. Metropolis), which assumes a symmetric proposal distribution, i.e., \(q\left( y_{t} \middle| x_{t - 1} \right) = q(x_{t - 1}|y_{t})\). Whereas the MH algorithm can handle both symmetric and asymmetric proposal distributions. The acceptance probability of the Metropolis algorithm is,

\[p_{t} = min\left( \frac{f\left( y_{t} \right)}{f\left( x_{t - 1} \right)},1 \right)\]

Fig 1: Comparison of the Metropolis and the Metropolis-Hastings algorithms (Python Code)

Advantages of the Metropolis-Hastings algorithm over other algorithms

One of the main strengths of the MH algorithm is that it only requires knowledge of the target distribution \(f(x)\) up to a multiplicative constant. In practice, this means we do not need to compute the often‑intractable normalizing constant of the distribution. This feature is especially valuable in Bayesian inference, where the posterior is typically known only up to a constant, because the likelihood and prior are fully specified, but the marginal distribution (the normalizing integral) is difficult or impossible to evaluate in closed form.

Suppose,

\[f(x) = ch(x)\]

We assume that we know \(h(x)\) but not c.

The acceptance probability according to the Metropolis-Hastings algorithm is

\[p_{t} = min\left( \frac{f\left( y_{t} \right)}{f\left( x_{t - 1} \right)}\frac{q\left( x_{t - 1}|y_{t} \right)}{q\left( y_{t}|x_{t - 1} \right)},1 \right) = min\left( \frac{ch\left( y_{t} \right)}{ch\left( x_{t - 1} \right)}\frac{q\left( x_{t - 1}|y_{t} \right)}{q\left( y_{t}|x_{t - 1} \right)},1 \right)\]

\[= min\left( \frac{h\left( y_{t} \right)}{h\left( x_{t - 1} \right)}\frac{q\left( x_{t - 1}|y_{t} \right)}{q\left( y_{t}|x_{t - 1} \right)},1 \right)\]

Thus, the acceptance probability, which is the only quantity that depends on f, can be computed without knowing the constant c. This is the beauty of the MH algorithm: we can generate draws from a distribution even if we do not fully know its probability density (or mass) of that distribution.

An independent Metropolis-Hastings algorithm

It is a special case of the MH algorithm (named by Tierney, 1994) in which the proposal distribution does not depend on the current state, i.e., \(q(y_{t}|x_{t - 1})\) does not depend on \(x_{t - 1}\), or, \(q\left( y_{t} \middle| x_{t - 1} \right) = q(y_{t})\). It is also called the independence chain MH algorithm.

For example, a common choice is to extract proposals \(y_{t}\) from a multivariate normal distribution (each draw is independent from the previous ones).

When \(y_{t}\) is drawn independently, the acceptance probability is,

\[p_{t} = min\left( \frac{f\left( y_{t} \right)}{f\left( x_{t - 1} \right)}\frac{q\left( x_{t - 1} \right)}{q\left( y_{t} \right)},1 \right)\]

A random-walk Metropolis-Hastings algorithm

It is a special case of the MH algorithm (named by Tierney, 1994) in which the proposal has the form \(y_{t} = x_{t - 1} + \varepsilon_{t}\), where \(\varepsilon_{t}\) is stochastically independent of the current state \(x_{t - 1}\), so,

\[q\left( y_{t} \middle| x_{t - 1} \right) = q\left( y_{t} - x_{t - 1} \right) = q(\varepsilon_{t})\]

And

\[q\left( x_{t - 1} \middle| y_{t} \right) = q\left( x_{t - 1} - y_{t} \right) = q( - \varepsilon_{t})\]

Thus, the acceptance probability is,

\[p_{t} = min\left( \frac{f\left( y_{t} \right)}{f\left( x_{t - 1} \right)}\frac{q\left( x_{t - 1}|y_{t} \right)}{q\left( y_{t}|x_{t - 1} \right)},1 \right) = min\left( \frac{f\left( x_{t - 1} + \varepsilon_{t} \right)}{f\left( x_{t - 1} \right)}\frac{q\left( - \varepsilon_{t} \right)}{q\left( \varepsilon_{t} \right)},1 \right)\]

Types of proposal distribution

There are three types of proposal distributions. They are:

Perfect proposal distribution
Bad proposal distribution
Good proposal distribution

A perfect proposal distribution is one that is exactly equal to the target distribution; the acceptance probability is 1 for every proposed move (\(p_{t} = 1\) when \(f(x) = q(x)\) for every \(x\)). In this ideal case, every proposal is accepted, and because the proposals are drawn independently from the target, the resulting chain yields a sample of independent draws. This is as good as it gets: the MH algorithm effectively reduces to direct sampling from the target distribution, with no serial correlation and no wasted iterations.

In the case of bad proposal distribution, things go the other way; they differ significantly from the target. The acceptance probabilities drop, and proposals are rejected much more often. As a result, the chain tends to get “stuck” at certain points for long stretches, revisiting the same values repeatedly. This produces a highly serially correlated MCMC sample, in which successive draws are strongly dependent on one another. As discussed in the MCMC diagnostics, such chains can be misleading: they may appear to have converged quickly, but in reality, they explore the target distribution very slowly, forcing us to run the algorithm for much longer, and often at prohibitive computational cost.

A good proposal distribution is simple but powerful; it should be chosen to be as close as possible to the target distribution. When the proposal resembles the target, acceptance probabilities remain reasonably high and the chain moves freely across the parameter space, producing a sample with relatively low serial correlation. Striking this balance is key to efficient MH sampling and to obtaining reliable estimates in a reasonable amount of time.

Fig 2: Good proposal distribution Vs bad proposal distribution (Python Code)

Choosing the right proposal distribution

Choosing the right proposal distribution is the most important step in the MH algorithm. It is the single most important practical decision that separates efficient sampling from painfully slow convergence. Unfortunately, there's no magic formula or universal recipe for picking the perfect proposal distribution in the MH algorithm.

That said, there's a practical, iterative strategy that works well in practice:

Start simple: Begin with a basic proposal, such as a multivariate normal distribution.
Run a pilot chain: Generate an initial MCMC sample (it might be highly correlated, and that's okay).
Learn from it: Use this first sample to estimate key features of the target distribution, like its mean and covariance matrix.
Adapt and improve: Tune your proposal distribution based on these insights. For example, tune the proposal distribution’s mean and scale its covariance to match the target's spread.

This adaptive approach turns a potentially poor or bad proposal into one that is much more efficient, often dramatically improving acceptance rates and reducing serial correlation in subsequent chains.

Note: A poor proposal leads to:

Low acceptance rates → mostly rejections, wasting computation
High serial correlation → chain moves like a drunkard, barely exploring
Slow mixing → millions of iterations needed for reliable inference

Properties of the Metropolis-Hastings algorithm

To understand the properties of the MH algorithm, let’s examine its simplest and most intuitive form, the random-walk MH algorithm. Now, suppose the random walk increment \(\varepsilon_{t}\) is symmetric, i.e., \(q\left( \varepsilon_{t} \right) = q( - \varepsilon_{t})\), In this case, the acceptance probability is

\[p_{t} = min\left( \frac{f\left( x_{t - 1} + \varepsilon_{t} \right)}{f\left( x_{t - 1} \right)},1 \right)\]

Suppose we also have the target density with the following two characteristics: points that are close to each other have similar densities, and there is a high probability region (e.g., around the mode of the distribution), or the points that are far away from this region have a much lower density than the points that belong to it.

Now, let us consider two extreme cases:

Case 1: When \(\varepsilon_{t}\) is very small (its variance is very small), then \(\frac{f\left( x_{t - 1} + \varepsilon_{t} \right)}{f\left( x_{t - 1} \right)}\), and \(p_{t}\) are always close to 1, because, by assumption, points that are close to each other have similar densities; the proposals are almost always accepted, but they are very close to each other; the resulting MCMC sample is highly serially correlated.
Case 2: When \(\varepsilon_{t}\) is very large (its variance is very large), then \(\frac{f\left( x_{t - 1} + \varepsilon_{t} \right)}{f\left( x_{t - 1} \right)}\), is always close to 0, when \(x_{t - 1}\) lies in the high-probability region because the proposals tend to be far from this region and have low density; the proposals are often rejected, and the chain remains stuck at high-density points for long periods of time; we obtain a highly serially correlated MCMC sample.

Fig 3: Variation in the distribution of the chain with the variation in \(\varepsilon_{t}\) (Python Code)

Avoiding these two extreme cases requires accurate tuning of the variance of \(\varepsilon_{t}\). The adaptive MH algorithm is the most commonly used method; it is very efficient and automatically tunes the variance as the MCMC chain runs.

We can also adopt a simpler approach based on trial and error. For example,

We start by guessing a good value for variance and run an initial MCMC chain
Inspect the trace plots:

Flat lines/stuck regions? → variance too large → decrease variance
Tiny wiggles, barely moving? → variance too small → increase variance

Adjust and re-run until we see that sweet "fuzzy caterpillar" trace

Convergence of the Metropolis-Hastings algorithm

The MH algorithm guarantees convergence to the target distribution under relatively mild conditions. However, the speed of convergence depends critically on our proposal distribution and tuning. The fundamental approach is: if the proposal distribution \(q(y_{t}|x_{t - 1})\) has the ability to propose any point from any starting point with positive probability, then the MH converges to the target \(f(x)\) regardless of the starting point \(x_{0}\).

To demonstrate convergence, it is also important to verify that the detailed balance condition holds, i.e., that the target distribution is the stationary distribution of the Markov chain generated by the MH algorithm.

Detailed balanced condition states that,

\[f\left( y \middle| x \right)f(x) = f\left( x \middle| y \right)f(y)\]

Let us denote \(f(x_{t}|x_{t - 1})\), the probability distribution of a transition (or transition kernel) from \(x_{t - 1}\) to \(x_{t}\).

Then, when \(x_{t} \neq x_{t - 1}\), then \(x_{t} = y_{t}\), we have

\[f\left( x_{t} \middle| x_{t - 1} \right)f\left( x_{t - 1} \right) = min\left\{ \frac{f\left( x_{t} \right)}{f\left( x_{t - 1} \right)}\frac{q\left( x_{t - 1} \middle| x_{t} \right)}{q\left( x_{t} \middle| x_{t - 1} \right)},1 \right\} q\left( x_{t} \middle| x_{t - 1} \right)f\left( x_{t - 1} \right) = min\left\{ q\left( x_{t - 1} \middle| x_{t} \right)f\left( x_{t} \right),q\left( x_{t} \middle| x_{t - 1} \right)f\left( x_{t - 1} \right) \right\}\]

Similarly, when the transition is from \(x_{t}\) to \(x_{t - 1}\),

\[f\left( x_{t - 1} \middle| x_{t} \right)f\left( x_{t} \right) = min\left\{ q\left( x_{t - 1} \middle| x_{t} \right)f\left( x_{t} \right),q\left( x_{t} \middle| x_{t - 1} \right)f\left( x_{t - 1} \right) \right\}\]

Therefore, when \(x_{t} \neq x_{t - 1}\),

\[f\left( x_{t} \middle| x_{t - 1} \right)f\left( x_{t - 1} \right) = f\left( x_{t - 1} \middle| x_{t} \right)f\left( x_{t} \right)\]

i.e., the target distribution \(f(x)\) and the transition probability distribution satisfies the detailed balance condition.

Now, when \(x_{t} = x_{t - 1}\),

\[f\left( x_{t} \middle| x_{t - 1} \right)f\left( x_{t - 1} \right) = f\left( x_{t} \middle| x_{t} \right)f\left( x_{t} \right)\]

\[f\left( x_{t - 1} \middle| x_{t} \right)f\left( x_{t} \right) = f\left( x_{t} \middle| x_{t} \right)f\left( x_{t} \right)\]

Thus, in both cases, the detailed balanced condition is satisfied. As a consequence, \(f(x)\) is the stationary distribution of the chain.

Furthermore, the resulting chain has to satisfy the condition of irregularity and aperiodicity to guarantee convergence from any initial state \(x_{0}\). Irreducibility refers to the fact that the chain can reach any state from any other state with positive probability in finite steps. Typically, this requires \(q\left( y_{t} \middle| x_{t - 1} \right) > 0\) whenever \(f(x) > 0\). Whereas aperiodicity means the chain does not get “trapped” in cycles. This is ensured because there is always a positive probability of staying at the current state. A Markov chain is called ergodic if it is irreducible and aperiodic. And according to the Ergodic Theorem of Markov Chain, if the resulting chain is ergodic, it will converge almost surely to the target distribution \(f(x)\).

Suggested Readings:

Geyer, C. J. (2011). Introduction to Markov Chain Monte Carlo. Handbook of Markov Chain Monte Carlo. CRC Press.
Robert, C. P., & Casella, G. (2010). Metropolis-Hastings Algorithms. Introducing Monte Carlo Methods with R. Springer.
Robert, C. P., & Casella, G. (2013). Monte Carlo statistical methods. Vol. 2. Springer.
Taboga, M. (2021). Metropolis-Hastings algorithm. Fundamentals of Statistics.

Suggested Citation: Meitei, W. B. (2026). Metropolis-Hastings Algorithm. WBM STATS.

Pages

Sunday, 5 April 2026

Metropolis-Hastings Algorithm

2 comments:

Disclaimer