What If I Told You the Most Powerful Algorithm in Statistics Is Just… Guessing?
Author(s): Kamrun Nahar
Originally published on Towards AI.
I stared at the Wikipedia page for MCMC for forty-five minutes. The page had seventeen Greek letters in the first paragraph. I understood exactly zero of them.
Three years later, I can tell you this: MCMC is not hard. The explanations are hard. The math notation is hard. The concept itself? It’s something a ten-year-old could grasp if you stripped away the academic armor plating.
So let’s do that.
What Even IS Sampling, and Why Should You Care?
Before we touch MCMC, let’s talk about what “sampling” means. Because everyone skips this part and then wonders why the rest makes no sense.
Imagine you have a bag of 10,000 marbles. Some are red, some blue, some green. You want to know the ratio. The obvious approach: dump the bag, count every single marble. Done.
But what if the bag has 10 billion marbles? What if the bag is the size of a football stadium? You can’t count them all. So instead, you close your eyes, grab a handful, and estimate the ratios from that handful. That’s sampling.
Now here’s where it gets spicy. What if you don’t even have a bag? What if all you know is a weird math formula that describes how the marbles should be distributed, but you can’t just “reach in and grab” them? The formula is too complicated to solve directly.
That’s the problem MCMC solves.
You have a probability distribution. It’s complex, high-dimensional, and you can’t just solve it analytically. You need to somehow generate samples from it, so you can estimate its properties. MCMC gives you a recipe for doing exactly that.
Let me give you a concrete example. Say you’re a Bayesian statistician (don’t run away). You’ve observed some data, and you’ve written down a model that describes how you think the data was generated. Bayes’ theorem tells you the posterior distribution of your model’s parameters, given the data:
P(parameters | data) = P(data | parameters) * P(parameters) / P(data)
The numerator? Usually calculable. The denominator, P(data)? It's an integral over ALL possible parameter values. In one dimension, maybe you can compute it. In 50 dimensions? Forget it. That integral is intractable.
DIAGRAM 1: Bayes’ Theorem Breakdown

So instead of computing it exactly, you sample from the posterior distribution using MCMC. You generate thousands of parameter values that “look like” they came from the posterior, and then you use those samples to estimate whatever you need: means, variances, credible intervals, predictions. Everything.
The “This is a Toaster” Pivot
“Expert” Definition: “MCMC methods comprise a class of algorithms for sampling from a probability distribution. By constructing a Markov chain that has the desired distribution as its equilibrium distribution, one can obtain a sample of the desired distribution by recording states from the chain after it has converged to equilibrium.”
I’ve read that definition approximately 400 times in my life. Every single time, I understood it less than the time before.
Let me try something different.
The Blindfolded Hiker Metaphor
Picture this. You’re blindfolded. You’re standing on a hilly landscape. Each point on this landscape has a specific elevation, and the elevation at any point represents the probability density at that point. High ground = high probability. Low ground = low probability.
Your goal: explore the landscape and spend most of your time on the high ground, because that’s where the probability is concentrated.
Here’s your strategy:
- Take a random step in some direction.
- Feel the ground under your new foot.
- If the new spot is higher (more probable), definitely move there.
- If the new spot is lower (less probable), maybe move there. Flip a biased coin. The steeper the drop, the less likely you’ll accept.
- If you reject the step, stay where you are.
- Repeat forever.
That’s MCMC. That’s the whole thing.
After thousands of steps, if you track where you’ve been, you’ll have a collection of points. And those points will be clustered around the peaks of the landscape. They’ll be samples from the probability distribution that the landscape represents.
The “Markov Chain” part: your next position depends only on your current position, not on where you were five steps ago. No memory. Just “where am I now, and where should I step next?”
The “Monte Carlo” part: you’re using randomness (coin flips, random directions) to explore. Monte Carlo is just a fancy name for “using random numbers to solve problems.” It’s named after the casino. Because gambling. Because randomness. Scientists are occasionally fun.
Why Not Just Check Every Point?
“Why don’t we just evaluate the function at a grid of points?”
This is the smartest question. Here’s why.
In one dimension, sure. You want to sample a distribution over a single variable? Make a grid of 1,000 points. Evaluate the function at each one. Done. Maybe takes a millisecond.
In two dimensions? 1,000 x 1,000 = 1,000,000 points. Okay, still manageable.
In ten dimensions? 1,000¹⁰ = 10³⁰ points. That’s a quintillion times more than the number of seconds since the Big Bang.
In fifty dimensions? Your computer laughs at you. Then it catches fire.
This is called the curse of dimensionality. As the number of dimensions grows, the volume of the space explodes exponentially. A grid approach becomes impossible. But real-world Bayesian problems routinely have 50, 500, even 50,000 parameters.
MCMC dodges this curse because it doesn’t try to cover every point. Instead, it focuses its exploration on the regions that matter, the high-probability regions, and ignores the vast deserts of near-zero probability. It’s not brute force. It’s clever laziness.
# The curse of dimensionality in action
# Watch how grid points explodedimensions = [1, 2, 5, 10, 20, 50]
points_per_dim = 100for d in dimensions:
total_points = points_per_dim ** d
print(f"Dimensions: {d:>3} | Grid points: {total_points:.2e}")# Output:
# Dimensions: 1 | Grid points: 1.00e+02
# Dimensions: 2 | Grid points: 1.00e+04
# Dimensions: 5 | Grid points: 1.00e+10
# Dimensions: 10 | Grid points: 1.00e+20
# Dimensions: 20 | Grid points: 1.00e+40 <-- more than atoms in universe
# Dimensions: 50 | Grid points: 1.00e+100 <-- your computer just left the chat
DIAGRAM 2: The Curse of Dimensionality

Every time someone tells you “just use grid search,” ask them how many dimensions their problem has. If the answer is more than 5, they’re either lying or they enjoy watching computers suffer.
The Metropolis-Hastings Algorithm: Your First MCMC
Alright. Gloves off. Let’s build the most famous MCMC algorithm from scratch: the Metropolis-Hastings algorithm. It was invented in 1953 at Los Alamos (yes, the nuclear bomb place).
DIAGRAM 3: Metropolis-Hastings Algorithm Flow

Here’s the algorithm in plain English, then in code.
Step 1: Pick a Starting Point
Choose any value. Literally any value. It can be terrible. It will be terrible. That’s fine. The whole point of MCMC is that it doesn’t matter where you start, as long as you run long enough. My laptop charger was dangling off the table when I first learned this, and I was as unstable as my starting point.
Step 2: Propose a New Point
From your current position x, generate a candidate position x_new. The simplest way: add some random noise. x_new = x + noise, where noise is drawn from a normal distribution.
This is called the proposal distribution. It’s just a recipe for suggesting “where to go next.” The simplest one is a Gaussian centered at your current location.
Step 3: Compute the Acceptance Ratio
Here’s the math (don’t panic, it’s one fraction):
alpha = P(x_new) / P(x_current)
Where P() is the target distribution you're trying to sample from. You don't need to know the normalizing constant! If both numerator and denominator are missing the same constant, it cancels out. This is the key insight that makes MCMC practical.
If alpha >= 1, the new point is more probable. Accept it. Move there.
If alpha < 1, the new point is less probable. Accept it with probability alpha. In practice: generate a random number u between 0 and 1. If u < alpha, accept. Otherwise, reject and stay put.
Step 4: Record and Repeat
Record your current position (whether you moved or stayed). Go back to Step 2. Do this thousands of times. The recorded positions are your samples.
Step 5: Throw Away the Beginning
The first few hundred (or thousand) samples are garbage. They’re contaminated by your terrible starting point. This “warm-up” phase is called burn-in. Throw those samples away. Keep the rest.
Here’s the code:
import numpy as np
import matplotlib.pyplot as plt
# ============================================
# METROPOLIS-HASTINGS FROM SCRATCH
# Target: sample from a distribution we can't
# sample from directly (a mixture of Gaussians)
# ============================================def target_distribution(x):
"""
# This is our "landscape" -- two hills
# A mixture of two Gaussians
# We only need this up to a proportionality constant
# (the normalizing constant cancels out in the ratio)
"""
return 0.3 * np.exp(-0.5 * ((x - 2) / 0.8) ** 2) + \
0.7 * np.exp(-0.5 * ((x + 1) / 1.2) ** 2)
def metropolis_hastings(target, n_samples=50000, proposal_width=1.0):
"""
# The actual algorithm. 12 lines of logic.
# I told you it wasn't that bad.
"""
samples = []
x_current = 0.0 # Starting point. Could be anything. I picked 0 because I'm lazy.
accepted = 0 for i in range(n_samples):
# Step 2: Propose a new point (random step from current position)
x_new = x_current + np.random.normal(0, proposal_width) # Step 3: Compute acceptance ratio
# This line is the entire heart of the algorithm
alpha = target(x_new) / target(x_current) # Step 3b: Accept or reject
if np.random.random() < alpha: # This handles both alpha >= 1 and alpha < 1
x_current = x_new
accepted += 1 # Step 4: Record current position (even if we didn't move)
samples.append(x_current) print(f"Acceptance rate: {accepted / n_samples:.2%}")
return np.array(samples)
# Run it!
samples = metropolis_hastings(target_distribution)# Step 5: Burn-in removal (throw away first 5000 samples)
samples_clean = samples[5000:]# Plot the results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))# Left: the trace plot (the "random walk" over time)
axes[0].plot(samples[:2000], linewidth=0.5, alpha=0.7, color='#4a9eed')
axes[0].set_title('Trace Plot (First 2000 Steps)', fontsize=14)
axes[0].set_xlabel('Step')
axes[0].set_ylabel('Value')
axes[0].axhline(y=2, color='#ef4444', linestyle='--', alpha=0.5, label='Peak 1')
axes[0].axhline(y=-1, color='#22c55e', linestyle='--', alpha=0.5, label='Peak 2')
axes[0].legend()# Right: histogram of samples vs true distribution
x_range = np.linspace(-5, 6, 1000)
true_density = target_distribution(x_range)
true_density = true_density / np.trapz(true_density, x_range) # Normalize for comparisonaxes[1].hist(samples_clean, bins=100, density=True, alpha=0.6,
color='#8b5cf6', label='MCMC Samples')
axes[1].plot(x_range, true_density, 'r-', linewidth=2, label='True Distribution')
axes[1].set_title('MCMC Samples vs True Distribution', fontsize=14)
axes[1].set_xlabel('Value')
axes[1].set_ylabel('Density')
axes[1].legend()plt.tight_layout()
plt.savefig('mcmc_basic.png', dpi=150)
plt.show()
Run this. Stare at the histogram. The purple bars (your random walk samples) will match the red line (the true distribution) almost perfectly. You just sampled from a distribution you couldn’t solve analytically. With a random walk. On your first try.
The acceptance rate tells you if your proposal width is good. Aim for 20–50%. Too high means you’re taking tiny steps and exploring slowly. Too low means you’re proposing wild jumps and rejecting everything. It’s like Goldilocks, but with random numbers.
The Three Knobs That Control Everything
MCMC has three settings you can tune, and understanding them is the difference between “my sampler works” and “my sampler has been running for 6 hours and produced garbage.” I learned this the hard way while my dog was barking at absolutely nothing outside the window at 11pm.
Knob 1: Proposal Width (Step Size)
This controls how far you try to jump at each step.
Too small: You accept almost everything (good), but you explore at the speed of a glacier (bad). You’ll need millions of samples to cover the distribution. This is called poor mixing.
Too large: You propose wild jumps to improbable regions (bad), reject almost everything (bad), and your chain just… sits there. Stuck. Refusing to move.
Just right: You get a ~25–50% acceptance rate. Your chain bounces around happily, exploring all the peaks and valleys.
# Let's SEE what happens with different proposal widths
fig, axes = plt.subplots(1, 3, figsize=(18, 4))
widths = [0.05, 1.0, 50.0]
titles = ['Too Small (sig=0.05)', 'Just Right (sig=1.0)', 'Too Large (sig=50.0)']
colors = ['#ef4444', '#22c55e', '#4a9eed']for ax, width, title, color in zip(axes, widths, titles, colors):
s = metropolis_hastings(target_distribution, n_samples=5000, proposal_width=width)
ax.plot(s, linewidth=0.5, alpha=0.7, color=color)
ax.set_title(title, fontsize=13)
ax.set_xlabel('Step')
ax.set_ylabel('Value')
ax.set_ylim(-5, 6)plt.tight_layout()
plt.savefig('proposal_widths.png', dpi=150)
plt.show()
Knob 2: Number of Samples
More is generally better. But there’s a point of diminishing returns. For most practical problems, 10,000 to 100,000 post-burn-in samples is plenty. If you need millions, something is probably wrong with your model, not your sampler.
Knob 3: Burn-in Length
How many initial samples to throw away. A conservative rule: run some diagnostic checks (we’ll get to those) and discard the first 10–25% of your total samples. If you’re paranoid, discard the first half. I’m usually paranoid.
Diagnosing Your Chain: Is This Thing Working?
This is the part nobody teaches well. You’ve run MCMC. You have 50,000 samples. But are they any good? How do you know the chain has “converged” (i.e., has actually found and explored the target distribution)?
DIAGRAM 4: MCMC Diagnostics Checklist

^ Three questions. That’s all. If all three pass, your results are solid. If any fail, fix it before you look at a single number.
Diagnostic 1: The Trace Plot
Plot your samples over time (step number on x-axis, sample value on y-axis). A healthy chain looks like a “fat, hairy caterpillar.” It should be jumping around rapidly, covering the same range consistently. No trends. No long flat periods. No drifting.
Bad trace plot signs: the chain is stuck in one place for long periods; the chain is slowly drifting in one direction; the chain hasn’t reached all the modes of the distribution.
Diagnostic 2: Multiple Chains
Run the same algorithm 3–4 times with different starting points. If all chains converge to the same distribution, you’re in good shape. If they disagree, you haven’t run long enough.
Diagnostic 3: The R-hat Statistic
Also called the “Gelman-Rubin diagnostic.” It compares the variance within each chain to the variance between chains. If R-hat is close to 1.0 (below 1.01 is the modern standard), your chains have converged. If it’s above 1.1, something’s wrong. Keep running.
def gelman_rubin(chains):
"""
# R-hat: the "are we there yet?" diagnostic
# chains: list of arrays, each array is one chain's samples
# Returns a number. Want it close to 1.0.
# If it's above 1.1, your chain needs more time (or therapy).
"""
n = len(chains[0]) # samples per chain
m = len(chains) # number of chains
# Mean of each chain
chain_means = [np.mean(c) for c in chains] # Grand mean (mean of means)
grand_mean = np.mean(chain_means) # Between-chain variance
B = n / (m - 1) * sum((cm - grand_mean) ** 2 for cm in chain_means) # Within-chain variance
W = np.mean([np.var(c, ddof=1) for c in chains]) # Pooled variance estimate
var_hat = (1 - 1/n) * W + (1/n) * B # R-hat
r_hat = np.sqrt(var_hat / W)
return r_hat
# Run 4 chains from different starting points
chains = []
for start in [-10, 0, 5, 15]:
# I'm reusing our function but you'd modify it to accept a start point
s = metropolis_hastings(target_distribution, n_samples=20000, proposal_width=1.0)
chains.append(s[5000:]) # Remove burn-inr_hat = gelman_rubin(chains)
print(f"R-hat: {r_hat:.4f}") # Should be close to 1.0
# If it says 1.002, celebrate
# If it says 1.3, panic gently
Diagnostic 4: Effective Sample Size (ESS)
Because consecutive MCMC samples are correlated (remember, each step depends on the previous one), 50,000 samples don’t give you 50,000 independent pieces of information. ESS tells you the effective number of independent samples. If you have 50,000 samples but an ESS of 200, your chain is mixing terribly. You want ESS to be at least 400 per parameter for reliable estimates.
def effective_sample_size(samples):
"""
# ESS: "How many of my samples actually count?"
# Uses autocorrelation to figure out how much
# information is in your correlated chain.
# A low number here means your chain is basically
# repeating itself like a broken record.
"""
n = len(samples)
mean = np.mean(samples)
var = np.var(samples)
if var == 0:
return 0 # No variance means no information. Sad. # Compute autocorrelation at different lags
max_lag = min(n // 2, 1000)
autocorr_sum = 0 for lag in range(1, max_lag):
c = np.mean((samples[:n-lag] - mean) * (samples[lag:] - mean)) / var
if c < 0.05: # Stop when autocorrelation drops below threshold
break
autocorr_sum += c ess = n / (1 + 2 * autocorr_sum)
return int(ess)
ess = effective_sample_size(samples_clean)
print(f"Total samples: {len(samples_clean)}")
print(f"Effective sample size: {ess}")
print(f"Efficiency: {ess/len(samples_clean):.1%}")
Why this matters for your sanity: An ESS of 50,000 from 50,000 samples means perfect mixing. An ESS of 100 from 50,000 samples means you wasted 99.8% of your compute. This number single-handedly tells you if your sampler is doing its job.
Beyond Metropolis-Hastings: The Family of MCMC Methods
Metropolis-Hastings is the granddaddy. But it has children, grandchildren, and one weird nephew. Let’s meet the family.
Gibbs Sampling: The One-Dimension-at-a-Time Strategy
The problem with Metropolis-Hastings in high dimensions: proposing a new point for ALL variables at once makes it hard to find good proposals. It’s like trying to move a couch through a narrow hallway by pushing it in all directions simultaneously.
Gibbs sampling takes a different approach. Instead of proposing a new value for all variables at once, it updates one variable at a time, while keeping all others fixed.
Say you have two parameters, theta_1 and theta_2. Gibbs sampling does this:
- Fix
theta_2. Sample a newtheta_1fromP(theta_1 | theta_2, data). - Fix
theta_1(the new one). Sample a newtheta_2fromP(theta_2 | theta_1, data). - Repeat.
The magic: each individual step samples from a conditional distribution, which is often much simpler than the full joint distribution. In many models (like Bayesian linear regression, topic models, mixture models), these conditional distributions have known, standard forms. You can sample from them directly. No accept/reject needed.
# ===================================
# GIBBS SAMPLING: Bivariate Normal
# Target: sample from a 2D Gaussian
# with correlation rho
# ===================================
def gibbs_bivariate_normal(n_samples=10000, rho=0.8, mu1=0, mu2=0, sigma1=1, sigma2=1):
"""
# Sample from a correlated bivariate normal distribution
# using Gibbs sampling.
#
# Each conditional distribution is univariate normal.
# That's the beauty of Gibbs: complex joint, simple conditionals.
#
# rho = 0.8 means the variables are strongly correlated.
# My allergies were also strongly correlated with pollen count
# the day I wrote this code.
"""
samples = np.zeros((n_samples, 2)) # Start anywhere
x1, x2 = 0.0, 0.0 for i in range(n_samples):
# Step 1: Sample x1 | x2
# The conditional distribution of x1 given x2 is:
# Normal(mu1 + rho * sigma1/sigma2 * (x2 - mu2),
# sigma1^2 * (1 - rho^2))
cond_mean_1 = mu1 + rho * (sigma1 / sigma2) * (x2 - mu2)
cond_std_1 = sigma1 * np.sqrt(1 - rho ** 2)
x1 = np.random.normal(cond_mean_1, cond_std_1) # Step 2: Sample x2 | x1
cond_mean_2 = mu2 + rho * (sigma2 / sigma1) * (x1 - mu1)
cond_std_2 = sigma2 * np.sqrt(1 - rho ** 2)
x2 = np.random.normal(cond_mean_2, cond_std_2) samples[i] = [x1, x2] return samples
samples_gibbs = gibbs_bivariate_normal(n_samples=10000, rho=0.8)# Plot the results
fig, axes = plt.subplots(1, 2, figsize=(14, 6))# Scatter plot of samples
axes[0].scatter(samples_gibbs[500:, 0], samples_gibbs[500:, 1],
alpha=0.1, s=2, color='#8b5cf6')
axes[0].set_title('Gibbs Samples (Bivariate Normal, rho=0.8)', fontsize=13)
axes[0].set_xlabel('theta_1')
axes[0].set_ylabel('theta_2')
axes[0].set_aspect('equal')# Show the Gibbs "staircase" pattern for first 50 steps
x_path = samples_gibbs[:50, 0]
y_path = samples_gibbs[:50, 1]
for i in range(len(x_path) - 1):
# Horizontal step (updating x1)
axes[1].plot([x_path[i], x_path[i+1]], [y_path[i], y_path[i]],
'b-', alpha=0.5, linewidth=0.8)
# Vertical step (updating x2)
axes[1].plot([x_path[i+1], x_path[i+1]], [y_path[i], y_path[i+1]],
'r-', alpha=0.5, linewidth=0.8)axes[1].set_title('Gibbs "Staircase" (First 50 Steps)', fontsize=13)
axes[1].set_xlabel('theta_1')
axes[1].set_ylabel('theta_2')
axes[1].set_aspect('equal')plt.tight_layout()
plt.savefig('gibbs_sampling.png', dpi=150)
plt.show()
Notice the “staircase” pattern in the path plot. Gibbs sampling always moves parallel to one axis at a time. Horizontal move: update theta_1. Vertical move: update theta_2. Back and forth. It looks weird, but it works brilliantly when the conditional distributions are easy to sample from.
When to use Gibbs: When your model has nice conditional distributions (conjugate priors in Bayesian stats). When you know the conditionals in closed form. Most topic models (like LDA) use Gibbs.
When NOT to use Gibbs: When your parameters are highly correlated and the conditionals don’t have nice forms. Gibbs can get stuck in narrow diagonal valleys, because it can only walk horizontally and vertically.
Hamiltonian Monte Carlo (HMC): The Physics Engine
Here’s where things get seriously cool.
Regular Metropolis-Hastings has a problem. It’s a random walk. Random walks are inefficient. In high dimensions, a random walker tends to wander back and forth, revisiting the same areas, making slow progress. It’s like a tourist without Google Maps.
HMC steals an idea from physics. Imagine your probability distribution is a surface, and you place a ball on that surface. Give the ball a random kick. Thanks to physics (Hamilton’s equations of motion), the ball will roll along the surface, following the contours, naturally spending more time in the valleys (which, in HMC’s convention, are actually the HIGH probability regions because we flip the surface).
Wait. Let me be more precise. HMC imagines the negative log-probability as a “potential energy surface.” Low potential energy = high probability. The ball rolls toward low potential energy = high probability regions. Standard physics.
The algorithm adds “momentum” to each parameter. At each step:
- Give the ball a random kick (sample a random momentum).
- Simulate physics for some number of steps using a numerical integrator (called the “leapfrog integrator”).
- Accept or reject the final position using a Metropolis-like criterion (to correct for numerical integration errors).
The result? Instead of random stumbling, HMC takes long, purposeful trajectories along the probability surface. It explores distributions FAR more efficiently than Metropolis-Hastings, especially in high dimensions.
# =============================================
# HAMILTONIAN MONTE CARLO FROM SCRATCH
# This is the algorithm that powers Stan and PyMC
# =============================================
def hmc(target_log_prob, grad_log_prob, initial, n_samples=5000,
step_size=0.1, n_leapfrog=20):
"""
# HMC: The sampler that went to physics class.
#
# target_log_prob: function returning log of target density
# grad_log_prob: function returning gradient of log target density
# (Yes, you need the gradient. That's the trade-off.)
# step_size: how big each leapfrog step is (epsilon)
# n_leapfrog: how many leapfrog steps per proposal (L)
"""
dim = len(initial)
samples = [initial.copy()]
current_q = initial.copy()
accepted = 0 for i in range(n_samples):
q = current_q.copy() # Step 1: Sample random momentum
# (this is the "random kick" to the ball)
p = np.random.normal(0, 1, size=dim)
current_p = p.copy() # Step 2: Leapfrog integration
# This simulates physics. It's beautiful.
# It's just three lines in a loop.
p += 0.5 * step_size * grad_log_prob(q) # Half-step for momentum for step in range(n_leapfrog - 1):
q += step_size * p # Full step for position
p += step_size * grad_log_prob(q) # Full step for momentum q += step_size * p # Final full step for position
p += 0.5 * step_size * grad_log_prob(q) # Final half-step for momentum # Negate momentum (for theoretical reversibility)
p = -p # Step 3: Metropolis accept/reject
# (corrects for imperfect numerical integration)
current_H = -target_log_prob(current_q) + 0.5 * np.sum(current_p ** 2)
proposed_H = -target_log_prob(q) + 0.5 * np.sum(p ** 2) # Accept if energy is conserved (or nearly so)
if np.random.random() < np.exp(current_H - proposed_H):
current_q = q
accepted += 1 samples.append(current_q.copy()) print(f"HMC Acceptance rate: {accepted / n_samples:.2%}")
return np.array(samples)
# Define a 2D target: correlated bivariate normal
# We need both the log probability AND its gradient
def target_log_prob_2d(q):
"""Log probability of bivariate normal with correlation 0.8"""
rho = 0.8
x, y = q[0], q[1]
z = x**2 - 2*rho*x*y + y**2
return -z / (2 * (1 - rho**2))
def grad_log_prob_2d(q):
"""Gradient of the log probability. Calculus pays off here."""
rho = 0.8
x, y = q[0], q[1]
scale = 1 / (1 - rho**2)
dx = -scale * (x - rho * y)
dy = -scale * (y - rho * x)
return np.array([dx, dy])
# Run HMC
hmc_samples = hmc(
target_log_prob_2d,
grad_log_prob_2d,
initial=np.array([0.0, 0.0]),
n_samples=5000,
step_size=0.15,
n_leapfrog=20
)# Plot
fig, axes = plt.subplots(1, 2, figsize=(14, 6))# HMC samples
axes[0].scatter(hmc_samples[500:, 0], hmc_samples[500:, 1],
alpha=0.15, s=3, color='#4a9eed')
axes[0].set_title('HMC Samples', fontsize=14)
axes[0].set_xlabel('theta_1')
axes[0].set_ylabel('theta_2')
axes[0].set_aspect('equal')# First 30 trajectories
for i in range(30):
axes[1].plot(hmc_samples[i:i+2, 0], hmc_samples[i:i+2, 1],
'-', alpha=0.4, linewidth=1, color='#ef4444')
axes[1].scatter(hmc_samples[:30, 0], hmc_samples[:30, 1],
s=15, color='#ef4444', zorder=5)
axes[1].set_title('HMC Trajectories (First 30 Steps)', fontsize=14)
axes[1].set_xlabel('theta_1')
axes[1].set_ylabel('theta_2')
axes[1].set_aspect('equal')plt.tight_layout()
plt.savefig('hmc_sampling.png', dpi=150)
plt.show()
See the difference? Where Metropolis-Hastings takes random baby steps, HMC takes long, sweeping arcs through the distribution. Each sample is nearly independent of the last. The ESS per computation is dramatically higher.
The trade-off: HMC needs the gradient of the log-probability. For simple distributions, you compute it by hand. For complex models, you use automatic differentiation (which is what PyMC and Stan do under the hood). If you can’t compute the gradient, HMC is off the table.
MCMC Comparison Table: Know Your Weapons
Let me save you the confusion I experienced. Here’s when to use what.
# When to use what (the cheat sheet)
#
# Metropolis-Hastings:
# - You only have the density (no gradients)
# - Low dimensions (1-10)
# - Quick and dirty prototype
# - Educational purposes
#
# Gibbs Sampling:
# - You know the conditional distributions
# - Conjugate Bayesian models
# - Medium dimensions (5-50)
#
# HMC / NUTS:
# - High dimensions (50+)
# - You have (or can compute) gradients
# - Production Bayesian inference
# - Tools: PyMC, Stan, NumPyro
DIAGRAM 5: MCMC Method Comparison

NUTS deserves a special mention. It’s HMC’s smarter sibling. HMC requires you to set the number of leapfrog steps manually (getting this wrong is painful). NUTS automatically determines the optimal trajectory length by detecting when the trajectory starts to “double back” on itself (a U-turn). It’s what Stan and PyMC4 use by default. If you’re using a modern probabilistic programming framework, you’re probably already using NUTS without knowing it.
The Real-World Toolbox: PyMC and Stan
You don’t need to write MCMC from scratch in production. That code above was educational. In practice, you use battle-tested libraries. The two biggest players are PyMC (Python) and Stan (its own language with Python/R interfaces).
Here’s what a real-world Bayesian analysis looks like with PyMC:
import pymc as pm
import arviz as az
import numpy as np
# =============================================
# REAL BAYESIAN LINEAR REGRESSION WITH PyMC
# This is what actual data scientists run.
# =============================================# Generate some fake data
# (In real life, this is your actual dataset)
np.random.seed(42)
n = 200
true_slope = 2.5
true_intercept = -1.0
true_noise = 1.5X = np.random.uniform(-3, 3, size=n)
y = true_intercept + true_slope * X + np.random.normal(0, true_noise, size=n)# Build the Bayesian model
with pm.Model() as linear_model: # Priors: what we believe about the parameters before seeing data
# (these are intentionally vague / "uninformative")
intercept = pm.Normal('intercept', mu=0, sigma=10)
slope = pm.Normal('slope', mu=0, sigma=10)
sigma = pm.HalfNormal('sigma', sigma=5) # Noise must be positive # Likelihood: how the data is generated given the parameters
mu = intercept + slope * X
y_obs = pm.Normal('y_obs', mu=mu, sigma=sigma, observed=y) # THIS IS WHERE MCMC HAPPENS
# PyMC automatically picks NUTS (the best HMC variant)
# It auto-tunes the step size during warm-up
# It runs 4 chains in parallel
# It checks R-hat, ESS, and divergences for you
trace = pm.sample(
draws=2000, # 2000 post-warmup samples per chain
tune=1000, # 1000 warmup samples (auto-tuning)
chains=4, # 4 independent chains
cores=4, # parallel computation
random_seed=42 # for reproducibility
)# Diagnostics: ArviZ makes this painless
print(az.summary(trace, var_names=['intercept', 'slope', 'sigma']))# This summary gives you:
# - mean: posterior mean of each parameter
# - sd: posterior standard deviation
# - hdi_3%: lower bound of 94% credible interval
# - hdi_97%: upper bound of 94% credible interval
# - r_hat: convergence diagnostic (want ~1.0)
# - ess_bulk: effective sample size (want > 400)
# - ess_tail: effective sample size for tail estimates# Plot the posterior distributions
az.plot_trace(trace, var_names=['intercept', 'slope', 'sigma'])
plt.tight_layout()
plt.savefig('pymc_trace.png', dpi=150)
plt.show()# Plot the posterior predictive
az.plot_posterior(trace, var_names=['intercept', 'slope', 'sigma'],
ref_val=[true_intercept, true_slope, true_noise])
plt.tight_layout()
plt.savefig('pymc_posterior.png', dpi=150)
plt.show()
Seven lines of model code. That’s it. PyMC handles the MCMC algorithm selection, step size tuning, parallel chains, convergence diagnostics, and plotting. All the code we wrote from scratch earlier? It’s running behind the scenes, but optimized and tested by people much smarter than me.
Writing MCMC from scratch teaches you the intuition. Using PyMC/Stan in production keeps you sane. Both matter.
The acceptance ratio doesn’t need the normalizing constant.
Go back to the Metropolis-Hastings acceptance ratio: alpha = P(x_new) / P(x_current).
If your target distribution is P(x) = f(x) / Z, where Z is the impossible normalizing constant, then:
alpha = f(x_new) / Z divided by f(x_current) / Z = f(x_new) / f(x_current)
Z cancels. Gone. You never need it.
This is the entire reason MCMC was invented. Not because sampling is fun. Not because random walks are elegant. Because that Z, that terrifying integral that appears in Bayes' theorem, in partition functions, in every hard probability problem, just… vanishes.
Every MCMC textbook buries this revelation on page 47 between two proofs about ergodicity. It should be on the cover. In neon.
The Golden Rule: You never, ever need to compute the normalizing constant. That’s not a trick. That’s the entire point.
Where MCMC Shows Up in Real Life
This isn’t some academic exercise that only exists in textbooks. MCMC runs inside:
Drug Discovery. Pharmaceutical companies use Bayesian models to estimate how new drugs will behave in human bodies. MCMC samples from the posterior distribution of pharmacokinetic parameters. Thousands of virtual patients. No one gets hurt.
Climate Science. Climate models have hundreds of parameters. MCMC helps estimate the uncertainty in temperature projections by sampling from posterior distributions over model parameters, given observed data.
Astronomy. Astrophysicists use MCMC to estimate the properties of exoplanets from tiny wobbles in starlight. The emcee Python package is one of the most cited MCMC tools in astronomy.
Sports Analytics. Bayesian hierarchical models with MCMC power player rating systems in basketball, football, and baseball. FiveThirtyEight’s models used MCMC.
Spam Filters. Some email spam filters use Bayesian methods. The probability that an email is spam, given its words, is estimated using posterior sampling.
Language Models. Some LLM training techniques and certain decoding strategies use MCMC-like sampling from token distributions.
When Things Go Wrong
Let me tell you about the three ways your MCMC can silently fail, producing results that look correct but are actually garbage. I’ve personally walked into each of these traps. Each time, my snack was getting cold on my desk while I debugged for hours.
Trap 1: The Multimodal Problem
If your distribution has multiple peaks (modes) separated by low-probability valleys, standard MCMC can get stuck on one peak and never find the others. It’s like getting stuck on one hill because the valley between hills is too deep to cross.
Solution: Run many chains from diverse starting points. Use tempering methods (which “heat up” the landscape to flatten the valleys). Or use specialized algorithms like parallel tempering.
Trap 2: The Funnel of Doom
Some posterior distributions have “funnel” shapes, where one parameter controls the scale of another. The Neal’s funnel is the classic example. At the narrow end of the funnel, HMC needs tiny step sizes. At the wide end, it needs large step sizes. No single step size works everywhere.
Solution: Reparameterize your model. Instead of sampling the original parameters, sample a transformed version where the geometry is better. Stan’s manual has an entire section on this. It’s worth reading.
Trap 3: Divergent Transitions
When using HMC/NUTS, the leapfrog integrator can “diverge” (blow up to infinity) in regions where the posterior has sharp curvatures. PyMC and Stan will warn you about divergent transitions. Do NOT ignore these warnings. They mean the sampler is struggling with the geometry of your posterior.
Solution: Increase target_accept (in PyMC, set it to 0.95 or 0.99). Reparameterize your model. Increase the warmup period.
# When PyMC gives you divergence warnings, try this:
with linear_model:
trace_safer = pm.sample(
draws=2000,
tune=2000, # More warmup for better tuning
chains=4,
target_accept=0.95, # More conservative step sizes
random_seed=42
)
# If divergences persist, the model probably needs reparameterization
# Don't just crank target_accept to 0.999 and pray
Scars-Over-Theory Tips
Things I wish someone told me before I wasted hundreds of hours:
- Start with a simple model. Get MCMC working on a toy problem before touching your real data. If it doesn’t work on fake data where you know the answer, it won’t work on real data.
- Always run multiple chains. One chain tells you nothing about convergence. Four chains tell you almost everything.
- Check R-hat and ESS before looking at results. If R-hat > 1.01 or ESS < 400, your results are unreliable. Full stop.
- Visualize the trace plot. If it doesn’t look like a hairy caterpillar, keep running.
- Use informative priors when you have real prior knowledge. Vague priors like
Normal(0, 1000)can cause terrible geometry. Weakly informative priors likeNormal(0, 10)usually help. - Don’t write your own sampler in production. PyMC, Stan, NumPyro, and TensorFlow Probability exist. They’ve been tested by thousands of researchers. Use them.
- Reparameterize before you panic. Most MCMC failures are caused by bad geometry, not bad algorithms. Transform your parameters to make the posterior more “normal-shaped.”
- The acceptance rate is your check engine light. If it’s outside 20–90%, something is wrong with your settings.
- MCMC is not the only game in town. For some problems, variational inference is faster (but approximate). For others, nested sampling is better. MCMC is the gold standard for accuracy, not speed.
Here’s what nobody tells you about learning MCMC.
Once you understand it, you’re free.
Free from “this integral can’t be solved analytically.” Free from “the posterior has no closed form.” Free from “this problem is too complex for exact computation.”
MCMC says: “Give me any distribution. Any shape. Any number of dimensions. I don’t need a formula. I don’t need an analytical solution. Just give me a function that’s proportional to the density, and I will map the entire landscape for you, one random step at a time.”
That’s not a limitation. That’s a superpower.
Every complex model in Bayesian statistics, every uncertainty estimate in modern machine learning, every probabilistic forecast you’ve ever seen on a news site, they all trace back to this one idea: you don’t need to solve the problem. You just need to wander through it intelligently.
MCMC is the art of finding answers by getting lost on purpose.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Towards AI Academy
We Build Enterprise-Grade AI. We'll Teach You to Master It Too.
15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.
Start free — no commitment:
→ 6-Day Agentic AI Engineering Email Guide — one practical lesson per day
→ Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages
Our courses:
→ AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.
→ Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.
→ AI for Work — Understand, evaluate, and apply AI for complex work tasks.
Note: Article content contains the views of the contributing authors and not Towards AI.