Bayesian Inference in Python
Last Updated on January 4, 2022 by Editorial Team
Author(s): Shanmukh Dara
Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.
Probability
Discover the distinction between the Frequentist and Bayesian approaches.
Uncertainty is the only certainty there is, and knowing how to live with insecurity is the only securityβββJohn AllenΒ Paulos
Life is uncertain, and statistics can help us quantify certainty in this uncertain world by applying the concepts of probability and inference. However, there are two major approaches to inference:
- Frequentist (Classical)
- Bayesian
Letβs look at both of these using the well-known, and simple Coin flipΒ example.
What can we say about the fairness of a coin after tossing it 10 times and observing 1 head and 9Β tails?
Although 10 is small sample size for determining the fairness of a coin, I chose it because collecting a large number of samples in real life can be difficult, especially when risk and opportunity costs areΒ high.
Frequentist Inference:
This is the one that is commonly taught in colleges and frequently used in industry, but it is not without criticism.
Basic idea: The frequentist method computes a p-value, which is the probability of observing an outcome at least as extreme as the one in our observed data, assuming the null hypothesis isΒ true.
The null and alternative hypothesis, in this case, would be:
H0: The coin is fair
H1: The coin isΒ biased
Assuming the coin is fair, we calculate the probability of seeing something as extreme as 1 head. That is 1 head, 0 heads, and 9 heads and 10 heads on the other side. Adding all four results in a probability ofΒ 0.021.
from scipy.stats import binom
binom.pmf(k=9, n=10, p=0.5) + binom.pmf(k=10, n=10, p=0.5) +binom.pmf(k=0, n=10, p=0.5)+binom.pmf(k=1, n=10, p=0.5)
So, at 5% significance, we can confidently reject the null hypothesis and conclude with 95% certainty that the coin isΒ biased.
Everything appears to be in order so far, and weβve established that the coin is biased. So, whatβs theΒ issue?
Assume the observer was dissatisfied with the answer and went on to observe ten more samples with the same coin. Will he draw the same inference about the coin? Not necessarily. This time, he could see 5 heads and 5 tails and conclude that the coin is not biased. As a result, weβre having trouble reproducing the same results under the same conditions.
In addition, Weβll never understand how reasonable our null hypothesis is because we assumed it was trueβββYou can make an absurd claim and then design an experiment to reject or accept the null hypothesis.
So, do we have any other robust methods that can overcome these constraints?
Bayesian Inference:
Bayesian inference takes a very different viewpoint from the frequentist approach, instead of estimating a single population parameter from the observed data, we characterize them with entire probability distributions which represent our knowledge and uncertainty aboutΒ them.
We start with prior probability which represents what we know about the parameters before seeing data, then we make observations, and depending on them we update the prior into posterior using BayesβΒ rules.
So letβs start with the BayesβΒ rule:
Theta in the equation is the success of the experiment, Data here is the observed data from experimentation, let's understand each term in theΒ equation
Β· P(theta)βββalso called priorβββit is the probability of success before seeing the data, thisΒ comes
Β· P(theta | Data)βββalso called posteriorβββit is the probability that Hypothesis is true after considering the observedΒ data
Β· P(Data | theta)βββalso called likelihoodβββit is the evidence about Hypothesis provided by theΒ data
Β· P(Data)βββalso called Normalizationβββis the total probability of the data taking into account all possible hypotheses
Inference about an incident can be offered in threeΒ ways.
- Entirely based on your prior experience, without any observed data,βββi.e. based on p(theta) a.k.a prior, which is a non-statistician approach toΒ things.
- The second way is the frequentist method, in which we anticipate how rare it is to observe this outcome if the hypothesis is true, i.e. p(data | theta) a.k.a likelihood, which means we are simply relying on observed data and nothingΒ else.
- Finally, we have Bayesian inference, which uses both our prior knowledge p(theta) and our observed data to construct a distribution of probable posteriors. So one key difference between frequentist and Bayesian inference is our prior knowledge, i.e. pΒ (theta).
So, in Bayesian reasoning, we begin with a prior belief. Then we examine the observed data, allowing each data piece to alter our previous understanding slightly. The more data we observe, the further we can stray from our initial beliefs. This procedure finally results in a posterior belief.
But thereβs a catch: if prior refers to our previous understanding, doesnβt this add experimenterβs bias? What if we obtain different results because of different priors? These are all fair questions, however, the prior is an important part of Bayesian inference, and one should ensure that his prior is relevant.
So, to compute the posterior precisely, we need a prior and a likelihood for the hypothesis, but in most experiments, the prior probabilities on hypotheses are unknown. In this scenario, we can assume that the prob(theta) is uniformly distributed between 0 andΒ 1.
Letβs return to our coin flipΒ example:
p(theta)βββthe probability of success (heads)βββbefore seeing data, what is the probability of successβ as we donβt have any prior knowledge of the coin, we assume itβs uniformly distributed between 0 andΒ 1.
Let's start with Bayesian analysisβββThe posterior can be computed in several ways. In this case, Iβm using grid approximation to estimateΒ it.
Assuming observations as [T,T,T,T,T,T,T,T,T,H]
#Lets import required packages
import numpy as np
import matplotlib.pyplot as plt
#Taking a grid of 1000 points between 0 and 1
probability_grid = np.linspace(0,1,num= 1000)
#As we have assumed a uniform distribution between 0 and 1 (i.e. uniform distribution across all 1000 grid points)
prior_probability = np.ones(1000)
#Now lets toss the coin once, and we observed a tail
probability_data = binom.cdf(k=0, n=1, p=probability_grid)
posterior = prior_probability * probability_data
posterior = (posterior/sum(posterior))*1000
This process of posterior updating has to be repeated for all ten trails, and the posterior after each trail is shownΒ below:
I iterated through all of the trails separately to demonstrate the procedure. We can use the Pymc3 package to do it in a singleΒ shot.
#Import required packages
import pymc3 as pm
import arviz as az
tosses_output = [ 0,0,0,0,0,0,0,0,0,1]
with pm.Model() as model:
# lets define the prior
hypo = pm.Beta('hypo', 2, 2)
# define probability
data = pm.Bernoulli('data', hypo, observed=tosses)
# sample
trace = pm.sample(return_inferencedata=True)
# visualize
az.plot_posterior(trace, hdi_prob=0.95)
So, what did we accomplish? Rather than simply stating whether or not the coin is biased, we obtained the entire distribution of success, from which we can answer questions like the most likely probabilities of heads, Confidence Intervals ofΒ heads.
Iβll conclude with a cartoon to help you remember the distinction between Bayesian and frequentist thinking.
Frequentist inference does not account for the βpriorβ probability of the sun exploding, and because it is based solely on observed data, a purely Frequentist method can run into problems. The Frequentist examines how likely the finding is in light of the null hypothesis, but it does not consider if the hypothesis is even more implausible aΒ priori.
References and additional reading
[1] Frequentist and Bayesian Approaches in Statistics
[2] Comparison of frequentist and Bayesian inference
[4] Bayesian vs Frequentist Approach
[5] Probability concepts explained: Bayesian inference for parameter estimation.
[6] Andre Schumacherβs talk atΒ DTC
[7] Richard McElreathβs Statistical Rethinking lecturesΒ 2019
Bayesian Inference in Python was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Join thousands of data leaders on the AI newsletter. Itβs free, we donβt spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI