Bayesian analysis and decision theory: application to determine a decision point for classification problems
Last Updated on July 19, 2024 by Editorial Team
Author(s): Greg Postalian-Yrausquin
Originally published on Towards AI.
A dilemma often presented in classification problems where the output is a number is determining the cutout point between the categories. For example, the output of a neural network might be a number between 0 and 1, letβs say 0.7, does that correspond to the positive (1) category or to the negative (0) category? Common sense says to use 0.5 as a decision marker, but what if there is a higher risk in underestimating the positives? or if the classes are unbalanced?
A correct estimation of the cut point in these cases warrants some review of probabilities and Bayesian theory. When talking about probabilities, three rules take the central stage for the processes that will follow:
- Sum rule:
Where considering x and y as two events, the probability of x is the sum of the x occurring together with each option of y.
- Product rule:
This means that the probability of x and y occurring together is equal to the probability of y occurring given that x happened time the probability of x occurring.
- Bayesβ theorem:
Bayesβ theorem is a very powerful tool that provides a way to update the probabilities of an event (in this case, event y) after getting some new information, represented in this case by p(x|y). The new, updated probability is then p(y|x).
In detail, p(y) is named the prior, the probability of y before the new information is obtained; p(x|y) is the probability of a new event x happening provided that y exists, this is the new data or information about the system; and p(x) is the marginal probability of the event x regardless of the value of y.
Bayesβ theorem can be expressed in any of the following forms, which all are derived from the original equation and the two rules explained above:
To illustrate the power of Bayesβ theorem I will use an example. Letβs say that having a disease is event Y (not having it would be Y0, and Y1 is the unfortunate event of being sick); and getting a positive blood test to detect the disease is the event X. The probability of having the disease, over the whole population is a small number p(y). About the test, someone that has the disease will test positive with a probability of p(x|y); and the percentage of the population that will test positive, regardless if they are sick or not is p(x), which includes then the real positives and the false positives.
Letβs plug some numbers for illustration:
p(y) = Prob. of having the disease, or people sick over the whole population: 1 in 10,000 = 0.0001
p(x|y) = probability of getting a positive test if there is a disease (the effectivity of the test itself): 0.9 / the test is effective in locating the disease 90% of the time
p(x) = probability of positive test / it is the number of people that get the test and test positive, regardless of whether they being really sick or not: 1 in 1000
With this, applying Bayesβ theorem: p(y|x) = (0.9*0.0001)/(0.001) = 9%
This means that even after testing positive, the actual chances of having the disease are still low, and more tests are needed to produce a diagnosis.
After applying Bayesβ theorem, the probability of having the disease for this individual has been updated from 1 in 10,000 to almost 1 in 10.
In reality, these blood tests, just as the numerical outcome of both regression and classification problems in neural networks, are not binary but formed by a continuous variable. In this situation, the question is where to βcutβ the results and assign a positive or negative value to the outcome. Common sense dictates to use the middle point (0.5 if the last layer is a softmax for example), but that is not the only option and ignores issues like different risks or unbalanced training variables.
Considering the risks is very important in the example used above, because getting a false positive (test positive but not being really sick) only carries the small risk of being annoyed by further testing, but a false negative (being sick and getting a negative test) means further spread of the disease and failure to receive care for it.
The next chart shows what the distributions look like, the blue one being the healthy individuals distribution and the red one the sick ones. The X axis is the test result (for example a value of protein xxx in the blood), and the Y axis is a value representing quantity. As these are probability distributions, they are normalized so that the area under them totals to one.
import numpy as np
import matplotlib.pyplot as plt
import scipy
#define mean and standard dev
mu, sg = 10, 1
#serie of 100000 points
s = np.random.normal(mu, sigma, 100000)
#plot the histogram and create bins
count, bins, ignored = plt.hist(s, 500, density=True)
#standard distribution formula
def standardDistribution(mu,sg,x):
y = (1/np.sqrt(2*np.pi*sg**2))*np.exp(-((x-mu)**2)/(2*sg**2))
return y
#prob distribution of negative test and values of test (x)
#for negative test
mu0, sg0 = 50, 15
x = np.arange(0.0, 150.0, 0.01)
probY0_X = standardDistribution(mu0,sg0,x)
#for positive test
mu1, sg1 = 100, 20
x = np.arange(0.0, 150.0, 0.01)
probY1_X = standardDistribution(mu1,sg1,x)
fig, (ax1, ax2) = plt.subplots(1, 2,sharex=True, sharey=True, figsize=(15,5))
ax1.plot(x, probY0_X, linewidth=2, color='b')
ax1.plot(x, probY1_X, linewidth=2, color='r')
ax1.set_title('The joined Y0 and Y1 with X')
ax2.plot(x, probY1_X+probY0_X, linewidth=2, color='g')
ax2.set_title('Probability of X')
If we donβt know anything about the individuals, if they are sick or not, we will only see the green chart, which is the distribution probability of the results of the test. We can see by intuition that there are two modes, which correspond to the median of the sick or healthy cases.
Note that in this process, I am going to assume that both distributions are normal or close to normal, which will be the case if the average of a significant number of random samples (central limit theorem).
Letβs review in detail the first chart, we see four regions that are of interest in our case:
- True positive: TP -> Good! accurate identification of the class
- True negative: TN -> Good! accurate identification of the class
- False negative: FN -> Bad! The result is attributed to class 0 (no disease in our example) when it really is class 1
- False positive: FP -> Bad! The result is attributed to class 1 when it belongs to class 0
The areas of 3 and 4 measure how wrong the results are so, this is a good error function to minimize in order to get the best results of the model:
The last equation just requires remembering that these joint probabilities are Gaussian.
For more than two outcomes, the error area is generalized to:
At this point is easy to introduce bias to the error to account for risk. In our example, for the βbadβ results, we want to penalize the false negative. We introduce to the error calculation factors Rfn and Rfp to account for their respective penalties.
At this point we have an optimization problem to find the minimum of the function of the error area.
The derivatives of the integrals are Gaussians
M is the cutting point that minimizes the error as we have defined it, given the assigned risk to each error type.
The next step is to resolve this last equation, what I am going to do in Python:
#formula to solve
#for negative test
mu0, sg0 = 50, 15
#for positive test
mu1, sg1 = 100, 20
def func(w):
r = (rFN/sg1)*(np.exp(-((w-mu1)**2)/(2*sg1**2))) - (rFP/sg0)*(np.exp(-((w-mu0)**2)/(2*sg0**2)))
return r
#sol no penalty
rFN, rFP = 1, 1
sol0 = scipy.optimize.fsolve(func,x0=60)
#sol penalty 5:1
rFN, rFP = 5, 1
sol1 = scipy.optimize.fsolve(func,x0=60)
#sol penalty 10:1
rFN, rFP = 10, 1
sol2 = scipy.optimize.fsolve(func,x0=60)
#plot with the solutions
plt.figure(figsize=(12, 10))
plt.plot(x, probY0_X, linewidth=1, color='b', label='Y0 -> healthy')
plt.plot(x, probY1_X, linewidth=1, color='r', label='Y1 -> disease')
plt.axvline(x=sol0, color='black', ls='--', label='Cut, no penalty')
plt.axvline(x=sol1, color='gray', ls='--', label='Cut, penalty 1:5')
plt.axvline(x=sol2, color='brown', ls='--', label='Cut, penalty 1:10')
plt.legend(bbox_to_anchor=(1.0, 1), loc='upper left')
plt.show()
The vertical lines represent different solutions for the best point M with different weights or penalties; illustrating the impact of the manually introduced difference between the categories.
Applying Bayesβ theorem, these are the same results over the posterior functions p(Y|X):
#plot of p(Y|x) for Y0 and Y1
plt.figure(figsize=(12, 10))
plt.plot(x, probY0_X/(probY1_X + probY0_X), linewidth=1, color='b', label='Y0 -> healthy')
plt.plot(x, probY1_X/(probY1_X + probY0_X), linewidth=1, color='r', label='Y1 -> disease')
plt.axvline(x=sol0, color='black', ls='--', label='Cut, no penalty')
plt.axvline(x=sol1, color='gray', ls='--', label='Cut, penalty 1:5')
plt.axvline(x=sol2, color='brown', ls='--', label='Cut, penalty 1:10')
plt.legend(bbox_to_anchor=(1.0, 1), loc='upper left')
plt.show()
In a real-life scenario, for machine learning, we can attack a problem of this same kind of optimization in three different ways:
- Use the p(y,x), the probability of y and x occurring, as I just did above (which are the two distributions of having a blood value x and having the disease and not having the disease) for the training set. Then, determine the best point to cut.
- Use the posterior p(Y|X); which are probabilities of having the disease given a test result as data. The cut point is also determined as an optimization problem.
- Train a direct classification model with binary output, in the training set make sure the labels account for the different risk or resample in case of unbalanced classes. This method can be quicker, but it has several drawbacks, for example, it does not give much information about possible factors (problems in real life are generally multivariable), removes the possibility of manual accounting for risk and has no option to reject low confidence results (close to the decision point).
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI