Bayesian analysis and decision theory: application to determine a decision point for classification problems

Last Updated on July 19, 2024 by Editorial Team

Author(s): Greg Postalian-Yrausquin

Originally published on Towards AI.

A dilemma often presented in classification problems where the output is a number is determining the cutout point between the categories. For example, the output of a neural network might be a number between 0 and 1, let’s say 0.7, does that correspond to the positive (1) category or to the negative (0) category? Common sense says to use 0.5 as a decision marker, but what if there is a higher risk in underestimating the positives? or if the classes are unbalanced?

A correct estimation of the cut point in these cases warrants some review of probabilities and Bayesian theory. When talking about probabilities, three rules take the central stage for the processes that will follow:

Sum rule:

Where considering x and y as two events, the probability of x is the sum of the x occurring together with each option of y.

Product rule:

This means that the probability of x and y occurring together is equal to the probability of y occurring given that x happened time the probability of x occurring.

Bayes’ theorem:

Bayes’ theorem is a very powerful tool that provides a way to update the probabilities of an event (in this case, event y) after getting some new information, represented in this case by p(x|y). The new, updated probability is then p(y|x).

In detail, p(y) is named the prior, the probability of y before the new information is obtained; p(x|y) is the probability of a new event x happening provided that y exists, this is the new data or information about the system; and p(x) is the marginal probability of the event x regardless of the value of y.

Bayes’ theorem can be expressed in any of the following forms, which all are derived from the original equation and the two rules explained above:

To illustrate the power of Bayes’ theorem I will use an example. Let’s say that having a disease is event Y (not having it would be Y0, and Y1 is the unfortunate event of being sick); and getting a positive blood test to detect the disease is the event X. The probability of having the disease, over the whole population is a small number p(y). About the test, someone that has the disease will test positive with a probability of p(x|y); and the percentage of the population that will test positive, regardless if they are sick or not is p(x), which includes then the real positives and the false positives.

Let’s plug some numbers for illustration:

p(y) = Prob. of having the disease, or people sick over the whole population: 1 in 10,000 = 0.0001

p(x|y) = probability of getting a positive test if there is a disease (the effectivity of the test itself): 0.9 / the test is effective in locating the disease 90% of the time

p(x) = probability of positive test / it is the number of people that get the test and test positive, regardless of whether they being really sick or not: 1 in 1000

With this, applying Bayes’ theorem: p(y|x) = (0.9*0.0001)/(0.001) = 9%

This means that even after testing positive, the actual chances of having the disease are still low, and more tests are needed to produce a diagnosis.

After applying Bayes’ theorem, the probability of having the disease for this individual has been updated from 1 in 10,000 to almost 1 in 10.

In reality, these blood tests, just as the numerical outcome of both regression and classification problems in neural networks, are not binary but formed by a continuous variable. In this situation, the question is where to “cut” the results and assign a positive or negative value to the outcome. Common sense dictates to use the middle point (0.5 if the last layer is a softmax for example), but that is not the only option and ignores issues like different risks or unbalanced training variables.

Considering the risks is very important in the example used above, because getting a false positive (test positive but not being really sick) only carries the small risk of being annoyed by further testing, but a false negative (being sick and getting a negative test) means further spread of the disease and failure to receive care for it.

The next chart shows what the distributions look like, the blue one being the healthy individuals distribution and the red one the sick ones. The X axis is the test result (for example a value of protein xxx in the blood), and the Y axis is a value representing quantity. As these are probability distributions, they are normalized so that the area under them totals to one.

import numpy as np
import matplotlib.pyplot as plt
import scipy

#define mean and standard dev
mu, sg = 10, 1

#serie of 100000 points
s = np.random.normal(mu, sigma, 100000)

#plot the histogram and create bins
count, bins, ignored = plt.hist(s, 500, density=True)

#standard distribution formula
def standardDistribution(mu,sg,x):
 y = (1/np.sqrt(2*np.pi*sg**2))*np.exp(-((x-mu)**2)/(2*sg**2))
 return y

#prob distribution of negative test and values of test (x)

#for negative test
mu0, sg0 = 50, 15
x = np.arange(0.0, 150.0, 0.01)
probY0_X = standardDistribution(mu0,sg0,x)

#for positive test
mu1, sg1 = 100, 20
x = np.arange(0.0, 150.0, 0.01)
probY1_X = standardDistribution(mu1,sg1,x)

fig, (ax1, ax2) = plt.subplots(1, 2,sharex=True, sharey=True, figsize=(15,5))
ax1.plot(x, probY0_X, linewidth=2, color='b')
ax1.plot(x, probY1_X, linewidth=2, color='r')
ax1.set_title('The joined Y0 and Y1 with X')
ax2.plot(x, probY1_X+probY0_X, linewidth=2, color='g')
ax2.set_title('Probability of X')

If we don’t know anything about the individuals, if they are sick or not, we will only see the green chart, which is the distribution probability of the results of the test. We can see by intuition that there are two modes, which correspond to the median of the sick or healthy cases.

Note that in this process, I am going to assume that both distributions are normal or close to normal, which will be the case if the average of a significant number of random samples (central limit theorem).

Let’s review in detail the first chart, we see four regions that are of interest in our case:

True positive: TP -> Good! accurate identification of the class
True negative: TN -> Good! accurate identification of the class
False negative: FN -> Bad! The result is attributed to class 0 (no disease in our example) when it really is class 1
False positive: FP -> Bad! The result is attributed to class 1 when it belongs to class 0

The areas of 3 and 4 measure how wrong the results are so, this is a good error function to minimize in order to get the best results of the model:

The last equation just requires remembering that these joint probabilities are Gaussian.

For more than two outcomes, the error area is generalized to:

At this point is easy to introduce bias to the error to account for risk. In our example, for the “bad” results, we want to penalize the false negative. We introduce to the error calculation factors Rfn and Rfp to account for their respective penalties.

At this point we have an optimization problem to find the minimum of the function of the error area.

The derivatives of the integrals are Gaussians

M is the cutting point that minimizes the error as we have defined it, given the assigned risk to each error type.

The next step is to resolve this last equation, what I am going to do in Python:

#formula to solve

#for negative test
mu0, sg0 = 50, 15
#for positive test
mu1, sg1 = 100, 20

def func(w):
 r = (rFN/sg1)*(np.exp(-((w-mu1)**2)/(2*sg1**2))) - (rFP/sg0)*(np.exp(-((w-mu0)**2)/(2*sg0**2)))
 return r

#sol no penalty
rFN, rFP = 1, 1
sol0 = scipy.optimize.fsolve(func,x0=60)

#sol penalty 5:1
rFN, rFP = 5, 1
sol1 = scipy.optimize.fsolve(func,x0=60)

#sol penalty 10:1
rFN, rFP = 10, 1
sol2 = scipy.optimize.fsolve(func,x0=60)

#plot with the solutions
plt.figure(figsize=(12, 10))
plt.plot(x, probY0_X, linewidth=1, color='b', label='Y0 -> healthy')
plt.plot(x, probY1_X, linewidth=1, color='r', label='Y1 -> disease')
plt.axvline(x=sol0, color='black', ls='--', label='Cut, no penalty')
plt.axvline(x=sol1, color='gray', ls='--', label='Cut, penalty 1:5')
plt.axvline(x=sol2, color='brown', ls='--', label='Cut, penalty 1:10')
plt.legend(bbox_to_anchor=(1.0, 1), loc='upper left')
plt.show()

The vertical lines represent different solutions for the best point M with different weights or penalties; illustrating the impact of the manually introduced difference between the categories.

Applying Bayes’ theorem, these are the same results over the posterior functions p(Y|X):

#plot of p(Y|x) for Y0 and Y1
plt.figure(figsize=(12, 10))
plt.plot(x, probY0_X/(probY1_X + probY0_X), linewidth=1, color='b', label='Y0 -> healthy')
plt.plot(x, probY1_X/(probY1_X + probY0_X), linewidth=1, color='r', label='Y1 -> disease')
plt.axvline(x=sol0, color='black', ls='--', label='Cut, no penalty')
plt.axvline(x=sol1, color='gray', ls='--', label='Cut, penalty 1:5')
plt.axvline(x=sol2, color='brown', ls='--', label='Cut, penalty 1:10')
plt.legend(bbox_to_anchor=(1.0, 1), loc='upper left')
plt.show()

In a real-life scenario, for machine learning, we can attack a problem of this same kind of optimization in three different ways:

Use the p(y,x), the probability of y and x occurring, as I just did above (which are the two distributions of having a blood value x and having the disease and not having the disease) for the training set. Then, determine the best point to cut.
Use the posterior p(Y|X); which are probabilities of having the disease given a test result as data. The cut point is also determined as an optimization problem.
Train a direct classification model with binary output, in the training set make sure the labels account for the different risk or resample in case of unbalanced classes. This method can be quicker, but it has several drawbacks, for example, it does not give much information about possible factors (problems in real life are generally multivariable), removes the possibility of manual accounting for risk and has no option to reject low confidence results (close to the decision point).

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Bayesian analysis and decision theory: application to determine a decision point for classification problems

Author(s): Greg Postalian-Yrausquin

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

The Secret to Unlocking Deeper SWOT Analysis with AI (The Code That Started It All — and How I Took It to the Next Level)

Evaluating and Monitoring LLM Agents: Tools, Metrics, and Best Practices

Building Multi-Agent AI Systems From Scratch: OpenAI vs. Ollama

Web-LLM Assistant: Bridging Local AI Models With Real-Time Web Intelligence

ChatGPT Gets Windows App

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Bayesian analysis and decision theory: application to determine a decision point for classification problems

Author(s): Greg Postalian-Yrausquin

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement