Understanding Why There Is No Such Thing as ‘Correct Probability’ in Data Science
Last Updated on January 25, 2024 by Editorial Team
Author(s): Peyman Kor
Originally published on Towards AI.
Probability Should be Conditioned on your Current State of Information
Intro
In the field of data science, individuals frequently use the term “Probability”. However, a fundamental concept needs to be emphasized:
There is no such thing as a Correct Probability in Data Science.
Let’s go a little deeper. We heavily rely on building machine learning models that give the probability of future events.
For example, what is the probability of the next person defaulting on a bank loan?, or what is the probability of a particular transaction being fraudulent?
Uncertainty Vs. Variability:
One common confusion I see is to recognize the difference between two terms, uncertainty and variability.
Confusion comes from the point both Uncertainty ad Variability can be expressed in terms of probability, but they are different concepts.
Variability
Variability is a state of things: It is quantified by the frequency of observed actual value.
For example, you can quantify the variability of the height of 100 students at the high school by plotting the probability mass function of the 100 data points.
Uncertainty
Uncertainty is a state of mind: It is quantified by the probability of a future event being true or not.
In the previous example, when we want to assign a probability to the height of the next person (101) who hasn’t been measured yet, here we are in the realm of uncertainty.
Uncertainty Comes from Person
In probability classes, you might heard the term “fair coin.” Well, the reality is that the factory producing the coin may not necessarily be concerned with ensuring the coin is perfectly fair.
It is our assumption, our belief, that the probability of coming head or tail is equal.
Probability is nothing more than our degree of belief, and it is much more useful to think about it as a measure that is conditioned on the current state of information we have.
Imagine two analysts sharing their weather predictions for tomorrow. One person suggests that there is a 70% of rain tomorrow, while another predicts a 50% chance.
Now, suppose tomorrow it rains. Who was correct? They both were. If it does not rain, they are still both equally correct.
There’s no such thing as an “actual probability “ of rain; each person presented a belief about the chance of rain, conditioned on the state of information they had.
Bayesian Example: What is the Probability of Rain Tomorrow?
Imagine that I want to forecast the probability of rain in my city tomorrow.
I can simply look at the previous year and see how many days of January were rainy. Say it was 20 days. Now, with this information, I can assign the probability of rain for tomorrow:
Now you just check the weather forecast, and it predicts heavy clouds and high humidity, which historically are associated with a 70% chance of rain.
Now, this is a new information. As we said, probability is just the state of information, and it changes when the information we have changes.
We can do simple Bayesian flipping to update our belief:
Now the probability of Rain (with weather forecast info) is around 0.82, which is different from what it was in the beginning when it was 2/3 (0.66).
The more information we receive, the probability we assign to the uncertain event changes, making probability a measure of the state of information.
Here is a simple Python code to reproduce the example:
prior_prob_rain = 20/30
print(f"Prior Probaility of Rain: {prior_prob_rain}")
prob_heavycloud_rain = 0.7
prob_heavycloud_norain = 0.3
# Calculate the total probability of heavy cloud
prob_heavycloud = prior_prob_rain * prob_heavycloud_rain + (1 - prior_prob_rain) * prob_heavycloud_norain
# Calculate the updated probability of rain
updated_prob_rain = prior_prob_rain * prob_heavycloud_rain / prob_heavycloud
# Print the results
#print(f"Total Probability of Heavy Cloud: {prob_heavycloud}")
print(f"Updated Probability of Rain: {updated_prob_rain}")
Main Message:
The main message I wish to convey is about assigning probabilities to uncertain events.
Two data scientists can, legitimately, assign different probabilities to an uncertain event if they have different information or process the same information differently.
Probability is subjective and personal. There is not “the” probability but rather “a probability”, and probability is conditioned on the current state of information we have.
References:
- [1] Foundations of Decision Analysis, Ronald Howard (Author), Ali Abbas (Author)
- [2] I appreciate the insightful dialog I had with Professor Reidar Brumer Bratvold and Professor Steve Begg. Their book on making good decisions will be a valuable resource to explore.
If you think this article helped you to learn more about this topic, please give it a U+1F44F and follow!
Resources:
Connect with MeU+007C Book a Free Call with Me
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI