5 Paradoxes in Statistics Every Data Scientist Should be Familiar With
Last Updated on July 17, 2023 by Editorial Team
Author(s): Simranjeet Singh
Originally published on Towards AI.
Introduction
Statistics is an essential part of data science, and it provides us with various tools and techniques to analyze and understand data. However, sometimes statistical results can be counterintuitive or even paradoxical, leading to confusion and misinterpretation. In this blog, we will explore five statistical paradoxes that every data scientist should be familiar with. We will explain what each paradox is, why it occurs, and how to avoid common pitfalls associated with it.
U+1F449 Before Starting the Blog, Please Subscribe to my YouTube Channel and Follow Me on Instagram U+1F447
U+1F4F7 YouTube β https://bit.ly/38gLfTo
U+1F4C3 Instagram β https://bit.ly/3VbKHWh
U+1F449 Do Donate U+1F4B0 or Give me Tip U+1F4B5 If you really like my blogs, Because I am from India and not able to get into Medium Partner Program. Click Here to Donate or Tip U+1F4B0 β https://bit.ly/3oTHiz3
By the end of this blog, you will have a better understanding of some of the strange and unexpected outcomes that can arise from statistical analysis, and be better equipped to handle them in your work.
Table of Contents
- Accuracy Paradox
- False Positive Paradox
- Gamblerβs Fallacy
- Simpsonβs Paradox
- Berksonβs Paradox
- Conclusion
Accuracy Paradox
The Accuracy Paradox refers to the situation where a high level of accuracy can be achieved even when a model is not predictive. It can occur when there is an imbalance in the distribution of classes in the dataset. For example, consider a dataset where 90% of the observations belong to one class and 10% to another class. A model that predicts the majority class for all observations will achieve an accuracy of 90%, even though it is not really predicting anything.
Here is an example in Python to illustrate the concept:
import numpy as np
from sklearn.metrics import accuracy_score
# create imbalanced dataset
y_true = np.array([0] * 900 + [1] * 100)
y_pred = np.zeros(1000)
# calculate accuracy
accuracy = accuracy_score(y_true, y_pred)
print('Accuracy:', accuracy)
In this example, we create an imbalanced dataset with 900 observations in one class and 100 in another class. We then create a model that predicts the majority class (0) for all observations. Despite not actually predicting anything, the model achieves an accuracy of 90%.
A real-world example of the Accuracy Paradox can be seen in medical testing. Consider a rare disease that affects only 1 in 100,000 people. If a test is created that is 99.9% accurate in detecting the disease, but is given to a population where only 0.1% of people have the disease, the test will have a high accuracy rate of 99.9%. However, it will result in a large number of false positives, meaning many healthy people will be incorrectly diagnosed with the disease.
Evaluating classification tasks using accuracy may not be the best choice. Precision and recall are better alternatives. These metrics are related to the False Positive Paradox, which will be discussed in the following section.
False Positive Paradox
The False Positive Paradox occurs when the accuracy of a model is high but the false positive rate is also high. In other words, the model may classify a large number of instances as positive when they are actually negative. This paradox can lead to incorrect conclusions and decision-making.
Simple Example of Python explaining the False Positive Paradox:
import pandas as pd
import numpy as np
# Define variables
normal_count = 9999
fraud_count = 1
false_positives = 499.95
false_negatives = 0
# Calculate precision
precision = fraud_count / (fraud_count + false_positives)
print(f"Precision: {precision:.2f}")
# Calculate recall
recall = fraud_count / (fraud_count + false_negatives)
print(f"Recall: {recall:.2f}")
# Calculate accuracy
true_negatives = normal_count - false_positives
accuracy = (true_negatives + fraud_count) / (normal_count + fraud_count)
print(f"Accuracy: {accuracy:.2f}")
Output:
Precision: 0.00
Recall: 1.00
Accuracy: 0.95
For example, imagine a medical test for a disease that only affects 1% of the population. If the test is 99% accurate, then 99% of the time it correctly identifies the presence or absence of the disease. However, if 1000 people are tested, 10 people will test positive for the disease even though only 1 actually has it. This means that a positive test result is more likely to be a false positive than a true positive.
Hereβs an example of Python code for the False Positive Paradox:
from sklearn.metrics import confusion_matrix
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
# generate a binary classification dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=42)
# split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# train a logistic regression model
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
# predict on test set and get the confusion matrix
y_pred = model.predict(X_test)
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
# calculate the accuracy, precision, and recall
accuracy = (tp + tn) / (tp + tn + fp + fn)
precision = tp / (tp + fp)
recall = tp / (tp + fn)
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
Output:
Accuracy: 0.79
Precision: 0.82
Recall: 0.75
In such cases, precision and recall are better measures to evaluate the modelβs performance. Precision measures the proportion of true positives among all positive classifications, while recall measures the proportion of true positives among all actual positive instances. These measures can help avoid the False Positive Paradox and provide a more accurate evaluation of the modelβs performance.
Gamblerβs Fallacy
The Gamblerβs Fallacy is the belief that past events can influence the probability of future events in a random process. For example, in a game of roulette, some players believe that if the ball has landed on black for several consecutive spins, the chances of it landing on red next time are higher, even though the outcome is still equally random.
To illustrate this with a python example, we can simulate flipping a fair coin using the numpy module:
import numpy as np
# Simulate flipping a coin 10 times
results = np.random.randint(0, 2, size=10)
print(f"Coin flips: {results}")
# Count the number of consecutive heads or tails
consecutive = 0
for i in range(1, len(results)):
if results[i] == results[i-1]:
consecutive += 1
else:
consecutive = 0
# Print the result
if consecutive > 0:
print(f"Number of consecutive flips: {consecutive + 1}")
else:
print("No consecutive flips")
Output:
Coin flips: [0 1 0 0 0 0 0 0 1 0]
No consecutive flips
In the above example, the code simulates flipping a coin 10 times and then counts the number of consecutive heads or tails. The Gamblerβs Fallacy would suggest that if there have been several heads in a row, the next flip is more likely to be tails, and vice versa. However, in reality, each flip of the coin is independent and has an equal chance of resulting in heads or tails.
A real-world example of the Gamblerβs Fallacy could be seen in the stock market. Some investors may believe that if a stock has been consistently rising in value for several days, it is more likely to fall soon, even though market movements are still inherently unpredictable and subject to a range of factors.
Simpsonβs Paradox
Simpsonβs Paradox is a statistical phenomenon that occurs when a trend appears in a small dataset, but the trend disappears or reverses when the dataset is divided into subgroups. This can lead to incorrect conclusions if the data is not analyzed correctly.
Letβs consider an example to understand this phenomenon better. Suppose we want to compare the admission rates of male and female applicants to a university. We have data for two departments: department A and department B.
In the above table, the combined admission rate for male and female applicants is 50%. However, when we analyze the data by department, we see that in each department, the admission rate for women is higher than the admission rate for men. This seems counterintuitive since the overall admission rate is higher for men.
This paradox occurs because the number of applicants and admission rates are different for each department. Department A has a higher admission rate overall, but a lower percentage of female applicants. Department B has a lower admission rate overall but a higher percentage of female applicants.
In Python, we can replicate this example using the following code:
import pandas as pd
# Create a dataframe
df = pd.DataFrame({'Department': ['A', 'A', 'B', 'B'],
'Gender': ['Male', 'Female', 'Male', 'Female'],
'Applicants': [100, 80, 500, 400],
'Admitted': [60, 40, 40, 70]})
# Calculate admission rates
df['Admission Rate'] = df['Admitted'] / df['Applicants'] * 100
# Display the dataframe
print(df)
# Calculate overall admission rate
overall_rate = df['Admitted'].sum() / df['Applicants'].sum() * 100
print(f"Overall Admission Rate: {overall_rate:.2f}%")
# Calculate admission rates by department and gender
department_rates = df.groupby(['Department', 'Gender'])['Admission Rate'].mean()
print(department_rates)
Ouput:
Department Gender Applicants Admitted Admission Rate
0 A Male 100 60 60.0
1 A Female 80 40 50.0
2 B Male 500 40 8.0
3 B Female 400 70 17.5
Overall Admission Rate: 19.44%
Department Gender
A Female 50.0
Male 60.0
B Female 17.5
Male 8.0
Name: Admission Rate, dtype: float64
In the above code, we create a dataframe with the same data as in the table above. We then calculate the admission rates and display the dataframe. Next, we calculate the overall admission rate, which is 19.44%. Finally, we group the data by department and gender and calculate the admission rates for each subgroup. We see that the admission rate for women is higher in both departments, even though the overall admission rate is higher for men. This is an example of Simpsonβs Paradox.
Berksonβs Paradox
Berksonβs paradox is a statistical phenomenon where a negative correlation appears between two independent variables or when there is a negative correlation between two variables, but a positive correlation appears when the data is split into subgroups or if that there is no actual correlation between them. This paradox occurs when both independent variables have a common influence or cause that is not included in the analysis.
To explain this paradox using the iris dataset, letβs consider the sepal length and width as the two variables of interest. We can calculate the correlation coefficient between these two variables using the corr()
method in pandas:
import pandas as pd
import seaborn as sns
iris = sns.load_dataset('iris')
correlation = iris['sepal_length'].corr(iris['sepal_width'])
print('Correlation between sepal length and width:', correlation)
Correlation between sepal length and width: -0.11756978413300208
As we can see, there is a negative correlation between sepal length and width in the overall dataset.
However, if we split the dataset by species and calculate the correlation coefficient for each species separately, we might get a different result. For example, if we only consider the setosa species, we get a positive correlation:
setosa = iris[iris['species'] == 'setosa']
correlation_setosa = setosa['sepal_length'].corr(setosa['sepal_width'])
print('Correlation between sepal length and width for setosa:', correlation_setosa)
Correlation between sepal length and width for setosa: 0.7425466856651597
This means that there is a positive correlation between sepal length and width for the setosa species, which is opposite to the overall negative correlation.
This paradox occurs because the setosa species has a smaller range of values for both sepal length and width compared to the other species. As a result, when we only consider the setosa species, the negative correlation within the overall dataset is overshadowed by the positive correlation within the setosa species.
Conclusion
In conclusion, understanding statistical paradoxes is crucial for data scientists as they can help in avoiding common mistakes and biases in data analysis.
- The Accuracy Paradox teaches us that accuracy alone is not enough to evaluate classification tasks, and precision and recall are more informative.
- The False Positive Paradox highlights the importance of understanding the cost of false positives in relation to the cost of false negatives.
- The Gamblerβs Fallacy reminds us that each event is independent and past outcomes do not affect future ones.
- The Simpsonβs Paradox shows how aggregating data can obscure relationships between variables and lead to incorrect conclusions.
- Finally, Berksonβs Paradox shows how sampling bias can occur when selecting non-random samples from a population.
Being aware of these paradoxes can help data scientists make more accurate and informed decisions in their work.
If you like the article and would like to support me make sure to:
U+1F44F Clap for the story (100 Claps) and follow me U+1F449U+1F3FBSimranjeet Singh
U+1F4D1 View more content on my Medium Profile
U+1F514 Follow Me: LinkedIn U+007C Medium U+007C GitHub U+007C Twitter U+007C Telegram
U+1F680 Help me in reaching to a wider audience by sharing my content with your friends and colleagues.
U+1F393 If you want to start a career in Data Science and Artificial Intelligence and you do not know how? I offer data science and AI mentoring sessions and long-term career guidance.
U+1F4C5 Consultation or Career Guidance
U+1F4C5 1:1 Mentorship β About Python, Data Science, and Machine Learning
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI