Exploration of Markov’s and Chebyshev’s Inequality: It’s Application in Data Science

Last Updated on June 3, 2024 by Editorial Team

Author(s): Ghadah AlHabib

Originally published on Towards AI.

Introduction to Markov’s Inequality

Markov’s inequality provides an upper bound on the probability that a non-negative random variable is at least as large as a certain value or threshold. It helps us learn about probabilities of extreme events using little information about the distribution (the mean of the distribution) and is useful when the random variable is significantly skewed.

Introduction to Chebyshev’s Inequality

Chebyshev’s inequality guarantees that within a specific range or distance from the mean, for many types of probability distributions, no more than a specific fraction of values will be present. It provides an upper bound to the probability that the absolute deviation of a random variable from its mean will exceed a given threshold. $1/k²$ of a distribution’s values can be more than or equal to $k$ standard deviations. It also holds that $1–1/k²$ of a distribution’s values must be within, but not including, K standard deviations away from the mean of the distribution.

Analyzing and Visualizing Hospital Length of Stay using Markov’s and Chebyshev’s Inequality

Use Case of Using Markov’s Inequality

Using Markov’s Inequality is useful in calculating the likelihood of encountering values at or beyond a predefined threshold, considering them as potential outliers. In the example below, our goal is to estimate the upper bound on the probability that a patient’s length of stay exceeds a certain number of days (threshold), which can be crucial for hospital resource management and planning.

import numpy as np
import matplotlib.pyplot as plt

# Generating synthetic Length of Stay data using a gamma distribution
np.random.seed(42)
data_los = np.random.gamma(shape=2, scale=3, size=1000) 
# Calculate the mean length of stay
mean_los = np.mean(data_los)
# Set a threshold for a long length of stay, e.g., 15 days
threshold_los = 15
probability_bound_los = mean_los / threshold_los
# Plotting the distribution and the threshold
plt.figure(figsize=(10, 6))
plt.hist(data_los, bins=40, alpha=0.7, label='LOS Distribution')
plt.axvline(x=threshold_los, color='red', linestyle='--', label=f'Threshold = {threshold_los} days')
plt.axvline(x=mean_los, color='blue', linestyle='--', label=f'Mean LOS = {mean_los:.2f} days')
plt.xlabel('Length of Stay (days)')
plt.ylabel('Frequency')
plt.title('Histogram of Length of Stay with Markov\\'s Inequality Threshold')
plt.legend()
plt.show()
print(f"Mean Length of Stay: {mean_los:.2f} days")
print(f"Markov's Inequality Estimate: P(LOS >= {threshold_los}) <= {probability_bound_los:.3f}")

In the previous code, the generated dataset is a gamma distribution of lengths of stay of patients in a hospital. So, this results in a distribution with values concentrating around the mean but with a long right tail (skewed distribution). Next, we set our desired threshold (threshold_los = 15) to investigate the tail of the distribution because we’re interested in the frequency of values that are equal to or exceed 15.

Apply Markov’s inequality:

After we have set the ground, we use the inequality to estimate the upper bound on the probability that a randomly selected value from this distribution is at least 15.

# Output:
Mean Length of Stay: 6.18 days
Markov's Inequality Estimate: P(LOS >= 15) <= 0.412

Understanding the results:

Mean length of stay: The average stay is 6 days. This gives us an estimate of the general turnover of hospital beds.
Threshold: set to identify long-term stays.
Probability bound from Markov’s Inequality: A conservative estimate and the actual % is much lower because Markov’s inequality gives us the an upper bound and not the actual estimate: 40% of patients will have a LOS of 15 days or more.

So now we know that no more than 40% of patients are expected to stay 15 days or longer which allows hospitals to plan that a significant number of beds will likely turn over more frequently than this period.

Next steps for a more accurate analysis for outlier detection:

Because Markov’s inequality is very general and doesn’t account for the shape of the distribution beyond its mean. More accurate analyses might involve:

Chebyshev’s Inequality: This inequality uses both the mean and variance of the distribution, providing a tighter bound compared to Markov’s inequality when more information about the distribution is available, which is the standard deviation.
Empirical Quantiles: Instead of bounds, use the actual data to calculate quantiles. This gives a direct measure of, for example, the 90th percentile of hospital stays.
Extreme Value Theory (EVT): to model the tail of the distribution.

Use Case of Using Chebyshev’s Inequality

# Calculate the mean and standard deviation
mean_los = np.mean(data_los)
std_los = np.std(data_los)

# Set k for Chebyshev's Inequality (e.g., 2 standard deviations)
k = 2
# Apply Chebyshev's Inequality
probability_bound_los = 1 / (k ** 2)
# Visualizing the distribution and thresholds
plt.figure(figsize=(10, 6))
plt.hist(data_los, bins=40, alpha=0.7, label='LOS Distribution')
plt.axvline(x=mean_los, color='blue', linestyle='--', label='Mean LOS')
plt.axvline(x=mean_los + k*std_los, color='red', linestyle='--', label=f'+{k} SD')
plt.axvline(x=mean_los - k*std_los, color='red', linestyle='--', label=f'-{k} SD')
plt.xlabel('Length of Stay (days)')
plt.ylabel('Frequency')
plt.title('Histogram of Length of Stay with Chebyshev's Inequality Thresholds')
plt.legend()
plt.show()
print(f"Mean LOS: {mean_los:.2f} days")
print(f"Standard Deviation LOS: {std_los:.2f} days")
print(f"Chebyshev's Inequality Estimate: P(|LOS - Mean| >= {k} * SD) <= {probability_bound_los:.3f}")

# Output:
Mean Length of Stay: 6.18 days
Standard Deviation Length of Stay: 4.21 days
Chebyshev's Inequality Estimate: P(|LOS - Mean| >= 2 * SD) <= 0.250

The plot shows the distribution of LOS along with lines indicating the mean and the bounds set at ±2 standard deviations from the mean.

Chebyshev’s inequality tells us that no more than 25% of the data (1/k² = 1/4 = 0.25) should fall outside of this range. This helps in identifying what percentage of stays can be expected to be unusually long or short, which is important for capacity planning and managing exceptions in hospital operations.

Let’s explicitly identify the outliers using the Z-score method

import numpy as np

# Calculate the mean and standard deviation
mean_los = np.mean(data_los)
std_los = np.std(data_los)
# Calculate Z-scores
z_scores = (data_los - mean_los) / std_los
# Define a threshold, the number of standard deviations 
threshold = 2
# Identify outliers
outliers = data_los[np.abs(z_scores) > threshold]
print("Outliers using Z-score method:", outliers)

The Z-score method involves calculating the standard deviation and mean of the dataset, and then finding the “Z-score” of each data point, which is the number of standard deviations it is from the mean. Points with a Z-score above a certain value (in our case 2) are considered outliers, which are identified here:

# Output
Outliers using Z-score method: [21.89399337 16.26622656 20.38959659 18.54517793 
18.27028866 18.83459759 19.18005698 16.35479045 15.97616906 17.98111993 
15.34512497 17.12223 14.79403961 14.70875979 19.9580484 16.36012467 16.12363236
23.05543792 14.86851646 15.47192187 15.67907888 15.66813034 22.56087248 
16.10462967 18.92919721 15.72112453 15.18718067 15.11888241 18.12696851 
18.23344525 16.68003173 21.27966247 18.75937418 14.67804919 22.20391272 
18.87330885 23.36067128 14.65687522 15.1141912 22.89315476 17.48881529 
16.16709294 16.42671481 15.48385797 17.95156816 16.33028343 16.55798115 
16.03273815 14.98218812 14.85726405]

Let’s compare the Z-score method results to the Chebyshev’s inequality

# Calculate the proportion of outliers
proportion_outliers_z = len(outliers_z) / len(data_los)

# Output: 0.05

Using the Z-score method with a threshold of 2 standard deviations, the actual proportion of outliers in the dataset is 5%. However, Chebyshev’s inequality estimates that at least 25% of the data could lie beyond 2 standard deviations from the mean.

This demonstrates that Chebyshev’s inequality provides a very conservative estimate. The actual number of outliers as identified by the Z-score method can be significantly less, as seen in our dataset where only 5% of the data points are identified as outliers. This highlights the usefulness of methods like the Z-score for distributions that approximate normality, where they can provide a more precise estimation of outlier proportions compared to broader bounds like Chebyshev’s.

Thank you for reading!

Let’s Connect!

Twitter: https://twitter.com/ghadah_alha/

LinkedIn: https://www.linkedin.com/in/ghadah-alhabib/

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Exploration of Markov’s and Chebyshev’s Inequality: It’s Application in Data Science

Author(s): Ghadah AlHabib

Introduction to Markov’s Inequality

Introduction to Chebyshev’s Inequality

Use Case of Using Markov’s Inequality

Use Case of Using Chebyshev’s Inequality

Let’s explicitly identify the outliers using the Z-score method

Let’s compare the Z-score method results to the Chebyshev’s inequality

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

10 Comprehensive Strategies for Ensuring Ethical Artificial Intelligence

10 Comprehensive Strategies for Ensuring Ethical Artificial Intelligence

Think You’re a Data Science Expert? Answer These 7 Questions to Find Out

Think You’re a Data Science Expert? Answer These 7 Questions to Find Out

The Real Reason Your Company’s AI Isn’t Working (Hint: It’s Not the Technology)

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Exploration of Markov’s and Chebyshev’s Inequality: It’s Application in Data Science

Author(s): Ghadah AlHabib

Introduction to Markov’s Inequality

Introduction to Chebyshev’s Inequality

Use Case of Using Markov’s Inequality

Use Case of Using Chebyshev’s Inequality

Let’s explicitly identify the outliers using the Z-score method

Let’s compare the Z-score method results to the Chebyshev’s inequality

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement