Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Exploration of Markov’s and Chebyshev’s Inequality: It’s Application in Data Science
Data Science   Latest   Machine Learning

Exploration of Markov’s and Chebyshev’s Inequality: It’s Application in Data Science

Last Updated on June 3, 2024 by Editorial Team

Author(s): Ghadah AlHabib

Originally published on Towards AI.

Image generated by ChatGPT

Introduction to Markov’s Inequality

Markov’s inequality provides an upper bound on the probability that a non-negative random variable is at least as large as a certain value or threshold. It helps us learn about probabilities of extreme events using little information about the distribution (the mean of the distribution) and is useful when the random variable is significantly skewed.

Photo taken by author

Introduction to Chebyshev’s Inequality

Chebyshev’s inequality guarantees that within a specific range or distance from the mean, for many types of probability distributions, no more than a specific fraction of values will be present. It provides an upper bound to the probability that the absolute deviation of a random variable from its mean will exceed a given threshold. $1/k²$ of a distribution’s values can be more than or equal to $k$ standard deviations. It also holds that $1–1/k²$ of a distribution’s values must be within, but not including, K standard deviations away from the mean of the distribution.

Analyzing and Visualizing Hospital Length of Stay using Markov’s and Chebyshev’s Inequality

Photo taken by Author

Use Case of Using Markov’s Inequality

Using Markov’s Inequality is useful in calculating the likelihood of encountering values at or beyond a predefined threshold, considering them as potential outliers. In the example below, our goal is to estimate the upper bound on the probability that a patient’s length of stay exceeds a certain number of days (threshold), which can be crucial for hospital resource management and planning.

import numpy as np
import matplotlib.pyplot as plt

# Generating synthetic Length of Stay data using a gamma distribution
np.random.seed(42)
data_los = np.random.gamma(shape=2, scale=3, size=1000)
# Calculate the mean length of stay
mean_los = np.mean(data_los)
# Set a threshold for a long length of stay, e.g., 15 days
threshold_los = 15
probability_bound_los = mean_los / threshold_los
# Plotting the distribution and the threshold
plt.figure(figsize=(10, 6))
plt.hist(data_los, bins=40, alpha=0.7, label='LOS Distribution')
plt.axvline(x=threshold_los, color='red', linestyle='--', label=f'Threshold = {threshold_los} days')
plt.axvline(x=mean_los, color='blue', linestyle='--', label=f'Mean LOS = {mean_los:.2f} days')
plt.xlabel('Length of Stay (days)')
plt.ylabel('Frequency')
plt.title('Histogram of Length of Stay with Markov\\'s Inequality Threshold')
plt.legend()
plt.show()
print(f"Mean Length of Stay: {mean_los:.2f} days")
print(f"Markov'
s Inequality Estimate: P(LOS >= {threshold_los}) <= {probability_bound_los:.3f}")

In the previous code, the generated dataset is a gamma distribution of lengths of stay of patients in a hospital. So, this results in a distribution with values concentrating around the mean but with a long right tail (skewed distribution). Next, we set our desired threshold (threshold_los = 15) to investigate the tail of the distribution because we’re interested in the frequency of values that are equal to or exceed 15.

Apply Markov’s inequality:

After we have set the ground, we use the inequality to estimate the upper bound on the probability that a randomly selected value from this distribution is at least 15.

# Output:
Mean Length of Stay: 6.18 days
Markov's Inequality Estimate: P(LOS >= 15) <= 0.412

Understanding the results:

  1. Mean length of stay: The average stay is 6 days. This gives us an estimate of the general turnover of hospital beds.
  2. Threshold: set to identify long-term stays.
  3. Probability bound from Markov’s Inequality: A conservative estimate and the actual % is much lower because Markov’s inequality gives us the an upper bound and not the actual estimate: 40% of patients will have a LOS of 15 days or more.
Graph created by author

So now we know that no more than 40% of patients are expected to stay 15 days or longer which allows hospitals to plan that a significant number of beds will likely turn over more frequently than this period.

Next steps for a more accurate analysis for outlier detection:

Because Markov’s inequality is very general and doesn’t account for the shape of the distribution beyond its mean. More accurate analyses might involve:

  • Chebyshev’s Inequality: This inequality uses both the mean and variance of the distribution, providing a tighter bound compared to Markov’s inequality when more information about the distribution is available, which is the standard deviation.
  • Empirical Quantiles: Instead of bounds, use the actual data to calculate quantiles. This gives a direct measure of, for example, the 90th percentile of hospital stays.
  • Extreme Value Theory (EVT): to model the tail of the distribution.

Use Case of Using Chebyshev’s Inequality

# Calculate the mean and standard deviation
mean_los = np.mean(data_los)
std_los = np.std(data_los)

# Set k for Chebyshev's Inequality (e.g., 2 standard deviations)
k = 2
# Apply Chebyshev's Inequality
probability_bound_los = 1 / (k ** 2)
# Visualizing the distribution and thresholds
plt.figure(figsize=(10, 6))
plt.hist(data_los, bins=40, alpha=0.7, label='LOS Distribution')
plt.axvline(x=mean_los, color='blue', linestyle='--', label='Mean LOS')
plt.axvline(x=mean_los + k*std_los, color='red', linestyle='--', label=f'+{k} SD')
plt.axvline(x=mean_los - k*std_los, color='red', linestyle='--', label=f'-{k} SD')
plt.xlabel('Length of Stay (days)')
plt.ylabel('Frequency')
plt.title('Histogram of Length of Stay with Chebyshev's Inequality Thresholds')
plt.legend()
plt.show()
print(f"Mean LOS: {mean_los:.2f} days")
print(f"Standard Deviation LOS: {std_los:.2f} days")
print(f"Chebyshev'
s Inequality Estimate: P(|LOS - Mean| >= {k} * SD) <= {probability_bound_los:.3f}")
Graph created by author
# Output:
Mean Length of Stay: 6.18 days
Standard Deviation Length of Stay: 4.21 days
Chebyshev's Inequality Estimate: P(|LOS - Mean| >= 2 * SD) <= 0.250

The plot shows the distribution of LOS along with lines indicating the mean and the bounds set at ±2 standard deviations from the mean.

Chebyshev’s inequality tells us that no more than 25% of the data (1/k² = 1/4 = 0.25) should fall outside of this range. This helps in identifying what percentage of stays can be expected to be unusually long or short, which is important for capacity planning and managing exceptions in hospital operations.

Let’s explicitly identify the outliers using the Z-score method

import numpy as np

# Calculate the mean and standard deviation
mean_los = np.mean(data_los)
std_los = np.std(data_los)
# Calculate Z-scores
z_scores = (data_los - mean_los) / std_los
# Define a threshold, the number of standard deviations
threshold = 2
# Identify outliers
outliers = data_los[np.abs(z_scores) > threshold]
print("Outliers using Z-score method:", outliers)

The Z-score method involves calculating the standard deviation and mean of the dataset, and then finding the “Z-score” of each data point, which is the number of standard deviations it is from the mean. Points with a Z-score above a certain value (in our case 2) are considered outliers, which are identified here:

# Output
Outliers using Z-score method: [21.89399337 16.26622656 20.38959659 18.54517793
18.27028866 18.83459759 19.18005698 16.35479045 15.97616906 17.98111993
15.34512497 17.12223 14.79403961 14.70875979 19.9580484 16.36012467 16.12363236
23.05543792 14.86851646 15.47192187 15.67907888 15.66813034 22.56087248
16.10462967 18.92919721 15.72112453 15.18718067 15.11888241 18.12696851
18.23344525 16.68003173 21.27966247 18.75937418 14.67804919 22.20391272
18.87330885 23.36067128 14.65687522 15.1141912 22.89315476 17.48881529
16.16709294 16.42671481 15.48385797 17.95156816 16.33028343 16.55798115
16.03273815 14.98218812 14.85726405]

Let’s compare the Z-score method results to the Chebyshev’s inequality

# Calculate the proportion of outliers
proportion_outliers_z = len(outliers_z) / len(data_los)

# Output: 0.05

Using the Z-score method with a threshold of 2 standard deviations, the actual proportion of outliers in the dataset is 5%. However, Chebyshev’s inequality estimates that at least 25% of the data could lie beyond 2 standard deviations from the mean.

This demonstrates that Chebyshev’s inequality provides a very conservative estimate. The actual number of outliers as identified by the Z-score method can be significantly less, as seen in our dataset where only 5% of the data points are identified as outliers. This highlights the usefulness of methods like the Z-score for distributions that approximate normality, where they can provide a more precise estimation of outlier proportions compared to broader bounds like Chebyshev’s.

Thank you for reading!

Let’s Connect!

Twitter: https://twitter.com/ghadah_alha/

LinkedIn: https://www.linkedin.com/in/ghadah-alhabib/

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓