Part 3:

Last Updated on July 25, 2023 by Editorial Team

Author(s): Rohini Vaidya

Originally published on Towards AI.

Unraveling the Complexity of Distributions in Statistics U+007C by Rohini Vaidya U+007C Apr 2023 U+007C Towards AI

Navigating the Complex World of Distribution: A Complete Guide for the different distributions that you must know before going for the data analysis.

“Data are just summaries of thousands of stories — tell a few of those stories to help make the data meaningful.”

— Chip and Dan Heath.

The data science field is a field that revolves around data. We can make different conclusions from the given data. Distributions are used to give a detailed idea about the data. How data behaves with respect to different features, and what are its different characteristics? We can draw inferences from the distribution of the data.

In data science, statistical distribution refers to the way in which a set of data is spread out over a range of values or distributed across different values.

It is parameterized mathematical function that gives the probability of different outcomes in a random variable.

A distribution can be represented graphically using various plots such as a histogram, a density plot, or a bar plot. The shape of a distribution is determined by its skewness.

There are majorly two types of distributions based on the outcome of the data i.e. Discrete distribution and Continuous distribution.

The distribution that can deal with a discrete type of data is known as Discrete distribution. For example, the number of times the coin lands heads in 3 tosses.

Bernoulli distribution
Binomial distribution
Uniform distribution
Geometric distribution
Poisson distribution

The distribution that can deal with a continuous type of data is known as a Continuous distribution. For example, scores of the number of students in a class.

Normal/Gaussian distribution
Exponential distribution
Student’s T distribution
Chi-square test

Now, we will go through each distribution in depth.

A. Discrete distributions

Bernoulli distribution:

Bernoulli distribution is a type of discrete distribution. It is named after the Swiss mathematician Jacob Bernoulli. It is used only for binary outputs. It is a distribution of a random variable that takes the value 1 with a probability of p and 0 with a probability q=1-p. It is applied to experiments in which questions lead to outcomes that are Boolean-valued. It can take a value like success/true/yes/one with a probability p and values like failure/false/no/zero with a probability 1-p.

The probability mass function(PMF) of a random variable x that follows a Bernoulli distribution is:

Here, p is the probability that random variable x is a “success” and probability 1-p is the probability that random variable x is a “failure”.

From PMF, we can calculate the expected value and variance of a random variable x. Let’s consider x=1 for a “success” and x=0 for a “failure” then E(x) and Var(x) are:

The Bernoulli distribution is a special case of the binomial distribution where a single trial is conducted.

2. Binomial distribution:

The binomial distribution is also a type of discrete distribution. It is an outcome distribution of n identical Bernoulli distribution. It is a probability distribution that describes the number of independent trials where each trial has two possible outcomes: success or failure. There is n number of identical trials in total, each trial is independent of the other trial.

The distribution has two parameters: the success probability p and the number of trials n. The PMF is defined using the formula:

As the Binomial distribution is an n-time identical Bernoulli distribution, the expected value and the variance are as follows:

3. Uniform distribution:

The uniform distribution can be continuous or a discrete type distribution. It is a distribution in which the probability of every outcome is equal. An example of the discrete uniform distribution is tossing a fair die which is equally likely to land showing any number from 1 to 6. For a continuous uniform distribution over some range, say from a to b, the sum of the probabilities for the entire range must be equal to 1. They are a family of symmetric probability distributions. These distributions describe an experiment where there is an arbitrary outcome that lies between a certain bound. The probability of a continuous uniform distribution is :

4. Geometric distribution:

In a Bernoulli trial, the number of successive failures before success is obtained is represented by a geometric distribution. The geometric distribution is a probability distribution in statistics that describes the number of trials needed to achieve success in a series of independent Bernoulli trials, where each trial has two possible outcomes(success or failure) and the probability of success is constant across all the trials.

The geometric distribution gives the probability that the first occurrence of success requires k-independent trials, each with the success probability p. If the probability of success on each trial is p, then the probability that the kth trial is the first success is:

It is a type of discrete distribution, meaning that x can take only integer values starting from 1. It is used in applications such as quality control to calculate the probability of product faili9ng after a certain number of inspections, in the marketing marketers can use the geometric distribution to estimate the number of times an advertisement needs to be seen before the customer’s action, such as a marketing purchase.

5. Poisson distribution:

Poisson distribution is a discrete probability function that expresses the probability of a given number of events in a fixed interval of time or space when the events are rare and randomly distributed. It has a single parameter, lambda which represents the average rate of occurrence of events. The probability mass function of a Poisson distribution is given by,

where,

k is the number of occurrences(k= 0,1,2,..)

e is an Euler’s number(e= 2.71828)

The Poisson distribution assumes that the events are independent of each other and occur at a constant rate of time or space.

B. Continuous distributions

Normal /Gaussian distribution:

“The normal distribution is a universal phenomenon that can be found everywhere in the natural world” — Steven Strogatz.

The normal distribution is the most widely used continuous probability distribution. It is also known as a Gaussian distribution, it is named after the genius of Carl Friedrich Gauss.

For a random variable x, if we plot the probability density function and it forms a bell-shaped curve and the mean, mode, and median values are equal, then the variable has a normal distribution. The height of a person, the technical stock market, rolling dice, tossing a coin, the IQ level of a student, and many more are examples of a normal distribution from our day-to-day life. Most of the statistics and the inferential problems follow the normal distribution.

If we consider a random variable that will take the blood pressure values of the human population, having a mean value as m and standard deviation as s. We will collect some samples o represent the random variable, each sample has its own mean. Now if we start collecting more examples and calculating the mean of each sample, then the sample mean has its own probability distribution, which will converge towards a normal distribution the as the number of samples increases. This is known as the Central Limit Theorem.

The standard normal distribution follows an empirical rule. This rule states that 68% of the data lie within a range of 1st standard deviation, 95% of the data lie within a 2nd standard deviation range, and 99.7% of the data lie within a 3rd standard deviation range.

The normal distribution with a mean of 0 and a standard deviation of 1 is called a Standard Normal Distribution.

2. Exponential distribution:

The exponential distribution is another type of continuous distribution. It is a time between events in a Poisson point process. If the number of calls the company receive follows a Poisson distribution, then the time interval between calls follows the Exponential distribution. The mean and variance of an exponential distribution are both equal to 1/λ, which means that the exponential distribution has a memoryless property. This property states that the probability of an event occurring next time interval is not affected by how much time has elapsed since the last event occurred.

3. Student’s-T distribution:

It is another member of a family of Continuous probability distribution. It arises when estimating the mean of a normally distributed population when the sample size is small(less than 30) or when the population standard deviation is unknown. The shape of the T distribution depends on the degrees of freedom(df), which is the number of independent observations in a sample minus one. As the degree of freedom increases, this distribution will tend towards the normal distribution.

4. Chi-square distribution:

It is the most widely used probability distribution in inferential statistics, notably in hypothesis testing and statistical inferences. It takes only non-negative values and is right-skewed. It is a special case of a gamma distribution.

Statistical distributions are everywhere in our daily life. Distributions play a vital role for the data scientist to know the data in depth, perform better data analysis, draw different inferences from the data a different inference, to select a model suitable for a specific dataset.

If you found this article insightful, follow me on Linkedin and medium.

Stay tuned !!!

Thank you !!!

If you have not checked part 1 and part 2 of this statistics series, then check them out if you are interested!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Author(s): Rohini Vaidya

Unraveling the Complexity of Distributions in Statistics U+007C by Rohini Vaidya U+007C Apr 2023 U+007C Towards AI

“The normal distribution is a universal phenomenon that can be found everywhere in the natural world” — Steven Strogatz.

Basic Concepts of statistics that every Data scientist should know.

It’s easy to lie with statistics, It’s hard to tell the truth without statistics.

A Comprehensive Guide for Handling Outliers

Part: 2

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

TAI #143: New Scaling Laws Incoming? Ilya’s SSI Raises at $30bn, Manus Takes AI Agents Mainstream

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Part 3:

Author(s): Rohini Vaidya

Unraveling the Complexity of Distributions in Statistics U+007C by Rohini Vaidya U+007C Apr 2023 U+007C Towards AI

“The normal distribution is a universal phenomenon that can be found everywhere in the natural world” — Steven Strogatz.

Basic Concepts of statistics that every Data scientist should know.

It’s easy to lie with statistics, It’s hard to tell the truth without statistics.

A Comprehensive Guide for Handling Outliers

Part: 2

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement