Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

Part 3:
Latest   Machine Learning

Part 3:

Last Updated on July 25, 2023 by Editorial Team

Author(s): Rohini Vaidya

Originally published on Towards AI.

Unraveling the Complexity of Distributions in Statistics U+007C by Rohini Vaidya U+007C Apr 2023 U+007C Towards AI

Navigating the Complex World of Distribution: A Complete Guide for the different distributions that you must know before going for the data analysis.

Photo by Carlos Muza on Unsplash

“Data are just summaries of thousands of stories — tell a few of those stories to help make the data meaningful.”

— Chip and Dan Heath.

The data science field is a field that revolves around data. We can make different conclusions from the given data. Distributions are used to give a detailed idea about the data. How data behaves with respect to different features, and what are its different characteristics? We can draw inferences from the distribution of the data.

In data science, statistical distribution refers to the way in which a set of data is spread out over a range of values or distributed across different values.

It is parameterized mathematical function that gives the probability of different outcomes in a random variable.

A distribution can be represented graphically using various plots such as a histogram, a density plot, or a bar plot. The shape of a distribution is determined by its skewness.

There are majorly two types of distributions based on the outcome of the data i.e. Discrete distribution and Continuous distribution.

The distribution that can deal with a discrete type of data is known as Discrete distribution. For example, the number of times the coin lands heads in 3 tosses.

  1. Bernoulli distribution
  2. Binomial distribution
  3. Uniform distribution
  4. Geometric distribution
  5. Poisson distribution

The distribution that can deal with a continuous type of data is known as a Continuous distribution. For example, scores of the number of students in a class.

  1. Normal/Gaussian distribution
  2. Exponential distribution
  3. Student’s T distribution
  4. Chi-square test

Now, we will go through each distribution in depth.

A. Discrete distributions

  1. Bernoulli distribution:

Bernoulli distribution is a type of discrete distribution. It is named after the Swiss mathematician Jacob Bernoulli. It is used only for binary outputs. It is a distribution of a random variable that takes the value 1 with a probability of p and 0 with a probability q=1-p. It is applied to experiments in which questions lead to outcomes that are Boolean-valued. It can take a value like success/true/yes/one with a probability p and values like failure/false/no/zero with a probability 1-p.

The probability mass function(PMF) of a random variable x that follows a Bernoulli distribution is:

Here, p is the probability that random variable x is a “success” and probability 1-p is the probability that random variable x is a “failure”.

From PMF, we can calculate the expected value and variance of a random variable x. Let’s consider x=1 for a “success” and x=0 for a “failure” then E(x) and Var(x) are:

The Bernoulli distribution is a special case of the binomial distribution where a single trial is conducted.

Bernoillie distribution

2. Binomial distribution:

The binomial distribution is also a type of discrete distribution. It is an outcome distribution of n identical Bernoulli distribution. It is a probability distribution that describes the number of independent trials where each trial has two possible outcomes: success or failure. There is n number of identical trials in total, each trial is independent of the other trial.

The distribution has two parameters: the success probability p and the number of trials n. The PMF is defined using the formula:

As the Binomial distribution is an n-time identical Bernoulli distribution, the expected value and the variance are as follows:

3. Uniform distribution:

The uniform distribution can be continuous or a discrete type distribution. It is a distribution in which the probability of every outcome is equal. An example of the discrete uniform distribution is tossing a fair die which is equally likely to land showing any number from 1 to 6. For a continuous uniform distribution over some range, say from a to b, the sum of the probabilities for the entire range must be equal to 1. They are a family of symmetric probability distributions. These distributions describe an experiment where there is an arbitrary outcome that lies between a certain bound. The probability of a continuous uniform distribution is :

GeeksforGeeks

4. Geometric distribution:

In a Bernoulli trial, the number of successive failures before success is obtained is represented by a geometric distribution. The geometric distribution is a probability distribution in statistics that describes the number of trials needed to achieve success in a series of independent Bernoulli trials, where each trial has two possible outcomes(success or failure) and the probability of success is constant across all the trials.

The geometric distribution gives the probability that the first occurrence of success requires k-independent trials, each with the success probability p. If the probability of success on each trial is p, then the probability that the kth trial is the first success is:

It is a type of discrete distribution, meaning that x can take only integer values starting from 1. It is used in applications such as quality control to calculate the probability of product faili9ng after a certain number of inspections, in the marketing marketers can use the geometric distribution to estimate the number of times an advertisement needs to be seen before the customer’s action, such as a marketing purchase.

Geometric distribution

5. Poisson distribution:

Poisson distribution is a discrete probability function that expresses the probability of a given number of events in a fixed interval of time or space when the events are rare and randomly distributed. It has a single parameter, lambda which represents the average rate of occurrence of events. The probability mass function of a Poisson distribution is given by,

where,

k is the number of occurrences(k= 0,1,2,..)

e is an Euler’s number(e= 2.71828)

The Poisson distribution assumes that the events are independent of each other and occur at a constant rate of time or space.

Poisson distribution with lambda at one

B. Continuous distributions

  1. Normal /Gaussian distribution:

“The normal distribution is a universal phenomenon that can be found everywhere in the natural world” — Steven Strogatz.

The normal distribution is the most widely used continuous probability distribution. It is also known as a Gaussian distribution, it is named after the genius of Carl Friedrich Gauss.

For a random variable x, if we plot the probability density function and it forms a bell-shaped curve and the mean, mode, and median values are equal, then the variable has a normal distribution. The height of a person, the technical stock market, rolling dice, tossing a coin, the IQ level of a student, and many more are examples of a normal distribution from our day-to-day life. Most of the statistics and the inferential problems follow the normal distribution.

Image

If we consider a random variable that will take the blood pressure values of the human population, having a mean value as m and standard deviation as s. We will collect some samples o represent the random variable, each sample has its own mean. Now if we start collecting more examples and calculating the mean of each sample, then the sample mean has its own probability distribution, which will converge towards a normal distribution the as the number of samples increases. This is known as the Central Limit Theorem.

The standard normal distribution follows an empirical rule. This rule states that 68% of the data lie within a range of 1st standard deviation, 95% of the data lie within a 2nd standard deviation range, and 99.7% of the data lie within a 3rd standard deviation range.

Standard deviation

The normal distribution with a mean of 0 and a standard deviation of 1 is called a Standard Normal Distribution.

2. Exponential distribution:

The exponential distribution is another type of continuous distribution. It is a time between events in a Poisson point process. If the number of calls the company receive follows a Poisson distribution, then the time interval between calls follows the Exponential distribution. The mean and variance of an exponential distribution are both equal to 1/λ, which means that the exponential distribution has a memoryless property. This property states that the probability of an event occurring next time interval is not affected by how much time has elapsed since the last event occurred.

Exponential distribution

3. Student’s-T distribution:

It is another member of a family of Continuous probability distribution. It arises when estimating the mean of a normally distributed population when the sample size is small(less than 30) or when the population standard deviation is unknown. The shape of the T distribution depends on the degrees of freedom(df), which is the number of independent observations in a sample minus one. As the degree of freedom increases, this distribution will tend towards the normal distribution.

Student’s T distribution

4. Chi-square distribution:

It is the most widely used probability distribution in inferential statistics, notably in hypothesis testing and statistical inferences. It takes only non-negative values and is right-skewed. It is a special case of a gamma distribution.

Chi-square distribution

Statistical distributions are everywhere in our daily life. Distributions play a vital role for the data scientist to know the data in depth, perform better data analysis, draw different inferences from the data a different inference, to select a model suitable for a specific dataset.

If you found this article insightful, follow me on Linkedin and medium.

Stay tuned !!!

Thank you !!!

If you have not checked part 1 and part 2 of this statistics series, then check them out if you are interested!

Basic Concepts of statistics that every Data scientist should know.

It’s easy to lie with statistics, It’s hard to tell the truth without statistics.

pub.towardsai.net

A Comprehensive Guide for Handling Outliers

Part: 2

pub.towardsai.net

.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓