Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

Breaking Down the Central Limit Theorem: What You Need to Know
Latest   Machine Learning

Breaking Down the Central Limit Theorem: What You Need to Know

Last Updated on March 19, 2023 by Editorial Team

Author(s): Chinmay Bhalerao

Originally published on Towards AI.

The Importance of the Central Limit Theorem in Statistical Inference

Photo by Tomas Eidsvold on Unsplash

Even if you are not normal, the average is normal !!!! –Josh Starmer

The central limit theorem is a fundamental concept in probability theory and statistics. But before diving into the actual central limit theorem, you must have an idea about normal distribution. I have explained normal distribution in very simple words and with examples in the below blog. you can refer to it for the introduction. If you are familiar with normal distribution, then you can skip the below link and paragraph.

There are two important things that describe the normal distribution.

  • Mean — This is the average value of all the points in the sample that is computed by summing the values and then dividing by the total number of the values in a sample.
  • Standard Deviation — This indicates how much the data set deviates from the mean of the sample.

Normal Distribution

A normal distribution is determined by two parameters the mean and the variance. A normal distribution with a mean of 0 and a standard deviation of 1 is called a standard normal distribution.

The normal distribution is a bell-shaped curve where ideally mean=mode=median.

Above all are normal distributions [Image by Author]

When the distribution of data is concentrated at the center or at the mean and values are decreasing as we go to the higher or lower side, typically the structure is of normal distribution. If we plot It then it looks like a bell[see above image] Because of this, it is known as bell-shaped distribution.

Sampling

In statistics, sampling refers to the process of selecting a subset of individuals or observations from a larger population in order to make inferences about the population as a whole.

The individuals or observations that are selected to be part of the sample should be representative of the population from which they are drawn in order to ensure that the inferences made based on the sample can be generalized to the entire population.

Statistical sampling [Source: Wikipedia]

The easiest way to explain: I have 10 people who play cricket, 10 people who play basketball, and 10 people who play chess. I want to make a commity for sports association so have to choose any 6 people to represent sports. so to make the commity unbiased, I will choose 2–2 people from all 3 sports. That choice will be random [Even though there are methods to choose k sample but still this is random]. and those chosen people will be sampled from all student's sample space.

Random variable:

Statistics and data mining are concerned with data. How do we link sample spaces and events to data? The link is provided by the concept of random variables.

A random variable is a mapping,

X: Ω → R

that assigns a real number X(ω) to each outcome ω.

Central limit theorem

The basic definition of the central limit theorem can be stated as,

“The sums or averages of a large number of independent and identically distributed random variables will be approximately normally distributed, regardless of the underlying distribution of the individual random variables.”

This is a random sampling with a sample size of 15 and 50 draws keeping alpha 1.20 and beta 1 [Image by author]

The central limit theorem has three main components.

  • The first component is the requirement that the random variables be independent and identically distributed. This means that each random variable is drawn from the same probability distribution and that the outcome of one variable does not affect the outcome of any other variable. This requirement ensures that the behavior of the random variables is consistent across the sample and reduces the effect of any outliers or extreme values.
  • The second component of the central limit theorem is the requirement that the sample size is large. This means that the sum or average of the random variables is based on a significant number of observations. As the sample size increases, the distribution of the sum or average becomes more normal, regardless of the underlying distribution of the individual random variables.
  • The third component of the central limit theorem is that the distribution of the sum or average of the random variables converges to a normal distribution. This means that as the sample size increases, the distribution of the sum or average becomes more tightly clustered around the mean of the distribution, and the shape of the distribution becomes more bell-shaped.
Visualizing sampling and means [Image by author]

The law of large numbers says that the distribution of Xn piles up near µ. This isn’t enough to help us approximate probability statements about Xn. For this, we need the central limit theorem. Suppose that X1,…, Xn are iid with mean µ and variance σ2. The central limit theorem (CLT) says that.

Xn = n−1 SUMMi*Xi has a distribution that is approximately Normal with mean µ and variance σ2/n. This is remarkable since nothing is assumed about the distribution of Xi, except the existence of the mean and variance.

Probability statements about Xn can be approximated using Normal distribution. It’s the probability statements that we are approximating, not the random variable itself.

Speaking mathematically [Image credits: All of statistics by Larry Wasserman book ]

Where are we currently using CLT?

The central limit theorem has many practical applications. One of the most important applications is hypothesis testing. [I am going to write a separate blog on hypothesis testing, but till then, you can refer attached link.]. Hypothesis testing involves using a sample to make inferences about a population. The central limit theorem allows us to make assumptions about the distribution of the sample mean, which is often used as a test statistic in hypothesis testing. For example, if we are testing whether the mean of a population is equal to a certain value, we can use the central limit theorem to assume that the distribution of the sample mean is approximately normal, regardless of the underlying distribution of the individual observations.

Another important application of the central limit theorem is in confidence interval estimation. Confidence intervals are used to estimate the range of values within which a population parameter is likely to fall. The central limit theorem allows us to assume that the distribution of the sample mean is approximately normal, which allows us to construct confidence intervals using the properties of the normal distribution.

The central limit theorem also has important applications in statistical process control. Statistical process control involves monitoring and controlling a process to ensure that it remains within certain limits. The central limit theorem allows us to assume that the distribution of the sample mean is approximately normal, which allows us to establish control limits based on the properties of the normal distribution.

Limitations of CLT

Despite its wide applications, the central limit theorem has some limitations. One limitation is that it assumes that the random variables are independent and identically distributed. In practice, this assumption may not always be valid. For example, in time series data, observations may be correlated over time, which violates the independence assumption. Additionally, the central limit theorem assumes that the sample size is large. In practice, it may be difficult or expensive to collect a large sample, which can limit the usefulness of the central limit theorem.

In summary,

The central limit theorem is a fundamental concept in probability theory and statistics. It states that, under certain conditions, the sum or average of a large number of independent and identically distributed random variables will be approximately normally distributed, regardless of the underlying distribution of the individual random variables. The central limit theorem has many important applications, including hypothesis testing, confidence interval estimation, and statistical process control. However, it has some limitations, including the assumptions of independence and a large sample size. Despite these limitations, the central limit theorem remains a powerful tool for analyzing and understanding data.

If you have found this article insightful

It is a proven fact that “Generosity makes you a happier person”; therefore, Give claps to the article if you liked it. If you found this article insightful, follow me on Linkedin and medium. You can also subscribe to get notified when I publish articles. Let’s create a community! Thanks for your support!

You can read my other blogs related to :

Signing off,

Chinmay

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓