Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Gentle Introduction to Statistics for Machine Learning
Latest

Gentle Introduction to Statistics for Machine Learning

Last Updated on May 14, 2022 by Editorial Team

Author(s): Hope Ogidan

Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.

Photo by Edge2Edge Media onΒ Unsplash

WHAT IS STATISTICS?

Statistics can also be defined as the science of collecting, analyzing, and interpreting data. Statistics is a field that has been existing for a long time now and it’s also a must-know field for every data scientist. It involves the study of data to obtain actionable insights that will be used for decision-making. That is the definition of statistics in its simplestΒ form.

In statistics, there are some terminologies that you have to be conversant with, I will be defining a few of themΒ here:

1. CONSTRUCT: A construct is any occurrence or event that is difficult to measure. For example, the feeling of happiness, the feeling of sadness, and how well you slept. All these events do not have a defined way of measurement.

2. OPERATIONAL DEFINITION: An operational definition is usually introduced so that it’s easy to define a construct.

3. POPULATION: This is the total number of people and things understudy

4. SAMPLE: A sample is the part of the population underΒ study

5. VARIABLES: These are factors that could cause a particular occurrence.

6. HYPOTHESIS: This is a statement describing the relationship between variables

Now that you have known some basic terminologies in the field of statistics…

let’s take a recap of the definition of machine learning.

Machine learning is the ability of computers to learn patterns from data and make predictions. It used data science techniques to analyze the data and the field of statistics is one of the subsets of data science. A lot of techniques used in machine learning are made possible through the field of statistics.

Therefore, the knowledge of statistics is very important for you as a data scientist to know what is going on under the hood. You can be able to build models as a machine learning engineer without the knowledge of statistics but a good understanding of how the process works under the hood is very important for your progress and also the explainability of your codes as companies won’t employ anyone who can’t offer proper explanations of their codes. They will prefer to employ someone who has a proper understanding of what he\she isΒ doing.

Branches of Statistics

Statistics is mainly dividedΒ into:

1. Descriptive Statistics

2. Inferential statistics

Descriptive Statistics

Photo by Luke Chesser onΒ Unsplash

In descriptive statistics, you are mainly organizing and summarizing your data using numbers and graphs. For example, you could summarize your data into a bar graph, pie chart, histogram, etc.

To describe your data using graphs, you can make use of the following:

  • Bar graph
  • Line graph
  • Histogram
  • Pie chart

To describe your data using numbers, you mainly make use of the following:

1. Measure ofΒ center

2. Measure of dispersion

Measure ofΒ center

A measure of central tendency is a single value that attempts to group values by identifying the central position within a group ofΒ data.

There are three measures of center, theyΒ are:

I. Mean

The mean of a particular data is the sum of the samples divided by the total number of samples. It is usually affected by outliers.

The formula forΒ mean:

II. Median

The median is the middle of the data. One of the properties of the median is that it is not usually affected by outliers unlike theΒ mean

The formula forΒ median:

III. Mode

The mode shows us the most occurring sample in the distribution.

Measure of dispersion

Photo by Sven Read onΒ Unsplash

The measure of dispersion helps you to determine how far the data points are from themselves.

We have three measures of dispersion namely:

Range, variance, standard deviation

i. Range

This is the difference between the maximum and minimum values in the distribution

The formula for the rangeΒ is:

ii. Variance

This is the sum of all the squared deviation of the sample means from the population means.

The formula for variance:

iii. Standard deviation

This is the squared root of theΒ variance

The formula for standard deviation is:

Empirical rule

In statistics, we do follow the empirical rule which statesΒ that

68% of your data falls within one standard deviation from the mean of the distribution

95% of the data falls within the two standard deviation from the mean of the distribution

99.7% of the data falls within the three standard deviations from the mean of the distribution.

The empirical rule has a lot of applications in probability but will not be diving into those applications as those concepts are beyond the scope of thisΒ article.

Central limitΒ theorem

The central limit theorem statesΒ that

As the number of trials increases, the value of observes probability approaches the theoretical probability.

Z score

Z score is used in statistics to find how far in terms of the standard deviation a number is from theΒ mean

Inferential statistics

Photo by Scott Graham onΒ Unsplash

This branch of statistics samples data to conclude the population. In inferential statistics, you will learn about estimation and how you can get information about a population from itsΒ sample.

In real-world problems, it can be a little difficult for us to get the total population, so we make use of samples in mostΒ cases.

Correlation

Correlation helps us to define the relationships between variables in ourΒ dataset.

As a data scientist, it is very important for you to know how well your independent variables correlate with each other so that you know what variables to combine together during feature engineering and you must also know how the independent variables correlate with the dependent variable.

Just as we have units of measurement in maths for measuring length, mass, time, etc we also have a measure of correlation in statistics called correlation coefficient (r)

It is used to quantify the strength of the relationships between variables

There is an important point that you toΒ note

Non Correlation doesn’t imply independence. Correlation doesn’t equal causation.

Correlation coefficients;

Close to 1= large positive correlation

Close to -1 = large negative correlation

Close to 0= no relationship”

Conclusion

I hope you have been able to get a good idea of what statistics is and its importance to you as a Data Scientist/Machine learning engineer.

Thanks to all who inspired me to do this. Connect with me on LinkedIn and Twitter and see how well we canΒ bond.


Gentle Introduction to Statistics for Machine Learning was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓