Master Essential — Key Statistics Every Aspiring Data Scientist Must Grasp Before Taking the Dive — Part-I

Last Updated on February 2, 2024 by Editorial Team

Author(s): Kamireddy Mahendra

Originally published on Towards AI.

The Most Underestimated thing to excel in any field is to build a strong foundation or fundamentals. keep strengthening fundamentals.

Before getting into the context of this article, it is important to ensure that we are good at the fundamentals that are required to enter into the world of data.

Let’s start with the basics in brief.

Variable:

A variable is a characteristic of any entity being studied that is capable of taking different values.

Example: any characters like x, y, z, a, b,……….etc.

Example: any strings like height, length, temperature, weight, ……….etc.

Measurement:

it is the standard process of assigning numbers to a particular characteristic of a variable.

Example: finding the height of a person, finding the temperature now,…….etc.

Data:

data is a recorded measurement, or the values assigned to any variable are called data, and this won’t convey any message or context. There are two major types of data available: qualitative and quantitative data.

Example: x=9, e=mahendra, t=strength, ………etc.

Information:

values that are assigned to any variables that convey meaning or provide context for the situation.

Ex: height of a person=9ft, today temperature temperature=98F, object colour=white,……………etc.

Data Analysis:

data analysis is the process of examining, transforming, and arranging raw data in a specific way to generate useful information from the data. data analysis uses past events to analyze past results in any context.

Types of Data Analysis:

Descriptive Analysis
Diagnostic Analysis
Predictive Analysis
Prescriptive Analysis
Inferential Analysis

Data Analytics:

it is the scientific process of transforming data into insights for making better decisions. data analytics is used to explore future events by applying different techniques.

Types of Data Analytics:

Descriptive Analytics → What happened
Diagnostic Analytics → Why did it happen
Predictive Analytics→ What will happen
Prescriptive Analytics →How can we make it happen

Probability:

It is the measure of the chance that a specific event will occur; its value generally ranges between 0 and 1.

Now Let’s get into the context of the article and we will try to know what concepts a data scientist should be aware.

Statistics:

It is the study of collecting, analyzing, and interpreting data to find patterns and make sense of information.

Types of Statistics:

There are two major types of statistics that every data scientist should be using. Those are

i. Descriptive Statistics

ii. Inferential Statistics (we will discuss in the next article)

i. Descriptive statistics:

These are the statistics that summarize data using measures like mean, median, mode, and many other visualizations to understand the central tendencies and distributions of data.

These are the fundamental statistics used by data scientists in doing EDA.
There are some important concepts involved in descriptive statistics, which are explained below.

a. Measures of Central Tendency:

Mean:

It is the average of a group of numbers. This value is affected by each value in the data set, including extreme values. It is Computed by summing all values in the data set and dividing the sum by the number of values in the data set.

Example: given array of numbers like 3,5,6,7,8 then the value of mean is (3+5+6+7)/4=5.25

Weighted Mean:

sometimes we wish to average numbers, but we want to assign more importance or weight to some of the numbers.

Example: mid-1 marks=25 having 70% weightage and mid-2 marks =20 having 30% weightage, then the weighted mean is ((25*.7)+(20*.3))/2=23.5

Median:

The middle value in an ordered array. If the array has an odd number of terms, then the middle value is the median. If the array has an even number of terms, then the median is the average of the two middle numbers.

Example: given an array like 4,3,2,6,7,8,43,4 then the median can be found by making it in order either ascending or descending then the average of the two middle values gives the median value. here after arranging the array, it looks like 2,3,4,4,6,7,8,43 then the average of 4,6 is 5. Therefore, the median value is 5.
Another example is 4,5,7,8,9,53,22,2,5. The median value is in the middle of the sorted array. i.e. 2,4,5,5,7,8,9,22,53 i.e. 7 is the median value.

Mode:

The most frequently occurring value in a data set. It applies to all levels of data. It is of two types:

i. Bi-model — a data set that has two modes.

ii. Multi-model — a data set that contains more than two modes.

Percentile:

It is a measure of central tendency that divides a group of data into 100 parts.

Quartiles:

It is a measure of central tendency that divides a group of data into four subgroups.

Q1- 25% of the data set is below the first quartile.
Q2- 50% of the data set is below the second quartile.
Q3- 75% of the data set is below the third quartile.
Q4- Remaining data set.

b. Measures of Dispersion or Variability:

Measures of variability describe the spread or the dispersion of a set of data. The reliability of the measure of central tendency is dispersion. It is used to compare the dispersion of various samples.

Range:

The difference between the largest and the smallest values in a set of data. It is simple to compute and we can ignore all data points except the extreme two data points.

Interquartile Range:

The range of values between the first and third quartiles. Range of the middle half. It is less influenced by extremes.

Inter-quartile range is = Q3-Q1

Mean Absolute Deviation:

Average of the absolute deviation from the mean.

Population Variance:

Average of the square deviations from the arithmetic mean.

Standard Deviation:

The squared root of the population variance is called as standard deviation.

Uses of S.D:

Indicator for financial risk. (if S.D is high-> more risk; if S.D is low -> lower risk)
Quality control (construction of quality control charts. i.e, variance is low, then the Quality is high)

Coefficient of Variation:

It is the ratio of the standard deviation to the mean, expressed as a percentage. It is the measurement of relative dispersion.

c. Measures of Shape:

Skewness:

The skewness of distribution is measured by comparing the relative positions of the mean, median, and mode.

Distribution is symmetrical if Mean=Median=Mode
Distribution is skewed right, the means median lies between Mean and Mode, and the value of mode is less than the mean.
Mode<Median<Mean
Distribution is skewed left, means median lies between mode and mean, and mode is greater than mean.
Mean<Median<Mode

Kurtosis:

Kurtosis is a statistical measure that describes the distribution of data points in a dataset. It provides information about the shape of the distribution, specifically focusing on the tails and the overall “peakiness” or “flatness” compared to a normal distribution.

There are three types of kurtosis:

Mesokurtic (normal distribution):

A distribution with kurtosis equal to 0. This means that the tails of the distribution are neither too heavy nor too light compared to a normal distribution.

2. Leptokurtic:

A distribution with positive kurtosis. This indicates that the tails of the distribution are heavier than those of a normal distribution. The distribution has relatively more values in the tails, resulting in a sharper peak.

3. Platykurtic:

A distribution with negative kurtosis. This means that the tails of the distribution are lighter than those of a normal distribution. The distribution has relatively fewer values in the tails, resulting in a flatter peak.

Box-and-Whisker Plots (Boxplots):

In this basically, five specific values are used:
— Median, Q2
— First quartile, Q1
— Third quartile, Q3
— Minimum value in the data set
— Maximum value in the data set

I hope this article helps you in a basic understanding of statistics that are essential to excel in the field of data as a data scientist. In the next article, we will continue to discuss inferential statistics in detail.

Reference: Data analytics with Python.

Kindly support me with your appreciation, through clapping and feedback. It will help me to improve the quality of the content and give me a positive way to share more content, don’t forget to follow me and subscribe to my newsletter to get instant updates from me. Thank you 🙂

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication