Module 1 Part -01 Building Block of Data Analytics
Author(s): Sudeep
Originally published on Towards AI.
If you are wondering what is this Module 1 and related stuff, please refer this : What is Data Analytics
So it all starts with Statistics
At a high level, statistics is a collection of methods that help us analyze, summarize, and interpret data. To dive into statistics, we first need to understand what data is and its various types.
Data
Data is a collection of facts, numbers, words, or observations that can be used to learn about something. Data can be represented in many different ways and can be used for a variety of purposes.
Data can be divided into mainly 3 types
Now lets learn about the types of statistics
There are broadly two types of statistics
1) Descriptive Statistics
formal definition: Descriptive statistics are methods used to summarize and describe the main features of a dataset
In-short it has so many methods which helps get summary of data , well it has methods such as mean, mode, medium.
2) Inductive/ Inferential Statistics
formal definition: Inferential statistics involves drawing conclusions or making inferences about a population based on data collected from a sample of that population.
In-short inferential statistics is all about understanding the βwhyβ and βhowβ behind the data patterns we observe.
Terms
Population: The whole data is called population.
Then What is a part of the population called? β¦🤔 well its called Samples. And also known as observation, tuples, feature Matrix.
Bonus: Attributes are known as Features
Variables
Variables are of mainly two types:
An example for Nominal variable are as follows :
Colors of cars in a parking lot (Red, Blue, Black, White).
Types of payment methods used in a store (Cash, Credit Card, Debit Card, Mobile Payment).
An example for Ordered variable are as follows :
Education levels (High School, Bachelorβs, Masterβs, PhD)
Customer satisfaction ratings (Very Dissatisfied, Dissatisfied, Neutral, Satisfied, Very Satisfied)
T-shirt sizes (XS, S, M, L, XL)
Lets focus more on Descriptive statisticsβ¦
What is measure of central tendency🤔β¦?
Its nothing much , basically it include methods like Mean Medium Mode
1. Mean
- Definition: The average value.
- Formula: Mean = (Sum of all values) / (Number of values).
- Example: The average age of students in a class.
2. Median
- Definition: The middle value when data is sorted.
- Tip: For even-sized datasets, take the average of the two middle values.
- Example: The median salary in a company can give you a better idea of employee earnings when there are outliers.
3. Mode
- Definition: The most frequent value in a dataset.
- Example: The most popular product sold in an online store.
But What is Measure of Dispersion
A measure of dispersion is a statistical value that indicates how spread out a set of data is around a central value. It can help you determine if the data is stretched out or squeezed together
Some examples are
Range: It is defined as the difference between the largest and the smallest value in the distribution.
Mean Deviation: It is the arithmetic mean of the difference between the values and their mean.
Standard Deviation: It is the square root of the arithmetic average of the square of the deviations measured from the mean.
Variance: It is defined as the average of the square deviation from the mean of the given data set.
Quartile Deviation: It is defined as half of the difference between the third quartile and the first quartile in a given data set.
Interquartile Range: The difference between upper(Q3 ) and lower(Q1) quartile is called Interterquartile Range. Its formula is given as Q3 β Q1.
In summary, what we discussed in this post are the fundamental building blocks of data analytics, specifically:
The foundation of statistics and its two main branches:
- Descriptive Statistics: Methods for summarizing data.
- Inferential Statistics: Drawing conclusions about populations based on sample data.
Basic terminology in data analytics:
- Population: The complete dataset.
- Samples: Subsets of the population.
- Features (also called attributes): Characteristics we measure.
The classification of variables:
- Numerical variables.
- Categorical variables,
which include:
- Nominal data (e.g., colors, payment methods).
- Ordinal data (e.g., education levels, satisfaction ratings).
Important statistical measures:
- Measures of Central Tendency:
- Mean: The average.
- Median: The middle value.
- Mode: The most frequent value.
2. Measures of Dispersion:
- Range: Difference between the largest and smallest values.
- Mean Deviation: The mean of the differences from the mean.
- Standard Deviation: The square root of the average squared differences from the mean.
- Variance: The average of the squared differences from the mean.
- Quartile Deviation: Half the difference between the third and first quartiles.
- Interquartile Range: The difference between the upper (Q3) and lower (Q1) quartiles.
Thatβs it for this post! Stay tuned for Part 2, coming soon in the next 3β5 days. 😉
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI