Must-Know Statistics Questions for Data Science Interviews

Last Updated on May 16, 2023 by Editorial Team

Author(s): Roli Trivedi

Originally published on Towards AI.

Must-Know Statistics Questions for Data Science Interviews — Photo by Marjhon Obsioma on Unsplash

What is Inferential Statistics?
Inferential Statistics make predictions and inferences about the population based on the sample of data taken from the population. There are two main use cases where you use inferential statistics.
‣ To make estimates about population.
‣ To draw conclusions about the populations.
For Example, a study conducted by a pharmaceutical company to test the effectiveness of a new drug.
What is the difference between Population and Sample?
Population is the total number of things, whereas Sample is a small part of the population. From the population, we take a sample. We cannot work on population because of the high computational cost and availability of all data points. From the sample, we calculate the statistics and from the sample statistics, we conclude about the population.
What is the relationship between mean and median in normal
distribution?
In normal distribution, the mean is equal to the median.
What is an outlier?
Outlier is a point that is at an abnormal distance from most of the points in the dataset.
What can I do with Outliers?
You can keep the outliers:
When there are a lot of outliers(skewed data)
When results are critical
When outliers have meaning(fraud data)
You can remove the outliers:
When we know the data point is wrong( negative age of the person)
When we have lots of data
When we need to provide analysis. One with the outliers and another without outliers.
What is the difference between population parameters and
sample statistics?
Population parameters are :
Mean = μ
Standard deviation = σ
Sample statistics are :
Mean = x̄
Standard deviation = s
What is the difference between inferential statistics and
descriptive statistics?
Descriptive statistics is the processing of data without drawing inferences from it. It is useful in describing and summarizing either through numerical calculations or graphs or tables.
Inferential statistics is drawing inferences or predictions about the population based on the sample data.
Most common characteristics used in descriptive statistics?
‣ Measure of central tendency: Mean, median, mode
‣ Measure of variability/spread/dispersion: Standard deviation, variance, range, IQR
‣ Measure of symmetricity: Skewness, kurtosis
‣ Outliers: It is an abnormal value from most of the values in the dataset.
How do you determine Outliers?
‣ Method 1 : IQR (Interquartile range)
The IQR is the middle 50% of the dataset. It’s the range of values
between the third quartile and the first quartile (Q3 — Q1). Used to measure variability by dividing the dataset into quartiles. Quartiles are values that divide your data into 4 parts provided data is sorted in ascending order.
IQR = Q3 — Q1
Q1 = 1st quartile (lower quartile which is 25th percentile that divides lower 25% of data)
Q2 = 2nd quartile (median which is 50th percentile)
Q3 = 3rd quartile (upper quartile which is the 75th percentile that divides the upper 25% of data)
Note: Percentage and percentile are two different things. If the 25th percentile is 8 then it simply means 25% of the data is less than 8. If the 75th percentile is 40 then it simply means 75% of the data is less than 40.
If the data value < Q1–1.5(IQR) OR the data value > Q3+1.5(IQR) then it is treated as an outlier.
‣ Method 2: Z-Score
It is also known as standard score gives us an idea of how far a data point is from the mean. It tells how far a data point deviates from the mean in standard deviations. We know that if the data follow normal distribution then the data covers 99.7% of the points up to 3 standard deviations. We can have our outliers calculated beyond that on both sides.
So if we get z-score of 2.5 then we say it is 2.5 standard deviation above average and if we get -2.5 then we say it is 2.5 standard deviation below average. Therefore we can conclude that the z-score is the number of standard deviations above or below that mean that each value falls.
The main advantage of z-score is that it tells you how much value in % is an outlier.
Z Score = (x-μ)/σ
x is an observation in the sample
x̄ is the mean of the observations in the sample
σ is the standard deviation of the observations in the sample
‣ Method 3: Sort data and see extreme values
This is the basic method where you can sort the data. After that look for extreme values and that will our outlier.
For Example, We have been given age as 4,6,9,2,10,12,102.
Step 1: Sort data: 2, 4,6,9,12,102
Step 2: Spot for extreme values we can see 102 is an extreme value so that could be an outlier for us.
Method 4: Plotting scatter plot, boxplot
Scatterplot: It is a great indicator that allows us to see whether there is a pattern between two variables. It is used when you pair numerical data or when you are determining the relationship between two variables. But not only this, but you can also use it for outlier detection.
Boxplot: It summarizes sample data using the 25th percentile, 50th percentile and 75th percentile. One can get insights about quartiles, median and outliers.
‣ Method 4: Hypothesis Testing
You can use hypothesis tests to find outliers. Many outlier tests exist, but I’ll focus on one to illustrate how they work. I will demonstrate Grubbs’ test, which tests the following hypotheses:
Null: All values in the sample were drawn from a single population that follows the same normal distribution.
Alternative: One value in the sample was not drawn from the same normally distributed population as the other values.
If the p-value for this test is less than your significance level, you can reject the null and conclude that one of the values is an outlier.
(For details refer: World of Outliers)
When do you reject or accept the null hypothesis? List Steps.
Step 1: State the null hypothesis
In this step, you state the null hypothesis and alternative hypothesis. Sometimes it is easier to state alternative hypothesis first because that is the researcher’s thought about the experiment.
Step 2: Reject or accept the null hypothesis
There are several methods that exist and it totally depends on the data that you have. For example, You can use the P-value method.
Basically, you reject the null hypothesis when your test value falls into the rejection region

Thanks for reading! If you enjoyed this piece and would like to read more of my work, please consider following me on Medium. I look forward to sharing more with you in the future.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.

Start free — no commitment:

→ Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages

Our courses:

→ AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.

→ Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.

→ AI for Work — Understand, evaluate, and apply AI for complex work tasks.

Note: Article content contains the views of the contributing authors and not Towards AI.

Frequently Used, Contextual References

Resources

Must-Know Statistics Questions for Data Science Interviews

Author(s): Roli Trivedi

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Recent Posts

Full-Stack Data Scientists for the Agentic Coding World

Building Production-Grade AI Skills with Snowflake Cortex AI Function Studio

I Tried 10 AI Agent Frameworks in 2026 — Here’s the Honest Guide I Wish I Had Earlier

How One Spring Boot Optimization Saved Our Startup $30,000 a Year

Inside Palantir AIP: How the World’s Most Controversial AI Platform Actually Works

What Is a Reverse Proxy? (And Why Every Backend Developer Should Care)

What Claude Opus 4.8 Actually Changes If You’re Building Agents

QWEN 3.7 Max Worked For 35 Hrs Straight And The Results Were Mind-blowing

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Must-Know Statistics Questions for Data Science Interviews

Author(s): Roli Trivedi

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Related posts

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement