Must-Know Statistics Questions for Data Science Interviews
Last Updated on May 16, 2023 by Editorial Team
Author(s): Roli Trivedi
Originally published on Towards AI.
- What is Inferential Statistics?
Inferential Statistics make predictions and inferences about the population based on the sample of data taken from the population. There are two main use cases where you use inferential statistics.
‣ To make estimates about population.
‣ To draw conclusions about the populations.
For Example, a study conducted by a pharmaceutical company to test the effectiveness of a new drug.
- What is the difference between Population and Sample?
Population is the total number of things, whereas Sample is a small part of the population. From the population, we take a sample. We cannot work on population because of the high computational cost and availability of all data points. From the sample, we calculate the statistics and from the sample statistics, we conclude about the population.
- What is the relationship between mean and median in normal
In normal distribution, the mean is equal to the median.
- What is an outlier?
Outlier is a point that is at an abnormal distance from most of the points in the dataset.
- What can I do with Outliers?
You can keep the outliers:
When there are a lot of outliers(skewed data)
When results are critical
When outliers have meaning(fraud data)
You can remove the outliers:
When we know the data point is wrong( negative age of the person)
When we have lots of data
When we need to provide analysis. One with the outliers and another without outliers.
- What is the difference between population parameters and
Population parameters are :
Mean = μ
Standard deviation = σ
Sample statistics are :
Mean = x̄
Standard deviation = s
- What is the difference between inferential statistics and
Descriptive statistics is the processing of data without drawing inferences from it. It is useful in describing and summarizing either through numerical calculations or graphs or tables.
Inferential statistics is drawing inferences or predictions about the population based on the sample data.
- Most common characteristics used in descriptive statistics?
‣ Measure of central tendency: Mean, median, mode
‣ Measure of variability/spread/dispersion: Standard deviation, variance, range, IQR
‣ Measure of symmetricity: Skewness, kurtosis
‣ Outliers: It is an abnormal value from most of the values in the dataset.
- How do you determine Outliers?
‣ Method 1 : IQR (Interquartile range)
The IQR is the middle 50% of the dataset. It’s the range of values
between the third quartile and the first quartile (Q3 — Q1). Used to measure variability by dividing the dataset into quartiles. Quartiles are values that divide your data into 4 parts provided data is sorted in ascending order.
IQR = Q3 — Q1
Q1 = 1st quartile (lower quartile which is 25th percentile that divides lower 25% of data)
Q2 = 2nd quartile (median which is 50th percentile)
Q3 = 3rd quartile (upper quartile which is the 75th percentile that divides the upper 25% of data)
Note: Percentage and percentile are two different things. If the 25th percentile is 8 then it simply means 25% of the data is less than 8. If the 75th percentile is 40 then it simply means 75% of the data is less than 40.
If the data value < Q1–1.5(IQR) OR the data value > Q3+1.5(IQR) then it is treated as an outlier.
‣ Method 2: Z-Score
It is also known as standard score gives us an idea of how far a data point is from the mean. It tells how far a data point deviates from the mean in standard deviations. We know that if the data follow normal distribution then the data covers 99.7% of the points up to 3 standard deviations. We can have our outliers calculated beyond that on both sides.
So if we get z-score of 2.5 then we say it is 2.5 standard deviation above average and if we get -2.5 then we say it is 2.5 standard deviation below average. Therefore we can conclude that the z-score is the number of standard deviations above or below that mean that each value falls.
The main advantage of z-score is that it tells you how much value in % is an outlier.
Z Score = (x-μ)/σ
x is an observation in the sample
x̄ is the mean of the observations in the sample
σ is the standard deviation of the observations in the sample
‣ Method 3: Sort data and see extreme values
This is the basic method where you can sort the data. After that look for extreme values and that will our outlier.
For Example, We have been given age as 4,6,9,2,10,12,102.
Step 1: Sort data: 2, 4,6,9,12,102
Step 2: Spot for extreme values we can see 102 is an extreme value so that could be an outlier for us.
Method 4: Plotting scatter plot, boxplot
Scatterplot: It is a great indicator that allows us to see whether there is a pattern between two variables. It is used when you pair numerical data or when you are determining the relationship between two variables. But not only this, but you can also use it for outlier detection.
Boxplot: It summarizes sample data using the 25th percentile, 50th percentile and 75th percentile. One can get insights about quartiles, median and outliers.
‣ Method 4: Hypothesis Testing
You can use hypothesis tests to find outliers. Many outlier tests exist, but I’ll focus on one to illustrate how they work. I will demonstrate Grubbs’ test, which tests the following hypotheses:
Null: All values in the sample were drawn from a single population that follows the same normal distribution.
Alternative: One value in the sample was not drawn from the same normally distributed population as the other values.
If the p-value for this test is less than your significance level, you can reject the null and conclude that one of the values is an outlier.
(For details refer: World of Outliers)
- When do you reject or accept the null hypothesis? List Steps.
Step 1: State the null hypothesis
In this step, you state the null hypothesis and alternative hypothesis. Sometimes it is easier to state alternative hypothesis first because that is the researcher’s thought about the experiment.
Step 2: Reject or accept the null hypothesis
There are several methods that exist and it totally depends on the data that you have. For example, You can use the P-value method.
Basically, you reject the null hypothesis when your test value falls into the rejection region
Thanks for reading! If you enjoyed this piece and would like to read more of my work, please consider following me on Medium. I look forward to sharing more with you in the future.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI