Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

The Ultimate Guide to Statistics: Part 1— Descriptive Statistics
Latest   Machine Learning

The Ultimate Guide to Statistics: Part 1— Descriptive Statistics

Last Updated on July 17, 2023 by Editorial Team

Author(s): Simranjeet Singh

Originally published on Towards AI.

Introduction

Welcome to my statistics blog series! We’ll explore several different subjects in this series, including Descriptive Statistics, Inferential Statistics, Bayesian Statistics, Regression Analysis, Time Series Analysis, Financial Modeling and Analysis, and various Projects and Case Studies.

This blog post will delve into the topic of Descriptive Statistics. Descriptive statistics is a subset of statistics that deals with data collection, analysis, and interpretation. It allows you to summarise and describe the key characteristics of a data set, such as its central tendency, variability, and shape. Descriptive statistics are frequently used as the first step in data analysis, providing a preliminary understanding of the data before more complex statistical techniques are applied. In this blog, we’ll look at the fundamental concepts of descriptive statistics, such as measures of central tendency and dispersion, data visualization techniques, and data types.

U+1F449 Before Starting the Blog, Please Subscribe to my YouTube Channel and Follow Me on Instagram U+1F447
U+1F4F7 YouTube — https://bit.ly/38gLfTo
U+1F4C3 Instagram — https://bit.ly/3VbKHWh

U+1F449 Do Donate U+1F4B0 or Give me Tip U+1F4B5 If you really like my blogs, Because I am from India and not able to get into Medium Partner Program. Click Here to Donate or Tip U+1F4B0 — https://bit.ly/3oTHiz3

Fig.1 — Descriptive Statistics

By the end of this post, you’ll have a firm grasp of descriptive statistics and be prepared to tackle more advanced statistical concepts in future blog posts.

Table of Contents

  1. Importance of descriptive statistics in data science
  2. Types of Data
  3. Measures of Central Tendency
  4. When to use each measure
  5. Measures of Dispersion
  6. Skewness and Kurtosis
  7. Probability Distributions
  8. Correlation and Regression Analysis
  9. Model selection techniques
  10. Statistical Inference
  11. Project on Descriptive Statistics

1. Importance of Descriptive Statistics

Descriptive statistics can be used to gain insights into data, identify outliers and anomalies, detect trends and patterns, and make data-informed decisions. It helps in the identification of key features and variables that can then be used for modeling and prediction. It is used in a variety of fields in data science, including business, finance, healthcare, social sciences, and others. It is critical in data exploration, data cleaning, and data visualization, all of which are necessary steps in the data analysis process. It would be difficult to make sense of the data and draw meaningful conclusions without descriptive statistics. It is the first step in data analysis and involves describing and summarising the data using various measures such as mean, median, mode, variance, standard deviation, and percentiles.

2. Types of data

Data in statistics can be divided into two categories: quantitative and qualitative. Quantitative data is numerical data that can be measured, whereas qualitative data is non-numerical data that cannot be measured numerically. Understanding the type of data being analyzed is critical in determining the best statistical methods for analysis.

Quantitative data can be divided into two types: discrete data and continuous data. Continuous data can take on any value within a certain range, whereas discrete data can take on a countable number of values, such as the number of students in a class.

Fig.2 — Types of Data

On the other hand, nominal and ordinal data are subcategories of qualitative data. Data that are used to name or label variables, such as the kind of fruit in a basket, are known as nominal data. Ordinal data, on the other hand, can be ranked or ordered and include things like survey respondents’ levels of satisfaction.

3. Measures of Central Tendency

Measures of central tendency are statistical measures that determine the central or typical value of a dataset. These measures indicate where the center of the distribution of data lies. Mean, median, and mode are the most common measures of central tendency.

Fig.3 — Measure of Central Tendency

1. Mean, Median, and Mode

Mean, median, and mode are measures of central tendency that are commonly used in descriptive statistics to summarize a dataset. Here are the formulas and Python code examples for each of these measures:

a. Mean: The mean is the arithmetic average of a set of numbers.

Fig.4— Mean Formula
def mean(numbers):
return sum(numbers) / len(numbers)

b. Median: The median is the middle value of a set of numbers.

Fig.5 — Median Formulas
def median(numbers):
n = len(numbers)
sorted_numbers = sorted(numbers)
if n % 2 == 0:
return (sorted_numbers[n//2-1] + sorted_numbers[n//2]) / 2
else:
return sorted_numbers[n//2]

c. Mode: The mode is the most frequently occurring value in a set of numbers.

Fig.6 — Mode Formula
from collections import Counter

def mode(numbers):
count = Counter(numbers)
max_count = max(count.values())
mode = [k for k, v in count.items() if v == max_count]
return mode[0] if mode else None

4. When to use each measure

The choice of which measure of central tendency to use depends on the nature of the data and the research question.

  • Mean: The mean is generally used for normally distributed data or when there are no outliers. It is also a good choice when the data is continuous, and the distribution is symmetrical.
  • Median: The median is used when there are outliers in the data or when the data is skewed. It is also used for ordinal or interval data, where the data is not continuous.
  • Mode: The mode is used when the data has a discrete distribution, such as categorical data. It is also useful for identifying the most commonly occurring value in a set of data.

5. Measures of Dispersion

Measures of Dispersion are used to describe how much variation there is in a dataset. There are several measures of dispersion, including range, variance, standard deviation, and interquartile range.

a. Range: The range is the difference between the maximum and minimum values in a dataset.

Range = maximum value — minimum value

b. Variance: The variance measures the average degree to which each point differs from the mean.

Variance = (1/n) * sum((xi — mean)²)

c. Standard Deviation: The standard deviation is the square root of the variance. It is an important measure of dispersion in data science that helps to understand how much the data values are spread out from the mean value. Standard deviation is particularly useful when the data is normally distributed. It helps to identify the outliers or extreme values in the data set, which can provide important insights into the data.

Fig.7 — Standard Deviation

However, there are certain situations where standard deviation may not be the best measure of dispersion. For example, when the data is skewed or has outliers, other measures such as interquartile range (IQR) or median absolute deviation (MAD) may be more appropriate.

Standard Deviation = sqrt(Variance)

d. Interquartile Range: The Interquartile Range (IQR) is a measure of variability based on dividing a dataset into quartiles. Quartiles divide a rank-ordered dataset into four equal parts. The IQR is the range between the first quartile (25th percentile) and the third quartile (75th percentile). It gives us an idea of how spread out the middle 50% of the dataset is. It is useful when dealing with skewed data or when outliers are present.

Fig.8 — Interquartile Range

IQR should be used when the data has extreme values that could affect the accuracy of the measures of central tendency, such as the mean or median.

Interquartile Range = Q3 — Q1

import numpy as np

# Sample dataset
data = np.array([5, 10, 15, 20, 25, 30])

# Range
range_val = np.max(data) - np.min(data)
print("Range:", range_val)

# Variance
variance_val = np.var(data)
print("Variance:", variance_val)

# Standard Deviation
std_dev_val = np.std(data)
print("Standard Deviation:", std_dev_val)

# Interquartile Range
q1, q3 = np.percentile(data, [25, 75])
iqr_val = q3 - q1
print("Interquartile Range:", iqr_val)

6. Skewness and Kurtosis

Skewness and kurtosis are two measures of the shape of a distribution.

Skewness is a measure of the asymmetry of the distribution. A distribution is considered to be skewed if one tail is longer than the other. A positively skewed distribution has a long tail on the right side, while a negatively skewed distribution has a long tail on the left side.

Fig.9 — Skewness

Kurtosis, on the other hand, is a measure of the peakedness of the distribution. A distribution with high kurtosis has a sharper peak and thinner tails, while a distribution with low kurtosis has a flatter peak and fatter tails.

Fig.10 — Kurtosis

In data science, skewness and kurtosis are important because they can help to identify potential problems with a dataset. For example, if a dataset is heavily skewed or has high kurtosis, this can indicate that the data is not normally distributed, which can affect the accuracy of statistical tests and models.

import numpy as np
from scipy.stats import skew, kurtosis

# Create a sample dataset
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

# Calculate skewness and kurtosis
data_skewness = skew(data)
data_kurtosis = kurtosis(data)

print("Skewness:", data_skewness)
print("Kurtosis:", data_kurtosis)

The skewness and kurtosis of a dataset can also be calculated using mathematical formulas.

Skewness Formula:

Fig.11- Skewness Formula

Kurtosis Formula:

Fig.12- Kurtosis Formula

It’s important to note that these formulas are for the population skewness and kurtosis, not the sample skewness and kurtosis. To calculate the sample skewness and kurtosis, you would need to use slightly different formulas that account for the degrees of freedom.

7. Probability Distributions

Probability distributions are mathematical functions that describe the likelihood of different outcomes in a random event. In data science, probability distributions are used to model and analyze various types of data, such as continuous or discrete variables. They provide important information about the central tendency, variability, and shape of data, and can be used to make predictions about future events. There are many different types of probability distributions, each with its own characteristics and applications, including normal distribution, binomial distribution, Poisson distribution, and many others. Learn More at.

a. Types of Probability Distributions

There are many types of probability distributions, but some of the most common ones are:

  1. Normal distribution: A symmetric bell-shaped curve that represents the distribution of many natural phenomena, such as height or weight.
  2. Binomial distribution: A discrete distribution that represents the probability of a certain number of successes in a fixed number of independent trials.
  3. Poisson distribution: A discrete distribution that represents the probability of a certain number of events occurring in a fixed interval of time or space.
  4. Exponential distribution: A continuous distribution that represents the probability of waiting a certain amount of time for an event to occur.
  5. Uniform distribution: A continuous distribution where all outcomes are equally likely.

To identify which distribution to use for a particular dataset, it’s important to look at the shape of the data and the nature of the variable being measured. For example, if the variable is continuous and symmetric, a normal distribution may be appropriate. If the variable is discrete and represents a count, a Poisson or binomial distribution may be more appropriate.

b. Probability Mass Function (PMF) and Probability Density Function (PDF)

Probability Mass Function (PMF) and Probability Density Function (PDF) are two important concepts in probability theory and statistics.

PMF is used for discrete random variables and gives the probability of each possible outcome. It is defined as:

Fig.13— PMF function

PDF is used for continuous random variables and gives the probability density at each possible value of the variable. It is defined as:

Fig.14 — PDF Function

In Python, we can use the scipy.stats module to calculate PMF and PDF.

Fig.15 — PMF and PDF Functions

Here is an example of how to calculate PMF and PDF for a given data set:

import numpy as np
from scipy.stats import norm

# generate a random dataset
data = np.random.normal(0, 1, 100)

# calculate PMF
pmf, bins = np.histogram(data, bins=10, density=True)
print(pmf)

# calculate PDF
pdf = norm.pdf(bins, np.mean(data), np.std(data))
print(pdf)

To determine which one to use, we need to consider whether the random variable is discrete or continuous. If it is discrete, we use PMF, and if it is continuous, we use PDF.

8. Correlation and Regression Analysis

Correlation and regression analysis are two important statistical techniques used in data analysis. Correlation analysis is used to measure the strength and direction of the relationship between two variables, while regression analysis is used to model the relationship between a dependent variable and one or more independent variables. Both techniques are commonly used in various fields, including finance, economics, social sciences, and engineering.

a. Pearson Correlation

The Pearson correlation coefficient is a measure of the linear correlation between two variables X and Y. It measures the strength and direction of the linear relationship between the two variables. The value of the Pearson correlation coefficient ranges from -1 to 1, where -1 indicates a perfect negative linear correlation, 0 indicates no linear correlation, and 1 indicates a perfect positive linear correlation.

Fig.16 — Pearson correlation

In Python, we can calculate the Pearson correlation coefficient using the pearsonr() function from the scipy.stats module:

from scipy.stats import pearsonr

x = [1, 2, 3, 4, 5]
y = [5, 4, 3, 2, 1]

corr, pvalue = pearsonr(x, y)

print("Pearson correlation coefficient:", corr)
print("p-value:", pvalue)

Pearson correlation coefficient: -1.0
p-value: 0.0

This indicates a perfect negative linear correlation between X and Y. The p-value is also very small, indicating that the correlation is statistically significant.

We use the Pearson correlation coefficient when we want to measure the strength and direction of a linear relationship between two continuous variables. However, it is important to note that the Pearson correlation coefficient only measures linear relationships and does not capture other types of relationships, such as nonlinear relationships. In addition, it assumes that the relationship between the two variables is symmetrical and that there are no outliers or influential observations in the data.

9. Model selection techniques

Model selection is the process of selecting the best model from among a set of candidate models that fit a given dataset. Model selection techniques can help to identify the most appropriate model for a given dataset by evaluating models based on their ability to fit the data and their complexity. Here are three commonly used model selection techniques:

  1. Stepwise selection: Stepwise selection is a process of iteratively adding or removing variables from a model until the best fit is achieved. This method involves two approaches: forward selection and backward elimination. In forward selection, variables are added to the model one by one until the best fit is achieved. In backward elimination, all variables are included in the model at first and then one by one, the less important variables are removed until the best fit is achieved. The stepwise method is commonly used in regression analysis.
  2. Akaike Information Criterion (AIC): AIC is a model selection technique that aims to balance the goodness of fit of a model with its complexity. It measures the relative quality of a statistical model for a given set of data. The AIC score for a model is calculated using the log-likelihood function and the number of parameters in the model. The model with the lowest AIC score is considered the best.
  3. Bayesian Information Criterion (BIC): BIC is similar to AIC, but it puts a stronger penalty on the number of parameters in the model. This method is useful for avoiding overfitting, where a model fits the training data too well and is unable to generalize to new data. The model with the lowest BIC score is considered the best.

Here are the steps involved in applying model selection techniques:

  1. Identify candidate models: First, identify a set of candidate models that can potentially fit the data.
  2. Split the data: Split the data into training and testing sets. The training set is used to fit the models, and the testing set is used to evaluate the performance of the models.
  3. Fit the models: Fit each candidate model to the training data and calculate the goodness of fit.
  4. Evaluate the models: Evaluate the performance of each model using the testing data.
  5. Select the best model: Select the model with the best performance on the testing data using one of the model selection techniques discussed above.

Here is a Python code example for stepwise selection using the statsmodels library:

import statsmodels.api as sm
import pandas as pd

# Load data
data = pd.read_csv('data.csv')

# Define response variable and predictor variables
y = data['y']
X = data[['x1', 'x2', 'x3']]

# Perform stepwise selection
model = sm.GLS(y, X).fit()
selected_model = model.tvalues.abs().sort_values(ascending=False)[:2].index
final_model = sm.GLS(y, X[selected_model]).fit()

# Print final model summary
print(final_model.summary())

When interpreting a full model summary, it is important to understand the meaning of the different values and what they represent. Here are some tips:

  1. R-squared: This is a measure of how much variance in the dependent variable is explained by the independent variables in the model. A higher value of R-squared indicates a better fit of the model to the data.
  2. Coefficients: These show the relationship between the independent variables and the dependent variable. A positive coefficient indicates a positive relationship, while a negative coefficient indicates a negative relationship. The magnitude of the coefficient shows the strength of the relationship.
  3. Standard errors: These indicate the variability of the estimated coefficients. Smaller standard errors indicate more precise estimates.
  4. t-values: These show the ratio of the estimated coefficient to its standard error. Higher absolute t-values indicate greater statistical significance of the relationship between the independent variable and the dependent variable.
  5. p-values: These indicate the statistical significance of the estimated coefficients. A p-value less than 0.05 is typically considered statistically significant.
  6. Confidence intervals: These provide a range of values within which the true value of the coefficient is likely to fall. A wider confidence interval indicates more uncertainty in the estimate.

10. Statistical Inference

Statistical Inference is a process of drawing conclusions about a population based on a sample of data. It involves using various statistical methods to analyze the sample data and make inferences or predictions about the larger population. The goal of statistical inference is to make accurate and reliable estimates or predictions based on limited data. It is widely used in various fields, including finance, medicine, social sciences, and engineering.

a. Population vs Sample

In statistics, a population refers to the entire group of individuals, objects, or events that we are interested in studying. A sample, on the other hand, is a smaller subset of the population that is selected to represent the population. The goal of statistical inference is to use information obtained from a sample to make inferences or draw conclusions about the larger population. By studying the sample, we hope to gain insights into the population without having to measure every single member of the population.

Fig.17 — Population vs Sample

b. Sampling Techniques

Sampling is a process of selecting a subset of individuals or observations from a population to estimate or infer something about the whole population. There are several types of sampling techniques used in statistics, including:

1. Simple random sampling: In this technique, each individual or observation has an equal chance of being selected for the sample. For example, if we want to survey 100 people from a population of 1000, we can randomly select 100 individuals from the population.

Formula: Probability of selection of each element = (Population size)/(Sample size)

Python code:

import random

population = [1,2,3,4,5,6,7,8,9,10]
sample_size = 5

sample = random.sample(population, sample_size)
print(sample)

2. Stratified sampling: In this technique, the population is divided into non-overlapping subgroups or strata, and a random sample is taken from each stratum. This ensures that the sample is representative of the population.

Formula: Sample size of each stratum = (Size of stratum/Total population size) x Sample size

Python code:

import pandas as pd

population = pd.DataFrame({'Gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female'],
'Age': [20, 25, 30, 35, 40, 45, 50, 55],
'Income': [2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000]})

strata = population.groupby('Gender').apply(lambda x: x.sample(n=2, replace=True)).reset_index(drop=True)
print(strata)

3. Cluster sampling: In this technique, the population is divided into clusters, and a random sample of clusters is selected. Then, all individuals within the selected clusters are included in the sample.

Formula: Sample size of each cluster = (Cluster size/Total population size) x Sample size

Python code:

import pandas as pd

population = pd.DataFrame({'City': ['A', 'A', 'B', 'B', 'C', 'C', 'D', 'D'],
'Age': [20, 25, 30, 35, 40, 45, 50, 55],
'Income': [2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000]})

clusters = population.groupby('City').apply(lambda x: x.sample(n=2, replace=True)).reset_index(drop=True)
print(clusters)

4. Systematic sampling: In this technique, the population is first ordered, and then a random starting point is selected. Then, every kth individual is selected for the sample.

Formula: Sampling interval = Population size/Sample size

Python code:

import pandas as pd

population = pd.DataFrame({'ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Age': [20, 25, 30, 35, 40, 45, 50, 55, 60, 65],
'Income': [2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 11000]})

sample_size = 5

c. Confidence Intervals

Confidence interval is a statistical measure used to estimate the range of values within which the true population parameter is expected to lie with a certain degree of confidence. It is a range of values constructed from a sample of data that can be used to infer the range of values in the population.

The formula for confidence interval is given as:

Fig.18- Confidence Intervals
import numpy as np
from scipy.stats import norm

# generate a sample of data
data = np.random.normal(10, 2, 100)

# calculate sample mean and standard deviation
sample_mean = np.mean(data)
sample_std = np.std(data, ddof=1)

# set the desired level of confidence
confidence_level = 0.95

# calculate the z-score associated with the desired level of confidence
z_score = norm.ppf(1 - ((1 - confidence_level) / 2))

# calculate the lower and upper bounds of the confidence interval
lower_bound = sample_mean - (z_score * (sample_std / np.sqrt(len(data))))
upper_bound = sample_mean + (z_score * (sample_std / np.sqrt(len(data))))

# print the results
print("Sample mean:", sample_mean)
print("Confidence interval:", lower_bound, "-", upper_bound)

Suppose we want to estimate the average height of all male students in a university. We take a random sample of 100 male students and calculate the sample mean height to be 175 cm and the sample standard deviation to be 5 cm. We want to construct a 95% confidence interval for the true population mean height.

Using the formula above, we can calculate the confidence interval as:

Fig.19— Solving the problem using Formula of CI

d. Hypothesis testing: Null and Alternative

Hypothesis testing is a statistical method that is used to determine if there is enough evidence to reject a null hypothesis in favor of an alternative hypothesis. The null hypothesis is a statement that there is no difference between two populations, while the alternative hypothesis is a statement that there is a difference.

Fig.20 — Hypothesis testing: Null and Alternative

The process of hypothesis testing involves the following steps:

  1. State the null and alternative hypothesis
  2. Select a significance level (alpha)
  3. Collect data and calculate test statistics
  4. Determine the p-value
  5. Make a decision and interpret the results

The p-value is the probability of obtaining the observed results or more extreme results assuming that the null hypothesis is true. If the p-value is less than the significance level, we reject the null hypothesis and accept the alternative hypothesis.

Type I error is the rejection of a true null hypothesis, while type II error is the failure to reject a false null hypothesis. The significance level, denoted by alpha, is the probability of committing a type I error. It is typically set at 0.05 or 0.01.

For example, in a clinical trial, the null hypothesis might be that a new drug has no effect, while the alternative hypothesis is that it does. The p-value is calculated based on the results of the trial, and if it is less than the significance level, the null hypothesis is rejected, indicating that the drug has a significant effect. If the p-value is greater than the significance level, the null hypothesis is not rejected, indicating that there is not enough evidence to conclude that the drug has a significant effect.

There are several types of hypothesis testing, including:

1. One-sample t-test: A one-sample t-test is used to determine whether the mean of a population is significantly different from a specified value. This test is commonly used in hypothesis testing when the sample size is small and the population standard deviation is unknown.

Mathematical formula: t = (x̄ — μ) / (s / sqrt(n))

where x̄ is the sample mean, μ is the population mean, s is the sample standard deviation, and n is the sample size.

Python code:

from scipy.stats import ttest_1samp

# Example data
data = [10, 15, 12, 18, 14]

# Null hypothesis: population mean is 12
# Alternative hypothesis: population mean is not 12
t_statistic, p_value = ttest_1samp(data, 12)

print("t-statistic:", t_statistic)
print("p-value:", p_value)

2. Two-sample t-test: A two-sample t-test is used to determine whether the means of two populations are significantly different from each other. This test is commonly used in hypothesis testing when comparing the means of two independent samples.

Mathematical formula: t = (x̄1 — x̄2) / sqrt((s1² / n1) + (s2² / n2))

where x̄1 and x̄2 are the sample means, s1 and s2 are the sample standard deviations, and n1 and n2 are the sample sizes.

Python code:

from scipy.stats import ttest_ind

# Example data
group1 = [10, 15, 12, 18, 14]
group2 = [8, 11, 13, 9, 12]

# Null hypothesis: population means are equal
# Alternative hypothesis: population means are not equal
t_statistic, p_value = ttest_ind(group1, group2)

print("t-statistic:", t_statistic)
print("p-value:", p_value)

3. Chi-squared test: Chi-squared test: The chi-squared test is used to determine if there is a significant association between two categorical variables. It measures how much the observed data differs from the expected data under a specific hypothesis. The test calculates a chi-squared statistic which is compared to a critical value from a chi-squared distribution to determine if the results are statistically significant. The formula for the chi-squared statistic is:

Chi-squared statistic = Σ (Observed frequency — Expected frequency)² / Expected frequency

In Python, the scipy.stats module provides functions to perform the chi-squared test. Here's an example:

import numpy as np
from scipy.stats import chi2_contingency

# Create a contingency table
observed = np.array([[50, 30], [20, 40]])

# Perform chi-squared test
chi2_stat, p_val, dof, expected = chi2_contingency(observed)

print("Chi-squared statistic:", chi2_stat)
print("P-value:", p_val)
print("Degrees of freedom:", dof)
print("Expected frequencies:\n", expected)

4. ANOVA (Analysis of Variance): ANOVA is used to determine if there is a significant difference between the means of three or more groups. It partitions the total variance of the data into different sources and calculates an F-statistic, which is compared to a critical value from an F-distribution to determine if the results are statistically significant. The formula for the F-statistic is:

F-statistic = (Between-group variance / (k-1)) / (Within-group variance / (n-k))

where k is the number of groups, n is the total sample size, and the between-group variance and within-group variance are calculated as:

Between-group variance = Σ (group mean — overall mean)² / (k-1)

Within-group variance = Σ (observation — group mean)² / (n-k)

In Python, the statsmodels module provides functions to perform ANOVA. Here's an example:

import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Load data
data = pd.read_csv("data.csv")

# Perform ANOVA
model = ols('score ~ group', data=data).fit()
anova = sm.stats.anova_lm(model, typ=2)

print(anova)

5. T-Test

A t-test is a statistical hypothesis test used to determine if there is a significant difference between the means of two groups. It is a type of parametric test that assumes the data is normally distributed and that the variances of the two groups are equal.

There are two main types of t-tests:

  1. One-sample t-test: used to compare the mean of a single group to a known value or a hypothesized value.
  2. Two-sample t-test: used to compare the means of two independent groups.

The t-test works by calculating the t-statistic, which measures the difference between the means of the two groups relative to the variation within each group. The t-statistic is then compared to a critical value from the t-distribution, based on the degrees of freedom and the desired level of significance (usually 0.05). If the calculated t-statistic is greater than the critical value, the null hypothesis (that the means are equal) is rejected in favor of the alternative hypothesis (that the means are different).

Here is the formula for the t-statistic:

t = (x1 — x2) / (s * sqrt(2/n))

where:

  • x1 and x2 are the sample means of the two groups
  • s is the pooled standard deviation of the two groups
  • n is the sample size of each group

And here is some example Python code for conducting a two-sample t-test:

from scipy.stats import ttest_ind

group1 = [1, 2, 3, 4, 5]
group2 = [6, 7, 8, 9, 10]

t_stat, p_value = ttest_ind(group1, group2)

print("t-statistic:", t_stat)
print("p-value:", p_value)

6. Regression analysis: Regression analysis is used to examine the relationship between a dependent variable and one or more independent variables. It calculates the coefficients of the regression equation, which can be used to make predictions. The most common type of regression analysis is linear regression, which assumes a linear relationship between the variables. The formula for the regression equation is:

Y = a + bX

where Y is the dependent variable, X is the independent variable, a is the intercept, and b is the slope. The coefficients are estimated using the method of least squares, which minimizes the sum of squared residuals. The t-test and F-test are used to determine if the coefficients are statistically significant.

In Python, the statsmodels module provides functions to perform linear regression. Here's an example:

import pandas as pd
import statsmodels.api as sm

# Load data
data = pd.read_csv("data.csv")

# Perform linear regression
X = data["X"]
Y = data["Y"]
X = sm.add_constant(X)
model = sm.OLS(Y, X).fit()
predictions = model.predict(X)

print(model.summary())

11. Project on Descriptive Statistics

U+1F579 Checkout the GitHub for Python notebook on Project of Descriptive Analysis on Yahoo Finance Data.

Top-Machine-Learning-Algorithms-Python/Descriptive_Statistics_Project.ipynb at main ·…

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

Next Part of this “Ultimate Guide to Statistics Series: Inferential Statistics”, Click on the Link Below:

The Ultimate Guide to Statistics: Part 2 — Inferential Statistics

Learn about Inferential Statistics and Hypothesis Testing along with Confidence Intervals. Regression analysis is also…

medium.com

If you like the article and would like to support me make sure to:

U+1F44F Clap for the story (100 Claps) and follow me U+1F449U+1F3FBSimranjeet Singh

U+1F4D1 View more content on my Medium Profile

U+1F514 Follow Me: LinkedIn U+007C Medium U+007C GitHub U+007C Twitter U+007C Telegram

U+1F680 Help me in reaching to a wider audience by sharing my content with your friends and colleagues.

U+1F393 If you want to start a career in Data Science and Artificial Intelligence and you do not know how? I offer data science and AI mentoring sessions and long-term career guidance.

U+1F4C5 Consultation or Career Guidance

U+1F4C5 1:1 Mentorship — About Python, Data Science, and Machine Learning

Book your Appointment

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓