Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

Univariate, Bivariate, and Multivariate Analysis
Latest   Machine Learning

Univariate, Bivariate, and Multivariate Analysis

Last Updated on July 17, 2023 by Editorial Team

Author(s): Ann Mary Shaju

Originally published on Towards AI.

A beginner guide to exploratory data analysis using Matplotlib and Seaborn

Photo by Luke Chesser on Unsplash

Data is increasing daily, and in order to understand useful information and gain insights from data we need to analyse data. One, two or more variables can be used in data analysis. Also, the type of variable can be either numerical or categorical. Based on the number and type of variables there are different analysis techniques. In this article, I will be explaining the Univariate, Bivariate and Multivariate analysis of data.

Image by Author

The dataset used in the article is available on Github. The original source of the dataset is Kaggle. However, in order to explain a few functionalities have modified the dataset and saved it on Github.

Before performing the data analysis it is important to identify the numerical and categorical variables of the data frame. We can use info() for the same.

data.info()
Image by Author

From the info() we can identify the numerical and categorical variables of the given data frame.

Univariate Analysis

Uni means single and variate means variable. Hence univariate means the analysis of a single variable. In univariate analysis, we are trying to analyse the pattern present in a single variable.

Image by Author

Univariate Analysis for Numerical Variables

Univariate analysis of numerical variables can be performed using multiple methods.

  • Histogram: Histogram is used to plot the graphical representation of a numerical variable. The x-axis represents the numeric data and y-axis represents the respective count of the data.
import matplotlib.pyplot as plt
plt.hist(data["Loan_Amount_Term"]) # Histogram plot using matplotlib
plt.xlabel("Loan Amount Term")
plt.ylabel("Count")
Image by Author
import seaborn as sns
sns.histplot(data, x='Loan_Amount_Term') # Histogram plot using seaborn
Image by Author
  • Box plot: Box plot is created to get the summary of variable ie. minimum, first quartile, median, 3rd quartile and maximum.
plt.boxplot(data["ApplicantIncome"]) # Box plot using matplotlib
Image by Author
sns.boxplot(data["ApplicantIncome"]) # Box plot using seaborn
Image by Author
  • Violin plot: A violin plot is a combination of a box plot and a histogram
plt.violinplot(data["CoapplicantIncome"]) # Violin plot using matplotlib
Image by Author
sns.violinplot(data["CoapplicantIncome"]) # Violin plot using seaborn
Image by Author

Univariate Analysis for Categorical Variables

Univariate analysis of categorical variables can be performed using the following methods

  • Count plot: The count plot is similar to the histogram. But unlike histogram count plot represents frequency distribution of categorical data.
sns.countplot(data, x="Gender")
Image by author
  • Bar graph: A bar graph can also be used to represent the frequency distribution of categorical variable
# Bar graph using matplotlib
gender_counts = data['Gender'].value_counts()
plt.bar(gender_counts.index, gender_counts.values)
plt.title("Gender Distribution")
plt.xlabel("Gender")
plt.ylabel("Count")
plt.show()
Image by Author
# Bar graph using seaborn
gender_counts = data['Gender'].value_counts()
sns.barplot(data, x=gender_counts.index, y=gender_counts.values)
Image by Author
  • Piechart: A piechart is a circular graph that is divided into slices like pies in order to represent numerical proportions of a variable
# Pie chart using matplotlib
dependents = data['Dependents'].value_counts()
plt.pie(dependents.values, labels=dependents.index, autopct='%1.1f%%')
plt.title("Dependents Distribution")
Image by Author

Bivariate Analysis

Bivariate means the analysis of two variables. Using bivariate analysis we can find how well the variables are correlated. Bivariate analysis is of 3 types

  • Numerical variables
  • Categorical variables
  • Numerical & Categorical variable
Image by Author

Bivariate Analysis of Numerical Variables

  • Scatterplot: Scatterplot uses dots to represent the relationship between two numeric variables.
# Scatter plot using matplotlib
plt.scatter(data["ApplicantIncome"], data["LoanAmount"])
plt.title("Scatter Plot of Applicant Income vs LoanAmount")
plt.xlabel("ApplicantIncome")
plt.ylabel("LoanAmount")
plt.show()
Image by Author
# Scatter plot using Seaborn
sns.scatterplot(data, x="ApplicantIncome", y="LoanAmount")
plt.title("Scatter Plot of Applicant Income vs LoanAmount")
plt.xlabel("ApplicantIncome")
plt.ylabel("LoanAmount")
Image by Author
  • Join plot: As the name joint plot suggests join plot joins bivariate and univariate graphs. Using the parameter kind we can mention the kind of plot to draw (ie. scatter, hex, hist, kde, reg, resid)
# Joint plot using Seaborn
sns.jointplot(data,x="LoanAmount",y="ApplicantIncome",kind="scatter")
plt.suptitle("Joint Plot of Loan Amount & Applicant Income")
Image by Author

Bivariate Analysis of Categorical Variables

  • Count plot: Using the hue parameter we can analyse two categorical variables in a count plot.
sns.countplot(data, x="Education", hue="Loan_Status")
Image by Author

Bivariate Analysis of Numerical & Categorical Variable

  • Barplot: Using x and y parameter we can find the relationship between a numeric and categorical variable
sns.barplot(data, x="Gender", y="ApplicantIncome")
Image by Author
  • Box plot: Using x and y parameter we can find the relationship between a numeric and categorical variable.
sns.boxplot(data, x="Self_Employed", y="ApplicantIncome")
Image by Author
  • Violin plot: Using x and y parameter we can find the relationship between a numeric and categorical variable in violin plot.
sns.violinplot(data, x="Property_Area", y="LoanAmount")
Image by Author
  • Displot: Displot plots the relationship between numeric and categorical variable. Using the parameter kind we can mention the kind of plot to draw (ie. hist, kde)
sns.displot(data, x= "ApplicantIncome", hue="Married", kind="kde")
Image by Author

Multivariate Analysis

Multivariate means the analysis of more than two variables.

Image by Author
  • Bar plot: In the bar plot x and y parameter is used to plot the relationship between a categorical and numerical variable. In addition to that we can use the hue parameter to group variables.
sns.barplot(data, x="Gender", y="ApplicantIncome", hue="Loan_Status")
Image by Author
  • Pair plot: Pair plot plots the pairwise relationship in a dataset. By default, a pair plot gives the relationship between numerical variables but by specifying the variables we want to analyse in the vars parameter we can analyse categorical and numerical variables of a data frame.
# By passing data.columns in vars parameter,
# we can analyse categorical and numerical variables of a data frame
sns.pairplot(data, vars=list(data.columns))
sns.pairplot(data, vars=['ApplicantIncome', 'Gender', 'Education'])
Image by Author
  • Heatmap: The heatmap is used to visualise the correlations between pairs of numeric variables.
sns.heatmap(data.corr(), annot=True)
Image by Author
  • hist(): hist() is a function of the pandas library. It makes histograms of all numeric variables of a data frame. hist() calls matplotlib.pyplot.hist() on each series of the data frame.

data.hist(bins=50, figsize=(12, 8))
Image by Author

Conclusion

In this article, we looked at what is univariate, bivariate and multivariate analysis. We also learnt various ways of plotting the data using Matplotlib and Seaborn libraries. The graphs discussed in the above article are not the only graphs available, based on the use case we can decide which graph to use.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓