Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

# Univariate, Bivariate, and Multivariate Analysis

Last Updated on July 17, 2023 by Editorial Team

#### Author(s): Ann Mary Shaju

Originally published on Towards AI.

## A beginner guide to exploratory data analysis using Matplotlib and Seaborn

Data is increasing daily, and in order to understand useful information and gain insights from data we need to analyse data. One, two or more variables can be used in data analysis. Also, the type of variable can be either numerical or categorical. Based on the number and type of variables there are different analysis techniques. In this article, I will be explaining the Univariate, Bivariate and Multivariate analysis of data.

The dataset used in the article is available on Github. The original source of the dataset is Kaggle. However, in order to explain a few functionalities have modified the dataset and saved it on Github.

Before performing the data analysis it is important to identify the numerical and categorical variables of the data frame. We can use info() for the same.

`data.info()`

From the info() we can identify the numerical and categorical variables of the given data frame.

## Univariate Analysis

Uni means single and variate means variable. Hence univariate means the analysis of a single variable. In univariate analysis, we are trying to analyse the pattern present in a single variable.

## Univariate Analysis for Numerical Variables

Univariate analysis of numerical variables can be performed using multiple methods.

• Histogram: Histogram is used to plot the graphical representation of a numerical variable. The x-axis represents the numeric data and y-axis represents the respective count of the data.
`import matplotlib.pyplot as pltplt.hist(data["Loan_Amount_Term"]) # Histogram plot using matplotlibplt.xlabel("Loan Amount Term")plt.ylabel("Count")`
`import seaborn as snssns.histplot(data, x='Loan_Amount_Term') # Histogram plot using seaborn`
• Box plot: Box plot is created to get the summary of variable ie. minimum, first quartile, median, 3rd quartile and maximum.
`plt.boxplot(data["ApplicantIncome"]) # Box plot using matplotlib`
`sns.boxplot(data["ApplicantIncome"]) # Box plot using seaborn`
• Violin plot: A violin plot is a combination of a box plot and a histogram
`plt.violinplot(data["CoapplicantIncome"]) # Violin plot using matplotlib`
`sns.violinplot(data["CoapplicantIncome"]) # Violin plot using seaborn`

## Univariate Analysis for Categorical Variables

Univariate analysis of categorical variables can be performed using the following methods

• Count plot: The count plot is similar to the histogram. But unlike histogram count plot represents frequency distribution of categorical data.
`sns.countplot(data, x="Gender")`
• Bar graph: A bar graph can also be used to represent the frequency distribution of categorical variable
`# Bar graph using matplotlibgender_counts = data['Gender'].value_counts()plt.bar(gender_counts.index, gender_counts.values)plt.title("Gender Distribution")plt.xlabel("Gender")plt.ylabel("Count")plt.show()`
`# Bar graph using seaborngender_counts = data['Gender'].value_counts()sns.barplot(data, x=gender_counts.index, y=gender_counts.values)`
• Piechart: A piechart is a circular graph that is divided into slices like pies in order to represent numerical proportions of a variable
`# Pie chart using matplotlibdependents = data['Dependents'].value_counts()plt.pie(dependents.values, labels=dependents.index, autopct='%1.1f%%')plt.title("Dependents Distribution")`

## Bivariate Analysis

Bivariate means the analysis of two variables. Using bivariate analysis we can find how well the variables are correlated. Bivariate analysis is of 3 types

• Numerical variables
• Categorical variables
• Numerical & Categorical variable

## Bivariate Analysis of Numerical Variables

• Scatterplot: Scatterplot uses dots to represent the relationship between two numeric variables.
`# Scatter plot using matplotlibplt.scatter(data["ApplicantIncome"], data["LoanAmount"]) plt.title("Scatter Plot of Applicant Income vs LoanAmount")plt.xlabel("ApplicantIncome")plt.ylabel("LoanAmount")plt.show()`
`# Scatter plot using Seabornsns.scatterplot(data, x="ApplicantIncome", y="LoanAmount") plt.title("Scatter Plot of Applicant Income vs LoanAmount")plt.xlabel("ApplicantIncome")plt.ylabel("LoanAmount")`
• Join plot: As the name joint plot suggests join plot joins bivariate and univariate graphs. Using the parameter kind we can mention the kind of plot to draw (ie. scatter, hex, hist, kde, reg, resid)
`# Joint plot using Seabornsns.jointplot(data,x="LoanAmount",y="ApplicantIncome",kind="scatter")plt.suptitle("Joint Plot of Loan Amount & Applicant Income")`

## Bivariate Analysis of Categorical Variables

• Count plot: Using the hue parameter we can analyse two categorical variables in a count plot.
`sns.countplot(data, x="Education", hue="Loan_Status")`

## Bivariate Analysis of Numerical & Categorical Variable

• Barplot: Using x and y parameter we can find the relationship between a numeric and categorical variable
`sns.barplot(data, x="Gender", y="ApplicantIncome")`
• Box plot: Using x and y parameter we can find the relationship between a numeric and categorical variable.
`sns.boxplot(data, x="Self_Employed", y="ApplicantIncome")`
• Violin plot: Using x and y parameter we can find the relationship between a numeric and categorical variable in violin plot.
`sns.violinplot(data, x="Property_Area", y="LoanAmount")`
• Displot: Displot plots the relationship between numeric and categorical variable. Using the parameter kind we can mention the kind of plot to draw (ie. hist, kde)
`sns.displot(data, x= "ApplicantIncome", hue="Married", kind="kde")`

## Multivariate Analysis

Multivariate means the analysis of more than two variables.

• Bar plot: In the bar plot x and y parameter is used to plot the relationship between a categorical and numerical variable. In addition to that we can use the hue parameter to group variables.
`sns.barplot(data, x="Gender", y="ApplicantIncome", hue="Loan_Status")`
• Pair plot: Pair plot plots the pairwise relationship in a dataset. By default, a pair plot gives the relationship between numerical variables but by specifying the variables we want to analyse in the vars parameter we can analyse categorical and numerical variables of a data frame.
`# By passing data.columns in vars parameter,# we can analyse categorical and numerical variables of a data framesns.pairplot(data, vars=list(data.columns))`
`sns.pairplot(data, vars=['ApplicantIncome', 'Gender', 'Education'])`
• Heatmap: The heatmap is used to visualise the correlations between pairs of numeric variables.
`sns.heatmap(data.corr(), annot=True)`
• hist(): hist() is a function of the pandas library. It makes histograms of all numeric variables of a data frame. hist() calls matplotlib.pyplot.hist() on each series of the data frame.
`data.hist(bins=50, figsize=(12, 8))`

Conclusion

In this article, we looked at what is univariate, bivariate and multivariate analysis. We also learnt various ways of plotting the data using Matplotlib and Seaborn libraries. The graphs discussed in the above article are not the only graphs available, based on the use case we can decide which graph to use.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI