How do I Verify the Assumptions of Linear Regression?

Last Updated on July 26, 2023 by Editorial Team

Author(s): Gowtham S R

Originally published on Towards AI.

What are the assumptions of linear regression? and how to verify them with python?

Photo from Unsplash uploaded by Thong Vo

Linear regression is a model that estimates the relationship between independent variables and a dependent variable using a straight line. However, in order to use a linear regression model, we have to verify a few assumptions.

The 5 main assumptions of linear regression are,

A linear relationship between dependent and independent variables.
No/Very less multicollinearity.
Normality of Residuals
Homoscedasticity
No Autocorrelation of Errors

Let's understand each of the above assumptions in detail with the help of python code.

Simple ways to write Complex Patterns in Python in just 4mins.

Easy way to write complex pattern programs in python

medium.com

Import the required libraries, and read the dataset.

Separate the dependent and independent features, and split the data into train and test sets as shown below.

Create a linear regression model and calculate the residuals.

Let us verify the assumptions of linear regression for the above data.

A Complete End-to-End Machine Learning Based Recommendation Project

A machine learning recommendation project based on collaborative filtering and popularity-based filtering

pub.towardsai.net

1. Linear Relationship

In order to perform a linear regression, the first and foremost assumption is to have a linear relationship between the independent and the dependent features. Means — As the value of the X increases, the value of y should also increase or decrease linearly. If there are multiple independent features, each of the independent features should have a linear relationship with the dependent feature.

We can verify this assumption using a scatter plot as shown below.

In the above scatter plots we can clearly say that features 1 and 3 are having a clear linear relationship with the target. However, feature 2 is not having a linear relationship with the target.

2. Multicollinearity

Multicollinearity is a scenario in which two of the independent features are highly correlated. So, now the question is, what is correlation? Correlation is the scenario in which two variables are strongly related to each other.

Eg, If we have a dataset where age and years_of_experience are the two independent features in our dataset. It is highly possible that as age increases, years_of_experience also increase. So, in this case, age and years of experience are highly positively correlated.

If we have age and years_left_to_retire as independent features, then as age increases, the years_left_to_retire decreases. So, here we say that the two features are highly negatively correlated.

If we have any one of the above scenarios (strong positive correlation or negative correlation), then we say that there is multicollinearity.

We can verify if there is any multicollinearity in our data, using a correlation matrix or VIF as shown in the below figure.

From the above VIF and correlation matrix, we can say that there is no multicollinearity in our dataset.

If you are interested in understanding multicollinearity in detail, please read my blog on why multicollinearity is a problem

Why multicollinearity is a problem?

What is multicollinearity? and why should we take care of multicollinearity before creating a machine learning model?

medium.com

3. Normality of Residuals

Residual = actual y value − predicted y value. Having a negative residual means that the predicted value is too high, similarly, if you have a positive residual, it means that the predicted value was too low. The aim of a regression line is to minimize the sum of residuals.

The assumption says that if we plot the residual, then the plot should be normal or sort of normal.

We can verify this assumption with the help of the KDE plot and Q-Q plot, as shown below.

4. Homoscedasticity

Homo means same and scedasticity means scatter/spread. So, the meaning of homoscedasticity is having same scatter. It means the condition in which the variance of the residual, or error term, in a regression model is constant.

When we plot the residuals, the spread should be equal. We can check this by using a scatter plot, where the x-axis will have the predictions, and the y-axis will have the residuals, as shown in the below figure.

The residuals are spread uniformly, which holds the assumption of homoscedasticity.

5. No autocorrelation of errors

This assumption says that there should not be any relationship between the residuals. This can be verified by plotting the residuals as shown in the below figure. The plot should not result in any particular patterns.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

How do I Verify the Assumptions of Linear Regression?

Author(s): Gowtham S R

What are the assumptions of linear regression? and how to verify them with python?

Simple ways to write Complex Patterns in Python in just 4mins.

Easy way to write complex pattern programs in python

A Complete End-to-End Machine Learning Based Recommendation Project

A machine learning recommendation project based on collaborative filtering and popularity-based filtering

1. Linear Relationship

2. Multicollinearity

Why multicollinearity is a problem?

What is multicollinearity? and why should we take care of multicollinearity before creating a machine learning model?

3. Normality of Residuals

4. Homoscedasticity

5. No autocorrelation of errors

Anything that can go wrong will go wrong.

List of some of Murphy’s laws that we come across in our daily life

Simple ways to write Complex Patterns in Python in just 4mins.

Easy way to write complex pattern programs in python

Which Feature Scaling Technique To Use- Standardization vs Normalization

Is feature scaling mandatory? when to use standardization? when to use normalization? what will happen to the…

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

AI Safety on a Budget: Your Guide to Free, Open-Source Tools for Implementing Safer LLMs

AI Safety on a Budget: Your Guide to Free, Open-Source Tools for Implementing Safer LLMs

AI Safety on a Budget: Your Guide to Free, Open-Source Tools for Implementing Safer LLMs

You Can Now Call ChatGPT From Your Phone

You Can Now Call ChatGPT From Your Phone

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

How do I Verify the Assumptions of Linear Regression?

Author(s): Gowtham S R

What are the assumptions of linear regression? and how to verify them with python?

Simple ways to write Complex Patterns in Python in just 4mins.

Easy way to write complex pattern programs in python

A Complete End-to-End Machine Learning Based Recommendation Project

A machine learning recommendation project based on collaborative filtering and popularity-based filtering

1. Linear Relationship

2. Multicollinearity

Why multicollinearity is a problem?

What is multicollinearity? and why should we take care of multicollinearity before creating a machine learning model?

3. Normality of Residuals

4. Homoscedasticity

5. No autocorrelation of errors

Anything that can go wrong will go wrong.

List of some of Murphy’s laws that we come across in our daily life

Simple ways to write Complex Patterns in Python in just 4mins.

Easy way to write complex pattern programs in python

Which Feature Scaling Technique To Use- Standardization vs Normalization

Is feature scaling mandatory? when to use standardization? when to use normalization? what will happen to the…

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement