Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

The Outlier Story — Leverage and Influential Point in Linear Regression
Latest

The Outlier Story — Leverage and Influential Point in Linear Regression

Last Updated on September 19, 2021 by Editorial Team

Author(s): Supriya Ghosh

Data Science

The Outlier Story — Leverage and Influential Point in Linear Regression

Image by Will Myers on unsplash

For understanding any Outliers and Unusual Observations in linear regression, it is important to understand a formal and basic definition of Linear regression.

“Linear regression is a linear approach for modelling the relationship between a dependent response variable and one or more independent/explanatory variables. It is considered to be one of the most used algorithms to predict continuous values in Supervised Machine learning techniques”.

Image by Annie Spratt on unsplash

Linear Regression equation (depicting a fitted regression line ) can be represented as :

Yi = β0 + β1X + ε

Where β0 is the intercept term,

β1 is the slope (which is also the regression coefficient) between Y(dependent/response variables) and X(independent variable),

ε (pronounced epsilon), is the error term that captures errors in the measurement of Y.

Yi represents the predicted value of Y. It is the value of Y obtained using the regression line.

Ȳ (Y — bar) represents the mean of the data points of the Y variable i.e., the response variable.

X̅ (X — bar) represents the mean of the data points of the X variable i.e., independent variable.

Residual measures the vertical distance between the actual value of Y and the predicted value of Y from the regression line.

In other words, it measures the vertical distance between the actual data point and the predicted point on the line.

With this much introduction about Linear regression, let’s move on to Unusual Observations.

The Unusual Observations in linear regression are generally considered as an Outlier.

Image by Author

Outlier is defined as a data point that is very far away from the rest of the data i.e., unusual observation with respect to either x-value or y-value.

It is an observation for which generally the residual is large in magnitude compared to other observations in the data set. This signifies the observation for which Y(actual value) is far from the value predicted by the model i.e., Yi.

In simple words, one can say the data points that are far from the regression line of fit is an outlier.

Outliers do not fit the model well and they may or may not have a large effect on the model.

For e.g. — In the below picture, all the points circled in yellow are outliers.

But what about the points circled in Green. They are also far away from the other observations in the data set, but are they outliers?

No, they are not outliers in the strict sense.

Why is it so?

Do all outliers are problematic?

Do all outliers tend to affect the regression results significantly?

We will understand this further.

Picture 1

In fact, when we are doing regression modeling, we don’t always care about few data points being far from the rest of the data points till those data points break the pattern or do not follow the general trend of the rest of the data i.e., changes slope(regression line of fit) and regression coefficient to a great extent.

A regression coefficient is the same thing as the slope of the regression line of fit.

To understand problematic outliers let’s understand two more important terms.

1. Leverage Point

2. Influential Point

Let’s define the Leverage point formally.

The leverage point is a measure of how far away the X values (independent variable values) in a data set are from those of the other observations. High-leverage points are outliers with respect to the independent variables.

Hence, the leverage point may or may not be outliers and depends only on the x values, not the y values.

The leverage point with a small residual generally doesn’t affect the slope because it follows the linear trend of the original data and is not considered an outlier.

There are two types of leverage points.

1. High leverage point

Properties of high leverage point

a. It can affect the regression line of fit, sometimes extremely if residuals are high.

b. It may or may not have a large residual.

2. Low leverage point

Properties of low leverage point

a. It does not affect the regression line of fit too much extent.

b. It usually has a high residual.

For e.g. — In the below picture, the points circled in green are low leverage points and points circled in yellow are high leverage points.

Picture 2

Let’s define the Influential point formally.

An influential point is an outlier that greatly affects the slope of the regression line and has a relatively large effect on the regression model’s predictions.

Although an influential point will typically have high leverage, a high leverage point is not necessarily an influential point.

For e.g. — In the below picture, the points circled in blue are highly influential points.

Picture 3

Let me put it across in a table for clear visualization.

Image by Author

Now I guess it will be clear to you all why points circled in Green in Picture 1 are not an outlier.

It is because although they are far away from the rest of the observations but are close to the regression line of fit with low residuals. Hence, they do not affect the slope and regression coefficients as well as predictions to a significant extent and hence is not labeled as outlier.

Summarizing:

Low Leverage, Large Residual -> Small Influence (Affects the slope of the regression line of fit up to a certain extent) and is an outlier.

High Leverage, Small Residual -> Small Influence (Affects the slope of the regression line of fit to a minimal extent) and is not an outlier.

High Leverage, Large Residual -> Large Influence (Largely affects the slope, the regression line of fit, and further model predictions) and is an outlier.

We can say that the outliers which have a Large Influence affect the slope of the regression line and regression model’s predictions to the maximum extent and should be well taken care of while developing regression models.

Hope this gives you a clear picture of Outliers and the Leverage and Influential Points in Linear Regression.

Image by Stan B on unsplash

You can follow me on medium as well as

LinkedIn: Supriya Ghosh

And Twitter: @isupriyaghosh


The Outlier Story — Leverage and Influential Point in Linear Regression was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published via Towards AI

Feedback ↓