Blog

Assumptions of Linear Regression — What Fellow Data Scientists Should Know
Machine Learning

Assumptions of Linear Regression — What Fellow Data Scientists Should Know

Author(s): Shaurya Lalwani

Machine Learning

Assumptions of Linear Regression — What Fellow Data Scientists Should Know

Photo by Marius Masalar on Unsplash

Linear Regression is a linear approach to modeling the relationship between a target variable and one or more independent variables. This modeled relationship is then used for predictive analytics. Working on the linear regression algorithm is just half the work done. The other half lies in understanding the following assumptions that this technique depends on:

1. Normality of Residuals

For linear regression to work on the given data, it is assumed that Errors (residuals) follow a normal distribution. Although this is not necessarily required when the sample size is very large. The normality can be verified using the Q-Q Plot (Data Quantiles VS Normal Quantiles), where we map quantiles from our data-set and quantiles from a hypothetical normal distribution, and here we expect to see an almost straight line, to verify the normality of residuals.

Test: Jarque Bera Test, Shapiro Test, Residuals Plot

Example: Below we can see a histogram of the Residuals, with the kernel density estimation, which shows us that in this case, the residuals are fairly normal.

Figure 1: KDE Plot of Residuals

2. Homoscadacity

Homoscedacity describes a situation where noise/disturbance in the relationship (strength of variance) between independent features and the target variable is the same across all values of the independent values. So, we can check this using the Residuals VS Predicted Values Scatterplot. We should not see a pattern on this scatterplot, and all the data should be randomly distributed. This verifies the homoscedasticity.

Test: Goldfeld Test, Residuals VS Fitted Plot

Example: Below we can see the residuals vs fitted plot, and another one with scaled residual values (for the purpose of showing that scales can show a change in the randomness, but the randomness will be present in homoscedastic data, which is very much evident from the plots).

Figure 2: Residuals VS Fitted Plot

3. Linearity of Residuals

Residuals are the error terms obtained on calculating the difference between the predicted target value and the observed target value. Linearity can be observed if and when the predictor variables have a straight-line relationship with the target variable. This is generally not to be worried about if the residuals are normally distributed and homoscedastic.

Test: Rainbow Test, Probability Plot

Note: While looking at the equation, the linearity is not be judged from the power of the features/variables from the data-set, but from the power of the Beta parameters.

Example: Y = a + (β1*X1) + (β2*X2²)

In the above example, X2 has a power of 2, which means that a variable from our data-set has a power of 2, but none of the beta parameters (coefficients obtained on performing regression) have power other than 1. This shows the linearity of the residuals here.

Below we can see the Probability plot, i.e. Observed VS Theoretic Quantiles of Normal Distribution, to check for the linearity (which is fairly applicable to the lower data points in the below data). Remember that your data won’t be perfectly linear, but it has to tend to linearity.

Figure 3: Observed VS Theoretical Quantiles

4. No-Multicollinearity

Multi-collinearity is a state of very high inter-correlations or inter-association among independent variables. This disturbance weakens the statistical power of the regression model, which is why low or no multi-collinearity is desirable.

Test: Variance Inflation Factor (VIF), Correlation Matrix/Heatmap

Example: The Correlation Heatmap below shows the correlation among independent variables and the correlation of those independent variables with the target(which is Price in the below case). So, for checking the Multi-Collinearity, we don’t have to check how the independent variables are related to the target, so we can ignore Price here. For clarity, let’s check some of the relations:

  1. INDUS has a high negative correlation with DIS (As INDUS increases, DIS decreases)
  2. INDUS has a high positive correlation with TAX (As INDUS increases, tax increases)

So, when the variables are highly correlated, it may be needed to remove some of these variables, otherwise, an overfitted model would get built because these variables essentially provide the same data to the model.

Figure 4: Correlation Heatmap

5. No-Autocorrelation

Auto-correlation occurs when residual errors are dependant on each other, and this ultimately reduces the model’s accuracy. A correlogram (also called Auto Correlation Function ACF Plot or Autocorrelation plot) is a visual way to show the serial correlation in data that changes over time. This usually occurs in Time-Series models where the next instant is dependant on the previous instant. So, in a simpler way, Autocorrelation is an error in one point of time, which travels to a subsequent point in time. For example, you might overestimate the cost of customer acquisition for the first month, leading to an overestimate of that cost for the following months.

Test: Durbin Watson Test

Example: The below Correlogram shows the Correlation coefficient on the Y-Axis and the time lag of that correlation, on the X-Axis. We see that the correlation value is high for only some instances in time, and there is no upward or downward pattern while traversing through the X-Axis, so the possibilities of serial correlation can be demolished.

Figure 5: Auto Correlation Function (ACF) Plot

Summary

The above-mentioned assumptions must be satisfied before proceeding for any linear regression problem, however, there may certain exceptions at times, such as the one stated in “Linearity of Residuals”. For other assumptions as well, we do not require perfection, but the outcome shouldn’t be very different from what we have assumed. So, if let’s say Autocorrelation is less, or the Linearity of Residuals is tending to linearity, then that is acceptable to some limit. Plots help us in visualizing how the data is maintaining the assumptions we have planned to stick to, and the tests prove that.

Thoroughly checked all of the above? Now, we can proceed with finding the best-fit regression line!


Assumptions of Linear Regression — What Fellow Data Scientists Should Know was originally published in Towards AI — Multidisciplinary Science Journal on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published via Towards AI

Leave your thought here