Bias & Variance in Machine Learning
Last Updated on July 28, 2020 by Editorial Team
Author(s): Shaurya Lalwani
Linear Regression is a machine learning algorithm that is used to predict a quantitative target, with the help of independent variables that are modeled in a linear manner, to fit a line or a plane (or hyperplane) that contains the predicted data points. For a second, let’s consider this to be the best-fit line (for better understanding). So, usually, points from the training data don’t really lie on the best-fit line only, and that makes perfect sense because any data isn’t perfect. That is why we are making predictions in the first place, and not just plotting a random line.
The linear regression line cannot be curved in order to include all the training set data points, and hence is unable to capture an accurate relationship at times. This is called bias. In mathematical terms, intercept obtained in the linear regression equation, is the bias.
Why do I say that?
Let me explain: Here’s a random Linear Regression equation:
y = Intercept + Slope1*x1 + Slope2*x2
The target (y) has some values in the data-set, and the above equation calculates the predicted values for the same. If the “Intercept” itself is very high, and it reaches close to the predicted y values, then it would mean that the changes in y, caused by the other two parts of our equation — the independent variables(x1 and x2), would be less. This means that the amount of variance explained by x1 and x2, would be less, and that would eventually cause an underfitting model to be built. An underfitting model has a low R-squared (the amount of variance in the target, explained by the independent variables).
Underfit can also be understood by thinking of how the best-fit line/plane is captured in the first place. The best-fit line/plane captures the relationship between the target and the independent variable. If this relationship is captured to a very high extend, it leads to low bias and vice versa.
Now that we understand what bias is, and how a high bias causes an underfitting model, it becomes clear that for a robust model, we need to remove this underfit.
In a scenario where we create a curve that passes through all data points and can showcase the existing relationship between the independent variables and the dependant variable, then there would be no bias in the model.
A model that has overfitted on train data, will result in a new phenomenon called “variance”. Time to consider a few models:
Model1: High Bias (Unable to capture the relationship properly)
Model2: Low Bias (Captures relationship to a very high extent)
Error measurement while validating a model:
Error = Actual Values — Predicted Values
On calculating the errors on the training data (test data is not in the picture yet), we observe the following:
Model1: Validation of model on train data shows that errors are high
Model2: Validation of model on train data shows that errors are low
Now, let’s bring in the train data, and understand variance.
So, if the model has overfitted on train data, then it “understands” and “knows” the train data to such a high extent, that it is possible that it will struggle with the test data, and hence it will be unable to capture a relationship when test data is used as input to that model. In broader terms, this means that there will be a high difference of fit between the train data and the test data (as train data shows a perfect validation and test data is unable to capture a relationship). This difference of fit is referred to as “variance”, and it is usually caused when the model understands only the train data and struggles with any new input given to it.
On validating the above models on test data, we notice this:
Model1: Relationship isn’t captured correctly here as well, but there isn’t a huge gap of understanding between the train and test data, so the variance is low
Model2: There is a huge gap of understanding between the train and test data, so the variance is high
The Trade-Off between Bias & Variance
Now we understand that both bias and variance can cause problems in our prediction model. So, how do we go about solving this issue?
A couple of terms to understand before we proceed:
Overfit: Low Bias & High Variability — Model fits great on train data, but struggles with test data because it understands only the train data well
Underfit: High Bias & Low Variability — Model is unable to capture relationship while using train data, but since it hasn’t captured the relationship anyway, hence there isn’t much of a gap of understanding between train and test data, so variance is low
Coming back to the solution, we can do the following to try to build a trade-off between the bias and variance being caused:
Usually, a model is built on train data and tested on the same, but there’s one more thing that people prefer. Testing the model on a part of the train data, and this is called the validation data.
So, what is Cross-Validation?
As mentioned, model validation is done on a part of the train data. So, if we keep choosing a new set of data points from the train data for validating each iteration, and keep averaging the results obtained from these sets of data, we are doing cross-validation. This is an optimized method to understand the behavior of the model on the train data and a way to understand whether there is a presence of an overfit or not.
Types of Cross-Validation:
K-Fold CV: K here represents the number of sets we have to break our train set into, and then these K sets will be used for model validation, and the results obtained from theses K sets will be averaged to give a final result, which will possibly avoid overfitting.
Leave-One-Out CV: The working technique of Leave-One-Out CV is similar to that to K-Fold CV, but it takes the process to a new level since it calculates the cross-validation results using each and every data point in the train data. This is time-consuming obviously but definitely helps in avoiding overfitting.
Forward Chaining: While working with time-series data, K-Fold CV and Leave-One-Out CV can create a problem, since it is very much possible that some years could have a pattern that other years don’t have, so using random sets of data for cross-validation would not make sense. In fact, it is possible that the existing trends could go unnoticed, which is not what we want. So, usually, in this kind of case, a forward-chaining method is used, wherein each fold that we form (for cross-validation), contains a train set, created by adding up data of a consecutive year to the previous train set and validating it on the test set (which contains only the consecutive year to the latest year used in the train set).
Regularization is a technique that helps in reducing both, the bias and the variance, by penalizing beta coefficients attached to our model’s independent variables.
I’ve written a whole article on “Feature Selection in Machine Learning”, where I have described Regularization and its types in much more depth. Feel free to check it out here:
Feature Selection in Machine Learning
There is no perfect model. It has to be made perfect, by using its imperfections in a positive manner. Once you are able to identify that bias or variability exists in your model, then you can do a ton of things to change that. You may try feature selection and feature transformation as well. You may try removing some over-fitting variables. Based on what is possible at that moment, the decision can be made, and the model can definitely be improved if there is a possibility of that happening.
Thank you for reading! Happy learning!
Bias & Variance in Machine Learning was originally published in Towards AI — Multidisciplinary Science Journal on Medium, where people are continuing the conversation by highlighting and responding to this story.
Published via Towards AI