Bias & Variance in Machine Learning
Last Updated on July 28, 2020 by Editorial Team
Author(s): Shaurya Lalwani
Machine Learning
Linear Regression is a machine learning algorithm that is used to predict a quantitative target, with the help of independent variables that are modeled in a linear manner, to fit a line or a plane (or hyperplane) that contains the predicted data points. For a second, letβs consider this to be the best-fit line (for better understanding). So, usually, points from the training data donβt really lie on the best-fit line only, and that makes perfect sense because any data isnβt perfect. That is why we are making predictions in the first place, and not just plotting a randomΒ line.
Understanding Bias
The linear regression line cannot be curved in order to include all the training set data points, and hence is unable to capture an accurate relationship at times. This is called bias. In mathematical terms, intercept obtained in the linear regression equation, is theΒ bias.
Why do I sayΒ that?
Let me explain: Hereβs a random Linear Regression equation:
y = Intercept + Slope1*x1 + Slope2*x2
The target (y) has some values in the data-set, and the above equation calculates the predicted values for the same. If the βInterceptβ itself is very high, and it reaches close to the predicted y values, then it would mean that the changes in y, caused by the other two parts of our equationβββthe independent variables(x1 and x2), would be less. This means that the amount of variance explained by x1 and x2, would be less, and that would eventually cause an underfitting model to be built. An underfitting model has a low R-squared (the amount of variance in the target, explained by the independent variables).
Underfit can also be understood by thinking of how the best-fit line/plane is captured in the first place. The best-fit line/plane captures the relationship between the target and the independent variable. If this relationship is captured to a very high extend, it leads to low bias and viceΒ versa.
Now that we understand what bias is, and how a high bias causes an underfitting model, it becomes clear that for a robust model, we need to remove this underfit.
In a scenario where we create a curve that passes through all data points and can showcase the existing relationship between the independent variables and the dependant variable, then there would be no bias in theΒ model.
Understanding Variance
A model that has overfitted on train data, will result in a new phenomenon called βvarianceβ. Time to consider a fewΒ models:
Model1: High Bias (Unable to capture the relationship properly)
Model2: Low Bias (Captures relationship to a very highΒ extent)
Error measurement while validating aΒ model:
Error = Actual ValuesβββPredicted Values
On calculating the errors on the training data (test data is not in the picture yet), we observe the following:
Model1: Validation of model on train data shows that errors areΒ high
Model2: Validation of model on train data shows that errors areΒ low
Now, letβs bring in the train data, and understand variance.
So, if the model has overfitted on train data, then it βunderstandsβ and βknowsβ the train data to such a high extent, that it is possible that it will struggle with the test data, and hence it will be unable to capture a relationship when test data is used as input to that model. In broader terms, this means that there will be a high difference of fit between the train data and the test data (as train data shows a perfect validation and test data is unable to capture a relationship). This difference of fit is referred to as βvarianceβ, and it is usually caused when the model understands only the train data and struggles with any new input given toΒ it.
On validating the above models on test data, we noticeΒ this:
Model1: Relationship isnβt captured correctly here as well, but there isnβt a huge gap of understanding between the train and test data, so the variance isΒ low
Model2: There is a huge gap of understanding between the train and test data, so the variance isΒ high
The Trade-Off between Bias &Β Variance
Now we understand that both bias and variance can cause problems in our prediction model. So, how do we go about solving thisΒ issue?
A couple of terms to understand before weΒ proceed:
Overfit: Low Bias & High VariabilityβββModel fits great on train data, but struggles with test data because it understands only the train dataΒ well
Underfit: High Bias & Low VariabilityβββModel is unable to capture relationship while using train data, but since it hasnβt captured the relationship anyway, hence there isnβt much of a gap of understanding between train and test data, so variance isΒ low
Coming back to the solution, we can do the following to try to build a trade-off between the bias and variance beingΒ caused:
1. Cross-Validation
Usually, a model is built on train data and tested on the same, but thereβs one more thing that people prefer. Testing the model on a part of the train data, and this is called the validation data.
So, what is Cross-Validation?
As mentioned, model validation is done on a part of the train data. So, if we keep choosing a new set of data points from the train data for validating each iteration, and keep averaging the results obtained from these sets of data, we are doing cross-validation. This is an optimized method to understand the behavior of the model on the train data and a way to understand whether there is a presence of an overfit orΒ not.
Types of Cross-Validation:
K-Fold CV: K here represents the number of sets we have to break our train set into, and then these K sets will be used for model validation, and the results obtained from theses K sets will be averaged to give a final result, which will possibly avoid overfitting.
Leave-One-Out CV: The working technique of Leave-One-Out CV is similar to that to K-Fold CV, but it takes the process to a new level since it calculates the cross-validation results using each and every data point in the train data. This is time-consuming obviously but definitely helps in avoiding overfitting.
Forward Chaining: While working with time-series data, K-Fold CV and Leave-One-Out CV can create a problem, since it is very much possible that some years could have a pattern that other years donβt have, so using random sets of data for cross-validation would not make sense. In fact, it is possible that the existing trends could go unnoticed, which is not what we want. So, usually, in this kind of case, a forward-chaining method is used, wherein each fold that we form (for cross-validation), contains a train set, created by adding up data of a consecutive year to the previous train set and validating it on the test set (which contains only the consecutive year to the latest year used in the trainΒ set).
2. Regularization
Regularization is a technique that helps in reducing both, the bias and the variance, by penalizing beta coefficients attached to our modelβs independent variables.
Iβve written a whole article on βFeature Selection in Machine Learningβ, where I have described Regularization and its types in much more depth. Feel free to check it outΒ here:
Feature Selection in MachineΒ Learning
Conclusion
There is no perfect model. It has to be made perfect, by using its imperfections in a positive manner. Once you are able to identify that bias or variability exists in your model, then you can do a ton of things to change that. You may try feature selection and feature transformation as well. You may try removing some over-fitting variables. Based on what is possible at that moment, the decision can be made, and the model can definitely be improved if there is a possibility of that happening.
Thank you for reading! Happy learning!
Bias & Variance in Machine Learning was originally published in Towards AIβββMultidisciplinary Science Journal on Medium, where people are continuing the conversation by highlighting and responding to this story.
Published via Towards AI