Bias & Variance in Machine Learning

Last Updated on July 28, 2020 by Editorial Team

Linear Regression is a machine learning algorithm that is used to predict a quantitative target, with the help of independent variables that are modeled in a linear manner, to fit a line or a plane (or hyperplane) that contains the predicted data points. For a second, let’s consider this to be the best-fit line (for better understanding). So, usually, points from the training data don’t really lie on the best-fit line only, and that makes perfect sense because any data isn’t perfect. That is why we are making predictions in the first place, and not just plotting a random line.

Understanding Bias

The linear regression line cannot be curved in order to include all the training set data points, and hence is unable to capture an accurate relationship at times. This is called bias. In mathematical terms, intercept obtained in the linear regression equation, is the bias.

Why do I say that?

Let me explain: Here’s a random Linear Regression equation:

y = Intercept + Slope1*x1 + Slope2*x2

The target (y) has some values in the data-set, and the above equation calculates the predicted values for the same. If the “Intercept” itself is very high, and it reaches close to the predicted y values, then it would mean that the changes in y, caused by the other two parts of our equation — the independent variables(x1 and x2), would be less. This means that the amount of variance explained by x1 and x2, would be less, and that would eventually cause an underfitting model to be built. An underfitting model has a low R-squared (the amount of variance in the target, explained by the independent variables).

Underfit can also be understood by thinking of how the best-fit line/plane is captured in the first place. The best-fit line/plane captures the relationship between the target and the independent variable. If this relationship is captured to a very high extend, it leads to low bias and vice versa.

Now that we understand what bias is, and how a high bias causes an underfitting model, it becomes clear that for a robust model, we need to remove this underfit.

In a scenario where we create a curve that passes through all data points and can showcase the existing relationship between the independent variables and the dependant variable, then there would be no bias in the model.

Understanding Variance

A model that has overfitted on train data, will result in a new phenomenon called “variance”. Time to consider a few models:

Model1: High Bias (Unable to capture the relationship properly)

Model2: Low Bias (Captures relationship to a very high extent)

Error measurement while validating a model:

Error = Actual Values — Predicted Values

On calculating the errors on the training data (test data is not in the picture yet), we observe the following:

Model1: Validation of model on train data shows that errors are high

Model2: Validation of model on train data shows that errors are low

Now, let’s bring in the train data, and understand variance.

So, if the model has overfitted on train data, then it “understands” and “knows” the train data to such a high extent, that it is possible that it will struggle with the test data, and hence it will be unable to capture a relationship when test data is used as input to that model. In broader terms, this means that there will be a high difference of fit between the train data and the test data (as train data shows a perfect validation and test data is unable to capture a relationship). This difference of fit is referred to as “variance”, and it is usually caused when the model understands only the train data and struggles with any new input given to it.

On validating the above models on test data, we notice this:

Model1: Relationship isn’t captured correctly here as well, but there isn’t a huge gap of understanding between the train and test data, so the variance is low

Model2: There is a huge gap of understanding between the train and test data, so the variance is high

The Trade-Off between Bias & Variance

Now we understand that both bias and variance can cause problems in our prediction model. So, how do we go about solving this issue?

A couple of terms to understand before we proceed:

Overfit: Low Bias & High Variability — Model fits great on train data, but struggles with test data because it understands only the train data well

Underfit: High Bias & Low Variability — Model is unable to capture relationship while using train data, but since it hasn’t captured the relationship anyway, hence there isn’t much of a gap of understanding between train and test data, so variance is low

Coming back to the solution, we can do the following to try to build a trade-off between the bias and variance being caused:

1. Cross-Validation

Usually, a model is built on train data and tested on the same, but there’s one more thing that people prefer. Testing the model on a part of the train data, and this is called the validation data.

So, what is Cross-Validation?

As mentioned, model validation is done on a part of the train data. So, if we keep choosing a new set of data points from the train data for validating each iteration, and keep averaging the results obtained from these sets of data, we are doing cross-validation. This is an optimized method to understand the behavior of the model on the train data and a way to understand whether there is a presence of an overfit or not.

Types of Cross-Validation:

K-Fold CV: K here represents the number of sets we have to break our train set into, and then these K sets will be used for model validation, and the results obtained from theses K sets will be averaged to give a final result, which will possibly avoid overfitting.

Leave-One-Out CV: The working technique of Leave-One-Out CV is similar to that to K-Fold CV, but it takes the process to a new level since it calculates the cross-validation results using each and every data point in the train data. This is time-consuming obviously but definitely helps in avoiding overfitting.

Forward Chaining: While working with time-series data, K-Fold CV and Leave-One-Out CV can create a problem, since it is very much possible that some years could have a pattern that other years don’t have, so using random sets of data for cross-validation would not make sense. In fact, it is possible that the existing trends could go unnoticed, which is not what we want. So, usually, in this kind of case, a forward-chaining method is used, wherein each fold that we form (for cross-validation), contains a train set, created by adding up data of a consecutive year to the previous train set and validating it on the test set (which contains only the consecutive year to the latest year used in the train set).

2. Regularization

Regularization is a technique that helps in reducing both, the bias and the variance, by penalizing beta coefficients attached to our model’s independent variables.

I’ve written a whole article on “Feature Selection in Machine Learning”, where I have described Regularization and its types in much more depth. Feel free to check it out here:

Feature Selection in Machine Learning

Conclusion

There is no perfect model. It has to be made perfect, by using its imperfections in a positive manner. Once you are able to identify that bias or variability exists in your model, then you can do a ton of things to change that. You may try feature selection and feature transformation as well. You may try removing some over-fitting variables. Based on what is possible at that moment, the decision can be made, and the model can definitely be improved if there is a possibility of that happening.

Thank you for reading! Happy learning!

Support my writing here 😃

Bias & Variance in Machine Learning was originally published in Towards AI — Multidisciplinary Science Journal on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Bias & Variance in Machine Learning

Author(s): Shaurya Lalwani

Machine Learning

Understanding Bias

Understanding Variance

The Trade-Off between Bias & Variance

1. Cross-Validation

So, what is Cross-Validation?

Types of Cross-Validation:

2. Regularization

Conclusion

Towards AI Team

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

LAI #66: Information Theory for People in a Hurry

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

The World’s Leading AI and Technology Publication.

Company

CONTACT US

🔥 Recommended Articles 🔥

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Bias & Variance in Machine Learning

Author(s): Shaurya Lalwani

Understanding Bias

Understanding Variance

The Trade-Off between Bias & Variance

1. Cross-Validation

So, what is Cross-Validation?

Types of Cross-Validation:

2. Regularization

Conclusion

Towards AI Team

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement

Subscribe to our AI newsletter!

🔥 Recommended Articles 🔥