From Overfitting to Excellence: Harnessing the Power of Regularization

Last Updated on August 7, 2023 by Editorial Team

Author(s): Sandeepkumar Racherla

Originally published on Towards AI.

The Role of Regularization: Balancing Complexity and Generalization in Machine Learning

Source: Image by Andrew Martin on Pixabay

When it comes to Machine Learning, our scope is to find the ML model that makes the best predictions on data it hasn’t been trained on.

To do so, we train our ML models on training data and see how they perform in making predictions. Then, we compare the performance of the predictions on the train set and on the test set, that is the set with new data, to decide which is the best ML model between the ones we’re testing to solve the problem we’re facing.

In this article, we’ll describe regularization in Machine Learning as a way to avoid overfitting.

The problem of overfitting in Machine Learning

Overfitting is a problem that occurs in Machine Learning that is due to a model that has become “too specialized and focused” on the particular data it’s been trained and can’t generalize well.

Let’s show a picture describing the three conditions in ML: overfitting, underfitting, and good fit.

*The three conditions of fitting the data in ML. Source: Image by Author.*

So, overfitting it’s like students memorizing the answers to specific questions: odds are that they’ll fail the test because they didn’t understand the subject. They just remembered how to answer some questions.

On the side of performance, what generally happens is that we can understand that an ML is overfitting when a metric we’ve chosen to evaluate our ML model is very high (near 100%) both on the train and on the test sets.

So, overfitting — as well as underfitting — is something that we, as Data Scientists, absolutely need to avoid.

A way to avoid overfitting is through regularization. Let’s see how.

Resolving overfitting through regularization

Regularization adds a penalty term to the loss function during the training of the model: this “discourages” the model from becoming too complex.

The idea is that by limiting the model’s ability to fit the training data too closely, we can also reduce its ability to fit noise or random variations in the data. Thus, we avoid overfitting.

So, suppose we have a cost function defined as follows:

*Our cost function. Source: Image by Author.*

Where we have:

`Theta` is an estimator.
`x` and `y` are, respectively, the feature and the label of our model.

If we define the regularization parameter as `lambda` and the regularization function as `omega`, then the regularized objective function is:

*The regularized objective function. Source: Image by Author.*

So, the function to study now is the sum of the cost function plus a regularization function.

The regularization function and the regularization parameter can vary, depending on the type of regularization we want to use.

In the following paragraph, we’ll describe the two types of regularization functions that are the most used in Machine Learning.

Describing Lasso and Ridge regularization

The Lasso regularized model

Lasso (least absolute and selection operator) regularization performs an L1 regularization, which adds a penalty equal to the absolute value of the magnitude of the coefficients. This regularization method drives the weights loser to the origin by adding a regularization term:

*The regularization term for the Lasso model. Source: Image by Author.*

We can add an L1 regularization penalty, for example to the MSE (or to any other cost function we’d like) like so:

*The Mean Squared Error (MSE) metric with Lasso regularization term. Source: Image by Author.*

where we have:

`n` is the total number of the points/data we’re studying
`yi` are the actual values
“hat yi” are the values predicted by our model

The geometrical interpretation of Lasso regularization

Lasso regularization, under the hood, performs what we call features selection. Let’s give a graphical interpretation of Lasso regularization to understand this fact. For the sake of simplicity, we’ll create a 2-D drawing with two weights, w_1 and w_2:

The orange concentrical ellipses are the geometrical representation of the cost function we’ve chosen (MSE in our case) while the light blue rhombus is the geometrical interpretation of the penalty function (L1).

Our goal is to minimize the loss function; as we can see from the above illustration, the ellipses can be tangent to the rhombus just on one of its corners. At these corners, we have one of the weights be 0. The tangency may be better visualized in the following image because the minimization occurs on the tangency of the functions, not on the intersection between them:

*The tangency between the loss function and the penalty function. Source: Image by Author.*

Now, imagine we do this process for all the weights w_i; In the end, we’ll have what we call a sparse model, which means, a model with fewer features than the initial ones — because Lasso regularization penalizes the weights by getting to 0 the largest weights. This means that our model has performed the feature selection, dropping some features (because their weights have been set to 0 to simplify the initial model).

Mathematically speaking, remembering that we have to calculate the gradient of the function to minimize it, we should calculate the following:

*The formula of Lasso regularization with the gradient. Source: Image by Author.*

In this context, sparsity refers to the fact that some parameters have an optimal value of zero. In other words, this equation can have a numerical solution: some features can be led to 0, and this is called feature selection.

The Ridge regularized model.

The Ridge model performs an L2 regularization, which adds a penalty equal to the square of the magnitude of the coefficients. This regularization strategy drives the weights w_i closer to the origin by adding a regularization term.

*The regularization term for the Ridge model. Source: Image by Author.*

We can add an L2 regularization penalty, for example, to the MSE (or to any other cost function we’d like) like so:

*The Mean Squared Error (MSE) metric with Ridge regularization term. Source: Image by Author.*

The geometrical interpretation of Ridge regularization

Let’s give a graphical interpretation of Ridge regularization to understand this fact. For the sake of simplicity, we’ll create a 2-D drawing with two weights, w_1 and w_2:

In contrast to Lasso regularization, Ridge does not perform feature selection. Ridge decreases the complexity of the model without reducing the number of independent variables because it never leads to a coefficient value to 0. Let us visualize it with the help of the following image:

If we look at the above image, we can see that the ellipses can be tangent to the circle (the geometrical interpretation of the L2 penalization) everywhere, which means that the model penalizes large weights without shrinking them to 0. This means that the final model will include all the independent variables.

Also, in this case, remembering that we have to calculate the gradient of the function to minimize it, we should calculate the following:

*The formula of Ridge regularization with the gradient. Source: Image by Author.*

The difference with the Lasso regularization is that the above equation can’t always have a numerical solution, and this is why it does not perform feature selection.

Implementing Ridge and Lasso regularized models in Python

So, let’s see how we can implement both these regularized methods.

We start with the Lasso regularization:

import numpy as np
from sklearn.datasets import make_regression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.linear_model import Lasso
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
#warnings
import warnings

#ignoring warnings
warnings.filterwarnings('ignore')

# Create a dataset
X, y = make_regression(n_samples=100, n_features=5, noise=10, random_state=42)

# Scale the features
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Subdivide the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit a 6-degree polynomial to the train data
poly = PolynomialFeatures(degree=6)
X_train_poly = poly.fit_transform(X_train)

# Fit the 6-degree polynomial to the train data
poly_reg = LinearRegression()
poly_reg.fit(X_train_poly, y_train)

# Evaluate the polynomial on the train and test data
X_test_poly = poly.transform(X_test)
y_pred_train = poly_reg.predict(X_train_poly)
y_pred_test = poly_reg.predict(X_test_poly)
r2_score_poly_train = r2_score(y_train, y_pred_train)
r2_score_poly_test = r2_score(y_test, y_pred_test)
print(f'R-squared of 6-degree polynomial on train set: {r2_score_poly_train: .3f}')
print(f'R-squared of 6-degree polynomial on test set: {r2_score_poly_test: .3f}')

# Fit a regularized 6-degree polynomial to the train data
lasso_reg = make_pipeline(PolynomialFeatures(degree=6), Lasso(alpha=1))
lasso_reg.fit(X_train, y_train)

# Evaluate the regularized polynomial on the test data
y_pred_lasso = lasso_reg.predict(X_test)
r2_score_lasso = r2_score(y_test, y_pred_lasso)
print(f'R-squared of regularized 6-degree polynomial on test set: {r2_score_lasso: .3f}')

And we get:

R-squared of 6-degree polynomial on train set: 1.000
R-squared of 6-degree polynomial on test set: -84.110
R-squared of regularized 6-degree polynomial on test set: 0.827

So, as we can see, the “standard” 6-degree polynomial overfits because it has an R² of 1 on the train set and a negative one on the test set.

Instead, the 6-degree regularized model has an acceptable R² on the test set, meaning it has good generalization performance. Thus, regularization has improved the performance of the non-regularized model.

Now, let’s make a similar example using the Ridge model:

import numpy as np
from sklearn.datasets import make_regression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression

# Create a dataset
X, y = make_regression(n_samples=100, n_features=3, noise=5, random_state=42)

# Scale the features
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Subdivide the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit a 6-degree polynomial to the train data
poly = PolynomialFeatures(degree=6)
X_train_poly = poly.fit_transform(X_train)

# Fit the 6-degree polynomial to the train data
poly_reg = LinearRegression()
poly_reg.fit(X_train_poly, y_train)

# Evaluate the polynomial on the train and test data
X_test_poly = poly.transform(X_test)
y_pred_train = poly_reg.predict(X_train_poly)
y_pred_test = poly_reg.predict(X_test_poly)
r2_score_poly_train = r2_score(y_train, y_pred_train)
r2_score_poly_test = r2_score(y_test, y_pred_test)
print(f'R-squared of 6-degree polynomial on train set: {r2_score_poly_train}')
print(f'R-squared of 6-degree polynomial on test set: {r2_score_poly_test}')

# Fit a regularized 6-degree polynomial to the train data
ridge_reg = make_pipeline(PolynomialFeatures(degree=6), Ridge(alpha=1))
ridge_reg.fit(X_train, y_train)

# Evaluate the regularized polynomial on the test data
y_pred_ridge = ridge_reg.predict(X_test)
r2_score_ridge = r2_score(y_test, y_pred_ridge)
print(f'R-squared of regularized 6-degree polynomial on test set: {r2_score_ridge}')

And we get:

R-squared of 6-degree polynomial on train set: 1.0
R-squared of 6-degree polynomial on test set: -1612.4842791834997
R-squared of regularized 6-degree polynomial on test set: 0.9266258222037977

So, ridge regularization, even in this case, has increased the performance of the non-regularized polynomial model. Also, it performs even better than the Lasso model.

Conclusions

In this article, we’ve explained the problem of overfitting and how to solve it through regularization.

But when to use one regularized model and one the other. As a rule of thumb:

We’d better use Ridge regularization when the features are highly correlated. Hence, it is important to study the correlation matrix before deciding which features to delete from the study of our problem.
Lasso regularization helps reduce the features in a dataset. Lasso regularization automatically selects a subset of the most important features while simultaneously shrinking the coefficients of the less important features to zero. So, Lasso regularization is useful:
For datasets with high dimensionality.
When a significant number of features are correlated.
To impose sparsity on the model coefficient.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

From Overfitting to Excellence: Harnessing the Power of Regularization

Author(s): Sandeepkumar Racherla

The Role of Regularization: Balancing Complexity and Generalization in Machine Learning

The problem of overfitting in Machine Learning

Resolving overfitting through regularization

Describing Lasso and Ridge regularization

The Lasso regularized model

The geometrical interpretation of Lasso regularization

The Ridge regularized model.

The geometrical interpretation of Ridge regularization

Implementing Ridge and Lasso regularized models in Python

Conclusions

JOIN NOW!

🔥 Recommended Articles 🔥

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

LAI #66: Information Theory for People in a Hurry

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

From Overfitting to Excellence: Harnessing the Power of Regularization

Author(s): Sandeepkumar Racherla

The Role of Regularization: Balancing Complexity and Generalization in Machine Learning

The problem of overfitting in Machine Learning

Resolving overfitting through regularization

Describing Lasso and Ridge regularization

The Lasso regularized model

The geometrical interpretation of Lasso regularization

The Ridge regularized model.

The geometrical interpretation of Ridge regularization

Implementing Ridge and Lasso regularized models in Python

Conclusions

JOIN NOW!

🔥 Recommended Articles 🔥

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement