Counter Overfitting with L1 and L2 Regularization

Last Updated on June 29, 2024 by Editorial Team

Author(s): Eashan Mahajan

Originally published on Towards AI.

Overfitting. A modeling error many of us have encountered or will encounter while training a model. Simply put, overfitting is when the model learns about the details and noise of a dataset to the extent that it negatively impacts the model on new data. An overfitted model will do well on the training dataset but perform poorly on unseen data, leading to poor performances on test and validation sets.

To counter this, researchers have created several techniques to counter overfitting. Two of these techniques are known as L1 and L2 regularization. L1 (Lasso regression) regularization adds the absolute values of the coefficients as a penalty term for the loss function. L2 (Ridge regression) regularization adds the squared values of the coefficients as a penalty term to the loss function. In this article, we’ll explore how both regularization techniques work, how to use them, and the benefits and disadvantages with each.

Introduction

Causes of Overfitting

As stated above, overfitting occurs when the model learns too much about the details and noise of a dataset or is overtrained on a dataset. Because most of the fine details and noise within a testing dataset don’t occur in real-world data, this causes the model to perform poorly on unseen data. Overfitting is caused by a variety of factors. Those can be:

Model Complexity: When a model has too many parameters, it will fit the data too closely. It could focus on irrelevant details. Another way is using complex models such as neural networks or ensemble methods without collecting enough data can lead to overfitting.
Training Data: If insufficient data is insufficient, a model will easily memorize training examples, including the noise and outliers present. They won’t look for the underlying patterns.
Improper Feature Selection: If the data hasn’t been properly pre-processed before letting the model train on it, there will exist features that don’t have a casual relationship with the target variable.
Training Time: Running too many iterations or epochs will force the model to start fitting the noise within the training data.
Noise: Datasets that contain lots of noise can mislead the model and cause it to learn random and incorrect patterns.

Why Care?

Often you’ll see people ignore overfitting and ship their model off to production — only to face dozens of complaints from angry customers. Overfitting isn’t something you can ignore. It has to be dealt with, otherwise your entire model is useless. You’ll be able to analyze the trained data fine, but when it comes to unseen data, your model won’t be able to perform to the expectations others have.

Compromising the model’s ability to make generalizations goes against the whole purpose of creating a machine-learning model. That’s why researchers have developed techniques such as L1 and L2 regularization. So, without further ado, let’s talk about L1 regularization, otherwise known as Lasso regression.

L1 Regularization

Before we can understand how L1 regularization works, we need to analyze an equation. Let’s look at the equation for a linear regression model.

Where:

y: The dependent variable
x: The independent variable used to predict the dependent variable
β: The first β value is the bias term, where all the x-values are 0. The rest of the β values represent the change in the dependent variable when there is a one-unit change in the corresponding independent variable.
ϵ: The error term. This term captures all of the noise and other factors that affect y and are not explained by the linear relationship with the independent variable.

The values for β are chosen using the least square method, which will minimize the residual sum of squares (RSS).

In the equation, y is the actual value while the second y value with the caret on top is the predicted value for the i-th observation.

However, when the predictor variables become highly correlated, multicollinearity will become a problem. When the model is applied to a new dataset, it will perform poorly. To solve this issue, we can use L1 regularization, and apply a penalty term to the the equation.

This new term is known as a shrinkage penalty, where 1≤j≤p and λ>0. In this term, we add the values of all of the coefficients of x and take the absolute value of the result. λ, or lambda, is a tuning parameter that can strengthen the effect of the penalty term. As lambda increases, the penalty for larger coefficients becomes more prominent, which will drive some of the coefficients down to 0, effectively performing feature selection. It’ll result in a spare model where only a few features contribute to the prediction.

Creating the Model

Let’s design a basic model here for linear regression.

# Imports
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, GridSearchCV
import matplotlib.pyplot as plt

np.random.seed(42)

# Generate random data
n_samples = 100
X1 = np.random.rand(n_samples)
X2 = np.random.rand(n_samples)
X3 = X1 + np.random.normal(0, 0.1, n_samples) # X3 is highly correlated with X1
X4 = X2 + np.random.normal(0, 0.1, n_samples) # X4 is highly correlated with X2
X5 = np.random.rand(n_samples) # Irrelevant feature
X6 = np.random.rand(n_samples) # Irrelevant feature
X7 = X2 + np.random.normal(0, 0.1, n_samples) # X7 is highly correlated with X2
X8 = X1 + X2 + np.random.normal(0, 0.1, n_samples) # X8 is correlated with X1 and X2
X9 = np.random.rand(n_samples) # Irrelevant feature
X10 = np.random.rand(n_samples) # Irrelevant feature
X11 = np.random.rand(n_samples) # Irrelevant feature
X12 = np.random.rand(n_samples) # Irrelevant feature
X13 = np.random.rand(n_samples) # Irrelevant feature
X14 = np.random.rand(n_samples) # Irrelevant feature
X15 = np.random.rand(n_samples) # Irrelevant feature
X16 = np.random.rand(n_samples) # Irrelevant feature

# Generating a target variable that provides more noise
y = 3 * X1 + 2 * X2 + np.random.normal(0, 1, n_samples)

# Combining into a dataframe
data = pd.DataFrame({
 'X1': X1,
 'X2': X2,
 'X3': X3,
 'X4': X4,
 'X5': X5,
 'X6': X6,
 'X7': X7,
 'X8': X8,
 'X9': X9,
 'X10': X10,
 'X11': X11,
 'X12': X12,
 'X13': X13,
 'X14': X14,
 'X15': X15,
 'X16': X16,
 'y': y
})

# Spliting data into features and target
X = data.drop(columns=['y'])
y = data['y']

# Spliting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Training linear regression model
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
mse_lr = mean_squared_error(y_test, y_pred_lr)

print("Linear Regression Coefficients:", lr.coef_)
print("Linear Regression MSE:", mse_lr)

In regards to the data we created, we can tell that X3 and X1 are highly similar, which introduces multicollinearity. A linear regression model will struggle to determine their individual effects. X4 is irrelevant to the model and thus provides noise. In a real-world dataset, you’re going to encounter more of these situations, which makes the analysis required much more complex.

What we’re going to analyze is the MSE, or mean squared error. The closer an MSE value is to 0, the better it is. The code outputs:

Linear Regression Coefficients: [ 3.25072566 2.21799104 0.71344211 1.50151585 0.17049475 0.45632903
 -0.21593949 -1.00099295 -0.13812986 0.20603788 -0.39050274 0.14718376
 -0.78339654 0.81501732 0.27833921 0.5122955 ]
Linear Regression MSE: 1.1321449395570302

Wow! That’s already a really good MSE value. Let’s see if we can lower it even more.

Lasso Model

Using scikit-learn, we can easily implement a lasso regression or L1 regularization model. We can do so as below:

# Train the model
lasso = Lasso()
param_grid = {'alpha': np.logspace(-4, 0, 50)}
grid_search = GridSearchCV(lasso, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

# Find the best model
best_lasso = grid_search.best_estimator_
y_pred_lasso = best_lasso.predict(X_test)
mse_lasso = mean_squared_error(y_test, y_pred_lasso)

# Results
print("Lasso Regression Coefficients:", best_lasso.coef_)
print("Lasso Regression MSE:", mse_lasso)

This model outputs:

Lasso Regression Coefficients: [ 1.85169128 0.97233989 0.61109015 1.16770492 0. 0.24083797
 0. 0.2178444 -0. 0. -0.24803869 0.
 -0.62376248 0.6690866 0.13260896 0.2182543 ]
Lasso Regression MSE: 1.0618433653062156

It outputs an MSE of 1.0618433653062156, closer to 0 than without applying L1 Regularization! Notice as well that the lasso regression coefficients are much closer to 0 than compared to the coefficients for linear regression, with some of them being 0.

Now once again, keep in mind this isn’t the most complex model. In a real-world dataset, you’ll encounter more values that are closely related, and encounter more values that are just noise. In that case, L1 regularization is the perfect solution, as it will shrink the coefficients of the values that negatively impact the MSE.

But, there’s another way to counter this. Let’s transition to L2 regularization, otherwise known as Ridge regression.

L2 Regularization

Once again, before we can get into how we can apply L2 regularization, we need to understand the math behind it.

Let’s reference the equation for linear regression.

Remember the values for the coefficients of x are chosen using the least square method, with the equation for it right here:

For L2 regularization, we’re going to add a penalty term, as such:

In this penalty term, lambda acts the same as it did in L1 regularization. It remains the tuning parameter that strengthens the effect of the penalty term. As lambda approaches infinity, the effect becomes greater and the coefficients gradually approach, but do not equal, 0.

Within the summation notation, we add the squared values of the coefficients to the loss function. This will allow for the error to be distributed among all of the weights, leading to smaller and more uniformly distributing coefficients.

Alright, enough math. Let’s get into the model.

Creating the Model

We’ll use the same model we designed above.

# Imports
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, GridSearchCV
import matplotlib.pyplot as plt

np.random.seed(42)

# Generate random data
n_samples = 100
X1 = np.random.rand(n_samples)
X2 = np.random.rand(n_samples)
X3 = X1 + np.random.normal(0, 0.1, n_samples) # X3 is highly correlated with X1
X4 = X2 + np.random.normal(0, 0.1, n_samples) # X4 is highly correlated with X2
X5 = np.random.rand(n_samples) # Irrelevant feature
X6 = np.random.rand(n_samples) # Irrelevant feature
X7 = X2 + np.random.normal(0, 0.1, n_samples) # X7 is highly correlated with X2
X8 = X1 + X2 + np.random.normal(0, 0.1, n_samples) # X8 is correlated with X1 and X2
X9 = np.random.rand(n_samples) # Irrelevant feature
X10 = np.random.rand(n_samples) # Irrelevant feature
X11 = np.random.rand(n_samples) # Irrelevant feature
X12 = np.random.rand(n_samples) # Irrelevant feature
X13 = np.random.rand(n_samples) # Irrelevant feature
X14 = np.random.rand(n_samples) # Irrelevant feature
X15 = np.random.rand(n_samples) # Irrelevant feature
X16 = np.random.rand(n_samples) # Irrelevant feature

# Generating a target variable that provides more noise
y = 3 * X1 + 2 * X2 + np.random.normal(0, 1, n_samples)

# Combining into a dataframe
data = pd.DataFrame({
 'X1': X1,
 'X2': X2,
 'X3': X3,
 'X4': X4,
 'X5': X5,
 'X6': X6,
 'X7': X7,
 'X8': X8,
 'X9': X9,
 'X10': X10,
 'X11': X11,
 'X12': X12,
 'X13': X13,
 'X14': X14,
 'X15': X15,
 'X16': X16,
 'y': y
})

# Spliting data into features and target
X = data.drop(columns=['y'])
y = data['y']

# Spliting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Training linear regression model
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
mse_lr = mean_squared_error(y_test, y_pred_lr)

print("Linear Regression Coefficients:", lr.coef_)
print("Linear Regression MSE:", mse_lr)

This code outputs:

Linear Regression Coefficients: [ 3.25072566 2.21799104 0.71344211 1.50151585 0.17049475 0.45632903
 -0.21593949 -1.00099295 -0.13812986 0.20603788 -0.39050274 0.14718376
 -0.78339654 0.81501732 0.27833921 0.5122955 ]
Linear Regression MSE: 1.1321449395570302

Now, let’s code the L2 regularization model. Thanks to scikit-learn, the code is fairly simple:

# Tune and train ridge regression model
ridge = Ridge()
param_grid_ridge = {'alpha': np.logspace(-4, 0, 50)}
grid_search_ridge = GridSearchCV(ridge, param_grid_ridge, cv=5, scoring='neg_mean_squared_error')
grid_search_ridge.fit(X_train, y_train)

# Best model from grid search
best_ridge = grid_search_ridge.best_estimator_
y_pred_ridge = best_ridge.predict(X_test)
mse_ridge = mean_squared_error(y_test, y_pred_ridge)

print("Lasso Regression Coefficients:", best_lasso.coef_)
print("Lasso Regression MSE:", mse_lasso)

This will output:

Ridge Regression Coefficients: [ 1.25034253 0.65926717 0.88205049 0.8464581 0.0471434 0.34281097
 0.34180269 0.56652748 -0.05426079 0.03019687 -0.26896752 0.17650297
 -0.66223805 0.73950862 0.30905173 0.38244519]
Ridge Regression MSE: 1.016471655923085

With an MSE of 1.016471655923085, it beats out not only the linear regression model but also the lasso regression model! Take note that none of the coefficients for L2 regularization are 0, but they are much closer to 0 when compared to the coefficients for linear regression.

Comparison

When should L1 or L2 regularization be used?

Great question! Let’s review some guidelines that’ll help you decide which one you need to use.

L1 Regularization (Lasso Regularization:

Feature Selection: If you believe that some of the features your model learns from are irrelevant, use L1 regularization to make a few of the coefficients of less important features 0.
Sparse Models: Sparse models are those that possess very few non-zero coefficients. It’s a much simpler model, and if you wish to have one, using L1 will produce a sparse model.
High-Dimensional Data: Often you’ll encounter datasets that have a large number of features, that exceed the number of observations. This will often lead to overfitting, so it is a good idea to employ L1 regularization to make coefficients of less important features 0.

L2 Regularization (Ridge Regression):

Multicollinearity: If many of your features are highly correlated, L2 regularization can be effective because, unlike L1 regularization where it picks a value and ignores others, L2 will distribute coefficient values among correlated features.
Numerical Stability: If you’re working with a dataset that includes values that often change, such as the stock market, ridge regression prevents coefficients from becoming too big. This ensures numerical stability in the presence of extreme variance.
Model Stability: If you’re developing a demand forecasting model for a supply chain, data can often be noisy and subject to change. L2 regularization can provide a stable model that is resistant to changes across the dataset, which ensures a solid performance even with older or newer data.

Choosing between L1 or L2 regularization involves knowing the specifics of a dataset and the goals of your model. Lasso regression will be helpful when feature selection or sparse models are important, and ridge regression is particularly helpful for dealing with multicollinearity and model stability.

If you’re still confused about which one you need, I’ve attached a table below that will hopefully make it simpler to understand.

Concluding Thoughts

Let’s recap — L1 regularization, or lasso regression, is adding the absolute values of all of the coefficients as a penalty term, forcing the coefficients to approach 0 or become 0. It’s useful when you want to perform feature selection, have high-dimensional data, or have higher model interpretability.

L2 regularization, or ridge regression, adds the squared values of the coefficients as a penalty term, forcing the coefficients to shrink towards 0. It’s useful when dealing with multicollinearity, wanting to retain all of the features, or wanting higher model stability.

Both of these techniques are a great way to counter overfitting. Take note that they don’t just work on linear regression but other machine learning algorithms such as logistic regression, support vector machines, decision trees, etc.

In the future, we’ll discuss more techniques to counter overfitting or other problems in machine learning. But for now, I highly encourage you to try using L1 or L2 regularization in your future endeavors, as the more practice you do with them, the better you’ll understand them. Thank you for reading!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Counter Overfitting with L1 and L2 Regularization

Author(s): Eashan Mahajan

Introduction

Causes of Overfitting

Why Care?

L1 Regularization

Creating the Model

Lasso Model

L2 Regularization

Creating the Model

Comparison

When should L1 or L2 regularization be used?

Concluding Thoughts

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

AdaBoost Explained From Its Original Paper

Meta’s Chameleon, RAG with Autoencoder-Transformed Embeddings, and more #30

KAN (Kolmogorov-Arnold Networks): A Starter Guide 🐣

Top Important LLMs Papers for the Week from 24/06 to 30/06

BERT HuggingFace Model Deployment using Kubernetes [ Github Repo] — 03/07/2024

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Counter Overfitting with L1 and L2 Regularization

Author(s): Eashan Mahajan

Introduction

Causes of Overfitting

Why Care?

L1 Regularization

Creating the Model

Lasso Model

L2 Regularization

Creating the Model

Comparison

When should L1 or L2 regularization be used?

Concluding Thoughts

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement