Unlock the full potential of AI with Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Publication

Counter Overfitting with L1 and L2 Regularization
Data Science   Latest   Machine Learning

Counter Overfitting with L1 and L2 Regularization

Counter Overfitting with L1 and L2 Regularization

Last Updated on June 29, 2024 by Editorial Team

Author(s): Eashan Mahajan

Originally published on Towards AI.

Photo by Arseny Togulev on Unsplash

Overfitting. A modeling error many of us have encountered or will encounter while training a model. Simply put, overfitting is when the model learns about the details and noise of a dataset to the extent that it negatively impacts the model on new data. An overfitted model will do well on the training dataset but perform poorly on unseen data, leading to poor performances on test and validation sets.

To counter this, researchers have created several techniques to counter overfitting. Two of these techniques are known as L1 and L2 regularization. L1 (Lasso regression) regularization adds the absolute values of the coefficients as a penalty term for the loss function. L2 (Ridge regression) regularization adds the squared values of the coefficients as a penalty term to the loss function. In this article, we’ll explore how both regularization techniques work, how to use them, and the benefits and disadvantages with each.

Introduction

Causes of Overfitting

As stated above, overfitting occurs when the model learns too much about the details and noise of a dataset or is overtrained on a dataset. Because most of the fine details and noise within a testing dataset don’t occur in real-world data, this causes the model to perform poorly on unseen data. Overfitting is caused by a variety of factors. Those can be:

  1. Model Complexity: When a model has too many parameters, it will fit the data too closely. It could focus on irrelevant details. Another way is using complex models such as neural networks or ensemble methods without collecting enough data can lead to overfitting.
  2. Training Data: If insufficient data is insufficient, a model will easily memorize training examples, including the noise and outliers present. They won’t look for the underlying patterns.
  3. Improper Feature Selection: If the data hasn’t been properly pre-processed before letting the model train on it, there will exist features that don’t have a casual relationship with the target variable.
  4. Training Time: Running too many iterations or epochs will force the model to start fitting the noise within the training data.
  5. Noise: Datasets that contain lots of noise can mislead the model and cause it to learn random and incorrect patterns.

Why Care?

Often you’ll see people ignore overfitting and ship their model off to production — only to face dozens of complaints from angry customers. Overfitting isn’t something you can ignore. It has to be dealt with, otherwise your entire model is useless. You’ll be able to analyze the trained data fine, but when it comes to unseen data, your model won’t be able to perform to the expectations others have.

Compromising the model’s ability to make generalizations goes against the whole purpose of creating a machine-learning model. That’s why researchers have developed techniques such as L1 and L2 regularization. So, without further ado, let’s talk about L1 regularization, otherwise known as Lasso regression.

L1 Regularization

Before we can understand how L1 regularization works, we need to analyze an equation. Let’s look at the equation for a linear regression model.

Equation for Linear Regression

Where:

  1. y: The dependent variable
  2. x: The independent variable used to predict the dependent variable
  3. β: The first β value is the bias term, where all the x-values are 0. The rest of the β values represent the change in the dependent variable when there is a one-unit change in the corresponding independent variable.
  4. ϵ: The error term. This term captures all of the noise and other factors that affect y and are not explained by the linear relationship with the independent variable.

The values for β are chosen using the least square method, which will minimize the residual sum of squares (RSS).

Equation for RSS

In the equation, y is the actual value while the second y value with the caret on top is the predicted value for the i-th observation.

However, when the predictor variables become highly correlated, multicollinearity will become a problem. When the model is applied to a new dataset, it will perform poorly. To solve this issue, we can use L1 regularization, and apply a penalty term to the the equation.

Equation for L1 Regularization

This new term is known as a shrinkage penalty, where 1≤j≤p and λ>0. In this term, we add the values of all of the coefficients of x and take the absolute value of the result. λ, or lambda, is a tuning parameter that can strengthen the effect of the penalty term. As lambda increases, the penalty for larger coefficients becomes more prominent, which will drive some of the coefficients down to 0, effectively performing feature selection. It’ll result in a spare model where only a few features contribute to the prediction.

Creating the Model

Let’s design a basic model here for linear regression.

# Imports
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, GridSearchCV
import matplotlib.pyplot as plt

np.random.seed(42)

# Generate random data
n_samples = 100
X1 = np.random.rand(n_samples)
X2 = np.random.rand(n_samples)
X3 = X1 + np.random.normal(0, 0.1, n_samples) # X3 is highly correlated with X1
X4 = X2 + np.random.normal(0, 0.1, n_samples) # X4 is highly correlated with X2
X5 = np.random.rand(n_samples) # Irrelevant feature
X6 = np.random.rand(n_samples) # Irrelevant feature
X7 = X2 + np.random.normal(0, 0.1, n_samples) # X7 is highly correlated with X2
X8 = X1 + X2 + np.random.normal(0, 0.1, n_samples) # X8 is correlated with X1 and X2
X9 = np.random.rand(n_samples) # Irrelevant feature
X10 = np.random.rand(n_samples) # Irrelevant feature
X11 = np.random.rand(n_samples) # Irrelevant feature
X12 = np.random.rand(n_samples) # Irrelevant feature
X13 = np.random.rand(n_samples) # Irrelevant feature
X14 = np.random.rand(n_samples) # Irrelevant feature
X15 = np.random.rand(n_samples) # Irrelevant feature
X16 = np.random.rand(n_samples) # Irrelevant feature

# Generating a target variable that provides more noise
y = 3 * X1 + 2 * X2 + np.random.normal(0, 1, n_samples)

# Combining into a dataframe
data = pd.DataFrame({
'X1': X1,
'X2': X2,
'X3': X3,
'X4': X4,
'X5': X5,
'X6': X6,
'X7': X7,
'X8': X8,
'X9': X9,
'X10': X10,
'X11': X11,
'X12': X12,
'X13': X13,
'X14': X14,
'X15': X15,
'X16': X16,
'y': y
})

# Spliting data into features and target
X = data.drop(columns=['y'])
y = data['y']

# Spliting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Training linear regression model
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
mse_lr = mean_squared_error(y_test, y_pred_lr)

print("Linear Regression Coefficients:", lr.coef_)
print("Linear Regression MSE:", mse_lr)

In regards to the data we created, we can tell that X3 and X1 are highly similar, which introduces multicollinearity. A linear regression model will struggle to determine their individual effects. X4 is irrelevant to the model and thus provides noise. In a real-world dataset, you’re going to encounter more of these situations, which makes the analysis required much more complex.

What we’re going to analyze is the MSE, or mean squared error. The closer an MSE value is to 0, the better it is. The code outputs:

Linear Regression Coefficients: [ 3.25072566 2.21799104 0.71344211 1.50151585 0.17049475 0.45632903
-0.21593949 -1.00099295 -0.13812986 0.20603788 -0.39050274 0.14718376
-0.78339654 0.81501732 0.27833921 0.5122955 ]
Linear Regression MSE: 1.1321449395570302

Wow! That’s already a really good MSE value. Let’s see if we can lower it even more.

Lasso Model

Using scikit-learn, we can easily implement a lasso regression or L1 regularization model. We can do so as below:

# Train the model
lasso = Lasso()
param_grid = {'alpha': np.logspace(-4, 0, 50)}
grid_search = GridSearchCV(lasso, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

# Find the best model
best_lasso = grid_search.best_estimator_
y_pred_lasso = best_lasso.predict(X_test)
mse_lasso = mean_squared_error(y_test, y_pred_lasso)

# Results
print("Lasso Regression Coefficients:", best_lasso.coef_)
print("Lasso Regression MSE:", mse_lasso)

This model outputs:

Lasso Regression Coefficients: [ 1.85169128 0.97233989 0.61109015 1.16770492 0. 0.24083797
0. 0.2178444 -0. 0. -0.24803869 0.
-0.62376248 0.6690866 0.13260896 0.2182543 ]
Lasso Regression MSE: 1.0618433653062156

It outputs an MSE of 1.0618433653062156, closer to 0 than without applying L1 Regularization! Notice as well that the lasso regression coefficients are much closer to 0 than compared to the coefficients for linear regression, with some of them being 0.

Now once again, keep in mind this isn’t the most complex model. In a real-world dataset, you’ll encounter more values that are closely related, and encounter more values that are just noise. In that case, L1 regularization is the perfect solution, as it will shrink the coefficients of the values that negatively impact the MSE.

But, there’s another way to counter this. Let’s transition to L2 regularization, otherwise known as Ridge regression.

L2 Regularization

Once again, before we can get into how we can apply L2 regularization, we need to understand the math behind it.

Let’s reference the equation for linear regression.

Equation for Linear Regression

Remember the values for the coefficients of x are chosen using the least square method, with the equation for it right here:

Equation for RSS

For L2 regularization, we’re going to add a penalty term, as such:

Equation for L2 Regularization

In this penalty term, lambda acts the same as it did in L1 regularization. It remains the tuning parameter that strengthens the effect of the penalty term. As lambda approaches infinity, the effect becomes greater and the coefficients gradually approach, but do not equal, 0.

Within the summation notation, we add the squared values of the coefficients to the loss function. This will allow for the error to be distributed among all of the weights, leading to smaller and more uniformly distributing coefficients.

Alright, enough math. Let’s get into the model.

Creating the Model

We’ll use the same model we designed above.

# Imports
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, GridSearchCV
import matplotlib.pyplot as plt

np.random.seed(42)

# Generate random data
n_samples = 100
X1 = np.random.rand(n_samples)
X2 = np.random.rand(n_samples)
X3 = X1 + np.random.normal(0, 0.1, n_samples) # X3 is highly correlated with X1
X4 = X2 + np.random.normal(0, 0.1, n_samples) # X4 is highly correlated with X2
X5 = np.random.rand(n_samples) # Irrelevant feature
X6 = np.random.rand(n_samples) # Irrelevant feature
X7 = X2 + np.random.normal(0, 0.1, n_samples) # X7 is highly correlated with X2
X8 = X1 + X2 + np.random.normal(0, 0.1, n_samples) # X8 is correlated with X1 and X2
X9 = np.random.rand(n_samples) # Irrelevant feature
X10 = np.random.rand(n_samples) # Irrelevant feature
X11 = np.random.rand(n_samples) # Irrelevant feature
X12 = np.random.rand(n_samples) # Irrelevant feature
X13 = np.random.rand(n_samples) # Irrelevant feature
X14 = np.random.rand(n_samples) # Irrelevant feature
X15 = np.random.rand(n_samples) # Irrelevant feature
X16 = np.random.rand(n_samples) # Irrelevant feature

# Generating a target variable that provides more noise
y = 3 * X1 + 2 * X2 + np.random.normal(0, 1, n_samples)

# Combining into a dataframe
data = pd.DataFrame({
'X1': X1,
'X2': X2,
'X3': X3,
'X4': X4,
'X5': X5,
'X6': X6,
'X7': X7,
'X8': X8,
'X9': X9,
'X10': X10,
'X11': X11,
'X12': X12,
'X13': X13,
'X14': X14,
'X15': X15,
'X16': X16,
'y': y
})

# Spliting data into features and target
X = data.drop(columns=['y'])
y = data['y']

# Spliting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Training linear regression model
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
mse_lr = mean_squared_error(y_test, y_pred_lr)

print("Linear Regression Coefficients:", lr.coef_)
print("Linear Regression MSE:", mse_lr)

This code outputs:

Linear Regression Coefficients: [ 3.25072566 2.21799104 0.71344211 1.50151585 0.17049475 0.45632903
-0.21593949 -1.00099295 -0.13812986 0.20603788 -0.39050274 0.14718376
-0.78339654 0.81501732 0.27833921 0.5122955 ]
Linear Regression MSE: 1.1321449395570302

Now, let’s code the L2 regularization model. Thanks to scikit-learn, the code is fairly simple:

# Tune and train ridge regression model
ridge = Ridge()
param_grid_ridge = {'alpha': np.logspace(-4, 0, 50)}
grid_search_ridge = GridSearchCV(ridge, param_grid_ridge, cv=5, scoring='neg_mean_squared_error')
grid_search_ridge.fit(X_train, y_train)

# Best model from grid search
best_ridge = grid_search_ridge.best_estimator_
y_pred_ridge = best_ridge.predict(X_test)
mse_ridge = mean_squared_error(y_test, y_pred_ridge)

print("Lasso Regression Coefficients:", best_lasso.coef_)
print("Lasso Regression MSE:", mse_lasso)

This will output:

Ridge Regression Coefficients: [ 1.25034253 0.65926717 0.88205049 0.8464581 0.0471434 0.34281097
0.34180269 0.56652748 -0.05426079 0.03019687 -0.26896752 0.17650297
-0.66223805 0.73950862 0.30905173 0.38244519]
Ridge Regression MSE: 1.016471655923085

With an MSE of 1.016471655923085, it beats out not only the linear regression model but also the lasso regression model! Take note that none of the coefficients for L2 regularization are 0, but they are much closer to 0 when compared to the coefficients for linear regression.

Photo by Pietro Jeng on Unsplash

Comparison

When should L1 or L2 regularization be used?

Great question! Let’s review some guidelines that’ll help you decide which one you need to use.

L1 Regularization (Lasso Regularization:

  1. Feature Selection: If you believe that some of the features your model learns from are irrelevant, use L1 regularization to make a few of the coefficients of less important features 0.
  2. Sparse Models: Sparse models are those that possess very few non-zero coefficients. It’s a much simpler model, and if you wish to have one, using L1 will produce a sparse model.
  3. High-Dimensional Data: Often you’ll encounter datasets that have a large number of features, that exceed the number of observations. This will often lead to overfitting, so it is a good idea to employ L1 regularization to make coefficients of less important features 0.

L2 Regularization (Ridge Regression):

  1. Multicollinearity: If many of your features are highly correlated, L2 regularization can be effective because, unlike L1 regularization where it picks a value and ignores others, L2 will distribute coefficient values among correlated features.
  2. Numerical Stability: If you’re working with a dataset that includes values that often change, such as the stock market, ridge regression prevents coefficients from becoming too big. This ensures numerical stability in the presence of extreme variance.
  3. Model Stability: If you’re developing a demand forecasting model for a supply chain, data can often be noisy and subject to change. L2 regularization can provide a stable model that is resistant to changes across the dataset, which ensures a solid performance even with older or newer data.

Choosing between L1 or L2 regularization involves knowing the specifics of a dataset and the goals of your model. Lasso regression will be helpful when feature selection or sparse models are important, and ridge regression is particularly helpful for dealing with multicollinearity and model stability.

If you’re still confused about which one you need, I’ve attached a table below that will hopefully make it simpler to understand.

Concluding Thoughts

Let’s recap — L1 regularization, or lasso regression, is adding the absolute values of all of the coefficients as a penalty term, forcing the coefficients to approach 0 or become 0. It’s useful when you want to perform feature selection, have high-dimensional data, or have higher model interpretability.

L2 regularization, or ridge regression, adds the squared values of the coefficients as a penalty term, forcing the coefficients to shrink towards 0. It’s useful when dealing with multicollinearity, wanting to retain all of the features, or wanting higher model stability.

Both of these techniques are a great way to counter overfitting. Take note that they don’t just work on linear regression but other machine learning algorithms such as logistic regression, support vector machines, decision trees, etc.

In the future, we’ll discuss more techniques to counter overfitting or other problems in machine learning. But for now, I highly encourage you to try using L1 or L2 regularization in your future endeavors, as the more practice you do with them, the better you’ll understand them. Thank you for reading!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓