Counter Overfitting with L1 and L2 Regularization
Last Updated on June 29, 2024 by Editorial Team
Author(s): Eashan Mahajan
Originally published on Towards AI.
Overfitting. A modeling error many of us have encountered or will encounter while training a model. Simply put, overfitting is when the model learns about the details and noise of a dataset to the extent that it negatively impacts the model on new data. An overfitted model will do well on the training dataset but perform poorly on unseen data, leading to poor performances on test and validation sets.
To counter this, researchers have created several techniques to counter overfitting. Two of these techniques are known as L1 and L2 regularization. L1 (Lasso regression) regularization adds the absolute values of the coefficients as a penalty term for the loss function. L2 (Ridge regression) regularization adds the squared values of the coefficients as a penalty term to the loss function. In this article, weβll explore how both regularization techniques work, how to use them, and the benefits and disadvantages with each.
Introduction
Causes of Overfitting
As stated above, overfitting occurs when the model learns too much about the details and noise of a dataset or is overtrained on a dataset. Because most of the fine details and noise within a testing dataset donβt occur in real-world data, this causes the model to perform poorly on unseen data. Overfitting is caused by a variety of factors. Those can be:
- Model Complexity: When a model has too many parameters, it will fit the data too closely. It could focus on irrelevant details. Another way is using complex models such as neural networks or ensemble methods without collecting enough data can lead to overfitting.
- Training Data: If insufficient data is insufficient, a model will easily memorize training examples, including the noise and outliers present. They wonβt look for the underlying patterns.
- Improper Feature Selection: If the data hasnβt been properly pre-processed before letting the model train on it, there will exist features that donβt have a casual relationship with the target variable.
- Training Time: Running too many iterations or epochs will force the model to start fitting the noise within the training data.
- Noise: Datasets that contain lots of noise can mislead the model and cause it to learn random and incorrect patterns.
Why Care?
Often youβll see people ignore overfitting and ship their model off to production β only to face dozens of complaints from angry customers. Overfitting isnβt something you can ignore. It has to be dealt with, otherwise your entire model is useless. Youβll be able to analyze the trained data fine, but when it comes to unseen data, your model wonβt be able to perform to the expectations others have.
Compromising the modelβs ability to make generalizations goes against the whole purpose of creating a machine-learning model. Thatβs why researchers have developed techniques such as L1 and L2 regularization. So, without further ado, letβs talk about L1 regularization, otherwise known as Lasso regression.
L1 Regularization
Before we can understand how L1 regularization works, we need to analyze an equation. Letβs look at the equation for a linear regression model.
Where:
- y: The dependent variable
- x: The independent variable used to predict the dependent variable
- Ξ²: The first Ξ² value is the bias term, where all the x-values are 0. The rest of the Ξ² values represent the change in the dependent variable when there is a one-unit change in the corresponding independent variable.
- Ο΅: The error term. This term captures all of the noise and other factors that affect y and are not explained by the linear relationship with the independent variable.
The values for Ξ² are chosen using the least square method, which will minimize the residual sum of squares (RSS).
In the equation, y is the actual value while the second y value with the caret on top is the predicted value for the i-th observation.
However, when the predictor variables become highly correlated, multicollinearity will become a problem. When the model is applied to a new dataset, it will perform poorly. To solve this issue, we can use L1 regularization, and apply a penalty term to the the equation.
This new term is known as a shrinkage penalty, where 1β€jβ€p and Ξ»>0. In this term, we add the values of all of the coefficients of x and take the absolute value of the result. Ξ», or lambda, is a tuning parameter that can strengthen the effect of the penalty term. As lambda increases, the penalty for larger coefficients becomes more prominent, which will drive some of the coefficients down to 0, effectively performing feature selection. Itβll result in a spare model where only a few features contribute to the prediction.
Creating the Model
Letβs design a basic model here for linear regression.
# Imports
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, GridSearchCV
import matplotlib.pyplot as plt
np.random.seed(42)
# Generate random data
n_samples = 100
X1 = np.random.rand(n_samples)
X2 = np.random.rand(n_samples)
X3 = X1 + np.random.normal(0, 0.1, n_samples) # X3 is highly correlated with X1
X4 = X2 + np.random.normal(0, 0.1, n_samples) # X4 is highly correlated with X2
X5 = np.random.rand(n_samples) # Irrelevant feature
X6 = np.random.rand(n_samples) # Irrelevant feature
X7 = X2 + np.random.normal(0, 0.1, n_samples) # X7 is highly correlated with X2
X8 = X1 + X2 + np.random.normal(0, 0.1, n_samples) # X8 is correlated with X1 and X2
X9 = np.random.rand(n_samples) # Irrelevant feature
X10 = np.random.rand(n_samples) # Irrelevant feature
X11 = np.random.rand(n_samples) # Irrelevant feature
X12 = np.random.rand(n_samples) # Irrelevant feature
X13 = np.random.rand(n_samples) # Irrelevant feature
X14 = np.random.rand(n_samples) # Irrelevant feature
X15 = np.random.rand(n_samples) # Irrelevant feature
X16 = np.random.rand(n_samples) # Irrelevant feature
# Generating a target variable that provides more noise
y = 3 * X1 + 2 * X2 + np.random.normal(0, 1, n_samples)
# Combining into a dataframe
data = pd.DataFrame({
'X1': X1,
'X2': X2,
'X3': X3,
'X4': X4,
'X5': X5,
'X6': X6,
'X7': X7,
'X8': X8,
'X9': X9,
'X10': X10,
'X11': X11,
'X12': X12,
'X13': X13,
'X14': X14,
'X15': X15,
'X16': X16,
'y': y
})
# Spliting data into features and target
X = data.drop(columns=['y'])
y = data['y']
# Spliting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Training linear regression model
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
mse_lr = mean_squared_error(y_test, y_pred_lr)
print("Linear Regression Coefficients:", lr.coef_)
print("Linear Regression MSE:", mse_lr)
In regards to the data we created, we can tell that X3 and X1 are highly similar, which introduces multicollinearity. A linear regression model will struggle to determine their individual effects. X4 is irrelevant to the model and thus provides noise. In a real-world dataset, youβre going to encounter more of these situations, which makes the analysis required much more complex.
What weβre going to analyze is the MSE, or mean squared error. The closer an MSE value is to 0, the better it is. The code outputs:
Linear Regression Coefficients: [ 3.25072566 2.21799104 0.71344211 1.50151585 0.17049475 0.45632903
-0.21593949 -1.00099295 -0.13812986 0.20603788 -0.39050274 0.14718376
-0.78339654 0.81501732 0.27833921 0.5122955 ]
Linear Regression MSE: 1.1321449395570302
Wow! Thatβs already a really good MSE value. Letβs see if we can lower it even more.
Lasso Model
Using scikit-learn, we can easily implement a lasso regression or L1 regularization model. We can do so as below:
# Train the model
lasso = Lasso()
param_grid = {'alpha': np.logspace(-4, 0, 50)}
grid_search = GridSearchCV(lasso, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)
# Find the best model
best_lasso = grid_search.best_estimator_
y_pred_lasso = best_lasso.predict(X_test)
mse_lasso = mean_squared_error(y_test, y_pred_lasso)
# Results
print("Lasso Regression Coefficients:", best_lasso.coef_)
print("Lasso Regression MSE:", mse_lasso)
This model outputs:
Lasso Regression Coefficients: [ 1.85169128 0.97233989 0.61109015 1.16770492 0. 0.24083797
0. 0.2178444 -0. 0. -0.24803869 0.
-0.62376248 0.6690866 0.13260896 0.2182543 ]
Lasso Regression MSE: 1.0618433653062156
It outputs an MSE of 1.0618433653062156, closer to 0 than without applying L1 Regularization! Notice as well that the lasso regression coefficients are much closer to 0 than compared to the coefficients for linear regression, with some of them being 0.
Now once again, keep in mind this isnβt the most complex model. In a real-world dataset, youβll encounter more values that are closely related, and encounter more values that are just noise. In that case, L1 regularization is the perfect solution, as it will shrink the coefficients of the values that negatively impact the MSE.
But, thereβs another way to counter this. Letβs transition to L2 regularization, otherwise known as Ridge regression.
L2 Regularization
Once again, before we can get into how we can apply L2 regularization, we need to understand the math behind it.
Letβs reference the equation for linear regression.
Remember the values for the coefficients of x are chosen using the least square method, with the equation for it right here:
For L2 regularization, weβre going to add a penalty term, as such:
In this penalty term, lambda acts the same as it did in L1 regularization. It remains the tuning parameter that strengthens the effect of the penalty term. As lambda approaches infinity, the effect becomes greater and the coefficients gradually approach, but do not equal, 0.
Within the summation notation, we add the squared values of the coefficients to the loss function. This will allow for the error to be distributed among all of the weights, leading to smaller and more uniformly distributing coefficients.
Alright, enough math. Letβs get into the model.
Creating the Model
Weβll use the same model we designed above.
# Imports
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, GridSearchCV
import matplotlib.pyplot as plt
np.random.seed(42)
# Generate random data
n_samples = 100
X1 = np.random.rand(n_samples)
X2 = np.random.rand(n_samples)
X3 = X1 + np.random.normal(0, 0.1, n_samples) # X3 is highly correlated with X1
X4 = X2 + np.random.normal(0, 0.1, n_samples) # X4 is highly correlated with X2
X5 = np.random.rand(n_samples) # Irrelevant feature
X6 = np.random.rand(n_samples) # Irrelevant feature
X7 = X2 + np.random.normal(0, 0.1, n_samples) # X7 is highly correlated with X2
X8 = X1 + X2 + np.random.normal(0, 0.1, n_samples) # X8 is correlated with X1 and X2
X9 = np.random.rand(n_samples) # Irrelevant feature
X10 = np.random.rand(n_samples) # Irrelevant feature
X11 = np.random.rand(n_samples) # Irrelevant feature
X12 = np.random.rand(n_samples) # Irrelevant feature
X13 = np.random.rand(n_samples) # Irrelevant feature
X14 = np.random.rand(n_samples) # Irrelevant feature
X15 = np.random.rand(n_samples) # Irrelevant feature
X16 = np.random.rand(n_samples) # Irrelevant feature
# Generating a target variable that provides more noise
y = 3 * X1 + 2 * X2 + np.random.normal(0, 1, n_samples)
# Combining into a dataframe
data = pd.DataFrame({
'X1': X1,
'X2': X2,
'X3': X3,
'X4': X4,
'X5': X5,
'X6': X6,
'X7': X7,
'X8': X8,
'X9': X9,
'X10': X10,
'X11': X11,
'X12': X12,
'X13': X13,
'X14': X14,
'X15': X15,
'X16': X16,
'y': y
})
# Spliting data into features and target
X = data.drop(columns=['y'])
y = data['y']
# Spliting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Training linear regression model
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
mse_lr = mean_squared_error(y_test, y_pred_lr)
print("Linear Regression Coefficients:", lr.coef_)
print("Linear Regression MSE:", mse_lr)
This code outputs:
Linear Regression Coefficients: [ 3.25072566 2.21799104 0.71344211 1.50151585 0.17049475 0.45632903
-0.21593949 -1.00099295 -0.13812986 0.20603788 -0.39050274 0.14718376
-0.78339654 0.81501732 0.27833921 0.5122955 ]
Linear Regression MSE: 1.1321449395570302
Now, letβs code the L2 regularization model. Thanks to scikit-learn, the code is fairly simple:
# Tune and train ridge regression model
ridge = Ridge()
param_grid_ridge = {'alpha': np.logspace(-4, 0, 50)}
grid_search_ridge = GridSearchCV(ridge, param_grid_ridge, cv=5, scoring='neg_mean_squared_error')
grid_search_ridge.fit(X_train, y_train)
# Best model from grid search
best_ridge = grid_search_ridge.best_estimator_
y_pred_ridge = best_ridge.predict(X_test)
mse_ridge = mean_squared_error(y_test, y_pred_ridge)
print("Lasso Regression Coefficients:", best_lasso.coef_)
print("Lasso Regression MSE:", mse_lasso)
This will output:
Ridge Regression Coefficients: [ 1.25034253 0.65926717 0.88205049 0.8464581 0.0471434 0.34281097
0.34180269 0.56652748 -0.05426079 0.03019687 -0.26896752 0.17650297
-0.66223805 0.73950862 0.30905173 0.38244519]
Ridge Regression MSE: 1.016471655923085
With an MSE of 1.016471655923085, it beats out not only the linear regression model but also the lasso regression model! Take note that none of the coefficients for L2 regularization are 0, but they are much closer to 0 when compared to the coefficients for linear regression.
Comparison
When should L1 or L2 regularization be used?
Great question! Letβs review some guidelines thatβll help you decide which one you need to use.
L1 Regularization (Lasso Regularization:
- Feature Selection: If you believe that some of the features your model learns from are irrelevant, use L1 regularization to make a few of the coefficients of less important features 0.
- Sparse Models: Sparse models are those that possess very few non-zero coefficients. Itβs a much simpler model, and if you wish to have one, using L1 will produce a sparse model.
- High-Dimensional Data: Often youβll encounter datasets that have a large number of features, that exceed the number of observations. This will often lead to overfitting, so it is a good idea to employ L1 regularization to make coefficients of less important features 0.
L2 Regularization (Ridge Regression):
- Multicollinearity: If many of your features are highly correlated, L2 regularization can be effective because, unlike L1 regularization where it picks a value and ignores others, L2 will distribute coefficient values among correlated features.
- Numerical Stability: If youβre working with a dataset that includes values that often change, such as the stock market, ridge regression prevents coefficients from becoming too big. This ensures numerical stability in the presence of extreme variance.
- Model Stability: If youβre developing a demand forecasting model for a supply chain, data can often be noisy and subject to change. L2 regularization can provide a stable model that is resistant to changes across the dataset, which ensures a solid performance even with older or newer data.
Choosing between L1 or L2 regularization involves knowing the specifics of a dataset and the goals of your model. Lasso regression will be helpful when feature selection or sparse models are important, and ridge regression is particularly helpful for dealing with multicollinearity and model stability.
If youβre still confused about which one you need, Iβve attached a table below that will hopefully make it simpler to understand.
Concluding Thoughts
Letβs recap β L1 regularization, or lasso regression, is adding the absolute values of all of the coefficients as a penalty term, forcing the coefficients to approach 0 or become 0. Itβs useful when you want to perform feature selection, have high-dimensional data, or have higher model interpretability.
L2 regularization, or ridge regression, adds the squared values of the coefficients as a penalty term, forcing the coefficients to shrink towards 0. Itβs useful when dealing with multicollinearity, wanting to retain all of the features, or wanting higher model stability.
Both of these techniques are a great way to counter overfitting. Take note that they donβt just work on linear regression but other machine learning algorithms such as logistic regression, support vector machines, decision trees, etc.
In the future, weβll discuss more techniques to counter overfitting or other problems in machine learning. But for now, I highly encourage you to try using L1 or L2 regularization in your future endeavors, as the more practice you do with them, the better youβll understand them. Thank you for reading!
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI