From Overfitting to Excellence: Harnessing the Power of Regularization
Last Updated on August 7, 2023 by Editorial Team
Author(s): Sandeepkumar Racherla
Originally published on Towards AI.
The Role of Regularization: Balancing Complexity and Generalization in Machine Learning
When it comes to Machine Learning, our scope is to find the ML model that makes the best predictions on data it hasnβt been trained on.
To do so, we train our ML models on training data and see how they perform in making predictions. Then, we compare the performance of the predictions on the train set and on the test set, that is the set with new data, to decide which is the best ML model between the ones weβre testing to solve the problem weβre facing.
In this article, weβll describe regularization in Machine Learning as a way to avoid overfitting.
The problem of overfitting in Machine Learning
Overfitting is a problem that occurs in Machine Learning that is due to a model that has become βtoo specialized and focusedβ on the particular data itβs been trained and canβt generalize well.
Letβs show a picture describing the three conditions in ML: overfitting, underfitting, and good fit.
So, overfitting itβs like students memorizing the answers to specific questions: odds are that theyβll fail the test because they didnβt understand the subject. They just remembered how to answer some questions.
On the side of performance, what generally happens is that we can understand that an ML is overfitting when a metric weβve chosen to evaluate our ML model is very high (near 100%) both on the train and on the test sets.
So, overfitting β as well as underfitting β is something that we, as Data Scientists, absolutely need to avoid.
A way to avoid overfitting is through regularization. Letβs see how.
Resolving overfitting through regularization
Regularization adds a penalty term to the loss function during the training of the model: this βdiscouragesβ the model from becoming too complex.
The idea is that by limiting the modelβs ability to fit the training data too closely, we can also reduce its ability to fit noise or random variations in the data. Thus, we avoid overfitting.
So, suppose we have a cost function defined as follows:
Where we have:
- `Theta` is an estimator.
- `x` and `y` are, respectively, the feature and the label of our model.
If we define the regularization parameter as `lambda` and the regularization function as `omega`, then the regularized objective function is:
So, the function to study now is the sum of the cost function plus a regularization function.
The regularization function and the regularization parameter can vary, depending on the type of regularization we want to use.
In the following paragraph, weβll describe the two types of regularization functions that are the most used in Machine Learning.
Describing Lasso and Ridge regularization
The Lasso regularized model
Lasso (least absolute and selection operator) regularization performs an L1 regularization, which adds a penalty equal to the absolute value of the magnitude of the coefficients. This regularization method drives the weights loser to the origin by adding a regularization term:
We can add an L1 regularization penalty, for example to the MSE (or to any other cost function weβd like) like so:
where we have:
- `n` is the total number of the points/data weβre studying
- `yi` are the actual values
- βhat yiβ are the values predicted by our model
The geometrical interpretation of Lasso regularization
Lasso regularization, under the hood, performs what we call features selection. Letβs give a graphical interpretation of Lasso regularization to understand this fact. For the sake of simplicity, weβll create a 2-D drawing with two weights, w_1 and w_2:
The orange concentrical ellipses are the geometrical representation of the cost function weβve chosen (MSE in our case) while the light blue rhombus is the geometrical interpretation of the penalty function (L1).
Our goal is to minimize the loss function; as we can see from the above illustration, the ellipses can be tangent to the rhombus just on one of its corners. At these corners, we have one of the weights be 0. The tangency may be better visualized in the following image because the minimization occurs on the tangency of the functions, not on the intersection between them:
Now, imagine we do this process for all the weights w_i; In the end, weβll have what we call a sparse model, which means, a model with fewer features than the initial ones β because Lasso regularization penalizes the weights by getting to 0 the largest weights. This means that our model has performed the feature selection, dropping some features (because their weights have been set to 0 to simplify the initial model).
Mathematically speaking, remembering that we have to calculate the gradient of the function to minimize it, we should calculate the following:
In this context, sparsity refers to the fact that some parameters have an optimal value of zero. In other words, this equation can have a numerical solution: some features can be led to 0, and this is called feature selection.
The Ridge regularized model.
The Ridge model performs an L2 regularization, which adds a penalty equal to the square of the magnitude of the coefficients. This regularization strategy drives the weights w_i closer to the origin by adding a regularization term.
We can add an L2 regularization penalty, for example, to the MSE (or to any other cost function weβd like) like so:
The geometrical interpretation of Ridge regularization
Letβs give a graphical interpretation of Ridge regularization to understand this fact. For the sake of simplicity, weβll create a 2-D drawing with two weights, w_1 and w_2:
In contrast to Lasso regularization, Ridge does not perform feature selection. Ridge decreases the complexity of the model without reducing the number of independent variables because it never leads to a coefficient value to 0. Let us visualize it with the help of the following image:
If we look at the above image, we can see that the ellipses can be tangent to the circle (the geometrical interpretation of the L2 penalization) everywhere, which means that the model penalizes large weights without shrinking them to 0. This means that the final model will include all the independent variables.
Also, in this case, remembering that we have to calculate the gradient of the function to minimize it, we should calculate the following:
The difference with the Lasso regularization is that the above equation canβt always have a numerical solution, and this is why it does not perform feature selection.
Implementing Ridge and Lasso regularized models in Python
So, letβs see how we can implement both these regularized methods.
We start with the Lasso regularization:
import numpy as np
from sklearn.datasets import make_regression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.linear_model import Lasso
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
#warnings
import warnings
#ignoring warnings
warnings.filterwarnings('ignore')
# Create a dataset
X, y = make_regression(n_samples=100, n_features=5, noise=10, random_state=42)
# Scale the features
scaler = StandardScaler()
X = scaler.fit_transform(X)
# Subdivide the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Fit a 6-degree polynomial to the train data
poly = PolynomialFeatures(degree=6)
X_train_poly = poly.fit_transform(X_train)
# Fit the 6-degree polynomial to the train data
poly_reg = LinearRegression()
poly_reg.fit(X_train_poly, y_train)
# Evaluate the polynomial on the train and test data
X_test_poly = poly.transform(X_test)
y_pred_train = poly_reg.predict(X_train_poly)
y_pred_test = poly_reg.predict(X_test_poly)
r2_score_poly_train = r2_score(y_train, y_pred_train)
r2_score_poly_test = r2_score(y_test, y_pred_test)
print(f'R-squared of 6-degree polynomial on train set: {r2_score_poly_train: .3f}')
print(f'R-squared of 6-degree polynomial on test set: {r2_score_poly_test: .3f}')
# Fit a regularized 6-degree polynomial to the train data
lasso_reg = make_pipeline(PolynomialFeatures(degree=6), Lasso(alpha=1))
lasso_reg.fit(X_train, y_train)
# Evaluate the regularized polynomial on the test data
y_pred_lasso = lasso_reg.predict(X_test)
r2_score_lasso = r2_score(y_test, y_pred_lasso)
print(f'R-squared of regularized 6-degree polynomial on test set: {r2_score_lasso: .3f}')
And we get:
R-squared of 6-degree polynomial on train set: 1.000
R-squared of 6-degree polynomial on test set: -84.110
R-squared of regularized 6-degree polynomial on test set: 0.827
So, as we can see, the βstandardβ 6-degree polynomial overfits because it has an RΒ² of 1 on the train set and a negative one on the test set.
Instead, the 6-degree regularized model has an acceptable RΒ² on the test set, meaning it has good generalization performance. Thus, regularization has improved the performance of the non-regularized model.
Now, letβs make a similar example using the Ridge model:
import numpy as np
from sklearn.datasets import make_regression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
# Create a dataset
X, y = make_regression(n_samples=100, n_features=3, noise=5, random_state=42)
# Scale the features
scaler = StandardScaler()
X = scaler.fit_transform(X)
# Subdivide the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Fit a 6-degree polynomial to the train data
poly = PolynomialFeatures(degree=6)
X_train_poly = poly.fit_transform(X_train)
# Fit the 6-degree polynomial to the train data
poly_reg = LinearRegression()
poly_reg.fit(X_train_poly, y_train)
# Evaluate the polynomial on the train and test data
X_test_poly = poly.transform(X_test)
y_pred_train = poly_reg.predict(X_train_poly)
y_pred_test = poly_reg.predict(X_test_poly)
r2_score_poly_train = r2_score(y_train, y_pred_train)
r2_score_poly_test = r2_score(y_test, y_pred_test)
print(f'R-squared of 6-degree polynomial on train set: {r2_score_poly_train}')
print(f'R-squared of 6-degree polynomial on test set: {r2_score_poly_test}')
# Fit a regularized 6-degree polynomial to the train data
ridge_reg = make_pipeline(PolynomialFeatures(degree=6), Ridge(alpha=1))
ridge_reg.fit(X_train, y_train)
# Evaluate the regularized polynomial on the test data
y_pred_ridge = ridge_reg.predict(X_test)
r2_score_ridge = r2_score(y_test, y_pred_ridge)
print(f'R-squared of regularized 6-degree polynomial on test set: {r2_score_ridge}')
And we get:
R-squared of 6-degree polynomial on train set: 1.0
R-squared of 6-degree polynomial on test set: -1612.4842791834997
R-squared of regularized 6-degree polynomial on test set: 0.9266258222037977
So, ridge regularization, even in this case, has increased the performance of the non-regularized polynomial model. Also, it performs even better than the Lasso model.
Conclusions
In this article, weβve explained the problem of overfitting and how to solve it through regularization.
But when to use one regularized model and one the other. As a rule of thumb:
- Weβd better use Ridge regularization when the features are highly correlated. Hence, it is important to study the correlation matrix before deciding which features to delete from the study of our problem.
- Lasso regularization helps reduce the features in a dataset. Lasso regularization automatically selects a subset of the most important features while simultaneously shrinking the coefficients of the less important features to zero. So, Lasso regularization is useful:
- For datasets with high dimensionality.
- When a significant number of features are correlated.
- To impose sparsity on the model coefficient.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI