Unlock the full potential of AI with Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Publication

From Overfitting to Excellence: Harnessing the Power of Regularization
Latest   Machine Learning

From Overfitting to Excellence: Harnessing the Power of Regularization

Last Updated on August 7, 2023 by Editorial Team

Author(s): Sandeepkumar Racherla

Originally published on Towards AI.

The Role of Regularization: Balancing Complexity and Generalization in Machine Learning

Source: Image by Andrew Martin on Pixabay

When it comes to Machine Learning, our scope is to find the ML model that makes the best predictions on data it hasn’t been trained on.

To do so, we train our ML models on training data and see how they perform in making predictions. Then, we compare the performance of the predictions on the train set and on the test set, that is the set with new data, to decide which is the best ML model between the ones we’re testing to solve the problem we’re facing.

In this article, we’ll describe regularization in Machine Learning as a way to avoid overfitting.

The problem of overfitting in Machine Learning

Overfitting is a problem that occurs in Machine Learning that is due to a model that has become “too specialized and focused” on the particular data it’s been trained and can’t generalize well.

Let’s show a picture describing the three conditions in ML: overfitting, underfitting, and good fit.

The three conditions of fitting the data in ML. Source: Image by Author.

So, overfitting it’s like students memorizing the answers to specific questions: odds are that they’ll fail the test because they didn’t understand the subject. They just remembered how to answer some questions.

On the side of performance, what generally happens is that we can understand that an ML is overfitting when a metric we’ve chosen to evaluate our ML model is very high (near 100%) both on the train and on the test sets.

So, overfitting — as well as underfitting — is something that we, as Data Scientists, absolutely need to avoid.

A way to avoid overfitting is through regularization. Let’s see how.

Resolving overfitting through regularization

Regularization adds a penalty term to the loss function during the training of the model: this “discourages” the model from becoming too complex.

The idea is that by limiting the model’s ability to fit the training data too closely, we can also reduce its ability to fit noise or random variations in the data. Thus, we avoid overfitting.

So, suppose we have a cost function defined as follows:

Our cost function. Source: Image by Author.

Where we have:

  • `Theta` is an estimator.
  • `x` and `y` are, respectively, the feature and the label of our model.

If we define the regularization parameter as `lambda` and the regularization function as `omega`, then the regularized objective function is:

The regularized objective function. Source: Image by Author.

So, the function to study now is the sum of the cost function plus a regularization function.

The regularization function and the regularization parameter can vary, depending on the type of regularization we want to use.

In the following paragraph, we’ll describe the two types of regularization functions that are the most used in Machine Learning.

Describing Lasso and Ridge regularization

The Lasso regularized model

Lasso (least absolute and selection operator) regularization performs an L1 regularization, which adds a penalty equal to the absolute value of the magnitude of the coefficients. This regularization method drives the weights loser to the origin by adding a regularization term:

The regularization term for the Lasso model. Source: Image by Author.

We can add an L1 regularization penalty, for example to the MSE (or to any other cost function we’d like) like so:

The Mean Squared Error (MSE) metric with Lasso regularization term. Source: Image by Author.

where we have:

  • `n` is the total number of the points/data we’re studying
  • `yi` are the actual values
  • “hat yi” are the values predicted by our model

The geometrical interpretation of Lasso regularization

Lasso regularization, under the hood, performs what we call features selection. Let’s give a graphical interpretation of Lasso regularization to understand this fact. For the sake of simplicity, we’ll create a 2-D drawing with two weights, w_1 and w_2:

The geometrical interpretation of Lasso regularization. Source: Image by Author.

The orange concentrical ellipses are the geometrical representation of the cost function we’ve chosen (MSE in our case) while the light blue rhombus is the geometrical interpretation of the penalty function (L1).

Our goal is to minimize the loss function; as we can see from the above illustration, the ellipses can be tangent to the rhombus just on one of its corners. At these corners, we have one of the weights be 0. The tangency may be better visualized in the following image because the minimization occurs on the tangency of the functions, not on the intersection between them:

The tangency between the loss function and the penalty function. Source: Image by Author.

Now, imagine we do this process for all the weights w_i; In the end, we’ll have what we call a sparse model, which means, a model with fewer features than the initial ones — because Lasso regularization penalizes the weights by getting to 0 the largest weights. This means that our model has performed the feature selection, dropping some features (because their weights have been set to 0 to simplify the initial model).

Mathematically speaking, remembering that we have to calculate the gradient of the function to minimize it, we should calculate the following:

The formula of Lasso regularization with the gradient. Source: Image by Author.

In this context, sparsity refers to the fact that some parameters have an optimal value of zero. In other words, this equation can have a numerical solution: some features can be led to 0, and this is called feature selection.

The Ridge regularized model.

The Ridge model performs an L2 regularization, which adds a penalty equal to the square of the magnitude of the coefficients. This regularization strategy drives the weights w_i closer to the origin by adding a regularization term.

The regularization term for the Ridge model. Source: Image by Author.

We can add an L2 regularization penalty, for example, to the MSE (or to any other cost function we’d like) like so:

The Mean Squared Error (MSE) metric with Ridge regularization term. Source: Image by Author.

The geometrical interpretation of Ridge regularization

Let’s give a graphical interpretation of Ridge regularization to understand this fact. For the sake of simplicity, we’ll create a 2-D drawing with two weights, w_1 and w_2:

The geometrical interpretation of Ridge regularization. Source: Image by Author.

In contrast to Lasso regularization, Ridge does not perform feature selection. Ridge decreases the complexity of the model without reducing the number of independent variables because it never leads to a coefficient value to 0. Let us visualize it with the help of the following image:

The tangency between the loss function and the penalty function. Source: Image by Author.

If we look at the above image, we can see that the ellipses can be tangent to the circle (the geometrical interpretation of the L2 penalization) everywhere, which means that the model penalizes large weights without shrinking them to 0. This means that the final model will include all the independent variables.

Also, in this case, remembering that we have to calculate the gradient of the function to minimize it, we should calculate the following:

The formula of Ridge regularization with the gradient. Source: Image by Author.

The difference with the Lasso regularization is that the above equation can’t always have a numerical solution, and this is why it does not perform feature selection.

Implementing Ridge and Lasso regularized models in Python

So, let’s see how we can implement both these regularized methods.

We start with the Lasso regularization:

import numpy as np
from sklearn.datasets import make_regression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.linear_model import Lasso
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
#warnings
import warnings

#ignoring warnings
warnings.filterwarnings('ignore')

# Create a dataset
X, y = make_regression(n_samples=100, n_features=5, noise=10, random_state=42)

# Scale the features
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Subdivide the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit a 6-degree polynomial to the train data
poly = PolynomialFeatures(degree=6)
X_train_poly = poly.fit_transform(X_train)

# Fit the 6-degree polynomial to the train data
poly_reg = LinearRegression()
poly_reg.fit(X_train_poly, y_train)

# Evaluate the polynomial on the train and test data
X_test_poly = poly.transform(X_test)
y_pred_train = poly_reg.predict(X_train_poly)
y_pred_test = poly_reg.predict(X_test_poly)
r2_score_poly_train = r2_score(y_train, y_pred_train)
r2_score_poly_test = r2_score(y_test, y_pred_test)
print(f'R-squared of 6-degree polynomial on train set: {r2_score_poly_train: .3f}')
print(f'R-squared of 6-degree polynomial on test set: {r2_score_poly_test: .3f}')

# Fit a regularized 6-degree polynomial to the train data
lasso_reg = make_pipeline(PolynomialFeatures(degree=6), Lasso(alpha=1))
lasso_reg.fit(X_train, y_train)

# Evaluate the regularized polynomial on the test data
y_pred_lasso = lasso_reg.predict(X_test)
r2_score_lasso = r2_score(y_test, y_pred_lasso)
print(f'R-squared of regularized 6-degree polynomial on test set: {r2_score_lasso: .3f}')

And we get:

R-squared of 6-degree polynomial on train set: 1.000
R-squared of 6-degree polynomial on test set: -84.110
R-squared of regularized 6-degree polynomial on test set: 0.827

So, as we can see, the “standard” 6-degree polynomial overfits because it has an R² of 1 on the train set and a negative one on the test set.

Instead, the 6-degree regularized model has an acceptable R² on the test set, meaning it has good generalization performance. Thus, regularization has improved the performance of the non-regularized model.

Now, let’s make a similar example using the Ridge model:

import numpy as np
from sklearn.datasets import make_regression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression

# Create a dataset
X, y = make_regression(n_samples=100, n_features=3, noise=5, random_state=42)

# Scale the features
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Subdivide the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit a 6-degree polynomial to the train data
poly = PolynomialFeatures(degree=6)
X_train_poly = poly.fit_transform(X_train)

# Fit the 6-degree polynomial to the train data
poly_reg = LinearRegression()
poly_reg.fit(X_train_poly, y_train)

# Evaluate the polynomial on the train and test data
X_test_poly = poly.transform(X_test)
y_pred_train = poly_reg.predict(X_train_poly)
y_pred_test = poly_reg.predict(X_test_poly)
r2_score_poly_train = r2_score(y_train, y_pred_train)
r2_score_poly_test = r2_score(y_test, y_pred_test)
print(f'R-squared of 6-degree polynomial on train set: {r2_score_poly_train}')
print(f'R-squared of 6-degree polynomial on test set: {r2_score_poly_test}')

# Fit a regularized 6-degree polynomial to the train data
ridge_reg = make_pipeline(PolynomialFeatures(degree=6), Ridge(alpha=1))
ridge_reg.fit(X_train, y_train)

# Evaluate the regularized polynomial on the test data
y_pred_ridge = ridge_reg.predict(X_test)
r2_score_ridge = r2_score(y_test, y_pred_ridge)
print(f'R-squared of regularized 6-degree polynomial on test set: {r2_score_ridge}')

And we get:

R-squared of 6-degree polynomial on train set: 1.0
R-squared of 6-degree polynomial on test set: -1612.4842791834997
R-squared of regularized 6-degree polynomial on test set: 0.9266258222037977

So, ridge regularization, even in this case, has increased the performance of the non-regularized polynomial model. Also, it performs even better than the Lasso model.

Conclusions

In this article, we’ve explained the problem of overfitting and how to solve it through regularization.

But when to use one regularized model and one the other. As a rule of thumb:

  1. We’d better use Ridge regularization when the features are highly correlated. Hence, it is important to study the correlation matrix before deciding which features to delete from the study of our problem.
  2. Lasso regularization helps reduce the features in a dataset. Lasso regularization automatically selects a subset of the most important features while simultaneously shrinking the coefficients of the less important features to zero. So, Lasso regularization is useful:
  3. For datasets with high dimensionality.
  4. When a significant number of features are correlated.
  5. To impose sparsity on the model coefficient.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓