# From Overfitting to Excellence: Harnessing the Power of Regularization

Last Updated on August 7, 2023 by Editorial Team

**Author(s): Sandeepkumar Racherla**

Originally published on Towards AI.

## The Role of Regularization: Balancing Complexity and Generalization in Machine Learning

When it comes to Machine Learning, our scope is to find the ML model that makes the best predictions on data it hasnβt been trained on.

To do so, we train our ML models on training data and see how they perform in making predictions. Then, we compare the performance of the predictions on the train set and on the test set, that is the set with new data, to decide which is the best ML model between the ones weβre testing to solve the problem weβre facing.

In this article, weβll describe regularization in Machine Learning as a way to avoid overfitting.

## The problem of overfitting in Machine Learning

Overfitting is a problem that occurs in Machine Learning that is due to a model that has become βtoo specialized and focusedβ on the particular data itβs been trained and canβt generalize well.

Letβs show a picture describing the three conditions in ML: overfitting, underfitting, and good fit.

So, overfitting itβs like students memorizing the answers to specific questions: odds are that theyβll fail the test because they didnβt understand the subject. They just remembered how to answer some questions.

On the side of performance, what generally happens is that we can understand that an ML is overfitting when a metric weβve chosen to evaluate our ML model is very high (near 100%) both on the train and on the test sets.

So, overfitting β as well as underfitting β is something that we, as Data Scientists, absolutely need to avoid.

A way to avoid overfitting is through regularization. Letβs see how.

## Resolving overfitting through regularization

Regularization adds a penalty term to the loss function during the training of the model: this βdiscouragesβ the model from becoming too complex.

The idea is that by limiting the modelβs ability to fit the training data too closely, we can also reduce its ability to fit noise or random variations in the data. Thus, we avoid overfitting.

So, suppose we have a cost function defined as follows:

Where we have:

- `Theta` is an estimator.
- `x` and `y` are, respectively, the feature and the label of our model.

If we define the regularization parameter as `lambda` and the regularization function as `omega`, then the regularized objective function is:

So, the function to study now is the sum of the cost function plus a regularization function.

The regularization function and the regularization parameter can vary, depending on the type of regularization we want to use.

In the following paragraph, weβll describe the two types of regularization functions that are the most used in Machine Learning.

## Describing Lasso and Ridge regularization

## The Lasso regularized model

Lasso (least absolute and selection operator) regularization performs an *L1* regularization, which adds a penalty equal to the absolute value of the magnitude of the coefficients. This regularization method drives the weights loser to the origin by adding a regularization term:

We can add an *L1* regularization penalty, for example to the MSE (or to any other cost function weβd like) like so:

where we have:

- `n` is the total number of the points/data weβre studying
- `yi` are the actual values
- βhat yiβ are the values predicted by our model

## The geometrical interpretation of Lasso regularization

Lasso regularization, under the hood, performs what we call features selection. Letβs give a graphical interpretation of Lasso regularization to understand this fact. For the sake of simplicity, weβll create a 2-D drawing with two weights, w_1 and w_2:

The orange concentrical ellipses are the geometrical representation of the cost function weβve chosen (MSE in our case) while the light blue rhombus is the geometrical interpretation of the penalty function (L1).

Our goal is to minimize the loss function; as we can see from the above illustration, the ellipses can be tangent to the rhombus just on one of its corners. At these corners, we have one of the weights be 0. The tangency may be better visualized in the following image because the minimization occurs on the tangency of the functions, not on the intersection between them:

Now, imagine we do this process for all the weights w_i; In the end, weβll have what we call a sparse model, which means, a model with fewer features than the initial ones β because Lasso regularization penalizes the weights by getting to 0 the largest weights. This means that our model has performed the feature selection, dropping some features (because their weights have been set to 0 to simplify the initial model).

Mathematically speaking, remembering that we have to calculate the gradient of the function to minimize it, we should calculate the following:

In this context, sparsity refers to the fact that some parameters have an optimal value of zero. In other words, this equation can have a numerical solution: some features can be led to 0, and this is called feature selection.

## The Ridge regularized model.

The Ridge model performs an L2 regularization, which adds a penalty equal to the square of the magnitude of the coefficients. This regularization strategy drives the weights w_i closer to the origin by adding a regularization term.

We can add an L2 regularization penalty, for example, to the MSE (or to any other cost function weβd like) like so:

## The geometrical interpretation of Ridge regularization

Letβs give a graphical interpretation of Ridge regularization to understand this fact. For the sake of simplicity, weβll create a 2-D drawing with two weights, w_1 and w_2:

In contrast to Lasso regularization, Ridge does not perform feature selection. Ridge decreases the complexity of the model without reducing the number of independent variables because it never leads to a coefficient value to 0. Let us visualize it with the help of the following image:

If we look at the above image, we can see that the ellipses can be tangent to the circle (the geometrical interpretation of the *L*2 penalization) everywhere, which means that the model penalizes large weights without shrinking them to 0. This means that the final model will include all the independent variables.

Also, in this case, remembering that we have to calculate the gradient of the function to minimize it, we should calculate the following:

The difference with the Lasso regularization is that the above equation canβt always have a numerical solution, and this is why it does not perform feature selection.

## Implementing Ridge and Lasso regularized models in Python

So, letβs see how we can implement both these regularized methods.

We start with the Lasso regularization:

`import numpy as np`

from sklearn.datasets import make_regression

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split

from sklearn.metrics import r2_score

from sklearn.linear_model import Lasso

from sklearn.preprocessing import PolynomialFeatures

from sklearn.pipeline import make_pipeline

from sklearn.linear_model import LinearRegression

#warnings

import warnings

#ignoring warnings

warnings.filterwarnings('ignore')

# Create a dataset

X, y = make_regression(n_samples=100, n_features=5, noise=10, random_state=42)

# Scale the features

scaler = StandardScaler()

X = scaler.fit_transform(X)

# Subdivide the data into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit a 6-degree polynomial to the train data

poly = PolynomialFeatures(degree=6)

X_train_poly = poly.fit_transform(X_train)

# Fit the 6-degree polynomial to the train data

poly_reg = LinearRegression()

poly_reg.fit(X_train_poly, y_train)

# Evaluate the polynomial on the train and test data

X_test_poly = poly.transform(X_test)

y_pred_train = poly_reg.predict(X_train_poly)

y_pred_test = poly_reg.predict(X_test_poly)

r2_score_poly_train = r2_score(y_train, y_pred_train)

r2_score_poly_test = r2_score(y_test, y_pred_test)

print(f'R-squared of 6-degree polynomial on train set: {r2_score_poly_train: .3f}')

print(f'R-squared of 6-degree polynomial on test set: {r2_score_poly_test: .3f}')

# Fit a regularized 6-degree polynomial to the train data

lasso_reg = make_pipeline(PolynomialFeatures(degree=6), Lasso(alpha=1))

lasso_reg.fit(X_train, y_train)

# Evaluate the regularized polynomial on the test data

y_pred_lasso = lasso_reg.predict(X_test)

r2_score_lasso = r2_score(y_test, y_pred_lasso)

print(f'R-squared of regularized 6-degree polynomial on test set: {r2_score_lasso: .3f}')

And we get:

`R-squared of 6-degree polynomial on train set: 1.000`

R-squared of 6-degree polynomial on test set: -84.110

R-squared of regularized 6-degree polynomial on test set: 0.827

So, as we can see, the βstandardβ 6-degree polynomial overfits because it has an RΒ² of 1 on the train set and a negative one on the test set.

Instead, the 6-degree regularized model has an acceptable RΒ² on the test set, meaning it has good generalization performance. Thus, regularization has improved the performance of the non-regularized model.

Now, letβs make a similar example using the Ridge model:

`import numpy as np`

from sklearn.datasets import make_regression

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split

from sklearn.metrics import r2_score

from sklearn.linear_model import Ridge

from sklearn.preprocessing import PolynomialFeatures

from sklearn.pipeline import make_pipeline

from sklearn.linear_model import LinearRegression

# Create a dataset

X, y = make_regression(n_samples=100, n_features=3, noise=5, random_state=42)

# Scale the features

scaler = StandardScaler()

X = scaler.fit_transform(X)

# Subdivide the data into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit a 6-degree polynomial to the train data

poly = PolynomialFeatures(degree=6)

X_train_poly = poly.fit_transform(X_train)

# Fit the 6-degree polynomial to the train data

poly_reg = LinearRegression()

poly_reg.fit(X_train_poly, y_train)

# Evaluate the polynomial on the train and test data

X_test_poly = poly.transform(X_test)

y_pred_train = poly_reg.predict(X_train_poly)

y_pred_test = poly_reg.predict(X_test_poly)

r2_score_poly_train = r2_score(y_train, y_pred_train)

r2_score_poly_test = r2_score(y_test, y_pred_test)

print(f'R-squared of 6-degree polynomial on train set: {r2_score_poly_train}')

print(f'R-squared of 6-degree polynomial on test set: {r2_score_poly_test}')

# Fit a regularized 6-degree polynomial to the train data

ridge_reg = make_pipeline(PolynomialFeatures(degree=6), Ridge(alpha=1))

ridge_reg.fit(X_train, y_train)

# Evaluate the regularized polynomial on the test data

y_pred_ridge = ridge_reg.predict(X_test)

r2_score_ridge = r2_score(y_test, y_pred_ridge)

print(f'R-squared of regularized 6-degree polynomial on test set: {r2_score_ridge}')

And we get:

`R-squared of 6-degree polynomial on train set: 1.0`

R-squared of 6-degree polynomial on test set: -1612.4842791834997

R-squared of regularized 6-degree polynomial on test set: 0.9266258222037977

So, ridge regularization, even in this case, has increased the performance of the non-regularized polynomial model. Also, it performs even better than the Lasso model.

## Conclusions

In this article, weβve explained the problem of overfitting and how to solve it through regularization.

But when to use one regularized model and one the other. As a rule of thumb:

- Weβd better use Ridge regularization when the features are highly correlated. Hence, it is important to study the correlation matrix before deciding which features to delete from the study of our problem.
- Lasso regularization helps reduce the features in a dataset. Lasso regularization automatically selects a subset of the most important features while simultaneously shrinking the coefficients of the less important features to zero. So, Lasso regularization is useful:
- For datasets with high dimensionality.
- When a significant number of features are correlated.
- To impose sparsity on the model coefficient.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI