Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

Striking the Right Balance: Understanding Underfitting and Overfitting in Machine Learning Models
Latest   Machine Learning

Striking the Right Balance: Understanding Underfitting and Overfitting in Machine Learning Models

Last Updated on August 1, 2023 by Editorial Team

Author(s): Shivamshinde

Originally published on Towards AI.

This article will explain the basic concept of overfitting and underfitting from the machine learning and deep learning perspective.

Photo by Ag PIC on Unsplash

Seeing underfitting and overfitting as a problem

Every person working on a machine learning problem wants their model to work as optimally as possible. But there are times when the model might not work as optimally as we want. It might either have an accuracy worse than ideal or better than ideal. In machine learning, both of these are considered a problem.

Some people might wonder that having a less-than-ideal accuracy might be considered a problem, but why are we considering the above ideal accuracy as a problem too?

Sometimes our model tries to find the relation in meaningless stuff i.e., some unnecessary features or some noise in the data, which is where this extra accuracy comes from. Let’s understand this with an example.

If we are training a model that predicts a salary of a person. For this problem, our data have four features, namely the name of the person, his/her education, his/her experience, and his/her skill set. Based on our common sense, we know that the person’s name is not a factor that affects the person’s salary. But despite this fact, if we use the person’s name as one of the features in our data, our model might try to find some kind of relation between name and salary. And this kind of relationship might add some extra accuracy to our model. This causes more-than-ideal accuracy and in such cases, our model is trained incorrectly.

Basic terminologies

Before diving into the topics, let’s understand two different kinds of errors that are necessary to understand underfitting and overfitting.

  1. Bias error: A bias error is basically an error that we find using the training data and a trained model. In other words, here we are finding the error using the same data that is used for training the model. An error can be any kind of error such as mean squared error, mean absolute error, etc.
  2. Variance error: A variance error is an error that we find using the test data and a trained model. Again here, the error can be any type of error. Even though we can use any type of error to find the variance, we use the same error that we used for bias finding because that way we can compare the bias and variance values.

Note that the ideal condition of our trained model is having low bias and low variance.

What are overfitting and underfitting in general life?

Let’s say you are visiting a foreign country and the taxi driver rips you off. You might be tempted to say that all the taxi drivers in that country are greedy. This is what we call over-generalization.

The over-generalization could happen to our trained machine and deep learning models. The over-generalization in the case of machine and deep learning is known as the overfitting of the model.

Similarly, the under-generalization is known as the underfitting of the model.

What does overfitting mean from a machine learning perspective?

We say our model suffers from overfitting if it has low bias and high variance.

Overfitting happens when the model is too complex relative to the amount and noisiness of the training data.

Possible solutions to the overfitting issue

  1. Simplify the model in one of the following ways:

Select the machine learning model with fewer parameters

Reduce the features or columns used for training the machine-learning model

Constraint the model (Using regularization methods)

2. Gather more training data.

3. Reduce the noise in the data. The noise could be some errors in the data or the presence of outliers, etc.

4. Use early stopping

What is underfitting?

Underfitting happens when a machine learning model is not able to capture a relationship between our independent and dependent features. In other words, in case of underfitting, our model will give us high bias and high variance. There might be several reasons behind this.

Possible solutions to an underfitting issue

  1. Use a more complex model that could capture the relationship between independent and dependent features.
  2. Relax the constraints on the model i.e., reduce the regularization.
  3. Try to obtain more training data.
  4. Try to increase the duration of model training. This can be done by training the model for more epochs.
  5. Try to clean the data to reduce the noise.

Let’s see how the overfitting and underfitting look like using some plots

Let’s use the red-wine-quality dataset to understand the concepts of underfitting and overfitting.

Underfitting:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')

from sklearn.model_selection import train_test_split
import tensorflow.keras.layers as tfl
from tensorflow.keras.models import Model

## Reading the data
wine = pd.read_csv('wine.csv')

## Splitting the data into independent and dependent features
X = wine.drop('quality',axis=1)
y = wine['quality']

## Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

## creating a model
input = tfl.Input(shape=X.shape[1:])
hidden1 = tfl.Dense(6,activation='relu')(input)
output = tfl.Dense(10, activation='softmax')(hidden1)
model = Model(inputs=[input], outputs=[output])

## compiling the model
model.compile(optimizer='adam',loss='sparse_categorical_crossentropy', metrics=['accuracy'])

## training the model using training and test set
history = model.fit(X_train, y_train, epochs=150, validation_data=(X_test, y_test))

## visualizing the train and test accuracy
plt.plot(history.history['accuracy'],color='red',label='train accuracy')
plt.plot(history.history['val_accuracy'],color='blue',label='test accuracy')
plt.legend()
plt.show()

Observe the above plot. We can see that the accuracy of the train model on both training data and test data is less than 55% which is quite less. So our model, in this case, is suffering from the underfitting problem. This occurs because of the simplicity of the model.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
from sklearn.model_selection import train_test_split
import tensorflow.keras.layers as tfl
from tensorflow.keras.models import Model

## Reading the data
wine = pd.read_csv('wine.csv')

## Splitting the data into independent and dependent features
X = wine.drop('quality',axis=1)
y = wine['quality']

## Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

## creating a model
input = tfl.Input(shape=X.shape[1:])
hidden1 = tfl.Dense(100,activation='relu')(input)
hidden2 = tfl.Dense(100, activation='relu')(hidden1)
hidden3 = tfl.Dense(100, activation='relu')(hidden2)
hidden4 = tfl.Dense(100, activation='relu')(hidden3)
output = tfl.Dense(10, activation='softmax')(hidden4)
model = Model(inputs=[input], outputs=[output])

## compiling the model
model.compile(optimizer='adam',loss='sparse_categorical_crossentropy', metrics=['accuracy'])

## training the model using training and test set
history = model.fit(X_train, y_train, epochs=150, validation_data=(X_test, y_test))

## visualizing the train and test accuracy
plt.plot(history.history['accuracy'],color='red',label='train accuracy')
plt.plot(history.history['val_accuracy'],color='blue',label='test accuracy')
plt.legend()
plt.show()
Overfitting

After observing the above plot, one can tell that the space between the two graphs is increasing as we go towards the left side (i.e., as we increase epochs). This means as we are increasing the epochs for which training is performed, the training accuracy is increasing while test accuracy is not. This kind of situation is considered overfitting. This kind of model doesn’t generalize well on test as well as new data.

We need to train the model in such a way that it gives good enough accuracy on both the training data and test data. This model will be on the middle line between underfitting and overfitting.

I hope you like the article. If you have any thoughts on the article then please let me know. Any constructive feedback is highly appreciated.

Connect with me on LinkedIn.

Mail me at [email protected]

Have a great day!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓