Important Techniques to Handle Imbalanced Data in Machine Learning Python
Last Updated on September 19, 2022 by Editorial Team
Author(s): Muttineni Sai Rohith
Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.
How to Handle Imbalanced Data in ML Classification usingΒ Python
In this article, we will discuss what is Imbalanced Data, the Metrics we should use to evaluate the model with Imbalanced Data, and the Techniques used to Handle Imbalanced Data.
While doing binary classification, almost every data scientist might have encountered the problem of handling Imbalanced Data. Generally Imbalanced data occurs when the datasets are distributed unequally i.e. when the frequency of data points or the number of rows in one class is much more than in other classes, then the data is imbalanced.
For example, suppose we have a covid Dataset, and our target class is whether a person is having covid or not, if the positive ratio is 10% in our class and the negative ratio is 90%, then we can say that our Data is imbalanced.
Problem with Imbalanced Data
Most machine learning algorithms are designed in a way to improve accuracy and reduces errors. In this process, they donβt consider the distribution of classes. Also, standard machine learning algorithms like Decision trees and Logistic Regression have a bias toward Majority classes and tend to ignore minority classes. So in these cases, even though the model has 95% accuracy, it cannot be said as a perfect model as the frequency of the number of classes in testing data may be 95%, and 5% wrongly predicted data must be from the minorityΒ class.
Accuracy pitfall
Before diving into the handling of Imbalanced Datasets, Letβs understand the metrics we should use while evaluating the models. Generally, accuracy_score is calculated as the ratio of the number of correct predictions to the total number of predictions.
Accuracy = Number of Correct Predictions / Total Number of Predictions.
So we can see that accuracy_score will not consider the distribution of classes. It only focuses on the Number of Correct Predictions. So Even though we get 95+ accuracy, as shown in the above example, we canβt guarantee the performance of the model and its prediction of the minorityΒ class.
So for classification techniques, instead of accuracy_score, it is recommended to use Confusion Matrix, precision_score, recall_score, and Area under the ROC Curve(AUC) as Evaluation Metrics.
Handling Imbalanced Data
A technique that is widely used while handling imbalanced data is Sampling. There are two types of SamplingΒ β
- Under Sampling
- Over Sampling
In Under Sampling, samples are removed from the majority class, whereas, in Over Sampling, samples are added to the minorityΒ class.
To demonstrate the usage of the above techniques, initially, we will consider an example without Handling Imbalanced Data. Dataset used can be foundΒ here.
Importing Libraries
# import necessary modules
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
Loading Data
df = pd.read_csv("/content/drive/MyDrive/creditcard.csv")
Preparing Data
# normalise the amount column
df['normAmount'] = StandardScaler().fit_transform(np.array( df['Amount']).reshape(-1, 1))
# drop Time and Amount columns as they are not relevant for prediction purpose
data = df.drop(['Time', 'Amount'], axis = 1)
# as you can see there are 492 fraud transactions.
data['Class'].value_counts()
X = data.drop(['Class'], axis = 1)
y = data["Class"]
Splitting train-test-data
from sklearn.model_selection import train_test_split
# split into 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
# describes info about train and test set
print("Number transactions X_train dataset: ", X_train.shape)
print("Number transactions y_train dataset: ", y_train.shape)
print("Number transactions X_test dataset: ", X_test.shape)
print("Number transactions y_test dataset: ", y_test.shape)
Classification
# logistic regression object
lr = LogisticRegression()
# train the model on train set
lr.fit(X_train, y_train.ravel())
predictions = lr.predict(X_test)
# print classification report
print("Accuracy score is: ",accuracy_score(y_test, predictions))
print("Recall score is: ",recall_score(y_test, predictions))
print("Precision score is: ",precision_score(y_test, predictions))
print("Confusion Matrix: \n",confusion_matrix(y_test, predictions))
As we can see, even though the Accuracy score is 99.9%, we can see that the Recall score is 61.9% which is relatively low, and the Precision score isΒ 88.3%.
This is because the dataset is imbalanced, and now we will try to improve these scores using the techniques mentioned above.
Handling Imbalanced Data using UnderΒ Sampling
Under Sampling involves the removal of records from the majority class to balance out with the minorityΒ class.
The simplest technique involved in under-sampling is Random under-sampling. This technique involves the removal of random records from the majority class. But there will be a loss of important information if we randomly remove the rows. So various techniques are implemented for undersampling the data. One such import technique is NearMiss Undersampling.
NearMiss Undersampling
In this technique, data points are selected based on the distance between the majority and minority classes. It has 3 different versions, and each version considers the different data points from the majorityΒ class.
- Version 1βββIt selects data points of the majority class whose average distances to the K closest instances of minority class is theΒ smallest
- Version 2βββIt selects data points of the majority class whose average distances to the K farthest instances of minority class is theΒ smallest
- Version 3βββIt works in 2 steps. Firstly, for each minority class instance, their M nearest neighbors will be stored. Then finally, the majority class instances are selected for which the average distance to the N nearest neighbors is theΒ largest.
In short, Version 3 is the more accurate version as it will remove the tomek links and makes the classification process easy as it forms a decision boundary.
# apply near miss
from imblearn.under_sampling import NearMiss
nr = NearMiss()
X_train_miss, y_train_miss = nr.fit_resample(X_train, y_train.ravel())
print('After Undersampling, the shape of train_X: {}'.format(X_train_miss.shape))
print('After Undersampling, the shape of train_y: {} \n'.format(y_train_miss.shape))
print("After Undersampling, counts of label '1': {}".format(sum(y_train_miss == 1)))
print("After Undersampling, counts of label '0': {}".format(sum(y_train_miss == 0)))
We have undersampled the majority classβββ0and balanced it out with the minority classβββ1. Now Letβs train and evaluate theΒ data.
lr2 = LogisticRegression()
lr2.fit(X_train_miss, y_train_miss)
predictions = lr2.predict(X_test)
# print evaluation metrics
print("Accuracy score is: ",accuracy_score(y_test, predictions))
print("Recall score is: ",recall_score(y_test, predictions))
print("Precision score is: ",precision_score(y_test, predictions))
print("Confusion Matrix: \n",confusion_matrix(y_test, predictions))
So even though the recall score is more, we can see that accuracy is less. But by our observations, when the prediction of minority class is a priority, we can use this technique.
Handling Imbalanced Data using OverΒ Sampling
Unlike undersampling, where we remove records from the majority class, In Over sampling, we will add records in the minority class. Under-Sampling can be used when we have tons of Data, whereas Oversampling can be used when we have lessΒ Data.
The simplest technique involved in over-sampling is Random Over Sampling, where we randomly add more copies to the minority class to balance out with the majority class, but the disadvantage of Over Sampling is that it causes overfitting and generalization of data, thereby decreasing the accuracy. So for this purpose, we use SMOTE technique.
SMOTE(Synthetic Minority Oversampling Technique)
SMOTE techniques work by randomly picking a data point from a minority class and computing the K-Nearest Neighbour from that point, and adding random points between this chosen point and its neighbors.
SMOTE Algorithm works in 4 StepsΒ β
- Choose the minority class as the inputΒ vector.
- Find its K-Nearest Neighbours by using Euclidean distance.
- Choose one of these Neighbours and add a synthetic point between the chosen point and its Neighbours.
- Repeat the steps until it is balanced.
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state = 2)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train.ravel())
print('After OverSampling, the shape of train_X: {}'.format(X_train_res.shape))
print('After OverSampling, the shape of train_y: {} \n'.format(y_train_res.shape))
print("After OverSampling, counts of label '1': {}".format(sum(y_train_res == 1)))
print("After OverSampling, counts of label '0': {}".format(sum(y_train_res == 0)))
We have oversampled the minority classβββ1 and balanced it out with the majority classβββ0. Letβs train and evaluate theΒ data.
lr1 = LogisticRegression()
lr1.fit(X_train_res, y_train_res.ravel())
predictions = lr1.predict(X_test)
# print Evaluation Metrics
print("Accuracy score is: ",accuracy_score(y_test, predictions))
print("Recall score is: ",recall_score(y_test, predictions))
print("Precision score is: ",precision_score(y_test, predictions))
print("Confusion Matrix: \n",confusion_matrix(y_test, predictions))
As we can see with minimal decrease in Accuracy, the recall_score has increased significantly when compared to the original statistics.
Comparing the model performance
Below is the comparison of the model without handling imbalanced data, with undersampling and with oversampling β
By the above metrics, one can understand that SMOTE technique has goodΒ results.
Conclusion
In this article, we have discussed how to handle Imbalanced data using different techniques. We have used Logistic regression in the above example, we can try various algorithms and improve the model performance.
I hope this is helpfulβ¦. Happy Codingβ¦..
Important Techniques to Handle Imbalanced Data in Machine Learning Python was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Join thousands of data leaders on the AI newsletter. Itβs free, we donβt spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI