From Raw to Refined: A Journey Through Data Preprocessing — Part 6: Imbalanced Datasets

Last Updated on January 10, 2024 by Editorial Team

Author(s): Shivamshinde

Originally published on Towards AI.

Table of Content

What is imbalanced data?
Degree of imbalance
Why having an imbalanced dataset is a problem?
Methods to deal with imbalanced data
– Try getting more data
– Try changing the performance metric
– Try sampling of the data
– Try different algorithms
– Try penalized Models
Important tips
Outro
References and Code

What is imbalanced data?

When tackling a classification problem in data science projects, we often come across data where one of the classes (or labels) has significantly more data points than the remaining classes (or labels). This type of data is known as imbalanced data.

Some of the examples of imbalanced data are:

Any kind of data related to rare diseases: If we take, for example, the disease called ‘Breast Cancer,’ then out of all the data points, only a handful of data points will belong to the positive class (diseased people) while the remaining will belong to the negative class (healthy people).
Natural disaster-related data: Natural disasters happen quite rarely. Therefore, when dealing with such data, we will often have a large number of data points with a negative class (disaster didn’t happen) and very few data points with a positive class (disaster happened).
Fraud detection data: Out of a large population, fraudulent financial activities happen with very few people. That’s why, fraud detection data will contain very few positive data points (fraud happened) and a high number of negative data points (fraud didn’t happen).

The classes that make up a large portion of the data are known as majority classes. And those that make up a smaller portion of the data are known as minority classes.

Degree of Imbalance

Why is having an imbalanced dataset a problem?

Let’s take an example of a dataset that has 1000 data points. Out of these 1000 data points, suppose 50 data points belong to class ‘A’, and the remaining data points belong to class ‘B’.

While training a machine learning model with this dataset, almost 95% of the time, a model will come across the data point with class ‘B’. that’s why, our model may become biased toward class ‘B’. In extreme imbalance, we might even get the model that outputs the prediction of class ‘B’ regardless of what the input data point is. Such a model would get 95% accuracy on training data, but it still won’t give satisfactory results on the test data.

Therefore,

When working with imbalanced data, it’s essential to address the imbalance of classes. This is because if we don’t, we might end up with a model that doesn’t perform well on the test data. To avoid this, we need to organize our data in a way that ensures the model can generalize well.

Methods to deal with imbalanced data

Try getting more data

The most basic approach to not having enough data for the minority classes is to gather more data for the minority class. Before going for any fancy techniques, check if there is any source that can provide you with more data for the minority class.

Try changing the evaluation metric

For any kind of machine learning problem, our first instinct is to use accuracy as an evaluation metric. However, in case of an imbalanced data classification problem, the accuracy metric will give misleading results. That’s why, you can try the following performance metrics to make sense of the trained model predictions:

Confusion Matrix
Precision, Recall, and F1 Score

If you want to learn how to make sense of the predictions made by the model using the confusion matrix, precision, recall, and F1 score, you can read my article below link.

Know Different Performance Measures for Machine Learning Classification Problem

This article will teach you the different performance measures used in machine learning classification tasks. The…

pub.towardsai.net

Try sampling of the data

Source: ResearchGate (Find reference at the bottom of an article)

Sampling the data means either increasing the number of data points or decreasing them.

Increasing data points of the minority class is called oversampling. On the other hand, decreasing the data points of the majority class is called undersampling.

We can make use of the Python library called imblearn to implement the methods for undersampling and oversampling.

There are two ways to perform oversampling in the data:

To randomly sample the records with replacements from the existing minority class data
To generate the new records using the K-nearest neighbor approach on the existing minority data

To demonstrate oversampling and undersampling, I will be using the code snippets from one of my Kaggle notebooks. If you are interested in the other preprocessing steps, you can check the notebook using the link given at the end of the article.

The first approach can be implemented using the RandomOverSampler class of the imblearn library.

# Importing the required classes
from imblearn.over_sampling import RandomOverSampler

# Performing oversampling on the training data
X_oversampled, y_oversampled = RandomOverSampler(random_state=0).fit_resample(Xtrain, ytrain)
print(f"X_oversamples new size: {X_oversampled.shape}")
print(f"y_oversampled new size: {y_oversampled.shape}")

# Training the random forest classifier on the oversampled data
rfr = RandomForestClassifier(max_depth=3)
rfr.fit(X_oversampled,y_oversampled)

# Making predictions using the trained model
ypred = rfr.predict(Xtest)
cf_matrix = confusion_matrix(ytest, ypred)
sns.heatmap(cf_matrix, annot=True, cmap='crest',fmt='.3g')

The second approach can be implemented using the SMOTE or ADASYN classes of the imblearn library.

# Importing the required classes
from imblearn.over_sampling import SMOTE

# Performing oversampling on the training data
X_oversampled, y_oversampled = SMOTE().fit_resample(Xtrain, ytrain)

print(f"X_oversamples new size: {X_oversampled.shape}")
print(f"y_oversampled new size: {y_oversampled.shape}")

# Training the random forest classifier on the oversampled data
rfr = RandomForestClassifier(max_depth=3)
rfr.fit(X_oversampled,y_oversampled)

# Making predictions using the trained model
ypred = rfr.predict(Xtest)
cf_matrix = confusion_matrix(ytest, ypred)
sns.heatmap(cf_matrix, annot=True, cmap='crest',fmt='.3g')

Similarly, you can use the ADASYN class too.

You can use a similar class to perform the undersampling of the majority class, too.

# Importing the required classes
from imblearn.under_sampling import RandomUnderSampler

# Performing undersampling on the training data
X_undersampled, y_undersampled = RandomUnderSampler(random_state=0).fit_resample(Xtrain, ytrain)

print(f"X_undersamples new size: {X_undersampled.shape}")
print(f"y_undersampled new size: {y_undersampled.shape}\n")

# Training the random forest classifier on the undersampled data
rfr = RandomForestClassifier(max_depth=3)
rfr.fit(X_undersampled,y_undersampled)

# Making predictions using the trained model
ypred = rfr.predict(Xtest)
cf_matrix = confusion_matrix(ytest, ypred)
sns.heatmap(cf_matrix, annot=True, cmap='crest',fmt='.3g')

Try different algorithms

Rather than worrying about the imbalance in the data, one approach would be to use a machine learning algorithm that can handle imbalance well, such as the Random Forest algorithm. Additionally, you can use algorithms such as SVC from Scikit-Learn, where we can provide the weights to the classes present in the data.

However, in case of extreme imbalance, gathering more data would be the most wise choice.

Try penalized models

We can make use of penalized models to deal with the imbalance in the data. Penalized classification as the name suggests punishes the model by imposing some extra cost when the model makes a classification mistake while predicting the minority classes. This way the model is forced to pay more attention to the minority classes. Some examples of penalized classes are penalized SVM, penalized LDA, etc.

Important Tips

You should always split the data before performing any kind of balancing technique. This way, we can ensure that no information leaks into the test data.
It’s not advisable to use undersampling as the primary solution for tackling the imbalance issue in your dataset. This is because undersampling involves removing a significant portion of your data, which can potentially alter the data’s distribution. If the distribution changes, the undersampled data may not accurately represent the domain of the problem, leading to inaccuracies in your analysis.

Outro

Thanks for reading! If you have any thoughts on the article, then please let me know.

Are you struggling to choose what to read next? Don’t worry, I have got you covered.

Shivam Shinde

Connect with me on LinkedIn
Similarly, you can follow me on Medium

Have a great day!

Machine Learning with Oversampling and Undersampling Techniques: Overview Study and Experimental Results — Scientific Figure on ResearchGate. Available from: https://www.researchgate.net/figure/Differences-between-undersampling-and-oversampling_fig1_340978368 [accessed 5 Jan, 2024]

Code

Imbalanced Data demo using Cerebral Stroke Dataset

Explore and run machine learning code with Kaggle Notebooks U+007C Using data from Cerebral Stroke Prediction-Imbalanced…

www.kaggle.com

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

From Raw to Refined: A Journey Through Data Preprocessing — Part 6: Imbalanced Datasets

Author(s): Shivamshinde

Table of Content

What is imbalanced data?

Degree of Imbalance

Why is having an imbalanced dataset a problem?

Methods to deal with imbalanced data

Try getting more data

Try changing the evaluation metric

Know Different Performance Measures for Machine Learning Classification Problem

This article will teach you the different performance measures used in machine learning classification tasks. The…

Try sampling of the data

Try different algorithms

Try penalized models

Important Tips

Outro

From Raw to Refined: A Journey Through Data Preprocessing — Part 5: Outliers

A Simple Guide to Navigating Data Anomalies. Decode the mystery behind outliers in data science. From detection to…

From Raw to Refined: A Journey Through Data Preprocessing — Part 4: Data Encoding

Why data encoding is necessary

Shivam Shinde

References:

Imbalanced Data U+007C Machine Learning U+007C Google for Developers

A classification data set with skewed class proportions is called imbalanced . Classes that make up a large proportion…

5 Techniques to Handle Imbalanced Data For a Classification Problem

Techniques to handle imbalanced data for a classification problem. Here we discuss what is imbalanced data, and how to…

Having an Imbalanced Dataset? Here Is How You Can Fix It.

Different Ways to Handle Imbalanced Datasets.

imbalanced-learn documentation – Version 0.11.0

The user guide provides in-depth information on the key concepts of imbalanced-learn with useful background information…

Code

Imbalanced Data demo using Cerebral Stroke Dataset

Explore and run machine learning code with Kaggle Notebooks U+007C Using data from Cerebral Stroke Prediction-Imbalanced…

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement