From Raw to Refined: A Journey Through Data Preprocessing — Part 6: Imbalanced Datasets
Last Updated on January 10, 2024 by Editorial Team
Author(s): Shivamshinde
Originally published on Towards AI.
Table of Content
- What is imbalanced data?
- Degree of imbalance
- Why having an imbalanced dataset is a problem?
- Methods to deal with imbalanced data
– Try getting more data
– Try changing the performance metric
– Try sampling of the data
– Try different algorithms
– Try penalized Models - Important tips
- Outro
- References and Code
What is imbalanced data?
When tackling a classification problem in data science projects, we often come across data where one of the classes (or labels) has significantly more data points than the remaining classes (or labels). This type of data is known as imbalanced data.
Some of the examples of imbalanced data are:
- Any kind of data related to rare diseases: If we take, for example, the disease called ‘Breast Cancer,’ then out of all the data points, only a handful of data points will belong to the positive class (diseased people) while the remaining will belong to the negative class (healthy people).
- Natural disaster-related data: Natural disasters happen quite rarely. Therefore, when dealing with such data, we will often have a large number of data points with a negative class (disaster didn’t happen) and very few data points with a positive class (disaster happened).
- Fraud detection data: Out of a large population, fraudulent financial activities happen with very few people. That’s why, fraud detection data will contain very few positive data points (fraud happened) and a high number of negative data points (fraud didn’t happen).
The classes that make up a large portion of the data are known as majority classes. And those that make up a smaller portion of the data are known as minority classes.
Degree of Imbalance
Why is having an imbalanced dataset a problem?
Let’s take an example of a dataset that has 1000 data points. Out of these 1000 data points, suppose 50 data points belong to class ‘A’, and the remaining data points belong to class ‘B’.
While training a machine learning model with this dataset, almost 95% of the time, a model will come across the data point with class ‘B’. that’s why, our model may become biased toward class ‘B’. In extreme imbalance, we might even get the model that outputs the prediction of class ‘B’ regardless of what the input data point is. Such a model would get 95% accuracy on training data, but it still won’t give satisfactory results on the test data.
Therefore,
When working with imbalanced data, it’s essential to address the imbalance of classes. This is because if we don’t, we might end up with a model that doesn’t perform well on the test data. To avoid this, we need to organize our data in a way that ensures the model can generalize well.
Methods to deal with imbalanced data
Try getting more data
The most basic approach to not having enough data for the minority classes is to gather more data for the minority class. Before going for any fancy techniques, check if there is any source that can provide you with more data for the minority class.
Try changing the evaluation metric
For any kind of machine learning problem, our first instinct is to use accuracy as an evaluation metric. However, in case of an imbalanced data classification problem, the accuracy metric will give misleading results. That’s why, you can try the following performance metrics to make sense of the trained model predictions:
- Confusion Matrix
- Precision, Recall, and F1 Score
If you want to learn how to make sense of the predictions made by the model using the confusion matrix, precision, recall, and F1 score, you can read my article below link.
Know Different Performance Measures for Machine Learning Classification Problem
This article will teach you the different performance measures used in machine learning classification tasks. The…
pub.towardsai.net
Try sampling of the data
Sampling the data means either increasing the number of data points or decreasing them.
Increasing data points of the minority class is called oversampling. On the other hand, decreasing the data points of the majority class is called undersampling.
We can make use of the Python library called imblearn to implement the methods for undersampling and oversampling.
There are two ways to perform oversampling in the data:
- To randomly sample the records with replacements from the existing minority class data
- To generate the new records using the K-nearest neighbor approach on the existing minority data
To demonstrate oversampling and undersampling, I will be using the code snippets from one of my Kaggle notebooks. If you are interested in the other preprocessing steps, you can check the notebook using the link given at the end of the article.
The first approach can be implemented using the RandomOverSampler class of the imblearn library.
# Importing the required classes
from imblearn.over_sampling import RandomOverSampler
# Performing oversampling on the training data
X_oversampled, y_oversampled = RandomOverSampler(random_state=0).fit_resample(Xtrain, ytrain)
print(f"X_oversamples new size: {X_oversampled.shape}")
print(f"y_oversampled new size: {y_oversampled.shape}")
# Training the random forest classifier on the oversampled data
rfr = RandomForestClassifier(max_depth=3)
rfr.fit(X_oversampled,y_oversampled)
# Making predictions using the trained model
ypred = rfr.predict(Xtest)
cf_matrix = confusion_matrix(ytest, ypred)
sns.heatmap(cf_matrix, annot=True, cmap='crest',fmt='.3g')
The second approach can be implemented using the SMOTE or ADASYN classes of the imblearn library.
# Importing the required classes
from imblearn.over_sampling import SMOTE
# Performing oversampling on the training data
X_oversampled, y_oversampled = SMOTE().fit_resample(Xtrain, ytrain)
print(f"X_oversamples new size: {X_oversampled.shape}")
print(f"y_oversampled new size: {y_oversampled.shape}")
# Training the random forest classifier on the oversampled data
rfr = RandomForestClassifier(max_depth=3)
rfr.fit(X_oversampled,y_oversampled)
# Making predictions using the trained model
ypred = rfr.predict(Xtest)
cf_matrix = confusion_matrix(ytest, ypred)
sns.heatmap(cf_matrix, annot=True, cmap='crest',fmt='.3g')
Similarly, you can use the ADASYN class too.
You can use a similar class to perform the undersampling of the majority class, too.
# Importing the required classes
from imblearn.under_sampling import RandomUnderSampler
# Performing undersampling on the training data
X_undersampled, y_undersampled = RandomUnderSampler(random_state=0).fit_resample(Xtrain, ytrain)
print(f"X_undersamples new size: {X_undersampled.shape}")
print(f"y_undersampled new size: {y_undersampled.shape}\n")
# Training the random forest classifier on the undersampled data
rfr = RandomForestClassifier(max_depth=3)
rfr.fit(X_undersampled,y_undersampled)
# Making predictions using the trained model
ypred = rfr.predict(Xtest)
cf_matrix = confusion_matrix(ytest, ypred)
sns.heatmap(cf_matrix, annot=True, cmap='crest',fmt='.3g')
Try different algorithms
Rather than worrying about the imbalance in the data, one approach would be to use a machine learning algorithm that can handle imbalance well, such as the Random Forest algorithm. Additionally, you can use algorithms such as SVC from Scikit-Learn, where we can provide the weights to the classes present in the data.
However, in case of extreme imbalance, gathering more data would be the most wise choice.
Try penalized models
We can make use of penalized models to deal with the imbalance in the data. Penalized classification as the name suggests punishes the model by imposing some extra cost when the model makes a classification mistake while predicting the minority classes. This way the model is forced to pay more attention to the minority classes. Some examples of penalized classes are penalized SVM, penalized LDA, etc.
Important Tips
- You should always split the data before performing any kind of balancing technique. This way, we can ensure that no information leaks into the test data.
- It’s not advisable to use undersampling as the primary solution for tackling the imbalance issue in your dataset. This is because undersampling involves removing a significant portion of your data, which can potentially alter the data’s distribution. If the distribution changes, the undersampled data may not accurately represent the domain of the problem, leading to inaccuracies in your analysis.
Outro
Thanks for reading! If you have any thoughts on the article, then please let me know.
Are you struggling to choose what to read next? Don’t worry, I have got you covered.
From Raw to Refined: A Journey Through Data Preprocessing — Part 5: Outliers
A Simple Guide to Navigating Data Anomalies. Decode the mystery behind outliers in data science. From detection to…
ai.plainenglish.io
and one more…
From Raw to Refined: A Journey Through Data Preprocessing — Part 4: Data Encoding
Why data encoding is necessary
pub.towardsai.net
Shivam Shinde
Have a great day!
References:
Imbalanced Data U+007C Machine Learning U+007C Google for Developers
A classification data set with skewed class proportions is called imbalanced . Classes that make up a large proportion…
developers.google.com
5 Techniques to Handle Imbalanced Data For a Classification Problem
Techniques to handle imbalanced data for a classification problem. Here we discuss what is imbalanced data, and how to…
www.analyticsvidhya.com
Having an Imbalanced Dataset? Here Is How You Can Fix It.
Different Ways to Handle Imbalanced Datasets.
towardsdatascience.com
imbalanced-learn documentation – Version 0.11.0
The user guide provides in-depth information on the key concepts of imbalanced-learn with useful background information…
imbalanced-learn.o
Undersampling and oversampling image reference
Machine Learning with Oversampling and Undersampling Techniques: Overview Study and Experimental Results — Scientific Figure on ResearchGate. Available from: https://www.researchgate.net/figure/Differences-between-undersampling-and-oversampling_fig1_340978368 [accessed 5 Jan, 2024]
Code
Imbalanced Data demo using Cerebral Stroke Dataset
Explore and run machine learning code with Kaggle Notebooks U+007C Using data from Cerebral Stroke Prediction-Imbalanced…
www.kaggle.com
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI