Imbalanced Data and How to Balance It

Last Updated on March 12, 2021 by Editorial Team

Author(s): Vaishnavi Patil

Machine Learning

In recent times dealing with data has become a tedious job, considering the fact that most of the time is spent on cleaning and preprocessing the data. Often the data in the real-world is not available as per our expectations. It has a lot of irregularities, and it’s challenging to deal with such data. One of the main obstacles in handling such data is that it is imbalanced. Prominently, data is structured in such a way that the target feature (specifically classification problems) has the majority of one class and the second class has few to no training samples. Secondly, for multi-class classification problems, in the target feature, the training samples of a particular class exceed to a great extent as compared to the samples of other classes. This hampers the model performance. Below Listed are some of the ways you can avoid such problems and ensure that your model does effective justice to your data.

Undersampling
Oversampling
Setting the weights hyperparameter of the model.
Setting the train test split to stratify attribute.

Undersampling

Undersampling refers to removing training samples from the data which belong to that class of the target feature that has the majority over the other class. This method, however, comes with its pros and cons. Pros, you can speed up the model training as the model has comparatively fewer samples to train on, and the dataset is equally balanced. Cons, as the number of training samples, is reduced the model (a highly efficient one)would not be able to generalize well on the unseen data and may sometimes lead to overfitting. As it's said, the more data the model gets to see, the better it predicts on unseen data.

Oversampling

Oversampling is a technique wherein the number of training samples of the minority class is increased so as to balance the number of samples of both majority and minority class. This applies to only binary classification problems where you have only two classes. In the case of multi-class classification, the number of samples of minority classes is increased. One of the key techniques used for oversampling is called SMOTE(Synthetic Minority Oversampling Technique).

In this method, a random training sample is selected from the minority class, and using K-Nearest Neighbors, the training samples closest to the selected sample (usually k=5)are taken into consideration. Out of these selected samples, a random sample is selected. Taking into account this sample and the one pertaining to the minority class, a synthetic training sample is generated using the SMOTE method. With the application of oversampling, your model will probably be able to generalize well on new data. The one drawback to consider is that datasets having a large number of training samples wouldn’t be a good choice for oversampling. This is because a huge number of samples slow down the training process of the model at hand.

Setting the ‘weights’ hyperparameter of the model.

Generally, if you consider a primary machine learning model like a Random Forest Classifier, you can fine-tune a hyperparameter known as sample_weight .For, e.g., in a binary classification problem, if the target class 1 exceeds 3 times that of the 0 class then you could set this hyperparameter as sample_weight=[np.array[3 if i==0 else 1 if i==1] .Similarly, for the Catboost classifier, we have class_weights hyperparameter, which can be set depending upon our class distribution. In this method, you need to determine the ratio of each class.For, e.g., if class 0 contributes to 70 percent of training samples and the remaining 30 percent belong to class 1 then, you set the parameter like class_weights=[0.3,0.7] .From this, we infer that as class 0 has 70 % of the training samples, we assign it the value 0.3 to equate the ratios such that data is balanced. The same is the case of samples with class 1.

Setting the train_test_split stratify attribute.

Image by Alexandre Van Thuan on Unsplash

Here let’s look at a direct and easy method through which balancing the data is as simple as possible. During splitting the dataset into training and test sets, you can specify the ‘stratify’ attribute of the function, which balances the imbalanced dataset. Using this attribute, the train_test_split splits the original dataset in such a way that the proportion of both classes (binary classification) is preserved in the training and validation sets. For, e.g., if you distribute original data in an 80:20 ratio, train_test_split(X,y,test_size=0.2,stratify=y) it will preserve the proportion of 80:20 in the training set as well as the test set. This is a straightforward method; however, you can get your hands dirty on any of the methods, which in turn contribute towards the overall performance of your model.

References

Scikit learns official documentation. Images by Author. Images from Unsplash.

Conclusion

Real-World data is often unstructured and imbalanced. Thus handling data imbalance is one of the primary points which should be taken into consideration during preprocessing of data. I hope to have provided valuable information regarding the handling of imbalanced datasets. Lastly, one recent method for handling imbalance is the Ensemble Resampling method by Andreas Mueller. Feel free to check out these and the above methods for data handling.

Imbalanced Data and How to Balance It was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Imbalanced Data and How to Balance It

Author(s): Vaishnavi Patil

Machine Learning

Undersampling

Oversampling

Setting the ‘weights’ hyperparameter of the model.

Setting the train_test_split stratify attribute.

References

Conclusion

Towards AI Team

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

TAI #143: New Scaling Laws Incoming? Ilya’s SSI Raises at $30bn, Manus Takes AI Agents Mainstream

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Imbalanced Data and How to Balance It

Author(s): Vaishnavi Patil

Machine Learning

Undersampling

Oversampling

Setting the ‘weights’ hyperparameter of the model.

Setting the train_test_split stratify attribute.

References

Conclusion

Towards AI Team

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement