Imbalanced Data and How to Balance It
Last Updated on March 12, 2021 by Editorial Team
Author(s): Vaishnavi Patil
In recent times dealing with data has become a tedious job, considering the fact that most of the time is spent on cleaning and preprocessing the data. Often the data in the real-world is not available as per our expectations. It has a lot of irregularities, and it’s challenging to deal with such data. One of the main obstacles in handling such data is that it is imbalanced. Prominently, data is structured in such a way that the target feature (specifically classification problems) has the majority of one class and the second class has few to no training samples. Secondly, for multi-class classification problems, in the target feature, the training samples of a particular class exceed to a great extent as compared to the samples of other classes. This hampers the model performance. Below Listed are some of the ways you can avoid such problems and ensure that your model does effective justice to your data.
- Setting the weights hyperparameter of the model.
- Setting the train test split to stratify attribute.
Undersampling refers to removing training samples from the data which belong to that class of the target feature that has the majority over the other class. This method, however, comes with its pros and cons. Pros, you can speed up the model training as the model has comparatively fewer samples to train on, and the dataset is equally balanced. Cons, as the number of training samples, is reduced the model (a highly efficient one)would not be able to generalize well on the unseen data and may sometimes lead to overfitting. As it's said, the more data the model gets to see, the better it predicts on unseen data.
Oversampling is a technique wherein the number of training samples of the minority class is increased so as to balance the number of samples of both majority and minority class. This applies to only binary classification problems where you have only two classes. In the case of multi-class classification, the number of samples of minority classes is increased. One of the key techniques used for oversampling is called SMOTE(Synthetic Minority Oversampling Technique).
In this method, a random training sample is selected from the minority class, and using K-Nearest Neighbors, the training samples closest to the selected sample (usually k=5)are taken into consideration. Out of these selected samples, a random sample is selected. Taking into account this sample and the one pertaining to the minority class, a synthetic training sample is generated using the SMOTE method. With the application of oversampling, your model will probably be able to generalize well on new data. The one drawback to consider is that datasets having a large number of training samples wouldn’t be a good choice for oversampling. This is because a huge number of samples slow down the training process of the model at hand.
Setting the ‘weights’ hyperparameter of the model.
Generally, if you consider a primary machine learning model like a Random Forest Classifier, you can fine-tune a hyperparameter known as sample_weight .For, e.g., in a binary classification problem, if the target class 1 exceeds 3 times that of the 0 class then you could set this hyperparameter as sample_weight=[np.array[3 if i==0 else 1 if i==1] .Similarly, for the Catboost classifier, we have class_weights hyperparameter, which can be set depending upon our class distribution. In this method, you need to determine the ratio of each class.For, e.g., if class 0 contributes to 70 percent of training samples and the remaining 30 percent belong to class 1 then, you set the parameter like class_weights=[0.3,0.7] .From this, we infer that as class 0 has 70 % of the training samples, we assign it the value 0.3 to equate the ratios such that data is balanced. The same is the case of samples with class 1.
Setting the train_test_split stratify attribute.
Here let’s look at a direct and easy method through which balancing the data is as simple as possible. During splitting the dataset into training and test sets, you can specify the ‘stratify’ attribute of the function, which balances the imbalanced dataset. Using this attribute, the train_test_split splits the original dataset in such a way that the proportion of both classes (binary classification) is preserved in the training and validation sets. For, e.g., if you distribute original data in an 80:20 ratio, train_test_split(X,y,test_size=0.2,stratify=y) it will preserve the proportion of 80:20 in the training set as well as the test set. This is a straightforward method; however, you can get your hands dirty on any of the methods, which in turn contribute towards the overall performance of your model.
Scikit learns official documentation. Images by Author. Images from Unsplash.
Real-World data is often unstructured and imbalanced. Thus handling data imbalance is one of the primary points which should be taken into consideration during preprocessing of data. I hope to have provided valuable information regarding the handling of imbalanced datasets. Lastly, one recent method for handling imbalance is the Ensemble Resampling method by Andreas Mueller. Feel free to check out these and the above methods for data handling.
Published via Towards AI