Why Accuracy Is Not A Good Metric For Imbalanced Data
Last Updated on August 11, 2022 by Editorial Team
Author(s): Rafay Qayyum
Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.
Classification, In Machine Learning, is a supervised learning concept where data points are classified into different classes. For example, determining if an email is βspamβ or βnot spamβ and determining the blood type of aΒ patient.
Machine Learning Classification is generally divided into three categories:
- Binary Classification
- Multi-class Classification
- Multi-label Classification
What are Imbalanced classes orΒ data?
Imbalanced data refers to a problem where the distribution of examples across the known classes is biased (One class has more instances than the other). For example, One class may have 10000 instances while the other class has just 100 instances.
Data Imbalance can range from small to huge differences in the number of instances of the classes. Small data imbalances such as 4:1, 10:1, etc., wonβt harm your model much, but as the data imbalance starts to increase to 1000:1 and 5000: it can create problems for your machine learningΒ model.
The class (or classes) in an imbalanced classification problem that has many instances is known as the Majority Class(es).
The class (or classes) in an imbalanced classification problem that has few instances is known as the Minority Class(es).
Why Imbalanced Classes can cause problems?
When working with imbalanced data, The minority class is our interest most of the time. Like when detecting βspamβ emails, they number quite a few compared to βnot spamβ emails. So, the machine learning algorithms favor the larger class and sometimes even ignore the smaller class if the data is highly imbalanced.
Machine learning algorithms are designed to learn from the training data to minimize the loss and maximize accuracy. Letβs see how a machine learning algorithm works with highly imbalanced data.
An Example
Consider this example where there are 100 instances of Class βAβ and 9900 instances of ClassΒ βBβ.
x, y = make_classification(n_samples=10000, weights=[0.99], flip_y=0)
The count plot of the dataset can be created with the seabornΒ library
np.unique(y,return_counts=True)
y=np.where(y==0,'A','B')
sns.countplot(x=y)
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.20, random_state=42)
print(np.unique(ytrain,return_counts=True))
print(np.unique(ytest,return_counts=True))
After splitting the dataset into training and test data using train_test_split with a test size of 20%, we are left with 7919 training examples for Class βAβ and 81 training examples for Class βBβ. Testing examples are 1981 for Class βAβ and 19 for ClassΒ βBβ.
Letβs first train a Logistic Regression model with our trainingΒ data.
lr=LogisticRegression()
lr.fit(xtrain,ytrain)
lr.score(xtest,ytest)
Now, if we check the accuracy of the model using the scoring method, it is 0.992. 99.2% Accuracy? Itβs performing great, right? Letβs check the confusion matrix.
pred_lr=lr.predict(xtest)
print(confusion_matrix(ytest,pred_lr))
Although Class βAβ has an accuracy of 100%, only 3 out of 19 test examples were classified correctly. It must be a mistake,Β right?
Letβs use Random Forest Classifier on the same dataset and check whatβs happening.
rfc=RandomForestClassifier()
rfc.fit(xtrain,ytrain)
rfc.score(xtest,ytest)
The accuracy score is 0.991 this time, but what did we learn last time? The real results hide behind the accuracy. Letβs check the confusion matrix for Random Forest Classifierβs Predictions.
pred_rfc=rfc.predict(xtest)
print(confusion_matrix(ytest,pred_rfc))
Only 1 out of 1981 testing examples for Class βAβ was classified wrong, but only 2 out of 19 testing examples for Class βBβ were classified correctly.
What are our machine learning models doingΒ here?
As we have discussed before, machine learning models try to maximize accuracy, thatβs what is happening here. Since the instances of Class βAβ make up 99% of the data, machine learning models learn to classify them correctly and ignore or do not learn much about Class βBβ because classifying all of the data to class βAβ will get it 99% accuracy.
You can match the accuracy of these models just by writing 1 statement in python.Β Shocked?
pred=['A']*len(ytest)
This statement creates a list of length 2000 (since total testing data is 2000 or 20% for 10000) and fills it with βAβ. Since 99% of the sample is just A class, so we get the accuracy of 99% using the accuracyΒ score.
accuracy_score(ytest,pred)
How can you handle an imbalanced dataset?
There are many ways through which you can handle an imbalanced dataset. Some require you to have field knowledge others use different algorithms to increase the instances of minority class (Over-sampling) and to decrease the instances of majority class (Under-sampling).
- Through field knowledge, you can either split a majority class into more than one class, or you can merge different minority classes to make one class with more instances.
- Imbalanced-learn is a python library that has different methods for under-sampling and over-sampling.
- Another way is to set weights for different classes when creating a machine learning model object. Check how class_weights work on StackOverflow.
- Some over-sampling methods in Imbalanced-learn are SMOTE, RandomOverSampler, BorderlineSMOTE, KMeansSMOTE, andΒ ADASYN.
- Some under-sampling methods in Imbalanced-learn are ClusterCentroids, CondensedNearestNeighbour, EditedNearestNeighbours, NearMiss, OneSidedSelection, RandomUnderSampler, TomekLinks, and NeighbourhoodCleaningRule
Why Accuracy Is Not A Good Metric For Imbalanced Data was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Join thousands of data leaders on the AI newsletter. Itβs free, we donβt spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI