Why Accuracy Is Not A Good Metric For Imbalanced Data

Last Updated on August 11, 2022 by Editorial Team

Author(s): Rafay Qayyum

Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.

Classification, In Machine Learning, is a supervised learning concept where data points are classified into different classes. For example, determining if an email is “spam” or “not spam” and determining the blood type of a patient.

Machine Learning Classification is generally divided into three categories:

Binary Classification
Multi-class Classification
Multi-label Classification

What are Imbalanced classes or data?

Imbalanced data refers to a problem where the distribution of examples across the known classes is biased (One class has more instances than the other). For example, One class may have 10000 instances while the other class has just 100 instances.

Class with majority instances is weighed more than the class with minority instances — Google

Data Imbalance can range from small to huge differences in the number of instances of the classes. Small data imbalances such as 4:1, 10:1, etc., won’t harm your model much, but as the data imbalance starts to increase to 1000:1 and 5000: it can create problems for your machine learning model.

The class (or classes) in an imbalanced classification problem that has many instances is known as the Majority Class(es).

The class (or classes) in an imbalanced classification problem that has few instances is known as the Minority Class(es).

Why Imbalanced Classes can cause problems?

When working with imbalanced data, The minority class is our interest most of the time. Like when detecting “spam” emails, they number quite a few compared to “not spam” emails. So, the machine learning algorithms favor the larger class and sometimes even ignore the smaller class if the data is highly imbalanced.

Machine learning algorithms are designed to learn from the training data to minimize the loss and maximize accuracy. Let’s see how a machine learning algorithm works with highly imbalanced data.

An Example

Consider this example where there are 100 instances of Class “A” and 9900 instances of Class “B”.

x, y = make_classification(n_samples=10000, weights=[0.99], flip_y=0)

The count plot of the dataset can be created with the seaborn library

np.unique(y,return_counts=True)
y=np.where(y==0,'A','B')
sns.countplot(x=y)

xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.20, random_state=42)
print(np.unique(ytrain,return_counts=True))
print(np.unique(ytest,return_counts=True))

After splitting the dataset into training and test data using train_test_split with a test size of 20%, we are left with 7919 training examples for Class “A” and 81 training examples for Class “B”. Testing examples are 1981 for Class “A” and 19 for Class “B”.

Let’s first train a Logistic Regression model with our training data.

lr=LogisticRegression()
lr.fit(xtrain,ytrain)
lr.score(xtest,ytest)

Now, if we check the accuracy of the model using the scoring method, it is 0.992. 99.2% Accuracy? It’s performing great, right? Let’s check the confusion matrix.

pred_lr=lr.predict(xtest)
print(confusion_matrix(ytest,pred_lr))

Confusion matrix for Logistic Regression

Although Class “A” has an accuracy of 100%, only 3 out of 19 test examples were classified correctly. It must be a mistake, right?

Let’s use Random Forest Classifier on the same dataset and check what’s happening.

rfc=RandomForestClassifier()
rfc.fit(xtrain,ytrain)
rfc.score(xtest,ytest)

The accuracy score is 0.991 this time, but what did we learn last time? The real results hide behind the accuracy. Let’s check the confusion matrix for Random Forest Classifier’s Predictions.

pred_rfc=rfc.predict(xtest)
print(confusion_matrix(ytest,pred_rfc))

Confusion matrix for Random Forest Classifier

Only 1 out of 1981 testing examples for Class “A” was classified wrong, but only 2 out of 19 testing examples for Class “B” were classified correctly.

What are our machine learning models doing here?

As we have discussed before, machine learning models try to maximize accuracy, that’s what is happening here. Since the instances of Class “A” make up 99% of the data, machine learning models learn to classify them correctly and ignore or do not learn much about Class “B” because classifying all of the data to class “A” will get it 99% accuracy.

You can match the accuracy of these models just by writing 1 statement in python. Shocked?

pred=['A']*len(ytest)

This statement creates a list of length 2000 (since total testing data is 2000 or 20% for 10000) and fills it with “A”. Since 99% of the sample is just A class, so we get the accuracy of 99% using the accuracy score.

accuracy_score(ytest,pred)

How can you handle an imbalanced dataset?

There are many ways through which you can handle an imbalanced dataset. Some require you to have field knowledge others use different algorithms to increase the instances of minority class (Over-sampling) and to decrease the instances of majority class (Under-sampling).

Through field knowledge, you can either split a majority class into more than one class, or you can merge different minority classes to make one class with more instances.
Imbalanced-learn is a python library that has different methods for under-sampling and over-sampling.
Another way is to set weights for different classes when creating a machine learning model object. Check how class_weights work on StackOverflow.
Some over-sampling methods in Imbalanced-learn are SMOTE, RandomOverSampler, BorderlineSMOTE, KMeansSMOTE, and ADASYN.
Some under-sampling methods in Imbalanced-learn are ClusterCentroids, CondensedNearestNeighbour, EditedNearestNeighbours, NearMiss, OneSidedSelection, RandomUnderSampler, TomekLinks, and NeighbourhoodCleaningRule

Why Accuracy Is Not A Good Metric For Imbalanced Data was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Why Accuracy Is Not A Good Metric For Imbalanced Data

Author(s): Rafay Qayyum

What are Imbalanced classes or data?

Why Imbalanced Classes can cause problems?

An Example

What are our machine learning models doing here?

How can you handle an imbalanced dataset?

Towards AI Team

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

LAI #66: Information Theory for People in a Hurry

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

The World’s Leading AI and Technology Publication.

Company

CONTACT US

🔥 Recommended Articles 🔥

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Why Accuracy Is Not A Good Metric For Imbalanced Data

Author(s): Rafay Qayyum

What are Imbalanced classes or data?

Why Imbalanced Classes can cause problems?

An Example

What are our machine learning models doing here?

How can you handle an imbalanced dataset?

Towards AI Team

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement

Subscribe to our AI newsletter!

🔥 Recommended Articles 🔥