Join thousands of AI enthusiasts and experts at the Learn AI Community.

Publication

Latest

Why Accuracy Is Not A Good Metric For Imbalanced Data

Last Updated on August 11, 2022 by Editorial Team

Author(s): Rafay Qayyum

Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.

Classification, In Machine Learning, is a supervised learning concept where data points are classified into different classes. For example, determining if an email is โ€œspamโ€ or โ€œnot spamโ€ and determining the blood type of aย patient.

Machine Learning Classification is generally divided into three categories:

  • Binary Classification
  • Multi-class Classification
  • Multi-label Classification

What are Imbalanced classes orย data?

Imbalanced data refers to a problem where the distribution of examples across the known classes is biased (One class has more instances than the other). For example, One class may have 10000 instances while the other class has just 100 instances.

Class with majority instances is weighed more than the class with minority instancesโ€Šโ€”โ€ŠGoogle

Data Imbalance can range from small to huge differences in the number of instances of the classes. Small data imbalances such as 4:1, 10:1, etc., wonโ€™t harm your model much, but as the data imbalance starts to increase to 1000:1 and 5000: it can create problems for your machine learningย model.

The class (or classes) in an imbalanced classification problem that has many instances is known as the Majority Class(es).

The class (or classes) in an imbalanced classification problem that has few instances is known as the Minority Class(es).

Why Imbalanced Classes can cause problems?

When working with imbalanced data, The minority class is our interest most of the time. Like when detecting โ€œspamโ€ emails, they number quite a few compared to โ€œnot spamโ€ emails. So, the machine learning algorithms favor the larger class and sometimes even ignore the smaller class if the data is highly imbalanced.

Machine learning algorithms are designed to learn from the training data to minimize the loss and maximize accuracy. Letโ€™s see how a machine learning algorithm works with highly imbalanced data.

An Example

Consider this example where there are 100 instances of Class โ€œAโ€ and 9900 instances of Classย โ€œBโ€.

x, y = make_classification(n_samples=10000, weights=[0.99], flip_y=0)

The count plot of the dataset can be created with the seabornย library

np.unique(y,return_counts=True)
y=np.where(y==0,'A','B')
sns.countplot(x=y)
count plot of theย dataset.
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.20, random_state=42)
print(np.unique(ytrain,return_counts=True))
print(np.unique(ytest,return_counts=True))

After splitting the dataset into training and test data using train_test_split with a test size of 20%, we are left with 7919 training examples for Class โ€œAโ€ and 81 training examples for Class โ€œBโ€. Testing examples are 1981 for Class โ€œAโ€ and 19 for Classย โ€œBโ€.

Letโ€™s first train a Logistic Regression model with our trainingย data.

lr=LogisticRegression()
lr.fit(xtrain,ytrain)
lr.score(xtest,ytest)

Now, if we check the accuracy of the model using the scoring method, it is 0.992. 99.2% Accuracy? Itโ€™s performing great, right? Letโ€™s check the confusion matrix.

pred_lr=lr.predict(xtest)
print(confusion_matrix(ytest,pred_lr))
Confusion matrix for Logistic Regression

Although Class โ€œAโ€ has an accuracy of 100%, only 3 out of 19 test examples were classified correctly. It must be a mistake,ย right?

Letโ€™s use Random Forest Classifier on the same dataset and check whatโ€™s happening.

rfc=RandomForestClassifier()
rfc.fit(xtrain,ytrain)
rfc.score(xtest,ytest)

The accuracy score is 0.991 this time, but what did we learn last time? The real results hide behind the accuracy. Letโ€™s check the confusion matrix for Random Forest Classifierโ€™s Predictions.

pred_rfc=rfc.predict(xtest)
print(confusion_matrix(ytest,pred_rfc))
Confusion matrix for Random Forest Classifier

Only 1 out of 1981 testing examples for Class โ€œAโ€ was classified wrong, but only 2 out of 19 testing examples for Class โ€œBโ€ were classified correctly.

What are our machine learning models doingย here?

As we have discussed before, machine learning models try to maximize accuracy, thatโ€™s what is happening here. Since the instances of Class โ€œAโ€ make up 99% of the data, machine learning models learn to classify them correctly and ignore or do not learn much about Class โ€œBโ€ because classifying all of the data to class โ€œAโ€ will get it 99% accuracy.

You can match the accuracy of these models just by writing 1 statement in python.ย Shocked?

pred=['A']*len(ytest)

This statement creates a list of length 2000 (since total testing data is 2000 or 20% for 10000) and fills it with โ€œAโ€. Since 99% of the sample is just A class, so we get the accuracy of 99% using the accuracyย score.

accuracy_score(ytest,pred)
Confusion matrix for the โ€œpredโ€ย list

How can you handle an imbalanced dataset?

There are many ways through which you can handle an imbalanced dataset. Some require you to have field knowledge others use different algorithms to increase the instances of minority class (Over-sampling) and to decrease the instances of majority class (Under-sampling).


Why Accuracy Is Not A Good Metric For Imbalanced Data was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join thousands of data leaders on the AI newsletter. Itโ€™s free, we donโ€™t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aย sponsor.

Published via Towards AI

Feedback โ†“