Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Why Accuracy Is Not A Good Metric For Imbalanced Data
Latest

Why Accuracy Is Not A Good Metric For Imbalanced Data

Last Updated on August 11, 2022 by Editorial Team

Author(s): Rafay Qayyum

Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.

Classification, In Machine Learning, is a supervised learning concept where data points are classified into different classes. For example, determining if an email is β€œspam” or β€œnot spam” and determining the blood type of aΒ patient.

Machine Learning Classification is generally divided into three categories:

  • Binary Classification
  • Multi-class Classification
  • Multi-label Classification

What are Imbalanced classes orΒ data?

Imbalanced data refers to a problem where the distribution of examples across the known classes is biased (One class has more instances than the other). For example, One class may have 10000 instances while the other class has just 100 instances.

Class with majority instances is weighed more than the class with minority instancesβ€Šβ€”β€ŠGoogle

Data Imbalance can range from small to huge differences in the number of instances of the classes. Small data imbalances such as 4:1, 10:1, etc., won’t harm your model much, but as the data imbalance starts to increase to 1000:1 and 5000: it can create problems for your machine learningΒ model.

The class (or classes) in an imbalanced classification problem that has many instances is known as the Majority Class(es).

The class (or classes) in an imbalanced classification problem that has few instances is known as the Minority Class(es).

Why Imbalanced Classes can cause problems?

When working with imbalanced data, The minority class is our interest most of the time. Like when detecting β€œspam” emails, they number quite a few compared to β€œnot spam” emails. So, the machine learning algorithms favor the larger class and sometimes even ignore the smaller class if the data is highly imbalanced.

Machine learning algorithms are designed to learn from the training data to minimize the loss and maximize accuracy. Let’s see how a machine learning algorithm works with highly imbalanced data.

An Example

Consider this example where there are 100 instances of Class β€œA” and 9900 instances of ClassΒ β€œB”.

x, y = make_classification(n_samples=10000, weights=[0.99], flip_y=0)

The count plot of the dataset can be created with the seabornΒ library

np.unique(y,return_counts=True)
y=np.where(y==0,'A','B')
sns.countplot(x=y)
count plot of theΒ dataset.
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.20, random_state=42)
print(np.unique(ytrain,return_counts=True))
print(np.unique(ytest,return_counts=True))

After splitting the dataset into training and test data using train_test_split with a test size of 20%, we are left with 7919 training examples for Class β€œA” and 81 training examples for Class β€œB”. Testing examples are 1981 for Class β€œA” and 19 for ClassΒ β€œB”.

Let’s first train a Logistic Regression model with our trainingΒ data.

lr=LogisticRegression()
lr.fit(xtrain,ytrain)
lr.score(xtest,ytest)

Now, if we check the accuracy of the model using the scoring method, it is 0.992. 99.2% Accuracy? It’s performing great, right? Let’s check the confusion matrix.

pred_lr=lr.predict(xtest)
print(confusion_matrix(ytest,pred_lr))
Confusion matrix for Logistic Regression

Although Class β€œA” has an accuracy of 100%, only 3 out of 19 test examples were classified correctly. It must be a mistake,Β right?

Let’s use Random Forest Classifier on the same dataset and check what’s happening.

rfc=RandomForestClassifier()
rfc.fit(xtrain,ytrain)
rfc.score(xtest,ytest)

The accuracy score is 0.991 this time, but what did we learn last time? The real results hide behind the accuracy. Let’s check the confusion matrix for Random Forest Classifier’s Predictions.

pred_rfc=rfc.predict(xtest)
print(confusion_matrix(ytest,pred_rfc))
Confusion matrix for Random Forest Classifier

Only 1 out of 1981 testing examples for Class β€œA” was classified wrong, but only 2 out of 19 testing examples for Class β€œB” were classified correctly.

What are our machine learning models doingΒ here?

As we have discussed before, machine learning models try to maximize accuracy, that’s what is happening here. Since the instances of Class β€œA” make up 99% of the data, machine learning models learn to classify them correctly and ignore or do not learn much about Class β€œB” because classifying all of the data to class β€œA” will get it 99% accuracy.

You can match the accuracy of these models just by writing 1 statement in python.Β Shocked?

pred=['A']*len(ytest)

This statement creates a list of length 2000 (since total testing data is 2000 or 20% for 10000) and fills it with β€œA”. Since 99% of the sample is just A class, so we get the accuracy of 99% using the accuracyΒ score.

accuracy_score(ytest,pred)
Confusion matrix for the β€œpred” list

How can you handle an imbalanced dataset?

There are many ways through which you can handle an imbalanced dataset. Some require you to have field knowledge others use different algorithms to increase the instances of minority class (Over-sampling) and to decrease the instances of majority class (Under-sampling).


Why Accuracy Is Not A Good Metric For Imbalanced Data was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓