Stop Oversampling: Why You Should Avoid It
Last Updated on November 3, 2024 by Editorial Team
Author(s): Davide Nardini
Originally published on Towards AI.
The Obscure Dangers of Oversampling: the study that kills the oversampling family techniques
This member-only story is on us. Upgrade to access all of Medium.
Picture of Cauayan Island Resort on UnsplashIn Machine Learning, class imbalance is a persistent challenge that researchers and practitioners face daily.
When one class significantly outnumbers another in a dataset, traditional algorithms often struggle to learn effectively, leading to poor performance in the minority class.
To address this problem, data scientists typically choose between two options: Undersampling and Oversampling. While undersampling involves reducing the majority class, oversampling generates new synthetic samples based on the training data.
In this article, Iβll show you how an interesting paper [1] has highlighted the pitfalls of oversampling techniques.
Class Imbalance ProblemUndersampling and OversamplingDangers of Oversampling: the studyConclusion
One of the biggest challenges you can face when dealing with Data Science and Machine Learning projects is having imbalanced classes, especially the target class.
For example, imagine you want to make a churn analysis prediction: itβs very likely (and also desirable if youβre the business owner!) that the percentage of churners is very low, say 1% of the total customer base.
If you were to train a typical algorithm to recognize this kind of target, any standard classification algorithm would struggle. It would only learn to recognize the dominant class, effectively making… Read the full blog for free on Medium.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI