
Handling Imbalanced Datasets in Machine Learning: SMOTE, Oversampling & Undersampling Explained
Last Updated on May 2, 2025 by Editorial Team
Author(s): Abinaya Subramaniam
Originally published on Towards AI.

What are imbalanced Datasets?
In many real-world classification problems, the number of samples in each class is not balanced. This is called an imbalanced dataset. For example, in fraud detection, there might be 99,000 normal transactions and only 1,000 fraudulent ones. Similarly, in medical diagnostics, 95% of the patients might be healthy while only 5% are diagnosed with a rare disease.
A dataset is imbalanced when the number of examples in each class is not approximately equal.
This imbalance creates a problem. Traditional machine learning models assume equal class distribution. As a result, the model becomes biased toward the majority class. We might get a high accuracy, but thatβs misleading. If our model always predicts βhealthyβ and the person is actually not, it will fail miserably in practice.
Why is accuracy not enough?
Imagine a classifier for detecting a rare disease, where 950 patients are healthy and 50 patients have the disease. If your model predicts everyone is βhealthyβ, it still gets:
- 950 / 1000=95% accuracy. But it misses all 50 disease cases!
Thatβs why we need better metrics like:
- Precision: Of all positive predictions, how many were correct?
- Recall: Of all actual positives, how many did we correctly find?
- F1-score: Harmonic mean of Precision and Recall
In imbalanced datasets, F1-score and ROC-AUC are often much better indicators than plain accuracy.
Techniques to handle Imbalanced data
- Random Oversampling / Upsampling
Random oversampling is the simplest way to balance a dataset by duplicating samples from the minority class. It artificially increases the representation of the minority class by making copies of its existing examples. If your dataset is small and underrepresented, you can boost minority class visibility without losing majority class data. This helps the model learn better decision boundaries for the rare class.
Letβs say we have:
- 900 samples in Class 0 (majority)
- 100 samples in Class 1 (minority)
Oversampling will randomly pick minority samples with replacement until there are 900 total samples in both classes. It doesnβt create new samples β just copies.
from sklearn.utils import resample
import pandas as pd
#load the imbalanced dataset
data = pd.read_csv("our_imbalanced_dataset.csv")
majority = data[data['label'] == 0]
minority = data[data['label'] == 1]
# Oversample the minority
minority_oversampled = resample(minority,
replace=True,
n_samples=len(majority),
random_state=42)
# Combine and shuffle
balanced_data = pd.concat([majority, minority_oversampled]).sample(frac=1, random_state=42)
Here, we are performing random oversampling to balance our imbalanced dataset. We first load the dataset and separate it into the majority (label == 0
) and minority (label == 1
) classes. Then, we upsample the minority class by randomly duplicating its samples (with replacement) until it matches the size of the majority class. After that, we combine the oversampled minority data with the majority class and shuffle the resulting dataset. This helps us ensure that our machine learning model learns equally from both classes, improving its ability to generalize.

We use replacement (replace = True) during oversampling to allow the same minority class samples to be selected multiple times. Since the original minority class may have too few unique samples, sampling with replacement ensures we can generate enough data to match the majority class size, helping to balance the dataset without requiring new or external data.
This is a simple and quick technique that works well with small datasets, making it easy to balance classes without complex algorithms. However, since it duplicates existing minority class samples, it doesnβt introduce any new information and can lead to overfitting, as the model may memorize repeated examples instead of learning general patterns.
2. Random Undersampling/ Downsampling
Undersampling is just exact the opposite of oversampling. It balances an imbalanced dataset by reducing the majority class size rather than increasing the minority class. It works by randomly removing samples from the majority class until it matches the size of the minority class.
This approach is helpful when we have a large dataset and can afford to discard some data, as it speeds up training and avoids the risk of overfitting from duplicated records. However, it may also result in the loss of potentially valuable information.
The coding is same like oversampling with some modifications.
majority_undersampled = resample(majority,
replace=False,
n_samples=len(minority),
random_state=42)
balanced_data = pd.concat([majority_undersampled, minority]).sample(frac=1, random_state=42)
In this code, we perform random undersampling to balance the dataset by reducing the number of majority class samples. First, we use the resample
function to randomly select a subset of the majority class without replacement, making its size equal to that of the minority class. Then, we combine this undersampled majority data with the original minority class and shuffle the combined dataset.
This results in a balanced dataset where both classes have an equal number of samples, which can help prevent a model from being biased toward the majority class.

We set replace = False because we donβt want to pick the same sample more than once when reducing the majority class. Since we are already cutting down the number of majority samples, thereβs no need to repeat any. Sampling without replacement means we select different, unique samples, which helps keep the data more natural and avoids creating repeated or biased examples.
Random undersampling is a quick and efficient method that helps speed up training and avoids overfitting since it doesnβt duplicate data. However, it comes with the risk of discarding important information from the majority class, which can lead to underfitting if the model loses valuable patterns needed for accurate predictions.
3. SMOTE (Synthetic Minority Oversampling Technique)
SMOTE (Synthetic Minority Over-sampling Technique) is an advanced oversampling method used to handle imbalanced datasets. Unlike random oversampling, which simply duplicates existing minority class samples, SMOTE creates entirely new synthetic samples. These new samples are generated by looking at existing minority class data points and creating new examples along the lines that connect them to their neighbors, making the minority class more diverse.
SMOTE is useful because it expands the minority class without introducing duplicates, which reduces the risk of overfitting. At the same time, it keeps all the majority class data intact, unlike undersampling. By generating more realistic and varied samples for the minority class, SMOTE helps machine learning models learn a better and more general decision boundary, especially when the original data has sparse regions or gaps in the minority class.

For each sample in the minority class, SMOTE first finds its k nearest neighbors (commonly 5). Then it randomly selects one of those neighbors and creates a new synthetic point somewhere along the straight line connecting the two. This process is repeated to generate as many synthetic samples as needed. Visually, this looks like filling in the space between existing minority points, which helps smooth the distribution and provide more learning material for the model.
!pip install imbalanced-learn
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
X = data.drop('label', axis=1)
y = data['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)
sm = SMOTE(random_state=42)
X_resampled, y_resampled = sm.fit_resample(X_train, y_train)
In this code, we first install the imbalanced-learn library to access SMOTE for handling class imbalance. We then import the necessary modules and split the dataset into training and testing sets using train_test_split, ensuring that the class distribution is preserved in both sets by setting stratify=y.
After that, we apply SMOTE to the training data (X_train and y_train) to generate synthetic samples for the minority class, creating a balanced dataset. The resampled features (X_resampled) and labels (y_resampled) are returned, which can now be used for training a model.

SMOTE offers several advantages, such as avoiding overfitting by generating meaningful synthetic samples rather than duplicating existing ones, and it often leads to a dramatic improvement in model performance by expanding the minority class in a more realistic way. However, it has some drawbacks, including the risk of introducing noise if the synthetic points are not representative of the actual data, and it is not ideal for datasets with categorical features, for which SMOTENC (a variant designed for categorical data) should be used instead.
4. Using class weights
Using Class Weights is a technique where, instead of altering the dataset, we adjust the model to focus more on the minority class by assigning it a higher weight. This is done by modifying the loss function, such that errors made on the minority class are penalized more heavily. Algorithms like Logistic Regression, SVM, and Random Forest have built-in parameters to apply class weights, such as class_weight='balanced'
in scikit-learn. This means the model inherently learns to pay more attention to the minority class without needing any changes to the data itself.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
model = LogisticRegression(class_weight='balanced', random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
One of the biggest advantages of using class weights is that it avoids the need for resampling, making it a clean and elegant solution, especially useful for large datasets. However, it may not be as effective when the imbalance is extreme, such as in a 99:1 ratio, where the model could still struggle to learn the minority class effectively. In such cases, other techniques like SMOTE or ensemble models may be more appropriate to address the imbalance.

When to apply these techniques
Any oversampling or undersampling technique, such as SMOTE, random oversampling, or random undersampling, should only be applied to the Training data and not the validation or test data. Applying these techniques to the entire dataset, including the test set, can introduce data leakage, as it alters the test data that is meant to represent unseen, real-world data. The purpose of these techniques is to address class imbalance during training, allowing the model to learn better from the data. The test set should remain untouched and reflect the natural distribution of data, ensuring that the modelβs performance is evaluated accurately and without bias.
Imbalanced datasets are common and challenging, but depending on accuracy alone can be misleading. To build a more robust model, itβs essential to employ strategies like oversampling, undersampling, SMOTE, or using class weights to address the imbalance.
Additionally, evaluating the model with metrics like the F1 score, ROC-AUC, and confusion matrix provides a better understanding of performance, especially for rare but critical cases. By applying the right techniques, your model can learn to identify the minority class effectively, improving its ability to detect important instances like disease diagnoses, spam emails, or fraudulent transactions.
Additional Resources:
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI