Customer Segmentation and Time Series Forecasting Based on Sales Data #2/3

Last Updated on October 5, 2024 by Editorial Team

Author(s): Naveen Malla

Originally published on Towards AI.

Customer Segmentation and Time Series Forecasting Based on Sales Data #2/3

This is the second article in a 3-part series. In the first part, I covered some initial data analysis steps you can take before diving into the actual Customer Segmentation. You don’t have to read that before this one, but it’ll give you some great insights and set you up for the more exciting stuff we’re about to cover.

Customer Segmentation and Time Series Forecasting Based on Sales Data #1/3

project that got me an ml internship

pub.towardsai.net

In this part, we’ll look into segmenting customers into clusters and coming up with some marketing strategies for each cluster to maximize returns.

🗂️ Creating a New Dataset with Engineered Features for Customer Analysis

In the last article, we engineered some features from the original dataset to gain deeper insights. Now, we’ll create a new dataset using those features, which will serve as a base for segmentation.


avg_order_value = df.groupby('customer_number')['revenue'].mean().reset_index()
avg_order_value.columns = ['customer_number', 'avg_order_value']
total_quantity = df.groupby('customer_number')['quantity'].sum().reset_index()
total_quantity.columns = ['customer_number', 'total_quantity']


# Merge all features into a single DataFrame
customer_data = total_revenue.merge(avg_order_value, on='customer_number')
customer_data = customer_data.merge(total_quantity, on='customer_number')
customer_data = customer_data.merge(recency, on='customer_number')
customer_data['total_revenue'] = customer_data['total_revenue_x']
customer_data = customer_data.drop(['total_revenue_x', 'total_revenue_y'], axis=1)

# Display the aggregated data
print(customer_data.shape)
print(customer_data.head())

Customer segmentation is the process of dividing a company’s customer base into different groups, or “segments”, based on shared characteristics. The goal is to identify clusters of customers who have similar needs and behaviours.

🎯 Why Is Customer Segmentation Important?

Helps in

understanding the customer base better.
creating targeted marketing strategies.
improving customer service.

⚙️ Building a Model with KMeans Clustering

KMeans is an unsupervised machine learning algorithm that groups similar data points into clusters based on their features. It works by assigning each data point to the nearest cluster center (centroid) and then iteratively adjusting the centroids until the clusters stabilize.

Why KMeans Clustering?

Simple to implement.
Efficient, fast, and scales well to large datasets.

📈 Selecting Features for Clustering

features = customer_data[['total_revenue', 'avg_order_value', 'total_quantity', 'recency']]

# Normalize the features
scaler = StandardScaler()
normalized_features = scaler.fit_transform(features)

🔍 So how do we decide the number of clusters?

Elbow Method

The KMeans algorithm is run for different values of k (number of clusters), and for each k, the sum of squared distances within each cluster is calculated.
As k increases, the total variance within the cluster decreases, as more more clusters allow for data points to tightly group together.
After a certain point, adding more clusters doesn’t significantly reduce the variance and that is where an “elbow” forms in the graph signifying our optimal numbe of clusters.

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Elbow method to determine the optimal number of clusters
wcss = []
for i in range(1, 11):
 kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=42)
 kmeans.fit(normalized_features)
 wcss.append(kmeans.inertia_)

# Plotting the results
plt.figure(figsize=(8, 4))
plt.plot(range(1, 11), wcss, marker='o', linestyle='--')
plt.title('Elbow Method for Optimal Number of Clusters')
plt.xlabel('Number of clusters')
plt.ylabel('Within-Cluster Variance')
plt.show()

The graph shows a sharp decline in the within-cluster variance as k increases from 1 to 3 and then the decrease in variance becomes less significant.

🚥 Validating with Silhouette Score

Just to be sure, I used another metric called Silhouette Score which measures how similar an object is to its cluster compared to other clusters.

Higher the Silhouette Score, better the clustering.

from sklearn.metrics import silhouette_score

# Calculate silhouette scores for a range of cluster numbers
silhouette_scores = []
for n_clusters in range(2, 11):
 kmeans = KMeans(n_clusters=n_clusters, init='k-means++', max_iter=300, n_init=10, random_state=42)
 cluster_labels = kmeans.fit_predict(normalized_features)
 silhouette_avg = silhouette_score(normalized_features, cluster_labels)
 silhouette_scores.append(silhouette_avg)

# Plot silhouette scores
plt.figure(figsize=(8, 4))
plt.plot(range(2, 11), silhouette_scores, marker='o', linestyle='--')
plt.title('Silhouette Scores for Different Numbers of Clusters')
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Score')
plt.show()

There is a significant drop in Silhouette Score after cluster 3 which is the opposite of what we want. So, we can confirm k = 3 and build a model to group the whole customer base into 3 clusters.

🧩 Applying KMeans with 3 Clusters

from sklearn.cluster import KMeans

optimal_clusters = 3
kmeans_3 = KMeans(n_clusters=optimal_clusters, init='k-means++', max_iter=300, n_init=10, random_state=42)
customer_data['cluster_3'] = kmeans_3.fit_predict(normalized_features)

# cluster numbers should start from 1
customer_data['cluster_3'] = customer_data['cluster_3'] + 1

# Analyze the characteristics of each cluster (3 clusters)
cluster_summary_3 = customer_data.groupby('cluster_3').agg({
 'total_revenue': ['mean', 'sum'],
 'avg_order_value': 'mean',
 'total_quantity': 'mean',
 'recency': 'mean',
 'customer_number': 'count'
}).reset_index()

# Flatten the MultiIndex columns
cluster_summary_3.columns = ['cluster', 'avg_total_revenue', 'sum_total_revenue', 'avg_order_value', 'avg_total_quantity', 'avg_recency', 'customer_count']

📊 Visualizing the Clusters and Their Characteristics

No. of customers in the clusters

plt.figure(figsize=(8, 4))
sns.countplot(data=customer_data, x='cluster_3', palette='viridis')
plt.title('Customer Segmentation - 3 Clusters')
plt.xlabel('Cluster')
plt.ylabel('Number of Customers')

The second cluster seems to be the largest.

Average Total Revenue per Cluster

plt.figure(figsize=(8, 4))
sns.barplot(x='cluster', y='avg_total_revenue', data=cluster_summary_3, palette='viridis')
plt.title('Average Total Revenue per Cluster')
plt.xlabel('Cluster')
plt.ylabel('Average Total Revenue')
plt.show()

Cluster 1 seems to be the largest in revenue followed by 2 and 3.

Recency by Cluster

plt.figure(figsize=(8, 4))
sns.boxplot(x='cluster_3', y='recency', data=customer_data, palette='viridis')
plt.title('Recency of Orders by Cluster')
plt.xlabel('Cluster')
plt.ylabel('Recency (Days since last order)')
plt.show()

Customers in cluster 3 seem to be the least frequent of all.

Displaying the Cluster Summary

print(cluster_summary_3)

Outliers present in the data can skew some of the results. Discussion with stakeholders is necessary to decide how to handle them.

🔎 Individual Cluster Analysis and Tailoring Marketing Strategies

Cluster 1

Average Total Revenue: $969.89
Sum Total Revenue: $320,976.92
Average Order Value: $13.25
Average Total Quantity: 1,379.24
Average Recency: 5.50 days
Customer Count: 333

Analysis

High-value customers with high revenue and quantity.
These customers are highly engaged with frequent purchases.

Marketing Strategy

Upsell and cross-sell products to increase revenue because these customers are already engaged.
Offer loyalty programs to encourage continued frequent purchases.

Cluster 2:

Average Total Revenue: $696.06
Sum Total Revenue: $355,686.77
Average Order Value: $12.23
Average Total Quantity: 1,007.60
Average Recency: 7.29 days
Customer Count: 511

Analysis

Moderate spending customers with steady purchases.
These customers are less frequent than Cluster 1 but still show regular engagement.

Marketing Strategy

Implementing a retention campaign to make sure these customers are engaged.
Referral programs. Although it seems like it should be a general strategy, it can be more beneficial from this cluster in my opinion. (Give it a thought)

Cluster 3:

Average Total Revenue: $557.54
Sum Total Revenue: $86,976.31
Average Order Value: $10.76
Average Total Quantity: 824.18
Average Recency: 27.24 days
Customer Count: 156

Analysis

Low-spending customers who are also less frequent.
These customers are at a higher risk of churning. (Churn: When a customer stops purchasing from a business)

Marketing Strategy

Gather feedback, maybe through surveys(with well-framed questions) to understand why these customers are not purchasing frequently.
Launching a re-engagement campaign with special offers and discounts.

We will look into using the concepts of ⏳ Time Series Forecasting to predict the sales of the products for the next 3 months in the next article. Stay tuned! 🚀

There is more analysis and code that I put in Appendix section of the notebook so as to not make this too long. Star the repo for future reference here:

GitHub – naveen-malla/Customer-Segmentation-and-SKU-Forecasting: This repo contains code for…

This repo contains code for performing customer segmentation and sales forecast prediction on a company's sales data. …

github.com

If you enjoyed this post, please consider

holding the clap button for a few seconds (it goes up to 50) and
following me for more updates.

It gives me the motivation to keep going and helps the story reach more people like you. I share stories every week about machine learning concepts and tutorials on interesting projects. See you next week. Happy learning!

LinkedIn, Medium, GitHub

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Customer Segmentation and Time Series Forecasting Based on Sales Data #2/3

Author(s): Naveen Malla

Customer Segmentation and Time Series Forecasting Based on Sales Data #2/3

Customer Segmentation and Time Series Forecasting Based on Sales Data #1/3

project that got me an ml internship

🗂️ Creating a New Dataset with Engineered Features for Customer Analysis

🎯 Why Is Customer Segmentation Important?

⚙️ Building a Model with KMeans Clustering

GitHub – naveen-malla/Customer-Segmentation-and-SKU-Forecasting: This repo contains code for…

This repo contains code for performing customer segmentation and sales forecast prediction on a company's sales data. …

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

LAI #66: Information Theory for People in a Hurry

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

The World’s Leading AI and Technology Publication.

Company

CONTACT US

🔥 Recommended Articles 🔥

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Customer Segmentation and Time Series Forecasting Based on Sales Data #2/3

Author(s): Naveen Malla

Customer Segmentation and Time Series Forecasting Based on Sales Data #2/3

Customer Segmentation and Time Series Forecasting Based on Sales Data #1/3

project that got me an ml internship

🗂️ Creating a New Dataset with Engineered Features for Customer Analysis

🎯 Why Is Customer Segmentation Important?

⚙️ Building a Model with KMeans Clustering

GitHub – naveen-malla/Customer-Segmentation-and-SKU-Forecasting: This repo contains code for…

This repo contains code for performing customer segmentation and sales forecast prediction on a company's sales data. …

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement

Subscribe to our AI newsletter!

🔥 Recommended Articles 🔥