Customer Segmentation and Time Series Forecasting Based on Sales Data #2/3
Last Updated on October 5, 2024 by Editorial Team
Author(s): Naveen Malla
Originally published on Towards AI.
Customer Segmentation and Time Series Forecasting Based on Sales Data #2/3
This is the second article in a 3-part series. In the first part, I covered some initial data analysis steps you can take before diving into the actual Customer Segmentation. You donβt have to read that before this one, but itβll give you some great insights and set you up for the more exciting stuff weβre about to cover.
Customer Segmentation and Time Series Forecasting Based on Sales Data #1/3
project that got me an ml internship
pub.towardsai.net
In this part, weβll look into segmenting customers into clusters and coming up with some marketing strategies for each cluster to maximize returns.
🗂οΈ Creating a New Dataset with Engineered Features for Customer Analysis
In the last article, we engineered some features from the original dataset to gain deeper insights. Now, weβll create a new dataset using those features, which will serve as a base for segmentation.
avg_order_value = df.groupby('customer_number')['revenue'].mean().reset_index()
avg_order_value.columns = ['customer_number', 'avg_order_value']
total_quantity = df.groupby('customer_number')['quantity'].sum().reset_index()
total_quantity.columns = ['customer_number', 'total_quantity']
# Merge all features into a single DataFrame
customer_data = total_revenue.merge(avg_order_value, on='customer_number')
customer_data = customer_data.merge(total_quantity, on='customer_number')
customer_data = customer_data.merge(recency, on='customer_number')
customer_data['total_revenue'] = customer_data['total_revenue_x']
customer_data = customer_data.drop(['total_revenue_x', 'total_revenue_y'], axis=1)
# Display the aggregated data
print(customer_data.shape)
print(customer_data.head())
Customer segmentation is the process of dividing a companyβs customer base into different groups, or βsegmentsβ, based on shared characteristics. The goal is to identify clusters of customers who have similar needs and behaviours.
🎯 Why Is Customer Segmentation Important?
Helps in
- understanding the customer base better.
- creating targeted marketing strategies.
- improving customer service.
⚙οΈ Building a Model with KMeans Clustering
KMeans is an unsupervised machine learning algorithm that groups similar data points into clusters based on their features. It works by assigning each data point to the nearest cluster center (centroid) and then iteratively adjusting the centroids until the clusters stabilize.
Why KMeans Clustering?
- Simple to implement.
- Efficient, fast, and scales well to large datasets.
📈 Selecting Features for Clustering
features = customer_data[['total_revenue', 'avg_order_value', 'total_quantity', 'recency']]
# Normalize the features
scaler = StandardScaler()
normalized_features = scaler.fit_transform(features)
🔍 So how do we decide the number of clusters?
Elbow Method
- The KMeans algorithm is run for different values of k (number of clusters), and for each k, the sum of squared distances within each cluster is calculated.
- As k increases, the total variance within the cluster decreases, as more more clusters allow for data points to tightly group together.
- After a certain point, adding more clusters doesnβt significantly reduce the variance and that is where an βelbowβ forms in the graph signifying our optimal numbe of clusters.
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
# Elbow method to determine the optimal number of clusters
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=42)
kmeans.fit(normalized_features)
wcss.append(kmeans.inertia_)
# Plotting the results
plt.figure(figsize=(8, 4))
plt.plot(range(1, 11), wcss, marker='o', linestyle='--')
plt.title('Elbow Method for Optimal Number of Clusters')
plt.xlabel('Number of clusters')
plt.ylabel('Within-Cluster Variance')
plt.show()
The graph shows a sharp decline in the within-cluster variance as k increases from 1 to 3 and then the decrease in variance becomes less significant.
🚥 Validating with Silhouette Score
Just to be sure, I used another metric called Silhouette Score which measures how similar an object is to its cluster compared to other clusters.
Higher the Silhouette Score, better the clustering.
from sklearn.metrics import silhouette_score
# Calculate silhouette scores for a range of cluster numbers
silhouette_scores = []
for n_clusters in range(2, 11):
kmeans = KMeans(n_clusters=n_clusters, init='k-means++', max_iter=300, n_init=10, random_state=42)
cluster_labels = kmeans.fit_predict(normalized_features)
silhouette_avg = silhouette_score(normalized_features, cluster_labels)
silhouette_scores.append(silhouette_avg)
# Plot silhouette scores
plt.figure(figsize=(8, 4))
plt.plot(range(2, 11), silhouette_scores, marker='o', linestyle='--')
plt.title('Silhouette Scores for Different Numbers of Clusters')
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Score')
plt.show()
There is a significant drop in Silhouette Score after cluster 3 which is the opposite of what we want. So, we can confirm k = 3 and build a model to group the whole customer base into 3 clusters.
🧩 Applying KMeans with 3 Clusters
from sklearn.cluster import KMeans
optimal_clusters = 3
kmeans_3 = KMeans(n_clusters=optimal_clusters, init='k-means++', max_iter=300, n_init=10, random_state=42)
customer_data['cluster_3'] = kmeans_3.fit_predict(normalized_features)
# cluster numbers should start from 1
customer_data['cluster_3'] = customer_data['cluster_3'] + 1
# Analyze the characteristics of each cluster (3 clusters)
cluster_summary_3 = customer_data.groupby('cluster_3').agg({
'total_revenue': ['mean', 'sum'],
'avg_order_value': 'mean',
'total_quantity': 'mean',
'recency': 'mean',
'customer_number': 'count'
}).reset_index()
# Flatten the MultiIndex columns
cluster_summary_3.columns = ['cluster', 'avg_total_revenue', 'sum_total_revenue', 'avg_order_value', 'avg_total_quantity', 'avg_recency', 'customer_count']
📊 Visualizing the Clusters and Their Characteristics
No. of customers in the clusters
plt.figure(figsize=(8, 4))
sns.countplot(data=customer_data, x='cluster_3', palette='viridis')
plt.title('Customer Segmentation - 3 Clusters')
plt.xlabel('Cluster')
plt.ylabel('Number of Customers')
- The second cluster seems to be the largest.
Average Total Revenue per Cluster
plt.figure(figsize=(8, 4))
sns.barplot(x='cluster', y='avg_total_revenue', data=cluster_summary_3, palette='viridis')
plt.title('Average Total Revenue per Cluster')
plt.xlabel('Cluster')
plt.ylabel('Average Total Revenue')
plt.show()
- Cluster 1 seems to be the largest in revenue followed by 2 and 3.
Recency by Cluster
plt.figure(figsize=(8, 4))
sns.boxplot(x='cluster_3', y='recency', data=customer_data, palette='viridis')
plt.title('Recency of Orders by Cluster')
plt.xlabel('Cluster')
plt.ylabel('Recency (Days since last order)')
plt.show()
- Customers in cluster 3 seem to be the least frequent of all.
Displaying the Cluster Summary
print(cluster_summary_3)
Outliers present in the data can skew some of the results. Discussion with stakeholders is necessary to decide how to handle them.
🔎 Individual Cluster Analysis and Tailoring Marketing Strategies
Cluster 1
- Average Total Revenue: $969.89
- Sum Total Revenue: $320,976.92
- Average Order Value: $13.25
- Average Total Quantity: 1,379.24
- Average Recency: 5.50 days
- Customer Count: 333
Analysis
- High-value customers with high revenue and quantity.
- These customers are highly engaged with frequent purchases.
Marketing Strategy
- Upsell and cross-sell products to increase revenue because these customers are already engaged.
- Offer loyalty programs to encourage continued frequent purchases.
Cluster 2:
- Average Total Revenue: $696.06
- Sum Total Revenue: $355,686.77
- Average Order Value: $12.23
- Average Total Quantity: 1,007.60
- Average Recency: 7.29 days
- Customer Count: 511
Analysis
- Moderate spending customers with steady purchases.
- These customers are less frequent than Cluster 1 but still show regular engagement.
Marketing Strategy
- Implementing a retention campaign to make sure these customers are engaged.
- Referral programs. Although it seems like it should be a general strategy, it can be more beneficial from this cluster in my opinion. (Give it a thought)
Cluster 3:
- Average Total Revenue: $557.54
- Sum Total Revenue: $86,976.31
- Average Order Value: $10.76
- Average Total Quantity: 824.18
- Average Recency: 27.24 days
- Customer Count: 156
Analysis
- Low-spending customers who are also less frequent.
- These customers are at a higher risk of churning. (Churn: When a customer stops purchasing from a business)
Marketing Strategy
- Gather feedback, maybe through surveys(with well-framed questions) to understand why these customers are not purchasing frequently.
- Launching a re-engagement campaign with special offers and discounts.
We will look into using the concepts of β³ Time Series Forecasting to predict the sales of the products for the next 3 months in the next article. Stay tuned! 🚀
There is more analysis and code that I put in Appendix section of the notebook so as to not make this too long. Star the repo for future reference here:
GitHub – naveen-malla/Customer-Segmentation-and-SKU-Forecasting: This repo contains code forβ¦
This repo contains code for performing customer segmentation and sales forecast prediction on a company's sales data. β¦
github.com
If you enjoyed this post, please consider
- holding the clap button for a few seconds (it goes up to 50) and
- following me for more updates.
It gives me the motivation to keep going and helps the story reach more people like you. I share stories every week about machine learning concepts and tutorials on interesting projects. See you next week. Happy learning!
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI