Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: Diversity Policy: Ethics Policy: Masthead:
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: Alternate Name: tai Alternate Name: toward ai Alternate Name: Alternate Name: Towards AI, Inc. Alternate Name: Alternate Name:
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e


Unlock the full potential of AI with Building LLMs for Productionβ€”our 470+ page guide to mastering LLMs with practical projects and expert insights!


Customer Segmentation and Time Series Forecasting Based on Sales Data #2/3
Data Science   Latest   Machine Learning

Customer Segmentation and Time Series Forecasting Based on Sales Data #2/3

Last Updated on October 5, 2024 by Editorial Team

Author(s): Naveen Malla

Originally published on Towards AI.

Customer Segmentation and Time Series Forecasting Based on Sales Data #2/3

This is the second article in a 3-part series. In the first part, I covered some initial data analysis steps you can take before diving into the actual Customer Segmentation. You don’t have to read that before this one, but it’ll give you some great insights and set you up for the more exciting stuff we’re about to cover.

Customer Segmentation and Time Series Forecasting Based on Sales Data #1/3

project that got me an ml internship

In this part, we’ll look into segmenting customers into clusters and coming up with some marketing strategies for each cluster to maximize returns.

🗂️ Creating a New Dataset with Engineered Features for Customer Analysis

In the last article, we engineered some features from the original dataset to gain deeper insights. Now, we’ll create a new dataset using those features, which will serve as a base for segmentation.

avg_order_value = df.groupby('customer_number')['revenue'].mean().reset_index()
avg_order_value.columns = ['customer_number', 'avg_order_value']
total_quantity = df.groupby('customer_number')['quantity'].sum().reset_index()
total_quantity.columns = ['customer_number', 'total_quantity']

# Merge all features into a single DataFrame
customer_data = total_revenue.merge(avg_order_value, on='customer_number')
customer_data = customer_data.merge(total_quantity, on='customer_number')
customer_data = customer_data.merge(recency, on='customer_number')
customer_data['total_revenue'] = customer_data['total_revenue_x']
customer_data = customer_data.drop(['total_revenue_x', 'total_revenue_y'], axis=1)

# Display the aggregated data

Customer segmentation is the process of dividing a company’s customer base into different groups, or β€œsegments”, based on shared characteristics. The goal is to identify clusters of customers who have similar needs and behaviours.

Photo by freestocks on Unsplash

🎯 Why Is Customer Segmentation Important?

Helps in

  • understanding the customer base better.
  • creating targeted marketing strategies.
  • improving customer service.

⚙️ Building a Model with KMeans Clustering

KMeans is an unsupervised machine learning algorithm that groups similar data points into clusters based on their features. It works by assigning each data point to the nearest cluster center (centroid) and then iteratively adjusting the centroids until the clusters stabilize.

Why KMeans Clustering?

  • Simple to implement.
  • Efficient, fast, and scales well to large datasets.

📈 Selecting Features for Clustering

features = customer_data[['total_revenue', 'avg_order_value', 'total_quantity', 'recency']]

# Normalize the features
scaler = StandardScaler()
normalized_features = scaler.fit_transform(features)

🔍 So how do we decide the number of clusters?

Elbow Method

  1. The KMeans algorithm is run for different values of k (number of clusters), and for each k, the sum of squared distances within each cluster is calculated.
  2. As k increases, the total variance within the cluster decreases, as more more clusters allow for data points to tightly group together.
  3. After a certain point, adding more clusters doesn’t significantly reduce the variance and that is where an β€œelbow” forms in the graph signifying our optimal numbe of clusters.
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Elbow method to determine the optimal number of clusters
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=42)

# Plotting the results
plt.figure(figsize=(8, 4))
plt.plot(range(1, 11), wcss, marker='o', linestyle='--')
plt.title('Elbow Method for Optimal Number of Clusters')
plt.xlabel('Number of clusters')
plt.ylabel('Within-Cluster Variance')

The graph shows a sharp decline in the within-cluster variance as k increases from 1 to 3 and then the decrease in variance becomes less significant.

🚥 Validating with Silhouette Score

Just to be sure, I used another metric called Silhouette Score which measures how similar an object is to its cluster compared to other clusters.

Higher the Silhouette Score, better the clustering.

from sklearn.metrics import silhouette_score

# Calculate silhouette scores for a range of cluster numbers
silhouette_scores = []
for n_clusters in range(2, 11):
kmeans = KMeans(n_clusters=n_clusters, init='k-means++', max_iter=300, n_init=10, random_state=42)
cluster_labels = kmeans.fit_predict(normalized_features)
silhouette_avg = silhouette_score(normalized_features, cluster_labels)

# Plot silhouette scores
plt.figure(figsize=(8, 4))
plt.plot(range(2, 11), silhouette_scores, marker='o', linestyle='--')
plt.title('Silhouette Scores for Different Numbers of Clusters')
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Score')

There is a significant drop in Silhouette Score after cluster 3 which is the opposite of what we want. So, we can confirm k = 3 and build a model to group the whole customer base into 3 clusters.

🧩 Applying KMeans with 3 Clusters

from sklearn.cluster import KMeans

optimal_clusters = 3
kmeans_3 = KMeans(n_clusters=optimal_clusters, init='k-means++', max_iter=300, n_init=10, random_state=42)
customer_data['cluster_3'] = kmeans_3.fit_predict(normalized_features)

# cluster numbers should start from 1
customer_data['cluster_3'] = customer_data['cluster_3'] + 1
# Analyze the characteristics of each cluster (3 clusters)
cluster_summary_3 = customer_data.groupby('cluster_3').agg({
'total_revenue': ['mean', 'sum'],
'avg_order_value': 'mean',
'total_quantity': 'mean',
'recency': 'mean',
'customer_number': 'count'

# Flatten the MultiIndex columns
cluster_summary_3.columns = ['cluster', 'avg_total_revenue', 'sum_total_revenue', 'avg_order_value', 'avg_total_quantity', 'avg_recency', 'customer_count']

📊 Visualizing the Clusters and Their Characteristics

No. of customers in the clusters

plt.figure(figsize=(8, 4))
sns.countplot(data=customer_data, x='cluster_3', palette='viridis')
plt.title('Customer Segmentation - 3 Clusters')
plt.ylabel('Number of Customers')
  • The second cluster seems to be the largest.

Average Total Revenue per Cluster

plt.figure(figsize=(8, 4))
sns.barplot(x='cluster', y='avg_total_revenue', data=cluster_summary_3, palette='viridis')
plt.title('Average Total Revenue per Cluster')
plt.ylabel('Average Total Revenue')
  • Cluster 1 seems to be the largest in revenue followed by 2 and 3.

Recency by Cluster

plt.figure(figsize=(8, 4))
sns.boxplot(x='cluster_3', y='recency', data=customer_data, palette='viridis')
plt.title('Recency of Orders by Cluster')
plt.ylabel('Recency (Days since last order)')
  • Customers in cluster 3 seem to be the least frequent of all.

Displaying the Cluster Summary


Outliers present in the data can skew some of the results. Discussion with stakeholders is necessary to decide how to handle them.

🔎 Individual Cluster Analysis and Tailoring Marketing Strategies

Cluster 1

  • Average Total Revenue: $969.89
  • Sum Total Revenue: $320,976.92
  • Average Order Value: $13.25
  • Average Total Quantity: 1,379.24
  • Average Recency: 5.50 days
  • Customer Count: 333


  • High-value customers with high revenue and quantity.
  • These customers are highly engaged with frequent purchases.

Marketing Strategy

  • Upsell and cross-sell products to increase revenue because these customers are already engaged.
  • Offer loyalty programs to encourage continued frequent purchases.

Cluster 2:

  • Average Total Revenue: $696.06
  • Sum Total Revenue: $355,686.77
  • Average Order Value: $12.23
  • Average Total Quantity: 1,007.60
  • Average Recency: 7.29 days
  • Customer Count: 511


  • Moderate spending customers with steady purchases.
  • These customers are less frequent than Cluster 1 but still show regular engagement.

Marketing Strategy

  • Implementing a retention campaign to make sure these customers are engaged.
  • Referral programs. Although it seems like it should be a general strategy, it can be more beneficial from this cluster in my opinion. (Give it a thought)

Cluster 3:

  • Average Total Revenue: $557.54
  • Sum Total Revenue: $86,976.31
  • Average Order Value: $10.76
  • Average Total Quantity: 824.18
  • Average Recency: 27.24 days
  • Customer Count: 156


  • Low-spending customers who are also less frequent.
  • These customers are at a higher risk of churning. (Churn: When a customer stops purchasing from a business)

Marketing Strategy

  • Gather feedback, maybe through surveys(with well-framed questions) to understand why these customers are not purchasing frequently.
  • Launching a re-engagement campaign with special offers and discounts.

We will look into using the concepts of ⏳ Time Series Forecasting to predict the sales of the products for the next 3 months in the next article. Stay tuned! 🚀

There is more analysis and code that I put in Appendix section of the notebook so as to not make this too long. Star the repo for future reference here:

GitHub – naveen-malla/Customer-Segmentation-and-SKU-Forecasting: This repo contains code for…

This repo contains code for performing customer segmentation and sales forecast prediction on a company's sales data. …

If you enjoyed this post, please consider

  • holding the clap button for a few seconds (it goes up to 50) and
  • following me for more updates.

It gives me the motivation to keep going and helps the story reach more people like you. I share stories every week about machine learning concepts and tutorials on interesting projects. See you next week. Happy learning!

LinkedIn, Medium, GitHub

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓