Customer Segmentation and Time Series Forecasting Based on Sales Data #1/3

Last Updated on September 27, 2024 by Editorial Team

Author(s): Naveen Malla

Originally published on Towards AI.

Customer Segmentation and Time Series Forecasting Based on Sales Data #1/3

Hey, first things first. This blog is divided into a 3-part series where I am going to focus on three different aspects:

Exploratory Data Analysis (EDA)
Customer segmentation
SKU forecasting (with bonus code)

Why Should You Care About This Series❓

This work got me a machine learning internship. I presented my notebooks to the whole team, including both technical and non-technical stakeholders, and they loved it. I put in a lot of hours into this project to make it as informative and easily understandable as possible. I was really glad that this project was highly engaging for the team, and I wanted to share it here. So, if you’re curious about working on some machine learning projects to gain knowledge or are looking for ideas on how to present your projects related to this topic in your upcoming interviews, this series is for you. Let’s dive in.

🗂️ The Dataset

This is a typical sales dataset for healthcare products. It contains sales data for 7 months of the year, starting from January. It also contains noise, so the results may seem evenly distributed.

But this isn’t going to be about the data itself; rather, it’s about what you can do with it.

The features in the dataset are kinda self-explanatory, so I am not gonna waste your time giving descriptions for each. You’ll see what they mean as we proceed.

order_number
order_date
customer_number
type
month
item_number (SKU)
quantity
category
revenue
customer_source
order_source

📊 Let’s Do Some EDA!!!

I will skip some parts of the code like data loading and processing to keep it more engaging. I will attach a link to the whole notebook at the end of the article, though.

print(df.nunique())

unique values count in for each feature (all images by author)

Observation:

There are a total of

1000 unique customers
1000 unique items
2 categories

Monthly Revenue

df['month'] = df['order_date'].dt.month
month_abbr = {1: 'Jan', 2: 'Feb', 3: 'Mar', 4: 'Apr', 5: 'May', 6: 'Jun', 7: 'Jul', 8: 'Aug', 9: 'Sep', 10: 'Oct', 11: 'Nov', 12: 'Dec'}
df['month_abbr'] = df['month'].map(month_abbr)

# Aggregate monthly revenue
monthly_revenue = df.groupby('month_abbr')['revenue'].sum().reset_index()

# Define a categorical type for the month_abbr to ensure proper sorting
monthly_revenue['month_abbr'] = pd.Categorical(monthly_revenue['month_abbr'], 
 categories=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'], 
 ordered=True)

# Sort by the 'month_abbr' column
monthly_revenue = monthly_revenue.sort_values(by='month_abbr')

# Plot monthly revenue
plt.figure(figsize=(8, 4))
sns.barplot(x='month_abbr', y='revenue', data=monthly_revenue, palette='viridis')
plt.xlabel('Month')
plt.ylabel('Revenue')
plt.title('Monthly Revenue')

# Format y-label as shortened form
ax = plt.gca()
ax.yaxis.set_major_formatter(FuncFormatter(lambda x, pos: f'{int(x/1000)}k'))

plt.xticks(rotation=45)
plt.show()

Observation:

The revenue seems to be increasing over the months.
Larger jumps in revenue can be seen starting from the month of May compared to previous months

🔧 Feature Engineering to create more meaningful features

Feature Engineering is the process of creating new features from the existing features in the dataset.

This helps us analyse the data better as the new features are more meaningful.
Example: top 10 products, average revenue per category, etc.

# Top 10 performing products by revenue
top_10_products = df.groupby('item_number')['revenue'].sum().nlargest(10).reset_index()

# Average revenue by category
avg_revenue_category = df.groupby('category')['revenue'].mean().reset_index()

# Average revenue by type
avg_revenue_type = df.groupby('type')['revenue'].mean().reset_index()

# Average revenue by customer source
avg_revenue_customer_source = df.groupby('customer_source')['revenue'].mean().reset_index()

# Average revenue by order source
avg_revenue_order_source = df.groupby('order_source')['revenue'].mean().reset_index()

📈 Visualizing the engineered features

Top 10 Products by Revenue

# Bar chart for Top 10 performing products
plt.figure(figsize=(8, 4))
sns.barplot(x='item_number', y='revenue', data=top_10_products, palette='viridis')
plt.title('Top 10 Performing Products by Revenue')
plt.xlabel('Item Number')
plt.ylabel('Total Revenue')
plt.xticks(rotation=45)
plt.show()

Observation:

The top performing products seem to be generating the similar amount of revenue.

Now, let’s see how much percent of total revenue is contributed by the top 10 products.


top_10_revenue = top_10_products['revenue'].sum()
total_revenue = df['revenue'].sum()

print(f'Top 10 products contribute {top_10_revenue/total_revenue:.2%} to the total revenue')

1.45% of total revenue isn’t a significant amount, indicating that these products do not dominate the overall sales.

Average Revenue per Categorical variable

fig, axes = plt.subplots(1, 2, figsize=(8, 4))

# Pie chart for Average Revenue by Customer Source
axes[0].pie(avg_revenue_customer_source['revenue'], labels=avg_revenue_customer_source['customer_source'], autopct='%1.1f%%', colors=sns.color_palette('viridis', 2))
axes[0].set_title('Average Revenue by Customer Source')

# Donut chart for Average Revenue by Order Source
axes[1].pie(avg_revenue_order_source['revenue'], labels=avg_revenue_order_source['order_source'], autopct='%1.1f%%', colors=sns.color_palette('viridis', 5))
axes[1].set_title('Average Revenue by Order Source')
centre_circle = plt.Circle((0,0), 0.70, fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

fig, axes = plt.subplots(1, 2, figsize=(8, 4))

# Pie chart for Average Revenue by Category
axes[0].pie(avg_revenue_category['revenue'], labels=avg_revenue_category['category'], autopct='%1.1f%%', colors=sns.color_palette('viridis', 2))
axes[0].set_title('Average Revenue by Category')

# Pie chart for Average Revenue by Type
axes[1].pie(avg_revenue_type['revenue'], labels=avg_revenue_type['type'], autopct='%1.1f%%', colors=sns.color_palette('viridis', 2))
axes[1].set_title('Average Revenue by Type')

plt.tight_layout()
plt.show()

Observation:

The distribution of revenue seems to be almost equally spread among Categories, Types, Customers and Order sources.

Count of orders by category

category_counts = df.groupby('category')['quantity'].sum()

# Plot the count of orders by category
plt.figure(figsize=(8, 4))
sns.barplot(x=category_counts.index, y=category_counts, palette='viridis')
plt.title('Total Orders by Category')
plt.xlabel('Category')
plt.ylabel('Total Orders')
plt.xticks(rotation=45)
plt.show()

Observation:

Despite DIABETES and HYPERTENSIVES products generating same revenue, the number of HYPERTENSIVES products sold is more than DIABETES products.
This insight suggests that DIABETES products being purchased are more expensive than HYPERTENSIVES products.

total_products_sold = df.groupby('category')['quantity'].sum()
print(total_products_sold)

HYPERTENSIVES products indeed seem to be more popular among customers with a lead of around 50000 products sold in last 7 months.

Revenue Distribution

# Calculate total revenue for each customer
total_revenue = df.groupby('customer_number')['revenue'].sum().reset_index()
total_revenue.columns = ['customer_number', 'total_revenue']

# Distribution Plot
plt.figure(figsize=(8, 4))
sns.histplot(total_revenue['total_revenue'], kde=True, color='skyblue')
plt.title('Distribution of Total Revenue by Customer')
plt.xlabel('Total Revenue')
plt.ylabel('Frequency')
plt.show()

Observations:

🔔 Normal Distribution

Normal distribution is a way to show how data is spread out, where most values are close to the middle (the average). It looks symmetric, and bell-shaped.
The above graph is a near-normal distribution.
In our case, the average spending is around $700 to $800 mark, which you can identify around at the middle of the curve.

💰 Revenue Spread

Revenue Spread indicates how diverse the customer spending is. We look at the lowest or leftmost and the highest or the rightmost figures to understand this metric.
There is a significant spread ranging from from $200 to $1400.
Indicates a diverse range of customer spending.

💎 High-Value Customers/Outliers

Higher end tail represents high-value customers.
Might be B2B customers or customers who buy in bulk.

📅 Calculate recency for each customer

Recency is the number of days since the last purchase made by the customer.

recency = df.groupby('customer_number')['order_date'].max().reset_index()
recency['recency'] = (df['order_date'].max() - recency['order_date']).dt.days
recency = recency.drop('order_date', axis=1)


# Heatmap of Recency by Revenue Segment
recency = pd.merge(recency, total_revenue, on='customer_number')
recency['revenue_segment'] = pd.qcut(recency['total_revenue'], q=4, labels=['Low', 'Medium', 'High', 'Very High'])

plt.figure(figsize=(8, 4))
sns.heatmap(recency.pivot_table(index='revenue_segment', columns='recency', aggfunc='size', fill_value=0), cmap='Blues')
plt.title('Heatmap of Recency by Revenue Segment')
plt.xlabel('Days Since Last Order')
plt.ylabel('Revenue Segment')
plt.show()

Observation:

There are three things to be noted in this Heatmap. Revenue segment which is divided into 4 categories (Y-axis), Days since the the last order (X-axis) and no of users falling in each revenue segment. Darker the shade, more are the number of users.
It is clear that many users keep purchasing on a regular basis which can be seen in the bottom left corner of the graph.
As we move towards right, the days since the last order increases and we can observe less no of users (light shade) on the Low revenue segment.

Question to stakeholders:

What actions are needed to ensure that customers with low recency and revenue generation are retained?

🔄 Customer Retention Analysis

Customer retention is the ability of a company to retain its customers over some specified period. This reflects customer loyalty and satisfaction.

from operator import attrgetter
# Cohort Analysis
df['order_date'] = pd.to_datetime(df['order_date'])
df['order_month'] = df['order_date'].dt.to_period('M')
df['cohort'] = df.groupby('customer_number')['order_date'].transform('min').dt.to_period('M')

cohort_data = df.groupby(['cohort', 'order_month']).agg({
 'customer_number': 'nunique',
 'order_number': 'count',
 'revenue': 'sum'
}).reset_index()

cohort_data['cohort_index'] = (cohort_data['order_month'] - cohort_data['cohort']).apply(attrgetter('n'))

# Pivot table to visualize cohort retention
cohort_pivot = cohort_data.pivot_table(index='cohort', columns='cohort_index', values='customer_number')
cohort_size = cohort_pivot.iloc[:, 0]
retention_matrix = cohort_pivot.divide(cohort_size, axis=0)

import seaborn as sns
plt.figure(figsize=(8, 4))
sns.heatmap(retention_matrix, annot=True, fmt='.0%', cmap='viridis')
plt.title('Cohort Analysis - Retention Rates')
plt.ylabel('Cohort Month')
plt.xlabel('Months Since First Purchase')
plt.show()

If this is the first time you are seeing a graph like this, here is some notes to make a sense of it:

There too, there are three things to consider. Months on X axis, When the Cohort starts on Y axis and percentage of the retention with appropriate shades.
Each horizontal block (we have 3 here) is a Cohort.
The first month of every cohort always has 100% retention since it is the starting point for our analysis.

Observations:

High Retention in Initial Months: Retention rates remain strong during the first few months across cohorts.
Drop in Retention: There’s a noticeable drop to 50% retention in the fourth month for the March 2024 cohort

Question to stakeholders:

What might be the reason for the significant drop in retention rate?

This question might be better answered with more business context.

🧑‍💻 Creating new dataset with engineered features for customer analysis

We engineered many new features from existing ones. Aside from visualizing them, we can also obtain very useful insights while performing tasks like Customer Segmentation. So let’s merge these new more useful features to a new dataset for our next step.


avg_order_value = df.groupby('customer_number')['revenue'].mean().reset_index()
avg_order_value.columns = ['customer_number', 'avg_order_value']
total_quantity = df.groupby('customer_number')['quantity'].sum().reset_index()
total_quantity.columns = ['customer_number', 'total_quantity']


# Merge all features into a single DataFrame
customer_data = total_revenue.merge(avg_order_value, on='customer_number')
customer_data = customer_data.merge(total_quantity, on='customer_number')
customer_data = customer_data.merge(recency, on='customer_number')
customer_data['total_revenue'] = customer_data['total_revenue_x']
customer_data = customer_data.drop(['total_revenue_x', 'total_revenue_y'], axis=1)

# Display the aggregated data
print(customer_data.shape)
print(customer_data.head())

We will look at how to perform clustering of users on this data and come up with marketing strategies tailored for each cluster in next week’s article.

If you enjoyed this post, please consider

holding the clap button for a few seconds (it goes up to 50) and
following me for more updates.

It gives me the motivation to keep going and helps the story reach more people like you. I share stories every week about machine learning concepts and tutorials on interesting projects. See you next week. Happy learning!

LinkedIn, Medium, GitHub

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Customer Segmentation and Time Series Forecasting Based on Sales Data #1/3

Author(s): Naveen Malla

Customer Segmentation and Time Series Forecasting Based on Sales Data #1/3

Why Should You Care About This Series❓

🗂️ The Dataset

📊 Let’s Do Some EDA!!!

🔧 Feature Engineering to create more meaningful features

📈 Visualizing the engineered features

📅 Calculate recency for each customer

🔄 Customer Retention Analysis

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Arbitration for AI: A New Frontier in Governing Uncensored Models

Fine-Tuning vs Distillation vs Transfer Learning: What’s The Difference?

#63: Full of Frameworks: APDTFlow, NSGM, MLFlow, and more!

Vector Databases 101: A Beginner’s Guide to Vector Search and Indexing

AI Agent Developer: A Journey Through Code, Creativity, and Curiosity

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Customer Segmentation and Time Series Forecasting Based on Sales Data #1/3

Author(s): Naveen Malla

Customer Segmentation and Time Series Forecasting Based on Sales Data #1/3

Why Should You Care About This Series❓

🗂️ The Dataset

📊 Let’s Do Some EDA!!!

🔧 Feature Engineering to create more meaningful features

📈 Visualizing the engineered features

📅 Calculate recency for each customer

🔄 Customer Retention Analysis

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement