Customer Segmentation and Time Series Forecasting Based on Sales Data #1/3
Last Updated on September 27, 2024 by Editorial Team
Author(s): Naveen Malla
Originally published on Towards AI.
Customer Segmentation and Time Series Forecasting Based on Sales Data #1/3
Hey, first things first. This blog is divided into a 3-part series where I am going to focus on three different aspects:
- Exploratory Data Analysis (EDA)
- Customer segmentation
- SKU forecasting (with bonus code)
Why Should You Care About This Series❓
This work got me a machine learning internship. I presented my notebooks to the whole team, including both technical and non-technical stakeholders, and they loved it. I put in a lot of hours into this project to make it as informative and easily understandable as possible. I was really glad that this project was highly engaging for the team, and I wanted to share it here. So, if youβre curious about working on some machine learning projects to gain knowledge or are looking for ideas on how to present your projects related to this topic in your upcoming interviews, this series is for you. Letβs dive in.
🗂οΈ The Dataset
This is a typical sales dataset for healthcare products. It contains sales data for 7 months of the year, starting from January. It also contains noise, so the results may seem evenly distributed.
But this isnβt going to be about the data itself; rather, itβs about what you can do with it.
The features in the dataset are kinda self-explanatory, so I am not gonna waste your time giving descriptions for each. Youβll see what they mean as we proceed.
- order_number
- order_date
- customer_number
- type
- month
- item_number (SKU)
- quantity
- category
- revenue
- customer_source
- order_source
📊 Letβs Do Some EDA!!!
I will skip some parts of the code like data loading and processing to keep it more engaging. I will attach a link to the whole notebook at the end of the article, though.
print(df.nunique())
Observation:
There are a total of
- 1000 unique customers
- 1000 unique items
- 2 categories
Monthly Revenue
df['month'] = df['order_date'].dt.month
month_abbr = {1: 'Jan', 2: 'Feb', 3: 'Mar', 4: 'Apr', 5: 'May', 6: 'Jun', 7: 'Jul', 8: 'Aug', 9: 'Sep', 10: 'Oct', 11: 'Nov', 12: 'Dec'}
df['month_abbr'] = df['month'].map(month_abbr)
# Aggregate monthly revenue
monthly_revenue = df.groupby('month_abbr')['revenue'].sum().reset_index()
# Define a categorical type for the month_abbr to ensure proper sorting
monthly_revenue['month_abbr'] = pd.Categorical(monthly_revenue['month_abbr'],
categories=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'],
ordered=True)
# Sort by the 'month_abbr' column
monthly_revenue = monthly_revenue.sort_values(by='month_abbr')
# Plot monthly revenue
plt.figure(figsize=(8, 4))
sns.barplot(x='month_abbr', y='revenue', data=monthly_revenue, palette='viridis')
plt.xlabel('Month')
plt.ylabel('Revenue')
plt.title('Monthly Revenue')
# Format y-label as shortened form
ax = plt.gca()
ax.yaxis.set_major_formatter(FuncFormatter(lambda x, pos: f'{int(x/1000)}k'))
plt.xticks(rotation=45)
plt.show()
Observation:
- The revenue seems to be increasing over the months.
- Larger jumps in revenue can be seen starting from the month of May compared to previous months
🔧 Feature Engineering to create more meaningful features
Feature Engineering is the process of creating new features from the existing features in the dataset.
- This helps us analyse the data better as the new features are more meaningful.
- Example: top 10 products, average revenue per category, etc.
# Top 10 performing products by revenue
top_10_products = df.groupby('item_number')['revenue'].sum().nlargest(10).reset_index()
# Average revenue by category
avg_revenue_category = df.groupby('category')['revenue'].mean().reset_index()
# Average revenue by type
avg_revenue_type = df.groupby('type')['revenue'].mean().reset_index()
# Average revenue by customer source
avg_revenue_customer_source = df.groupby('customer_source')['revenue'].mean().reset_index()
# Average revenue by order source
avg_revenue_order_source = df.groupby('order_source')['revenue'].mean().reset_index()
📈 Visualizing the engineered features
Top 10 Products by Revenue
# Bar chart for Top 10 performing products
plt.figure(figsize=(8, 4))
sns.barplot(x='item_number', y='revenue', data=top_10_products, palette='viridis')
plt.title('Top 10 Performing Products by Revenue')
plt.xlabel('Item Number')
plt.ylabel('Total Revenue')
plt.xticks(rotation=45)
plt.show()
Observation:
- The top performing products seem to be generating the similar amount of revenue.
Now, letβs see how much percent of total revenue is contributed by the top 10 products.
top_10_revenue = top_10_products['revenue'].sum()
total_revenue = df['revenue'].sum()
print(f'Top 10 products contribute {top_10_revenue/total_revenue:.2%} to the total revenue')
1.45% of total revenue isnβt a significant amount, indicating that these products do not dominate the overall sales.
Average Revenue per Categorical variable
fig, axes = plt.subplots(1, 2, figsize=(8, 4))
# Pie chart for Average Revenue by Customer Source
axes[0].pie(avg_revenue_customer_source['revenue'], labels=avg_revenue_customer_source['customer_source'], autopct='%1.1f%%', colors=sns.color_palette('viridis', 2))
axes[0].set_title('Average Revenue by Customer Source')
# Donut chart for Average Revenue by Order Source
axes[1].pie(avg_revenue_order_source['revenue'], labels=avg_revenue_order_source['order_source'], autopct='%1.1f%%', colors=sns.color_palette('viridis', 5))
axes[1].set_title('Average Revenue by Order Source')
centre_circle = plt.Circle((0,0), 0.70, fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
fig, axes = plt.subplots(1, 2, figsize=(8, 4))
# Pie chart for Average Revenue by Category
axes[0].pie(avg_revenue_category['revenue'], labels=avg_revenue_category['category'], autopct='%1.1f%%', colors=sns.color_palette('viridis', 2))
axes[0].set_title('Average Revenue by Category')
# Pie chart for Average Revenue by Type
axes[1].pie(avg_revenue_type['revenue'], labels=avg_revenue_type['type'], autopct='%1.1f%%', colors=sns.color_palette('viridis', 2))
axes[1].set_title('Average Revenue by Type')
plt.tight_layout()
plt.show()
Observation:
The distribution of revenue seems to be almost equally spread among Categories, Types, Customers and Order sources.
Count of orders by category
category_counts = df.groupby('category')['quantity'].sum()
# Plot the count of orders by category
plt.figure(figsize=(8, 4))
sns.barplot(x=category_counts.index, y=category_counts, palette='viridis')
plt.title('Total Orders by Category')
plt.xlabel('Category')
plt.ylabel('Total Orders')
plt.xticks(rotation=45)
plt.show()
Observation:
- Despite DIABETES and HYPERTENSIVES products generating same revenue, the number of HYPERTENSIVES products sold is more than DIABETES products.
- This insight suggests that DIABETES products being purchased are more expensive than HYPERTENSIVES products.
total_products_sold = df.groupby('category')['quantity'].sum()
print(total_products_sold)
- HYPERTENSIVES products indeed seem to be more popular among customers with a lead of around 50000 products sold in last 7 months.
Revenue Distribution
# Calculate total revenue for each customer
total_revenue = df.groupby('customer_number')['revenue'].sum().reset_index()
total_revenue.columns = ['customer_number', 'total_revenue']
# Distribution Plot
plt.figure(figsize=(8, 4))
sns.histplot(total_revenue['total_revenue'], kde=True, color='skyblue')
plt.title('Distribution of Total Revenue by Customer')
plt.xlabel('Total Revenue')
plt.ylabel('Frequency')
plt.show()
Observations:
🔔 Normal Distribution
- Normal distribution is a way to show how data is spread out, where most values are close to the middle (the average). It looks symmetric, and bell-shaped.
- The above graph is a near-normal distribution.
- In our case, the average spending is around $700 to $800 mark, which you can identify around at the middle of the curve.
💰 Revenue Spread
- Revenue Spread indicates how diverse the customer spending is. We look at the lowest or leftmost and the highest or the rightmost figures to understand this metric.
- There is a significant spread ranging from from $200 to $1400.
- Indicates a diverse range of customer spending.
💎 High-Value Customers/Outliers
- Higher end tail represents high-value customers.
- Might be B2B customers or customers who buy in bulk.
📅 Calculate recency for each customer
Recency is the number of days since the last purchase made by the customer.
recency = df.groupby('customer_number')['order_date'].max().reset_index()
recency['recency'] = (df['order_date'].max() - recency['order_date']).dt.days
recency = recency.drop('order_date', axis=1)
# Heatmap of Recency by Revenue Segment
recency = pd.merge(recency, total_revenue, on='customer_number')
recency['revenue_segment'] = pd.qcut(recency['total_revenue'], q=4, labels=['Low', 'Medium', 'High', 'Very High'])
plt.figure(figsize=(8, 4))
sns.heatmap(recency.pivot_table(index='revenue_segment', columns='recency', aggfunc='size', fill_value=0), cmap='Blues')
plt.title('Heatmap of Recency by Revenue Segment')
plt.xlabel('Days Since Last Order')
plt.ylabel('Revenue Segment')
plt.show()
Observation:
- There are three things to be noted in this Heatmap. Revenue segment which is divided into 4 categories (Y-axis), Days since the the last order (X-axis) and no of users falling in each revenue segment. Darker the shade, more are the number of users.
- It is clear that many users keep purchasing on a regular basis which can be seen in the bottom left corner of the graph.
- As we move towards right, the days since the last order increases and we can observe less no of users (light shade) on the Low revenue segment.
Question to stakeholders:
What actions are needed to ensure that customers with low recency and revenue generation are retained?
🔄 Customer Retention Analysis
Customer retention is the ability of a company to retain its customers over some specified period. This reflects customer loyalty and satisfaction.
from operator import attrgetter
# Cohort Analysis
df['order_date'] = pd.to_datetime(df['order_date'])
df['order_month'] = df['order_date'].dt.to_period('M')
df['cohort'] = df.groupby('customer_number')['order_date'].transform('min').dt.to_period('M')
cohort_data = df.groupby(['cohort', 'order_month']).agg({
'customer_number': 'nunique',
'order_number': 'count',
'revenue': 'sum'
}).reset_index()
cohort_data['cohort_index'] = (cohort_data['order_month'] - cohort_data['cohort']).apply(attrgetter('n'))
# Pivot table to visualize cohort retention
cohort_pivot = cohort_data.pivot_table(index='cohort', columns='cohort_index', values='customer_number')
cohort_size = cohort_pivot.iloc[:, 0]
retention_matrix = cohort_pivot.divide(cohort_size, axis=0)
import seaborn as sns
plt.figure(figsize=(8, 4))
sns.heatmap(retention_matrix, annot=True, fmt='.0%', cmap='viridis')
plt.title('Cohort Analysis - Retention Rates')
plt.ylabel('Cohort Month')
plt.xlabel('Months Since First Purchase')
plt.show()
If this is the first time you are seeing a graph like this, here is some notes to make a sense of it:
- There too, there are three things to consider. Months on X axis, When the Cohort starts on Y axis and percentage of the retention with appropriate shades.
- Each horizontal block (we have 3 here) is a Cohort.
- The first month of every cohort always has 100% retention since it is the starting point for our analysis.
Observations:
- High Retention in Initial Months: Retention rates remain strong during the first few months across cohorts.
- Drop in Retention: Thereβs a noticeable drop to 50% retention in the fourth month for the March 2024 cohort
Question to stakeholders:
What might be the reason for the significant drop in retention rate?
This question might be better answered with more business context.
🧑β💻 Creating new dataset with engineered features for customer analysis
We engineered many new features from existing ones. Aside from visualizing them, we can also obtain very useful insights while performing tasks like Customer Segmentation. So letβs merge these new more useful features to a new dataset for our next step.
avg_order_value = df.groupby('customer_number')['revenue'].mean().reset_index()
avg_order_value.columns = ['customer_number', 'avg_order_value']
total_quantity = df.groupby('customer_number')['quantity'].sum().reset_index()
total_quantity.columns = ['customer_number', 'total_quantity']
# Merge all features into a single DataFrame
customer_data = total_revenue.merge(avg_order_value, on='customer_number')
customer_data = customer_data.merge(total_quantity, on='customer_number')
customer_data = customer_data.merge(recency, on='customer_number')
customer_data['total_revenue'] = customer_data['total_revenue_x']
customer_data = customer_data.drop(['total_revenue_x', 'total_revenue_y'], axis=1)
# Display the aggregated data
print(customer_data.shape)
print(customer_data.head())
We will look at how to perform clustering of users on this data and come up with marketing strategies tailored for each cluster in next weekβs article.
If you enjoyed this post, please consider
- holding the clap button for a few seconds (it goes up to 50) and
- following me for more updates.
It gives me the motivation to keep going and helps the story reach more people like you. I share stories every week about machine learning concepts and tutorials on interesting projects. See you next week. Happy learning!
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI