Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take the GenAI Test: 25 Questions, 6 Topics. Free from Activeloop & Towards AI

Publication

Customer Segmentation and Time Series Forecasting Based on Sales Data #1/3
Data Science   Latest   Machine Learning

Customer Segmentation and Time Series Forecasting Based on Sales Data #1/3

Last Updated on September 27, 2024 by Editorial Team

Author(s): Naveen Malla

Originally published on Towards AI.

Customer Segmentation and Time Series Forecasting Based on Sales Data #1/3

Hey, first things first. This blog is divided into a 3-part series where I am going to focus on three different aspects:

  1. Exploratory Data Analysis (EDA)
  2. Customer segmentation
  3. SKU forecasting (with bonus code)

Why Should You Care About This Series

This work got me a machine learning internship. I presented my notebooks to the whole team, including both technical and non-technical stakeholders, and they loved it. I put in a lot of hours into this project to make it as informative and easily understandable as possible. I was really glad that this project was highly engaging for the team, and I wanted to share it here. So, if you’re curious about working on some machine learning projects to gain knowledge or are looking for ideas on how to present your projects related to this topic in your upcoming interviews, this series is for you. Let’s dive in.

🗂️ The Dataset

This is a typical sales dataset for healthcare products. It contains sales data for 7 months of the year, starting from January. It also contains noise, so the results may seem evenly distributed.

But this isn’t going to be about the data itself; rather, it’s about what you can do with it.

The features in the dataset are kinda self-explanatory, so I am not gonna waste your time giving descriptions for each. You’ll see what they mean as we proceed.

  • order_number
  • order_date
  • customer_number
  • type
  • month
  • item_number (SKU)
  • quantity
  • category
  • revenue
  • customer_source
  • order_source

📊 Let’s Do Some EDA!!!

I will skip some parts of the code like data loading and processing to keep it more engaging. I will attach a link to the whole notebook at the end of the article, though.

print(df.nunique())
unique values count in for each feature (all images by author)

Observation:

There are a total of

  • 1000 unique customers
  • 1000 unique items
  • 2 categories

Monthly Revenue

df['month'] = df['order_date'].dt.month
month_abbr = {1: 'Jan', 2: 'Feb', 3: 'Mar', 4: 'Apr', 5: 'May', 6: 'Jun', 7: 'Jul', 8: 'Aug', 9: 'Sep', 10: 'Oct', 11: 'Nov', 12: 'Dec'}
df['month_abbr'] = df['month'].map(month_abbr)

# Aggregate monthly revenue
monthly_revenue = df.groupby('month_abbr')['revenue'].sum().reset_index()

# Define a categorical type for the month_abbr to ensure proper sorting
monthly_revenue['month_abbr'] = pd.Categorical(monthly_revenue['month_abbr'],
categories=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'],
ordered=True)

# Sort by the 'month_abbr' column
monthly_revenue = monthly_revenue.sort_values(by='month_abbr')

# Plot monthly revenue
plt.figure(figsize=(8, 4))
sns.barplot(x='month_abbr', y='revenue', data=monthly_revenue, palette='viridis')
plt.xlabel('Month')
plt.ylabel('Revenue')
plt.title('Monthly Revenue')

# Format y-label as shortened form
ax = plt.gca()
ax.yaxis.set_major_formatter(FuncFormatter(lambda x, pos: f'{int(x/1000)}k'))

plt.xticks(rotation=45)
plt.show()

Observation:

  • The revenue seems to be increasing over the months.
  • Larger jumps in revenue can be seen starting from the month of May compared to previous months

🔧 Feature Engineering to create more meaningful features

Feature Engineering is the process of creating new features from the existing features in the dataset.

  • This helps us analyse the data better as the new features are more meaningful.
  • Example: top 10 products, average revenue per category, etc.
# Top 10 performing products by revenue
top_10_products = df.groupby('item_number')['revenue'].sum().nlargest(10).reset_index()

# Average revenue by category
avg_revenue_category = df.groupby('category')['revenue'].mean().reset_index()

# Average revenue by type
avg_revenue_type = df.groupby('type')['revenue'].mean().reset_index()

# Average revenue by customer source
avg_revenue_customer_source = df.groupby('customer_source')['revenue'].mean().reset_index()

# Average revenue by order source
avg_revenue_order_source = df.groupby('order_source')['revenue'].mean().reset_index()

📈 Visualizing the engineered features

Top 10 Products by Revenue

# Bar chart for Top 10 performing products
plt.figure(figsize=(8, 4))
sns.barplot(x='item_number', y='revenue', data=top_10_products, palette='viridis')
plt.title('Top 10 Performing Products by Revenue')
plt.xlabel('Item Number')
plt.ylabel('Total Revenue')
plt.xticks(rotation=45)
plt.show()

Observation:

  • The top performing products seem to be generating the similar amount of revenue.

Now, let’s see how much percent of total revenue is contributed by the top 10 products.


top_10_revenue = top_10_products['revenue'].sum()
total_revenue = df['revenue'].sum()

print(f'Top 10 products contribute {top_10_revenue/total_revenue:.2%} to the total revenue')

1.45% of total revenue isn’t a significant amount, indicating that these products do not dominate the overall sales.

Average Revenue per Categorical variable

fig, axes = plt.subplots(1, 2, figsize=(8, 4))

# Pie chart for Average Revenue by Customer Source
axes[0].pie(avg_revenue_customer_source['revenue'], labels=avg_revenue_customer_source['customer_source'], autopct='%1.1f%%', colors=sns.color_palette('viridis', 2))
axes[0].set_title('Average Revenue by Customer Source')

# Donut chart for Average Revenue by Order Source
axes[1].pie(avg_revenue_order_source['revenue'], labels=avg_revenue_order_source['order_source'], autopct='%1.1f%%', colors=sns.color_palette('viridis', 5))
axes[1].set_title('Average Revenue by Order Source')
centre_circle = plt.Circle((0,0), 0.70, fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
fig, axes = plt.subplots(1, 2, figsize=(8, 4))

# Pie chart for Average Revenue by Category
axes[0].pie(avg_revenue_category['revenue'], labels=avg_revenue_category['category'], autopct='%1.1f%%', colors=sns.color_palette('viridis', 2))
axes[0].set_title('Average Revenue by Category')

# Pie chart for Average Revenue by Type
axes[1].pie(avg_revenue_type['revenue'], labels=avg_revenue_type['type'], autopct='%1.1f%%', colors=sns.color_palette('viridis', 2))
axes[1].set_title('Average Revenue by Type')

plt.tight_layout()
plt.show()

Observation:

The distribution of revenue seems to be almost equally spread among Categories, Types, Customers and Order sources.

Count of orders by category

category_counts = df.groupby('category')['quantity'].sum()

# Plot the count of orders by category
plt.figure(figsize=(8, 4))
sns.barplot(x=category_counts.index, y=category_counts, palette='viridis')
plt.title('Total Orders by Category')
plt.xlabel('Category')
plt.ylabel('Total Orders')
plt.xticks(rotation=45)
plt.show()

Observation:

  • Despite DIABETES and HYPERTENSIVES products generating same revenue, the number of HYPERTENSIVES products sold is more than DIABETES products.
  • This insight suggests that DIABETES products being purchased are more expensive than HYPERTENSIVES products.
total_products_sold = df.groupby('category')['quantity'].sum()
print(total_products_sold)
  • HYPERTENSIVES products indeed seem to be more popular among customers with a lead of around 50000 products sold in last 7 months.

Revenue Distribution

# Calculate total revenue for each customer
total_revenue = df.groupby('customer_number')['revenue'].sum().reset_index()
total_revenue.columns = ['customer_number', 'total_revenue']

# Distribution Plot
plt.figure(figsize=(8, 4))
sns.histplot(total_revenue['total_revenue'], kde=True, color='skyblue')
plt.title('Distribution of Total Revenue by Customer')
plt.xlabel('Total Revenue')
plt.ylabel('Frequency')
plt.show()

Observations:

🔔 Normal Distribution

  • Normal distribution is a way to show how data is spread out, where most values are close to the middle (the average). It looks symmetric, and bell-shaped.
  • The above graph is a near-normal distribution.
  • In our case, the average spending is around $700 to $800 mark, which you can identify around at the middle of the curve.

💰 Revenue Spread

  • Revenue Spread indicates how diverse the customer spending is. We look at the lowest or leftmost and the highest or the rightmost figures to understand this metric.
  • There is a significant spread ranging from from $200 to $1400.
  • Indicates a diverse range of customer spending.

💎 High-Value Customers/Outliers

  • Higher end tail represents high-value customers.
  • Might be B2B customers or customers who buy in bulk.

📅 Calculate recency for each customer

Recency is the number of days since the last purchase made by the customer.

recency = df.groupby('customer_number')['order_date'].max().reset_index()
recency['recency'] = (df['order_date'].max() - recency['order_date']).dt.days
recency = recency.drop('order_date', axis=1)


# Heatmap of Recency by Revenue Segment
recency = pd.merge(recency, total_revenue, on='customer_number')
recency['revenue_segment'] = pd.qcut(recency['total_revenue'], q=4, labels=['Low', 'Medium', 'High', 'Very High'])

plt.figure(figsize=(8, 4))
sns.heatmap(recency.pivot_table(index='revenue_segment', columns='recency', aggfunc='size', fill_value=0), cmap='Blues')
plt.title('Heatmap of Recency by Revenue Segment')
plt.xlabel('Days Since Last Order')
plt.ylabel('Revenue Segment')
plt.show()

Observation:

  • There are three things to be noted in this Heatmap. Revenue segment which is divided into 4 categories (Y-axis), Days since the the last order (X-axis) and no of users falling in each revenue segment. Darker the shade, more are the number of users.
  • It is clear that many users keep purchasing on a regular basis which can be seen in the bottom left corner of the graph.
  • As we move towards right, the days since the last order increases and we can observe less no of users (light shade) on the Low revenue segment.

Question to stakeholders:

What actions are needed to ensure that customers with low recency and revenue generation are retained?

🔄 Customer Retention Analysis

Customer retention is the ability of a company to retain its customers over some specified period. This reflects customer loyalty and satisfaction.

from operator import attrgetter
# Cohort Analysis
df['order_date'] = pd.to_datetime(df['order_date'])
df['order_month'] = df['order_date'].dt.to_period('M')
df['cohort'] = df.groupby('customer_number')['order_date'].transform('min').dt.to_period('M')

cohort_data = df.groupby(['cohort', 'order_month']).agg({
'customer_number': 'nunique',
'order_number': 'count',
'revenue': 'sum'
}).reset_index()

cohort_data['cohort_index'] = (cohort_data['order_month'] - cohort_data['cohort']).apply(attrgetter('n'))

# Pivot table to visualize cohort retention
cohort_pivot = cohort_data.pivot_table(index='cohort', columns='cohort_index', values='customer_number')
cohort_size = cohort_pivot.iloc[:, 0]
retention_matrix = cohort_pivot.divide(cohort_size, axis=0)

import seaborn as sns
plt.figure(figsize=(8, 4))
sns.heatmap(retention_matrix, annot=True, fmt='.0%', cmap='viridis')
plt.title('Cohort Analysis - Retention Rates')
plt.ylabel('Cohort Month')
plt.xlabel('Months Since First Purchase')
plt.show()

If this is the first time you are seeing a graph like this, here is some notes to make a sense of it:

  • There too, there are three things to consider. Months on X axis, When the Cohort starts on Y axis and percentage of the retention with appropriate shades.
  • Each horizontal block (we have 3 here) is a Cohort.
  • The first month of every cohort always has 100% retention since it is the starting point for our analysis.

Observations:

  • High Retention in Initial Months: Retention rates remain strong during the first few months across cohorts.
  • Drop in Retention: There’s a noticeable drop to 50% retention in the fourth month for the March 2024 cohort

Question to stakeholders:

What might be the reason for the significant drop in retention rate?

This question might be better answered with more business context.

🧑‍💻 Creating new dataset with engineered features for customer analysis

We engineered many new features from existing ones. Aside from visualizing them, we can also obtain very useful insights while performing tasks like Customer Segmentation. So let’s merge these new more useful features to a new dataset for our next step.


avg_order_value = df.groupby('customer_number')['revenue'].mean().reset_index()
avg_order_value.columns = ['customer_number', 'avg_order_value']
total_quantity = df.groupby('customer_number')['quantity'].sum().reset_index()
total_quantity.columns = ['customer_number', 'total_quantity']


# Merge all features into a single DataFrame
customer_data = total_revenue.merge(avg_order_value, on='customer_number')
customer_data = customer_data.merge(total_quantity, on='customer_number')
customer_data = customer_data.merge(recency, on='customer_number')
customer_data['total_revenue'] = customer_data['total_revenue_x']
customer_data = customer_data.drop(['total_revenue_x', 'total_revenue_y'], axis=1)

# Display the aggregated data
print(customer_data.shape)
print(customer_data.head())

We will look at how to perform clustering of users on this data and come up with marketing strategies tailored for each cluster in next week’s article.

If you enjoyed this post, please consider

  • holding the clap button for a few seconds (it goes up to 50) and
  • following me for more updates.

It gives me the motivation to keep going and helps the story reach more people like you. I share stories every week about machine learning concepts and tutorials on interesting projects. See you next week. Happy learning!

LinkedIn, Medium, GitHub

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓