Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Explainable Clustering — The Introduction of Recursive Embedding and Clustering and its Application
Data Visualization   Latest   Machine Learning

Explainable Clustering — The Introduction of Recursive Embedding and Clustering and its Application

Last Updated on January 3, 2025 by Editorial Team

Author(s): Yuki Shizuya

Originally published on Towards AI.

The explainable clustering method overview

Recently, I read an interesting article from Spotify titled “Recursive Embedding and Clustering.” According to the blog, Spotify developed an explainable clustering method and utilizes this technique to research user clustering. Although this method is helpful for all other companies, there are no implementation codes. Thus, in this blog, I will explain this technique step by step with Python implementation, and I hope it will help someone who wants to treat clustering in an explainable way.

Table of Contents

  1. Introduction of Recursive embedding and clustering method
  2. Real-world data Application : E-commerce — Users of a French C2C fashion store dataset

1. Introduction of Recursive embedding and clustering method

Spotify has around 600 million users worldwide. A deeper understanding of users is essential to provide a better user experience, but because of its popularity, it is challenging. Moreover, although user segmenting by clustering is one of the effective methods, also because of its popularity, it is difficult to explain or examine why the cluster exists. Therefore, Spotify developed the explainable clustering method to find significant clusters and explain why those clusters exist. The overall architecture is shown below.

The over all architecture for explainable clustering method

It mainly comprises two stages.

  1. Find the low-dimensional representation from the original data using UMAP and cluster it using HDBSCAN. Examine the clusters’ characteristics.
  2. Train XGBoost using the raw data as input and labels that HDBSCAN classified as output and understand the feature contribution using SHAP.

Let’s dive into it step by step.

For the first stage, we utilize UMAP and HDBSCAN to construct the understandable clusterings. Let’s quickly recap about UMAP and HDBSCAN.

1.1 UMAP

UMAP image adapted from the original paper [2]

Uniform Manifold Approximation and Projection (UMAP) is one of the non-linear dimension reduction algorithms. UMAP learns the manifold structure in the input data’s high-dimensional space and finds the low-dimensional representation, preserving the manifold structure as much as possible. It is classified in the same category as T-SNE. One of the advantages of UMAP is that it is more robust in preserving the high-dimensional structure than T-SNE because of the loss function (For your information, T-SNE uses KL divergence, and UMAP uses Cross-entropy loss). You can see more details in this blog [3].

Thus, we utilize UMAP as a dimension reduction algorithm to reduce the dimension of input data to 2 or 3 dimensions.

1.2 HDBSCAN

HDBSCAN clustering result adapted from the official documentation

Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) is a density-based clustering method. It is fast and robust and finds meaningful clusters if there are any. The algorithm overview from the official guide is below.

  1. Transform the space according to the density/sparsity.
  2. Build the minimum spanning tree of the distance-weighted graph.
  3. Construct a cluster hierarchy of connected components.
  4. Condense the cluster hierarchy based on minimum cluster size.
  5. Extract the stable clusters from the condensed tree.

Therefore, we utilize HDBSCAN to find dense clusters based on the UMAP representation. We can assign labels using HDBSCAN and check the features of clusters by examining the raw data.

In summary, we conduct two processes in the first stage.

  1. Condense the high-dimensional information of input data to low-dimensional representation using UMAP.
  2. Find clusters using low-dimensional data by UMAP.

After processing this procedure, we can visualize the clusters and check the details using the raw data.

1.3 XGBoost + SHAP

XGBoost is a popular gradient-boosting tree algorithm. It solves many data science problems quickly and accurately. On the other hand, SHAP is one of the explainable AI methods for any machine learning algorithm. We can check the contribution of each feature using SHAP. Since we don’t want to proceed to the first stage many times, which is time-consuming, we train the XGBoost model using the high-dimensional data as input and HDBSCAN labels as output in the second stage. Note that we use XGBoost as a one-versus-all model for each cluster to allow very fast, accurate training and a precise understanding of the feature’s importance. The illustration of this process is as follows:

The illustration of the second stage process

We train multiple XGBoost classifiers for each label and can get the detailed importance of the feature using SHAP.

So far, we have gone through the algorithmic perspective of this method. In the next section, let’s see the power of this method using real-world data!

2. Real-world data Application: E-commerce-Users of a French C2C fashion store dataset

In the last section, I will guide you through applying recursive embedding and clustering methods using the TikTok user profile dataset [4]. As its name suggests, the TikTok user profile dataset contains the user information, and each row represents one user’s information. Let’s implement it step by step.

As the first step, you need to set an environment. I used ubuntu20.04 and Python3.10 environment. Firstly, I create the virtual environment using conda.

conda create --name rclustering python==3.10 -y
conda activate sam
conda install pip
## optional: To avoid install libraries on the local environment,
## check the which pip will be used to store libraries
which pip
# I use /opt/conda/envs/rclustering/bin/pip in my enviornment.

Next, you need to install the following libraries.

conda install -c conda-forge umap-learn=0.5.6 -y
conda install -c plotly plotly=5.23.0 -y
conda install -c conda-forge pandas=2.2.2 -y
conda install -c conda-forge nbformat -y
conda install -c conda-forge py-xgboost -y
conda install -c conda-forge shap -y
conda install -c conda-forge matplotlib

We’ve finished to set an environment, so let’s see the TikTok user profile dataset. I will attach the whole code that I used later.

The TikTok user profile dataset contains 1,000 rows and 19 columns. The first five rows are shown below.

df = pd.read_csv('../TikTok profiles dataset (Public web data).csv')
df.head()
The first five row data

The dataset contains the following columns besides the unrelated columns in this setting, such as timestamp, biolink, create_time, id, top_videos, URL, and profile_pic_url.

  • account_id : Unique identifier for each account.
  • nickname : account name.
  • biography : account’s biography.
  • awg_engagement_rate : average engagement rate.
  • comment_engagement_rate
  • like_engagement_rate
  • is_verified : the verified status on TikTok.
  • followers : the number of followers.
  • following : the number of following creators.
  • likes : the number of likes the user have obtained.
  • videos_count : the number of videos.

The dataset distribution is shown below.

fig = make_subplots(rows=3, cols=2)
viz_columns = ['awg_engagement_rate', 'comment_engagement_rate', 'like_engagement_rate', 'followers', 'following', 'videos_count']

for i in range(len(viz_columns)):
x = i % 2 + 1
y = i // 2 + 1

col = viz_columns[i]
values = df[col]

fig.add_trace(go.Histogram(x=values, name=viz_columns[i]), row=y, col=x)
fig.show()
The distribution of TikTok user profile dataset

As you can see, the distribution of all features tends to be right-skewed, and each feature’s scale varies.

For the next step, we need to preprocess data. In this case, we can drop account_id, nickname, and biography for simplicity. If you want to use nicknames and biographies, such as text information, we need to utilize the language model to encode this information. Moreover, we have one categorical variable, is_verified, so we must encode it using LabelEncoder. Before proceeding with UMAP, we need to rescale input variables using StandardScaler.

Now, let’s using UMAP. UMAP has some hyperparameters that we need to adjust. “metric” and “neighbor_num” tend to make a bigger difference in the result. For your information, “metric” measures the distance between each point, and “neighbor_num” balances local and global structures. In this blog, I change the metric and neighbor_num to find distinguishable low-dimensional representations. The following graphs show the results of UMAP.

def apply_umap(features: np.array, random_state: int = 7, metric: str = 'euclidean', neighbor_num: int = 15):
reducer = umap.UMAP(random_state=random_state, n_neighbors=neighbor_num, metric=metric)
embedding = reducer.fit_transform(features)

return reducer, embedding

def visualize_sample(df: pd.DataFrame, embedding: np.array, metric: str):
df['umap_feat1'] = embedding[:, 0].reshape(-1, 1)
df['umap_feat2'] = embedding[:, 1].reshape(-1, 1)

fig = px.scatter(sample_df, x='umap_feat1', y='umap_feat2', hover_data=scaled_columns, title=f'{metric}')
fig.show()

metric = 'euclidean'
_, embedding = apply_umap(scaled_features, neighbor_num=10, metric=metric)
visualize_sample(sample_df, embedding, metric)
The UMAP result for each metric

The case using Euclidean space as a metric shows so-so distinguishability, so I chose Euclidean one for the HDBSCAN input.

The next step is HDBSCAN part. HDBSCAN also has hyperparameters, such as the minimum number of data assigned to one cluster. Note that you need to adjust hyperparameters to be compatible with your data. The HDBCAN clustering result is shown below.

# HDBSCAN
hdb = HDBSCAN(min_cluster_size=10)
hdb.fit(embedding)
labels = hdb.labels_

sample_df['HDBSCAN_label'] = labels.reshape(-1, 1)
# Change the type of label for better visualization in plotly
sample_df['HDBSCAN_label'] = sample_df['HDBSCAN_label'].astype(str)

fig = px.scatter(sample_df, x='umap_feat1', y='umap_feat2', color='HDBSCAN_label', hover_data=scaled_columns)
fig.show()
(Left) HDBSCAN result for all data, (Right) HDBSCAN result for data except for unassigned data

The left chart shows the HDBSCAN result using all data, and the right one shows the result except for data not assigned to any clusters. HDBSCAN found 23 clusters in this data. To examine the clusters thoroughly, I found each cluster has the following features. I will show you some of the main clusters below.

  • Cluster 0: verified influencer cluster.
  • Cluster 1: non-verified and many followers with low engagement rate cluster.
  • Cluster 4: Normal number of followers, following, and more engagement level clusters.
  • Cluster 13: the number of following is larger than the number of followers.

To find the feature of the cluster, I visualize the distribution in the target cluster. For example, we can do it as follows:

# cluster 1
cls1_df = sample_df[sample_df['HDBSCAN_label'] == '1']

fig = make_subplots(rows=3, cols=2)
viz_columns = ['awg_engagement_rate', 'comment_engagement_rate', 'like_engagement_rate', 'followers', 'following', 'videos_count']

for i in range(len(viz_columns)):
x = i % 2 + 1
y = i // 2 + 1

col = viz_columns[i]
values = cls1_df[col]

fig.add_trace(go.Histogram(x=values, name=viz_columns[i]), row=y, col=x)
fig.show()
The visualization result for the distribution of the cluster 1

The above example shows cluster 1’s feature distribution. As you can see, this cluster has many followers and a low rate for three engagement features.

Now, let’s train XGBoost for each cluster to verify that my assumption is correct. In this blog, I will skip training the small-sized clusters because they tend to be difficult to train and produce less business value than other bigger clusters. The following figures show the results for the bigger clusters.

# To cut off the smaller size of clusters
# But if you want to pass specified cluster, please enter the label into ok_labels list
def label_check(df_length: int, label: str, minimum_num: int = 40, ok_labels: list = []):
if label in ok_labels:
return True

if label == '-1':
return False

if df_length < minimum_num:
return False
else:
return True


minimum_num = 30 # the minumum number of the cluster size
ok_labels = ['9']
labels = sample_df['HDBSCAN_label'].unique().tolist()

os.makedirs('models', exist_ok=True)

for label in labels:
df_length = len(sample_df[sample_df['HDBSCAN_label'] == label])

if not label_check(df_length, label, ok_labels=ok_labels):
print(f'skip Label {label}')
continue

print(f'processing labels {label}')
sample_df['label'] = 0
sample_df.loc[sample_df['HDBSCAN_label'] == label, 'label'] = 1

X_train, X_test, y_train, y_test = train_test_split(sample_df[scaled_columns].values,
sample_df['label'],
stratify=sample_df['label'],
test_size=0.05)
print(np.unique(y_train, return_counts=True)[-1])
train_df = pd.DataFrame(X_train, columns=scaled_columns)
test_df = pd.DataFrame(X_test, columns=scaled_columns)

classes_weights = class_weight.compute_sample_weight(
class_weight='balanced',
y=y_train
)

tree = XGBClassifier(n_estimators=2, max_depth=2, objective='binary:logistic', sample_weight=classes_weights)
tree.fit(train_df, y_train)

preds = tree.predict(X_test)

print(f"\nLabel{label}: Classification Report:")
print(classification_report(y_test, preds))

xgboost.plot_importance(tree)
pl.title(f"Label{label}: xgboost.plot_importance(model)")
pl.show()

explainer = shap.Explainer(tree, train_df)
shap_values = explainer(train_df)

shap.plots.bar(shap_values)

tree.save_model(f'models/xgboost-{label}.model')
The XGBoost + SHAP result for cluster 1 and 4
The XGBoost + SHAP result for cluster 9 and 13

According to the result, XGBoost tends to focus on four features: “videos_count,” “following,” “followers,” and “awg_engagement_rate.” Furthermore, the explanation from SHAP is similar to my assumption. Therefore, you can gain a better understanding of the cluster in the first stage via XGBoost + SHAP. Note that it might be overfitted if your data size is small, so you need to add some regularization methods to prevent it.

Once you complete training XGBoost, you can use it when you get new data. Isn’t it very convenient? Now, you get the powerful tool to explain to your stakeholders why the cluster exists easily.

In this blog, I introduced the explainable clustering method developed by Spotify. It is simple but powerful, and I hope this blog helps you get good user segmentation or other things. The code I used is attached below.

References

[1] Pereira, G., Recursive Embedding and Clustering, Spotify R&D Engineering

[2] McInnes, L., Healy, J., Melville, J., UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, Arxiv

[3] Oskolkov, N., Why UMAP is Superior over T-SNE, Towards Data Science

[4] Kumar, M., TikTok User Profiles Dataset, kaggle

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓