Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Free: 6-day Agentic AI Engineering Email Guide.
Learnings from Towards AI's hands-on work with real clients.
From Clusters to Customers: Supercharging Segmentation with Generative AI
Latest   Machine Learning

From Clusters to Customers: Supercharging Segmentation with Generative AI

Last Updated on February 3, 2026 by Editorial Team

Author(s): Abhijeet Sahoo

Originally published on Towards AI.

The “Consultant’s Confession”

If you’ve spent any time in pharma consulting, you know the drill. We live in a world of high-stakes “Patient Journeys” and “HCP Target Lists.” I’ve spent more hours than I care to admit staring at spreadsheets of patient data, trying to find that “Aha!” moment. (The article uses pharma examples throughout, but this method easily adapts to any consulting domain.)

In the “old days” — about two years ago — a segmentation project meant running a K-Means clustering model and surfacing four or five neat little groups. But let’s be honest: nothing kills a “Data-Driven Culture” faster than a slide labeled Cluster 0 through Cluster 4.

You’ve spent weeks cleaning data, tuning hyperparameters, and debating “Elbow vs. Silhouette” curves. You proudly present your findings, only for the Head of Marketing to ask the one question that makes your heart sink: “Great, but… what do I actually say to the people in Cluster 2?”

Suddenly, your precise mathematical groups feel like abstract art — pretty to look at, but impossible to use. We’ve been stuck in this gap between Math and Meaning for years.

But what if the “missing link” isn’t more data, but a better translator? Imagine replacing the overworked analyst (currently vibrating on overpriced espresso and billable-hour anxiety) with an LLM sidekick that turns cold centroids into a warm, actionable business playbook.

In this article, I’m pulling back the curtain on a project I built in Google Colab. I’ll show you how I moved from raw (synthetic) Migraine patient data to a full-blown strategic playbook using a blend of K-Means clustering and Google’s Gemini. We aren’t just grouping data anymore; we’re automating the “Strategic Soul” of the business.

Step 1: The Foundation (Synthetic Data)

I’d probably be sacked faster than a non-compliant sales rep if I used actual IQVIA data for this article, so let’s stick to the safe side. To keep my career intact while proving the point, we’ll create some synthetic data that mimics real-world dynamics without the legal paperwork. In the code snippet below, I’ll avoid generating patient- or claim-level data. Instead, I’ll create practical aggregated features — directly derivable from patient data — that are highly relevant to segmentation.

# Step 1: GenAI created code for generating synthetic dataset for the segmentation process

import numpy as np
import pandas as pd

np.random.seed(42)

n_hcps = 50000

# 1. Core identifiers and attributes
hcp_ids = [f"HCP_{i+1}" for i in range(n_hcps)]

specialties = np.random.choice(
["Neurologist", "PCP", "Other"],
size=n_hcps,
p=[0.25, 0.5, 0.25] # tweak as needed
)

states = np.random.choice(
["CA", "TX", "NY", "FL", "IL", "PA", "OH", "GA", "NC", "MI"],
size=n_hcps
)

top_payer_type = np.random.choice(
["Commercial", "Medicare", "Medicaid", "Other"],
size=n_hcps,
p=[0.5, 0.25, 0.15, 0.10] # commercial‑heavy mix
)

# 2. Access score (1–10, higher = better)
# Let Neurologists have slightly better access on average
base_access = np.random.normal(loc=6.5, scale=2.0, size=n_hcps)
base_access += np.where(specialties == "Neurologist", 0.8, 0.0)
base_access += np.where(top_payer_type == "Medicaid", -0.7, 0.0)
access_score = np.clip(np.round(base_access), 1, 10).astype(int)

# 3. Class TRx (100–500 migraine market TRx, higher for Neuro)
base_class_trx = np.random.normal(loc=260, scale=60, size=n_hcps)
base_class_trx += np.where(specialties == "Neurologist", 60, 0)
base_class_trx += np.where(specialties == "PCP", -30, 0)
class_trx = np.clip(np.round(base_class_trx), 100, 500).astype(int)

# 4. Brand share and brand TRx
# Target ~45% overall market share, modulated by access & specialty
raw_share = (
0.45
+ 0.03 * (access_score - 5) # better access -> higher share
+ np.where(specialties == "Neurologist", 0.05, 0.0)
+ np.where(top_payer_type == "Medicaid", -0.05, 0.0)
+ np.random.normal(0, 0.05, n_hcps) # noise
)

brand_share = np.clip(raw_share, 0.05, 0.9)
brand_trx = np.round(class_trx * brand_share).astype(int)

# Check overall market share (optional sanity check)
overall_share = brand_trx.sum() / class_trx.sum()
print(f"Overall synthetic market share: {overall_share:.3f}")

# 5. Channel engagement: digital (emails etc.) and rep calls
# Let access & specialty influence channel mix
digital_base = 4 + 0.6 * (access_score - 5)
digital_base += np.where(specialties == "Neurologist", 2, 0)
digital_base += np.where(top_payer_type == "Commercial", 1, 0)
digital_engagement = np.random.poisson(lam=np.clip(digital_base, 0.5, 20))

rep_base = 3 + 0.4 * (access_score - 5)
rep_base += np.where(specialties == "Neurologist", 1.5, 0)
rep_base += np.where(top_payer_type == "Medicaid", -0.5, 0)
rep_calls = np.random.poisson(lam=np.clip(rep_base, 0.2, 20))


# New vs existing patients mix (0–1; higher = more new starts)
new_start_ratio = np.clip(
np.random.beta(a=2, b=3, size=n_hcps)
+ 0.05 * (access_score - 5) / 5,
0, 1
)

# Chronic burden index (proxy for comorbidities severity 0–5)
chronic_burden_index = np.clip(
np.random.normal(loc=2.5, scale=1.0, size=n_hcps)
+ np.where(top_payer_type == "Medicare", 0.8, 0.0),
0, 5
)

# Adherence proxy (proportion of patients with MPR > 80%)
adherence_rate = np.clip(
0.7
+ 0.02 * (access_score - 5)
+ np.random.normal(0, 0.05, n_hcps),
0.3, 0.95
)


# Build final DataFrame
df_hcp = pd.DataFrame({
"hcp_id": hcp_ids,
"specialty": specialties,
"access_score": access_score,
"class_trx": class_trx,
"brand_trx": brand_trx,
"digital_engagement": digital_engagement,
"rep_calls": rep_calls,
"state": states,
"top_payer_type": top_payer_type,
"brand_share": brand_trx / class_trx,
"new_start_ratio": new_start_ratio,
"chronic_burden_index": chronic_burden_index,
"adherence_rate": adherence_rate
})

df_hcp.head()

Step 2: The Logic (K-Means & The Mathematical Anchor)

This is the “classic” part of the process — the ML foundation that remains the bedrock of any solid analysis. In pharma, segmentation isn’t just about grouping; it’s about finding distinct, non-overlapping patient profiles that respond differently to treatment or messaging.

We use K-Means, an unsupervised learning algorithm that groups data points by minimizing the distance between each point and its cluster center (the centroid). However, K-Means is a bit like a GPS: it will take you wherever you ask, but you have to tell it how many stops to make.

From Clusters to Customers: Supercharging Segmentation with Generative AI

To avoid picking a number out of thin air, we use two diagnostic tools:

  • The Elbow Method: We look for the “bend” where adding more clusters stops providing significant improvements in tightness (Inertia).
  • The Silhouette Score: This measures how well a patient fits their own group versus the neighboring one. We want high scores — meaning “tight” clusters and “clear” boundaries.

The following snippet runs this diagnostic marathon to find the “Goldilocks” number of clusters:

## Importing basic stuff which would be required for the K Means clustering algo
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA

import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="whitegrid")

# Select numeric features for clustering
# These capture HCP behavior, intensity, and potential
# Exclude IDs and pure categoricals
numeric_features = [
"access_score", # Access to therapy (1-10)
"class_trx", # Total migraine TRx volume
"brand_trx", # Brand‑specific TRx
"brand_share", # Brand penetration (0-1)
"digital_engagement", # Email/webinar engagement
"rep_calls", # Field rep interactions
"new_start_ratio", # New prescriptions vs refills
"chronic_burden_index", # Patient complexity proxy
"adherence_rate" # Persistence quality
]

print(f"Selected {len(numeric_features)} features for clustering")

# Extract features as numpy array
X = df_hcp[numeric_features].values

print("Feature matrix shape:", X.shape)
print("Sample feature values (first 5 HCPs):")
print(X[:5])

# Standardize features (CRITICAL for clustering)
# K-means uses Euclidean distance; unscaled features will dominate
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("Scaling complete. Mean and std of scaled features:")
print(pd.DataFrame(X_scaled, columns=numeric_features).describe().round(3))

# Step 4: Elbow method + Silhouette score to choose k
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

ks = range(2, 11) # Test k=2 to 10
inertias = []
sil_scores = []

for k in ks:
# Fit K-means for this k
km = KMeans(n_clusters=k, random_state=42, n_init="auto")
labels = km.fit_predict(X_scaled)

# Store metrics
inertias.append(km.inertia_) # Within-cluster sum of squares
sil_scores.append(silhouette_score(X_scaled, labels))

# Plot both diagnostics
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

ax1.plot(ks, inertias, marker="o", linewidth=2)
ax1.set_xlabel("Number of clusters (k)")
ax1.set_ylabel("Inertia (within-cluster SSE)")
ax1.set_title("Elbow Method")
ax1.grid(True, alpha=0.3)

ax2.plot(ks, sil_scores, marker="o", linewidth=2, color="orange")
ax2.set_xlabel("Number of clusters (k)")
ax2.set_ylabel("Average Silhouette Score")
ax2.set_title("Silhouette Analysis")
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print results table
results_table = pd.DataFrame({
"k": ks,
"inertia": np.round(inertias, 0),
"silhouette": np.round(sil_scores, 3)
})
print("Diagnostic results:")
print(results_table)
Output of the code snippet which would help us finalize the k

With the optimal “K” value identified as 4, we processed our 50,000 HCP data points through the K-means model. This transformed our raw data into the structured cluster visualization shown below.

So, what’s next? Am I expected to run descriptive analytics on each cluster across all the features we used, invest hours crafting marketing-friendly buzzwords for them, and then also develop a strategic playbook tailored to each cluster?

WHAT IF! All the heavy lifting — descriptive analytics, buzzword crafting, cluster playbooks — could be jump‑started by one code snippet. That means less time on mechanics, more time on strategy. You start with a strong foundation and then tailor it into a roadmap for each client.

## Setting up the gemini LLM models

!pip install -q google-generativeai
import google.generativeai as genai
from google.colab import userdata
genai.configure(api_key=userdata.get('Default'))

## Lets try the LLM way now

def run_clustering_with_gemini_descriptions(df_hcp, k_final=4):
"""
Full pipeline: K-means clustering → profiles → Gemini Pro descriptions
"""

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, Markdown

# 1. CLUSTERING
numeric_features = [
"access_score", "class_trx", "brand_trx", "brand_share",
"digital_engagement", "rep_calls", "new_start_ratio",
"chronic_burden_index", "adherence_rate"
]

X = df_hcp[numeric_features].values
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit K-means
kmeans = KMeans(n_clusters=k_final, random_state=42, n_init="auto")
df_hcp["cluster_ml"] = kmeans.fit_predict(X_scaled)

# 2. PROFILES
cluster_profile = df_hcp.groupby("cluster_ml")[numeric_features].mean().round(2)
cluster_profile["size"] = df_hcp["cluster_ml"].value_counts().sort_index().values
cluster_profile["pct_total"] = (cluster_profile["size"] / len(df_hcp) * 100).round(1)

print("✅ ML Clusters created!")
print(cluster_profile)

# 3. PCA VISUALIZATION
pca = PCA(n_components=2)
pcs = pca.fit_transform(X_scaled)
df_hcp["pc1"] = pcs[:, 0]
df_hcp["pc2"] = pcs[:, 1]

plt.figure(figsize=(8, 6))
sns.scatterplot(
data=df_hcp.sample(5000, random_state=42),
x="pc1", y="pc2", hue="cluster_ml",
palette="tab10", alpha=0.7, s=30
)
plt.title(f"HCP Clusters (k={k_final})")
plt.legend(title="Cluster")
plt.show()

# 4. GEMINI PRO DESCRIPTIONS
descriptions = generate_gemini_cluster_stories(cluster_profile, df_hcp)

display(Markdown("## 🤖 **Gemini Pro: HCP Cluster Personas**"))
display(Markdown(descriptions))

return df_hcp, cluster_profile

def generate_gemini_cluster_stories(cluster_profile, df_hcp):
"""
Use Gemini Pro to generate business-ready cluster descriptions
"""

# Prepare data for Gemini
table_md = cluster_profile.to_markdown()

# Get categorical insights
specialty_dist = pd.crosstab(df_hcp["cluster_ml"], df_hcp["specialty"],
normalize="index").round(2).to_markdown()
payer_dist = pd.crosstab(df_hcp["cluster_ml"], df_hcp["top_payer_type"],
normalize="index").round(2).to_markdown()

prompt = f"""
You are a pharma commercial excellence consultant analyzing HCP segments for a migraine brand.

**ML CLUSTER PROFILES** (means per cluster):
{table_md}

**SPECIALTY MIX**:
{specialty_dist}

**PAYER MIX**:
{payer_dist}

**TASK**: For each cluster (0-{len(cluster_profile)-1}), create:

## Cluster X: [2-3 word business name]
**Profile** (2-3 bullets):
• Key characteristics in plain business language
• What makes this HCP unique

**Priority** (High/Med/Low):
• Why this segment matters for brand growth

**Engagement playbook** (3 tactics):
• Rep strategy
• Digital strategy
• Access/messaging focus

Format as clean markdown. Make it actionable for field force leaders.
"""


# Generate with Gemini
model = genai.GenerativeModel('gemini-2.5-flash')
## Please make sure the credentials/Keys are added and
## you select the model which is available for your user
response = model.generate_content(prompt)

return response.text

# RUN THE FULL PIPELINE
df_with_clusters, cluster_profiles = run_clustering_with_gemini_descriptions(df_hcp, k_final=4)
Voila !

In minutes, raw claims data turned into actionable HCP personas like “Elite Advocates” and “High-Potential Expanders” — work that used to take weeks now runs instantly in a free Colab notebook. This gives you a strong starting point with clear field playbooks, but as the analyst, you still need to validate each cluster’s story: does “Elite Advocates” really show high brand loyalty and engagement penetration? Do “High-Potential Expanders” have the class volume and access to justify the growth opportunity? Cross-check with revenue concentration, payer mix alignment, and field feedback before deployment. AI handles the heavy lifting while you bring the expertise to make it production-ready.

Check out my colab notebook to understand better — https://colab.research.google.com/drive/1wuit_8PXH2ykGhDubG65poYIBLRo2Igq?usp=sharing

— From someone who knows exactly how it feels when your carefully planned three‑week segmentation timeline becomes a 72‑hour fire drill (This may or may not be an exaggeration)

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.

Start free — no commitment:

6-Day Agentic AI Engineering Email Guide — one practical lesson per day

Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages

Our courses:

AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.

Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.

AI for Work — Understand, evaluate, and apply AI for complex work tasks.

Note: Article content contains the views of the contributing authors and not Towards AI.