From Clusters to Customers: Supercharging Segmentation with Generative AI

Last Updated on February 3, 2026 by Editorial Team

Author(s): Abhijeet Sahoo

Originally published on Towards AI.

The “Consultant’s Confession”

If you’ve spent any time in pharma consulting, you know the drill. We live in a world of high-stakes “Patient Journeys” and “HCP Target Lists.” I’ve spent more hours than I care to admit staring at spreadsheets of patient data, trying to find that “Aha!” moment. (The article uses pharma examples throughout, but this method easily adapts to any consulting domain.)

In the “old days” — about two years ago — a segmentation project meant running a K-Means clustering model and surfacing four or five neat little groups. But let’s be honest: nothing kills a “Data-Driven Culture” faster than a slide labeled Cluster 0 through Cluster 4.

You’ve spent weeks cleaning data, tuning hyperparameters, and debating “Elbow vs. Silhouette” curves. You proudly present your findings, only for the Head of Marketing to ask the one question that makes your heart sink: “Great, but… what do I actually say to the people in Cluster 2?”

Suddenly, your precise mathematical groups feel like abstract art — pretty to look at, but impossible to use. We’ve been stuck in this gap between Math and Meaning for years.

But what if the “missing link” isn’t more data, but a better translator? Imagine replacing the overworked analyst (currently vibrating on overpriced espresso and billable-hour anxiety) with an LLM sidekick that turns cold centroids into a warm, actionable business playbook.

In this article, I’m pulling back the curtain on a project I built in Google Colab. I’ll show you how I moved from raw (synthetic) Migraine patient data to a full-blown strategic playbook using a blend of K-Means clustering and Google’s Gemini. We aren’t just grouping data anymore; we’re automating the “Strategic Soul” of the business.

Step 1: The Foundation (Synthetic Data)

I’d probably be sacked faster than a non-compliant sales rep if I used actual IQVIA data for this article, so let’s stick to the safe side. To keep my career intact while proving the point, we’ll create some synthetic data that mimics real-world dynamics without the legal paperwork. In the code snippet below, I’ll avoid generating patient- or claim-level data. Instead, I’ll create practical aggregated features — directly derivable from patient data — that are highly relevant to segmentation.

# Step 1: GenAI created code for generating synthetic dataset for the segmentation process

import numpy as np
import pandas as pd

np.random.seed(42)

n_hcps = 50000

# 1. Core identifiers and attributes
hcp_ids = [f"HCP_{i+1}" for i in range(n_hcps)]

specialties = np.random.choice(
 ["Neurologist", "PCP", "Other"],
 size=n_hcps,
 p=[0.25, 0.5, 0.25] # tweak as needed
)

states = np.random.choice(
 ["CA", "TX", "NY", "FL", "IL", "PA", "OH", "GA", "NC", "MI"],
 size=n_hcps
)

top_payer_type = np.random.choice(
 ["Commercial", "Medicare", "Medicaid", "Other"],
 size=n_hcps,
 p=[0.5, 0.25, 0.15, 0.10] # commercial‑heavy mix
)

# 2. Access score (1–10, higher = better)
# Let Neurologists have slightly better access on average
base_access = np.random.normal(loc=6.5, scale=2.0, size=n_hcps)
base_access += np.where(specialties == "Neurologist", 0.8, 0.0)
base_access += np.where(top_payer_type == "Medicaid", -0.7, 0.0)
access_score = np.clip(np.round(base_access), 1, 10).astype(int)

# 3. Class TRx (100–500 migraine market TRx, higher for Neuro)
base_class_trx = np.random.normal(loc=260, scale=60, size=n_hcps)
base_class_trx += np.where(specialties == "Neurologist", 60, 0)
base_class_trx += np.where(specialties == "PCP", -30, 0)
class_trx = np.clip(np.round(base_class_trx), 100, 500).astype(int)

# 4. Brand share and brand TRx
# Target ~45% overall market share, modulated by access & specialty
raw_share = (
 0.45
 + 0.03 * (access_score - 5) # better access -> higher share
 + np.where(specialties == "Neurologist", 0.05, 0.0)
 + np.where(top_payer_type == "Medicaid", -0.05, 0.0)
 + np.random.normal(0, 0.05, n_hcps) # noise
)

brand_share = np.clip(raw_share, 0.05, 0.9)
brand_trx = np.round(class_trx * brand_share).astype(int)

# Check overall market share (optional sanity check)
overall_share = brand_trx.sum() / class_trx.sum()
print(f"Overall synthetic market share: {overall_share:.3f}")

# 5. Channel engagement: digital (emails etc.) and rep calls
# Let access & specialty influence channel mix
digital_base = 4 + 0.6 * (access_score - 5)
digital_base += np.where(specialties == "Neurologist", 2, 0)
digital_base += np.where(top_payer_type == "Commercial", 1, 0)
digital_engagement = np.random.poisson(lam=np.clip(digital_base, 0.5, 20))

rep_base = 3 + 0.4 * (access_score - 5)
rep_base += np.where(specialties == "Neurologist", 1.5, 0)
rep_base += np.where(top_payer_type == "Medicaid", -0.5, 0)
rep_calls = np.random.poisson(lam=np.clip(rep_base, 0.2, 20))


# New vs existing patients mix (0–1; higher = more new starts)
new_start_ratio = np.clip(
 np.random.beta(a=2, b=3, size=n_hcps)
 + 0.05 * (access_score - 5) / 5,
 0, 1
)

# Chronic burden index (proxy for comorbidities severity 0–5)
chronic_burden_index = np.clip(
 np.random.normal(loc=2.5, scale=1.0, size=n_hcps)
 + np.where(top_payer_type == "Medicare", 0.8, 0.0),
 0, 5
)

# Adherence proxy (proportion of patients with MPR > 80%)
adherence_rate = np.clip(
 0.7
 + 0.02 * (access_score - 5)
 + np.random.normal(0, 0.05, n_hcps),
 0.3, 0.95
)


# Build final DataFrame
df_hcp = pd.DataFrame({
 "hcp_id": hcp_ids,
 "specialty": specialties,
 "access_score": access_score,
 "class_trx": class_trx,
 "brand_trx": brand_trx,
 "digital_engagement": digital_engagement,
 "rep_calls": rep_calls,
 "state": states,
 "top_payer_type": top_payer_type,
 "brand_share": brand_trx / class_trx,
 "new_start_ratio": new_start_ratio,
 "chronic_burden_index": chronic_burden_index,
 "adherence_rate": adherence_rate
})

df_hcp.head()

Step 2: The Logic (K-Means & The Mathematical Anchor)

This is the “classic” part of the process — the ML foundation that remains the bedrock of any solid analysis. In pharma, segmentation isn’t just about grouping; it’s about finding distinct, non-overlapping patient profiles that respond differently to treatment or messaging.

We use K-Means, an unsupervised learning algorithm that groups data points by minimizing the distance between each point and its cluster center (the centroid). However, K-Means is a bit like a GPS: it will take you wherever you ask, but you have to tell it how many stops to make.

To avoid picking a number out of thin air, we use two diagnostic tools:

The Elbow Method: We look for the “bend” where adding more clusters stops providing significant improvements in tightness (Inertia).
The Silhouette Score: This measures how well a patient fits their own group versus the neighboring one. We want high scores — meaning “tight” clusters and “clear” boundaries.

The following snippet runs this diagnostic marathon to find the “Goldilocks” number of clusters:

## Importing basic stuff which would be required for the K Means clustering algo
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA

import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="whitegrid")

# Select numeric features for clustering
# These capture HCP behavior, intensity, and potential
# Exclude IDs and pure categoricals
numeric_features = [
 "access_score", # Access to therapy (1-10)
 "class_trx", # Total migraine TRx volume
 "brand_trx", # Brand‑specific TRx
 "brand_share", # Brand penetration (0-1)
 "digital_engagement", # Email/webinar engagement
 "rep_calls", # Field rep interactions
 "new_start_ratio", # New prescriptions vs refills
 "chronic_burden_index", # Patient complexity proxy
 "adherence_rate" # Persistence quality
]

print(f"Selected {len(numeric_features)} features for clustering")

# Extract features as numpy array
X = df_hcp[numeric_features].values

print("Feature matrix shape:", X.shape)
print("Sample feature values (first 5 HCPs):")
print(X[:5])

# Standardize features (CRITICAL for clustering)
# K-means uses Euclidean distance; unscaled features will dominate
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("Scaling complete. Mean and std of scaled features:")
print(pd.DataFrame(X_scaled, columns=numeric_features).describe().round(3))

# Step 4: Elbow method + Silhouette score to choose k
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

ks = range(2, 11) # Test k=2 to 10
inertias = []
sil_scores = []

for k in ks:
 # Fit K-means for this k
 km = KMeans(n_clusters=k, random_state=42, n_init="auto")
 labels = km.fit_predict(X_scaled)

 # Store metrics
 inertias.append(km.inertia_) # Within-cluster sum of squares
 sil_scores.append(silhouette_score(X_scaled, labels))

# Plot both diagnostics
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

ax1.plot(ks, inertias, marker="o", linewidth=2)
ax1.set_xlabel("Number of clusters (k)")
ax1.set_ylabel("Inertia (within-cluster SSE)")
ax1.set_title("Elbow Method")
ax1.grid(True, alpha=0.3)

ax2.plot(ks, sil_scores, marker="o", linewidth=2, color="orange")
ax2.set_xlabel("Number of clusters (k)")
ax2.set_ylabel("Average Silhouette Score")
ax2.set_title("Silhouette Analysis")
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print results table
results_table = pd.DataFrame({
 "k": ks,
 "inertia": np.round(inertias, 0),
 "silhouette": np.round(sil_scores, 3)
})
print("Diagnostic results:")
print(results_table)

Output of the code snippet which would help us finalize the k

With the optimal “K” value identified as 4, we processed our 50,000 HCP data points through the K-means model. This transformed our raw data into the structured cluster visualization shown below.

So, what’s next? Am I expected to run descriptive analytics on each cluster across all the features we used, invest hours crafting marketing-friendly buzzwords for them, and then also develop a strategic playbook tailored to each cluster?

WHAT IF! All the heavy lifting — descriptive analytics, buzzword crafting, cluster playbooks — could be jump‑started by one code snippet. That means less time on mechanics, more time on strategy. You start with a strong foundation and then tailor it into a roadmap for each client.

## Setting up the gemini LLM models

!pip install -q google-generativeai
import google.generativeai as genai
from google.colab import userdata
genai.configure(api_key=userdata.get('Default'))

## Lets try the LLM way now

def run_clustering_with_gemini_descriptions(df_hcp, k_final=4):
 """
 Full pipeline: K-means clustering → profiles → Gemini Pro descriptions
 """
 from sklearn.preprocessing import StandardScaler
 from sklearn.cluster import KMeans
 from sklearn.decomposition import PCA
 import matplotlib.pyplot as plt
 import seaborn as sns
 from IPython.display import display, Markdown

 # 1. CLUSTERING
 numeric_features = [
 "access_score", "class_trx", "brand_trx", "brand_share",
 "digital_engagement", "rep_calls", "new_start_ratio",
 "chronic_burden_index", "adherence_rate"
 ]

 X = df_hcp[numeric_features].values
 scaler = StandardScaler()
 X_scaled = scaler.fit_transform(X)

 # Fit K-means
 kmeans = KMeans(n_clusters=k_final, random_state=42, n_init="auto")
 df_hcp["cluster_ml"] = kmeans.fit_predict(X_scaled)

 # 2. PROFILES
 cluster_profile = df_hcp.groupby("cluster_ml")[numeric_features].mean().round(2)
 cluster_profile["size"] = df_hcp["cluster_ml"].value_counts().sort_index().values
 cluster_profile["pct_total"] = (cluster_profile["size"] / len(df_hcp) * 100).round(1)

 print("✅ ML Clusters created!")
 print(cluster_profile)

 # 3. PCA VISUALIZATION
 pca = PCA(n_components=2)
 pcs = pca.fit_transform(X_scaled)
 df_hcp["pc1"] = pcs[:, 0]
 df_hcp["pc2"] = pcs[:, 1]

 plt.figure(figsize=(8, 6))
 sns.scatterplot(
 data=df_hcp.sample(5000, random_state=42),
 x="pc1", y="pc2", hue="cluster_ml",
 palette="tab10", alpha=0.7, s=30
 )
 plt.title(f"HCP Clusters (k={k_final})")
 plt.legend(title="Cluster")
 plt.show()

 # 4. GEMINI PRO DESCRIPTIONS
 descriptions = generate_gemini_cluster_stories(cluster_profile, df_hcp)

 display(Markdown("## 🤖 **Gemini Pro: HCP Cluster Personas**"))
 display(Markdown(descriptions))

 return df_hcp, cluster_profile

def generate_gemini_cluster_stories(cluster_profile, df_hcp):
 """
 Use Gemini Pro to generate business-ready cluster descriptions
 """
 # Prepare data for Gemini
 table_md = cluster_profile.to_markdown()

 # Get categorical insights
 specialty_dist = pd.crosstab(df_hcp["cluster_ml"], df_hcp["specialty"],
 normalize="index").round(2).to_markdown()
 payer_dist = pd.crosstab(df_hcp["cluster_ml"], df_hcp["top_payer_type"],
 normalize="index").round(2).to_markdown()

 prompt = f"""
You are a pharma commercial excellence consultant analyzing HCP segments for a migraine brand.

**ML CLUSTER PROFILES** (means per cluster):
{table_md}

**SPECIALTY MIX**:
{specialty_dist}

**PAYER MIX**:
{payer_dist}

**TASK**: For each cluster (0-{len(cluster_profile)-1}), create:

## Cluster X: [2-3 word business name]
**Profile** (2-3 bullets):
• Key characteristics in plain business language
• What makes this HCP unique

**Priority** (High/Med/Low):
• Why this segment matters for brand growth

**Engagement playbook** (3 tactics):
• Rep strategy
• Digital strategy
• Access/messaging focus

Format as clean markdown. Make it actionable for field force leaders.
"""

 # Generate with Gemini 
 model = genai.GenerativeModel('gemini-2.5-flash') 
 ## Please make sure the credentials/Keys are added and 
 ## you select the model which is available for your user
 response = model.generate_content(prompt)

 return response.text

# RUN THE FULL PIPELINE
df_with_clusters, cluster_profiles = run_clustering_with_gemini_descriptions(df_hcp, k_final=4)

In minutes, raw claims data turned into actionable HCP personas like “Elite Advocates” and “High-Potential Expanders” — work that used to take weeks now runs instantly in a free Colab notebook. This gives you a strong starting point with clear field playbooks, but as the analyst, you still need to validate each cluster’s story: does “Elite Advocates” really show high brand loyalty and engagement penetration? Do “High-Potential Expanders” have the class volume and access to justify the growth opportunity? Cross-check with revenue concentration, payer mix alignment, and field feedback before deployment. AI handles the heavy lifting while you bring the expertise to make it production-ready.

Check out my colab notebook to understand better — https://colab.research.google.com/drive/1wuit_8PXH2ykGhDubG65poYIBLRo2Igq?usp=sharing

— From someone who knows exactly how it feels when your carefully planned three‑week segmentation timeline becomes a 72‑hour fire drill (This may or may not be an exaggeration)

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

From Clusters to Customers: Supercharging Segmentation with Generative AI

Author(s): Abhijeet Sahoo

Step 1: The Foundation (Synthetic Data)

Step 2: The Logic (K-Means & The Mathematical Anchor)

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Recent Posts

Full-Stack Data Scientists for the Agentic Coding World

Building Production-Grade AI Skills with Snowflake Cortex AI Function Studio

I Tried 10 AI Agent Frameworks in 2026 — Here’s the Honest Guide I Wish I Had Earlier

How One Spring Boot Optimization Saved Our Startup $30,000 a Year

Inside Palantir AIP: How the World’s Most Controversial AI Platform Actually Works

What Is a Reverse Proxy? (And Why Every Backend Developer Should Care)

What Claude Opus 4.8 Actually Changes If You’re Building Agents

QWEN 3.7 Max Worked For 35 Hrs Straight And The Results Were Mind-blowing

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

From Clusters to Customers: Supercharging Segmentation with Generative AI

Author(s): Abhijeet Sahoo

Step 1: The Foundation (Synthetic Data)

Step 2: The Logic (K-Means & The Mathematical Anchor)

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Related posts

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement