Synthetic Data That Behaves: A Practical Guide to Generating Realistic Healthcare-Like Data Without Violating Privacy

Author(s): Abhishek Yadav

Originally published on Towards AI.

A hands-on guide to building synthetic data that looks, feels, and behaves like the real world without privacy risk

Synthetic Data That Behaves: A Practical Guide to Generating Realistic Healthcare-Like Data Without Violating Privacy — Photo by Luke Chesser on Unsplash

Healthcare organizations sit on treasure chests of data be it the appointments, lab results, care journeys, billing patterns, social determinants. Yet the very rules designed to protect patient privacy often make it nearly impossible for analysts and data scientists to experiment freely.

And that’s a problem.

Because innovation doesn’t start in production systems.
It starts with play such as trying ideas, building prototypes, and running experiments without fear of leaking sensitive data.

That’s where synthetic data becomes one of the most powerful tools in a data scientist’s toolkit.

In this guide, I’ll show you a practical, no-nonsense approach to generating healthcare-like synthetic data that behaves like the real thing such as preserving shape, distributions, trends, correlations, without exposing a single patient.

No GANs.
No deep learning.
No fragile black-box models.

Just a clean, transparent method that you can explain to your compliance team in under two minutes.

Why Synthetic Data Matters (More Than Ever)

Most organizations face three common constraints:

1. Privacy laws delay or block innovation.
HIPAA, GDPR, and internal policies often prevent teams from sharing even de-identified datasets internally.

2. Analysts can’t experiment fast.
Requesting access to PHI can take weeks or months thus killing momentum for early ideas.

3. Teams need realistic datasets for prototypes and demos.
Executive demos, model validations, and vendor evaluations all need data… but not real patient data.

Synthetic data solves all three problems when done correctly:

✔ Behaves like real data
✔ Contains no patient information
✔ Can be shared freely across teams
✔ Speeds up model development dramatically

But Not All Synthetic Data Is Good Synthetic Data

If your synthetic data looks like uniform noise, you’ve only created fake data not useful synthetic data.

Good synthetic data has three properties:

1. Statistical fidelity
Distributions, seasonal trends, missingness patterns, and class ratios resemble real data.

2. Relationship fidelity
Correlations stay intact, e.g., older age relates to higher visit frequency, chronic conditions relate to higher readmission risk, etc.

3. Behavioral fidelity
The shape of data over time looks right such as wait times, appointment lead times, cancellations, etc.

Our goal is not perfect replication of the real dataset.
Our goal is behavioral realism.

The Approach: A Transparent 3-Layer Synthetic Generator

Deep generative models (GANs, VAEs) can create beautiful synthetic datasets, but they are:

Hard to tune
Risk-prone (they may leak patterns from small datasets)
A black box to compliance teams

Instead, I use a simple and explainable three-layer generator that works for most healthcare operations datasets.

Layer 1: Distributions That Match Reality

Every variable gets a distribution based on real-world behavior.

Examples:

This layer ensures your dataset looks real.

Layer 2: Respecting Correlations

Example relationships we preserve:

Older patients → higher visit frequency
Chronic conditions → longer appointment lead times
New patients → higher no-show probability
Certain specialties → higher cancellation rates

We approximate correlations using:

Spearman rank correlation for skewed healthcare variables
A copula model to generate correlated samples
Logical rules layered on top (e.g., chronic_conditions > 3 → high_visit_risk)

The dataset now behaves like the real world.

Layer 3: Realistic Time Behavior

Healthcare data is not static. It has pattern and seasonality:

Hourly cycles (e.g., AM peak, lunch dip, PM ramp)
Weekly patterns (Monday load, weekend effect)
Seasonality (flu season spikes, December cancellations)
Simple, explainable curves and holiday flags no black boxes

We add these via simple, explainable curves, no AI magic needed.

A Minimal Python Generator You Can Trust

Here is a clean and readable version you can publish:

import numpy as np
import pandas as pd
from scipy.stats import gaussian_kde

np.random.seed(42)

N = 5000 # dataset size

# ---- Layer 1: Distributions ----
ages = np.random.lognormal(mean=3.6, sigma=0.4, size=N).astype(int)
ages = np.clip(ages, 18, 95)

lead_time = np.random.lognormal(mean=2.3, sigma=0.5, size=N).astype(int)

provider_type = np.random.choice(
 ['Primary Care', 'Cardiology', 'Dermatology', 'Neurology'],
 size=N,
 p=[0.55, 0.20, 0.15, 0.10]
)

no_show_prob = np.random.beta(2, 10, size=N)

# ---- Layer 2: Relationships ----
visit_frequency = np.round((ages / 30) + np.random.normal(0, 1, N)).clip(0)
chronic_conditions = np.random.poisson(lam=1.2, size=N)

# Rule-based adjustment
no_show_flag = (np.random.rand(N) < (no_show_prob + chronic_conditions*0.02)).astype(int)

# ---- Layer 3: Time Behavior ----
days = pd.date_range(start="2023-01-01", periods=N, freq="H")
seasonality = np.sin(np.linspace(0, 6*np.pi, N)) # approximate monthly cycle

wait_time = (lead_time * (1 + 0.3*seasonality)).clip(0).astype(int)

# ---- Output ----
df = pd.DataFrame({
 'age': ages,
 'lead_time_days': lead_time,
 'provider': provider_type,
 'chronic_conditions': chronic_conditions,
 'visit_frequency': visit_frequency,
 'no_show': no_show_flag,
 'appointment_date': days,
 'adjusted_wait_time': wait_time
})

df.head()

This gives you:

Realistic age curves
Realistic lead times
Realistic no-show patterns
Seasonal behavior
Clean correlations

A dataset that behaves, without ever touching PHI.

How Close Does Synthetic Data Need to Be?

The rule I use when working with compliance and clinical partners:

“The data should behave like the real world, not mimic any real patient.”

You can validate this using three checks:

1. Distribution plots

Compare real vs. synthetic at a high level.

2. Correlation matrices

Compare Spearman correlation matrices; directions should match, magnitudes should be plausible (not identical to any specific dataset).

3. Downstream model accuracy

Train a simple model (e.g., logistic regression for no‑show). Performance trends on synthetic should approximate real-world (e.g., which features matter) without identical metrics.

Quick Visualization Code:

import matplotlib.pyplot as plt
import numpy as np

plt.figure(figsize=(12,4))
plt.subplot(1,3,1); plt.hist(df['age'], bins=30, color='#4c78a8'); plt.title('Age')
plt.subplot(1,3,2); plt.hist(df['lead_time_days'], bins=30, color='#f58518'); plt.title('Lead Time (days)')
plt.subplot(1,3,3); plt.hist(df['chronic_conditions'], bins=15, color='#54a24b'); plt.title('Chronic Conditions')
plt.tight_layout(); plt.show()

# Spearman correlation heatmap
cols = ['age','lead_time_days','chronic_conditions','visit_frequency']
corr = df[cols].corr(method='spearman')
plt.figure(figsize=(5,4))
plt.imshow(corr, cmap='coolwarm', vmin=-1, vmax=1)
plt.colorbar(); plt.xticks(range(len(cols)), cols, rotation=45); plt.yticks(range(len(cols)), cols)
plt.title('Spearman Correlation'); plt.tight_layout(); plt.show()

Distribution plots for Age, Lead Time, and Chronic Conditions generated using Python (Matplotlib)

Spearman correlation heatmap generated using Python (Matplotlib)

A Case Study: Appointment Optimization

A healthcare organization wanted to explore:

reducing no-shows
optimizing scheduling slots
measuring wait time bottlenecks

But analysts didn’t have access to operational data until approvals were cleared.

By building a synthetic dataset like the one above, they could:

Test feature engineering
Build early prototypes
Try no-show prediction models
Experiment with scheduling simulations

When real data access was granted weeks later, 80% of the pipeline was already built. The team went from idea → deployment in 13 days, instead of 2–3 months.

Synthetic data didn’t replace real data.
It unlocked speed.

The Real Benefit: Freedom to Experiment

Synthetic data gives teams:

✔ The freedom to try ideas
✔ The freedom to fail safely
✔ The freedom to share datasets across teams
✔ The freedom to prototype without waiting for permissions
✔ The freedom to innovate faster than bureaucracy

In an industry where delays can be measured in lives, that freedom matters.

Final Thoughts

Healthcare data doesn’t need to be held hostage until access approvals complete. With thoughtful synthetic generation, teams can collaborate, build, iterate, and innovate responsibly.

Not all synthetic data is equal, but when it’s done transparently and with behavioral realism, it becomes one of the most powerful accelerators for healthcare analytics.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.

Start free — no commitment:

→ Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages

Our courses:

→ AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.

→ Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.

→ AI for Work — Understand, evaluate, and apply AI for complex work tasks.

Note: Article content contains the views of the contributing authors and not Towards AI.

Frequently Used, Contextual References

Resources

Synthetic Data That Behaves: A Practical Guide to Generating Realistic Healthcare-Like Data Without Violating Privacy

Author(s): Abhishek Yadav

A hands-on guide to building synthetic data that looks, feels, and behaves like the real world without privacy risk

Why Synthetic Data Matters (More Than Ever)

But Not All Synthetic Data Is Good Synthetic Data

The Approach: A Transparent 3-Layer Synthetic Generator

Layer 1: Distributions That Match Reality

Layer 2: Respecting Correlations

Layer 3: Realistic Time Behavior

A Minimal Python Generator You Can Trust

How Close Does Synthetic Data Need to Be?

1. Distribution plots

2. Correlation matrices

3. Downstream model accuracy

A Case Study: Appointment Optimization

The Real Benefit: Freedom to Experiment

Final Thoughts

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Recent Posts

Full-Stack Data Scientists for the Agentic Coding World

Building Production-Grade AI Skills with Snowflake Cortex AI Function Studio

I Tried 10 AI Agent Frameworks in 2026 — Here’s the Honest Guide I Wish I Had Earlier

How One Spring Boot Optimization Saved Our Startup $30,000 a Year

Inside Palantir AIP: How the World’s Most Controversial AI Platform Actually Works

What Is a Reverse Proxy? (And Why Every Backend Developer Should Care)

What Claude Opus 4.8 Actually Changes If You’re Building Agents

QWEN 3.7 Max Worked For 35 Hrs Straight And The Results Were Mind-blowing

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Synthetic Data That Behaves: A Practical Guide to Generating Realistic Healthcare-Like Data Without Violating Privacy

Author(s): Abhishek Yadav

A hands-on guide to building synthetic data that looks, feels, and behaves like the real world without privacy risk

Why Synthetic Data Matters (More Than Ever)

But Not All Synthetic Data Is Good Synthetic Data

The Approach: A Transparent 3-Layer Synthetic Generator

Layer 1: Distributions That Match Reality

Layer 2: Respecting Correlations

Layer 3: Realistic Time Behavior

A Minimal Python Generator You Can Trust

How Close Does Synthetic Data Need to Be?

1. Distribution plots

2. Correlation matrices

3. Downstream model accuracy

A Case Study: Appointment Optimization

The Real Benefit: Freedom to Experiment

Final Thoughts

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Related posts

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement