Synthetic Data That Behaves: A Practical Guide to Generating Realistic Healthcare-Like Data Without Violating Privacy
Author(s): Abhishek Yadav
Originally published on Towards AI.
A hands-on guide to building synthetic data that looks, feels, and behaves like the real world without privacy risk
Healthcare organizations sit on treasure chests of data be it the appointments, lab results, care journeys, billing patterns, social determinants. Yet the very rules designed to protect patient privacy often make it nearly impossible for analysts and data scientists to experiment freely.
And that’s a problem.
Because innovation doesn’t start in production systems.
It starts with play such as trying ideas, building prototypes, and running experiments without fear of leaking sensitive data.
That’s where synthetic data becomes one of the most powerful tools in a data scientist’s toolkit.
In this guide, I’ll show you a practical, no-nonsense approach to generating healthcare-like synthetic data that behaves like the real thing such as preserving shape, distributions, trends, correlations, without exposing a single patient.
No GANs.
No deep learning.
No fragile black-box models.
Just a clean, transparent method that you can explain to your compliance team in under two minutes.
Why Synthetic Data Matters (More Than Ever)
Most organizations face three common constraints:
1. Privacy laws delay or block innovation.
HIPAA, GDPR, and internal policies often prevent teams from sharing even de-identified datasets internally.
2. Analysts can’t experiment fast.
Requesting access to PHI can take weeks or months thus killing momentum for early ideas.
3. Teams need realistic datasets for prototypes and demos.
Executive demos, model validations, and vendor evaluations all need data… but not real patient data.
Synthetic data solves all three problems when done correctly:
✔ Behaves like real data
✔ Contains no patient information
✔ Can be shared freely across teams
✔ Speeds up model development dramatically
But Not All Synthetic Data Is Good Synthetic Data
If your synthetic data looks like uniform noise, you’ve only created fake data not useful synthetic data.
Good synthetic data has three properties:
1. Statistical fidelity
Distributions, seasonal trends, missingness patterns, and class ratios resemble real data.
2. Relationship fidelity
Correlations stay intact, e.g., older age relates to higher visit frequency, chronic conditions relate to higher readmission risk, etc.
3. Behavioral fidelity
The shape of data over time looks right such as wait times, appointment lead times, cancellations, etc.
Our goal is not perfect replication of the real dataset.
Our goal is behavioral realism.
The Approach: A Transparent 3-Layer Synthetic Generator
Deep generative models (GANs, VAEs) can create beautiful synthetic datasets, but they are:
- Hard to tune
- Risk-prone (they may leak patterns from small datasets)
- A black box to compliance teams
Instead, I use a simple and explainable three-layer generator that works for most healthcare operations datasets.
Layer 1: Distributions That Match Reality
Every variable gets a distribution based on real-world behavior.
Examples:

This layer ensures your dataset looks real.
Layer 2: Respecting Correlations
Example relationships we preserve:
- Older patients → higher visit frequency
- Chronic conditions → longer appointment lead times
- New patients → higher no-show probability
- Certain specialties → higher cancellation rates
We approximate correlations using:
- Spearman rank correlation for skewed healthcare variables
- A copula model to generate correlated samples
- Logical rules layered on top (e.g., chronic_conditions > 3 → high_visit_risk)
The dataset now behaves like the real world.
Layer 3: Realistic Time Behavior
Healthcare data is not static. It has pattern and seasonality:
- Hourly cycles (e.g., AM peak, lunch dip, PM ramp)
- Weekly patterns (Monday load, weekend effect)
- Seasonality (flu season spikes, December cancellations)
- Simple, explainable curves and holiday flags no black boxes
We add these via simple, explainable curves, no AI magic needed.
A Minimal Python Generator You Can Trust
Here is a clean and readable version you can publish:
import numpy as np
import pandas as pd
from scipy.stats import gaussian_kde
np.random.seed(42)
N = 5000 # dataset size
# ---- Layer 1: Distributions ----
ages = np.random.lognormal(mean=3.6, sigma=0.4, size=N).astype(int)
ages = np.clip(ages, 18, 95)
lead_time = np.random.lognormal(mean=2.3, sigma=0.5, size=N).astype(int)
provider_type = np.random.choice(
['Primary Care', 'Cardiology', 'Dermatology', 'Neurology'],
size=N,
p=[0.55, 0.20, 0.15, 0.10]
)
no_show_prob = np.random.beta(2, 10, size=N)
# ---- Layer 2: Relationships ----
visit_frequency = np.round((ages / 30) + np.random.normal(0, 1, N)).clip(0)
chronic_conditions = np.random.poisson(lam=1.2, size=N)
# Rule-based adjustment
no_show_flag = (np.random.rand(N) < (no_show_prob + chronic_conditions*0.02)).astype(int)
# ---- Layer 3: Time Behavior ----
days = pd.date_range(start="2023-01-01", periods=N, freq="H")
seasonality = np.sin(np.linspace(0, 6*np.pi, N)) # approximate monthly cycle
wait_time = (lead_time * (1 + 0.3*seasonality)).clip(0).astype(int)
# ---- Output ----
df = pd.DataFrame({
'age': ages,
'lead_time_days': lead_time,
'provider': provider_type,
'chronic_conditions': chronic_conditions,
'visit_frequency': visit_frequency,
'no_show': no_show_flag,
'appointment_date': days,
'adjusted_wait_time': wait_time
})
df.head()
This gives you:
- Realistic age curves
- Realistic lead times
- Realistic no-show patterns
- Seasonal behavior
- Clean correlations
A dataset that behaves, without ever touching PHI.
How Close Does Synthetic Data Need to Be?
The rule I use when working with compliance and clinical partners:
“The data should behave like the real world, not mimic any real patient.”
You can validate this using three checks:
1. Distribution plots
Compare real vs. synthetic at a high level.
2. Correlation matrices
Compare Spearman correlation matrices; directions should match, magnitudes should be plausible (not identical to any specific dataset).
3. Downstream model accuracy
Train a simple model (e.g., logistic regression for no‑show). Performance trends on synthetic should approximate real-world (e.g., which features matter) without identical metrics.
Quick Visualization Code:
import matplotlib.pyplot as plt
import numpy as np
plt.figure(figsize=(12,4))
plt.subplot(1,3,1); plt.hist(df['age'], bins=30, color='#4c78a8'); plt.title('Age')
plt.subplot(1,3,2); plt.hist(df['lead_time_days'], bins=30, color='#f58518'); plt.title('Lead Time (days)')
plt.subplot(1,3,3); plt.hist(df['chronic_conditions'], bins=15, color='#54a24b'); plt.title('Chronic Conditions')
plt.tight_layout(); plt.show()
# Spearman correlation heatmap
cols = ['age','lead_time_days','chronic_conditions','visit_frequency']
corr = df[cols].corr(method='spearman')
plt.figure(figsize=(5,4))
plt.imshow(corr, cmap='coolwarm', vmin=-1, vmax=1)
plt.colorbar(); plt.xticks(range(len(cols)), cols, rotation=45); plt.yticks(range(len(cols)), cols)
plt.title('Spearman Correlation'); plt.tight_layout(); plt.show()


A Case Study: Appointment Optimization
A healthcare organization wanted to explore:
- reducing no-shows
- optimizing scheduling slots
- measuring wait time bottlenecks
But analysts didn’t have access to operational data until approvals were cleared.
By building a synthetic dataset like the one above, they could:
- Test feature engineering
- Build early prototypes
- Try no-show prediction models
- Experiment with scheduling simulations
When real data access was granted weeks later, 80% of the pipeline was already built. The team went from idea → deployment in 13 days, instead of 2–3 months.
Synthetic data didn’t replace real data.
It unlocked speed.
The Real Benefit: Freedom to Experiment
Synthetic data gives teams:
✔ The freedom to try ideas
✔ The freedom to fail safely
✔ The freedom to share datasets across teams
✔ The freedom to prototype without waiting for permissions
✔ The freedom to innovate faster than bureaucracy
In an industry where delays can be measured in lives, that freedom matters.
Final Thoughts
Healthcare data doesn’t need to be held hostage until access approvals complete. With thoughtful synthetic generation, teams can collaborate, build, iterate, and innovate responsibly.
Not all synthetic data is equal, but when it’s done transparently and with behavioral realism, it becomes one of the most powerful accelerators for healthcare analytics.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Get your free Agents Cheatsheet here. Our proven framework for choosing the right AI architecture.
3 years of hands-on work with real clients into 6 pages.
Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!
Discover Your Dream AI Career at Towards AI JobsTowards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!
Note: Content contains the views of the contributing authors and not Towards AI.