Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Get your free Agents Cheatsheet here. Our proven framework for choosing the right AI architecture.
3 years of hands-on work with real clients into 6 pages.

Publication

Synthetic Data That Behaves: A Practical Guide to Generating Realistic Healthcare-Like Data Without Violating Privacy
Artificial Intelligence   Data Science   Latest   Machine Learning

Synthetic Data That Behaves: A Practical Guide to Generating Realistic Healthcare-Like Data Without Violating Privacy

Author(s): Abhishek Yadav

Originally published on Towards AI.

A hands-on guide to building synthetic data that looks, feels, and behaves like the real world without privacy risk

Synthetic Data That Behaves: A Practical Guide to Generating Realistic Healthcare-Like Data Without Violating Privacy
Photo by Luke Chesser on Unsplash

Healthcare organizations sit on treasure chests of data be it the appointments, lab results, care journeys, billing patterns, social determinants. Yet the very rules designed to protect patient privacy often make it nearly impossible for analysts and data scientists to experiment freely.

And that’s a problem.

Because innovation doesn’t start in production systems.
It starts with play such as trying ideas, building prototypes, and running experiments without fear of leaking sensitive data.

That’s where synthetic data becomes one of the most powerful tools in a data scientist’s toolkit.

In this guide, I’ll show you a practical, no-nonsense approach to generating healthcare-like synthetic data that behaves like the real thing such as preserving shape, distributions, trends, correlations, without exposing a single patient.

No GANs.
No deep learning.
No fragile black-box models.

Just a clean, transparent method that you can explain to your compliance team in under two minutes.

Why Synthetic Data Matters (More Than Ever)

Most organizations face three common constraints:

1. Privacy laws delay or block innovation.
HIPAA, GDPR, and internal policies often prevent teams from sharing even de-identified datasets internally.

2. Analysts can’t experiment fast.
Requesting access to PHI can take weeks or months thus killing momentum for early ideas.

3. Teams need realistic datasets for prototypes and demos.
Executive demos, model validations, and vendor evaluations all need data… but not real patient data.

Synthetic data solves all three problems when done correctly:

✔ Behaves like real data
✔ Contains no patient information
✔ Can be shared freely across teams
✔ Speeds up model development dramatically

But Not All Synthetic Data Is Good Synthetic Data

If your synthetic data looks like uniform noise, you’ve only created fake data not useful synthetic data.

Good synthetic data has three properties:

1. Statistical fidelity
Distributions, seasonal trends, missingness patterns, and class ratios resemble real data.

2. Relationship fidelity
Correlations stay intact, e.g., older age relates to higher visit frequency, chronic conditions relate to higher readmission risk, etc.

3. Behavioral fidelity
The shape of data over time looks right such as wait times, appointment lead times, cancellations, etc.

Our goal is not perfect replication of the real dataset.
Our goal is behavioral realism.

The Approach: A Transparent 3-Layer Synthetic Generator

Deep generative models (GANs, VAEs) can create beautiful synthetic datasets, but they are:

  • Hard to tune
  • Risk-prone (they may leak patterns from small datasets)
  • A black box to compliance teams

Instead, I use a simple and explainable three-layer generator that works for most healthcare operations datasets.

Layer 1: Distributions That Match Reality

Every variable gets a distribution based on real-world behavior.

Examples:

This layer ensures your dataset looks real.

Layer 2: Respecting Correlations

Example relationships we preserve:

  • Older patients → higher visit frequency
  • Chronic conditions → longer appointment lead times
  • New patients → higher no-show probability
  • Certain specialties → higher cancellation rates

We approximate correlations using:

  • Spearman rank correlation for skewed healthcare variables
  • A copula model to generate correlated samples
  • Logical rules layered on top (e.g., chronic_conditions > 3 → high_visit_risk)

The dataset now behaves like the real world.

Layer 3: Realistic Time Behavior

Healthcare data is not static. It has pattern and seasonality:

  • Hourly cycles (e.g., AM peak, lunch dip, PM ramp)
  • Weekly patterns (Monday load, weekend effect)
  • Seasonality (flu season spikes, December cancellations)
  • Simple, explainable curves and holiday flags no black boxes

We add these via simple, explainable curves, no AI magic needed.

A Minimal Python Generator You Can Trust

Here is a clean and readable version you can publish:

import numpy as np
import pandas as pd
from scipy.stats import gaussian_kde

np.random.seed(42)

N = 5000 # dataset size

# ---- Layer 1: Distributions ----
ages = np.random.lognormal(mean=3.6, sigma=0.4, size=N).astype(int)
ages = np.clip(ages, 18, 95)

lead_time = np.random.lognormal(mean=2.3, sigma=0.5, size=N).astype(int)

provider_type = np.random.choice(
['Primary Care', 'Cardiology', 'Dermatology', 'Neurology'],
size=N,
p=[0.55, 0.20, 0.15, 0.10]
)

no_show_prob = np.random.beta(2, 10, size=N)

# ---- Layer 2: Relationships ----
visit_frequency = np.round((ages / 30) + np.random.normal(0, 1, N)).clip(0)
chronic_conditions = np.random.poisson(lam=1.2, size=N)

# Rule-based adjustment
no_show_flag = (np.random.rand(N) < (no_show_prob + chronic_conditions*0.02)).astype(int)

# ---- Layer 3: Time Behavior ----
days = pd.date_range(start="2023-01-01", periods=N, freq="H")
seasonality = np.sin(np.linspace(0, 6*np.pi, N)) # approximate monthly cycle

wait_time = (lead_time * (1 + 0.3*seasonality)).clip(0).astype(int)

# ---- Output ----
df = pd.DataFrame({
'age': ages,
'lead_time_days': lead_time,
'provider': provider_type,
'chronic_conditions': chronic_conditions,
'visit_frequency': visit_frequency,
'no_show': no_show_flag,
'appointment_date': days,
'adjusted_wait_time': wait_time
})

df.head()

This gives you:

  • Realistic age curves
  • Realistic lead times
  • Realistic no-show patterns
  • Seasonal behavior
  • Clean correlations

A dataset that behaves, without ever touching PHI.

How Close Does Synthetic Data Need to Be?

The rule I use when working with compliance and clinical partners:

“The data should behave like the real world, not mimic any real patient.”

You can validate this using three checks:

1. Distribution plots

Compare real vs. synthetic at a high level.

2. Correlation matrices

Compare Spearman correlation matrices; directions should match, magnitudes should be plausible (not identical to any specific dataset).

3. Downstream model accuracy

Train a simple model (e.g., logistic regression for no‑show). Performance trends on synthetic should approximate real-world (e.g., which features matter) without identical metrics.

Quick Visualization Code:

import matplotlib.pyplot as plt
import numpy as np

plt.figure(figsize=(12,4))
plt.subplot(1,3,1); plt.hist(df['age'], bins=30, color='#4c78a8'); plt.title('Age')
plt.subplot(1,3,2); plt.hist(df['lead_time_days'], bins=30, color='#f58518'); plt.title('Lead Time (days)')
plt.subplot(1,3,3); plt.hist(df['chronic_conditions'], bins=15, color='#54a24b'); plt.title('Chronic Conditions')
plt.tight_layout(); plt.show()

# Spearman correlation heatmap
cols = ['age','lead_time_days','chronic_conditions','visit_frequency']
corr = df[cols].corr(method='spearman')
plt.figure(figsize=(5,4))
plt.imshow(corr, cmap='coolwarm', vmin=-1, vmax=1)
plt.colorbar(); plt.xticks(range(len(cols)), cols, rotation=45); plt.yticks(range(len(cols)), cols)
plt.title('Spearman Correlation'); plt.tight_layout(); plt.show()
Distribution plots for Age, Lead Time, and Chronic Conditions generated using Python (Matplotlib)
Spearman correlation heatmap generated using Python (Matplotlib)

A Case Study: Appointment Optimization

A healthcare organization wanted to explore:

  • reducing no-shows
  • optimizing scheduling slots
  • measuring wait time bottlenecks

But analysts didn’t have access to operational data until approvals were cleared.

By building a synthetic dataset like the one above, they could:

  • Test feature engineering
  • Build early prototypes
  • Try no-show prediction models
  • Experiment with scheduling simulations

When real data access was granted weeks later, 80% of the pipeline was already built. The team went from idea → deployment in 13 days, instead of 2–3 months.

Synthetic data didn’t replace real data.
It unlocked speed.

The Real Benefit: Freedom to Experiment

Synthetic data gives teams:

✔ The freedom to try ideas
✔ The freedom to fail safely
✔ The freedom to share datasets across teams
✔ The freedom to prototype without waiting for permissions
✔ The freedom to innovate faster than bureaucracy

In an industry where delays can be measured in lives, that freedom matters.

Final Thoughts

Healthcare data doesn’t need to be held hostage until access approvals complete. With thoughtful synthetic generation, teams can collaborate, build, iterate, and innovate responsibly.

Not all synthetic data is equal, but when it’s done transparently and with behavioral realism, it becomes one of the most powerful accelerators for healthcare analytics.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Get your free Agents Cheatsheet here. Our proven framework for choosing the right AI architecture.
3 years of hands-on work with real clients into 6 pages.

Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Discover Your Dream AI Career at Towards AI Jobs

Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!

Note: Content contains the views of the contributing authors and not Towards AI.