Why Missing Data Is Not Missing at Random and Why That Matters

Last Updated on November 11, 2025 by Editorial Team

Author(s): ANGELI WICKRAMA ARACHCHI

Originally published on Towards AI.

Why Missing Data Is Not Missing at Random and Why That Matters

The medical study that got it wrong

You’re analyzing a clinical trial for a new antidepressant:

1,000 patients enrolled
700 completed the 3-month follow-up
300 patients have missing follow-up scores

Your team says: “Let’s just use the 700 complete cases or fill missing values with the average.”

But here’s what actually happened:

Patients who got much worse didn’t return (too depressed).
Patients who felt completely cured didn’t return (they didn’t think they needed to).
Only the moderate-improvement group reliably showed up.

conclusion: “The drug shows mild effectiveness.”
Reality: the drug either works brilliantly or fails catastrophically.

The missing data was the story.

The Three Types of Missing Data (And Why It Matters)

🟢 MCAR — Missing Completely At Random

What it means: Data is missing due to pure randomness — like a coin flip.

Examples:

Lab technician spills coffee on random survey forms
Sensor randomly fails due to manufacturing defect
Database bug drops random records

Visual pattern:

Patient | Age | Income | Health
--------|-----|---------|-------
Alice | 25 | $50,000 | 85
Bob | 34 | [?] | 90 ← Random
Carol | 45 | $75,000 | [?] ← Random
David | 29 | $48,000 | 78

Good news: Simple methods work (delete rows or use mean imputation)

Bad news: MCAR is extremely rare in real data (~5%)

🟡 MAR — Missing At Random

What it means: Missingness depends on observed variables, not the missing value itself.

Example: Older people don’t report income (privacy concerns). Age (which you have) predicts missingness, but actual income (which is missing) doesn’t.

Visual pattern:

Patient | Age | Income | Responded?
--------|-----|---------|------------
Emma | 67 | [?] | No ← Older
Frank | 65 | [?] | No ← Older
Grace | 28 | $45,000 | Yes
Henry | 31 | $52,000 | Yes

Key insight: You can use the relationship between age and missingness to impute intelligently.

🔴 MNAR — Missing Not At Random

What it means: The missing value itself causes it to be missing.

Examples:

Weight-tracking app: People skip logging after weight gain
Income surveys: Very high and very low earners don’t report
Depression studies: Most depressed patients don’t return

Visual pattern:

User | Week1 | Week2 | Week3
-----|-------|-------|-------
Jane | 150 | 149 | 148 ✓ Losing
John | 180 | 182 | [?] ✗ Gaining = stopped logging

The danger: The data you see is fundamentally biased. You can’t see the pattern because the evidence is missing.

Why Mean Imputation Destroys MNAR Data

Scenario: Employee salary survey. High earners refuse to report (privacy).

Your data:

Employee | Salary
---------|----------
Alice | $45,000
Bob | $52,000
Carol | $48,000
David | [?] ← Actually $250,000
Emma | $55,000
Frank | [?] ← Actually $300,000
Grace | $50,000

Mean of observed values = $50,000
Imputing missing with $50k hides the real extremes.

Consequence:

True mean ≈ $114,000, but you report $50,000 — a 56% underestimate.
Variance collapses. Correlations break. Policy decisions become catastrophically wrong.

Rule of thumb: Mean imputation is only safe for MCAR. For MNAR, it introduces massive bias.

How to Detect Which Type You Have

1. Visual Inspection

Create a missingness indicator and plot it against other features:

# Create missingness indicator
df['income_missing'] = df['income'].isna().astype(int)

# Check if it correlates with other variables
import seaborn as sns
sns.boxplot(data=df, x='income_missing', y='age')

Look for:

MCAR: No visible pattern.
MAR: Missingness correlates with observed columns (age, gender).
MNAR: Missingness associates with related outcomes (e.g., spending), or domain logic suggests the missing value itself causes nonresponse.

2. Little’s MCAR Test

A statistical test.

H₀: Data is MCAR
If p < 0.05 → reject MCAR (so data is MAR or MNAR)

3. Domain Knowledge (Most Important)

Ask yourself:

Would people with extreme values hide their answers?
Is this sensitive data? (income, weight, health)
Is this self-reported?
Do dropouts happen in longitudinal studies?

If yes → Probably MNAR

How to Handle Missing Data: The Right Way

For MCAR (Rare):

# Simple deletion works
df_clean = df.dropna()

Works only if missingness truly random.
Loses power; safe only when percent missing is tiny.

For MAR (Common):

MICE (Multiple Imputation by Chained Equations):

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imputer = IterativeImputer(max_iter=10, random_state=0)
df_imputed = imputer.fit_transform(df)

KNN Imputation:

from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
df_imputed = imputer.fit_transform(df)

These use relationships between observed variables to predict missing values and preserve uncertainty.

For MNAR (Most Dangerous):

You cannot “fix” MNAR without assumptions or extra data. Use:

Pattern-mixture models: model observed vs missing patterns separately.
Selection models: two-part models that model missingness probability and outcomes together.
Sensitivity analysis: try optimistic/pessimistic scenarios and report a range.

Example reporting table:

Scenario | Estimated Mean | Conclusion
----------------------|----------------|------------
Optimistic (MAR) | $50,000 | No action needed
Conservative | $75,000 | Moderate concern
Realistic (MNAR) | $85,000 | Significant inequality
Worst case | $120,000 | Crisis level

Report: "True value likely $75k-$85k, not $50k"

The honest approach: Report a range, not false precision.

Quick decision framework

How much missing?

<5% — likely manageable
5–20% — investigate carefully
>20% — serious concern

2. Run Little’s MCAR test

p > 0.05 → MCAR plausible
p < 0.05 → not MCAR

3. Check domain knowledge

Sensitive/self-reported → likely MNAR
Technical failure → likely MCAR
Correlates with observed vars → MAR

4. Choose method

MCAR → drop/simple impute
MAR → MICE / KNN
MNAR → sensitivity analysis, pattern/selection models

5. Validate

Compare analyses with and without imputation
Report assumptions and uncertainty

Key Takeaways

✅ Missingness is information — Patterns reveal the data’s true story

✅ Not all missing data is equal — MCAR/MAR/MNAR need different approaches

✅ Mean imputation is dangerous — Only works for MCAR, catastrophic for MNAR

✅ Domain knowledge > fancy algorithms — Understanding WHY matters most

✅ You can’t always fix it — Sometimes honesty about uncertainty is best

The Question That Changes Everything

Next time you see missing data, don’t ask: ❌ “How do I fill this in quickly?”

Ask: ✅ “Why is this missing — and what does that tell me?”

Because the people who didn’t answer might be exactly the ones you need to understand most.

Memorable Analogies

What’s Next?

🔖 Save this article for your next data project

👏 Clap if this saved you from a missing data disaster

📤 Share with someone about to make a costly assumption

💬 Comment: What’s the worst missing data mistake you’ve seen?

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.

Start free — no commitment:

→ Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages

Our courses:

→ AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.

→ Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.

→ AI for Work — Understand, evaluate, and apply AI for complex work tasks.

Note: Article content contains the views of the contributing authors and not Towards AI.

Frequently Used, Contextual References

Resources

Why Missing Data Is Not Missing at Random and Why That Matters

Author(s): ANGELI WICKRAMA ARACHCHI

The medical study that got it wrong

The missing data was the story.

The Three Types of Missing Data (And Why It Matters)

🟢 MCAR — Missing Completely At Random

🟡 MAR — Missing At Random

🔴 MNAR — Missing Not At Random

Why Mean Imputation Destroys MNAR Data

How to Detect Which Type You Have

1. Visual Inspection

2. Little’s MCAR Test

3. Domain Knowledge (Most Important)

How to Handle Missing Data: The Right Way

For MCAR (Rare):

For MAR (Common):

For MNAR (Most Dangerous):

Quick decision framework

Key Takeaways

The Question That Changes Everything

What’s Next?

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Recent Posts

Full-Stack Data Scientists for the Agentic Coding World

Building Production-Grade AI Skills with Snowflake Cortex AI Function Studio

I Tried 10 AI Agent Frameworks in 2026 — Here’s the Honest Guide I Wish I Had Earlier

How One Spring Boot Optimization Saved Our Startup $30,000 a Year

Inside Palantir AIP: How the World’s Most Controversial AI Platform Actually Works

What Is a Reverse Proxy? (And Why Every Backend Developer Should Care)

What Claude Opus 4.8 Actually Changes If You’re Building Agents

QWEN 3.7 Max Worked For 35 Hrs Straight And The Results Were Mind-blowing

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Why Missing Data Is Not Missing at Random and Why That Matters

Author(s): ANGELI WICKRAMA ARACHCHI

The medical study that got it wrong

The missing data was the story.

The Three Types of Missing Data (And Why It Matters)

🟢 MCAR — Missing Completely At Random

🟡 MAR — Missing At Random

🔴 MNAR — Missing Not At Random

Why Mean Imputation Destroys MNAR Data

How to Detect Which Type You Have

1. Visual Inspection

2. Little’s MCAR Test

3. Domain Knowledge (Most Important)

How to Handle Missing Data: The Right Way

For MCAR (Rare):

For MAR (Common):

For MNAR (Most Dangerous):

Quick decision framework

Key Takeaways

The Question That Changes Everything

What’s Next?

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Related posts

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement