Why Missing Data Is Not Missing at Random and Why That Matters
Last Updated on November 11, 2025 by Editorial Team
Author(s): ANGELI WICKRAMA ARACHCHI
Originally published on Towards AI.

The medical study that got it wrong
You’re analyzing a clinical trial for a new antidepressant:
- 1,000 patients enrolled
- 700 completed the 3-month follow-up
- 300 patients have missing follow-up scores
Your team says: “Let’s just use the 700 complete cases or fill missing values with the average.”
But here’s what actually happened:
- Patients who got much worse didn’t return (too depressed).
- Patients who felt completely cured didn’t return (they didn’t think they needed to).
- Only the moderate-improvement group reliably showed up.
conclusion: “The drug shows mild effectiveness.”
Reality: the drug either works brilliantly or fails catastrophically.
The missing data was the story.
The Three Types of Missing Data (And Why It Matters)
🟢 MCAR — Missing Completely At Random
What it means: Data is missing due to pure randomness — like a coin flip.
Examples:
- Lab technician spills coffee on random survey forms
- Sensor randomly fails due to manufacturing defect
- Database bug drops random records
Visual pattern:
Patient | Age | Income | Health
--------|-----|---------|-------
Alice | 25 | $50,000 | 85
Bob | 34 | [?] | 90 ← Random
Carol | 45 | $75,000 | [?] ← Random
David | 29 | $48,000 | 78
Good news: Simple methods work (delete rows or use mean imputation)
Bad news: MCAR is extremely rare in real data (~5%)
🟡 MAR — Missing At Random
What it means: Missingness depends on observed variables, not the missing value itself.
Example: Older people don’t report income (privacy concerns). Age (which you have) predicts missingness, but actual income (which is missing) doesn’t.
Visual pattern:
Patient | Age | Income | Responded?
--------|-----|---------|------------
Emma | 67 | [?] | No ← Older
Frank | 65 | [?] | No ← Older
Grace | 28 | $45,000 | Yes
Henry | 31 | $52,000 | Yes
Key insight: You can use the relationship between age and missingness to impute intelligently.
🔴 MNAR — Missing Not At Random
What it means: The missing value itself causes it to be missing.
Examples:
- Weight-tracking app: People skip logging after weight gain
- Income surveys: Very high and very low earners don’t report
- Depression studies: Most depressed patients don’t return
Visual pattern:
User | Week1 | Week2 | Week3
-----|-------|-------|-------
Jane | 150 | 149 | 148 ✓ Losing
John | 180 | 182 | [?] ✗ Gaining = stopped logging
The danger: The data you see is fundamentally biased. You can’t see the pattern because the evidence is missing.
Why Mean Imputation Destroys MNAR Data
Scenario: Employee salary survey. High earners refuse to report (privacy).
Your data:
Employee | Salary
---------|----------
Alice | $45,000
Bob | $52,000
Carol | $48,000
David | [?] ← Actually $250,000
Emma | $55,000
Frank | [?] ← Actually $300,000
Grace | $50,000
- Mean of observed values = $50,000
- Imputing missing with $50k hides the real extremes.
Consequence:
- True mean ≈ $114,000, but you report $50,000 — a 56% underestimate.
- Variance collapses. Correlations break. Policy decisions become catastrophically wrong.
Rule of thumb: Mean imputation is only safe for MCAR. For MNAR, it introduces massive bias.
How to Detect Which Type You Have
1. Visual Inspection
Create a missingness indicator and plot it against other features:
# Create missingness indicator
df['income_missing'] = df['income'].isna().astype(int)
# Check if it correlates with other variables
import seaborn as sns
sns.boxplot(data=df, x='income_missing', y='age')
Look for:
- MCAR: No visible pattern.
- MAR: Missingness correlates with observed columns (age, gender).
- MNAR: Missingness associates with related outcomes (e.g., spending), or domain logic suggests the missing value itself causes nonresponse.
2. Little’s MCAR Test
A statistical test.
- H₀: Data is MCAR
- If
p < 0.05→ reject MCAR (so data is MAR or MNAR)
3. Domain Knowledge (Most Important)
Ask yourself:
- Would people with extreme values hide their answers?
- Is this sensitive data? (income, weight, health)
- Is this self-reported?
- Do dropouts happen in longitudinal studies?
If yes → Probably MNAR
How to Handle Missing Data: The Right Way

For MCAR (Rare):
# Simple deletion works
df_clean = df.dropna()
- Works only if missingness truly random.
- Loses power; safe only when percent missing is tiny.
For MAR (Common):
MICE (Multiple Imputation by Chained Equations):
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imputer = IterativeImputer(max_iter=10, random_state=0)
df_imputed = imputer.fit_transform(df)
KNN Imputation:
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
df_imputed = imputer.fit_transform(df)
These use relationships between observed variables to predict missing values and preserve uncertainty.
For MNAR (Most Dangerous):
You cannot “fix” MNAR without assumptions or extra data. Use:
- Pattern-mixture models: model observed vs missing patterns separately.
- Selection models: two-part models that model missingness probability and outcomes together.
- Sensitivity analysis: try optimistic/pessimistic scenarios and report a range.
Example reporting table:
Scenario | Estimated Mean | Conclusion
----------------------|----------------|------------
Optimistic (MAR) | $50,000 | No action needed
Conservative | $75,000 | Moderate concern
Realistic (MNAR) | $85,000 | Significant inequality
Worst case | $120,000 | Crisis level
Report: "True value likely $75k-$85k, not $50k"
The honest approach: Report a range, not false precision.
Quick decision framework
- How much missing?
<5%— likely manageable5–20%— investigate carefully>20%— serious concern
2. Run Little’s MCAR test
p > 0.05→ MCAR plausiblep < 0.05→ not MCAR
3. Check domain knowledge
- Sensitive/self-reported → likely MNAR
- Technical failure → likely MCAR
- Correlates with observed vars → MAR
4. Choose method
- MCAR → drop/simple impute
- MAR → MICE / KNN
- MNAR → sensitivity analysis, pattern/selection models
5. Validate
- Compare analyses with and without imputation
- Report assumptions and uncertainty
Key Takeaways
✅ Missingness is information — Patterns reveal the data’s true story
✅ Not all missing data is equal — MCAR/MAR/MNAR need different approaches
✅ Mean imputation is dangerous — Only works for MCAR, catastrophic for MNAR
✅ Domain knowledge > fancy algorithms — Understanding WHY matters most
✅ You can’t always fix it — Sometimes honesty about uncertainty is best
The Question That Changes Everything
Next time you see missing data, don’t ask: ❌ “How do I fill this in quickly?”
Ask: ✅ “Why is this missing — and what does that tell me?”
Because the people who didn’t answer might be exactly the ones you need to understand most.
Memorable Analogies

What’s Next?
🔖 Save this article for your next data project
👏 Clap if this saved you from a missing data disaster
📤 Share with someone about to make a costly assumption
💬 Comment: What’s the worst missing data mistake you’ve seen?
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!
Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Discover Your Dream AI Career at Towards AI Jobs
Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!
Note: Content contains the views of the contributing authors and not Towards AI.