Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Our 15 AI experts built the most comprehensive, practical, 90+ lesson courses to master AI Engineering - we have pathways for any experience at Towards AI Academy. Cohorts still open - use COHORT10 for 10% off.

Publication

Why Missing Data Is Not Missing at Random and Why That Matters
Data Science   Latest   Machine Learning

Why Missing Data Is Not Missing at Random and Why That Matters

Last Updated on November 11, 2025 by Editorial Team

Author(s): ANGELI WICKRAMA ARACHCHI

Originally published on Towards AI.

Why Missing Data Is Not Missing at Random and Why That Matters

The medical study that got it wrong

You’re analyzing a clinical trial for a new antidepressant:

  • 1,000 patients enrolled
  • 700 completed the 3-month follow-up
  • 300 patients have missing follow-up scores

Your team says: “Let’s just use the 700 complete cases or fill missing values with the average.”

But here’s what actually happened:

  • Patients who got much worse didn’t return (too depressed).
  • Patients who felt completely cured didn’t return (they didn’t think they needed to).
  • Only the moderate-improvement group reliably showed up.

conclusion: “The drug shows mild effectiveness.”
Reality: the drug either works brilliantly or fails catastrophically.

The missing data was the story.

The Three Types of Missing Data (And Why It Matters)

🟢 MCAR — Missing Completely At Random

What it means: Data is missing due to pure randomness — like a coin flip.

Examples:

  • Lab technician spills coffee on random survey forms
  • Sensor randomly fails due to manufacturing defect
  • Database bug drops random records

Visual pattern:

Patient | Age | Income | Health
--------|-----|---------|-------
Alice | 25 | $50,000 | 85
Bob | 34 | [?] | 90 ← Random
Carol | 45 | $75,000 | [?] ← Random
David | 29 | $48,000 | 78

Good news: Simple methods work (delete rows or use mean imputation)

Bad news: MCAR is extremely rare in real data (~5%)

🟡 MAR — Missing At Random

What it means: Missingness depends on observed variables, not the missing value itself.

Example: Older people don’t report income (privacy concerns). Age (which you have) predicts missingness, but actual income (which is missing) doesn’t.

Visual pattern:

Patient | Age | Income | Responded?
--------|-----|---------|------------
Emma | 67 | [?] | No ← Older
Frank | 65 | [?] | No ← Older
Grace | 28 | $45,000 | Yes
Henry | 31 | $52,000 | Yes

Key insight: You can use the relationship between age and missingness to impute intelligently.

🔴 MNAR — Missing Not At Random

What it means: The missing value itself causes it to be missing.

Examples:

  • Weight-tracking app: People skip logging after weight gain
  • Income surveys: Very high and very low earners don’t report
  • Depression studies: Most depressed patients don’t return

Visual pattern:

User | Week1 | Week2 | Week3
-----|-------|-------|-------
Jane | 150 | 149 | 148 ✓ Losing
John | 180 | 182 | [?] ✗ Gaining = stopped logging

The danger: The data you see is fundamentally biased. You can’t see the pattern because the evidence is missing.

Why Mean Imputation Destroys MNAR Data

Scenario: Employee salary survey. High earners refuse to report (privacy).

Your data:

Employee | Salary
---------|----------
Alice | $45,000
Bob | $52,000
Carol | $48,000
David | [?] ← Actually $250,000
Emma | $55,000
Frank | [?] ← Actually $300,000
Grace | $50,000
  • Mean of observed values = $50,000
  • Imputing missing with $50k hides the real extremes.

Consequence:

  • True mean ≈ $114,000, but you report $50,000 — a 56% underestimate.
  • Variance collapses. Correlations break. Policy decisions become catastrophically wrong.

Rule of thumb: Mean imputation is only safe for MCAR. For MNAR, it introduces massive bias.

How to Detect Which Type You Have

1. Visual Inspection

Create a missingness indicator and plot it against other features:

# Create missingness indicator
df['income_missing'] = df['income'].isna().astype(int)

# Check if it correlates with other variables
import seaborn as sns
sns.boxplot(data=df, x='income_missing', y='age')

Look for:

  • MCAR: No visible pattern.
  • MAR: Missingness correlates with observed columns (age, gender).
  • MNAR: Missingness associates with related outcomes (e.g., spending), or domain logic suggests the missing value itself causes nonresponse.

2. Little’s MCAR Test

A statistical test.

  • H₀: Data is MCAR
  • If p < 0.05reject MCAR (so data is MAR or MNAR)

3. Domain Knowledge (Most Important)

Ask yourself:

  • Would people with extreme values hide their answers?
  • Is this sensitive data? (income, weight, health)
  • Is this self-reported?
  • Do dropouts happen in longitudinal studies?

If yes → Probably MNAR

How to Handle Missing Data: The Right Way

For MCAR (Rare):

# Simple deletion works
df_clean = df.dropna()
  • Works only if missingness truly random.
  • Loses power; safe only when percent missing is tiny.

For MAR (Common):

MICE (Multiple Imputation by Chained Equations):

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imputer = IterativeImputer(max_iter=10, random_state=0)
df_imputed = imputer.fit_transform(df)

KNN Imputation:

from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
df_imputed = imputer.fit_transform(df)

These use relationships between observed variables to predict missing values and preserve uncertainty.

For MNAR (Most Dangerous):

You cannot “fix” MNAR without assumptions or extra data. Use:

  • Pattern-mixture models: model observed vs missing patterns separately.
  • Selection models: two-part models that model missingness probability and outcomes together.
  • Sensitivity analysis: try optimistic/pessimistic scenarios and report a range.

Example reporting table:

Scenario | Estimated Mean | Conclusion
----------------------|----------------|------------
Optimistic (MAR) | $50,000 | No action needed
Conservative | $75,000 | Moderate concern
Realistic (MNAR) | $85,000 | Significant inequality
Worst case | $120,000 | Crisis level

Report: "True value likely $75k-$85k, not $50k"

The honest approach: Report a range, not false precision.

Quick decision framework

  1. How much missing?
  • <5% — likely manageable
  • 5–20% — investigate carefully
  • >20% — serious concern

2. Run Little’s MCAR test

  • p > 0.05 → MCAR plausible
  • p < 0.05 → not MCAR

3. Check domain knowledge

  • Sensitive/self-reported → likely MNAR
  • Technical failure → likely MCAR
  • Correlates with observed vars → MAR

4. Choose method

  • MCAR → drop/simple impute
  • MAR → MICE / KNN
  • MNAR → sensitivity analysis, pattern/selection models

5. Validate

  • Compare analyses with and without imputation
  • Report assumptions and uncertainty

Key Takeaways

Missingness is information — Patterns reveal the data’s true story

Not all missing data is equal — MCAR/MAR/MNAR need different approaches

Mean imputation is dangerous — Only works for MCAR, catastrophic for MNAR

Domain knowledge > fancy algorithms — Understanding WHY matters most

You can’t always fix it — Sometimes honesty about uncertainty is best

The Question That Changes Everything

Next time you see missing data, don’t ask: ❌ “How do I fill this in quickly?”

Ask: ✅ “Why is this missing — and what does that tell me?”

Because the people who didn’t answer might be exactly the ones you need to understand most.

Memorable Analogies

What’s Next?

🔖 Save this article for your next data project

👏 Clap if this saved you from a missing data disaster

📤 Share with someone about to make a costly assumption

💬 Comment: What’s the worst missing data mistake you’ve seen?

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!


Discover Your Dream AI Career at Towards AI Jobs

Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!

Note: Content contains the views of the contributing authors and not Towards AI.