Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Our 15 AI experts built the most comprehensive, practical, 90+ lesson courses to master AI Engineering - we have pathways for any experience at Towards AI Academy. Cohorts still open - use COHORT10 for 10% off.

Publication

The Builder’s Notes: Why AI Systems Break in Production (The Infrastructure No One Talks About)
Artificial Intelligence   Latest   Machine Learning

The Builder’s Notes: Why AI Systems Break in Production (The Infrastructure No One Talks About)

Last Updated on October 18, 2025 by Editorial Team

Author(s): Piyoosh Rai

Originally published on Towards AI.

The Builder’s Notes: Why AI Systems Break in Production (The Infrastructure No One Talks About)
Your monitoring stack shows green. Your infrastructure is on fire. This is why 30% of AI projects fail after deployment — not because of the models, but because of the invisible infrastructure no one architects for. (Image: The reality of production AI systems at 2:47 AM)

📐 THE BUILDER’S NOTES

Technical deep-dives for engineers and CTOs who build systems that can’t fail.

Written by someone who’s spent 20 years debugging production disasters in healthcare, financial services, and government — where downtime means lawsuits, not just tweets.

This is the inaugural issue. Follow for weekly Builder’s Notes every Tuesday.

The alert came at 2:47 AM on a Tuesday.

“Model accuracy dropped to 43%. Patient risk predictions failing. Emergency rollback in progress.”

A healthcare AI system we’d built was hemorrhaging. Not because of the model — the neural network was fine. Not because of the data — inputs were clean. The system was dying because of something far more mundane: a dependency we didn’t know we had.

Eighteen months of development. $3.2 million invested. Passed every pre-production test. Failed spectacularly under real-world load.

By 6 AM, we’d isolated it: a third-party API rate limit we’d never hit in staging. When patient volume spiked, the system silently degraded. No alerts. No circuit breakers. Just quiet, catastrophic failure.

This wasn’t a technology problem. It was an architecture problem. And it’s killing AI systems everywhere.

The Production Reality No One Prepares For

According to Gartner’s 2025 analysis, at least 30% of generative AI projects will be abandoned after proof of concept due to poor data quality, inadequate risk controls, escalating costs, or unclear business value.

But here’s what the analysts miss: most failures aren’t about the AI at all.

I’ve watched brilliant machine learning engineers build models that achieve 98% accuracy in development, only to see them collapse within weeks of production deployment. The pattern is always the same:

  • Perfect model performance in controlled environments
  • Flawless demo presentations that wow executives
  • Zero consideration for the infrastructure that keeps it alive

Then reality hits.

Production environments don’t care about your accuracy metrics. They care about:

  • What happens when your data pipeline gets 10x the expected load
  • How your system behaves when a critical dependency fails
  • Whether you can detect drift before it costs you millions
  • If you can roll back without losing data integrity

The difference between systems that survive and systems that die isn’t the algorithm. It’s the infrastructure you built around it.

What Actually Breaks AI Systems in Production

After two decades building AI systems for regulated industries — healthcare, financial services, government — I’ve seen every failure mode. Here are the real killers:

1. Invisible Dependencies That Silently Fail

Most AI systems depend on dozens of external services: data lakes, feature stores, model registries, monitoring tools, authentication providers. In staging, these all behave perfectly.

In production, any one of them can fail at any time.

The problem: most teams don’t architect for graceful degradation. They assume 100% uptime from every component. When a dependency fails, the entire system collapses — often without clear alerts.

What successful systems do differently:

  • Implement circuit breakers at every external call
  • Design fallback behaviors for each dependency failure
  • Monitor dependency health independently of your system health
  • Set aggressive timeout thresholds (if an API takes >500ms, something’s wrong)

We rebuilt our healthcare system with explicit circuit breakers on every external dependency. When that rate limit hit again three months later, the system degraded gracefully instead of catastrophically. Patient predictions continued using cached features. Alerts fired immediately. We had it resolved in 22 minutes instead of 3 hours.

2. Data Drift That Your Monitoring Can’t See

Model drift is well-understood. Data drift is where systems actually die.

Here’s the pattern: Your model was trained on data with specific distributions. In production, those distributions shift — gradually or suddenly. Your model keeps making predictions. They’re just increasingly wrong.

The insidious part? Traditional monitoring won’t catch it.

You’re watching accuracy metrics, latency, throughput. Everything looks fine. Meanwhile, your input distributions have shifted 15%, and your model is confidently making terrible predictions.

MIT’s 2025 research on generative AI deployments found that 95% of GenAI pilots deliver little to no measurable bottom-line impact — not because the models are bad, but because organizations can’t detect when they stop working.

What successful systems do differently:

  • Monitor input distributions in real-time, not just model outputs
  • Set statistical thresholds for acceptable distribution shifts
  • Automatically trigger retraining when thresholds breach
  • Maintain versioned datasets to compare against production inputs

We instrument every feature going into our models. When distribution shifts exceed 2 standard deviations from training data, we trigger automatic alerts and model evaluation — before predictions degrade enough to impact patients.

3. Observability Theater (You’re Measuring the Wrong Things)

Most teams think they have observability because they track:

  • Request latency
  • Error rates
  • Model accuracy
  • Infrastructure metrics

This is observability theater. You’re collecting metrics that make dashboards look impressive while missing the signals that actually matter.

Real observability for AI systems means:

For every prediction:

  • Input feature values and distributions
  • Model version used
  • Confidence scores
  • Time to prediction
  • Downstream action taken

For the system:

  • Feature pipeline health (not just “is it running?”)
  • Data freshness (when was this data last updated?)
  • Model staleness (how long since last retraining?)
  • Dependency response times
  • Circuit breaker states

When our healthcare system failed at 2:47 AM, we had beautiful dashboards showing “everything green.” But we weren’t measuring the right things.

After the rebuild, we can trace every prediction back to its inputs, identify exactly when distributions started shifting, and pinpoint which dependency failed — all within 30 seconds of anomaly detection.

4. State Management That Assumes Perfect Consistency

AI systems are stateful. Models have versions. Features have dependencies. Predictions create state that downstream systems rely on.

Most teams architect for the happy path: everything succeeds, state stays consistent, the world is deterministic.

Production is not the happy path.

Networks partition. Databases lag. Services restart. Messages get delivered out of order. Your model updates while a prediction is in flight.

If your architecture assumes perfect consistency, you’re building a system designed to fail.

What successful systems do differently:

  • Design for eventual consistency from day one
  • Implement idempotent operations everywhere
  • Version every artifact (data, models, features, predictions)
  • Build audit trails that survive partial failures
  • Use event sourcing for critical state changes

In financial services, we can’t afford “eventually consistent” predictions. So we built explicit versioning: every prediction includes the exact model version, feature set version, and data snapshot used. If something fails, we can reconstruct the exact state and retry with perfect consistency.

5. The Retraining Pipeline No One Plans For

Your model is in production. It’s working. Six weeks later, performance starts degrading. Time to retrain.

How long does that take?

Most teams discover the answer in production: “We don’t actually know.”

Retraining isn’t just “run the training script again.” In production systems, it’s:

  • Identifying which new data to include
  • Validating data quality hasn’t degraded
  • Rerunning feature engineering pipelines
  • Training the new model
  • Validating it performs better than current production model
  • Deploying with zero downtime
  • Monitoring for regression
  • Rolling back if something goes wrong

If you can’t do all of this in under 24 hours, you don’t have a production AI system. You have a demo that occasionally works.

What successful systems do differently:

  • Automate the entire retraining pipeline
  • Run shadow deployments (new model predicts, old model decides)
  • Use A/B testing for gradual rollouts
  • Build automatic rollback on performance degradation
  • Document every step so any engineer can execute it

We run weekly retraining cycles. New models deploy to 5% of traffic first. If performance metrics improve, we gradually roll to 100%. If anything degrades, automatic rollback happens in under 90 seconds. No manual intervention required.

The Architecture Pattern That Actually Works

After rebuilding production AI systems repeatedly, here’s the pattern that survives:

Layer 1: Chaos-Resistant Data Pipeline

  • Event-driven architecture (not request-response)
  • Circuit breakers on every external dependency
  • Automatic retry with exponential backoff
  • Dead letter queues for failures
  • Real-time distribution monitoring

Layer 2: Observable Model Serving

  • Every prediction logged with full context
  • A/B testing infrastructure built-in
  • Shadow deployment capability
  • Automatic performance comparison
  • Instant rollback mechanisms

Layer 3: Continuous Retraining

  • Automated data validation
  • Versioned training datasets
  • Reproducible training environments
  • Shadow model evaluation
  • Gradual rollout with monitoring

Layer 4: Intelligent Degradation

  • Graceful fallbacks for every failure mode
  • Feature importance rankings (which can we live without?)
  • Confidence thresholds that trigger human review
  • Manual override capabilities
  • Audit trails for compliance

This isn’t sexy. It’s not the architecture that wins hackathons. But it’s the architecture that keeps systems alive when dependencies fail, data drifts, and load spikes.

What This Means For Your Next Production Deployment

If you’re building an AI system for production right now, ask yourself:

Infrastructure Questions:

  • Can your system survive the failure of any single dependency?
  • Do you monitor input distributions in real-time?
  • Can you trace every prediction back to exact inputs and model version?
  • How long would it take to retrain and deploy a new model?
  • What happens when your data pipeline gets 10x expected load?

Observability Questions:

  • Are you measuring data freshness or just system uptime?
  • Can you detect distribution drift before model performance degrades?
  • Do you know which features matter most for each prediction?
  • Can you replay any prediction to debug failures?

Resilience Questions:

  • What’s your rollback strategy if the new model performs worse?
  • Do you have circuit breakers on every external call?
  • How do you handle partial failures without corrupting state?
  • Can any engineer execute your retraining pipeline at 3 AM?

If you answered “no” or “I don’t know” to more than three of these questions, you’re building a demo, not a production system.

The good news? You can fix this. The companies that succeed aren’t necessarily smarter — they just build infrastructure that assumes failure instead of success.

The Uncomfortable Truth About Production AI

Most engineers think production AI is hard because machine learning is complex.

Production AI is hard because distributed systems are hard.

The model is often the easy part. The hard part is:

  • Building pipelines that survive chaos
  • Detecting problems before they cascade
  • Rolling back without data loss
  • Maintaining consistency under failures
  • Operating at scale without constant firefighting

Your PhD in machine learning won’t save you when a rate limit takes down your system at 2:47 AM. What saves you is boring infrastructure work: circuit breakers, observability, versioning, automated testing, gradual rollouts.

The 20% of AI systems that survive in production don’t have better models. They have better infrastructure.

The question isn’t whether you can build an accurate model. The question is whether you can build an infrastructure that keeps it alive.

What’s Next

This is the first Builder’s Notes — a weekly series on the technical realities of building AI systems that actually work in regulated, high-stakes environments.

Next week: How We Built Self-Healing AI Infrastructure (Without Burning $2M)

If you’re tired of ML tutorials that ignore production realities and want insights from someone who’s debugged catastrophic failures at 3 AM, follow me.

I publish Tuesdays (technical) and Thursdays (business strategy). Real problems. Real architecture. No bullshit.

Piyoosh Rai is the Founder & CEO of The Algorithm, where he builds native-AI platforms for healthcare, financial services, and government sectors. After 20 years of watching technically perfect systems fail in production, he writes about the unglamorous infrastructure work that separates demos from deployments. His systems process millions of predictions daily in environments where failure means regulatory action, not just retry logic.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!


Discover Your Dream AI Career at Towards AI Jobs

Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!

Note: Content contains the views of the contributing authors and not Towards AI.