Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Free: 6-day Agentic AI Engineering Email Guide.
Learnings from Towards AI's hands-on work with real clients.
Your AI Agent Works in Demo. It Dies in Production. Here’s Why.
Latest   Machine Learning

Your AI Agent Works in Demo. It Dies in Production. Here’s Why.

Last Updated on January 15, 2026 by Editorial Team

Author(s): Elliott Girard

Originally published on Towards AI.

I’ve built 38 AI agents in 2025. 6 made it to production. 42 didn’t. Here’s what killed them — and how to avoid the same fate.

The Demo That Fooled Everyone

You know the feeling.

Your agent works perfectly in testing. It fetches data, reasons through problems, calls the right tools. Your demo goes flawlessly. Stakeholders are impressed.

Then you ship it.

Users complain it’s slow. Your API bill explodes. The same query returns completely different results each time. Support tickets pile up.

What happened?

After building Wasaphi (trading agent), Izimail (email assistant), calendar bots, other personal assistant and a lot of client projects, I’ve identified the three killers that murder AI agents between demo and production:

  1. Non-determinism — Same input, different outputs. Impossible to debug.
  2. Token explosion — Your context window becomes a black hole for money.
  3. Latency death spiral — 30 seconds per response kills user trust.

Let me show you exactly how each one kills your agent — and what to do about it.

Your AI Agent Works in Demo. It Dies in Production. Here’s Why.
Image created with AI by the author

Killer #1: Non-Determinism (The Silent Chaos)

The Problem

Your agent works… most of the time.

Ask it the same question twice:

  • First response: Perfect, concise, uses the right tool
  • Second response: Rambling, calls 3 unnecessary tools, misses the point

Users hate this. They don’t know what to expect. They lose trust fast.

Why It Happens

LLMs are probabilistic. Even with temperature=0, you get variation:

  • Slight differences in token generation
  • Tool selection isn’t deterministic
  • Context window changes affect reasoning
  • Different conversation histories = different behaviors

Real Example

My trading agent Wasaphi had this problem early on:

User: "What's the sentiment on NVDA?"
Response 1: [calls Reddit tool → analyzes → returns clean summary]
Response 2: [calls Reddit tool → calls news tool → calls SEC tool → runs out of iterations → returns incomplete garbage]

Same query. Wildly different execution paths. Unpredictable costs.

The Fix: Constrained Workflows

For critical paths, don’t let the LLM decide. Build rigid workflows.

Option 1: Skill Servers

Instead of letting the agent freestyle, create explicit skill definitions:

# skills/sentiment_analysis.py
SKILL = {
"name": "sentiment_analysis",
"description": "Analyze sentiment for a given ticker",
"workflow": [
{"tool": "get_reddit_sentiment", "required": True},
{"tool": "get_news_headlines", "required": False},
{"tool": "synthesize_sentiment", "required": True}
],
"max_iterations": 3
}

The agent doesn’t choose what to call — the workflow does.

Option 2: Parlant for Behavioral Constraints

I’ve been testing Parlant — it adds a determinism layer to agents:

import parlant.sdk as p
@p.tool
async def get_sentiment(context: p.ToolContext, ticker: str) -> p.ToolResult:
# Your sentiment logic
return p.ToolResult(f"Bullish sentiment for {ticker}")
async def main():
async with p.Server() as server:
agent = await server.create_agent(
name="SentimentBot",
description="Stock sentiment analyzer"
)

# This is the magic: behavioral constraints
await agent.create_guideline(
condition="User asks about sentiment",
action="Always use get_sentiment tool first, then summarize",
tools=[get_sentiment]
)

The guideline system forces specific behaviors based on input patterns. No more "maybe it calls this tool, maybe it doesn't."

Option 3: Aggressive Prompt Engineering

If you can’t restructure, at least constrain via prompts:

CRITICAL RULES:
1. For sentiment queries: ALWAYS call get_reddit_sentiment FIRST
2. NEVER call more than 3 tools per query
3. If uncertain which tool to use, ask user for clarification instead of guessing

Not as reliable as structured workflows, but better than nothing.

The Tradeoff

More determinism = less flexibility.

That’s okay. For production agents, predictability beats creativity.

Reduce risk from the begining

Very often it happens when you want a single agent to do to many thing ! When you start a project the more focus is the need and the scope of action that you want the more predictable the agent will be.

Killer #2: Token Explosion (The Bill That Kills)

The Problem

Your agent works great with short conversations. Then:

  • Context accumulates
  • RAG retrieves too much
  • Tool responses bloat the context
  • Suddenly each request costs $0.50 instead of $0.02

At scale, this kills your margins — or your funding.

Why It Happens

RAG gone wrong:

# Bad: Retrieve everything remotely relevant
results = vectorstore.similarity_search(query, k=20)
context = "\n".join([doc.page_content for doc in results])
# context is now 15,000 tokens of mostly irrelevant content

Tool responses uncontrolled:

# Bad: Return raw API responses
@tool
def get_stock_data(ticker: str):
return yahoo_finance.get_everything(ticker) # 5,000 tokens of JSON

Conversation history unbounded:

# Bad: Keep everything forever
messages.append({"role": "user", "content": user_message})
messages.append({"role": "assistant", "content": assistant_response})
# After 50 exchanges: 30,000 tokens of history

Real Example

Early Wasaphi would retrieve Reddit posts like this:

# The token bomb
posts = reddit.get_hot_posts("wallstreetbets", limit=50)
for post in posts:
post['comments'] = reddit.get_comments(post['id'], limit=100)

50 posts × 100 comments × ~200 tokens each = 1,000,000 tokens per request.

My API bill after one day of testing: $47.

The Fixes

Fix 1: Active RAG (Only When Needed)

Don’t retrieve by default. Let the agent decide if it needs external context:

@tool
def search_knowledge_base(query: str) -> str:
"""Search internal docs. Only use if you need specific information not in your training."""
results = vectorstore.similarity_search(query, k=3) # k=3, not k=20
return format_compact(results)

Passive RAG (inject context every time) vs Active RAG (agent chooses when to retrieve) can cut token usage by 70%.

Fix 2: Auto Model Selection

Not every query needs GPT-4 or Claude Opus.

def select_model(query: str, conversation_length: int) -> str:
"""Route to appropriate model based on complexity."""

# Simple queries → cheap model
if is_simple_factual(query):
return "gpt-5-nano" # $0.025/1M tokens input

# Long conversations → still use efficient model
if conversation_length > 10:
return "claude-3-haiku" # Fast, cheap

# Complex reasoning → premium model
if needs_deep_analysis(query):
return "claude-4-5-sonnet"

return "gpt-5-nano" # Default to cheap

In Wasaphi, users can choose their model. But I also built “Auto” mode that routes intelligently. 80% of queries work fine with the cheapest model.

Fix 3: Skill Servers (Again)

Skills define not just what to do, but how much context is needed:

SKILL = {
"name": "quick_price_check",
"max_context_tokens": 500,
"rag_enabled": False,
"model": "gpt-4o-mini"
}

Simple queries get simple handling. No bloat.

And you don’t have to load complex and long prompt context on how to use the tool, you load the instruction when needed and at beginning you hav an endpoint to get all tools description short and how to get the complete description of the tools or the endpoint.

Fix 4: Compact Tool Responses

Format for AI, not humans:

# Bad: Raw JSON (500 tokens)
{
"ticker": "NVDA",
"price": 142.50,
"change": 2.3,
"volume": 45000000,
"market_cap": 3500000000000,
"pe_ratio": 65.2,
"52_week_high": 152.89,
# ... 20 more fields
}
# Good: Compact format (50 tokens)
"NVDA: $142.50 (+2.3%) | Vol: 45M | MCap: 3.5T | P/E: 65"

10x token reduction. Same information density for the LLM.

Fix 5: TOON Architecture (Advanced)

TOON (Tool-Oriented Orchestration Networks) is a more complex pattern where you separate:

  • Orchestrator: Lightweight model that routes requests
  • Specialists: Domain-specific agents that handle execution

The orchestrator uses minimal tokens. Specialists only activate when needed.

This is harder to implement but scales beautifully. Worth exploring if you’re building something serious, but only after achieving product market fit and have done the previous improvment.

Killer #3: Latency Death Spiral (The UX Killer)

The Problem

User sends a message. Waits. And waits. And waits.

30 seconds later: “Here’s your answer!”

By then, they’ve already:

  • Opened another tab
  • Lost trust in your product
  • Decided to do it manually

Why It Happens

Sequential tool calls:

1. Call Reddit API (2s)
2. Wait for response
3. Call News API (1.5s)
4. Wait for response
5. Call SEC API (3s)
6. Wait for response
7. LLM processes (2s)
8. Generate response (1s)
Total: 9.5 seconds minimum

RAG retrieval overhead:

1. Embed query (0.5s)
2. Vector search (0.3s)
3. Retrieve documents (0.2s)
4. Re-rank results (1s)
5. Inject into context
6. Now the LLM can start...

Model cold starts: Serverless deployments add 2–5s on first request.

Real Example

Early Wasaphi analyzed 5 stocks like this:

For each stock (5 total):
- get_company_info (1s)
- get_historical_prices (1s)
- get_news (1s)
- get_sentiment (2s)

5 stocks × 5s each = 25 seconds
+ LLM reasoning = 30+ seconds total

Users thought the app was broken.

The Fixes

Fix 1: Meta-Tools (Parallel Fetching)

I covered this in my MCP article, but it’s crucial for latency:

@tool
async def get_stock_snapshot(ticker: str) -> str:
"""Fetch ALL data for a stock in one call."""

# Parallel execution - all at once
results = await asyncio.gather(
get_company_info(ticker),
get_historical_prices(ticker),
get_news(ticker),
get_sentiment(ticker),
return_exceptions=True
)

return format_snapshot(results)

Before: 5 sequential calls × 5 stocks = 25 tool calls After: 1 meta-tool call × 5 stocks = 5 tool calls (parallelized internally)

Latency reduction: 80%+

Fix 2: Stream Everything

Don’t wait for completion. Stream the response:

async def stream_response(query: str):
# Stream thinking indicators
yield "🔍 Analyzing your request...\n"

# Stream tool usage
async for tool_call in agent.process_with_tools(query):
yield f"📊 Fetching {tool_call.name}...\n"

# Stream the actual response
async for token in agent.generate_response():
yield token

In Wasaphi, I stream:

  1. “Thinking…” indicator
  2. Tool calls as they happen with friendly name
  3. The actual response token by token

Perceived latency drops dramatically even if actual latency stays the same.

Fix 3: Auto Model for Speed

Simple queries don’t need powerful models:

# User: "What's the price of AAPL?"
# → Route to GPT-4o-mini (fast, cheap)
# → Response in <1 second
# User: "Analyze the options flow and recommend a strategy"
# → Route to Claude Sonnet (slower, smarter)
# → Stream response over 5-10 seconds

Match model to query complexity. Fast models for fast queries.

Fix 4: Active vs Passive RAG

Passive RAG: Retrieve on every query (adds 1–2s latency)

Active RAG: Retrieve only when agent requests it

# Passive RAG - agent decides
@tool
def search_docs(query: str) -> str:
"""Search documentation. Use only when you need specific information."""
return vectorstore.search(query, k=3)
# Agent receives query "What's 2+2?"
# → Doesn't call search_docs
# → Responds instantly-
# Agent receives query "What was in our Q3 report?"
# → Calls search_docs
# → Worth the latency

Fix 5: Skill-Based Routing

Skills can specify latency requirements:

SKILLS = {
"quick_lookup": {
"max_latency_ms": 2000,
"model": "gpt-4o-mini",
"rag": False,
"tools": ["get_price"]
},
"deep_analysis": {
"max_latency_ms": 30000,
"model": "claude-sonnet",
"rag": True,
"tools": ["get_stock_snapshot", "analyze_options"]
}
}

The router picks the skill based on query type, automatically optimizing for latency.

The Pattern That Saves Agents

After all these failures and fixes, here’s the architecture that actually works:

My flow to have a production ready agent

This pattern addresses all three killers:

  • Non-determinism: Skill routing + workflows
  • Token explosion: Model selection + passive RAG + compact formatting
  • Latency: Meta-tools + streaming + fast model routing

The Checklist Before You Ship

Before your agent goes to production, verify:

Determinism

  • Critical paths have defined workflows (not LLM freestyle)
  • Tool selection is constrained for common queries
  • Same input produces consistent output (test 10x)
  • Behavioral guidelines are explicit, not implicit

Token Control

  • RAG is passive, not active (agent chooses when to retrieve)
  • Tool responses are compact (formatted for AI, not humans)
  • Model selection is automatic based on complexity
  • Context has hard limits (max history, max retrieval)

Latency

  • Meta-tools exist for multi-source queries
  • Data fetching is parallelized
  • Responses are streamed (thinking + tools + output)
  • Simple queries route to fast models

The Uncomfortable Truth

Here it is:

The agent that demos well is not the agent that ships well.

Demos hide non-determinism (you show the good runs). Demos hide token costs (you’re not at scale). Demos hide latency (you have fresh context, warm models).

Production exposes everything.

The agents that survive are the ones built with constraints from day one:

  • Constrained tool access
  • Constrained token budgets
  • Constrained latency targets

Freedom is for demos. Discipline is for production.

Thanks for reading! I’m Elliott, a Python & Agentic AI consultant and Entrepreneur who builds practical AI tools and shares what actually works (and what spectacularly doesn’t). I write weekly about the tools I’m experimenting with, the projects I’m building, and the hard lessons I learn — usually the expensive way, so you don’t have to.

If this saved you from a production disaster, smash that clap button 👏 (50 times if you’re feeling generous) and follow for more honest takes on AI tooling and agent architecture.

Built an agent that died in production? Drop a comment — I’d love to hear what killed it and how you fixed it (or didn’t).

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.

Start free — no commitment:

6-Day Agentic AI Engineering Email Guide — one practical lesson per day

Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages

Our courses:

AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.

Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.

AI for Work — Understand, evaluate, and apply AI for complex work tasks.

Note: Article content contains the views of the contributing authors and not Towards AI.