Your AI Agent Works in Demo. It Dies in Production. Here’s Why.

Last Updated on January 15, 2026 by Editorial Team

Author(s): Elliott Girard

Originally published on Towards AI.

I’ve built 38 AI agents in 2025. 6 made it to production. 42 didn’t. Here’s what killed them — and how to avoid the same fate.

The Demo That Fooled Everyone

You know the feeling.

Your agent works perfectly in testing. It fetches data, reasons through problems, calls the right tools. Your demo goes flawlessly. Stakeholders are impressed.

Then you ship it.

Users complain it’s slow. Your API bill explodes. The same query returns completely different results each time. Support tickets pile up.

What happened?

After building Wasaphi (trading agent), Izimail (email assistant), calendar bots, other personal assistant and a lot of client projects, I’ve identified the three killers that murder AI agents between demo and production:

Non-determinism — Same input, different outputs. Impossible to debug.
Token explosion — Your context window becomes a black hole for money.
Latency death spiral — 30 seconds per response kills user trust.

Let me show you exactly how each one kills your agent — and what to do about it.

Your AI Agent Works in Demo. It Dies in Production. Here’s Why. — Image created with AI by the author

Killer #1: Non-Determinism (The Silent Chaos)

The Problem

Your agent works… most of the time.

Ask it the same question twice:

First response: Perfect, concise, uses the right tool
Second response: Rambling, calls 3 unnecessary tools, misses the point

Users hate this. They don’t know what to expect. They lose trust fast.

Why It Happens

LLMs are probabilistic. Even with temperature=0, you get variation:

Slight differences in token generation
Tool selection isn’t deterministic
Context window changes affect reasoning
Different conversation histories = different behaviors

Real Example

My trading agent Wasaphi had this problem early on:

User: "What's the sentiment on NVDA?"
Response 1: [calls Reddit tool → analyzes → returns clean summary]
Response 2: [calls Reddit tool → calls news tool → calls SEC tool → runs out of iterations → returns incomplete garbage]

Same query. Wildly different execution paths. Unpredictable costs.

The Fix: Constrained Workflows

For critical paths, don’t let the LLM decide. Build rigid workflows.

Option 1: Skill Servers

Instead of letting the agent freestyle, create explicit skill definitions:

# skills/sentiment_analysis.py
SKILL = {
 "name": "sentiment_analysis",
 "description": "Analyze sentiment for a given ticker",
 "workflow": [
 {"tool": "get_reddit_sentiment", "required": True},
 {"tool": "get_news_headlines", "required": False},
 {"tool": "synthesize_sentiment", "required": True}
 ],
 "max_iterations": 3
}

The agent doesn’t choose what to call — the workflow does.

Option 2: Parlant for Behavioral Constraints

I’ve been testing Parlant — it adds a determinism layer to agents:

import parlant.sdk as p
@p.tool
async def get_sentiment(context: p.ToolContext, ticker: str) -> p.ToolResult:
 # Your sentiment logic
 return p.ToolResult(f"Bullish sentiment for {ticker}")
async def main():
 async with p.Server() as server:
 agent = await server.create_agent(
 name="SentimentBot",
 description="Stock sentiment analyzer"
 )
 
 # This is the magic: behavioral constraints
 await agent.create_guideline(
 condition="User asks about sentiment",
 action="Always use get_sentiment tool first, then summarize",
 tools=[get_sentiment]
 )

The guideline system forces specific behaviors based on input patterns. No more "maybe it calls this tool, maybe it doesn't."

Option 3: Aggressive Prompt Engineering

If you can’t restructure, at least constrain via prompts:

CRITICAL RULES:
1. For sentiment queries: ALWAYS call get_reddit_sentiment FIRST
2. NEVER call more than 3 tools per query
3. If uncertain which tool to use, ask user for clarification instead of guessing

Not as reliable as structured workflows, but better than nothing.

The Tradeoff

More determinism = less flexibility.

That’s okay. For production agents, predictability beats creativity.

Reduce risk from the begining

Very often it happens when you want a single agent to do to many thing ! When you start a project the more focus is the need and the scope of action that you want the more predictable the agent will be.

Killer #2: Token Explosion (The Bill That Kills)

The Problem

Your agent works great with short conversations. Then:

Context accumulates
RAG retrieves too much
Tool responses bloat the context
Suddenly each request costs $0.50 instead of $0.02

At scale, this kills your margins — or your funding.

Why It Happens

RAG gone wrong:

# Bad: Retrieve everything remotely relevant
results = vectorstore.similarity_search(query, k=20)
context = "\n".join([doc.page_content for doc in results])
# context is now 15,000 tokens of mostly irrelevant content

Tool responses uncontrolled:

# Bad: Return raw API responses
@tool
def get_stock_data(ticker: str):
 return yahoo_finance.get_everything(ticker) # 5,000 tokens of JSON

Conversation history unbounded:

# Bad: Keep everything forever
messages.append({"role": "user", "content": user_message})
messages.append({"role": "assistant", "content": assistant_response})
# After 50 exchanges: 30,000 tokens of history

Real Example

Early Wasaphi would retrieve Reddit posts like this:

# The token bomb
posts = reddit.get_hot_posts("wallstreetbets", limit=50)
for post in posts:
 post['comments'] = reddit.get_comments(post['id'], limit=100)

50 posts × 100 comments × ~200 tokens each = 1,000,000 tokens per request.

My API bill after one day of testing: $47.

The Fixes

Fix 1: Active RAG (Only When Needed)

Don’t retrieve by default. Let the agent decide if it needs external context:

@tool
def search_knowledge_base(query: str) -> str:
 """Search internal docs. Only use if you need specific information not in your training."""
 results = vectorstore.similarity_search(query, k=3) # k=3, not k=20
 return format_compact(results)

Passive RAG (inject context every time) vs Active RAG (agent chooses when to retrieve) can cut token usage by 70%.

Fix 2: Auto Model Selection

Not every query needs GPT-4 or Claude Opus.

def select_model(query: str, conversation_length: int) -> str:
 """Route to appropriate model based on complexity."""
 
 # Simple queries → cheap model
 if is_simple_factual(query):
 return "gpt-5-nano" # $0.025/1M tokens input
 
 # Long conversations → still use efficient model
 if conversation_length > 10:
 return "claude-3-haiku" # Fast, cheap
 
 # Complex reasoning → premium model
 if needs_deep_analysis(query):
 return "claude-4-5-sonnet"
 
 return "gpt-5-nano" # Default to cheap

In Wasaphi, users can choose their model. But I also built “Auto” mode that routes intelligently. 80% of queries work fine with the cheapest model.

Fix 3: Skill Servers (Again)

Skills define not just what to do, but how much context is needed:

SKILL = {
 "name": "quick_price_check",
 "max_context_tokens": 500,
 "rag_enabled": False,
 "model": "gpt-4o-mini"
}

Simple queries get simple handling. No bloat.

And you don’t have to load complex and long prompt context on how to use the tool, you load the instruction when needed and at beginning you hav an endpoint to get all tools description short and how to get the complete description of the tools or the endpoint.

Fix 4: Compact Tool Responses

Format for AI, not humans:

# Bad: Raw JSON (500 tokens)
{
 "ticker": "NVDA",
 "price": 142.50,
 "change": 2.3,
 "volume": 45000000,
 "market_cap": 3500000000000,
 "pe_ratio": 65.2,
 "52_week_high": 152.89,
 # ... 20 more fields
}

# Good: Compact format (50 tokens)
"NVDA: $142.50 (+2.3%) | Vol: 45M | MCap: 3.5T | P/E: 65"

10x token reduction. Same information density for the LLM.

Fix 5: TOON Architecture (Advanced)

TOON (Tool-Oriented Orchestration Networks) is a more complex pattern where you separate:

Orchestrator: Lightweight model that routes requests
Specialists: Domain-specific agents that handle execution

The orchestrator uses minimal tokens. Specialists only activate when needed.

This is harder to implement but scales beautifully. Worth exploring if you’re building something serious, but only after achieving product market fit and have done the previous improvment.

Killer #3: Latency Death Spiral (The UX Killer)

The Problem

User sends a message. Waits. And waits. And waits.

30 seconds later: “Here’s your answer!”

By then, they’ve already:

Opened another tab
Lost trust in your product
Decided to do it manually

Why It Happens

Sequential tool calls:

1. Call Reddit API (2s)
2. Wait for response
3. Call News API (1.5s)
4. Wait for response
5. Call SEC API (3s)
6. Wait for response
7. LLM processes (2s)
8. Generate response (1s)
Total: 9.5 seconds minimum

RAG retrieval overhead:

1. Embed query (0.5s)
2. Vector search (0.3s)
3. Retrieve documents (0.2s)
4. Re-rank results (1s)
5. Inject into context
6. Now the LLM can start...

Model cold starts: Serverless deployments add 2–5s on first request.

Real Example

Early Wasaphi analyzed 5 stocks like this:

For each stock (5 total):
 - get_company_info (1s)
 - get_historical_prices (1s)
 - get_news (1s)
 - get_sentiment (2s)
 
5 stocks × 5s each = 25 seconds
+ LLM reasoning = 30+ seconds total

Users thought the app was broken.

The Fixes

Fix 1: Meta-Tools (Parallel Fetching)

I covered this in my MCP article, but it’s crucial for latency:

@tool
async def get_stock_snapshot(ticker: str) -> str:
 """Fetch ALL data for a stock in one call."""
 
 # Parallel execution - all at once
 results = await asyncio.gather(
 get_company_info(ticker),
 get_historical_prices(ticker),
 get_news(ticker),
 get_sentiment(ticker),
 return_exceptions=True
 )
 
 return format_snapshot(results)

Before: 5 sequential calls × 5 stocks = 25 tool calls After: 1 meta-tool call × 5 stocks = 5 tool calls (parallelized internally)

Latency reduction: 80%+

Fix 2: Stream Everything

Don’t wait for completion. Stream the response:

async def stream_response(query: str):
 # Stream thinking indicators
 yield "🔍 Analyzing your request...\n"
 
 # Stream tool usage
 async for tool_call in agent.process_with_tools(query):
 yield f"📊 Fetching {tool_call.name}...\n"
 
 # Stream the actual response
 async for token in agent.generate_response():
 yield token

In Wasaphi, I stream:

“Thinking…” indicator
Tool calls as they happen with friendly name
The actual response token by token

Perceived latency drops dramatically even if actual latency stays the same.

Fix 3: Auto Model for Speed

Simple queries don’t need powerful models:

# User: "What's the price of AAPL?"
# → Route to GPT-4o-mini (fast, cheap)
# → Response in <1 second
# User: "Analyze the options flow and recommend a strategy"
# → Route to Claude Sonnet (slower, smarter)
# → Stream response over 5-10 seconds

Match model to query complexity. Fast models for fast queries.

Fix 4: Active vs Passive RAG

Passive RAG: Retrieve on every query (adds 1–2s latency)

Active RAG: Retrieve only when agent requests it

# Passive RAG - agent decides
@tool
def search_docs(query: str) -> str:
 """Search documentation. Use only when you need specific information."""
 return vectorstore.search(query, k=3)
# Agent receives query "What's 2+2?"
# → Doesn't call search_docs
# → Responds instantly-
# Agent receives query "What was in our Q3 report?"
# → Calls search_docs
# → Worth the latency

Fix 5: Skill-Based Routing

Skills can specify latency requirements:

SKILLS = {
 "quick_lookup": {
 "max_latency_ms": 2000,
 "model": "gpt-4o-mini",
 "rag": False,
 "tools": ["get_price"]
 },
 "deep_analysis": {
 "max_latency_ms": 30000,
 "model": "claude-sonnet",
 "rag": True,
 "tools": ["get_stock_snapshot", "analyze_options"]
 }
}

The router picks the skill based on query type, automatically optimizing for latency.

The Pattern That Saves Agents

After all these failures and fixes, here’s the architecture that actually works:

My flow to have a production ready agent

This pattern addresses all three killers:

Non-determinism: Skill routing + workflows
Token explosion: Model selection + passive RAG + compact formatting
Latency: Meta-tools + streaming + fast model routing

The Checklist Before You Ship

Before your agent goes to production, verify:

Determinism

Critical paths have defined workflows (not LLM freestyle)
Tool selection is constrained for common queries
Same input produces consistent output (test 10x)
Behavioral guidelines are explicit, not implicit

Token Control

RAG is passive, not active (agent chooses when to retrieve)
Tool responses are compact (formatted for AI, not humans)
Model selection is automatic based on complexity
Context has hard limits (max history, max retrieval)

Latency

Meta-tools exist for multi-source queries
Data fetching is parallelized
Responses are streamed (thinking + tools + output)
Simple queries route to fast models

The Uncomfortable Truth

Here it is:

The agent that demos well is not the agent that ships well.

Demos hide non-determinism (you show the good runs). Demos hide token costs (you’re not at scale). Demos hide latency (you have fresh context, warm models).

Production exposes everything.

The agents that survive are the ones built with constraints from day one:

Constrained tool access
Constrained token budgets
Constrained latency targets

Freedom is for demos. Discipline is for production.

Thanks for reading! I’m Elliott, a Python & Agentic AI consultant and Entrepreneur who builds practical AI tools and shares what actually works (and what spectacularly doesn’t). I write weekly about the tools I’m experimenting with, the projects I’m building, and the hard lessons I learn — usually the expensive way, so you don’t have to.

If this saved you from a production disaster, smash that clap button 👏 (50 times if you’re feeling generous) and follow for more honest takes on AI tooling and agent architecture.

Built an agent that died in production? Drop a comment — I’d love to hear what killed it and how you fixed it (or didn’t).

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Your AI Agent Works in Demo. It Dies in Production. Here’s Why.

Author(s): Elliott Girard

I’ve built 38 AI agents in 2025. 6 made it to production. 42 didn’t. Here’s what killed them — and how to avoid the same fate.

The Demo That Fooled Everyone

Killer #1: Non-Determinism (The Silent Chaos)

The Problem

Why It Happens

Real Example

The Fix: Constrained Workflows

The Tradeoff

Reduce risk from the begining

Killer #2: Token Explosion (The Bill That Kills)

The Problem

Why It Happens

Real Example

The Fixes

Killer #3: Latency Death Spiral (The UX Killer)

The Problem

Why It Happens

Real Example

The Fixes

The Pattern That Saves Agents

The Checklist Before You Ship

Determinism

Token Control

Latency

The Uncomfortable Truth

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Related posts

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement