AI Agents in Production: What Actually Works (Based on 300+ Deployments)

Last Updated on December 29, 2025 by Editorial Team

Author(s): Artem Shelamanov

Originally published on Towards AI.

As 2025 comes to an end, everyone wants to wrap themselves in cozy blankets, stare at the Christmas tree, and relax with a mug of hot cocoa. Instead, many data scientists are working overtime on weekends because some LLM agents just made very expensive mistakes.

To avoid exactly this, a group of smart folks from Stanford, IBM, and other top institutions ran a large study of 306 AI practitioners whose agents are actually deployed in the real world and expected to work reliably, with near human-level quality.

This article shows the main insights from their 50-page research paper [1]. I have been working with AI agents since the GPT-3 era, and I was genuinely surprised by several of their findings.

So let’s get straight into the key takeaways and see what they tell us about building smarter AI agents in 2026.

AI Agents in Production: What Actually Works (Based on 300+ Deployments) — Photo by Kieran White on Unsplash

1. How do you measure value?

Before talking about what works and what does not, we first need to define what “works” even means. How do you know whether an agent is good or bad? How do you measure its actual usefulness, apart from vibes and comment from a colleague “it looks good to me”?

It turns out that successful agents are not judged by how autonomous or impressive they look. In most cases, they are not even customer-facing products. Instead, they quietly live inside companies as internal tools, doing boring but necessary work and never seeing the outside world.

According to the paper, the top motivations for deploying agents are:

Increasing productivity: 73%;
Reducing human task-hours: 64%;
Automating routine labor: 50%.

In other words, agents get deployed when they save people time. Not when they replace humans. Not when they do some magical data processing that classical algorithms could never dream of. Just when they make humans faster.

This is also reflected in how agents are used: 92.5% of deployed agents primarily serve human users. They sit next to people, not instead of them.

Key insight: The main metric that matters is painfully simple: how many human hours did we save?

2. What actually works?

The biggest pattern across successful deployments is simple: limit the agent’s autonomy.

This was very unexpected, at least for me. Autonomy is supposed to be the defining feature of AI agents. It is the reason they became popular in the first place. We have all seen cool demos where an agent completes a one-hour task in dozens of steps and thought, “wow, this is definitely the future”.

As it turns out, reality is less fun and much more down to earth. The less freedom you give an agent, the better it works.

Some numbers from the paper:

68% of deployed agents execute at most 10 steps before requiring human intervention;
47% execute 4 steps or fewer;
Most agents run inside static, predefined workflows, not self-planning loops.

So while research agents look impressive and futuristic in papers, production agents are kept on a very short leash. Not because teams lack ambition, but because every extra step multiplies the chance of something going wrong at 3 a.m.

Key insight: reliability beats capability. Teams routinely trade away “intelligence” and autonomy to gain predictability, control, and the ability to sleep at night.

3. How do these agents get implemented?

Let’s start with the obvious question: what LLM do the good guys use?

According to the paper, 17 out of 20 case studies rely on closed-source frontier models. Think OpenAI, Claude, Gemini. Open-source models appear only in two situations:

When the token volume is absolutely massive;
Or when privacy and regulation make external APIs impossible.

In most cases, teams simply pick the smartest model available. Even if it is expensive. Even if it is slow. The reason is simple: it is still much cheaper than human experts doing the same work.

Even more interesting, around 70% of production agents use no fine-tuning at all. Fine-tuning shows up only in rare cases where tasks are extremely narrow and domain- or client-specific. And even then, the fine-tuned model almost never works alone. It is usually paired with a general-purpose LLM.

Key insight: strong base models plus good prompting are already good enough for most real deployments.

But how are they built in practice?

Mostly through prompt engineering. A lot of it.

Prompt construction is almost always manual. Automated prompt optimization appears in fewer than 9% of cases. Humans write prompts, tweak them, break them, fix them, and repeat. LLMs sometimes help refine prompts, but they are rarely trusted to do this autonomously.

Naturally, RAG is everywhere. Open-ended exploration almost never allowed.

Prompt length is also surprisingly reasonable:

Around half are under 500 tokens;
Only 12% exceed 10,000 tokens.

Sometimes, less really is more.

When prompts do get large, they usually contain things no one likes to talk about in demos: business rules, failure handling, edge cases, and strict output constraints. In other words, all the stuff that actually matters in production.

Key insight: prompts quietly replace training data and program logic.

And what about frameworks?

Here things get interesting.

There is a clear survey vs reality gap:

Around 61% report using agent frameworks;
But 85% of interview case studies rely on custom implementations.

The reasons are practical: dependency bloat, flexibility, and security. Core agent loops are usually simple enough that teams prefer to build their own solutions rather than fight abstractions they do not fully control.

That said, frameworks are far from useless. They are commonly used for prototyping, experimentation, or small separate components. They just tend to disappear once things get serious.

Key insight: frameworks are great for demos and prototypes, but production agents usually grow their own skeleton.

4. What about testing?

You would expect good production agents to be built on top of hundreds of tests, carefully created evaluation datasets, and custom benchmarks. As it turns out, the exact opposite is true.

Only 25% of interviewed teams built custom benchmarks. The remaining 75% deploy agents without any formal benchmark at all. Evaluation mostly boils down to a very simple question: “is this output correct enough?”

Evaluation methods themselves are also surprisingly human-heavy. The most common approaches are:

Human-in-the-loop evaluation: 74%;
LLM-as-a-judge: 52%;
Rule-based checks: around 40%.

Importantly, every single team using LLM judges still relies on humans. Apparently, we are not yet in the phase of trusting LLMs with judging themselves. Probably for good reasons.

An example evaluation looks like this:

An LLM evaluates its own output and assigns a confidence score;
High-confidence outputs are auto-accepted;
Low-confidence outputs are routed to humans;
Humans periodically look over the “safe” outputs, just in case.

Why all this caution? Many teams point to the same problem: nondeterminism. An agent can answer a question correctly ten times in a row and still fail the eleventh time. Passing yesterday’s test does not mean it will pass tomorrow’s.

On top of that, outputs are often hard to score programmatically. Not every answer is strictly right or wrong. Many are partially correct, subtly wrong, or wrong in ways that only a domain expert notices after reading it twice.

Key insight: there is no trusted, fully automated evaluation loop today. In production, correctness is still a social and human problem, not a solved technical one.

5. The main problem with the agents?

One word: reliability.

Every single company ran into one of these issues:

Correctness;
Verifiability;
Predictability.

As I said before, agents are often unpredictable. You can never trust an agent to do something correctly 100% of the times.

One thing that was surprising is what did not matter that much: latency.

Only about 15% of teams reported latency as a hard deployment blocker. Some companies barely mentioned it at all. Many agents take minutes to run, and teams are completely fine with that. A few minutes of waiting is still a massive improvement if the alternative is hours of human work, meetings, or back-and-forth emails.

Production agents are not optimized for speed or elegance. They are optimized for usefulness.

Key insight: agents reach production by constraining scope. Smaller tasks = fewer surprises, and fewer surprises = higher reliability.

Summary

According to the paper, if you want an agent to survive in production, do this:

Start with a static workflow;
Cut steps and model calls aggressively;
Always keep a human in the loop;
Use the best available model first.

And don’t do:

Open ended, many-step planning;
Fully autonomous tasks;
Fine-tuning and training custom models early;
Chasing benchmark perfection and aggressive testing.

The final key insight is simple and slightly uncomfortable: production agents do not succeed because they are smart. They succeed because they are constrained.

I really recommend to read the original paper this article is based on. It’s full of useful insights, and it’s written in a very practical way. I’m not a fan of deep research papers, but I had a good time reading this one.

Sources

Measuring Agents in Production (https://arxiv.org/pdf/2512.04123)

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

AI Agents in Production: What Actually Works (Based on 300+ Deployments)

Author(s): Artem Shelamanov

1. How do you measure value?

2. What actually works?

3. How do these agents get implemented?

4. What about testing?

5. The main problem with the agents?

Summary

Sources

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Recent Posts

Full-Stack Data Scientists for the Agentic Coding World

Building Production-Grade AI Skills with Snowflake Cortex AI Function Studio

I Tried 10 AI Agent Frameworks in 2026 — Here’s the Honest Guide I Wish I Had Earlier

How One Spring Boot Optimization Saved Our Startup $30,000 a Year

Inside Palantir AIP: How the World’s Most Controversial AI Platform Actually Works

What Is a Reverse Proxy? (And Why Every Backend Developer Should Care)

What Claude Opus 4.8 Actually Changes If You’re Building Agents

QWEN 3.7 Max Worked For 35 Hrs Straight And The Results Were Mind-blowing

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

AI Agents in Production: What Actually Works (Based on 300+ Deployments)

Author(s): Artem Shelamanov

1. How do you measure value?

2. What actually works?

3. How do these agents get implemented?

4. What about testing?

5. The main problem with the agents?

Summary

Sources

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Related posts

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement