How to Evaluate Your AI Agent
Last Updated on June 3, 2024 by Editorial Team
Author(s): Peter Richens
Originally published on Towards AI.
Agents that equip LLMs with the ability to interact with the world autonomously are perhaps the hottest idea of the current AI summer. Over the last year, Iβve been at the coalface of this trend, developing and evaluating a pioneering AI agent. Our early efforts were characterized by failure, but as we developed a process to track and correct these failures we were able to turn a source of frustration into our core strength β a proprietary dataset to continuously improve agent performance. The evaluation system we built is not especially sophisticated, based on simple heuristics and our domain knowledge, but the core decisions we got right greatly accelerated our development and created an ever-compounding data asset.
While the potential of AI agents has been demonstrated in carefully controlled settings, few companies have a robust product ready to release into the wild. This will no doubt change soon. Those that succeed will have something in common β an agent architecture and evaluation system that makes it easy to iterate quickly.
Keep it simple
A casual search for LLM evaluation approaches will yield a bewildering array of acronyms, benchmarks and arxiv papers. The more you read, the higher your perplexity becomes. Nobody agrees on a general approach to evaluation, but thatβs ok. If youβre building a domain-specific agent, you most likely have a much narrower problem to solve, and itβs an area you already know well. So stop googling and start thinking about the problem youβre trying to solve. Donβt think of evaluation as a hairy problem to tackle later. Start early and small and incrementally build up your evaluation suite as you develop the agent.
The first step is to observe how your agent fails. It shouldnβt be difficult to find failure cases, although pinpointing exactly where in a long chain of LLM calls your agent went off course may take a little more work. You can then focus on a specific step in the chain. Craft a simple assertion that describes the behavior you think the agent should exhibit in this specific scenario. This is similar to a traditional unit test (indeed, thatβs the term Hamel Husain uses). Now, tweak your agent and re-run the step. Of course, LLMs are probabilistic beasts and we cannot expect a 100% success rate. The aim of the game is to improve the pass rate over time by repeating this loop β identify failures, add assertions and tweak the agentβs behavior. This step-by-step approach allows for iterative development and evaluation with a rapid feedback loop.
While this process may sound simple, implementing it efficiently depends on some important underlying capabilities. To pinpoint failures quickly, you need to trace and visualize agent trajectories. To restore and replay a particular step, you require some kind of checkpointing. To monitor the assertion pass rate over time, you rely on a versioned dataset of test cases and a UI to track your metrics. Fortunately, in a gold rush, we can expect mining technology to improve rapidly. LLM tooling and platforms are proliferating. Onboarding Langsmith for tracing, dataset management and metric tracking brought us a huge productivity boost.
Focus on flow not prompts
βPrompt engineeringβ was, until recently, widely hyped as a must-have new skill set. However, too much emphasis here could be a sign youβre headed down the wrong track. You donβt want to find yourself asking for JSON PLEASE, offering to tip the LLM or threatening its grandmother. All of these strategies have been recommended with a straight face and may work in certain scenarios, but thereβs a real risk of focusing on a local optima and missing a much bigger picture. If prompts have only a minor impact on agent performance, where should you focus? A higher leverage area is tweaking the multi-step iterative flow your agent follows, what AlphaCodium has popularized as βflow engineeringβ.
A key ingredient for success is the ability to quickly evaluate changes to your agentβs control flow. Solving this problem goes beyond evaluation to the core system design. The agentβs architecture must be composable β a lego set of interchangeable blocks. There are no easy solutions here, it will require careful design tailored to your problem. The first round of successful agents will be anchored around specific use cases, but general agent frameworks will no doubt mature quickly. LangGraph is one to monitor and great for prototyping β you can quickly compose cyclical flows for your agent with built-in support for checkpointing and βtime travelβ. These types of capabilities will not only be critical for evaluation and iteration speed, but could also lay the foundation for user-facing features such as the time-scrubbing interface introduced by Devin.
Turn failures into assets
Once you have a flexible agent architecture and the infrastructure to identify, replay and correct failure cases, things start to get interesting. You will naturally fall into an iterative evaluation process that β when tracked in a standardized way β yields an extremely valuable dataset. Your development velocity will accelerate as comprehensive evaluation coverage provides confidence to move faster. The same dataset can be re-used for fine-tuning, allowing the model to fully internalize the set of past failures and course corrections. Your agent will increasingly βjust knowβ the right action to take, without needing to tweak prompts or re-order steps in the flow.
If thereβs anything more important than your agentβs control flow, itβs the capabilities of the underlying LLM. New general-purpose models are coming out thick and fast and foundation models specifically designed for agents are on the horizon. These developments are not a threat to use case-specific agents, but the reverse, they will push the boundaries of whatβs possible and reward those betting on the most ambitious AI products. LLMs will continue to excel at rote learning and the value of a high-quality domain-specific dataset will only grow. Itβs a game where data quality trumps quantity. Any company that efficiently tailors this process to their use case will reap huge rewards.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI