Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

How to Evaluate Your AI Agent
Latest   Machine Learning

How to Evaluate Your AI Agent

Last Updated on June 3, 2024 by Editorial Team

Author(s): Peter Richens

Originally published on Towards AI.

Photo by Dominik Scythe on Unsplash

Agents that equip LLMs with the ability to interact with the world autonomously are perhaps the hottest idea of the current AI summer. Over the last year, I’ve been at the coalface of this trend, developing and evaluating a pioneering AI agent. Our early efforts were characterized by failure, but as we developed a process to track and correct these failures we were able to turn a source of frustration into our core strength β€” a proprietary dataset to continuously improve agent performance. The evaluation system we built is not especially sophisticated, based on simple heuristics and our domain knowledge, but the core decisions we got right greatly accelerated our development and created an ever-compounding data asset.

While the potential of AI agents has been demonstrated in carefully controlled settings, few companies have a robust product ready to release into the wild. This will no doubt change soon. Those that succeed will have something in common β€” an agent architecture and evaluation system that makes it easy to iterate quickly.

Keep it simple

A casual search for LLM evaluation approaches will yield a bewildering array of acronyms, benchmarks and arxiv papers. The more you read, the higher your perplexity becomes. Nobody agrees on a general approach to evaluation, but that’s ok. If you’re building a domain-specific agent, you most likely have a much narrower problem to solve, and it’s an area you already know well. So stop googling and start thinking about the problem you’re trying to solve. Don’t think of evaluation as a hairy problem to tackle later. Start early and small and incrementally build up your evaluation suite as you develop the agent.

The first step is to observe how your agent fails. It shouldn’t be difficult to find failure cases, although pinpointing exactly where in a long chain of LLM calls your agent went off course may take a little more work. You can then focus on a specific step in the chain. Craft a simple assertion that describes the behavior you think the agent should exhibit in this specific scenario. This is similar to a traditional unit test (indeed, that’s the term Hamel Husain uses). Now, tweak your agent and re-run the step. Of course, LLMs are probabilistic beasts and we cannot expect a 100% success rate. The aim of the game is to improve the pass rate over time by repeating this loop β€” identify failures, add assertions and tweak the agent’s behavior. This step-by-step approach allows for iterative development and evaluation with a rapid feedback loop.

While this process may sound simple, implementing it efficiently depends on some important underlying capabilities. To pinpoint failures quickly, you need to trace and visualize agent trajectories. To restore and replay a particular step, you require some kind of checkpointing. To monitor the assertion pass rate over time, you rely on a versioned dataset of test cases and a UI to track your metrics. Fortunately, in a gold rush, we can expect mining technology to improve rapidly. LLM tooling and platforms are proliferating. Onboarding Langsmith for tracing, dataset management and metric tracking brought us a huge productivity boost.

Focus on flow not prompts

β€œPrompt engineering” was, until recently, widely hyped as a must-have new skill set. However, too much emphasis here could be a sign you’re headed down the wrong track. You don’t want to find yourself asking for JSON PLEASE, offering to tip the LLM or threatening its grandmother. All of these strategies have been recommended with a straight face and may work in certain scenarios, but there’s a real risk of focusing on a local optima and missing a much bigger picture. If prompts have only a minor impact on agent performance, where should you focus? A higher leverage area is tweaking the multi-step iterative flow your agent follows, what AlphaCodium has popularized as β€œflow engineering”.

A key ingredient for success is the ability to quickly evaluate changes to your agent’s control flow. Solving this problem goes beyond evaluation to the core system design. The agent’s architecture must be composable β€” a lego set of interchangeable blocks. There are no easy solutions here, it will require careful design tailored to your problem. The first round of successful agents will be anchored around specific use cases, but general agent frameworks will no doubt mature quickly. LangGraph is one to monitor and great for prototyping β€” you can quickly compose cyclical flows for your agent with built-in support for checkpointing and β€œtime travel”. These types of capabilities will not only be critical for evaluation and iteration speed, but could also lay the foundation for user-facing features such as the time-scrubbing interface introduced by Devin.

Turn failures into assets

Once you have a flexible agent architecture and the infrastructure to identify, replay and correct failure cases, things start to get interesting. You will naturally fall into an iterative evaluation process that β€” when tracked in a standardized way β€” yields an extremely valuable dataset. Your development velocity will accelerate as comprehensive evaluation coverage provides confidence to move faster. The same dataset can be re-used for fine-tuning, allowing the model to fully internalize the set of past failures and course corrections. Your agent will increasingly β€œjust know” the right action to take, without needing to tweak prompts or re-order steps in the flow.

If there’s anything more important than your agent’s control flow, it’s the capabilities of the underlying LLM. New general-purpose models are coming out thick and fast and foundation models specifically designed for agents are on the horizon. These developments are not a threat to use case-specific agents, but the reverse, they will push the boundaries of what’s possible and reward those betting on the most ambitious AI products. LLMs will continue to excel at rote learning and the value of a high-quality domain-specific dataset will only grow. It’s a game where data quality trumps quantity. Any company that efficiently tailors this process to their use case will reap huge rewards.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓