Taming the Oracle: Key Principals That Bring Our LLM Agents to Production

Author(s): Nate Liebmann

Originally published on Towards AI.

***A Tame Oracle****. Generated with Microsoft Designer*

With the second anniversary of the ChatGPT earthquake right around the corner, the rush to build useful applications based on large language models (LLMs) of its like seems to be in full force. But despite the aura of magic surrounding demos of LLM agents or involved conversations, I am sure many can relate to my own experience developing LLM-based applications: you start with some example that seems to be working great, but buyer’s remorse is soon to follow. Trying out other variations of the task could simply fail miserably, without a clear differentiator; and agentic flows could reveal their tendency to diverge when straying away from the original prototyping happy path.

If not for the title, you might have thought at this point I was a generative AI luddite, which could not be further from the truth. The journey my team at Torq and I have been on in the past two years, developing LLM-based software features that enhance the no-code automation building experience on our platform, has taught me a lot about the great power LLMs bring — if handled correctly.

From here on I will discuss three core principals that guide our development and allow our agents to reach successful production deployment and customer utility. I believe they are highly relevant to other LLM based applications just as much.

❶ The least freedom principle

LLMs interact through free-text, but it’s not always the way our users will interact with our LLM-based application. In many cases, even if the input is indeed a textual description provided by the user, the output is much more structured, and could be used to take actions in the application automatically. In such a setting, the great power in the LLM’s ability to solve some tasks otherwise requiring massive and complex deterministic logic or human intervention — can turn into a problem. The more leeway we give the LLM, the more prone our application is to hallucinations and diverging agentic flows. Therefore, a-la the least privileges principle in security, I believe it’s important to constrain the LLM as much as possible.

**Fig. 1**: The unconstrained, multi-step agentic flow

Consider an agent that takes a snapshot of a hand-written grocery list, extracts the text via OCR, locates the most relevant items in stock, and prepares an order. It may sound tempting to opt for a flexible multi-step agentic flow where the agent can use methods such as search_product and add_to_order (see fig. 1 above). However, this process could turn out to be very slow, consist of superfluous steps, and might even get stuck in a loop in case some function call returns an error the model struggles with recovering from. An alternative approach could constrain the flow to two steps, the first being a batch search to get a filtered product tree object, and the second being generating the order based on it, referencing appropriate products from the partial product tree returned by the search function call (see fig. 2 below). Apart from the clear performance benefit, we can be much more confident the agent will remain on track and complete the task.

**Fig. 2**: A structured agentic flow with deterministic auto-fixing

When dealing with problems in the generated output, I believe it’s best to do as much of the correction deterministically, without involving the LLM again. This is because against our intuition, sending an error back to an LLM agent and asking it to correct it does not always get it back on track, and might even increase the likelihood of further errors, as some evidence has shown. Circling back to the grocery shopping agent, it is very likely that in some cases invalid JSON paths will be produced to refer to products (e.g., food.cheeses.goats[0] instead of food.dairy.cheeses.goat[0]). As we have the entire stock at hand, we can apply a simple heuristic to automatically fix the incorrect path in a deterministic way, for example by using an edit distance algorithm to find the valid path closest to the generated one in the product tree. Even then, some invalid paths might be too far from any valid ones. In such a case, we might want to simply retry the LLM request rather than adding the error to the context and asking it to fix it.

❷ Automated empirical evaluation

Unlike traditional 3rd-party APIs, calling an LLM with the exact same input could produce different results each time, even when setting the temperature hyper-parameter to zero. This is in direct conflict with fundamental principals of good software engineering, that is supposed to give the users an expected and consistent experience. The key to tackling this conflict is automated empirical evaluation, which I consider the LLM edition of test-driven development.

The evaluation suite can be implemented as a regular test suite, which has the benefit of natural integration into the development cycle and CI/CD pipelines. Crucially, however, the LLMs must be actually called, and not mocked, of course. Each evaluation case consists of user inputs and initial system state, as well as a grading function for the generated output or modified state. Unlike traditional test cases, the notion of PASS or FAIL is insufficient here, because the evaluation suite plays an important role in guiding improvements and enhancements, as well as catching unintended degradations. The grading function should therefore return a fitness score for the output or state modifications our agent produces. How do we actually implement the grading function? Think, for example, of a simple LLM task for generating small Python utility functions. An evaluation case could prompt it to write a function that computes the nth element of the Fibonacci sequence. The model’s implementation might take either the iterative or the recursive path, both valid (though suboptimal, because there is a closed form expression), so we cannot make assertions about the specifics of the function’s code. The grading function in this case could, however, take a handful of test values for the Fibonacci function’s argument, spin up an isolated environment, run the generated function on those values, and verify the results. This black-box grading of the produced output does not make unnecessary assumptions, while strictly validating it in a fully deterministic fashion.

While I believe that should be the preferred approach, it is not suitable for all applications. There are cases where we cannot fully validate the result, but we can still make assertions about some of its properties. For example, consider an agent that generates short summaries of system logs. Some properties of its outputs, like length, are easy to check deterministically. Other, semantic ones, not as much. If the otherwise business-as-usual logs serving as input for an evaluation case contain a single record about a kernel panic, we want to make sure the summary will mention that. A naive approach for the grading function in this case will involve an LLM task that directly produces a fitness score for the summary based on the log records. This approach might lock our evaluation in a sort of LLM complacency loop, with none of the guarantees provided by deterministic checks. A more nuanced approach, however, could still use an LLM for grading, but craft the task differently: given a summary, the model could be instructed to answer multiple-choice factual questions (e.g. “Has there been a major incident in the covered period? (a) No (b) Yes, a kernel panic (c) Yes, a network connectivity loss…”). We can be much more confident that the LLM would simply not be able to consistently answer such questions correctly if the key information is missing from the summary, making the score much more reliable.

Finally, due to non-determinism, each evaluation case must be run several times, with the results aggregated to form a final evaluation report. I have found it very useful to implement the evaluation suite early and use it to guide our development. Once the application has reached some maturity, it could make sense to fail the integration pipeline if the aggregate score for its evaluation suite drops below some set threshold, to prevent catastrophic degradations.

❸ Not letting the tail wag the dog

Good LLM-based software is, first and foremost, good software. The magic factor we see in LLMs (which is telling of human nature and the role language plays in our perception of other intelligent beings, a topic I will not cover here of course) might tempt us to think about LLM-based software as a whole new field, requiring novel tools, frameworks and development processes. As discussed above, the non-deterministic nature of commercial LLMs, as well as their unstructured API, indeed necessitate dedicated handling. But I would argue that instead of looking at LLM-based application as a whole new creature that might here and there utilise familiar coding patterns — we should treat such an application as any other application, except for where it is not. The power of this approach lies in the fact that by doing so, we do not let external abstractions hide away the low-level LLM handling, which is crucial for truly understanding its capabilities and limitations in the scope of our application. Abstractions can and should be adopted where they save time and reduce boilerplate code, but never at the cost of losing control over the most important part of your application: the intricate touchpoints between the LLM and your deterministic code, that should be tailored to your specific use case.

Wrapping up, LLMs can be viewed as powerful oracles that enable previously-unfeasible applications. My experience developing LLM based agents has taught me several principles that correlated with successful production deployment and utility. Firstly, agents should be given the least possible freedom: flows should be structured, and whatever can be done deterministically should be. Secondly, automated empirical evaluation of the LLM task and surrounding logic should be a cornerstone of the development process, relying as much as possible on deterministic scoring. Thirdly, abstractions provided by libraries and frameworks should not be adopted where they hide essential details of the integration between the LLM and our code, the core of LLM-based applications.

Feel free to reach out to discuss this matter further and tell me what you think!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Taming the Oracle: Key Principals That Bring Our LLM Agents to Production

Author(s): Nate Liebmann

❶ The least freedom principle

❷ Automated empirical evaluation

❸ Not letting the tail wag the dog

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Taming the Oracle: Key Principals That Bring Our LLM Agents to Production

Build the Smallest LLM From Scratch With Pytorch (And Generate Pokémon Names!)

Build a Local CSV Query Assistant Using Gradio and LangChain

OpenAI Reveals New “Operator” AI Agent

The End of Scaling Laws: How Harvard’s “Scaling Laws for Precision” Revolutionizes LLM Training

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Taming the Oracle: Key Principals That Bring Our LLM Agents to Production

Author(s): Nate Liebmann

❶ The least freedom principle

❷ Automated empirical evaluation

❸ Not letting the tail wag the dog

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement