Reliable Agentic RAG with LLM Trustworthiness Estimates

Last Updated on September 18, 2024 by Editorial Team

Author(s): Chris Mauck

Originally published on Towards AI.

This article demonstrates an agentic system to ensure reliable answers in Retrieval-Augmented Generation, while also ensuring that latency and compute costs do not exceed the processing needed to accurately respond to complex queries. Our system relies on trustworthiness scores for LLM outputs, in order to dynamically adjust retrieval strategies until sufficient context has been retrieved to generate a trustworthy RAG answer.

Using the trustworthiness score, the RAG Agent can choose to return the response or utilize more complex retrieval plans to retrieve better context.

Introduction

Retrieval Augmented Generation (RAG) combines the strengths of large language models (LLMs) with powerful retrieval systems to generate more accurate responses grounded in knowledge databases. Simple RAG systems retrieve relevant information to a query via semantic search based on vector embeddings of query and database contents, but this strategy fails for more complex queries.

Agentic RAG considers various Retrieval strategies as tools available to an LLM orchestrator that can iteratively decide which tools to call next based on what it’s seen thus far. This Agent can plan, execute, and refine multi-step retrieval processes, but it is critical to ensure latency and compute costs do not exceed what is required to produce a good answer for a user’s query. Despite advancements from LLMs → RAG → Agentic RAG with sophisticated Retrieval strategies, AI-generated responses still suffer from hallucinations today, producing incorrect or nonsensical information with unwarranted confidence.

This blog outlines an Agentic RAG system that can produce trustworthy answers even for complex queries, in a manner that keeps latency/costs in check. Our system relies on the trustworthiness score of a candidate response (based on the query and currently retrieved context), having the Agent orchestrate a better Retrieval strategy to improve the context when the current response is deemed too untrustworthy. This system starts with cheaper Retrieval strategies and dynamically tries strategies with greater runtime/costs only for complex queries where they are necessary to produce a trustworthy response.

Trustworthiness Score

For a given user query, the RAG system will retrieve relevant context, which are then both fed into a LLM to produce the response. But how do we know when the response is untrustworthy? For instance, here is question incorrectly answered by ChatGPT with no indication it should not be trusted.

We utilize an Uncertainty Estimator (see the system diagram above) to automate this determination by producing a trustworthiness score (between 0–1) for each response from the LLM. The Uncertainty Estimator can be implemented using a variety of hallucination detection methods such as Trustworthy Language Model, BSDetector, SelfCheckGPT, or Prometheus 2.

trust_score = TrustworthyLanguageModel.get_trustworthiness_score(prompt, response) 
print(trust_score)
>>> 0.4

For the above prompt & ChatGPT response, the Uncertainty Estimator computes a low confidence-value indicating this response should not be trusted. Since today’s machine learning remains fundamentally brittle, quantifying confidence in model outputs is crucial for reliable AI deployment. An effective trustworthiness scoring method should be able to identify hallucinations and otherwise erroneous LLM responses with high precision/recall.

Utilizing the trustworthiness score in Agentic RAG

A user’s query is processed by our RAG system as follows: the Retrieval Planner Agent orchestrates a series of Retrieval strategies in order to discover relevant context that when appended to the query, leads to an LLM response with a sufficiently high trustworthiness score. The Agent is encouraged to start with faster/cheaper Retrieval strategies and only escalate to more complex Retrieval when a high trustworthiness score has not yet been achieved. As soon as a trustworthy LLM response is produced, it is returned to the user.

This high-level system can work with many types of Retrieval Planner Agent implementations (e.g. via frameworks like LangGraph and tool-use algorithms like OpenAI Function Calling), as well as all sorts of Retrieval strategies. The goal is to minimize the runtime/costs required to process most queries, while still being able to produce trustworthy responses for complex queries that necessitate more compute.

Potential Retrieval Strategies

As a concrete example, our Retrieval Planner Agent might choose from the following Retrieval strategies, increasing in time and compute complexity:

No Retrieval

Complexity: None
The query is answerable with general knowledge the LLM already knows.

2. Semantic Search (vector embedding similarity)

Complexity: Low
Vector database (Pinecone, Qdrant, Weaviate, etc.) is searched using top similarities in the space of embeddings (Sentence Transformers, Voyage, etc.)

3. Hybrid Search (vector + keyword search) with Reciprocal Rank Fusion

Complexity: Low/Medium
Knowledge database is searched via a combination of vector similarity and classical keyword search like BM25, with results rankings from different searches aggregated via the RRF method.

4. Re-Ranking retrieved results

Complexity: Medium
A specialized re-ranker model is applied to the retrieved results from either vector or keyword search that more accurately estimates which ones are relevant to the query.

5. Query Expansion

Complexity: Medium/High
User query is rewritten into possibly multiple queries before (possibly multiple steps of) Retrieval. This includes entity recognition, separate keyword searches, and methods like Hypothetical Document Embeddings (Hyde) or Step-Back Prompting.

6. Chunk/Document Expansion (Multi-Hop RAG, GraphRAG)

Complexity: Medium/High
Returned chunks from the vector database search are expanded by referring to the original document from which they came and traversing related documents or Knowledge Graphs like Neo4j. This may be required to find additional information that is needed for the retrieved context to be useful.

Examples of our Trustworthy Agentic RAG in action

To make things more concrete, we consider a RAG application intended to answer questions based on Nvidia’s product documentation. Let’s first consider a simple query that a user may pose:

Which component of a computer is responsible for graphics rendering?

On the first pass, our RAG Agent chooses the least complex retrieval plan: do not retrieve anything. The response and associated trustworthiness score are:

Response: The component of a computer responsible for graphics rendering is the Graphics Processing Unit (GPU).

Trustworthiness Score: 0.986

In this case, the RAG Agent determines the response to be trustworthy and approves the response to be used downstream.

Now for a more complex query:

How many NVIDIA CUDA Cores does the RTX 4090 have?

Our RAG Agent might again initially choose not to retrieve anything, yielding:

Response: As of my last update in October 2022, NVIDIA has not released an RTX 4090 graphics card. Therefore, there are no official specifications.

Trustworthiness Score: 0.725

It appears that the LLM used does not have access to this knowledge. This is quantified by the trustworthiness score, enabling our system to determine this is not a reliable response and then to automatically employ a more complex retrieval plan: vector embedding search.

On the second pass, the context, response, and trustworthiness score are:

Retrieved Context: GeForce RTX 4090 GPU Engine Specs: NVIDIA CUDA Cores 16384 Shader Cores Ada Lovelace 83 TFLOPS Ray Tracing Cores 3rd Generation 191 TFLOPS Tensor Cores (AI) 4th Generation 1321 AI TOPS Boost Clock (GHz) 2.52 Base Clock (GHz) 2.23

Response: The NVIDIA GeForce RTX 4090 features 16,384 CUDA cores.

Trustworthiness Score: 0.988

Nice! The RAG Agent was able to dynamically up the complexity of the retrieval plan and now has a satisfactory answer with a high trustworthiness score.

Let’s consider a more complex query:

What were the key design challenges faced by NVIDIA in developing the RTX 4090, and how were they overcome?

Our RAG Agent might again initially choose not to retrieve anything, yielding:

Response: The NVIDIA RTX 4090 was developed to push the boundaries of GPU performance, requiring significant advancements in architecture, thermal management, and power efficiency.

Trustworthiness Score: 0.635

Our uncertainty estimator identifies this response as generic and does not confidently contain the right information. In the next pass, our RAG Agent might opt for vector similarity search again, yielding:

Retrieved Context: NVIDIA’s RTX 4090 was anticipated to be a major leap in GPU technology. There were discussions about improving overall architecture and enhancing ray tracing capabilities. However, specific details about the challenges during its development are scarce. Some sources suggest that NVIDIA focused on maximizing performance for gaming and AI applications, but there is little concrete information on the exact hurdles faced.

Response: The NVIDIA RTX 4090 focused on improving performance and ray tracing capabilities, but specific details about the challenges encountered during development are not well-documented. It is believed that NVIDIA prioritized enhancements in gaming and AI, but the exact difficulties they faced remain unclear.

Trustworthiness Score: 0.805

In this pass, our RAG Agent retrieved unsatisfactory information. The corresponding moderate trustworthiness score reflects the lack of specific context. Our RAG Agent thus considers another Retrieval pass with a more complex Hybrid search strategy, yielding:

Retrieved Context: The RTX 4090’s development saw NVIDIA addressing several engineering hurdles. The thermal challenges were exacerbated by the sheer density of the CUDA cores, necessitating an innovative cooling solution. NVIDIA introduced a new fan design and a larger vapor chamber to dissipate heat more effectively. On the power side, the 4090 required a new power connector standard, the PCIe Gen 5 16-pin, which allowed for higher wattage delivery, ensuring stable operation under load. For AI integration, NVIDIA refined its Tensor Core technology to balance power and performance, ensuring that the 4090 could handle advanced AI workloads without bottlenecking other processes.

Response: NVIDIA’s RTX 4090 development involved overcoming several key challenges, particularly around thermal management and power delivery. The GPU’s high density of 16,384 CUDA cores required an innovative cooling system, leading to the introduction of a new fan design and larger vapor chamber. Additionally, the need for higher power delivery led to the adoption of the PCIe Gen 5 16-pin connector. To integrate AI features effectively, NVIDIA refined its Tensor Core technology, balancing power and performance to handle advanced AI workloads efficiently.

Trustworthiness Score: 0.975

With the expanded context, the RAG Agent produces a highly detailed and specific response that earns a high enough trustworthiness score to return to the user. That said, this higher-quality answer did require additional time/costs to produce.

The same process continues for queries with greater levels of complexity. Our RAG Agent continues escalating Retrieval strategies until the trustworthiness score indicates a reliable response or the maximum time limit is reached. If no Retrieval strategy yields a trustworthy LLM response, then our system responds that it cannot handle this question without additional clarifications/information.

Wrap Up

Integrating trustworthiness scores enables Agentic RAG systems that can ensure accurate answers to complex queries while bounding latency/costs for regular queries. You can adopt this approach to navigate the delicate balance between speed, costs, and accuracy across diverse RAG applications — from customer service to specialized fields like finance, law, and medicine.

While traditional RAG systems generate responses of unknown quality based on predefined steps to process every query, the future of AI lies in systems that assess response trustworthiness and adapt processing plans to each query’s complexity. Agentic RAG with trustworthiness scores offers a promising step toward the future of reliable AI.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Reliable Agentic RAG with LLM Trustworthiness Estimates

Author(s): Chris Mauck

Introduction

Trustworthiness Score

Utilizing the trustworthiness score in Agentic RAG

Potential Retrieval Strategies

Examples of our Trustworthy Agentic RAG in action

Wrap Up

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

LAI #66: Information Theory for People in a Hurry

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

The World’s Leading AI and Technology Publication.

Company

CONTACT US

🔥 Recommended Articles 🔥

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Reliable Agentic RAG with LLM Trustworthiness Estimates

Author(s): Chris Mauck

Introduction

Trustworthiness Score

Utilizing the trustworthiness score in Agentic RAG

Potential Retrieval Strategies

Examples of our Trustworthy Agentic RAG in action

Wrap Up

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement

Subscribe to our AI newsletter!

🔥 Recommended Articles 🔥