A Closer Look at ‘Fast or Better?’: Evaluating User Control in RAG

Last Updated on March 3, 2025 by Editorial Team

Author(s): Nikhilesh Pandey

Originally published on Towards AI.

Context

What if large language models (LLMs) could dynamically adapt to user needs, balancing accuracy and cost in real time? This question lies at the heart of Retrieval-Augmented Generation (RAG), a promising approach to mitigating LLM hallucinations by integrating external knowledge. However, existing RAG systems often fail to adapt to the varying complexity of user queries, leading to inefficiencies in both computational cost and response quality.

This post aims to provide a high-level overview of the key concepts and build a solid foundation for understanding the paper “Fast or Better? Balancing Accuracy and Cost in Retrieval-Augmented Generation with Flexible User Control”. The paper introduces Flare-Aug, a novel RAG framework that addresses these limitations by enabling user-controllable adaptive retrieval. By leveraging two classifiers — one optimized for cost and another for reliability — and a tunable parameter (α), Flare-Aug allows users to dynamically balance accuracy and retrieval cost. This framework represents a significant step toward making RAG systems more adaptable and practical for real-world applications.

How can we ensure that large language models (LLMs) provide accurate, reliable answers without incurring unnecessary computational costs?

This question has become increasingly urgent as LLMs like ChatGPT are deployed in diverse applications, from customer service to medical diagnosis. Despite their impressive capabilities, LLMs are prone to hallucinations — responses that appear plausible but are factually incorrect. This problem is particularly acute for queries involving recent events, obscure facts, or domain-specific knowledge, not well-represented in the model’s parametric memory.

Retrieval-Augmented Generation (RAG) offers a solution by enabling LLMs to retrieve and incorporate external knowledge. However, current RAG systems often apply retrieval indiscriminately, leading to two key inefficiencies:

Over-Retrieval: Fetching unnecessary information for queries that can be answered using the model’s internal knowledge, increasing latency and computational cost.
Under-Retrieval: Failing to retrieve iteratively for complex queries that require multi-step reasoning, resulting in incomplete or incorrect answers.

Diverse User Query Complexities and Retrieval Strategies

Let’s kick things off by exploring the diverse complexity of user queries and how each demands a unique retrieval approach. As shown in Figure 1, the complexity of a query dictates the retrieval method: from simple to intricate, the strategy changes. We focus on three core retrieval techniques: no retrieval, single-step retrieval, and multi-step retrieval, representing zero, one, and multiple retrieval stages, respectively. These approaches rely on the model’s internal knowledge and the extent to which retrieval and reasoning must collaborate to generate an accurate response.

Figure 1 : Image by author, inspired by https://arxiv.org/pdf/2502.12145

Compared to no retrieval and single step retrieval, the multi-step retrieval process incurs higher computational cost, as each additional retrieval step increases latency and resource consumption.

A Core Challenge: Adaptive Retrieval

In the real world, user queries come in all shapes and sizes, and a “one-size-fits-all” retrieval system just doesn’t cut it. A rigid approach can lead to wasted resources, slow responses, or just plain inaccurate answers. For simple questions (like “What’s the capital of France?”), overcomplicating things by relying on heavy retrieval is like using a sledgehammer to crack a nut. On the flip side, for real-time queries (like “What’s the current temperature in Tokyo?”), sticking to outdated information won’t get the job done. Complex queries, though, need that extra step — multi-layered retrieval — because a quick answer just won’t do. The key is building a system that strikes the right balance between speed, precision, and efficiency.

Limitations of Previous Approaches

Past solutions, like Adaptive-RAG, try to balance speed and cost, but they only adjust to the complexity of the query — not the unique needs of each user. What we need is a truly adaptive system that listens to the user and tailors itself to their specific requirements. Take a lawyer for example, they’d prefer top-notch accuracy, even if it takes a little longer, while a chatbot for customer support would prioritize fast and cost-effective responses. An investment strategist might need a more detailed, retrieval-heavy system, but a real-time assistant answering quick trivia should be all about speed, not exhaustiveness. A flexible, dynamic retrieval system is key, one that adjusts to both the task at hand and user preferences.

The Solution: Flare-Aug (FLexible, Adaptive REtrieval Augmented Generation)

Flare-Aug introduces a user-controllable adaptive retrieval framework that dynamically balances accuracy and cost. By incorporating two classifiers — a Cost-Optimized Classifier and a Reliability-Optimized Classifier — and a tunable parameter (α), Flare-Aug allows users to tailor the retrieval strategy to their specific needs. This approach not only improves efficiency but also enhances the practicality of RAG systems in real-world applications.

Figure 2 : Image source https://arxiv.org/pdf/2502.12145

How Flare-Aug Works

The main component of the framework contains two classifiers : Cost-Optimized Classifier, which is dynamic and LLM-dependent, and Reliability-Optimized Classifier, which is static and dataset-dependent.

1.User Query (q): The process starts with a user query.

2. Cost-Optimized Classifier (Wcoc)

Objective: Minimize retrieval cost while ensuring correct answers.
Mechanism: Trained on data from the specific LLM, this classifier selects the cheapest retrieval strategy that still yields a correct answer.
Equation:

Equation source https://arxiv.org/pdf/2502.12145

Use Case: Ideal for cost-sensitive applications, such as customer service chatbots.

3. Reliability-Optimized Classifier (Wroc)

Objective: Ensure high accuracy by always retrieving information.
Mechanism: Uses dataset-level biases to determine whether single-step or multi-step retrieval is required.
Equation:

Use Case: Suitable for high-stakes applications, such as medical diagnosis or legal research.

4. User-Controllable Adaptive Classifier (Wα)

Combines the outputs of (Wcoc) and (Wroc) based on the user-defined parameter α
Equation:

User Control with Parameter α: Flare-Aug introduces a tunable parameter (α) that allows users to control the trade-off between cost and accuracy:

Real-World Applications

Low-Cost Mode (α = 0)

Example Query: “What’s the weather today?”
Benefit: Skips retrieval if the LLM already knows the answer, saving time and resources.

High-Accuracy Mode (α = 1)

Example Query: “What are the latest FDA guidelines for AI medical devices?”
Benefit: Ensures detailed, reliable answers by always retrieving up-to-date information.

Balanced Mode (0 < α < 1)

Example Query: “What’s the current stock price of Apple?”
Benefit: Provides a mix of speed and precision, fetching only necessary data.

Conclusion

Flare-Aug represents a significant advancement in making RAG systems more flexible and user-centric. By enabling users to control the trade-off between accuracy and cost, it addresses the limitations of existing adaptive retrieval systems and enhances their practicality for real-world applications. Whether the goal is speed, precision, or a balance of both, Flare-Aug provides a robust framework for tailoring retrieval strategies to diverse user needs.

This work underscores the importance of user-driven adaptability in AI systems and opens new avenues for research in retrieval-augmented generation. As LLMs continue to evolve, frameworks like Flare-Aug will play a critical role in ensuring their safe, efficient, and effective deployment.

References

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

A Closer Look at ‘Fast or Better?’: Evaluating User Control in RAG

Author(s): Nikhilesh Pandey

Context

How can we ensure that large language models (LLMs) provide accurate, reliable answers without incurring unnecessary computational costs?

Diverse User Query Complexities and Retrieval Strategies

A Core Challenge: Adaptive Retrieval

Limitations of Previous Approaches

The Solution: Flare-Aug (FLexible, Adaptive REtrieval Augmented Generation)

How Flare-Aug Works

Real-World Applications

Low-Cost Mode (α = 0)

High-Accuracy Mode (α = 1)

Balanced Mode (0 < α < 1)

Conclusion

References

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

LAI #66: Information Theory for People in a Hurry

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

The World’s Leading AI and Technology Publication.

Company

CONTACT US

🔥 Recommended Articles 🔥

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

A Closer Look at ‘Fast or Better?’: Evaluating User Control in RAG

Author(s): Nikhilesh Pandey

Context

How can we ensure that large language models (LLMs) provide accurate, reliable answers without incurring unnecessary computational costs?

Diverse User Query Complexities and Retrieval Strategies

A Core Challenge: Adaptive Retrieval

Limitations of Previous Approaches

The Solution: Flare-Aug (FLexible, Adaptive REtrieval Augmented Generation)

How Flare-Aug Works

Real-World Applications

Low-Cost Mode (α = 0)

High-Accuracy Mode (α = 1)

Balanced Mode (0 < α < 1)

Conclusion

References

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement

Subscribe to our AI newsletter!

🔥 Recommended Articles 🔥