RAG vs. CAG : Can Cache-Augmented Generation Really Replace Retrieval?

Last Updated on February 5, 2025 by Editorial Team

Author(s): Alden Do Rosario

Originally published on Towards AI.

A recent VentureBeat article highlights a new Cache-Augmented Generation (CAG) method that promises no retrieval overhead and even better performance than Retrieval-Augmented Generation (RAG).

Sounds too good to be true?

We decided to find out by running our own tests on KV-Cache (a popular CAG implementation) versus RAG.

Below are our insights on what happens when you apply these methods to real workloads.

1. Setting the Stage: RAG vs. KV-Cache (CAG)

RAG

What It Is
A Retrieval-Augmented Generation approach that uses a retriever to find relevant documents, then passes them to a large language model for final answers.

Where It Shines

Handles larger or frequently updated datasets without loading everything at once.
Avoids massive prompts, which can lead to truncation or context overload.

Key Limitations

Adds a retrieval step, which can be slower.
Often relies on external APIs or indexing overhead.

KV-Cache (CAG)

What It Is
A method that aims for near-zero retrieval time by loading all documents directly into the model’s context window. In principle, it cuts out the retriever entirely.

Note: In our benchmarks, we used a “No Cache” version of KV-Cache because the model was too large to run locally. Instead, we mimicked the same behavior via an API (OpenRouter) by feeding all documents each time. We’re not comparing retrieval speed here, since KV-Cache would obviously win if run locally on a suitable setup.

Where It Shines

If your entire knowledge base easily fits in the model’s context, you get almost instant answers (no retrieval step).
Best for stable datasets that rarely change.

Key Limitations

Context Size: If you exceed the model’s capacity, you must truncate or compress, killing accuracy.
Local Requirement: Real caching needs control over memory, meaning you must run the model on your own infrastructure.
Frequent Updates: Reloading the entire knowledge in context is impractical for dynamic data.

2. The BIG BUT (and We Cannot Lie)

Long-context LLMs (like Google Gemini or Claude with hundreds of thousands of tokens) are emerging, making CAG more appealing for some workloads.

But there’s a big condition:

You must run the model locally and have access to its memory to enable caching. Many high-powered LLMs are hosted, limit context lengths, and obviously, you can’t access the memory for user-level manipulation via an API.
Once your dataset crosses a threshold you might exceed the context window. If that happens, the method can break entirely or force you to truncate vital info, tanking accuracy.

This snippet from one error log says it all:

“error”:{“message”:”This endpoint’s maximum context length is 131072 tokens. However, you requested about 719291 tokens…”}

Translation: You’re out of luck unless you compress or chunk your data which can reduce the performance by a lot.

3. Our Benchmark Setup

We used the HotpotQA dataset (known for multi-hop QA) and ran our tests on the meta-llama/llama-3.1–8b-instruct model. We posed 50 questions each to two knowledge sizes — 50 documents and 500 documents — to see how each method performs at different scales.

Because we used an API (OpenRouter) for KV-Cache, there was no actual “cache” or local memory optimization happening; we simply passed all documents in each request.

top_k=5 for RAG, and no top_k for KV-Cache (it loads everything).
No retrieval time comparison: Our focus is on semantic accuracy, since KV-Cache would trivially have zero retrieval overhead if it were truly caching locally.

4. Results

Our benchmark tests on the HotpotQA dataset revealed interesting insights into the performance of RAG and KV-Cache (CAG) under different knowledge sizes.

Below are the key findings:

Figure 1: Average semantic similarity scores for KV-Cache (No Cache) and RAG across knowledge sizes (k=50 and k=500). Tests were conducted on the HotpotQA dataset using the meta-llama/llama-3.1–8b-instruct model, with 50 questions per knowledge size. KV-Cache used an API (OpenRouter) without local caching, while RAG employed top_k=5 for retrieval.

Key Takeaways

KV-Cache Struggles with Scale: As the dataset grows, KV-Cache faces context size limits, which require prompt truncation or compression.
RAG Handles Complexity: RAG’s retrieval mechanism ensures only relevant documents are used, avoiding context overload and maintaining accuracy.

The Bottom Line

While KV-Cache shines with small, stable datasets, RAG proves more robust for larger, dynamic knowledge bases, making it a better fit for real-world, enterprise-level tasks.

5. KV-Cache (CAG): Pros & Cons

CAG can appear unbeatable in early or small-scale tests (e.g. ~50 documents). But scaling up to 500+ documents reveals some crucial issues:

Context Overflow

When you exceed the model’s max context window, you risk prompt truncation or outright token-limit errors. Vital information gets cut, and accuracy suffers.

Local Hardware

To truly leverage KV-Cache, you need direct access to the model’s memory. If you rely on a hosted or API-driven model, there’s no way to manage caching yourself.

Frequent Updates

Every time your data changes, you have to rebuild the entire cache. This overhead can undermine the supposed “instant” advantage that KV-Cache promises.

6. Quizzing Time: Score Wars — Why ‘Rosie Mac’ is the Winner

Not all scores tell the full story. When evaluating model responses, similarity metrics compare generated answers to a reference text. But what happens when one answer is more detailed than the reference? Does it get rewarded — or penalized? Let’s look at a real example from our benchmark.

The Question:

Q: Who was the body double for Emilia Clarke playing Daenerys Targaryen in Game of Thrones?

Two Correct Answers:

Answer A

“Rosie Mac was the body double for Emilia Clarke in her portrayal of Daenerys Targaryen in Game of Thrones.”

Answer B

“Rosie Mac.”

Which one do you think scored higher on our similarity metric? Most people might assume the more detailed answer (A) wins. But here are the actual scores:

Answer A: 0.60526
Answer B: 0.98361

Yes, the shorter “Rosie Mac.” received the higher score. Why? Because the ground truth reference answer was simply “Rosie Mac” — so the more detailed response introduced extra words that lowered the alignment score.

This doesn’t mean longer answers are worse — often, they provide better context. But it highlights why similarity metrics should be interpreted with caution, especially in nuanced or multi-hop reasoning tasks. Our overall results remain valid, but it’s important to look beyond raw scores to gain a comprehensive, unbiased perspective on how these models truly perform.

7. Final Thoughts: No Free Lunch

Yes, Cache-Augmented Generation can truly offer zero retrieval overhead — if your entire knowledge base and context can fit comfortably in your local LLM. But for many enterprise or multi-hop tasks, that’s a big “if.”

If your data is large or updates frequently, RAG approaches like CustomGPT.ai may remain the more robust and flexible choice.

8. Frequently Asked Questions

What is Retrieval-Augmented Generation (RAG)?

It’s a technique that fetches external documents at inference time to enrich a model’s responses, allowing you to handle bigger or changing data sets without overloading the model’s context.

How did you measure semantic similarity?

We used a BERTScore model (“all-MiniLM-L6-v2”) to compare generated answers with ground-truth references.

What does “No Cache” KV-Cache mean in your diagrams?

It indicates we didn’t run an actual local caching mechanism. Instead, we replicated the effect by passing all documents via an API request each time, so we could compare its semantic accuracy without focusing on speed.

Why was HotpotQA used?

HotpotQA requires retrieving multiple documents to answer a single question, making it ideal for testing retrieval methods like RAG and highlighting KV-Cache’s limitations with large knowledge bases.

When is multi-hop retrieval needed?

When no single document contains the full answer — common in research, legal analysis, and complex reasoning tasks requiring fact linking.

Learn More

CustomGPT.ai: RAG-as-a-Service for large or dynamic data
VentureBeat Article: How Cache-Augmented Generation Reduces Latency & Complexity
CAG Official Repo: CAG Original
Our Modified Fork (With Results): CAG Fork

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

RAG vs. CAG : Can Cache-Augmented Generation Really Replace Retrieval?

Author(s): Alden Do Rosario

1. Setting the Stage: RAG vs. KV-Cache (CAG)

RAG

KV-Cache (CAG)

2. The BIG BUT (and We Cannot Lie)

3. Our Benchmark Setup

4. Results

Key Takeaways

5. KV-Cache (CAG): Pros & Cons

Context Overflow

Local Hardware

Frequent Updates

6. Quizzing Time: Score Wars — Why ‘Rosie Mac’ is the Winner

The Question:

Two Correct Answers:

Answer A

Answer B

7. Final Thoughts: No Free Lunch

8. Frequently Asked Questions

What is Retrieval-Augmented Generation (RAG)?

How did you measure semantic similarity?

What does “No Cache” KV-Cache mean in your diagrams?

Why was HotpotQA used?

When is multi-hop retrieval needed?

Learn More

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement

Subscribe to our AI newsletter!

🔥 Recommended Articles 🔥