RAG vs. CAG : Can Cache-Augmented Generation Really Replace Retrieval?
Last Updated on February 5, 2025 by Editorial Team
Author(s): Alden Do Rosario
Originally published on Towards AI.
A recent VentureBeat article highlights a new Cache-Augmented Generation (CAG) method that promises no retrieval overhead and even better performance than Retrieval-Augmented Generation (RAG).
Sounds too good to be true?
We decided to find out by running our own tests on KV-Cache (a popular CAG implementation) versus RAG.
Below are our insights on what happens when you apply these methods to real workloads.
1. Setting the Stage: RAG vs. KV-Cache (CAG)
RAG
What It Is
A Retrieval-Augmented Generation approach that uses a retriever to find relevant documents, then passes them to a large language model for final answers.
Where It Shines
- Handles larger or frequently updated datasets without loading everything at once.
- Avoids massive prompts, which can lead to truncation or context overload.
Key Limitations
- Adds a retrieval step, which can be slower.
- Often relies on external APIs or indexing overhead.
KV-Cache (CAG)
What It Is
A method that aims for near-zero retrieval time by loading all documents directly into the modelβs context window. In principle, it cuts out the retriever entirely.
Note: In our benchmarks, we used a βNo Cacheβ version of KV-Cache because the model was too large to run locally. Instead, we mimicked the same behavior via an API (OpenRouter) by feeding all documents each time. Weβre not comparing retrieval speed here, since KV-Cache would obviously win if run locally on a suitable setup.
Where It Shines
- If your entire knowledge base easily fits in the modelβs context, you get almost instant answers (no retrieval step).
- Best for stable datasets that rarely change.
Key Limitations
- Context Size: If you exceed the modelβs capacity, you must truncate or compress, killing accuracy.
- Local Requirement: Real caching needs control over memory, meaning you must run the model on your own infrastructure.
- Frequent Updates: Reloading the entire knowledge in context is impractical for dynamic data.
2. The BIG BUT (and We Cannot Lie)
Long-context LLMs (like Google Gemini or Claude with hundreds of thousands of tokens) are emerging, making CAG more appealing for some workloads.
But thereβs a big condition:
- You must run the model locally and have access to its memory to enable caching. Many high-powered LLMs are hosted, limit context lengths, and obviously, you canβt access the memory for user-level manipulation via an API.
- Once your dataset crosses a threshold you might exceed the context window. If that happens, the method can break entirely or force you to truncate vital info, tanking accuracy.
This snippet from one error log says it all:
βerrorβ:{βmessageβ:βThis endpointβs maximum context length is 131072 tokens. However, you requested about 719291 tokensβ¦β}
Translation: Youβre out of luck unless you compress or chunk your data which can reduce the performance by a lot.
3. Our Benchmark Setup
We used the HotpotQA dataset (known for multi-hop QA) and ran our tests on the meta-llama/llama-3.1β8b-instruct model. We posed 50 questions each to two knowledge sizes β 50 documents and 500 documents β to see how each method performs at different scales.
Because we used an API (OpenRouter) for KV-Cache, there was no actual βcacheβ or local memory optimization happening; we simply passed all documents in each request.
- top_k=5 for RAG, and no top_k for KV-Cache (it loads everything).
- No retrieval time comparison: Our focus is on semantic accuracy, since KV-Cache would trivially have zero retrieval overhead if it were truly caching locally.
4. Results
Our benchmark tests on the HotpotQA dataset revealed interesting insights into the performance of RAG and KV-Cache (CAG) under different knowledge sizes.
Below are the key findings:
Key Takeaways
- KV-Cache Struggles with Scale: As the dataset grows, KV-Cache faces context size limits, which require prompt truncation or compression.
- RAG Handles Complexity: RAGβs retrieval mechanism ensures only relevant documents are used, avoiding context overload and maintaining accuracy.
The Bottom Line
While KV-Cache shines with small, stable datasets, RAG proves more robust for larger, dynamic knowledge bases, making it a better fit for real-world, enterprise-level tasks.
5. KV-Cache (CAG): Pros & Cons
CAG can appear unbeatable in early or small-scale tests (e.g. ~50 documents). But scaling up to 500+ documents reveals some crucial issues:
Context Overflow
When you exceed the modelβs max context window, you risk prompt truncation or outright token-limit errors. Vital information gets cut, and accuracy suffers.
Local Hardware
To truly leverage KV-Cache, you need direct access to the modelβs memory. If you rely on a hosted or API-driven model, thereβs no way to manage caching yourself.
Frequent Updates
Every time your data changes, you have to rebuild the entire cache. This overhead can undermine the supposed βinstantβ advantage that KV-Cache promises.
6. Quizzing Time: Score Wars β Why βRosie Macβ is the Winner
Not all scores tell the full story. When evaluating model responses, similarity metrics compare generated answers to a reference text. But what happens when one answer is more detailed than the reference? Does it get rewarded β or penalized? Letβs look at a real example from our benchmark.
The Question:
Q: Who was the body double for Emilia Clarke playing Daenerys Targaryen in Game of Thrones?
Two Correct Answers:
Answer A
βRosie Mac was the body double for Emilia Clarke in her portrayal of Daenerys Targaryen in Game of Thrones.β
Answer B
βRosie Mac.β
Which one do you think scored higher on our similarity metric? Most people might assume the more detailed answer (A) wins. But here are the actual scores:
- Answer A: 0.60526
- Answer B: 0.98361
Yes, the shorter βRosie Mac.β received the higher score. Why? Because the ground truth reference answer was simply βRosie Macβ β so the more detailed response introduced extra words that lowered the alignment score.
This doesnβt mean longer answers are worse β often, they provide better context. But it highlights why similarity metrics should be interpreted with caution, especially in nuanced or multi-hop reasoning tasks. Our overall results remain valid, but itβs important to look beyond raw scores to gain a comprehensive, unbiased perspective on how these models truly perform.
7. Final Thoughts: No Free Lunch
Yes, Cache-Augmented Generation can truly offer zero retrieval overhead β if your entire knowledge base and context can fit comfortably in your local LLM. But for many enterprise or multi-hop tasks, thatβs a big βif.β
If your data is large or updates frequently, RAG approaches like CustomGPT.ai may remain the more robust and flexible choice.
8. Frequently Asked Questions
What is Retrieval-Augmented Generation (RAG)?
Itβs a technique that fetches external documents at inference time to enrich a modelβs responses, allowing you to handle bigger or changing data sets without overloading the modelβs context.
How did you measure semantic similarity?
We used a BERTScore model (βall-MiniLM-L6-v2β) to compare generated answers with ground-truth references.
What does βNo Cacheβ KV-Cache mean in your diagrams?
It indicates we didnβt run an actual local caching mechanism. Instead, we replicated the effect by passing all documents via an API request each time, so we could compare its semantic accuracy without focusing on speed.
Why was HotpotQA used?
HotpotQA requires retrieving multiple documents to answer a single question, making it ideal for testing retrieval methods like RAG and highlighting KV-Cacheβs limitations with large knowledge bases.
When is multi-hop retrieval needed?
When no single document contains the full answer β common in research, legal analysis, and complex reasoning tasks requiring fact linking.
Learn More
- CustomGPT.ai: RAG-as-a-Service for large or dynamic data
- VentureBeat Article: How Cache-Augmented Generation Reduces Latency & Complexity
- CAG Official Repo: CAG Original
- Our Modified Fork (With Results): CAG Fork
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI