Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

RAG vs. CAG : Can Cache-Augmented Generation Really Replace Retrieval?
Artificial Intelligence   Latest   Machine Learning

RAG vs. CAG : Can Cache-Augmented Generation Really Replace Retrieval?

Last Updated on February 5, 2025 by Editorial Team

Author(s): Alden Do Rosario

Originally published on Towards AI.

A recent VentureBeat article highlights a new Cache-Augmented Generation (CAG) method that promises no retrieval overhead and even better performance than Retrieval-Augmented Generation (RAG).

Sounds too good to be true?

We decided to find out by running our own tests on KV-Cache (a popular CAG implementation) versus RAG.

Below are our insights on what happens when you apply these methods to real workloads.

1. Setting the Stage: RAG vs. KV-Cache (CAG)

RAG

What It Is
A Retrieval-Augmented Generation approach that uses a retriever to find relevant documents, then passes them to a large language model for final answers.

Where It Shines

  • Handles larger or frequently updated datasets without loading everything at once.
  • Avoids massive prompts, which can lead to truncation or context overload.

Key Limitations

  • Adds a retrieval step, which can be slower.
  • Often relies on external APIs or indexing overhead.

KV-Cache (CAG)

What It Is
A method that aims for near-zero retrieval time by loading all documents directly into the model’s context window. In principle, it cuts out the retriever entirely.

Note: In our benchmarks, we used a β€œNo Cache” version of KV-Cache because the model was too large to run locally. Instead, we mimicked the same behavior via an API (OpenRouter) by feeding all documents each time. We’re not comparing retrieval speed here, since KV-Cache would obviously win if run locally on a suitable setup.

Where It Shines

  • If your entire knowledge base easily fits in the model’s context, you get almost instant answers (no retrieval step).
  • Best for stable datasets that rarely change.

Key Limitations

  • Context Size: If you exceed the model’s capacity, you must truncate or compress, killing accuracy.
  • Local Requirement: Real caching needs control over memory, meaning you must run the model on your own infrastructure.
  • Frequent Updates: Reloading the entire knowledge in context is impractical for dynamic data.

2. The BIG BUT (and We Cannot Lie)

Long-context LLMs (like Google Gemini or Claude with hundreds of thousands of tokens) are emerging, making CAG more appealing for some workloads.

But there’s a big condition:

  • You must run the model locally and have access to its memory to enable caching. Many high-powered LLMs are hosted, limit context lengths, and obviously, you can’t access the memory for user-level manipulation via an API.
  • Once your dataset crosses a threshold you might exceed the context window. If that happens, the method can break entirely or force you to truncate vital info, tanking accuracy.

This snippet from one error log says it all:

β€œerror”:{β€œmessage”:”This endpoint’s maximum context length is 131072 tokens. However, you requested about 719291 tokens…”}

Translation: You’re out of luck unless you compress or chunk your data which can reduce the performance by a lot.

3. Our Benchmark Setup

We used the HotpotQA dataset (known for multi-hop QA) and ran our tests on the meta-llama/llama-3.1–8b-instruct model. We posed 50 questions each to two knowledge sizes β€” 50 documents and 500 documents β€” to see how each method performs at different scales.

Because we used an API (OpenRouter) for KV-Cache, there was no actual β€œcache” or local memory optimization happening; we simply passed all documents in each request.

  • top_k=5 for RAG, and no top_k for KV-Cache (it loads everything).
  • No retrieval time comparison: Our focus is on semantic accuracy, since KV-Cache would trivially have zero retrieval overhead if it were truly caching locally.

4. Results

Our benchmark tests on the HotpotQA dataset revealed interesting insights into the performance of RAG and KV-Cache (CAG) under different knowledge sizes.

Below are the key findings:

Figure 1: Average semantic similarity scores for KV-Cache (No Cache) and RAG across knowledge sizes (k=50 and k=500). Tests were conducted on the HotpotQA dataset using the meta-llama/llama-3.1–8b-instruct model, with 50 questions per knowledge size. KV-Cache used an API (OpenRouter) without local caching, while RAG employed top_k=5 for retrieval.

Key Takeaways

  • KV-Cache Struggles with Scale: As the dataset grows, KV-Cache faces context size limits, which require prompt truncation or compression.
  • RAG Handles Complexity: RAG’s retrieval mechanism ensures only relevant documents are used, avoiding context overload and maintaining accuracy.

The Bottom Line

While KV-Cache shines with small, stable datasets, RAG proves more robust for larger, dynamic knowledge bases, making it a better fit for real-world, enterprise-level tasks.

5. KV-Cache (CAG): Pros & Cons

CAG can appear unbeatable in early or small-scale tests (e.g. ~50 documents). But scaling up to 500+ documents reveals some crucial issues:

Context Overflow

When you exceed the model’s max context window, you risk prompt truncation or outright token-limit errors. Vital information gets cut, and accuracy suffers.

Local Hardware

To truly leverage KV-Cache, you need direct access to the model’s memory. If you rely on a hosted or API-driven model, there’s no way to manage caching yourself.

Frequent Updates

Every time your data changes, you have to rebuild the entire cache. This overhead can undermine the supposed β€œinstant” advantage that KV-Cache promises.

6. Quizzing Time: Score Wars β€” Why β€˜Rosie Mac’ is the Winner

Not all scores tell the full story. When evaluating model responses, similarity metrics compare generated answers to a reference text. But what happens when one answer is more detailed than the reference? Does it get rewarded β€” or penalized? Let’s look at a real example from our benchmark.

The Question:

Q: Who was the body double for Emilia Clarke playing Daenerys Targaryen in Game of Thrones?

Two Correct Answers:

Answer A

β€œRosie Mac was the body double for Emilia Clarke in her portrayal of Daenerys Targaryen in Game of Thrones.”

Answer B

β€œRosie Mac.”

Which one do you think scored higher on our similarity metric? Most people might assume the more detailed answer (A) wins. But here are the actual scores:

  • Answer A: 0.60526
  • Answer B: 0.98361

Yes, the shorter β€œRosie Mac.” received the higher score. Why? Because the ground truth reference answer was simply β€œRosie Mac” β€” so the more detailed response introduced extra words that lowered the alignment score.

This doesn’t mean longer answers are worse β€” often, they provide better context. But it highlights why similarity metrics should be interpreted with caution, especially in nuanced or multi-hop reasoning tasks. Our overall results remain valid, but it’s important to look beyond raw scores to gain a comprehensive, unbiased perspective on how these models truly perform.

Image Credit: Kevin Michael Schindler

7. Final Thoughts: No Free Lunch

Yes, Cache-Augmented Generation can truly offer zero retrieval overhead β€” if your entire knowledge base and context can fit comfortably in your local LLM. But for many enterprise or multi-hop tasks, that’s a big β€œif.”

If your data is large or updates frequently, RAG approaches like CustomGPT.ai may remain the more robust and flexible choice.

8. Frequently Asked Questions

What is Retrieval-Augmented Generation (RAG)?

It’s a technique that fetches external documents at inference time to enrich a model’s responses, allowing you to handle bigger or changing data sets without overloading the model’s context.

How did you measure semantic similarity?

We used a BERTScore model (β€œall-MiniLM-L6-v2”) to compare generated answers with ground-truth references.

What does β€œNo Cache” KV-Cache mean in your diagrams?

It indicates we didn’t run an actual local caching mechanism. Instead, we replicated the effect by passing all documents via an API request each time, so we could compare its semantic accuracy without focusing on speed.

Why was HotpotQA used?

HotpotQA requires retrieving multiple documents to answer a single question, making it ideal for testing retrieval methods like RAG and highlighting KV-Cache’s limitations with large knowledge bases.

When is multi-hop retrieval needed?

When no single document contains the full answer β€” common in research, legal analysis, and complex reasoning tasks requiring fact linking.

Learn More

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓