
Beyond Training Data: How RAG Lets LLMs Retrieve, Not Guess
Author(s): DarkBones
Originally published on Towards AI.
Large Language Models (LLMs) like GPT-4 donβt actually βknowβ anything, they predict words based on old training data. Retrieval-Augmented Generation (RAG) changes that by letting AI pull in fresh, real-world knowledge before answering.
RAG enhances LLMs by enabling them to retrieve relevant information from external sources before generating a response. Because LLMs rely on static training data and donβt update automatically, RAG gives them access to fresh, domain-specific, or private knowledge, without the need for costly retraining.
Letβs explore how RAG works, why it is useful, and how it differs from traditional LLM prompting.
What is Retrieval-Augmented Generation (RAG) in AI?
Retrieval-Augmented Generation (RAG) helps AI models retrieve external information before generating a response. But how exactly does this process work, and why is it important?
Large Language Models excel at many tasks. They can code, draft emails, hallucinate ingredients for the perfect sandwich, and even write articles, although I still prefer doing that myself. However, they have a major limitation. They lack real-time knowledge. Because training LLMs is a time-consuming process, they do not βknowβ about recent events. If you ask one about last week, it will either display a disclaimer, provide an outdated answer, or generate something completely inaccurate.
βSome LLMs overcome their biggest limitation of stale training data by retrieving up-to-date information before responding.β
RAG fetches relevant information before generating an answer, making AI responses more accurate and reducing hallucinations.
RAG Explained in Simple Terms
But how does RAG actually work? Instead of looking it up ourselves, letβs ask our favorite LLM:
This is not quite what we were hoping for. No problem, we can ask Bob instead.
Surprisingly, Bob did not know the answer either, but he was able to retrieve it. Here is what happened:
- We asked Bob about RAG.
- Bob went to the library and asked the librarian for information.
- The librarian pointed him to the right aisle.
- Bob retrieved the information.
- Bob augmented his understanding by consuming the information before generating an answer.
- Now Bob sounds like an expert. Thanks, Bob.
This breakdown reveals that Bob is effectively functioning as a RAG agent.
With that insight, letβs explore exactly how a RAG agent operates.
RAG β Simplified
Letβs transform our interaction with Bob into an actual RAG system:
- Bob represents the RAG system.
- The librarian acts as an embedder.
- The library functions as a vector database.
βRather than prompting an LLM directly, a RAG system acts as a knowledge bridge: retrieving, augmenting, and then generating responses.β
Vectorizing the Input
The RAG system then forwards the prompt to the embedder, which converts it into a vector. This vector is a numeric representation of the prompt. The idea is that information with similar meaning will have similar vector representations.
βVectors unlock relevance. This vector allows the system to retrieve the most meaningful information from the vector database.β
When the vector representation of the userβs prompt is sent to the database, it retrieves the most relevant matches.
The RAG system then enhances the userβs prompt by including the retrieved information:
<context>
the information returned from the database
</context>
<user-prompt>
the user's original prompt
</user-prompt>
That is the entire process. Retrieve, Augment, and Generate. RAG.
Adding to the Knowledge Base
However, the system cannot retrieve information that has not been added to the database. How do we store new data? The process is straightforward. Instead of using the vector to find relevant information, the system stores the data along with its vector representation.
If you were only interested in the big picture, congratulations. You now understand the core concept. However, if youβre a fellow neckbeard, letβs talk a bit more about vectors and embedders.
What is a Vector?
In simple terms, a vector is a set of coordinates that describe how to move from A to B. Look at this graph:
This graph has two dimensions. Each point, A, B, C, and D, can be described using a two-number coordinate system. The first number tells us how far to move to the right from the origin (0), while the second number tells us how far to move up. To reach A, the vector is [3, 7]
. To reach D, the vector is [3, 0]
.
Dimensionality of Vectors
The same principle applies in three dimensions. To move from your desk to the coffee machine, you must travel a certain distance along the x
, y
, and z
axes, forming a three-digit coordinate system.
βHumans struggle to visualize beyond three dimensions. Computers thrive in multi-dimensional spaces.β
The math remains the same. Four dimensions? That requires a four-digit coordinate system. One hundred dimensions? That requires a 100-digit coordinate system.
βThe embedder I use operates in a mind-bending, 768-dimensional coordinate system, far beyond human perception.β
When you have finished trying to visualize that, we can return to simpler, easy-to-draw, two-dimensional graphs.
How Vector Embeddings Help LLMs Retrieve Data
Vectors by themselves are simply n-dimensional coordinates that represent points in n-dimensional space.
βVectors arenβt just numbers, they encode meaning. Their true power lies in the information they represent.β
In the same way, vectors are coordinates not to places, but to information. A specialized LLM, an embedder, is trained on a large corpus of text to figure out similarities and to place these pieces of information somewhere in n-dimensional space such that similar topics tend to be grouped together.
Like, when you go to a social event, youβre likely to stick with your friends, colleagues, or at least a group of like-minded people.
Grouping Similar Concepts Together
This graph shows how words that are similar in meaning tend to get grouped together in this n-dimensional space. Modern embedders (like BERT) donβt use single-word embeddings anymore, but generate contextual embeddings.
The ability to group similar concepts in vector space makes embeddings powerful. However, early embedding models like Word2Vec had a significant limitation that modern models have addressed.
Quick Tech Tangent
If youβve been working on AI systems for as long as I have, you might be familiar with Word2Vec. While groundbreaking when it came out in 2013, it has a major flaw: it assigns a single vector to each word, no matter the context.
Take the word βbatβ.
- Are we talking about the flying mammal? Then it should be near βmammalβ, βcaveβ, and βnocturnalβ.
- Or do we mean a baseball bat? Then it belongs near βballβ, βpitchβ, and βbaseβ (but what base? Military?)
- And what if weβre in the world of fiction? Then βbatβ relates to βvampireβ and βtransformationβ.
Word2Vec canβt tell the difference. It picks one and sticks with it.
One thing I find particularly fascinating with Word2Vec is that, since words are now represented by numbers, you can actually do arithmetic on them.
You can make equations like
βking - man + woman = queen
– a legendary example of how AI models map relationships in vector space."
Itβs wild, but it works (most of the time).
Tangent over.
How are Vectors Used?
Now that we understand vectors, the next step is straightforward. We embed the information we want the LLM to access, and when we ask a question about that information, the question itself should be close to the relevant content in vector space. The vector database retrieves the n
most relevant pieces of content, where n
is a configurable number.
It also returns the cosine similarity score for each result, indicating how closely the retrieved content matches the query.
Cosine Similarity
βCosine similarity doesnβt just compare numbers, it measures meaning by calculating the angle between two vectors.β
A smaller angle indicates greater similarity, meaning the retrieved data is more relevant to the prompt.
In our example, A and B represent the phrases βRAG stands for Retrieval Augmented Generationβ and βHey LLM, tell me about RAGβ. Since they are closely related, their vectors are similar. If we instead ask βDescribe an Eclipseβ, its vector will be far from the others, making it unrelated. However, if βRAG stands for Retrieval Augmented Generationβ is the only entry in the database, it will still be retrieved, even if it is not relevant to the query.
Limitations of RAG
Typically, we do not store and retrieve entire documents in the vector database. If we did, a single large document could easily exceed the context window of the LLM. If the system is configured to return the ten most relevant pieces of information, and each of them is the size of a full article, your computer quickly turns into a space heater. To prevent this, we split the information into chunks of a predefined size, such as 1000
characters, and we try to keep the sentences and paragraphs intact.
However, splitting information into chunks introduces a new problem. Just as Word2Vec struggles to determine meaning from a single word, RAG often fails to understand the full context of a single chunk, especially when that chunk is extracted from the middle of a document.
Here is a problem I encountered recently. I keep a detailed work diary where I document all my professional achievements. It is extremely useful during performance reviews. However, when I ask my RAG system what I achieved at my current company, it confidently includes accomplishments from my previous jobs. Because I write this diary in the first person and also include information from other sources written in the first person, the system cannot distinguish between them. As a result, it starts attributing achievements to me that I had nothing to do with. That is how I realized something was wrong. My system was suddenly telling me about all the interesting things I supposedly did away from the computer, which is impossible since I never leave my desk.
Conclusion
RAG makes LLMs more useful by letting them retrieve information they wouldnβt otherwise have access to. But itβs not magic. It comes with its own challenges, from handling context properly to avoiding irrelevant results.
But as I learned firsthand, fetching information isnβt the same as understanding it. Thatβs why making RAG systems context-aware is the next big challenge, one Iβll tackle in my next article.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI