RAG From Scratch
Last Updated on October 5, 2024 by Editorial Team
Author(s): Barhoumi Mosbeh
Originally published on Towards AI.
Iβm working as a machine learning engineer, and I frequently use Claude or ChatGPT to help me write code. However, in some cases, the model starts to repeat itself or hallucinate, especially during complex or lengthy tasks. This can happen due to limitations in the modelβs context window or the nature of the prompt. When this occurs, I typically write pseudo-code to guide the model along with the prompt, and it works well in most cases. The βcontext windowβ refers to the maximum amount of text (measured in tokens) that the model can process in a single input. Exceeding this limit can lead to issues like information loss or confusion in longer tasks.
In some cases, when I feel the model starts to βforgetβ the information I provided, I use a simple trick: I give it the necessary information in chunks to avoid exceeding the context window. This strategy helps the model retain important details throughout the interaction. This is actually the core idea behind Retrieval-Augmented Generation (RAG) systems, with some adjustments, which will be covered in the upcoming sections.
Indexing
- Documents: This is where we start with our collection of information sources, such as books, articles, or any other text-based knowledge.
- Chunking: In this step, we break down the large documents into smaller, manageable pieces. This makes it easier to process and retrieve specific information later.
- Embedding: Here, we convert each chunk of text into a numerical representation that captures its meaning. This allows computers to understand and compare the content efficiently.
- Index: Finally, we store these numerical representations in a special structure that allows for quick and efficient searching.
Chunking
I believe all stages are clear except for the chunking part. How do we split documents?
Many methods have been introduced for this purpose. Some of them are static, such as Fixed-Length Splitting, which involves dividing the text into chunks of a predetermined number of tokens or characters. Others are more semantic.
For a more detailed explanation, please visit this article: Five Levels of Chunking Strategies in RAG.
Retrival
After splitting our document into chunks, embedding, and indexing them, we wonβt feed all chunks to the LLMs alongside the query. Instead, we will select the top k most relevant chunks based on the userβs question. To elaborate, we will embed the query or question and compare it with our embedded chunks to identify the k most relevant ones.
Generation
This is nearly the final part of our RAG pipeline. We will now take the query with the most relevant documents and pass them as a prompt to the LLM, which will return the final answer.
This was the most basic RAG pipeline, but it comes with what is basically mentioned in the tweet image: RAG is used to answer user queries about private data when the query itself isnβt ambiguous. I mean, it can be phrased in a very different way than the documents. So the problem might be that the model canβt understand our task or question, which means we need to translate the query.
Query translation
Multi-Query Prompt
We start with an original question and rephrase it into several different queries. Next, we look at the relevant pieces of information for each query. Finally, we can use different methods to combine these queries and their results before sending them, along with the original question, to the model.
β«βͺRAG-fusionβ¬β¬
The RAG-fusion technique utilizes a method known as Reciprocal Rank Fusion (RRF). Understanding RRF is crucial, especially since weβll revisit it when discussing Re-ranking later on.
When we employ Multi-Query strategies, we generate multiple queries and collect the top relevant chunks of information for each. After gathering these results, we review them to eliminate any duplicates, ensuring we only work with unique entries.
Now, letβs consider a situation where two queries return chunks with the same score. For instance, if both queries produce a chunk rated at 0.55, we face the challenge of determining their order when we combine them. This is where RRF becomes beneficial.
RRF helps us rank these chunks based on their relevance from each query, giving us a way to decide which chunk should take precedence in our final list. By applying RRF, we can effectively merge the results while prioritizing the most relevant chunks, leading to a more efficient and accurate representation of the information we need.
To explore the formula and see some examples, check these articles:
RAG Fusion Revolution β A Paradigm Shift in Generative AI
With the latest advancements in the field of NLP and Generative AI, the introduction of RAG (Retrieval Augmentedβ¦
medium.com
Reciprocal Rank Fusion (RRF) explained in 4 mins.
Unlock the power of RRF in Retrieval-Augmented Generation
medium.com
Decomposition
Decomposition Query is a method where we break down a complex task in a query into smaller, simpler, and easier-to-solve subtasks or parts. Afterward, we merge these answers, either by solving them recursively or addressing them individually.
Recursively Answer
This method involves breaking down a complex query into a series of simpler, smaller questions. When an LLM (Large Language Model) answers the first question, we take both the question and its answer as the context for the next question. We continue this process with the second, third, and so on until all the smaller questions are answered. (In some cases, we can stop here.) In other cases, we take all these questions and their answers as the context to address the original main query. If we do this final step, it becomes the Answer Individually method.
Routing
Building a large RAG application often wonβt just focus on a single fixed task or work with only one uploaded file. For example, if you have a high school student studying for exams, you might create an RAG application that helps them across all subjects, not just one. However, the application needs a router to determine the subject of each query. For example, if the student asks a physics question, the router would direct the RAG to search in the physics textbooks the student has uploaded, ignoring other subjects. Based on the userβs query, the application will choose the most suitable route or chain to get the best result.
Now, letβs talk about the methods we can use for routing:
Using a Custom Function:
Simply create custom chains for each subject. For example, if the subject is mathematics, you can design a chain that tells the model itβs a math teacher and should answer the question based on its knowledge of math. Similarly, youβd have another chain for physics, and so on. A simple function would check the query for specific words (e.g., βmathβ) and select the corresponding chain for that subject.
Machine Learning Classifier:
If you have a lot of questions from different subjects like physics, math, etc., and you want the system to determine which subject is being asked about, you can train a classifier. This classifier will categorize the query into the correct subject, and once classified, the system can route the query to the relevant textbooks or chain associated with that subject.
Semantic Similarity:
More advanced functions can be used beyond simple keyword matching. For example, you could create an embedding (a numerical representation) for each chain. When a query comes in, its embedding is compared to the embeddings of each chain, and the chain with the closest match is selected.
Hybrid Approach:
As the name suggests, you can use a hybrid approach that combines multiple methods mentioned above to achieve better and more accurate routing.
The importance of routing lies in ensuring that the context provided to the LLM is truly relevant to the question asked. This prevents the model from using unrelated contexts from a different subject. Effective routing also improves the indexing process by ensuring that we are searching in the correct chunks, saving time and leading to better results.
Queryβ¬β¬ β«βͺConstructionβ¬β¬β«βͺ
When constructing a query, most of the files in the data weβre using likely have metadata, right? So what is metadata? Metadata is simply data about data.
For example, if we have a video, the data would be the content of the video itself. However the metadata could be various things like the source of the video, its production date, duration, or other details.
Now, if we have a query like:
Which video is longer, A or B?
The first solution might be that the model looks at the number of words in the transcript of the first video, then compares it to the number of words in the second videoβs transcript. The one with more words would be longer, right?
Not exactly. One video could have more silence than the other, so even if it has fewer words, it could still be longer. But, even if the video with more words is longer, whatβs the faster way to get the answer? Is it for the model to analyze the entire video content, or just to grab the duration from the metadata and compare the lengths?
So, what do you think about using metadata to improve chunking? Itβs possible that the query written by the user could be answered directly through the metadata, and from that, we can provide an answer much more easily.
Thatβs it for this article I may share another one about more advanced concepts in the upcoming days!
Resources:
Retrieval-Augmented Generation (RAG) from basics to advanced
Introduction:
medium.com
LangChain
LangChain's suite of products supports developers along each step of their development journey.
www.langchain.com
freeCodeCamp.org
Learn to Code – For Free
www.freecodecamp.org
Advanced RAG Techniques: Unlocking the Next Level
Introduction to Retrieval Augmented Generation (RAG)
medium.com
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI