RAG From Scratch

Last Updated on October 5, 2024 by Editorial Team

Author(s): Barhoumi Mosbeh

Originally published on Towards AI.

I’m working as a machine learning engineer, and I frequently use Claude or ChatGPT to help me write code. However, in some cases, the model starts to repeat itself or hallucinate, especially during complex or lengthy tasks. This can happen due to limitations in the model’s context window or the nature of the prompt. When this occurs, I typically write pseudo-code to guide the model along with the prompt, and it works well in most cases. The “context window” refers to the maximum amount of text (measured in tokens) that the model can process in a single input. Exceeding this limit can lead to issues like information loss or confusion in longer tasks.

In some cases, when I feel the model starts to “forget” the information I provided, I use a simple trick: I give it the necessary information in chunks to avoid exceeding the context window. This strategy helps the model retain important details throughout the interaction. This is actually the core idea behind Retrieval-Augmented Generation (RAG) systems, with some adjustments, which will be covered in the upcoming sections.

Indexing

Documents: This is where we start with our collection of information sources, such as books, articles, or any other text-based knowledge.
Chunking: In this step, we break down the large documents into smaller, manageable pieces. This makes it easier to process and retrieve specific information later.
Embedding: Here, we convert each chunk of text into a numerical representation that captures its meaning. This allows computers to understand and compare the content efficiently.
Index: Finally, we store these numerical representations in a special structure that allows for quick and efficient searching.

Chunking

I believe all stages are clear except for the chunking part. How do we split documents?

Many methods have been introduced for this purpose. Some of them are static, such as Fixed-Length Splitting, which involves dividing the text into chunks of a predetermined number of tokens or characters. Others are more semantic.

For a more detailed explanation, please visit this article: Five Levels of Chunking Strategies in RAG.

Retrival

After splitting our document into chunks, embedding, and indexing them, we won’t feed all chunks to the LLMs alongside the query. Instead, we will select the top k most relevant chunks based on the user’s question. To elaborate, we will embed the query or question and compare it with our embedded chunks to identify the k most relevant ones.

Generation

This is nearly the final part of our RAG pipeline. We will now take the query with the most relevant documents and pass them as a prompt to the LLM, which will return the final answer.

This was the most basic RAG pipeline, but it comes with what is basically mentioned in the tweet image: RAG is used to answer user queries about private data when the query itself isn’t ambiguous. I mean, it can be phrased in a very different way than the documents. So the problem might be that the model can’t understand our task or question, which means we need to translate the query.

Query translation

Multi-Query Prompt

We start with an original question and rephrase it into several different queries. Next, we look at the relevant pieces of information for each query. Finally, we can use different methods to combine these queries and their results before sending them, along with the original question, to the model.

‫‪RAG-fusion‬‬

The RAG-fusion technique utilizes a method known as Reciprocal Rank Fusion (RRF). Understanding RRF is crucial, especially since we’ll revisit it when discussing Re-ranking later on.

When we employ Multi-Query strategies, we generate multiple queries and collect the top relevant chunks of information for each. After gathering these results, we review them to eliminate any duplicates, ensuring we only work with unique entries.

Now, let’s consider a situation where two queries return chunks with the same score. For instance, if both queries produce a chunk rated at 0.55, we face the challenge of determining their order when we combine them. This is where RRF becomes beneficial.

RRF helps us rank these chunks based on their relevance from each query, giving us a way to decide which chunk should take precedence in our final list. By applying RRF, we can effectively merge the results while prioritizing the most relevant chunks, leading to a more efficient and accurate representation of the information we need.

To explore the formula and see some examples, check these articles:

RAG Fusion Revolution — A Paradigm Shift in Generative AI

With the latest advancements in the field of NLP and Generative AI, the introduction of RAG (Retrieval Augmented…

medium.com

Reciprocal Rank Fusion (RRF) explained in 4 mins.

Unlock the power of RRF in Retrieval-Augmented Generation

medium.com

Decomposition

Decomposition Query is a method where we break down a complex task in a query into smaller, simpler, and easier-to-solve subtasks or parts. Afterward, we merge these answers, either by solving them recursively or addressing them individually.

Recursively Answer

This method involves breaking down a complex query into a series of simpler, smaller questions. When an LLM (Large Language Model) answers the first question, we take both the question and its answer as the context for the next question. We continue this process with the second, third, and so on until all the smaller questions are answered. (In some cases, we can stop here.) In other cases, we take all these questions and their answers as the context to address the original main query. If we do this final step, it becomes the Answer Individually method.

Routing

Building a large RAG application often won’t just focus on a single fixed task or work with only one uploaded file. For example, if you have a high school student studying for exams, you might create an RAG application that helps them across all subjects, not just one. However, the application needs a router to determine the subject of each query. For example, if the student asks a physics question, the router would direct the RAG to search in the physics textbooks the student has uploaded, ignoring other subjects. Based on the user’s query, the application will choose the most suitable route or chain to get the best result.

Now, let’s talk about the methods we can use for routing:

Using a Custom Function:

Simply create custom chains for each subject. For example, if the subject is mathematics, you can design a chain that tells the model it’s a math teacher and should answer the question based on its knowledge of math. Similarly, you’d have another chain for physics, and so on. A simple function would check the query for specific words (e.g., “math”) and select the corresponding chain for that subject.

Machine Learning Classifier:

If you have a lot of questions from different subjects like physics, math, etc., and you want the system to determine which subject is being asked about, you can train a classifier. This classifier will categorize the query into the correct subject, and once classified, the system can route the query to the relevant textbooks or chain associated with that subject.

Semantic Similarity:

More advanced functions can be used beyond simple keyword matching. For example, you could create an embedding (a numerical representation) for each chain. When a query comes in, its embedding is compared to the embeddings of each chain, and the chain with the closest match is selected.

Hybrid Approach:

As the name suggests, you can use a hybrid approach that combines multiple methods mentioned above to achieve better and more accurate routing.

The importance of routing lies in ensuring that the context provided to the LLM is truly relevant to the question asked. This prevents the model from using unrelated contexts from a different subject. Effective routing also improves the indexing process by ensuring that we are searching in the correct chunks, saving time and leading to better results.

Query‬‬ ‫‪Construction‬‬‫‪

When constructing a query, most of the files in the data we’re using likely have metadata, right? So what is metadata? Metadata is simply data about data.

For example, if we have a video, the data would be the content of the video itself. However the metadata could be various things like the source of the video, its production date, duration, or other details.

Now, if we have a query like:

Which video is longer, A or B?

The first solution might be that the model looks at the number of words in the transcript of the first video, then compares it to the number of words in the second video’s transcript. The one with more words would be longer, right?

Not exactly. One video could have more silence than the other, so even if it has fewer words, it could still be longer. But, even if the video with more words is longer, what’s the faster way to get the answer? Is it for the model to analyze the entire video content, or just to grab the duration from the metadata and compare the lengths?

So, what do you think about using metadata to improve chunking? It’s possible that the query written by the user could be answered directly through the metadata, and from that, we can provide an answer much more easily.

That’s it for this article I may share another one about more advanced concepts in the upcoming days!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Author(s): Barhoumi Mosbeh

Indexing

Chunking

Retrival

Generation

Query translation

Multi-Query Prompt

‫‪RAG-fusion‬‬

RAG Fusion Revolution — A Paradigm Shift in Generative AI

With the latest advancements in the field of NLP and Generative AI, the introduction of RAG (Retrieval Augmented…

Reciprocal Rank Fusion (RRF) explained in 4 mins.

Unlock the power of RRF in Retrieval-Augmented Generation

Decomposition

Recursively Answer

Routing

Using a Custom Function:

Machine Learning Classifier:

Semantic Similarity:

Hybrid Approach:

Query‬‬ ‫‪Construction‬‬‫‪

Resources:

Retrieval-Augmented Generation (RAG) from basics to advanced

Introduction:

LangChain

LangChain's suite of products supports developers along each step of their development journey.

freeCodeCamp.org

Learn to Code – For Free

Advanced RAG Techniques: Unlocking the Next Level

Introduction to Retrieval Augmented Generation (RAG)

JOIN NOW!

🔥 Recommended Articles 🔥

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement