Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Unlock the full potential of AI with Building LLMs for Productionβ€”our 470+ page guide to mastering LLMs with practical projects and expert insights!

Publication

RAG From Scratch
Latest   Machine Learning

RAG From Scratch

Last Updated on October 5, 2024 by Editorial Team

Author(s): Barhoumi Mosbeh

Originally published on Towards AI.

I’m working as a machine learning engineer, and I frequently use Claude or ChatGPT to help me write code. However, in some cases, the model starts to repeat itself or hallucinate, especially during complex or lengthy tasks. This can happen due to limitations in the model’s context window or the nature of the prompt. When this occurs, I typically write pseudo-code to guide the model along with the prompt, and it works well in most cases. The β€œcontext window” refers to the maximum amount of text (measured in tokens) that the model can process in a single input. Exceeding this limit can lead to issues like information loss or confusion in longer tasks.

Youtube

In some cases, when I feel the model starts to β€œforget” the information I provided, I use a simple trick: I give it the necessary information in chunks to avoid exceeding the context window. This strategy helps the model retain important details throughout the interaction. This is actually the core idea behind Retrieval-Augmented Generation (RAG) systems, with some adjustments, which will be covered in the upcoming sections.

Image from the author

Indexing

Image from the author
  • Documents: This is where we start with our collection of information sources, such as books, articles, or any other text-based knowledge.
  • Chunking: In this step, we break down the large documents into smaller, manageable pieces. This makes it easier to process and retrieve specific information later.
  • Embedding: Here, we convert each chunk of text into a numerical representation that captures its meaning. This allows computers to understand and compare the content efficiently.
  • Index: Finally, we store these numerical representations in a special structure that allows for quick and efficient searching.

Chunking

I believe all stages are clear except for the chunking part. How do we split documents?

Many methods have been introduced for this purpose. Some of them are static, such as Fixed-Length Splitting, which involves dividing the text into chunks of a predetermined number of tokens or characters. Others are more semantic.

For a more detailed explanation, please visit this article: Five Levels of Chunking Strategies in RAG.

Retrival

Image from the author

After splitting our document into chunks, embedding, and indexing them, we won’t feed all chunks to the LLMs alongside the query. Instead, we will select the top k most relevant chunks based on the user’s question. To elaborate, we will embed the query or question and compare it with our embedded chunks to identify the k most relevant ones.

Generation

Image from the author

This is nearly the final part of our RAG pipeline. We will now take the query with the most relevant documents and pass them as a prompt to the LLM, which will return the final answer.

This was the most basic RAG pipeline, but it comes with what is basically mentioned in the tweet image: RAG is used to answer user queries about private data when the query itself isn’t ambiguous. I mean, it can be phrased in a very different way than the documents. So the problem might be that the model can’t understand our task or question, which means we need to translate the query.

twitter

Query translation

LangChain

Multi-Query Prompt

We start with an original question and rephrase it into several different queries. Next, we look at the relevant pieces of information for each query. Finally, we can use different methods to combine these queries and their results before sending them, along with the original question, to the model.

‫β€ͺRAG-fusion‬‬

The RAG-fusion technique utilizes a method known as Reciprocal Rank Fusion (RRF). Understanding RRF is crucial, especially since we’ll revisit it when discussing Re-ranking later on.

When we employ Multi-Query strategies, we generate multiple queries and collect the top relevant chunks of information for each. After gathering these results, we review them to eliminate any duplicates, ensuring we only work with unique entries.

Now, let’s consider a situation where two queries return chunks with the same score. For instance, if both queries produce a chunk rated at 0.55, we face the challenge of determining their order when we combine them. This is where RRF becomes beneficial.

RRF helps us rank these chunks based on their relevance from each query, giving us a way to decide which chunk should take precedence in our final list. By applying RRF, we can effectively merge the results while prioritizing the most relevant chunks, leading to a more efficient and accurate representation of the information we need.

To explore the formula and see some examples, check these articles:

RAG Fusion Revolution β€” A Paradigm Shift in Generative AI

With the latest advancements in the field of NLP and Generative AI, the introduction of RAG (Retrieval Augmented…

medium.com

Reciprocal Rank Fusion (RRF) explained in 4 mins.

Unlock the power of RRF in Retrieval-Augmented Generation

medium.com

Decomposition

Decomposition Query is a method where we break down a complex task in a query into smaller, simpler, and easier-to-solve subtasks or parts. Afterward, we merge these answers, either by solving them recursively or addressing them individually.

Image from the author

Recursively Answer

This method involves breaking down a complex query into a series of simpler, smaller questions. When an LLM (Large Language Model) answers the first question, we take both the question and its answer as the context for the next question. We continue this process with the second, third, and so on until all the smaller questions are answered. (In some cases, we can stop here.) In other cases, we take all these questions and their answers as the context to address the original main query. If we do this final step, it becomes the Answer Individually method.

Image from the author

Routing

Building a large RAG application often won’t just focus on a single fixed task or work with only one uploaded file. For example, if you have a high school student studying for exams, you might create an RAG application that helps them across all subjects, not just one. However, the application needs a router to determine the subject of each query. For example, if the student asks a physics question, the router would direct the RAG to search in the physics textbooks the student has uploaded, ignoring other subjects. Based on the user’s query, the application will choose the most suitable route or chain to get the best result.

Now, let’s talk about the methods we can use for routing:

Using a Custom Function:

Simply create custom chains for each subject. For example, if the subject is mathematics, you can design a chain that tells the model it’s a math teacher and should answer the question based on its knowledge of math. Similarly, you’d have another chain for physics, and so on. A simple function would check the query for specific words (e.g., β€œmath”) and select the corresponding chain for that subject.

Machine Learning Classifier:

If you have a lot of questions from different subjects like physics, math, etc., and you want the system to determine which subject is being asked about, you can train a classifier. This classifier will categorize the query into the correct subject, and once classified, the system can route the query to the relevant textbooks or chain associated with that subject.

Semantic Similarity:

More advanced functions can be used beyond simple keyword matching. For example, you could create an embedding (a numerical representation) for each chain. When a query comes in, its embedding is compared to the embeddings of each chain, and the chain with the closest match is selected.

Hybrid Approach:

As the name suggests, you can use a hybrid approach that combines multiple methods mentioned above to achieve better and more accurate routing.

The importance of routing lies in ensuring that the context provided to the LLM is truly relevant to the question asked. This prevents the model from using unrelated contexts from a different subject. Effective routing also improves the indexing process by ensuring that we are searching in the correct chunks, saving time and leading to better results.

Image from the author

Query‬‬ ‫β€ͺConstruction‬‬‫β€ͺ

When constructing a query, most of the files in the data we’re using likely have metadata, right? So what is metadata? Metadata is simply data about data.

For example, if we have a video, the data would be the content of the video itself. However the metadata could be various things like the source of the video, its production date, duration, or other details.

Now, if we have a query like:

Which video is longer, A or B?

The first solution might be that the model looks at the number of words in the transcript of the first video, then compares it to the number of words in the second video’s transcript. The one with more words would be longer, right?

Not exactly. One video could have more silence than the other, so even if it has fewer words, it could still be longer. But, even if the video with more words is longer, what’s the faster way to get the answer? Is it for the model to analyze the entire video content, or just to grab the duration from the metadata and compare the lengths?

So, what do you think about using metadata to improve chunking? It’s possible that the query written by the user could be answered directly through the metadata, and from that, we can provide an answer much more easily.

Image from the author

That’s it for this article I may share another one about more advanced concepts in the upcoming days!

Resources:

Retrieval-Augmented Generation (RAG) from basics to advanced

Introduction:

medium.com

LangChain

LangChain's suite of products supports developers along each step of their development journey.

www.langchain.com

freeCodeCamp.org

Learn to Code – For Free

www.freecodecamp.org

Advanced RAG Techniques: Unlocking the Next Level

Introduction to Retrieval Augmented Generation (RAG)

medium.com

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓