Another Few Tips for Better Results from LLM RAG Solutions
Author(s): Dmitry Malishev
Originally published on Towards AI.
A structured and thoughtful response from LLM based on perfectly selected data from RAG is a promising technique. And it can be even better!
By now, Iβve completed several projects related to LLM based on RAG: article analyzers, knowledge-based Q&A bots, and insight generators. While working with it I had a chance to try many frameworks, models, and pipelines with an uncountable number of parameters, settings, and other tunings capable of turning these pieces of code into a real magic spell. This article is a compilation of my notes on building a better-performing LLM RAG pipeline in the hope of helping other engineers with experiments.
Before digging into the details, let me recall a general LLM RAG pipeline:
As shown in the diagram, we deal with consequential steps to turn a user query into a LLM response. In a real project, any step can contain additional pre- and post-processing routines.
All steps are currently covered by well-known third-party frameworks. For the RAG part, itβs either LlamaIndex or LangChain. For the LLM part, itβs typically a local setup based on open Llama architecture or one of the cloud-based GPT services like ChatGPT. You may find dozens of additional details just by googling any of them.
Iβm not going into detail on how the pipeline works, as Iβm targeting a more experienced audience. But if youβre only starting, I recommend checking out LlamaIndex and LangChain beginnerβs guides. The basic workflows are easy to understand. Thereβs no need for deep knowledge of large language models and data processing algorithms.
All preparations are completed and itβs time to get straight to the tips!
Context Length
This parameter relates to both LLM and RAG parts, it defines the final length of the prompt containing a query and a context. Often it sets up as a maximum context length supported by the used LLM with the intuition βthe bigger is the betterβ. Meanwhile, I found several side effects of the big context value:
- If the required information is contained only in a contiguous location of the knowledge base you may end up with many irrelevant chunks provided by the RAG, which may trigger LLM to wrong answers.
- Thereβs the so-called Lost-in-the-Middle problem, where LLM pays attention to the chunks only at the beginning and at the end of the context and skips the middle.
- Most of the time a longer context means more VRAM demand for the pipeline and, consequently, more calculations and processing time.
And one last argument! There are LLMs with really big context sizes (64k, 128k, β¦) which may feed all the knowledge database and make the RAG part entirely unnecessary. It might be interesting to experiment and may work out in certain cases, but as we still have RAG frameworks on the scene, big context doesnβt solve all the problems.
Text Cleaning
Depending on the task you may decide to clean the context from all unnecessary symbols (extra spaces, newlines, etc.). It sounds especially appealing when the knowledge database is extracted from text formats like DOCX and PDF. Be careful at this step, as I discovered that some LLMs pay lots of attention to the information dividing into sections, paragraphs, and bulleted lists, and perform much worse without these special symbols, which, by the way, donβt spend too many bytes to store.
Two Parts β Two Queries
By default, an input query applies to RAG and LLM parts unaltered: vector searching, composing a prompt, and sending it to an LLM. These parts are so different at the job they do, but we expect itβs good enough to feed them with the same input. And this is not mandatory! Itβs possible to get optimized query modifications and provide them to each part of the pipeline. This is particularly useful when you work with fixed queries to find certain types of answers in the knowledge database. So, you can create two versions of the prompt manually whenever possible or use an LLM to compose modified versions.
Other Languages
The vast majority of LLMs are trained on English text data. Other languages are added to the training set too, but usually in smaller volumes. Trained LLMs can also be finetuned for other languages, but itβs hard. Anyway, when English isnβt the basic language for your solution, you may find yourself with a small number of options. For this case, I want to share another trick! LLMs are almost always much better at understanding non-English languages than using them for composing responses. Just try to use your knowledge database in the local language, but ask an LLM to provide the response in English. This may be enough by itself if English answers are appropriate, if not β use a language translation service to turn it into a local language.
Faster on the Same Host
LLM-based solutions work rather slowly. Having both a powerful CPU and GPU, you still want to make it faster without digging into low-level algorithmic optimization. I have a little something for this case, too! Just try to build your app for an alternative OS and launch it in a container (Docker). Once, I got a 20% performance boost by launching my pipeline for Linux in a container on Windows versus the initial Windows-native. I assume frameworks may have different optimization sets for different platforms and OSs, which is also the case for GPU drivers and low-level stacks. This advice is so easy to try: all needed frameworks are cross-platform and Docker provides ready-made images with GPU support.
Thatβs it for now.
I hope your list of ideas to test is now bigger. Iβll be glad to receive feedback and letβs keep in touch!
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI