Another Few Tips for Better Results from LLM RAG Solutions

Author(s): Dmitry Malishev

Originally published on Towards AI.

A structured and thoughtful response from LLM based on perfectly selected data from RAG is a promising technique. And it can be even better!

Image generated by the author using Leonardo.ai

By now, I’ve completed several projects related to LLM based on RAG: article analyzers, knowledge-based Q&A bots, and insight generators. While working with it I had a chance to try many frameworks, models, and pipelines with an uncountable number of parameters, settings, and other tunings capable of turning these pieces of code into a real magic spell. This article is a compilation of my notes on building a better-performing LLM RAG pipeline in the hope of helping other engineers with experiments.

Before digging into the details, let me recall a general LLM RAG pipeline:

Image created by the author using Draw.io

As shown in the diagram, we deal with consequential steps to turn a user query into a LLM response. In a real project, any step can contain additional pre- and post-processing routines.

All steps are currently covered by well-known third-party frameworks. For the RAG part, it’s either LlamaIndex or LangChain. For the LLM part, it’s typically a local setup based on open Llama architecture or one of the cloud-based GPT services like ChatGPT. You may find dozens of additional details just by googling any of them.

I’m not going into detail on how the pipeline works, as I’m targeting a more experienced audience. But if you’re only starting, I recommend checking out LlamaIndex and LangChain beginner’s guides. The basic workflows are easy to understand. There’s no need for deep knowledge of large language models and data processing algorithms.

All preparations are completed and it’s time to get straight to the tips!

Context Length

This parameter relates to both LLM and RAG parts, it defines the final length of the prompt containing a query and a context. Often it sets up as a maximum context length supported by the used LLM with the intuition “the bigger is the better”. Meanwhile, I found several side effects of the big context value:

If the required information is contained only in a contiguous location of the knowledge base you may end up with many irrelevant chunks provided by the RAG, which may trigger LLM to wrong answers.
There’s the so-called Lost-in-the-Middle problem, where LLM pays attention to the chunks only at the beginning and at the end of the context and skips the middle.
Most of the time a longer context means more VRAM demand for the pipeline and, consequently, more calculations and processing time.

And one last argument! There are LLMs with really big context sizes (64k, 128k, …) which may feed all the knowledge database and make the RAG part entirely unnecessary. It might be interesting to experiment and may work out in certain cases, but as we still have RAG frameworks on the scene, big context doesn’t solve all the problems.

Text Cleaning

Depending on the task you may decide to clean the context from all unnecessary symbols (extra spaces, newlines, etc.). It sounds especially appealing when the knowledge database is extracted from text formats like DOCX and PDF. Be careful at this step, as I discovered that some LLMs pay lots of attention to the information dividing into sections, paragraphs, and bulleted lists, and perform much worse without these special symbols, which, by the way, don’t spend too many bytes to store.

Two Parts — Two Queries

By default, an input query applies to RAG and LLM parts unaltered: vector searching, composing a prompt, and sending it to an LLM. These parts are so different at the job they do, but we expect it’s good enough to feed them with the same input. And this is not mandatory! It’s possible to get optimized query modifications and provide them to each part of the pipeline. This is particularly useful when you work with fixed queries to find certain types of answers in the knowledge database. So, you can create two versions of the prompt manually whenever possible or use an LLM to compose modified versions.

Other Languages

The vast majority of LLMs are trained on English text data. Other languages are added to the training set too, but usually in smaller volumes. Trained LLMs can also be finetuned for other languages, but it’s hard. Anyway, when English isn’t the basic language for your solution, you may find yourself with a small number of options. For this case, I want to share another trick! LLMs are almost always much better at understanding non-English languages than using them for composing responses. Just try to use your knowledge database in the local language, but ask an LLM to provide the response in English. This may be enough by itself if English answers are appropriate, if not — use a language translation service to turn it into a local language.

Faster on the Same Host

LLM-based solutions work rather slowly. Having both a powerful CPU and GPU, you still want to make it faster without digging into low-level algorithmic optimization. I have a little something for this case, too! Just try to build your app for an alternative OS and launch it in a container (Docker). Once, I got a 20% performance boost by launching my pipeline for Linux in a container on Windows versus the initial Windows-native. I assume frameworks may have different optimization sets for different platforms and OSs, which is also the case for GPU drivers and low-level stacks. This advice is so easy to try: all needed frameworks are cross-platform and Docker provides ready-made images with GPU support.

That’s it for now.
I hope your list of ideas to test is now bigger. I’ll be glad to receive feedback and let’s keep in touch!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Another Few Tips for Better Results from LLM RAG Solutions

Author(s): Dmitry Malishev

A structured and thoughtful response from LLM based on perfectly selected data from RAG is a promising technique. And it can be even better!

🔥 Recommended Articles 🔥

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

LAI #71: Open-Sora: $200K Video Model, HPC’s Unsung Hero, and 10 Ways LLMs Fail in the Wild

Using CrewAI to Build Agentic Systems

Future of the Job Market — Impact of AI on Various Roles in 2025

Multimodal Autonomous AI Agents: Enhancing Web Interactions Through Tree Search

TAI #148: New API Models from OpenAI (4.1) & xAI (grok-3); Exploring Deep Research’s Scaling Laws

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Another Few Tips for Better Results from LLM RAG Solutions

Author(s): Dmitry Malishev

A structured and thoughtful response from LLM based on perfectly selected data from RAG is a promising technique. And it can be even better!

🔥 Recommended Articles 🔥

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement