Ways to Deal With Hallucinations in LLM
Author(s): Igor Novikov
Originally published on Towards AI.
One of the major challenges in using LLMs in business is that LLMs hallucinate. How can you entrust your clients to a chatbot that can go mad and tell them something inappropriate at any moment? Or how can you trust your corporate AI assistant if it makes things up randomly?
Thatβs a problem, especially given that an LLM canβt be fired or held accountable.
Thatβs the thing with AI systems β they donβt benefit from lying to you in any way but at the same time, despite sounding intelligent they are not a person, so they canβt be blamed either.
Some tout RAG as a cure-all approach, but in reality it only solves one particular cause and doesnβt help with others. Only a combination of several methods can help.
Not all hope is lost though. There are ways to work with it so letβs look at that.
So not to go too philosophical about what is hallucination, letβs define the most important cases:
- The model understands the question but gives an incorrect answer
- The model didnβt understand the question and thus gave an incorrect answer
- There is no right or wrong answer, and therefore if you disagree with the mode β it doesnβt make it incorrect. Like if you ask Apple vs Androidβ whatever it answers is technically just an opinion
Letβs start with the latter. These are reasons why a model can misunderstand the questions:
- The question is crap (ambiguous, not clear, etc.), and therefore the answer is crap. Not the model's fault, ask better questions
- The model does not have context
- Language: the model does not understand the language you are using
- Bad luck or, in other words, stochastic distribution led the reasoning in a weird way
Now letβs look at the first one: why would a model lie, that is give factually and verifiably incorrect information, if it understands the questions?
- It didnβt follow all the logical steps to arrive at a conclusion
- It didnβt have enough context
- The information (context) in this is incorrect as well
- It has the right information but got confused
- It was trained to give incorrect answers (for political and similar reasons)
- Bad luck, and stochastic distribution led to the reasoning in a weird way
- It was configured so it is allowed to fantasize (which can be sometimes desirable)
- Overfitting and underfitting: the model was trained in a specific field and tries to apply its logic to a different field, leading to incorrect deduction or induction in answering
- The model is overwhelmed with data and starts to lose context
Iβm not going to discuss things that are not a model problem, like bad questions or questions with no right answers. Letβs concentrate on what we can try to solve, one by one.
The model does not have enough context or information, or the information that was provided to it is not correct or full
This is where RAG comes into play. RAG, when correctly implemented should provide the model's necessary context, so it can answer. Here is the article on how to do the RAG properly.
It is important to do it right, with all required metadata about the information structure and attributes. It is desirable to use something like GraphRag, and Reranking in the retrieval phase, so that the model is given only relevant context, otherwise, the model can get confused.
It is also extremely important to keep the data you provide to the model up to date and continuously update it, taking versioning into account. If you have data conflicts, which is not uncommon, the model will start generating conflicting answers as well. There are methods, such as the Maximum Marginal Relevance (MMR) algorithm, which considers the relevance and novelty of information for filtering and reordering. However, this is not a panacea, and it is best to address this issue at the data storage stage.
Language
Not all models understand all languages equally well. It is always preferable to use English for prompts as it works best for most models. If you have to use a specific language β you may have to use a model build for that, like Qwen for Chinese.
A model does not follow all the logical steps to arrive at a conclusion
You can force the model to follow a particular thinking process with techniques like SelfRag, Chain of Thought, or SelfCheckGPT. Here is an article about these techniques.
The general idea is to ask the model to think in steps and explain/validate its conclusions and intermediate steps, so it can catch its errors.
Alternatively, you can use the Agents model, where several LLM agents communicate with each other and verify each other's outputs and each step.
A model got confused with the information it had and βbad luckβ
These two are actually caused by the same thing and this is a tricky one. The way models work is they stochastically predict the next token in a sentence. The process is somewhat random, so it is possible that it will pick some less probable route and go off course. It is built into the model and the way it works.
There are several methods on how to handle this:
- MultiQuerry β ran several queries for the same answer and picked the best one using relevance score like Cross Encoder. If you get 3 very similar answers and one very different β it is likely that it was a random hallucination. It adds certain overhead, so you pay the price but it is a very good method to ensure you donβt randomly get a bad answer
- Set the model temperature to a lower value to discourage it from going in less probable directions (ie fantasizing)
There is one more, which is harder to fix. The model keeps semantically similar ideas close in the vector space. Being asked about facts that have other facts close in proximity that are close but not actually related will lead the model to a path of least resistance. The model has associative memory, so to speak, so it thinks in associations, and that mode of thinking is not suitable for tasks like playing chess or math. The model has a fast-thinking brain, per Kahneman's description, but lacks a slow one.
For example, you ask a mode what is 3 + 7 and it answers 37. Why???
But it all makes sense since if you look at 3 and 7 in vector space, the closest vector to them is 37. Here the mistake is obvious but it may be much more subtle.
Example:
The answer is incorrect.
- βAfonsoβ was the third king of Portugal. Not βAlfonso.β There was no βAlfonso IIβ as the king of Portugal.
- The mother of βAfonso IIβ was Dulce of Aragon, not Urraca of Castile.
From the LLMβs perspective, βAlfonsoβ is basically the same as βAfonsoβ and βmotherβ is a direct match. Therefore, if there is no βmotherβ close to βAfonsoβ then the LLM will choose the Alfonso/mother combination.
Here is an article explaining this in detail and potential ways to fix this. Also, in general, fine-tuning the model on data from your domain will make it less likely to happen, as the model will be less confused with similar facts in edge cases.
The model was configured so it is allowed to fantasize
This can be done either through a master prompt or by setting the model temperature too high. So basically you need to instruct the model to:
- Not give an answer if it is not sure or donβt have information
- Ensure nothing in the prompt instructs the model to make up facts and, in general, make instructions very clear
- Set temperature lower
Overfitting and underfitting
If you use a model that is trained in healthcare space to solve programming tasks β it will hallucinate, or in other words, will try to put square bits into round holes because it only knows how to do that. Thatβs kind of obvious. Same if you use a generic model, trained on generic data from the internet to solve industry-specific tasks.
The solution is to use a proper model for your industry and fine-tune/train it in that area. That will improve the correctness dramatically in certain cases. Iβm not saying you always have to do that, but you might have to.
Another case of this is using a model too small (in terms of parameters) to solve your tasks. Yes, certain tasks may not require a large model, but certainly do, and you should use a model not smaller than appropriate. Using a model too big will cost you but at least it will work correctly.
The model is overwhelmed with data and starts to lose context
You may think that the more data you have the better β but it is not the case at all!
Model context window and attention span are limited. Even recent models with millions of tokens of context window do not work well. They will start to forget things, ignore things in the middle, and so on.
The solution here is to use RAG with proper context size management. You have to pre-select only relevant data, rerank it, and feed it to LLM.
Here is my article that overviews some of the techniques to do that.
Also, some models do not handle long context at all, and at a certain point, the quality of answers will start to degrade with increasing context size, see below:
Here is a research paper on that.
Other general techniques
Human in the loop
You can always have someone in the loop to fact-check LLM outputs. For example, you use LLM for data annotation (which is a great idea) β you will need to use it in conjunction with real humans to validate the results. Or use your system in Co-pilot mode where humans make the final decision. This doesnβt scale well though
Oracles
Alternatively, you can use an automated Oracle to fact-check the system results, if that option is available
External tools
Certain things, like calculations and math, should be done outside of LLM, using tools that are provided to LLM. For example, you can use LLM to generate a query to SQL database or Elasticsearch and execute that, and then use the results to generate the final answer.
What to read next:
Peace!
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI