Ways to Deal With Hallucinations in LLM

Last Updated on December 17, 2024 by Editorial Team

Author(s): Igor Novikov

Originally published on Towards AI.

One of the major challenges in using LLMs in business is that LLMs hallucinate. How can you entrust your clients to a chatbot that can go mad and tell them something inappropriate at any moment? Or how can you trust your corporate AI assistant if it makes things up randomly?

That’s a problem, especially given that an LLM can’t be fired or held accountable.

That’s the thing with AI systems — they don’t benefit from lying to you in any way but at the same time, despite sounding intelligent they are not a person, so they can’t be blamed either.

Some tout RAG as a cure-all approach, but in reality it only solves one particular cause and doesn’t help with others. Only a combination of several methods can help.

Not all hope is lost though. There are ways to work with it so let’s look at that.

So not to go too philosophical about what is hallucination, let’s define the most important cases:

The model understands the question but gives an incorrect answer
The model didn’t understand the question and thus gave an incorrect answer
There is no right or wrong answer, and therefore if you disagree with the mode — it doesn’t make it incorrect. Like if you ask Apple vs Android— whatever it answers is technically just an opinion

Let’s start with the latter. These are reasons why a model can misunderstand the questions:

The question is crap (ambiguous, not clear, etc.), and therefore the answer is crap. Not the model's fault, ask better questions
The model does not have context
Language: the model does not understand the language you are using
Bad luck or, in other words, stochastic distribution led the reasoning in a weird way

Now let’s look at the first one: why would a model lie, that is give factually and verifiably incorrect information, if it understands the questions?

It didn’t follow all the logical steps to arrive at a conclusion
It didn’t have enough context
The information (context) in this is incorrect as well
It has the right information but got confused
It was trained to give incorrect answers (for political and similar reasons)
Bad luck, and stochastic distribution led to the reasoning in a weird way
It was configured so it is allowed to fantasize (which can be sometimes desirable)
Overfitting and underfitting: the model was trained in a specific field and tries to apply its logic to a different field, leading to incorrect deduction or induction in answering
The model is overwhelmed with data and starts to lose context

I’m not going to discuss things that are not a model problem, like bad questions or questions with no right answers. Let’s concentrate on what we can try to solve, one by one.

The model does not have enough context or information, or the information that was provided to it is not correct or full

This is where RAG comes into play. RAG, when correctly implemented should provide the model's necessary context, so it can answer. Here is the article on how to do the RAG properly.

It is important to do it right, with all required metadata about the information structure and attributes. It is desirable to use something like GraphRag, and Reranking in the retrieval phase, so that the model is given only relevant context, otherwise, the model can get confused.

It is also extremely important to keep the data you provide to the model up to date and continuously update it, taking versioning into account. If you have data conflicts, which is not uncommon, the model will start generating conflicting answers as well. There are methods, such as the Maximum Marginal Relevance (MMR) algorithm, which considers the relevance and novelty of information for filtering and reordering. However, this is not a panacea, and it is best to address this issue at the data storage stage.

Language

Not all models understand all languages equally well. It is always preferable to use English for prompts as it works best for most models. If you have to use a specific language — you may have to use a model build for that, like Qwen for Chinese.

A model does not follow all the logical steps to arrive at a conclusion

You can force the model to follow a particular thinking process with techniques like SelfRag, Chain of Thought, or SelfCheckGPT. Here is an article about these techniques.

The general idea is to ask the model to think in steps and explain/validate its conclusions and intermediate steps, so it can catch its errors.

Alternatively, you can use the Agents model, where several LLM agents communicate with each other and verify each other's outputs and each step.

A model got confused with the information it had and “bad luck”

These two are actually caused by the same thing and this is a tricky one. The way models work is they stochastically predict the next token in a sentence. The process is somewhat random, so it is possible that it will pick some less probable route and go off course. It is built into the model and the way it works.

There are several methods on how to handle this:

MultiQuerry — ran several queries for the same answer and picked the best one using relevance score like Cross Encoder. If you get 3 very similar answers and one very different — it is likely that it was a random hallucination. It adds certain overhead, so you pay the price but it is a very good method to ensure you don’t randomly get a bad answer
Set the model temperature to a lower value to discourage it from going in less probable directions (ie fantasizing)

There is one more, which is harder to fix. The model keeps semantically similar ideas close in the vector space. Being asked about facts that have other facts close in proximity that are close but not actually related will lead the model to a path of least resistance. The model has associative memory, so to speak, so it thinks in associations, and that mode of thinking is not suitable for tasks like playing chess or math. The model has a fast-thinking brain, per Kahneman's description, but lacks a slow one.

For example, you ask a mode what is 3 + 7 and it answers 37. Why???

But it all makes sense since if you look at 3 and 7 in vector space, the closest vector to them is 37. Here the mistake is obvious but it may be much more subtle.

Example:

The answer is incorrect.

“Afonso” was the third king of Portugal. Not “Alfonso.” There was no “Alfonso II” as the king of Portugal.
The mother of “Afonso II” was Dulce of Aragon, not Urraca of Castile.

From the LLM’s perspective, “Alfonso” is basically the same as “Afonso” and “mother” is a direct match. Therefore, if there is no “mother” close to “Afonso” then the LLM will choose the Alfonso/mother combination.

Here is an article explaining this in detail and potential ways to fix this. Also, in general, fine-tuning the model on data from your domain will make it less likely to happen, as the model will be less confused with similar facts in edge cases.

The model was configured so it is allowed to fantasize

This can be done either through a master prompt or by setting the model temperature too high. So basically you need to instruct the model to:

Not give an answer if it is not sure or don’t have information
Ensure nothing in the prompt instructs the model to make up facts and, in general, make instructions very clear
Set temperature lower

Overfitting and underfitting

If you use a model that is trained in healthcare space to solve programming tasks — it will hallucinate, or in other words, will try to put square bits into round holes because it only knows how to do that. That’s kind of obvious. Same if you use a generic model, trained on generic data from the internet to solve industry-specific tasks.

The solution is to use a proper model for your industry and fine-tune/train it in that area. That will improve the correctness dramatically in certain cases. I’m not saying you always have to do that, but you might have to.

Another case of this is using a model too small (in terms of parameters) to solve your tasks. Yes, certain tasks may not require a large model, but certainly do, and you should use a model not smaller than appropriate. Using a model too big will cost you but at least it will work correctly.

The model is overwhelmed with data and starts to lose context

You may think that the more data you have the better — but it is not the case at all!

Model context window and attention span are limited. Even recent models with millions of tokens of context window do not work well. They will start to forget things, ignore things in the middle, and so on.

The solution here is to use RAG with proper context size management. You have to pre-select only relevant data, rerank it, and feed it to LLM.

Here is my article that overviews some of the techniques to do that.

Also, some models do not handle long context at all, and at a certain point, the quality of answers will start to degrade with increasing context size, see below:

Here is a research paper on that.

Other general techniques

Human in the loop

You can always have someone in the loop to fact-check LLM outputs. For example, you use LLM for data annotation (which is a great idea) — you will need to use it in conjunction with real humans to validate the results. Or use your system in Co-pilot mode where humans make the final decision. This doesn’t scale well though

Oracles

Alternatively, you can use an automated Oracle to fact-check the system results, if that option is available

External tools

Certain things, like calculations and math, should be done outside of LLM, using tools that are provided to LLM. For example, you can use LLM to generate a query to SQL database or Elasticsearch and execute that, and then use the results to generate the final answer.

Frequently Used, Contextual References

Resources

Publication

Ways to Deal With Hallucinations in LLM

Author(s): Igor Novikov

The model does not have enough context or information, or the information that was provided to it is not correct or full

Language

A model does not follow all the logical steps to arrive at a conclusion

A model got confused with the information it had and “bad luck”

The model was configured so it is allowed to fantasize

Overfitting and underfitting

The model is overwhelmed with data and starts to lose context

Other general techniques

JOIN NOW!

🔥 Recommended Articles 🔥

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

LAI #66: Information Theory for People in a Hurry

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Ways to Deal With Hallucinations in LLM

Author(s): Igor Novikov

The model does not have enough context or information, or the information that was provided to it is not correct or full

Language

A model does not follow all the logical steps to arrive at a conclusion

A model got confused with the information it had and “bad luck”

The model was configured so it is allowed to fantasize

Overfitting and underfitting

The model is overwhelmed with data and starts to lose context

Other general techniques

JOIN NOW!

🔥 Recommended Articles 🔥

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement