Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Ways to Deal With Hallucinations in LLM
Latest   Machine Learning

Ways to Deal With Hallucinations in LLM

Author(s): Igor Novikov

Originally published on Towards AI.

Image by the author

One of the major challenges in using LLMs in business is that LLMs hallucinate. How can you entrust your clients to a chatbot that can go mad and tell them something inappropriate at any moment? Or how can you trust your corporate AI assistant if it makes things up randomly?

That’s a problem, especially given that an LLM can’t be fired or held accountable.

That’s the thing with AI systems β€” they don’t benefit from lying to you in any way but at the same time, despite sounding intelligent they are not a person, so they can’t be blamed either.

Some tout RAG as a cure-all approach, but in reality it only solves one particular cause and doesn’t help with others. Only a combination of several methods can help.

Not all hope is lost though. There are ways to work with it so let’s look at that.

So not to go too philosophical about what is hallucination, let’s define the most important cases:

  1. The model understands the question but gives an incorrect answer
  2. The model didn’t understand the question and thus gave an incorrect answer
  3. There is no right or wrong answer, and therefore if you disagree with the mode β€” it doesn’t make it incorrect. Like if you ask Apple vs Androidβ€” whatever it answers is technically just an opinion

Let’s start with the latter. These are reasons why a model can misunderstand the questions:

  1. The question is crap (ambiguous, not clear, etc.), and therefore the answer is crap. Not the model's fault, ask better questions
  2. The model does not have context
  3. Language: the model does not understand the language you are using
  4. Bad luck or, in other words, stochastic distribution led the reasoning in a weird way

Now let’s look at the first one: why would a model lie, that is give factually and verifiably incorrect information, if it understands the questions?

  1. It didn’t follow all the logical steps to arrive at a conclusion
  2. It didn’t have enough context
  3. The information (context) in this is incorrect as well
  4. It has the right information but got confused
  5. It was trained to give incorrect answers (for political and similar reasons)
  6. Bad luck, and stochastic distribution led to the reasoning in a weird way
  7. It was configured so it is allowed to fantasize (which can be sometimes desirable)
  8. Overfitting and underfitting: the model was trained in a specific field and tries to apply its logic to a different field, leading to incorrect deduction or induction in answering
  9. The model is overwhelmed with data and starts to lose context

I’m not going to discuss things that are not a model problem, like bad questions or questions with no right answers. Let’s concentrate on what we can try to solve, one by one.

The model does not have enough context or information, or the information that was provided to it is not correct or full

This is where RAG comes into play. RAG, when correctly implemented should provide the model's necessary context, so it can answer. Here is the article on how to do the RAG properly.

It is important to do it right, with all required metadata about the information structure and attributes. It is desirable to use something like GraphRag, and Reranking in the retrieval phase, so that the model is given only relevant context, otherwise, the model can get confused.

It is also extremely important to keep the data you provide to the model up to date and continuously update it, taking versioning into account. If you have data conflicts, which is not uncommon, the model will start generating conflicting answers as well. There are methods, such as the Maximum Marginal Relevance (MMR) algorithm, which considers the relevance and novelty of information for filtering and reordering. However, this is not a panacea, and it is best to address this issue at the data storage stage.

Language

Not all models understand all languages equally well. It is always preferable to use English for prompts as it works best for most models. If you have to use a specific language β€” you may have to use a model build for that, like Qwen for Chinese.

A model does not follow all the logical steps to arrive at a conclusion

You can force the model to follow a particular thinking process with techniques like SelfRag, Chain of Thought, or SelfCheckGPT. Here is an article about these techniques.

The general idea is to ask the model to think in steps and explain/validate its conclusions and intermediate steps, so it can catch its errors.

Alternatively, you can use the Agents model, where several LLM agents communicate with each other and verify each other's outputs and each step.

A model got confused with the information it had and β€œbad luck”

These two are actually caused by the same thing and this is a tricky one. The way models work is they stochastically predict the next token in a sentence. The process is somewhat random, so it is possible that it will pick some less probable route and go off course. It is built into the model and the way it works.

There are several methods on how to handle this:

  1. MultiQuerry β€” ran several queries for the same answer and picked the best one using relevance score like Cross Encoder. If you get 3 very similar answers and one very different β€” it is likely that it was a random hallucination. It adds certain overhead, so you pay the price but it is a very good method to ensure you don’t randomly get a bad answer
  2. Set the model temperature to a lower value to discourage it from going in less probable directions (ie fantasizing)

There is one more, which is harder to fix. The model keeps semantically similar ideas close in the vector space. Being asked about facts that have other facts close in proximity that are close but not actually related will lead the model to a path of least resistance. The model has associative memory, so to speak, so it thinks in associations, and that mode of thinking is not suitable for tasks like playing chess or math. The model has a fast-thinking brain, per Kahneman's description, but lacks a slow one.

For example, you ask a mode what is 3 + 7 and it answers 37. Why???

But it all makes sense since if you look at 3 and 7 in vector space, the closest vector to them is 37. Here the mistake is obvious but it may be much more subtle.

Example:

Image by the author

The answer is incorrect.

  • β€œAfonso” was the third king of Portugal. Not β€œAlfonso.” There was no β€œAlfonso II” as the king of Portugal.
  • The mother of β€œAfonso II” was Dulce of Aragon, not Urraca of Castile.

From the LLM’s perspective, β€œAlfonso” is basically the same as β€œAfonso” and β€œmother” is a direct match. Therefore, if there is no β€œmother” close to β€œAfonso” then the LLM will choose the Alfonso/mother combination.

Here is an article explaining this in detail and potential ways to fix this. Also, in general, fine-tuning the model on data from your domain will make it less likely to happen, as the model will be less confused with similar facts in edge cases.

The model was configured so it is allowed to fantasize

This can be done either through a master prompt or by setting the model temperature too high. So basically you need to instruct the model to:

  1. Not give an answer if it is not sure or don’t have information
  2. Ensure nothing in the prompt instructs the model to make up facts and, in general, make instructions very clear
  3. Set temperature lower

Overfitting and underfitting

If you use a model that is trained in healthcare space to solve programming tasks β€” it will hallucinate, or in other words, will try to put square bits into round holes because it only knows how to do that. That’s kind of obvious. Same if you use a generic model, trained on generic data from the internet to solve industry-specific tasks.

The solution is to use a proper model for your industry and fine-tune/train it in that area. That will improve the correctness dramatically in certain cases. I’m not saying you always have to do that, but you might have to.

Another case of this is using a model too small (in terms of parameters) to solve your tasks. Yes, certain tasks may not require a large model, but certainly do, and you should use a model not smaller than appropriate. Using a model too big will cost you but at least it will work correctly.

The model is overwhelmed with data and starts to lose context

You may think that the more data you have the better β€” but it is not the case at all!

Model context window and attention span are limited. Even recent models with millions of tokens of context window do not work well. They will start to forget things, ignore things in the middle, and so on.

The solution here is to use RAG with proper context size management. You have to pre-select only relevant data, rerank it, and feed it to LLM.

Here is my article that overviews some of the techniques to do that.

Also, some models do not handle long context at all, and at a certain point, the quality of answers will start to degrade with increasing context size, see below:

Here is a research paper on that.

Other general techniques

Human in the loop

You can always have someone in the loop to fact-check LLM outputs. For example, you use LLM for data annotation (which is a great idea) β€” you will need to use it in conjunction with real humans to validate the results. Or use your system in Co-pilot mode where humans make the final decision. This doesn’t scale well though

Oracles

Alternatively, you can use an automated Oracle to fact-check the system results, if that option is available

External tools

Certain things, like calculations and math, should be done outside of LLM, using tools that are provided to LLM. For example, you can use LLM to generate a query to SQL database or Elasticsearch and execute that, and then use the results to generate the final answer.

What to read next:

RAG architecture guide

Advanced RAG guide

Peace!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓