
Reducing Hallucinations in VLMs using REVERSE
Last Updated on September 12, 2025 by Editorial Team
Author(s): Youssef Farag
Originally published on Towards AI.
Paper link: https://arxiv.org/abs/2504.13169
Code: https://github.com/tsunghan-wu/reverse_vlm
Released: 17th of April 2025

Large Language Models (LLMs) are brilliant liars. They can produce fabricated information out of thin air with a highly convincing tone. Although these “hallucinations” are sometimes useful in a creative setting, they’re a huge liability when it comes to real-world applications.
Vision Language Models (VLMs) inherit the same problem. They might misidentify an object in an image or describe something that’s not there at all. Although there are various methods to mitigate VLM hallucinations, our paper for the day introduces a novel hallucination-aware approach that automatically modifies outputs during generation.
Retrospective Verification and Self-correction [1](or REVERSE for short) is a novel framework for online verification and adjustments of hallucinations. Trained on a modified version of LLaVA-v1.5–665k dataset, REVERSE is designed to inherently recognize hallucination patterns in real time.
Their contributions are:
- A new dataset, building on LLaVA-v1.5–665k, containing samples of tagged intentional hallucinations, is utilized for teaching models hallucination patterns.
- REVERSE: a framework that automatically detects hallucinations on generation, and adjusts them either using rejection sampling or query rewriting.
Dataset Used

The primary goal of this paper is to develop a VLM capable of dynamically detecting its hallucinations and adapting accordingly. To achieve this behavior, the model must be trained on a specially curated dataset containing manually generated hallucinations alongside factual responses.
The newly curated dataset, which builds on LLaVA-v1.5 as mentioned above, contains 6.8 million question–answer pairs. These are further divided into 3.8 million correct pairs of answers and 2.9 million hallucinated answers. Each question–answer pair is tagged to indicate whether the answer is a hallucination or a well-grounded statement.
Each phrase in the answers is annotated using a set of custom tokens/tags added to the VLM’s vocabulary. The confident/grounded phrases are wrapped within <SPAN>
and </CN>
tokens, while ungrounded phrases are enclosed within</UN>
tokens. An example of the tagging mechanism is shown in Figure 2.
Since LLaVA-v1.5 only contains positive pairs, the authors conduct a multi-step augmentation pipeline to generate hallucinated samples. This begins by categorizing each<question,answer>
pair based on the answer type. Possible categories are :
- Bounding Box
- Number
- Yes/No Questions
- Directions
- Multiple-Choice
- General Answers **
For each category, the pipeline applies specific rule-based modifications to introduce hallucinations.
- For bounding box answers, the coordinates are changed.
- For Yes/No or Multiple-Choice questions, the answers are either negated or a different (wrong) choice is chosen.
- For general answers, the authors used GPT-4o-mini to alter factual details so that the answer would still capture a similar aspect from the original answer, but with a different meaning.
** For example, the detection of a “red plastic cup” in an image could be replaced with a “green glass bottle”, or “giant hot dog’ with “small burger”. It could also shift perspective semantics about the scene, from ‘back view’ to top view’. Such minor alterations leave the structure of the sentence as is, but change the meaning significantly, rendering it wrong when compared to the input image, thus a hallucination.
REVERSE, Simplified
Utilizing the modified dataset, the authors applied some fundamental changes to both the training and the inference stage of a typical VLM.
Training
The goal here is split into two tasks:
- Increasing the generation of phrases encapsulated by
<SPAN>
and</CN>
tokens & decreasing</UN>
phrases. - Calculating the likelihood of generating an ungrounded token
</UN>
, which will later be used to detect hallucinations.
This can be achieved with a modification to the cross-entropy training loss by supplying only positive weights to phrases encapsulated in a confidence sequence and assigning a zero weight to ungrounded phrases. This will also push the model to avoid hallucinations instead of actively seeking them.
Given a VLM model parameterized by θ, and a sequence of previously generated tokens [y(1),…,y(i-1)], the loss of a sample S is defined as the negative log-likelihood sum over all output tokens y(i), conditioned on input image X and all previously generated tokens.

To control the effect of hallucinations on training, an indicator function (1.Hall(i)) is used. It is set to 1 when the current token y(i) falls within a confident phrase (enclosed by <SPAN>
and </CN>
), and 0 when it lies within an unconfident phrase. This mechanism encourages the model to prioritize grounded responses and prevents it from reinforcing hallucinations during training.
Inference
After training the model with the new loss function and including </UN>
and </CN>
tags into the VLM’s vocabulary, a likelihood of hallucination can now be calculated by checking the probability of generating an unconfident tag P(</UN>)
. A higher value indicates a greater likelihood that the model is producing a hallucinated response.
With this ability of the model to detect hallucinations during inference, the authors propose a self-correcting algorithm that proactively fixes potentially flawed outputs. This algorithm is based on two core strategies :
- Rejection sampling: Resampling multiple generations with an increased model temperature to decrease the probability of hallucinations, and allowing the model to explore new grounds.
- Query Rewriting: Dynamically reformulating the prompt to encourage more factual generations by supplying clarification hints to the prompt. For example, in Figure 1, when the VLM falsely generates “girl” as an option with high uncertainty, this can be added as a hint in the next inference step:
<sys-prompt><image><question> (Hint: potential incorrect phrases → <girl>)

The algorithm, which utilizes two counters K and N, works as follows ( a corresponding flowchart with each step label is shown in Figure 3).
- End-of-span evaluation:
After each span, whether a</CN>
or a</UN>
was generated, calculateP(</UN>)
to check the probability of hallucination. - Threshold check and self-correction trigger:
IfP(</UN>)
exceeds threshold τ (where τ = 0.003 for generative tasks, and 0.5 for discriminative tasks), initiate the self-correction process. The usage of τ is necessary as usually the probability ofP(</UN>)
is low, and checking against threshold values aids in hallucination identification before they fully form. - Backtracking and retry:
Backtrack to the last confident span marked by</CN>
and execute steps 4 and 5 up to K Times. If the issue persists after K attempts, proceed to step 6. - Rejection sampling:
Apply rejection sampling, increasing the temperature by ∆T =0.1, using the formula T = min(T + ∆T, T0 + 0.5), where T0 is the initial temperature. - Query rewriting
Modify the input prompt slightly by adding hints or context designed to guide the model toward a more grounded, non-hallucinatory output. - Update counters and retry:
Decrease K and N by one, then return to step 1 to re-evaluate hallucination probability.
– If K=0, reset K and skip to step 7.
– if N =0, skip to step 8. - Deeper backtracking:
Since the problem persisted for all K attempts, the issue may stem from an earlier segment. In this case, backtrack to the most recent punctuation mark and begin again from step 1. - Final fallback:
If the issue persists for N total attempts, output is finalized and returned to the user along with an indicator that a hallucination was detected but could not be corrected.
Key Findings & Analysis
The results of REVERSE are compared against several hallucination mitigation methods, including post-hoc refinement methods(Woodpecker [5]) and training-free generative adjustment techniques (DoLA [4]). For evaluation, the authors use benchmarks across two major tasks: image captioning (using CHAIR-MSCOCO [6] and AMBER-G [7]), and open-ended question answering (using MMHal-Bench [8] and HaloQuest [9]). While further comparisons and experiments are included in the paper, we will only focus on the most noteworthy results here.
In the following table, results of REVERSE are shown using two different threshold values (τ=0.003 and τ=0.0003), demonstrating the effect of decreasing threshold values, which in turn decreases the rate of hallucination. A more thorough analysis will be covered in the next section.
For image captioning, hallucination is measured using the CHAIR score ( Caption Hallucination Assessment with Image Relevance) [10]. It is defined as one minus the intersection over union between (1) the set of ground truth objects in an image, and (2) the objects mentioned in the model’s generated caption.
- CHAIR(i) measures the amount of hallucinated objects mentioned across all captions.
- CHAIR(s) measures the amount of captions that contain at least one hallucinated object.

As shown in Table 1, RESERVED, even in its “light detection setting” (τ= 0.03), already outperforms all other baseline models on the hallucination detection metrics across both CHAIR-MSCOCO and AMBER-G. Not only that, but it also maintains high coverage, performing better than DoLA and Woodpecker, only falling short of HALVA[11] by 1.5%.
When evaluated under even more aggressive detection settings (τ= 0.003), hallucination scores improve dramatically, outperforming other models by at least 65% ( when comparing REVERSE with HALVA across AMBER-G CHAIR score). This shows that although a more aggressive threshold leads to a lower coverage (94% decrease from 52.2 to 26.9), this could still lead to a very high yield in hallucination reduction.

Table 2 presents results on MMHAL[8] and HaloQuest[9], two benchmarks used for open-ended questions, mainly to test how the model acts in context-lacking scenarios. Similar to the findings observed in image captioning, REVERSE, in its light setting, outperformed the strongest models on both datasets, achieving a 28% improvement on HaloQuest and a 10% improvement on MMHAL-Bench.
However, REVERSE shows a clear performance drop in the visually challenging images sector (VC Acc), performing 27% worse than DoLA using LLaVA-v1.5. The authors attribute this behavior to REVERSE having a more “conservative” and cautious stance towards text generation, which hinders its accuracy in visually challenging tasks requiring more speculative answers.
De-Hallucination vs Expressiveness
As you have already noticed from previous sections, there appears to be an inverse correlation between lowering the hallucination detection threshold and the model’s ability to cover or express all elements in the image. In the paper, this was analysed further by varying the threshold value τ and measuring hallucinations using the CHAIR metric. The effect of increasing τ was demonstrated using two different VLMs as backbones.

As shown in Figure 3, coverage decreases as the threshold value decreases, which also leads to an increase in the CHAIR score. The two ends of this spectrum present an interesting trade-off.
When a very low value of τ is used, we observe that the model remarkably hallucinates even less than GPT-4V! On the other hand, if a high τ is used (0.01), the model’s hallucination scores still perform similarly to Woodpecker, achieving a comparable CHAIR score.
The researchers further analyze this pattern and observe that even without utilizing the full REVERSE algorithm, simply training on the new dataset improves the model’s ability to avoid hallucinations. Training either LLaVA-v1.5 or LLaVA-More leads to a decrease in the CHAIR score from 7.8/7.2 to 6.0/6.0. This improvement can be attributed to the modified loss function combined with the training dataset containing hallucination information (tagged by </UN>
), which helps the model develop a better fundamental understanding of hallucinations, resulting in a score reduction of 1.8/1.2.
Conclusion
In this blog post, we explored REVERSE, a novel online verification and hallucination detection framework that outperforms existing post-hoc and training-based detection methods while maintaining high factual coverage. We have also highlighted their introduction of a curated dataset containing deliberate hallucinations, as well as insightful analysis on the trade-off between expressiveness and hallucination risk.
Notably, the authors proved that incorporating hallucination-aware data during training enhanced a model’s ability to generate grounded knowledge. These findings provide a promising direction for future work aimed at reducing hallucinations without sacrificing accuracy or coverage, pushing us closer towards trustworthy AI systems.
References:
[1] T.-H. Wu, H. Lee, J. Ge, J. E. Gonzalez, T. Darrell and D. M. Chan, Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling (2025), arXiv preprint arXiv:2504.13169
[2] S. Petryk, D. M. Chan, A. Kachinthaya, H. Zou, J. Canny, J. E. Gonzalez and T. Darrell, ALOHa: A New Measure for Hallucination in Captioning Models (2024), arXiv preprint arXiv:2404.02904
[3] W. An, F. Tian, S. Leng, J. Nie, H. Lin, Q. Wang, G. Dai, P. Chen and S. Lu, AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention (2025), arXiv preprint arXiv:2406.12718
[4] Y.-S. Chuang, Y. Xie, H. Luo, Y. Kim, J. Glass and P. He, DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models (2024), arXiv preprint arXiv:2309.03883
[5] S. Yin, C. Fu, S. Zhao, T. Xu, H. Wang, D. Sui, Y. Shen, K. Li, X. Sun and E. Chen, Woodpecker: Hallucination Correction for Multimodal Large Language Models (2024), arXiv preprint arXiv:2310.16045
[6] A. Rohrbach, L. A. Hendricks, K. Burns, T. Darrell and K. Saenko, Object Hallucination in Image Captioning (2018), arXiv preprint arXiv:1809.02156
[7] J. Wang, Y. Wang, G. Xu, J. Zhang, Y. Gu, H. Jia, J. Wang, H. Xu, M. Yan, J. Zhang and J. Sang, AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation (2024), arXiv preprint arXiv:2311.07397
[8] Z. Sun, S. Shen, S. Cao, H. Liu, C. Li, Y. Shen, C. Gan, L.-Y. Gui, Y.-X. Wang, Y. Yang, K. Keutzer and T. Darrell, Aligning Large Multimodal Models with Factually Augmented RLHF (2023), arXiv:2309.14525
[9] Z. Wang, G. Bingham, A. Yu, Q. Le, T. Luong and G. Ghiasi, HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning (2024), arXiv:2407.15680
[10] A. Rohrbach, L. A. Hendricks, K. Burns, T. Darrell and K. Saenko, Object Hallucination in Image Captioning (2019), arXiv:1809.0215
[11] P. Sarkar, S. Ebrahimi, A. Etemad, A. Beirami, S. Ö. Arık and T. Pfister, Mitigating Object Hallucination in MLLMs via Data-augmented Phrase-level Alignment (2025), arXiv:2405.18654
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!
Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Discover Your Dream AI Career at Towards AI Jobs
Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!
Note: Content contains the views of the contributing authors and not Towards AI.