ARGUS: Vision-Centric Reasoning with Grounded Chain-of-Thought

Author(s): Yash Thube

Originally published on Towards AI.

ARGUS: Vision-Centric Reasoning with Grounded Chain-of-Thought

Existing Multimodal LLMs, primarily driven by advancements in large language models (LLMs), often underperform when accurate visual perception and understanding of specific regions-of-interest (RoIs) are crucial for successful reasoning. Argus tackles this by proposing a new visual attention grounding mechanism, a framework designed to address the limitations of current multimodal LLMs in vision-centric scenarios.

It draws inspiration from cognitive visual intelligence, particularly the distinction between stimulus-driven (involuntary) and goal-directed (voluntary) visual attention. In MLLMs, stimulus-driven attention is evident in image tokenization by pre-trained visual models, whereas goal-directed attention involves language-conditioned image feature engagement within the LLM. The paper highlights that while stimulus-driven attention via unconditioned image tokenization has been explored, the effect of explicit language-guided visual engagement is less studied.

Argus explores and proposes a grounding-driven visual attention re-engagement module. Instead of relying solely on the implicit self-attention of LLMs to model language-directed attention to visual tokens, it explicitly performs a top-down visual search to locate the image region of interest (RoI) most relevant to the text prompt. This search guides the model to focus on these specific regions for subsequent reasoning and answer generation.

The framework utilises text-to-box object-centric grounding as an intermediate reasoning stage. The predicted bounding boxes serve as simple yet effective visual chain-of-thought (CoT) signals to improve the quality of the final reasoning step.

📌Architectural Design

Source:https://yunzeman.github.io/argus/

Argus builds upon the standard autoregressive MLLM paradigm, where images are converted to visual tokens and processed alongside language tokens by an LLM.

1️⃣Visual Encoders – Argus employs a Mixture-of-Vision-Experts (MoVEs) strategy, combining outputs from three different vision foundation models: CLIP45, ConvNeXt46, and EVA-02. These encoders are crucial for abstracting image information with minimal loss and aligning vision and language. The 2D embeddings are interpolated, concatenated, and then mapped to the text token space by an MLP projector.

2️⃣LLM Decoder – A state-of-the-art pretrained LLM, specifically Llama3–8B50, is used as the transformer decoder for next-token prediction.

3️⃣Region-of-Interest (RoI) Sampling – The model can predict bounding boxes corresponding to regions mentioned in the question prompt. These bounding boxes are represented in text format using normalized coordinates ([xmin, ymin, xmax, ymax]). The predicted box guides the cropping of the relevant RoI from the input image for re-engagement.

📌Directed Visual Context Re-engagement

The predicted bounding boxes highlight the most relevant visual context. Argus investigates four strategies for engaging with these sampled RoIs:

→ Implicit Self-Attention – The baseline, relying on the LLM’s global self-attention to attend to visual context. Minimal control over specific RoIs.

→ Implicit Box Guidance – Predicting bounding boxes as text tokens acts as a CoT signal, implicitly nudging self-attention towards RoIs without explicit visual re-engagement.

→ Explicit RoI Re-encoding – Samples the image crop defined by the RoI and processes it through vision encoders to generate a new set of visual tokens. This explicitly introduces context-specific signals but increases computation. Requires preprocessing (padding, resizing).

→ Explicit RoI Re-sampling – Instead of re-encoding, this method retrieves visual embeddings from the initial encoding stage based on their overlap with the RoI bounding box. It leverages cached tokens for efficiency. It also preserves positional context, potentially lost in re-encoding preprocessing.

📌Training Pipeline

Training is divided into two stages:

→ Alignment and Pre-training – Vision encoders and the MLP projector are trained on LLaVA-595K while the LLM is frozen. This stage includes a vision expert pre-alignment.

→ Supervised Fine-Tuning (SFT)– The full model (vision encoders, MLP projectors, LLM) is fine-tuned on a mixture of Eagle1.8M (conversational data), VCoT (visual CoT), and grounding datasets (GRIT, Shikra). This stage enables the model to predict RoI boxes and leverage visual CoT.

📌Datasets Used

They utilise several datasets for both training and evaluating:

Training Datasets

→ Stage 1 (Alignment and Pre-training) – The initial pre-training stage uses the LLaVA-595K dataset. This dataset consists of curated image-text pairs.

→ Stage 2 (Supervised Fine-Tuning — SFT) – This stage employs a diverse combination of datasets to ensure robust performance:

Eagle1.8M dataset – A comprehensive collection of conversational data aggregated from various sources, including LLaVA-Instruct, DocVQA, synDog-EN, ChartQA, DVQA, AI2D, ShareGPT-4V, LAION-GPT4v, LVIS-Instruct4V, LRV-Instruct, Geo170K, LLaVAR, Visual7W, and Open-Hermes 2.5.
VCoT dataset – Provides region-of-interest (RoI) bounding box annotations specifically designed for grounding and reasoning tasks. This dataset is structured as a multi-turn conversation incorporating RoI prediction and visual chain-of-thought signals.
Grounding datasets: A mixture of GRIT (756K Grounded Image-Text pairs) and Shikra( 326K visual grounding-oriented samples) datasets are incorporated to enhance the model’s ability to ground concepts in unconstrained scenarios.

Evaluation Benchmarks

→ Multimodal Reasoning Tasks – Evaluated on various vision-language benchmarks covering Vision-Centric Tasks (V-Star, CV-Bench 2D/3D, MMVP, RealworldQA), Text Understanding (ChartQA, OCRBench, TextVQA, DocVQA), and General Tasks (MMMUV, MMB, SEED, IGQA).

→ Referring Expression Grounding Tasks – The model’s object grounding capabilities are evaluated using the RefCOCO, RefCOCO+, and RefCOCOg benchmarks. The performance metric used is [email protected].

📌Evaluation and Results

Argus is benchmarked on visual reasoning and referring grounding tasks.

→ Visual Reasoning – It achieves state-of-the-art performance among public MLLMs of comparable size and training scale. It shows substantial improvements in vision-centric and text understanding tasks, demonstrating the effectiveness of the goal-conditioned visual search and attention mechanisms.

→ Referring Grounding – Argus demonstrates leading performance among comparable generalist MLLMs and is competitive with specialist grounding models. This indicates its strength in both high-level reasoning and precise visual localisation.

→ Qualitative Results – Examples show Argus successfully performing challenging reasoning tasks with visually grounded CoT.

📌Ablation Studies and Analysis

Controlled experiments validate the design choices:

→ CoT and Grounding – Incorporating CoT reasoning consistently boosts performance. Explicit visual CoT (re-encoding/re-sampling) offers greater gains than implicit box guidance. Adding grounding datasets further enhances reasoning by improving object-centric perception and bounding box predictions.

→ Re-engagement Strategies – Both explicit re-encoding and re-sampling outperform implicit methods. Re-sampling is generally superior due to better context preservation and less distribution shift, except for tasks requiring fine-grained details of small objects (like V-Star), where re-encoding performs better.

→ Encoder Capacity – Higher capacity vision encoders improve performance. Re-encoding is less dependent on the initial feature quality than re-sampling.

→ Context Expansion: Re-encoding benefits from moderate RoI context expansion (20–40%), which helps with slightly inaccurate boxes and relative positioning. Re-sampling performs best with the original box size, as it already leverages overlapping patches for context. Excessive expansion hurts performance for both.

→ Non-shared MLPs – Using separate MLPs for initial and re-engaged visual tokens marginally improves re-sampling performance by optimising for different image/RoI distributions.

→ Computational Efficiency – Re-sampling is significantly more computationally efficient than re-encoding, requiring fewer operations and additional visual tokens, leading to faster inference.

📌Limitations and Future Work

The authors acknowledge limitations, including evaluating the approach on larger model scales, the limited diversity and availability of large-scale visual CoT data, and expanding coverage to tasks like open-world detection.

Paper

Stay Curious☺️….See you in the next one!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

ARGUS: Vision-Centric Reasoning with Grounded Chain-of-Thought

Author(s): Yash Thube

📌Architectural Design

📌Directed Visual Context Re-engagement

📌Training Pipeline

📌Datasets Used

Training Datasets

Evaluation Benchmarks

📌Evaluation and Results

📌Ablation Studies and Analysis

📌Limitations and Future Work

Paper

Stay Curious☺️….See you in the next one!

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

No Code, No Limits: The Best Open-Source AI UIs in 2025

LLMs Don’t Need Search Engines: They Can Search Their Own Brains

This Plug-and-Play AI Memory Works With Any Model

From Prompts to RAG to RAGAs: Evaluating Retrieval-Augmented Generation Systems the Right Way

“BIOREASON” Makes DNA Analysis Simple Using AI

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

ARGUS: Vision-Centric Reasoning with Grounded Chain-of-Thought

Author(s): Yash Thube

📌Architectural Design

📌Directed Visual Context Re-engagement

📌Training Pipeline

📌Datasets Used

Training Datasets

Evaluation Benchmarks

📌Evaluation and Results

📌Ablation Studies and Analysis

📌Limitations and Future Work

Stay Curious☺️….See you in the next one!

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement