Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-FranΓ§ois Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

ARGUS: Vision-Centric Reasoning with Grounded Chain-of-Thought
Computer Vision   Latest   Machine Learning

ARGUS: Vision-Centric Reasoning with Grounded Chain-of-Thought

Author(s): Yash Thube

Originally published on Towards AI.

ARGUS: Vision-Centric Reasoning with Grounded Chain-of-Thought

Existing Multimodal LLMs, primarily driven by advancements in large language models (LLMs), often underperform when accurate visual perception and understanding of specific regions-of-interest (RoIs) are crucial for successful reasoning. Argus tackles this by proposing a new visual attention grounding mechanism, a framework designed to address the limitations of current multimodal LLMs in vision-centric scenarios.

It draws inspiration from cognitive visual intelligence, particularly the distinction between stimulus-driven (involuntary) and goal-directed (voluntary) visual attention. In MLLMs, stimulus-driven attention is evident in image tokenization by pre-trained visual models, whereas goal-directed attention involves language-conditioned image feature engagement within the LLM. The paper highlights that while stimulus-driven attention via unconditioned image tokenization has been explored, the effect of explicit language-guided visual engagement is less studied.

Argus explores and proposes a grounding-driven visual attention re-engagement module. Instead of relying solely on the implicit self-attention of LLMs to model language-directed attention to visual tokens, it explicitly performs a top-down visual search to locate the image region of interest (RoI) most relevant to the text prompt. This search guides the model to focus on these specific regions for subsequent reasoning and answer generation.

The framework utilises text-to-box object-centric grounding as an intermediate reasoning stage. The predicted bounding boxes serve as simple yet effective visual chain-of-thought (CoT) signals to improve the quality of the final reasoning step.

📌Architectural Design

Source:https://yunzeman.github.io/argus/

Argus builds upon the standard autoregressive MLLM paradigm, where images are converted to visual tokens and processed alongside language tokens by an LLM.

1️⃣Visual Encoders – Argus employs a Mixture-of-Vision-Experts (MoVEs) strategy, combining outputs from three different vision foundation models: CLIP45, ConvNeXt46, and EVA-02. These encoders are crucial for abstracting image information with minimal loss and aligning vision and language. The 2D embeddings are interpolated, concatenated, and then mapped to the text token space by an MLP projector.

2️⃣LLM Decoder – A state-of-the-art pretrained LLM, specifically Llama3–8B50, is used as the transformer decoder for next-token prediction.

3️⃣Region-of-Interest (RoI) Sampling – The model can predict bounding boxes corresponding to regions mentioned in the question prompt. These bounding boxes are represented in text format using normalized coordinates ([xmin, ymin, xmax, ymax]). The predicted box guides the cropping of the relevant RoI from the input image for re-engagement.

Source:https://yunzeman.github.io/argus/

📌Directed Visual Context Re-engagement

Source:https://yunzeman.github.io/argus/

The predicted bounding boxes highlight the most relevant visual context. Argus investigates four strategies for engaging with these sampled RoIs:

β†’ Implicit Self-Attention – The baseline, relying on the LLM’s global self-attention to attend to visual context. Minimal control over specific RoIs.

β†’ Implicit Box Guidance – Predicting bounding boxes as text tokens acts as a CoT signal, implicitly nudging self-attention towards RoIs without explicit visual re-engagement.

β†’ Explicit RoI Re-encoding – Samples the image crop defined by the RoI and processes it through vision encoders to generate a new set of visual tokens. This explicitly introduces context-specific signals but increases computation. Requires preprocessing (padding, resizing).

β†’ Explicit RoI Re-sampling – Instead of re-encoding, this method retrieves visual embeddings from the initial encoding stage based on their overlap with the RoI bounding box. It leverages cached tokens for efficiency. It also preserves positional context, potentially lost in re-encoding preprocessing.

📌Training Pipeline

Training is divided into two stages:

β†’ Alignment and Pre-training – Vision encoders and the MLP projector are trained on LLaVA-595K while the LLM is frozen. This stage includes a vision expert pre-alignment.

β†’ Supervised Fine-Tuning (SFT)– The full model (vision encoders, MLP projectors, LLM) is fine-tuned on a mixture of Eagle1.8M (conversational data), VCoT (visual CoT), and grounding datasets (GRIT, Shikra). This stage enables the model to predict RoI boxes and leverage visual CoT.

📌Datasets Used

They utilise several datasets for both training and evaluating:

Training Datasets

β†’ Stage 1 (Alignment and Pre-training) – The initial pre-training stage uses the LLaVA-595K dataset. This dataset consists of curated image-text pairs.

β†’ Stage 2 (Supervised Fine-Tuning β€” SFT) – This stage employs a diverse combination of datasets to ensure robust performance:

  • Eagle1.8M dataset – A comprehensive collection of conversational data aggregated from various sources, including LLaVA-Instruct, DocVQA, synDog-EN, ChartQA, DVQA, AI2D, ShareGPT-4V, LAION-GPT4v, LVIS-Instruct4V, LRV-Instruct, Geo170K, LLaVAR, Visual7W, and Open-Hermes 2.5.
  • VCoT dataset – Provides region-of-interest (RoI) bounding box annotations specifically designed for grounding and reasoning tasks. This dataset is structured as a multi-turn conversation incorporating RoI prediction and visual chain-of-thought signals.
  • Grounding datasets: A mixture of GRIT (756K Grounded Image-Text pairs) and Shikra( 326K visual grounding-oriented samples) datasets are incorporated to enhance the model’s ability to ground concepts in unconstrained scenarios.

Evaluation Benchmarks

β†’ Multimodal Reasoning Tasks – Evaluated on various vision-language benchmarks covering Vision-Centric Tasks (V-Star, CV-Bench 2D/3D, MMVP, RealworldQA), Text Understanding (ChartQA, OCRBench, TextVQA, DocVQA), and General Tasks (MMMUV, MMB, SEED, IGQA).

β†’ Referring Expression Grounding Tasks – The model’s object grounding capabilities are evaluated using the RefCOCO, RefCOCO+, and RefCOCOg benchmarks. The performance metric used is [email protected].

📌Evaluation and Results

Source:https://yunzeman.github.io/argus/

Argus is benchmarked on visual reasoning and referring grounding tasks.

β†’ Visual Reasoning – It achieves state-of-the-art performance among public MLLMs of comparable size and training scale. It shows substantial improvements in vision-centric and text understanding tasks, demonstrating the effectiveness of the goal-conditioned visual search and attention mechanisms.

β†’ Referring Grounding – Argus demonstrates leading performance among comparable generalist MLLMs and is competitive with specialist grounding models. This indicates its strength in both high-level reasoning and precise visual localisation.

β†’ Qualitative Results – Examples show Argus successfully performing challenging reasoning tasks with visually grounded CoT.

📌Ablation Studies and Analysis

Controlled experiments validate the design choices:

β†’ CoT and Grounding – Incorporating CoT reasoning consistently boosts performance. Explicit visual CoT (re-encoding/re-sampling) offers greater gains than implicit box guidance. Adding grounding datasets further enhances reasoning by improving object-centric perception and bounding box predictions.

β†’ Re-engagement Strategies – Both explicit re-encoding and re-sampling outperform implicit methods. Re-sampling is generally superior due to better context preservation and less distribution shift, except for tasks requiring fine-grained details of small objects (like V-Star), where re-encoding performs better.

β†’ Encoder Capacity – Higher capacity vision encoders improve performance. Re-encoding is less dependent on the initial feature quality than re-sampling.

β†’ Context Expansion: Re-encoding benefits from moderate RoI context expansion (20–40%), which helps with slightly inaccurate boxes and relative positioning. Re-sampling performs best with the original box size, as it already leverages overlapping patches for context. Excessive expansion hurts performance for both.

β†’ Non-shared MLPs – Using separate MLPs for initial and re-engaged visual tokens marginally improves re-sampling performance by optimising for different image/RoI distributions.

β†’ Computational Efficiency – Re-sampling is significantly more computationally efficient than re-encoding, requiring fewer operations and additional visual tokens, leading to faster inference.

📌Limitations and Future Work

The authors acknowledge limitations, including evaluating the approach on larger model scales, the limited diversity and availability of large-scale visual CoT data, and expanding coverage to tasks like open-world detection.

Paper

Stay Curious☺️….See you in the next one!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓