Building an ARC-2 Solver — Part 2

Last Updated on January 26, 2026 by Editorial Team

Author(s): Rahul Ray

Originally published on Towards AI.

Building an ARC-2 Solver — Part 2 — Cover

Here’s something that should be easy: count the holes in a shape. A five-year-old can do it. A frontier AI model worth billions in compute? Not so much.

In Part 1, I introduced my multi-agent Socratic reasoning system for tackling ARC-AGI-2 puzzles. This post goes deeper — into what might be the most surprising bottleneck I’ve discovered: LLMs can’t reliably see what’s in front of them. This isn’t a reasoning problem. It’s a perception problem. And it changes how we should think about solving ARC-2.

What I Will Share

• A framework for breaking down ARC solving into four distinct skills — and why isolating them matters

• Experimental evidence that even when given the correct solution, LLMs fail to execute it

• Why combining text and images outperforms either alone

• A simple technique that dramatically boosts accuracy

Why ARC-2 Matters

ARC-AGI-2 isn’t just another benchmark. It exposes fundamental gaps in how LLMs process information — gaps that matter far beyond puzzle-solving.

ARC puzzles require extrapolation from limited data, typically just 2–3 examples. This is the opposite of how LLMs are trained, where they consume a large quantum of data to learn patterns. ARC asks: can you derive the rule from almost nothing?

Current frontier models don’t do so well on ARC-AGI-2. The benchmark remains effectively unsolved.

The Four Skills Framework

Working on this project clarified something important: solving ARC-2 requires four distinct capabilities, and failure in any one blocks success.

Skill 1: Perception. Can the model accurately see what’s in the grid? Identify objects, count features, recognize boundaries, detect spatial relationships?

Skill 2: Reasoning. Given accurate perception, can the model figure out the transformation rule?

Skill 3: Execution. Given the transformation rules, can the model precisely follow instructions to construct the output?

Skill 4: Verification. Does the model know when it’s right? This is why induction approaches (derive rules, then apply) beat transduction (output the answer directly) — derived rules can be tested against training examples before committing to an answer.

Most research focuses on Skill 2 (reasoning). This post focuses on Skill 3 (execution) — and reveals it’s actually a Skill 1 (perception) problem in disguise.

The Experiment: What If We Give the Model the Answer?

I designed a simple test: inject the correct transformation rules and ask the model to apply them. This isolates Skill 3 from Skills 1 and 2. If the model knows exactly what to do, can it do it?

The puzzle I focused on (task e3721c99) requires counting holes in gray objects and recoloring based on a legend. The transformation rules are straightforward: identify each gray object, count its holes, look up the corresponding color, apply it.

Transformation Rules for *task* ***e3721c99***

I tested multiple frontier models: Gemini 2.5 Pro, GPT-4o, Claude Sonnet 4, Gemma 3 27B (on HuggingFace), Deepseek (on Openrouter), and Qwen 2.5 VL 32B (on Openrouter).

The results below focus on Gemini 2.5 Pro, which performed best on this task.

Experiment 1: Text-Only Input

I provided the grids as text (numeric arrays) along with the correct transformation instructions.

Experiment 1: Training input/output and Test input grids in text format

Result: The model failed on almost every object. Worse, it couldn’t even pass validation on the training examples — meaning in a real run, it would never reach the test grid. The few successes were objects without holes (trivial cases).

Experiment 2: Image-Only Input

Maybe this is a representation problem. What if we give the model a visual image instead of text arrays?

Experiment 2: Training input/output and Test input grids in image format

Result: Dramatically worse — 0% similarity. The model couldn’t even reconstruct the basic grid structure, let alone apply transformations.

Experiment 3: Text + Image Combined

What if we provide both representations? Text gives precise cell values; images give spatial context.

Experiment 3: Training input/output and Test input grids in text and image format

Result: Significant improvement. Critically, the model now passes validation on training examples — real progress. But it still makes strange errors: miscounting holes on complex shapes, failing to preserve object boundaries.

Experiment 4: Enhanced Images

If combined input helps, can we improve the visual representation? I made three changes: doubled the pixel size per cell (32px → 64px), added gray gridlines between cells, and added row/column numbers.

Experiment 4: Training input/output and Test input grids in text and **higher resolution** image format

Result: Score improved further. The enhanced visual representation helps the model distinguish individual cells and track positions more accurately.

Experiment 5: Explicit Reasoning Step

Here’s the key question: Is the output grid a true reflection of what the model sees? Or is the model perceiving correctly but failing to construct the output?

To test this, I added an intermediate step: before generating the output grid, describe in plain English what you see and what you plan to do. This isn’t asking for transformation rules (we already provided those) — it’s asking the model to articulate its specific plan for this grid.

Experiment 5: Solver produces solution in words instead of a grid

The English description reveals the model’s perception errors clearly. It misidentifies objects, merges distinct shapes, and miscounts holes. But it also gets more right than the direct grid output did.

When I inject this verbal plan back into the grid construction step, something interesting happens: the output grid tracks the verbal description more faithfully than the original direct approach. The model still makes errors, but they’re more consistent and predictable.

Result: 92.4% similarity — Adding an explicit reasoning step before grid construction produces dramatically better results.

Key Insights

Text and images provide orthogonal information. Neither alone is sufficient. Text gives precise values but loses spatial structure. Images preserve relationships but lose precision. The combination outperforms either.

Image quality matters more than expected. Higher resolution, clear gridlines, and coordinate labels all help. The model isn’t just “looking” at images — it’s parsing them, and better visual structure makes parsing easier.

Explicit verbalization improves execution. Asking the model to describe its plan before acting forces it to commit to a specific interpretation. This surfaces errors earlier and produces more consistent output.

Perception remains the bottleneck. Even with the correct transformation rules and an explicit reasoning step, the model still makes errors that a child wouldn’t. Counting holes in a simple shape remains genuinely hard for frontier models.

Summary of Results

What’s Next

These experiments focused on a single puzzle to isolate variables. The next step is scaling: does this approach generalize across the ARC-2 evaluation set? Do different puzzle types require different perception strategies?

There’s also a deeper question worth exploring: if frontier models struggle with basic visual perception, what does this mean for applications that require precise spatial reasoning? Document understanding, UI automation, robotics — all depend on accurate perception as a foundation.

More to follow.

Acknowledgments

Thank you to lamda.ai for providing cloud compute and to Hugging Face for the Gemma model.

— — —

If you’re working on ARC-AGI or multi-agent reasoning systems, I’d love to hear from you. Follow me for Part 3, where I’ll explore scaling these techniques across the full evaluation set.

Missed Part 1? Read it here: Building an ARC-2 Solver: My Multi-Agent Socratic Reasoning Journey

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.

Start free — no commitment:

→ Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages

Our courses:

→ AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.

→ Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.

→ AI for Work — Understand, evaluate, and apply AI for complex work tasks.

Note: Article content contains the views of the contributing authors and not Towards AI.

Frequently Used, Contextual References

Resources

Building an ARC-2 Solver — Part 2

Author(s): Rahul Ray

What I Will Share

Why ARC-2 Matters

The Four Skills Framework

The Experiment: What If We Give the Model the Answer?

Experiment 1: Text-Only Input

Experiment 2: Image-Only Input

Experiment 3: Text + Image Combined

Experiment 4: Enhanced Images

Experiment 5: Explicit Reasoning Step

Key Insights

Summary of Results

What’s Next

Acknowledgments

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Recent Posts

Full-Stack Data Scientists for the Agentic Coding World

Building Production-Grade AI Skills with Snowflake Cortex AI Function Studio

I Tried 10 AI Agent Frameworks in 2026 — Here’s the Honest Guide I Wish I Had Earlier

How One Spring Boot Optimization Saved Our Startup $30,000 a Year

Inside Palantir AIP: How the World’s Most Controversial AI Platform Actually Works

What Is a Reverse Proxy? (And Why Every Backend Developer Should Care)

What Claude Opus 4.8 Actually Changes If You’re Building Agents

QWEN 3.7 Max Worked For 35 Hrs Straight And The Results Were Mind-blowing

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Building an ARC-2 Solver — Part 2

Author(s): Rahul Ray

What I Will Share

Why ARC-2 Matters

The Four Skills Framework

The Experiment: What If We Give the Model the Answer?

Experiment 1: Text-Only Input

Experiment 2: Image-Only Input

Experiment 3: Text + Image Combined

Experiment 4: Enhanced Images

Experiment 5: Explicit Reasoning Step

Key Insights

Summary of Results

What’s Next

Acknowledgments

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Related posts

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement