Building an ARC-2 Solver — Part 2
Last Updated on January 26, 2026 by Editorial Team
Author(s): Rahul Ray
Originally published on Towards AI.

Here’s something that should be easy: count the holes in a shape. A five-year-old can do it. A frontier AI model worth billions in compute? Not so much.
In Part 1, I introduced my multi-agent Socratic reasoning system for tackling ARC-AGI-2 puzzles. This post goes deeper — into what might be the most surprising bottleneck I’ve discovered: LLMs can’t reliably see what’s in front of them. This isn’t a reasoning problem. It’s a perception problem. And it changes how we should think about solving ARC-2.
What I Will Share
• A framework for breaking down ARC solving into four distinct skills — and why isolating them matters
• Experimental evidence that even when given the correct solution, LLMs fail to execute it
• Why combining text and images outperforms either alone
• A simple technique that dramatically boosts accuracy
Why ARC-2 Matters
ARC-AGI-2 isn’t just another benchmark. It exposes fundamental gaps in how LLMs process information — gaps that matter far beyond puzzle-solving.
ARC puzzles require extrapolation from limited data, typically just 2–3 examples. This is the opposite of how LLMs are trained, where they consume a large quantum of data to learn patterns. ARC asks: can you derive the rule from almost nothing?
Current frontier models don’t do so well on ARC-AGI-2. The benchmark remains effectively unsolved.

The Four Skills Framework
Working on this project clarified something important: solving ARC-2 requires four distinct capabilities, and failure in any one blocks success.
Skill 1: Perception. Can the model accurately see what’s in the grid? Identify objects, count features, recognize boundaries, detect spatial relationships?
Skill 2: Reasoning. Given accurate perception, can the model figure out the transformation rule?
Skill 3: Execution. Given the transformation rules, can the model precisely follow instructions to construct the output?
Skill 4: Verification. Does the model know when it’s right? This is why induction approaches (derive rules, then apply) beat transduction (output the answer directly) — derived rules can be tested against training examples before committing to an answer.
Most research focuses on Skill 2 (reasoning). This post focuses on Skill 3 (execution) — and reveals it’s actually a Skill 1 (perception) problem in disguise.
The Experiment: What If We Give the Model the Answer?
I designed a simple test: inject the correct transformation rules and ask the model to apply them. This isolates Skill 3 from Skills 1 and 2. If the model knows exactly what to do, can it do it?
The puzzle I focused on (task e3721c99) requires counting holes in gray objects and recoloring based on a legend. The transformation rules are straightforward: identify each gray object, count its holes, look up the corresponding color, apply it.


I tested multiple frontier models: Gemini 2.5 Pro, GPT-4o, Claude Sonnet 4, Gemma 3 27B (on HuggingFace), Deepseek (on Openrouter), and Qwen 2.5 VL 32B (on Openrouter).
The results below focus on Gemini 2.5 Pro, which performed best on this task.
Experiment 1: Text-Only Input
I provided the grids as text (numeric arrays) along with the correct transformation instructions.

Result: The model failed on almost every object. Worse, it couldn’t even pass validation on the training examples — meaning in a real run, it would never reach the test grid. The few successes were objects without holes (trivial cases).

Experiment 2: Image-Only Input
Maybe this is a representation problem. What if we give the model a visual image instead of text arrays?

Result: Dramatically worse — 0% similarity. The model couldn’t even reconstruct the basic grid structure, let alone apply transformations.

Experiment 3: Text + Image Combined
What if we provide both representations? Text gives precise cell values; images give spatial context.

Result: Significant improvement. Critically, the model now passes validation on training examples — real progress. But it still makes strange errors: miscounting holes on complex shapes, failing to preserve object boundaries.

Experiment 4: Enhanced Images
If combined input helps, can we improve the visual representation? I made three changes: doubled the pixel size per cell (32px → 64px), added gray gridlines between cells, and added row/column numbers.

Result: Score improved further. The enhanced visual representation helps the model distinguish individual cells and track positions more accurately.

Experiment 5: Explicit Reasoning Step
Here’s the key question: Is the output grid a true reflection of what the model sees? Or is the model perceiving correctly but failing to construct the output?
To test this, I added an intermediate step: before generating the output grid, describe in plain English what you see and what you plan to do. This isn’t asking for transformation rules (we already provided those) — it’s asking the model to articulate its specific plan for this grid.

The English description reveals the model’s perception errors clearly. It misidentifies objects, merges distinct shapes, and miscounts holes. But it also gets more right than the direct grid output did.
When I inject this verbal plan back into the grid construction step, something interesting happens: the output grid tracks the verbal description more faithfully than the original direct approach. The model still makes errors, but they’re more consistent and predictable.
Result: 92.4% similarity — Adding an explicit reasoning step before grid construction produces dramatically better results.

Key Insights
Text and images provide orthogonal information. Neither alone is sufficient. Text gives precise values but loses spatial structure. Images preserve relationships but lose precision. The combination outperforms either.
Image quality matters more than expected. Higher resolution, clear gridlines, and coordinate labels all help. The model isn’t just “looking” at images — it’s parsing them, and better visual structure makes parsing easier.
Explicit verbalization improves execution. Asking the model to describe its plan before acting forces it to commit to a specific interpretation. This surfaces errors earlier and produces more consistent output.
Perception remains the bottleneck. Even with the correct transformation rules and an explicit reasoning step, the model still makes errors that a child wouldn’t. Counting holes in a simple shape remains genuinely hard for frontier models.
Summary of Results

What’s Next
These experiments focused on a single puzzle to isolate variables. The next step is scaling: does this approach generalize across the ARC-2 evaluation set? Do different puzzle types require different perception strategies?
There’s also a deeper question worth exploring: if frontier models struggle with basic visual perception, what does this mean for applications that require precise spatial reasoning? Document understanding, UI automation, robotics — all depend on accurate perception as a foundation.
More to follow.
Acknowledgments
Thank you to lamda.ai for providing cloud compute and to Hugging Face for the Gemma model.
— — —
If you’re working on ARC-AGI or multi-agent reasoning systems, I’d love to hear from you. Follow me for Part 3, where I’ll explore scaling these techniques across the full evaluation set.
Missed Part 1? Read it here: Building an ARC-2 Solver: My Multi-Agent Socratic Reasoning Journey
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Towards AI Academy
We Build Enterprise-Grade AI. We'll Teach You to Master It Too.
15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.
Start free — no commitment:
→ 6-Day Agentic AI Engineering Email Guide — one practical lesson per day
→ Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages
Our courses:
→ AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.
→ Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.
→ AI for Work — Understand, evaluate, and apply AI for complex work tasks.
Note: Article content contains the views of the contributing authors and not Towards AI.