Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Free: 6-day Agentic AI Engineering Email Guide.
Learnings from Towards AI's hands-on work with real clients.
Building an ARC-2 Solver — Part 2
Artificial Intelligence   Latest   Machine Learning

Building an ARC-2 Solver — Part 2

Last Updated on January 26, 2026 by Editorial Team

Author(s): Rahul Ray

Originally published on Towards AI.

Building an ARC-2 Solver — Part 2
Cover

Here’s something that should be easy: count the holes in a shape. A five-year-old can do it. A frontier AI model worth billions in compute? Not so much.

In Part 1, I introduced my multi-agent Socratic reasoning system for tackling ARC-AGI-2 puzzles. This post goes deeper — into what might be the most surprising bottleneck I’ve discovered: LLMs can’t reliably see what’s in front of them. This isn’t a reasoning problem. It’s a perception problem. And it changes how we should think about solving ARC-2.

What I Will Share

• A framework for breaking down ARC solving into four distinct skills — and why isolating them matters

• Experimental evidence that even when given the correct solution, LLMs fail to execute it

• Why combining text and images outperforms either alone

• A simple technique that dramatically boosts accuracy

Why ARC-2 Matters

ARC-AGI-2 isn’t just another benchmark. It exposes fundamental gaps in how LLMs process information — gaps that matter far beyond puzzle-solving.

ARC puzzles require extrapolation from limited data, typically just 2–3 examples. This is the opposite of how LLMs are trained, where they consume a large quantum of data to learn patterns. ARC asks: can you derive the rule from almost nothing?

Current frontier models don’t do so well on ARC-AGI-2. The benchmark remains effectively unsolved.

ARC-2 Leaderboard

The Four Skills Framework

Working on this project clarified something important: solving ARC-2 requires four distinct capabilities, and failure in any one blocks success.

Skill 1: Perception. Can the model accurately see what’s in the grid? Identify objects, count features, recognize boundaries, detect spatial relationships?

Skill 2: Reasoning. Given accurate perception, can the model figure out the transformation rule?

Skill 3: Execution. Given the transformation rules, can the model precisely follow instructions to construct the output?

Skill 4: Verification. Does the model know when it’s right? This is why induction approaches (derive rules, then apply) beat transduction (output the answer directly) — derived rules can be tested against training examples before committing to an answer.

Most research focuses on Skill 2 (reasoning). This post focuses on Skill 3 (execution) — and reveals it’s actually a Skill 1 (perception) problem in disguise.

The Experiment: What If We Give the Model the Answer?

I designed a simple test: inject the correct transformation rules and ask the model to apply them. This isolates Skill 3 from Skills 1 and 2. If the model knows exactly what to do, can it do it?

The puzzle I focused on (task e3721c99) requires counting holes in gray objects and recoloring based on a legend. The transformation rules are straightforward: identify each gray object, count its holes, look up the corresponding color, apply it.

Task e3721c99
Transformation Rules for task e3721c99

I tested multiple frontier models: Gemini 2.5 Pro, GPT-4o, Claude Sonnet 4, Gemma 3 27B (on HuggingFace), Deepseek (on Openrouter), and Qwen 2.5 VL 32B (on Openrouter).

The results below focus on Gemini 2.5 Pro, which performed best on this task.

Experiment 1: Text-Only Input

I provided the grids as text (numeric arrays) along with the correct transformation instructions.

Experiment 1: Training input/output and Test input grids in text format

Result: The model failed on almost every object. Worse, it couldn’t even pass validation on the training examples — meaning in a real run, it would never reach the test grid. The few successes were objects without holes (trivial cases).

Experiment 1 Solution

Experiment 2: Image-Only Input

Maybe this is a representation problem. What if we give the model a visual image instead of text arrays?

Experiment 2: Training input/output and Test input grids in image format

Result: Dramatically worse — 0% similarity. The model couldn’t even reconstruct the basic grid structure, let alone apply transformations.

Experiment 2 Solution

Experiment 3: Text + Image Combined

What if we provide both representations? Text gives precise cell values; images give spatial context.

Experiment 3: Training input/output and Test input grids in text and image format

Result: Significant improvement. Critically, the model now passes validation on training examples — real progress. But it still makes strange errors: miscounting holes on complex shapes, failing to preserve object boundaries.

Experiment 3 Solution

Experiment 4: Enhanced Images

If combined input helps, can we improve the visual representation? I made three changes: doubled the pixel size per cell (32px → 64px), added gray gridlines between cells, and added row/column numbers.

Experiment 4: Training input/output and Test input grids in text and higher resolution image format

Result: Score improved further. The enhanced visual representation helps the model distinguish individual cells and track positions more accurately.

Experiment 4 Solution

Experiment 5: Explicit Reasoning Step

Here’s the key question: Is the output grid a true reflection of what the model sees? Or is the model perceiving correctly but failing to construct the output?

To test this, I added an intermediate step: before generating the output grid, describe in plain English what you see and what you plan to do. This isn’t asking for transformation rules (we already provided those) — it’s asking the model to articulate its specific plan for this grid.

Experiment 5: Solver produces solution in words instead of a grid

The English description reveals the model’s perception errors clearly. It misidentifies objects, merges distinct shapes, and miscounts holes. But it also gets more right than the direct grid output did.

When I inject this verbal plan back into the grid construction step, something interesting happens: the output grid tracks the verbal description more faithfully than the original direct approach. The model still makes errors, but they’re more consistent and predictable.

Result: 92.4% similarity — Adding an explicit reasoning step before grid construction produces dramatically better results.

Experiment 5 Solution

Key Insights

Text and images provide orthogonal information. Neither alone is sufficient. Text gives precise values but loses spatial structure. Images preserve relationships but lose precision. The combination outperforms either.

Image quality matters more than expected. Higher resolution, clear gridlines, and coordinate labels all help. The model isn’t just “looking” at images — it’s parsing them, and better visual structure makes parsing easier.

Explicit verbalization improves execution. Asking the model to describe its plan before acting forces it to commit to a specific interpretation. This surfaces errors earlier and produces more consistent output.

Perception remains the bottleneck. Even with the correct transformation rules and an explicit reasoning step, the model still makes errors that a child wouldn’t. Counting holes in a simple shape remains genuinely hard for frontier models.

Summary of Results

Arc of Experiments

What’s Next

These experiments focused on a single puzzle to isolate variables. The next step is scaling: does this approach generalize across the ARC-2 evaluation set? Do different puzzle types require different perception strategies?

There’s also a deeper question worth exploring: if frontier models struggle with basic visual perception, what does this mean for applications that require precise spatial reasoning? Document understanding, UI automation, robotics — all depend on accurate perception as a foundation.

More to follow.

Acknowledgments

Thank you to lamda.ai for providing cloud compute and to Hugging Face for the Gemma model.

— — —

If you’re working on ARC-AGI or multi-agent reasoning systems, I’d love to hear from you. Follow me for Part 3, where I’ll explore scaling these techniques across the full evaluation set.

Missed Part 1? Read it here: Building an ARC-2 Solver: My Multi-Agent Socratic Reasoning Journey

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.

Start free — no commitment:

6-Day Agentic AI Engineering Email Guide — one practical lesson per day

Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages

Our courses:

AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.

Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.

AI for Work — Understand, evaluate, and apply AI for complex work tasks.

Note: Article content contains the views of the contributing authors and not Towards AI.