Multimodal Autonomous AI Agents: Enhancing Web Interactions Through Tree Search

Last Updated on April 16, 2025 by Editorial Team

Author(s): Kapardhi kannekanti

Originally published on Towards AI.

Multimodal Autonomous AI Agents: Enhancing Web Interactions Through Tree Search

I’ve been thinking a lot about AI agents lately, those systems that can actually do things for us online instead of just answering questions. Last week, Professor Ruslan Salakhutdinov from CMU gave a Lecture that really got me excited about where this field is heading. His work on multimodal AI agents shows how these systems can navigate websites and handle tasks that we do every day.

Why AI Agents Matter

Ruslan started with a simple but powerful point: we spend tons of time doing boring tasks on our computers and phones. Think about all the clicking, searching, and form-filling we do every day. What if AI could handle these things for us?

Today’s language models are pretty smart. They can learn from examples, follow instructions, and even do things they weren’t specifically trained for. But to turn them into agents that can actually get stuff done for us online, they need extra abilities — especially the power to see and understand websites the way we do.

How Web Agents Actually Work

The part that got me leaning forward in my seat was when Salakhutdinov explained how these web agents are built. It’s not just one big AI — it’s several pieces working together:

Visual Understanding: The agent needs to “see” what’s on the screen
HTML Processing: It needs to read the code behind the webpage
Web Grounding: It has to connect what it sees with what it can do
Language Model: This is the “brain” that makes decisions

When these agents try to complete a task, they work in layers:

First, they make a plan (like “I need to find the cheapest printer and buy it”)
Then, they figure out what they’re looking at (“this is a product listing page”)
Finally, they take specific actions (clicking a button or typing text)

This is just like shopping online — we don’t just randomly click. we will have a plan, look around the page, and then click on what seems useful.

The Big Problem: Mistakes Add Up Fast

Here’s the main challenge these agents face, the “exponential error compounding” problem.

Imagine you’re following a recipe with 30 steps. If you have a 90% chance of getting each step right, you might think you’d do pretty well. But the math says otherwise — your chance of getting the whole recipe right drops to just 4.24%!

The same thing happens with AI agents. Even if they’re pretty good at each small step (clicking the right button, typing the right thing), when they have to do many steps in a row, they often fail. One small mistake early on can derail the whole process.

Tree Search: The Clever Solution I Got Excited About

This is where the Lecture grabbed me — when Salakhutdinov explained how “tree search” can fix this problem. It’s like giving the AI the ability to try different paths and backtrack when it makes mistakes — just like we do!

Here’s how it works:

The agent tries a few possible actions
It keeps track of how promising each path looks
If it hits a dead end, it goes back and tries something else
It keeps searching until it finds a solution that works

What makes this cool is that it mimics how humans navigate websites. When we are looking for something and click the wrong button, we don’t give up — we just hit the back button and try something else.

The results were pretty amazing. When they added tree search to GPT-4o, its success rate on web tasks nearly doubled — from 17% to 26% on one benchmark. Llama models saw similar improvements.

Here’s how much better these agents got with tree search:

A great example where an agent was trying to find and compare canned fruit products. When it made a wrong turn, instead of being stuck, it backtracked and found another way to complete the task. Just like a real person would!

Why Agents Still Mess Up (and How We’ll Fix It)

How and why these agents still fail:

Sometimes they get stuck in loops, bouncing between the same two pages
They might give up too early before finding the solution
They often click the wrong things because they misunderstand what they’re seeing
They struggle with spatial tasks like “find the product in the first row”

But he was optimistic about solutions:

Better ways to evaluate which paths are promising
Teaching agents to improve their strategies through experience
Figuring out when to make the base agent smarter versus when to let it explore more options
Making these systems work in real websites, not just in test environments

I found myself nodding along, thinking about all the times I’ve watched someone struggle to navigate a website, making the same mistakes repeatedly. These AI solutions mirror how we teach each other to use technology.

Training These Agents at Internet Scale

The last part introduced a project called “Towards Internet-Scale Training For Agents” (InSTA). This part really got me thinking about practical applications.

Instead of paying humans to demonstrate thousands of web tasks (super expensive!), they’re using language models to generate realistic tasks across thousands of websites. For example:

“Find a free WordPress theme for a personal blog”
“Look up the meaning of the Om symbol in ancient cultures”
“Compare prices of Nikon D850 and D500 cameras”

Their process is simple but clever:

Generate realistic tasks for different websites
Let agents try to complete them
Use another AI to check if they succeeded
Collect all this data to train better agents

Agents trained this way learned much faster and could handle new websites they’d never seen before. This approach seems much more practical for creating agents that can work across the entire internet, not just a few test websites.

What This Means For Our Future

After sitting through Salakhutdinov’s Lecture, I couldn’t help but think about how these technologies might change my daily life. Imagine having an assistant that could actually book your flights, find the best deals, research topics for you, or fill out those annoying forms — all by understanding websites the way you do.

The tree search technique really stuck with me. It’s such a human approach to problem-solving — try something, see if it works, and if not, back up and try something else. By giving AI this ability to explore and recover from mistakes, we’re making them much more reliable for real-world tasks.

We’re still in the early days (success rates of 26% are better than 8%, but far from perfect), but the progress is happening fast. I think in a few years, we’ll look back at having to navigate websites ourselves as a weird chore from the past — like how we now view memorizing phone numbers.

This article discusses research by Professor Ruslan Salakhutdinov from Carnegie Mellon University presented as part of UCB’s Advanced Large Language Model Agents MOOC for Spring 2025.

Passionate about AI, ML, and tech? Let’s connect and collaborate!
Twitter. LinkedIn. GitHub. kapardhikannekanti@gmail.com.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Multimodal Autonomous AI Agents: Enhancing Web Interactions Through Tree Search

Author(s): Kapardhi kannekanti

Why AI Agents Matter

How Web Agents Actually Work

The Big Problem: Mistakes Add Up Fast

Tree Search: The Clever Solution I Got Excited About

Why Agents Still Mess Up (and How We’ll Fix It)

Training These Agents at Internet Scale

What This Means For Our Future

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Multimodal Autonomous AI Agents: Enhancing Web Interactions Through Tree Search

Author(s): Kapardhi kannekanti

Why AI Agents Matter

How Web Agents Actually Work

The Big Problem: Mistakes Add Up Fast

Tree Search: The Clever Solution I Got Excited About

Why Agents Still Mess Up (and How We’ll Fix It)

Training These Agents at Internet Scale

What This Means For Our Future

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement