
Multimodal Autonomous AI Agents: Enhancing Web Interactions Through Tree Search
Last Updated on April 16, 2025 by Editorial Team
Author(s): Kapardhi kannekanti
Originally published on Towards AI.
Iβve been thinking a lot about AI agents lately, those systems that can actually do things for us online instead of just answering questions. Last week, Professor Ruslan Salakhutdinov from CMU gave a Lecture that really got me excited about where this field is heading. His work on multimodal AI agents shows how these systems can navigate websites and handle tasks that we do every day.
Why AI Agents Matter
Ruslan started with a simple but powerful point: we spend tons of time doing boring tasks on our computers and phones. Think about all the clicking, searching, and form-filling we do every day. What if AI could handle these things for us?
Todayβs language models are pretty smart. They can learn from examples, follow instructions, and even do things they werenβt specifically trained for. But to turn them into agents that can actually get stuff done for us online, they need extra abilities β especially the power to see and understand websites the way we do.
How Web Agents Actually Work
The part that got me leaning forward in my seat was when Salakhutdinov explained how these web agents are built. Itβs not just one big AI β itβs several pieces working together:
- Visual Understanding: The agent needs to βseeβ whatβs on the screen
- HTML Processing: It needs to read the code behind the webpage
- Web Grounding: It has to connect what it sees with what it can do
- Language Model: This is the βbrainβ that makes decisions
When these agents try to complete a task, they work in layers:
- First, they make a plan (like βI need to find the cheapest printer and buy itβ)
- Then, they figure out what theyβre looking at (βthis is a product listing pageβ)
- Finally, they take specific actions (clicking a button or typing text)
This is just like shopping online β we donβt just randomly click. we will have a plan, look around the page, and then click on what seems useful.
The Big Problem: Mistakes Add Up Fast
Hereβs the main challenge these agents face, the βexponential error compoundingβ problem.
Imagine youβre following a recipe with 30 steps. If you have a 90% chance of getting each step right, you might think youβd do pretty well. But the math says otherwise β your chance of getting the whole recipe right drops to just 4.24%!
The same thing happens with AI agents. Even if theyβre pretty good at each small step (clicking the right button, typing the right thing), when they have to do many steps in a row, they often fail. One small mistake early on can derail the whole process.
Tree Search: The Clever Solution I Got Excited About
This is where the Lecture grabbed me β when Salakhutdinov explained how βtree searchβ can fix this problem. Itβs like giving the AI the ability to try different paths and backtrack when it makes mistakes β just like we do!
Hereβs how it works:
- The agent tries a few possible actions
- It keeps track of how promising each path looks
- If it hits a dead end, it goes back and tries something else
- It keeps searching until it finds a solution that works
What makes this cool is that it mimics how humans navigate websites. When we are looking for something and click the wrong button, we donβt give up β we just hit the back button and try something else.
The results were pretty amazing. When they added tree search to GPT-4o, its success rate on web tasks nearly doubled β from 17% to 26% on one benchmark. Llama models saw similar improvements.
Hereβs how much better these agents got with tree search:
A great example where an agent was trying to find and compare canned fruit products. When it made a wrong turn, instead of being stuck, it backtracked and found another way to complete the task. Just like a real person would!
Why Agents Still Mess Up (and How Weβll Fix It)
How and why these agents still fail:
- Sometimes they get stuck in loops, bouncing between the same two pages
- They might give up too early before finding the solution
- They often click the wrong things because they misunderstand what theyβre seeing
- They struggle with spatial tasks like βfind the product in the first rowβ
But he was optimistic about solutions:
- Better ways to evaluate which paths are promising
- Teaching agents to improve their strategies through experience
- Figuring out when to make the base agent smarter versus when to let it explore more options
- Making these systems work in real websites, not just in test environments
I found myself nodding along, thinking about all the times Iβve watched someone struggle to navigate a website, making the same mistakes repeatedly. These AI solutions mirror how we teach each other to use technology.
Training These Agents at Internet Scale
The last part introduced a project called βTowards Internet-Scale Training For Agentsβ (InSTA). This part really got me thinking about practical applications.
Instead of paying humans to demonstrate thousands of web tasks (super expensive!), theyβre using language models to generate realistic tasks across thousands of websites. For example:
- βFind a free WordPress theme for a personal blogβ
- βLook up the meaning of the Om symbol in ancient culturesβ
- βCompare prices of Nikon D850 and D500 camerasβ
Their process is simple but clever:
- Generate realistic tasks for different websites
- Let agents try to complete them
- Use another AI to check if they succeeded
- Collect all this data to train better agents
Agents trained this way learned much faster and could handle new websites theyβd never seen before. This approach seems much more practical for creating agents that can work across the entire internet, not just a few test websites.
What This Means For Our Future
After sitting through Salakhutdinovβs Lecture, I couldnβt help but think about how these technologies might change my daily life. Imagine having an assistant that could actually book your flights, find the best deals, research topics for you, or fill out those annoying forms β all by understanding websites the way you do.
The tree search technique really stuck with me. Itβs such a human approach to problem-solving β try something, see if it works, and if not, back up and try something else. By giving AI this ability to explore and recover from mistakes, weβre making them much more reliable for real-world tasks.
Weβre still in the early days (success rates of 26% are better than 8%, but far from perfect), but the progress is happening fast. I think in a few years, weβll look back at having to navigate websites ourselves as a weird chore from the past β like how we now view memorizing phone numbers.
This article discusses research by Professor Ruslan Salakhutdinov from Carnegie Mellon University presented as part of UCBβs Advanced Large Language Model Agents MOOC for Spring 2025.
Passionate about AI, ML, and tech? Letβs connect and collaborate!
Twitter. LinkedIn. GitHub. [email protected].
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI