Meet WebAgent: DeepMind’s New LLM that Follows Instructions and Complete Tasks on Websites
Last Updated on August 2, 2023 by Editorial Team
Author(s): Jesus Rodriguez
Originally published on Towards AI.
The model combines language understanding and web navigation.
I recently started an AI-focused educational newsletter, that already has over 160,000 subscribers. TheSequence is a no-BS (meaning no hype, no news, etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers, and concepts. Please give it a try by subscribing below:
TheSequence U+007C Jesus Rodriguez U+007C Substack
The best source to stay up-to-date with the developments in the machine learning, artificial intelligence, and data…
thesequence.substack.com
The integration between large language models(LLMs) and websites is one of the areas that can unlock a new wave of LLM-powered applications. LLMs have demonstrated remarkable proficiency in a wide array of natural language tasks, ranging from basic arithmetic and logical reasoning to more complex challenges such as commonsense understanding, question answering, and even interactive decision-making. Augmenting these capabilities with web navigation results in a very powerful combination. Recently, Google DeepMind unveiled Web Agent, an LLM-driven autonomous agent capable of navigating real websites based on user instructions.
The real-world implementation of web navigation has posed unique challenges, including
(1) the absence of a predefined action space.
(2) the presence of much longer HTML observations compared to simulators.
(3) the lack of domain-specific knowledge about HTML within LLMs.
These hurdles arise due to the unbounded nature of real-world websites and the complexity of instructions, making it difficult to define an appropriate action space in advance. While some research has highlighted the potential of instruction-finetuning or reinforcement learning from human feedback to improve HTML understanding and navigation accuracy, LLM designs have not always been optimized for processing HTML documents effectively. Specifically, most LLMs have relatively shorter context lengths, insufficient for handling the average token lengths found in real websites, and may not adopt crucial techniques for dealing with structured documents.
Enter WebAgent
WebAgent approaches the task by planning sub-instructions for each step, summarizing long HTML pages into relevant snippets based on these sub-instructions, and executing the task by grounding the sub-instructions and HTML snippets into executable Python codes. To build WebAgent, Google DeepMind combines two LLMs: “Flan-U-PaLM” for generating grounded code and “HTML-T5,” a newly introduced domain-expert pre-trained language model responsible for task planning and conditional HTML summarization. HTML-T5, designed with an encoder-decoder architecture, excels in capturing the structure of lengthy HTML pages by utilizing local and global attention mechanisms, and it is self-supervised pre-trained on a vast corpus of HTML data synthesized from CommonCrawl.
Existing LLM-driven agents typically handle decision-making tasks with a single LLM by prompting different examples per role. However, for more complex real-world tasks, this approach falls short. Google DeepMind’s comprehensive evaluations demonstrate that WebAgent’s combined method, integrating plug-in language models, significantly improves HTML understanding and grounding, leading to better generalization. WebAgent achieves over 50% increase in success rates for real-world web navigation, and detailed analysis reveals the critical role of coupling task planning with HTML summarization using specialized language models for successful task execution. Furthermore, WebAgent performs admirably on static website comprehension tasks, surpassing single LLMs in QA accuracy and performing competitively against strong baselines.
Google DeepMind’s WebAgent is an innovative composition of two distinct language models, HTML-T5 and Flan-U-PaLM, working together to enable efficient web automation tasks that involve navigating and processing HTML documents.
HTML-T5, a domain-expert encoder-decoder language model, plays a crucial role in predicting sub-instructions for the next-step program and conditionally summarizing lengthy HTML documents. This specialized model strikes a balance between the general capabilities of language models like T5, Flan-T5, and Instruct-GPT, which exhibit superior web navigation with strong HTML comprehension, and the HTML-specific inductive biases found in prior transformer models like those proposed by Guo et al. HTML-T5 leverages local and global attention mechanisms in the encoder to effectively handle the hierarchical structure of HTML inputs. The local attention focuses on nearby tokens to the left and right of each element in the HTML, such as <input>, <label>, or <button>, while the transient global attention allows tokens to attend beyond the immediate neighborhood, by dividing the input sequence into blocks and computing global tokens through summation and normalization. This hierarchical approach aligns naturally with the structure of HTML documents, where elements are defined locally and iteratively integrated globally, such as <body>, <form>, or <div>. The model is pre-trained on a large-scale HTML corpus extracted from CommonCrawl, utilizing a mixture of long-span denoising objectives.
In contrast to traditional dense attention used in natural language models, HTML-T5’s adoption of local and global attention mechanisms better captures the explicit hierarchy present in HTML documents. Leaf elements in HTML (e.g., <input>, <label>, etc.) are effectively processed using local attention, while internal elements (e.g., <body>, <form>, etc.) are compressed into transient global attention. This design choice aligns seamlessly with the hierarchical syntax of HTML, enhancing the model’s ability to comprehend the structure and attributes of HTML elements.
To implement HTML-T5, Google DeepMind builds upon the base architecture of LongT5, which incorporates dense attention in the decoder. This combination of local and global attention mechanisms with the decoder’s dense attention provides an effective and well-rounded solution for handling HTML documents in web automation tasks.
The Results
To illustrate the capabilities of Google DeepMind’s WebAgent, let’s look at automation in the context of a real estate website. In this setting, WebAgent exemplifies a powerful flow encompassing planning, summarization, and grounded program synthesis. In this process, HTML-T5 takes the lead, employing an iterative approach to predict decomposed sub-instructions and task-relevant snippets, depicted in orange, by closely analyzing the HTML documents, instructions in yellow, and historical predictions in green. Simultaneously, Flan-U-PaLM, triggered by the sub-instructions and snippets in orange, deftly decodes Python programs represented in blue.
The results achieved by WebAgent are truly remarkable, boasting an impressive 70% success rate on real websites. This significant advancement surpasses the performance of the single LLM approach by an impressive margin of over 50%. This achievement indicates that breaking down the task into a sequence of sub-problems, each tackled by different language models, can substantially enhance the overall task success.
In addition, Google DeepMind presents an innovative and scalable recipe for crafting HTML-specialized language models. This approach involves training local and global attention mechanisms using a combination of long-span denoising objectives. The ultimate aim is to adeptly capture the hierarchical structures that underlie HTML documents, paving the way for improved comprehension and more effective handling of HTML-related tasks.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI