How Can GPTs Interact with Computers? OmniParser Explained
Last Updated on October 31, 2024 by Editorial Team
Author(s): Barhoumi Mosbeh
Originally published on Towards AI.
Microsoft has silently released OmniParser, an open-source tool designed to convert screenshots into structured, easy-to-interpret elements for Vision Agents. The goal of this tool is to advance the emerging field of enabling large language models (LLMs) to interact with graphical user interfaces (GUIs). Recently, Anthropic announced a similar but closed-source tool for interacting with computer interfaces. However, creating a similar system isnβt as challenging as it might seem, the concept is straightforward. Microsoftβs OmniParser is thoroughly documented in an accompanying paper, which explains each component in a clear and accessible way. This article will explore how we can build a tool comparable to Anthropicβs.
Introduction
To give an idea of what weβre trying to accomplish, imagine you needed ChatGPTβs help for a UI task on the web. For example, if you wanted to set up a webhook, ChatGPT doesnβt need to βseeβ the UI. It simply provides instructions like βclick hereβ or βnavigate to that optionβ based on information from sources like Stack Overflow.
Now, we want to take this a step further. The vision-enabled agent will actually be able to see whatβs on your screen, understand the interface, and make decisions about the next steps, such as which button to click. To do this effectively, it will need to identify the exact coordinates of UI elements.
How OmniParser works
Complex UI interaction tasks can be broken down into two fundamental requirements for Vision Language Models (VLMs):
- Understanding the current UI screen state
- Predicting the next appropriate action in order to accomplish the task
Instead of handling both requirements in a single step, OmniParser breaks down the process into multiple steps. First, the model must understand the current state of the screenshot, meaning it has to recognize the objects in the screenshot and predict what will happen if each object is clicked. Microsoft researchers have used also OCR to identify clickable elements with text to provide more context, and they have fine-tuned an icon description model.
With this approach, the model gains knowledge of the coordinates of different components on the screen and understands what each component does.
Interactable Element Detection
To achieve the first step in the system, Microsoft researchers trained a YOLOv8 model on 66,990 samples for 20 epochs, achieving approximately 75% mAP@50. In addition to interactable region detection, they also developed an OCR module to extract bounding boxes of text. Then, they merge the bounding boxes from the OCR detection module and the icon detection module, removing boxes with high overlap (using a threshold of over 90%). For each bounding box, they label it with a unique ID using a simple algorithm that minimizes overlap between numeric labels and other bounding boxes.
Semantic Understanding
To handle semantic understanding of UI elements, Microsoft researchers fine-tuned a BLIP-v2 model on a custom dataset of 7,000 icon-description pairs. This dataset was specifically curated using GPT-4 to ensure high-quality, relevant descriptions of UI components. The fine-tuned model processes two types of elements differently: for detected interactive icons, it generates functional descriptions explaining their purpose and behavior, while for text elements identified by the OCR module, it utilizes both the extracted text content and its corresponding label. This semantic layer feeds into the larger system by providing the VLM with explicit functional context for each UI element, reducing the need for the model to infer element purposes solely from visual appearance.
The system can fail in several interesting ways that highlight areas for potential improvement in vision-based GUI interaction. Letβs explore these limitations and discuss potential solutions that could enhance the systemβs reliability.
Challenges with Repeated Elements
The system can fail when encountering repeated UI elements on the same page. For instance, when multiple identical βSubmitβ buttons appear in different sections, the current implementation struggles to distinguish between these identical elements effectively. This can lead to incorrect action predictions when the user task requires clicking on a specific instance of these repeated elements.
# Current approach
description = "Submit button"
# Improved approach could look like:
enhanced_description = {
"element_type": "Submit button",
"context": "Form section: User Details",
"position": "Primary submit in main form",
"relative_location": "Bottom right of user information section"
}
The solution likely lies in implementing βcontextual fingerprintingβ β adding layer-specific and position-specific identifiers to seemingly identical elements. This would allow the system to generate unique descriptions for each instance of repeated elements.
Granularity Issues in Bounding Box Detection
Another notable limitation involves the precision of bounding box detection, particularly with text elements. The OCR module sometimes generates overly broad bounding boxes that can lead to inaccurate click predictions. This becomes especially problematic with hyperlinks and interactive text elements.
Consider this common scenario:
[Read More About Our Services]
^
Current click point (center)
Resources:
https://microsoft.github.io/OmniParser/
https://www.microsoft.com/en-us/research/articles/omniparser-for-pure-vision-based-gui-agent/
Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku
A refreshed, more powerful Claude 3.5 Sonnet, Claude 3.5 Haiku, and a new experimental AI capability: computer use.
www.anthropic.com
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI