How Can GPTs Interact with Computers? OmniParser Explained

Last Updated on October 31, 2024 by Editorial Team

Author(s): Barhoumi Mosbeh

Originally published on Towards AI.

Microsoft has silently released OmniParser, an open-source tool designed to convert screenshots into structured, easy-to-interpret elements for Vision Agents. The goal of this tool is to advance the emerging field of enabling large language models (LLMs) to interact with graphical user interfaces (GUIs). Recently, Anthropic announced a similar but closed-source tool for interacting with computer interfaces. However, creating a similar system isn’t as challenging as it might seem, the concept is straightforward. Microsoft’s OmniParser is thoroughly documented in an accompanying paper, which explains each component in a clear and accessible way. This article will explore how we can build a tool comparable to Anthropic’s.

Introduction

To give an idea of what we’re trying to accomplish, imagine you needed ChatGPT’s help for a UI task on the web. For example, if you wanted to set up a webhook, ChatGPT doesn’t need to “see” the UI. It simply provides instructions like “click here” or “navigate to that option” based on information from sources like Stack Overflow.

Now, we want to take this a step further. The vision-enabled agent will actually be able to see what’s on your screen, understand the interface, and make decisions about the next steps, such as which button to click. To do this effectively, it will need to identify the exact coordinates of UI elements.

How OmniParser works

Complex UI interaction tasks can be broken down into two fundamental requirements for Vision Language Models (VLMs):

Understanding the current UI screen state
Predicting the next appropriate action in order to accomplish the task

Instead of handling both requirements in a single step, OmniParser breaks down the process into multiple steps. First, the model must understand the current state of the screenshot, meaning it has to recognize the objects in the screenshot and predict what will happen if each object is clicked. Microsoft researchers have used also OCR to identify clickable elements with text to provide more context, and they have fine-tuned an icon description model.

With this approach, the model gains knowledge of the coordinates of different components on the screen and understands what each component does.

Interactable Element Detection

To achieve the first step in the system, Microsoft researchers trained a YOLOv8 model on 66,990 samples for 20 epochs, achieving approximately 75% mAP@50. In addition to interactable region detection, they also developed an OCR module to extract bounding boxes of text. Then, they merge the bounding boxes from the OCR detection module and the icon detection module, removing boxes with high overlap (using a threshold of over 90%). For each bounding box, they label it with a unique ID using a simple algorithm that minimizes overlap between numeric labels and other bounding boxes.

Semantic Understanding

To handle semantic understanding of UI elements, Microsoft researchers fine-tuned a BLIP-v2 model on a custom dataset of 7,000 icon-description pairs. This dataset was specifically curated using GPT-4 to ensure high-quality, relevant descriptions of UI components. The fine-tuned model processes two types of elements differently: for detected interactive icons, it generates functional descriptions explaining their purpose and behavior, while for text elements identified by the OCR module, it utilizes both the extracted text content and its corresponding label. This semantic layer feeds into the larger system by providing the VLM with explicit functional context for each UI element, reducing the need for the model to infer element purposes solely from visual appearance.

The system can fail in several interesting ways that highlight areas for potential improvement in vision-based GUI interaction. Let’s explore these limitations and discuss potential solutions that could enhance the system’s reliability.

Challenges with Repeated Elements

The system can fail when encountering repeated UI elements on the same page. For instance, when multiple identical “Submit” buttons appear in different sections, the current implementation struggles to distinguish between these identical elements effectively. This can lead to incorrect action predictions when the user task requires clicking on a specific instance of these repeated elements.

# Current approach
description = "Submit button"

# Improved approach could look like:
enhanced_description = {
 "element_type": "Submit button",
 "context": "Form section: User Details",
 "position": "Primary submit in main form",
 "relative_location": "Bottom right of user information section"
}

The solution likely lies in implementing “contextual fingerprinting” — adding layer-specific and position-specific identifiers to seemingly identical elements. This would allow the system to generate unique descriptions for each instance of repeated elements.

Granularity Issues in Bounding Box Detection

Another notable limitation involves the precision of bounding box detection, particularly with text elements. The OCR module sometimes generates overly broad bounding boxes that can lead to inaccurate click predictions. This becomes especially problematic with hyperlinks and interactive text elements.

Consider this common scenario:

[Read More About Our Services]
 ^
Current click point (center)

Resources:

https://microsoft.github.io/OmniParser/

https://www.microsoft.com/en-us/research/articles/omniparser-for-pure-vision-based-gui-agent/

Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku

A refreshed, more powerful Claude 3.5 Sonnet, Claude 3.5 Haiku, and a new experimental AI capability: computer use.

www.anthropic.com

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

How Can GPTs Interact with Computers? OmniParser Explained

Author(s): Barhoumi Mosbeh

Introduction

How OmniParser works

Interactable Element Detection

Semantic Understanding

Challenges with Repeated Elements

Granularity Issues in Bounding Box Detection

Resources:

Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku

A refreshed, more powerful Claude 3.5 Sonnet, Claude 3.5 Haiku, and a new experimental AI capability: computer use.

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

The Secret to Unlocking Deeper SWOT Analysis with AI (The Code That Started It All — and How I Took It to the Next Level)

Evaluating and Monitoring LLM Agents: Tools, Metrics, and Best Practices

Building Multi-Agent AI Systems From Scratch: OpenAI vs. Ollama

Web-LLM Assistant: Bridging Local AI Models With Real-Time Web Intelligence

ChatGPT Gets Windows App

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

How Can GPTs Interact with Computers? OmniParser Explained

Author(s): Barhoumi Mosbeh

Introduction

How OmniParser works

Interactable Element Detection

Semantic Understanding

Challenges with Repeated Elements

Granularity Issues in Bounding Box Detection

Resources:

Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku

A refreshed, more powerful Claude 3.5 Sonnet, Claude 3.5 Haiku, and a new experimental AI capability: computer use.

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement