Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Unlock the full potential of AI with Building LLMs for Productionβ€”our 470+ page guide to mastering LLMs with practical projects and expert insights!

Publication

How Can GPTs Interact with Computers? OmniParser Explained
Latest   Machine Learning

How Can GPTs Interact with Computers? OmniParser Explained

Last Updated on October 31, 2024 by Editorial Team

Author(s): Barhoumi Mosbeh

Originally published on Towards AI.

Microsoft has silently released OmniParser, an open-source tool designed to convert screenshots into structured, easy-to-interpret elements for Vision Agents. The goal of this tool is to advance the emerging field of enabling large language models (LLMs) to interact with graphical user interfaces (GUIs). Recently, Anthropic announced a similar but closed-source tool for interacting with computer interfaces. However, creating a similar system isn’t as challenging as it might seem, the concept is straightforward. Microsoft’s OmniParser is thoroughly documented in an accompanying paper, which explains each component in a clear and accessible way. This article will explore how we can build a tool comparable to Anthropic’s.

Introduction

To give an idea of what we’re trying to accomplish, imagine you needed ChatGPT’s help for a UI task on the web. For example, if you wanted to set up a webhook, ChatGPT doesn’t need to β€œsee” the UI. It simply provides instructions like β€œclick here” or β€œnavigate to that option” based on information from sources like Stack Overflow.

Image from the author

Now, we want to take this a step further. The vision-enabled agent will actually be able to see what’s on your screen, understand the interface, and make decisions about the next steps, such as which button to click. To do this effectively, it will need to identify the exact coordinates of UI elements.

How OmniParser works

Complex UI interaction tasks can be broken down into two fundamental requirements for Vision Language Models (VLMs):

  1. Understanding the current UI screen state
  2. Predicting the next appropriate action in order to accomplish the task

Instead of handling both requirements in a single step, OmniParser breaks down the process into multiple steps. First, the model must understand the current state of the screenshot, meaning it has to recognize the objects in the screenshot and predict what will happen if each object is clicked. Microsoft researchers have used also OCR to identify clickable elements with text to provide more context, and they have fine-tuned an icon description model.

With this approach, the model gains knowledge of the coordinates of different components on the screen and understands what each component does.

Image from the author

Interactable Element Detection

To achieve the first step in the system, Microsoft researchers trained a YOLOv8 model on 66,990 samples for 20 epochs, achieving approximately 75% mAP@50. In addition to interactable region detection, they also developed an OCR module to extract bounding boxes of text. Then, they merge the bounding boxes from the OCR detection module and the icon detection module, removing boxes with high overlap (using a threshold of over 90%). For each bounding box, they label it with a unique ID using a simple algorithm that minimizes overlap between numeric labels and other bounding boxes.

paper

Semantic Understanding

To handle semantic understanding of UI elements, Microsoft researchers fine-tuned a BLIP-v2 model on a custom dataset of 7,000 icon-description pairs. This dataset was specifically curated using GPT-4 to ensure high-quality, relevant descriptions of UI components. The fine-tuned model processes two types of elements differently: for detected interactive icons, it generates functional descriptions explaining their purpose and behavior, while for text elements identified by the OCR module, it utilizes both the extracted text content and its corresponding label. This semantic layer feeds into the larger system by providing the VLM with explicit functional context for each UI element, reducing the need for the model to infer element purposes solely from visual appearance.

paper

The system can fail in several interesting ways that highlight areas for potential improvement in vision-based GUI interaction. Let’s explore these limitations and discuss potential solutions that could enhance the system’s reliability.

Challenges with Repeated Elements

The system can fail when encountering repeated UI elements on the same page. For instance, when multiple identical β€œSubmit” buttons appear in different sections, the current implementation struggles to distinguish between these identical elements effectively. This can lead to incorrect action predictions when the user task requires clicking on a specific instance of these repeated elements.

# Current approach
description = "Submit button"

# Improved approach could look like:
enhanced_description = {
"element_type": "Submit button",
"context": "Form section: User Details",
"position": "Primary submit in main form",
"relative_location": "Bottom right of user information section"
}

The solution likely lies in implementing β€œcontextual fingerprinting” β€” adding layer-specific and position-specific identifiers to seemingly identical elements. This would allow the system to generate unique descriptions for each instance of repeated elements.

Granularity Issues in Bounding Box Detection

Another notable limitation involves the precision of bounding box detection, particularly with text elements. The OCR module sometimes generates overly broad bounding boxes that can lead to inaccurate click predictions. This becomes especially problematic with hyperlinks and interactive text elements.

Consider this common scenario:

[Read More About Our Services]
^
Current click point (center)

Resources:

https://microsoft.github.io/OmniParser/

https://www.microsoft.com/en-us/research/articles/omniparser-for-pure-vision-based-gui-agent/

Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku

A refreshed, more powerful Claude 3.5 Sonnet, Claude 3.5 Haiku, and a new experimental AI capability: computer use.

www.anthropic.com

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓