Inside Ferret-UI: Apple’s Multimodal LLM for Mobile Screen Understanding
Last Updated on April 22, 2024 by Editorial Team
Author(s): Jesus Rodriguez
Originally published on Towards AI.
I recently started an AI-focused educational newsletter, that already has over 165,000 subscribers. TheSequence is a no-BS (meaning no hype, no news, etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers, and concepts. Please give it a try by subscribing below:
TheSequence | Jesus Rodriguez | Substack
The best source to stay up-to-date with the developments in the machine learning, artificial intelligence, and data…
thesequence.substack.com
The AI world anxiously waits to see what Apple is going to do in space! Unlike other tech incumbents such as Microsoft, Google, and Meta, Apple has been relatively quiet when it comes to contributions in the AI space. A safe bet seems to assume that anything Apple does in the space is going to be tied to mobile applications to leverage their iPhone/iPad distribution. Not surprisingly, every time Apple Research publishes a paper, it triggers a tremendous level of speculation, and that has certainly been the case with their recent work in mobile screen understanding.
One of the most interesting trends in autonomous agents is based on computer vision models that can infer actions from screens. Earlier this year, we were all amazed at Rabbit’s large action demo in CES. Companies like Adept.ai have been pushing screen understanding as the right way to build autonomous agents. One of the areas in which this paradigm can undoubtedly have an impact is in mobile apps. Recently, Apple decided to dabble into this space by publishing a paper outlining Ferret-UI, a multimodal LLM optimized for mobile screen understanding.
Ferret-UI was crafted specifically to manage and interpret user interface (UI) screens. This model excels at both understanding and executing open-ended language instructions related to UIs. The development of Ferret-UI focuses on three main areas: refining the model’s architecture, enhancing the data it learns from, and setting new benchmarks for evaluation.
Model Architecture
The foundation of Ferret-UI is built on the existing Ferret model, noted for its strong ability to understand and interact with various natural image types. By leveraging this capability, Ferret-UI is well-prepared to tackle UI-based tasks. An innovative addition to this model is the integration of a feature allowing it to adapt to any screen resolution. This includes a method to break down the entire screen into smaller sections, thus accommodating both portrait and landscape orientations effectively. This structure helps to focus on finer details and improve the overall visual recognition performance.
At its core, Ferret-UI builds upon the robust base provided by the original Ferret model, renowned for its adept handling of spatial tasks in image processing. The model’s architecture has been specifically tweaked to better understand and interact with UI screens, encompassing a wide range of UI-related tasks. Unlike other models, Ferret-UI operates directly with raw pixel data from screens, enhancing its ability to conduct in-depth interactions with single screens and opening up new possibilities for applications such as accessibility enhancements.
Training Data
The training regimen for Ferret-UI involves creating data that ranges from basic to complex UI tasks. Initially, basic training scenarios are set up using a template-driven method that covers essential UI tasks like widget identification and text recognition. These foundational tasks are crucial for the model to learn the basic semantics and spatial layouts of UI components. For more complex scenarios, Apple uses advanced data generation techniques that simulate detailed descriptions and interactions, thus preparing Ferret-UI to handle more sophisticated discussions and decision-making processes concerning UI elements.
To tailor a model capable of effectively interacting with mobile UIs, it was essential to gather a diverse collection of screen data from both iPhone and Android devices. This collection forms the basis of the training and testing datasets, which are meticulously prepared to cover a variety of screen sizes and UI tasks. The data spans basic descriptive tasks to more advanced interaction simulations, ensuring comprehensive training coverage.
The dataset tackles three fundament types of tasks:
1) Spotlight Tasks
Initial data sets include tasks that describe UI elements and predict their functionality, derived from existing data sets and enhanced with conversational QA pairings. These tasks serve as a foundational component of the training process, preparing the model for more complex interactions.
2) Elementary Tasks
The training also includes a variety of elementary UI tasks that focus on identifying and interacting with specific UI elements.
3) Advanced Tasks
Advanced tasks involve deeper reasoning and are supported by sophisticated data generation techniques using the latest language models to simulate real-world interactions.
Benchmarking and Evaluation
To validate Ferret-UI’s capabilities, Apple has developed a comprehensive benchmark that includes a wide array of mobile UI tasks. This benchmark not only tests basic UI understanding but also challenges the model with complex interaction scenarios on both iPhone and Android platforms. Extensive tests show that Ferret-UI significantly outperforms its predecessor and exhibits superior performance in complex UI tasks when compared to other models. The following table shows a performance comparison between Ferret-UI and other multimodal LLMs in mobile tasks:
To put things in context, take a look at the following outputs from different models in screen understanding tasks which clearly highlight the Ferret-UI’s superior performance:
Ferret-UI is a very interesting multimodal LLM that can be applied to many automation tasks in mobile devices. As the AI community continues to wait for Apple’s positioning in the space, Ferret-UI is a very thoughtful research that might signal the path forward.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI