Building Large Action Models: Insights from Microsoft
Author(s): Jesus Rodriguez
Originally published on Towards AI.
I recently started an AI-focused educational newsletter, that already has over 175,000 subscribers. TheSequence is a no-BS (meaning no hype, no news, etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers, and concepts. Please give it a try by subscribing below:
TheSequence | Jesus Rodriguez | Substack
The best source to stay up-to-date with the developments in the machine learning, artificial intelligence, and dataβ¦
thesequence.ai
Action execution is one of the key building blocks of agentic workflows. One of the most interesting debates in that are is whether actions are executed by the model itself or by an external coordination layer. The supporters of the former hypothesis have lined up behind a theory known as large action models(LAMs) with projects like Gorilla or Rabbit r1 as key pioneers. However, there are still only a few practical examples of LAM frameworks. Recently, Microsoft Research published one of the most complete papers in this area outlining a complete framework for LAM models. Microsoftβs core idea is to simply bridge the gap between the language understanding prowess of LLMs and the need for real-world action execution.
From LLMs to LAMs: A Paradigm Shift
The limitations of traditional LLMs in interacting with and manipulating the physical world necessitate the development of LAMs. While LLMs excel at generating intricate textual responses, their inability to translate understanding into tangible actions restricts their applicability in real-world scenarios. LAMs address this challenge by extending the expertise of LLMs from language processing to action generation, enabling them to perform actions in both physical and digital environments. This transition signifies a shift from passive language understanding to active task completion, marking a significant milestone in AI development.
Key Architectural Components: A Step-by-Step Approach
Microsoftβs framework for developing LAMs outlines a systematic process, encompassing crucial stages from inception to deployment. The key architectural components include:
Data Collection and Preparation
This foundational step involves gathering and curating high-quality, action-oriented data for specific use cases. This data includes user queries, environmental context, potential actions, and any other relevant information required to train the LAM effectively. A two-phase data collection approach is adopted:
Task-Plan Collection
This phase focuses on collecting data consisting of tasks and their corresponding plans. Tasks represent user requests expressed in natural language, while plans outline detailed step-by-step procedures designed to fulfill these requests. This data is crucial for training the model to generate effective plans and enhance its high-level reasoning and planning capabilities. Sources for this data include application documentation, online how-to guides like WikiHow, and historical search queries.
Task-Action Collection
This phase converts task-plan data into executable steps. It involves refining tasks and plans to be more concrete and grounded within a specific environment. Action sequences are generated, representing actionable instructions that directly interact with the environment, such as select_text(text=βhelloβ) or click(on=Button(β20β), how=βleftβ, double=False). This data provides the necessary granularity for training a LAM to perform reliable and accurate task executions in real-world scenarios.
Model Training
This stage involves training or fine-tuning LLMs to perform actions rather than merely generate text. A staged training strategy, consisting of four phases, is employed:
- Phase 1: Task-Plan Pretraining: This phase focuses on training the model to generate coherent and logical plans for various tasks, utilizing a dataset of 76,672 task-plan pairs. This pretraining establishes a foundational understanding of task structures, enabling the model to decompose tasks into logical steps.
- Phase 2: Learning from Experts: The model learns to execute actions by imitating expert-labeled task-action trajectories. This phase aligns plan generation with actionable steps, teaching the model how to perform actions based on observed UI states and corresponding actions.
- Phase 3: Self-Boosting Exploration: This phase encourages the model to explore and handle tasks that even expert demonstrations failed to solve. By interacting with the environment and trying alternative strategies, the model autonomously generates new success cases, promoting diversity and adaptability.
- Phase 4: Learning from a Reward Model: This phase incorporates reinforcement learning (RL) principles to optimize decision-making. A reward model is trained on success and failure data to predict the quality of actions. This model is then used to fine-tune the LAM in an offline RL setting, allowing the model to learn from failures and improve action selection without additional environmental interactions.
Integration and Grounding
The trained LAM is integrated into an agent framework, enabling interaction with external tools, maintaining memory, and interfacing with the environment. This integration transforms the model into a functional agent capable of making meaningful impacts in the physical world. Microsoftβs UFO, a GUI agent for Windows OS interaction, exemplifies this integration. The AppAgent within UFO serves as the operational platform for the LAM.
Evaluation
Rigorous evaluation processes are essential to assess the reliability, robustness, and safety of the LAM before real-world deployment. This evaluation involves testing the model in a variety of scenarios to ensure generalization across different environments and tasks, as well as effective handling of unexpected situations. Both offline and online evaluations are conducted:
- Offline Evaluation: The LAMβs performance is assessed using an offline dataset in a controlled, static environment. This allows for systematic analysis of task success rates, precision, and recall metrics.
- Online Evaluation: The LAMβs performance is evaluated in a real-world environment. This involves measuring aspects like task completion accuracy, efficiency, and effectiveness.
Key Building Blocks: Essential Features of LAMs
Several key building blocks empower LAMs to perform complex real-world tasks:
- Action Generation: The ability to translate user intentions into actionable steps grounded in the environment is a defining feature of LAMs. These actions can manifest as operations on graphical user interfaces (GUIs), API calls for software applications, physical manipulations by robots, or even code generation.
- Dynamic Planning and Adaptation: LAMs are capable of decomposing complex tasks into subtasks and dynamically adjusting their plans in response to environmental changes. This adaptive planning ensures robust performance in dynamic, real-world scenarios where unexpected situations are common.
- Specialization and Efficiency: LAMs can be tailored for specific domains or tasks, achieving high accuracy and efficiency within their operational scope. This specialization allows for reduced computational overhead and improved response times compared to general-purpose LLMs.
- Agent Systems: Agent systems provide the operational framework for LAMs, equipping them with tools, memory, and feedback mechanisms. This integration allows LAMs to interact with the world and execute actions effectively. UFOβs AppAgent, for example, employs components like action executors, memory, and environment data collection to facilitate seamless interaction between the LAM and the Windows OS environment.
The UFO Agent: Grounding LAMs in Windows OS
Microsoftβs UFO agent exemplifies the integration and grounding of LAMs in a real-world environment. Key aspects of UFO include:
- Architecture: UFO comprises a HostAgent for decomposing user requests into subtasks and an AppAgent for executing these subtasks within specific applications. This hierarchical structure facilitates the handling of complex, cross-application tasks.
- AppAgent Structure: The AppAgent, where the LAM resides, consists of:
- Environment Data Collection: The agent gathers information about the application environment, including UI elements and their properties, to provide context for the LAM.
- LAM Inference Engine: The LAM, serving as the brain of the AppAgent, processes the collected information and infers the necessary actions to fulfill the user request.
- Action Executor: This component grounds the LAMβs predicted actions, translating them into concrete interactions with the applicationβs UI, such as mouse clicks, keyboard inputs, or API calls.
- Memory: The agent maintains a memory of previous actions and plans, providing crucial context for the LAM to make informed and adaptive decisions.
Evaluation and Performance: Benchmarking LAMs
Microsoft employs a comprehensive evaluation framework to assess the performance of LAMs in both controlled and real-world environments. Key metrics include:
- Task Success Rate (TSR): This measures the percentage of tasks successfully completed out of the total attempted. It evaluates the agentβs ability to accurately and reliably complete tasks.
- Task Completion Time: This measures the total time taken to complete a task, from the initial request to the final action. It reflects the efficiency of the LAM and agent system.
- Object Accuracy: This measures the accuracy of selecting the correct UI element for each task step. It assesses the agentβs ability to interact with the appropriate UI components.
- Step Success Rate (SSR): This measures the percentage of individual steps completed successfully within a task. It provides a granular assessment of action execution accuracy.
In online evaluations using Microsoft Word as the target application, LAM achieved a TSR of 71.0%, demonstrating competitive performance compared to baseline models like GPT-4o. Importantly, LAM exhibited superior efficiency, achieving the shortest task completion times and lowest average step latencies. These results underscore the efficacy of Microsoftβs framework in building LAMs that are not only accurate but also efficient in real-world applications.
Limitations
Despite the advancements made, LAMs are still in their early stages of development. Key limitations and future research areas include:
- Safety Risks: The ability of LAMs to interact with the real world introduces potential safety concerns. Robust mechanisms are needed to ensure that LAMs operate safely and reliably, minimizing the risk of unintended consequences.
- Ethical Considerations: The development and deployment of LAMs raise ethical considerations, particularly regarding bias, fairness, and accountability. Future research needs to address these concerns to ensure responsible LAM development and deployment.
- Scalability and Adaptability: Scaling LAMs to new domains and tasks can be challenging due to the need for extensive data collection and training. Developing more efficient training methods and exploring techniques like transfer learning are crucial for enhancing the scalability and adaptability of LAMs.
Conclusion
Microsoftβs framework for building LAMs represents a significant advancement in AI, enabling a shift from passive language understanding to active real-world engagement. The frameworkβs comprehensive approach, encompassing data collection, model training, agent integration, and rigorous evaluation, provides a robust foundation for building LAMs. While challenges remain, the transformative potential of LAMs in revolutionizing human-computer interaction and automating complex tasks is undeniable. Continued research and development efforts will pave the way for more sophisticated, reliable, and ethically sound LAM applications, bringing us closer to a future where AI seamlessly integrates with our lives, augmenting human capabilities and transforming our interaction with the world around us.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI