Building Large Action Models: Insights from Microsoft

Author(s): Jesus Rodriguez

Originally published on Towards AI.

I recently started an AI-focused educational newsletter, that already has over 175,000 subscribers. TheSequence is a no-BS (meaning no hype, no news, etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers, and concepts. Please give it a try by subscribing below:

TheSequence | Jesus Rodriguez | Substack

The best source to stay up-to-date with the developments in the machine learning, artificial intelligence, and data…

thesequence.ai

Action execution is one of the key building blocks of agentic workflows. One of the most interesting debates in that are is whether actions are executed by the model itself or by an external coordination layer. The supporters of the former hypothesis have lined up behind a theory known as large action models(LAMs) with projects like Gorilla or Rabbit r1 as key pioneers. However, there are still only a few practical examples of LAM frameworks. Recently, Microsoft Research published one of the most complete papers in this area outlining a complete framework for LAM models. Microsoft’s core idea is to simply bridge the gap between the language understanding prowess of LLMs and the need for real-world action execution.

From LLMs to LAMs: A Paradigm Shift

The limitations of traditional LLMs in interacting with and manipulating the physical world necessitate the development of LAMs. While LLMs excel at generating intricate textual responses, their inability to translate understanding into tangible actions restricts their applicability in real-world scenarios. LAMs address this challenge by extending the expertise of LLMs from language processing to action generation, enabling them to perform actions in both physical and digital environments. This transition signifies a shift from passive language understanding to active task completion, marking a significant milestone in AI development.

Key Architectural Components: A Step-by-Step Approach

Microsoft’s framework for developing LAMs outlines a systematic process, encompassing crucial stages from inception to deployment. The key architectural components include:

Data Collection and Preparation

This foundational step involves gathering and curating high-quality, action-oriented data for specific use cases. This data includes user queries, environmental context, potential actions, and any other relevant information required to train the LAM effectively. A two-phase data collection approach is adopted:

Task-Plan Collection

This phase focuses on collecting data consisting of tasks and their corresponding plans. Tasks represent user requests expressed in natural language, while plans outline detailed step-by-step procedures designed to fulfill these requests. This data is crucial for training the model to generate effective plans and enhance its high-level reasoning and planning capabilities. Sources for this data include application documentation, online how-to guides like WikiHow, and historical search queries.

Task-Action Collection

This phase converts task-plan data into executable steps. It involves refining tasks and plans to be more concrete and grounded within a specific environment. Action sequences are generated, representing actionable instructions that directly interact with the environment, such as select_text(text=”hello”) or click(on=Button(“20”), how=”left”, double=False). This data provides the necessary granularity for training a LAM to perform reliable and accurate task executions in real-world scenarios.

Model Training

This stage involves training or fine-tuning LLMs to perform actions rather than merely generate text. A staged training strategy, consisting of four phases, is employed:

Phase 1: Task-Plan Pretraining: This phase focuses on training the model to generate coherent and logical plans for various tasks, utilizing a dataset of 76,672 task-plan pairs. This pretraining establishes a foundational understanding of task structures, enabling the model to decompose tasks into logical steps.
Phase 2: Learning from Experts: The model learns to execute actions by imitating expert-labeled task-action trajectories. This phase aligns plan generation with actionable steps, teaching the model how to perform actions based on observed UI states and corresponding actions.
Phase 3: Self-Boosting Exploration: This phase encourages the model to explore and handle tasks that even expert demonstrations failed to solve. By interacting with the environment and trying alternative strategies, the model autonomously generates new success cases, promoting diversity and adaptability.
Phase 4: Learning from a Reward Model: This phase incorporates reinforcement learning (RL) principles to optimize decision-making. A reward model is trained on success and failure data to predict the quality of actions. This model is then used to fine-tune the LAM in an offline RL setting, allowing the model to learn from failures and improve action selection without additional environmental interactions.

Integration and Grounding

The trained LAM is integrated into an agent framework, enabling interaction with external tools, maintaining memory, and interfacing with the environment. This integration transforms the model into a functional agent capable of making meaningful impacts in the physical world. Microsoft’s UFO, a GUI agent for Windows OS interaction, exemplifies this integration. The AppAgent within UFO serves as the operational platform for the LAM.

Evaluation

Rigorous evaluation processes are essential to assess the reliability, robustness, and safety of the LAM before real-world deployment. This evaluation involves testing the model in a variety of scenarios to ensure generalization across different environments and tasks, as well as effective handling of unexpected situations. Both offline and online evaluations are conducted:

Offline Evaluation: The LAM’s performance is assessed using an offline dataset in a controlled, static environment. This allows for systematic analysis of task success rates, precision, and recall metrics.
Online Evaluation: The LAM’s performance is evaluated in a real-world environment. This involves measuring aspects like task completion accuracy, efficiency, and effectiveness.

Key Building Blocks: Essential Features of LAMs

Several key building blocks empower LAMs to perform complex real-world tasks:

Action Generation: The ability to translate user intentions into actionable steps grounded in the environment is a defining feature of LAMs. These actions can manifest as operations on graphical user interfaces (GUIs), API calls for software applications, physical manipulations by robots, or even code generation.
Dynamic Planning and Adaptation: LAMs are capable of decomposing complex tasks into subtasks and dynamically adjusting their plans in response to environmental changes. This adaptive planning ensures robust performance in dynamic, real-world scenarios where unexpected situations are common.
Specialization and Efficiency: LAMs can be tailored for specific domains or tasks, achieving high accuracy and efficiency within their operational scope. This specialization allows for reduced computational overhead and improved response times compared to general-purpose LLMs.
Agent Systems: Agent systems provide the operational framework for LAMs, equipping them with tools, memory, and feedback mechanisms. This integration allows LAMs to interact with the world and execute actions effectively. UFO’s AppAgent, for example, employs components like action executors, memory, and environment data collection to facilitate seamless interaction between the LAM and the Windows OS environment.

The UFO Agent: Grounding LAMs in Windows OS

Microsoft’s UFO agent exemplifies the integration and grounding of LAMs in a real-world environment. Key aspects of UFO include:

Architecture: UFO comprises a HostAgent for decomposing user requests into subtasks and an AppAgent for executing these subtasks within specific applications. This hierarchical structure facilitates the handling of complex, cross-application tasks.
AppAgent Structure: The AppAgent, where the LAM resides, consists of:
Environment Data Collection: The agent gathers information about the application environment, including UI elements and their properties, to provide context for the LAM.
LAM Inference Engine: The LAM, serving as the brain of the AppAgent, processes the collected information and infers the necessary actions to fulfill the user request.
Action Executor: This component grounds the LAM’s predicted actions, translating them into concrete interactions with the application’s UI, such as mouse clicks, keyboard inputs, or API calls.
Memory: The agent maintains a memory of previous actions and plans, providing crucial context for the LAM to make informed and adaptive decisions.

Evaluation and Performance: Benchmarking LAMs

Microsoft employs a comprehensive evaluation framework to assess the performance of LAMs in both controlled and real-world environments. Key metrics include:

Task Success Rate (TSR): This measures the percentage of tasks successfully completed out of the total attempted. It evaluates the agent’s ability to accurately and reliably complete tasks.
Task Completion Time: This measures the total time taken to complete a task, from the initial request to the final action. It reflects the efficiency of the LAM and agent system.
Object Accuracy: This measures the accuracy of selecting the correct UI element for each task step. It assesses the agent’s ability to interact with the appropriate UI components.
Step Success Rate (SSR): This measures the percentage of individual steps completed successfully within a task. It provides a granular assessment of action execution accuracy.

In online evaluations using Microsoft Word as the target application, LAM achieved a TSR of 71.0%, demonstrating competitive performance compared to baseline models like GPT-4o. Importantly, LAM exhibited superior efficiency, achieving the shortest task completion times and lowest average step latencies. These results underscore the efficacy of Microsoft’s framework in building LAMs that are not only accurate but also efficient in real-world applications.

Limitations

Despite the advancements made, LAMs are still in their early stages of development. Key limitations and future research areas include:

Safety Risks: The ability of LAMs to interact with the real world introduces potential safety concerns. Robust mechanisms are needed to ensure that LAMs operate safely and reliably, minimizing the risk of unintended consequences.
Ethical Considerations: The development and deployment of LAMs raise ethical considerations, particularly regarding bias, fairness, and accountability. Future research needs to address these concerns to ensure responsible LAM development and deployment.
Scalability and Adaptability: Scaling LAMs to new domains and tasks can be challenging due to the need for extensive data collection and training. Developing more efficient training methods and exploring techniques like transfer learning are crucial for enhancing the scalability and adaptability of LAMs.

Conclusion

Microsoft’s framework for building LAMs represents a significant advancement in AI, enabling a shift from passive language understanding to active real-world engagement. The framework’s comprehensive approach, encompassing data collection, model training, agent integration, and rigorous evaluation, provides a robust foundation for building LAMs. While challenges remain, the transformative potential of LAMs in revolutionizing human-computer interaction and automating complex tasks is undeniable. Continued research and development efforts will pave the way for more sophisticated, reliable, and ethically sound LAM applications, bringing us closer to a future where AI seamlessly integrates with our lives, augmenting human capabilities and transforming our interaction with the world around us.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Building Large Action Models: Insights from Microsoft

Author(s): Jesus Rodriguez

TheSequence | Jesus Rodriguez | Substack

The best source to stay up-to-date with the developments in the machine learning, artificial intelligence, and data…

From LLMs to LAMs: A Paradigm Shift

Key Architectural Components: A Step-by-Step Approach

Key Building Blocks: Essential Features of LAMs

The UFO Agent: Grounding LAMs in Windows OS

Evaluation and Performance: Benchmarking LAMs

Limitations

Conclusion

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

TAI #143: New Scaling Laws Incoming? Ilya’s SSI Raises at $30bn, Manus Takes AI Agents Mainstream

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Building Large Action Models: Insights from Microsoft

Author(s): Jesus Rodriguez

TheSequence | Jesus Rodriguez | Substack

The best source to stay up-to-date with the developments in the machine learning, artificial intelligence, and data…

From LLMs to LAMs: A Paradigm Shift

Key Architectural Components: A Step-by-Step Approach

Key Building Blocks: Essential Features of LAMs

The UFO Agent: Grounding LAMs in Windows OS

Evaluation and Performance: Benchmarking LAMs

Limitations

Conclusion

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement