Inside xLAM: Salesforce’s Models Specialized for Agentic Tasks
Last Updated on September 18, 2024 by Editorial Team
Author(s): Jesus Rodriguez
Originally published on Towards AI.
I recently started an AI-focused educational newsletter, that already has over 170,000 subscribers. TheSequence is a no-BS (meaning no hype, no news, etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers, and concepts. Please give it a try by subscribing below:
TheSequence | Jesus Rodriguez | Substack
The best source to stay up-to-date with the developments in the machine learning, artificial intelligence, and data…
thesequence.substack.com
Agentic workflows is one of the most interesting categories in foundation model research. By agentic workflows we are referring to AI programs that can execute actions in a specific environment. One of the main debates in the agent community is how many capabilities go into a model versus peripherical methods like RAG. Recently, Salesforce Research published some major work with agentic AI with xLAM, a series of models optimized for agentic tasks.
xLAM is a new series of action models designed specifically for AI tasks. It includes five different models, built using either dense or mixture-of-expert architectures. These models range in size from 1 billion to 8×22 billion parameters. A flexible and scalable training pipeline was used to enhance their performance across a variety of environments by combining and augmenting diverse datasets. Initial tests show that xLAM consistently performs well, placing first on the Berkeley Function-Calling Leaderboard and surpassing other prominent models like GPT-4 and Claude-3 in specific tasks, particularly those requiring tool use.
Agentic Models vs. Agentic RAG
As agentic AI evolves, there is a lively debate about which capabilities make it as part of the models instead of external capabilities. Obviously, techniques such as retrieval augmented generation(RAG) are the most common candidate for agentic tasks. However, recently there have been a growth on the number of models specialized in agentic tasks specifically in API calling.
The idea that many of the function calling is sophisticated because the stochastic nature of LLMs. Function calling is by definition a discrete task so incorporating that into LLM introduces a number of interesting challenges to say the least. These are some of the challenges that xLAM tried to tackle.
xLAM
The xLAM series offers a range of models suited for various needs, from smaller models like the 1B and 7B versions that are optimized for on-device applications, to larger models like the 8x7B and 8x22B versions intended for more complex tasks. Insights from the training of these models emphasize key lessons in data handling, such as the importance of unifying and augmenting data to increase its diversity and reduce overfitting. The use of synthetic data has been especially valuable, enabling xLAM models to secure top positions on competitive leaderboards.
Three main xLAM models are available, designed for different use cases:
· xLAM-7b-r: A 7B-parameter model for quick experimentation in academic settings, particularly when resources are limited.
· xLAM-8x7b-r: An 8x7B mixture-of-experts model designed for industrial applications where balancing latency, resources, and performance is key.
· xLAM-8x22b-r: The largest model, suitable for projects with substantial computational resources and high-performance demands.
These models can handle both single-turn and multi-turn tasks across various benchmarks and environments. Earlier versions, such as xLAM-1b-fc-r and xLAM-7b-fc-r, were developed for single-turn function-calling tasks, with xLAM-7b-fc-r previously achieving second place on the Berkeley Function-Calling Leaderboard, although it now ranks sixteenth in the latest version. Meanwhile, the smaller xLAM-1b-fc-r, known as the “Tiny Giant,” is optimized for mobile use.
Data Processing and Augmentation
The data pipeline for xLAM models involves several critical steps. First, data is unified into a standard format that works across different environments and tasks. This makes it easier to apply augmentations and identify errors, such as incorrect function calls or hallucinations. The augmentation process itself focuses on improving data diversity by applying various transformations, producing new synthetic samples. The unified format simplifies this process, ensuring consistency.
Error detection is another crucial part of the data pipeline, with rule-based and large language model (LLM) tools used to spot issues like undefined functions and poor reasoning.
Data Synthesis and Mixture
xLAM uses a systematic data synthesis framework known as APIGen. This framework creates verified datasets based on executable APIs, ensuring high-quality data through a multi-step verification process. Data from several sources, including synthetic datasets and general instruction datasets, is combined for supervised fine-tuning of xLAM models.
Training
Model training follows a supervised fine-tuning (SFT) approach, making use of a flexible data pipeline. The training framework is built on HuggingFace libraries and PyTorch, and xLAM models undergo multiple training epochs with shuffled datasets to ensure robust learning. For the largest model, xLAM-8x22b-r, the LoRA method is used to preserve its abilities while preventing it from forgetting previously learned information. LoRA is also employed in aligning all xLAM models with the DPO method, and a cosine learning rate scheduler is used to further optimize training.
Benchmark Results
To assess xLAM’s performance, a variety of benchmarks are used, including:
· Webshop: An environment simulating online shopping tasks to test how well agents assist in e-commerce. Performance is measured through success and progress rates.
· ToolQuery: A benchmark testing how well agents use tools to retrieve information across multiple domains. This test includes a variety of settings like weather and movies, with success and progress rates used as metrics.
· ToolBench: A real-time evaluation platform for multi-turn reasoning via RapidAPI, with a pass rate metric used to gauge success. This benchmark includes in-domain and out-of-domain tasks.
· Berkeley Function-Calling Leaderboard: This benchmark tests an agent’s ability to handle function calls in various programming languages and application domains. With over 2,200 test cases, it challenges models with complex tasks involving multiple function calls. The evaluation includes accuracy metrics like Abstract Syntax Tree (AST) accuracy and executable accuracy.
These benchmarks confirm that xLAM models are highly capable in environments requiring complex reasoning and tool use.
xLAM is one of the most that highlights the potential of agentic workflows as part of LLMs. Its going to be interesting to see how Salesforce incorporates xLAM into its own products.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI