Learn to Schedule Communication between Cooperative Agents
Last Updated on July 20, 2023 by Editorial Team
Author(s): Sherwin Chen
Originally published on Towards AI.
A novel architecture for communication scheduling in multi-agent environments
Introduction
In multi-agent environments, one way to accelerate the coordination effect is to enable multiple agents to communicate with each other in a distributed manner and behave as a group. In this article, we discuss a multi-agent reinforcement learning framework, called SchedNet proposed by Kim et al in ICLR 2019, in which agents learn how to schedule communication, how to encode messages, and how to act upon received messages.
Problem Setup
We consider multi-agent scenarios wherein the task at hand is of a cooperative nature and agents are situated in a partially observable environment. We formulate such scenarios into a multi-agent sequential decision-making problem, such that all agents share the goal of maximizing the same discounted sum of rewards. As we rely on a method to schedule communication between agents, we impose two restrictions on medium access:
- Bandwidth constraint: The agent can only pass L bits message to the medium every time.
- Contention constraint: The agents share the communication medium so that only K out of n agents can broadcast their messages.
We now formalize MARL using DEC-POMDP(DECentralized Partially Observable Markov Decision Process), a generalization of MDP to allow a distributed control by multiple agents who may be incapable of observing the global state. We describe a DEC-POMDP by a tuple <S, A, r, P, ????, O, ????>β, where:
- βs β S is the environment state, which is not available to agents
- aα΅’ β Aβ and oα΅’ β ????β are the action and observation for agent βi β N
- r: S β¨ A^N β R is the reward function shared with all agents
- P:S β¨ A^N β Sβ is the transition function
- O: S β¨ N β ????β is the emission/observation probability
- ????β denotes the discount factor
SchedNet
Overview
Before diving into details, we first take a quick look at the architecture(Figure1) to get an overview of whatβs going on here. At each time step, each agent receives its observation, and pass the observation to a weight generator and an encoder to produce a weight value wβ and a message m, respectivelyβ. All weight values are then transferred to a central scheduler, which determines which agentsβ messages are scheduled to broadcast via a schedule vector c=[cα΅’]β, cα΅’ β{0, 1}β. The message center aggregates all messages along with the schedule vector βc and then broadcasts selected messages to all agents. At last, each agent takes an action based on these messages and their own observations.
As we will see next, SchedNet trains all its components through the critic, following the decentralized training and distributed execution framework.
Weight Generator
Letβs start with the weight generator. The weight generator takes observation as input and outputs a weight value which is then used by the scheduler to schedule messages. We train the weight generator through the critic by maximizing Q(s,w)β, an action-value function. To get a better sense of whatβs going on here, letβs take the weight generator as a deterministic policy network, and absorb all other parts except the critic into the environment. Then the weight generator and critic will form a DDPG structure. In this setup, the weight generator is responsible for answering the question: βwhat weight I generate could maximize the environment rewards from here on?β. As a result, we have the following objective
It is essential to distinguish s from o; s is the environment state, while o is the observation from the viewpoint of each agent.
Scheduler
Back when we described the problem setup, two constraints were imposed on the communication process. The bandwidth limitation Lβ can easily be implemented by restricting the size of message mβ. We now focus on imposing Kβ on the scheduling part.
The scheduler adopts a simply weight-based algorithm, called WSA(Weight-based Scheduling Algorithm), to select Kβ agent. We consider two proposals from the paper
- βTop(k): Selecting top kβ agents in terms of their weight values
- Softmax(k)β: Computing softmax valuesβ for each agent i based on their weight valuesβ, and then randomly selecting kβ agents according to this softmax values
The WSA module outputs a schedule vector c=[cα΅’]β, cα΅’ β{0, 1}β, where each cα΅’β determines whether the agent ββs message is scheduled to broadcast or not.
Message Encoder, Message Center, and action selector
The message encoder encodes observations to produce a message βm. The message center aggregates all messages mβ, and select which messages to broadcast based on βc. The resulting message mβ cβ is the concatenation of all selected messages. For example, if m=[000, 010, 111]β and c=[101]β, the final message to broadcast is βmβ c=[000111]. Each agentβs action selector then chooses an action based on this message and its observation.
We train the message encoders and action selectors via an on-policy algorithm, with the state-value function V(s)β in the critic. The gradient of its objective is
where ????β denotes the aggregate network of the encoder and selector, and Vβ is trained with the following objective
Discussion
Two Different Training Procedure?
Kim et al. train the weight generators and action selectors using different methods but with the same data source. Specifically, they train the weight generators using a deterministic policy-gradient algorithm(an off-policy method), while simultaneously training the action selectors using a stochastic policy-gradient algorithm(an on-policy method). This could be problematic in practice since the stochastic policy-gradient method could diverge under the training with off-policy data. The official implementation ameliorates this problem using a small replay buffer of β transitions, which, however, may impair the performance of the on-policy one.
We could bypass this problem by reparameterizing the critic such that it takes as inputs state sβ and actions aβ, aβ, β¦β and outputs the corresponding βQ-value. In this way, we make both trained with off-policy methods. Another conceivable way is to separate the training process from environment interaction if one insists on stochastic policy-gradient methods. Note that it is not enough to simply separate the policy training since the update of the weight generator could change the environment state distribution.
References
Daewoo Kim, Moon Sangwoo, Hostallero David, Wan Ju Kang, Lee Taeyoung, Son Kyunghwan, and Yi Yung. 2019. βLearning To Schedule Communication In Multi-Agent Reinforcement Learning.β ICLR, 1β17.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI