
Understanding Reinforcement Learning and Multi-Agent Systems: A Beginner’s Guide to MARL (Part 1)
Author(s): Arthur Kakande
Originally published on Towards AI.
When we learn from labeled data, we call it supervised learning. When we learn by grouping similar items, we call it clustering. When we learn by observing rewards or gains, we call it reinforcement learning.
To put it simply, reinforcement learning is the process of figuring out the best actions or strategies based on observed rewards. This type of learning is especially useful for tasks with a large number of possible actions. For example, imagine playing a game of Snakes and Ladders — where you can move left, right, up, or down. A specific combination of moves, like up → left → up → right, might result in winning the game. Reinforcement learning helps an agent (the decision-maker) explore different move combinations and learn which ones consistently lead to victory. In some cases, multiple agents can learn and interact together. A good example is autonomous cars sharing the same road. This is known as Multi-Agent Reinforcement Learning (MARL).
What is Autonomous Control (AC)?
Now that I have introduced autonomous vehicles above, I will dive into what autonomous control is. AC refers to those systems where decisions are decentralized. Decentralized in this case means individual components such as robots or vehicles can make independent choices within their environment. MARL is particularly useful here. Let’s take for example, in logistics we could attach an intelligent software agent to a container, a vehicle, and a storage facility, this creates our multi-agent system whereby the container could independently explore the best storage facility as its destination, it can additionally select a suitable transport provider to move it to this identified facility which altogether maximizes the efficiency. In this simple illustration, it’s just one container, now imagine how efficient it would be if multiple containers could be grouped and transported altogether in the same manner. Similarly, a fleet of delivery robots tasked with dropping off packages would need to coordinate to ensure efficiency and avoid delays. This is where MARL becomes very crucial as it enables this kind of strategic decision-making.
Now looking back at autonomous cars, in another scenario, one might have multiple self-driving cars that have to share a road or even co-ordinate their activity at a junction or roundabout. To do this manually, one might need to create a schedule that ensures a specific number of cars are crossing a specific junction at a specific time to avoid collision. This would be very difficult and not scalable. To tackle this challenge these autonomous cars must learn to coordinate movements to avoid accidents and improve traffic flow altogether. Predicting and responding to each other’s actions creates a smoother driving experience. This same illustration would apply to a fleet of delivery robots.
Single-Agent vs. Multi-Agent Reinforcement Learning
Now that we understand what autonomous control is, we can dive deeper into RL and understand how combining the two leads to efficient systems. But first, we should understand how reinforcement learning for a single agent works. There are a few key concepts you must understand as you dive into RL. These include; “agents” who are the decision-makers in the “environment”, the environment being the space in which the agent is operating, operating by taking “actions”, actions being the choice options an agent can make which sometimes have an effect on the environment in the form of a state, “States” being the current condition of the environment. While the agent navigates all this, it receives some feedback based on the actions made in particular states and this is known as “rewards”.
A popular algorithm used for training a single agent is the Q-learning algorithm. The algorithm works by helping the agent estimate a reward from performing different actions in different states. An action in this case could be moving a step forward, and the state could be the new current environment after the action has been taken. The agent observes this current state and might receive a reward. After exploring multiple actions and states and observing rewards, the agent updates its knowledge whenever it observes new rewards and makes estimations of which combinations of states and actions yielded a reward. These are called Q-values and sometimes they converge yielding optimal decisions. For example, the moves up → left → up → right that I previously introduced would be the optimal decisions i.e. the states and actions that yielded the highest Q-values.
Here’s how Q-learning works step by step:
Where the state s, and the current state-action pair value estimate from a and s donated by Qt (s, a), t + 1 denotes the time constant, γ is the discount factor, r t + 1 is the payoff that the agent receives when action a is taken in state s, and parameter α is a learning rate.
Challenges in Multi-Agent RL
When it comes to multiple agents sharing an environment, things get more complex. This is because the agents influence each other’s decisions. The environment in this case is no longer static. Let’s say delivery agent 1 picked up an item for delivery in state K and was able to get a reward, what would stop delivery agent 2 from picking up that item in a different state during a different episode? Making the environment change every time.
Additionally, there are multiple settings in which the approaches would differ for example in a competitive setting, an agent may try to outsmart opponents by predicting their moves as opposed to a cooperative setting, where agents work together to maximize a shared reward. This complexity means multi-agent systems require more advanced strategies compared to single-agent RL. This brings us to our next question; how do multiple agents learn together?
There are different approaches to multi-agent learning: we can let one agent make decisions for everyone and this agent takes the role of a coordinator delegating tasks to all the other agents, this is known as centralized learning. Alternatively, we could either let each agent learn and act independently and learn from observing each other’s actions and this is known as decentralized learning, or use centralized training with decentralized execution an approach where agents get global information during training but act independently when deployed.
During this learning, the agents can be able to coordinate either explicitly by directly exchanging messages or implicitly by inferring other agent’s actions without direct message exchange.
What’s Next?
Now that I have introduced you to the basics of RL and multi-agent systems, we should dive deeper into what MARL algorithms are and look at how they differ. In Part 2 of this blog series, we shall explore elements of independent Q-learning for MARL alongside team-based approaches. Stay tuned!
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI