Introduction to Reinforcement Learning Series. Tutorial 2: The Return, Value Functions & Bellman Equation
Last Updated on July 17, 2023 by Editorial Team
Author(s): Towards AI Editorial Team
Originally published on Towards AI.
Table of Content:
Aside: How many timesteps does an MDP have?
Why not maximize the reward rt?
3. Exercise 1 β Barryβs Blissful Breakfast Maze
6. Example: Calculate the Value function with Bellman
Pseudocode to Calculate the Value function using Bellman
What would this look like in Python?
7. Exercise: Plane Repair Value
Parts of this tutorial have been adapted from Reinforcement Learning: an Introduction
Introduction
In the previous tutorial, we saw how reinforcement learning algorithms learn a policy. The algorithmβs aim is to find the optimal policy. This is the policy that takes the actions that maximize the sum of future rewards received.
In this tutorial, we start by better defining the goal of learning the optimal policy. We then introduce the key concept (value functions) and equation (Bellman Equation) that allow us to build our first reinforcement learning algorithm in Tutorial 3!
This series initially formed part of Delta Academy and is brought to you in partnership with the Delta Academy teamβs new project Superflows!
1. Return Gtβ
In Tutorial 1 we discussed, informally, the objective of reinforcement learning algorithms. We said that the goal of a reinforcement learning algorithm is to maximize the cumulative reward it receives in the long run.
We define this as the return, denoted Gt.
Simple Return Formula
The simplest way to express the return Gtβ is the sum of all future rewards youβll receive where rtβ is the reward at time t, up to the final timestep T (if it exists β otherwise continue to infinity, β ).
Note: we use GtSumβ above since we define Gtβ in the general case below
Aside: How many timesteps does an MDP have?
Not all MDPs last for an infinite number of timesteps. Those with terminal states are called episodic MDPs. The episode ends when you reach those states.
Typically this final timestep in an MDP is denoted by T, as in the equation above.
E.g. in a game of chess, there are many possible states where the game is over. One player has won, or the game is a stalemate. These are the terminal states, and every game of chess would be one episode.
Example #2: if getting a robot to navigate to a certain location, episodes could terminate when it successfully reaches this location, or when it runs out of battery!
Whether the MDP is episodic β has 1 or more terminal states β or not is typically part of the environment and not up to the RL designer.
In some cases, however (e.g., the robot navigation one), the designer also chooses whether the MDP is episodic or not, since they design the environment the agent trains in.
Why not maximize the reward rt?
The real world is complex; not every desirable action yields an instant reward. So we canβt just consider immediate rewards for action, but also possible future rewards.
This is best explained with an example.
Example: Barryβs Blissful Breakfast
Alongside being a busy airline CEO, Barry loves an American Breakfast with bacon and pancakes β nothing brings him greater pleasure. Barry is only rewarded at the end of his breakfast preparation when he eats that first blissful mouthful.
He must first choose which oil to use β vegetable or motor oil. This is the only decision he needs to make β the then cooks the breakfast and serves it up before he can eat it.
When one of the two final states is reached (circles on the right below), the Markov Decision Process terminates, meaning the episode is over.
We can express Barryβs cooking as the Markov Decision Process below.
If we were in the starting state βBarry is hungryβ and only considered the reward weβd get immediately after picking the oil, the two options look the same. Both give a reward of 00.
However, in the long run (looking 2 steps into the future), we can see that using motor oil is a terrible idea since it always leads to a negative reward!
If we instead looked at the return weβd get from each successor state when choosing the action at the start (positive for vegetable oil, negative for motor oil), then Barry can make the correct choice.
2. Discounting the Future Ξ³
Do you value $100 given to you now as much as $100 given to you 2 weeks in the future?
Youβd probably rather have the $100 now. Future rewards are subject to uncertainty, so we intrinsically discount them compared with immediate rewards.
Discounting future rewards also encourages prompt action, since acting sooner means you have to wait less time for your rewards.
To imitate this, we introduce a discount factor, Ξ³ (lowercase Greek letter gamma) when calculating the return. We define 0β€10β€Ξ³β€1 since gamma discounts the relative value of future rewards.
For the reward that we receive n timesteps into the future, we multiply that reward by Ξ³n to discount it.
This means each reward is discounted by an exponentially decaying gamma term (Ξ³n ). This decay is shown in the diagram below.
Note: we normally use a Ξ³ closer to 1 , such as 0.9 or 0.99, as we often deal with many timesteps.
This also has a nice mathematical property. It allows us to solve MDPs that have infinite timesteps without having to consider infinite-sized returns. This is because the exponential decay from the Ξ³ terms goes to zero as you look further into the future, changing an infinite sum to a finite one.
General Return Formula
This leads to the following general formula for the return:
The GtSumβ we saw earlier is a special case of this equation where Ξ³=1, so no terms are discounted (since Ξ³2=1 if Ξ³=1 ).
Example of Discounted Return
First, letβs imagine weβre in an MDP where youβre being paid $1 for each timestep for 3 more timesteps β youβre being paid to guard a valuable piece of art and itβs near the end of your shift.
To illustrate gamma, weβll set Ξ³=0.5 (very low).
Since you know all this, you can calculate the return:
The reward at t is worth $1
The reward at t+1 is worth $1 Γ 0.5 = Β’50
The reward at t+2 is worth $1 Γ 0.52 =Β’25
The episode terminates at t+3
Your return from time t would be $1 + Β’50 + Β’25 = $1.75 which isnβt the actual amount of money youβd receive in the next 3 timesteps (which is really $3 ). Instead, itβs how much you value the money at the time t, discounting the future rewards.
Review:
1. Give the general formula for a Markov decision processβs return at time t.
U+27A5 Gtβ = ββi = 1 βΞ³irt + i
2. What trick do we use to tractably compute the return for Markov decision processes with infinite timesteps?
U+27A5 Use a discount factor Ξ³< 1. Because each stepβs return is scaled by Ξ³n, future terms converge to zero. The resulting sum is effectively finite.
3. What does Gtβ denote in a Markov decision process?
U+27A5 The return at time t, i.e. the value it tries to maximize.
4. What is an episodic Markov decision process?
U+27A5 One with a terminal state, i.e. a finite number of time steps.
5. Define GtSumβ in terms of Gt.
U+27A5 GtSumβ = Gtβ where Ξ³ =1.
6. Notation for a Markov decision processβs return at time t?
U+27A5 Gt.
7. What does T denote in the context of a Markov decision process?
U+27A5 The final time step.
8. Notation for the final time step in an episode Markov decision process?
U+27A5 T.
9. What range of values can Ξ³ take in Markov decision processes?
U+27A5 0 < = Ξ³ < = 1
10. What does Ξ³ denote in Markov decision processes?
U+27A5 A discount factor for future rewards.
11. For a chess-playing Markov decision process, what is an βepisodeβ?
U+27A5 A single game of chess.
3. Exercise 1 β Barryβs Blissful Breakfast Maze
https://replit.com/team/delta-academy-RL7/21-Barrys-Blissful-Breakfasts
Barry decides to open Barryβs Blissful Breakfasts β a fully operational restaurant that only serves his favourite breakfast of bacon cooked with vegetable oil.
To commemorate the anniversary of opening this restaurant, he wants to run an event for his customers. One Sunday evening after closing time, he builds a maze with obstacle-course elements in the restaurant. He hides prizes around the restaurant β vouchers for extra bacon or vegetable oil on your breakfast.
He then tests out the course and hidden prizes himself before deciding whether itβs complete.
Calculate the return at each timestep of Barryβs test run, given the rewards he receives in the test run.
4. Value Functions vΟβ(s)
Definition: The value function vΟβ(s) of a state s is the expected return from s while following policy Ο.
A higher value for a state means a higher expected return, meaning the return is on average higher given you follow policy Ο.
Whereas returns are the discounted sums of future rewards you experience, the value function is the expected return from this state. Since weβve only focused on deterministic examples thus far, this distinction may be unclear.
U+27A4 Why does the policy affect the expected return?
Weβll explain this with an example.
Suppose we think about the value of the starting state: βBarry is hungryβ. The expected return we get depends on which action we pick β if we βUse Vegetable Oilβ, the start state has a positive value, and if we βUse Motor Oilβ it has a negative value.
Since the action is chosen by the policy, the value of the starting state depends on which policy you follow.
U+27A4 Example: the difference between value and return
- Suppose thereβs a very simple MDP β you flip a fair coin once, get $1 if it lands on heads, and nothing if it lands on tails. Then the MDP terminates.
- Restated:
- 50% of the time, it lands on heads, so your return is $1.
- The other 50% of the time, it lands on tails, so your return is $0.
However, the value of your starting state (before flipping) is Β’50, since this is the expected return. Note that you never get a return of Β’50Β’50 in an episode!
Mathematical Definition
The value function is defined as:
U+27A4 Notation
Definitions of notation weβve not seen before:
E is the symbol for βexpectationβ and acts on the variables inside the brackets (in this case Gtβ ). This means the average value of Gtβ if you did infinite simulations from that state.
The subscript of E (in this case Ο ) just refers to the conditions under which the expectation is taken. This is equivalent to saying: βGiven that we follow policy Ο(s) β. Since the policy affects how the state changes and the rewards you see, itβs important this is specified.
βΌ Expected values β main ideas video
If youβre still unclear: donβt worry too much! Not understanding the mathematical notation wonβt stop you from understanding what the value intuitively means or how to estimate the value function algorithmically (covered in the next tutorial).
While a value function exists for any policy Ο, an agent will usually not know the exact value function of the policy it is following.
An approximate value function is commonly estimated and re-estimated from the sequences of states and rewards an agent observes from experience in the environment. This allows it to determine its policy by choosing the action that leads to the highest value successor state.ββ
**The process of estimating the value function is explored in the next tutorial.
5. Bellman Equation
This is one of the key equations of Reinforcement Learning algorithms. And it comes from a very simple observation.
The value function vΟβ(stβ) where stβ is the state at time t and Ο is the policy youβre following, can be decomposed into two components:
- The reward received rtβ at timestep t
- The value function vΟβ(st+1β) at state st+1β
To put it another way, the value of the state youβre in is the reward you get this timestep plus the value of the next state youβre in.
Mathematically, when dealing with deterministic rewards this is:
This is given that you follow policy Ο to select atβ (the action taken at state st).
The Ξ³ is the discount factor weβve just been introduced to.
βΌ Mathematical generalization of this equation which deals with randomness
- In general, the state transition function T (the function that determines the successor state given the previous state and action) can be random. Similarly, the reward function R (the function that outputs the reward from the environment) doesnβt need to be deterministic.
- As a result, for this to be true, we must take expectations over the rewards and successor states.
While this might seem self-evident, this equation is the heart of a key class of reinforcement learning algorithms.
6. Example: Calculate the Value function with Bellman
So how can we calculate the value function of an MDP for a specific policy?
In the above example, we know:(1)
- The state transitions for different actions (e.g. if youβre in the state
Pub
and take actionGo to sleep
then you move to stateSleep
)
In general, this is referred to as the state transition function.
2. The reward for every state-action pair possible
In general, this is referred to as the reward function
There arenβt any terminal states in this MDP. Given this, we can βsimulateβ how the environment and rewards evolve into the future and calculate the return as a weighted sum of rewards.
In the code below we first calculate the value of Sleep
since it's visited from all other states when following the optimal policy. (2)
U+27A4 Footnotes
(1) Itβs common to not have access to either of these in Reinforcement Learning problems. For example, in chess, the state transitions between your turns arenβt easy to calculate since they depend on the move the opponent makes. Weβll see how this is dealt with in future tutorials.
(2) Reminder: the optimal policy is to Go to sleep
when at the Pub
and alternate between Learning RL
and Sleep
otherwise.
Pseudocode to Calculate the Value function using Bellman
INITIALISE policy as the optimal policy
INITIALISE value_function as a mapping from each state to 0
INITIALISE state as "Sleep"
# First calculate the value of "Sleep"
FOR steps_into_future in {0, 1, ..., large_number}:
Sample action from policy(state)
Sample reward from reward_function(state, action)
# This is the term from the next timestep into the future
UPDATE value_function "Sleep" term by adding (gamma ** steps_into_future) * reward
UPDATE current_state to transition_fn(state, action)
END FOR
Next, we can use the Bellman Equation to calculate the values for the other states, using the value for Sleep
we just calculated.
# `Pub` is 1 step from `Sleep` if following optimal policy
INITIALISE current_state as "Pub"
INITIALISE action as "Go to sleep"
Sample reward from reward_fn(state, action)
UPDATE value_function "Pub" term to be reward + gamma * value_function("Sleep")
# `Learning RL` is also 1 step away from `Sleep` when following the optimal policy
INITIALISE current_state as "Learning RL"
INITIALISE action as "Go to sleep"
Sample reward from reward_fn(state, action)
UPDATE value_function "Learning RL" term to (reward + gamma * value_function("Sleep"))
In this case, you can also calculate the values for Pub
and Learning RL
the same way we did for Sleep
, but the above trick saves some time.
What would this look like in Python?
To make the above pseudocode more concrete, how can we represent the state transition function, the reward function, and the value function in Python code?
In general, they are stochastic functions with inputs and outputs. In this case, all our functions are deterministic, so we can use a dictionary as a neat representation. Below are examples:
optimal_policy = {
"Pub": "Go to sleep",
"Sleep": "Study",
"Learning RL": "Go to sleep"
}
initial_value_fn = {"Sleep": 0, "Pub": 0, "Go to sleep": 0}
# Below are the actual values for gamma = 0.9
value_function = {'Pub': 63.68, 'Sleep': 76.32, 'Learning RL': 73.68}
The state transition function is much longer since it needs an entry for all state-action pairs (9 in this case).
So it hasnβt been written above, but would have the following format: {(state, action): successor_state}
.
Review
1. How can we use a Markov decision processβs estimated value function to form a policy?
U+27A5 Choose the action which leads to the successor state with the highest value according to the value function.
2. Define a Markov decision processβs value function (in words).
U+27A5 The expected return from a given state while following a given policy.
3. How does the Bellman Equation help us solve Markov decision processesβ value functions?
U+27A5 It allows us to solve for the value function at a given state using only the successor stateβs value. No need to search the entire tree for each state.
4. Give the Bellman Equation (assuming deterministic rewards).
U+27A5 vΟβ(stβ ) = rtβ + Ξ³vΟβ(st+1β)
5. What is a Markov decision processβs reward function?
U+27A5 For every state and action, gives the associated reward.
6. What is a Markov decision processβs state transition function?
U+27A5 Given an action and a state, produces a successor state.
7. Give the mathematical definition of a Markov decision processβs value function.
U+27A5 vΟβ(stβ )= EΟβ[Gtβ]
7. Exercise: Plane Repair Value
https://replit.com/team/delta-academy-RL7/22-Plane-Repair-Value
Now in Hong Kong (that was quick!), Barryβs execs are asking him for proof that the optimal policy you proposed for plane repairs (from the previous tutorial β see diagram) is correct.
They want to see the value functions of the two possible deterministic* policies that are possible, with a discount factor Ξ³=0.99.
**Deterministic means that no randomness is involved. The opposite, a stochastic policy, would output a probability distribution over the possible actions (e.g. when Faulty, Fly 80% of the time, Repair 20% of the time). We arenβt considering these for now.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI