Our Neat Value Function

Author(s): Rem E

Originally published on Towards AI.

Our Neat Value Function — Our Agent **Learning,** Source: Generated by ChatGPT

So far, we’ve been discussing the environment (the problem) side. Now it’s time to talk about the solution: the agent!
And what better place to start than with value functions?

Before you begin, if you haven’t read The Whole Story of MDP in RL, make sure to check it out first!

🔁Returns

Remember our simple little value function? Yep, we’re starting from there again:

Value Function using Average, Source: Image by the author

The value of an action a at time step t is
the average of the rewards received (r₁,r₂,r₃,…,rₖ) when that action was chosen k times.

Here, we are averaging all the rewards obtained from a certain state-action pair over k trials. k can represent the number of episodes, essentially asking: “What’s the average reward we got from this pair across k episodes?”
But there’s a problem: in this setup, we only care about the immediate rewards from that state-action pair. If you think about it, that’s not enough; we also want to consider all the upcoming rewards after taking that pair. Sometimes, a state-action pair might not give a great immediate reward but can lead to future pairs with much higher rewards. To account for that, we introduce a new concept called Returns:

Return Formula, Source: Image by the author

Return Rₜ is the sum of rewards from time step t+1 onward, up to the final step T.

rₜ: Reward you just got (already in the past at step t).
Rₜ: Total rewards you will collect from now on (future-oriented, starting from t+1), coming from upcoming states or state-action pairs.

Some books use the notation Gₜ for return, derived from ‘gain,’ but both notations are widely accepted and mean the same thing.

And if you’re smart enough, you might notice that we’re currently giving equal importance to all upcoming rewards (same weight). That doesn’t really make sense.
This is where γ (the discount factor) comes in! Remember it from MDPs? It’s an important element that helps us prioritize immediate rewards over distant ones. We apply γ by multiplying it by itself for each future step (discounted exponentially), making rewards further in the future contribute less to the return:

Discounted Return Formula, Source: Image by the author

Return Rₜ is the sum of discounted rewards (by γ) from time step t+1 onward, up to the final step T.

Here, k represents the number of steps from t+1 to T. And in the summation, the upper limit is simply the difference between the end and the beginning: T−(t+1)=T−t−1. For continual tasks, we can use ∞ instead, and it will still converge thanks to the discount factor γ.

-How does γ affect the return formula?

γ = 0
Only the immediate next reward rₜ+1 matters.
This means the agent is myopic, focusing only on immediate gains.
0 < γ < 1 (Typical case)
Future rewards are considered, but discounted exponentially by γ.
Rewards farther in the future contribute less.
γ ≈ 1
Future rewards are nearly as important as immediate rewards.
Used in tasks where long-term success is important.
γ = 1 (No discounting)
All future rewards are summed equally.
This only works for episodic tasks but is not applicable to continual tasks, as it can lead to infinite returns.

✨Refined Value Function

Ok, now that the concept of Returns is clear, let’s look at our new, fancy value function:

Value Function using Return, Source: Image by the author

The value of a state following policy π is equal to the expected return Rₜ, if we start in state s.

Calm down, nothing new here. Whether you use V(s) or Q(s, a) doesn’t matter. Both follow the same formula, but:

V(s) depends on the state: [..∣sₜ=s]
Q(s, a) depends on both state and action: [..∣sₜ=s, aₜ=a]

note that writing |sₜ or |sₜ=s means the same thing. They’re just two ways of saying given that the current state is s.

-When do you choose V or Q?

It depends on your problem: if your decision depends only on the state, use V(s). If it depends on both the state and the action, use Q(s, a).
π is just there to show that we’re using the policy π.

-Okay, we understand the left-hand side, but what about the right-hand side?

If you’re already familiar with expectation, you know it’s basically an average, but we also take probabilities into account. We’ll use a bigger equation to include the stochastic environment as well. Lastly, instead of using the immediate reward rₜ, we’ll use our return Rₜ. It’s like saying:
“If I start in state s and follow policy π, what total reward can I expect in the long run?”

Wouldn’t that make more sense now?

✅ What We’ve Learned…

We started with the simple average-based value function and saw why it’s limited to immediate rewards. That’s where returns Rₜ come in, allowing us to account for future rewards after a state (or state-action pair). The discount factor γ helps balance immediate and long-term rewards. Finally, we updated our value function to use the expected return, giving it a forward-looking perspective that captures the full future impact of an agent’s decisions.
Before we wrap up, I know what you’re probably thinking. If you’ve been following this series from the beginning, you might be wondering:
–What about the EMA formula we derived earlier?
Of course, I haven’t forgotten about it! Just hang tight, it’s not time to bring it in yet… but you’ll see exactly where it fits in upcoming parts!

👉Next up: See it all in action!

📚References:

Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Our Neat Value Function

Author(s): Rem E

🔁Returns

✨Refined Value Function

✅ What We’ve Learned…

Implementing the Value Function the Monte Carlo Way

Tutorial 5: In this tutorial, we’ll see in action how returns and state values are calculated using the Monte Carlo…

Why Is the Bellman Equation So Powerful in RL?

Breaking down the math to reveal how Bellman’s insight connects value, recursion, and optimality

📚References:

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Our Neat Value Function

Author(s): Rem E

🔁Returns

✨Refined Value Function

✅ What We’ve Learned…

Implementing the Value Function the Monte Carlo Way

Tutorial 5: In this tutorial, we’ll see in action how returns and state values are calculated using the Monte Carlo…

Why Is the Bellman Equation So Powerful in RL?

Breaking down the math to reveal how Bellman’s insight connects value, recursion, and optimality

📚References:

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement