
Our Neat Value Function
Author(s): Rem E
Originally published on Towards AI.

So far, we’ve been discussing the environment (the problem) side. Now it’s time to talk about the solution: the agent!
And what better place to start than with value functions?
Before you begin, if you haven’t read The Whole Story of MDP in RL, make sure to check it out first!
🔁Returns
Remember our simple little value function? Yep, we’re starting from there again:

The value of an action a at time step t is
the average of the rewards received (r₁,r₂,r₃,…,rₖ) when that action was chosen k times.
Here, we are averaging all the rewards obtained from a certain state-action pair over k trials. k can represent the number of episodes, essentially asking: “What’s the average reward we got from this pair across k episodes?”
But there’s a problem: in this setup, we only care about the immediate rewards from that state-action pair. If you think about it, that’s not enough; we also want to consider all the upcoming rewards after taking that pair. Sometimes, a state-action pair might not give a great immediate reward but can lead to future pairs with much higher rewards. To account for that, we introduce a new concept called Returns:

Return Rₜ is the sum of rewards from time step t+1 onward, up to the final step T.
- rₜ: Reward you just got (already in the past at step t).
- Rₜ: Total rewards you will collect from now on (future-oriented, starting from t+1), coming from upcoming states or state-action pairs.
Some books use the notation Gₜ for return, derived from ‘gain,’ but both notations are widely accepted and mean the same thing.
And if you’re smart enough, you might notice that we’re currently giving equal importance to all upcoming rewards (same weight). That doesn’t really make sense.
This is where γ (the discount factor) comes in! Remember it from MDPs? It’s an important element that helps us prioritize immediate rewards over distant ones. We apply γ by multiplying it by itself for each future step (discounted exponentially), making rewards further in the future contribute less to the return:

Return Rₜ is the sum of discounted rewards (by γ) from time step t+1 onward, up to the final step T.
Here, k represents the number of steps from t+1 to T. And in the summation, the upper limit is simply the difference between the end and the beginning: T−(t+1)=T−t−1. For continual tasks, we can use ∞ instead, and it will still converge thanks to the discount factor γ.
-How does γ affect the return formula?
- γ = 0
Only the immediate next reward rₜ+1 matters.
This means the agent is myopic, focusing only on immediate gains. - 0 < γ < 1 (Typical case)
Future rewards are considered, but discounted exponentially by γ.
Rewards farther in the future contribute less. - γ ≈ 1
Future rewards are nearly as important as immediate rewards.
Used in tasks where long-term success is important. - γ = 1 (No discounting)
All future rewards are summed equally.
This only works for episodic tasks but is not applicable to continual tasks, as it can lead to infinite returns.
✨Refined Value Function
Ok, now that the concept of Returns is clear, let’s look at our new, fancy value function:

The value of a state following policy π is equal to the expected return Rₜ, if we start in state s.
Calm down, nothing new here. Whether you use V(s) or Q(s, a) doesn’t matter. Both follow the same formula, but:
- V(s) depends on the state: [..∣sₜ=s]
- Q(s, a) depends on both state and action: [..∣sₜ=s, aₜ=a]
note that writing |sₜ or |sₜ=s means the same thing. They’re just two ways of saying given that the current state is s.
-When do you choose V or Q?
It depends on your problem: if your decision depends only on the state, use V(s). If it depends on both the state and the action, use Q(s, a).
π is just there to show that we’re using the policy π.
-Okay, we understand the left-hand side, but what about the right-hand side?
If you’re already familiar with expectation, you know it’s basically an average, but we also take probabilities into account. We’ll use a bigger equation to include the stochastic environment as well. Lastly, instead of using the immediate reward rₜ, we’ll use our return Rₜ. It’s like saying:
“If I start in state s and follow policy π, what total reward can I expect in the long run?”
Wouldn’t that make more sense now?
✅ What We’ve Learned…
We started with the simple average-based value function and saw why it’s limited to immediate rewards. That’s where returns Rₜ come in, allowing us to account for future rewards after a state (or state-action pair). The discount factor γ helps balance immediate and long-term rewards. Finally, we updated our value function to use the expected return, giving it a forward-looking perspective that captures the full future impact of an agent’s decisions.
Before we wrap up, I know what you’re probably thinking. If you’ve been following this series from the beginning, you might be wondering:
–What about the EMA formula we derived earlier?
Of course, I haven’t forgotten about it! Just hang tight, it’s not time to bring it in yet… but you’ll see exactly where it fits in upcoming parts!
👉Next up: See it all in action!
Implementing the Value Function the Monte Carlo Way
Tutorial 5: In this tutorial, we’ll see in action how returns and state values are calculated using the Monte Carlo…
pub.towardsai.net
👉 Then, we’ll meet Bellman and his equations.
And see how this refined value function connects directly to them!
Why Is the Bellman Equation So Powerful in RL?
Breaking down the math to reveal how Bellman’s insight connects value, recursion, and optimality
pub.towardsai.net
✨ As always… stay curious, stay coding, and stay tuned!
📚References:
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!
Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Discover Your Dream AI Career at Towards AI Jobs
Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!
Note: Content contains the views of the contributing authors and not Towards AI.