Our Neat Value Function
Author(s): Rem E
Originally published on Towards AI.

So far, we’ve been discussing the environment (the problem) side. Now it’s time to talk about the solution: the agent!
And what better place to start than with value functions?
Before you begin, if you haven’t read The Whole Story of MDP in RL, make sure to check it out first!
🔁Returns
Remember our simple little value function? Yep, we’re starting from there again:

The value of an action a at time step t is
the average of the rewards received (r₁,r₂,r₃,…,rₖ) when that action was chosen k times.
Here, we are averaging all the rewards obtained from a certain state-action pair over k trials. k can represent the number of episodes, essentially asking: “What’s the average reward we got from this pair across k episodes?”
But there’s a problem: in this setup, we only care about the immediate rewards from that state-action pair. If you think about it, that’s not enough; we also want to consider all the upcoming rewards after taking that pair. Sometimes, a state-action pair might not give a great immediate reward but can lead to future pairs with much higher rewards. To account for that, we introduce a new concept called Returns:

Return Rₜ is the sum of rewards from time step t+1 onward, up to the final step T.
- rₜ: Reward you just got (already in the past at step t).
- Rₜ: Total rewards you will collect from now on (future-oriented, starting from t+1), coming from upcoming states or state-action pairs.
Some books use the notation Gₜ for return, derived from ‘gain,’ but both notations are widely accepted and mean the same thing.
And if you’re smart enough, you might notice that we’re currently giving equal importance to all upcoming rewards (same weight). That doesn’t really make sense.
This is where γ (the discount factor) comes in! Remember it from MDPs? It’s an important element that helps us prioritize immediate rewards over distant ones. We apply γ by multiplying it by itself for each future step (discounted exponentially), making rewards further in the future contribute less to the return:

Return Rₜ is the sum of discounted rewards (by γ) from time step t+1 onward, up to the final step T.
Here, k represents the number of steps from t+1 to T. And in the summation, the upper limit is simply the difference between the end and the beginning: T−(t+1)=T−t−1. For continual tasks, we can use ∞ instead, and it will still converge thanks to the discount factor γ.
-How does γ affect the return formula?
- γ = 0
Only the immediate next reward rₜ+1 matters.
This means the agent is myopic, focusing only on immediate gains. - 0 < γ < 1 (Typical case)
Future rewards are considered, but discounted exponentially by γ.
Rewards farther in the future contribute less. - γ ≈ 1
Future rewards are nearly as important as immediate rewards.
Used in tasks where long-term success is important. - γ = 1 (No discounting)
All future rewards are summed equally.
This only works for episodic tasks but is not applicable to continual tasks, as it can lead to infinite returns.
✨Refined Value Function
Ok, now that the concept of Returns is clear, let’s look at our new, fancy value function:

The value of a state following policy π is equal to the expected return Rₜ, if we start in state s.
Calm down, nothing new here. Whether you use V(s) or Q(s, a) doesn’t matter. Both follow the same formula, but:
- V(s) depends on the state: [..∣sₜ=s]
- Q(s, a) depends on both state and action: [..∣sₜ=s, aₜ=a]
note that writing |sₜ or |sₜ=s means the same thing. They’re just two ways of saying given that the current state is s.
-When do you choose V or Q?
It depends on your problem: if your decision depends only on the state, use V(s). If it depends on both the state and the action, use Q(s, a).
π is just there to show that we’re using the policy π.
-Okay, we understand the left-hand side, but what about the right-hand side?
If you’re already familiar with expectation, you know it’s basically an average, but we also take probabilities into account. We’ll use a bigger equation to include the stochastic environment as well. Lastly, instead of using the immediate reward rₜ, we’ll use our return Rₜ. It’s like saying:
“If I start in state s and follow policy π, what total reward can I expect in the long run?”
Wouldn’t that make more sense now?
✅ What We’ve Learned…
We started with the simple average-based value function and saw why it’s limited to immediate rewards. That’s where returns Rₜ come in, allowing us to account for future rewards after a state (or state-action pair). The discount factor γ helps balance immediate and long-term rewards. Finally, we updated our value function to use the expected return, giving it a forward-looking perspective that captures the full future impact of an agent’s decisions.
Before we wrap up, I know what you’re probably thinking. If you’ve been following this series from the beginning, you might be wondering:
–What about the EMA formula we derived earlier?
Of course, I haven’t forgotten about it! Just hang tight, it’s not time to bring it in yet… but you’ll see exactly where it fits in upcoming parts!
👉Next up: See it all in action!
Implementing the Value Function the Monte Carlo Way
Tutorial 5: In this tutorial, we’ll see in action how returns and state values are calculated using the Monte Carlo…
pub.towardsai.net
👉 Then, we’ll meet Bellman and his equations.
And see how this refined value function connects directly to them!
Why Is the Bellman Equation So Powerful in RL?
Breaking down the math to reveal how Bellman’s insight connects value, recursion, and optimality
pub.towardsai.net
✨ As always… stay curious, stay coding, and stay tuned!
📚References:
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Towards AI Academy
We Build Enterprise-Grade AI. We'll Teach You to Master It Too.
15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.
Start free — no commitment:
→ 6-Day Agentic AI Engineering Email Guide — one practical lesson per day
→ Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages
Our courses:
→ AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.
→ Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.
→ AI for Work — Understand, evaluate, and apply AI for complex work tasks.
Note: Article content contains the views of the contributing authors and not Towards AI.