Return, Value Functions & Bellman Equations
- Compute finite-horizon undiscounted and infinite-horizon discounted returns for a given trajectory and explain when each formulation is appropriate
- Define V^π(s), Q^π(s,a), and A^π(s,a) in terms of expected return, and explain the relationship A^π = Q^π - V^π
- State and interpret the Bellman equations for V^π and Q^π, and the Bellman optimality equations for V* and Q*
- Derive the optimal action from Q* without needing a separate policy, and explain why this works for discrete but not continuous action spaces
Reward and Return
At each timestep , the environment emits a scalar reward . The agent's goal is to maximize return — some aggregate of rewards over time.
Finite-Horizon Undiscounted Return
For episodes with a fixed length :
All rewards are counted equally. Used in episodic tasks with a natural endpoint (e.g., a game that ends in win/loss).
Infinite-Horizon Discounted Return
For tasks that continue indefinitely, we apply a discount factor to ensure convergence:
The discount factor has two interpretations:
- Mathematical: ensures the sum is finite.
- Economic: future rewards are worth less than immediate ones. With , a reward 100 steps away is worth of a reward today.
In practice, implementations often use the discounted return formula even for episodic tasks, treating terminal states as absorbing (no more reward).
The RL Optimization Problem
The agent's goal is to find a policy that maximizes expected return:
Different RL algorithms attack this optimization in different ways — policy gradients directly, Q-learning indirectly.
Value Functions
Value functions answer: "how much expected return will I get from here?"
State-Value Function
This is the expected return starting from state and following policy thereafter. A high means state is favorable under .
Action-Value Function
This is the expected return starting from state , taking action first, then following . Unlike , it conditions on the specific first action.
Advantage Function
The advantage measures how much better (or worse) action is compared to the average action taken by in state . A positive advantage means "this action is better than the policy's average"; negative means worse.
The advantage function is central to policy gradient methods: instead of pushing up actions proportional to total return, we push them up proportional to their advantage — a much lower-variance signal.
Bellman Equations
The Bellman equations express a recursive consistency that any valid value function must satisfy. They connect the value at the current state to the value at successor states.
Bellman Equation for
In words: the value of state equals the expected immediate reward plus the discounted value of the next state.
Bellman Equation for
The connection between and :
Optimal Value Functions
The optimal value functions correspond to the best possible policy:
Bellman Optimality Equations
These are the foundation of Q-learning: if we can learn , we can recover the optimal policy without ever explicitly representing it:
This works trivially for discrete action spaces (evaluate all options) but requires solving a separate optimization problem for continuous spaces — motivating DDPG and TD3.
Connecting the Pieces
Here's a summary of the key relationships:
| Symbol | Name | Meaning |
|---|---|---|
| Return | Total (discounted) reward from trajectory | |
| State-value | Expected return from following | |
| Action-value | Expected return from following | |
| Advantage | : how much better is ? | |
| Optimal state-value | Best achievable | |
| Optimal action-value | Best achievable |
Almost every RL algorithm either (a) learns a parameterized approximation of one of these functions, or (b) uses sampled returns as a Monte Carlo estimate of them.