Deep Reinforcement Learning · RL Foundations

Return, Value Functions & Bellman Equations

14 min read

By the end of this reading you will be able to:

Compute finite-horizon undiscounted and infinite-horizon discounted returns for a given trajectory and explain when each formulation is appropriate
Define V^π(s), Q^π(s,a), and A^π(s,a) in terms of expected return, and explain the relationship A^π = Q^π - V^π
State and interpret the Bellman equations for V^π and Q^π, and the Bellman optimality equations for V* and Q*
Derive the optimal action from Q* without needing a separate policy, and explain why this works for discrete but not continuous action spaces

Reward and Return

At each timestep $t$ , the environment emits a scalar reward $r_t = R(s_t, a_t, s_{t+1})$ . The agent's goal is to maximize return — some aggregate of rewards over time.

Finite-Horizon Undiscounted Return

For episodes with a fixed length $T$ : $R(\tau) = \sum_{t=0}^{T} r_t$

All rewards are counted equally. Used in episodic tasks with a natural endpoint (e.g., a game that ends in win/loss).

Infinite-Horizon Discounted Return

For tasks that continue indefinitely, we apply a discount factor $\gamma \in (0, 1)$ to ensure convergence: $R(\tau) = \sum_{t=0}^{\infty} \gamma^t r_t$

The discount factor has two interpretations:

Mathematical: ensures the sum is finite.
Economic: future rewards are worth less than immediate ones. With $\gamma = 0.99$ , a reward 100 steps away is worth $0.99^{100} \approx 0.37$ of a reward today.

In practice, implementations often use the discounted return formula even for episodic tasks, treating terminal states as absorbing (no more reward).

The RL Optimization Problem

The agent's goal is to find a policy $\pi$ that maximizes expected return: $\pi^* = \arg\max_\pi \; J(\pi) = \mathbb{E}_{\tau \sim \pi}[R(\tau)]$

Different RL algorithms attack this optimization in different ways — policy gradients directly, Q-learning indirectly.

Value Functions

Value functions answer: "how much expected return will I get from here?"

State-Value Function $V^\pi(s)$

$V^\pi(s) = \mathbb{E}_{\tau \sim \pi}\left[R(\tau) \mid s_0 = s\right]$

This is the expected return starting from state $s$ and following policy $\pi$ thereafter. A high $V^\pi(s)$ means state $s$ is favorable under $\pi$ .

Action-Value Function $Q^\pi(s, a)$

$Q^\pi(s, a) = \mathbb{E}_{\tau \sim \pi}\left[R(\tau) \mid s_0 = s, a_0 = a\right]$

This is the expected return starting from state $s$ , taking action $a$ first, then following $\pi$ . Unlike $V^\pi$ , it conditions on the specific first action.

Advantage Function $A^\pi(s, a)$

$A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)$

The advantage measures how much better (or worse) action $a$ is compared to the average action taken by $\pi$ in state $s$ . A positive advantage means "this action is better than the policy's average"; negative means worse.

The advantage function is central to policy gradient methods: instead of pushing up actions proportional to total return, we push them up proportional to their advantage — a much lower-variance signal.

Bellman Equations

The Bellman equations express a recursive consistency that any valid value function must satisfy. They connect the value at the current state to the value at successor states.

Bellman Equation for $V^\pi$

$V^\pi(s) = \mathbb{E}_{a \sim \pi, s' \sim P}\left[r(s,a) + \gamma V^\pi(s')\right]$

In words: the value of state $s$ equals the expected immediate reward plus the discounted value of the next state.

Bellman Equation for $Q^\pi$

$Q^\pi(s,a) = \mathbb{E}_{s' \sim P}\left[r(s,a) + \gamma \mathbb{E}_{a' \sim \pi}\left[Q^\pi(s', a')\right]\right]$

The connection between $V^\pi$ and $Q^\pi$ : $V^\pi(s) = \mathbb{E}_{a \sim \pi}\left[Q^\pi(s,a)\right]$

Optimal Value Functions

The optimal value functions correspond to the best possible policy:

$V^*(s) = \max_\pi V^\pi(s)$ $Q^*(s,a) = \max_\pi Q^\pi(s,a)$

Bellman Optimality Equations

$V^*(s) = \max_a \; \mathbb{E}_{s' \sim P}\left[r(s,a) + \gamma V^*(s')\right]$

$Q^*(s,a) = \mathbb{E}_{s' \sim P}\left[r(s,a) + \gamma \max_{a'} Q^*(s', a')\right]$

These are the foundation of Q-learning: if we can learn $Q^*$ , we can recover the optimal policy without ever explicitly representing it:

$a^*(s) = \arg\max_a Q^*(s,a)$

This works trivially for discrete action spaces (evaluate all options) but requires solving a separate optimization problem for continuous spaces — motivating DDPG and TD3.

Connecting the Pieces

Here's a summary of the key relationships:

Symbol	Name	Meaning
$R(\tau)$	Return	Total (discounted) reward from trajectory $\tau$
$V^\pi(s)$	State-value	Expected return from $s$ following $\pi$
$Q^\pi(s,a)$	Action-value	Expected return from $(s,a)$ following $\pi$
$A^\pi(s,a)$	Advantage	$Q^\pi(s,a) - V^\pi(s)$ : how much better is $a$ ?
$V^*(s)$	Optimal state-value	Best achievable $V^\pi(s)$
$Q^*(s,a)$	Optimal action-value	Best achievable $Q^\pi(s,a)$

Almost every RL algorithm either (a) learns a parameterized approximation of one of these functions, or (b) uses sampled returns as a Monte Carlo estimate of them.

References

OpenAI Spinning Up — Part 1: Key Concepts in RL

Sutton & Barto 2018 — Reinforcement Learning: An Introduction — Ch. 3

Previous Next →

Return, Value Functions & Bellman Equations

Reward and Return

Finite-Horizon Undiscounted Return

Infinite-Horizon Discounted Return

The RL Optimization Problem

Value Functions

State-Value Function Vπ(s)V^\pi(s)Vπ(s)

Action-Value Function Qπ(s,a)Q^\pi(s, a)Qπ(s,a)

Advantage Function Aπ(s,a)A^\pi(s, a)Aπ(s,a)

Bellman Equations

Bellman Equation for VπV^\piVπ

Bellman Equation for QπQ^\piQπ

Optimal Value Functions

Bellman Optimality Equations

Connecting the Pieces

Privacy Policy

What we collect

What we don't collect

Your choices

Contact

State-Value Function $V^\pi(s)$

Action-Value Function $Q^\pi(s, a)$

Advantage Function $A^\pi(s, a)$

Bellman Equation for $V^\pi$

Bellman Equation for $Q^\pi$