Deep Reinforcement Learning · RL Foundations

Return, Value Functions & Bellman Equations

14 min read
By the end of this reading you will be able to:
  • Compute finite-horizon undiscounted and infinite-horizon discounted returns for a given trajectory and explain when each formulation is appropriate
  • Define V^π(s), Q^π(s,a), and A^π(s,a) in terms of expected return, and explain the relationship A^π = Q^π - V^π
  • State and interpret the Bellman equations for V^π and Q^π, and the Bellman optimality equations for V* and Q*
  • Derive the optimal action from Q* without needing a separate policy, and explain why this works for discrete but not continuous action spaces

Reward and Return

At each timestep tt, the environment emits a scalar reward rt=R(st,at,st+1)r_t = R(s_t, a_t, s_{t+1}). The agent's goal is to maximize return — some aggregate of rewards over time.

Finite-Horizon Undiscounted Return

For episodes with a fixed length TT: R(τ)=t=0TrtR(\tau) = \sum_{t=0}^{T} r_t

All rewards are counted equally. Used in episodic tasks with a natural endpoint (e.g., a game that ends in win/loss).

Infinite-Horizon Discounted Return

For tasks that continue indefinitely, we apply a discount factor γ(0,1)\gamma \in (0, 1) to ensure convergence: R(τ)=t=0γtrtR(\tau) = \sum_{t=0}^{\infty} \gamma^t r_t

The discount factor has two interpretations:

  • Mathematical: ensures the sum is finite.
  • Economic: future rewards are worth less than immediate ones. With γ=0.99\gamma = 0.99, a reward 100 steps away is worth 0.991000.370.99^{100} \approx 0.37 of a reward today.

In practice, implementations often use the discounted return formula even for episodic tasks, treating terminal states as absorbing (no more reward).

The RL Optimization Problem

The agent's goal is to find a policy π\pi that maximizes expected return: π=argmaxπ  J(π)=Eτπ[R(τ)]\pi^* = \arg\max_\pi \; J(\pi) = \mathbb{E}_{\tau \sim \pi}[R(\tau)]

Different RL algorithms attack this optimization in different ways — policy gradients directly, Q-learning indirectly.

Value Functions

Value functions answer: "how much expected return will I get from here?"

State-Value Function Vπ(s)V^\pi(s)

Vπ(s)=Eτπ[R(τ)s0=s]V^\pi(s) = \mathbb{E}_{\tau \sim \pi}\left[R(\tau) \mid s_0 = s\right]

This is the expected return starting from state ss and following policy π\pi thereafter. A high Vπ(s)V^\pi(s) means state ss is favorable under π\pi.

Action-Value Function Qπ(s,a)Q^\pi(s, a)

Qπ(s,a)=Eτπ[R(τ)s0=s,a0=a]Q^\pi(s, a) = \mathbb{E}_{\tau \sim \pi}\left[R(\tau) \mid s_0 = s, a_0 = a\right]

This is the expected return starting from state ss, taking action aa first, then following π\pi. Unlike VπV^\pi, it conditions on the specific first action.

Advantage Function Aπ(s,a)A^\pi(s, a)

Aπ(s,a)=Qπ(s,a)Vπ(s)A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)

The advantage measures how much better (or worse) action aa is compared to the average action taken by π\pi in state ss. A positive advantage means "this action is better than the policy's average"; negative means worse.

The advantage function is central to policy gradient methods: instead of pushing up actions proportional to total return, we push them up proportional to their advantage — a much lower-variance signal.

Bellman Equations

The Bellman equations express a recursive consistency that any valid value function must satisfy. They connect the value at the current state to the value at successor states.

Bellman Equation for VπV^\pi

Vπ(s)=Eaπ,sP[r(s,a)+γVπ(s)]V^\pi(s) = \mathbb{E}_{a \sim \pi, s' \sim P}\left[r(s,a) + \gamma V^\pi(s')\right]

In words: the value of state ss equals the expected immediate reward plus the discounted value of the next state.

Bellman Equation for QπQ^\pi

Qπ(s,a)=EsP[r(s,a)+γEaπ[Qπ(s,a)]]Q^\pi(s,a) = \mathbb{E}_{s' \sim P}\left[r(s,a) + \gamma \mathbb{E}_{a' \sim \pi}\left[Q^\pi(s', a')\right]\right]

The connection between VπV^\pi and QπQ^\pi: Vπ(s)=Eaπ[Qπ(s,a)]V^\pi(s) = \mathbb{E}_{a \sim \pi}\left[Q^\pi(s,a)\right]

Optimal Value Functions

The optimal value functions correspond to the best possible policy:

V(s)=maxπVπ(s)V^*(s) = \max_\pi V^\pi(s) Q(s,a)=maxπQπ(s,a)Q^*(s,a) = \max_\pi Q^\pi(s,a)

Bellman Optimality Equations

V(s)=maxa  EsP[r(s,a)+γV(s)]V^*(s) = \max_a \; \mathbb{E}_{s' \sim P}\left[r(s,a) + \gamma V^*(s')\right]

Q(s,a)=EsP[r(s,a)+γmaxaQ(s,a)]Q^*(s,a) = \mathbb{E}_{s' \sim P}\left[r(s,a) + \gamma \max_{a'} Q^*(s', a')\right]

These are the foundation of Q-learning: if we can learn QQ^*, we can recover the optimal policy without ever explicitly representing it:

a(s)=argmaxaQ(s,a)a^*(s) = \arg\max_a Q^*(s,a)

This works trivially for discrete action spaces (evaluate all options) but requires solving a separate optimization problem for continuous spaces — motivating DDPG and TD3.

Connecting the Pieces

Here's a summary of the key relationships:

Symbol Name Meaning
R(τ)R(\tau) Return Total (discounted) reward from trajectory τ\tau
Vπ(s)V^\pi(s) State-value Expected return from ss following π\pi
Qπ(s,a)Q^\pi(s,a) Action-value Expected return from (s,a)(s,a) following π\pi
Aπ(s,a)A^\pi(s,a) Advantage Qπ(s,a)Vπ(s)Q^\pi(s,a) - V^\pi(s): how much better is aa?
V(s)V^*(s) Optimal state-value Best achievable Vπ(s)V^\pi(s)
Q(s,a)Q^*(s,a) Optimal action-value Best achievable Qπ(s,a)Q^\pi(s,a)

Almost every RL algorithm either (a) learns a parameterized approximation of one of these functions, or (b) uses sampled returns as a Monte Carlo estimate of them.