Deep Reinforcement Learning · RL Foundations

RL Fundamentals: Agents, Environments, and Policies

12 min read
By the end of this reading you will be able to:
  • Define the agent-environment interaction loop, distinguishing states from observations and identifying where reward is produced
  • Distinguish discrete and continuous action spaces and explain why this distinction drives major differences in algorithm design
  • Implement a categorical policy (discrete) and a diagonal Gaussian policy (continuous) in PyTorch using torch.distributions
  • Explain the role of log-probability in training stochastic policies and compute it for both categorical and diagonal Gaussian distributions

The Core Loop

Reinforcement learning studies how agents learn to behave by trial and error. Unlike supervised learning — where a teacher provides correct labels — an RL agent receives only a scalar reward signal and must figure out, through exploration, which actions lead to higher cumulative reward.

The fundamental loop is simple:

  1. The agent observes the environment's current state.
  2. The agent selects an action.
  3. The environment transitions to a new state and emits a reward.
  4. Repeat.

Over many repetitions, the agent adjusts its behavior to maximize the total reward collected.

States vs. Observations

A state ss is the complete description of the world — no information is hidden. An observation oo is what the agent actually sees, which may be a partial view of the state.

  • A chess board position is a full state (perfect information).
  • A single frame from an Atari game is an observation — the ball's velocity is not visible in one frame alone.

When the agent sees the full state we say the environment is fully observed; otherwise it is partially observed. In practice, most deep RL papers write ss even when they technically mean oo, because the policy conditions only on what is observed anyway.

States and observations are represented as real-valued tensors: a robot's state might be joint angles and velocities; a visual observation might be a (84×84×3)(84 \times 84 \times 3) RGB tensor.

Action Spaces

The action space is the set of all valid actions. Two major types exist:

Discrete action spaces contain a finite number of distinct actions — e.g., {left, right, fire} in Atari, or one of 361 board positions in Go. Algorithms like DQN are designed for discrete spaces.

Continuous action spaces contain real-valued vectors — e.g., joint torques for a robotic arm or steering/throttle for a simulated car. Algorithms like DDPG, TD3, and SAC are designed for continuous spaces.

This distinction is not cosmetic. Computing argmaxaQ(s,a)\arg\max_a Q(s,a) is trivial when aa is discrete (evaluate all options) but becomes a separate optimization problem when aRna \in \mathbb{R}^n, which is why continuous-space algorithms need different architecture.

Policies

A policy is the agent's decision rule — a mapping from state to action. There are two types:

Deterministic policy μθ\mu_\theta: returns a single action for each state. at=μθ(st)a_t = \mu_\theta(s_t)

Stochastic policy πθ\pi_\theta: returns a distribution over actions; the agent samples from it. atπθ(st)a_t \sim \pi_\theta(\cdot \mid s_t)

In deep RL, policies are parameterized by neural network weights θ\theta. Training adjusts these weights to improve performance.

Categorical Policies (Discrete Actions)

For discrete actions, the network outputs one logit per action and a softmax converts them to probabilities:

import torch
import torch.nn as nn
from torch.distributions import Categorical

class CategoricalPolicy(nn.Module):
    def __init__(self, obs_dim, act_dim, hidden=[64, 64]):
        super().__init__()
        layers = []
        in_dim = obs_dim
        for h in hidden:
            layers += [nn.Linear(in_dim, h), nn.Tanh()]
            in_dim = h
        layers.append(nn.Linear(in_dim, act_dim))
        self.net = nn.Sequential(*layers)

    def forward(self, obs):
        logits = self.net(obs)
        return Categorical(logits=logits)

    def act(self, obs):
        return self.forward(obs).sample().item()

Two key operations:

  • Sampling: dist.sample() — draw an action from the policy.
  • Log-likelihood: dist.log_prob(a) — needed for computing policy gradient updates.

The log-probability of action aa under a categorical distribution with probability vector Pθ(s)P_\theta(s) is: logπθ(as)=log[Pθ(s)]a\log \pi_\theta(a|s) = \log [P_\theta(s)]_a

Diagonal Gaussian Policies (Continuous Actions)

For continuous actions, the network outputs a mean vector μθ(s)\mu_\theta(s) and either a fixed or state-dependent log-standard-deviation logσ\log \sigma. Actions are sampled from N(μθ(s),σ2I)\mathcal{N}(\mu_\theta(s), \sigma^2 I):

from torch.distributions import Normal

class GaussianPolicy(nn.Module):
    def __init__(self, obs_dim, act_dim, hidden=[64, 64]):
        super().__init__()
        layers = []
        in_dim = obs_dim
        for h in hidden:
            layers += [nn.Linear(in_dim, h), nn.Tanh()]
            in_dim = h
        self.mean_net = nn.Sequential(*layers, nn.Linear(in_dim, act_dim))
        # Log std as a learnable parameter (not state-dependent)
        self.log_std = nn.Parameter(-0.5 * torch.ones(act_dim))

    def forward(self, obs):
        mean = self.mean_net(obs)
        std = self.log_std.exp()
        return Normal(mean, std)

    def act(self, obs):
        return self.forward(obs).sample()

For a diagonal Gaussian with mean μ\mu and std σ\sigma, the log-probability of action aa is the sum of per-dimension log-probs: logπθ(as)=i=1nlogN(ai;μi,σi2)\log \pi_\theta(a|s) = \sum_{i=1}^{n} \log \mathcal{N}(a_i; \mu_i, \sigma_i^2)

In code: dist.log_prob(a).sum(axis=-1) — the .sum(axis=-1) is critical; without it you get a tensor of shape (batch, act_dim) instead of (batch,).

Trajectories

A trajectory (also called a rollout or episode) is a sequence of states and actions: τ=(s0,a0,s1,a1,,sT,aT)\tau = (s_0, a_0, s_1, a_1, \ldots, s_T, a_T)

The first state s0s_0 is sampled from the start-state distribution ρ0\rho_0. Each subsequent state transition follows the environment's dynamics P(st+1st,at)P(s_{t+1} | s_t, a_t), which the agent does not know (and usually doesn't need to know in model-free RL).

Trajectories are the raw data of RL. Every algorithm collects them, extracts reward signals, and uses them to update the policy.

What RL Can Do

RL has produced remarkable results in recent years:

  • Games: AlphaGo/AlphaZero (Go), OpenAI Five (Dota 2), DQN (Atari)
  • Robotics: Sim-to-real locomotion, robotic dexterous manipulation
  • Recommendation systems, chip design, protein folding guidance

Success stories share common features: well-defined reward signals, large amounts of simulator experience, and careful engineering. RL still struggles with sparse rewards, safety constraints, and sample efficiency compared to supervised learning.