Deep Reinforcement Learning · RL Foundations

RL Fundamentals: Agents, Environments, and Policies

12 min read

By the end of this reading you will be able to:

Define the agent-environment interaction loop, distinguishing states from observations and identifying where reward is produced
Distinguish discrete and continuous action spaces and explain why this distinction drives major differences in algorithm design
Implement a categorical policy (discrete) and a diagonal Gaussian policy (continuous) in PyTorch using torch.distributions
Explain the role of log-probability in training stochastic policies and compute it for both categorical and diagonal Gaussian distributions

The Core Loop

Reinforcement learning studies how agents learn to behave by trial and error. Unlike supervised learning — where a teacher provides correct labels — an RL agent receives only a scalar reward signal and must figure out, through exploration, which actions lead to higher cumulative reward.

The fundamental loop is simple:

The agent observes the environment's current state.
The agent selects an action.
The environment transitions to a new state and emits a reward.
Repeat.

Over many repetitions, the agent adjusts its behavior to maximize the total reward collected.

States vs. Observations

A state $s$ is the complete description of the world — no information is hidden. An observation $o$ is what the agent actually sees, which may be a partial view of the state.

A chess board position is a full state (perfect information).
A single frame from an Atari game is an observation — the ball's velocity is not visible in one frame alone.

When the agent sees the full state we say the environment is fully observed; otherwise it is partially observed. In practice, most deep RL papers write $s$ even when they technically mean $o$ , because the policy conditions only on what is observed anyway.

States and observations are represented as real-valued tensors: a robot's state might be joint angles and velocities; a visual observation might be a $(84 \times 84 \times 3)$ RGB tensor.

Action Spaces

The action space is the set of all valid actions. Two major types exist:

Discrete action spaces contain a finite number of distinct actions — e.g., {left, right, fire} in Atari, or one of 361 board positions in Go. Algorithms like DQN are designed for discrete spaces.

Continuous action spaces contain real-valued vectors — e.g., joint torques for a robotic arm or steering/throttle for a simulated car. Algorithms like DDPG, TD3, and SAC are designed for continuous spaces.

This distinction is not cosmetic. Computing $\arg\max_a Q(s,a)$ is trivial when $a$ is discrete (evaluate all options) but becomes a separate optimization problem when $a \in \mathbb{R}^n$ , which is why continuous-space algorithms need different architecture.

Policies

A policy is the agent's decision rule — a mapping from state to action. There are two types:

Deterministic policy $\mu_\theta$ : returns a single action for each state. $a_t = \mu_\theta(s_t)$

Stochastic policy $\pi_\theta$ : returns a distribution over actions; the agent samples from it. $a_t \sim \pi_\theta(\cdot \mid s_t)$

In deep RL, policies are parameterized by neural network weights $\theta$ . Training adjusts these weights to improve performance.

Categorical Policies (Discrete Actions)

For discrete actions, the network outputs one logit per action and a softmax converts them to probabilities:

import torch
import torch.nn as nn
from torch.distributions import Categorical

class CategoricalPolicy(nn.Module):
    def __init__(self, obs_dim, act_dim, hidden=[64, 64]):
        super().__init__()
        layers = []
        in_dim = obs_dim
        for h in hidden:
            layers += [nn.Linear(in_dim, h), nn.Tanh()]
            in_dim = h
        layers.append(nn.Linear(in_dim, act_dim))
        self.net = nn.Sequential(*layers)

    def forward(self, obs):
        logits = self.net(obs)
        return Categorical(logits=logits)

    def act(self, obs):
        return self.forward(obs).sample().item()

Two key operations:

Sampling: dist.sample() — draw an action from the policy.
Log-likelihood: dist.log_prob(a) — needed for computing policy gradient updates.

The log-probability of action $a$ under a categorical distribution with probability vector $P_\theta(s)$ is: $\log \pi_\theta(a|s) = \log [P_\theta(s)]_a$

Diagonal Gaussian Policies (Continuous Actions)

For continuous actions, the network outputs a mean vector $\mu_\theta(s)$ and either a fixed or state-dependent log-standard-deviation $\log \sigma$ . Actions are sampled from $\mathcal{N}(\mu_\theta(s), \sigma^2 I)$ :

from torch.distributions import Normal

class GaussianPolicy(nn.Module):
    def __init__(self, obs_dim, act_dim, hidden=[64, 64]):
        super().__init__()
        layers = []
        in_dim = obs_dim
        for h in hidden:
            layers += [nn.Linear(in_dim, h), nn.Tanh()]
            in_dim = h
        self.mean_net = nn.Sequential(*layers, nn.Linear(in_dim, act_dim))
        # Log std as a learnable parameter (not state-dependent)
        self.log_std = nn.Parameter(-0.5 * torch.ones(act_dim))

    def forward(self, obs):
        mean = self.mean_net(obs)
        std = self.log_std.exp()
        return Normal(mean, std)

    def act(self, obs):
        return self.forward(obs).sample()

For a diagonal Gaussian with mean $\mu$ and std $\sigma$ , the log-probability of action $a$ is the sum of per-dimension log-probs: $\log \pi_\theta(a|s) = \sum_{i=1}^{n} \log \mathcal{N}(a_i; \mu_i, \sigma_i^2)$

In code: dist.log_prob(a).sum(axis=-1) — the .sum(axis=-1) is critical; without it you get a tensor of shape (batch, act_dim) instead of (batch,).

Trajectories

A trajectory (also called a rollout or episode) is a sequence of states and actions: $\tau = (s_0, a_0, s_1, a_1, \ldots, s_T, a_T)$

The first state $s_0$ is sampled from the start-state distribution $\rho_0$ . Each subsequent state transition follows the environment's dynamics $P(s_{t+1} | s_t, a_t)$ , which the agent does not know (and usually doesn't need to know in model-free RL).

Trajectories are the raw data of RL. Every algorithm collects them, extracts reward signals, and uses them to update the policy.

What RL Can Do

RL has produced remarkable results in recent years:

Games: AlphaGo/AlphaZero (Go), OpenAI Five (Dota 2), DQN (Atari)
Robotics: Sim-to-real locomotion, robotic dexterous manipulation
Recommendation systems, chip design, protein folding guidance

Success stories share common features: well-defined reward signals, large amounts of simulator experience, and careful engineering. RL still struggles with sparse rewards, safety constraints, and sample efficiency compared to supervised learning.

References

Sutton & Barto 2018 — Reinforcement Learning: An Introduction (2nd ed.)

OpenAI Spinning Up — Part 1: Key Concepts in RL

Previous Next →

RL Fundamentals: Agents, Environments, and Policies

The Core Loop

States vs. Observations

Action Spaces

Policies

Categorical Policies (Discrete Actions)

Diagonal Gaussian Policies (Continuous Actions)

Trajectories

What RL Can Do

Privacy Policy

What we collect

What we don't collect

Your choices

Contact