RL Fundamentals: Agents, Environments, and Policies
- Define the agent-environment interaction loop, distinguishing states from observations and identifying where reward is produced
- Distinguish discrete and continuous action spaces and explain why this distinction drives major differences in algorithm design
- Implement a categorical policy (discrete) and a diagonal Gaussian policy (continuous) in PyTorch using torch.distributions
- Explain the role of log-probability in training stochastic policies and compute it for both categorical and diagonal Gaussian distributions
The Core Loop
Reinforcement learning studies how agents learn to behave by trial and error. Unlike supervised learning — where a teacher provides correct labels — an RL agent receives only a scalar reward signal and must figure out, through exploration, which actions lead to higher cumulative reward.
The fundamental loop is simple:
- The agent observes the environment's current state.
- The agent selects an action.
- The environment transitions to a new state and emits a reward.
- Repeat.
Over many repetitions, the agent adjusts its behavior to maximize the total reward collected.
States vs. Observations
A state is the complete description of the world — no information is hidden. An observation is what the agent actually sees, which may be a partial view of the state.
- A chess board position is a full state (perfect information).
- A single frame from an Atari game is an observation — the ball's velocity is not visible in one frame alone.
When the agent sees the full state we say the environment is fully observed; otherwise it is partially observed. In practice, most deep RL papers write even when they technically mean , because the policy conditions only on what is observed anyway.
States and observations are represented as real-valued tensors: a robot's state might be joint angles and velocities; a visual observation might be a RGB tensor.
Action Spaces
The action space is the set of all valid actions. Two major types exist:
Discrete action spaces contain a finite number of distinct actions — e.g., {left, right, fire} in Atari, or one of 361 board positions in Go. Algorithms like DQN are designed for discrete spaces.
Continuous action spaces contain real-valued vectors — e.g., joint torques for a robotic arm or steering/throttle for a simulated car. Algorithms like DDPG, TD3, and SAC are designed for continuous spaces.
This distinction is not cosmetic. Computing is trivial when is discrete (evaluate all options) but becomes a separate optimization problem when , which is why continuous-space algorithms need different architecture.
Policies
A policy is the agent's decision rule — a mapping from state to action. There are two types:
Deterministic policy : returns a single action for each state.
Stochastic policy : returns a distribution over actions; the agent samples from it.
In deep RL, policies are parameterized by neural network weights . Training adjusts these weights to improve performance.
Categorical Policies (Discrete Actions)
For discrete actions, the network outputs one logit per action and a softmax converts them to probabilities:
import torch
import torch.nn as nn
from torch.distributions import Categorical
class CategoricalPolicy(nn.Module):
def __init__(self, obs_dim, act_dim, hidden=[64, 64]):
super().__init__()
layers = []
in_dim = obs_dim
for h in hidden:
layers += [nn.Linear(in_dim, h), nn.Tanh()]
in_dim = h
layers.append(nn.Linear(in_dim, act_dim))
self.net = nn.Sequential(*layers)
def forward(self, obs):
logits = self.net(obs)
return Categorical(logits=logits)
def act(self, obs):
return self.forward(obs).sample().item()
Two key operations:
- Sampling:
dist.sample()— draw an action from the policy. - Log-likelihood:
dist.log_prob(a)— needed for computing policy gradient updates.
The log-probability of action under a categorical distribution with probability vector is:
Diagonal Gaussian Policies (Continuous Actions)
For continuous actions, the network outputs a mean vector and either a fixed or state-dependent log-standard-deviation . Actions are sampled from :
from torch.distributions import Normal
class GaussianPolicy(nn.Module):
def __init__(self, obs_dim, act_dim, hidden=[64, 64]):
super().__init__()
layers = []
in_dim = obs_dim
for h in hidden:
layers += [nn.Linear(in_dim, h), nn.Tanh()]
in_dim = h
self.mean_net = nn.Sequential(*layers, nn.Linear(in_dim, act_dim))
# Log std as a learnable parameter (not state-dependent)
self.log_std = nn.Parameter(-0.5 * torch.ones(act_dim))
def forward(self, obs):
mean = self.mean_net(obs)
std = self.log_std.exp()
return Normal(mean, std)
def act(self, obs):
return self.forward(obs).sample()
For a diagonal Gaussian with mean and std , the log-probability of action is the sum of per-dimension log-probs:
In code: dist.log_prob(a).sum(axis=-1) — the .sum(axis=-1) is critical; without it you get a tensor of shape (batch, act_dim) instead of (batch,).
Trajectories
A trajectory (also called a rollout or episode) is a sequence of states and actions:
The first state is sampled from the start-state distribution . Each subsequent state transition follows the environment's dynamics , which the agent does not know (and usually doesn't need to know in model-free RL).
Trajectories are the raw data of RL. Every algorithm collects them, extracts reward signals, and uses them to update the policy.
What RL Can Do
RL has produced remarkable results in recent years:
- Games: AlphaGo/AlphaZero (Go), OpenAI Five (Dota 2), DQN (Atari)
- Robotics: Sim-to-real locomotion, robotic dexterous manipulation
- Recommendation systems, chip design, protein folding guidance
Success stories share common features: well-defined reward signals, large amounts of simulator experience, and careful engineering. RL still struggles with sparse rewards, safety constraints, and sample efficiency compared to supervised learning.