DNN Policy Learning Theory

Deep Neural Network policy learning with mathematical foundations.


Policy Gradient Methods

Policy Parameterization

Policy $\pi_\theta(a|s)$ parameterized by neural network with weights $\theta$.

Objective Function

Maximize expected return:

$$ J(\theta) = \mathbb{E}{\tau \sim \pi\theta}\left[\sum_{t=0}^{T} \gamma^t r_t\right] $$

Where $\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots)$ is a trajectory.


Policy Gradient Theorem

$$ \nabla_\theta J(\theta) = \mathbb{E}{\tau \sim \pi\theta}\left[\sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t\right] $$

Where $G_t = \sum_{k=t}^{T} \gamma^{k-t} r_k$ is the return from time $t$.

Derivation Sketch

$$ \begin{aligned} \nabla_\theta J(\theta) &= \nabla_\theta \mathbb{E}{\tau}[R(\tau)] \ &= \nabla\theta \int P(\tau|\theta) R(\tau) d\tau \ &= \int \nabla_\theta P(\tau|\theta) R(\tau) d\tau \ &= \int P(\tau|\theta) \nabla_\theta \log P(\tau|\theta) R(\tau) d\tau \ &= \mathbb{E}{\tau}\left[\nabla\theta \log P(\tau|\theta) R(\tau)\right] \end{aligned} $$


REINFORCE Algorithm

Update Rule

$$ \theta \leftarrow \theta + \alpha \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t $$

With Baseline

To reduce variance:

$$ \theta \leftarrow \theta + \alpha \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot (G_t - b(s_t)) $$

Common baseline: $b(s_t) = V(s_t)$ (state value function)


Actor-Critic Methods

Architecture

  • Actor: Policy network $\pi_\theta(a|s)$
  • Critic: Value network $V_\phi(s)$

Advantage Function

$$ A(s_t, a_t) = Q(s_t, a_t) - V(s_t) $$

Approximation:

$$ A(s_t, a_t) \approx r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t) $$

Update Rules

Actor update:

$$ \theta \leftarrow \theta + \alpha_\theta \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot A(s_t, a_t) $$

Critic update:

$$ \phi \leftarrow \phi - \alpha_\phi \nabla_\phi \left(r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)\right)^2 $$


A3C (Asynchronous Advantage Actor-Critic)

Multiple agents explore in parallel, updating shared parameters.

Advantage Estimation

n-step return:

$$ A(s_t, a_t) = \sum_{i=0}^{n-1} \gamma^i r_{t+i} + \gamma^n V(s_{t+n}) - V(s_t) $$


PPO (Proximal Policy Optimization)

Clipped Objective

$$ L^{CLIP}(\theta) = \mathbb{E}_t\left[\min\left(r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t\right)\right] $$

Where:

$$ r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} $$

This prevents too large policy updates.


TRPO (Trust Region Policy Optimization)

Constrained Optimization

$$ \begin{aligned} \max_\theta \quad & \mathbb{E}t\left[\frac{\pi\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} \hat{A}t\right] \ \text{s.t.} \quad & \mathbb{E}t[D{KL}(\pi{\theta_{old}}(\cdot|s_t) | \pi_\theta(\cdot|s_t))] \leq \delta \end{aligned} $$

Where $D_{KL}$ is the KL divergence.


Deterministic Policy Gradient (DPG)

For continuous action spaces:

$$ \nabla_\theta J(\theta) = \mathbb{E}{s \sim \rho^\pi}\left[\nabla\theta \mu_\theta(s) \nabla_a Q^\mu(s,a)\big|{a=\mu\theta(s)}\right] $$

Where $\mu_\theta(s)$ is a deterministic policy.


DDPG (Deep Deterministic Policy Gradient)

Combines DPG with DQN techniques:

Actor update:

$$ \nabla_\theta J \approx \mathbb{E}{s_t}\left[\nabla_a Q(s,a|\phi)\big|{s=s_t, a=\mu(s_t|\theta)} \nabla_\theta \mu(s|\theta)\big|_{s=s_t}\right] $$

Critic update:

$$ L = \mathbb{E}\left[(r + \gamma Q'(s', \mu'(s'|\theta')|\phi') - Q(s,a|\phi))^2\right] $$

Uses target networks (denoted with $'$) and experience replay.


Python Implementation (Simple Actor-Critic)

 1import torch
 2import torch.nn as nn
 3import torch.optim as optim
 4
 5class Actor(nn.Module):
 6    def __init__(self, state_dim, action_dim):
 7        super().__init__()
 8        self.net = nn.Sequential(
 9            nn.Linear(state_dim, 128),
10            nn.ReLU(),
11            nn.Linear(128, action_dim),
12            nn.Softmax(dim=-1)
13        )
14    
15    def forward(self, state):
16        return self.net(state)
17
18class Critic(nn.Module):
19    def __init__(self, state_dim):
20        super().__init__()
21        self.net = nn.Sequential(
22            nn.Linear(state_dim, 128),
23            nn.ReLU(),
24            nn.Linear(128, 1)
25        )
26    
27    def forward(self, state):
28        return self.net(state)
29
30# Training
31actor = Actor(state_dim, action_dim)
32critic = Critic(state_dim)
33actor_optimizer = optim.Adam(actor.parameters(), lr=1e-3)
34critic_optimizer = optim.Adam(critic.parameters(), lr=1e-3)
35
36for episode in range(1000):
37    state = env.reset()
38    done = False
39    
40    while not done:
41        # Get action probabilities
42        probs = actor(torch.FloatTensor(state))
43        action = torch.multinomial(probs, 1).item()
44        
45        # Take action
46        next_state, reward, done = env.step(action)
47        
48        # Compute advantage
49        value = critic(torch.FloatTensor(state))
50        next_value = critic(torch.FloatTensor(next_state))
51        advantage = reward + gamma * next_value * (1 - done) - value
52        
53        # Update critic
54        critic_loss = advantage.pow(2)
55        critic_optimizer.zero_grad()
56        critic_loss.backward()
57        critic_optimizer.step()
58        
59        # Update actor
60        log_prob = torch.log(probs[action])
61        actor_loss = -log_prob * advantage.detach()
62        actor_optimizer.zero_grad()
63        actor_loss.backward()
64        actor_optimizer.step()
65        
66        state = next_state

Related Snippets