DNN Policy Learning Theory
Deep Neural Network policy learning with mathematical foundations.
Policy Gradient Methods
Policy Parameterization
Policy $\pi_\theta(a|s)$ parameterized by neural network with weights $\theta$.
Objective Function
Maximize expected return:
$$ J(\theta) = \mathbb{E}{\tau \sim \pi\theta}\left[\sum_{t=0}^{T} \gamma^t r_t\right] $$
Where $\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots)$ is a trajectory.
Policy Gradient Theorem
$$ \nabla_\theta J(\theta) = \mathbb{E}{\tau \sim \pi\theta}\left[\sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t\right] $$
Where $G_t = \sum_{k=t}^{T} \gamma^{k-t} r_k$ is the return from time $t$.
Derivation Sketch
$$ \begin{aligned} \nabla_\theta J(\theta) &= \nabla_\theta \mathbb{E}{\tau}[R(\tau)] \ &= \nabla\theta \int P(\tau|\theta) R(\tau) d\tau \ &= \int \nabla_\theta P(\tau|\theta) R(\tau) d\tau \ &= \int P(\tau|\theta) \nabla_\theta \log P(\tau|\theta) R(\tau) d\tau \ &= \mathbb{E}{\tau}\left[\nabla\theta \log P(\tau|\theta) R(\tau)\right] \end{aligned} $$
REINFORCE Algorithm
Update Rule
$$ \theta \leftarrow \theta + \alpha \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t $$
With Baseline
To reduce variance:
$$ \theta \leftarrow \theta + \alpha \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot (G_t - b(s_t)) $$
Common baseline: $b(s_t) = V(s_t)$ (state value function)
Actor-Critic Methods
Architecture
- Actor: Policy network $\pi_\theta(a|s)$
- Critic: Value network $V_\phi(s)$
Advantage Function
$$ A(s_t, a_t) = Q(s_t, a_t) - V(s_t) $$
Approximation:
$$ A(s_t, a_t) \approx r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t) $$
Update Rules
Actor update:
$$ \theta \leftarrow \theta + \alpha_\theta \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot A(s_t, a_t) $$
Critic update:
$$ \phi \leftarrow \phi - \alpha_\phi \nabla_\phi \left(r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)\right)^2 $$
A3C (Asynchronous Advantage Actor-Critic)
Multiple agents explore in parallel, updating shared parameters.
Advantage Estimation
n-step return:
$$ A(s_t, a_t) = \sum_{i=0}^{n-1} \gamma^i r_{t+i} + \gamma^n V(s_{t+n}) - V(s_t) $$
PPO (Proximal Policy Optimization)
Clipped Objective
$$ L^{CLIP}(\theta) = \mathbb{E}_t\left[\min\left(r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t\right)\right] $$
Where:
$$ r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} $$
This prevents too large policy updates.
TRPO (Trust Region Policy Optimization)
Constrained Optimization
$$ \begin{aligned} \max_\theta \quad & \mathbb{E}t\left[\frac{\pi\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} \hat{A}t\right] \ \text{s.t.} \quad & \mathbb{E}t[D{KL}(\pi{\theta_{old}}(\cdot|s_t) | \pi_\theta(\cdot|s_t))] \leq \delta \end{aligned} $$
Where $D_{KL}$ is the KL divergence.
Deterministic Policy Gradient (DPG)
For continuous action spaces:
$$ \nabla_\theta J(\theta) = \mathbb{E}{s \sim \rho^\pi}\left[\nabla\theta \mu_\theta(s) \nabla_a Q^\mu(s,a)\big|{a=\mu\theta(s)}\right] $$
Where $\mu_\theta(s)$ is a deterministic policy.
DDPG (Deep Deterministic Policy Gradient)
Combines DPG with DQN techniques:
Actor update:
$$ \nabla_\theta J \approx \mathbb{E}{s_t}\left[\nabla_a Q(s,a|\phi)\big|{s=s_t, a=\mu(s_t|\theta)} \nabla_\theta \mu(s|\theta)\big|_{s=s_t}\right] $$
Critic update:
$$ L = \mathbb{E}\left[(r + \gamma Q'(s', \mu'(s'|\theta')|\phi') - Q(s,a|\phi))^2\right] $$
Uses target networks (denoted with $'$) and experience replay.
Python Implementation (Simple Actor-Critic)
1import torch
2import torch.nn as nn
3import torch.optim as optim
4
5class Actor(nn.Module):
6 def __init__(self, state_dim, action_dim):
7 super().__init__()
8 self.net = nn.Sequential(
9 nn.Linear(state_dim, 128),
10 nn.ReLU(),
11 nn.Linear(128, action_dim),
12 nn.Softmax(dim=-1)
13 )
14
15 def forward(self, state):
16 return self.net(state)
17
18class Critic(nn.Module):
19 def __init__(self, state_dim):
20 super().__init__()
21 self.net = nn.Sequential(
22 nn.Linear(state_dim, 128),
23 nn.ReLU(),
24 nn.Linear(128, 1)
25 )
26
27 def forward(self, state):
28 return self.net(state)
29
30# Training
31actor = Actor(state_dim, action_dim)
32critic = Critic(state_dim)
33actor_optimizer = optim.Adam(actor.parameters(), lr=1e-3)
34critic_optimizer = optim.Adam(critic.parameters(), lr=1e-3)
35
36for episode in range(1000):
37 state = env.reset()
38 done = False
39
40 while not done:
41 # Get action probabilities
42 probs = actor(torch.FloatTensor(state))
43 action = torch.multinomial(probs, 1).item()
44
45 # Take action
46 next_state, reward, done = env.step(action)
47
48 # Compute advantage
49 value = critic(torch.FloatTensor(state))
50 next_value = critic(torch.FloatTensor(next_state))
51 advantage = reward + gamma * next_value * (1 - done) - value
52
53 # Update critic
54 critic_loss = advantage.pow(2)
55 critic_optimizer.zero_grad()
56 critic_loss.backward()
57 critic_optimizer.step()
58
59 # Update actor
60 log_prob = torch.log(probs[action])
61 actor_loss = -log_prob * advantage.detach()
62 actor_optimizer.zero_grad()
63 actor_loss.backward()
64 actor_optimizer.step()
65
66 state = next_state
Related Snippets
- Data Augmentation
Data augmentation techniques for Keras and PyTorch - Graph RAG Techniques
Graph-based Retrieval-Augmented Generation for enhanced context and relationship … - Image to Vector Embeddings
Image embeddings convert visual content into dense vector representations that … - Keras Essentials
High-level Keras API for building neural networks quickly. Installation 1# Keras … - LangChain Recipes
Practical recipes for building LLM applications with LangChain: prompts, chains, … - ONNX Model Conversion
ONNX (Open Neural Network Exchange) for converting models between frameworks. … - PyTorch Essentials
Essential PyTorch operations and patterns for deep learning. Installation 1# CPU … - Q-Learning Theory
Q-Learning algorithm theory with mathematical foundations. Markov Decision … - RAG (Retrieval-Augmented Generation)
Retrieval-Augmented Generation techniques for enhancing LLM responses with … - Sound to Vector Embeddings
Audio embeddings convert sound signals (speech, music, environmental sounds) … - Tensor Mathematics & Backpropagation
Tensor mathematics fundamentals and backpropagation theory with detailed … - TensorFlow Essentials
Essential TensorFlow operations and patterns for deep learning. Installation 1# … - TensorFlow Lite
TensorFlow Lite for deploying models on mobile and embedded devices. Convert … - Text to Vector Embeddings
Text embeddings convert textual content into dense vector representations that …