Q-Learning Theory
Q-Learning algorithm theory with mathematical foundations.
Markov Decision Process (MDP)
An MDP is defined by the tuple $(S, A, P, R, \gamma)$:
- $S$: Set of states
- $A$: Set of actions
- $P$: Transition probability $P(s'|s,a)$
- $R$: Reward function $R(s,a,s')$
- $\gamma \in [0,1]$: Discount factor
Value Functions
State Value Function
Expected return starting from state $s$:
$$ V^\pi(s) = \mathbb{E}\pi\left[\sum{t=0}^{\infty} \gamma^t R_{t+1} \mid S_0 = s\right] $$
Action Value Function (Q-Function)
Expected return starting from state $s$, taking action $a$:
$$ Q^\pi(s,a) = \mathbb{E}\pi\left[\sum{t=0}^{\infty} \gamma^t R_{t+1} \mid S_0 = s, A_0 = a\right] $$
Bellman Equations
Bellman Expectation Equation
$$ V^\pi(s) = \sum_{a \in A} \pi(a|s) \sum_{s' \in S} P(s'|s,a)[R(s,a,s') + \gamma V^\pi(s')] $$
$$ Q^\pi(s,a) = \sum_{s' \in S} P(s'|s,a)[R(s,a,s') + \gamma \sum_{a'} \pi(a'|s') Q^\pi(s',a')] $$
Bellman Optimality Equation
$$ V^(s) = \max_{a \in A} \sum_{s' \in S} P(s'|s,a)[R(s,a,s') + \gamma V^(s')] $$
$$ Q^(s,a) = \sum_{s' \in S} P(s'|s,a)[R(s,a,s') + \gamma \max_{a'} Q^(s',a')] $$
Q-Learning Algorithm
Update Rule
$$ Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[r_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t)\right] $$
Where:
- $\alpha \in (0,1]$: Learning rate
- $\gamma \in [0,1]$: Discount factor
- $r_{t+1}$: Immediate reward
- $s_t, a_t$: Current state and action
- $s_{t+1}$: Next state
TD Error
The temporal difference error:
$$ \delta_t = r_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t) $$
Exploration vs Exploitation
ε-Greedy Policy
$$ \pi(a|s) = \begin{cases} 1 - \epsilon + \frac{\epsilon}{|A|} & \text{if } a = \arg\max_{a'} Q(s,a') \ \frac{\epsilon}{|A|} & \text{otherwise} \end{cases} $$
Boltzmann Exploration (Softmax)
$$ \pi(a|s) = \frac{\exp(Q(s,a)/\tau)}{\sum_{a'} \exp(Q(s,a')/\tau)} $$
Where $\tau$ is the temperature parameter.
Convergence Conditions
Q-Learning converges to $Q^*$ if:
- All state-action pairs are visited infinitely often
- Learning rate satisfies: $$ \sum_{t=0}^{\infty} \alpha_t = \infty, \quad \sum_{t=0}^{\infty} \alpha_t^2 < \infty $$
- Example: $\alpha_t = \frac{1}{1+t}$
Python Implementation
1import numpy as np
2
3class QLearning:
4 def __init__(self, n_states, n_actions, alpha=0.1, gamma=0.99, epsilon=0.1):
5 self.Q = np.zeros((n_states, n_actions))
6 self.alpha = alpha
7 self.gamma = gamma
8 self.epsilon = epsilon
9
10 def choose_action(self, state):
11 if np.random.random() < self.epsilon:
12 return np.random.randint(self.Q.shape[1])
13 return np.argmax(self.Q[state])
14
15 def update(self, state, action, reward, next_state):
16 # Q-Learning update
17 td_target = reward + self.gamma * np.max(self.Q[next_state])
18 td_error = td_target - self.Q[state, action]
19 self.Q[state, action] += self.alpha * td_error
20 return td_error
21
22# Training loop
23agent = QLearning(n_states=100, n_actions=4)
24
25for episode in range(1000):
26 state = env.reset()
27 done = False
28
29 while not done:
30 action = agent.choose_action(state)
31 next_state, reward, done = env.step(action)
32 agent.update(state, action, reward, next_state)
33 state = next_state
Double Q-Learning
To address overestimation bias:
$$ Q_1(s_t, a_t) \leftarrow Q_1(s_t, a_t) + \alpha \left[r_{t+1} + \gamma Q_2(s_{t+1}, \arg\max_{a'} Q_1(s_{t+1}, a')) - Q_1(s_t, a_t)\right] $$
Randomly switch between updating $Q_1$ and $Q_2$.
SARSA (On-Policy Alternative)
$$ Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)\right] $$
Key difference: Uses actual next action $a_{t+1}$ instead of $\max_{a'} Q(s_{t+1}, a')$.
Related Snippets
- Data Augmentation
Data augmentation techniques for Keras and PyTorch - DNN Policy Learning Theory
Deep Neural Network policy learning with mathematical foundations. Policy … - Graph RAG Techniques
Graph-based Retrieval-Augmented Generation for enhanced context and relationship … - Image to Vector Embeddings
Image embeddings convert visual content into dense vector representations that … - Keras Essentials
High-level Keras API for building neural networks quickly. Installation 1# Keras … - LangChain Recipes
Practical recipes for building LLM applications with LangChain: prompts, chains, … - ONNX Model Conversion
ONNX (Open Neural Network Exchange) for converting models between frameworks. … - PyTorch Essentials
Essential PyTorch operations and patterns for deep learning. Installation 1# CPU … - RAG (Retrieval-Augmented Generation)
Retrieval-Augmented Generation techniques for enhancing LLM responses with … - Sound to Vector Embeddings
Audio embeddings convert sound signals (speech, music, environmental sounds) … - Tensor Mathematics & Backpropagation
Tensor mathematics fundamentals and backpropagation theory with detailed … - TensorFlow Essentials
Essential TensorFlow operations and patterns for deep learning. Installation 1# … - TensorFlow Lite
TensorFlow Lite for deploying models on mobile and embedded devices. Convert … - Text to Vector Embeddings
Text embeddings convert textual content into dense vector representations that …