Q-Learning Theory

Dec 12, 2024 · 3 min read · ai-knowhow reinforcement-learning q-learning mathematics ·

Share on:

Q-Learning algorithm theory with mathematical foundations.

Markov Decision Process (MDP)

An MDP is defined by the tuple $(S, A, P, R, \gamma)$:

$S$: Set of states
$A$: Set of actions
$P$: Transition probability $P(s'|s,a)$
$R$: Reward function $R(s,a,s')$
$\gamma \in [0,1]$: Discount factor

Value Functions

State Value Function

Expected return starting from state $s$:

$$ V^\pi(s) = \mathbb{E}\pi\left[\sum{t=0}^{\infty} \gamma^t R_{t+1} \mid S_0 = s\right] $$

Action Value Function (Q-Function)

Expected return starting from state $s$, taking action $a$:

$$ Q^\pi(s,a) = \mathbb{E}\pi\left[\sum{t=0}^{\infty} \gamma^t R_{t+1} \mid S_0 = s, A_0 = a\right] $$

Bellman Equations

Bellman Expectation Equation

$$ V^\pi(s) = \sum_{a \in A} \pi(a|s) \sum_{s' \in S} P(s'|s,a)[R(s,a,s') + \gamma V^\pi(s')] $$

$$ Q^\pi(s,a) = \sum_{s' \in S} P(s'|s,a)[R(s,a,s') + \gamma \sum_{a'} \pi(a'|s') Q^\pi(s',a')] $$

Bellman Optimality Equation

$$ V^(s) = \max_{a \in A} \sum_{s' \in S} P(s'|s,a)[R(s,a,s') + \gamma V^(s')] $$

$$ Q^(s,a) = \sum_{s' \in S} P(s'|s,a)[R(s,a,s') + \gamma \max_{a'} Q^(s',a')] $$

Q-Learning Algorithm

Update Rule

$$ Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[r_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t)\right] $$

Where:

$\alpha \in (0,1]$: Learning rate
$\gamma \in [0,1]$: Discount factor
$r_{t+1}$: Immediate reward
$s_t, a_t$: Current state and action
$s_{t+1}$: Next state

TD Error

The temporal difference error:

$$ \delta_t = r_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t) $$

Exploration vs Exploitation

ε-Greedy Policy

$$ \pi(a|s) = \begin{cases} 1 - \epsilon + \frac{\epsilon}{|A|} & \text{if } a = \arg\max_{a'} Q(s,a') \ \frac{\epsilon}{|A|} & \text{otherwise} \end{cases} $$

Boltzmann Exploration (Softmax)

$$ \pi(a|s) = \frac{\exp(Q(s,a)/\tau)}{\sum_{a'} \exp(Q(s,a')/\tau)} $$

Where $\tau$ is the temperature parameter.

Convergence Conditions

Q-Learning converges to $Q^*$ if:

All state-action pairs are visited infinitely often
Learning rate satisfies: $$ \sum_{t=0}^{\infty} \alpha_t = \infty, \quad \sum_{t=0}^{\infty} \alpha_t^2 < \infty $$
Example: $\alpha_t = \frac{1}{1+t}$

Python Implementation

 1import numpy as np
 2
 3class QLearning:
 4    def __init__(self, n_states, n_actions, alpha=0.1, gamma=0.99, epsilon=0.1):
 5        self.Q = np.zeros((n_states, n_actions))
 6        self.alpha = alpha
 7        self.gamma = gamma
 8        self.epsilon = epsilon
 9    
10    def choose_action(self, state):
11        if np.random.random() < self.epsilon:
12            return np.random.randint(self.Q.shape[1])
13        return np.argmax(self.Q[state])
14    
15    def update(self, state, action, reward, next_state):
16        # Q-Learning update
17        td_target = reward + self.gamma * np.max(self.Q[next_state])
18        td_error = td_target - self.Q[state, action]
19        self.Q[state, action] += self.alpha * td_error
20        return td_error
21
22# Training loop
23agent = QLearning(n_states=100, n_actions=4)
24
25for episode in range(1000):
26    state = env.reset()
27    done = False
28    
29    while not done:
30        action = agent.choose_action(state)
31        next_state, reward, done = env.step(action)
32        agent.update(state, action, reward, next_state)
33        state = next_state

Double Q-Learning

To address overestimation bias:

$$ Q_1(s_t, a_t) \leftarrow Q_1(s_t, a_t) + \alpha \left[r_{t+1} + \gamma Q_2(s_{t+1}, \arg\max_{a'} Q_1(s_{t+1}, a')) - Q_1(s_t, a_t)\right] $$

Randomly switch between updating $Q_1$ and $Q_2$.

SARSA (On-Policy Alternative)

$$ Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)\right] $$

Key difference: Uses actual next action $a_{t+1}$ instead of $\max_{a'} Q(s_{t+1}, a')$.

Related Snippets

Data Augmentation
Data augmentation techniques for Keras and PyTorch
DNN Policy Learning Theory
Deep Neural Network policy learning with mathematical foundations. Policy …
Graph RAG Techniques
Graph-based Retrieval-Augmented Generation for enhanced context and relationship …
Image to Vector Embeddings
Image embeddings convert visual content into dense vector representations that …
Keras Essentials
High-level Keras API for building neural networks quickly. Installation 1# Keras …
LangChain Recipes
Practical recipes for building LLM applications with LangChain: prompts, chains, …
ONNX Model Conversion
ONNX (Open Neural Network Exchange) for converting models between frameworks. …
PyTorch Essentials
Essential PyTorch operations and patterns for deep learning. Installation 1# CPU …
RAG (Retrieval-Augmented Generation)
Retrieval-Augmented Generation techniques for enhancing LLM responses with …
Sound to Vector Embeddings
Audio embeddings convert sound signals (speech, music, environmental sounds) …
Tensor Mathematics & Backpropagation
Tensor mathematics fundamentals and backpropagation theory with detailed …
TensorFlow Essentials
Essential TensorFlow operations and patterns for deep learning. Installation 1# …
TensorFlow Lite
TensorFlow Lite for deploying models on mobile and embedded devices. Convert …
Text to Vector Embeddings
Text embeddings convert textual content into dense vector representations that …