Entropy & Information Measures

Dec 12, 2024 · 1 min read · information-theory entropy kl-divergence cross-entropy ·

Share on:

Shannon Entropy

Average information content:

$$ H(X) = -\sum_i p(x_i) \log_2 p(x_i) $$

Units: bits (if log base 2), nats (if natural log)

 1import numpy as np
 2
 3def entropy(probabilities):
 4    """Calculate Shannon entropy"""
 5    p = np.array(probabilities)
 6    p = p[p > 0]  # Remove zeros
 7    return -np.sum(p * np.log2(p))
 8
 9# Example: fair coin
10p_coin = [0.5, 0.5]
11H = entropy(p_coin)
12print(f"Entropy: {H:.3f} bits")  # 1.000 bits

Cross-Entropy

$$ H(p, q) = -\sum_i p(x_i) \log q(x_i) $$

Used in machine learning loss functions.

1def cross_entropy(p, q):
2    """Cross-entropy between distributions p and q"""
3    p = np.array(p)
4    q = np.array(q)
5    q = np.clip(q, 1e-10, 1)  # Avoid log(0)
6    return -np.sum(p * np.log2(q))

KL Divergence

Measures "distance" between distributions:

$$ D_{KL}(p | q) = \sum_i p(x_i) \log \frac{p(x_i)}{q(x_i)} $$

1def kl_divergence(p, q):
2    """KL divergence from q to p"""
3    p = np.array(p)
4    q = np.array(q)
5    return np.sum(p * np.log2(p / q))
6
7# Relationship: H(p,q) = H(p) + D_KL(p||q)

Properties

$H(X) \geq 0$ (non-negative)
$H(X) \leq \log_2 n$ (maximum for uniform distribution)
$D_{KL}(p | q) \geq 0$ (non-negative)
$D_{KL}(p | q) \neq D_{KL}(q | p)$ (not symmetric)

Related Snippets

Channel Capacity
Shannon's theorem and noisy channels
Data Compression
Lossy vs lossless compression, Huffman coding
Information Theory Basics
Fundamental concepts of information theory
Mutual Information
Measuring dependence between variables