Entropy & Information Measures
Shannon Entropy
Average information content:
$$ H(X) = -\sum_i p(x_i) \log_2 p(x_i) $$
Units: bits (if log base 2), nats (if natural log)
1import numpy as np
2
3def entropy(probabilities):
4 """Calculate Shannon entropy"""
5 p = np.array(probabilities)
6 p = p[p > 0] # Remove zeros
7 return -np.sum(p * np.log2(p))
8
9# Example: fair coin
10p_coin = [0.5, 0.5]
11H = entropy(p_coin)
12print(f"Entropy: {H:.3f} bits") # 1.000 bits
Cross-Entropy
$$ H(p, q) = -\sum_i p(x_i) \log q(x_i) $$
Used in machine learning loss functions.
1def cross_entropy(p, q):
2 """Cross-entropy between distributions p and q"""
3 p = np.array(p)
4 q = np.array(q)
5 q = np.clip(q, 1e-10, 1) # Avoid log(0)
6 return -np.sum(p * np.log2(q))
KL Divergence
Measures "distance" between distributions:
$$ D_{KL}(p | q) = \sum_i p(x_i) \log \frac{p(x_i)}{q(x_i)} $$
1def kl_divergence(p, q):
2 """KL divergence from q to p"""
3 p = np.array(p)
4 q = np.array(q)
5 return np.sum(p * np.log2(p / q))
6
7# Relationship: H(p,q) = H(p) + D_KL(p||q)
Properties
- $H(X) \geq 0$ (non-negative)
- $H(X) \leq \log_2 n$ (maximum for uniform distribution)
- $D_{KL}(p | q) \geq 0$ (non-negative)
- $D_{KL}(p | q) \neq D_{KL}(q | p)$ (not symmetric)
Further Reading
Related Snippets
- Channel Capacity
Shannon's theorem and noisy channels - Data Compression
Lossy vs lossless compression, Huffman coding - Information Theory Basics
Fundamental concepts of information theory - Mutual Information
Measuring dependence between variables