Entropy & Information Measures

Shannon Entropy

Average information content:

$$ H(X) = -\sum_i p(x_i) \log_2 p(x_i) $$

Units: bits (if log base 2), nats (if natural log)

 1import numpy as np
 2
 3def entropy(probabilities):
 4    """Calculate Shannon entropy"""
 5    p = np.array(probabilities)
 6    p = p[p > 0]  # Remove zeros
 7    return -np.sum(p * np.log2(p))
 8
 9# Example: fair coin
10p_coin = [0.5, 0.5]
11H = entropy(p_coin)
12print(f"Entropy: {H:.3f} bits")  # 1.000 bits

Cross-Entropy

$$ H(p, q) = -\sum_i p(x_i) \log q(x_i) $$

Used in machine learning loss functions.

1def cross_entropy(p, q):
2    """Cross-entropy between distributions p and q"""
3    p = np.array(p)
4    q = np.array(q)
5    q = np.clip(q, 1e-10, 1)  # Avoid log(0)
6    return -np.sum(p * np.log2(q))

KL Divergence

Measures "distance" between distributions:

$$ D_{KL}(p | q) = \sum_i p(x_i) \log \frac{p(x_i)}{q(x_i)} $$

1def kl_divergence(p, q):
2    """KL divergence from q to p"""
3    p = np.array(p)
4    q = np.array(q)
5    return np.sum(p * np.log2(p / q))
6
7# Relationship: H(p,q) = H(p) + D_KL(p||q)

Properties

  • $H(X) \geq 0$ (non-negative)
  • $H(X) \leq \log_2 n$ (maximum for uniform distribution)
  • $D_{KL}(p | q) \geq 0$ (non-negative)
  • $D_{KL}(p | q) \neq D_{KL}(q | p)$ (not symmetric)

Further Reading

Related Snippets