Data Compression
Lossless vs Lossy
- Lossless: Perfect reconstruction (ZIP, PNG, FLAC)
- Lossy: Approximate reconstruction (JPEG, MP3, H.264)
Huffman Coding
Optimal prefix-free code for known symbol probabilities.
1import heapq
2from collections import Counter
3
4def huffman_encoding(data):
5 """Build Huffman tree and encode data"""
6 # Count frequencies
7 freq = Counter(data)
8
9 # Build heap
10 heap = [[weight, [symbol, ""]] for symbol, weight in freq.items()]
11 heapq.heapify(heap)
12
13 # Build tree
14 while len(heap) > 1:
15 lo = heapq.heappop(heap)
16 hi = heapq.heappop(heap)
17 for pair in lo[1:]:
18 pair[1] = '0' + pair[1]
19 for pair in hi[1:]:
20 pair[1] = '1' + pair[1]
21 heapq.heappush(heap, [lo[0] + hi[0]] + lo[1:] + hi[1:])
22
23 # Extract codes
24 codes = {symbol: code for symbol, code in sorted(heapq.heappop(heap)[1:], key=lambda p: (len(p[-1]), p))}
25
26 # Encode
27 encoded = ''.join(codes[symbol] for symbol in data)
28
29 return codes, encoded
30
31# Example
32data = "AAABBC"
33codes, encoded = huffman_encoding(data)
34print(f"Codes: {codes}")
35print(f"Encoded: {encoded}")
Compression Ratio
$$ \text{Ratio} = \frac{\text{Original Size}}{\text{Compressed Size}} $$
Further Reading
Related Snippets
- Channel Capacity
Shannon's theorem and noisy channels - Entropy & Information Measures
Shannon entropy, cross-entropy, and KL divergence - Information Theory Basics
Fundamental concepts of information theory - Mutual Information
Measuring dependence between variables