Data Compression

Lossless vs Lossy

  • Lossless: Perfect reconstruction (ZIP, PNG, FLAC)
  • Lossy: Approximate reconstruction (JPEG, MP3, H.264)

Huffman Coding

Optimal prefix-free code for known symbol probabilities.

 1import heapq
 2from collections import Counter
 3
 4def huffman_encoding(data):
 5    """Build Huffman tree and encode data"""
 6    # Count frequencies
 7    freq = Counter(data)
 8    
 9    # Build heap
10    heap = [[weight, [symbol, ""]] for symbol, weight in freq.items()]
11    heapq.heapify(heap)
12    
13    # Build tree
14    while len(heap) > 1:
15        lo = heapq.heappop(heap)
16        hi = heapq.heappop(heap)
17        for pair in lo[1:]:
18            pair[1] = '0' + pair[1]
19        for pair in hi[1:]:
20            pair[1] = '1' + pair[1]
21        heapq.heappush(heap, [lo[0] + hi[0]] + lo[1:] + hi[1:])
22    
23    # Extract codes
24    codes = {symbol: code for symbol, code in sorted(heapq.heappop(heap)[1:], key=lambda p: (len(p[-1]), p))}
25    
26    # Encode
27    encoded = ''.join(codes[symbol] for symbol in data)
28    
29    return codes, encoded
30
31# Example
32data = "AAABBC"
33codes, encoded = huffman_encoding(data)
34print(f"Codes: {codes}")
35print(f"Encoded: {encoded}")

Compression Ratio

$$ \text{Ratio} = \frac{\text{Original Size}}{\text{Compressed Size}} $$

Further Reading

Related Snippets