String Similarity Algorithms

Algorithms for measuring similarity between strings, used in spell checking, DNA sequencing, plagiarism detection, and fuzzy matching.

1. Levenshtein Distance (Edit Distance)

Minimum number of single-character edits (insertions, deletions, substitutions) to transform one string into another.

How It Works

Core Idea: Build a table where dp[i][j] represents the minimum edits needed to transform the first i characters of string 1 into the first j characters of string 2.

Intuition:

  • If characters match: no edit needed, copy diagonal value
  • If different: take minimum of:
    • Delete from s1: dp[i-1][j] + 1
    • Insert into s1: dp[i][j-1] + 1
    • Substitute: dp[i-1][j-1] + 1

Example: Transform "kitten" → "sitting"

1    ""  s  i  t  t  i  n  g
2""   0  1  2  3  4  5  6  7
3k    1  1  2  3  4  5  6  7
4i    2  2  1  2  3  4  5  6
5t    3  3  2  1  2  3  4  5
6t    4  4  3  2  1  2  3  4
7e    5  5  4  3  2  2  3  4
8n    6  6  5  4  3  3  2  3

Result: 3 edits (k→s, e→i, insert g)

Dynamic Programming Solution

 1func LevenshteinDistance(s1, s2 string) int {
 2    m, n := len(s1), len(s2)
 3    
 4    // Create DP table
 5    dp := make([][]int, m+1)
 6    for i := range dp {
 7        dp[i] = make([]int, n+1)
 8    }
 9    
10    // Initialize base cases
11    for i := 0; i <= m; i++ {
12        dp[i][0] = i // Delete all characters from s1
13    }
14    for j := 0; j <= n; j++ {
15        dp[0][j] = j // Insert all characters from s2
16    }
17    
18    // Fill DP table
19    for i := 1; i <= m; i++ {
20        for j := 1; j <= n; j++ {
21            if s1[i-1] == s2[j-1] {
22                dp[i][j] = dp[i-1][j-1] // No operation needed
23            } else {
24                dp[i][j] = 1 + min(
25                    dp[i-1][j],   // Delete from s1
26                    dp[i][j-1],   // Insert into s1
27                    dp[i-1][j-1], // Substitute
28                )
29            }
30        }
31    }
32    
33    return dp[m][n]
34}
35
36func min(a, b, c int) int {
37    if a < b {
38        if a < c {
39            return a
40        }
41        return c
42    }
43    if b < c {
44        return b
45    }
46    return c
47}
48
49// Example
50func main() {
51    s1 := "kitten"
52    s2 := "sitting"
53    dist := LevenshteinDistance(s1, s2)
54    fmt.Printf("Distance between '%s' and '%s': %d\n", s1, s2, dist)
55    // Output: Distance between 'kitten' and 'sitting': 3
56    // (k→s, e→i, insert g)
57}

Time: $O(m \times n)$
Space: $O(m \times n)$

Space-Optimized Version

 1func LevenshteinDistanceOptimized(s1, s2 string) int {
 2    m, n := len(s1), len(s2)
 3    
 4    // Only need two rows
 5    prev := make([]int, n+1)
 6    curr := make([]int, n+1)
 7    
 8    // Initialize first row
 9    for j := 0; j <= n; j++ {
10        prev[j] = j
11    }
12    
13    for i := 1; i <= m; i++ {
14        curr[0] = i
15        
16        for j := 1; j <= n; j++ {
17            if s1[i-1] == s2[j-1] {
18                curr[j] = prev[j-1]
19            } else {
20                curr[j] = 1 + min(prev[j], curr[j-1], prev[j-1])
21            }
22        }
23        
24        prev, curr = curr, prev
25    }
26    
27    return prev[n]
28}

Space: $O(n)$

2. Longest Common Subsequence (LCS)

Find the longest subsequence present in both strings (not necessarily contiguous).

How It Works

Core Idea: Build a table where dp[i][j] represents the length of LCS of first i characters of s1 and first j characters of s2.

Intuition:

  • If characters match: extend previous LCS by 1
  • If different: take maximum from either excluding current char from s1 or s2

Example: "ABCDGH" and "AEDFHR"

1    ""  A  E  D  F  H  R
2""   0  0  0  0  0  0  0
3A    0  1  1  1  1  1  1
4B    0  1  1  1  1  1  1
5C    0  1  1  1  1  1  1
6D    0  1  1  2  2  2  2
7G    0  1  1  2  2  2  2
8H    0  1  1  2  2  3  3

LCS: "ADH" (length 3)

Key Difference from Edit Distance: We're finding common elements, not transforming one to another.

Dynamic Programming Solution

 1func LongestCommonSubsequence(s1, s2 string) int {
 2    m, n := len(s1), len(s2)
 3    
 4    dp := make([][]int, m+1)
 5    for i := range dp {
 6        dp[i] = make([]int, n+1)
 7    }
 8    
 9    for i := 1; i <= m; i++ {
10        for j := 1; j <= n; j++ {
11            if s1[i-1] == s2[j-1] {
12                dp[i][j] = dp[i-1][j-1] + 1
13            } else {
14                dp[i][j] = max(dp[i-1][j], dp[i][j-1])
15            }
16        }
17    }
18    
19    return dp[m][n]
20}
21
22// Get the actual LCS string
23func GetLCS(s1, s2 string) string {
24    m, n := len(s1), len(s2)
25    
26    dp := make([][]int, m+1)
27    for i := range dp {
28        dp[i] = make([]int, n+1)
29    }
30    
31    // Fill DP table
32    for i := 1; i <= m; i++ {
33        for j := 1; j <= n; j++ {
34            if s1[i-1] == s2[j-1] {
35                dp[i][j] = dp[i-1][j-1] + 1
36            } else {
37                dp[i][j] = max(dp[i-1][j], dp[i][j-1])
38            }
39        }
40    }
41    
42    // Backtrack to find LCS
43    lcs := []byte{}
44    i, j := m, n
45    
46    for i > 0 && j > 0 {
47        if s1[i-1] == s2[j-1] {
48            lcs = append([]byte{s1[i-1]}, lcs...)
49            i--
50            j--
51        } else if dp[i-1][j] > dp[i][j-1] {
52            i--
53        } else {
54            j--
55        }
56    }
57    
58    return string(lcs)
59}
60
61// Example
62func main() {
63    s1 := "ABCDGH"
64    s2 := "AEDFHR"
65    fmt.Printf("LCS length: %d\n", LongestCommonSubsequence(s1, s2))
66    fmt.Printf("LCS: %s\n", GetLCS(s1, s2))
67    // Output:
68    // LCS length: 3
69    // LCS: ADH
70}

Time: $O(m \times n)$
Space: $O(m \times n)$

3. Longest Common Substring

Find the longest contiguous substring present in both strings.

How It Works

Core Idea: Similar to LCS but requires consecutive matches. Reset to 0 when characters don't match.

Intuition:

  • If characters match: extend current substring length
  • If different: reset to 0 (must be contiguous)
  • Track maximum length and ending position

Example: "OldSite:GeeksforGeeks.org" and "NewSite:GeeksQuiz.com"

  • Common substrings: "Site:", "Geeks"
  • Longest: "Site:Geeks" (length 11)

Key Difference from LCS: Must be consecutive characters, not just subsequence.

 1func LongestCommonSubstring(s1, s2 string) string {
 2    m, n := len(s1), len(s2)
 3    
 4    dp := make([][]int, m+1)
 5    for i := range dp {
 6        dp[i] = make([]int, n+1)
 7    }
 8    
 9    maxLen := 0
10    endIndex := 0
11    
12    for i := 1; i <= m; i++ {
13        for j := 1; j <= n; j++ {
14            if s1[i-1] == s2[j-1] {
15                dp[i][j] = dp[i-1][j-1] + 1
16                
17                if dp[i][j] > maxLen {
18                    maxLen = dp[i][j]
19                    endIndex = i
20                }
21            }
22        }
23    }
24    
25    if maxLen == 0 {
26        return ""
27    }
28    
29    return s1[endIndex-maxLen : endIndex]
30}
31
32// Example
33func main() {
34    s1 := "OldSite:GeeksforGeeks.org"
35    s2 := "NewSite:GeeksQuiz.com"
36    fmt.Printf("Longest common substring: %s\n", LongestCommonSubstring(s1, s2))
37    // Output: Longest common substring: Site:Geeks
38}

Time: $O(m \times n)$
Space: $O(m \times n)$

4. Hamming Distance

Number of positions at which corresponding symbols differ (strings must be same length).

How It Works

Core Idea: Simply count positions where characters differ.

Intuition: Used for fixed-length codes (error detection, DNA sequences).

Example: "karolin" vs "kathrin"

1k a r o l i n
2k a t h r i n
3✓ ✓ ✗ ✗ ✗ ✓ ✓

Distance = 3 (positions 2, 3, 4 differ)

Use Case: Error-correcting codes, detecting bit flips.

 1func HammingDistance(s1, s2 string) int {
 2    if len(s1) != len(s2) {
 3        return -1 // Invalid
 4    }
 5    
 6    distance := 0
 7    for i := 0; i < len(s1); i++ {
 8        if s1[i] != s2[i] {
 9            distance++
10        }
11    }
12    
13    return distance
14}
15
16// Example
17func main() {
18    s1 := "karolin"
19    s2 := "kathrin"
20    fmt.Printf("Hamming distance: %d\n", HammingDistance(s1, s2))
21    // Output: Hamming distance: 3
22    // (k-k, a-a, r-t, o-h, l-r, i-i, n-n)
23}

Time: $O(n)$
Space: $O(1)$

5. Jaro-Winkler Distance

Measures similarity between two strings, giving more weight to common prefixes.

How It Works

Core Idea: Jaro distance considers matching characters within a window, then Winkler adds bonus for common prefix.

Jaro Distance Formula: $$ \text{Jaro} = \frac{1}{3}\left(\frac{m}{|s_1|} + \frac{m}{|s_2|} + \frac{m-t}{m}\right) $$ where:

  • $m$ = number of matching characters
  • $t$ = number of transpositions (matching chars in wrong order)
  • Match window = $\max(|s_1|, |s_2|) / 2 - 1$

Jaro-Winkler: Adds prefix bonus $$ \text{JW} = \text{Jaro} + (l \times p \times (1 - \text{Jaro})) $$ where:

  • $l$ = length of common prefix (max 4)
  • $p$ = scaling factor (typically 0.1)

Example: "MARTHA" vs "MARHTA"

  • Matches: M, A, R, H, T, A (all 6)
  • Transpositions: T and H swapped (1 transposition)
  • Common prefix: "MAR" (length 3)
  • Jaro ≈ 0.944, Jaro-Winkler ≈ 0.961

Use Case: Record linkage, name matching (tolerates typos).

 1func JaroDistance(s1, s2 string) float64 {
 2    if s1 == s2 {
 3        return 1.0
 4    }
 5    
 6    len1, len2 := len(s1), len(s2)
 7    if len1 == 0 || len2 == 0 {
 8        return 0.0
 9    }
10    
11    // Maximum distance for matches
12    matchDistance := max(len1, len2)/2 - 1
13    if matchDistance < 0 {
14        matchDistance = 0
15    }
16    
17    s1Matches := make([]bool, len1)
18    s2Matches := make([]bool, len2)
19    
20    matches := 0
21    transpositions := 0
22    
23    // Find matches
24    for i := 0; i < len1; i++ {
25        start := max(0, i-matchDistance)
26        end := min(i+matchDistance+1, len2)
27        
28        for j := start; j < end; j++ {
29            if s2Matches[j] || s1[i] != s2[j] {
30                continue
31            }
32            s1Matches[i] = true
33            s2Matches[j] = true
34            matches++
35            break
36        }
37    }
38    
39    if matches == 0 {
40        return 0.0
41    }
42    
43    // Count transpositions
44    k := 0
45    for i := 0; i < len1; i++ {
46        if !s1Matches[i] {
47            continue
48        }
49        for !s2Matches[k] {
50            k++
51        }
52        if s1[i] != s2[k] {
53            transpositions++
54        }
55        k++
56    }
57    
58    jaro := (float64(matches)/float64(len1) +
59        float64(matches)/float64(len2) +
60        float64(matches-transpositions/2)/float64(matches)) / 3.0
61    
62    return jaro
63}
64
65func JaroWinklerDistance(s1, s2 string) float64 {
66    jaro := JaroDistance(s1, s2)
67    
68    // Find common prefix length (max 4)
69    prefixLen := 0
70    for i := 0; i < min(len(s1), len(s2), 4); i++ {
71        if s1[i] == s2[i] {
72            prefixLen++
73        } else {
74            break
75        }
76    }
77    
78    // Jaro-Winkler = Jaro + (prefix_length * 0.1 * (1 - Jaro))
79    return jaro + float64(prefixLen)*0.1*(1.0-jaro)
80}
81
82// Example
83func main() {
84    s1 := "MARTHA"
85    s2 := "MARHTA"
86    fmt.Printf("Jaro-Winkler distance: %.4f\n", JaroWinklerDistance(s1, s2))
87    // Output: Jaro-Winkler distance: 0.9611
88}

Time: $O(m \times n)$
Space: $O(m + n)$

6. Damerau-Levenshtein Distance

Edit distance allowing insertions, deletions, substitutions, and transpositions of adjacent characters.

How It Works

Core Idea: Like Levenshtein but also allows swapping adjacent characters (transposition).

Intuition: Accounts for common typo - swapping two adjacent letters.

Example: "CA" → "ABC"

  • Levenshtein: 3 (delete C, delete A, insert A, B, C)
  • Damerau-Levenshtein: 2 (insert A before, insert B after)

Operations:

  1. Insert
  2. Delete
  3. Substitute
  4. Transpose (swap adjacent) - this is the addition

Use Case: Spell checkers (80% of typos are single-character edits or transpositions).

 1func DamerauLevenshteinDistance(s1, s2 string) int {
 2    m, n := len(s1), len(s2)
 3    
 4    // Create DP table with extra row/column
 5    dp := make([][]int, m+2)
 6    for i := range dp {
 7        dp[i] = make([]int, n+2)
 8    }
 9    
10    maxDist := m + n
11    dp[0][0] = maxDist
12    
13    for i := 0; i <= m; i++ {
14        dp[i+1][0] = maxDist
15        dp[i+1][1] = i
16    }
17    for j := 0; j <= n; j++ {
18        dp[0][j+1] = maxDist
19        dp[1][j+1] = j
20    }
21    
22    da := make(map[byte]int)
23    
24    for i := 1; i <= m; i++ {
25        db := 0
26        
27        for j := 1; j <= n; j++ {
28            k := da[s2[j-1]]
29            l := db
30            cost := 1
31            
32            if s1[i-1] == s2[j-1] {
33                cost = 0
34                db = j
35            }
36            
37            dp[i+1][j+1] = min(
38                dp[i][j]+cost,           // Substitution
39                dp[i+1][j]+1,            // Insertion
40                dp[i][j+1]+1,            // Deletion
41                dp[k][l]+(i-k-1)+1+(j-l-1), // Transposition
42            )
43        }
44        
45        da[s1[i-1]] = i
46    }
47    
48    return dp[m+1][n+1]
49}
50
51func min(vals ...int) int {
52    minVal := vals[0]
53    for _, v := range vals[1:] {
54        if v < minVal {
55            minVal = v
56        }
57    }
58    return minVal
59}
60
61// Example
62func main() {
63    s1 := "CA"
64    s2 := "ABC"
65    fmt.Printf("Damerau-Levenshtein distance: %d\n", DamerauLevenshteinDistance(s1, s2))
66    // Output: Damerau-Levenshtein distance: 2
67}

Time: $O(m \times n)$
Space: $O(m \times n)$

7. Cosine Similarity

Measures similarity based on character frequency (bag of words).

How It Works

Core Idea: Treat strings as vectors of character frequencies, measure angle between vectors.

Intuition: Ignores order, focuses on character distribution.

Formula: $$ \text{Cosine} = \frac{\mathbf{A} \cdot \mathbf{B}}{||\mathbf{A}|| \times ||\mathbf{B}||} = \frac{\sum A_i B_i}{\sqrt{\sum A_i^2} \times \sqrt{\sum B_i^2}} $$

Example: "hello" vs "olleh"

  • Vector for "hello": {h:1, e:1, l:2, o:1}
  • Vector for "olleh": {o:1, l:2, e:1, h:1}
  • Same frequencies → Cosine = 1.0 (identical)

Key Property: Order-independent, perfect for anagrams.

Use Case: Document similarity, plagiarism detection.

 1import "math"
 2
 3func CosineSimilarity(s1, s2 string) float64 {
 4    // Build frequency maps
 5    freq1 := make(map[rune]int)
 6    freq2 := make(map[rune]int)
 7    
 8    for _, ch := range s1 {
 9        freq1[ch]++
10    }
11    for _, ch := range s2 {
12        freq2[ch]++
13    }
14    
15    // Calculate dot product and magnitudes
16    dotProduct := 0.0
17    mag1 := 0.0
18    mag2 := 0.0
19    
20    // Get all unique characters
21    allChars := make(map[rune]bool)
22    for ch := range freq1 {
23        allChars[ch] = true
24    }
25    for ch := range freq2 {
26        allChars[ch] = true
27    }
28    
29    for ch := range allChars {
30        f1 := float64(freq1[ch])
31        f2 := float64(freq2[ch])
32        
33        dotProduct += f1 * f2
34        mag1 += f1 * f1
35        mag2 += f2 * f2
36    }
37    
38    if mag1 == 0 || mag2 == 0 {
39        return 0.0
40    }
41    
42    return dotProduct / (math.Sqrt(mag1) * math.Sqrt(mag2))
43}
44
45// Example
46func main() {
47    s1 := "hello world"
48    s2 := "hello there"
49    fmt.Printf("Cosine similarity: %.4f\n", CosineSimilarity(s1, s2))
50}

Time: $O(m + n)$
Space: $O(m + n)$

8. Jaccard Similarity

Measures similarity as intersection over union of character sets.

How It Works

Core Idea: Compare unique character sets using set operations.

Formula: $$ \text{Jaccard} = \frac{|A \cap B|}{|A \cup B|} $$

Intuition: What fraction of all unique characters appear in both strings?

Example: "night" vs "nacht"

  • Set A: {n, i, g, h, t}
  • Set B: {n, a, c, h, t}
  • Intersection: {n, h, t} (3 chars)
  • Union: {n, i, g, h, t, a, c} (7 chars)
  • Jaccard = 3/7 ≈ 0.43

Properties:

  • Range: [0, 1]
  • 0 = no common characters
  • 1 = identical character sets

Use Case: Set similarity, recommendation systems.

 1func JaccardSimilarity(s1, s2 string) float64 {
 2    set1 := make(map[rune]bool)
 3    set2 := make(map[rune]bool)
 4    
 5    for _, ch := range s1 {
 6        set1[ch] = true
 7    }
 8    for _, ch := range s2 {
 9        set2[ch] = true
10    }
11    
12    // Calculate intersection
13    intersection := 0
14    for ch := range set1 {
15        if set2[ch] {
16            intersection++
17        }
18    }
19    
20    // Calculate union
21    union := len(set1) + len(set2) - intersection
22    
23    if union == 0 {
24        return 0.0
25    }
26    
27    return float64(intersection) / float64(union)
28}
29
30// Example
31func main() {
32    s1 := "night"
33    s2 := "nacht"
34    fmt.Printf("Jaccard similarity: %.4f\n", JaccardSimilarity(s1, s2))
35    // Output: Jaccard similarity: 0.2857 (2 common chars / 7 unique chars)
36}

Time: $O(m + n)$
Space: $O(m + n)$

Comparison Table

AlgorithmTimeSpaceUse Case
Levenshtein$O(mn)$$O(mn)$ or $O(n)$General edit distance, spell check
LCS$O(mn)$$O(mn)$Diff tools, DNA sequencing
LCS Substring$O(mn)$$O(mn)$Find common patterns
Hamming$O(n)$$O(1)$Equal-length strings, error detection
Jaro-Winkler$O(mn)$$O(m+n)$Record linkage, name matching
Damerau-Levenshtein$O(mn)$$O(mn)$Typo detection (transpositions)
Cosine$O(m+n)$$O(m+n)$Document similarity
Jaccard$O(m+n)$$O(m+n)$Set similarity

Practical Applications

Spell Checker

 1func FindClosestWord(word string, dictionary []string, threshold int) []string {
 2    suggestions := []string{}
 3    
 4    for _, dictWord := range dictionary {
 5        dist := LevenshteinDistance(word, dictWord)
 6        if dist <= threshold {
 7            suggestions = append(suggestions, dictWord)
 8        }
 9    }
10    
11    return suggestions
12}

Fuzzy String Matching

 1func FuzzyMatch(query string, candidates []string, threshold float64) []string {
 2    matches := []string{}
 3    
 4    for _, candidate := range candidates {
 5        similarity := JaroWinklerDistance(query, candidate)
 6        if similarity >= threshold {
 7            matches = append(matches, candidate)
 8        }
 9    }
10    
11    return matches
12}

DNA Sequence Alignment

1func AlignSequences(seq1, seq2 string) int {
2    return LongestCommonSubsequence(seq1, seq2)
3}

When to Use Each Algorithm

Levenshtein: General purpose, spell checking
LCS: Version control diffs, plagiarism detection
Hamming: Error detection in equal-length codes
Jaro-Winkler: Name matching, record linkage
Damerau-Levenshtein: Typo detection with transpositions
Cosine: Document similarity, text classification
Jaccard: Set similarity, recommendation systems

Related Snippets

  • Binary Search
    Binary search is an efficient algorithm for finding a target value in a sorted …
  • Binary Tree
    A binary tree is a tree data structure where each node has at most two children, …
  • General Tree
    A tree is a hierarchical data structure consisting of nodes connected by edges, …
  • Heap Data Structure
    A heap is a specialized tree-based data structure that satisfies the heap …
  • Heap Sort
    Heap sort is a comparison-based sorting algorithm that uses a binary heap data …
  • Linked List Operations
    Optimal techniques for common linked list operations using pointer manipulation, …
  • LRU and LFU Cache
    Cache replacement policies determine which items to evict when the cache is …
  • Merge Sort
    Merge sort is a stable, comparison-based sorting algorithm that uses the …
  • Quadtree
    A Quadtree is a tree data structure where each internal node has exactly four …
  • Quick Sort
    Quick sort is an efficient, in-place, comparison-based sorting algorithm that …
  • Terrain & Heightmap Generation
    Procedural terrain generation creates realistic landscapes using mathematical …
  • Trie (Prefix Tree)
    A Trie (pronounced &quot;try&quot;) is a tree-like data structure used to store …
  • Wave Function Collapse
    Wave Function Collapse (WFC) is a procedural generation algorithm that creates …