String Similarity Algorithms
Algorithms for measuring similarity between strings, used in spell checking, DNA sequencing, plagiarism detection, and fuzzy matching.
1. Levenshtein Distance (Edit Distance)
Minimum number of single-character edits (insertions, deletions, substitutions) to transform one string into another.
How It Works
Core Idea: Build a table where dp[i][j] represents the minimum edits needed to transform the first i characters of string 1 into the first j characters of string 2.
Intuition:
- If characters match: no edit needed, copy diagonal value
- If different: take minimum of:
- Delete from s1:
dp[i-1][j] + 1 - Insert into s1:
dp[i][j-1] + 1 - Substitute:
dp[i-1][j-1] + 1
- Delete from s1:
Example: Transform "kitten" → "sitting"
1 "" s i t t i n g
2"" 0 1 2 3 4 5 6 7
3k 1 1 2 3 4 5 6 7
4i 2 2 1 2 3 4 5 6
5t 3 3 2 1 2 3 4 5
6t 4 4 3 2 1 2 3 4
7e 5 5 4 3 2 2 3 4
8n 6 6 5 4 3 3 2 3
Result: 3 edits (k→s, e→i, insert g)
Dynamic Programming Solution
1func LevenshteinDistance(s1, s2 string) int {
2 m, n := len(s1), len(s2)
3
4 // Create DP table
5 dp := make([][]int, m+1)
6 for i := range dp {
7 dp[i] = make([]int, n+1)
8 }
9
10 // Initialize base cases
11 for i := 0; i <= m; i++ {
12 dp[i][0] = i // Delete all characters from s1
13 }
14 for j := 0; j <= n; j++ {
15 dp[0][j] = j // Insert all characters from s2
16 }
17
18 // Fill DP table
19 for i := 1; i <= m; i++ {
20 for j := 1; j <= n; j++ {
21 if s1[i-1] == s2[j-1] {
22 dp[i][j] = dp[i-1][j-1] // No operation needed
23 } else {
24 dp[i][j] = 1 + min(
25 dp[i-1][j], // Delete from s1
26 dp[i][j-1], // Insert into s1
27 dp[i-1][j-1], // Substitute
28 )
29 }
30 }
31 }
32
33 return dp[m][n]
34}
35
36func min(a, b, c int) int {
37 if a < b {
38 if a < c {
39 return a
40 }
41 return c
42 }
43 if b < c {
44 return b
45 }
46 return c
47}
48
49// Example
50func main() {
51 s1 := "kitten"
52 s2 := "sitting"
53 dist := LevenshteinDistance(s1, s2)
54 fmt.Printf("Distance between '%s' and '%s': %d\n", s1, s2, dist)
55 // Output: Distance between 'kitten' and 'sitting': 3
56 // (k→s, e→i, insert g)
57}
Time: $O(m \times n)$
Space: $O(m \times n)$
Space-Optimized Version
1func LevenshteinDistanceOptimized(s1, s2 string) int {
2 m, n := len(s1), len(s2)
3
4 // Only need two rows
5 prev := make([]int, n+1)
6 curr := make([]int, n+1)
7
8 // Initialize first row
9 for j := 0; j <= n; j++ {
10 prev[j] = j
11 }
12
13 for i := 1; i <= m; i++ {
14 curr[0] = i
15
16 for j := 1; j <= n; j++ {
17 if s1[i-1] == s2[j-1] {
18 curr[j] = prev[j-1]
19 } else {
20 curr[j] = 1 + min(prev[j], curr[j-1], prev[j-1])
21 }
22 }
23
24 prev, curr = curr, prev
25 }
26
27 return prev[n]
28}
Space: $O(n)$
2. Longest Common Subsequence (LCS)
Find the longest subsequence present in both strings (not necessarily contiguous).
How It Works
Core Idea: Build a table where dp[i][j] represents the length of LCS of first i characters of s1 and first j characters of s2.
Intuition:
- If characters match: extend previous LCS by 1
- If different: take maximum from either excluding current char from s1 or s2
Example: "ABCDGH" and "AEDFHR"
1 "" A E D F H R
2"" 0 0 0 0 0 0 0
3A 0 1 1 1 1 1 1
4B 0 1 1 1 1 1 1
5C 0 1 1 1 1 1 1
6D 0 1 1 2 2 2 2
7G 0 1 1 2 2 2 2
8H 0 1 1 2 2 3 3
LCS: "ADH" (length 3)
Key Difference from Edit Distance: We're finding common elements, not transforming one to another.
Dynamic Programming Solution
1func LongestCommonSubsequence(s1, s2 string) int {
2 m, n := len(s1), len(s2)
3
4 dp := make([][]int, m+1)
5 for i := range dp {
6 dp[i] = make([]int, n+1)
7 }
8
9 for i := 1; i <= m; i++ {
10 for j := 1; j <= n; j++ {
11 if s1[i-1] == s2[j-1] {
12 dp[i][j] = dp[i-1][j-1] + 1
13 } else {
14 dp[i][j] = max(dp[i-1][j], dp[i][j-1])
15 }
16 }
17 }
18
19 return dp[m][n]
20}
21
22// Get the actual LCS string
23func GetLCS(s1, s2 string) string {
24 m, n := len(s1), len(s2)
25
26 dp := make([][]int, m+1)
27 for i := range dp {
28 dp[i] = make([]int, n+1)
29 }
30
31 // Fill DP table
32 for i := 1; i <= m; i++ {
33 for j := 1; j <= n; j++ {
34 if s1[i-1] == s2[j-1] {
35 dp[i][j] = dp[i-1][j-1] + 1
36 } else {
37 dp[i][j] = max(dp[i-1][j], dp[i][j-1])
38 }
39 }
40 }
41
42 // Backtrack to find LCS
43 lcs := []byte{}
44 i, j := m, n
45
46 for i > 0 && j > 0 {
47 if s1[i-1] == s2[j-1] {
48 lcs = append([]byte{s1[i-1]}, lcs...)
49 i--
50 j--
51 } else if dp[i-1][j] > dp[i][j-1] {
52 i--
53 } else {
54 j--
55 }
56 }
57
58 return string(lcs)
59}
60
61// Example
62func main() {
63 s1 := "ABCDGH"
64 s2 := "AEDFHR"
65 fmt.Printf("LCS length: %d\n", LongestCommonSubsequence(s1, s2))
66 fmt.Printf("LCS: %s\n", GetLCS(s1, s2))
67 // Output:
68 // LCS length: 3
69 // LCS: ADH
70}
Time: $O(m \times n)$
Space: $O(m \times n)$
3. Longest Common Substring
Find the longest contiguous substring present in both strings.
How It Works
Core Idea: Similar to LCS but requires consecutive matches. Reset to 0 when characters don't match.
Intuition:
- If characters match: extend current substring length
- If different: reset to 0 (must be contiguous)
- Track maximum length and ending position
Example: "OldSite:GeeksforGeeks.org" and "NewSite:GeeksQuiz.com"
- Common substrings: "Site:", "Geeks"
- Longest: "Site:Geeks" (length 11)
Key Difference from LCS: Must be consecutive characters, not just subsequence.
1func LongestCommonSubstring(s1, s2 string) string {
2 m, n := len(s1), len(s2)
3
4 dp := make([][]int, m+1)
5 for i := range dp {
6 dp[i] = make([]int, n+1)
7 }
8
9 maxLen := 0
10 endIndex := 0
11
12 for i := 1; i <= m; i++ {
13 for j := 1; j <= n; j++ {
14 if s1[i-1] == s2[j-1] {
15 dp[i][j] = dp[i-1][j-1] + 1
16
17 if dp[i][j] > maxLen {
18 maxLen = dp[i][j]
19 endIndex = i
20 }
21 }
22 }
23 }
24
25 if maxLen == 0 {
26 return ""
27 }
28
29 return s1[endIndex-maxLen : endIndex]
30}
31
32// Example
33func main() {
34 s1 := "OldSite:GeeksforGeeks.org"
35 s2 := "NewSite:GeeksQuiz.com"
36 fmt.Printf("Longest common substring: %s\n", LongestCommonSubstring(s1, s2))
37 // Output: Longest common substring: Site:Geeks
38}
Time: $O(m \times n)$
Space: $O(m \times n)$
4. Hamming Distance
Number of positions at which corresponding symbols differ (strings must be same length).
How It Works
Core Idea: Simply count positions where characters differ.
Intuition: Used for fixed-length codes (error detection, DNA sequences).
Example: "karolin" vs "kathrin"
1k a r o l i n
2k a t h r i n
3✓ ✓ ✗ ✗ ✗ ✓ ✓
Distance = 3 (positions 2, 3, 4 differ)
Use Case: Error-correcting codes, detecting bit flips.
1func HammingDistance(s1, s2 string) int {
2 if len(s1) != len(s2) {
3 return -1 // Invalid
4 }
5
6 distance := 0
7 for i := 0; i < len(s1); i++ {
8 if s1[i] != s2[i] {
9 distance++
10 }
11 }
12
13 return distance
14}
15
16// Example
17func main() {
18 s1 := "karolin"
19 s2 := "kathrin"
20 fmt.Printf("Hamming distance: %d\n", HammingDistance(s1, s2))
21 // Output: Hamming distance: 3
22 // (k-k, a-a, r-t, o-h, l-r, i-i, n-n)
23}
Time: $O(n)$
Space: $O(1)$
5. Jaro-Winkler Distance
Measures similarity between two strings, giving more weight to common prefixes.
How It Works
Core Idea: Jaro distance considers matching characters within a window, then Winkler adds bonus for common prefix.
Jaro Distance Formula: $$ \text{Jaro} = \frac{1}{3}\left(\frac{m}{|s_1|} + \frac{m}{|s_2|} + \frac{m-t}{m}\right) $$ where:
- $m$ = number of matching characters
- $t$ = number of transpositions (matching chars in wrong order)
- Match window = $\max(|s_1|, |s_2|) / 2 - 1$
Jaro-Winkler: Adds prefix bonus $$ \text{JW} = \text{Jaro} + (l \times p \times (1 - \text{Jaro})) $$ where:
- $l$ = length of common prefix (max 4)
- $p$ = scaling factor (typically 0.1)
Example: "MARTHA" vs "MARHTA"
- Matches: M, A, R, H, T, A (all 6)
- Transpositions: T and H swapped (1 transposition)
- Common prefix: "MAR" (length 3)
- Jaro ≈ 0.944, Jaro-Winkler ≈ 0.961
Use Case: Record linkage, name matching (tolerates typos).
1func JaroDistance(s1, s2 string) float64 {
2 if s1 == s2 {
3 return 1.0
4 }
5
6 len1, len2 := len(s1), len(s2)
7 if len1 == 0 || len2 == 0 {
8 return 0.0
9 }
10
11 // Maximum distance for matches
12 matchDistance := max(len1, len2)/2 - 1
13 if matchDistance < 0 {
14 matchDistance = 0
15 }
16
17 s1Matches := make([]bool, len1)
18 s2Matches := make([]bool, len2)
19
20 matches := 0
21 transpositions := 0
22
23 // Find matches
24 for i := 0; i < len1; i++ {
25 start := max(0, i-matchDistance)
26 end := min(i+matchDistance+1, len2)
27
28 for j := start; j < end; j++ {
29 if s2Matches[j] || s1[i] != s2[j] {
30 continue
31 }
32 s1Matches[i] = true
33 s2Matches[j] = true
34 matches++
35 break
36 }
37 }
38
39 if matches == 0 {
40 return 0.0
41 }
42
43 // Count transpositions
44 k := 0
45 for i := 0; i < len1; i++ {
46 if !s1Matches[i] {
47 continue
48 }
49 for !s2Matches[k] {
50 k++
51 }
52 if s1[i] != s2[k] {
53 transpositions++
54 }
55 k++
56 }
57
58 jaro := (float64(matches)/float64(len1) +
59 float64(matches)/float64(len2) +
60 float64(matches-transpositions/2)/float64(matches)) / 3.0
61
62 return jaro
63}
64
65func JaroWinklerDistance(s1, s2 string) float64 {
66 jaro := JaroDistance(s1, s2)
67
68 // Find common prefix length (max 4)
69 prefixLen := 0
70 for i := 0; i < min(len(s1), len(s2), 4); i++ {
71 if s1[i] == s2[i] {
72 prefixLen++
73 } else {
74 break
75 }
76 }
77
78 // Jaro-Winkler = Jaro + (prefix_length * 0.1 * (1 - Jaro))
79 return jaro + float64(prefixLen)*0.1*(1.0-jaro)
80}
81
82// Example
83func main() {
84 s1 := "MARTHA"
85 s2 := "MARHTA"
86 fmt.Printf("Jaro-Winkler distance: %.4f\n", JaroWinklerDistance(s1, s2))
87 // Output: Jaro-Winkler distance: 0.9611
88}
Time: $O(m \times n)$
Space: $O(m + n)$
6. Damerau-Levenshtein Distance
Edit distance allowing insertions, deletions, substitutions, and transpositions of adjacent characters.
How It Works
Core Idea: Like Levenshtein but also allows swapping adjacent characters (transposition).
Intuition: Accounts for common typo - swapping two adjacent letters.
Example: "CA" → "ABC"
- Levenshtein: 3 (delete C, delete A, insert A, B, C)
- Damerau-Levenshtein: 2 (insert A before, insert B after)
Operations:
- Insert
- Delete
- Substitute
- Transpose (swap adjacent) - this is the addition
Use Case: Spell checkers (80% of typos are single-character edits or transpositions).
1func DamerauLevenshteinDistance(s1, s2 string) int {
2 m, n := len(s1), len(s2)
3
4 // Create DP table with extra row/column
5 dp := make([][]int, m+2)
6 for i := range dp {
7 dp[i] = make([]int, n+2)
8 }
9
10 maxDist := m + n
11 dp[0][0] = maxDist
12
13 for i := 0; i <= m; i++ {
14 dp[i+1][0] = maxDist
15 dp[i+1][1] = i
16 }
17 for j := 0; j <= n; j++ {
18 dp[0][j+1] = maxDist
19 dp[1][j+1] = j
20 }
21
22 da := make(map[byte]int)
23
24 for i := 1; i <= m; i++ {
25 db := 0
26
27 for j := 1; j <= n; j++ {
28 k := da[s2[j-1]]
29 l := db
30 cost := 1
31
32 if s1[i-1] == s2[j-1] {
33 cost = 0
34 db = j
35 }
36
37 dp[i+1][j+1] = min(
38 dp[i][j]+cost, // Substitution
39 dp[i+1][j]+1, // Insertion
40 dp[i][j+1]+1, // Deletion
41 dp[k][l]+(i-k-1)+1+(j-l-1), // Transposition
42 )
43 }
44
45 da[s1[i-1]] = i
46 }
47
48 return dp[m+1][n+1]
49}
50
51func min(vals ...int) int {
52 minVal := vals[0]
53 for _, v := range vals[1:] {
54 if v < minVal {
55 minVal = v
56 }
57 }
58 return minVal
59}
60
61// Example
62func main() {
63 s1 := "CA"
64 s2 := "ABC"
65 fmt.Printf("Damerau-Levenshtein distance: %d\n", DamerauLevenshteinDistance(s1, s2))
66 // Output: Damerau-Levenshtein distance: 2
67}
Time: $O(m \times n)$
Space: $O(m \times n)$
7. Cosine Similarity
Measures similarity based on character frequency (bag of words).
How It Works
Core Idea: Treat strings as vectors of character frequencies, measure angle between vectors.
Intuition: Ignores order, focuses on character distribution.
Formula: $$ \text{Cosine} = \frac{\mathbf{A} \cdot \mathbf{B}}{||\mathbf{A}|| \times ||\mathbf{B}||} = \frac{\sum A_i B_i}{\sqrt{\sum A_i^2} \times \sqrt{\sum B_i^2}} $$
Example: "hello" vs "olleh"
- Vector for "hello": {h:1, e:1, l:2, o:1}
- Vector for "olleh": {o:1, l:2, e:1, h:1}
- Same frequencies → Cosine = 1.0 (identical)
Key Property: Order-independent, perfect for anagrams.
Use Case: Document similarity, plagiarism detection.
1import "math"
2
3func CosineSimilarity(s1, s2 string) float64 {
4 // Build frequency maps
5 freq1 := make(map[rune]int)
6 freq2 := make(map[rune]int)
7
8 for _, ch := range s1 {
9 freq1[ch]++
10 }
11 for _, ch := range s2 {
12 freq2[ch]++
13 }
14
15 // Calculate dot product and magnitudes
16 dotProduct := 0.0
17 mag1 := 0.0
18 mag2 := 0.0
19
20 // Get all unique characters
21 allChars := make(map[rune]bool)
22 for ch := range freq1 {
23 allChars[ch] = true
24 }
25 for ch := range freq2 {
26 allChars[ch] = true
27 }
28
29 for ch := range allChars {
30 f1 := float64(freq1[ch])
31 f2 := float64(freq2[ch])
32
33 dotProduct += f1 * f2
34 mag1 += f1 * f1
35 mag2 += f2 * f2
36 }
37
38 if mag1 == 0 || mag2 == 0 {
39 return 0.0
40 }
41
42 return dotProduct / (math.Sqrt(mag1) * math.Sqrt(mag2))
43}
44
45// Example
46func main() {
47 s1 := "hello world"
48 s2 := "hello there"
49 fmt.Printf("Cosine similarity: %.4f\n", CosineSimilarity(s1, s2))
50}
Time: $O(m + n)$
Space: $O(m + n)$
8. Jaccard Similarity
Measures similarity as intersection over union of character sets.
How It Works
Core Idea: Compare unique character sets using set operations.
Formula: $$ \text{Jaccard} = \frac{|A \cap B|}{|A \cup B|} $$
Intuition: What fraction of all unique characters appear in both strings?
Example: "night" vs "nacht"
- Set A: {n, i, g, h, t}
- Set B: {n, a, c, h, t}
- Intersection: {n, h, t} (3 chars)
- Union: {n, i, g, h, t, a, c} (7 chars)
- Jaccard = 3/7 ≈ 0.43
Properties:
- Range: [0, 1]
- 0 = no common characters
- 1 = identical character sets
Use Case: Set similarity, recommendation systems.
1func JaccardSimilarity(s1, s2 string) float64 {
2 set1 := make(map[rune]bool)
3 set2 := make(map[rune]bool)
4
5 for _, ch := range s1 {
6 set1[ch] = true
7 }
8 for _, ch := range s2 {
9 set2[ch] = true
10 }
11
12 // Calculate intersection
13 intersection := 0
14 for ch := range set1 {
15 if set2[ch] {
16 intersection++
17 }
18 }
19
20 // Calculate union
21 union := len(set1) + len(set2) - intersection
22
23 if union == 0 {
24 return 0.0
25 }
26
27 return float64(intersection) / float64(union)
28}
29
30// Example
31func main() {
32 s1 := "night"
33 s2 := "nacht"
34 fmt.Printf("Jaccard similarity: %.4f\n", JaccardSimilarity(s1, s2))
35 // Output: Jaccard similarity: 0.2857 (2 common chars / 7 unique chars)
36}
Time: $O(m + n)$
Space: $O(m + n)$
Comparison Table
| Algorithm | Time | Space | Use Case |
|---|---|---|---|
| Levenshtein | $O(mn)$ | $O(mn)$ or $O(n)$ | General edit distance, spell check |
| LCS | $O(mn)$ | $O(mn)$ | Diff tools, DNA sequencing |
| LCS Substring | $O(mn)$ | $O(mn)$ | Find common patterns |
| Hamming | $O(n)$ | $O(1)$ | Equal-length strings, error detection |
| Jaro-Winkler | $O(mn)$ | $O(m+n)$ | Record linkage, name matching |
| Damerau-Levenshtein | $O(mn)$ | $O(mn)$ | Typo detection (transpositions) |
| Cosine | $O(m+n)$ | $O(m+n)$ | Document similarity |
| Jaccard | $O(m+n)$ | $O(m+n)$ | Set similarity |
Practical Applications
Spell Checker
1func FindClosestWord(word string, dictionary []string, threshold int) []string {
2 suggestions := []string{}
3
4 for _, dictWord := range dictionary {
5 dist := LevenshteinDistance(word, dictWord)
6 if dist <= threshold {
7 suggestions = append(suggestions, dictWord)
8 }
9 }
10
11 return suggestions
12}
Fuzzy String Matching
1func FuzzyMatch(query string, candidates []string, threshold float64) []string {
2 matches := []string{}
3
4 for _, candidate := range candidates {
5 similarity := JaroWinklerDistance(query, candidate)
6 if similarity >= threshold {
7 matches = append(matches, candidate)
8 }
9 }
10
11 return matches
12}
DNA Sequence Alignment
1func AlignSequences(seq1, seq2 string) int {
2 return LongestCommonSubsequence(seq1, seq2)
3}
When to Use Each Algorithm
✅ Levenshtein: General purpose, spell checking
✅ LCS: Version control diffs, plagiarism detection
✅ Hamming: Error detection in equal-length codes
✅ Jaro-Winkler: Name matching, record linkage
✅ Damerau-Levenshtein: Typo detection with transpositions
✅ Cosine: Document similarity, text classification
✅ Jaccard: Set similarity, recommendation systems
Related Snippets
- Binary Search
Binary search is an efficient algorithm for finding a target value in a sorted … - Binary Tree
A binary tree is a tree data structure where each node has at most two children, … - General Tree
A tree is a hierarchical data structure consisting of nodes connected by edges, … - Heap Data Structure
A heap is a specialized tree-based data structure that satisfies the heap … - Heap Sort
Heap sort is a comparison-based sorting algorithm that uses a binary heap data … - Linked List Operations
Optimal techniques for common linked list operations using pointer manipulation, … - LRU and LFU Cache
Cache replacement policies determine which items to evict when the cache is … - Merge Sort
Merge sort is a stable, comparison-based sorting algorithm that uses the … - Quadtree
A Quadtree is a tree data structure where each internal node has exactly four … - Quick Sort
Quick sort is an efficient, in-place, comparison-based sorting algorithm that … - Terrain & Heightmap Generation
Procedural terrain generation creates realistic landscapes using mathematical … - Trie (Prefix Tree)
A Trie (pronounced "try") is a tree-like data structure used to store … - Wave Function Collapse
Wave Function Collapse (WFC) is a procedural generation algorithm that creates …