Similar String Groups: Union-Find Explained
Similar String Groups is LeetCode problem 839, rated hard. You are given a list of strings that are all anagrams of each other. Your goal is to count how many groups these strings form, where two strings belong to the same group if you can transform one into the other through a chain of single swaps.
The problem
Two strings are "similar" if you can swap exactly two characters in one string to produce the other (or if they are already equal). A group of similar strings is a set where every string is connected to every other string through a chain of similarities. Given an array of anagram strings, return the number of groups.
Input: strs = ["tars","rats","arts","star"]
Output: 2
"tars" and "rats" are similar (swap positions 0 and 1). "tars" and "arts" are also similar (swap positions 0 and 1). "rats" and "arts" are similar too (swap positions 0 and 2). But "star" is not similar to any of the other three, because it differs by more than a single swap from each. That gives us two groups: ("tars", "rats", "arts") and ("star").
The key insight: Union-Find on swap similarity
This is a connected components problem. Each string is a node. If two strings are similar, there is an edge between them. The answer is the number of connected components.
Union-Find is a natural fit. You iterate through all pairs of strings, check whether they are similar, and if so, union them. At the end, count the number of distinct roots.
The similarity check is the crucial subroutine. Compare two strings character by character and count the positions where they differ. If the count is exactly 2, the strings are similar (swapping those two positions in one string produces the other, since all strings are anagrams). If the count is 0, the strings are identical and therefore similar. Any other count means they are not similar.
The similarity check runs in O(m) time where m is the string length. You can short-circuit early: as soon as you find a third mismatch, return false immediately. This optimization matters when strings are long.
Step-by-step walkthrough
Let's trace the Union-Find process on the example ["tars", "rats", "arts", "star"]. We compare all pairs, check similarity, and union when appropriate.
Step 1: Initialize Union-Find
Each string starts as its own component. parent = [0, 1, 2, 3]. Strings: 0=tars, 1=rats, 2=arts, 3=star.
Step 2: Compare "tars" and "rats"
Exactly 2 positions differ (t/r at 0, a/a match, r/r match, s/s match... actually r/a at pos 1). Swap makes them equal. union(0, 1), parent[1] = 0.
Step 3: Compare "tars" and "arts"
"tars" vs "arts": positions 0 and 1 differ (t/a and a/t). Swapping gives a match. union(0, 2), parent[2] = 0.
Step 4: Compare "tars" and "star"
All 4 positions differ. More than 2 differences means no single swap can transform one into the other. Skip.
Step 5: Compare "rats" and "arts"
Positions 0 and 2 differ (r/a and t/t... r/a at 0, a/r at 2). Similar, but find(1) = find(2) = 0 already. Same component, no change.
Step 6: Remaining pairs (rats-star, arts-star)
Neither "rats" nor "arts" is similar to "star" (4 differences each). "star" stays isolated. Final: 2 distinct components.
The code
class UnionFind:
def __init__(self, n):
self.parent = list(range(n))
self.rank = [0] * n
def find(self, x):
if self.parent[x] != x:
self.parent[x] = self.find(self.parent[x])
return self.parent[x]
def union(self, x, y):
px, py = self.find(x), self.find(y)
if px == py:
return
if self.rank[px] < self.rank[py]:
px, py = py, px
self.parent[py] = px
if self.rank[px] == self.rank[py]:
self.rank[px] += 1
def is_similar(a, b):
diff = 0
for x, y in zip(a, b):
if x != y:
diff += 1
if diff > 2:
return False
return diff == 0 or diff == 2
def num_similar_groups(strs):
n = len(strs)
uf = UnionFind(n)
for i in range(n):
for j in range(i + 1, n):
if is_similar(strs[i], strs[j]):
uf.union(i, j)
return len(set(uf.find(i) for i in range(n)))
The is_similar function is the heart of the solution. It walks through both strings in parallel, counting mismatches. The early exit when diff > 2 prevents wasted work on clearly dissimilar strings. After comparing all pairs, we count unique roots by calling find on every index and collecting the results into a set.
Notice that we call uf.find(i) in the final count, not just uf.parent[i]. Without path compression in the final pass, some nodes might still point to intermediate parents rather than the true root.
Complexity analysis
| Approach | Time | Space |
|---|---|---|
| DFS/BFS | O(n^2 * m) | O(n * m) |
| Union-Find | O(n^2 * m * alpha(n)) | O(n) |
Here n is the number of strings and m is the length of each string. The O(n^2) comes from checking all pairs. Each similarity check costs O(m). The alpha(n) factor from Union-Find is the inverse Ackermann function, which is effectively constant for all practical inputs. The space advantage of Union-Find is notable: you only store the parent and rank arrays of size n, while DFS/BFS may need to store an adjacency list of size O(n^2).
The building blocks
Union-Find with path compression
The Union-Find template here is identical to what you see in problems like Number of Provinces, Accounts Merge, and Redundant Connection. The find function uses path compression to flatten the tree on each lookup, making future queries nearly O(1). The union function uses rank to keep the tree balanced by attaching the shorter tree under the taller one.
What varies between problems is the "when to union" logic. In Accounts Merge, you union when two accounts share an email. In Redundant Connection, you union when processing each edge. Here, you union when two strings are similar. The Union-Find skeleton stays the same.
Similarity checking as edge detection
The is_similar function is effectively an adjacency check. Instead of being given an explicit graph, you are given raw data and must determine connectivity yourself. This pattern appears in many Union-Find problems: the challenge is not the data structure itself, but figuring out which elements should be connected.
Edge cases
- All strings identical: Every pair is similar (0 differences). Everything unions into one group. Answer: 1.
- All strings unique by more than one swap: No pair is similar. Every string is its own group. Answer: n.
- Two strings only: Just one comparison. Either 1 group (similar) or 2 groups (not similar).
- Duplicate strings in the array: Two copies of the same string are similar (0 differences), so they belong to the same group. The algorithm handles this correctly.
- Very long strings: The early exit in
is_similar(breaking when diff exceeds 2) prevents scanning the entire string when the answer is obvious early on.
From understanding to recall
You have seen how Similar String Groups reduces to a connected components problem once you define "similarity" as the edge relation. The Union-Find handles the grouping, and the is_similar function handles the edge detection. The combination is clean and efficient.
The tricky parts that are easy to forget under pressure: the similarity check must handle both the 0-difference case (identical strings) and the 2-difference case, the early exit optimization matters for performance, and the final count requires calling find on every node rather than just reading the parent array. These are exactly the kind of details that fade from memory without practice.
Spaced repetition locks in the full pipeline. You practice writing the Union-Find template, the similarity check with its early exit, the nested loop over pairs, and the final root-counting step. Each review reinforces the pattern until you can produce the complete solution without hesitation.
Related posts
- Number of Provinces - Union-Find for connected components in a graph
- Accounts Merge - Union-Find to group related accounts
- Redundant Connection - Union-Find to detect cycles