Short Encoding of Words: Suffix Deduplication with Tries
You are given a list of words, and you need to encode them into a single reference string. The encoding uses # as a separator. A word can be "covered" by another word if it appears as a suffix. For example, "me" is a suffix of "time", so the encoding "time#" already covers both words. Your job is to find the shortest possible encoding string.
This is LeetCode 820: Short Encoding of Words, and it is a clean application of suffix deduplication. Given words = ["time", "me", "bell"], the answer is 10 because the encoding "time#bell#" covers all three words. "me" does not need its own entry since it is a suffix of "time".
Why this problem matters
Suffix deduplication shows up in real systems more often than you might expect. Compilers use suffix sharing in string tables. DNS systems compress domain names by pointing to shared suffixes. File systems and serialization formats use similar tricks to reduce redundant storage.
This problem isolates the core idea: when you have a collection of strings and some are suffixes of others, you can eliminate the shorter ones. The remaining strings, the ones that are not suffixes of any longer string, form the minimal encoding. The challenge is figuring out which words survive and which get absorbed.
The problem also bridges two important patterns. You can solve it with a set-based suffix removal approach (simple and elegant) or with a reverse trie (which gives you deeper insight into the structure). Both approaches teach you something different about string manipulation.
The key insight
A word should be excluded from the encoding if and only if it is a suffix of some other word in the list. The word "me" is a suffix of "time", so "me" gets absorbed. The word "bell" is not a suffix of any other word, so it survives.
The set-based approach makes this concrete: put all words in a set, then for each word, remove all of its suffixes from the set. After processing every word, the set contains only the words that are not suffixes of any other word. The encoding length is the sum of len(word) + 1 for each remaining word (the +1 accounts for the # separator).
The trie approach works differently but arrives at the same answer. If you reverse each word and insert it into a trie, the leaf nodes correspond to words that are not suffixes of any longer word. Non-leaf nodes represent words that are proper suffixes of some longer word. You count only the leaf nodes.
Solution approach: set-based suffix removal
The cleanest solution avoids building a trie entirely. Instead, it uses a set to track which words survive:
- Add all words to a set.
- For each word, generate all of its proper suffixes (
word[1:],word[2:], etc.) and remove them from the set. - The remaining words are the ones that no other word covers.
- Sum up
len(word) + 1for each remaining word.
def minimumLengthEncoding(words):
word_set = set(words)
for word in words:
for i in range(1, len(word)):
word_set.discard(word[i:])
return sum(len(word) + 1 for word in word_set)
This works because discard is safe to call even if the suffix is not in the set. After the loop, only words that are not a suffix of any other word remain. Each surviving word needs len(word) + 1 characters in the encoding (the word itself plus the # delimiter).
Alternative: the trie approach
If you reverse each word and insert it into a trie, words that are suffixes of longer words will share a path from the root. The word "me" reversed is "em", and "time" reversed is "emit". When you insert "em" and then "emit", the path for "em" is a prefix of the path for "emit". That means "em" (which is "me" reversed) ends at a non-leaf node.
Only leaf nodes in the reverse trie correspond to words that need their own entry in the encoding. You walk the trie and count the depth of each leaf node plus 1 (for the #).
class TrieNode:
def __init__(self):
self.children = {}
def minimumLengthEncoding(words):
root = TrieNode()
nodes = {}
for word in set(words):
node = root
for ch in reversed(word):
if ch not in node.children:
node.children[ch] = TrieNode()
node = node.children[ch]
nodes[node] = len(word) + 1
return sum(depth for node, depth in nodes.items() if not node.children)
The trie approach is more involved, but it reveals the underlying structure. If you are studying trie patterns, it is worth implementing both solutions to see how they connect.
Visual walkthrough
Let's trace through the set-based approach step by step with words = ["time", "me", "bell"].
Step 1: Start with all words in a set
Add every word to a set. This set tracks which words survive as independent entries in the encoding.
Step 2: Process "time". Remove all suffixes from the set.
"me" is a suffix of "time", so it does not need its own entry. Remove it. "ime" and "e" are not in the set, so nothing else changes.
Step 3: Process "me". Remove all suffixes from the set.
"me" only has one proper suffix ("e"), which is not in the set. No changes.
Step 4: Process "bell". Remove all suffixes from the set.
None of the suffixes of "bell" are in the word set. "bell" survives as an independent word.
Step 5: Calculate the encoding length
Each surviving word contributes its length plus 1 (for the "#" separator). The answer is 10, corresponding to the encoding string "time#bell#".
The critical moment is Step 2. When we process "time", we check its suffixes: "ime", "me", and "e". The suffix "me" is in the set, so we remove it. After that, "me" is gone and will not contribute to the encoding length. The remaining words "time" and "bell" each contribute their length plus 1, giving us 5 + 5 = 10.
Complexity analysis
Let n be the number of words and m be the average word length.
| Metric | Set-based | Trie-based |
|---|---|---|
| Time | O(n * m^2) | O(n * m) |
| Space | O(n * m) | O(n * m) |
Time for set-based: For each of the n words, you generate up to m suffixes. Each suffix creation is O(m) due to string slicing and set lookup. That gives O(n * m^2). In practice this is fast because word lengths are bounded (the constraint says 1 <= words[i].length <= 7).
Time for trie-based: Each word is inserted in O(m) time, so the total is O(n * m).
Space for both approaches is O(n * m) to store the words (in a set or in the trie nodes).
The building blocks
This problem is built from two reusable pieces that CodeBricks drills independently:
1. Suffix enumeration
The core loop that generates all proper suffixes of a word:
for i in range(1, len(word)):
suffix = word[i:]
This is a simple but important pattern. You are slicing the word from each starting position to generate every possible suffix. The same suffix enumeration shows up in problems involving suffix arrays, rolling hashes, and string matching. Once you can write this loop from memory, you can apply it to any problem that asks "is one string a suffix of another?"
2. Set-based deduplication
The idea of using a set to track candidates and then removing entries that are "covered" by other entries:
word_set = set(words)
for word in words:
for i in range(1, len(word)):
word_set.discard(word[i:])
This pattern of "add everything, then remove what is redundant" appears in interval merging, duplicate removal, and greedy elimination problems. The discard method is key here because it does not raise an error if the element is missing.
Edge cases
Before submitting, check these:
- All words are the same:
["me", "me", "me"]should give3(just"me#"). The set deduplicates identical words automatically. - No suffix relationships:
["abc", "def", "ghi"]gives12. Every word survives independently. - Chain of suffixes:
["time", "ime", "me", "e"]gives5. Only"time"survives because each shorter word is a suffix of a longer one. - Single word:
["hello"]gives6. Just the word plus the#. - Single character words:
["a", "b", "c"]gives6. None is a suffix of another (they are all length 1), so each gets its own entry.
From understanding to recall
You have read through the short encoding of words solution. You see how the set-based approach removes suffixes and how the trie approach identifies leaf nodes. But can you write the solution from scratch in five minutes?
That is the real test. The set-based solution has a subtle detail: you iterate range(1, len(word)) to avoid removing the word itself (only proper suffixes). The trie solution requires you to insert reversed words and then filter for leaf nodes. These are small details, but they are the kind that trip you up under pressure if you have not drilled them.
Spaced repetition closes that gap. You practice writing the suffix removal loop and the set deduplication pattern from memory at increasing intervals. After a few reps, the pattern is automatic. You see a suffix-related problem and the code flows out without hesitation.
The suffix enumeration pattern and the set-based deduplication pattern are two of roughly 60 reusable building blocks that cover hundreds of LeetCode problems. Learning them individually and drilling them with spaced repetition is far more effective than grinding random problems and hoping the patterns stick.
Related posts
- Implement Trie - The foundational trie data structure. If you want to understand the trie-based approach to this problem, start here.
- Longest Common Prefix - Another string problem that benefits from understanding shared prefixes and how tries organize them.
- Design Add and Search Words - A trie problem that combines prefix trees with DFS. Good follow-up if you want to deepen your trie skills.
CodeBricks breaks the short encoding of words problem into its suffix-enumeration and set-deduplication building blocks, then drills them independently with spaced repetition. You type each piece from scratch until the pattern is automatic. When a suffix deduplication problem shows up in your interview, you do not think about it. You just write it.