Skip to content
← All posts

Short Encoding of Words: Suffix Deduplication with Tries

7 min read
leetcodeproblemmediumstringstrie

You are given a list of words, and you need to encode them into a single reference string. The encoding uses # as a separator. A word can be "covered" by another word if it appears as a suffix. For example, "me" is a suffix of "time", so the encoding "time#" already covers both words. Your job is to find the shortest possible encoding string.

This is LeetCode 820: Short Encoding of Words, and it is a clean application of suffix deduplication. Given words = ["time", "me", "bell"], the answer is 10 because the encoding "time#bell#" covers all three words. "me" does not need its own entry since it is a suffix of "time".

words:"time""me""bell"suffix ofsurvivesabsorbedReverse Trie (insert reversed words)elmlietbrootelm"me" ends hereliet"time" ends (leaf)b"bell" ends (leaf)encoding:time#bell#length = 10
Words "time", "me", and "bell". Since "me" is a suffix of "time", the encoding only needs "time#bell#". The reverse trie shows "me" (ending at node m) is not a leaf, so it gets absorbed.

Why this problem matters

Suffix deduplication shows up in real systems more often than you might expect. Compilers use suffix sharing in string tables. DNS systems compress domain names by pointing to shared suffixes. File systems and serialization formats use similar tricks to reduce redundant storage.

This problem isolates the core idea: when you have a collection of strings and some are suffixes of others, you can eliminate the shorter ones. The remaining strings, the ones that are not suffixes of any longer string, form the minimal encoding. The challenge is figuring out which words survive and which get absorbed.

The problem also bridges two important patterns. You can solve it with a set-based suffix removal approach (simple and elegant) or with a reverse trie (which gives you deeper insight into the structure). Both approaches teach you something different about string manipulation.

The key insight

A word should be excluded from the encoding if and only if it is a suffix of some other word in the list. The word "me" is a suffix of "time", so "me" gets absorbed. The word "bell" is not a suffix of any other word, so it survives.

The set-based approach makes this concrete: put all words in a set, then for each word, remove all of its suffixes from the set. After processing every word, the set contains only the words that are not suffixes of any other word. The encoding length is the sum of len(word) + 1 for each remaining word (the +1 accounts for the # separator).

The trie approach works differently but arrives at the same answer. If you reverse each word and insert it into a trie, the leaf nodes correspond to words that are not suffixes of any longer word. Non-leaf nodes represent words that are proper suffixes of some longer word. You count only the leaf nodes.

Solution approach: set-based suffix removal

The cleanest solution avoids building a trie entirely. Instead, it uses a set to track which words survive:

  1. Add all words to a set.
  2. For each word, generate all of its proper suffixes (word[1:], word[2:], etc.) and remove them from the set.
  3. The remaining words are the ones that no other word covers.
  4. Sum up len(word) + 1 for each remaining word.
def minimumLengthEncoding(words):
    word_set = set(words)
    for word in words:
        for i in range(1, len(word)):
            word_set.discard(word[i:])
    return sum(len(word) + 1 for word in word_set)

This works because discard is safe to call even if the suffix is not in the set. After the loop, only words that are not a suffix of any other word remain. Each surviving word needs len(word) + 1 characters in the encoding (the word itself plus the # delimiter).

Alternative: the trie approach

If you reverse each word and insert it into a trie, words that are suffixes of longer words will share a path from the root. The word "me" reversed is "em", and "time" reversed is "emit". When you insert "em" and then "emit", the path for "em" is a prefix of the path for "emit". That means "em" (which is "me" reversed) ends at a non-leaf node.

Only leaf nodes in the reverse trie correspond to words that need their own entry in the encoding. You walk the trie and count the depth of each leaf node plus 1 (for the #).

class TrieNode:
    def __init__(self):
        self.children = {}

def minimumLengthEncoding(words):
    root = TrieNode()
    nodes = {}
    for word in set(words):
        node = root
        for ch in reversed(word):
            if ch not in node.children:
                node.children[ch] = TrieNode()
            node = node.children[ch]
        nodes[node] = len(word) + 1

    return sum(depth for node, depth in nodes.items() if not node.children)

The trie approach is more involved, but it reveals the underlying structure. If you are studying trie patterns, it is worth implementing both solutions to see how they connect.

Visual walkthrough

Let's trace through the set-based approach step by step with words = ["time", "me", "bell"].

Step 1: Start with all words in a set

words = {"time", "me", "bell"}

Add every word to a set. This set tracks which words survive as independent entries in the encoding.

Step 2: Process "time". Remove all suffixes from the set.

suffixes: "ime", "me", "e" -- discard "me" from set

"me" is a suffix of "time", so it does not need its own entry. Remove it. "ime" and "e" are not in the set, so nothing else changes.

Step 3: Process "me". Remove all suffixes from the set.

suffixes: "e" -- "e" not in set, nothing removed

"me" only has one proper suffix ("e"), which is not in the set. No changes.

Step 4: Process "bell". Remove all suffixes from the set.

suffixes: "ell", "ll", "l" -- none in set, nothing removed

None of the suffixes of "bell" are in the word set. "bell" survives as an independent word.

Step 5: Calculate the encoding length

remaining = {"time", "bell"} => len("time")+1 + len("bell")+1 = 5 + 5 = 10

Each surviving word contributes its length plus 1 (for the "#" separator). The answer is 10, corresponding to the encoding string "time#bell#".

The critical moment is Step 2. When we process "time", we check its suffixes: "ime", "me", and "e". The suffix "me" is in the set, so we remove it. After that, "me" is gone and will not contribute to the encoding length. The remaining words "time" and "bell" each contribute their length plus 1, giving us 5 + 5 = 10.

Complexity analysis

Let n be the number of words and m be the average word length.

MetricSet-basedTrie-based
TimeO(n * m^2)O(n * m)
SpaceO(n * m)O(n * m)

Time for set-based: For each of the n words, you generate up to m suffixes. Each suffix creation is O(m) due to string slicing and set lookup. That gives O(n * m^2). In practice this is fast because word lengths are bounded (the constraint says 1 <= words[i].length <= 7).

Time for trie-based: Each word is inserted in O(m) time, so the total is O(n * m).

Space for both approaches is O(n * m) to store the words (in a set or in the trie nodes).

The building blocks

This problem is built from two reusable pieces that CodeBricks drills independently:

1. Suffix enumeration

The core loop that generates all proper suffixes of a word:

for i in range(1, len(word)):
    suffix = word[i:]

This is a simple but important pattern. You are slicing the word from each starting position to generate every possible suffix. The same suffix enumeration shows up in problems involving suffix arrays, rolling hashes, and string matching. Once you can write this loop from memory, you can apply it to any problem that asks "is one string a suffix of another?"

2. Set-based deduplication

The idea of using a set to track candidates and then removing entries that are "covered" by other entries:

word_set = set(words)
for word in words:
    for i in range(1, len(word)):
        word_set.discard(word[i:])

This pattern of "add everything, then remove what is redundant" appears in interval merging, duplicate removal, and greedy elimination problems. The discard method is key here because it does not raise an error if the element is missing.

Edge cases

Before submitting, check these:

  • All words are the same: ["me", "me", "me"] should give 3 (just "me#"). The set deduplicates identical words automatically.
  • No suffix relationships: ["abc", "def", "ghi"] gives 12. Every word survives independently.
  • Chain of suffixes: ["time", "ime", "me", "e"] gives 5. Only "time" survives because each shorter word is a suffix of a longer one.
  • Single word: ["hello"] gives 6. Just the word plus the #.
  • Single character words: ["a", "b", "c"] gives 6. None is a suffix of another (they are all length 1), so each gets its own entry.

From understanding to recall

You have read through the short encoding of words solution. You see how the set-based approach removes suffixes and how the trie approach identifies leaf nodes. But can you write the solution from scratch in five minutes?

That is the real test. The set-based solution has a subtle detail: you iterate range(1, len(word)) to avoid removing the word itself (only proper suffixes). The trie solution requires you to insert reversed words and then filter for leaf nodes. These are small details, but they are the kind that trip you up under pressure if you have not drilled them.

Spaced repetition closes that gap. You practice writing the suffix removal loop and the set deduplication pattern from memory at increasing intervals. After a few reps, the pattern is automatic. You see a suffix-related problem and the code flows out without hesitation.

The suffix enumeration pattern and the set-based deduplication pattern are two of roughly 60 reusable building blocks that cover hundreds of LeetCode problems. Learning them individually and drilling them with spaced repetition is far more effective than grinding random problems and hoping the patterns stick.

Related posts

  • Implement Trie - The foundational trie data structure. If you want to understand the trie-based approach to this problem, start here.
  • Longest Common Prefix - Another string problem that benefits from understanding shared prefixes and how tries organize them.
  • Design Add and Search Words - A trie problem that combines prefix trees with DFS. Good follow-up if you want to deepen your trie skills.

CodeBricks breaks the short encoding of words problem into its suffix-enumeration and set-deduplication building blocks, then drills them independently with spaced repetition. You type each piece from scratch until the pattern is automatic. When a suffix deduplication problem shows up in your interview, you do not think about it. You just write it.