Delete Duplicate Folders in System: Trie Serialization and Hash-Based Dedup
You are given a list of folder paths, where each path is a list of strings representing the chain of folders from the root. Two folders are considered identical if they have the exact same subtree structure, meaning the same set of children with the same subtrees recursively. If any two folders anywhere in the tree are identical, you must delete all of them. Return the remaining paths after deletion.
This is LeetCode 1948: Delete Duplicate Folders in System, and it combines three important techniques: building a trie from paths, serializing subtrees to detect structural duplicates, and using a hash map to find which subtrees appear more than once. It is one of the harder trie problems you will encounter, but each individual piece is a pattern you can learn and drill.
The problem
Given paths = [["a"],["c"],["d"],["a","b"],["c","b"],["d","a"],["d","a","b"]], build the folder tree and find all folders with duplicate subtree structures. In this example, folders a, c, and d/a each have a single child b with no further children. Their subtree structures are identical, so all three get deleted. Folder d had a unique subtree structure (child a with grandchild b), but once d/a is deleted, d becomes a leaf. Since its original subtree (a(b)) was unique, d itself is not deleted. The answer is [["d"]].
The approach
The solution has four phases:
-
Build the trie. Insert every path into a trie where each node represents a folder. Shared prefixes reuse existing nodes, just like a standard trie.
-
Serialize each subtree. Run a post-order DFS from the root. For each node, build a string that encodes its entire subtree structure. A leaf node gets an empty string. A node with children
bandc(both leaves) gets"(b)(c)"with children sorted alphabetically. This serialization uniquely identifies the shape of each subtree. -
Find duplicates via hash map. Store each non-empty serialization in a hash map that maps the serial string to a list of nodes. Any serial that appears two or more times means those nodes are duplicates and must be deleted.
-
Collect remaining paths. Run another DFS from the root, skipping any node marked for deletion. Build the path as you go and add it to the result whenever you visit a surviving node.
The key insight is that subtree serialization reduces the structural comparison problem to a string equality problem. Two subtrees are structurally identical if and only if they produce the same serial string. Once you have strings, a hash map instantly tells you which ones are duplicates.
Visual walkthrough
Step 1: Build the trie from input paths
Each path like ["d","a","b"] creates the chain root -> d -> a -> b. Shared prefixes reuse existing nodes.
Step 2: Serialize each subtree to find structural duplicates
Post-order DFS builds a string for each node. Leaf nodes serialize as empty string. A node with child "b" (leaf) serializes as "(b)".
Step 3: Mark folders with duplicate subtrees for deletion
Any serialization appearing more than once means those folders are duplicates. All three folders with serial "(b)" are marked for deletion.
Step 4: Collect remaining paths from surviving folders
After deletion, only root -> d remains. DFS from root collects the path ["d"]. Final answer: [["d"]].
Notice how the serialization works bottom-up. You cannot compute a node's serial until you know all its children's serials. That is why post-order DFS is the right traversal: you visit children first, build their serials, then combine them into the parent's serial.
The code
class TrieNode:
def __init__(self):
self.children = {}
self.deleted = False
class Solution:
def deleteDuplicateFolder(self, paths: list[list[str]]) -> list[list[str]]:
root = TrieNode()
for path in paths:
node = root
for folder in path:
if folder not in node.children:
node.children[folder] = TrieNode()
node = node.children[folder]
serial_count = {}
def serialize(node):
if not node.children:
return ""
parts = []
for name in sorted(node.children):
child = node.children[name]
child_serial = serialize(child)
parts.append(f"({name}{child_serial})")
serial = "".join(parts)
serial_count[serial] = serial_count.get(serial, 0) + 1
node.serial = serial
return serial
serialize(root)
def mark_deleted(node):
for child in node.children.values():
if hasattr(child, "serial") and child.serial and serial_count.get(child.serial, 0) > 1:
child.deleted = True
else:
mark_deleted(child)
mark_deleted(root)
result = []
def collect(node, path):
for name in sorted(node.children):
child = node.children[name]
if not child.deleted:
path.append(name)
result.append(list(path))
collect(child, path)
path.pop()
collect(root, [])
return result
Let's walk through each phase.
Building the trie: For each path in the input, walk from the root and create nodes for any folders that do not exist yet. This is the standard trie insert loop. After processing all paths, the trie represents the full folder tree.
Serialization: The serialize function runs post-order DFS. For each node, it collects the serials of all children (sorted by name to ensure consistent ordering), wraps each in parentheses with the child's name, and joins them. Sorting is critical because children {"b": ..., "a": ...} and {"a": ..., "b": ...} must produce the same serial. The serial is stored in a counter map.
Marking deletions: After serialization, any serial that appears more than once in serial_count identifies duplicate subtrees. The mark_deleted function walks the tree and flags any node whose serial has a count greater than 1. Once a node is flagged, we skip its entire subtree (no need to recurse further since the whole subtree will be removed).
Collecting results: The final DFS collects paths from all nodes that were not deleted, building the path array as it goes.
Complexity analysis
| Complexity | |
|---|---|
| Time | O(n * L * log(C)) |
| Space | O(n * L) |
Here n is the number of paths, L is the average path length, and C is the maximum number of children any single node has.
Time: Building the trie takes O(n * L). Serialization visits every node once and sorts each node's children, giving O(n * L * log(C)) in total. The hash map operations are O(1) amortized per entry. Collecting results is another O(n * L) pass. The dominant term is the serialization step.
Space: The trie stores all n * L folder nodes. The serialization strings can collectively be O(n * L) in size. The hash map and result list are also bounded by O(n * L).
Building blocks
1. Trie construction from paths
The first building block is inserting paths into a trie. Each path is a list of folder names, and you walk the trie creating nodes as needed. This is the same pattern from Implement Trie, except each "character" is an entire folder name string rather than a single letter.
node = root
for folder in path:
if folder not in node.children:
node.children[folder] = TrieNode()
node = node.children[folder]
This pattern appears whenever you need to build a tree structure from a flat list of paths, whether for file systems, URLs, or hierarchical data.
2. Subtree serialization
The second building block is converting a subtree into a canonical string so you can compare structures. Post-order DFS visits children first, builds their serials, then combines them. Sorting children by name guarantees that two structurally identical subtrees always produce the same string.
parts = []
for name in sorted(node.children):
child_serial = serialize(node.children[name])
parts.append(f"({name}{child_serial})")
serial = "".join(parts)
This serialization technique also appears in problems like finding duplicate subtrees in binary trees (LeetCode 652), where you serialize left and right children to detect structural matches.
3. Hash-based deduplication
The third building block is using a hash map to count occurrences of each serial. Any serial with a count above 1 identifies duplicates. This is the same grouping pattern from Group Anagrams, where you use a canonical key to bucket equivalent items together.
Edge cases
-
All folders are unique: No serialization appears more than once. Nothing gets deleted. Return all original paths.
-
Single-level paths only: If every path has length 1, every root-level folder is a leaf with serial
"". Leaf nodes (empty serial) should not trigger deletion. The problem only considers non-leaf subtrees as candidates for duplication. -
Deeply nested identical branches: Two paths like
["a","b","c","d"]and["x","b","c","d"]share the subtree rooted atb. Bothbnodes (and everything beneath them) get deleted. Make sure your DFS skips the entire subtree once a node is marked. -
Deletion cascading up: When
d/agets deleted in our example,dloses its only child. Butdis not deleted because its original subtree serial"(a(b))"was unique. Deletion decisions are based on the original tree structure, not the post-deletion structure.
From understanding to recall
You have seen how this problem breaks into three clean phases: build the trie, serialize subtrees, and deduplicate with a hash map. Each phase uses a pattern that shows up across many other problems. Trie construction is the same loop from Implement Trie. Subtree serialization is the same idea from Duplicate Subtrees in binary trees. Hash-based grouping is the same technique from Group Anagrams.
The challenge is not understanding any one piece. It is combining all three fluently under time pressure. Spaced repetition lets you drill each building block independently until it is automatic, then practice the combination. After a few reps, the flow from trie to serialization to hash map becomes second nature.
Related posts
- Implement Trie - The foundational trie data structure
- Design Add Search Words - Trie with pattern matching
- Group Anagrams - Hash-based grouping of equivalent structures