Find Duplicate File in System: Grouping by Content with Hash Maps
You are given a list of directory info strings, each containing a root directory followed by one or more files with their content in parentheses. Your job is to find all groups of duplicate files - files that share the exact same content across the entire file system. This is LeetCode 609: Find Duplicate File in System.
Why this problem matters
This problem is a practical take on the grouping pattern. Instead of grouping words by sorted characters (like in Group Anagrams), you are grouping file paths by their content. The twist is that you first need to parse a messy input format before you can apply the pattern. That parsing step is what makes this a medium rather than an easy - the core algorithm is the same group-by-key idea you have seen before.
It also mirrors real-world tasks. Tools like fdupes and rdfind do exactly this: scan a file system, hash file contents, and group duplicates. The difference is that real tools hash the bytes to avoid comparing full contents, but the grouping logic is identical.
The approach
The algorithm breaks down into two parts: parsing and grouping.
Parsing. Each input string looks like "root/a 1.txt(abcd) 2.txt(efgh)". You split on spaces to get the directory (root/a) and the file entries (1.txt(abcd), 2.txt(efgh)). For each file entry, you find the opening parenthesis to separate the filename from the content, then combine the directory and filename into a full path.
Grouping. Once you have (full_path, content) pairs, you drop them into a hash map keyed by content. After processing every path string, you filter the map for entries where the list has more than one file. Those are your duplicate groups.
from collections import defaultdict
def findDuplicate(paths):
content_map = defaultdict(list)
for path in paths:
parts = path.split()
directory = parts[0]
for file_info in parts[1:]:
paren = file_info.index('(')
filename = file_info[:paren]
content = file_info[paren + 1:-1]
full_path = directory + '/' + filename
content_map[content].append(full_path)
return [group for group in content_map.values() if len(group) > 1]
The index('(') call finds the position of the first opening parenthesis. Everything before it is the filename, everything after it (minus the closing paren) is the content. This avoids regex and keeps the parsing fast and readable.
Step-by-step walkthrough
Step 1: Parse each input string into directory, filename, and content.
Split on spaces to get the directory and file entries. For each file entry, find the opening parenthesis to separate the filename from the content.
Step 2: Build a hash map from content to file paths.
Each content string maps to a list of full file paths that contain that content.
Step 3: Filter for groups with more than one file.
A group with only one file has no duplicates. Keep only groups where the list has two or more paths.
Step 4: Return the remaining groups as the answer.
Each inner list is a group of duplicate files. The order of groups and files within groups does not matter.
Complexity analysis
| Metric | Value |
|---|---|
| Time | O(n * k) where n is the number of path strings and k is the average length of each string |
| Space | O(n * k) to store all file paths and content strings in the hash map |
Every character in every input string gets processed exactly once during parsing. The hash map operations (insert and lookup) are O(1) amortized per entry. The final filter pass is linear in the number of unique content strings.
Edge cases to watch for
- One file per directory. Each path string might have only one file entry. The parsing still works since
parts[1:]handles any number of file entries. - No duplicates at all. Every file has unique content. The filter step returns an empty list. No special handling needed.
- All files are duplicates. Every file shares the same content. You get one big group. The algorithm handles this naturally.
- Empty content. A file like
1.txt()has empty content"". Multiple files with empty content will correctly group together under the key"". - Parentheses in content. The problem guarantees content is enclosed in a single pair of parentheses, so
index('(')always finds the right boundary.
The building blocks
This problem is built from two core building blocks:
String parsing
The input format is not a standard data structure - it is a custom string encoding. You need to split it apart before you can do anything useful. The parsing here is simple (split on spaces, find the parenthesis), but the skill of breaking down a messy string into structured data shows up in many problems. Compare this to parsing nested encodings in Decode String or splitting version numbers in Compare Version Numbers.
Group-by-key
The same pattern from Group Anagrams: iterate through a collection, compute a key for each element, and group elements that share the same key. Here the "key" is the file content and the "element" is the full file path. The template is always:
groups = defaultdict(list)
for item in collection:
key = compute_key(item)
groups[key].append(item)
Once you have the group-by-key brick internalized, any problem that says "find items that share a property" becomes a one-pass hash map solution.
From understanding to recall
This problem combines two skills - parsing and grouping - and the risk is that you remember the grouping but forget the parsing details. How do you extract the filename from 1.txt(abcd)? Where does the directory come from? These small details trip people up under pressure.
Spaced repetition helps you lock in both the high-level pattern and the parsing mechanics. After a few review cycles, you will not need to re-derive the index('(') trick or wonder whether parts[0] is the directory or the first file. The details become automatic, so you can focus on adapting the approach to harder variants.
Related posts
- Group Anagrams - The canonical group-by-key problem using sorted characters as the key
- Two Sum - The complement lookup pattern, another core hash map technique
- Find All Duplicates in an Array - A different duplicate-finding strategy using index marking instead of hash maps