Longest Duplicate Substring: Binary Search and Rolling Hash
Longest Duplicate Substring is LeetCode 1044. Given a string s, you need to find the longest substring that appears at least twice. If no such substring exists, return an empty string. This is a hard problem, but it becomes very manageable once you see that it combines two well-known techniques: binary search on the answer and Rabin-Karp rolling hash.
For example, given s = "banana", the answer is "ana", which appears starting at index 1 and again at index 3.
Why this problem matters
This problem is a clean example of combining binary search on the answer with hashing. The binary search narrows down the length of the duplicate you are looking for, and the rolling hash efficiently checks whether any duplicate of that length exists. These two techniques individually appear across dozens of problems, and this problem is one of the best places to see them work together.
The key insight
If a duplicate substring of length L exists, then a duplicate of length L - 1 also exists. Why? Take the two copies of the length-L duplicate and chop off their last character. You now have two identical substrings of length L - 1. This means the "feasibility" of finding a duplicate is monotonic with respect to length: there is some maximum length where duplicates exist, and every shorter length also has duplicates.
That monotonicity is exactly what binary search needs. You can binary search on the length of the duplicate substring. For each candidate length, you need to check whether any substring of that length appears at least twice. A brute force check would compare all pairs of substrings, but Rabin-Karp rolling hash lets you do it in O(n) time by computing a hash for each substring in a sliding window and looking for collisions.
The solution
def longestDupSubstring(s):
n = len(s)
MOD = (1 << 61) - 1
BASE = 131
def check(length):
if length == 0:
return ""
h = 0
power = pow(BASE, length, MOD)
seen = {}
for i in range(length):
h = (h * BASE + ord(s[i])) % MOD
seen[h] = [0]
for i in range(1, n - length + 1):
h = (h * BASE - ord(s[i - 1]) * power + ord(s[i + length - 1])) % MOD
if h in seen:
candidate = s[i:i + length]
for j in seen[h]:
if s[j:j + length] == candidate:
return candidate
seen[h].append(i)
else:
seen[h] = [i]
return ""
lo, hi = 0, n - 1
result = ""
while lo <= hi:
mid = (lo + hi) // 2
found = check(mid)
if found:
result = found
lo = mid + 1
else:
hi = mid - 1
return result
The outer loop is the binary search. lo and hi define the range of candidate lengths. At each step, check(mid) uses a rolling hash to scan every substring of length mid and returns the first duplicate it finds (or an empty string if none exists).
Inside check, the hash is computed for the first window of size length. Then, for each subsequent position, the hash is updated in O(1) by removing the contribution of the leftmost character and adding the new rightmost character. This is the classic Rabin-Karp sliding window.
When a hash collision occurs, the code verifies the actual strings match. This handles the rare case of two different substrings producing the same hash value.
The modulus (1 << 61) - 1 is a Mersenne prime, which gives a large hash space and makes collisions extremely unlikely. The base 131 is chosen because it is a prime larger than the size of the ASCII alphabet.
The binary search on answer pattern works whenever there is monotonicity in feasibility. Ask yourself: "If this value works, does every smaller (or larger) value also work?" If yes, you can binary search on the answer instead of trying every possibility.
Visual walkthrough
Let's trace through s = "banana". The binary search tries different lengths, and for each length the rolling hash scans the string.
Step 1: Binary search on length. lo=1, hi=5, mid=3. Check if any substring of length 3 appears twice.
Binary search picks length 3 as the first candidate to check.
Step 2: Slide the rolling hash. Window moves to 'ana' (i=1). Compute hash in O(1).
Remove 'b' from hash, add 'a'. The rolling hash updates in constant time.
Step 3: Window moves to 'nan' (i=2). Hash computed, no collision.
Three distinct hashes so far. No duplicate of length 3 found yet.
Step 4: Window moves to 'ana' (i=3). Hash matches h2! Verify the actual strings match.
Duplicate of length 3 found: 'ana'. Since length 3 works, try longer. Set lo = mid + 1 = 4.
Step 5: lo=4, hi=5, mid=4. Check length 4. Substrings: 'bana', 'anan', 'nana'. All distinct hashes.
No duplicate of length 4. Set hi = mid - 1 = 3. Now lo > hi, search ends.
Result: The longest duplicate substring is 'ana' (length 3).
Binary search narrowed the length. Rolling hash checked each length in O(n). Total: O(n log n).
The binary search tried only two lengths (3 and 4) instead of checking all five possible lengths. For each length, the rolling hash scanned the string in linear time. The combination keeps the total work to O(n log n).
Complexity analysis
| Approach | Time | Space |
|---|---|---|
| Binary search + rolling hash | O(n log n) | O(n) |
The binary search runs O(log n) iterations. Each iteration calls check, which scans the string once in O(n) time, computing and storing hashes for each window position. That gives O(n log n) total time.
The space is O(n) for the hash map that stores starting indices for each hash value.
One caveat: hash collisions can degrade performance. If many substrings produce the same hash, the string comparison fallback could push a single check call toward O(n^2). In practice, with a large prime modulus like (1 << 61) - 1, collisions are extremely rare and the expected time remains O(n log n).
The building blocks
1. Binary search on answer
When the answer itself lives in a range and feasibility is monotonic, you do not enumerate every candidate. You binary search the range. The template looks like this:
lo, hi = min_possible, max_possible
result = default
while lo <= hi:
mid = (lo + hi) // 2
if is_feasible(mid):
result = mid
lo = mid + 1
else:
hi = mid - 1
You have already seen this pattern in Koko Eating Bananas (search the speed) and Split Array Largest Sum (search the max sum). Here, you are searching the length of the duplicate substring. Same skeleton, different feasibility check.
2. Rabin-Karp rolling hash
The rolling hash lets you compute the hash of a sliding window in O(1) per step. The idea is to treat the substring as a number in some base, take it modulo a large prime, and update it incrementally as the window slides:
h = (h * BASE - old_char * power + new_char) % MOD
Removing the old character and adding the new one is just arithmetic. No need to rehash the entire window. This is the same technique used in Rabin-Karp string matching, and it is what makes the per-length check O(n) instead of O(n^2).
Edge cases
- All identical characters:
s = "aaaa". The longest duplicate is"aaa"(appears at index 0 and index 1). The rolling hash finds it quickly because every window of the same length has the same hash. - No duplicates:
s = "abcd". Every substring is unique. The binary search narrows to length 0 and returns"". - Two characters:
s = "aa". The answer is"a". The smallest non-trivial case. - Very long strings: The O(n log n) time and O(n) space scale well. The Mersenne prime modulus keeps hash collisions negligible even for
nup to 100,000.
From understanding to recall
The idea is clean: binary search on the length, rolling hash to check each length. But in an interview, the details matter. What base and modulus do you use? How do you handle hash collisions? How do you update the rolling hash formula? These are exactly the kinds of details that slip away after a few days.
Spaced repetition locks them in. You write the solution from scratch today, again tomorrow, then in three days. After a few rounds, the rolling hash update formula and the binary search boundaries become automatic. You stop rederiving and start writing.
Related posts
- Shortest Palindrome - Rolling hash (KMP/Rabin-Karp) for string matching
- Koko Eating Bananas - Binary search on the answer pattern
- Find the Duplicate Number - Finding duplicates with clever techniques
CodeBricks uses spaced repetition to turn these patterns into reflex. You solve a problem, then the system schedules reviews at increasing intervals. By the time you sit down for an interview, the binary search template and the rolling hash formula are second nature, not something you need to look up.