Skip to content
← All posts

Longest Duplicate Substring: Binary Search and Rolling Hash

6 min read
leetcodeproblemhardstringsbinary-search

Longest Duplicate Substring is LeetCode 1044. Given a string s, you need to find the longest substring that appears at least twice. If no such substring exists, return an empty string. This is a hard problem, but it becomes very manageable once you see that it combines two well-known techniques: binary search on the answer and Rabin-Karp rolling hash.

For example, given s = "banana", the answer is "ana", which appears starting at index 1 and again at index 3.

b0a1n2a3n4a5"ana" at i=1"ana" at i=3
s = "banana". The substring "ana" appears at index 1 and index 3, making it the longest duplicate substring.

Why this problem matters

This problem is a clean example of combining binary search on the answer with hashing. The binary search narrows down the length of the duplicate you are looking for, and the rolling hash efficiently checks whether any duplicate of that length exists. These two techniques individually appear across dozens of problems, and this problem is one of the best places to see them work together.

The key insight

If a duplicate substring of length L exists, then a duplicate of length L - 1 also exists. Why? Take the two copies of the length-L duplicate and chop off their last character. You now have two identical substrings of length L - 1. This means the "feasibility" of finding a duplicate is monotonic with respect to length: there is some maximum length where duplicates exist, and every shorter length also has duplicates.

That monotonicity is exactly what binary search needs. You can binary search on the length of the duplicate substring. For each candidate length, you need to check whether any substring of that length appears at least twice. A brute force check would compare all pairs of substrings, but Rabin-Karp rolling hash lets you do it in O(n) time by computing a hash for each substring in a sliding window and looking for collisions.

The solution

def longestDupSubstring(s):
    n = len(s)
    MOD = (1 << 61) - 1
    BASE = 131

    def check(length):
        if length == 0:
            return ""
        h = 0
        power = pow(BASE, length, MOD)
        seen = {}
        for i in range(length):
            h = (h * BASE + ord(s[i])) % MOD
        seen[h] = [0]
        for i in range(1, n - length + 1):
            h = (h * BASE - ord(s[i - 1]) * power + ord(s[i + length - 1])) % MOD
            if h in seen:
                candidate = s[i:i + length]
                for j in seen[h]:
                    if s[j:j + length] == candidate:
                        return candidate
                seen[h].append(i)
            else:
                seen[h] = [i]
        return ""

    lo, hi = 0, n - 1
    result = ""
    while lo <= hi:
        mid = (lo + hi) // 2
        found = check(mid)
        if found:
            result = found
            lo = mid + 1
        else:
            hi = mid - 1
    return result

The outer loop is the binary search. lo and hi define the range of candidate lengths. At each step, check(mid) uses a rolling hash to scan every substring of length mid and returns the first duplicate it finds (or an empty string if none exists).

Inside check, the hash is computed for the first window of size length. Then, for each subsequent position, the hash is updated in O(1) by removing the contribution of the leftmost character and adding the new rightmost character. This is the classic Rabin-Karp sliding window.

When a hash collision occurs, the code verifies the actual strings match. This handles the rare case of two different substrings producing the same hash value.

The modulus (1 << 61) - 1 is a Mersenne prime, which gives a large hash space and makes collisions extremely unlikely. The base 131 is chosen because it is a prime larger than the size of the ASCII alphabet.

The binary search on answer pattern works whenever there is monotonicity in feasibility. Ask yourself: "If this value works, does every smaller (or larger) value also work?" If yes, you can binary search on the answer instead of trying every possibility.

Visual walkthrough

Let's trace through s = "banana". The binary search tries different lengths, and for each length the rolling hash scans the string.

Step 1: Binary search on length. lo=1, hi=5, mid=3. Check if any substring of length 3 appears twice.

bananahash('ban') = h1 → store in seen

Binary search picks length 3 as the first candidate to check.

Step 2: Slide the rolling hash. Window moves to 'ana' (i=1). Compute hash in O(1).

bananahash('ana') = h2 → store in seen

Remove 'b' from hash, add 'a'. The rolling hash updates in constant time.

Step 3: Window moves to 'nan' (i=2). Hash computed, no collision.

bananahash('nan') = h3 → store in seen

Three distinct hashes so far. No duplicate of length 3 found yet.

Step 4: Window moves to 'ana' (i=3). Hash matches h2! Verify the actual strings match.

bananahash('ana') = h2 → collision! s[1:4] == s[3:6] confirmed

Duplicate of length 3 found: 'ana'. Since length 3 works, try longer. Set lo = mid + 1 = 4.

Step 5: lo=4, hi=5, mid=4. Check length 4. Substrings: 'bana', 'anan', 'nana'. All distinct hashes.

bananaNo hash collision for any length-4 window

No duplicate of length 4. Set hi = mid - 1 = 3. Now lo > hi, search ends.

Result: The longest duplicate substring is 'ana' (length 3).

banana

Binary search narrowed the length. Rolling hash checked each length in O(n). Total: O(n log n).

The binary search tried only two lengths (3 and 4) instead of checking all five possible lengths. For each length, the rolling hash scanned the string in linear time. The combination keeps the total work to O(n log n).

Complexity analysis

ApproachTimeSpace
Binary search + rolling hashO(n log n)O(n)

The binary search runs O(log n) iterations. Each iteration calls check, which scans the string once in O(n) time, computing and storing hashes for each window position. That gives O(n log n) total time.

The space is O(n) for the hash map that stores starting indices for each hash value.

One caveat: hash collisions can degrade performance. If many substrings produce the same hash, the string comparison fallback could push a single check call toward O(n^2). In practice, with a large prime modulus like (1 << 61) - 1, collisions are extremely rare and the expected time remains O(n log n).

The building blocks

1. Binary search on answer

When the answer itself lives in a range and feasibility is monotonic, you do not enumerate every candidate. You binary search the range. The template looks like this:

lo, hi = min_possible, max_possible
result = default
while lo <= hi:
    mid = (lo + hi) // 2
    if is_feasible(mid):
        result = mid
        lo = mid + 1
    else:
        hi = mid - 1

You have already seen this pattern in Koko Eating Bananas (search the speed) and Split Array Largest Sum (search the max sum). Here, you are searching the length of the duplicate substring. Same skeleton, different feasibility check.

2. Rabin-Karp rolling hash

The rolling hash lets you compute the hash of a sliding window in O(1) per step. The idea is to treat the substring as a number in some base, take it modulo a large prime, and update it incrementally as the window slides:

h = (h * BASE - old_char * power + new_char) % MOD

Removing the old character and adding the new one is just arithmetic. No need to rehash the entire window. This is the same technique used in Rabin-Karp string matching, and it is what makes the per-length check O(n) instead of O(n^2).

Edge cases

  • All identical characters: s = "aaaa". The longest duplicate is "aaa" (appears at index 0 and index 1). The rolling hash finds it quickly because every window of the same length has the same hash.
  • No duplicates: s = "abcd". Every substring is unique. The binary search narrows to length 0 and returns "".
  • Two characters: s = "aa". The answer is "a". The smallest non-trivial case.
  • Very long strings: The O(n log n) time and O(n) space scale well. The Mersenne prime modulus keeps hash collisions negligible even for n up to 100,000.

From understanding to recall

The idea is clean: binary search on the length, rolling hash to check each length. But in an interview, the details matter. What base and modulus do you use? How do you handle hash collisions? How do you update the rolling hash formula? These are exactly the kinds of details that slip away after a few days.

Spaced repetition locks them in. You write the solution from scratch today, again tomorrow, then in three days. After a few rounds, the rolling hash update formula and the binary search boundaries become automatic. You stop rederiving and start writing.

Related posts

CodeBricks uses spaced repetition to turn these patterns into reflex. You solve a problem, then the system schedules reviews at increasing intervals. By the time you sit down for an interview, the binary search template and the rolling hash formula are second nature, not something you need to look up.