Skip to content
← All posts

Repeated DNA Sequences: Sliding Window with a Hash Set

6 min read
leetcodeproblemmediumstringshash-mapbit-manipulation

LeetCode Repeated DNA Sequences (problem 187, medium) asks you to find all 10-letter-long subsequences in a DNA string that appear more than once. The string s only contains the characters 'A', 'C', 'G', and 'T'.

Example 1: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT" returns ["AAAAACCCCC", "CCCCCAAAAA"].

Example 2: s = "AAAAAAAAAAAAA" returns ["AAAAAAAAAA"].

At first glance, this looks like it could require comparing every pair of substrings. But there is a much cleaner approach: slide a fixed-size window across the string and use a hash set to detect repeats in O(1) per lookup.

A0A1A2A3A4C5C6C7C8C9A10A11A12A13A14C15C16C17C18C19C20"AAAAACCCCC" (first seen at i=0)"CCCCCAAAAA" (first seen at i=5)Two overlapping 10-letter windows slide across the DNA string
A sliding window of size 10 moves one position at a time. Each substring is checked against a "seen" set.

The approach: fixed-size sliding window + hash set

The core idea is simple:

  1. Slide a window of exactly 10 characters across the string, one position at a time.
  2. At each position, extract the 10-character substring.
  3. If you have seen this substring before, add it to the result.
  4. If not, add it to the "seen" set.

You need two sets: seen to track every substring you have encountered, and result to collect the duplicates. Using a set for result (instead of a list) avoids adding the same duplicate multiple times if it appears three or more times.

The window moves from index 0 to index len(s) - 10. At each position, extracting a 10-character substring is O(10) = O(1), and the hash set lookup is O(1) on average. Total: O(n) time.

Visual walkthrough

Let's trace through s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT". The window is always 10 characters wide. At each step, you check whether the current substring is already in the seen set.

Step 1: Window at index 0. Extract 'AAAAACCCCC'.

AAAAACCCCCAAAAACCCCCstartendseen: { }"AAAAACCCCC" new, add to seen

Set is empty. Add 'AAAAACCCCC' to seen.

Step 2: Window at index 1. Extract 'AAAACCCCCA'.

AAAAACCCCCAAAAACCCCCstartendseen: {1 substring}"AAAACCCCCA" new, add to seen

Not in seen. Add it.

Step 3: Window at index 5. Extract 'CCCCCAAAAA'.

AAAAACCCCCAAAAACCCCCstartendseen: {5 substrings}"CCCCCAAAAA" new, add to seen

Not in seen. Add it. First time seeing this substring.

Step 4: Window at index 10. Extract 'AAAAACCCCC'.

AAAAACCCCCAAAAACCCCCCstartendseen: {2 substrings}"AAAAACCCCC" already seen!result: ["AAAAACCCCC"]

'AAAAACCCCC' IS in seen! Add it to result.

Step 5: Window at index 11. Extract 'AAAACCCCCC'.

AAAAACCCCCAAAAACCCCCCstartendseen: {2 substrings}"AAAACCCCCC" new, add to seenresult: ["AAAAACCCCC"]

Not in seen. Add it. Keep sliding.

Step 6: Window at index 15. Extract 'CCCCCAAAAA'.

CCCCCAAAAACCCCCCAAAAAstartendseen: {2 substrings}"CCCCCCAAAA" already seen!result: ["AAAAACCCCC", "CCCCCAAAAA"]

'CCCCCAAAAA' IS in seen! Add it to result. Final answer: ['AAAAACCCCC', 'CCCCCAAAAA'].

The algorithm finishes when the window reaches the end of the string. Every 10-letter substring that appeared more than once is now in the result set.

Python solution

def find_repeated_dna_sequences(s: str) -> list[str]:
    seen = set()
    result = set()

    for i in range(len(s) - 9):
        substring = s[i:i + 10]
        if substring in seen:
            result.add(substring)
        else:
            seen.add(substring)

    return list(result)

That is the entire solution. Let's walk through why each piece works.

Line by line:

  • seen holds every 10-character substring encountered so far.
  • result holds substrings that appeared at least twice. Using a set here prevents duplicates in the output.
  • The loop runs from i = 0 to i = len(s) - 10. At each index, you extract the substring s[i:i+10].
  • If the substring is already in seen, it is a repeat, so add it to result.
  • If not, add it to seen so future windows can detect it.
  • Return result as a list.

The range(len(s) - 9) ensures the window never extends past the end of the string. When i = len(s) - 10, the slice s[i:i+10] grabs exactly the last 10 characters.

Why seen.add is in the else branch

You might wonder why the code does not always add to seen. If a substring is already in seen, adding it again is harmless (sets ignore duplicate insertions). But putting the seen.add in the else branch makes the logic clearer: you either add to result (duplicate found) or add to seen (first occurrence). Both paths are valid. A common alternative is to always call seen.add(substring) unconditionally and just check before adding to result:

for i in range(len(s) - 9):
    substring = s[i:i + 10]
    if substring in seen:
        result.add(substring)
    seen.add(substring)

Both versions produce the same output.

Complexity analysis

Time: O(n). The window slides across n - 9 positions. At each position, extracting a 10-character substring is O(10) = O(1), and the set lookup/insert is O(1) on average. Total: O(n).

Space: O(n). In the worst case, every 10-character window is unique, so the seen set stores up to n - 9 strings. Each string is 10 characters long, so the total memory is O(10 * n) = O(n).

ApproachTimeSpace
Brute force (compare all pairs)O(n^2)O(1)
Sliding window + hash setO(n)O(n)

For this problem, there is also a bit-manipulation approach where you encode each 10-letter sequence as a 20-bit integer (2 bits per character). This reduces the memory per substring from 10 characters to a single integer, and avoids the cost of hashing strings. The time complexity stays O(n), but the constant factor improves. The hash set approach above is the one to reach for in interviews since it is simpler and generalizes to any substring length.

The building blocks

This problem decomposes into two reusable pieces that CodeBricks drills independently:

1. Fixed-size sliding window

Unlike variable-size sliding windows (where you grow and shrink based on a condition), this window is always exactly 10 characters wide. You do not need two pointers. You just iterate i from 0 to len(s) - 10 and extract s[i:i+10].

for i in range(len(s) - 9):
    window = s[i:i + 10]

This same fixed-window pattern appears in problems like "Find All Anagrams in a String," "Minimum Size Subarray Sum" (when the window is fixed), and any problem that asks about substrings or subarrays of a specific length. The window size changes between problems, but the iteration structure stays identical.

2. Set-based duplicate detection

You maintain a set that grows as you process elements. Before processing a new item, check if it is already in the set. If yes, you have a duplicate. If no, add it and continue.

seen = set()
for item in collection:
    if item in seen:
        # duplicate found
    else:
        seen.add(item)

This is the same building block used in Contains Duplicate, Happy Number, and many other problems. The only difference here is that the "item" being tracked is a 10-character string instead of a single integer. The pattern is identical.

These two pieces snap together cleanly in this problem: the fixed-size window generates the substrings, and the seen-set detects which ones repeat.

Edge cases

Before submitting, verify your solution handles:

  • String shorter than 10 characters: s = "ACGT" should return []. The loop range(len(s) - 9) produces range(-5), which is empty. No iterations, no results. Handled automatically.
  • All same characters: s = "AAAAAAAAAAAAA" (13 A's). The 10-character window "AAAAAAAAAA" appears at indices 0, 1, 2, and 3. The first occurrence goes into seen. The second occurrence triggers adding it to result. The third and fourth occurrences find it already in result, so the set ignores the duplicate insertion. Output: ["AAAAAAAAAA"].
  • No repeats: s = "ACGTACGTAC" (exactly 10 characters). Only one window exists, so there can be no duplicates. Output: [].
  • Every window is the same: s = "AAAAAAAAAA" (exactly 10 A's). Only one window, no duplicates possible. Output: [].

From understanding to recall

Reading through this solution is the easy part. The real challenge is reproducing it from scratch when you encounter a similar problem a week from now. The fixed-size sliding window and set-based duplicate detection are building blocks that appear in dozens of LeetCode problems. If you can write them from memory, you will recognize where they fit the moment you read a new problem statement.

That is what spaced repetition is for. You drill the fixed-window extraction and the seen-set check as individual building blocks, typing them from scratch at increasing intervals. After a few review cycles, the code is in your fingers, not just your head.

Related posts