Substring with Concatenation of All Words

May 12, 20267 min read

leetcodeproblemhardstringshash-mapsliding-window

LeetCode's Substring with Concatenation of All Words (LeetCode 30) asks you to find every starting index in a string s where a substring is formed by concatenating all words from a given list exactly once, in any order.

Given a string s and an array words where every word has the same length, return all starting indices of substrings in s that are a concatenation of each word in words exactly once, in any order.

Example: s = "barfoothefoobarman", words = ["foo", "bar"] returns [0, 9].

At index 0: "barfoo" is "bar" + "foo".
At index 9: "foobar" is "foo" + "bar".

The fact that every word has the same length is the critical constraint. It means the concatenation window is always exactly len(words) * len(words[0]) characters long, and you can split it into fixed-size chunks. This transforms the problem from a messy substring search into a clean sliding window over word-sized slots.

At index 0, "bar"+"foo" uses each word exactly once. At index 9, "foo"+"bar" does too. Both are valid concatenations.

Brute force and why it fails

The most direct approach: for every possible starting index i in s, extract a substring of length word_len * num_words, split it into chunks of word_len, and check whether those chunks match the word list exactly. That check requires building a frequency map of the chunks and comparing it to a frequency map of words.

def find_substring_brute(s, words):
    if not s or not words:
        return []
    word_len = len(words[0])
    num_words = len(words)
    total_len = word_len * num_words
    from collections import Counter
    word_count = Counter(words)
    result = []

    for i in range(len(s) - total_len + 1):
        seen = Counter()
        for j in range(num_words):
            chunk = s[i + j * word_len : i + (j + 1) * word_len]
            if chunk not in word_count:
                break
            seen[chunk] += 1
            if seen[chunk] > word_count[chunk]:
                break
        else:
            result.append(i)
    return result

This runs in O(n * num_words) where n is the length of s. For each of the n starting positions, you check up to num_words chunks. It passes on LeetCode for small inputs, but it does redundant work. When you shift the starting index by 1, you are re-examining almost all of the same chunks again. The sliding window approach eliminates that redundancy.

The approach: fixed-chunk sliding window with frequency maps

The key insight is that because all words have the same length word_len, you only need to consider word_len different starting offsets (0, 1, 2, ..., word_len - 1). For each offset, you slide a window of num_words chunks across s, adding one chunk on the right and removing one chunk on the left.

Here is the strategy:

Build a frequency map word_count from words.
For each offset from 0 to word_len - 1:
- Initialize a window_count map and a matched counter (number of words whose window frequency equals the required frequency).
- Slide through s in steps of word_len, maintaining a window of exactly num_words chunks.
- When a new chunk enters on the right, update window_count and matched.
- When the window has more than num_words chunks, remove the leftmost chunk and update window_count and matched.
- When matched equals the number of unique words in word_count, the current window is a valid concatenation. Record the starting index.

The matched counter works exactly like the formed counter in Minimum Window Substring. You increment it when a word's window frequency reaches its required frequency, and you decrement it when the frequency drops below the requirement. This avoids scanning the entire frequency map on every step.

Running word_len separate sliding window passes might look expensive, but each pass processes about n / word_len chunks. The total work across all passes is O(n), not O(n * word_len). Each character of s is examined a constant number of times.

Walking through the example

Let's trace s = "barfoothefoobarman" with words = ["foo", "bar"]. Here word_len = 3, num_words = 2, and total_len = 6. We need word_count = {foo: 1, bar: 1}.

For offset 0, the chunks are: "bar", "foo", "the", "foo", "bar", "man". The window slides over pairs of consecutive chunks.

Step 1: offset = 0. First chunk s[0:3] = "bar". bar is in words. Add it.

Found "bar" (1/1). Still need "foo".

Step 2: Next chunk s[3:6] = "foo". foo is in words. Window is full.

matched == total words. Record index 0 as a valid start.

Step 3: Slide forward. Drop "bar" from left, add s[6:9] = "the".

"the" is not in words. Reset window and skip ahead.

Step 4: After reset, reach s[9:12] = "foo". Start fresh window.

Found "foo" (1/1). Need "bar" next.

Step 5: Next chunk s[12:15] = "bar". Both words found!

matched == total words. Record index 9. Final answer: [0, 9].

Offset 1 would check chunks starting at indices 1, 4, 7, 10, 13, 16. Offset 2 checks chunks starting at 2, 5, 8, 11, 14, 17. Neither offset produces valid concatenations in this example, but they would in strings where valid substrings start at non-multiple-of-word_len positions.

Practice this problem with spaced repetition

The code

def find_substring(s, words):
    if not s or not words:
        return []

    word_len = len(words[0])
    num_words = len(words)
    total_len = word_len * num_words
    from collections import Counter

    word_count = Counter(words)
    required = len(word_count)
    result = []

    for offset in range(word_len):
        window_count = {}
        matched = 0
        left = offset

        for right in range(offset, len(s) - word_len + 1, word_len):
            chunk = s[right : right + word_len]

            if chunk in word_count:
                window_count[chunk] = window_count.get(chunk, 0) + 1
                if window_count[chunk] == word_count[chunk]:
                    matched += 1
                elif window_count[chunk] == word_count[chunk] + 1:
                    matched -= 1
            else:
                window_count = {}
                matched = 0
                left = right + word_len
                continue

            while right - left + word_len > total_len:
                leaving = s[left : left + word_len]
                if leaving in word_count:
                    if window_count[leaving] == word_count[leaving]:
                        matched -= 1
                    window_count[leaving] -= 1
                left += word_len

            if matched == required:
                result.append(left)

    return result

A few things to notice:

The outer loop runs word_len times, once per offset. Each offset produces a fresh sliding window.
right advances in steps of word_len, not 1. Each step reads one word-length chunk.
When a chunk is not in word_count, the window is broken. You reset everything and start fresh from the next chunk.
The matched counter increments when window_count[chunk] reaches word_count[chunk] and decrements when it exceeds that value. This keeps the validity check at O(1).
The while loop removes chunks from the left when the window exceeds total_len. This is the standard sliding window shrink step.

The elif window_count[chunk] == word_count[chunk] + 1 line handles the case where a word appears more times than needed. When the count exceeds the requirement, matched must drop because the word is no longer at its exact target.

Complexity

Approach	Time	Space
Brute force	O(n * num_words)	O(num_words)
Fixed-chunk sliding window	O(n * word_len)	O(num_words)

Time: O(n * word_len). There are word_len offsets. For each offset, the window slides through roughly n / word_len positions, each doing O(1) work. Total across all offsets: word_len * (n / word_len) = O(n). But reading each chunk is O(word_len) for the string slice, giving O(n * word_len) overall. In practice word_len is small and bounded by the input constraints.

Space: O(num_words). The word_count and window_count maps each hold at most num_words entries.

Drill sliding window patterns

The building blocks

This problem decomposes into two reusable pieces that appear in many string and hash map problems.

1. Frequency matching

Comparing two frequency maps to check if one "covers" the other. You build a reference map from words, then incrementally maintain a window map as chunks enter and leave. The matched counter avoids rescanning the map on every step. This is the same technique used in Minimum Window Substring (with the formed counter) and in Group Anagrams (where sorted keys serve as canonical forms for frequency comparison).

if window_count[chunk] == word_count[chunk]:
    matched += 1
elif window_count[chunk] == word_count[chunk] + 1:
    matched -= 1

2. Fixed-chunk sliding window

Instead of sliding by 1 character at a time, you slide by word_len characters. The window always contains a whole number of word-sized chunks, so you never need to deal with partial words. The word_len offsets ensure you cover every valid alignment. This pattern is specific to problems where the elements have uniform length, like this one and Find All Anagrams in a String (LeetCode 438).

for offset in range(word_len):
    left = offset
    for right in range(offset, len(s) - word_len + 1, word_len):
        chunk = s[right : right + word_len]

When you can write both of these blocks from memory, you can assemble the full solution without memorizing the problem itself. The frequency matching block is shared with many other problems. The fixed-chunk window is less common but instantly recognizable once you have drilled it.

Edge cases

Before submitting, verify your solution handles:

Empty input. If s is empty or words is empty, return [].
Single word. words = ["foo"] means the window is exactly one chunk. You are just searching for occurrences of that word in s.
Duplicate words. words = ["foo", "foo", "bar"] means you need foo twice and bar once. The frequency map handles this naturally.
Total length exceeds s. If word_len * num_words is greater than len(s), no valid window can exist. Return [].
All characters the same. s = "aaaaaa", words = ["aa", "aa"]. Multiple overlapping matches are possible. The offset loop catches them.
No matches. s = "abcdef", words = ["gh"]. Return [].

From understanding to recall

Reading through this solution once gives you the concept, but it will not stick under interview pressure. The tricky parts are the offset loop (remembering that you need word_len passes), the matched counter logic (when to increment versus decrement), and the reset behavior when a chunk is not in the word list.

These are the details that slip away after a few days. Spaced repetition forces you to reconstruct the solution from scratch at increasing intervals. After a few rounds, the fixed-chunk sliding window pattern and the frequency matching block become automatic. You stop thinking about the mechanics and start thinking about which blocks to combine.

Minimum Window Substring - The classic shrinking-window problem that uses the same formed/matched counter technique for frequency validation
Group Anagrams - Uses frequency comparison (via sorted canonical keys) to group strings, the same core idea of matching character distributions

Try CodeBricks free