Substring with Concatenation of All Words
LeetCode's Substring with Concatenation of All Words (LeetCode 30) asks you to find every starting index in a string s where a substring is formed by concatenating all words from a given list exactly once, in any order.
Given a string s and an array words where every word has the same length, return all starting indices of substrings in s that are a concatenation of each word in words exactly once, in any order.
Example: s = "barfoothefoobarman", words = ["foo", "bar"] returns [0, 9].
- At index 0:
"barfoo"is"bar" + "foo". - At index 9:
"foobar"is"foo" + "bar".
The fact that every word has the same length is the critical constraint. It means the concatenation window is always exactly len(words) * len(words[0]) characters long, and you can split it into fixed-size chunks. This transforms the problem from a messy substring search into a clean sliding window over word-sized slots.
Brute force and why it fails
The most direct approach: for every possible starting index i in s, extract a substring of length word_len * num_words, split it into chunks of word_len, and check whether those chunks match the word list exactly. That check requires building a frequency map of the chunks and comparing it to a frequency map of words.
def find_substring_brute(s, words):
if not s or not words:
return []
word_len = len(words[0])
num_words = len(words)
total_len = word_len * num_words
from collections import Counter
word_count = Counter(words)
result = []
for i in range(len(s) - total_len + 1):
seen = Counter()
for j in range(num_words):
chunk = s[i + j * word_len : i + (j + 1) * word_len]
if chunk not in word_count:
break
seen[chunk] += 1
if seen[chunk] > word_count[chunk]:
break
else:
result.append(i)
return result
This runs in O(n * num_words) where n is the length of s. For each of the n starting positions, you check up to num_words chunks. It passes on LeetCode for small inputs, but it does redundant work. When you shift the starting index by 1, you are re-examining almost all of the same chunks again. The sliding window approach eliminates that redundancy.
The approach: fixed-chunk sliding window with frequency maps
The key insight is that because all words have the same length word_len, you only need to consider word_len different starting offsets (0, 1, 2, ..., word_len - 1). For each offset, you slide a window of num_words chunks across s, adding one chunk on the right and removing one chunk on the left.
Here is the strategy:
- Build a frequency map
word_countfromwords. - For each offset from 0 to
word_len - 1:- Initialize a
window_countmap and amatchedcounter (number of words whose window frequency equals the required frequency). - Slide through
sin steps ofword_len, maintaining a window of exactlynum_wordschunks. - When a new chunk enters on the right, update
window_countandmatched. - When the window has more than
num_wordschunks, remove the leftmost chunk and updatewindow_countandmatched. - When
matchedequals the number of unique words inword_count, the current window is a valid concatenation. Record the starting index.
- Initialize a
The matched counter works exactly like the formed counter in Minimum Window Substring. You increment it when a word's window frequency reaches its required frequency, and you decrement it when the frequency drops below the requirement. This avoids scanning the entire frequency map on every step.
Running word_len separate sliding window passes might look expensive, but each pass processes about n / word_len chunks. The total work across all passes is O(n), not O(n * word_len). Each character of s is examined a constant number of times.
Walking through the example
Let's trace s = "barfoothefoobarman" with words = ["foo", "bar"]. Here word_len = 3, num_words = 2, and total_len = 6. We need word_count = {foo: 1, bar: 1}.
For offset 0, the chunks are: "bar", "foo", "the", "foo", "bar", "man". The window slides over pairs of consecutive chunks.
Step 1: offset = 0. First chunk s[0:3] = "bar". bar is in words. Add it.
Found "bar" (1/1). Still need "foo".
Step 2: Next chunk s[3:6] = "foo". foo is in words. Window is full.
matched == total words. Record index 0 as a valid start.
Step 3: Slide forward. Drop "bar" from left, add s[6:9] = "the".
"the" is not in words. Reset window and skip ahead.
Step 4: After reset, reach s[9:12] = "foo". Start fresh window.
Found "foo" (1/1). Need "bar" next.
Step 5: Next chunk s[12:15] = "bar". Both words found!
matched == total words. Record index 9. Final answer: [0, 9].
Offset 1 would check chunks starting at indices 1, 4, 7, 10, 13, 16. Offset 2 checks chunks starting at 2, 5, 8, 11, 14, 17. Neither offset produces valid concatenations in this example, but they would in strings where valid substrings start at non-multiple-of-word_len positions.
The code
def find_substring(s, words):
if not s or not words:
return []
word_len = len(words[0])
num_words = len(words)
total_len = word_len * num_words
from collections import Counter
word_count = Counter(words)
required = len(word_count)
result = []
for offset in range(word_len):
window_count = {}
matched = 0
left = offset
for right in range(offset, len(s) - word_len + 1, word_len):
chunk = s[right : right + word_len]
if chunk in word_count:
window_count[chunk] = window_count.get(chunk, 0) + 1
if window_count[chunk] == word_count[chunk]:
matched += 1
elif window_count[chunk] == word_count[chunk] + 1:
matched -= 1
else:
window_count = {}
matched = 0
left = right + word_len
continue
while right - left + word_len > total_len:
leaving = s[left : left + word_len]
if leaving in word_count:
if window_count[leaving] == word_count[leaving]:
matched -= 1
window_count[leaving] -= 1
left += word_len
if matched == required:
result.append(left)
return result
A few things to notice:
- The outer loop runs
word_lentimes, once per offset. Each offset produces a fresh sliding window. rightadvances in steps ofword_len, not 1. Each step reads one word-length chunk.- When a chunk is not in
word_count, the window is broken. You reset everything and start fresh from the next chunk. - The
matchedcounter increments whenwindow_count[chunk]reachesword_count[chunk]and decrements when it exceeds that value. This keeps the validity check at O(1). - The while loop removes chunks from the left when the window exceeds
total_len. This is the standard sliding window shrink step.
The elif window_count[chunk] == word_count[chunk] + 1 line handles the case where a word appears more times than needed. When the count exceeds the requirement, matched must drop because the word is no longer at its exact target.
Complexity
| Approach | Time | Space |
|---|---|---|
| Brute force | O(n * num_words) | O(num_words) |
| Fixed-chunk sliding window | O(n * word_len) | O(num_words) |
Time: O(n * word_len). There are word_len offsets. For each offset, the window slides through roughly n / word_len positions, each doing O(1) work. Total across all offsets: word_len * (n / word_len) = O(n). But reading each chunk is O(word_len) for the string slice, giving O(n * word_len) overall. In practice word_len is small and bounded by the input constraints.
Space: O(num_words). The word_count and window_count maps each hold at most num_words entries.
The building blocks
This problem decomposes into two reusable pieces that appear in many string and hash map problems.
1. Frequency matching
Comparing two frequency maps to check if one "covers" the other. You build a reference map from words, then incrementally maintain a window map as chunks enter and leave. The matched counter avoids rescanning the map on every step. This is the same technique used in Minimum Window Substring (with the formed counter) and in Group Anagrams (where sorted keys serve as canonical forms for frequency comparison).
if window_count[chunk] == word_count[chunk]:
matched += 1
elif window_count[chunk] == word_count[chunk] + 1:
matched -= 1
2. Fixed-chunk sliding window
Instead of sliding by 1 character at a time, you slide by word_len characters. The window always contains a whole number of word-sized chunks, so you never need to deal with partial words. The word_len offsets ensure you cover every valid alignment. This pattern is specific to problems where the elements have uniform length, like this one and Find All Anagrams in a String (LeetCode 438).
for offset in range(word_len):
left = offset
for right in range(offset, len(s) - word_len + 1, word_len):
chunk = s[right : right + word_len]
When you can write both of these blocks from memory, you can assemble the full solution without memorizing the problem itself. The frequency matching block is shared with many other problems. The fixed-chunk window is less common but instantly recognizable once you have drilled it.
Edge cases
Before submitting, verify your solution handles:
- Empty input. If
sis empty orwordsis empty, return[]. - Single word.
words = ["foo"]means the window is exactly one chunk. You are just searching for occurrences of that word ins. - Duplicate words.
words = ["foo", "foo", "bar"]means you needfootwice andbaronce. The frequency map handles this naturally. - Total length exceeds s. If
word_len * num_wordsis greater thanlen(s), no valid window can exist. Return[]. - All characters the same.
s = "aaaaaa",words = ["aa", "aa"]. Multiple overlapping matches are possible. The offset loop catches them. - No matches.
s = "abcdef",words = ["gh"]. Return[].
From understanding to recall
Reading through this solution once gives you the concept, but it will not stick under interview pressure. The tricky parts are the offset loop (remembering that you need word_len passes), the matched counter logic (when to increment versus decrement), and the reset behavior when a chunk is not in the word list.
These are the details that slip away after a few days. Spaced repetition forces you to reconstruct the solution from scratch at increasing intervals. After a few rounds, the fixed-chunk sliding window pattern and the frequency matching block become automatic. You stop thinking about the mechanics and start thinking about which blocks to combine.
Related posts
- Minimum Window Substring - The classic shrinking-window problem that uses the same
formed/matchedcounter technique for frequency validation - Group Anagrams - Uses frequency comparison (via sorted canonical keys) to group strings, the same core idea of matching character distributions