Skip to content
← All posts

Occurrences After Bigram: Linear Scan Pattern

5 min read
leetcodeproblemeasystrings

Given a string text along with two words first and second, find all occurrences of the bigram first second in text and collect the word that immediately follows each occurrence. Return those collected words as a list.

This is LeetCode 1078: Occurrences After Bigram, and it is a clean introduction to scanning text at the word level. Instead of matching individual characters, you split the text into words and look for a pattern across consecutive entries.

firstsecondthird (result)aliceisagoodgirlsheisagoodstudent0123456789bigram matchbigram match
text = "alice is a good girl she is a good student", first = "a", second = "good". The two bigram matches yield result = ["girl", "student"].

Why this problem matters

Bigrams (pairs of consecutive words) are a fundamental concept in natural language processing. Counting bigrams, finding words that follow them, and building frequency tables over word pairs are common tasks in text analysis, autocomplete systems, and language modeling. This problem isolates the core mechanic: scan a word sequence and look for a specific pair.

It also reinforces a key interviewing skill, which is recognizing when you can reduce a string problem to an array problem. Once you split the text into words, the entire problem becomes a linear scan over an array with a window of three consecutive elements. No complex string matching is needed.

The key insight

Split the text into an array of words. Then scan through the array, and at each position i, check whether words[i] equals first and words[i+1] equals second. If both match, append words[i+2] to the result. You need i+2 to be in bounds, so the loop runs from index 0 to len(words) - 3.

This is essentially a sliding window of size 3 over the word array. You check the first two elements of the window against the bigram, and if they match, you collect the third.

The solution

def find_ocurrences(text: str, first: str, second: str) -> list[str]:
    words = text.split()
    result = []

    for i in range(len(words) - 2):
        if words[i] == first and words[i + 1] == second:
            result.append(words[i + 2])

    return result

The text.split() call handles any amount of whitespace and produces a clean list of words. Then we iterate through all valid starting positions for a triple-word window. At each position, a simple equality check on the first two words tells us whether we have a bigram match. If we do, we grab the third word.

Notice that the loop bound is len(words) - 2. This ensures words[i + 2] never goes out of bounds. When i is at its maximum value of len(words) - 3, the indices i, i+1, and i+2 are the last three elements.

When a string problem talks about "words" in a sentence, your first move should almost always be text.split(). This converts the problem from character-level string matching to array-level scanning, which is usually simpler to reason about and implement.

Visual walkthrough

Step 1: Check window at indices 0, 1, 2

words[0]="alice", words[1]="is". Neither matches first="a", so skip.

aliceisagoodgirlsheisagoodstudent

result = []

Step 2: Check window at indices 1, 2, 3

words[1]="is", words[2]="a". words[1] does not equal first="a", so skip.

aliceisagoodgirlsheisagoodstudent

result = []

Step 3: Check window at indices 2, 3, 4

words[2]="a" matches first, words[3]="good" matches second. Bigram found! Collect words[4]="girl".

aliceisagoodgirlsheisagoodstudent

result = ["girl"]

Steps 4-6: Scan indices 3 through 6

No window starting at indices 3, 4, 5, or 6 has words[i]="a" followed by words[i+1]="good". Keep scanning.

aliceisagoodgirlsheisagoodstudent

result = ["girl"]

Step 7: Check window at indices 7, 8, 9

words[7]="a" matches first, words[8]="good" matches second. Bigram found again! Collect words[9]="student".

aliceisagoodgirlsheisagoodstudent

result = ["girl", "student"]

Done: All windows checked

We scanned every triple-word window. Two bigram matches produced two result words.

aliceisagoodgirlsheisagoodstudent

result = ["girl", "student"]

The scan visits each triple-word window exactly once. When the first two words match the bigram, we collect the third. No backtracking, no nested loops, just a single left-to-right pass through the word array.

Complexity analysis

ApproachTimeSpace
Linear scanO(n)O(n)

Here n is the length of the text string.

Time: Splitting the text into words takes O(n) where n is the number of characters. The scan through the words array visits each word once, which is also O(n) total character comparisons in the worst case. Overall: O(n).

Space: The words array stores all the words from the text, which takes O(n) space. The result list takes at most O(n) as well. Total: O(n).

The building blocks

1. Word splitting and array scanning

The foundation of this solution is converting a sentence into a word array and then scanning it. This is the same pattern used in problems like Reverse Words in a String and Most Common Word. Any time you see a problem that operates on "words in a sentence," the first step is split(). Once you have an array, you can use indexing, iteration, and standard array techniques.

words = text.split()
for i in range(len(words)):
    process(words[i])

2. Triple-word window pattern

The second building block is checking a fixed-size window of consecutive elements. Here the window is three words wide: you check the first two against a pattern and collect the third. This is a simplified version of the sliding window technique. The key difference is that the window size is fixed and tiny, so there is no need for complex pointer management. You just index i, i+1, and i+2.

for i in range(len(arr) - 2):
    if arr[i] == target1 and arr[i + 1] == target2:
        collect(arr[i + 2])

Edge cases

  • No bigram match: if the bigram first second never appears in the text, the result is an empty list. The loop runs but never triggers the if condition.
  • Bigram at the very end: if the text ends with first second and there is no word after, the loop bound len(words) - 2 prevents that position from being checked. No index error occurs.
  • Overlapping bigrams: consider text = "a a a", first = "a", second = "a". At index 0, words[0]="a" and words[1]="a" match, so words[2]="a" is collected. At index 1, words[1]="a" and words[2]="a" also match, but there is no words[3]. The loop bound stops at index 0 (len=3, range(1)), so only one result is collected. This is correct.
  • Single or two words in text: if the text has fewer than three words, range(len(words) - 2) produces an empty range. The loop body never executes, and the result is an empty list.
  • Repeated bigram with different following words: each match is independent. If the bigram appears five times with five different following words, all five are collected in order.

From understanding to recall

The logic here is minimal: split, scan triples, collect matches. But in an interview, the small details matter. Remembering to use len(words) - 2 as the bound, not len(words) - 1 or len(words), is the kind of off-by-one detail that causes bugs under pressure.

Spaced repetition drills these details into muscle memory. You practice writing the split-and-scan pattern from scratch, getting the loop bound right every time. After a few rounds, you do not need to think about whether it is -1 or -2. Your fingers just write it correctly because they have done it before.

Related posts

CodeBricks breaks Occurrences After Bigram into its word-splitting and triple-window scanning building blocks, then drills them independently with spaced repetition. You type each piece from scratch until the pattern is automatic. When a word-level string problem appears in your interview, you do not hesitate. You just write it.