Most Common Word: Filtering and Counting with Hash Maps
Most Common Word (LeetCode 819) is a great warmup problem that combines two fundamental skills: string normalization and frequency counting. You are given a paragraph of text and a list of banned words. Your job is to find the most frequently occurring word that is not banned, ignoring case and punctuation.
Example: paragraph = "Bob hit a ball, the hit BALL flew far after it was hit", banned = ["hit"]. The answer is "ball" because "ball" appears twice (the most of any non-banned word), while "hit" appears three times but is excluded.
Why this problem matters
Most Common Word is not just a toy exercise. It tests your ability to handle messy, real-world text. In practice, you almost never get clean input. Strings arrive with mixed casing, stray punctuation, and words you need to ignore. This problem trains you to write a reliable normalization pipeline before you even start counting.
The underlying pattern, counting frequencies and filtering by a condition, appears constantly in data processing, log analysis, and interview questions alike.
The key insight
The problem has two separate concerns, and you should handle them in order:
- Normalize first. Convert the paragraph to lowercase and strip all non-letter characters. This turns messy input into a clean list of words.
- Count second. Walk through the words, skip any that appear in the banned set, and count frequencies for the rest.
By separating normalization from counting, you avoid a tangle of conditionals and edge cases. Each step is simple on its own.
Solution approach
Here is the plan:
- Lowercase the paragraph
- Replace every non-letter character with a space
- Split on whitespace to get individual words
- Build a set from the banned list (also lowercased) for O(1) lookups
- Count word frequencies with a dictionary or
Counter, skipping banned words - Return the word with the highest count
The code
import re
from collections import Counter
def most_common_word(paragraph: str, banned: list[str]) -> str:
normalized = re.sub(r'[^a-zA-Z]', ' ', paragraph).lower()
words = normalized.split()
banned_set = set(w.lower() for w in banned)
counts = Counter(w for w in words if w not in banned_set)
return counts.most_common(1)[0][0]
Let's walk through what each line does:
re.sub(r'[^a-zA-Z]', ' ', paragraph)replaces every character that is not a letter with a space. Commas, periods, apostrophes, digits; they all become spaces..lower()converts everything to lowercase so "BALL" and "ball" count as the same word..split()splits on any whitespace (including multiple consecutive spaces), giving us a clean list of words.set(w.lower() for w in banned)builds a set of banned words in lowercase. Using a set means each "is this word banned?" check is O(1).Counter(w for w in words if w not in banned_set)counts only the non-banned words. The generator expression filters out banned words before they ever enter the counter.counts.most_common(1)[0][0]grabs the single most frequent word.most_common(1)returns a list like[("ball", 2)], so[0][0]extracts just"ball".
Visual walkthrough
Let's trace through the example step by step to see exactly how the algorithm processes the input.
Step 1: Normalize the paragraph.
Lowercase everything and replace non-letter characters (commas, periods, etc.) with spaces. This gives you a clean string with only lowercase letters and spaces.
Step 2: Split into words and build the banned set.
Splitting on whitespace extracts each word. Convert the banned list to a set so lookups are O(1).
Step 3: Count frequencies, skipping banned words.
"hit" appears 3 times but is banned, so it never enters the frequency map. "ball" appears twice, the highest among remaining words.
Step 4: Find the word with the highest count.
Scan the frequency map for the entry with the largest value. "ball" has count 2, which is the maximum. Return "ball".
That is the whole algorithm. Normalize, filter, count, return the max. Four steps, and the code is five lines.
Complexity analysis
| Metric | Value |
|---|---|
| Time | O(n + m) where n is the length of the paragraph and m is the total length of banned words |
| Space | O(n + m) for storing the word list, banned set, and frequency map |
The regex substitution and split are both O(n). Building the banned set is O(m). Counting frequencies is O(n) since we visit each word once. Finding the max is O(k) where k is the number of distinct non-banned words, which is bounded by n. Everything is linear.
The building blocks
This problem is built from two patterns that show up over and over in string and hash map problems.
String normalization
Before you can do anything useful with text, you need to clean it up. The template is:
normalized = re.sub(r'[^a-zA-Z]', ' ', text).lower()
words = normalized.split()
This two-line pipeline handles mixed case, punctuation, digits, and multiple spaces all at once. You will use this same approach (or something very similar) whenever a problem gives you messy text input.
Frequency counting with filtering
The act of counting elements while skipping certain ones. The template is:
banned_set = set(excluded_items)
counts = Counter(item for item in collection if item not in banned_set)
This is a variation of the standard frequency counter pattern. The only addition is the if item not in banned_set filter in the generator expression. Using a set for the banned list keeps the filter at O(1) per check.
Edge cases
A few things to watch for when coding this up:
- Punctuation at the end of words. "ball," and "ball" should count as the same word. The regex approach handles this by replacing the comma with a space before splitting.
- Mixed case. "BALL", "Ball", and "ball" are all the same word. Lowercasing the entire paragraph at the start handles this.
- Multiple spaces after substitution. When you replace
","or"."with spaces, you might get consecutive spaces.split()without arguments handles this correctly by splitting on any whitespace run. - All words are banned except one. The algorithm still works because the Counter will have exactly one entry.
- Banned list contains uppercase. Always lowercase the banned list too. The problem statement might not guarantee they are lowercase.
From understanding to recall
Most Common Word feels easy once you see the solution. "Normalize the text, count the words, skip the banned ones." But the real skill here is not solving this one problem. It is recognizing the normalization-then-counting pipeline as a reusable pattern.
Next time you see a problem involving messy text and frequency analysis, you want the re.sub plus Counter template to fire automatically. You should not have to re-derive it.
Spaced repetition makes this stick. You practice the normalization pipeline and the filtered counting pattern at increasing intervals. After a few cycles, the moment a problem mentions "ignore case and punctuation" or "exclude certain words," you already know the first three lines of your solution.
Related posts
- Group Anagrams - Another hash map grouping problem where string normalization (sorting characters) is the key step
- Valid Anagram - The simplest frequency counting problem, a great warmup before Most Common Word
- First Unique Character in a String - Frequency counting to find the first character with count 1, the inverse of finding the max count