Remove Comments: Parsing with a State Machine
Remove Comments is LeetCode problem 722. You are given a C++ program represented as a list of strings, one per line, and you need to strip out all comments. The tricky part is that there are two kinds of comments, and one of them can span multiple lines.
The problem
Given a C++ program as a list of strings (one per line), remove all comments. Line comments start with // and block comments are delimited by /* and */. Return the cleaned source code with empty lines removed.
A few rules make this interesting:
- A
//comment removes everything from that point to the end of the line. - A
/*comment removes everything until the matching*/, which might be on a different line entirely. - Comments do not nest. If you are inside a block comment, any
//or/*is just text being skipped. - If removing comments leaves a line empty, drop it from the output.
The approach: two-state parser
The key insight is that at any point while scanning, you are in exactly one of two states: either you are inside a block comment, or you are not. A single boolean flag, in_block, tracks which state you are in.
When in_block is False, you scan character by character looking for // or /*. If you find //, you stop processing the current line (everything after it is a comment). If you find /*, you flip in_block to True and skip over the /* marker. Otherwise, the character is normal code and you keep it.
When in_block is True, you scan character by character looking for */. If you find it, you flip in_block back to False and skip over the */ marker. Otherwise, you skip the character since it is inside the comment.
At the end of each line, if in_block is False and you have accumulated any output characters, you join them into a string and add it to the result. If in_block is True, you do not emit a line yet because the block comment continues onto the next line. The accumulated characters carry over.
Visual walkthrough
Here is a trace of the parser working through the string "a/*b*/c". Watch how in_block flips when the parser encounters /* and */, and how characters outside the comment are kept while characters inside are skipped.
Step 1: Read 'a', normal code
in_block = Falsein_block is False. Character 'a' is regular code. Append it to output. Advance i to 1.
out: "a"Step 2: See /*, enter block comment
in_block = Truein_block is False. Characters at i and i+1 form "/*". Set in_block = True. Skip both, advance i to 3.
out: "a"Step 3: Skip 'b' inside block comment
in_block = Truein_block is True. Character 'b' is inside the block comment. Skip it. Advance i to 4.
out: "a"Step 4: See */, exit block comment
in_block = Falsein_block is True. Characters at i and i+1 form "*/". Set in_block = False. Skip both, advance i to 6.
out: "a"Step 5: Read 'c', normal code again
in_block = Falsein_block is False. Character 'c' is regular code. Append it to output. Line done. Output: "ac".
out: "ac"Notice how the pointer jumps by 2 when it hits /* or */ (it needs to skip both characters of the delimiter), but advances by 1 for every other character. This two-character lookahead is why we use line[i:i+2] in the code.
The code
def remove_comments(source):
result = []
in_block = False
new_line = []
for line in source:
i = 0
while i < len(line):
if in_block:
if i + 1 < len(line) and line[i:i+2] == "*/":
in_block = False
i += 2
else:
i += 1
else:
if i + 1 < len(line) and line[i:i+2] == "//":
break
elif i + 1 < len(line) and line[i:i+2] == "/*":
in_block = True
i += 2
else:
new_line.append(line[i])
i += 1
if not in_block and new_line:
result.append("".join(new_line))
new_line = []
return result
Key details to notice:
new_lineis initialized once before the outer loop, not inside it. When a block comment spans multiple lines, the characters before/*on the first line and after*/on a later line need to be joined into one output line.- The
breakfor//exits the inner while loop but not the outer for loop. It just means "stop reading this line." - The
if not in_block and new_linecheck at the end of each line ensures we only emit a result line when we are outside a block comment and there is actually content to emit.
The two-character lookahead (line[i:i+2]) is safe in Python even when i is at the last character. Slicing past the end of a string returns a shorter string rather than raising an error, so line[i:i+2] at the last index returns a single character that will not match //, /*, or */.
Complexity analysis
| Metric | Value | Reasoning |
|---|---|---|
| Time | O(n) | Every character in the source is visited exactly once. The pointer i advances by at least 1 on every iteration of the while loop. |
| Space | O(n) | The result list and new_line buffer together hold at most as many characters as the original source. |
Here, n is the total number of characters across all lines. Each character is examined once and either appended to new_line or skipped, so the work is linear.
The building blocks
Character-by-character parsing
Many string problems require you to scan one character at a time while making decisions based on what you see. This problem uses a while loop with a manual index i instead of a for loop because sometimes you need to advance by 2 (when you encounter a two-character delimiter like /* or */). This same technique appears in problems like string-to-integer conversion and expression evaluation, where the parser needs to consume variable-length tokens.
State machines
A state machine is any algorithm where you track a "mode" or "state" that determines how each piece of input is processed. Here, the state is a single boolean: are you inside a block comment or not? More complex parsers might have several states (for example, "reading digits," "reading a sign," "skipping whitespace"). The pattern is always the same: check the current state, decide what to do with the current input, and optionally transition to a new state.
Edge cases to watch for
- Block comment spanning multiple lines. The
in_blockflag persists across lines, andnew_lineis not reset until a line is emitted. Characters before/*and after*/on different lines merge into one output line. //inside a block comment is ignored. Whenin_blockisTrue, you only look for*/. The//check never runs./*inside a line comment is ignored. Once you see//, youbreakout of the line. You never reach the/*check for the rest of that line.- Adjacent comments:
a/*comment*/bbecomesabon the same line. After closing the block comment, the parser continues scanning the same line. Thebis regular code and gets appended tonew_lineright aftera. - Empty lines after removing comments should be excluded. The condition
if not in_block and new_linehandles this. If a comment removal leavesnew_lineempty, nothing is appended to the result.
From understanding to recall
You can follow the logic of this two-state parser without much trouble. But when you sit down to write it from scratch, the details trip you up. Where do you initialize new_line, before the outer loop or inside it? When do you emit a line, at the end of every line or only when in_block is False? Do you break for // or set a flag?
Spaced repetition locks in exactly these details. You drill the state machine skeleton as one brick: the in_block flag, the two branches, the two-character lookahead. You drill the line emission logic as another: new_line lives outside the loop, and you only flush it when the block comment is closed. After a few rounds at increasing intervals, the whole structure is automatic. When you see a parsing problem in an interview, you are assembling practiced bricks, not reinventing the parser from scratch.
Related posts
- Simplify Path - Another string parsing problem with state tracking, LeetCode 71
- Decode String - Parsing nested structures character by character, LeetCode 394
- Basic Calculator - Parsing expressions with a state machine, LeetCode 224
Visual Learner? Explore how string parsing algorithms work with our interactive visualizations.