Remove Comments: Parsing with a State Machine

May 22, 20266 min read

leetcodeproblemmediumstrings

Remove Comments is LeetCode problem 722. You are given a C++ program represented as a list of strings, one per line, and you need to strip out all comments. The tricky part is that there are two kinds of comments, and one of them can span multiple lines.

The problem

Given a C++ program as a list of strings (one per line), remove all comments. Line comments start with // and block comments are delimited by /* and */. Return the cleaned source code with empty lines removed.

A few rules make this interesting:

A // comment removes everything from that point to the end of the line.
A /* comment removes everything until the matching */, which might be on a different line entirely.
Comments do not nest. If you are inside a block comment, any // or /* is just text being skipped.
If removing comments leaves a line empty, drop it from the output.

Source code with comments highlighted. Line comments (yellow) start with // and run to end of line. Block comments (blue) span from /* to */ across any number of lines.

The approach: two-state parser

The key insight is that at any point while scanning, you are in exactly one of two states: either you are inside a block comment, or you are not. A single boolean flag, in_block, tracks which state you are in.

When in_block is False, you scan character by character looking for // or /*. If you find //, you stop processing the current line (everything after it is a comment). If you find /*, you flip in_block to True and skip over the /* marker. Otherwise, the character is normal code and you keep it.

When in_block is True, you scan character by character looking for */. If you find it, you flip in_block back to False and skip over the */ marker. Otherwise, you skip the character since it is inside the comment.

At the end of each line, if in_block is False and you have accumulated any output characters, you join them into a string and add it to the result. If in_block is True, you do not emit a line yet because the block comment continues onto the next line. The accumulated characters carry over.

Practice this pattern

Visual walkthrough

Here is a trace of the parser working through the string "a/*b*/c". Watch how in_block flips when the parser encounters /* and */, and how characters outside the comment are kept while characters inside are skipped.

Step 1: Read 'a', normal code

in_block = False

in_block is False. Character 'a' is regular code. Append it to output. Advance i to 1.

out: "a"

Step 2: See /*, enter block comment

in_block = True

in_block is False. Characters at i and i+1 form "/*". Set in_block = True. Skip both, advance i to 3.

out: "a"

Step 3: Skip 'b' inside block comment

in_block = True

in_block is True. Character 'b' is inside the block comment. Skip it. Advance i to 4.

out: "a"

Step 4: See */, exit block comment

in_block = False

in_block is True. Characters at i and i+1 form "*/". Set in_block = False. Skip both, advance i to 6.

out: "a"

Step 5: Read 'c', normal code again

in_block = False

in_block is False. Character 'c' is regular code. Append it to output. Line done. Output: "ac".

out: "ac"

Notice how the pointer jumps by 2 when it hits /* or */ (it needs to skip both characters of the delimiter), but advances by 1 for every other character. This two-character lookahead is why we use line[i:i+2] in the code.

The code

def remove_comments(source):
    result = []
    in_block = False
    new_line = []

    for line in source:
        i = 0
        while i < len(line):
            if in_block:
                if i + 1 < len(line) and line[i:i+2] == "*/":
                    in_block = False
                    i += 2
                else:
                    i += 1
            else:
                if i + 1 < len(line) and line[i:i+2] == "//":
                    break
                elif i + 1 < len(line) and line[i:i+2] == "/*":
                    in_block = True
                    i += 2
                else:
                    new_line.append(line[i])
                    i += 1
        if not in_block and new_line:
            result.append("".join(new_line))
            new_line = []

    return result

Key details to notice:

new_line is initialized once before the outer loop, not inside it. When a block comment spans multiple lines, the characters before /* on the first line and after */ on a later line need to be joined into one output line.
The break for // exits the inner while loop but not the outer for loop. It just means "stop reading this line."
The if not in_block and new_line check at the end of each line ensures we only emit a result line when we are outside a block comment and there is actually content to emit.

The two-character lookahead (line[i:i+2]) is safe in Python even when i is at the last character. Slicing past the end of a string returns a shorter string rather than raising an error, so line[i:i+2] at the last index returns a single character that will not match //, /*, or */.

Drill this solution with spaced repetition

Complexity analysis

Metric	Value	Reasoning
Time	O(n)	Every character in the source is visited exactly once. The pointer `i` advances by at least 1 on every iteration of the while loop.
Space	O(n)	The result list and `new_line` buffer together hold at most as many characters as the original source.

Here, n is the total number of characters across all lines. Each character is examined once and either appended to new_line or skipped, so the work is linear.

The building blocks

Character-by-character parsing

Many string problems require you to scan one character at a time while making decisions based on what you see. This problem uses a while loop with a manual index i instead of a for loop because sometimes you need to advance by 2 (when you encounter a two-character delimiter like /* or */). This same technique appears in problems like string-to-integer conversion and expression evaluation, where the parser needs to consume variable-length tokens.

State machines

A state machine is any algorithm where you track a "mode" or "state" that determines how each piece of input is processed. Here, the state is a single boolean: are you inside a block comment or not? More complex parsers might have several states (for example, "reading digits," "reading a sign," "skipping whitespace"). The pattern is always the same: check the current state, decide what to do with the current input, and optionally transition to a new state.

Edge cases to watch for

Block comment spanning multiple lines. The in_block flag persists across lines, and new_line is not reset until a line is emitted. Characters before /* and after */ on different lines merge into one output line.
// inside a block comment is ignored. When in_block is True, you only look for */. The // check never runs.
/* inside a line comment is ignored. Once you see //, you break out of the line. You never reach the /* check for the rest of that line.
Adjacent comments: a/*comment*/b becomes ab on the same line. After closing the block comment, the parser continues scanning the same line. The b is regular code and gets appended to new_line right after a.
Empty lines after removing comments should be excluded. The condition if not in_block and new_line handles this. If a comment removal leaves new_line empty, nothing is appended to the result.

From understanding to recall

You can follow the logic of this two-state parser without much trouble. But when you sit down to write it from scratch, the details trip you up. Where do you initialize new_line, before the outer loop or inside it? When do you emit a line, at the end of every line or only when in_block is False? Do you break for // or set a flag?

Spaced repetition locks in exactly these details. You drill the state machine skeleton as one brick: the in_block flag, the two branches, the two-character lookahead. You drill the line emission logic as another: new_line lives outside the loop, and you only flush it when the block comment is closed. After a few rounds at increasing intervals, the whole structure is automatic. When you see a parsing problem in an interview, you are assembling practiced bricks, not reinventing the parser from scratch.

Try CodeBricks free

Simplify Path - Another string parsing problem with state tracking, LeetCode 71
Decode String - Parsing nested structures character by character, LeetCode 394
Basic Calculator - Parsing expressions with a state machine, LeetCode 224

Visual Learner? Explore how string parsing algorithms work with our interactive visualizations.

The problem

The approach: two-state parser

Visual walkthrough

The code

Complexity analysis

The building blocks

Character-by-character parsing

State machines

Edge cases to watch for

From understanding to recall

Related posts