Skip to content
← All posts

HTML Entity Parser: String Replacement with a Map

5 min read
leetcodeproblemmediumstringshash-map

Given a string containing HTML entities like > and &, replace each recognized entity with its corresponding character. The tricky part is that not every & starts a valid entity, and you need to handle all six entity types without accidentally mangling the rest of the string. This is LeetCode 1410: HTML Entity Parser.

The problem

You receive a string text that may contain the following HTML entities:

  • " becomes "
  • ' becomes '
  • & becomes &
  • > becomes >
  • &lt; becomes <
  • &frasl; becomes /

No other entities need to be handled. Return the string with all recognized entities replaced by their characters.

Input:  "&amp; is an HTML entity but &ambassador; is not."
Output: "& is an HTML entity but &ambassador; is not."

Input:  "and I quote: &quot;...&quot;"
Output: 'and I quote: "..."'

Notice in the first example that &ambassador; stays unchanged because it is not one of the six recognized entities. Only exact matches get replaced.

Entity Mapping Tableentitycharacter&quot;"&apos;'&amp;&&gt;>&lt;<&frasl;/Exampleinputx &gt; y &amp;&amp; a &lt; boutputx > y && a < b
Six HTML entities and their replacement characters. The parser scans the input string and replaces each recognized entity with the corresponding character.

The map-and-replace approach

The simplest way to solve this is to store all six entities in a dictionary and apply replacements. You can use Python's str.replace in a loop, but there is an important ordering detail: you must replace &amp; first. If you replace &gt; before &amp;, then a literal &amp;gt; in the input would first become &gt; (from the &amp; replacement) and then incorrectly become > on the second pass.

A cleaner approach avoids this pitfall entirely. Instead of chaining str.replace calls, you scan through the string character by character and build the output in one pass.

Scanning with a pointer

Walk through the string left to right. When you see an &, try to match the substring starting at that position against each of the six entities. If one matches, append the replacement character and jump the pointer past the entity. If none match, just append the & and move on by one character.

This single-pass approach is safe from ordering issues because you process each entity exactly once at its position. You never re-scan characters you have already emitted.

The solution

def entityParser(text: str) -> str:
    entities = {
        "&quot;": '"',
        "&apos;": "'",
        "&amp;": "&",
        "&gt;": ">",
        "&lt;": "<",
        "&frasl;": "/",
    }

    result = []
    i = 0
    while i < len(text):
        if text[i] == "&":
            matched = False
            for entity, char in entities.items():
                if text[i:i + len(entity)] == entity:
                    result.append(char)
                    i += len(entity)
                    matched = True
                    break
            if not matched:
                result.append(text[i])
                i += 1
        else:
            result.append(text[i])
            i += 1
    return "".join(result)

Step-by-step walkthrough

Let's trace through the input "x &gt; 1". The parser scans left to right, looking for ampersands that begin valid entities.

Step 1: Scan "x" and " " -- no ampersand, copy to output.

input charactersx &gt; 1output so farx

Regular characters pass through unchanged. Output is "x ".

Step 2: Hit "&" at index 2. Start collecting an entity.

input charactersx &gt; 1output so farx

An ampersand signals the possible start of an entity. Begin buffering.

Step 3: Continue scanning "g", "t", ";" to complete "&gt;".

input charactersx &gt; 1output so farx

We read until the semicolon. The buffer holds "&gt;" which matches an entity.

Step 4: "&gt;" maps to ">". Append ">" to output.

input charactersx &gt; 1output so farx >

Entity replaced. The four characters "&gt;" become the single character ">".

Step 5: Scan " " and "1" -- no ampersand, copy to output.

input charactersx &gt; 1output so farx > 1

Remaining characters are plain text. Final output: "x > 1".

The key moment is Step 3. When the parser hits &, it does not immediately output it. Instead, it reads ahead to see if the following characters form a recognized entity. Here &gt; matches, so the four-character entity gets replaced with a single >. If the characters after & had not matched any entity, the & itself would just pass through to the output unchanged.

Complexity analysis

MetricValue
TimeO(n)
SpaceO(n)

Time: O(n). You scan through the string once. At each &, you check against at most six entities, each at most 7 characters long. That inner work is bounded by a constant (6 entities times 7 characters = 42 comparisons at most), so the overall time is linear in the length of the input string.

Space: O(n). The result list holds at most as many characters as the input. The entity dictionary uses constant space since it always has exactly six entries.

Building blocks

Hash map for pattern lookup

Storing the six entities in a dictionary gives O(1) average-case lookup once you know the key. This is the same pattern you see whenever you need to map a fixed set of inputs to outputs: Morse code translators, Roman numeral converters, or any problem where a small lookup table replaces a long chain of if-else branches.

Single-pass scan with lookahead

Walking through a string with a pointer and peeking ahead to decide what to consume is a fundamental parsing technique. You see it in tokenizers, state machines, and problems like removing comments from source code. The key idea is that when you encounter a trigger character (here &), you try to match a longer pattern before falling back to treating it as a plain character.

Edge cases to watch for

  • Partial entities: A string like "&am" does not end with a semicolon and does not match &amp;. The & and following characters should pass through unchanged.
  • Back-to-back entities: "&amp;&gt;" should produce &>. Each entity is independent, and one replacement should not affect the next.
  • No entities at all: A string with no & characters is returned as-is. The scanner just copies every character.
  • Ampersand without a match: "&unknown;" is not one of the six recognized entities, so it stays as &unknown; in the output.
  • Empty string: Returns an empty string immediately.

From understanding to recall

You have seen how the parser walks through the string and checks a dictionary at every ampersand. The logic is simple, but the details matter in an interview. Do you check for &amp; replacement order if using str.replace? Do you remember all six entities? Do you advance the pointer by the entity length on a match and by one on a miss?

These details slip away without practice. Spaced repetition locks them in. You write the entity dictionary, the scan loop, and the lookahead logic from scratch at increasing intervals until the whole solution is automatic.

Related posts

CodeBricks breaks the HTML Entity Parser pattern into its reusable building blocks and drills them with spaced repetition. You type the solution from scratch each time, so when LeetCode 1410 or any string parsing problem shows up in an interview, the code flows without hesitation.