Unique Email Addresses: String Normalization

May 25, 20264 min read

leetcodeproblemeasyarraysstringshash-map

LeetCode 929 asks you to count the number of unique email addresses given a list of emails, where certain transformations apply to the local name (the part before the @ sign). Specifically, dots in the local name are ignored, and everything after the first + in the local name is discarded. The domain is left unchanged.

"test.email+alex@leetcode.com" normalizes to "testemail@leetcode.com". Dots are removed, everything after + (inclusive) is discarded, and the domain stays unchanged.

Why this problem matters

String normalization is a bread-and-butter skill in software engineering. You encounter it everywhere: canonicalizing URLs, normalizing usernames, deduplicating records. This problem distills that pattern into its simplest form. You parse a structured string, apply transformation rules to part of it, then check for uniqueness. Once you recognize this pattern, you can apply it to more complex real-world scenarios where data arrives in inconsistent formats.

The approach

The algorithm is clean and direct:

For each email, split on @ to separate the local name from the domain.
In the local name, truncate everything from the first + onward.
Remove all dots from the remaining local name.
Rejoin the processed local name with @ and the domain.
Add the normalized email to a set.
Return the size of the set.

The key insight

The domain is never modified, only the local name gets normalized. By splitting on @ first, you isolate the part that needs processing. Then the two local-name rules (remove dots, truncate at +) are order-independent in practice, but truncating at + first avoids doing unnecessary work on characters that will be discarded anyway.

Always split on @ first. The dot and plus rules only apply to the local name, not the domain. Applying them to the full email string would corrupt addresses like "user@my.company.com".

Walking through it step by step

Step 1: Process first email

Split on '@'. In local part 'test.email+alex': remove dots to get 'testemail+alex', then truncate at '+' to get 'testemail'. Rejoin with domain.

Step 2: Process second email

Local part 'test.e.mail+bob': remove dots gives 'testemail+bob', truncate at '+' gives 'testemail'. This normalized form is already in our set, so no new entry.

Step 3: Process third email

Local part 'testemail+david': no dots to remove, truncate at '+' gives 'testemail'. Domain is 'lee.tcode.com' (different domain). New unique address added.

Step 4: Final result

Three emails normalize to 2 unique addresses. The first two map to the same address (same local after normalization, same domain).

Practice this problem with spaced repetition

The solution

def numUniqueEmails(emails):
    seen = set()
    for email in emails:
        local, domain = email.split("@")
        local = local.split("+")[0]
        local = local.replace(".", "")
        seen.add(local + "@" + domain)
    return len(seen)

We initialize an empty set to track unique normalized addresses.
For each email, we split at the @ symbol to get the local name and domain.
We split the local name at + and keep only the first part (discarding the suffix).
We remove all dots from the remaining local name using replace(".", "").
We reconstruct the normalized email and add it to the set (duplicates are automatically ignored).
Finally, we return the size of the set, which is the count of unique addresses.

Complexity analysis

Metric	Value
Time	O(n * m) where n is the number of emails and m is the average email length
Space	O(n * m) for the set storing normalized emails

Each email is processed in a single pass through its characters (split, replace, and join are all linear in the string length). The set stores at most n entries, each of length up to m.

Building blocks

1. String normalization

def normalize_local(local):
    local = local.split("+")[0]  # truncate at +
    local = local.replace(".", "")  # remove dots
    return local

This is the reusable kernel. Any time you need to canonicalize a string before comparison, you follow the same pattern: identify which characters or substrings to discard, apply the rules, and return the cleaned version.

2. Set deduplication

seen = set()
for item in collection:
    normalized = normalize(item)
    seen.add(normalized)
unique_count = len(seen)

Sets give you O(1) average-case insertion and lookup. When the question is "how many distinct items exist after some transformation," a set is almost always the right data structure.

The combination of "normalize then deduplicate" appears in many problems. Group Anagrams uses sorted characters as the normalization key. This problem uses dot-removal and plus-truncation. The structural pattern is identical.

Edge cases

No dots or plus signs: The local name is already normalized, so the email passes through unchanged.
Multiple dots in a row: "a...b@example.com" normalizes to "ab@example.com". All dots are removed regardless of position.
Plus sign with nothing after it: "user+@domain.com" normalizes to "user@domain.com". Splitting on + and taking index 0 handles this naturally.
Dots in the domain: These are preserved. "user@my.long.domain.com" keeps its domain intact because we only process the local part.
All emails identical after normalization: The set will have size 1.
All emails unique even after normalization: The set will have size n (the full input length).

Build your pattern library with CodeBricks

From understanding to recall

Understanding the logic behind this problem takes minutes, but retaining the pattern for months requires deliberate practice. String normalization shows up in varied disguises: URL canonicalization, file path resolution, identifier deduplication. If you only solve it once and move on, you will likely fumble the implementation details when it reappears under time pressure.

Spaced repetition addresses this by surfacing the problem at intervals calibrated to your forgetting curve. The first review might come a day later, then three days, then a week. Each retrieval strengthens the neural pathway from "normalize then deduplicate" to the concrete steps: split, truncate, remove, rejoin, collect in a set.

CodeBricks schedules these reviews automatically so you spend your limited study time on patterns that are about to fade rather than ones already solid in memory.

Group Anagrams - String normalization as a grouping key
Valid Anagram - Character frequency comparison after normalization
Isomorphic Strings - Pattern matching between string representations

This problem is a great entry point into the world of string processing interview questions. Once the normalize-and-deduplicate pattern clicks, you will find it everywhere. CodeBricks helps you lock it in permanently through intelligent review scheduling.

Try CodeBricks free