Skip to content
← All posts

UTF-8 Validation: Bit Pattern Matching

5 min read
leetcodeproblemmediumbit-manipulation

Given an array of integers representing bytes, determine whether it forms a valid UTF-8 encoding. That is LeetCode 393. Each integer in the array represents one byte (only the lowest 8 bits matter), and you need to verify that the entire sequence follows the UTF-8 encoding rules.

UTF-8 is a variable-length encoding. A character can be 1, 2, 3, or 4 bytes long, and each format has a specific bit pattern that the bytes must match. Your job is to check every byte against these patterns.

UTF-8 Encoding Rulesprefix bits (fixed pattern)data bits (x = 0 or 1)1-byte0 to 1270xxxxxxx1 prefix, 7 data bits2-byte128 to 2047110xxxxx10xxxxxx5 prefix, 11 data bits3-byte2048 to 655351110xxxx10xxxxxx10xxxxxx8 prefix, 16 data bits4-byte65536 to 111411111110xxx10xxxxxx10xxxxxx10xxxxxx11 prefix, 21 data bits
UTF-8 encoding uses prefix bits to signal the byte count. Continuation bytes always start with 10. The x positions carry the actual character data.

The encoding rules

The first byte of a character tells you how many bytes that character uses:

  • 1-byte character: starts with 0. The pattern is 0xxxxxxx.
  • 2-byte character: starts with 110. The pattern is 110xxxxx 10xxxxxx.
  • 3-byte character: starts with 1110. The pattern is 1110xxxx 10xxxxxx 10xxxxxx.
  • 4-byte character: starts with 11110. The pattern is 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx.

Every continuation byte (the bytes after the first one) must start with 10. If any byte does not match the expected pattern, the encoding is invalid.

The approach

You walk through the array byte by byte. When you encounter a leading byte, you use bit masks to determine how many continuation bytes should follow. Then you check each continuation byte to confirm it starts with 10. If anything is off, you return False.

The bit masks you need:

  • byte & 0x80 == 0: single byte (starts with 0)
  • byte & 0xE0 == 0xC0: two-byte lead (starts with 110)
  • byte & 0xF0 == 0xE0: three-byte lead (starts with 1110)
  • byte & 0xF8 == 0xF0: four-byte lead (starts with 11110)
  • byte & 0xC0 == 0x80: valid continuation byte (starts with 10)

The solution

def validUtf8(data: list[int]) -> bool:
    remaining = 0

    for byte in data:
        if remaining > 0:
            if byte >> 6 != 0b10:
                return False
            remaining -= 1
        elif byte >> 7 == 0:
            remaining = 0
        elif byte >> 5 == 0b110:
            remaining = 1
        elif byte >> 4 == 0b1110:
            remaining = 2
        elif byte >> 3 == 0b11110:
            remaining = 3
        else:
            return False

    return remaining == 0

The variable remaining tracks how many continuation bytes you still expect. When it is 0, you are at the start of a new character and you read the leading byte to determine the character length. When it is greater than 0, you verify the current byte is a valid continuation byte (top two bits are 10).

Step-by-step walkthrough

Let's trace through the input [197, 130, 1]:

  • 197 in binary is 11000101
  • 130 in binary is 10000010
  • 1 in binary is 00000001

Step 1: Read byte 197 (binary 11000101)

byte 011000101110xxxxx: 2-byte char

The top 3 bits are 110, so this starts a 2-byte character. We expect 1 continuation byte next.

Step 2: Read byte 130 (binary 10000010)

byte 11000001010xxxxxx: valid continuation

Top 2 bits are 10, confirming this is a valid continuation byte. The 2-byte character is complete.

Step 3: Read byte 1 (binary 00000001)

byte 2000000010xxxxxxx: 1-byte char

The top bit is 0, so this is a valid single-byte ASCII character. All bytes validated. Return True.

The sequence is valid. Byte 197 starts a 2-byte character, byte 130 is the required continuation byte, and byte 1 is a standalone ASCII character.

The key insight is using right shifts instead of masks. byte >> 5 == 0b110 is equivalent to byte & 0xE0 == 0xC0, but the shift version reads more naturally since you are directly comparing the prefix bits.

Complexity analysis

AspectValue
Time complexityO(n), one pass through the array
Space complexityO(1), only a counter variable

You visit each byte exactly once and perform a constant number of bit operations per byte. No extra data structures are needed.

Building blocks

This problem decomposes into two core skills:

Bit masking and shifting

The ability to isolate specific bits of a byte using shifts and comparisons. You need to extract the top N bits of a byte to determine whether it is a lead byte or a continuation byte. This is the same skill used in counting set bits, reversing bits, and checking powers of two.

byte >> 5 == 0b110

State tracking across a sequence

You maintain a counter (remaining) that carries state from one byte to the next. This pattern of processing a stream of tokens while maintaining a simple state variable appears in many validation problems, from parentheses matching to finite automata.

Edge cases

Empty array: The input has no bytes to validate. remaining starts at 0 and never changes, so you return True. An empty sequence is trivially valid.

Incomplete multi-byte character: The array ends while you still expect continuation bytes. For example, [197] starts a 2-byte character but has no continuation byte. The loop finishes with remaining = 1, so you return False.

Byte starting with 10 when no continuation expected: If remaining is 0 and you encounter a byte like 10xxxxxx, it does not match any leading byte pattern. The final else branch catches this and returns False.

Invalid 5-byte or longer prefix: A byte like 11111000 (starts with 11111) does not match any valid UTF-8 lead pattern. The cascading if-elif chain rejects it in the else branch.

Only the lowest 8 bits matter: The problem states that each integer can be larger than 255. You only examine the lowest 8 bits. Since we use right shifts to check the top bits of the lowest 8 bits, values larger than 255 will naturally fail the prefix checks and be handled correctly. If you want to be explicit, you can mask with byte & 0xFF at the start.

From understanding to recall

UTF-8 validation is a clean exercise in translating a specification into bit-level code. The encoding rules are not hard to look up, but in an interview you need to translate them into masks and shifts quickly and without hesitation.

Spaced repetition helps here. The first time you write the cascading shift checks, you will probably need to reference the bit patterns. After a few repetitions spread over days, the mapping from prefix bits to character length becomes automatic. You stop thinking about which mask to use and start focusing on the state machine logic instead.

Related posts

  • Number of 1 Bits - Another bit manipulation problem that uses masking and shifting to examine individual bits. The n & 1 and right-shift pattern is the foundation for the byte-level checks in UTF-8 validation.
  • Reverse Bits - Practices the same bit extraction and placement skills. Both problems require you to think about bits at specific positions within a fixed-width value.
  • Single Number - Uses XOR as a bit manipulation primitive. While XOR is not used in UTF-8 validation, building fluency across all bit operations (AND, OR, XOR, shifts) makes each individual problem easier to approach.