UTF-8 Validation: Bit Pattern Matching
Given an array of integers representing bytes, determine whether it forms a valid UTF-8 encoding. That is LeetCode 393. Each integer in the array represents one byte (only the lowest 8 bits matter), and you need to verify that the entire sequence follows the UTF-8 encoding rules.
UTF-8 is a variable-length encoding. A character can be 1, 2, 3, or 4 bytes long, and each format has a specific bit pattern that the bytes must match. Your job is to check every byte against these patterns.
The encoding rules
The first byte of a character tells you how many bytes that character uses:
- 1-byte character: starts with
0. The pattern is0xxxxxxx. - 2-byte character: starts with
110. The pattern is110xxxxx 10xxxxxx. - 3-byte character: starts with
1110. The pattern is1110xxxx 10xxxxxx 10xxxxxx. - 4-byte character: starts with
11110. The pattern is11110xxx 10xxxxxx 10xxxxxx 10xxxxxx.
Every continuation byte (the bytes after the first one) must start with 10. If any byte does not match the expected pattern, the encoding is invalid.
The approach
You walk through the array byte by byte. When you encounter a leading byte, you use bit masks to determine how many continuation bytes should follow. Then you check each continuation byte to confirm it starts with 10. If anything is off, you return False.
The bit masks you need:
byte & 0x80 == 0: single byte (starts with0)byte & 0xE0 == 0xC0: two-byte lead (starts with110)byte & 0xF0 == 0xE0: three-byte lead (starts with1110)byte & 0xF8 == 0xF0: four-byte lead (starts with11110)byte & 0xC0 == 0x80: valid continuation byte (starts with10)
The solution
def validUtf8(data: list[int]) -> bool:
remaining = 0
for byte in data:
if remaining > 0:
if byte >> 6 != 0b10:
return False
remaining -= 1
elif byte >> 7 == 0:
remaining = 0
elif byte >> 5 == 0b110:
remaining = 1
elif byte >> 4 == 0b1110:
remaining = 2
elif byte >> 3 == 0b11110:
remaining = 3
else:
return False
return remaining == 0
The variable remaining tracks how many continuation bytes you still expect. When it is 0, you are at the start of a new character and you read the leading byte to determine the character length. When it is greater than 0, you verify the current byte is a valid continuation byte (top two bits are 10).
Step-by-step walkthrough
Let's trace through the input [197, 130, 1]:
- 197 in binary is
11000101 - 130 in binary is
10000010 - 1 in binary is
00000001
Step 1: Read byte 197 (binary 11000101)
The top 3 bits are 110, so this starts a 2-byte character. We expect 1 continuation byte next.
Step 2: Read byte 130 (binary 10000010)
Top 2 bits are 10, confirming this is a valid continuation byte. The 2-byte character is complete.
Step 3: Read byte 1 (binary 00000001)
The top bit is 0, so this is a valid single-byte ASCII character. All bytes validated. Return True.
The sequence is valid. Byte 197 starts a 2-byte character, byte 130 is the required continuation byte, and byte 1 is a standalone ASCII character.
The key insight is using right shifts instead of masks. byte >> 5 == 0b110 is equivalent to byte & 0xE0 == 0xC0, but the shift version reads more naturally since you are directly comparing the prefix bits.
Complexity analysis
| Aspect | Value |
|---|---|
| Time complexity | O(n), one pass through the array |
| Space complexity | O(1), only a counter variable |
You visit each byte exactly once and perform a constant number of bit operations per byte. No extra data structures are needed.
Building blocks
This problem decomposes into two core skills:
Bit masking and shifting
The ability to isolate specific bits of a byte using shifts and comparisons. You need to extract the top N bits of a byte to determine whether it is a lead byte or a continuation byte. This is the same skill used in counting set bits, reversing bits, and checking powers of two.
byte >> 5 == 0b110
State tracking across a sequence
You maintain a counter (remaining) that carries state from one byte to the next. This pattern of processing a stream of tokens while maintaining a simple state variable appears in many validation problems, from parentheses matching to finite automata.
Edge cases
Empty array: The input has no bytes to validate. remaining starts at 0 and never changes, so you return True. An empty sequence is trivially valid.
Incomplete multi-byte character: The array ends while you still expect continuation bytes. For example, [197] starts a 2-byte character but has no continuation byte. The loop finishes with remaining = 1, so you return False.
Byte starting with 10 when no continuation expected: If remaining is 0 and you encounter a byte like 10xxxxxx, it does not match any leading byte pattern. The final else branch catches this and returns False.
Invalid 5-byte or longer prefix: A byte like 11111000 (starts with 11111) does not match any valid UTF-8 lead pattern. The cascading if-elif chain rejects it in the else branch.
Only the lowest 8 bits matter: The problem states that each integer can be larger than 255. You only examine the lowest 8 bits. Since we use right shifts to check the top bits of the lowest 8 bits, values larger than 255 will naturally fail the prefix checks and be handled correctly. If you want to be explicit, you can mask with byte & 0xFF at the start.
From understanding to recall
UTF-8 validation is a clean exercise in translating a specification into bit-level code. The encoding rules are not hard to look up, but in an interview you need to translate them into masks and shifts quickly and without hesitation.
Spaced repetition helps here. The first time you write the cascading shift checks, you will probably need to reference the bit patterns. After a few repetitions spread over days, the mapping from prefix bits to character length becomes automatic. You stop thinking about which mask to use and start focusing on the state machine logic instead.
Related posts
- Number of 1 Bits - Another bit manipulation problem that uses masking and shifting to examine individual bits. The
n & 1and right-shift pattern is the foundation for the byte-level checks in UTF-8 validation. - Reverse Bits - Practices the same bit extraction and placement skills. Both problems require you to think about bits at specific positions within a fixed-width value.
- Single Number - Uses XOR as a bit manipulation primitive. While XOR is not used in UTF-8 validation, building fluency across all bit operations (AND, OR, XOR, shifts) makes each individual problem easier to approach.