Skip to content
← All posts

Minimum Genetic Mutation: BFS on Gene Strings

5 min read
leetcodeproblemmediumstringsgraph

You are given two gene strings startGene and endGene, each 8 characters long using only the characters A, C, G, and T. You are also given a gene bank, a list of valid gene strings. A single mutation changes exactly one character. Find the minimum number of mutations needed to transform startGene into endGene, where every intermediate gene must be in the bank. If no valid transformation exists, return -1.

This is LeetCode 433: Minimum Genetic Mutation, a classic BFS shortest path problem. If you have seen Word Ladder, this will feel familiar. The twist is that instead of 26 lowercase letters, you only have 4 possible characters at each position.

Example: startGene = "AACCGGTT", endGene = "AAACGGTA", bank = ["AACCGGTA","AACCGCTA","AAACGGTA"]. The answer is 2: AACCGGTT -> AACCGGTA -> AAACGGTA.

AACCGGTTAACCGGTAAACCGCTAAAACGGTAstarttarget (2 mutations)
Gene strings as graph nodes. Edges connect genes that differ by exactly one character. The green path is the shortest route from start to target.

Why this problem matters

Minimum Genetic Mutation is a perfect stepping stone for graph BFS problems. The gene string constraint (only 4 characters, fixed length 8) makes the search space small and predictable, which lets you focus on the BFS pattern itself without worrying about performance tricks.

The underlying pattern is identical to Word Ladder: model strings as nodes in a graph, connect nodes that differ by exactly one character, and run BFS to find the shortest path. Once you internalize this "strings as graph nodes" reframing, you can apply it to any problem where you need the minimum number of single-character transformations.

This problem also reinforces a critical BFS detail: you must check that every intermediate state is valid (in the bank). The bank acts as a filter on which nodes exist in the graph. Without it, you would explore an exponential number of invalid gene strings.

The approach

The key insight is that each gene string is a node in an implicit graph. Two nodes share an edge if they differ by exactly one character. "Find the minimum mutations" becomes "find the shortest path in an unweighted graph," and BFS solves that directly.

For each gene in the queue, you try all 4 characters (A, C, G, T) at each of the 8 positions. If the resulting string is in the bank and has not been visited, you add it to the queue. When BFS reaches endGene, the current mutation count is the answer.

from collections import deque

def min_mutation(start_gene: str, end_gene: str, bank: list[str]) -> int:
    bank_set = set(bank)
    if end_gene not in bank_set:
        return -1

    queue = deque([(start_gene, 0)])
    visited = {start_gene}

    while queue:
        gene, mutations = queue.popleft()

        for i in range(8):
            for c in "ACGT":
                if c == gene[i]:
                    continue
                next_gene = gene[:i] + c + gene[i + 1:]
                if next_gene == end_gene:
                    return mutations + 1
                if next_gene in bank_set and next_gene not in visited:
                    visited.add(next_gene)
                    queue.append((next_gene, mutations + 1))

    return -1

Step-by-step walkthrough

Let's trace BFS through the example: startGene = "AACCGGTT", endGene = "AAACGGTA", bank = ["AACCGGTA","AACCGCTA","AAACGGTA"].

Step 1: Initialize BFS with startGene

mutations:0
queue:
AACCGGTT
visited:
AACCGGTT

Start BFS from "AACCGGTT". Mutation count begins at 0.

Step 2: Process "AACCGGTT". Find one valid neighbor in the bank.

mutations:1
queue:
AACCGGTA
visited:
AACCGGTTAACCGGTA

"AACCGGTT" to "AACCGGTA": position 7 changes T to A. That is the only bank word one mutation away. Mutation count is now 1.

Step 3: Process "AACCGGTA". Find two valid neighbors.

mutations:2
queue:
AACCGCTAAAACGGTA
visited:
AACCGGTTAACCGGTAAACCGCTAAAACGGTA

"AACCGGTA" to "AACCGCTA": position 5, G to C. "AACCGGTA" to "AAACGGTA": position 2, C to A. Both are in the bank and unvisited. Mutation count is now 2.

Step 4: "AAACGGTA" is the endGene. Return 2!

mutations:2
queue:
(empty)
visited:
AACCGGTTAACCGGTAAACCGCTAAAACGGTA

BFS reached "AAACGGTA" after 2 mutations. Path: AACCGGTT -> AACCGGTA -> AAACGGTA. Return 2.

BFS explores genes level by level. Each level corresponds to one mutation. The first time BFS reaches the endGene, the mutation count is guaranteed to be minimal because BFS always finds the shortest path in an unweighted graph.

Notice that "AACCGCTA" gets enqueued but never leads to the target in fewer steps. BFS naturally handles this by processing all nodes at mutation count 2 simultaneously. The target is found at level 2 regardless of which branch reaches it first.

Complexity analysis

ApproachTimeSpace
BFS with character replacementO(B * M * K)O(B * M)

B is the number of genes in the bank. M is the length of each gene (always 8). K is the number of possible characters (always 4).

Time: In the worst case, BFS visits every gene in the bank. For each gene, you try K characters at each of M positions. Building each neighbor string costs O(M) due to string slicing. Total: O(B * M * K * M) = O(B * M^2 * K). Since M = 8 and K = 4 are constants, this simplifies to O(B) in practice.

Space: The visited set and queue can hold up to O(B) genes, each of length M. Total: O(B * M), or effectively O(B) since M is constant.

Edge cases to watch for

  • endGene not in bank: Return -1 immediately. No valid mutation can produce a gene outside the bank.
  • startGene equals endGene: Return 0. Zero mutations needed (though the problem constraints say they differ).
  • No path exists: BFS exhausts the queue without finding endGene. Return -1.
  • startGene is in the bank: This does not cause issues. BFS marks it as visited at the start, so it will not be revisited.
  • Single mutation path: If endGene differs from startGene by one character and endGene is in the bank, return 1.

The building blocks

This problem is built on two core reusable patterns that CodeBricks drills independently:

1. BFS shortest path on an implicit graph

You never build an adjacency list. Instead, you generate neighbors on the fly by trying all single-character mutations and filtering through the bank set. This implicit graph BFS template appears in Word Ladder, open-the-lock puzzles, and many other transformation problems. The skeleton is always the same: dequeue a node, generate valid neighbors, enqueue unvisited ones, and return as soon as you hit the target.

2. Neighbor generation via character replacement

For gene strings, you generate neighbors by replacing each position with each of the 4 valid characters. For word ladder, it is 26 letters. For lock combinations, it is incrementing or decrementing each digit. The approach changes, but the pattern of "try all single-position modifications and check validity" stays the same. This is the same strategy used in Clone Graph for exploring adjacent nodes, adapted to string-based state spaces.

From understanding to recall

You now see how Minimum Genetic Mutation reduces to BFS on an implicit graph of gene strings. The bank set, the character replacement loop, the visited check on enqueue. It all clicks when you read through it. But in a timed interview, you need to write the nested loop for 8 positions and 4 characters, the string slicing, and the early return, all from memory.

That is where spaced repetition helps. CodeBricks breaks this problem into its building blocks (implicit graph BFS and character replacement neighbor generation) and drills them at increasing intervals. You type the queue initialization, the mutation loop, and the bank lookup from scratch. After a few sessions, the pattern is automatic. When a transformation problem shows up in your interview, you recognize it immediately and write the solution without hesitation.

Related posts

  • Word Ladder - The classic BFS transformation problem with 26 letters instead of 4, same implicit graph approach
  • Clone Graph - Graph traversal with a visited map, reinforcing the BFS/DFS pattern for exploring node neighbors
  • Number of Islands - Grid BFS/DFS traversal, another problem where recognizing the implicit graph structure is the key insight