Distinct Subsequences II: DP with Duplicate Tracking

May 25, 20266 min read

leetcodeproblemhardstringsdynamic-programming

Given a string s, return the number of distinct non-empty subsequences of s. Since the answer can be very large, return it modulo 10^9 + 7.

A subsequence is a sequence derived from the string by deleting zero or more characters without changing the order of the remaining characters. The tricky part is that two subsequences are considered the same if they have identical character sequences, even if they were formed by choosing different indices. Counting distinct subsequences while handling these duplicates efficiently is the core challenge.

This is LeetCode 940: Distinct Subsequences II.

For s = "abc", the DP value doubles at each step (plus 1 for the new character alone), giving 1, 3, 7.

The key insight: doubling with deduplication

Consider what happens when you process a new character c from the string. Every distinct subsequence you have built so far can either include c at the end or not. That gives you twice as many subsequences, plus c by itself as a brand new single-character subsequence.

If c has never appeared before, that is all you need:

dp[i] = 2 * dp[i-1]

Here dp[i] counts the number of distinct subsequences (including the empty subsequence as a building block) after processing the first i+1 characters, with dp[-1] = 1 representing just the empty subsequence.

But when c has appeared before, you overcount. The subsequences that ended with the previous occurrence of c get generated again. To fix this, subtract the count right before that previous occurrence:

dp[i] = 2 * dp[i-1] - dp[last[c] - 1]

where last[c] is the index of the most recent prior occurrence of character c.

At the end, subtract 1 from the final dp value to exclude the empty subsequence, since the problem asks for non-empty subsequences only.

The formula works because dp[last[c] - 1] is exactly the number of subsequences that existed before the previous 'c' was processed. Those subsequences, when appended with 'c', produce the same results both times. Subtracting them once removes the double-count precisely.

Practice this problem with spaced repetition

Walking through the example

Take s = "aba". Let us trace the DP array step by step.

dp[-1] = 1 (just the empty subsequence)
i = 0, c = 'a': First time seeing 'a'. dp[0] = 2 * 1 = 2. Subsequences: ("empty", "a").
i = 1, c = 'b': First time seeing 'b'. dp[1] = 2 * 2 = 4. Subsequences: ("empty", "a", "b", "ab").
i = 2, c = 'a': Seen 'a' before at index 0. dp[2] = 2 * 4 - dp[-1] = 8 - 1 = 7. Subsequences: ("empty", "a", "b", "ab", "ba", "aa", "aba").

The answer is dp[2] - 1 = 6. Those six distinct non-empty subsequences are: "a", "b", "ab", "ba", "aa", "aba".

Notice how the subtraction of dp[-1] = 1 at step 2 prevented us from double-counting the subsequence "a", which would have been generated again by appending the second 'a' to the empty subsequence.

The solution

def distinctSubseqII(s):
    MOD = 10 ** 9 + 7
    dp = 1
    last = {}
    prev_dp = {}
    for i, c in enumerate(s):
        new_dp = (2 * dp) % MOD
        if c in last:
            new_dp = (new_dp - prev_dp[c]) % MOD
        prev_dp[c] = dp
        last[c] = i
        dp = new_dp
    return (dp - 1) % MOD

The variable dp holds the running count (including the empty subsequence). The dictionary prev_dp maps each character to the dp value that existed right before that character was last processed. When you see a repeat character, you subtract that stored value to remove duplicates.

Step 1: Initialize and process s[0] = 'a'

Start with dp[-1] = 1 (empty subsequence). First occurrence of 'a', so dp[0] = 2 * dp[-1] = 2. This represents: {empty, "a"}. Record last['a'] = 0.

Step 2: Process s[1] = 'b'

First occurrence of 'b', so dp[1] = 2 * dp[0] = 2 * 2 = 4. This represents: {empty, "a", "b", "ab"}. Record last['b'] = 1.

Step 3: Process s[2] = 'a' (duplicate!)

Character 'a' appeared before at index 0. dp[2] = 2 * dp[1] - dp[last['a'] - 1] = 2 * 4 - dp[-1] = 8 - 1 = 7. Subtracting dp[-1] removes the duplicate subsequences that were already counted when 'a' first appeared.

Step 4: Extract the answer

Final answer = dp[2] - 1 = 7 - 1 = 6 (subtract the empty subsequence). The 6 distinct non-empty subsequences of "aba" are: "a", "b", "ab", "ba", "aa", "aba".

Why the formula is correct

Think of it this way. At any point, dp represents the total number of distinct subsequences including the empty one. When you encounter character c:

Doubling: every existing subsequence can either skip c or append c. That doubles the count.
Overcounting: if c appeared before, the subsequences that existed right before the previous c were already extended with c once. Doubling extends them with c again, creating duplicates.
Correction: subtracting prev_dp[c] (the count right before the last time c was processed) removes exactly those duplicates.

This works because the set of duplicated subsequences maps one-to-one with the set of subsequences that existed before the previous occurrence of c. The subtraction is exact, not an approximation.

Build your pattern library with CodeBricks

Complexity analysis

Approach	Time	Space
DP with last-seen map	O(n)	O(1)

Time: O(n). You process each character once, doing O(1) work per character (a multiplication, a dictionary lookup, and possibly a subtraction).

Space: O(1). The last and prev_dp dictionaries hold at most 26 entries (one per lowercase letter), which is constant. You only keep a single running dp value, not an array.

Edge cases to watch for

Single character: s = "a". dp = 2 * 1 = 2, answer = 2 - 1 = 1. Just the one subsequence "a".
All same characters: s = "aaa". After each step, the duplicate subtraction kicks in. dp goes 1 -> 2 -> 3 -> 4, and the answer is 3. The distinct subsequences are "a", "aa", "aaa".
No repeats: s = "abc". No subtractions ever happen. dp goes 1 -> 2 -> 4 -> 8, and the answer is 7. Every possible non-empty subsequence is distinct.
Modular arithmetic: always apply % MOD after every arithmetic operation to prevent overflow. The subtraction can go negative, so use (new_dp - prev_dp[c]) % MOD which in Python handles negative values correctly.

The building blocks

This problem uses two core building blocks:

DP with deduplication. The doubling formula dp[i] = 2 * dp[i-1] is the "generate all new subsequences" step, and the subtraction - dp[last[c] - 1] is the "remove duplicates" step. This pattern of "count everything, then subtract overcounts" appears in many combinatorial DP problems.

Character tracking with a last-seen map. Storing the most recent index (or the dp value at that index) for each character lets you efficiently identify and remove duplicate contributions. This is the same idea used in problems like longest substring without repeating characters, where you track the last position of each character to make decisions in O(1) time.

The combination of these two building blocks gives you a clean O(n) solution. The doubling handles the combinatorial explosion, and the last-seen map keeps duplicates in check.

From understanding to recall

The formula dp[i] = 2 * dp[i-1] - dp[last[c] - 1] is compact, but it packs a lot of reasoning into a single line. Can you reconstruct why you double? Can you remember what exactly gets subtracted and why? Will you handle the edge case where the character has not appeared before?

Spaced repetition turns this from "I read it once" into "I can derive it cold." You revisit the doubling logic and the duplicate subtraction at increasing intervals, writing them from memory each time. After a few reps, the connection between "append c to all existing subsequences" and "subtract the ones from before the last c" becomes automatic.

This DP-with-deduplication pattern is one of roughly 60 reusable building blocks that cover hundreds of LeetCode problems. Drilling them individually with spaced repetition beats random problem grinding every time.

Distinct Subsequences - Count subsequences matching a target string, a related DP problem with a different formulation
Decode Ways - Linear DP with conditional transitions, another great drill for the DP pipeline
Counting Bits - Uses a similar "build on previous results" DP pattern with bit manipulation

Try CodeBricks free