Distinct Subsequences II: DP with Duplicate Tracking
Given a string s, return the number of distinct non-empty subsequences of s. Since the answer can be very large, return it modulo 10^9 + 7.
A subsequence is a sequence derived from the string by deleting zero or more characters without changing the order of the remaining characters. The tricky part is that two subsequences are considered the same if they have identical character sequences, even if they were formed by choosing different indices. Counting distinct subsequences while handling these duplicates efficiently is the core challenge.
This is LeetCode 940: Distinct Subsequences II.
The key insight: doubling with deduplication
Consider what happens when you process a new character c from the string. Every distinct subsequence you have built so far can either include c at the end or not. That gives you twice as many subsequences, plus c by itself as a brand new single-character subsequence.
If c has never appeared before, that is all you need:
dp[i] = 2 * dp[i-1]
Here dp[i] counts the number of distinct subsequences (including the empty subsequence as a building block) after processing the first i+1 characters, with dp[-1] = 1 representing just the empty subsequence.
But when c has appeared before, you overcount. The subsequences that ended with the previous occurrence of c get generated again. To fix this, subtract the count right before that previous occurrence:
dp[i] = 2 * dp[i-1] - dp[last[c] - 1]
where last[c] is the index of the most recent prior occurrence of character c.
At the end, subtract 1 from the final dp value to exclude the empty subsequence, since the problem asks for non-empty subsequences only.
The formula works because dp[last[c] - 1] is exactly the number of subsequences that existed before the previous 'c' was processed. Those subsequences, when appended with 'c', produce the same results both times. Subtracting them once removes the double-count precisely.
Walking through the example
Take s = "aba". Let us trace the DP array step by step.
- dp[-1] = 1 (just the empty subsequence)
- i = 0, c = 'a': First time seeing 'a'.
dp[0] = 2 * 1 = 2. Subsequences: ("empty", "a"). - i = 1, c = 'b': First time seeing 'b'.
dp[1] = 2 * 2 = 4. Subsequences: ("empty", "a", "b", "ab"). - i = 2, c = 'a': Seen 'a' before at index 0.
dp[2] = 2 * 4 - dp[-1] = 8 - 1 = 7. Subsequences: ("empty", "a", "b", "ab", "ba", "aa", "aba").
The answer is dp[2] - 1 = 6. Those six distinct non-empty subsequences are: "a", "b", "ab", "ba", "aa", "aba".
Notice how the subtraction of dp[-1] = 1 at step 2 prevented us from double-counting the subsequence "a", which would have been generated again by appending the second 'a' to the empty subsequence.
The solution
def distinctSubseqII(s):
MOD = 10 ** 9 + 7
dp = 1
last = {}
prev_dp = {}
for i, c in enumerate(s):
new_dp = (2 * dp) % MOD
if c in last:
new_dp = (new_dp - prev_dp[c]) % MOD
prev_dp[c] = dp
last[c] = i
dp = new_dp
return (dp - 1) % MOD
The variable dp holds the running count (including the empty subsequence). The dictionary prev_dp maps each character to the dp value that existed right before that character was last processed. When you see a repeat character, you subtract that stored value to remove duplicates.
Step 1: Initialize and process s[0] = 'a'
Start with dp[-1] = 1 (empty subsequence). First occurrence of 'a', so dp[0] = 2 * dp[-1] = 2. This represents: {empty, "a"}. Record last['a'] = 0.
Step 2: Process s[1] = 'b'
First occurrence of 'b', so dp[1] = 2 * dp[0] = 2 * 2 = 4. This represents: {empty, "a", "b", "ab"}. Record last['b'] = 1.
Step 3: Process s[2] = 'a' (duplicate!)
Character 'a' appeared before at index 0. dp[2] = 2 * dp[1] - dp[last['a'] - 1] = 2 * 4 - dp[-1] = 8 - 1 = 7. Subtracting dp[-1] removes the duplicate subsequences that were already counted when 'a' first appeared.
Step 4: Extract the answer
Final answer = dp[2] - 1 = 7 - 1 = 6 (subtract the empty subsequence). The 6 distinct non-empty subsequences of "aba" are: "a", "b", "ab", "ba", "aa", "aba".
Why the formula is correct
Think of it this way. At any point, dp represents the total number of distinct subsequences including the empty one. When you encounter character c:
- Doubling: every existing subsequence can either skip
cor appendc. That doubles the count. - Overcounting: if
cappeared before, the subsequences that existed right before the previouscwere already extended withconce. Doubling extends them withcagain, creating duplicates. - Correction: subtracting
prev_dp[c](the count right before the last timecwas processed) removes exactly those duplicates.
This works because the set of duplicated subsequences maps one-to-one with the set of subsequences that existed before the previous occurrence of c. The subtraction is exact, not an approximation.
Complexity analysis
| Approach | Time | Space |
|---|---|---|
| DP with last-seen map | O(n) | O(1) |
Time: O(n). You process each character once, doing O(1) work per character (a multiplication, a dictionary lookup, and possibly a subtraction).
Space: O(1). The last and prev_dp dictionaries hold at most 26 entries (one per lowercase letter), which is constant. You only keep a single running dp value, not an array.
Edge cases to watch for
- Single character:
s = "a".dp = 2 * 1 = 2, answer =2 - 1 = 1. Just the one subsequence "a". - All same characters:
s = "aaa". After each step, the duplicate subtraction kicks in. dp goes 1 -> 2 -> 3 -> 4, and the answer is 3. The distinct subsequences are "a", "aa", "aaa". - No repeats:
s = "abc". No subtractions ever happen. dp goes 1 -> 2 -> 4 -> 8, and the answer is 7. Every possible non-empty subsequence is distinct. - Modular arithmetic: always apply
% MODafter every arithmetic operation to prevent overflow. The subtraction can go negative, so use(new_dp - prev_dp[c]) % MODwhich in Python handles negative values correctly.
The building blocks
This problem uses two core building blocks:
DP with deduplication. The doubling formula dp[i] = 2 * dp[i-1] is the "generate all new subsequences" step, and the subtraction - dp[last[c] - 1] is the "remove duplicates" step. This pattern of "count everything, then subtract overcounts" appears in many combinatorial DP problems.
Character tracking with a last-seen map. Storing the most recent index (or the dp value at that index) for each character lets you efficiently identify and remove duplicate contributions. This is the same idea used in problems like longest substring without repeating characters, where you track the last position of each character to make decisions in O(1) time.
The combination of these two building blocks gives you a clean O(n) solution. The doubling handles the combinatorial explosion, and the last-seen map keeps duplicates in check.
From understanding to recall
The formula dp[i] = 2 * dp[i-1] - dp[last[c] - 1] is compact, but it packs a lot of reasoning into a single line. Can you reconstruct why you double? Can you remember what exactly gets subtracted and why? Will you handle the edge case where the character has not appeared before?
Spaced repetition turns this from "I read it once" into "I can derive it cold." You revisit the doubling logic and the duplicate subtraction at increasing intervals, writing them from memory each time. After a few reps, the connection between "append c to all existing subsequences" and "subtract the ones from before the last c" becomes automatic.
This DP-with-deduplication pattern is one of roughly 60 reusable building blocks that cover hundreds of LeetCode problems. Drilling them individually with spaced repetition beats random problem grinding every time.
Related posts
- Distinct Subsequences - Count subsequences matching a target string, a related DP problem with a different formulation
- Decode Ways - Linear DP with conditional transitions, another great drill for the DP pipeline
- Counting Bits - Uses a similar "build on previous results" DP pattern with bit manipulation