String Compression II: DP on Run-Length Encoding
You are given a string s of lowercase English letters and an integer k. You can delete at most k characters from s. After deletions, the remaining characters (in their original order) are run-length encoded. The encoding replaces consecutive identical characters with the character followed by the count, but a count of 1 is omitted. For example, "aabccc" encodes as "a2bc3" (length 5). Return the minimum possible length of the run-length encoded string after deleting at most k characters.
This is LeetCode 1531: String Compression II. It is one of the hardest DP problems on the platform because the state space is tricky to define and the transitions require careful counting of how deletions affect run lengths.
Why this problem matters
Most DP problems on strings involve matching, editing, or counting subsequences. String Compression II is different because the cost function (run-length encoded length) is nonlinear. Adding one more character to a run of 9 bumps the encoded length from 2 to 3 (because the count goes from one digit to two digits). This means you cannot just greedily extend runs. You need to explore all possible deletion strategies and let the DP find the minimum.
This problem also combines two skills that appear separately in easier problems: reasoning about run-length encoding structure and managing a deletion budget across recursive calls. If you can solve this one cleanly, you have a strong handle on advanced string DP.
The key insight
Define dp(i, k) as the minimum encoded length of s[i:] when you have k deletions remaining. At each position i, you have two broad choices:
- Delete
s[i]: spend one deletion and recurse todp(i + 1, k - 1). - Keep
s[i]: start a run with the characters[i]. Then scan forward through positionsj = i+1, i+2, ...and for every character that does not matchs[i], count it as a deletion. As long as your deletion count stays within budget, you can keep extending the run. At each stopping point, the cost is the encoded length of the current run plusdp(j + 1, remaining_k)for whatever comes after.
The trick is option 2. When you keep s[i], you do not just look at the immediate run of identical characters. You look ahead and consider deleting non-matching characters in between to merge separated groups of the same character into one longer run. For example, in "aabaa" with k = 1, keeping the first a and deleting the b at index 2 merges all four as into a single run of length 4, encoded as "a4" (length 2).
The solution
from functools import lru_cache
def getLengthOfOptimalCompression(s: str, k: int) -> int:
n = len(s)
def encode_len(count):
if count == 0:
return 0
if count == 1:
return 1
if count < 10:
return 2
if count < 100:
return 3
return 4
@lru_cache(maxsize=None)
def dp(i, k):
if k < 0:
return float('inf')
if i >= n:
return 0
res = dp(i + 1, k - 1)
run_len = 0
deletions = 0
for j in range(i, n):
if s[j] == s[i]:
run_len += 1
else:
deletions += 1
if deletions > k:
break
res = min(res, encode_len(run_len) + dp(j + 1, k - deletions))
return res
return dp(0, k)
The dp function tries both options at every position. The deletion option is dp(i + 1, k - 1). The keep option scans forward, tracking how many characters match s[i] (building the run) and how many do not match (requiring deletion). At each point in the scan, it computes the cost of encoding the current run plus the optimal cost for the rest of the string with the remaining deletion budget.
The encode_len helper maps run counts to encoded lengths. A single character costs 1 (just the character, no digit). Counts from 2 to 9 cost 2 (character plus one digit). Counts from 10 to 99 cost 3 (character plus two digits). Counts of 100 cost 4 (character plus three digits). Getting this mapping right is essential because it drives the entire optimization.
Visual walkthrough
Step 1: Define the state dp(i, k)
dp(i, k) returns the minimum run-length encoded length of s[i:] when you can still delete at most k characters. The answer is dp(0, k).
Step 2: Keep s[i] and extend its run
When you keep s[i], scan forward. For each j from i+1 to n, count non-matching characters. If the non-matching count exceeds k, stop. Otherwise, deleting those non-matching characters extends the run of s[i] further. The encoded length depends on the run length: 1 char costs 1, 2-9 chars cost 2, 10-99 cost 3.
Step 3: Or delete s[i] instead
The other option at each position is to simply delete s[i], spending one deletion. This recurses to dp(i+1, k-1). The DP takes the minimum across all options: delete s[i], or keep s[i] and try every possible run extension.
Step 4: Full recurrence and optimal solution
dp(0, 1) evaluates all branches. Keeping s[0]="a" and deleting the "b" at index 2 produces the run "aaaa", encoded as "a4" (length 2) with nothing remaining. This beats keeping the "b" (which would give "a2ba2" = length 5). The minimum encoded length is 2.
The walkthrough above uses s = "aabaa" with k = 1. The optimal strategy is to delete the lone "b" at index 2, merging the two groups of "a" into a single run of 4. The encoded result is "a4" with length 2. Without any deletions, the encoding would be "a2ba2" with length 5. That one deletion saves 3 characters of encoded length.
Complexity analysis
| Approach | Time | Space |
|---|---|---|
| DP with memoization | O(n^2 * k) | O(n * k) |
There are O(n * k) unique states in the memoization table (one for each combination of starting index and remaining deletions). For each state, the inner loop scans up to n positions forward. This gives O(n^2 * k) total work. The space is O(n * k) for the cache.
The constraint n is at most 100 and k is at most n, so the worst case is roughly 100 * 100 * 100 = 1,000,000 operations. This comfortably fits within time limits.
The building blocks
1. Run-length encoding length calculation
The nonlinear cost function is what makes this problem hard. You need to know that a run of 1 costs 1, a run of 2-9 costs 2, a run of 10-99 costs 3, and a run of 100+ costs 4. This determines how valuable it is to extend a run by deleting intervening characters. Extending a run from 1 to 2 costs an extra digit in the encoding (+1 length), but extending from 2 to 3 costs nothing extra. Extending from 9 to 10 costs an extra digit (+1 length). These thresholds at 1, 10, and 100 create the nonlinearity that the DP must navigate.
2. DP with a deletion budget
The k parameter acts as a budget. Every call to dp(i, k) knows exactly how many deletions it can still afford. When you delete non-matching characters to extend a run, you spend from this budget. When you delete s[i] itself, you also spend from it. The budget prevents the DP from deleting everything (which would give an encoded length of 0 but is only valid when k >= n).
This budget pattern appears in other problems too. Any time you have a limited number of modifications you can make to optimize some objective, you are likely dealing with a DP where one dimension tracks the remaining budget.
Edge cases
k >= n: you can delete all characters. The result is 0 (empty string has encoded length 0).k = 0: no deletions allowed. You just compute the run-length encoding ofsas-is.- All characters identical: the entire string is one run. No deletions help (unless
k >= n, which reduces to the empty string). The answer isencode_len(n)orencode_len(n - k)if deletions shorten the run past a threshold (for example, deleting one character from a run of 10 drops the count to 9, saving one encoded character). - All characters distinct: each character is its own run of length 1, so the encoding equals the string. Deletions just remove characters, giving encoded length
n - k. - Run lengths near thresholds (1, 10, 100): be careful with
encode_len. Extending a run of 9 to 10 increases encoded length by 1. The DP must weigh whether deleting an intervening character to extend a run across a threshold is worth the deletion cost.
From understanding to recall
You now understand the state definition, the two choices at each position, and why the inner scan loop is necessary. The hardest part to remember under pressure is the inner loop: when you keep s[i], you do not just count the immediate consecutive run. You scan all the way forward, deleting non-matching characters along the way, and check every possible stopping point.
The other detail that trips people up is the encode_len function. It is tempting to treat every run as "character plus count," but runs of length 1 have no count digit. Missing this case will produce wrong answers on nearly every test.
String Compression II combines the deletion-budget DP pattern with a nonlinear cost function. Both pieces are reusable building blocks. The budget pattern shows up in problems like Edit Distance (where each edit has a cost) and the nonlinear encoding cost is specific to compression problems. Practicing both blocks separately, then combining them, is the fastest way to internalize this solution.
Related posts
- Decode String - Working with compressed string representations
- Edit Distance - DP with deletion/insertion operations on strings
- Distinct Subsequences - DP with character selection decisions