Skip to content
← All posts

Maximum Number of Non-Overlapping Substrings: Greedy Interval Selection on Strings

6 min read
leetcodeproblemhardstringsgreedy

LeetCode 1520, Maximum Number of Non-Overlapping Substrings, asks you to carve a string into as many non-overlapping pieces as possible, where each piece is "self-contained." A substring is self-contained if it includes every occurrence of every character that appears in it. If multiple answers have the same count, you return the one with the smallest total length.

This is a hard problem, but the solution is surprisingly clean once you see it as an interval scheduling problem built on top of character ranges.

The problem

You are given a string s. Find the maximum number of non-overlapping substrings such that:

  1. Each substring contains all occurrences of every character within it.
  2. No two substrings overlap.
  3. Among all solutions with the maximum count, choose the one with minimum total length.

For example, given s = "adefaddaccc", the answer is ["e", "f", "ccc"]. The characters a and d are tangled together across much of the string, so the best strategy is to grab the small, independent substrings.

012345678910adefaddaccca: [0,7]d: [1,6]e: [2,2]f: [3,3]c: [8,10]= valid self-contained interval= valid but overlaps others
For s = "adefaddaccc", each character defines an interval from its first to last occurrence. Green intervals are the greedy picks: "e", "f", and "ccc".

The brute force approach

You could enumerate every possible substring, check whether it is self-contained (does it include all occurrences of its characters?), then try all combinations of non-overlapping self-contained substrings to maximize the count. This explodes combinatorially. With up to 10^5 characters, brute force is not viable.

The key observation is that you do not need to consider arbitrary substrings. The only candidates worth looking at are the intervals defined by characters themselves.

The key insight

Every character in the string defines a natural interval: from its first occurrence to its last occurrence. A valid self-contained substring must span at least this range for every character it contains. So for each character, you can try to build a minimal valid interval by starting with [first[c], last[c]] and expanding it whenever you find a character inside the range whose first or last occurrence falls outside.

If during expansion you discover a character whose first occurrence is before your starting point, the interval is invalid (it would need to expand backward past where you started, meaning it is not a standalone substring).

Once you have all valid intervals, the problem reduces to classic greedy interval scheduling: sort by right endpoint and pick non-overlapping intervals.

Think of each character as defining a "claim" on a range of the string. Expanding intervals is like merging overlapping claims. The intervals that survive expansion without contradicting themselves are your candidates. Then it is just activity selection.

Step-by-step walkthrough

Step 1: Find the first and last occurrence of each character

adefaddaccc012345678910

first/last: a=[0,7], d=[1,6], e=[2,2], f=[3,3], c=[8,10]

Step 2: Expand each interval to include all characters within it

adefaddaccc012345678910a: [0,7] contains d,e,f => stays [0,7]d: [1,6] contains a(last=7) => invalid!e: [2,2]f: [3,3]c: [8,10]

For "d", the range [1,6] contains "a" at index 4, but "a" last appears at 7 (outside [1,6]). This interval is invalid. The valid intervals are [0,7], [2,2], [3,3], [8,10].

Step 3: Sort valid intervals by right endpoint

adefaddaccc012345678910e: end=2f: end=3a: end=7c: end=10

Sorted order: (end=2, start=2), (end=3, start=3), (end=7, start=0), (end=10, start=8).

Step 4: Greedily select non-overlapping intervals (earliest end first)

adefaddaccc012345678910

Pick "e" [2,2] (prev_end=-1, start=2 > -1, take it, prev_end=2). Pick "f" [3,3] (start=3 > 2, take it, prev_end=3). Skip "a" [0,7] (start=0, not > 3). Pick "ccc" [8,10] (start=8 > 3, take it, prev_end=10). Result: ["e", "f", "ccc"] = 3 substrings.

The code

def maxNumOfSubstrings(s):
    first = {}
    last = {}
    for i, c in enumerate(s):
        if c not in first:
            first[c] = i
        last[c] = i

    intervals = []
    for c in first:
        start = first[c]
        end = last[c]
        j = start
        valid = True
        while j <= end:
            if first[s[j]] < start:
                valid = False
                break
            end = max(end, last[s[j]])
            j += 1
        if valid:
            intervals.append((end, start))

    intervals.sort()
    result = []
    prev_end = -1
    for end, start in intervals:
        if start > prev_end:
            result.append(s[start:end + 1])
            prev_end = end

    return result

The algorithm has three phases. First, scan the string to record the first and last index of each character. Second, for each character, try to build a valid interval by expanding from [first[c], last[c]]. As you scan through the range, if you find a character whose first occurrence is before your start, the interval is invalid because you cannot extend backward. Otherwise, you push the end out to cover every character's last occurrence. Third, sort the valid intervals by their right endpoint and apply the classic greedy selection: pick an interval if its start is past the previous interval's end.

The intervals are stored as (end, start) tuples so that sorting by the first element automatically sorts by right endpoint. This is a common trick to avoid writing a custom comparator.

Why does backward expansion invalidate an interval? If character x appears inside [start, end] but first[x] is before start, then a valid substring starting at start cannot include all occurrences of x. You would need to move start backward, but then you might discover even more characters that pull the interval further left. The algorithm short-circuits: if any character pulls left past the original start, reject this interval entirely.

Complexity analysis

ApproachTimeSpace
Brute force (enumerate all substrings)O(2^n)O(n)
Optimized (expand + greedy)O(n * sigma)O(sigma)

Here sigma is the alphabet size (at most 26 for lowercase letters). The first pass is O(n). The expansion phase visits each character's range, and in the worst case each of the 26 characters could expand across the entire string, giving O(26 * n). Sorting at most 26 intervals is O(sigma log sigma), which is constant. The greedy scan is O(sigma). Total: O(n * sigma), which simplifies to O(n) for a fixed alphabet. Space is O(sigma) for the first/last maps and the interval list.

The building blocks

Character interval expansion

The idea of defining an interval per character and then expanding it to include all "dependencies" is a powerful pattern. You saw a simpler version in Partition Labels, where you only track the last occurrence and never need to validate backward. Here, backward validation adds a layer: not all character intervals produce valid substrings. This expansion pattern also appears in problems where you need to find the smallest window containing certain elements.

Greedy interval scheduling (activity selection)

Once you have a set of intervals, selecting the maximum number of non-overlapping ones by sorting by right endpoint is one of the oldest greedy algorithms. It appears in Non-overlapping Intervals (where you count removals instead of selections) and in job scheduling problems. The proof is an exchange argument: if the optimal solution skips the earliest-ending interval, you can always swap it in without reducing the count.

Edge cases

Single character string like "a": there is one character with interval [0, 0]. Expansion does nothing. The result is ["a"].

All characters identical like "aaaa": only one character interval [0, 3]. It is valid (no other characters to conflict). The result is ["aaaa"].

All characters unique like "abcdef": each character has interval [i, i]. All are valid and none overlap. The result is ["a", "b", "c", "d", "e", "f"].

Fully interleaved characters like "abab": a=[0,2], b=[1,3]. Expanding a's interval includes b at index 1, pushing end to 3. Expanding b's interval includes a at index 2, but first[a]=0 is before start=1, so b's interval is invalid. Only a's expanded interval [0,3] is valid. The result is ["abab"].

Empty string: no characters, so no intervals and no substrings. Return [].

From understanding to recall

You have seen the three-phase approach: compute character ranges, expand and validate intervals, then greedily select. The tricky part is the expansion loop. You need to remember that backward expansion (finding a character whose first occurrence is before your start) means the interval is invalid, not that you should extend leftward. This is the detail that separates this problem from the simpler Partition Labels.

Spaced repetition helps you internalize the expansion logic so you do not confuse it with other interval techniques under pressure. You practice writing the three phases from memory, first at short intervals, then longer ones. When you see a "find valid self-contained substrings" problem, the pattern fires automatically: expand, validate, schedule.

Related posts

  • Partition Labels - Greedy string partitioning using last-occurrence tracking, a simpler version of the same expand-and-split idea
  • Non-overlapping Intervals - Greedy interval scheduling by sorting on end time, the same selection step used here
  • Merge Intervals - Interval merging, the foundational pattern for thinking about overlapping ranges