Skip to content
← All posts

Reduce Array Size to The Half: Greedy Frequency Counting

5 min read
leetcodeproblemmediumarrayshash-mapgreedy

Given an integer array arr, you can choose a set of integers and remove all occurrences of those integers from the array. Return the minimum size of the set so that at least half of the integers in the array are removed.

This is LeetCode 1338: Reduce Array Size to The Half, and it is one of the cleanest examples of a greedy strategy built on top of frequency counting. The problem asks you to find the fewest distinct values whose combined occurrences cover at least half the array.

03132333455565728297length = 10, half = 5Frequency count:12344val=33val=52val=21val=7
arr = [3,3,3,3,5,5,5,2,2,7]. Blue bars = values picked for removal (freq 4 and 3). Removing values 3 and 5 eliminates 7 elements, which exceeds half (5).

Why this problem matters

Frequency counting is one of the most common building blocks in interview problems. You see it in "Top K Frequent Elements," "Group Anagrams," "Sort Characters by Frequency," and dozens of others. What makes this problem valuable is that it layers a greedy decision on top of the frequency map. You are not just counting. You are using those counts to make an optimal selection.

The greedy insight is natural once you see it: if you want to remove the fewest distinct values possible, you should always remove the most frequent value first. Each removal of a high-frequency value eliminates more elements than a low-frequency one. This is the same greedy reasoning that appears in scheduling problems and coin change variants.

Once you internalize this "count, sort, greedily pick" pattern, you will recognize it in problems that look very different on the surface but share the same structure underneath.

The approach

The algorithm has three phases:

  1. Count frequencies. Use a hash map to count how many times each value appears in the array.
  2. Sort the frequencies. Extract just the counts (you no longer care about which value has which count) and sort them in descending order.
  3. Greedily remove. Walk through the sorted frequencies from largest to smallest. Add each frequency to a running total. Once the running total reaches at least len(arr) // 2, return the number of values you have picked.
from collections import Counter

def min_set_size(arr: list[int]) -> int:
    counts = Counter(arr)
    freqs = sorted(counts.values(), reverse=True)

    removed = 0
    half = len(arr) // 2

    for i, freq in enumerate(freqs):
        removed += freq
        if removed >= half:
            return i + 1

    return len(freqs)

You do not need to track which values you are removing. The problem only asks for the count of distinct values in your set. Once you have the sorted frequency list, the values themselves are irrelevant.

Step-by-step walkthrough

Let's trace through the example arr = [3, 3, 3, 3, 5, 5, 5, 2, 2, 7].

Step 1: Count frequencies with a hash map

34532271

Scan the array once. For each element, increment its count in the map.

Step 2: Sort frequencies in descending order

idx 04idx 13idx 22idx 31

Extract the frequency values and sort: [4, 3, 2, 1]. The identity of each value no longer matters.

Step 3: Pick the most frequent value (freq = 4)

pick freq = 4removed = 44 is less than 5. Keep going.result count = 1

Removed 4 elements so far. We need to remove at least 5 (half of 10). Not done yet. Count = 1.

Step 4: Pick the next most frequent value (freq = 3)

pick freq = 3removed = 4 + 3 = 77 >= 5. Done!result count = 2

Removed 4 + 3 = 7 elements total. 7 >= 5 (half of 10). Done! Answer = 2.

The greedy strategy picks the most frequent values first. After picking value 3 (frequency 4) and value 5 (frequency 3), we have removed 7 elements. Since 7 >= 5 (half of 10), the answer is 2. We never needed to consider values 2 or 7.

Here is the final clean solution:

from collections import Counter

def min_set_size(arr: list[int]) -> int:
    counts = Counter(arr)
    freqs = sorted(counts.values(), reverse=True)

    removed = 0
    half = len(arr) // 2

    for i, freq in enumerate(freqs):
        removed += freq
        if removed >= half:
            return i + 1

    return len(freqs)

Complexity analysis

ApproachTimeSpace
Greedy + sorted frequenciesO(n log n)O(n)
Greedy + bucket sort on frequenciesO(n)O(n)

Time for the sorting approach is O(n log n). Building the frequency map takes O(n). Sorting the frequency values takes O(k log k) where k is the number of distinct values. Since k is at most n, this is O(n log n) in the worst case. The greedy scan is O(k).

Space is O(n) for the frequency map. In the worst case, every element is unique, so the map has n entries.

You can optimize to O(n) time by using bucket sort on the frequencies. Since every frequency is between 1 and n, you can create a bucket array of size n + 1 and iterate from the largest bucket downward. This avoids the O(n log n) sort but adds implementation complexity. For interviews, the sorting version is cleaner and almost always sufficient.

Edge cases to watch for

  • All elements are the same. arr = [5, 5, 5, 5]. The frequency map has one entry with count 4. Half is 2. Removing value 5 eliminates 4 elements, which is >= 2. Answer is 1.
  • All elements are distinct. arr = [1, 2, 3, 4]. Every frequency is 1. Half is 2. You need to pick at least 2 values. Answer is 2.
  • Array of length 2. arr = [1, 1]. Half is 1. Removing value 1 eliminates 2 elements. Answer is 1. Also, arr = [1, 2]. Half is 1. Removing either value eliminates 1 element. Answer is 1.
  • Ties in frequency. arr = [1, 1, 2, 2, 3, 3]. All frequencies are 2. Half is 3. You need to pick 2 values (removing 4 >= 3). The order among tied frequencies does not matter.
  • Odd-length array. arr = [1, 2, 2]. Half is 3 // 2 = 1. Removing value 2 eliminates 2 elements, which is >= 1. Answer is 1.

The building blocks

Frequency counting with a hash map

The core building block is the same one you see everywhere: scan an array, count occurrences, and use those counts to make decisions. The Counter class in Python (or a dictionary) makes this mechanical:

from collections import Counter
counts = Counter(arr)

This building block drives Top K Frequent Elements, Group Anagrams, Valid Anagram, Sort Characters by Frequency, and many more. If you can write it without thinking, you are already halfway through a large family of problems.

Greedy selection from a sorted collection

The second building block is the greedy loop: sort your options by "value per unit" (here, frequency), then pick greedily until you hit a threshold. This pattern appears in fractional knapsack, activity selection, and any problem where local optimal choices lead to the global optimum. The key requirement for greedy to work is the "greedy choice property," and in this problem it holds because removing a higher-frequency value always eliminates at least as many elements as removing a lower-frequency one.

From understanding to recall

You have seen the pattern: count frequencies, sort descending, greedily accumulate until you reach half. The logic is clean and the code is short. But under interview pressure, the details matter. Do you sort ascending or descending? Do you compare against len(arr) // 2 or len(arr) / 2? Do you return i or i + 1?

These are not conceptual questions. They are recall questions, and they are exactly the kind of thing that falls apart when you are nervous. Spaced repetition closes the gap. You practice writing the frequency-count-and-greedily-pick pattern from memory at increasing intervals until it becomes automatic. When you see "minimize the set size" or "remove the fewest elements" in a problem statement, the template is already in your hands.

Related posts

  • Top K Frequent Elements - Another problem where frequency counting drives the solution
  • Group Anagrams - Using hash maps to categorize and count elements
  • Sort Colors - A problem where understanding element frequencies leads to an efficient solution