Skip to content
← All posts

Statistics from a Large Sample: Frequency Array Analysis

5 min read
leetcodeproblemmediummath

You are given a large sample of integers in the range [0, 255]. Instead of storing every individual value, the sample is represented as a count array where count[k] tells you how many times the value k appears. Your job is to compute five statistics from this compressed representation: the minimum, maximum, mean, median, and mode.

This is LeetCode 1093: Statistics from a Large Sample.

count01234501132235452617819valueminmaxmodemin = 1max = 9mode = 4median = 4mean = (1+2+2+2+3+3+4+4+4+4+4+7+7+8+9) / 15 = 4.27
A frequency count array where count[k] tracks how many times value k appears. The five statistics are computed from this histogram.

Why this problem matters

At first glance, this looks like a basic math exercise. But the real lesson is about working with frequency arrays, a data structure that shows up constantly in algorithm problems. Instead of operating on raw data, you operate on a compressed summary. That shift in perspective unlocks efficient solutions to a wide range of problems.

Frequency arrays appear in counting sort, bucket sort, anagram detection, and sliding window problems. If you can compute statistics from a count array without expanding it into a full list of values, you understand the abstraction deeply enough to apply it in harder contexts.

The problem also tests whether you can handle the median carefully. When the total number of elements is even, you need the average of two middle values, and finding those two values from a running prefix sum requires precision.

The key insight

You never need to reconstruct the original list. Every statistic can be computed directly from the count array in one or two passes over its 256 entries.

The minimum is the smallest index with a nonzero count. The maximum is the largest. The mean is the weighted sum of indices divided by the total count. The mode is the index with the highest count. And the median comes from walking through the count array while tracking a running total, stopping when you reach the middle position.

Since the array has a fixed size of 256, every pass takes constant time regardless of how many samples exist. Even if the sample contains billions of values, you only ever loop over 256 entries.

The solution

def sample_stats(count: list[int]) -> list[float]:
    n = sum(count)

    minimum = next(i for i in range(256) if count[i] > 0)
    maximum = next(i for i in range(255, -1, -1) if count[i] > 0)

    total = sum(i * count[i] for i in range(256))
    mean = total / n

    mode = max(range(256), key=lambda i: count[i])

    median = find_median(count, n)

    return [float(minimum), float(maximum), mean, median, float(mode)]


def find_median(count: list[int], n: int) -> float:
    mid1 = (n + 1) // 2
    mid2 = (n + 2) // 2
    m1 = m2 = 0
    running = 0
    for i in range(256):
        running += count[i]
        if m1 == 0 and running >= mid1:
            m1 = i
        if m2 == 0 and running >= mid2:
            m2 = i
    return (m1 + m2) / 2.0

The main function computes each statistic independently. For minimum, it scans left to right and stops at the first nonzero count. For maximum, it scans right to left. The mean uses a weighted sum. The mode finds the index with the largest count.

The median helper handles both odd and even totals. It identifies the positions of the two "middle" elements using (n + 1) // 2 and (n + 2) // 2. When n is odd, both point to the same element. When n is even, they point to the two elements that need to be averaged. A single pass through the count array with a running sum locates both positions.

When computing the median from a frequency array, always handle even and odd totals with two position variables. This avoids branching logic and works uniformly for both cases.

Visual walkthrough

Step 1: Find minimum and maximum

Scan left to right through count[0..255]. The first index with count > 0 is the minimum. Scan right to left for the maximum. For count = [0, 1, 3, 2, 5, 0, 2, 1, 0, 1], the first nonzero is at index 1 (min = 1) and the last nonzero is at index 9 (max = 9).

Two boundary scans, each O(256) at worst.

Step 2: Compute the mean

Walk through each index i from 0 to 255. Accumulate total_sum += i * count[i] and total_count += count[i]. Then mean = total_sum / total_count. Here total_sum = 64, total_count = 15, so mean = 64 / 15 = 4.27.

One pass, running totals only.

Step 3: Find the mode

Track the index with the highest count. Walk through count[0..255], updating mode whenever count[i] > max_seen. Index 4 has count 5, the highest, so mode = 4.

The mode is the most frequent value, not the most frequent count.

Step 4: Find the median

The median is the middle value when all elements are sorted. With 15 total elements, the median is the 8th value. Walk through count[], accumulating a running total. When the running sum first reaches or passes 8, you have found the median index. Running sums: count[1]=1, count[2]=4, count[3]=6, count[4]=11. At index 4 we pass 8, so median = 4.

For even total counts, average the two middle values.

Step 5: Return the result

Return [minimum, maximum, mean, median, mode] = [1.0, 9.0, 4.267, 4.0, 4.0]. All five statistics are computed without ever expanding the frequency array into a full list of individual elements.

Total work: a few passes over 256 entries, regardless of how many samples there are.

Each step is a simple scan over the 256-entry count array. You accumulate running totals and compare against thresholds. No sorting, no expansion, no extra data structures.

Complexity analysis

ApproachTimeSpace
Frequency scanO(n)O(1)

Here n refers to the size of the count array, which is always 256. That means the time complexity is technically O(1) since the input size is fixed. The space complexity is also O(1) because you only use a handful of variables.

If you think of n as the total number of samples (the sum of all counts), then the approach is still O(256) time and O(1) extra space, which is far better than the O(n log n) you would need if you sorted the raw samples.

The building blocks

1. Prefix sums over a frequency array

The median computation relies on accumulating a running total as you walk through the count array. This is essentially a prefix sum: after processing index i, you know how many total elements have values from 0 to i.

running = 0
for i in range(256):
    running += count[i]
    if running >= target:
        return i

This pattern appears whenever you need to find the k-th element in a frequency distribution.

2. Weighted aggregation

The mean uses weighted aggregation: multiply each value by its frequency, sum everything, and divide by the total count. This same pattern shows up in expected value calculations, center-of-mass problems, and probability distributions.

total = sum(i * count[i] for i in range(256))
mean = total / n

Keeping the sum as a running accumulator avoids floating point drift when dealing with very large counts.

Edge cases

  • Single distinct value: Every count is zero except one. Min, max, mode, and median are all the same value, and the mean equals that value too.
  • Two distinct values with equal counts: The median is the average of the two values, not either one individually.
  • All 256 values present: The algorithm still runs 256 iterations, but now min = 0 and max = 255.
  • Extremely large counts: With counts up to 10^9, the weighted sum i * count[i] can overflow 32-bit integers. Use 64-bit arithmetic or a language that handles big integers natively (like Python).
  • Even vs. odd total count: The median logic must average two middle values for even totals. Missing this is the most common bug.

From understanding to recall

Reading through this solution once gives you the idea, but you will not remember the median formula next week. The trick is to practice retrieving the technique from memory, not just recognizing it when you see it.

Try closing this page and writing the find_median helper from scratch. If you get stuck, come back and check. Then try again tomorrow. That spaced retrieval is what turns a "solved it once" problem into a pattern you own.

CodeBricks automates this cycle for you. It breaks problems into atomic building blocks (prefix sum over frequency arrays, weighted aggregation, boundary scanning) and schedules them for review at increasing intervals. You stop grinding and start retaining.

Related posts

  • Find Peak Element - another problem where scanning with a simple condition finds the answer
  • Two Sum - frequency-based lookup using a hash map
  • Majority Element - computing statistics from frequency information

The five statistics in this problem are not hard individually. What makes the problem interesting is computing all of them from a single compressed representation. If you can internalize the frequency array abstraction, you will find it popping up everywhere, from counting sort to sliding window frequency maps.