Skip to content
← All posts

Find Median from Data Stream: Two Heaps Explained

7 min read
leetcodeproblemhardheapdesign

LeetCode #295, Find Median from Data Stream, is one of the most popular hard problems in coding interviews. You need to design a data structure that accepts a stream of integers and can return the median at any time. The tricky part is that you do not know the numbers in advance, and you need both operations to be fast. If you have ever been asked a "streaming statistics" question, this is the classic version of it.

The problem

Design a class MedianFinder with two methods:

  • addNum(num: int) adds an integer from the data stream.
  • findMedian() -> float returns the median of all elements added so far.

The median is the middle value in an ordered list. If the list has an even number of elements, the median is the average of the two middle values.

mf = MedianFinder()
mf.addNum(5)
mf.addNum(15)
mf.findMedian()   # returns 10.0 (average of 5 and 15)
mf.addNum(1)
mf.findMedian()   # returns 5.0 (middle of [1, 5, 15])
mf.addNum(3)
mf.findMedian()   # returns 4.0 (average of 3 and 5)
mf.addNum(8)
mf.findMedian()   # returns 5.0 (middle of [1, 3, 5, 8, 15])

After inserting all five numbers, the sorted order is [1, 3, 5, 8, 15] and the median is the middle element, 5.

Two Heaps After Inserting [5, 15, 1, 3, 8]numbers flow in from the streamMax-Heap (smaller half)Min-Heap (larger half)315815Median = 5size = 2size = 3
The max-heap holds the smaller half of numbers. The min-heap holds the larger half. The median comes from the root of the larger heap (or the average of both roots when sizes are equal).

Why brute force is slow

The simplest approach is to store all numbers in a list and sort it every time you call findMedian. Sorting takes O(n log n), so if you call findMedian after every insertion, the total cost is O(n^2 log n) for n insertions. That is far too slow for large streams.

You could also maintain a sorted list and use binary search to find the insertion point, then insert. Finding the position is O(log n), but inserting into the middle of a list is O(n) because you have to shift elements. So addNum becomes O(n) and findMedian is O(1). Better, but still not ideal.

In an interview, mentioning the sorted list approach and explaining why insertion is O(n) shows you understand the problem before jumping to the optimal solution. Interviewers appreciate seeing you reason through tradeoffs.

The key insight: two heaps

Instead of maintaining one sorted structure, split the numbers into two halves:

  1. A max-heap that stores the smaller half of the numbers. The root is the largest of the small numbers.
  2. A min-heap that stores the larger half of the numbers. The root is the smallest of the large numbers.

If both heaps have the same size, the median is the average of the two roots. If one heap is larger by one element, the median is the root of the larger heap. The key constraint is that the heaps must stay balanced: their sizes can differ by at most 1.

When a new number arrives, you compare it to the roots to decide which heap it belongs to. Then you rebalance if one heap gets too large.

In Python, heapq only provides a min-heap. To simulate a max-heap, negate the values before pushing and negate again when popping. This is the standard trick and interviewers expect it.

Step-by-step walkthrough

Let's trace through inserting [5, 15, 1, 3, 8] and see how both heaps evolve after each insertion.

Step 1: Insert 5

Median = 5.0
inserting5Max-Heapsize=0Min-Heapsize=1empty5

First number goes to the min-heap. Max-heap is empty, min-heap has [5]. Median = 5.0 (only element).

Step 2: Insert 15

Median = 10.0
inserting15Max-Heapsize=1Min-Heapsize=1515

15 is larger than min-heap root (5), so it goes to min-heap. Rebalance: move 5 from min-heap to max-heap. Max-heap = [5], min-heap = [15]. Median = (5 + 15) / 2 = 10.0.

Step 3: Insert 1

Median = 5.0
inserting1Max-Heapsize=1Min-Heapsize=21515

1 is smaller than max-heap root (5), so it goes to max-heap. Sizes differ by 1 which is fine. Max-heap = [1], min-heap = [5, 15]. Median = 5.0 (root of the larger heap).

Step 4: Insert 3

Median = 4.0
inserting3Max-Heapsize=2Min-Heapsize=231515

3 is smaller than min-heap root (5), so it goes to max-heap. Sizes are equal: max-heap = [3, 1], min-heap = [5, 15]. Median = (3 + 5) / 2 = 4.0.

Step 5: Insert 8

Median = 5.0
inserting8Max-Heapsize=2Min-Heapsize=3315815

8 is larger than max-heap root (3), so it goes to min-heap. Min-heap = [5, 8, 15], max-heap = [3, 1]. Sizes differ by 1 which is fine. Median = 5.0 (root of the larger heap).

Notice the pattern: the max-heap always stores values that are less than or equal to the min-heap root, and the min-heap stores values that are greater than or equal to the max-heap root. This invariant means the two roots are always the "middle" values, and computing the median is O(1).

The code

import heapq

class MedianFinder:
    def __init__(self):
        self.lo = []
        self.hi = []

    def addNum(self, num: int) -> None:
        heapq.heappush(self.lo, -num)
        heapq.heappush(self.hi, -heapq.heappop(self.lo))
        if len(self.hi) > len(self.lo):
            heapq.heappush(self.lo, -heapq.heappop(self.hi))

    def findMedian(self) -> float:
        if len(self.lo) > len(self.hi):
            return -self.lo[0]
        return (-self.lo[0] + self.hi[0]) / 2.0
  1. self.lo is the max-heap (smaller half). We negate values to simulate a max-heap with Python's min-heap.
  2. self.hi is the min-heap (larger half). Values are stored as-is.
  3. In addNum, we always push to the max-heap first (negated), then move the max-heap root to the min-heap. This guarantees the max-heap root is always less than or equal to the min-heap root.
  4. If the min-heap becomes larger than the max-heap, we move its root back. This keeps lo either the same size as hi or exactly one element larger.
  5. In findMedian, if lo is larger, the median is its root (negated back). If they are the same size, the median is the average of both roots.

Complexity analysis

Time: addNum is O(log n) because each heap operation (push/pop) is O(log n), and we perform at most three heap operations per call. findMedian is O(1) since we just read the heap roots.

Space: O(n) to store all n elements across both heaps.

ApproachaddNumfindMedianSpace
Sort on every queryO(1)O(n log n)O(n)
Sorted list + binary searchO(n)O(1)O(n)
Two heapsO(log n)O(1)O(n)

Common mistakes

  1. Forgetting to negate for the max-heap. Python's heapq is a min-heap. If you push raw values into lo, it becomes a min-heap and the algorithm breaks. Always push -num and negate when reading.

  2. Getting the rebalancing logic backwards. The order matters: push to lo first, move to hi, then rebalance if hi is too big. Reversing these steps can violate the invariant that all values in lo are less than or equal to all values in hi.

  3. Off-by-one in size comparison. The heaps should differ by at most 1. A common bug is checking len(self.hi) > len(self.lo) + 1 instead of len(self.hi) > len(self.lo). With the "push to lo, move to hi, maybe move back" pattern, you only need the simpler check.

  4. Returning an integer instead of a float. When the total count is even, the median is the average of two values and might not be a whole number. Make sure you return a float.

Watch out for the negation when reading from the max-heap. It is easy to forget the minus sign in findMedian and return the negative of the actual median. Double-check that you negate on both push and read.

The building blocks

This problem teaches a reusable pattern: maintaining a running statistic over a stream by partitioning data into two balanced structures.

Two-heap partition

import heapq

lo = []
hi = []

def add(num):
    heapq.heappush(lo, -num)
    heapq.heappush(hi, -heapq.heappop(lo))
    if len(hi) > len(lo):
        heapq.heappush(lo, -heapq.heappop(hi))

This pattern of "push to one side, rebalance to the other" shows up whenever you need to track the median or any order-statistic in a stream. Variations include using two heaps to solve the sliding window median problem (LeetCode 480) and interval scheduling problems where you split events into two groups.

The core idea extends beyond medians. Anytime you need fast access to the "middle" of a changing dataset, two heaps give you O(log n) updates and O(1) queries.

Practice writing the three-line addNum from memory. The sequence (push to lo, move to hi, maybe move back) is the entire algorithm. Once that becomes automatic, the rest is just reading roots.

Edge cases

  • Single element. After one addNum, lo has one element and hi is empty. findMedian returns that element.
  • Two elements. Both heaps have one element. The median is the average of both roots.
  • All identical values. If every number is the same, both heaps contain that value and the median is always that value.
  • Negative numbers. The negation trick still works because negating a negative number makes it positive, and the heap ordering is preserved.
  • Large streams. With n = 50,000 calls (the LeetCode constraint), O(log n) per insertion is roughly 16 operations per call. Well within time limits.

From understanding to recall

The two-heap median pattern is one of those solutions that feels obvious once you see it but is surprisingly hard to reproduce under pressure. The part that trips people up is not the concept of splitting into two halves. It is the mechanical details: which heap do you push to first, how do you rebalance, and where do the negations go.

Spaced repetition turns those details into reflexes. You practice writing the three-line addNum method from scratch, at increasing intervals, until the push-move-rebalance sequence is automatic. CodeBricks schedules these reps for you, so the pattern stays sharp long after you first learn it.

Related posts

The two-heap pattern is a building block that pays dividends across many problems. Once you can write it from memory, median and order-statistic problems become mechanical. CodeBricks helps you drill that recall so the pattern sticks when it counts.