Find Median from Data Stream: Two Heaps Explained
LeetCode #295, Find Median from Data Stream, is one of the most popular hard problems in coding interviews. You need to design a data structure that accepts a stream of integers and can return the median at any time. The tricky part is that you do not know the numbers in advance, and you need both operations to be fast. If you have ever been asked a "streaming statistics" question, this is the classic version of it.
The problem
Design a class MedianFinder with two methods:
addNum(num: int)adds an integer from the data stream.findMedian() -> floatreturns the median of all elements added so far.
The median is the middle value in an ordered list. If the list has an even number of elements, the median is the average of the two middle values.
mf = MedianFinder()
mf.addNum(5)
mf.addNum(15)
mf.findMedian() # returns 10.0 (average of 5 and 15)
mf.addNum(1)
mf.findMedian() # returns 5.0 (middle of [1, 5, 15])
mf.addNum(3)
mf.findMedian() # returns 4.0 (average of 3 and 5)
mf.addNum(8)
mf.findMedian() # returns 5.0 (middle of [1, 3, 5, 8, 15])
After inserting all five numbers, the sorted order is [1, 3, 5, 8, 15] and the median is the middle element, 5.
Why brute force is slow
The simplest approach is to store all numbers in a list and sort it every time you call findMedian. Sorting takes O(n log n), so if you call findMedian after every insertion, the total cost is O(n^2 log n) for n insertions. That is far too slow for large streams.
You could also maintain a sorted list and use binary search to find the insertion point, then insert. Finding the position is O(log n), but inserting into the middle of a list is O(n) because you have to shift elements. So addNum becomes O(n) and findMedian is O(1). Better, but still not ideal.
In an interview, mentioning the sorted list approach and explaining why insertion is O(n) shows you understand the problem before jumping to the optimal solution. Interviewers appreciate seeing you reason through tradeoffs.
The key insight: two heaps
Instead of maintaining one sorted structure, split the numbers into two halves:
- A max-heap that stores the smaller half of the numbers. The root is the largest of the small numbers.
- A min-heap that stores the larger half of the numbers. The root is the smallest of the large numbers.
If both heaps have the same size, the median is the average of the two roots. If one heap is larger by one element, the median is the root of the larger heap. The key constraint is that the heaps must stay balanced: their sizes can differ by at most 1.
When a new number arrives, you compare it to the roots to decide which heap it belongs to. Then you rebalance if one heap gets too large.
In Python, heapq only provides a min-heap. To simulate a max-heap, negate the values before pushing and negate again when popping. This is the standard trick and interviewers expect it.
Step-by-step walkthrough
Let's trace through inserting [5, 15, 1, 3, 8] and see how both heaps evolve after each insertion.
Step 1: Insert 5
Median = 5.0First number goes to the min-heap. Max-heap is empty, min-heap has [5]. Median = 5.0 (only element).
Step 2: Insert 15
Median = 10.015 is larger than min-heap root (5), so it goes to min-heap. Rebalance: move 5 from min-heap to max-heap. Max-heap = [5], min-heap = [15]. Median = (5 + 15) / 2 = 10.0.
Step 3: Insert 1
Median = 5.01 is smaller than max-heap root (5), so it goes to max-heap. Sizes differ by 1 which is fine. Max-heap = [1], min-heap = [5, 15]. Median = 5.0 (root of the larger heap).
Step 4: Insert 3
Median = 4.03 is smaller than min-heap root (5), so it goes to max-heap. Sizes are equal: max-heap = [3, 1], min-heap = [5, 15]. Median = (3 + 5) / 2 = 4.0.
Step 5: Insert 8
Median = 5.08 is larger than max-heap root (3), so it goes to min-heap. Min-heap = [5, 8, 15], max-heap = [3, 1]. Sizes differ by 1 which is fine. Median = 5.0 (root of the larger heap).
Notice the pattern: the max-heap always stores values that are less than or equal to the min-heap root, and the min-heap stores values that are greater than or equal to the max-heap root. This invariant means the two roots are always the "middle" values, and computing the median is O(1).
The code
import heapq
class MedianFinder:
def __init__(self):
self.lo = []
self.hi = []
def addNum(self, num: int) -> None:
heapq.heappush(self.lo, -num)
heapq.heappush(self.hi, -heapq.heappop(self.lo))
if len(self.hi) > len(self.lo):
heapq.heappush(self.lo, -heapq.heappop(self.hi))
def findMedian(self) -> float:
if len(self.lo) > len(self.hi):
return -self.lo[0]
return (-self.lo[0] + self.hi[0]) / 2.0
self.lois the max-heap (smaller half). We negate values to simulate a max-heap with Python's min-heap.self.hiis the min-heap (larger half). Values are stored as-is.- In
addNum, we always push to the max-heap first (negated), then move the max-heap root to the min-heap. This guarantees the max-heap root is always less than or equal to the min-heap root. - If the min-heap becomes larger than the max-heap, we move its root back. This keeps
loeither the same size ashior exactly one element larger. - In
findMedian, iflois larger, the median is its root (negated back). If they are the same size, the median is the average of both roots.
Complexity analysis
Time: addNum is O(log n) because each heap operation (push/pop) is O(log n), and we perform at most three heap operations per call. findMedian is O(1) since we just read the heap roots.
Space: O(n) to store all n elements across both heaps.
| Approach | addNum | findMedian | Space |
|---|---|---|---|
| Sort on every query | O(1) | O(n log n) | O(n) |
| Sorted list + binary search | O(n) | O(1) | O(n) |
| Two heaps | O(log n) | O(1) | O(n) |
Common mistakes
-
Forgetting to negate for the max-heap. Python's
heapqis a min-heap. If you push raw values intolo, it becomes a min-heap and the algorithm breaks. Always push-numand negate when reading. -
Getting the rebalancing logic backwards. The order matters: push to
lofirst, move tohi, then rebalance ifhiis too big. Reversing these steps can violate the invariant that all values inloare less than or equal to all values inhi. -
Off-by-one in size comparison. The heaps should differ by at most 1. A common bug is checking
len(self.hi) > len(self.lo) + 1instead oflen(self.hi) > len(self.lo). With the "push to lo, move to hi, maybe move back" pattern, you only need the simpler check. -
Returning an integer instead of a float. When the total count is even, the median is the average of two values and might not be a whole number. Make sure you return a float.
Watch out for the negation when reading from the max-heap. It is easy to forget the minus sign in findMedian and return the negative of the actual median. Double-check that you negate on both push and read.
The building blocks
This problem teaches a reusable pattern: maintaining a running statistic over a stream by partitioning data into two balanced structures.
Two-heap partition
import heapq
lo = []
hi = []
def add(num):
heapq.heappush(lo, -num)
heapq.heappush(hi, -heapq.heappop(lo))
if len(hi) > len(lo):
heapq.heappush(lo, -heapq.heappop(hi))
This pattern of "push to one side, rebalance to the other" shows up whenever you need to track the median or any order-statistic in a stream. Variations include using two heaps to solve the sliding window median problem (LeetCode 480) and interval scheduling problems where you split events into two groups.
The core idea extends beyond medians. Anytime you need fast access to the "middle" of a changing dataset, two heaps give you O(log n) updates and O(1) queries.
Practice writing the three-line addNum from memory. The sequence (push to lo, move to hi, maybe move back) is the entire algorithm. Once that becomes automatic, the rest is just reading roots.
Edge cases
- Single element. After one
addNum,lohas one element andhiis empty.findMedianreturns that element. - Two elements. Both heaps have one element. The median is the average of both roots.
- All identical values. If every number is the same, both heaps contain that value and the median is always that value.
- Negative numbers. The negation trick still works because negating a negative number makes it positive, and the heap ordering is preserved.
- Large streams. With
n = 50,000calls (the LeetCode constraint), O(log n) per insertion is roughly 16 operations per call. Well within time limits.
From understanding to recall
The two-heap median pattern is one of those solutions that feels obvious once you see it but is surprisingly hard to reproduce under pressure. The part that trips people up is not the concept of splitting into two halves. It is the mechanical details: which heap do you push to first, how do you rebalance, and where do the negations go.
Spaced repetition turns those details into reflexes. You practice writing the three-line addNum method from scratch, at increasing intervals, until the push-move-rebalance sequence is automatic. CodeBricks schedules these reps for you, so the pattern stays sharp long after you first learn it.
Related posts
- Kth Largest Element in an Array - Uses a heap for selection, similar heap-based thinking
- Merge K Sorted Lists - Another heap application for maintaining sorted order across streams
- Sliding Window Maximum - Maintaining a running statistic over a data stream
The two-heap pattern is a building block that pays dividends across many problems. Once you can write it from memory, median and order-statistic problems become mechanical. CodeBricks helps you drill that recall so the pattern sticks when it counts.