Skip to content
← All posts

The Pipes and Filters Pattern in Software Architecture

10 min read
software-engineeringpatterns

Every time you chain commands together in a Unix terminal, you are using one of the most powerful patterns in software architecture. The pipes and filters pattern is the idea that data should flow through a sequence of independent processing stages. Each stage does one thing. The output of one stage becomes the input of the next. And because each stage is self-contained, you can rearrange, replace, or add stages without breaking the rest of the pipeline.

This pattern predates most of what you think of as "modern" software engineering. It was baked into Unix in the 1970s. But it keeps showing up because the core idea is timeless: when you decompose a complex transformation into small, focused steps, everything gets easier to understand, test, and change.

The Unix origin story

If you have ever typed something like this into a terminal, you already know pipes and filters:

cat server.log | grep "error" | sort | uniq -c | sort -rn

Each command is a filter. It reads input, transforms it, and writes output:

  • cat server.log reads a file and outputs its contents.
  • grep "error" keeps only lines containing "error."
  • sort alphabetically sorts the remaining lines.
  • uniq -c collapses duplicate lines and counts them.
  • sort -rn sorts by count in descending order.

The | operator is the pipe. It connects the output of one filter to the input of the next. No temporary files. No shared state. Just data flowing through a sequence of transformations.

The result: you get a ranked list of the most common error messages in your server log. Five simple commands, each doing one thing, composed into something powerful.

The four components

The pipes and filters pattern has four building blocks:

Filters are the processing units. Each filter takes input, applies a transformation, and produces output. A filter does not know where its input came from or where its output is going. It just does its job. This independence is what makes filters reusable.

Pipes are the connectors between filters. A pipe takes the output of one filter and delivers it as input to the next. In Unix, this is the | operator. In code, it might be a function call, a queue, a stream, or simply passing a return value to the next function.

Data source is where the pipeline starts. It could be a file, a database query, an API response, or any other source of raw input.

Data sink is where the pipeline ends. It could be a file on disk, a database table, a response sent to a client, or a value returned to the caller.

The key constraint: every filter must accept and produce a compatible data format. In Unix, that format is text (lines of characters). In your own pipelines, you get to choose. The point is that all filters agree on the shape of data flowing between them.

Core properties

The pipes and filters pattern gives you several properties that make systems easier to reason about.

Independence. Each filter is self-contained. It does not depend on the internal state of other filters. You can understand, test, and debug each filter in isolation.

Composability. Filters can be reordered, replaced, or added without affecting the rest of the pipeline. Need to add a logging step? Insert a new filter. Need to change how data gets cleaned? Swap out that one filter. Everything else stays the same.

Single responsibility. Each filter has one job. This makes individual filters simple, even when the overall pipeline is complex. The complexity lives in the composition, not in any single component.

Uniform interface. The same data format flows between all stages. This is what makes composition possible. If every filter expects text in and text out (like Unix commands), you can connect any filter to any other filter.

Python example: text processing pipeline

Let's build a text processing pipeline that takes raw HTML content and produces a list of stemmed tokens. Each step is a filter.

Procedural version

The simplest approach is just a sequence of function calls:

import re


def read_source(filepath):
    """Data source: read raw content from a file."""
    with open(filepath) as f:
        return f.read()


def remove_html(text):
    """Filter 1: strip HTML tags."""
    return re.sub(r"<[^>]+>", "", text)


def normalize_whitespace(text):
    """Filter 2: collapse runs of whitespace into single spaces."""
    return re.sub(r"\s+", " ", text).strip()


def tokenize(text):
    """Filter 3: split text into lowercase words."""
    return [word.lower() for word in text.split()]


def stem_words(words):
    """Filter 4: apply basic suffix stripping."""
    suffixes = ("ing", "ly", "ed", "tion", "ness")
    result = []
    for word in words:
        for suffix in suffixes:
            if word.endswith(suffix) and len(word) > len(suffix) + 2:
                word = word[: -len(suffix)]
                break
        result.append(word)
    return result


# The pipeline: data flows through each filter in sequence
raw = read_source("article.html")
cleaned = remove_html(raw)
normalized = normalize_whitespace(cleaned)
tokens = tokenize(normalized)
stemmed = stem_words(tokens)

print(stemmed)

Each function is a filter. The variable assignments are the pipes. You can test remove_html without thinking about stem_words. You can swap stem_words for a more sophisticated stemmer without touching the rest of the pipeline.

Class-based version

For more complex pipelines, you might want a reusable pipeline framework:

class Pipeline:
    def __init__(self):
        self.filters = []

    def add_filter(self, func):
        self.filters.append(func)
        return self

    def run(self, data):
        for f in self.filters:
            data = f(data)
        return data


# Build the pipeline by composing filters
pipeline = Pipeline()
pipeline.add_filter(remove_html)
pipeline.add_filter(normalize_whitespace)
pipeline.add_filter(tokenize)
pipeline.add_filter(stem_words)

result = pipeline.run(raw_html)

This version makes the pipeline itself a first-class object. You can pass it around, store different pipeline configurations, or build pipelines dynamically based on runtime conditions.

Python example: data transformation pipeline

Here is a second example that processes CSV data through a series of transformations:

import csv
from io import StringIO


def parse_csv(raw_text):
    """Data source filter: parse raw CSV into rows of dicts."""
    reader = csv.DictReader(StringIO(raw_text))
    return list(reader)


def filter_active(rows):
    """Filter: keep only rows where status is 'active'."""
    return [row for row in rows if row["status"] == "active"]


def parse_amounts(rows):
    """Filter: convert amount strings to floats."""
    for row in rows:
        row["amount"] = float(row["amount"])
    return rows


def aggregate_by_category(rows):
    """Filter: sum amounts by category."""
    totals = {}
    for row in rows:
        cat = row["category"]
        totals[cat] = totals.get(cat, 0) + row["amount"]
    return totals


def format_report(totals):
    """Data sink filter: produce a readable summary."""
    lines = []
    for category, total in sorted(totals.items()):
        lines.append(f"{category}: ${total:,.2f}")
    return "\n".join(lines)


# The pipeline
raw_csv = read_source("transactions.csv")
rows = parse_csv(raw_csv)
active = filter_active(rows)
parsed = parse_amounts(active)
totals = aggregate_by_category(parsed)
report = format_report(totals)

print(report)

Five filters, each with a single responsibility. If you want to change how filtering works (say, filter by date range instead of status), you replace one function. If you want to add a deduplication step, you insert a new filter between filter_active and parse_amounts. The rest of the pipeline does not change.

Real-world examples

The pipes and filters pattern shows up in a surprising number of places once you know what to look for.

Unix command line

This is the original and most famous example. Every Unix command is a filter that reads from stdin and writes to stdout. The shell's | operator is the pipe. This design is why Unix has survived for over 50 years. Small, focused tools composed into powerful workflows.

ETL pipelines

Extract-Transform-Load pipelines are pipes and filters by definition. Extract data from a source (data source), run it through a sequence of transformations (filters), and load it into a destination (data sink). Tools like Apache Airflow, dbt, and Luigi are all built on this pattern.

Compiler phases

A compiler is a pipeline. Source code flows through a sequence of processing stages:

  1. Lexing turns raw text into tokens.
  2. Parsing turns tokens into an abstract syntax tree (AST).
  3. Semantic analysis checks types and resolves names.
  4. Optimization transforms the AST or intermediate representation to improve performance.
  5. Code generation produces the final output (machine code, bytecode, etc.).

Each phase is a filter. The output of one phase is the input to the next. You can work on the optimizer without touching the parser. You can swap out the code generator to target a different platform. This is why compilers are often taught as the textbook example of pipes and filters.

Image processing

Image editing tools chain filters together: crop, resize, adjust brightness, apply blur, sharpen. Each operation takes an image in and produces an image out. Photoshop's filter menu is literally named after this pattern.

Stream processing

Systems like Apache Kafka and Apache Spark Streaming process data through a series of transformations. Read from a topic (source), apply a chain of map/filter/reduce operations (filters), write to an output topic (sink). The data flows continuously, but the architecture is the same.

Connection to coding problems

If you practice coding problems, you have already used pipes and filters without calling it that.

Sliding window as a filter loop

The sliding window pattern is a mini pipe-and-filter system running inside a loop. Each iteration pushes data through the same sequence of stages:

  1. Expand: move the right pointer to include more data.
  2. Validate: check if the current window meets the constraint.
  3. Contract: if the window is invalid or you want to try smaller, move the left pointer.
  4. Record: update the best answer seen so far.

Each step is a filter. The "data" is the window state (pointers, running totals, frequency maps). You can debug each step independently. If your answer is wrong, you ask: is my expand step correct? Is my validation correct? Is my contraction correct? The pipeline structure gives you that isolation.

Product of Array Except Self

Product of Array Except Self is a two-stage pipeline:

  1. Forward pass (Filter 1): compute prefix products. Takes an array, produces an array.
  2. Backward pass (Filter 2): compute suffix products and multiply with prefix results. Takes an array, produces the final array.

Each pass is a filter with one job. The output of the first pass flows into the second pass. You can reason about the forward pass without thinking about the backward pass. That is the pipes and filters pattern in miniature.

Advantages

Modularity. Each filter is a self-contained unit. You can understand it, build it, and maintain it independently. When the pipeline grows complex, you are not dealing with one massive function. You are dealing with many small, focused ones.

Testability. You can test each filter in isolation. Give it known input, check the output. No need to set up the entire pipeline to verify that one stage works correctly. Unit testing becomes trivial when each filter is a pure function.

Reusability. A filter that normalizes whitespace is useful in many pipelines, not just one. Because filters are independent, you can pull them out and reuse them wherever that transformation is needed.

Parallelism. Independent filters can run concurrently. If Filter 2 does not depend on Filter 1's output for the current item, they can process different items at the same time. Stream processing systems exploit this heavily.

Disadvantages

Data passing overhead. Every filter produces a complete output that gets passed to the next filter. For large datasets, this means creating intermediate copies at each stage. In performance-critical code, this overhead can matter.

Not great for interactive systems. Pipes and filters assume a linear, one-directional data flow. If you need back-and-forth communication (like a user interface responding to events), this pattern is a poor fit. Event-driven or observer patterns work better for interactive systems.

Error handling complexity. When a filter in the middle of a pipeline fails, what happens? Should the pipeline stop? Should it skip that item and continue? Should it retry? Error handling across multiple independent stages requires careful design. Each filter might need to handle errors differently, and propagating error context through the pipeline adds complexity.

Latency. Data must pass through every filter in sequence. You cannot skip ahead. If one filter is slow, it becomes a bottleneck for the entire pipeline. This is fine for batch processing but can be a problem for latency-sensitive systems.

When to use it

The pipes and filters pattern works best when you have:

  • Data transformation pipelines. Any time you need to take data from one form and convert it to another through multiple steps, this pattern is a natural fit.
  • Batch processing. Processing large volumes of data where throughput matters more than latency. Log processing, report generation, data migration.
  • Compiler design. Translating source code through lexing, parsing, optimization, and code generation. Each phase is a filter.
  • Log processing and monitoring. Collecting logs, parsing them, filtering for relevant events, aggregating metrics, and sending alerts. Each step maps cleanly to a filter.
  • Any system where you need flexibility in composition. If you expect the processing steps to change over time (new filters added, existing ones replaced), pipes and filters gives you that flexibility for free.

Avoid this pattern when you need bidirectional communication, when the processing steps are tightly interdependent, or when you need real-time interactive responsiveness.

Related posts