Batch vs stream processing
Process data in large scheduled chunks or continuously as it arrives — a choice between throughput and freshness that shapes every data pipeline.
Two ways to process data
- Batch — collect data over a window, then process it all at once on a schedule (hourly, nightly). Think “compute yesterday’s revenue report at 2am.”
- Stream — process each event as it arrives, continuously, within milliseconds to seconds. Think “update the live dashboard on every click.”
The split is bounded, scheduled chunks vs an unbounded, continuous flow.
Batch processing
- Strengths: maximum throughput and efficiency (amortize overhead over huge datasets); simpler to reason about and re-run; easy to handle late or corrected data (just reprocess the window).
- Weaknesses: high latency — results are as old as the window; you learn about a problem hours later.
- Use for: reports, billing, ETL, model training, nightly aggregations — where completeness matters more than immediacy.
- Tools: Hadoop MapReduce, Spark.
Stream processing
- Strengths: low latency / freshness — react to events now: fraud alerts, live metrics, trending, real-time recommendations.
- Weaknesses: more complex — must handle out-of-order and late events, windowing, and exactly-once semantics; harder to operate and debug; constant resource use.
- Use for: anything where stale-by-hours is useless: monitoring, anomaly detection, live counts, real-time personalization.
- Tools: Kafka Streams, Flink, Spark Structured Streaming.
The core trade-off
It’s throughput/simplicity vs latency/freshness. Batch is cheaper and simpler and gives complete results late; streaming gives fresh results immediately at the cost of complexity and continuous compute. Pick by asking: how stale can the answer be before it’s worthless?
The pragmatic hybrid
Real systems often run both, because each covers the other’s weakness:
- Lambda architecture — a streaming layer for fast, approximate, real-time results plus a batch layer that periodically recomputes the accurate, authoritative version. You get freshness and correctness.
- Kappa architecture — treat everything as a stream and reprocess from the log when you need to backfill, dropping the separate batch layer.
The interview cue
“The live view-count is a stream so it updates instantly (eventually consistent, approximate); the billing numbers come from a nightly batch job that’s authoritative and reconcilable.” Naming which path each metric takes — and why a system might run both — shows you understand the freshness-vs-cost exchange behind data pipelines. (This is latency vs throughput applied to processing.)