Skip to content
System design course
Ch.4 · Designing real systems·how to build it ·7 min read

Building a metrics & logging platform

Implement the batched ingestion pipeline, time-series compression, rollup downsampling, and approximate percentile queries.


Ingestion: agents → queue → processors

Agents batch locally and push; a queue absorbs bursts; processors write to storage:

# agent (on each host): batch points, flush periodically
def agent_flush(buffer):
    collector.send(compress(buffer))                  # batched, compressed over the wire

# collector → queue
def on_receive(batch):
    for point in decompress(batch):
        kafka.publish("metrics", key=point.series_id, value=point)   # partition by series

# stream processor
def process():
    for point in kafka.consume("metrics"):
        enrich(point)                                 # add tags/host metadata
        tsdb.append(point)                            # write to the time-series store
        rollup_aggregator.update(point)               # feed live rollups

Batching + the queue is what sustains millions of points/sec and survives spikes (back- pressure, replay) without dropping data.

Time-series storage and compression

A data point is (series_id, timestamp, value) where series_id = hash(metric, tags). Store per-series, time-ordered, and compress hard:

- delta-of-delta encode timestamps (regular intervals compress to ~bits)
- XOR-compress consecutive float values (Gorilla-style)
- partition by time window (e.g. 2h blocks) + series → queries scan only relevant blocks

This gets ~1–2 bytes per point and makes “series X over range T” a sequential block read. Logs instead go to an inverted index (parse fields → index) for text/field search.

Rollups (downsampling)

Pre-aggregate raw points into coarser intervals so long-range queries are cheap:

def rollup_aggregator():                              # continuous, per series + interval
    for window in tumbling(raw_stream, sizes=["1m", "1h", "1d"]):
        agg = {"sum": sum(window.values), "count": len(window.values),
               "min": min(window.values), "max": max(window.values),
               "p99_sketch": tdigest(window.values)}  # mergeable percentile sketch
        rollup_store.write(window.series, window.size, window.end, agg)

A “CPU over the last year” query reads 1d rollups (365 points), not billions of raw points. Retention tiers: raw for days, 1m for weeks, 1h/1d for years, then expire/archive.

Querying with the right resolution + approximation

The query engine picks the coarsest rollup that satisfies the time range, then merges:

def query(metric, tags, start, end, agg):
    res = choose_resolution(end - start)              # long range → coarse rollup
    series = storage.read(series_id(metric, tags), start, end, res)
    if agg == "p99":
        return merge_tdigests(s.p99_sketch for s in series).quantile(0.99)  # approximate
    return reduce(agg, series)

Percentiles use mergeable sketches (t-digest) and cardinality uses HyperLogLog — exact would be prohibitively expensive at this volume. Cache hot dashboard queries.

Alerting (streaming evaluation)

Evaluate rules against the live stream/recent rollups; dedupe and group before paging:

def evaluate_alerts():
    for rule in alert_rules:                          # e.g. p99 latency > 500ms for 5m
        if rule.fires(recent(rule.metric, rule.window)):
            key = rule.dedupe_key()
            if alert_state.transition_to_firing(key): # only on state change
                notify_grouped(rule, severity=rule.severity)   # notification service

State transitions (ok→firing→resolved) and grouping prevent alert storms (one outage = one page, not thousands).

Scale and failure handling

  • Ingestion spike → queue buffers; processors autoscale on lag; agents batch.
  • Processor crash → resume from the Kafka offset (at-least-once; writes idempotent by series+timestamp).
  • Storage growth → rollups + retention tiers + compression keep cost bounded.
  • High-cardinality explosion (too many tag combinations) → the real failure mode; cap/ drop high-cardinality tags, alert on cardinality.
  • Some loss tolerable → sampling logs, approximate metrics — acceptable for observability.

The takeaway

Concrete signals: batched queue-buffered ingestion (write-heavy), compressed time- series storage partitioned by time+series, rollups + retention tiers so range queries read aggregates, and approximate sketches (t-digest/HLL) for percentiles/cardinality, with stateful grouped alerting. Write-optimized pipeline + precomputed rollups is the analytics counterpart to the read-heavy fan-out feeds — and the capstone of Chapter 4.