Designing a metrics & logging platform

Ingest and query billions of metrics and logs — the write-heavy pipeline, time-series storage with rollups, and dashboards plus alerting.

The problem

Design an observability platform (Datadog/Prometheus + an ELK-style log stack): ingest metrics (numeric time series) and logs from thousands of services at enormous volume, store them efficiently, and let users query, dashboard, and alert. This is the write-heavy + analytics capstone — the opposite end from the read-heavy social feeds.

Step 1 — Requirements

Functional: ingest metrics (e.g. cpu{host=h1}=0.7 @ t) and logs (structured events) at high rate; query/aggregate over time ranges (sum, rate, percentiles, group- by); dashboards; alerting on thresholds/anomalies.

Non-functional: massive write throughput (millions of data points/sec — write-heavy), scalable storage (with retention/cost control), fast time-range queries, near-real- time freshness, available; some data loss/approximation tolerable.

Step 2 — Ingestion pipeline (write-optimized)

The classic write-heavy shape (Chapter 3): buffer, then process:

agents/services → collectors → message queue (Kafka) → stream processors → storage

Agents on each host batch and push metrics/logs to collectors.
A queue (Kafka) absorbs bursts and decouples ingestion from storage (back-pressure, durability, replay).
Stream processors parse, enrich, aggregate, and write to storage.

The queue + async processing is what lets you survive ingestion spikes without dropping data.

Step 3 — Storage: time-series and logs differ

Metrics → a time-series database (TSDB) — data is (metric, tags, timestamp, value). TSDBs (Prometheus, InfluxDB, M3) store columnar/compressed, partitioned by time (and metric/tags), with heavy compression (delta-of-delta on timestamps, etc.). Optimized for “values of metric X over time range T grouped by tag.”
Logs → an inverted index / columnar store — logs are searched by text/fields, so an inverted index (Elasticsearch) or columnar log store; high volume, often sampled.

Both shard by time + key so a query touches only relevant partitions.

Step 4 — Rollups and downsampling (cost + query speed)

You can’t keep every raw data point forever, and dashboards rarely need second-resolution for last year. Pre-aggregate:

Rollups — precompute aggregates at coarser intervals (1m → 1h → 1d) so a “last year” query reads daily rollups, not billions of raw points.
Retention tiers — keep raw data short-term (high res), downsampled data long-term, then expire/archive. Drastically cuts storage and speeds queries.

This is fan-out-on-write for analytics: precompute so reads are cheap.

Step 5 — Query and dashboards

A query engine answers time-range + aggregation queries (rate, p99, group-by-tag) by reading the right partitions/rollups, merging, and returning series for dashboards. Cache common dashboard queries. Percentiles at scale use approximate sketches (t-digest/HLL) since exact is too expensive.

Step 6 — Alerting

Continuously evaluate alert rules against the stream/recent data (threshold, rate-of-change, anomaly); on trigger, deduplicate/group alerts and route via the notification service (don’t page 500 times for one outage). Often a separate streaming evaluation path for low latency.

Step 7 — Architecture

agents → collectors → Kafka → stream processors → TSDB (metrics) / log store (logs)
                                   │                    ▲
                                   └ rollups/downsample ┘
query engine → dashboards (+ cache) ;  alerting (stream rules) → notifications

Trade-offs to raise

Raw retention (full fidelity, costly) vs rollups/downsampling (cheap, approximate at range). Tier by age.
Exact vs approximate (percentiles, cardinality) — sketches trade accuracy for huge savings.
Sampling logs (cheaper, lossy) vs keeping everything.
Write throughput prioritized; reads served from rollups/indexes.

The interview cue

“High-volume ingest through a queue + stream processors (write-heavy, burst-absorbing); metrics in a compressed TSDB, logs in an inverted index, both sharded by time+key; rollups + retention tiers so long-range queries read aggregates, not raw points; approximate sketches for percentiles; streaming alert evaluation → grouped notifications.” Write-optimized pipeline + time-series storage + rollups is the answer; implementation next.