Designing a metrics & logging platform
Ingest and query billions of metrics and logs — the write-heavy pipeline, time-series storage with rollups, and dashboards plus alerting.
The problem
Design an observability platform (Datadog/Prometheus + an ELK-style log stack): ingest metrics (numeric time series) and logs from thousands of services at enormous volume, store them efficiently, and let users query, dashboard, and alert. This is the write-heavy + analytics capstone — the opposite end from the read-heavy social feeds.
Step 1 — Requirements
Functional: ingest metrics (e.g. cpu{host=h1}=0.7 @ t) and logs (structured
events) at high rate; query/aggregate over time ranges (sum, rate, percentiles, group-
by); dashboards; alerting on thresholds/anomalies.
Non-functional: massive write throughput (millions of data points/sec — write-heavy), scalable storage (with retention/cost control), fast time-range queries, near-real- time freshness, available; some data loss/approximation tolerable.
Step 2 — Ingestion pipeline (write-optimized)
The classic write-heavy shape (Chapter 3): buffer, then process:
agents/services → collectors → message queue (Kafka) → stream processors → storage
- Agents on each host batch and push metrics/logs to collectors.
- A queue (Kafka) absorbs bursts and decouples ingestion from storage (back-pressure, durability, replay).
- Stream processors parse, enrich, aggregate, and write to storage.
The queue + async processing is what lets you survive ingestion spikes without dropping data.
Step 3 — Storage: time-series and logs differ
- Metrics → a time-series database (TSDB) — data is
(metric, tags, timestamp, value). TSDBs (Prometheus, InfluxDB, M3) store columnar/compressed, partitioned by time (and metric/tags), with heavy compression (delta-of-delta on timestamps, etc.). Optimized for “values of metric X over time range T grouped by tag.” - Logs → an inverted index / columnar store — logs are searched by text/fields, so an inverted index (Elasticsearch) or columnar log store; high volume, often sampled.
Both shard by time + key so a query touches only relevant partitions.
Step 4 — Rollups and downsampling (cost + query speed)
You can’t keep every raw data point forever, and dashboards rarely need second-resolution for last year. Pre-aggregate:
- Rollups — precompute aggregates at coarser intervals (1m → 1h → 1d) so a “last year” query reads daily rollups, not billions of raw points.
- Retention tiers — keep raw data short-term (high res), downsampled data long-term, then expire/archive. Drastically cuts storage and speeds queries.
This is fan-out-on-write for analytics: precompute so reads are cheap.
Step 5 — Query and dashboards
A query engine answers time-range + aggregation queries (rate, p99, group-by-tag) by reading the right partitions/rollups, merging, and returning series for dashboards. Cache common dashboard queries. Percentiles at scale use approximate sketches (t-digest/HLL) since exact is too expensive.
Step 6 — Alerting
Continuously evaluate alert rules against the stream/recent data (threshold, rate-of-change, anomaly); on trigger, deduplicate/group alerts and route via the notification service (don’t page 500 times for one outage). Often a separate streaming evaluation path for low latency.
Step 7 — Architecture
agents → collectors → Kafka → stream processors → TSDB (metrics) / log store (logs)
│ ▲
└ rollups/downsample ┘
query engine → dashboards (+ cache) ; alerting (stream rules) → notifications
Trade-offs to raise
- Raw retention (full fidelity, costly) vs rollups/downsampling (cheap, approximate at range). Tier by age.
- Exact vs approximate (percentiles, cardinality) — sketches trade accuracy for huge savings.
- Sampling logs (cheaper, lossy) vs keeping everything.
- Write throughput prioritized; reads served from rollups/indexes.
The interview cue
“High-volume ingest through a queue + stream processors (write-heavy, burst-absorbing); metrics in a compressed TSDB, logs in an inverted index, both sharded by time+key; rollups + retention tiers so long-range queries read aggregates, not raw points; approximate sketches for percentiles; streaming alert evaluation → grouped notifications.” Write-optimized pipeline + time-series storage + rollups is the answer; implementation next.