Designing Pastebin
Like a URL shortener but storing large text blobs — the moment to separate small metadata in a database from large content in a blob store behind a CDN.
The problem
Design Pastebin: users paste text (often large), get a short URL, and anyone with the link can read the paste. It’s the URL shortener plus a large-content twist — the canonical place to learn metadata-in-DB, blob-in-object-store.
Step 1 — Requirements
Functional: create a paste (text up to a few MB), get a short URL, read it via the URL; optional expiry, visibility (public/unlisted/private), syntax/size limits.
Non-functional: read-heavy (reads ≫ writes), low-latency reads, durable storage of potentially large blobs, scalable, available.
Step 2 — Estimate
- Say 10M pastes/day → ~115 writes/sec. Reads maybe 10–100× → thousands/sec.
- Average paste ~10 KB, max a few MB → 10M × 10 KB = ~100 GB/day of content → don’t put blobs in the relational DB.
That storage number is the whole lesson: large content goes in a blob/object store, not the database.
Step 3 — The key split: metadata vs content
- Metadata (small, queried) → a database:
paste_id, owner, created_at, expires_at, visibility, content_url, size. - Content (large, blob) → an object store (S3-style), one object per paste, fronted by a CDN for hot reads.
create: client → app → store blob in S3 → store metadata row (with blob URL) → return code
read: client → app → metadata (cache) → fetch blob from CDN/S3
The DB stays small and fast; the object store handles the bulk; the CDN serves popular pastes from the edge.
Step 4 — Code generation
Same as the URL shortener: base62 of a range-leased ID or a key-generation service for unique short codes (reuse that lesson). No need to re-derive it — say “same ID strategy as the shortener.”
Step 5 — Reads (cache + CDN)
- Metadata in a cache (Redis) for the hot pastes.
- Content via CDN; pastes are immutable, so caching is trivial (no invalidation), with TTLs for memory.
- Private pastes bypass the CDN (or use signed URLs) and check auth at the app.
Step 6 — Expiry and limits
- Expiry →
expires_at; a background job deletes expired metadata and their blobs (don’t leak storage); lazy-check on read. - Size/abuse limits → cap paste size; rate-limit creates; scan for malicious content.
Trade-offs to raise
- Blob store + CDN vs DB BLOBs — never store multi-MB blobs in the relational DB (bloats it, kills cache efficiency); the object-store split is the right call.
- Immutability simplifies caching — pastes don’t change, so cache freely.
- Public (CDN-cacheable) vs private (auth’d, signed URLs) content paths.
The interview cue
“Same short-code generation as the URL shortener, but content is large — so paste text goes to an object store fronted by a CDN, and only small metadata lives in the DB (cached for hot pastes). Pastes are immutable, so caching is trivial; expiry purges both metadata and blobs.” The metadata/blob split + CDN is the new idea here and recurs in every media system ahead.