Designing Yelp (proximity search)
Find businesses near a location — the geospatial indexing (geohash, quadtree) that makes "what's nearby" fast, plus reviews and ranking.
The problem
Design Yelp (or “nearby places”): given a user’s location, return nearby businesses matching filters (category, rating, open now), with reviews and ranking. The core is geospatial search at scale — “find all points within radius R of (lat, lng), fast” — which recurs in Uber, DoorDash, and Tinder.
Step 1 — Requirements
Functional: search businesses near a location (radius / map viewport) with filters (category, price, rating, open now); view business details + reviews; write reviews/ ratings.
Non-functional: low-latency proximity queries, read-heavy (search/browse ≫ writes), scalable (tens of millions of businesses), available; review writes are lighter.
Step 2 — Why naive geo queries fail
“Businesses within 5km” via WHERE lat BETWEEN ... AND lng BETWEEN ... forces a scan/box
that doesn’t use indexes well and returns a rectangle, not a radius. At scale you need a
spatial index that buckets the world so a nearby query touches only a few buckets.
Step 3 — Geospatial indexing (the core)
Two dominant approaches:
- Geohash — encode (lat, lng) into a short string; nearby points share a prefix. Index businesses by geohash; a proximity query computes the user’s geohash cell and queries that cell plus its 8 neighbors (to cover the radius and handle borders). Simple, works with a plain string index, easy to shard. The common interview answer.
- Quadtree — recursively subdivide space into 4 quadrants until each cell holds ≤ K points; dense areas (cities) get finer cells, sparse areas coarser. A query descends to the relevant cells. Adapts to density better than uniform geohash cells.
- S2 / H3 — Google’s/Uber’s hierarchical cell systems (refinements of the above); mention as production-grade options.
Filter the spatial candidates by category/rating/open-now and sort by distance/relevance.
Step 4 — Data model
Business: id, name, lat, lng, geohash, category, price, hours, avg_rating, review_count
Review: id, business_id, user_id, rating, text, created_at
Businesses are mostly static (good for caching); reviews append and update aggregates.
Step 5 — Architecture
search → API → geo index (geohash/quadtree) → candidate businesses in nearby cells
→ filter (category, open now) + rank (distance, rating, relevance)
→ hydrate details (cached)
review write → store review → async update business avg_rating/count
- Geo index in memory / a spatial DB (PostGIS, Elasticsearch geo, or a custom geohash index), sharded by region.
- Business details heavily cached (static); reviews in a sharded store with async aggregation of rating/count (avoid hot-row updates).
Step 6 — Ranking
Sort nearby candidates by a blend of distance, rating, review count, relevance to the query, and possibly personalization/ads. Most queries want “close + good.”
Trade-offs to raise
- Geohash (simple, uniform cells, border handling) vs quadtree (density-adaptive, more complex).
- Precision vs recall — cell size trades how many candidates you scan vs missing edge results (query neighbor cells to fix borders).
- Static businesses (cacheable) vs fresh “open now”/rating — cache details, compute open-now and live aggregates separately.
The interview cue
“Index businesses with a geohash (or quadtree for density); a proximity search queries the user’s cell plus neighbors, filters by category/open-now, and ranks by distance + rating + relevance; business details are cached (mostly static), reviews stored with async rating aggregation.” Geospatial indexing is the crux and the reusable core for the rest of this group; implementation next.