Designing Yelp (proximity search)

Find businesses near a location — the geospatial indexing (geohash, quadtree) that makes "what's nearby" fast, plus reviews and ranking.

The problem

Design Yelp (or “nearby places”): given a user’s location, return nearby businesses matching filters (category, rating, open now), with reviews and ranking. The core is geospatial search at scale — “find all points within radius R of (lat, lng), fast” — which recurs in Uber, DoorDash, and Tinder.

Step 1 — Requirements

Functional: search businesses near a location (radius / map viewport) with filters (category, price, rating, open now); view business details + reviews; write reviews/ ratings.

Non-functional: low-latency proximity queries, read-heavy (search/browse ≫ writes), scalable (tens of millions of businesses), available; review writes are lighter.

Step 2 — Why naive geo queries fail

“Businesses within 5km” via WHERE lat BETWEEN ... AND lng BETWEEN ... forces a scan/box that doesn’t use indexes well and returns a rectangle, not a radius. At scale you need a spatial index that buckets the world so a nearby query touches only a few buckets.

Step 3 — Geospatial indexing (the core)

Two dominant approaches:

Geohash — encode (lat, lng) into a short string; nearby points share a prefix. Index businesses by geohash; a proximity query computes the user’s geohash cell and queries that cell plus its 8 neighbors (to cover the radius and handle borders). Simple, works with a plain string index, easy to shard. The common interview answer.
Quadtree — recursively subdivide space into 4 quadrants until each cell holds ≤ K points; dense areas (cities) get finer cells, sparse areas coarser. A query descends to the relevant cells. Adapts to density better than uniform geohash cells.
S2 / H3 — Google’s/Uber’s hierarchical cell systems (refinements of the above); mention as production-grade options.

Filter the spatial candidates by category/rating/open-now and sort by distance/relevance.

Step 4 — Data model

Business: id, name, lat, lng, geohash, category, price, hours, avg_rating, review_count
Review:   id, business_id, user_id, rating, text, created_at

Businesses are mostly static (good for caching); reviews append and update aggregates.

Step 5 — Architecture

search → API → geo index (geohash/quadtree) → candidate businesses in nearby cells
              → filter (category, open now) + rank (distance, rating, relevance)
              → hydrate details (cached)
review write → store review → async update business avg_rating/count

Geo index in memory / a spatial DB (PostGIS, Elasticsearch geo, or a custom geohash index), sharded by region.
Business details heavily cached (static); reviews in a sharded store with async aggregation of rating/count (avoid hot-row updates).

Step 6 — Ranking

Sort nearby candidates by a blend of distance, rating, review count, relevance to the query, and possibly personalization/ads. Most queries want “close + good.”

Trade-offs to raise

Geohash (simple, uniform cells, border handling) vs quadtree (density-adaptive, more complex).
Precision vs recall — cell size trades how many candidates you scan vs missing edge results (query neighbor cells to fix borders).
Static businesses (cacheable) vs fresh “open now”/rating — cache details, compute open-now and live aggregates separately.

The interview cue

“Index businesses with a geohash (or quadtree for density); a proximity search queries the user’s cell plus neighbors, filters by category/open-now, and ranks by distance + rating + relevance; business details are cached (mostly static), reviews stored with async rating aggregation.” Geospatial indexing is the crux and the reusable core for the rest of this group; implementation next.