Skip to content
System design course
Ch.4 · Designing real systems·how to build it ·6 min read

Building Yelp (proximity search)

Implement geohash proximity queries with neighbor cells, the filter-and-rank step, and async review-aggregate updates.


Indexing businesses by geohash

Store each business under its geohash so a prefix identifies a cell:

def index_business(b):
    b.geohash = geohash.encode(b.lat, b.lng, precision=7)   # ~150m cell
    geo_index.add(b.geohash, b.id)                          # cell -> business ids
    business_store.put(b.id, b)

Precision picks cell size: shorter prefix = bigger cell. Choose precision for your typical search radius.

The proximity query (cell + neighbors)

A radius query covers the user’s cell and its 8 neighbors, so results near a cell border aren’t missed:

def search_nearby(lat, lng, radius_km, filters):
    precision = precision_for(radius_km)
    cell = geohash.encode(lat, lng, precision)
    cells = [cell] + geohash.neighbors(cell)               # 3x3 block of cells
    ids = []
    for c in cells:
        ids += geo_index.get(c)
    results = []
    for b in business_store.batch_get(ids):
        if haversine(lat, lng, b.lat, b.lng) <= radius_km and passes(b, filters):
            results.append(b)
    return rank(results, lat, lng)

Geohash gives a coarse candidate set (the cells); the exact haversine distance check filters to the true radius. For large radii, query a coarser precision or more rings of neighbors.

Filtering and ranking

def passes(b, f):
    return ((not f.category or b.category == f.category)
            and (not f.open_now or is_open(b.hours, now()))
            and b.avg_rating >= f.min_rating)

def rank(results, lat, lng):
    return sorted(results, key=lambda b: -score(b, lat, lng))   # distance + rating + relevance

open_now is computed from hours at query time (not stored), so it stays correct.

Reviews with async aggregation

A new review must not hot-update the business row on every write; aggregate asynchronously:

def add_review(business_id, user, rating, text):
    reviews.put(review_id(), {"business": business_id, "user": user,
                              "rating": rating, "text": text, "ts": now()})
    rating_queue.publish({"business": business_id, "rating": rating})

def rating_aggregator():                                   # batched
    for business_id, ratings in grouped(rating_queue.consume_batched()):
        b = business_store.get(business_id)
        b.review_count += len(ratings)
        b.avg_rating = recompute_avg(b, ratings)           # running average
        business_store.put(business_id, b)
        cache.invalidate(business_id)

Caching

  • Business details (mostly static) cached aggressively + at the edge.
  • Hot search areas (downtowns) — cache popular cell queries briefly.
  • Geo index hot cells replicated; dense cities get finer precision to bound candidates per cell.

Scale and failure handling

  • Geo index sharded by region; a hot dense cell → finer precision / split.
  • Read-heavy → caching + replicas carry it; the geo index is in memory.
  • Review write spike → queue + async aggregation absorbs it.
  • Border results → neighbor-cell query handles them; exact haversine prunes false positives.

The takeaway

Concrete signals: geohash cell + neighbor proximity queries refined by exact haversine distance, query-time filters (open-now), and async review aggregation. The geospatial index is the reusable workhorse — Uber and DoorDash add real-time moving locations and matching on top of exactly this.