Building Yelp (proximity search)
Implement geohash proximity queries with neighbor cells, the filter-and-rank step, and async review-aggregate updates.
Indexing businesses by geohash
Store each business under its geohash so a prefix identifies a cell:
def index_business(b):
b.geohash = geohash.encode(b.lat, b.lng, precision=7) # ~150m cell
geo_index.add(b.geohash, b.id) # cell -> business ids
business_store.put(b.id, b)
Precision picks cell size: shorter prefix = bigger cell. Choose precision for your typical search radius.
The proximity query (cell + neighbors)
A radius query covers the user’s cell and its 8 neighbors, so results near a cell border aren’t missed:
def search_nearby(lat, lng, radius_km, filters):
precision = precision_for(radius_km)
cell = geohash.encode(lat, lng, precision)
cells = [cell] + geohash.neighbors(cell) # 3x3 block of cells
ids = []
for c in cells:
ids += geo_index.get(c)
results = []
for b in business_store.batch_get(ids):
if haversine(lat, lng, b.lat, b.lng) <= radius_km and passes(b, filters):
results.append(b)
return rank(results, lat, lng)
Geohash gives a coarse candidate set (the cells); the exact haversine distance check filters to the true radius. For large radii, query a coarser precision or more rings of neighbors.
Filtering and ranking
def passes(b, f):
return ((not f.category or b.category == f.category)
and (not f.open_now or is_open(b.hours, now()))
and b.avg_rating >= f.min_rating)
def rank(results, lat, lng):
return sorted(results, key=lambda b: -score(b, lat, lng)) # distance + rating + relevance
open_now is computed from hours at query time (not stored), so it stays correct.
Reviews with async aggregation
A new review must not hot-update the business row on every write; aggregate asynchronously:
def add_review(business_id, user, rating, text):
reviews.put(review_id(), {"business": business_id, "user": user,
"rating": rating, "text": text, "ts": now()})
rating_queue.publish({"business": business_id, "rating": rating})
def rating_aggregator(): # batched
for business_id, ratings in grouped(rating_queue.consume_batched()):
b = business_store.get(business_id)
b.review_count += len(ratings)
b.avg_rating = recompute_avg(b, ratings) # running average
business_store.put(business_id, b)
cache.invalidate(business_id)
Caching
- Business details (mostly static) cached aggressively + at the edge.
- Hot search areas (downtowns) — cache popular cell queries briefly.
- Geo index hot cells replicated; dense cities get finer precision to bound candidates per cell.
Scale and failure handling
- Geo index sharded by region; a hot dense cell → finer precision / split.
- Read-heavy → caching + replicas carry it; the geo index is in memory.
- Review write spike → queue + async aggregation absorbs it.
- Border results → neighbor-cell query handles them; exact haversine prunes false positives.
The takeaway
Concrete signals: geohash cell + neighbor proximity queries refined by exact haversine distance, query-time filters (open-now), and async review aggregation. The geospatial index is the reusable workhorse — Uber and DoorDash add real-time moving locations and matching on top of exactly this.