Building a file sync service (Dropbox)

Implement content-defined chunking, the dedup upload that skips chunks the server has, delta sync between devices, and version-vector conflict handling.

Chunking and content addressing

Split a file into chunks and name each by its hash. Content-defined chunking (rolling hash to pick boundaries) keeps most chunks stable when bytes are inserted — unlike fixed chunks, which shift everything after an insert:

def chunk_file(path):
    chunks = []
    for data in content_defined_split(path, target=4*MB):   # Rabin fingerprint boundaries
        h = sha256(data)
        chunks.append((h, data))
    return chunks                      # ordered list of (hash, bytes)

The file’s metadata becomes just the ordered list of chunk hashes.

The dedup upload (only new chunks move)

Before uploading, ask the server which chunk hashes it already has — upload only the missing ones:

def upload(file_path, file_id):
    chunks = chunk_file(file_path)
    hashes = [h for h, _ in chunks]
    missing = metadata.which_missing(hashes)      # server checks its chunk index
    for h, data in chunks:
        if h in missing:
            blob_store.put(h, data)               # content-addressed; dedup automatic
    metadata.commit_version(file_id, hashes)      # new version = new chunk list

which_missing is what gives global dedup: a chunk any user already uploaded is skipped. The same installer from a million users is stored once.

Delta sync to other devices

A notified device reconciles by comparing chunk lists and fetching only what it lacks:

def sync_down(file_id):
    remote = metadata.get_version(file_id)        # list of chunk hashes
    local  = local_index.get(file_id)             # what we already have
    for h in remote.hashes:
        if h not in local.chunks and not have_locally(h):
            data = blob_store.get(h)              # fetch only missing chunks
            store_local(h, data)
    reassemble(file_id, remote.hashes)            # concat chunks in order

Change notification

The client holds a long-poll/WebSocket to a notification service; on a committed version it gets a small “file X changed” event and pulls the new metadata, then runs sync_down. (Reuse the realtime-delivery + notification patterns.)

Conflict handling with version vectors

Each file version carries a version vector (per-device counters). On commit, the server checks causality:

def commit_version(file_id, hashes, device_vv):
    cur = metadata.head(file_id)
    rel = compare(device_vv, cur.vv)
    if rel == "descends":                         # client built on the latest → fast-forward
        metadata.set_head(file_id, hashes, device_vv.merge(cur.vv))
    elif rel == "concurrent":                     # both edited offline → conflict
        metadata.add_conflict_copy(file_id, hashes, device_vv)   # "(conflicted copy)"
    # else stale → tell client to sync first

Keeping a conflicted copy never loses a user’s work — the safe, expected behavior.

Versioning storage cost

Because versions are just lists of chunk hashes, old versions are nearly free — unchanged chunks are shared across versions. Garbage-collect chunks referenced by no version (reference counting on the chunk index).

Scaling and failure handling

Blob store (chunks) scales independently and is heavily deduped.
Metadata DB shards by account; it’s the sync hot path — cache hot file trees.
Resumable uploads — chunks are independent, so a dropped connection resumes from the missing chunks, not the start.
Integrity — verify each chunk’s hash on download (content addressing makes this free).

The takeaway

Concrete signals: content-defined chunking + content addressing for delta sync and global dedup, a which-missing upload so only new bytes move, metadata-as-chunk-lists making versioning cheap, and version-vector conflicts → conflicted copies. Chunking + dedup is the reusable idea behind backup, object storage, and any large-file system.