Building a planet-scale file storage and synchronization service: how bytes flow from a laptop to an object store, how edits ripple to every device a user owns, and how to do all of it without re-uploading what we already have.
Before naming any technology, list the user-visible behaviours the system must guarantee. These five together define the problem.
Any file type, sizes from a few KB up to multi-gigabyte videos. Resumable on flaky networks; never lose bytes mid-flight.
An edit on the laptop appears on the phone seconds later. Each client converges to the same view without manual refresh.
Generate a link, invite by email, assign a role. Permissions propagate to nested folders and respect revocation.
Every save is recoverable. Users can roll back yesterday's accidental delete without contacting support.
Strong durability (11 nines), eventual consistency across devices, sub-second metadata reads, bandwidth-efficient sync, and per-user encryption at rest.
Five services collaborate. The client treats the cloud like a remote disk; behind the curtain, metadata and bytes live in different stores so each can be scaled, replicated, and optimised on its own terms.
Treating a 4 GB video as one giant blob is a recipe for retries, timeouts, and wasted bandwidth. Slice it into uniform 4 MB blocks and the problem collapses into many tiny, parallelisable, retriable transfers.
Parallelism. A 1 GB file becomes 250 four-megabyte uploads that can saturate the user's pipe instead of starving on a single TCP stream.
Retriability. If one block fails the integrity check, retry just that block — no need to re-send the other 249.
Streaming. The client can start playing a video as soon as the first few blocks land, instead of waiting for the whole file.
Why 4 MB? Big enough that per-block overhead (HTTP headers, metadata rows) stays small. Small enough that a single failed retry costs little. Dropbox famously picked 4 MB; values from 1–16 MB all work.
The same PDF lives in a thousand inboxes. The same npm tarball lives in a million repos. Identical bytes should be stored a single time, globally, across every user.
Before uploading a block, hash it. Ask the server: do you already have this hash? If yes, skip the byte transfer entirely and just point this user's file at the existing block.
A reference counter per block: when the last file pointing at a block is deleted, the block itself can finally be garbage-collected. Privacy-sensitive deployments often disable cross-tenant dedup so one user can't probe whether another stores a given file.
A user edits one paragraph in a 200-page document and hits save. A naive sync re-uploads the whole file. A delta-aware sync re-uploads two blocks. Same correctness, 99% less bandwidth.
Blocks in the object store are anonymous bytes. The metadata DB gives them names, parents, owners, version history, and permissions — everything the user thinks of as "my file."
| column | type | notes |
|---|---|---|
file_id | uuid | primary key |
owner_id | uuid | creator |
parent_id | uuid | folder it sits in |
name | string | display name |
current_version | uuid | → versions.id |
is_deleted | bool | soft-delete flag |
| column | type | notes |
|---|---|---|
version_id | uuid | primary key |
file_id | uuid | which file |
block_list | jsonb / array | ordered hashes |
size_bytes | int64 | full file size |
created_at | timestamp | |
created_by | uuid | which device |
| column | type | notes |
|---|---|---|
block_hash | string (sha256) | primary key |
size_bytes | int | actual size |
ref_count | int | GC when 0 |
storage_key | string | S3 location |
| column | type | notes |
|---|---|---|
file_id | uuid | |
principal_id | uuid | user or group |
role | enum | viewer / editor / owner |
granted_at | timestamp |
Sharding key: owner_id for files/versions — keeps a user's data on a single shard for fast folder reads. The blocks table is sharded by block_hash prefix since it is global.
The bytes themselves go nowhere near a relational database. They land in an object store — S3, GCS, or an in-house equivalent — where the key is not a filename but the cryptographic hash of the contents.
Dedup falls out for free. Two identical blocks map to the same key. A second write is a no-op.
Tamper-evident. Re-hash on read; if it doesn't match the key, the storage layer corrupted it. Self-healing replication can pick a different replica.
Cache-friendly. The hash is the etag. CDNs and clients can cache aggressively because the key never refers to different bytes.
Lifecycle policies migrate blocks down the tiers based on last-read timestamp. Reference counts in metadata drive the eventual delete.
An upload isn't done when the bytes land — it's done when the user's phone, tablet, and second laptop all know about it. The notification service fans the event out to every subscribed device.
Versioning gets cheap and correct if you commit to one rule: blocks never change. A new version is just a new ordered list of block hashes; old versions still point at their old hashes.
Versions become cheap. v2 only stores the delta (one block hash) on top of v1; the unchanged blocks are referenced, not copied.
Time travel is a pointer flip. "Restore yesterday's version" updates a single row in the metadata DB — no byte movement.
Deletes are safe. Removing a version decrements ref counts; blocks survive until truly orphaned.
A retention policy caps how many versions to keep (e.g., last 30 days, last 100 versions). The oldest version row is dropped; its referenced blocks lose a ref-count and become eligible for GC if no other version still needs them.
A rename touches only files.name — the version chain and the bytes are unaffected. Moving across folders is just a parent_id update.
Two laptops go offline, both edit budget.xlsx, both reconnect. The
server now has to pick — or refuse to pick — a winner. The right answer
depends on whether silently dropping work is acceptable.
How: Tag each upload with a server-assigned timestamp or monotonic version number. Whichever arrives last becomes the current version.
Cost: The losing device's edits are demoted to an older version. They are not lost — version history preserves them — but they no longer appear as the live file.
Good for: documents where users expect a single canonical "latest", and the chance of true simultaneous edits is low (most consumer use cases).
How: Detect that both clients diverged from the same base version. Instead of choosing, create budget (conflicted copy from Laptop B).xlsx alongside the original.
Cost: Users see two files and must reconcile manually. Annoying — but no edits silently disappear from the file tree.
Good for: spreadsheets, code, anything where merging unseen changes is unsafe. This is Dropbox's classic default.
Every upload carries the parent_version_id it was derived from. If a client tries to commit v3 on top of v2 but the server has already advanced to v3', the server knows the client's parent is stale. That's the trigger — without it, both strategies collapse to "blindly overwrite."
For real-time collaborative editing (Docs, Sheets), file-level conflict resolution gives way to operation-level merging — every keystroke is an op that gets transformed against concurrent ops. That's a different problem and lives one layer above the storage system.
Permissions are the most subtle correctness problem in a file system. A single bad row leaks private data; an over-eager cache leaks stale private data. Get the model right before scaling it.
permissions tying a user (or group) to a role.Folder-level grants cascade to children, but explicit grants on a child can extend (not narrow) — preventing a confusing situation where a user has access to a folder but mysteriously lacks access to a file inside it.
can_access(user, file, action):
1. walk file → parent → ... → root
2. collect grants where principal ∈ user's groups
3. find highest role; check role ⊇ action
4. deny otherwise
This walk happens on every metadata read, so the path-to-grants index has to be hot. A common pattern: denormalise an effective_acl column on each file, recomputed when ancestor grants change.
Every grant change is logged with actor, target, before, and after. Revocation is immediate for token-based checks but takes up to the cache TTL for denormalised ACLs — usually seconds, which is acceptable for files that aren't actively under attack.
Strip away the specifics and the same handful of ideas keep returning. Each is a lever you can adjust to trade bandwidth, latency, durability, or privacy against one another.
The object store holds anonymous blocks; the metadata DB holds the names, trees, and ACLs. Each scales on its own axis.
Fixed-size blocks plus content-addressing turn one hard problem (giant file transfers) into many easy ones.
Dedup eliminates duplicate writes; delta sync eliminates redundant uploads; pointer-flips eliminate copies.
Cheap versions, safe deletes, simple caching, and bullet-proof integrity all fall out of one rule.
Every client should converge on reconnect via a monotonic cursor; real-time notifications only make convergence faster.
Last-write-wins is simple but can hide work; keep-both is safer but noisier. Pick deliberately, document loudly.