Chapter Fifteen

Design Google Drive

Building a planet-scale file storage and synchronization service: how bytes flow from a laptop to an object store, how edits ripple to every device a user owns, and how to do all of it without re-uploading what we already have.

Topic
Cloud Storage & Sync

Scope
Architecture + Trade-offs

Slides
13

What does a file service actually owe its users?

Before naming any technology, list the user-visible behaviours the system must guarantee. These five together define the problem.

01

Upload & download

Any file type, sizes from a few KB up to multi-gigabyte videos. Resumable on flaky networks; never lose bytes mid-flight.

02

Sync across devices

An edit on the laptop appears on the phone seconds later. Each client converges to the same view without manual refresh.

03

Share with others

Generate a link, invite by email, assign a role. Permissions propagate to nested folders and respect revocation.

04

Version history

Every save is recoverable. Users can roll back yesterday's accidental delete without contacting support.

05

Non-functional

Strong durability (11 nines), eventual consistency across devices, sub-second metadata reads, bandwidth-efficient sync, and per-user encryption at rest.

A bird's-eye view of the moving parts

Five services collaborate. The client treats the cloud like a remote disk; behind the curtain, metadata and bytes live in different stores so each can be scaled, replicated, and optimised on its own terms.

Cut every file into fixed-size blocks

Treating a 4 GB video as one giant blob is a recipe for retries, timeouts, and wasted bandwidth. Slice it into uniform 4 MB blocks and the problem collapses into many tiny, parallelisable, retriable transfers.

Why fixed-size blocks?

Parallelism. A 1 GB file becomes 250 four-megabyte uploads that can saturate the user's pipe instead of starving on a single TCP stream.

Retriability. If one block fails the integrity check, retry just that block — no need to re-send the other 249.

Streaming. The client can start playing a video as soon as the first few blocks land, instead of waiting for the whole file.

Why 4 MB? Big enough that per-block overhead (HTTP headers, metadata rows) stays small. Small enough that a single failed retry costs little. Dropbox famously picked 4 MB; values from 1–16 MB all work.

parallel resumable streamable

Hash the block, store it exactly once

The same PDF lives in a thousand inboxes. The same npm tarball lives in a million repos. Identical bytes should be stored a single time, globally, across every user.

The trick

Before uploading a block, hash it. Ask the server: do you already have this hash? If yes, skip the byte transfer entirely and just point this user's file at the existing block.

Two layers of dedup

Per-user dedup — a user uploading the same attachment twice pays for it once.
Global dedup — across all tenants, popular blocks (templates, common libraries, identical photos) live in storage exactly once.

The cost

A reference counter per block: when the last file pointing at a block is deleted, the block itself can finally be garbage-collected. Privacy-sensitive deployments often disable cross-tenant dedup so one user can't probe whether another stores a given file.

SHA-256 ref-count GC privacy trade-off

Only re-upload what actually changed

A user edits one paragraph in a 200-page document and hits save. A naive sync re-uploads the whole file. A delta-aware sync re-uploads two blocks. Same correctness, 99% less bandwidth.

The metadata DB is where the file actually lives

Blocks in the object store are anonymous bytes. The metadata DB gives them names, parents, owners, version history, and permissions — everything the user thinks of as "my file."

files indexed on owner_id, parent_id

column	type	notes
`file_id`	uuid	primary key
`owner_id`	uuid	creator
`parent_id`	uuid	folder it sits in
`name`	string	display name
`current_version`	uuid	→ versions.id
`is_deleted`	bool	soft-delete flag

versions append-only

column	type	notes
`version_id`	uuid	primary key
`file_id`	uuid	which file
`block_list`	jsonb / array	ordered hashes
`size_bytes`	int64	full file size
`created_at`	timestamp
`created_by`	uuid	which device

blocks content-addressed

column	type	notes
`block_hash`	string (sha256)	primary key
`size_bytes`	int	actual size
`ref_count`	int	GC when 0
`storage_key`	string	S3 location

permissions access control

column	type	notes
`file_id`	uuid
`principal_id`	uuid	user or group
`role`	enum	viewer / editor / owner
`granted_at`	timestamp

Sharding key: owner_id for files/versions — keeps a user's data on a single shard for fast folder reads. The blocks table is sharded by block_hash prefix since it is global.

Block storage: an object store keyed by content

The bytes themselves go nowhere near a relational database. They land in an object store — S3, GCS, or an in-house equivalent — where the key is not a filename but the cryptographic hash of the contents.

Why content-addressed?

Dedup falls out for free. Two identical blocks map to the same key. A second write is a no-op.

Tamper-evident. Re-hash on read; if it doesn't match the key, the storage layer corrupted it. Self-healing replication can pick a different replica.

Cache-friendly. The hash is the etag. CDNs and clients can cache aggressively because the key never refers to different bytes.

Tiering

Hot tier — SSD-backed, ~ms latency. Recent or frequently-read blocks.
Warm tier — HDD-backed, ~10ms. Older versions, archived files.
Cold tier — tape/Glacier, minutes to restore. Long-retention compliance copies.

Lifecycle policies migrate blocks down the tiers based on last-read timestamp. Reference counts in metadata drive the eventual delete.

3x replicated erasure coded (cold) encrypted at rest

Tell every other device, fast

An upload isn't done when the bytes land — it's done when the user's phone, tablet, and second laptop all know about it. The notification service fans the event out to every subscribed device.

Immutable blocks, mutable pointers

Versioning gets cheap and correct if you commit to one rule: blocks never change. A new version is just a new ordered list of block hashes; old versions still point at their old hashes.

Why immutability wins

Versions become cheap. v2 only stores the delta (one block hash) on top of v1; the unchanged blocks are referenced, not copied.

Time travel is a pointer flip. "Restore yesterday's version" updates a single row in the metadata DB — no byte movement.

Deletes are safe. Removing a version decrements ref counts; blocks survive until truly orphaned.

What about retention?

A retention policy caps how many versions to keep (e.g., last 30 days, last 100 versions). The oldest version row is dropped; its referenced blocks lose a ref-count and become eligible for GC if no other version still needs them.

Edge: rename

A rename touches only files.name — the version chain and the bytes are unaffected. Moving across folders is just a parent_id update.

When two devices edit the same file

Two laptops go offline, both edit budget.xlsx, both reconnect. The server now has to pick — or refuse to pick — a winner. The right answer depends on whether silently dropping work is acceptable.

Strategy A

Last-write-wins (LWW)

How: Tag each upload with a server-assigned timestamp or monotonic version number. Whichever arrives last becomes the current version.

Cost: The losing device's edits are demoted to an older version. They are not lost — version history preserves them — but they no longer appear as the live file.

Good for: documents where users expect a single canonical "latest", and the chance of true simultaneous edits is low (most consumer use cases).

Strategy B

Keep-both (fork)

How: Detect that both clients diverged from the same base version. Instead of choosing, create budget (conflicted copy from Laptop B).xlsx alongside the original.

Cost: Users see two files and must reconcile manually. Annoying — but no edits silently disappear from the file tree.

Good for: spreadsheets, code, anything where merging unseen changes is unsafe. This is Dropbox's classic default.

Detecting the conflict at all

Every upload carries the parent_version_id it was derived from. If a client tries to commit v3 on top of v2 but the server has already advanced to v3', the server knows the client's parent is stale. That's the trigger — without it, both strategies collapse to "blindly overwrite."

vector clock (per-device counters) CRDTs for rich-text docs operational transform for live editing

For real-time collaborative editing (Docs, Sheets), file-level conflict resolution gives way to operation-level merging — every keystroke is an op that gets transformed against concurrent ops. That's a different problem and lives one layer above the storage system.

Sharing, links, and who is allowed to do what

Permissions are the most subtle correctness problem in a file system. A single bad row leaks private data; an over-eager cache leaks stale private data. Get the model right before scaling it.

Two share modes

Per-principal — "Bob can edit this folder." Stored as a row in permissions tying a user (or group) to a role.
Link-based — "Anyone with this URL can view." The URL embeds an opaque, unguessable token (e.g., 128-bit random) that the server looks up. Revoking a link rotates or invalidates the token.

Roles, smallest set that works

Viewer — read & download.
Commenter — viewer + leave comments.
Editor — read + write + create new versions.
Owner — editor + reshare + delete + change permissions.

Inheritance

Folder-level grants cascade to children, but explicit grants on a child can extend (not narrow) — preventing a confusing situation where a user has access to a folder but mysteriously lacks access to a file inside it.

The check, on every request

can_access(user, file, action):
  1. walk file → parent → ... → root
  2. collect grants where principal ∈ user's groups
  3. find highest role; check role ⊇ action
  4. deny otherwise

This walk happens on every metadata read, so the path-to-grants index has to be hot. A common pattern: denormalise an effective_acl column on each file, recomputed when ancestor grants change.

Audit & revocation

Every grant change is logged with actor, target, before, and after. Revocation is immediate for token-based checks but takes up to the cache TTL for denormalised ACLs — usually seconds, which is acceptable for files that aren't actively under attack.

RBAC capability URLs cascading grants audit log

Six principles that shape a file-sync system

Strip away the specifics and the same handful of ideas keep returning. Each is a lever you can adjust to trade bandwidth, latency, durability, or privacy against one another.

01

Separate bytes from metadata

The object store holds anonymous blocks; the metadata DB holds the names, trees, and ACLs. Each scales on its own axis.

02

Chunk first, hash always

Fixed-size blocks plus content-addressing turn one hard problem (giant file transfers) into many easy ones.

03

Never move bytes you can avoid

Dedup eliminates duplicate writes; delta sync eliminates redundant uploads; pointer-flips eliminate copies.

04

Make blocks immutable

Cheap versions, safe deletes, simple caching, and bullet-proof integrity all fall out of one rule.

05

Push is an optimisation, cursors are correctness

Every client should converge on reconnect via a monotonic cursor; real-time notifications only make convergence faster.

06

Decide your conflict policy up front

Last-write-wins is simple but can hide work; keep-both is safer but noisier. Pick deliberately, document loudly.