Chapter 14 · System Design

Design YouTube

A walk through the architecture behind a planet-scale video-sharing service — how raw uploads are turned into adaptive streams, how those bytes travel through storage tiers and a global CDN, and how metadata like views and comments stays consistent while the video files themselves live somewhere very different.

Slide 02 · Scope

What we are actually building

Before any boxes-and-arrows, pin down what the system must do and what it must survive. We will treat the workload as read-heavy, write-light, and globally distributed.

Functional

Creators upload a video file from web or mobile.
Viewers watch on phones, laptops, TVs and slow connections.
Comments, likes, subscriptions and basic social state.
Personalised home and "up next" recommendations.
Search by title, channel and topic.

Out of scope (today)

Live streaming, monetisation flows, copyright matching, and studio-grade editing tools. Each deserves its own design.

Non-functional

Billions of registered viewers, hundreds of millions active daily.
Playback start under two seconds at the 95th percentile.
Eleven nines of durability for stored masters.
Reads dominate writes by roughly two orders of magnitude.
Graceful degradation: a comment outage must not stop playback.

Workload shape

Many short watches, very few uploads. The hot tail (last 24 hours) drives almost all bandwidth; the long tail dominates storage.

Slide 03 · Back-of-envelope

How big is "big" here?

Numbers below are rough sketches to size the system — they are not benchmarks of any real service. The point is to find the order of magnitude that the architecture must respect.

~720 h

New uploads / minute

Across all creators. Each minute of source video averages around 120 MB after capture.

~5 PB

Raw ingest / day

720 hours × 60 min × 120 MB × 1440 ≈ 7.5 PB raw before encoding. Renditions add roughly another 1×.

~3.5 EB

Net storage / year

After dedup, garbage collection of failed uploads, and compression of long-tail content into cheaper codecs.

~1 B

Daily playback sessions

Average session pulls 4 Mbps for 8 minutes ≈ 240 MB egress per session at typical HD quality.

~240 PB

Egress / day from edge

Most of this never touches the origin — the CDN absorbs well above 95% of bytes.

~22 Tbps

Peak global bandwidth

Diurnal peaks land in the evening of each large timezone. We design for the union of those peaks.

Slide 04 · The big picture

High-level architecture

Two data planes run side by side. A control plane carries metadata — titles, captions, comments, view counts — while a separate blob plane moves the actual pixels through encoders, storage tiers and the CDN.

Slide 05 · Ingest

Upload flow — chunked, resumable, idempotent

An hour-long 4K file is gigabytes of bytes traveling over a flaky hotel Wi-Fi. Treat the upload as many small commitments, not one big bet, and let the client retry without re-sending what already landed.

Step by step

Init — client calls the upload service, receives an uploadId plus pre-signed chunk URLs.
Chunk — file is split into ~8 MB parts, each uploaded directly to object storage with its own SHA-256.
Resume — on retry the client asks "which offsets did you receive?" and only re-sends gaps.
Commit — server-side multipart finalisation atomically stitches parts into one immutable object.
Notify — a message lands on the transcoding queue; the upload API returns a watch-later placeholder.

Why it matters

Idempotent uploads survive flaky links, and direct-to-blob transfers keep the API tier out of the data path.

Slide 06 · Encoding

Transcoding pipeline as a DAG

A raw master fans out into many derived artifacts. Each node is a pure function over its inputs; missing outputs are simply rebuilt by the scheduler.

Slide 07 · Why so many copies?

One source, many renditions

A flagship phone on a fibre link and a five-year-old tablet on a crowded café Wi-Fi cannot use the same file. The transcoder builds a ladder of versions; the player picks the rung that fits right now.

Devices vary wildly

Codec support, screen size, decoder power, RAM and DRM envelope all differ. We can't ship one universal artifact.

Networks vary even more

Throughput swings inside a single watch session — elevator, subway, kitchen, café. Static quality forces either rebuffering or wasted bytes.

Newer codecs save bytes

AV1 and VP9 cut bandwidth ~30–50% over H.264 but are too slow for low-end CPUs. We ship multiple codecs so each device picks its best trade-off.

Slide 08 · Cost shape

Hot, warm, cold — long tail to deep archive

A tiny fraction of videos receive almost all views in their first week. Everything else becomes long-tail, watched rarely but kept forever. The storage cost equation breaks unless we move bytes with their access pattern.

Promotion / demotion

An access-frequency tracker watches reads per object per hour. Crossing thresholds triggers async copy up or down a tier. The viewer never blocks on this.

Re-encode the tail

Long-tail videos quietly get re-encoded into more efficient codecs and lower max-rungs to shrink their footprint. The master copy stays in archive.

Eleven nines on the master

Renditions are derived data and can always be rebuilt from the master. We pay top-tier durability only on the source.

Slide 09 · How bytes reach the player

HLS & DASH — manifests over plain HTTP

We do not stream one giant file. Each rendition is sliced into short segments (2–6 seconds each), and a manifest tells the player where every segment lives.

Why segment?

Small files cache cleanly at every layer, allow byte-range fetches, and let the player change quality between segments without reloading the whole stream.

HLS vs DASH

Both expose a master manifest pointing at per-rendition playlists. HLS uses .m3u8, DASH uses .mpd. Modern packagers emit both side by side.

Plain HTTP wins

Everything is GETs of static objects — easy to cache, easy to scale, friendly to existing CDN infrastructure.

Slide 10 · The player is half the system

Adaptive bitrate selection

Quality decisions happen on the device, segment by segment. The server just exposes the menu; the client orders the dish that fits its current appetite.

How the player decides

Times each segment download and updates a rolling bandwidth estimate.
Watches its own playback buffer — low buffer means "drop quality now, recover later".
Adds hysteresis so it does not flap on every wobble.
Considers device hints — battery, decoder load, screen resolution.

Why client-side?

The device sees real-time conditions the server cannot — radio strength, contention, throttling — and reacts within a single segment.

Slide 11 · The last mile

CDN — edge POPs, mid-tier shield, origin

A three-layer cache hierarchy puts bytes seconds away from every viewer while protecting origin storage from the herd. The trick is collapsing cache misses so the origin sees one request, not a million.

Slide 12 · Two databases, two destinies

Metadata and blobs live apart

Titles, comments, view counts and recommendations are small and accessed via complex queries. Video bytes are huge and accessed by primary key only. Mixing them is the fastest way to a crashed database.

Why split?

Different access patterns, different durability budgets, different scaling axes. Metadata grows with users; blobs grow with hours of video.

The link is just a key

The metadata row stores an object path. Replacing the blob backend later is a one-table migration.

Failure isolation

A metadata outage might hide titles, but the CDN keeps serving bytes to ongoing watches. Each plane can scale and fail on its own clock.

Slide 13 · Counting at scale

View counts — async aggregation

A trending video can attract a million plays per minute. Writing one row per view into a relational table will incinerate the database. Counts are eventually-consistent metrics, not transactions.

Slide 14 · Closing

Six principles to take with you

A video platform is many systems wearing a trench coat. The patterns below repeat at every layer — keep them in mind when sketching the next read-heavy, blob-heavy service.

1

Separate the byte plane from the control plane

Metadata and blobs scale and fail on different curves. Glue them with a key, not a join.

2

Treat heavy work as a DAG of small, retryable jobs

Transcoding, packaging and reconciliation are all idempotent steps that can be rebuilt on demand.

3

Push intelligence to the edges

The CDN absorbs the bytes, the player picks the quality, the origin only sees the misses that survive both.

4

Move bytes with their access pattern

Hot at the edge, warm regionally, cold in deep storage. Pay top-tier durability only on the master.

5

Counts are streams, not transactions

Views, likes and trending are async aggregates. Show a fast approximation now, reconcile the truth later.

6

Design every component to fail without taking the watch with it

Comments down, recommendations down, search down — playback must keep rolling.