Chapter 14 · System Design

Design YouTube

A walk through the architecture behind a planet-scale video-sharing service — how raw uploads are turned into adaptive streams, how those bytes travel through storage tiers and a global CDN, and how metadata like views and comments stays consistent while the video files themselves live somewhere very different.

Slide 02 · Scope

What we are actually building

Before any boxes-and-arrows, pin down what the system must do and what it must survive. We will treat the workload as read-heavy, write-light, and globally distributed.

Functional

  • Creators upload a video file from web or mobile.
  • Viewers watch on phones, laptops, TVs and slow connections.
  • Comments, likes, subscriptions and basic social state.
  • Personalised home and "up next" recommendations.
  • Search by title, channel and topic.

Out of scope (today)

Live streaming, monetisation flows, copyright matching, and studio-grade editing tools. Each deserves its own design.

Non-functional

  • Billions of registered viewers, hundreds of millions active daily.
  • Playback start under two seconds at the 95th percentile.
  • Eleven nines of durability for stored masters.
  • Reads dominate writes by roughly two orders of magnitude.
  • Graceful degradation: a comment outage must not stop playback.

Workload shape

Many short watches, very few uploads. The hot tail (last 24 hours) drives almost all bandwidth; the long tail dominates storage.

Slide 03 · Back-of-envelope

How big is "big" here?

Numbers below are rough sketches to size the system — they are not benchmarks of any real service. The point is to find the order of magnitude that the architecture must respect.

~720 h
New uploads / minute
Across all creators. Each minute of source video averages around 120 MB after capture.
~5 PB
Raw ingest / day
720 hours × 60 min × 120 MB × 1440 ≈ 7.5 PB raw before encoding. Renditions add roughly another 1×.
~3.5 EB
Net storage / year
After dedup, garbage collection of failed uploads, and compression of long-tail content into cheaper codecs.
~1 B
Daily playback sessions
Average session pulls 4 Mbps for 8 minutes ≈ 240 MB egress per session at typical HD quality.
~240 PB
Egress / day from edge
Most of this never touches the origin — the CDN absorbs well above 95% of bytes.
~22 Tbps
Peak global bandwidth
Diurnal peaks land in the evening of each large timezone. We design for the union of those peaks.
Slide 04 · The big picture

High-level architecture

Two data planes run side by side. A control plane carries metadata — titles, captions, comments, view counts — while a separate blob plane moves the actual pixels through encoders, storage tiers and the CDN.

Creator uploader Viewer player API Gateway auth · rate · routing Upload Service resumable chunks Raw Blob Store S3-class object storage Transcoder DAG renditions · thumbs Rendition Store HLS/DASH segments Streaming Svc manifests · tokens Global CDN edge POPs Metadata DB titles · users · views control / metadata path blob / video path
Slide 05 · Ingest

Upload flow — chunked, resumable, idempotent

An hour-long 4K file is gigabytes of bytes traveling over a flaky hotel Wi-Fi. Treat the upload as many small commitments, not one big bet, and let the client retry without re-sending what already landed.

Client splits file into chunks Upload Svc signs URLs tracks offsets Blob Store multipart parts + checksum Upload State uploadId · parts resumable cursor Job Q notify transcoder ① init ② parts ③ cursor ④ commit

Step by step

  • Init — client calls the upload service, receives an uploadId plus pre-signed chunk URLs.
  • Chunk — file is split into ~8 MB parts, each uploaded directly to object storage with its own SHA-256.
  • Resume — on retry the client asks "which offsets did you receive?" and only re-sends gaps.
  • Commit — server-side multipart finalisation atomically stitches parts into one immutable object.
  • Notify — a message lands on the transcoding queue; the upload API returns a watch-later placeholder.

Why it matters

Idempotent uploads survive flaky links, and direct-to-blob transfers keep the API tier out of the data path.

Slide 06 · Encoding

Transcoding pipeline as a DAG

A raw master fans out into many derived artifacts. Each node is a pure function over its inputs; missing outputs are simply rebuilt by the scheduler.

Raw Master ProRes / H.264 Probe duration · fps Chunker GOP-aligned Encode 240p · H.264 Encode 720p · H.264 Encode 1080p · AV1 Encode 1080p · VP9 Encode 4K · AV1 Thumbs · Captions · Preview Packager HLS · DASH Manifest Build .m3u8 / .mpd Publish flip "ready" flag
Slide 07 · Why so many copies?

One source, many renditions

A flagship phone on a fibre link and a five-year-old tablet on a crowded café Wi-Fi cannot use the same file. The transcoder builds a ladder of versions; the player picks the rung that fits right now.

Adaptive bitrate ladder 2160p · AV1 ~16 Mbps 1440p · VP9 ~9 Mbps 1080p · H.264 ~5 Mbps 720p · H.264 ~2.5 Mbps 480p · H.264 ~1 Mbps 360p · H.264 ~0.5 Mbps 240p · H.264 ~0.2 Mbps Player picks the highest rung that fits its current bandwidth. low BW high BW

Devices vary wildly

Codec support, screen size, decoder power, RAM and DRM envelope all differ. We can't ship one universal artifact.

Networks vary even more

Throughput swings inside a single watch session — elevator, subway, kitchen, café. Static quality forces either rebuffering or wasted bytes.

Newer codecs save bytes

AV1 and VP9 cut bandwidth ~30–50% over H.264 but are too slow for low-end CPUs. We ship multiple codecs so each device picks its best trade-off.

Slide 08 · Cost shape

Hot, warm, cold — long tail to deep archive

A tiny fraction of videos receive almost all views in their first week. Everything else becomes long-tail, watched rarely but kept forever. The storage cost equation breaks unless we move bytes with their access pattern.

HOT SSD / RAM cache at edge trending, last 24 h, sub-ms reads WARM Regional object storage popular renditions, < 100 ms reads COLD Multi-region erasure-coded long-tail watches, seconds to fetch ARCHIVE masters + rare originals · minutes to restore

Promotion / demotion

An access-frequency tracker watches reads per object per hour. Crossing thresholds triggers async copy up or down a tier. The viewer never blocks on this.

Re-encode the tail

Long-tail videos quietly get re-encoded into more efficient codecs and lower max-rungs to shrink their footprint. The master copy stays in archive.

Eleven nines on the master

Renditions are derived data and can always be rebuilt from the master. We pay top-tier durability only on the source.

Slide 09 · How bytes reach the player

HLS & DASH — manifests over plain HTTP

We do not stream one giant file. Each rendition is sliced into short segments (2–6 seconds each), and a manifest tells the player where every segment lives.

master.m3u8 #EXTM3U #STREAM 240p 0.2M #STREAM 720p 2.5M #STREAM 1080p 5.0M #STREAM 2160p 16M 720p.m3u8 seg-0001.ts 6.0s seg-0002.ts 6.0s seg-0003.ts 6.0s Segments on disk / CDN 720p · ~2.5 Mbps · 6 s each 1080p · ~5 Mbps · 6 s each 240p · ~0.2 Mbps · 6 s each

Why segment?

Small files cache cleanly at every layer, allow byte-range fetches, and let the player change quality between segments without reloading the whole stream.

HLS vs DASH

Both expose a master manifest pointing at per-rendition playlists. HLS uses .m3u8, DASH uses .mpd. Modern packagers emit both side by side.

Plain HTTP wins

Everything is GETs of static objects — easy to cache, easy to scale, friendly to existing CDN infrastructure.

Slide 10 · The player is half the system

Adaptive bitrate selection

Quality decisions happen on the device, segment by segment. The server just exposes the menu; the client orders the dish that fits its current appetite.

Estimated bandwidth (Mbps) time 8 5 2 0 720p 1080p 1080p 360p 240p 1080p 2160p measured throughput chosen rendition

How the player decides

  • Times each segment download and updates a rolling bandwidth estimate.
  • Watches its own playback buffer — low buffer means "drop quality now, recover later".
  • Adds hysteresis so it does not flap on every wobble.
  • Considers device hints — battery, decoder load, screen resolution.

Why client-side?

The device sees real-time conditions the server cannot — radio strength, contention, throttling — and reacts within a single segment.

Slide 11 · The last mile

CDN — edge POPs, mid-tier shield, origin

A three-layer cache hierarchy puts bytes seconds away from every viewer while protecting origin storage from the herd. The trick is collapsing cache misses so the origin sees one request, not a million.

Origin Store master rendition copy Origin Shield mid-tier · few regions Edge POP · São Paulo disk + RAM cache Edge POP · Frankfurt disk + RAM cache Edge POP · Mumbai disk + RAM cache Edge POP · Tokyo disk + RAM cache viewers ≥ 95% bytes served here collapses misses cold-byte source of truth
Slide 12 · Two databases, two destinies

Metadata and blobs live apart

Titles, comments, view counts and recommendations are small and accessed via complex queries. Video bytes are huge and accessed by primary key only. Mixing them is the fastest way to a crashed database.

METADATA PLANE videos id · title · owner users id · email · channel comments video_id · text views_agg video_id · count search index · recommender features embedding · tag · trending score sharded SQL / NoSQL indexed, transactional, cached BLOB PLANE videos/abc123/master.mp4 videos/abc123/1080p/seg-0001.ts videos/abc123/1080p/seg-0002.ts videos/abc123/720p/seg-0001.ts videos/abc123/thumb.jpg flat object storage · immutable · key-only access key

Why split?

Different access patterns, different durability budgets, different scaling axes. Metadata grows with users; blobs grow with hours of video.

The link is just a key

The metadata row stores an object path. Replacing the blob backend later is a one-table migration.

Failure isolation

A metadata outage might hide titles, but the CDN keeps serving bytes to ongoing watches. Each plane can scale and fail on its own clock.

Slide 13 · Counting at scale

View counts — async aggregation

A trending video can attract a million plays per minute. Writing one row per view into a relational table will incinerate the database. Counts are eventually-consistent metrics, not transactions.

Player emits play events Edge Collector validate · dedup Kafka topic view_events Stream Aggregator per-min · per-video Batch Reconciler hourly truth · dedup Live Counter (KV) approx, sub-minute lag Canonical views_agg authoritative count reconcile The UI shows the live counter for snappiness; the batch path corrects bots, duplicates and partial fan-out.
Slide 14 · Closing

Six principles to take with you

A video platform is many systems wearing a trench coat. The patterns below repeat at every layer — keep them in mind when sketching the next read-heavy, blob-heavy service.

1

Separate the byte plane from the control plane

Metadata and blobs scale and fail on different curves. Glue them with a key, not a join.

2

Treat heavy work as a DAG of small, retryable jobs

Transcoding, packaging and reconciliation are all idempotent steps that can be rebuilt on demand.

3

Push intelligence to the edges

The CDN absorbs the bytes, the player picks the quality, the origin only sees the misses that survive both.

4

Move bytes with their access pattern

Hot at the edge, warm regionally, cold in deep storage. Pay top-tier durability only on the master.

5

Counts are streams, not transactions

Views, likes and trending are async aggregates. Show a fast approximation now, reconcile the truth later.

6

Design every component to fail without taking the watch with it

Comments down, recommendations down, search down — playback must keep rolling.

← → SPACE · F fullscreen