Chapter 4 · System Design Fundamentals

Design a Rate Limiter

Every public API needs a doorman that decides, on each request, whether to let it through or push back. This is your first real system-design build — small enough to finish, deep enough to teach buckets, windows, atomicity, and the wire protocol all at once.

▶ Open the companion slides

Reading time ~12 min Prerequisites Ch 3, Ch 2 Audio 🔊 Hinglish read-aloud Next Consistent Hashing

A rate limiter is the simplest piece of infrastructure that still forces every hard idea in distributed systems: where do you put control, how do you count safely across machines, and how do you tell the caller the truth? Get this one right and the pattern repeats everywhere — quotas, billing, backpressure, fairness.

Primary source

Alex Xu, System Design Interview (Vol. 1), Chapter 4. Companion deck: slides — jump to any slide with the Slide N chips.
Go deeper: Stripe Engineering — Scaling your API with rate limiters.

Why rate-limit at all? Slide 2

A limiter is a small, sharp tool with four overlapping jobs. Each one alone justifies running it; together they make it non-optional for any public surface.

Block abuse

Credential stuffing, scraping, brute-force, and amplification floods all look the same from above — too many requests, too fast, from too few sources. The limiter cuts the long tail before it reaches your auth code or database.

Cap the cost

Pay-per-use APIs (LLM tokens, SMS, geocoding, egress) turn a stuck retry loop into a real invoice. A budget enforced at the edge converts an unbounded blowup into a bounded one.

Keep it fair

One tenant running a backfill should not starve everyone else. Per-key quotas make a shared cluster feel like a private one, and turn capacity planning into a per-tenant exercise.

Stay stable

Every downstream has a knee in its latency curve. Shedding load above the knee keeps the system in its linear region instead of dragging it into collapse.

Where does the limiter live? Slide 3

The same algorithm behaves very differently depending on where you put it. Real systems use more than one layer, with different limits at each — and the edge is the one place that sees a tenant's traffic across all services.

Three layers: client is polite but untrusted, the edge is shared truth, and service-side limits protect specific resources.

Client — cheap, polite, untrusted

The SDK pre-throttles so a stuck retry loop in user code doesn't slam the network. Good for UX, useless for security: a hostile client just removes it.

Edge / gateway — the primary control

Sees every request before it costs compute. Centralises policy, terminates abuse early, and is the only place one tenant's traffic is visible across all services.

Service-side — last line of defense

Protects a specific resource and knows what the edge can't — which call is cheap, which writes to disk. Usually a finer, per-endpoint limit.

Pick by trust and scope

Layer them: coarse at the edge for abuse, fine at the service for capacity, advisory in the SDK for politeness. One layer is never enough.

Token bucket Slide 4

A bucket holds up to B tokens and refills at r tokens per second. Each request removes one token; if the bucket is empty, the request is rejected. Simple, fast, and the default for most APIs.

State is just (tokens, last_refill_ts). On each request, add elapsed × r tokens (capped at B), then try to subtract one.

Bursts are first-class

An idle client builds up to B tokens, so a real-world "quiet, quiet, burst, quiet" pattern is served correctly. The long-run average stays bounded at r.

The burst can be sharp

If B is large, a starved client can spend all its credit in milliseconds. Fine for most apps, bad if the downstream can't absorb a spike — tune B to what the next hop can take.

Leaky bucket Slide 5

Requests enter a fixed-size FIFO queue and leave at a constant rate L. Excess arrivals are dropped. Where token bucket shapes credit, leaky bucket shapes output — perfectly smooth.

Output is perfectly smooth

Downstream sees a constant arrival rate no matter how spiky the upstream is — ideal when you're protecting something that hates bursts, like a database with a fixed connection pool.

Bursts wait or die

Every accepted request that lands during a backlog pays added latency, and bursts beyond the queue are silently lost. Token bucket forgives bursts; leaky bucket punishes them.

Fixed window counter Slide 6

Bucket time into fixed slots (say one minute) and keep one integer per slot per key. Increment on each request; reject when the count exceeds the limit. Trivial to build — and easy to exploit at the boundary.

With a 10-per-minute limit, 10 requests at 12:00:59 plus 10 at 12:01:00 is 20 in two seconds — the letter of the limit is intact; the spirit is not.

The edge effect

Storage is one INCR keyed by user_id:window_start, and old keys expire on their own — but never make fixed window your only defense. It's coarse, cheap, and good enough only for non-adversarial traffic where the limit is large.

Sliding window log Slide 7

Keep a sorted set of every request timestamp in the last W seconds. On arrival, drop entries older than now − W, count the rest, and allow if it's under the limit. Exact — but expensive.

Perfectly accurate

There's no boundary artifact — the window truly slides. The count at any instant is the real number of requests in the last W seconds. No other algorithm matches this.

Memory grows with traffic

A million-RPS client needs sixty million timestamps stored at all times for a 60-second window. Use it only when accuracy genuinely matters — billing, security-sensitive flows.

Sliding window counter Slide 8

Keep just two fixed-window counters — current and previous — and interpolate. You get most of the accuracy of a sliding log at the storage cost of a fixed window: two integers per key. This is the workhorse of high-volume public APIs.

Measure overlap

overlap = (now − current_window_start) / W — how far into the current slot you are.

Weight the previous slot

estimate = prev × (1 − overlap) + current — the old slot's contribution decays to zero.

Decide

e.g. 8 × 0.64 + 5 ≈ 10.1 — over a limit of 10, so this request is denied.

Storage and CPU match a fixed window, but the boundary doubling-attack is gone because the previous slot's weight decays linearly. The cost: it assumes traffic was uniform inside the previous window. Real traffic isn't — so the estimate drifts in pathological cases, usually by a few percent. Almost always worth it.

Pick the right algorithm Slide 9

All five solve the same problem from different angles. The choice turns on what you value: bursts, smoothness, accuracy, or cheap storage.

Algorithm	Accuracy	Memory / key	Burst handling	Best for
Token bucket	Good	2 numbers	Allows bursts up to B	General API throttling
Leaky bucket	Good	queue of N	Smooths bursts out	Fragile downstream
Fixed window	Edge effect	1 integer	Up to 2× at boundaries	Cheap coarse, internal
Sliding log	Exact	one / request	Hard cap	Billing, security-critical
Sliding counter	~95%+	2 integers	Hard cap (estimated)	High-volume public APIs

Rule of thumb: start with the sliding window counter; reach for sliding-log only when an audit or invoice depends on exactness.

One limiter, many machines Slide 10

A single edge node sees only its slice of traffic. To enforce a global limit you need shared state — usually Redis. Now correctness depends on race-free updates, and the naive version is broken.

Read-modify-write is a race

A naive limiter does GET the counter, compare to the limit, then INCR if allowed. Between the GET and the INCR, another node does the same — both read "under the limit," both let the request through, and the global limit is breached.

Make the decision atomic

Push the whole check-and-increment into one Redis Lua script. The server runs it without interleaving, so read, compare, and write become a single operation visible everywhere.

Operate it deliberately

Pin all writes for a key to one shard (cross-shard scripts aren't atomic); cache the decision locally for a few ms to absorb the round-trip on hot keys; and decide to fail open or closed before an outage forces the choice.

Tell the client what just happened Slide 11

A throttled client should recover gracefully without guesswork. Return a clear status, a wait time, and enough quota metadata to back off intelligently.

429, not 503

429 means "you're sending too much." 503 means "the server is unhealthy." Mixing them confuses retry logic — 503 implies the request might succeed elsewhere; 429 says it won't until the caller slows down.

Retry-After

An integer of seconds or an HTTP-date. Without it, good clients fall back to exponential backoff with jitter and bad ones hammer immediately. Setting it costs nothing and helps both.

X-RateLimit-* on success

Return Limit, Remaining, and Reset on every response, not just 429s. Clients can self-throttle before they hit the wall, and your support load drops.

A machine-readable body

Pair the headers with a small JSON error (rate_limit_exceeded, the limit, the window, the retry seconds) so SDKs can react without screen-scraping a header.

Monitoring and abuse detection Slide 12

A limiter without instrumentation is invisible until it breaks — and the metrics it emits are also your best raw material for spotting attacks and tuning policy.

What to graph

Throttled requests per second (overall and per-key), allow/deny ratio per route, the latency the limiter itself adds (Redis round-trip P99), and the top-N keys by denial rate.

When traffic stops looking organic

A few keys producing most denials; requests spaced perfectly evenly (a bot, not a human); sudden cross-key correlation from one subnet; or a denial spike on login and password-reset endpoints.

tighten

auto

limit after repeat violations

tier

block

move chronic offenders

page

systemic

not single-key spikes

review

replay

denied reqs for fraud

Edge cases that bite in production Slide 13

Most rate-limiter bugs aren't in the algorithm — they're in the assumptions around it. Two show up in every postmortem on the topic.

Clock skew

If each node writes timestamps from its local clock, two nodes a few hundred ms apart disagree about what falls inside the window. Fix: use the store's clock — pass Redis TIME from inside the Lua script — pin a key's writes to one shard, and alert on NTP drift.

Thundering herd at reset

When a fixed window resets, every throttled client retries at the same instant — a synchronized spike larger than the unthrottled traffic. Fix: jitter Retry-After, prefer sliding windows (no global reset), and soft-shed before the hard limit.

Principles to carry forward Slide 14

The algorithms differ; the engineering instincts behind a good limiter don't.

1 · Default to the sliding window counter

Two integers per key, no boundary attack, near-exact accuracy. The right starting point for almost any public API.

2 · Make every decision atomic

In a distributed system, a check-then-act limiter is a broken limiter. Push it into a Lua script or an atomic counter primitive.

3 · Layer your limits

Coarse at the edge, fine at the service, advisory in the SDK. One layer is never enough.

4 · Speak clearly on the wire

429 plus Retry-After plus X-RateLimit-* on every response. A client that knows its budget won't hammer you.

5 · Jitter everything that resets

Synchronized clients cause synchronized spikes. Spread reset times, backoff intervals, and refill ticks across the population.

6 · Treat denial as a signal

The graph of 429s per key is your earliest abuse-detection feed. Monitor it like errors — it's not noise, it's intelligence.

Active recall

Cover the answers. Say each one out loud before you tap to check.

Name the four jobs a rate limiter does.

One word each.

Block abuse, cap cost, keep it fair across tenants, and keep the system stable inside its safe envelope.

Token bucket: which two numbers define it, and what do they do?

Capacity and a rate.

B = capacity (the max burst) and r = refill rate (the long-run average). State is just (tokens, last_refill_ts).

Why is the fixed window counter exploitable?

Think about the seam.

The boundary burst: a client can fire a full quota just before a window ends and another just after — up to 2× the limit in a few seconds.

How does the sliding window counter estimate the count?

Two slots, weighted.

prev × (1 − overlap) + current. The previous slot's contribution decays linearly to zero, killing the boundary attack at two-integer cost.

Why is GET-then-INCR wrong in a distributed limiter?

Two nodes, same instant.

It's a race: between the read and the increment, another node also reads "under the limit," so both allow the request. Fix: one atomic Lua script.

What three things should a 429 response carry?

Status plus metadata.

Retry-After (when to come back) and the X-RateLimit-* quota headers (Limit, Remaining, Reset) — on every response, not just denials.

Check yourself

Q1 Which layer is the primary control point for a rate limiter?

Why: The edge sees every request before it costs compute and is the only place one tenant's traffic is visible across all services.

Q2 What makes the token bucket good at handling bursts?

Why: An idle client builds up to B tokens, so a "quiet then burst" pattern is served while the long-run average stays at r.

Q3 The fixed window counter's main weakness is that it can…

Why: A client can spend a full quota just before and just after a window edge, doubling the effective rate in a few seconds.

Q4 Why must the distributed counter update be atomic?

Why: Check-then-act races: both nodes read "under the limit" and both allow. One atomic Lua script collapses read, compare, and write into one step.

Q5 Which status code best tells a client it is being throttled?

Why: 429 means "slow down," distinct from 503's "server unhealthy." Retry-After tells the caller exactly when to come back.