Chapter 4 · System Design Fundamentals
Design a Rate Limiter
Every public API needs a doorman that decides, on each request, whether to let it through or push back. This is your first real system-design build — small enough to finish, deep enough to teach buckets, windows, atomicity, and the wire protocol all at once.
▶ Open the companion slidesA rate limiter is the simplest piece of infrastructure that still forces every hard idea in distributed systems: where do you put control, how do you count safely across machines, and how do you tell the caller the truth? Get this one right and the pattern repeats everywhere — quotas, billing, backpressure, fairness.
Alex Xu, System Design Interview (Vol. 1), Chapter 4. Companion deck:
slides — jump to any slide with the
Slide N chips.
Go deeper: Stripe Engineering — Scaling your API with rate limiters.
Why rate-limit at all? Slide 2
A limiter is a small, sharp tool with four overlapping jobs. Each one alone justifies running it; together they make it non-optional for any public surface.
Block abuse
Credential stuffing, scraping, brute-force, and amplification floods all look the same from above — too many requests, too fast, from too few sources. The limiter cuts the long tail before it reaches your auth code or database.
Cap the cost
Pay-per-use APIs (LLM tokens, SMS, geocoding, egress) turn a stuck retry loop into a real invoice. A budget enforced at the edge converts an unbounded blowup into a bounded one.
Keep it fair
One tenant running a backfill should not starve everyone else. Per-key quotas make a shared cluster feel like a private one, and turn capacity planning into a per-tenant exercise.
Stay stable
Every downstream has a knee in its latency curve. Shedding load above the knee keeps the system in its linear region instead of dragging it into collapse.
Where does the limiter live? Slide 3
The same algorithm behaves very differently depending on where you put it. Real systems use more than one layer, with different limits at each — and the edge is the one place that sees a tenant's traffic across all services.
Client — cheap, polite, untrusted
The SDK pre-throttles so a stuck retry loop in user code doesn't slam the network. Good for UX, useless for security: a hostile client just removes it.
Edge / gateway — the primary control
Sees every request before it costs compute. Centralises policy, terminates abuse early, and is the only place one tenant's traffic is visible across all services.
Service-side — last line of defense
Protects a specific resource and knows what the edge can't — which call is cheap, which writes to disk. Usually a finer, per-endpoint limit.
Pick by trust and scope
Layer them: coarse at the edge for abuse, fine at the service for capacity, advisory in the SDK for politeness. One layer is never enough.
Token bucket Slide 4
A bucket holds up to B tokens and refills at r tokens per second. Each request removes one token; if the bucket is empty, the request is rejected. Simple, fast, and the default for most APIs.
Bursts are first-class
An idle client builds up to B tokens, so a real-world "quiet, quiet, burst, quiet" pattern is served correctly. The long-run average stays bounded at r.
The burst can be sharp
If B is large, a starved client can spend all its credit in milliseconds. Fine for most apps, bad if the downstream can't absorb a spike — tune B to what the next hop can take.
Leaky bucket Slide 5
Requests enter a fixed-size FIFO queue and leave at a constant rate L. Excess arrivals are dropped. Where token bucket shapes credit, leaky bucket shapes output — perfectly smooth.
Output is perfectly smooth
Downstream sees a constant arrival rate no matter how spiky the upstream is — ideal when you're protecting something that hates bursts, like a database with a fixed connection pool.
Bursts wait or die
Every accepted request that lands during a backlog pays added latency, and bursts beyond the queue are silently lost. Token bucket forgives bursts; leaky bucket punishes them.
Fixed window counter Slide 6
Bucket time into fixed slots (say one minute) and keep one integer per slot per key. Increment on each request; reject when the count exceeds the limit. Trivial to build — and easy to exploit at the boundary.
Storage is one INCR keyed by user_id:window_start, and old keys
expire on their own — but never make fixed window your only defense. It's coarse,
cheap, and good enough only for non-adversarial traffic where the limit is large.
Sliding window log Slide 7
Keep a sorted set of every request timestamp in the last W seconds. On
arrival, drop entries older than now − W, count the rest, and allow if it's
under the limit. Exact — but expensive.
Perfectly accurate
There's no boundary artifact — the window truly slides. The count at any instant is the real number of requests in the last W seconds. No other algorithm matches this.
Memory grows with traffic
A million-RPS client needs sixty million timestamps stored at all times for a 60-second window. Use it only when accuracy genuinely matters — billing, security-sensitive flows.
Sliding window counter Slide 8
Keep just two fixed-window counters — current and previous — and interpolate. You get most of the accuracy of a sliding log at the storage cost of a fixed window: two integers per key. This is the workhorse of high-volume public APIs.
Storage and CPU match a fixed window, but the boundary doubling-attack is gone because the previous slot's weight decays linearly. The cost: it assumes traffic was uniform inside the previous window. Real traffic isn't — so the estimate drifts in pathological cases, usually by a few percent. Almost always worth it.
Pick the right algorithm Slide 9
All five solve the same problem from different angles. The choice turns on what you value: bursts, smoothness, accuracy, or cheap storage.
| Algorithm | Accuracy | Memory / key | Burst handling | Best for |
|---|---|---|---|---|
| Token bucket | Good | 2 numbers | Allows bursts up to B | General API throttling |
| Leaky bucket | Good | queue of N | Smooths bursts out | Fragile downstream |
| Fixed window | Edge effect | 1 integer | Up to 2× at boundaries | Cheap coarse, internal |
| Sliding log | Exact | one / request | Hard cap | Billing, security-critical |
| Sliding counter | ~95%+ | 2 integers | Hard cap (estimated) | High-volume public APIs |
Rule of thumb: start with the sliding window counter; reach for sliding-log only when an audit or invoice depends on exactness.
One limiter, many machines Slide 10
A single edge node sees only its slice of traffic. To enforce a global limit you need shared state — usually Redis. Now correctness depends on race-free updates, and the naive version is broken.
A naive limiter does GET the counter, compare to the limit, then
INCR if allowed. Between the GET and the INCR, another node does the same — both
read "under the limit," both let the request through, and the global limit is breached.
Make the decision atomic
Push the whole check-and-increment into one Redis Lua script. The server runs it without interleaving, so read, compare, and write become a single operation visible everywhere.
Operate it deliberately
Pin all writes for a key to one shard (cross-shard scripts aren't atomic); cache the decision locally for a few ms to absorb the round-trip on hot keys; and decide to fail open or closed before an outage forces the choice.
Tell the client what just happened Slide 11
A throttled client should recover gracefully without guesswork. Return a clear status, a wait time, and enough quota metadata to back off intelligently.
429, not 503
429 means "you're sending too much." 503 means "the server is unhealthy." Mixing them confuses retry logic — 503 implies the request might succeed elsewhere; 429 says it won't until the caller slows down.
Retry-After
An integer of seconds or an HTTP-date. Without it, good clients fall back to exponential backoff with jitter and bad ones hammer immediately. Setting it costs nothing and helps both.
X-RateLimit-* on success
Return Limit, Remaining, and Reset on every response, not just 429s. Clients can self-throttle before they hit the wall, and your support load drops.
A machine-readable body
Pair the headers with a small JSON error (rate_limit_exceeded, the limit, the window, the retry seconds) so SDKs can react without screen-scraping a header.
Monitoring and abuse detection Slide 12
A limiter without instrumentation is invisible until it breaks — and the metrics it emits are also your best raw material for spotting attacks and tuning policy.
What to graph
Throttled requests per second (overall and per-key), allow/deny ratio per route, the latency the limiter itself adds (Redis round-trip P99), and the top-N keys by denial rate.
When traffic stops looking organic
A few keys producing most denials; requests spaced perfectly evenly (a bot, not a human); sudden cross-key correlation from one subnet; or a denial spike on login and password-reset endpoints.
Edge cases that bite in production Slide 13
Most rate-limiter bugs aren't in the algorithm — they're in the assumptions around it. Two show up in every postmortem on the topic.
Clock skew
If each node writes timestamps from its local clock, two nodes a few hundred ms apart disagree about what falls inside the window. Fix: use the store's clock — pass Redis TIME from inside the Lua script — pin a key's writes to one shard, and alert on NTP drift.
Thundering herd at reset
When a fixed window resets, every throttled client retries at the same instant — a synchronized spike larger than the unthrottled traffic. Fix: jitter Retry-After, prefer sliding windows (no global reset), and soft-shed before the hard limit.
Principles to carry forward Slide 14
The algorithms differ; the engineering instincts behind a good limiter don't.
1 · Default to the sliding window counter
Two integers per key, no boundary attack, near-exact accuracy. The right starting point for almost any public API.
2 · Make every decision atomic
In a distributed system, a check-then-act limiter is a broken limiter. Push it into a Lua script or an atomic counter primitive.
3 · Layer your limits
Coarse at the edge, fine at the service, advisory in the SDK. One layer is never enough.
4 · Speak clearly on the wire
429 plus Retry-After plus X-RateLimit-* on every response. A client that knows its budget won't hammer you.
5 · Jitter everything that resets
Synchronized clients cause synchronized spikes. Spread reset times, backoff intervals, and refill ticks across the population.
6 · Treat denial as a signal
The graph of 429s per key is your earliest abuse-detection feed. Monitor it like errors — it's not noise, it's intelligence.
Active recall
Cover the answers. Say each one out loud before you tap to check.
(tokens, last_refill_ts).prev × (1 − overlap) + current. The previous slot's contribution
decays linearly to zero, killing the boundary attack at two-integer cost.