Chapter 04 · System Design

Design a Rate Limiter

A control plane that decides, on every request, whether to let it through or push back — protecting downstream systems from abuse, runaway cost, and noisy neighbors.

Topic · Throttling Surface · Edge & Service Slides · 14

Chapter 04

01 / 14

§ 4.1 · Motivation

Why rate-limit at all?

A rate limiter is a small, sharp tool with four overlapping jobs. Each one alone justifies the cost of running it; together they make it non-optional for any public surface.

Abuse

Block the obvious attackers

Credential stuffing, scraping, brute-force, and amplification floods all look the same from above: too many requests, too fast, from too few sources. A limiter cuts the long tail before it reaches your auth code or database.

Cost

Cap the cloud bill

Pay-per-use APIs (LLM tokens, SMS, geocoding, egress bandwidth) turn a stuck retry loop into a real invoice. A budget enforced at the edge converts an unbounded blowup into a bounded one.

Fairness

Stop noisy neighbors

One tenant running a backfill should not starve everyone else of capacity. Per-key quotas keep a shared cluster feeling like a private one, and make capacity planning a per-tenant exercise.

Stability

Stay inside the safe envelope

Every downstream — your database, a third-party API, a queue — has a knee in its latency curve. Shedding load above the knee keeps the system in its linear region instead of dragging it into collapse.

Why rate-limit

02 / 14

§ 4.2 · Placement

Where does the limiter live?

The same algorithm behaves very differently depending on where you put it. Most production systems use more than one of these layers, with different limits at each.

Client

Cheap, polite, untrusted

The SDK pre-throttles so a stuck retry loop in user code does not slam the network. Useful for UX, useless for security — a hostile client just removes it.

API gateway / edge

The primary control point

Sees every request before it costs you compute. Centralizes policy, terminates abuse early, and is the only place where one tenant's traffic is visible across all services.

Service-side

Last line of defense

Protects a specific resource (a database, a paid API). Knows things the edge does not — like which call is cheap and which one writes to disk. Often a finer-grained per-endpoint limit.

Placement

03 / 14

§ 4.3 · Algorithm

Token bucket

A bucket holds up to B tokens and refills at r tokens per second. Each request removes one token; if the bucket is empty, the request is rejected. Simple, fast, and the default choice for most APIs.

Mechanics

Two numbers describe it

The state is just (tokens, last_refill_ts). On each request, compute how much time has passed, add elapsed * r tokens (capped at B), then try to subtract one.

Strength

Bursts are first-class

An idle client builds up to B tokens, so a real-world traffic pattern of "quiet, quiet, burst, quiet" is served correctly. The long-run average is still bounded at r.

Trade-off

The burst can be sharp

If B is large, a starved client can spend all its credit in milliseconds — fine for most apps, bad if the downstream cannot absorb a spike. Tune B to what the next hop can take.

Token bucket

04 / 14

§ 4.4 · Algorithm

Leaky bucket

Requests enter a fixed-size queue and are processed (or forwarded) at a constant rate. Excess arrivals are dropped. Where token bucket shapes credit, leaky bucket shapes output — perfectly smooth.

Mechanics

A bounded queue with a constant drain

Implementation is literally a fixed-capacity FIFO plus a worker that releases one item every 1/L seconds. Adding to a full queue means the request is dropped.

Strength

Output is perfectly smooth

Downstream sees a constant arrival rate no matter how spiky the upstream is — ideal when you are protecting something that hates bursts, like a database with a fixed connection pool.

Trade-off

Bursts wait or die

Latency is added to every accepted request that arrives during a backlog. Bursts that exceed the queue are silently lost. Token bucket forgives bursts; leaky bucket punishes them.

Leaky bucket

05 / 14

§ 4.5 · Algorithm

Fixed window counter

Bucket time into fixed slots (say, one minute). Keep one integer per slot per key. Increment on each request; reject when the count exceeds the limit. Easy to implement, easy to exploit at boundaries.

Mechanics

One key per (user, window)

Storage is one integer keyed by user_id:window_start. A single INCR in Redis is enough. Old keys expire automatically.

The edge effect

Twice the limit at boundaries

With a 10-per-minute limit, a client can fire 10 requests at 12:00:59 and another 10 at 12:01:00 — that is 20 requests in two seconds. The limiter is technically not violated; the spirit of the limit is.

When it works

Coarse, cheap, good enough

For non-adversarial workloads where the limit is large relative to typical traffic, this is the cheapest option that exists. Just do not use it as your only line of defense.

Fixed window

06 / 14

§ 4.6 · Algorithm

Sliding window log

Keep a sorted set of every request timestamp in the last W seconds. On arrival, drop entries older than now − W, count the rest, and allow if it is under the limit. Exact, but expensive.

Mechanics

Redis sorted set per key

A ZADD with the timestamp as score, then ZREMRANGEBYSCORE 0 (now-W), then ZCARD. Wrap in MULTI/EXEC or a Lua script.

Strength

Perfectly accurate

There is no boundary artifact — the window truly slides. The count at any instant is the actual number of requests in the last W seconds. No other algorithm matches this.

Trade-off

Memory grows with traffic

A million-RPS client needs sixty million timestamps stored at all times for a 60-second window. Use only when accuracy genuinely matters — billing, security-sensitive flows.

Sliding window log

07 / 14

§ 4.7 · Algorithm

Sliding window counter (hybrid)

Keep just two fixed-window counters — current and previous — and interpolate. You get most of the accuracy of a sliding log at the storage cost of a fixed window: two integers per key.

Mechanics

Linear interpolation across two slots

Compute the fraction of the previous window still inside the sliding view: overlap = (now − win_start) / W. The estimated count is prev × (1 − overlap) + current.

Strength

Two integers, no edge effect

Storage and CPU match fixed-window. The boundary doubling-attack is gone because the contribution of the previous slot decays linearly to zero. Most production systems use this.

Trade-off

Approximate, not exact

It assumes traffic was uniform inside the previous window. Real traffic is not — so the estimate is off in pathological cases, though typically by a few percent. Tradeoff is almost always worth it.

Sliding window counter

08 / 14

§ 4.8 · Decision matrix

Pick the right algorithm

All five algorithms solve the same problem from different angles. The right one depends on whether you value bursts, smoothness, accuracy, or cheap storage.

Algorithm	Accuracy	Memory / key	Burst handling	Implementation	Best for
Token bucket	good	2 numbers	Allows bursts up to B	simple	General-purpose API throttling
Leaky bucket	good	Queue of N	Smooths bursts out	moderate	Protecting a fragile downstream
Fixed window	edge effect	1 integer	Up to 2× at boundaries	trivial	Cheap coarse limits, internal use
Sliding log	exact	One per request	Hard cap	heavy	Billing, audit, security-critical
Sliding counter	~95%+	2 integers	Hard cap (estimated)	simple	Default for high-volume public APIs

2

integers for sliding counter

~5%

typical sliding-counter error

2×

worst case for fixed window

O(R)

memory for sliding log

Comparison

09 / 14

§ 4.9 · Distributed

One limiter, many machines

A single edge node sees only its slice of traffic. To enforce a global limit you need shared state — usually Redis. Now correctness depends on race-free updates.

The race

Read-modify-write is wrong

A naive limiter GETs the counter, compares to the limit, and INCRs if allowed. Between the GET and the INCR, another node can do the same — both think they were under the limit, both let the request through.

The fix

Make the decision atomic

Push the entire check-and-increment into a single Redis Lua script. The server runs it without interleaving, so the read, the compare, and the write are one operation visible everywhere.

Operational notes

Pin all writes for one key to one Redis shard — cross-shard scripts are not atomic.
Cache the limit decision locally for a few ms to absorb the Redis round-trip on hot keys.
Fail open or closed deliberately — pick before an outage forces the choice.

Atomic sliding-window counter in Lua

A single round-trip, race-free

-- KEYS[1] = bucket key, ARGV: now, window_ms, limit
local key    = KEYS[1]
local now    = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local limit  = tonumber(ARGV[3])

-- drop timestamps older than the window
redis.call('ZREMRANGEBYSCORE', key, 0, now - window)

local count = redis.call('ZCARD', key)
if count >= limit then
  return {0, count}     -- denied
end

redis.call('ZADD', key, now, now)
redis.call('PEXPIRE', key, window)
return {1, count + 1}   -- allowed

Distributed

10 / 14

§ 4.10 · Wire protocol

Tell the client what just happened

A throttled client should be able to recover gracefully without guesswork. Return a clear status, a wait time, and enough quota metadata for the caller to back off intelligently.

Denied response

429 Too Many Requests

HTTP/1.1 429 Too Many Requests Content-Type: application/json Retry-After: 12 X-RateLimit-Limit: 100 X-RateLimit-Remaining: 0 X-RateLimit-Reset: 1748470812 { "error": "rate_limit_exceeded", "limit": 100, "window": "1m", "retry_after_seconds": 12 }

Status code

429, not 503

429 means "you are sending too much." 503 means "the server is unhealthy." Mixing them confuses retry logic — 503 implies the request might succeed elsewhere; 429 implies it will not until the caller slows down.

Retry-After

Tell them when to come back

An integer (seconds) or HTTP-date. Without it, well-behaved clients fall back to exponential backoff with jitter; misbehaving clients hammer immediately. Setting it costs nothing and helps both.

X-RateLimit-* on success

Quota headers on every response

Return Limit, Remaining, and Reset on every response, not just 429s. Clients can then self-throttle before they hit the wall, and your support load goes down.

HTTP response design

11 / 14

§ 4.11 · Observability

Monitoring and abuse detection

A rate limiter without instrumentation is invisible until it breaks. The metrics it emits are also the best raw material you have for spotting attacks and tuning policy.

Core metrics

What to graph

Throttled requests per second (overall and per-key).
Allow / deny ratio per route.
Latency added by the limiter itself (Redis round-trip P99).
Top-N keys by denial rate.

Abuse signals

When traffic stops looking organic

A handful of keys producing the bulk of denials.
Requests perfectly evenly spaced — that is a bot, not a human.
Sudden cross-key correlation from one ASN or subnet.
High denial rate on login or password-reset endpoints.

Closing the loop

From signal to action

Auto-tighten limits on a key after repeated violations.
Move chronic offenders into a stricter tier or block list.
Page on systemic deny-rate spikes, not single-key ones.
Replay denied requests in a dashboard for fraud review.

Monitoring

12 / 14

§ 4.12 · Failure modes

Edge cases that bite in production

Most rate limiter bugs are not in the algorithm — they are in the assumptions around it. These two show up in every postmortem on the topic.

Clock skew

Whose "now" do we trust?

If each limiter node uses its local clock to write timestamps into Redis, two nodes a few hundred milliseconds apart will disagree about what falls inside the window. With NTP misbehaving, the disagreement gets worse.

Use the store's clock — in Redis, pass TIME from inside the Lua script so every write uses the same reference.
Pin all writes for a key to one shard so monotonicity is local.
Monitor NTP offset on every limiter node; alert at >50ms drift.

Thundering herd at reset

Everyone fires at the same instant

When a fixed window resets, every client who got throttled retries the moment they think the new window opened. The result: a synchronized spike against your backend that can be larger than the unthrottled traffic.

Jitter Retry-After — return a random offset between e.g. [ttl, 1.3·ttl] per client.
Prefer sliding windows — there is no global reset moment, so the herd cannot synchronize.
Soft-shed before the limit — start delaying (not denying) at 80% utilization to spread the pressure.

Edge cases

13 / 14

§ 4.13 · Wrap-up

Principles to carry forward

The algorithms differ; the engineering instincts behind a good rate limiter do not.

01

Default to the sliding window counter

Two integers per key, no boundary attack, near-exact accuracy. The right starting point for almost any public API.

02

Make every decision atomic

In a distributed system, a check-then-act limiter is a broken limiter. Push it into a Lua script or an atomic counter primitive.

03

Layer your limits

Coarse at the edge for abuse, fine at the service for capacity, advisory in the SDK for politeness. One layer is never enough.

04

Speak clearly on the wire

429 plus Retry-After plus X-RateLimit-* on every response. A client that knows its budget is a client that will not hammer you.

05

Jitter everything that resets

Synchronized clients cause synchronized spikes. Spread reset times, backoff intervals, and refill ticks across the population.

06

Treat denial as a signal, not just an outcome

The graph of 429s per key is your earliest abuse-detection feed. Monitor it like you monitor errors — it is not noise, it is intelligence.

Summary

14 / 14