A control plane that decides, on every request, whether to let it through or push back — protecting downstream systems from abuse, runaway cost, and noisy neighbors.
A rate limiter is a small, sharp tool with four overlapping jobs. Each one alone justifies the cost of running it; together they make it non-optional for any public surface.
Credential stuffing, scraping, brute-force, and amplification floods all look the same from above: too many requests, too fast, from too few sources. A limiter cuts the long tail before it reaches your auth code or database.
Pay-per-use APIs (LLM tokens, SMS, geocoding, egress bandwidth) turn a stuck retry loop into a real invoice. A budget enforced at the edge converts an unbounded blowup into a bounded one.
One tenant running a backfill should not starve everyone else of capacity. Per-key quotas keep a shared cluster feeling like a private one, and make capacity planning a per-tenant exercise.
Every downstream — your database, a third-party API, a queue — has a knee in its latency curve. Shedding load above the knee keeps the system in its linear region instead of dragging it into collapse.
The same algorithm behaves very differently depending on where you put it. Most production systems use more than one of these layers, with different limits at each.
The SDK pre-throttles so a stuck retry loop in user code does not slam the network. Useful for UX, useless for security — a hostile client just removes it.
Sees every request before it costs you compute. Centralizes policy, terminates abuse early, and is the only place where one tenant's traffic is visible across all services.
Protects a specific resource (a database, a paid API). Knows things the edge does not — like which call is cheap and which one writes to disk. Often a finer-grained per-endpoint limit.
A bucket holds up to B tokens and refills at r tokens per second. Each request removes one token; if the bucket is empty, the request is rejected. Simple, fast, and the default choice for most APIs.
The state is just (tokens, last_refill_ts). On each request, compute how much time has passed, add elapsed * r tokens (capped at B), then try to subtract one.
An idle client builds up to B tokens, so a real-world traffic pattern of "quiet, quiet, burst, quiet" is served correctly. The long-run average is still bounded at r.
If B is large, a starved client can spend all its credit in milliseconds — fine for most apps, bad if the downstream cannot absorb a spike. Tune B to what the next hop can take.
Requests enter a fixed-size queue and are processed (or forwarded) at a constant rate. Excess arrivals are dropped. Where token bucket shapes credit, leaky bucket shapes output — perfectly smooth.
Implementation is literally a fixed-capacity FIFO plus a worker that releases one item every 1/L seconds. Adding to a full queue means the request is dropped.
Downstream sees a constant arrival rate no matter how spiky the upstream is — ideal when you are protecting something that hates bursts, like a database with a fixed connection pool.
Latency is added to every accepted request that arrives during a backlog. Bursts that exceed the queue are silently lost. Token bucket forgives bursts; leaky bucket punishes them.
Bucket time into fixed slots (say, one minute). Keep one integer per slot per key. Increment on each request; reject when the count exceeds the limit. Easy to implement, easy to exploit at boundaries.
Storage is one integer keyed by user_id:window_start. A single INCR in Redis is enough. Old keys expire automatically.
With a 10-per-minute limit, a client can fire 10 requests at 12:00:59 and another 10 at 12:01:00 — that is 20 requests in two seconds. The limiter is technically not violated; the spirit of the limit is.
For non-adversarial workloads where the limit is large relative to typical traffic, this is the cheapest option that exists. Just do not use it as your only line of defense.
Keep a sorted set of every request timestamp in the last W seconds. On arrival, drop entries older than now − W, count the rest, and allow if it is under the limit. Exact, but expensive.
A ZADD with the timestamp as score, then ZREMRANGEBYSCORE 0 (now-W), then ZCARD. Wrap in MULTI/EXEC or a Lua script.
There is no boundary artifact — the window truly slides. The count at any instant is the actual number of requests in the last W seconds. No other algorithm matches this.
A million-RPS client needs sixty million timestamps stored at all times for a 60-second window. Use only when accuracy genuinely matters — billing, security-sensitive flows.
Keep just two fixed-window counters — current and previous — and interpolate. You get most of the accuracy of a sliding log at the storage cost of a fixed window: two integers per key.
Compute the fraction of the previous window still inside the sliding view: overlap = (now − win_start) / W. The estimated count is prev × (1 − overlap) + current.
Storage and CPU match fixed-window. The boundary doubling-attack is gone because the contribution of the previous slot decays linearly to zero. Most production systems use this.
It assumes traffic was uniform inside the previous window. Real traffic is not — so the estimate is off in pathological cases, though typically by a few percent. Tradeoff is almost always worth it.
All five algorithms solve the same problem from different angles. The right one depends on whether you value bursts, smoothness, accuracy, or cheap storage.
| Algorithm | Accuracy | Memory / key | Burst handling | Implementation | Best for |
|---|---|---|---|---|---|
| Token bucket | good | 2 numbers | Allows bursts up to B | simple | General-purpose API throttling |
| Leaky bucket | good | Queue of N | Smooths bursts out | moderate | Protecting a fragile downstream |
| Fixed window | edge effect | 1 integer | Up to 2× at boundaries | trivial | Cheap coarse limits, internal use |
| Sliding log | exact | One per request | Hard cap | heavy | Billing, audit, security-critical |
| Sliding counter | ~95%+ | 2 integers | Hard cap (estimated) | simple | Default for high-volume public APIs |
A single edge node sees only its slice of traffic. To enforce a global limit you need shared state — usually Redis. Now correctness depends on race-free updates.
A naive limiter GETs the counter, compares to the limit, and INCRs if allowed. Between the GET and the INCR, another node can do the same — both think they were under the limit, both let the request through.
Push the entire check-and-increment into a single Redis Lua script. The server runs it without interleaving, so the read, the compare, and the write are one operation visible everywhere.
-- KEYS[1] = bucket key, ARGV: now, window_ms, limit local key = KEYS[1] local now = tonumber(ARGV[1]) local window = tonumber(ARGV[2]) local limit = tonumber(ARGV[3]) -- drop timestamps older than the window redis.call('ZREMRANGEBYSCORE', key, 0, now - window) local count = redis.call('ZCARD', key) if count >= limit then return {0, count} -- denied end redis.call('ZADD', key, now, now) redis.call('PEXPIRE', key, window) return {1, count + 1} -- allowed
A throttled client should be able to recover gracefully without guesswork. Return a clear status, a wait time, and enough quota metadata for the caller to back off intelligently.
429 means "you are sending too much." 503 means "the server is unhealthy." Mixing them confuses retry logic — 503 implies the request might succeed elsewhere; 429 implies it will not until the caller slows down.
An integer (seconds) or HTTP-date. Without it, well-behaved clients fall back to exponential backoff with jitter; misbehaving clients hammer immediately. Setting it costs nothing and helps both.
Return Limit, Remaining, and Reset on every response, not just 429s. Clients can then self-throttle before they hit the wall, and your support load goes down.
A rate limiter without instrumentation is invisible until it breaks. The metrics it emits are also the best raw material you have for spotting attacks and tuning policy.
Most rate limiter bugs are not in the algorithm — they are in the assumptions around it. These two show up in every postmortem on the topic.
If each limiter node uses its local clock to write timestamps into Redis, two nodes a few hundred milliseconds apart will disagree about what falls inside the window. With NTP misbehaving, the disagreement gets worse.
TIME from inside the Lua script so every write uses the same reference.When a fixed window resets, every client who got throttled retries the moment they think the new window opened. The result: a synchronized spike against your backend that can be larger than the unthrottled traffic.
[ttl, 1.3·ttl] per client.The algorithms differ; the engineering instincts behind a good rate limiter do not.
Two integers per key, no boundary attack, near-exact accuracy. The right starting point for almost any public API.
In a distributed system, a check-then-act limiter is a broken limiter. Push it into a Lua script or an atomic counter primitive.
Coarse at the edge for abuse, fine at the service for capacity, advisory in the SDK for politeness. One layer is never enough.
429 plus Retry-After plus X-RateLimit-* on every response. A client that knows its budget is a client that will not hammer you.
Synchronized clients cause synchronized spikes. Spread reset times, backoff intervals, and refill ticks across the population.
The graph of 429s per key is your earliest abuse-detection feed. Monitor it like you monitor errors — it is not noise, it is intelligence.