Design a Rate Limiter | System Design Course

Who Asks This Question?

The rate limiter is a favorite at companies that operate large-scale APIs. Based on interview reports, it's frequently asked at:

Stripe — Their entire business is API-first; they use token bucket for API throttling
Cloudflare — They process millions of requests per second and built their own rate limiter
Uber — Multiple Glassdoor reports confirm this as a system design round question
Amazon — AWS API Gateway rate limiting is a core service
Atlassian — Reported as a code design question with per-customer limits
Reddit — Asked candidates to design multi-API rate limiters with per-user quotas
Google — Appears in both system design and infrastructure-focused interviews

This question tests whether you've dealt with production traffic problems. Companies that ask it want to see that you understand the gap between "counting requests" (easy) and "counting requests across 50 servers without adding latency" (hard).

What the Interviewer Is Really Testing

Most candidates treat this as an algorithm problem: "Which counting algorithm should I pick?" That's only 20% of what interviewers evaluate. Here's the actual scoring breakdown at most companies:

Evaluation Area	Weight	What They're Looking For
Requirements gathering	15%	Do you ask the right questions, or do you start drawing boxes?
Algorithm knowledge	20%	Can you explain trade-offs, not just recite names?
System architecture	25%	Where does this component live? How does data flow?
Distributed challenges	25%	Race conditions, consistency, multi-region — this is where seniors shine
Production awareness	15%	Monitoring, failure modes, graceful degradation

The #1 reason candidates fail this question: they spend 15 minutes explaining token bucket mechanics while the interviewer waits for them to mention literally anything about distributed systems. The algorithm is table stakes — the distributed part is the interview.

Step 1: Clarify Requirements

Questions You Must Ask

Don't just nod and start designing. These questions change your architecture fundamentally:

"Is this client-side or server-side rate limiting?" Always server-side. Client-side limits are trivially bypassed. But asking this shows you know the difference.

"What's the throttle key — user ID, IP, API key, or something else?" This determines your counter storage design. User ID requires authentication first. IP is simpler but problematic with NAT and VPNs (thousands of users sharing one IP). API key is the cleanest for B2B APIs.

"Single data center or globally distributed?" This is the most important question. A single-server rate limiter is a 10-minute problem. A globally distributed one is a 35-minute discussion. Most interviewers want distributed.

"What happens when the rate limiter itself fails?" This reveals your production mindset. Two valid answers:

Fail open: Allow all traffic through (risk: no protection during outage)
Fail closed: Block all traffic (risk: complete service disruption)

Most production systems fail open because a rate limiter outage shouldn't become a full service outage.

Requirements You Should State

After asking questions, explicitly state what you're building:

Functional:

Limit requests based on configurable rules (per user, per IP, per endpoint)
Return HTTP 429 with rate limit headers when throttled
Support different limits for different API endpoints

Non-functional:

Must add less than 5ms latency to each request (the rate limiter shouldn't be the bottleneck)
Must work across a horizontally scaled server fleet
Must degrade gracefully — if the rate limiter is down, traffic still flows

Step 2: High-Level Architecture

Where Does It Live?

There are three placement options, and your choice signals your experience level:

Inline middleware (most common answer): Every API server has rate limiter middleware that checks a shared counter store (Redis) before forwarding the request. This is what most candidates describe and it's a solid baseline.

API gateway layer (production-grade answer): In microservices architectures, the API gateway already handles auth, routing, and TLS termination. Adding rate limiting here avoids duplicating logic across every service. AWS API Gateway, Kong, and Envoy all support this natively.

Sidecar proxy (advanced answer): In service mesh architectures (Istio, Linkerd), rate limiting runs as a sidecar alongside each service. This gives per-service limiting without code changes.

A strong answer mentions at least two options and explains your choice: "I'd implement this at the API gateway layer because we already route all traffic through it, and it keeps rate limiting logic centralized rather than duplicated across services."

Architecture Components

Request flow:

Request arrives at the API server
Rate limiter middleware extracts the throttle key (user ID, IP, etc.)
Middleware loads the applicable rule from its in-memory cache
Middleware checks the counter in Redis using an atomic operation
If under limit → forward to backend, return response with rate limit headers
If over limit → return HTTP 429 immediately, never touching the backend

Rate Limit Rules

Rules are configuration, not code. Store them in a config service or file and cache them in memory:

- endpoint: "/api/v1/messages"
  key: user_id
  limit: 100
  window: 60     # seconds

- endpoint: "/api/v1/auth/login"
  key: ip_address
  limit: 5
  window: 300    # 5 failed logins per 5 minutes

- endpoint: "/api/v1/search"
  key: api_key
  limit: 1000
  window: 3600   # per hour for B2B customers

HTTP Headers

Throttled or not, always return rate limit headers. This is a detail that separates thoughtful answers from generic ones:

Header	Purpose
`X-RateLimit-Limit`	Maximum requests allowed in the window
`X-RateLimit-Remaining`	Requests remaining before throttling
`X-RateLimit-Reset`	Unix timestamp when the window resets
`Retry-After`	Seconds to wait (only on 429 responses)

Step 3: Deep Dive — Algorithms

The Five Algorithms (Know the Trade-offs, Not Just the Names)

Interviewers don't want you to recite all five algorithms. They want you to pick one and justify it, then briefly acknowledge alternatives. Here's what you need to know:

Token Bucket

Used by: Amazon (AWS), Stripe

A bucket holds tokens up to a maximum capacity. Tokens refill at a steady rate. Each request consumes one token. No tokens left? Request rejected.

Why it's popular: It naturally handles bursts. If a user was idle for a while, they've accumulated tokens and can make several quick requests. This matches real user behavior — people don't send requests at perfectly uniform intervals.

public class TokenBucket {
    private final int capacity;
    private final double refillRate; // tokens per second
    private double tokens;
    private long lastRefillNanos;

    public TokenBucket(int capacity, double refillRate) {
        this.capacity = capacity;
        this.refillRate = refillRate;
        this.tokens = capacity;
        this.lastRefillNanos = System.nanoTime();
    }

    public synchronized boolean tryConsume() {
        refill();
        if (tokens >= 1) {
            tokens--;
            return true;
        }
        return false;
    }

    private void refill() {
        long now = System.nanoTime();
        double elapsed = (now - lastRefillNanos) / 1_000_000_000.0;
        tokens = Math.min(capacity, tokens + elapsed * refillRate);
        lastRefillNanos = now;
    }
}

When to pick it: General-purpose rate limiting where you want to allow short bursts. Good default choice.

Trade-off you should state: "Token bucket allows bursts up to the bucket capacity, which is usually what we want. But if we need a perfectly smooth request rate — like feeding a downstream payment processor that can only handle exactly 10 TPS — I'd use a leaking bucket instead."

Sliding Window Counter

Used by: Cloudflare (they published a blog post about this achieving 0.003% error rate across 400 million requests)

This is the algorithm most production systems actually use because it's both memory-efficient and accurate enough. It combines two fixed windows with a weighted average:

Estimated count = currentWindowCount + previousWindowCount * overlapPercentage

Example: Limit is 100 requests/minute. Previous window had 84 requests. Current window has 36 requests. We're 25% into the current window.

Estimated = 36 + 84 * 0.75 = 36 + 63 = 99 → under limit → allow

public class SlidingWindowCounter {
    private final int limit;
    private final long windowMs;
    private int previousCount;
    private int currentCount;
    private long currentWindowStart;

    public SlidingWindowCounter(int limit, long windowMs) {
        this.limit = limit;
        this.windowMs = windowMs;
        this.currentWindowStart = System.currentTimeMillis();
    }

    public synchronized boolean tryConsume() {
        long now = System.currentTimeMillis();

        if (now - currentWindowStart >= windowMs) {
            previousCount = currentCount;
            currentCount = 0;
            currentWindowStart =
                now - ((now - currentWindowStart) % windowMs);
        }

        double elapsed =
            (now - currentWindowStart) / (double) windowMs;
        double estimated =
            currentCount + previousCount * (1.0 - elapsed);

        if (estimated < limit) {
            currentCount++;
            return true;
        }
        return false;
    }
}

When to pick it: Most production systems. Memory cost is O(1) per key — just two counters regardless of request volume. Accurate enough for virtually all use cases.

The Other Three (Know Them, Don't Lead With Them)

Algorithm	One-Line Description	When to Mention
Leaking Bucket	FIFO queue that drains at a constant rate	When the interviewer asks about smooth output rate (payment processing, order queues)
Fixed Window Counter	Count requests in fixed time slots, reset at boundaries	Only to explain its boundary-burst problem as motivation for sliding window
Sliding Window Log	Store every request timestamp, count within rolling window	When the interviewer demands exact accuracy and you're OK with O(n) memory per user

Common mistake: Walking through all five algorithms takes 15 minutes and leaves no time for the distributed design. Pick one, justify it, and move on. If the interviewer wants to hear about others, they'll ask.

Step 4: Deep Dive — Distributed Challenges

This is where strong-hire and no-hire diverge. Anyone can implement a counter on a single server. The interview is really about what happens across 50 servers in 3 data centers.

Challenge 1: Race Conditions

Two servers read the same counter from Redis, both see "count = 4" (limit is 5), both allow the request, both increment to 5. The actual count should be 6 — one request should have been rejected.

Naive approach (what weak candidates say): "Use a lock." Distributed locks add latency and create contention. At 10,000 requests/second, lock contention becomes the bottleneck.

Production approach (what strong candidates say): Use Redis's atomic INCR command. It increments and returns the new value in a single operation — no read-then-write race.

-- Redis Lua script: atomic check-and-increment
local key = KEYS[1]
local window = tonumber(ARGV[1])
local limit = tonumber(ARGV[2])

local current = redis.call('INCR', key)
if current == 1 then
    redis.call('EXPIRE', key, window)
end

if current > limit then
    return 0  -- rejected
end
return current  -- allowed, returns remaining count

public class DistributedRateLimiter {
    private final JedisPool pool;
    private final String script;

    public DistributedRateLimiter(JedisPool pool) {
        this.pool = pool;
        this.script =
            "local current = redis.call('INCR', KEYS[1]) " +
            "if current == 1 then " +
            "  redis.call('EXPIRE', KEYS[1], ARGV[1]) " +
            "end " +
            "if current > tonumber(ARGV[2]) then " +
            "  return 0 " +
            "end " +
            "return current";
    }

    public boolean allowRequest(String userId, int windowSec, int limit) {
        String key = "rl:" + userId + ":" + (System.currentTimeMillis() / (windowSec * 1000));
        try (Jedis jedis = pool.getResource()) {
            Long result = (Long) jedis.eval(
                script,
                List.of(key),
                List.of(String.valueOf(windowSec), String.valueOf(limit))
            );
            return result > 0;
        }
    }
}

Challenge 2: Multi-Region Consistency

Your service runs in US-East, US-West, and EU-West. Each region has its own Redis. A user sending requests to different regions could exceed their global limit because each region counts independently.

Option A: Centralized Redis (simple but slow) All regions talk to one Redis cluster. Round-trip latency from EU to US is ~100ms — unacceptable for a rate limiter that should add less than 5ms of latency.

Option B: Local Redis + async sync (fast but approximate) Each region has its own Redis. Counters sync across regions via background replication. A user's true count might be slightly behind, allowing a brief over-limit window.

Option C: Accept regional limits (pragmatic) Instead of a global limit of 100/min, give each region a proportional share: US-East gets 50, US-West gets 30, EU gets 20. No cross-region communication needed.

The strongest answer acknowledges the trade-off explicitly: "Perfect global accuracy requires cross-region latency that violates our latency budget. I'd use local counters with eventual consistency — we might briefly allow 105 requests instead of 100, but that's acceptable for rate limiting. If the interviewer needs exact global limiting, I'd use Option C with proportional regional budgets."

Challenge 3: What If Redis Goes Down?

This is the follow-up question interviewers love to ask after you mention Redis.

Weak answer: "Redis won't go down, it's highly available." Strong answer: "We need a fallback strategy."

Three fallback options:

Fail open — Allow all traffic. The rate limiter is a safety net, not a critical path. Brief over-serving is better than a total outage.
Local in-memory fallback — Each server maintains a local counter. Not globally accurate, but provides basic protection.
Redis Cluster with replicas — Use Redis Sentinel or Cluster mode for automatic failover. This is infrastructure-level resilience, not application-level.

Common Mistakes

These are real patterns from interview debriefs, not hypothetical problems:

Mistake 1: Algorithm Recitation Without Justification

Describing all five algorithms without picking one shows you've memorized a textbook but can't make engineering decisions. Pick one, justify it with the requirements, and move on.

Mistake 2: Ignoring the "Where Does It Live?" Question

Jumping straight to algorithms without discussing where the rate limiter sits in the architecture. Is it middleware? API gateway? Sidecar? This decision affects everything downstream.

Mistake 3: Single-Server Design for a Distributed Question

If the interviewer said "distributed environment" and your design uses synchronized or a local HashMap, you've missed the core challenge. The single-server solution should take 2 minutes, then pivot to distributed.

Mistake 4: Over-Engineering for Small Scale

When the interviewer says "1,000 users per day," you don't need sharding, distributed caching, and multi-region sync. Scale your design to the stated requirements, then discuss what changes at higher scale.

Mistake 5: Not Mentioning Monitoring

A rate limiter without monitoring is a black box. Strong candidates mention tracking:

How many requests are being throttled (is the limit too strict?)
P99 latency of the rate limiter itself (is it adding too much overhead?)
Counter store availability (is Redis healthy?)

Interviewer Follow-Up Questions

Prepare for these — they're designed to push beyond rehearsed answers:

"What if a customer complains they're being rate limited unfairly?" You need observability. Log throttle events with the customer ID, endpoint, and current counter value. Provide a dashboard where support can see a customer's recent rate limit status. Consider a "burst credit" system where long-idle customers get a temporary higher limit.

"How would you handle rate limiting for a flash sale?" Token bucket shines here — idle users have accumulated tokens and can burst. Alternatively, temporarily raise limits for specific endpoints via your config system. The key insight: rate limits should be dynamically configurable, not hardcoded.

"Should rate limiting happen before or after authentication?" Both. IP-based rate limiting should happen before auth (to protect the auth service itself from brute force). User-based rate limiting happens after auth (because you need the user identity).

"How do you rate limit WebSocket connections?" Different from HTTP. You can't return 429 on each message — the connection is already open. Options: count messages per connection per window, or use a token bucket that drains as messages are sent and refills over time.

Summary: Your 35-Minute Interview Plan

Time	What to Do
0-5 min	Clarify requirements: throttle key, distributed?, failure mode
5-10 min	High-level architecture: placement, components, request flow
10-18 min	Algorithm: pick one (sliding window counter or token bucket), justify, code the core logic
18-28 min	Distributed challenges: race conditions (atomic INCR), multi-region, Redis failure
28-33 min	Production: monitoring, dynamic config, HTTP headers
33-35 min	Wrap up: state your trade-offs, what you'd improve

The rate limiter interview is really a distributed systems interview in disguise. The algorithm is the easy part — showing you can reason about race conditions, consistency trade-offs, and failure modes is what gets you the offer.