Skip to content
Low Level Design Mastery Logo
LowLevelDesign Mastery

API Rate Limiting & Throttling

Protecting your API from abuse, one request at a time

Without rate limiting:

  • ❌ One client can overwhelm your API
  • ❌ DDoS attacks possible
  • ❌ Unfair resource usage
  • ❌ Backend services crash

With rate limiting:

  • ✅ Fair usage for all clients
  • ✅ Protection from abuse
  • ✅ Backend services protected
  • ✅ Predictable costs
Diagram

Bucket holds tokens. Tokens refill at fixed rate. Request consumes token.

Diagram

Characteristics:

  • ✅ Allows bursts (up to bucket capacity)
  • ✅ Smooths to average rate over time
  • ✅ Simple to implement
  • ✅ Most popular algorithm

Algorithm:

  1. Initialize bucket with capacity tokens
  2. Refill tokens at refill_rate per second
  3. When request arrives:
    • If tokens available → consume token, allow request
    • If no tokens → reject request (429)
"token_bucket.py
import time
from threading import Lock
from typing import Optional
class TokenBucket:
"""Token bucket rate limiter"""
def __init__(self, capacity: int, refill_rate: float):
"""
Args:
capacity: Maximum tokens in bucket
refill_rate: Tokens added per second
"""
self.capacity = capacity
self.refill_rate = refill_rate
self.tokens = capacity
self.last_refill = time.time()
self.lock = Lock()
def acquire(self, tokens: int = 1) -> bool:
"""Try to acquire tokens. Returns True if successful."""
with self.lock:
self._refill()
if self.tokens >= tokens:
self.tokens -= tokens
return True
else:
return False
def _refill(self):
"""Refill tokens based on elapsed time"""
now = time.time()
elapsed = now - self.last_refill
# Calculate tokens to add
tokens_to_add = elapsed * self.refill_rate
# Add tokens (don't exceed capacity)
self.tokens = min(self.capacity, self.tokens + tokens_to_add)
self.last_refill = now
def get_available_tokens(self) -> int:
"""Get current number of available tokens"""
with self.lock:
self._refill()
return int(self.tokens)
# Usage
rate_limiter = TokenBucket(capacity=10, refill_rate=2.0) # 10 tokens, refill 2/sec
def handle_request():
if rate_limiter.acquire():
# Process request
return "Request processed"
else:
return "Rate limit exceeded", 429

Bucket holds requests. Requests leak out at fixed rate. If full, reject.

Diagram

Characteristics:

  • ✅ Smooths traffic to constant rate
  • ✅ No bursts allowed
  • ✅ Predictable output rate
  • ❌ Less flexible than token bucket

Algorithm:

  1. Initialize bucket with capacity (queue size)
  2. Process requests at leak_rate per second
  3. When request arrives:
    • If bucket has space → add to bucket
    • If bucket full → reject (429)
"leaky_bucket.py
import time
import queue
from threading import Lock, Thread
from typing import Optional
class LeakyBucket:
"""Leaky bucket rate limiter"""
def __init__(self, capacity: int, leak_rate: float):
"""
Args:
capacity: Maximum requests in bucket
leak_rate: Requests processed per second
"""
self.capacity = capacity
self.leak_rate = leak_rate
self.bucket = queue.Queue(maxsize=capacity)
self.lock = Lock()
self.processing = False
def start_processing(self):
"""Start processing requests"""
if not self.processing:
self.processing = True
Thread(target=self._process_requests, daemon=True).start()
def add_request(self, request) -> bool:
"""Try to add request to bucket. Returns True if successful."""
try:
self.bucket.put_nowait(request)
return True
except queue.Full:
return False
def _process_requests(self):
"""Process requests at leak rate"""
interval = 1.0 / self.leak_rate # Time between requests
while self.processing:
try:
request = self.bucket.get(timeout=interval)
# Process request
self._handle_request(request)
except queue.Empty:
continue
def _handle_request(self, request):
"""Handle processed request"""
# Override in subclass or pass handler
pass

Tracks requests in sliding time window. More accurate than fixed window.

Diagram

Characteristics:

  • ✅ More accurate than fixed window
  • ✅ No burst at boundaries
  • ❌ More memory intensive (stores timestamps)
  • ❌ More complex to implement
"sliding_window.py
import time
from collections import deque
from threading import Lock
class SlidingWindowRateLimiter:
"""Sliding window rate limiter"""
def __init__(self, limit: int, window_seconds: int):
"""
Args:
limit: Maximum requests allowed
window_seconds: Time window in seconds
"""
self.limit = limit
self.window_seconds = window_seconds
self.requests = deque() # Store request timestamps
self.lock = Lock()
def is_allowed(self) -> bool:
"""Check if request is allowed"""
with self.lock:
now = time.time()
# Remove old requests outside window
while self.requests and self.requests[0] < now - self.window_seconds:
self.requests.popleft()
# Check if under limit
if len(self.requests) < self.limit:
self.requests.append(now)
return True
else:
return False
def get_remaining_requests(self) -> int:
"""Get remaining requests in current window"""
with self.lock:
now = time.time()
# Remove old requests
while self.requests and self.requests[0] < now - self.window_seconds:
self.requests.popleft()
return max(0, self.limit - len(self.requests))

Divides time into fixed windows. Simple but allows bursts.

Diagram

Characteristics:

  • ✅ Simple to implement
  • ✅ Low memory usage
  • ❌ Allows bursts at boundaries
  • ❌ Less accurate
"fixed_window.py
import time
from threading import Lock
class FixedWindowRateLimiter:
"""Fixed window rate limiter"""
def __init__(self, limit: int, window_seconds: int):
"""
Args:
limit: Maximum requests per window
window_seconds: Window size in seconds
"""
self.limit = limit
self.window_seconds = window_seconds
self.count = 0
self.window_start = time.time()
self.lock = Lock()
def is_allowed(self) -> bool:
"""Check if request is allowed"""
with self.lock:
now = time.time()
# Check if window expired
if now - self.window_start >= self.window_seconds:
# Reset window
self.count = 0
self.window_start = now
# Check limit
if self.count < self.limit:
self.count += 1
return True
else:
return False

For multiple servers, use Redis:

"distributed_rate_limiter.py
import redis
import time
class DistributedTokenBucket:
"""Distributed token bucket using Redis"""
def __init__(self, redis_client: redis.Redis, key_prefix: str,
capacity: int, refill_rate: float):
self.redis = redis_client
self.key_prefix = key_prefix
self.capacity = capacity
self.refill_rate = refill_rate
def is_allowed(self, identifier: str) -> bool:
"""Check if request is allowed for identifier"""
key = f"{self.key_prefix}:{identifier}"
now = time.time()
# Use Lua script for atomic operations
lua_script = """
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now
-- Refill tokens
local elapsed = now - last_refill
local tokens_to_add = elapsed * refill_rate
tokens = math.min(capacity, tokens + tokens_to_add)
-- Check if can consume token
if tokens >= 1 then
tokens = tokens - 1
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, 3600) -- Expire after 1 hour
return 1
else
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, 3600)
return 0
end
"""
result = self.redis.eval(lua_script, 1, key,
self.capacity, self.refill_rate, now)
return bool(result)

rate_limiter = TokenBucket(capacity=100, refill_rate=10)
client_ip = request.remote_addr
if not rate_limiter.is_allowed(client_ip):
return "Rate limit exceeded", 429
user_id = get_user_id(request)
if not rate_limiter.is_allowed(f"user:{user_id}"):
return "Rate limit exceeded", 429
api_key = request.headers.get('X-API-Key')
if not rate_limiter.is_allowed(f"key:{api_key}"):
return "Rate limit exceeded", 429
# Different limits for different tiers
limits = {
'free': TokenBucket(100, 10),
'premium': TokenBucket(1000, 100),
'enterprise': TokenBucket(10000, 1000),
}
tier = get_user_tier(user_id)
if not limits[tier].is_allowed(user_id):
return "Rate limit exceeded", 429

Inform clients about rate limits:

HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 95
X-RateLimit-Reset: 1640995200

When limit exceeded:

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1640995200
Retry-After: 60

AlgorithmBurstsAccuracyMemoryComplexity
Token Bucket✅ YesHighLowMedium
Leaky Bucket❌ NoHighMediumMedium
Sliding Window❌ NoVery HighHighHigh
Fixed Window✅ YesMediumLowLow

Recommendation: Use Token Bucket for most cases. It’s simple, accurate, and allows bursts.


🪣 Token Bucket: Most Popular

Token bucket allows bursts, smooths to average rate. Most widely used algorithm.

📊 Sliding Window: Most Accurate

Sliding window is most accurate but uses more memory. Use when accuracy is critical.

🌐 Distributed: Use Redis

For multiple servers, use Redis with Lua scripts for atomic operations.

🔢 Return 429

When rate limit exceeded, return HTTP 429 with Retry-After header. Inform clients about limits.


  • Review API Gateway - rate limiting is often implemented in gateways
  • Learn Caching - cache rate limit data for performance
  • Understand Distributed Systems - distributed rate limiting challenges