API Rate Limiting: Complete Guide for 2026
Every production API needs rate limiting. Here's everything you need to know โ algorithms, implementations, and the mistakes that will wake you up at 3 AM.
What Is Rate Limiting (and Why You Can't Skip It)
Rate limiting controls how many requests a client can make to your API within a given time period. It's not optional. Without it, a single misbehaving client โ or a targeted DDoS attack โ can take down your entire service.
In 2026, with AI agents making automated API calls at machine speed, rate limiting is more critical than ever. A single autonomous agent can generate thousands of requests per second without human throttling. Your infrastructure needs to handle that gracefully.
Rate limiting serves three core purposes:
- Protection โ Prevent abuse, brute-force attacks, and accidental traffic spikes from crashing your servers
- Fairness โ Ensure one heavy user doesn't degrade service quality for everyone else
- Cost control โ Especially critical with pay-per-use infrastructure (serverless, managed databases, third-party API calls)
๐ก Quick math: An unthrottled API endpoint handling database queries at 5ms each can process 200 req/s per instance. One aggressive scraper sending 1,000 req/s will queue 800 requests per second, eventually exhausting your connection pool. Use our API Cost Calculator to model your limits.
The Four Core Rate Limiting Algorithms
Every rate limiter you've ever used is built on one of these four patterns. Understanding the tradeoffs is the difference between a rate limiter that works and one that fails under load.
1. Fixed Window Counter
The simplest approach. Divide time into fixed intervals (e.g., 1-minute windows), count requests per window, reject when the count exceeds the limit.
// Fixed Window - Conceptual model
// Window: [00:00 - 01:00] โ counter = 0
// Request at 00:15 โ counter = 1 โ
// Request at 00:45 โ counter = 2 โ
// ...
// Request at 00:59 โ counter = 100 โ (at limit)
// Request at 00:59 โ counter = 101 โ REJECTED
// Window resets at 01:00 โ counter = 0
Pros: Dead simple to implement, minimal memory (one counter per key per window), fast O(1) operations.
Cons: The boundary problem. A client can send 100 requests at 00:59 and another 100 at 01:01 โ that's 200 requests in 2 seconds while technically staying within a "100 per minute" limit. This burst at window boundaries can be twice your intended rate.
Best for: Simple internal services, non-critical endpoints, situations where boundary bursts are acceptable.
2. Sliding Window Log
Instead of fixed intervals, maintain a timestamped log of all requests. For each new request, remove entries older than the window size, then count remaining entries.
// Sliding Window Log
// Window size: 60 seconds, Limit: 100
//
// Request at T=150 โ log: [150], count=1 โ
// Request at T=155 โ log: [150, 155], count=2 โ
// ...
// Request at T=200 โ Remove entries before T=140
// log: [150, 155, ... 200], count=87 โ
// No boundary problem โ the window slides with each request
Pros: Perfectly accurate. No boundary bursts. The window truly slides.
Cons: Memory-hungry. You're storing every request timestamp. At high volume, this gets expensive fast. O(n) cleanup on each request.
Best for: Low-to-medium traffic APIs where accuracy matters more than memory. Authentication endpoints, payment processing.
3. Sliding Window Counter
A hybrid that combines the memory efficiency of fixed windows with the accuracy of sliding windows. Use two adjacent fixed windows and weight their counts based on the current position within the window.
// Sliding Window Counter
// Limit: 100 req/min
// Previous window (00:00-01:00): 84 requests
// Current window (01:00-02:00): 36 requests
// Current time: 01:15 (25% into current window)
//
// Weighted count = (84 ร 0.75) + (36 ร 1.0) = 63 + 36 = 99
// Next request? 99 < 100 โ ALLOWED
Pros: Smooths out boundary spikes. Low memory (just two counters per key). Near-accurate sliding behavior. This is what most production systems use.
Cons: It's an approximation. Not perfectly accurate, but close enough for virtually all use cases. Cloudflare reported only 0.003% of requests were incorrectly allowed or rejected with this approach.
Best for: Production APIs. This is the default choice unless you have a specific reason for something else.
4. Token Bucket
Imagine a bucket that holds tokens. Tokens are added at a fixed rate (e.g., 10 per second). Each request consumes one token. If the bucket is empty, the request is rejected. The bucket has a maximum capacity, which allows controlled bursts.
// Token Bucket
// Bucket capacity: 100 tokens
// Refill rate: 10 tokens/second
//
// T=0: Bucket = 100 (full)
// T=0: 50 requests โ Bucket = 50 โ (burst allowed!)
// T=1: Bucket = 60 (refilled 10)
// T=1: 60 requests โ Bucket = 0 โ
// T=2: Bucket = 10 (refilled 10)
// T=2: 15 requests โ 10 โ, 5 โ REJECTED
Pros: Naturally handles bursts (bucket capacity = burst size). Smooth long-term rate. Used by AWS, Stripe, and most major APIs. Two parameters give you fine-grained control: sustained rate and burst size.
Cons: Slightly more complex to implement. Distributed implementations require careful synchronization.
Best for: Public APIs, user-facing rate limits, anywhere you want to allow short bursts while enforcing a sustained rate.
Algorithm Comparison
| Algorithm | Memory | Accuracy | Burst Handling | Complexity |
|---|---|---|---|---|
| Fixed Window | Very Low | Low | Poor (boundary bursts) | Trivial |
| Sliding Log | High | Perfect | Good | Moderate |
| Sliding Counter | Low | Near-perfect | Good | Moderate |
| Token Bucket | Low | Good | Excellent (configurable) | Moderate |
Implementation: Node.js Rate Limiter
Here's a production-ready token bucket rate limiter using Redis. This handles distributed deployments where multiple server instances share rate limit state.
// token-bucket-limiter.js
import Redis from 'ioredis';
const redis = new Redis(process.env.REDIS_URL);
/**
* Token Bucket Rate Limiter
* @param {string} key - Unique identifier (e.g., user ID, IP, API key)
* @param {number} capacity - Max tokens (burst size)
* @param {number} refillRate - Tokens added per second
* @returns {{ allowed: boolean, remaining: number, retryAfter: number }}
*/
export async function checkRateLimit(key, capacity = 100, refillRate = 10) {
const now = Date.now();
const bucketKey = `rl:${key}`;
// Lua script for atomic operation
const luaScript = `
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refillRate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local bucket = redis.call('HMGET', key, 'tokens', 'lastRefill')
local tokens = tonumber(bucket[1])
local lastRefill = tonumber(bucket[2])
-- Initialize bucket if new
if tokens == nil then
tokens = capacity
lastRefill = now
end
-- Refill tokens based on elapsed time
local elapsed = (now - lastRefill) / 1000
tokens = math.min(capacity, tokens + (elapsed * refillRate))
lastRefill = now
local allowed = 0
local remaining = math.floor(tokens)
if tokens >= 1 then
tokens = tokens - 1
allowed = 1
remaining = math.floor(tokens)
end
-- Store updated state (expire key after 2x window to auto-cleanup)
redis.call('HMSET', key, 'tokens', tokens, 'lastRefill', lastRefill)
redis.call('EXPIRE', key, math.ceil(capacity / refillRate) * 2)
return {allowed, remaining}
`;
const [allowed, remaining] = await redis.eval(
luaScript, 1, bucketKey, capacity, refillRate, now
);
return {
allowed: allowed === 1,
remaining,
retryAfter: allowed === 1 ? 0 : Math.ceil(1 / refillRate),
};
}
// Express middleware
export function rateLimitMiddleware(capacity = 100, refillRate = 10) {
return async (req, res, next) => {
const key = req.headers['x-api-key'] || req.ip;
const result = await checkRateLimit(key, capacity, refillRate);
// Always set rate limit headers (RFC 6585 / draft-ietf-httpapi-ratelimit-headers)
res.set({
'X-RateLimit-Limit': capacity,
'X-RateLimit-Remaining': result.remaining,
'X-RateLimit-Reset': Math.ceil(Date.now() / 1000) + result.retryAfter,
'RateLimit-Policy': `${capacity};w=${Math.ceil(capacity / refillRate)}`,
});
if (!result.allowed) {
res.set('Retry-After', result.retryAfter);
return res.status(429).json({
error: 'Too Many Requests',
message: `Rate limit exceeded. Try again in ${result.retryAfter}s.`,
retryAfter: result.retryAfter,
});
}
next();
};
}
๐ง Tip: Use Lua scripts for Redis-based rate limiters. They execute atomically, eliminating race conditions between read-check-update operations. Without atomicity, two concurrent requests can both read "99 remaining" and both proceed, exceeding your limit.
Implementation: Python Rate Limiter
Here's the equivalent sliding window counter in Python, suitable for FastAPI or Django REST Framework:
# sliding_window_limiter.py
import time
import redis.asyncio as redis
pool = redis.ConnectionPool.from_url("redis://localhost:6379")
async def check_rate_limit(
key: str,
limit: int = 100,
window_seconds: int = 60,
) -> dict:
"""
Sliding window counter rate limiter.
Returns: {"allowed": bool, "remaining": int, "reset": int}
"""
r = redis.Redis(connection_pool=pool)
now = time.time()
current_window = int(now // window_seconds)
previous_window = current_window - 1
window_position = now % window_seconds / window_seconds
curr_key = f"rl:{key}:{current_window}"
prev_key = f"rl:{key}:{previous_window}"
async with r.pipeline(transaction=True) as pipe:
pipe.get(prev_key)
pipe.get(curr_key)
prev_count, curr_count = await pipe.execute()
prev_count = int(prev_count or 0)
curr_count = int(curr_count or 0)
# Weighted count: previous window weight decreases as we move through current
weighted = prev_count * (1 - window_position) + curr_count
remaining = max(0, limit - int(weighted) - 1)
allowed = weighted < limit
if allowed:
async with r.pipeline(transaction=True) as pipe:
pipe.incr(curr_key)
pipe.expire(curr_key, window_seconds * 2)
await pipe.execute()
reset_at = (current_window + 1) * window_seconds
return {
"allowed": allowed,
"remaining": remaining if allowed else 0,
"reset": int(reset_at),
}
# FastAPI middleware example
from fastapi import Request, HTTPException
from starlette.middleware.base import BaseHTTPMiddleware
class RateLimitMiddleware(BaseHTTPMiddleware):
def __init__(self, app, limit=100, window=60):
super().__init__(app)
self.limit = limit
self.window = window
async def dispatch(self, request: Request, call_next):
key = request.headers.get("x-api-key", request.client.host)
result = await check_rate_limit(key, self.limit, self.window)
if not result["allowed"]:
raise HTTPException(
status_code=429,
detail="Rate limit exceeded",
headers={
"Retry-After": str(result["reset"] - int(time.time())),
"X-RateLimit-Remaining": "0",
},
)
response = await call_next(request)
response.headers["X-RateLimit-Limit"] = str(self.limit)
response.headers["X-RateLimit-Remaining"] = str(result["remaining"])
response.headers["X-RateLimit-Reset"] = str(result["reset"])
return response
Standard Rate Limit Response Headers
In 2026, the IETF rate limit headers draft has become the de facto standard. Here's what you should include in every response:
| Header | Purpose | Example |
|---|---|---|
RateLimit-Limit |
Maximum requests allowed in the window | 100 |
RateLimit-Remaining |
Requests remaining in current window | 47 |
RateLimit-Reset |
Seconds until the window resets | 30 |
Retry-After |
Seconds to wait before retrying (only on 429) | 5 |
RateLimit-Policy |
Machine-readable policy description | 100;w=60 |
Always return these headers on every response โ not just 429s. Clients should be able to self-throttle before hitting the limit. Good API design is about making the happy path obvious.
Production Best Practices
1. Use Tiered Rate Limits
Don't apply a single rate limit to your entire API. Different endpoints have different costs. A search endpoint hitting your database is more expensive than a health check.
// Tiered limits by endpoint category
const tiers = {
auth: { capacity: 5, refillRate: 1 }, // 5/min - strict
search: { capacity: 30, refillRate: 5 }, // 30/min - moderate
read: { capacity: 100, refillRate: 20 }, // 100/min - generous
write: { capacity: 50, refillRate: 10 }, // 50/min - moderate
health: { capacity: 1000, refillRate: 100 }, // 1000/min - relaxed
};
2. Identify Clients Correctly
Rate limiting by IP alone is fragile. NAT gateways, corporate proxies, and VPNs can funnel thousands of legitimate users through a single IP. Prefer API keys for authenticated endpoints, and use a combination of IP + fingerprinting for anonymous traffic.
3. Handle 429 Responses Gracefully (Client Side)
If you're consuming a rate-limited API, implement exponential backoff with jitter. Don't hammer a 429-returning endpoint in a tight loop.
async function fetchWithRetry(url, options = {}, maxRetries = 3) {
for (let attempt = 0; attempt <= maxRetries; attempt++) {
const response = await fetch(url, options);
if (response.status !== 429) return response;
const retryAfter = parseInt(response.headers.get('Retry-After') || '1');
const jitter = Math.random() * 1000;
const delay = retryAfter * 1000 * Math.pow(2, attempt) + jitter;
console.warn(`Rate limited. Retrying in ${Math.round(delay / 1000)}s...`);
await new Promise(r => setTimeout(r, delay));
}
throw new Error(`Failed after ${maxRetries} retries due to rate limiting`);
}
4. Separate Rate Limits by Authentication Level
Free tier users, paid users, and internal services should have different limits. This is both a business model and an infrastructure protection strategy.
5. Monitor and Alert
Track your rate limiting metrics: how many requests are being throttled, which clients are hitting limits, and whether your limits are set correctly. If 30% of legitimate requests are getting 429s, your limits are too low. If nobody ever hits them, they might be too high to matter.
โ ๏ธ Common mistake: Don't apply rate limiting after expensive operations. If your endpoint does a database query, calls a third-party API, and then checks the rate limit, you've already incurred the cost. Rate limiting should be the first middleware in your request pipeline.
6. Provide Clear Error Messages
When a request is rate limited, tell the client exactly what happened, what their limit is, and when they can retry. Vague 429s without headers or messages create support tickets.
// Good 429 response
{
"error": "rate_limit_exceeded",
"message": "You've exceeded 100 requests per minute. Please retry after 23 seconds.",
"limit": 100,
"window": "60s",
"retryAfter": 23,
"docs": "https://api.example.com/docs/rate-limiting"
}
7. Consider Cost-Based Rate Limiting
In the AI API era, not all requests cost the same. A GPT-4 request with a 32K context window costs 100x more than a simple embedding call. Assign weights to requests and deduct from the bucket accordingly:
// Cost-weighted rate limiting
const costs = {
'GET /users': 1, // cheap read
'POST /search': 5, // moderate โ hits search index
'POST /ai/generate': 50, // expensive โ calls LLM
'POST /ai/image': 100, // very expensive โ image generation
};
// Each request deducts its cost from the token bucket
// Bucket capacity: 1000 "credits" per minute
// A user can make 1000 cheap reads, OR 10 AI image generations
Testing Your Rate Limiter
Rate limiters are notoriously hard to test because they depend on time and concurrency. Here are the scenarios you must cover:
- Basic limit enforcement โ Send limit+1 requests, verify the last is rejected
- Window reset โ Wait for window expiry, verify requests are allowed again
- Concurrent requests โ Send many parallel requests, verify total allowed doesn't exceed limit
- Different keys โ Verify client A's usage doesn't affect client B
- Header correctness โ Verify remaining count decrements correctly
- Clock drift โ In distributed systems, test with slightly skewed clocks
Use our JSON Formatter to inspect API response bodies, and our Regex Tester to validate rate limit header patterns in your integration tests.
Wrapping Up
Rate limiting isn't glamorous, but it's one of those things that separates production-quality APIs from hobby projects. The sliding window counter is your best starting point โ it's accurate enough, memory-efficient, and battle-tested by companies like Cloudflare and Stripe.
Start simple (fixed window), graduate to sliding window counter when you need accuracy, and reach for token bucket when you need burst control. Whatever you choose, ship it before you need it. Adding rate limiting after a production incident is a lot less fun than having it from day one.
For quick calculations while designing your rate limits, check out our API Cost Calculator. And if you're building an AI agent that needs to respect rate limits autonomously, take a look at OpenClaw โ we built that part in from the start.