OpenAI enforces limits on several axes at once — requests per minute (RPM), tokens per minute (TPM), and on lower tiers requests/tokens per day — separately for each model. You can be comfortably under one axis and blocked by another, which is why “I’m nowhere near my request limit” and a 429 happily coexist. This page explains how to read your real limits, watch your headroom in real time, and engineer so you rarely hit them.
The axes that throttle you
- RPM — requests per minute. A rolling 60-second window. Bursty parallelism trips it even when your average rate is low: 600 requests fired in 10 seconds is 3,600 RPM in that window.
- TPM — tokens per minute. Counts input + output tokens. Long context, RAG payloads, and image inputs burn TPM fast — it’s the usual bottleneck for serious workloads. OpenAI estimates the request’s max token cost up front (input + your
max_tokens), so a largemax_tokensreserves budget even if the model replies briefly. - RPD / TPD — per day. Present on Free and the lowest tiers. The daily cap catches people whose per-minute math looks fine but who run all day.
The error’s type field (requests vs tokens) tells you which ceiling you hit. Full troubleshooting lives in the OpenAI 429 fix; this page is the reference behind it.
Read the headers — adapt before you 429
Every response reports exactly how much headroom remains. Reacting to these is far better than waiting for the 429:
| Header | Meaning |
|---|---|
| x-ratelimit-limit-requests | Your RPM ceiling for this model |
| x-ratelimit-remaining-requests | Requests left in the current window |
| x-ratelimit-reset-requests | Time until the request window resets |
| x-ratelimit-limit-tokens | Your TPM ceiling for this model |
| x-ratelimit-remaining-tokens | Tokens left in the current window |
| x-ratelimit-reset-tokens | Time until the token window resets |
| retry-after | Seconds to wait (sent on some 429s) |
curl -sS -D - https://api.openai.com/v1/chat/completions \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"hi"}]}' \
-o /dev/null | grep -i x-ratelimit import time
from openai import OpenAI
client = OpenAI()
def chat_adaptive(messages, model="gpt-4o-mini", floor=2000):
resp = client.chat.completions.with_raw_response.create(model=model, messages=messages)
h = resp.headers
remaining = int(h.get("x-ratelimit-remaining-tokens", "999999"))
if remaining < floor:
# Pre-emptively wait out the window instead of risking a 429 next call.
reset = h.get("x-ratelimit-reset-tokens", "1s")
print(f"Low token headroom ({remaining}); pausing ~{reset}")
time.sleep(1.0)
return resp.parse()
print(chat_adaptive([{"role": "user", "content": "Hello"}]).choices[0].message.content) Usage-tier table
Limits scale automatically as your account ages and spends. Tiers gate on both cumulative spend and time since first payment — you must meet both. The numbers below are representative for gpt-4o; confirm yours in Settings → Limits, as they change and vary by model.
| Tier | Qualifies at | RPM | TPM |
|---|---|---|---|
| Free | No billing set up | 3 | 40,000 |
| Tier 1 | $5+ paid | 500 | 30,000 |
| Tier 2 | $50+ & 7 days | 5,000 | 450,000 |
| Tier 3 | $100+ & 7 days | 5,000 | 800,000 |
| Tier 4 | $250+ & 14 days | 10,000 | 2,000,000 |
| Tier 5 | $1,000+ & 30 days | 10,000 | 30,000,000 |
Smaller models (e.g. gpt-4o-mini) carry far higher RPM/TPM than flagship models. Routing tolerant tasks to a mini model is often the fastest way to add headroom without touching your tier.
The pattern that matters: moving up a tier raises TPM far more than RPM. If big prompts are 429ing you, tier up or trim context — don’t just slow your request rate.
Engineering around the limits
Throttle batches with a token bucket
For any fan-out, cap your send rate just below the ceiling so concurrent workers and clock skew don’t push you over:
import asyncio, time
from openai import AsyncOpenAI
client = AsyncOpenAI()
class RateLimiter:
"""Smooth token bucket: at most `rpm` requests per 60s."""
def __init__(self, rpm: int):
self.interval = 60.0 / rpm
self._lock = asyncio.Lock()
self._next = time.monotonic()
async def acquire(self):
async with self._lock:
now = time.monotonic()
wait = max(0.0, self._next - now)
self._next = max(now, self._next) + self.interval
if wait:
await asyncio.sleep(wait)
limiter = RateLimiter(rpm=450) # headroom under a 500 RPM ceiling
async def one(prompt):
await limiter.acquire()
r = await client.chat.completions.create(
model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}])
return r.choices[0].message.content
async def main(prompts):
return await asyncio.gather(*(one(p) for p in prompts))
print(asyncio.run(main([f"Summarize item {i}" for i in range(1000)]))) Other levers
- Batch API — for offline work, it has separate, higher limits and a discount, keeping bulk jobs out of your live RPM/TPM budget entirely.
- Prompt caching — cuts the input tokens that count against TPM for long, repeated prefixes (system prompts, few-shot examples).
- Trim
max_tokens— since OpenAI reserves TPM based on your declared max output, an inflatedmax_tokenswastes headroom. Set it to what you actually need. - Separate keys/projects per workload — so a nightly batch can’t starve user-facing traffic that shares the same limit.
How OpenAI compares to other providers
The mechanics are the same everywhere; the naming and split differ:
- Anthropic Claude splits tokens into input (ITPM) and output (OTPM) rather than one TPM, and adds a
529 Overloadedon top of429. See the Anthropic Claude rate limits reference. - Google Gemini uses
RESOURCE_EXHAUSTED(HTTP 429) with both per-minute and per-day quotas; the daily cap is the common surprise.
Spreading load across providers is a legitimate strategy at scale — compare the economics in OpenAI vs Anthropic pricing.
What to do next
- Read the headers and add adaptive slowdown (code above) so you act before the
429. - Confirm your tier in Settings → Limits; request an increase if you’ve outgrown it.
- Put batch work behind a token bucket or move it to the Batch API.
- Already hitting
429s? Jump to the 429 troubleshooting guide.
Frequently asked questions
Are OpenAI rate limits per key or per account?
What's usually the real bottleneck, RPM or TPM?
Why do I get 429s when I'm under my request limit?
type field (requests vs tokens) to see which.Does a high max_tokens affect my rate limit?
max_tokens, before the model responds. An inflated max_tokens consumes headroom you didn't use — set it to what you actually need.