OpenAI Rate Limits Explained: RPM, TPM & Tiers

OpenAI enforces limits on several axes at once — requests per minute (RPM), tokens per minute (TPM), and on lower tiers requests/tokens per day — separately for each model. You can be comfortably under one axis and blocked by another, which is why “I’m nowhere near my request limit” and a 429 happily coexist. This page explains how to read your real limits, watch your headroom in real time, and engineer so you rarely hit them.

The axes that throttle you

  • RPM — requests per minute. A rolling 60-second window. Bursty parallelism trips it even when your average rate is low: 600 requests fired in 10 seconds is 3,600 RPM in that window.
  • TPM — tokens per minute. Counts input + output tokens. Long context, RAG payloads, and image inputs burn TPM fast — it’s the usual bottleneck for serious workloads. OpenAI estimates the request’s max token cost up front (input + your max_tokens), so a large max_tokens reserves budget even if the model replies briefly.
  • RPD / TPD — per day. Present on Free and the lowest tiers. The daily cap catches people whose per-minute math looks fine but who run all day.

The error’s type field (requests vs tokens) tells you which ceiling you hit. Full troubleshooting lives in the OpenAI 429 fix; this page is the reference behind it.

Read the headers — adapt before you 429

Every response reports exactly how much headroom remains. Reacting to these is far better than waiting for the 429:

OpenAI rate-limit response headers (as of June 2026).
HeaderMeaning
x-ratelimit-limit-requests Your RPM ceiling for this model
x-ratelimit-remaining-requests Requests left in the current window
x-ratelimit-reset-requests Time until the request window resets
x-ratelimit-limit-tokens Your TPM ceiling for this model
x-ratelimit-remaining-tokens Tokens left in the current window
x-ratelimit-reset-tokens Time until the token window resets
retry-after Seconds to wait (sent on some 429s)
cURL — inspect your headroom
curl -sS -D - https://api.openai.com/v1/chat/completions \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"hi"}]}' \
-o /dev/null | grep -i x-ratelimit
Python — slow down when remaining tokens run low
import time
from openai import OpenAI

client = OpenAI()

def chat_adaptive(messages, model="gpt-4o-mini", floor=2000):
  resp = client.chat.completions.with_raw_response.create(model=model, messages=messages)
  h = resp.headers
  remaining = int(h.get("x-ratelimit-remaining-tokens", "999999"))
  if remaining < floor:
      # Pre-emptively wait out the window instead of risking a 429 next call.
      reset = h.get("x-ratelimit-reset-tokens", "1s")
      print(f"Low token headroom ({remaining}); pausing ~{reset}")
      time.sleep(1.0)
  return resp.parse()

print(chat_adaptive([{"role": "user", "content": "Hello"}]).choices[0].message.content)

Usage-tier table

Limits scale automatically as your account ages and spends. Tiers gate on both cumulative spend and time since first payment — you must meet both. The numbers below are representative for gpt-4o; confirm yours in Settings → Limits, as they change and vary by model.

Representative OpenAI limits for gpt-4o by tier (as of June 2026). Confirm current values in the dashboard.
TierQualifies atRPMTPM
Free No billing set up340,000
Tier 1 $5+ paid50030,000
Tier 2 $50+ & 7 days5,000450,000
Tier 3 $100+ & 7 days5,000800,000
Tier 4 $250+ & 14 days10,0002,000,000
Tier 5 $1,000+ & 30 days10,00030,000,000
Limits are per model, and minis are roomier

Smaller models (e.g. gpt-4o-mini) carry far higher RPM/TPM than flagship models. Routing tolerant tasks to a mini model is often the fastest way to add headroom without touching your tier.

The pattern that matters: moving up a tier raises TPM far more than RPM. If big prompts are 429ing you, tier up or trim context — don’t just slow your request rate.

Engineering around the limits

Throttle batches with a token bucket

For any fan-out, cap your send rate just below the ceiling so concurrent workers and clock skew don’t push you over:

Python — async RPM limiter over a batch
import asyncio, time
from openai import AsyncOpenAI

client = AsyncOpenAI()

class RateLimiter:
  """Smooth token bucket: at most `rpm` requests per 60s."""
  def __init__(self, rpm: int):
      self.interval = 60.0 / rpm
      self._lock = asyncio.Lock()
      self._next = time.monotonic()
  async def acquire(self):
      async with self._lock:
          now = time.monotonic()
          wait = max(0.0, self._next - now)
          self._next = max(now, self._next) + self.interval
      if wait:
          await asyncio.sleep(wait)

limiter = RateLimiter(rpm=450)  # headroom under a 500 RPM ceiling

async def one(prompt):
  await limiter.acquire()
  r = await client.chat.completions.create(
      model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}])
  return r.choices[0].message.content

async def main(prompts):
  return await asyncio.gather(*(one(p) for p in prompts))

print(asyncio.run(main([f"Summarize item {i}" for i in range(1000)])))

Other levers

  • Batch API — for offline work, it has separate, higher limits and a discount, keeping bulk jobs out of your live RPM/TPM budget entirely.
  • Prompt caching — cuts the input tokens that count against TPM for long, repeated prefixes (system prompts, few-shot examples).
  • Trim max_tokens — since OpenAI reserves TPM based on your declared max output, an inflated max_tokens wastes headroom. Set it to what you actually need.
  • Separate keys/projects per workload — so a nightly batch can’t starve user-facing traffic that shares the same limit.

How OpenAI compares to other providers

The mechanics are the same everywhere; the naming and split differ:

  • Anthropic Claude splits tokens into input (ITPM) and output (OTPM) rather than one TPM, and adds a 529 Overloaded on top of 429. See the Anthropic Claude rate limits reference.
  • Google Gemini uses RESOURCE_EXHAUSTED (HTTP 429) with both per-minute and per-day quotas; the daily cap is the common surprise.

Spreading load across providers is a legitimate strategy at scale — compare the economics in OpenAI vs Anthropic pricing.

What to do next

  1. Read the headers and add adaptive slowdown (code above) so you act before the 429.
  2. Confirm your tier in Settings → Limits; request an increase if you’ve outgrown it.
  3. Put batch work behind a token bucket or move it to the Batch API.
  4. Already hitting 429s? Jump to the 429 troubleshooting guide.

Frequently asked questions

Are OpenAI rate limits per key or per account?
Per organization/project, per model — not per key. Multiple keys in the same project share the same limits, so a batch job on one key can starve live traffic on another. Use separate projects to isolate workloads.
What's usually the real bottleneck, RPM or TPM?
TPM, for anything with long context. Moving up a tier raises TPM far more than RPM, so trimming context, lowering max_tokens, or tiering up is the fix for token-bound workloads — not just sending fewer requests.
Why do I get 429s when I'm under my request limit?
Almost always TPM rather than RPM, or a burst that exceeds RPM within a single 60-second window even though your average is fine. Check the error's type field (requests vs tokens) to see which.
Does a high max_tokens affect my rate limit?
Yes. OpenAI reserves TPM based on input tokens plus your declared max_tokens, before the model responds. An inflated max_tokens consumes headroom you didn't use — set it to what you actually need.
How do I increase my rate limits?
Add billing and prepay to reach Tier 1 immediately, then accumulate spend and account age to move up tiers automatically (both thresholds must be met). For specific high-throughput needs, request an increase in the dashboard or move bulk work to the Batch API.