OpenAI Error 429: Rate Limit Exceeded

Q: Does a 429 mean I'm out of money?

Not necessarily. A 429 with code: rate_limit_exceeded means you exceeded your per-minute request or token rate — retry with backoff. A 429 with code: insufficient_quota means you're out of credits or hit a budget cap — that one needs billing action, not retries.

Q: How long should I wait before retrying?

If the response includes a Retry-After header, honor it exactly. Otherwise use exponential backoff with jitter: ~1s, 2s, 4s, 8s plus a random fraction of a second, capped at 5–6 attempts. Most rate-limit 429s clear within a few seconds.

Quick Fix

Getting 429 Too Many Requests from OpenAI right now? Try these first:

Add exponential backoff with jitter and retry the request — most 429s are transient and clear within seconds.
Check Retry-After and the x-ratelimit-remaining-requests / -tokens headers to see exactly which limit you hit (RPM vs TPM) and when it resets.
Confirm it’s not a quota/billing issue — a 429 with insufficient_quota means you’re out of credits, not rate limited. That one backoff won’t fix.

A 429 from OpenAI means your request was rejected because you sent too many — too many requests per minute (RPM), too many tokens per minute (TPM), or you’ve exhausted your quota. It’s one of the most common errors in production AI apps because limits are per-minute and bursty traffic blows through them in seconds. The good news: nearly all of them are recoverable in code, and the rest are a billing or tier fix.

This page covers what the error actually means, the specific conditions that trigger it, and copy-paste fixes in Python, JavaScript, and cURL — plus how to stop it happening again.

What this error means

HTTP 429 Too Many Requests is the standard status for rate limiting. OpenAI returns it in two distinct situations that are easy to confuse:

You hit a rate limit — you exceeded your account’s RPM or TPM for that model in the current window. This is temporary. Wait, then retry.
You ran out of quota — your error body has "code": "insufficient_quota". You’re out of prepaid credits or hit a hard spend cap. Retrying will never succeed; you need to add credits or raise the cap.

Always read the error body. A rate-limit 429 looks like this:

Rate-limit 429 (retryable)

{
"error": {
  "message": "Rate limit reached for gpt-4o in organization org-xxxx on requests per min (RPM): Limit 500, Used 500. Please try again in 120ms.",
  "type": "requests",
  "param": null,
  "code": "rate_limit_exceeded"
}
}

A quota 429 looks like this — note the different code:

Quota 429 (NOT retryable)

{
"error": {
  "message": "You exceeded your current quota, please check your plan and billing details.",
  "type": "insufficient_quota",
  "param": null,
  "code": "insufficient_quota"
}
}

Branch on error.code before you retry. Backing off on an insufficient_quota error just wastes time and hammers the API.

Common causes

The five usual suspects

Most 429s trace back to one of these. Identify yours before reaching for a fix.

Bursty concurrency. You fire 50 requests in parallel (a batch job, a fan-out, a retry storm). Even if your average rate is fine, the burst exceeds RPM for that minute. This is the single most common cause.
Token-per-minute (TPM) ceiling, not request count. Large prompts — long context, big RAG payloads, image inputs — burn TPM fast. You can be well under RPM and still 429 on tokens. The error type field tells you which: requests vs tokens.
Low usage tier. New accounts start on Free or Tier 1 with tight limits (e.g. a handful of RPM on Free). Your code is fine; the ceiling is just low until you move up tiers.
Out of quota / hit a budget cap. insufficient_quota. Prepaid credits ran out, a monthly budget limit triggered, or a card failed. Looks like a 429 but is a billing problem.
Shared key across services. One API key used by your app, a cron job, and three developers’ laptops shares one limit. The cron job’s nightly batch starves your live traffic.

How to fix it

The default fix for a rate-limit 429 is retry with exponential backoff and jitter. Respect the Retry-After header when present, cap your retries, and never retry an insufficient_quota error.

Exponential backoff (the default fix)

The OpenAI SDKs already retry 429s automatically (twice by default). The fastest real fix is often just to raise that built-in retry count and let the SDK handle timing:

Python — let the SDK retry, then add your own

from openai import OpenAI
import openai, time, random

# The SDK retries 429/5xx automatically. Bump the built-in count first.
client = OpenAI(max_retries=5)

def chat_with_backoff(messages, model="gpt-4o-mini", max_attempts=6):
  for attempt in range(max_attempts):
      try:
          return client.chat.completions.create(model=model, messages=messages)
      except openai.RateLimitError as e:
          # Don't retry a quota error — it will never clear on its own.
          if getattr(e, "code", "") == "insufficient_quota":
              raise
          if attempt == max_attempts - 1:
              raise
          # Honor Retry-After if the server sent one, else exponential + jitter.
          retry_after = (e.response.headers.get("retry-after")
                         if e.response is not None else None)
          delay = float(retry_after) if retry_after else (2 ** attempt) + random.random()
          time.sleep(delay)

resp = chat_with_backoff([{"role": "user", "content": "Hello"}])
print(resp.choices[0].message.content)

JavaScript / Node — explicit backoff with jitter

import OpenAI from "openai";

const client = new OpenAI({ maxRetries: 5 });

const sleep = (ms) => new Promise((r) => setTimeout(r, ms));

async function chatWithBackoff(messages, { model = "gpt-4o-mini", maxAttempts = 6 } = {}) {
for (let attempt = 0; attempt < maxAttempts; attempt++) {
  try {
    return await client.chat.completions.create({ model, messages });
  } catch (err) {
    // Quota errors never clear — bail immediately.
    if (err?.code === "insufficient_quota") throw err;
    if (err?.status !== 429 || attempt === maxAttempts - 1) throw err;

    const retryAfter = Number(err?.headers?.["retry-after"]);
    const backoff = Number.isFinite(retryAfter)
      ? retryAfter * 1000
      : 2 ** attempt * 1000 + Math.random() * 1000;
    await sleep(backoff);
  }
}
}

const resp = await chatWithBackoff([{ role: "user", content: "Hello" }]);
console.log(resp.choices[0].message.content);

cURL — retry on 429 with curl's built-in flags

# curl can retry transient failures itself. --retry-all-errors covers 429,
# and exponential backoff is built in (doubles each time, capped by --retry-max-time).
curl https://api.openai.com/v1/chat/completions \
--retry 5 \
--retry-all-errors \
--retry-max-time 60 \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
  "model": "gpt-4o-mini",
  "messages": [{"role": "user", "content": "Hello"}]
}'

Token bucket / request throttling

Backoff reacts after you’ve been throttled. A token bucket prevents it by capping how fast you send. Use this for batch jobs and any fan-out that would otherwise burst:

Python — async rate limiter (RPM cap) over a batch

import asyncio, time
from openai import AsyncOpenAI

client = AsyncOpenAI()

class RateLimiter:
  """Simple token bucket: at most `rpm` requests per 60s, smoothed."""
  def __init__(self, rpm: int):
      self.interval = 60.0 / rpm
      self._lock = asyncio.Lock()
      self._next = time.monotonic()

  async def acquire(self):
      async with self._lock:
          now = time.monotonic()
          wait = max(0.0, self._next - now)
          self._next = max(now, self._next) + self.interval
      if wait:
          await asyncio.sleep(wait)

limiter = RateLimiter(rpm=450)  # stay under a 500 RPM ceiling with headroom

async def one(prompt):
  await limiter.acquire()
  r = await client.chat.completions.create(
      model="gpt-4o-mini",
      messages=[{"role": "user", "content": prompt}],
  )
  return r.choices[0].message.content

async def main(prompts):
  return await asyncio.gather(*(one(p) for p in prompts))

print(asyncio.run(main([f"Summarize item {i}" for i in range(2000)])))

Set the limiter a little below your real ceiling (here 450 against 500) so concurrent workers and clock skew don’t push you over.

Tier upgrade checklist

If you’re consistently at the ceiling even after throttling, you’ve outgrown your tier. OpenAI raises limits automatically as your account ages and spends. To move up:

Add a payment method and prepay credits. Free-tier limits are tiny; adding billing moves you to Tier 1 immediately.
Spend the threshold for the next tier. Tiers gate on cumulative spend and time since first payment — both must be met. Check your exact tier in the dashboard under Settings → Limits.
Verify your organization. Some higher limits and newer models require organization verification.
Request a limit increase if a specific workload needs more than your tier grants — there’s a form in the dashboard for this.

Switch to a model with higher limits

Limits are per model. Smaller/cheaper models almost always have far higher RPM and TPM than flagship models. If your task tolerates it, switching model is an instant fix:

Python — fall back to a higher-limit model on repeated 429s

import openai
from openai import OpenAI

client = OpenAI(max_retries=2)

# Ordered by capability; each fallback typically has higher rate limits.
MODEL_CHAIN = ["gpt-4o", "gpt-4o-mini"]

def chat_with_fallback(messages):
  last_err = None
  for model in MODEL_CHAIN:
      try:
          return client.chat.completions.create(model=model, messages=messages)
      except openai.RateLimitError as e:
          if getattr(e, "code", "") == "insufficient_quota":
              raise  # billing problem — switching model won't help
          last_err = e
          continue
  raise last_err

print(chat_with_fallback([{"role": "user", "content": "Hello"}]).choices[0].message.content)

How to prevent it

Fixing a 429 in the moment is reactive. These keep them from firing at all:

Throttle proactively. Run every batch behind a token bucket (above) sized just under your tier’s RPM/TPM. Cheaper than retry storms and far more predictable.
Read the rate-limit headers and adapt. Every response includes x-ratelimit-remaining-requests, x-ratelimit-remaining-tokens, and x-ratelimit-reset-*. Slow down when remaining is low instead of waiting for the 429.
Cache aggressively. Identical or near-identical prompts? Cache responses. Use OpenAI’s prompt caching for long, repeated prefixes (system prompts, few-shot examples) to cut the tokens that count against TPM.
Batch offline work. For non-realtime jobs, use the Batch API — it has separate, higher limits and a big discount, keeping bulk work out of your live RPM/TPM budget entirely.
Separate keys per workload. Give your live app, cron jobs, and dev environments their own keys (and ideally separate projects) so a batch job can’t starve user-facing traffic.

Watch out

Retrying without a cap, without jitter, or on insufficient_quota turns a brief blip into a self-inflicted outage: every client retries in lockstep, re-saturates the limit, and 429s compound. Always cap attempts and add randomized jitter.

How OpenAI rate limits compare across tiers

Limits vary by model; the table below shows representative gpt-4o ceilings to illustrate how tiers scale. Verify your exact numbers in the dashboard — these change.

Representative OpenAI rate limits by usage tier (gpt-4o, as of June 2026). Confirm current values in Settings → Limits.
Tier	Qualifies at	RPM	TPM
Free	No billing set up	3	40,000
Tier 1	$5+ paid	500	30,000
Tier 2	$50+ & 7 days	5,000	450,000
Tier 3	$100+ & 7 days	5,000	800,000
Tier 4	$250+ & 14 days	10,000	2,000,000
Tier 5	$1,000+ & 30 days	10,000	30,000,000

The pattern that matters: moving up a tier raises TPM dramatically (often the real bottleneck for long-context apps) more than RPM. If big prompts are 429ing you, tier up or trim context — don’t just slow your request rate.

Every major provider rate-limits the same way, with different names and ceilings:

Anthropic (Claude) returns 429 on RPM/TPM and a separate 529 Overloaded when its own capacity is saturated — handle both with backoff. See Anthropic Claude rate limits and the Claude 529 overloaded fix.
Google Gemini uses RESOURCE_EXHAUSTED (HTTP 429) with per-minute and per-day quotas; the per-day cap catches people out. We’ll link a dedicated Gemini page here as it ships.

If you regularly hit ceilings on one provider, spreading load across two is a legitimate strategy — see OpenAI vs Anthropic pricing and the OpenAI → Anthropic migration guide.

One key, many providers

If you want automatic failover across providers when one rate-limits you, a router can handle it without you maintaining multiple SDKs. OpenRouter exposes OpenAI-compatible endpoints across many models and falls back across providers on 429s — useful as a pressure valve for bursty workloads. (Affiliate link; you can wire equivalent failover yourself with the code above.)

What to do next

Read your error body — branch on rate_limit_exceeded vs insufficient_quota.
Add capped exponential backoff with jitter (code above) for the retryable case.
Put batch jobs behind a token bucket sized just under your tier limits.
Check Settings → Limits to confirm your tier and request an increase if you’ve outgrown it.

Frequently asked questions

Does a 429 mean I'm out of money?

Not necessarily. A 429 with code: rate_limit_exceeded means you exceeded your per-minute request or token rate — retry with backoff. A 429 with code: insufficient_quota means you're out of credits or hit a budget cap — that one needs billing action, not retries.

How long should I wait before retrying?

If the response includes a Retry-After header, honor it exactly. Otherwise use exponential backoff with jitter: ~1s, 2s, 4s, 8s plus a random fraction of a second, capped at 5–6 attempts. Most rate-limit 429s clear within a few seconds.

What's the difference between RPM and TPM limits?

RPM caps how many requests you send per minute; TPM caps how many tokens (input + output) you process per minute. Big prompts blow through TPM long before RPM. The error's type field (requests vs tokens) tells you which one you hit.

Why do I get 429s when I'm clearly under my limit?

Usually bursty concurrency: your average rate is fine but a parallel fan-out exceeds the per-minute limit in one window. It can also be a shared key (a cron job consuming the same budget) or TPM rather than RPM. Throttle with a token bucket and separate keys per workload.

Do the OpenAI SDKs retry 429s automatically?

Yes. The official Python and Node SDKs retry 429 and 5xx errors automatically (two retries by default) with backoff. Raise the count with max_retries / maxRetries. They will not retry insufficient_quota indefinitely, and you should still cap and monitor.

How do I increase my rate limits?

Add billing and prepay to reach Tier 1 immediately, then accumulate spend and account age to move up tiers automatically (both thresholds must be met). For specific high-throughput needs, request an increase via the form in the dashboard, or move bulk work to the Batch API, which has separate higher limits.

Will switching to gpt-4o-mini stop the 429s?

Often, yes — smaller models carry much higher RPM/TPM ceilings than flagship models, so routing tolerant tasks to a mini model is an instant relief valve. It won't help if the 429 is actually an insufficient_quota billing error, which applies across models.

What this error means

Common causes

How to fix it

Exponential backoff (the default fix)

Token bucket / request throttling

Tier upgrade checklist

Switch to a model with higher limits

How to prevent it

How OpenAI rate limits compare across tiers

Related providers

What to do next

Frequently asked questions

Related pages