Anthropic Claude Rate Limits Explained: RPM, ITPM & OTPM

Anthropic limits Claude on three axes — requests per minute (RPM), input tokens per minute (ITPM), and output tokens per minute (OTPM) — separately for each model. Splitting input from output is the key difference from OpenAI’s single TPM number, and it changes how you reason about headroom: a generation-heavy workload hits OTPM long before it troubles RPM or ITPM.

The three axes

  • RPM — requests per minute. A rolling minute window; bursts trip it even at a low average.
  • ITPM — input tokens per minute. Prompt, context, documents, and images. Prompt-cached reads are metered at a reduced rate, which matters a lot for long, stable prefixes (see below).
  • OTPM — output tokens per minute. Generated tokens. This is the lowest of the three on most tiers and the one most workloads hit first. Anthropic estimates output usage up front from your max_tokens, so an inflated max_tokens reserves OTPM you may not use.

You’re throttled the moment you cross any one of the three.

429 is not 529

A 429 (rate_limit_error) is your limit — back off, or tier up. A 529 Overloaded (overloaded_error) is Anthropic’s capacity — back off and add a fallback; tiering up won’t help. See the 529 Overloaded fix.

Read the headers

Anthropic returns per-axis limit, remaining, and reset headers. Adapt to them instead of waiting for the 429:

Anthropic rate-limit response headers (as of June 2026).
HeaderMeaning
anthropic-ratelimit-requests-limit Your RPM ceiling for this model
anthropic-ratelimit-requests-remaining Requests left this window
anthropic-ratelimit-input-tokens-remaining Input tokens left (ITPM)
anthropic-ratelimit-output-tokens-remaining Output tokens left (OTPM)
anthropic-ratelimit-tokens-reset When the token windows reset
retry-after Seconds to wait before retrying
cURL — inspect Anthropic rate-limit headers
curl -sS -D - https://api.anthropic.com/v1/messages \
-H "x-api-key: $ANTHROPIC_API_KEY" \
-H "anthropic-version: 2023-06-01" \
-H "content-type: application/json" \
-d '{"model":"claude-sonnet-4-6","max_tokens":16,"messages":[{"role":"user","content":"hi"}]}' \
-o /dev/null | grep -i ratelimit

Usage-tier table

Tiers are based on cumulative deposits/spend; limits rise as you move up, and custom limits are available via sales beyond the standard tiers. Numbers below are representative for a Sonnet-class model — confirm yours in the Console, as they change and differ per model.

Representative Anthropic usage tiers, Sonnet class (as of June 2026). Confirm in the Console.
TierQualifies atRPMITPMOTPM
Tier 1 $5+ deposit5030,0008,000
Tier 2 $40+ deposit1,00080,00016,000
Tier 3 $200+ deposit2,000160,00032,000
Tier 4 $400+ deposit4,000400,00080,000

Notice OTPM is consistently the smallest column. If you’re 429ing, check output first.

Prompt caching changes the math

For long, stable prefixes — system prompts, few-shot examples, a large document you query repeatedly — prompt caching lets you mark that prefix as cacheable. Cache reads are counted against ITPM at a reduced rate (and cost far less), so caching buys real input headroom on top of its cost savings:

Python — cache a long, reused prefix
import anthropic
client = anthropic.Anthropic()

LONG_CONTEXT = open("policy_manual.txt").read()  # reused across many queries

def ask(question):
  return client.messages.create(
      model="claude-sonnet-4-6",
      max_tokens=512,
      system=[
          {
              "type": "text",
              "text": LONG_CONTEXT,
              "cache_control": {"type": "ephemeral"},  # mark prefix cacheable
          }
      ],
      messages=[{"role": "user", "content": question}],
  ).content[0].text

print(ask("What is the refund window?"))
print(ask("Who approves exceptions?"))  # reuses the cached prefix -> cheaper, lighter on ITPM

Engineering around the limits

JS — adaptive slowdown using the output-token header
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();

async function ask(messages, { model = "claude-sonnet-4-6", floor = 1000 } = {}) {
const res = await client.messages.create({ model, max_tokens: 512, messages }).withResponse();
const remainingOut = Number(res.response.headers.get("anthropic-ratelimit-output-tokens-remaining") ?? "1e9");
if (remainingOut < floor) {
  // OTPM is the tightest axis — pause briefly before the next generation-heavy call.
  await new Promise((r) => setTimeout(r, 1000));
}
return res.data.content[0].text;
}

Other levers:

  • Message Batches API — separate, higher limits for offline work; keep bulk jobs off your live path.
  • Lower max_tokens — since OTPM is reserved from your declared max output, trim it to what you need to reclaim headroom.
  • Cache stable prefixes (above) to relieve ITPM.
  • Watch OTPM first in your dashboards — it’s the axis most likely to throttle you.

How Claude compares to OpenAI

  • Token split: Anthropic separates ITPM/OTPM; OpenAI uses a single TPM. Output-heavy work is more sharply constrained on Anthropic and more “averaged” on OpenAI. See the OpenAI rate limits reference.
  • Extra failure mode: Anthropic adds 529 Overloaded (provider capacity) distinct from 429 (your limit). OpenAI surfaces capacity issues as 5xx.
  • Cost angle: caching’s ITPM relief pairs with its pricing discount — factor both into provider choice via OpenAI vs Anthropic pricing.

What to do next

  1. Instrument the OTPM header and slow down before you hit it (code above).
  2. Cache long, reused prefixes to relieve ITPM and cut cost.
  3. Confirm your tier in the Console; move bulk work to the Batches API.
  4. Seeing capacity errors? Read the Claude 529 Overloaded fix. Planning a move from OpenAI? See the migration guide.

Frequently asked questions

Why does Anthropic split input and output token limits?
Output generation is more expensive to serve, so Anthropic meters OTPM separately — and lower — than input. Output-heavy workloads typically hit OTPM before RPM or ITPM, so it's the axis to watch first.
Does prompt caching help with rate limits?
Yes. Cache reads of a marked prefix are counted against ITPM at a reduced rate (and cost less), so caching long, stable prefixes like system prompts or reused documents buys real input-token headroom.
What's the difference between a 429 and a 529 on Claude?
A 429 (rate_limit_error) means you exceeded your own RPM/ITPM/OTPM — throttle or tier up. A 529 (overloaded_error) means Anthropic's API is temporarily over capacity — back off and add a fallback; tiering up does nothing for it.
Does a high max_tokens count against my rate limit?
Yes. Anthropic reserves OTPM based on your declared max_tokens before generating, so an inflated value consumes output headroom you may not use. Set it to a realistic ceiling for the task.
How do I move up an Anthropic tier?
Tiers are based on cumulative deposits/spend. Add credits to reach the next threshold and limits increase automatically; for needs beyond the standard tiers, contact Anthropic sales for custom limits.