Anthropic limits Claude on three axes — requests per minute (RPM), input tokens per minute (ITPM), and output tokens per minute (OTPM) — separately for each model. Splitting input from output is the key difference from OpenAI’s single TPM number, and it changes how you reason about headroom: a generation-heavy workload hits OTPM long before it troubles RPM or ITPM.
The three axes
- RPM — requests per minute. A rolling minute window; bursts trip it even at a low average.
- ITPM — input tokens per minute. Prompt, context, documents, and images. Prompt-cached reads are metered at a reduced rate, which matters a lot for long, stable prefixes (see below).
- OTPM — output tokens per minute. Generated tokens. This is the lowest of the three on most tiers and the one most workloads hit first. Anthropic estimates output usage up front from your
max_tokens, so an inflatedmax_tokensreserves OTPM you may not use.
You’re throttled the moment you cross any one of the three.
A 429 (rate_limit_error) is your limit — back off, or tier up. A 529 Overloaded (overloaded_error) is Anthropic’s capacity — back off and add a fallback; tiering up won’t help. See the 529 Overloaded fix.
Read the headers
Anthropic returns per-axis limit, remaining, and reset headers. Adapt to them instead of waiting for the 429:
| Header | Meaning |
|---|---|
| anthropic-ratelimit-requests-limit | Your RPM ceiling for this model |
| anthropic-ratelimit-requests-remaining | Requests left this window |
| anthropic-ratelimit-input-tokens-remaining | Input tokens left (ITPM) |
| anthropic-ratelimit-output-tokens-remaining | Output tokens left (OTPM) |
| anthropic-ratelimit-tokens-reset | When the token windows reset |
| retry-after | Seconds to wait before retrying |
curl -sS -D - https://api.anthropic.com/v1/messages \
-H "x-api-key: $ANTHROPIC_API_KEY" \
-H "anthropic-version: 2023-06-01" \
-H "content-type: application/json" \
-d '{"model":"claude-sonnet-4-6","max_tokens":16,"messages":[{"role":"user","content":"hi"}]}' \
-o /dev/null | grep -i ratelimit Usage-tier table
Tiers are based on cumulative deposits/spend; limits rise as you move up, and custom limits are available via sales beyond the standard tiers. Numbers below are representative for a Sonnet-class model — confirm yours in the Console, as they change and differ per model.
| Tier | Qualifies at | RPM | ITPM | OTPM |
|---|---|---|---|---|
| Tier 1 | $5+ deposit | 50 | 30,000 | 8,000 |
| Tier 2 | $40+ deposit | 1,000 | 80,000 | 16,000 |
| Tier 3 | $200+ deposit | 2,000 | 160,000 | 32,000 |
| Tier 4 | $400+ deposit | 4,000 | 400,000 | 80,000 |
Notice OTPM is consistently the smallest column. If you’re 429ing, check output first.
Prompt caching changes the math
For long, stable prefixes — system prompts, few-shot examples, a large document you query repeatedly — prompt caching lets you mark that prefix as cacheable. Cache reads are counted against ITPM at a reduced rate (and cost far less), so caching buys real input headroom on top of its cost savings:
import anthropic
client = anthropic.Anthropic()
LONG_CONTEXT = open("policy_manual.txt").read() # reused across many queries
def ask(question):
return client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
system=[
{
"type": "text",
"text": LONG_CONTEXT,
"cache_control": {"type": "ephemeral"}, # mark prefix cacheable
}
],
messages=[{"role": "user", "content": question}],
).content[0].text
print(ask("What is the refund window?"))
print(ask("Who approves exceptions?")) # reuses the cached prefix -> cheaper, lighter on ITPM Engineering around the limits
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
async function ask(messages, { model = "claude-sonnet-4-6", floor = 1000 } = {}) {
const res = await client.messages.create({ model, max_tokens: 512, messages }).withResponse();
const remainingOut = Number(res.response.headers.get("anthropic-ratelimit-output-tokens-remaining") ?? "1e9");
if (remainingOut < floor) {
// OTPM is the tightest axis — pause briefly before the next generation-heavy call.
await new Promise((r) => setTimeout(r, 1000));
}
return res.data.content[0].text;
} Other levers:
- Message Batches API — separate, higher limits for offline work; keep bulk jobs off your live path.
- Lower
max_tokens— since OTPM is reserved from your declared max output, trim it to what you need to reclaim headroom. - Cache stable prefixes (above) to relieve ITPM.
- Watch OTPM first in your dashboards — it’s the axis most likely to throttle you.
How Claude compares to OpenAI
- Token split: Anthropic separates ITPM/OTPM; OpenAI uses a single TPM. Output-heavy work is more sharply constrained on Anthropic and more “averaged” on OpenAI. See the OpenAI rate limits reference.
- Extra failure mode: Anthropic adds
529 Overloaded(provider capacity) distinct from429(your limit). OpenAI surfaces capacity issues as5xx. - Cost angle: caching’s ITPM relief pairs with its pricing discount — factor both into provider choice via OpenAI vs Anthropic pricing.
What to do next
- Instrument the OTPM header and slow down before you hit it (code above).
- Cache long, reused prefixes to relieve ITPM and cut cost.
- Confirm your tier in the Console; move bulk work to the Batches API.
- Seeing capacity errors? Read the Claude 529 Overloaded fix. Planning a move from OpenAI? See the migration guide.
Frequently asked questions
Why does Anthropic split input and output token limits?
Does prompt caching help with rate limits?
What's the difference between a 429 and a 529 on Claude?
rate_limit_error) means you exceeded your own RPM/ITPM/OTPM — throttle or tier up. A 529 (overloaded_error) means Anthropic's API is temporarily over capacity — back off and add a fallback; tiering up does nothing for it.Does a high max_tokens count against my rate limit?
max_tokens before generating, so an inflated value consumes output headroom you may not use. Set it to a realistic ceiling for the task.