AI API pricing, compared honestly

The sticker price of a model — dollars per million tokens — is only the start. Real cost depends on your input/output ratio, prompt caching, batch endpoints, and which tier is actually good enough for the job. Two providers with near-identical per-token rates can still produce a 3× difference on your bill.

The comparisons below cover OpenAI, Anthropic, Google Gemini, and DeepSeek, with worked examples on realistic token counts. Prefer to plug in your own traffic? The AI API cost calculator prices every model side by side. Pricing changes often — figures are stamped "as of" a date; confirm against the provider's live page before committing a budget.

OpenAI vs Anthropic Pricing: A Developer's Cost Comparison

The headline “dollars per million tokens” rarely predicts your actual bill. Your input/output ratio, prompt caching, and batch usage move the number more than the sticker rate does — two providers with near-identical headline prices can produce a 3× difference on your invoice. This page compares OpenAI and Anthropic on the things that actually show up on the bill — and benchmarks Google Gemini and DeepSeek alongside them — with worked math you can copy.

Verify before you budget

AI pricing changes frequently and quietly. Figures below are standard list rates as of June 2026 (the same data behind our cost calculator) — confirm on each provider’s official pricing page before committing a budget.

The cost model: input, output, cache, batch

Four numbers determine your spend per request:

  1. Input price — per million input tokens (prompt, context, system message).
  2. Output price — per million generated tokens, usually 3–5× the input price.
  3. Cached input price — a steep discount for repeated prefixes you mark cacheable.
  4. Batch discount — typically ~50% off for non-realtime work via batch endpoints.

The single biggest mistake is comparing on input price alone. Output is the expensive half, so a chatbot that reads long context but answers briefly (input-bound) behaves completely differently from an agent that writes long completions (output-bound).

Per-model rates — all four providers

Below are the current standard list rates across OpenAI, Anthropic, Google, and DeepSeek — the same data that powers our cost calculator. Don’t rank on the headline input price; rank on your input/output mix.

Standard per-million-token API rates, USD (as of June 2026). Confirm on each provider's pricing page.
ModelInput / 1MOutput / 1MCached input / 1M
OpenAI GPT-5.5 $5.00$30.00$0.50
OpenAI GPT-5.4 $2.50$15.00$0.25
OpenAI GPT-5.4 mini $0.75$4.50$0.075
Anthropic Claude Opus 4.8 $5.00$25.00$0.50
Anthropic Claude Sonnet 4.6 $3.00$15.00$0.30
Anthropic Claude Haiku 4.5 $1.00$5.00$0.10
Google Gemini 3.5 Flash $1.50$9.00
Google Gemini 2.5 Pro $1.25$10.00
Google Gemini 2.5 Flash $0.30$2.50
DeepSeek DeepSeek-V4-Flash $0.14$0.28$0.0028
DeepSeek DeepSeek-V4-Pro $0.435$0.87$0.0036

Two things to notice: output costs 3–5× input at every provider, and cached-input rates vary wildly — Anthropic and DeepSeek discount cached reads by ~90%, which can flip the cheapest pick for prompt-heavy, cache-friendly workloads. The mini/flash tiers (GPT‑5.4 mini, Gemini Flash) and DeepSeek run an order of magnitude cheaper than the flagships.

Worked example: an output-light vs output-heavy app

Take 1,000,000 requests/month. Compare two shapes:

  • Support bot (input-heavy): 2,000 input tokens, 200 output tokens each.
  • Drafting agent (output-heavy): 1,000 input tokens, 1,500 output tokens each.

Monthly token volume:

Token volume per workload shape, 1M requests/month.
WorkloadInput tokensOutput tokens
Support bot 2.0B0.2B
Drafting agent 1.0B1.5B

Now cost both workloads across every model (no caching), using tokens ÷ 1,000,000 × rate, sorted cheapest-first by the support-bot bill:

Estimated monthly cost across all models, no caching or batching (rates as of June 2026).
ModelSupport bot /moDrafting agent /mo
DeepSeek DeepSeek-V4-Flash $336$560
DeepSeek DeepSeek-V4-Pro $1,044$1,740
Google Gemini 2.5 Flash $1,100$4,050
OpenAI GPT-5.4 mini $2,400$7,500
Anthropic Claude Haiku 4.5 $3,000$8,500
Google Gemini 2.5 Pro $4,500$16,250
Google Gemini 3.5 Flash $4,800$15,000
OpenAI GPT-5.4 $8,000$25,000
Anthropic Claude Sonnet 4.6 $9,000$25,500
Anthropic Claude Opus 4.8 $15,000$42,500
OpenAI GPT-5.5 $16,000$50,000

Two patterns jump out. First, the mini/flash tiers and DeepSeek run 10–50× cheaper than the flagships at the same traffic. Second, rankings flip between workloads — a model that’s cheap for the read-heavy support bot can fall behind for the output-heavy agent, because output dominates that bill. So don’t crown one “cheapest model”; pick per workload. Want this math at your own traffic? The cost calculator does it live across all 11 models.

How caching and batching change the answer

Two discounts that flip the result

Prompt caching can cut repeated-prefix input costs by 50–90%. Batch endpoints typically halve total cost for non-realtime work. A provider that looks pricier at sticker price can win once these are applied — always model your own usage with them on.

The support bot above reuses a large system prompt on every call. If 1,800 of its 2,000 input tokens are a cacheable prefix, Anthropic’s cheap cached-input rate ($0.30/1M vs $3.00 base) erases much of its input cost — narrowing or reversing the gap with OpenAI. Caching helps input-bound apps the most; it does nothing for output cost.

For offline jobs (bulk classification, embeddings backfills, nightly summarization), routing through each provider’s batch API roughly halves the bill and uses separate, higher rate limits — a double win that’s independent of which provider is cheaper at sticker price.

A quick cost estimator

Python — estimate monthly cost for your own numbers
def monthly_cost(requests, in_tok, out_tok, in_rate, out_rate,
               cached_in_tok=0, cached_rate=0.0, batch_discount=0.0):
  billable_in = (in_tok - cached_in_tok) * in_rate + cached_in_tok * cached_rate
  per_req = (billable_in + out_tok * out_rate) / 1_000_000
  total = per_req * requests
  return total * (1 - batch_discount)

# Support bot on Anthropic Sonnet, 1.8k of 2k input cached, realtime (no batch):
print(monthly_cost(
  requests=1_000_000, in_tok=2000, out_tok=200,
  in_rate=3.00, out_rate=15.00,
  cached_in_tok=1800, cached_rate=0.30,
))  # -> dramatically lower than the no-cache $9,000 above

When to switch, when to split

  • Switch if one provider is meaningfully cheaper for your specific input/output ratio and quality is acceptable for the task. Confirm with the estimator on real token counts, not headline rates.
  • Split by routing cheap, tolerant tasks to a mini/Haiku model and reserving flagship/Sonnet for hard ones. This usually saves more than picking a single provider, and a thin abstraction makes it easy — see the wrapper in the OpenAI → Anthropic migration guide.
  • Factor in non-token costs: rate-limit headroom (a cheaper provider you constantly 429 against isn’t cheaper), latency, and quality on your eval set.

What to do next

  1. Plug your real token counts into the estimator above — separate input from output.
  2. Turn on caching and batching in the model before deciding; they often change the winner.
  3. Validate quality on your own eval set, not vendor benchmarks.
  4. Ready to move workloads? Follow the OpenAI → Anthropic migration guide, and size throughput with the OpenAI and Anthropic Claude rate-limit references.

Frequently asked questions

Is OpenAI or Anthropic cheaper?
It depends entirely on your input/output ratio and whether you use caching and batching. Output-heavy workloads favor whichever provider has the lower output rate; input-heavy, cache-friendly workloads can flip the result. Model your own token counts — don't trust the headline input price.
Why is output so much more expensive than input?
Generating tokens is more compute-intensive than reading them, so providers price output roughly 3–5× higher. That's why an output-heavy agent can cost several times more than a read-heavy classifier at the same request volume.
How much can prompt caching save?
For stable, repeated prefixes (system prompts, few-shot examples, long documents) caching commonly cuts that portion of input cost by 50–90%. It only helps input cost, but for input-bound apps it can change which provider is cheapest overall.
Does the Batch API really halve cost?
For non-realtime work, batch endpoints typically apply around a 50% discount and use separate, higher rate limits. If a job doesn't need an immediate response, batching is usually the single biggest lever on its cost.
Should I just pick the cheapest provider for everything?
Usually no. Splitting workloads — mini/Haiku for easy tasks, flagship/Sonnet for hard ones — saves more than a single-provider choice, and you should weigh rate-limit headroom, latency, and quality on your own evals alongside raw token price.