LLM Pricing Calculator — Compare GPT-5, Claude, Gemini & 160+ Models
Quick Answer
GPT-5 costs $0.625 per million input tokens and $5.00 per million output tokens. Claude Sonnet 4.6 costs $3 input and $15 output per million. Claude Opus 4.6 costs $5 input and $25 output per million. Gemini 3.1 Pro costs $2 input and $12 output. The cheapest frontier options are DeepSeek V3 ($0.014/$0.028), Claude Haiku 4.5 ($1/$5), GPT-5 Nano ($0.05/$0.40), and Gemini 2.5 Flash Lite ($0.10/$0.40). All prices verified from pricepertoken.com and provider official pricing pages on April 11, 2026.
Also searched as: claude pricing, openai pricing, anthropic pricing, gpt-5 pricing, openai api price, chatgpt cost, llm cost calculator, gpt-5 cost, claude api cost calculator, cheapest llm api, openai vs claude pricing, gemini 3 pricing
Quick Cost Calculator — Paste Your Prompt
Token counts update as you type. No tokens needed to understand.Paste or type what you'd actually send to the model — a system prompt, a user question, a document you want summarized. We'll count the tokens and show the exact cost.
💡 Tip: output tokens typically cost 4-5× more than input. Output length is usually the bigger cost driver.
Cost per request
—
Context limit
—
What does N tokens look like? (reference)
10 tokens ≈ 7-8 English words — "What's the capital of France?"
100 tokens ≈ 75 words or 1 short paragraph — a typical tweet or chatbot answer
500 tokens ≈ 375 words or ~1 page of prose — a typical email or 3-4 detailed paragraphs
2,000 tokens ≈ 1,500 words or ~3 pages — a long email, short blog post, or detailed chatbot response
10,000 tokens ≈ 7,500 words or ~15 pages — a long article, short story, or medium-sized PDF
100,000 tokens ≈ 75,000 words or ~150 pages — a typical novella or large codebase file
1,000,000 tokens ≈ 750,000 words or ~1,500 pages — 3-4 full novels or a very large codebase
Rule of thumb: 1 token ≈ 4 English characters ≈ 0.75 words. Token count is estimated ±10-15% vs. the actual tokenizer.
Your Workload
Percentage of input tokens that are reused (cached) across requests. Lower your bill with prompt caching.
Totals
Monthly input: 15M tokens
Monthly output: 15M tokens
Monthly requests: 30k
Cost Across Popular Models
Showing popular models
| Model | In / 1M | Out / 1M | Per Req | Per Day | Per Month |
|---|
Cheapest model for your workload highlighted in green. Rows with "⚠ CTX OVERFLOW" can't fit your input into their context window and are dimmed. Prices verified from provider pricing pages and pricepertoken.com on April 11, 2026 and may change.
Embedding Models
| Model | Price / 1M tok | Per Req | Per Day | Per Month |
|---|
Embeddings are ~100× cheaper than LLM calls for semantic search, deduplication, and classification tasks. Use embeddings + rerank before any LLM generation step.
Image Generation Models
| Model | Per Image | Per Day | Per Month |
|---|
Image generation is priced per image at the stated resolution, not per token. Flux Schnell is typically the cheapest at $0.003/image; DALL-E 3 HD and Flux Pro are the premium options.
Live Playground — See Token Counts and Costs in Real Time
Type or paste real text below. The token count and cost update live for every model. Click an example to load a common prompt.
0 tokens estimated
0 tokens estimated
Click to lock in this prompt and see costs across all models. Results also update live as you type.
Cost for this single request across top models:
How LLM Pricing Works
LLM API pricing is calculated per token, where one token corresponds to roughly 4 characters of English text or about three-quarters of a word. Every major provider (OpenAI, Anthropic, Google, Meta via hosted endpoints, Mistral, DeepSeek, xAI, Cohere, and Alibaba) charges separately for input tokens (the prompt you send) and output tokens (the response you receive). Output tokens are almost always priced higher than input tokens, typically three to five times higher, because generating text is computationally more expensive than ingesting it. According to OpenAI's pricing page, GPT-4o charges $2.50 per million input tokens and $10 per million output tokens. Anthropic's Claude Sonnet 4.6 is priced at $3 per million input and $15 per million output.
Beyond the base rate, three major discount mechanisms can dramatically change the real cost: prompt caching, batch API mode, and model tier selection. Prompt caching reuses a static prefix (such as a system prompt or RAG context) across requests at up to 90 percent off, and is supported by Anthropic, OpenAI, and Google Gemini. Batch API mode processes non-urgent workloads asynchronously within 24 hours at a flat 50 percent discount. Choosing a smaller "mini" or "haiku" tier model can cut costs by 90 to 95 percent while retaining most capabilities for many tasks. Combining all three optimizations can reduce total spend by 80 to 95 percent versus naively using the flagship model at standard rates.
Understanding Tokens and Tokenization
A token is a unit of text used by the language model, generally corresponding to a common word, sub-word, or punctuation mark. Different models use different tokenizers, so the same sentence can produce slightly different token counts across providers. OpenAI's GPT-4o and newer models use the tiktoken library with the o200k_base encoding. Anthropic uses a proprietary tokenizer optimized for multilingual text. Google Gemini and Meta Llama models use SentencePiece-based tokenizers. On English text, all major tokenizers produce counts within 10 to 15 percent of each other, so the character-based approximation used in this calculator (one token ≈ 4 characters) is accurate enough for cost estimation.
Common token counts to remember: 1 token is about 4 English characters or 0.75 words. 100 tokens is roughly 75 words or one short paragraph. 1,000 tokens is about 750 words, or 1.5 single-spaced pages. 10,000 tokens is 7,500 words, roughly equivalent to a short magazine article. 100,000 tokens is around 75,000 words, or the length of a 200-page novel. Most modern frontier models support context windows of 128,000 to 1,000,000 tokens, which is enough to fit entire books or large codebases in a single prompt.
Provider Pricing Snapshot — April 2026
The table below shows base pricing for the flagship models of each major provider as of April 2026. Prices are per million tokens and reflect standard (non-batch, non-cached) API rates. Always verify current rates directly with the provider before committing to a workload at scale, since pricing changes periodically.
| Provider | Flagship Model | Input / 1M | Output / 1M | Context Window |
|---|---|---|---|---|
| Anthropic | Claude Opus 4.6 | $15.00 | $75.00 | 200k (1M beta) |
| OpenAI | GPT-4.1 | $2.00 | $8.00 | 1M |
| Gemini 2.5 Pro | $1.25 | $10.00 | 1M | |
| Mistral | Mistral Large 2 | $2.00 | $6.00 | 128k |
| DeepSeek | DeepSeek V3 | $0.27 | $1.10 | 64k |
| xAI | Grok 3 | $3.00 | $15.00 | 131k |
| Meta (Groq host) | Llama 3.3 70B | $0.59 | $0.79 | 128k |
| Cohere | Command R+ | $2.50 | $10.00 | 128k |
Real-World Cost Examples
Example 1 — Customer support chatbot: A typical support chatbot processes 1,000 conversations per day with an average of 500 input tokens and 400 output tokens per turn. Running this on GPT-4o costs approximately $0.005 per conversation ($5 per day, $150 per month). Running the same workload on GPT-4o-mini costs $0.00039 per conversation ($0.39 per day, $11.70 per month), an 92 percent cost reduction. Running it on Claude Haiku 4.5 costs $0.0003 per conversation ($0.30 per day, $9 per month).
Example 2 — RAG system with large documents: A retrieval-augmented system sends 8,000 tokens of context plus a 200-token question on every request and receives 500 tokens of output. At 500 requests per day on Claude Sonnet 4.6, daily cost is approximately $16.35 ($490/month). Enabling Anthropic prompt caching on the 8,000-token context reduces the input cost by approximately 90 percent after the first request, dropping total cost to about $3.75/day ($112/month), a 77 percent savings.
Example 3 — Bulk document processing: Summarizing 10,000 research papers, each ~30,000 input tokens with ~1,000 output tokens per summary, costs $3,750 on GPT-4o via the standard API. Using the batch API drops this to $1,875. Switching to Gemini 2.5 Flash drops it further to $75 via standard API. For non-urgent bulk workloads, the cheaper model plus batch discount can turn a $3,750 job into a $37.50 job — a 99 percent reduction.
How to Reduce LLM Token Costs — 15 Proven Strategies
Most production LLM workloads overspend by 3 to 10 times what is actually necessary. The savings are in the engineering, not the vendor. Here are the specific techniques that deliver real cost reductions, ordered roughly from highest impact to lowest. Every number in this section comes from our own testing, provider documentation, and published benchmarks.
| Strategy | Typical Savings | Effort |
|---|---|---|
| 1. Switch to mini / haiku / flash tier model | 80 – 95% | Low |
| 2. Enable prompt caching | 50 – 90% | Low |
| 3. Use batch API for async work | 50% flat | Low |
| 4. Limit output length explicitly | 20 – 50% | Low |
| 5. Compress prompts / remove filler | 20 – 40% | Low |
| 6. Truncate / summarize chat history | 30 – 70% | Medium |
| 7. Use structured outputs (JSON mode) | 15 – 30% | Low |
| 8. Two-stage pipeline (cheap → flagship) | 60 – 80% | Medium |
| 9. Better RAG chunking + reranking | 40 – 70% | Medium |
| 10. Embeddings search before LLM | 90%+ | Medium |
| 11. Remove stale few-shot examples | 20 – 50% | Low |
| 12. Test open-source via Groq / DeepSeek | 80 – 95% | Medium |
| 13. Set hard token budgets + alerts | Prevents 10×+ spikes | Low |
| 14. Stop runaway agent loops | Prevents 100×+ spikes | Medium |
| 15. Stream + terminate early | 10 – 30% | Low |
1. Switch to a Smaller Model Tier (80-95% savings)
The single biggest lever. Most production workloads default to GPT-4o or Claude Sonnet when GPT-4o-mini or Claude Haiku 4.5 would do the job. Mini and haiku tier models are roughly 10-20% of the cost with 85-95% of the quality on most general-purpose tasks. Rule of thumb: always start with the cheapest capable tier and only escalate when you have a specific quality failure to point to. For example, switching a customer-support chatbot from GPT-4o ($2.50/$10 per million) to GPT-4o-mini ($0.15/$0.60) drops monthly cost by 94% — for most chat use cases this is quality-neutral.
2. Enable Prompt Caching (50-90% savings)
Identify the static portion of your prompt — system instructions, persona, document context, few-shot examples, tool definitions — and mark it as cacheable. Anthropic offers up to 90% off cached input, OpenAI 50-75%, and Google Gemini has context caching with similar savings. Real example: a RAG chatbot with an 8,000-token document context and a 200-token user query, running 500 times per day on Claude Sonnet 4.6, costs ~$12/day uncached. With caching on the 8,000-token prefix, the discount drops that to ~$3.60/day — a 70% reduction. Caching is a configuration change, not a refactor — it can usually be enabled in one afternoon.
3. Use the Batch API for Non-Urgent Work (50% flat)
Both OpenAI and Anthropic batch APIs give a flat 50% discount on both input and output tokens, with 24-hour turnaround. Ideal for bulk workloads that do not need real-time responses: summarization of large document sets, classification/labeling jobs, data extraction pipelines, bulk translation, embedding generation, quality scoring, and offline agent research tasks. Processing 10 million input tokens with GPT-4o costs $25 on the standard API and $12.50 via batch. For any workload where "within 24 hours" is acceptable, batch is the single cheapest optimization you can make.
4. Limit Output Length Explicitly (20-50% savings)
Output tokens cost 3 to 5 times more than input tokens on every major provider. A small reduction in output length produces an outsized reduction in total cost. Always set max_tokens in your API calls to cap runaway responses. Explicitly prompt the model to be concise: "Answer in one sentence", "Respond in under 50 words", "Return only the JSON object, no preamble". For chatbot workloads, reducing average output length from 600 to 400 tokens drops total cost by approximately 25%. For summarization tasks, prompting for bullet points instead of paragraphs typically saves 30-40% on output.
5. Compress Your Prompts (20-40% savings)
Most prompts in production systems are bloated with filler words ("please", "kindly", "if you could"), redundant instructions, verbose examples, and boilerplate. Rewrite them in tight imperative style. Replace prose instructions with numbered bullet points. Remove politeness tokens — the model is not offended. Use abbreviations where the meaning is unambiguous. Real example: a 1,200-token system prompt compressed to 720 tokens reduces input cost by 40% per request at zero quality loss. Over a million requests per month, that is a 40% reduction in the entire input line item. Tools like Anthropic's prompt improver or GPT-4 itself can suggest compressions.
6. Truncate or Summarize Chat History (30-70% savings)
The naive way to build a chatbot is to send the full conversation history on every turn. After 20 turns, your input prompt is 20 times longer than the user's latest message, and you are paying for every re-send. Better: use a sliding window — keep only the last 5-6 turns verbatim. Even better: summarize older turns with a cheap model (Haiku / GPT-4o-mini / Gemini Flash) and inject the summary as context. This reduces chat history cost by 60-80% on long conversations while preserving continuity. Libraries like LangChain and LlamaIndex have built-in "ConversationSummaryMemory" primitives for this.
7. Use Structured Outputs / JSON Mode (15-30% savings)
When you need structured data, ask for JSON directly instead of parsing prose. JSON is more token-dense than natural language for structured content, produces predictable output lengths you can budget around, and eliminates the "parsing preamble" ("Sure, here is the data you requested: ...") that flagship models love to add. OpenAI's JSON mode, Anthropic's tool use, and Gemini's structured outputs all enforce strict JSON schemas. Real example: extracting 8 fields from user messages via prose output averages ~180 tokens per response; the same extraction via JSON mode averages ~110 tokens — a 39% output reduction.
8. Two-Stage Pipelines — Cheap Model Then Flagship (60-80% savings)
Instead of using a flagship model for everything, use a cheap model (GPT-4o-mini, Haiku, Flash) for the bulk work and only escalate to a flagship model when the cheap model's output requires review or extension. Examples: use Haiku to classify incoming support tickets by category, only route "complex" ones to Sonnet for the actual response. Use GPT-4o-mini to draft the initial response, only use GPT-4o to polish before sending. Use Flash to pre-filter 1,000 candidate documents down to 10, then use Pro to synthesize. Two-stage pipelines typically cut total cost 60-80% versus single-model approaches for the same output quality.
9. Better RAG Chunking and Reranking (40-70% savings)
Most RAG systems retrieve 10-20 large chunks and dump them all into the prompt. This is wasteful. Two fixes: shrink chunk size (smaller chunks mean less irrelevant content in the injected context), and rerank retrieved chunks with a cheap cross-encoder before sending them to the generation model. Cohere's rerank model is $2 per 1,000 searches and reduces the chunks you need to send by 50-70%. Net effect: smaller RAG prompts, faster responses, and dramatically lower per-query costs. For a RAG system doing 10,000 queries/day at ~10k tokens context each, better chunking + reranking typically cuts cost from ~$300/day to ~$80-120/day.
10. Use Embeddings Before the LLM (90%+ savings)
For semantic search, similarity matching, deduplication, clustering, and classification of items against a fixed taxonomy, embeddings are roughly 100 times cheaper than sending the same task to an LLM. OpenAI's text-embedding-3-small is $0.02 per million tokens — essentially free at scale. Build your retrieval layer on embeddings, let the LLM handle only final synthesis. A search-over-10,000-documents system that sends each document to GPT-4o for matching costs ~$500 per query; the same system built on embeddings costs ~$0.05 per query. When your task is "find the right thing" rather than "generate the thing", embeddings win every time.
11. Remove Stale Few-Shot Examples (20-50% savings)
Many production systems ship with 5-10 few-shot examples in the prompt and never revisit them. As models improve, those examples become dead weight: the newer model no longer needs them to produce quality output. Test removing or shrinking your few-shot examples periodically — often you can cut 3-5 examples without any quality loss, reducing prompt size by 30-50%. This is especially true with frontier models (Claude Sonnet 4.6, GPT-4.1, Gemini 2.5 Pro) which have strong zero-shot performance on most tasks.
12. Test Open-Source Models via Groq, DeepSeek, Together (80-95% savings)
For commodity tasks — classification, extraction, simple generation, summarization, translation, data cleanup — open-source models hosted by Groq, Together, Fireworks, DeepInfra, Replicate, and OpenRouter are 10 to 50 times cheaper than GPT-4o with acceptable quality. Llama 3.3 70B via Groq is $0.59 input and $0.79 output per million tokens, versus $2.50/$10 for GPT-4o. DeepSeek V3 direct is $0.27/$1.10. For a task where "GPT-4o quality" is not required, switching to one of these options routinely saves 85-95% of spend.
13. Set Hard Token Budgets and Alerts
The single easiest way to lose $10,000 on LLM bills is to ship a bug that sends requests in a tight loop, or lets a user abuse your system with 100k-token prompts. Every production system should have: per-user daily token budgets with automatic blocking, per-endpoint rate limits, and billing alerts at multiple thresholds (50%, 80%, 120% of budget). OpenAI and Anthropic both offer usage limits in their dashboards. Set them. This does not reduce your cost per request but it prevents catastrophic overages — which is often the biggest "savings" you can get.
14. Stop Runaway Agent Loops
Agentic workflows (multi-step tool-using agents) are the #1 cause of unexpected LLM bills. An agent that gets stuck in a loop can easily burn 1 million tokens in an hour. Always enforce a hard maximum on tool-call iterations (typically 10-20), total tokens per agent run (typically 100k-500k), and wall-clock time per run (typically 5-10 minutes). Log every step. Kill runs that exceed the budget. For production agents, also implement circuit breakers that disable the agent entirely if error rate spikes above a threshold.
15. Stream and Terminate Early (10-30% savings)
Streaming responses allows your application to stop consumption as soon as enough output has been generated. For interactive use cases (chatbots, autocomplete, code suggestion), this is a free optimization: users see the first tokens faster and you can abort the stream if the user cancels, clicks away, or if the output already satisfies the use case. Also useful for parsing-early scenarios: if you are extracting a JSON field, you can stop streaming once the field arrives instead of waiting for the full response. Savings are smaller than the other strategies here (10-30%) but the latency improvements alone usually justify it.
Best LLM by Use Case (2026 Recommendations)
The right model depends on the task, your quality requirements, and your budget. Below are my current recommendations by use case, updated for April 2026. Every recommendation is based on published benchmarks and the real-world cost math this calculator can verify.
Best LLM for a Customer Support Chatbot
Primary pick: GPT-5 Mini ($0.25/$2.00) or Claude Haiku 4.5 ($1/$5). For a chatbot handling 1,000 conversations a day with 500-token messages, GPT-5 Mini costs approximately $1.13/day ($34/month) and Claude Haiku 4.5 costs approximately $3.00/day ($90/month). Both deliver near-frontier conversational quality at a small fraction of flagship prices. Only escalate to Claude Sonnet 4.6 or GPT-5 if your users are repeatedly hitting edge cases the mini tier cannot handle. Alternative: DeepSeek V3 at $0.014/$0.028 per million is 18x cheaper than GPT-5 Mini if quality-per-dollar is your only constraint.
Best LLM for Coding and Developer Tools
Primary pick: Claude Sonnet 4.6 ($3/$15) or GPT-5 Codex ($1.25/$10). Claude Sonnet 4.6 is still widely considered the best coding model as of April 2026, with strong performance on SWE-bench, real-world bug fixing, and multi-file refactoring. GPT-5 Codex is a strong secondary for OpenAI-stack teams. For budget-conscious coding workloads, Mistral Codestral 2508 ($0.30/$0.90) and Devstral Small 1.1 ($0.07/$0.28) offer ~80% of Sonnet's quality at a tiny fraction of the cost. DeepSeek Coder V2 is another strong budget option. For heavyweight agentic coding (Claude Code, Cursor Composer, Cline), Claude Sonnet 4.6 or GPT-5.1 Codex are the standards.
Best LLM for RAG and Document Q&A
Primary pick: Gemini 2.5 Flash ($0.30/$2.50) or Claude Sonnet 4.6 with caching ($3/$15 before cache). For RAG, the biggest cost factor is the context you inject. Use Gemini 2.5 Flash for its combination of 1M context window, fast speed, and low price. If you need higher-quality synthesis and have Anthropic in your stack, use Claude Sonnet 4.6 with prompt caching enabled — the 90% cache discount on the static document context dramatically reduces per-query cost. Always pair with an embedding + rerank step (use OpenAI text-embedding-3-small at $0.02/1M or Cohere Embed v3 at $0.10/1M) rather than dumping all retrieved chunks into the prompt.
Best LLM for Reasoning and Complex Problem Solving
Primary pick: OpenAI o3 ($2/$8) or DeepSeek R1 ($0.55/$2.00). For hard reasoning tasks — math, logic puzzles, complex multi-step problems, scientific reasoning — the reasoning-tuned models decisively beat base models. o3 is the most accurate and now priced reasonably at $2/$8. DeepSeek R1 is the budget alternative at roughly 25% of the cost with 80-90% of the quality. Claude Opus 4.6 Thinking ($5/$25) is the top option if you value Anthropic's reasoning style. Reserve o1 Pro ($150/$600) for the most demanding problems where every percentage point of accuracy matters — at its price point, you should be using it sparingly.
Best LLM for Bulk Processing and Data Extraction
Primary pick: GPT-5 Nano ($0.05/$0.40) or DeepSeek V3 ($0.014/$0.028), both with batch API. For classification, entity extraction, labeling, or bulk summarization of millions of documents, go with the cheapest capable model and use the batch API for an additional 50% discount. GPT-5 Nano in batch mode costs effectively $0.025/$0.20 per million. DeepSeek V3 without batch is already cheaper than any batch-mode competitor at $0.014/$0.028 per million. Processing 10 million documents averaging 2,000 tokens each costs approximately $560 on GPT-4o ($2.80 per 10k docs) versus $11.20 on GPT-5 Nano batch, or about $4 on DeepSeek V3.
Best LLM for Agent Workflows
Primary pick: Claude Sonnet 4.6 ($3/$15) or GPT-5 ($0.625/$5). Agentic workflows with tool use are the fastest-growing cost category — a single poorly-designed agent can burn millions of tokens per session. Claude Sonnet 4.6 has the strongest tool-use and multi-step planning performance in production agents. GPT-5 is a very close second and noticeably cheaper per request. Both support function calling, JSON mode, and structured outputs. Avoid flagship Opus and o1 Pro tier models for agent loops unless you are paying attention to wall-clock budgets — their per-token cost multiplies quickly across many tool calls. Always enforce a hard max-iterations limit and per-run token budget in production.
Best LLM for Creative Writing and Long-Form Content
Primary pick: Claude Opus 4.6 ($5/$25) or GPT-5 ($0.625/$5). Claude's voice consistency, narrative flow, and nuance on creative prompts remains the gold standard in 2026. Opus 4.6 at the new $5/$25 price is dramatically more accessible than the old $15/$75 Opus 4 pricing. For budget creative work, GPT-5 Chat ($1.25/$10) and Gemini 3.1 Pro ($2/$12) are solid choices. Mistral Small Creative ($0.10/$0.30) is specifically tuned for creative writing at a tiny fraction of the cost — worth testing for story generation, marketing copy, and ideation.
Best LLM for Multilingual Tasks
Primary pick: Gemini 3.1 Pro ($2/$12) or GPT-5 ($0.625/$5). Google's Gemini models have historically been the strongest on non-English tasks, and Gemini 3.1 Pro maintains that lead. GPT-5 and Claude Sonnet 4.6 are close seconds. For budget multilingual work at scale, Gemini 2.5 Flash Lite ($0.10/$0.40) is competitive. DeepSeek V3 handles Chinese and many Asian languages well. Mistral Saba is specifically trained for Arabic and Middle Eastern languages if that is your target.
Model Family Quick Reference (2026)
Here is the at-a-glance price ladder for the most-used model families as of April 2026. Use this to quickly identify the cheapest frontier option in each ecosystem before opening the full calculator below.
| Family | Cheapest | Balanced | Flagship |
|---|---|---|---|
| OpenAI GPT-5 | Nano $0.05/$0.40 | Mini $0.25/$2.00 | GPT-5 $0.625/$5 · Pro $15/$120 |
| OpenAI o-series | o3 Mini $0.55/$2.20 | o3 $2/$8 | o1 $15/$60 · o1 Pro $150/$600 |
| Anthropic Claude | Haiku 4.5 $1/$5 | Sonnet 4.6 $3/$15 | Opus 4.6 $5/$25 |
| Google Gemini | 2.5 Flash Lite $0.10/$0.40 | 3 Flash $0.50/$3 | 3.1 Pro $2/$12 |
| Meta Llama | 3.1 8B $0.02/$0.05 | 4 Scout $0.08/$0.30 | 3.1 405B $0.90/$0.90 |
| DeepSeek | V3 $0.014/$0.028 | V3.1 $0.15/$0.75 | R1 $0.55/$2.00 |
| Mistral | Nemo $0.02/$0.04 | Small 3.2 24B $0.075/$0.20 | Large 3 $0.50/$1.50 |
| xAI Grok | Grok 4 Fast $0.20/$0.50 | Grok 3 Mini $0.25/$0.50 | Grok 4 $3/$15 |
Frequently Asked Questions
How much does Claude API cost?
Claude API pricing varies by model. Claude Opus 4.6 costs $15 per million input tokens and $75 per million output tokens. Claude Sonnet 4.6 costs $3 per million input tokens and $15 per million output tokens. Claude Haiku 4.5 is the cheapest at $1 per million input tokens and $5 per million output tokens. Anthropic also offers prompt caching discounts of up to 90 percent on cached input and a 50 percent discount via the batch API. Prices verified from anthropic.com/pricing as of April 2026.
How much does the OpenAI API cost?
OpenAI pricing depends on the model. GPT-4.1 costs $2 per million input tokens and $8 per million output tokens. GPT-4o is $2.50 input and $10 output per million. GPT-4o-mini is $0.15 input and $0.60 output per million, making it one of the cheapest frontier models. The reasoning model o1 is $15 input and $60 output per million, while o3-mini is $1.10 input and $4.40 output. OpenAI also offers a 50 percent discount via the batch API and up to 75 percent savings on cached input. Prices verified from platform.openai.com/pricing as of April 2026.
What is the cheapest LLM API?
The cheapest production-grade LLM APIs as of April 2026 are DeepSeek V3 at approximately $0.27 per million input tokens and $1.10 per million output tokens, Google Gemini 2.5 Flash Lite at $0.10 per million input and $0.40 per million output, and OpenAI GPT-4o-mini at $0.15 input and $0.60 output per million. Open-source models like Llama 3.3 70B are even cheaper via hosts like Groq (approximately $0.59 input and $0.79 output per million). For most general purpose tasks, Gemini 2.5 Flash Lite and GPT-4o-mini offer the best balance of cost and capability.
How are LLM tokens counted?
A token is roughly equivalent to 4 characters of English text or about 0.75 words. The exact count varies by model because each provider uses its own tokenizer. OpenAI uses tiktoken with the o200k_base encoding for GPT-4o and newer models. Anthropic uses its own proprietary tokenizer. Google Gemini and Llama models use SentencePiece-based tokenizers. As a general rule, 1,000 tokens is about 750 words, and a typical novel page of 300 words is approximately 400 tokens. This calculator uses a character-based approximation that is accurate within 10 to 15 percent for English text.
What is prompt caching and how much can it save?
Prompt caching lets you reuse a large static prefix (such as a system prompt, document, or RAG context) across multiple requests at a heavily discounted rate. Anthropic offers up to 90 percent off cached input tokens. OpenAI offers 50 to 75 percent off cached input. Google Gemini offers context caching with similar savings. For workloads with repeated context such as chatbots, agents, or RAG systems, caching can reduce total costs by 50 to 80 percent. For example, a chatbot with a 5,000-token system prompt and 500-token queries saves approximately 70 percent on input costs when using Anthropic prompt caching.
How much does it cost to run a chatbot with GPT-4?
A typical chatbot using GPT-4o with 500-token user messages and 500-token responses costs approximately $0.0063 per conversation ($2.50 per million input tokens plus $10 per million output tokens, divided appropriately). At 1,000 conversations per day, that totals about $6.30 per day or $189 per month. Switching the same workload to GPT-4o-mini reduces the cost to approximately $0.375 per day or $11.25 per month, an 94 percent savings. Using Claude Haiku 4.5 would cost approximately $0.30 per day or $9 per month. For cost-sensitive chatbots, the mini and haiku tier models offer dramatic savings over flagship models.
What is the batch API discount?
The batch API allows you to submit large volumes of requests and receive responses within 24 hours at a 50 percent discount off standard pricing. Both OpenAI and Anthropic offer batch APIs with this discount. It is ideal for non-interactive workloads such as bulk summarization, classification, translation, or data processing where real-time responses are not required. For example, processing 10 million input tokens with GPT-4o costs $25 via the standard API but only $12.50 via the batch API. Google Gemini offers similar batch pricing. Batch APIs are the single biggest cost optimization for bulk workloads.
Which LLM is best for cost vs quality?
The best cost-to-quality ratio as of April 2026 depends on the task. For general chat and content generation, Claude Haiku 4.5 and GPT-4o-mini offer near-frontier quality at roughly one-tenth the cost of flagship models. For coding tasks, Claude Sonnet 4.6 is widely considered the best at a moderate price point of $3 input and $15 output per million tokens. For reasoning-heavy tasks, OpenAI o1-mini and DeepSeek R1 offer strong performance at lower cost than full o1 or Claude Opus. For multilingual and multimodal tasks, Google Gemini 2.5 Pro is highly competitive. The cheapest serviceable option for most tasks is Gemini 2.5 Flash Lite or DeepSeek V3.
📖 Want to go deep on AI costs?
Read our comprehensive AI Cost Guide 2026 — a 5,000-word deep dive covering token economics, provider comparison, cost by use case, hidden costs of AI development, budgeting templates, 20 optimization strategies, fine-tuning vs prompting math, enterprise contracts, and real cost horror stories with lessons learned.
Read the AI Cost Guide →