OpenAI API Cost Calculator: Full Pricing Guide for Production Teams (2026)
Current OpenAI API pricing for all major models, a practical cost calculator, and strategies to reduce your bill by 40–70% using intelligent model selection.
Most engineering teams don't have an accurate picture of their LLM API costs until the monthly invoice surprises them. The OpenAI pricing page lists per-token rates, but converting those rates into monthly projections for your specific workload requires knowing your call volume, average token counts, model distribution, and the cost of any features you're using (caching, batching, embeddings).
This guide covers current OpenAI API pricing across all major models, provides a reusable cost calculator, and explains where the biggest optimization opportunities are.
Current OpenAI API Pricing (2026)
GPT-4o Family
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Cached Input | Notes |
|---|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | $1.25 | Current flagship |
| GPT-4o-mini | $0.15 | $0.60 | $0.075 | 16x cheaper on output |
| GPT-4.1 | $2.00 | $8.00 | $0.50 | Strong on code tasks |
| GPT-4.1-mini | $0.40 | $1.60 | $0.10 | Mid-tier option |
| o3 (reasoning) | $10.00 | $40.00 | $2.50 | Extended thinking tasks |
| o4-mini (reasoning) | $1.10 | $4.40 | $0.275 | Cost-efficient reasoning |
GPT-3.5 Legacy
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Notes |
|---|---|---|---|
| GPT-3.5 Turbo | $0.50 | $1.50 | Legacy, GPT-4o-mini is better and cheaper |
Embeddings and Other Models
| Model | Cost | Use Case |
|---|---|---|
| text-embedding-3-small | $0.02 / 1M tokens | Most efficient embeddings |
| text-embedding-3-large | $0.13 / 1M tokens | Higher accuracy embeddings |
| Whisper (audio transcription) | $0.006 / minute | Speech to text |
| TTS (text to speech) | $15.00 / 1M characters | Audio generation |
| DALL-E 3 (1024x1024) | $0.040 / image | Image generation |
Batch API Discounts
OpenAI's Batch API processes requests asynchronously with a 24-hour completion window at 50% off both input and output prices. For workloads that are not latency-sensitive — document processing, nightly report generation, dataset enrichment — this is a straightforward 50% discount that requires minimal code changes.
How to Calculate Your Monthly LLM Costs
Token costs are straightforward once you understand what you're counting:
- Input tokens: Everything in your request — system prompt, conversation history, user message, function definitions
- Output tokens: The model's response
A rough rule of thumb: 1,000 tokens ≈ 750 words of English text.
The Cost Formula
Monthly Cost =
(Monthly Input Tokens / 1,000,000) × Input Price Per Million
+ (Monthly Output Tokens / 1,000,000) × Output Price Per Million
Python Cost Calculator
def calculate_monthly_llm_cost(
monthly_calls: int,
avg_input_tokens: int,
avg_output_tokens: int,
model: str = "gpt-4o",
cached_input_fraction: float = 0.0,
batch_fraction: float = 0.0,
) -> dict:
"""
Calculate estimated monthly OpenAI API cost.
Args:
monthly_calls: Total API calls per month
avg_input_tokens: Average input tokens per call
avg_output_tokens: Average output tokens per call
model: Model name (gpt-4o, gpt-4o-mini, etc.)
cached_input_fraction: Fraction of input tokens that are cached (0.0-1.0)
batch_fraction: Fraction of calls using the Batch API (0.0-1.0)
Returns:
dict with cost breakdown
"""
pricing = {
"gpt-4o": {"input": 2.50, "output": 10.00, "cached": 1.25},
"gpt-4o-mini": {"input": 0.15, "output": 0.60, "cached": 0.075},
"gpt-4.1": {"input": 2.00, "output": 8.00, "cached": 0.50},
"gpt-4.1-mini": {"input": 0.40, "output": 1.60, "cached": 0.10},
"o3": {"input": 10.00, "output": 40.00, "cached": 2.50},
"o4-mini": {"input": 1.10, "output": 4.40, "cached": 0.275},
}
if model not in pricing:
raise ValueError(f"Unknown model: {model}")
prices = pricing[model]
total_input_tokens = monthly_calls * avg_input_tokens
total_output_tokens = monthly_calls * avg_output_tokens
# Split input tokens: cached, batched, and standard
cached_input = total_input_tokens * cached_input_fraction
standard_input = total_input_tokens * (1 - cached_input_fraction)
# Batch API: 50% discount on both input and output
batch_output = total_output_tokens * batch_fraction
standard_output = total_output_tokens * (1 - batch_fraction)
# Calculate costs per million tokens
input_cost = (
standard_input / 1_000_000 * prices["input"] +
cached_input / 1_000_000 * prices["cached"]
)
output_cost = (
standard_output / 1_000_000 * prices["output"] +
batch_output / 1_000_000 * prices["output"] * 0.5 # 50% batch discount
)
total_cost = input_cost + output_cost
return {
"model": model,
"monthly_calls": monthly_calls,
"total_input_tokens": total_input_tokens,
"total_output_tokens": total_output_tokens,
"input_cost": round(input_cost, 2),
"output_cost": round(output_cost, 2),
"total_monthly_cost": round(total_cost, 2),
"cost_per_call": round(total_cost / monthly_calls, 6),
}
# Example: 200K calls/month, 1K input tokens, 500 output tokens
result = calculate_monthly_llm_cost(
monthly_calls=200_000,
avg_input_tokens=1_000,
avg_output_tokens=500,
model="gpt-4o",
)
print(f"Monthly cost: ${result['total_monthly_cost']:,.2f}")
print(f"Cost per call: ${result['cost_per_call']:.4f}")
# Output:
# Monthly cost: $1,500.00
# Cost per call: $0.0075
Quick Reference: Cost Per 1,000 Calls
At common token volumes, monthly costs per 1,000 calls look like this:
| Avg Tokens (In/Out) | GPT-4o | GPT-4o-mini | Savings from Routing |
|---|---|---|---|
| 500 / 250 | $1.87 | $0.11 | 94% |
| 1,000 / 500 | $3.75 | $0.23 | 94% |
| 2,000 / 1,000 | $7.50 | $0.45 | 94% |
| 5,000 / 2,000 | $32.50 | $2.00 | 94% |
These ratios hold regardless of scale. The 16x output token price difference between GPT-4o and GPT-4o-mini translates consistently to ~94% cost reduction on routed calls.
The Three Biggest Cost Drivers
Understanding where your money goes helps prioritize optimization effort.
1. Output tokens (the most expensive line item)
Output tokens cost 4x more than input tokens on GPT-4o ($10.00 vs $2.50 per million). This means a response-heavy workload — where the model produces long, detailed outputs — is disproportionately expensive.
The optimization levers for output costs:
- Model routing: GPT-4o-mini output costs $0.60/M vs $10.00/M — a 94% reduction on output
- Output length control: Explicit
max_tokensconstraints and instructions to "respond concisely" reduce average output length - Streaming with early stopping: For user-facing applications, streaming allows clients to stop generation when they have what they need
2. System prompt tokens (the silent cost multiplier)
A 2,000-token system prompt sent with every request multiplies across your entire call volume. At 500,000 calls/month, a 2,000-token system prompt contributes 1 billion input tokens — $2,500/month just for the system prompt on GPT-4o.
Prompt caching addresses this: cached tokens cost $1.25/M instead of $2.50/M (50% off) on GPT-4o, and $0.075/M instead of $0.15/M on GPT-4o-mini. For high-volume applications with stable system prompts, caching alone reduces input costs by 30–50%.
3. Conversation history accumulation
Multi-turn chat applications re-send the full conversation history with every message. A 10-turn conversation where each turn averages 300 tokens results in ~3,000 tokens of history being re-processed on turn 10 — even though turns 1–8 are likely irrelevant to the current response.
Context compression strategies — summarizing older turns, pruning irrelevant history, using sliding window context — can reduce effective context length by 40–60% on long-running conversations.
Estimating Costs Before You Build
When architecting a new LLM feature, use this framework to project costs before writing code:
def estimate_feature_monthly_cost(
feature_name: str,
daily_active_users: int,
avg_calls_per_user_per_day: float,
avg_input_tokens: int,
avg_output_tokens: int,
model: str = "gpt-4o",
) -> None:
"""Print a cost projection for a new feature."""
monthly_calls = int(daily_active_users * avg_calls_per_user_per_day * 30)
result = calculate_monthly_llm_cost(
monthly_calls=monthly_calls,
avg_input_tokens=avg_input_tokens,
avg_output_tokens=avg_output_tokens,
model=model,
)
# Also calculate with routing to gpt-4o-mini
mini_result = calculate_monthly_llm_cost(
monthly_calls=monthly_calls,
avg_input_tokens=avg_input_tokens,
avg_output_tokens=avg_output_tokens,
model="gpt-4o-mini",
)
print(f"\nCost projection: {feature_name}")
print(f" Monthly calls: {monthly_calls:,}")
print(f" GPT-4o cost: ${result['total_monthly_cost']:>10,.2f}/month")
print(f" GPT-4o-mini cost: ${mini_result['total_monthly_cost']:>10,.2f}/month")
print(f" Routing savings: ${result['total_monthly_cost'] - mini_result['total_monthly_cost']:>10,.2f}/month")
# Example: Email summarization feature
estimate_feature_monthly_cost(
feature_name="Email summarization",
daily_active_users=5_000,
avg_calls_per_user_per_day=3,
avg_input_tokens=1_500,
avg_output_tokens=300,
model="gpt-4o",
)
# Output:
# Cost projection: Email summarization
# Monthly calls: 450,000
# GPT-4o cost: $4,237.50/month
# GPT-4o-mini cost: $256.50/month
# Routing savings: $3,981.00/month
Running this calculation at feature design time reveals whether the default model choice makes economic sense — before it's committed to production.
Where Automatic Routing Fits In
Manual cost calculation and model selection is useful for planning. At scale, you need automatic routing that applies the right model selection to every individual request.
PromptUnit routes your LLM calls automatically — analyzing each request, classifying its complexity, and routing it to the cheapest model that meets the quality bar for that task type. The integration is a single base URL change:
# Before
client = OpenAI(api_key="sk-...")
# After — all routing, cost attribution, and monitoring activated
client = OpenAI(
api_key="sk-...",
base_url="https://api.promptunit.ai/proxy/openai",
default_headers={"x-promptunit-key": "YOUR_KEY"},
)
Every response includes headers with the actual cost, the model used, and the saving versus the requested model:
x-promptunit-model: gpt-4o-mini
x-promptunit-original-model: gpt-4o
x-promptunit-cost: 0.00023
x-promptunit-saving: 0.00727
x-promptunit-quality-score: 94
The dashboard aggregates these into total monthly savings, broken down by feature, model, and provider. The pricing model is 20% of verified savings — PromptUnit only charges when it demonstrably reduces your bill.
The Cost Scenarios That Catch Teams Off Guard
Prompt injection generating excessive output
A malicious input that causes the model to generate a 10,000-token response on every call instead of the expected 300 tokens increases output costs by 33x for affected requests. Without per-call monitoring and output token circuit breakers, these events are invisible until you see an unexpected invoice spike.
Retry loops multiplying call volume
An application bug that causes a retry on every request doubles or triples your call volume instantaneously. Budget enforcement at the proxy layer — circuit breakers on rolling spend windows — can stop this before it becomes expensive.
A/B tests adding unexpected model volume
Running a quality test of GPT-4o-mini against GPT-4o is reasonable. Running it on 100% of traffic for two weeks without noticing is expensive. Shadow testing through a proxy applies test traffic without duplicating production costs.
Key Takeaways
- GPT-4o is priced at $2.50/M input and $10.00/M output. GPT-4o-mini is priced at $0.15/M input and $0.60/M output — a consistent 16x cheaper on output across all volume levels.
- Output tokens are your most expensive cost driver — optimize output length and model selection for output-heavy workloads first.
- System prompt caching reduces input costs by 50% on high-volume applications with stable system prompts. Batch API reduces all costs by 50% for non-latency-sensitive workloads.
- Calculate feature costs at design time using token estimates — surprises at invoice time are avoidable with 20 minutes of upfront calculation.
- Manual model selection doesn't scale. Automatic routing that classifies each request and routes to the appropriate model tier captures savings across your entire call volume, not just the workloads you've manually targeted.
- PromptUnit's 14-day observation mode shows you exactly what routing would save on your specific traffic before any routing changes go live. For teams uncertain about their routability, this is the zero-risk path to an accurate savings estimate.
For a deeper look at which tasks are safe to route to cheaper models — and which aren't — see GPT-4o vs GPT-4o-mini: When Does the Cheaper Model Actually Win? and the complete guide to LLM model routing.