All posts
·9 min read

OpenAI API Cost Calculator: Full Pricing Guide for Production Teams (2026)

Current OpenAI API pricing for all major models, a practical cost calculator, and strategies to reduce your bill by 40–70% using intelligent model selection.

openai-pricingapi-cost-calculatorgpt-4o-pricingllm-cost-optimization

Most engineering teams don't have an accurate picture of their LLM API costs until the monthly invoice surprises them. The OpenAI pricing page lists per-token rates, but converting those rates into monthly projections for your specific workload requires knowing your call volume, average token counts, model distribution, and the cost of any features you're using (caching, batching, embeddings).

This guide covers current OpenAI API pricing across all major models, provides a reusable cost calculator, and explains where the biggest optimization opportunities are.


Current OpenAI API Pricing (2026)

GPT-4o Family

Model Input (per 1M tokens) Output (per 1M tokens) Cached Input Notes
GPT-4o $2.50 $10.00 $1.25 Current flagship
GPT-4o-mini $0.15 $0.60 $0.075 16x cheaper on output
GPT-4.1 $2.00 $8.00 $0.50 Strong on code tasks
GPT-4.1-mini $0.40 $1.60 $0.10 Mid-tier option
o3 (reasoning) $10.00 $40.00 $2.50 Extended thinking tasks
o4-mini (reasoning) $1.10 $4.40 $0.275 Cost-efficient reasoning

GPT-3.5 Legacy

Model Input (per 1M tokens) Output (per 1M tokens) Notes
GPT-3.5 Turbo $0.50 $1.50 Legacy, GPT-4o-mini is better and cheaper

Embeddings and Other Models

Model Cost Use Case
text-embedding-3-small $0.02 / 1M tokens Most efficient embeddings
text-embedding-3-large $0.13 / 1M tokens Higher accuracy embeddings
Whisper (audio transcription) $0.006 / minute Speech to text
TTS (text to speech) $15.00 / 1M characters Audio generation
DALL-E 3 (1024x1024) $0.040 / image Image generation

Batch API Discounts

OpenAI's Batch API processes requests asynchronously with a 24-hour completion window at 50% off both input and output prices. For workloads that are not latency-sensitive — document processing, nightly report generation, dataset enrichment — this is a straightforward 50% discount that requires minimal code changes.


How to Calculate Your Monthly LLM Costs

Token costs are straightforward once you understand what you're counting:

  • Input tokens: Everything in your request — system prompt, conversation history, user message, function definitions
  • Output tokens: The model's response

A rough rule of thumb: 1,000 tokens ≈ 750 words of English text.

The Cost Formula

Monthly Cost = 
  (Monthly Input Tokens / 1,000,000) × Input Price Per Million
  + (Monthly Output Tokens / 1,000,000) × Output Price Per Million

Python Cost Calculator

def calculate_monthly_llm_cost(
    monthly_calls: int,
    avg_input_tokens: int,
    avg_output_tokens: int,
    model: str = "gpt-4o",
    cached_input_fraction: float = 0.0,
    batch_fraction: float = 0.0,
) -> dict:
    """
    Calculate estimated monthly OpenAI API cost.
    
    Args:
        monthly_calls: Total API calls per month
        avg_input_tokens: Average input tokens per call
        avg_output_tokens: Average output tokens per call
        model: Model name (gpt-4o, gpt-4o-mini, etc.)
        cached_input_fraction: Fraction of input tokens that are cached (0.0-1.0)
        batch_fraction: Fraction of calls using the Batch API (0.0-1.0)
    
    Returns:
        dict with cost breakdown
    """
    pricing = {
        "gpt-4o":        {"input": 2.50,  "output": 10.00, "cached": 1.25},
        "gpt-4o-mini":   {"input": 0.15,  "output": 0.60,  "cached": 0.075},
        "gpt-4.1":       {"input": 2.00,  "output": 8.00,  "cached": 0.50},
        "gpt-4.1-mini":  {"input": 0.40,  "output": 1.60,  "cached": 0.10},
        "o3":            {"input": 10.00, "output": 40.00, "cached": 2.50},
        "o4-mini":       {"input": 1.10,  "output": 4.40,  "cached": 0.275},
    }
    
    if model not in pricing:
        raise ValueError(f"Unknown model: {model}")
    
    prices = pricing[model]
    
    total_input_tokens = monthly_calls * avg_input_tokens
    total_output_tokens = monthly_calls * avg_output_tokens
    
    # Split input tokens: cached, batched, and standard
    cached_input = total_input_tokens * cached_input_fraction
    standard_input = total_input_tokens * (1 - cached_input_fraction)
    
    # Batch API: 50% discount on both input and output
    batch_output = total_output_tokens * batch_fraction
    standard_output = total_output_tokens * (1 - batch_fraction)
    
    # Calculate costs per million tokens
    input_cost = (
        standard_input / 1_000_000 * prices["input"] +
        cached_input / 1_000_000 * prices["cached"]
    )
    output_cost = (
        standard_output / 1_000_000 * prices["output"] +
        batch_output / 1_000_000 * prices["output"] * 0.5  # 50% batch discount
    )
    
    total_cost = input_cost + output_cost
    
    return {
        "model": model,
        "monthly_calls": monthly_calls,
        "total_input_tokens": total_input_tokens,
        "total_output_tokens": total_output_tokens,
        "input_cost": round(input_cost, 2),
        "output_cost": round(output_cost, 2),
        "total_monthly_cost": round(total_cost, 2),
        "cost_per_call": round(total_cost / monthly_calls, 6),
    }

# Example: 200K calls/month, 1K input tokens, 500 output tokens
result = calculate_monthly_llm_cost(
    monthly_calls=200_000,
    avg_input_tokens=1_000,
    avg_output_tokens=500,
    model="gpt-4o",
)
print(f"Monthly cost: ${result['total_monthly_cost']:,.2f}")
print(f"Cost per call: ${result['cost_per_call']:.4f}")
# Output:
# Monthly cost: $1,500.00
# Cost per call: $0.0075

Quick Reference: Cost Per 1,000 Calls

At common token volumes, monthly costs per 1,000 calls look like this:

Avg Tokens (In/Out) GPT-4o GPT-4o-mini Savings from Routing
500 / 250 $1.87 $0.11 94%
1,000 / 500 $3.75 $0.23 94%
2,000 / 1,000 $7.50 $0.45 94%
5,000 / 2,000 $32.50 $2.00 94%

These ratios hold regardless of scale. The 16x output token price difference between GPT-4o and GPT-4o-mini translates consistently to ~94% cost reduction on routed calls.


The Three Biggest Cost Drivers

Understanding where your money goes helps prioritize optimization effort.

1. Output tokens (the most expensive line item)

Output tokens cost 4x more than input tokens on GPT-4o ($10.00 vs $2.50 per million). This means a response-heavy workload — where the model produces long, detailed outputs — is disproportionately expensive.

The optimization levers for output costs:

  • Model routing: GPT-4o-mini output costs $0.60/M vs $10.00/M — a 94% reduction on output
  • Output length control: Explicit max_tokens constraints and instructions to "respond concisely" reduce average output length
  • Streaming with early stopping: For user-facing applications, streaming allows clients to stop generation when they have what they need

2. System prompt tokens (the silent cost multiplier)

A 2,000-token system prompt sent with every request multiplies across your entire call volume. At 500,000 calls/month, a 2,000-token system prompt contributes 1 billion input tokens — $2,500/month just for the system prompt on GPT-4o.

Prompt caching addresses this: cached tokens cost $1.25/M instead of $2.50/M (50% off) on GPT-4o, and $0.075/M instead of $0.15/M on GPT-4o-mini. For high-volume applications with stable system prompts, caching alone reduces input costs by 30–50%.

3. Conversation history accumulation

Multi-turn chat applications re-send the full conversation history with every message. A 10-turn conversation where each turn averages 300 tokens results in ~3,000 tokens of history being re-processed on turn 10 — even though turns 1–8 are likely irrelevant to the current response.

Context compression strategies — summarizing older turns, pruning irrelevant history, using sliding window context — can reduce effective context length by 40–60% on long-running conversations.


Estimating Costs Before You Build

When architecting a new LLM feature, use this framework to project costs before writing code:

def estimate_feature_monthly_cost(
    feature_name: str,
    daily_active_users: int,
    avg_calls_per_user_per_day: float,
    avg_input_tokens: int,
    avg_output_tokens: int,
    model: str = "gpt-4o",
) -> None:
    """Print a cost projection for a new feature."""
    monthly_calls = int(daily_active_users * avg_calls_per_user_per_day * 30)
    
    result = calculate_monthly_llm_cost(
        monthly_calls=monthly_calls,
        avg_input_tokens=avg_input_tokens,
        avg_output_tokens=avg_output_tokens,
        model=model,
    )
    
    # Also calculate with routing to gpt-4o-mini
    mini_result = calculate_monthly_llm_cost(
        monthly_calls=monthly_calls,
        avg_input_tokens=avg_input_tokens,
        avg_output_tokens=avg_output_tokens,
        model="gpt-4o-mini",
    )
    
    print(f"\nCost projection: {feature_name}")
    print(f"  Monthly calls: {monthly_calls:,}")
    print(f"  GPT-4o cost:      ${result['total_monthly_cost']:>10,.2f}/month")
    print(f"  GPT-4o-mini cost: ${mini_result['total_monthly_cost']:>10,.2f}/month")
    print(f"  Routing savings:  ${result['total_monthly_cost'] - mini_result['total_monthly_cost']:>10,.2f}/month")

# Example: Email summarization feature
estimate_feature_monthly_cost(
    feature_name="Email summarization",
    daily_active_users=5_000,
    avg_calls_per_user_per_day=3,
    avg_input_tokens=1_500,
    avg_output_tokens=300,
    model="gpt-4o",
)
# Output:
# Cost projection: Email summarization
#   Monthly calls: 450,000
#   GPT-4o cost:        $4,237.50/month
#   GPT-4o-mini cost:     $256.50/month
#   Routing savings:   $3,981.00/month

Running this calculation at feature design time reveals whether the default model choice makes economic sense — before it's committed to production.


Where Automatic Routing Fits In

Manual cost calculation and model selection is useful for planning. At scale, you need automatic routing that applies the right model selection to every individual request.

PromptUnit routes your LLM calls automatically — analyzing each request, classifying its complexity, and routing it to the cheapest model that meets the quality bar for that task type. The integration is a single base URL change:

# Before
client = OpenAI(api_key="sk-...")

# After — all routing, cost attribution, and monitoring activated
client = OpenAI(
    api_key="sk-...",
    base_url="https://api.promptunit.ai/proxy/openai",
    default_headers={"x-promptunit-key": "YOUR_KEY"},
)

Every response includes headers with the actual cost, the model used, and the saving versus the requested model:

x-promptunit-model: gpt-4o-mini
x-promptunit-original-model: gpt-4o
x-promptunit-cost: 0.00023
x-promptunit-saving: 0.00727
x-promptunit-quality-score: 94

The dashboard aggregates these into total monthly savings, broken down by feature, model, and provider. The pricing model is 20% of verified savings — PromptUnit only charges when it demonstrably reduces your bill.


The Cost Scenarios That Catch Teams Off Guard

Prompt injection generating excessive output

A malicious input that causes the model to generate a 10,000-token response on every call instead of the expected 300 tokens increases output costs by 33x for affected requests. Without per-call monitoring and output token circuit breakers, these events are invisible until you see an unexpected invoice spike.

Retry loops multiplying call volume

An application bug that causes a retry on every request doubles or triples your call volume instantaneously. Budget enforcement at the proxy layer — circuit breakers on rolling spend windows — can stop this before it becomes expensive.

A/B tests adding unexpected model volume

Running a quality test of GPT-4o-mini against GPT-4o is reasonable. Running it on 100% of traffic for two weeks without noticing is expensive. Shadow testing through a proxy applies test traffic without duplicating production costs.


Key Takeaways

  • GPT-4o is priced at $2.50/M input and $10.00/M output. GPT-4o-mini is priced at $0.15/M input and $0.60/M output — a consistent 16x cheaper on output across all volume levels.
  • Output tokens are your most expensive cost driver — optimize output length and model selection for output-heavy workloads first.
  • System prompt caching reduces input costs by 50% on high-volume applications with stable system prompts. Batch API reduces all costs by 50% for non-latency-sensitive workloads.
  • Calculate feature costs at design time using token estimates — surprises at invoice time are avoidable with 20 minutes of upfront calculation.
  • Manual model selection doesn't scale. Automatic routing that classifies each request and routes to the appropriate model tier captures savings across your entire call volume, not just the workloads you've manually targeted.
  • PromptUnit's 14-day observation mode shows you exactly what routing would save on your specific traffic before any routing changes go live. For teams uncertain about their routability, this is the zero-risk path to an accurate savings estimate.

For a deeper look at which tasks are safe to route to cheaper models — and which aren't — see GPT-4o vs GPT-4o-mini: When Does the Cheaper Model Actually Win? and the complete guide to LLM model routing.

Start your 14-day observation period

See exactly how much you'd save before paying anything. Zero risk — if we save you $0, you pay $0.

Get started free →