How do I reduce my OpenAI API costs?

The most effective approach is intelligent model routing — automatically sending simple tasks to cheaper models like GPT-4o-mini or Gemini Flash instead of GPT-4o. Routing alone reduces costs by 40–70% without changing response quality for the majority of requests.

What is LLM model routing?

LLM model routing is a system that sits between your application and your AI provider, classifying each request by complexity and task type, then automatically directing it to the cheapest model that can handle it well. Simple requests go to inexpensive models; complex reasoning tasks are escalated to more capable ones.

Can I use model routing without changing my code?

Yes. PromptUnit is OpenAI-compatible and integrates by changing one value in your existing SDK configuration — the base URL. Your existing API calls, response parsing, and error handling continue to work exactly as before.

How much can I save by switching from GPT-4 to GPT-4o-mini?

GPT-4o costs approximately $5 per million input tokens and $15 per million output tokens. GPT-4o-mini costs $0.15 input and $0.60 output — a 30–50x reduction. For a team spending $10,000 per month on GPT-4o, routing 60% of requests to GPT-4o-mini reduces the bill to roughly $3,500–$4,000 per month.

What is an AI inference proxy?

An AI inference proxy is a server layer that intercepts requests from your application to an LLM provider like OpenAI or Anthropic. It adds capabilities like model routing, cost tracking, caching, budget enforcement, and fallback — then forwards the request to the appropriate model and returns the response in the exact same format.

What is cross-provider LLM routing?

Cross-provider routing means evaluating models across multiple AI providers simultaneously — OpenAI, Anthropic, Google, Groq — and routing each request to the cheapest globally available model that meets a quality threshold. Rather than routing within one provider, cross-provider routing opens the full market of available inference options for every call.

Does PromptUnit affect response quality?

No. PromptUnit uses a configurable quality threshold (default 85%). Each request is only routed to a cheaper model if benchmark data shows that model performs at or above the threshold for that task type. If no cheaper model qualifies, the original model is used.

How does PromptUnit pricing work?

PromptUnit charges 20% of verified monthly savings. If we save you $0, you pay $0. There is a 14-day free observation period where we analyze your traffic without making any routing changes. You only start paying after routing goes live and savings are confirmed.

When will I be charged?

You set your own billing threshold during onboarding — anywhere between $50 and $400 in savings. Once PromptUnit has saved you that amount, your card is automatically charged 20% of those savings and the counter resets. You decide when charges happen.

OpenAI API Cost Calculator: Full Pricing Guide for Production Teams (2026)

Most engineering teams don't have an accurate picture of their LLM API costs until the monthly invoice surprises them. The OpenAI pricing page lists per-token rates, but converting those rates into monthly projections for your specific workload requires knowing your call volume, average token counts, model distribution, and the cost of any features you're using (caching, batching, embeddings).

This guide covers current OpenAI API pricing across all major models, provides a reusable cost calculator, and explains where the biggest optimization opportunities are.

Current OpenAI API Pricing (2026)

GPT-4o Family

Model	Input (per 1M tokens)	Output (per 1M tokens)	Cached Input	Notes
GPT-4o	$2.50	$10.00	$1.25	Current flagship
GPT-4o-mini	$0.15	$0.60	$0.075	16x cheaper on output
GPT-4.1	$2.00	$8.00	$0.50	Strong on code tasks
GPT-4.1-mini	$0.40	$1.60	$0.10	Mid-tier option
o3 (reasoning)	$10.00	$40.00	$2.50	Extended thinking tasks
o4-mini (reasoning)	$1.10	$4.40	$0.275	Cost-efficient reasoning

GPT-3.5 Legacy

Model	Input (per 1M tokens)	Output (per 1M tokens)	Notes
GPT-3.5 Turbo	$0.50	$1.50	Legacy, GPT-4o-mini is better and cheaper

Embeddings and Other Models

Model	Cost	Use Case
text-embedding-3-small	$0.02 / 1M tokens	Most efficient embeddings
text-embedding-3-large	$0.13 / 1M tokens	Higher accuracy embeddings
Whisper (audio transcription)	$0.006 / minute	Speech to text
TTS (text to speech)	$15.00 / 1M characters	Audio generation
DALL-E 3 (1024x1024)	$0.040 / image	Image generation

Batch API Discounts

OpenAI's Batch API processes requests asynchronously with a 24-hour completion window at 50% off both input and output prices. For workloads that are not latency-sensitive — document processing, nightly report generation, dataset enrichment — this is a straightforward 50% discount that requires minimal code changes.

How to Calculate Your Monthly LLM Costs

Token costs are straightforward once you understand what you're counting:

Input tokens: Everything in your request — system prompt, conversation history, user message, function definitions
Output tokens: The model's response

A rough rule of thumb: 1,000 tokens ≈ 750 words of English text.

The Cost Formula

Monthly Cost = 
  (Monthly Input Tokens / 1,000,000) × Input Price Per Million
  + (Monthly Output Tokens / 1,000,000) × Output Price Per Million

Python Cost Calculator

def calculate_monthly_llm_cost(
    monthly_calls: int,
    avg_input_tokens: int,
    avg_output_tokens: int,
    model: str = "gpt-4o",
    cached_input_fraction: float = 0.0,
    batch_fraction: float = 0.0,
) -> dict:
    """
    Calculate estimated monthly OpenAI API cost.
    
    Args:
        monthly_calls: Total API calls per month
        avg_input_tokens: Average input tokens per call
        avg_output_tokens: Average output tokens per call
        model: Model name (gpt-4o, gpt-4o-mini, etc.)
        cached_input_fraction: Fraction of input tokens that are cached (0.0-1.0)
        batch_fraction: Fraction of calls using the Batch API (0.0-1.0)
    
    Returns:
        dict with cost breakdown
    """
    pricing = {
        "gpt-4o":        {"input": 2.50,  "output": 10.00, "cached": 1.25},
        "gpt-4o-mini":   {"input": 0.15,  "output": 0.60,  "cached": 0.075},
        "gpt-4.1":       {"input": 2.00,  "output": 8.00,  "cached": 0.50},
        "gpt-4.1-mini":  {"input": 0.40,  "output": 1.60,  "cached": 0.10},
        "o3":            {"input": 10.00, "output": 40.00, "cached": 2.50},
        "o4-mini":       {"input": 1.10,  "output": 4.40,  "cached": 0.275},
    }
    
    if model not in pricing:
        raise ValueError(f"Unknown model: {model}")
    
    prices = pricing[model]
    
    total_input_tokens = monthly_calls * avg_input_tokens
    total_output_tokens = monthly_calls * avg_output_tokens
    
    # Split input tokens: cached, batched, and standard
    cached_input = total_input_tokens * cached_input_fraction
    standard_input = total_input_tokens * (1 - cached_input_fraction)
    
    # Batch API: 50% discount on both input and output
    batch_output = total_output_tokens * batch_fraction
    standard_output = total_output_tokens * (1 - batch_fraction)
    
    # Calculate costs per million tokens
    input_cost = (
        standard_input / 1_000_000 * prices["input"] +
        cached_input / 1_000_000 * prices["cached"]
    )
    output_cost = (
        standard_output / 1_000_000 * prices["output"] +
        batch_output / 1_000_000 * prices["output"] * 0.5  # 50% batch discount
    )
    
    total_cost = input_cost + output_cost
    
    return {
        "model": model,
        "monthly_calls": monthly_calls,
        "total_input_tokens": total_input_tokens,
        "total_output_tokens": total_output_tokens,
        "input_cost": round(input_cost, 2),
        "output_cost": round(output_cost, 2),
        "total_monthly_cost": round(total_cost, 2),
        "cost_per_call": round(total_cost / monthly_calls, 6),
    }

# Example: 200K calls/month, 1K input tokens, 500 output tokens
result = calculate_monthly_llm_cost(
    monthly_calls=200_000,
    avg_input_tokens=1_000,
    avg_output_tokens=500,
    model="gpt-4o",
)
print(f"Monthly cost: ${result['total_monthly_cost']:,.2f}")
print(f"Cost per call: ${result['cost_per_call']:.4f}")
# Output:
# Monthly cost: $1,500.00
# Cost per call: $0.0075

Quick Reference: Cost Per 1,000 Calls

At common token volumes, monthly costs per 1,000 calls look like this:

Avg Tokens (In/Out)	GPT-4o	GPT-4o-mini	Savings from Routing
500 / 250	$1.87	$0.11	94%
1,000 / 500	$3.75	$0.23	94%
2,000 / 1,000	$7.50	$0.45	94%
5,000 / 2,000	$32.50	$2.00	94%

These ratios hold regardless of scale. The 16x output token price difference between GPT-4o and GPT-4o-mini translates consistently to ~94% cost reduction on routed calls.

The Three Biggest Cost Drivers

Understanding where your money goes helps prioritize optimization effort.

1. Output tokens (the most expensive line item)

Output tokens cost 4x more than input tokens on GPT-4o ($10.00 vs $2.50 per million). This means a response-heavy workload — where the model produces long, detailed outputs — is disproportionately expensive.

The optimization levers for output costs:

Model routing: GPT-4o-mini output costs $0.60/M vs $10.00/M — a 94% reduction on output
Output length control: Explicit max_tokens constraints and instructions to "respond concisely" reduce average output length
Streaming with early stopping: For user-facing applications, streaming allows clients to stop generation when they have what they need

2. System prompt tokens (the silent cost multiplier)

A 2,000-token system prompt sent with every request multiplies across your entire call volume. At 500,000 calls/month, a 2,000-token system prompt contributes 1 billion input tokens — $2,500/month just for the system prompt on GPT-4o.

Prompt caching addresses this: cached tokens cost $1.25/M instead of $2.50/M (50% off) on GPT-4o, and $0.075/M instead of $0.15/M on GPT-4o-mini. For high-volume applications with stable system prompts, caching alone reduces input costs by 30–50%.

3. Conversation history accumulation

Multi-turn chat applications re-send the full conversation history with every message. A 10-turn conversation where each turn averages 300 tokens results in ~3,000 tokens of history being re-processed on turn 10 — even though turns 1–8 are likely irrelevant to the current response.

Context compression strategies — summarizing older turns, pruning irrelevant history, using sliding window context — can reduce effective context length by 40–60% on long-running conversations.

Estimating Costs Before You Build

When architecting a new LLM feature, use this framework to project costs before writing code:

def estimate_feature_monthly_cost(
    feature_name: str,
    daily_active_users: int,
    avg_calls_per_user_per_day: float,
    avg_input_tokens: int,
    avg_output_tokens: int,
    model: str = "gpt-4o",
) -> None:
    """Print a cost projection for a new feature."""
    monthly_calls = int(daily_active_users * avg_calls_per_user_per_day * 30)
    
    result = calculate_monthly_llm_cost(
        monthly_calls=monthly_calls,
        avg_input_tokens=avg_input_tokens,
        avg_output_tokens=avg_output_tokens,
        model=model,
    )
    
    # Also calculate with routing to gpt-4o-mini
    mini_result = calculate_monthly_llm_cost(
        monthly_calls=monthly_calls,
        avg_input_tokens=avg_input_tokens,
        avg_output_tokens=avg_output_tokens,
        model="gpt-4o-mini",
    )
    
    print(f"\nCost projection: {feature_name}")
    print(f"  Monthly calls: {monthly_calls:,}")
    print(f"  GPT-4o cost:      ${result['total_monthly_cost']:>10,.2f}/month")
    print(f"  GPT-4o-mini cost: ${mini_result['total_monthly_cost']:>10,.2f}/month")
    print(f"  Routing savings:  ${result['total_monthly_cost'] - mini_result['total_monthly_cost']:>10,.2f}/month")

# Example: Email summarization feature
estimate_feature_monthly_cost(
    feature_name="Email summarization",
    daily_active_users=5_000,
    avg_calls_per_user_per_day=3,
    avg_input_tokens=1_500,
    avg_output_tokens=300,
    model="gpt-4o",
)
# Output:
# Cost projection: Email summarization
#   Monthly calls: 450,000
#   GPT-4o cost:        $4,237.50/month
#   GPT-4o-mini cost:     $256.50/month
#   Routing savings:   $3,981.00/month

Running this calculation at feature design time reveals whether the default model choice makes economic sense — before it's committed to production.

Where Automatic Routing Fits In

Manual cost calculation and model selection is useful for planning. At scale, you need automatic routing that applies the right model selection to every individual request.

PromptUnit routes your LLM calls automatically — analyzing each request, classifying its complexity, and routing it to the cheapest model that meets the quality bar for that task type. The integration is a single base URL change:

# Before
client = OpenAI(api_key="sk-...")

# After — all routing, cost attribution, and monitoring activated
client = OpenAI(
    api_key="sk-...",
    base_url="https://api.promptunit.ai/proxy/openai",
    default_headers={"x-promptunit-key": "YOUR_KEY"},
)

Every response includes headers with the actual cost, the model used, and the saving versus the requested model:

x-promptunit-model: gpt-4o-mini
x-promptunit-original-model: gpt-4o
x-promptunit-cost: 0.00023
x-promptunit-saving: 0.00727
x-promptunit-quality-score: 94

The dashboard aggregates these into total monthly savings, broken down by feature, model, and provider. The pricing model is 20% of verified savings — PromptUnit only charges when it demonstrably reduces your bill.

The Cost Scenarios That Catch Teams Off Guard

Prompt injection generating excessive output

A malicious input that causes the model to generate a 10,000-token response on every call instead of the expected 300 tokens increases output costs by 33x for affected requests. Without per-call monitoring and output token circuit breakers, these events are invisible until you see an unexpected invoice spike.

Retry loops multiplying call volume

An application bug that causes a retry on every request doubles or triples your call volume instantaneously. Budget enforcement at the proxy layer — circuit breakers on rolling spend windows — can stop this before it becomes expensive.

A/B tests adding unexpected model volume

Running a quality test of GPT-4o-mini against GPT-4o is reasonable. Running it on 100% of traffic for two weeks without noticing is expensive. Shadow testing through a proxy applies test traffic without duplicating production costs.

Key Takeaways

GPT-4o is priced at $2.50/M input and $10.00/M output. GPT-4o-mini is priced at $0.15/M input and $0.60/M output — a consistent 16x cheaper on output across all volume levels.
Output tokens are your most expensive cost driver — optimize output length and model selection for output-heavy workloads first.
System prompt caching reduces input costs by 50% on high-volume applications with stable system prompts. Batch API reduces all costs by 50% for non-latency-sensitive workloads.
Calculate feature costs at design time using token estimates — surprises at invoice time are avoidable with 20 minutes of upfront calculation.
Manual model selection doesn't scale. Automatic routing that classifies each request and routes to the appropriate model tier captures savings across your entire call volume, not just the workloads you've manually targeted.
PromptUnit's 14-day observation mode shows you exactly what routing would save on your specific traffic before any routing changes go live. For teams uncertain about their routability, this is the zero-risk path to an accurate savings estimate.

For a deeper look at which tasks are safe to route to cheaper models — and which aren't — see GPT-4o vs GPT-4o-mini: When Does the Cheaper Model Actually Win? and the complete guide to LLM model routing.