The Hidden Cost of Defaulting to GPT-4o in Production
Beyond the API invoice: the real financial and operational cost of routing every LLM call to your most capable model — and the compounding effect over time.
The decision to default everything to GPT-4o happens in a moment. It's not a deliberate architectural choice — it's the absence of one. You're building fast, the model works, and cost optimization feels like a problem you can solve later.
Later has a way of arriving expensively.
This post quantifies the actual cost of the GPT-4o default — not just in API spend, but in the compounding effects that make this pattern harder to undo over time.
The Visible Cost: Your Monthly Invoice
The most obvious component is the API bill. At $2.50 per million input tokens and $10.00 per million output tokens, GPT-4o is a 16x premium over GPT-4o-mini on output costs. For a team making 500,000 calls per month with an average of 1,000 tokens per call, the difference between models is the difference between a $2,500 bill and a roughly $150 bill on the portion of calls that don't require a frontier model.
But the invoice is just the starting point.
How costs compound over product lifecycle
| Time | Monthly GPT-4o Spend | If Routing Had Started (est.) | Cumulative Overpay |
|---|---|---|---|
| Month 1 | $800 | $320 | $480 |
| Month 6 | $2,400 | $960 | ~$7,200 |
| Month 12 | $5,500 | $2,200 | ~$25,000 |
| Month 24 | $9,000 | $3,600 | ~$66,000 |
These are illustrative figures for a growing SaaS product. The actual numbers vary — but the shape of the curve is consistent. API costs grow with usage. Routing discipline applied early compounds in your favor for the lifetime of the product.
The team that puts routing in place at month one doesn't just save money in month one. They build a habit, a system, and an architectural pattern that captures savings automatically as the product scales.
The Invisible Cost: Engineering Distraction
Here's the cost that doesn't appear on any invoice.
When LLM costs become painful enough to act on, the typical response is to build something. A team might spend two to four weeks building a custom routing layer: a classifier, a model registry, a fallback system, retry logic, quality scoring, monitoring. Each component is reasonable; together they represent a significant engineering investment.
That investment doesn't go into the product. It doesn't improve user experience. It doesn't ship features. It exists entirely to paper over an infrastructure decision that shouldn't have required custom engineering in the first place.
The build-it-yourself trap
Teams that build internal routing systems routinely underestimate:
- Ongoing maintenance: Models update, pricing changes, provider APIs evolve. Custom routing logic needs to track all of it.
- Quality monitoring: A routing decision that saves money but degrades quality is worse than no routing. Monitoring quality at the per-request level is non-trivial.
- Edge case handling: What happens when the cheaper model returns a malformed response? When latency spikes on one provider? When your classifier is wrong about a request's complexity?
- A/B testing infrastructure: To validate that routing doesn't hurt quality, you need to run both models and compare. That's additional engineering.
A team of three engineers spending six weeks on this infrastructure incurs an opportunity cost of roughly $60,000–$90,000 in engineering time. That's before accounting for ongoing maintenance.
The alternative is a proxy that handles this at the infrastructure layer — observable during a no-routing observation period, live only when you've validated the savings. That's what PromptUnit does.
The Rate Limit and Reliability Cost
GPT-4o operates under stricter rate limits than GPT-4o-mini. Teams that default all traffic to GPT-4o hit rate limits sooner as they scale, which leads to one of two outcomes:
- Throttled user experience: Requests queue or fail during peak traffic, degrading the product
- Tier upgrades with OpenAI: Paying for higher rate limit tiers to handle volume that could have gone to cheaper models with more permissive limits
Neither outcome is captured in the baseline cost estimate. Both add real cost — either in user experience damage or in additional API tier fees.
Routing distributes traffic across models and, in multi-provider configurations, across providers entirely. This naturally reduces rate limit pressure on any single model.
The Technical Debt Cost
Architecturally, defaulting to GPT-4o creates a form of technical debt that's difficult to repay later.
When every call in your codebase hits the same endpoint with the same model string, routing that traffic differently in the future requires either:
- Touching every call site: Finding every place in the codebase where
model="gpt-4o"appears and adding conditional logic - Wrapping the OpenAI client: Building an abstraction layer around your SDK calls that can be intercepted — which is the architecture an inference proxy provides, but with bespoke maintenance burden
- A proxy layer: Adding a routing proxy that intercepts calls without requiring code changes
Option 3 is the cleanest, but teams that built on options 1 or 2 often have fragile, inconsistently applied routing logic scattered across their codebase. Consolidating that into a proper routing layer later requires a refactor that carries its own risk and cost.
The Opportunity Cost of "We'll Optimize Later"
There's a broader cost that's worth naming: the opportunity cost of the optimization work you deprioritize because LLM costs aren't painful enough yet.
Teams with a routing system in place from early on get visibility into their LLM usage that informs product decisions:
- Which features are driving the most API spend? (Should we rethink how feature X works?)
- Which requests are being classified as high-complexity? (Is our prompting more verbose than it needs to be?)
- What's the per-user LLM cost? (Can we offer different tiers priced around actual cost-to-serve?)
This visibility is hard to retrofit. Without per-request cost attribution, you're running a product with significant variable costs that you can't allocate to the features generating them.
The Quality Assumption That's Usually Wrong
The implicit logic of "default to GPT-4o" is: "We don't know which tasks need the frontier model, so we use it for everything to be safe."
This assumption has a hidden premise: that GPT-4o produces meaningfully better output on the majority of your requests.
For most production workloads, that premise is wrong.
As we documented in We Analyzed 10,000 GPT-4o Calls — 60% Didn't Need GPT-4o, the majority of production LLM traffic is composed of summarization, classification, extraction, and templated generation — tasks where GPT-4o-mini performs within a few percentage points of GPT-4o on human evaluation benchmarks.
The "safety" of defaulting to GPT-4o is largely illusory. You're paying a 16x premium on output costs for quality headroom that most of your requests don't use.
A Framework for Calculating Your Actual Exposure
To estimate your current overpay, you need three numbers:
- Monthly token volume: Input and output tokens separately (from OpenAI usage dashboard)
- Estimated routable percentage: For most SaaS products, 55–70% of calls are routable to GPT-4o-mini
- Current model mix: If you're already using multiple models, what percentage is GPT-4o?
# Rough overpay calculator
monthly_input_tokens = 200_000_000 # from your OpenAI dashboard
monthly_output_tokens = 80_000_000 # from your OpenAI dashboard
routable_fraction = 0.60 # conservative estimate for most SaaS
# Current cost (100% GPT-4o)
current_cost = (
monthly_input_tokens / 1_000_000 * 2.50 +
monthly_output_tokens / 1_000_000 * 10.00
)
# Projected cost with routing
routed_input = monthly_input_tokens * routable_fraction
routed_output = monthly_output_tokens * routable_fraction
remaining_input = monthly_input_tokens * (1 - routable_fraction)
remaining_output = monthly_output_tokens * (1 - routable_fraction)
projected_cost = (
routed_input / 1_000_000 * 0.15 + # GPT-4o-mini input
routed_output / 1_000_000 * 0.60 + # GPT-4o-mini output
remaining_input / 1_000_000 * 2.50 + # GPT-4o input
remaining_output / 1_000_000 * 10.00 # GPT-4o output
)
monthly_savings = current_cost - projected_cost
print(f"Current cost: ${current_cost:,.2f}")
print(f"With routing: ${projected_cost:,.2f}")
print(f"Monthly savings: ${monthly_savings:,.2f}")
For a team with the token volumes above, that calculator returns approximately $4,200 in potential monthly savings with a 60% routing split — before accounting for PromptUnit's 20% fee.
When Defaulting to GPT-4o Is Actually Correct
This post isn't arguing that GPT-4o is the wrong default in every situation.
If you are:
- Pre-product, in rapid iteration with minimal traffic
- Running a product where code generation or complex reasoning is the core feature
- Operating at low enough volume that cost isn't yet a meaningful concern
- Actively unsure about your quality requirements
...then "use GPT-4o for everything" is a reasonable short-term default. The optimization can wait.
The problem isn't the choice — it's forgetting to revisit it. Most teams that are defaulting to GPT-4o at meaningful scale ($2,000+/month) have passed the point where revisiting the routing question would pay off immediately.
Getting Accurate Numbers for Your Traffic
The most important input to any routing decision is data about your own traffic. What are the actual task types in your production requests? How complex are they? What quality signals exist for each category?
Without this data, routing decisions are guesses. With it, they're engineering.
PromptUnit's observation mode gives you this data without requiring routing changes. Every request is analyzed and classified for 14 days. At the end, you see exactly what routing would save on your specific traffic — and you decide whether to activate it.
There's no commitment, no model change, and no code change beyond a base URL swap. The observation starts immediately. Learn more about how to get started and what the 14-day analysis reveals.
Key Takeaways
- The cost of defaulting to GPT-4o compounds over the product lifecycle — teams that optimize early capture compounding savings, not just a one-time reduction.
- Beyond the API invoice, the hidden costs include engineering time spent on custom routing infrastructure, rate limit pressure, technical debt from scattered model selection logic, and lost visibility into cost-by-feature.
- The "safety" assumption behind GPT-4o defaults is usually wrong: for the majority of production tasks, GPT-4o-mini produces output within human evaluation error of GPT-4o.
- A 60% routing split on a $5,000/month GPT-4o bill generates roughly $2,400–$3,000 in monthly savings — before the infrastructure cost of achieving it.
- The right approach is to measure your actual traffic before routing anything. Observation mode provides this without risk.