We Analyzed 10,000 GPT-4o Calls — 60% Didn't Need GPT-4o
A data-driven breakdown of real production LLM traffic showing which tasks actually require frontier models — and which are burning money unnecessarily.
When engineering teams integrate GPT-4o into a product, they typically start with a single model for everything. It's the pragmatic choice. You need the product to work before you can optimize it.
The problem is that "start with GPT-4o for everything" becomes "run GPT-4o for everything forever" — and the economics only become visible when the monthly invoice gets uncomfortable.
We analyzed 10,000 consecutive production API calls routed through PromptUnit across a mix of SaaS products, developer tools, and customer-facing applications. Here's what we found.
The Distribution of Real Production LLM Traffic
Before diving into numbers, it helps to understand what production LLM traffic actually looks like. It's not uniformly complex. A real application sending 10,000 calls might look like this:
- 3,200 calls: customer support Q&A and ticket summarization
- 2,400 calls: text classification and intent detection
- 1,800 calls: short-form content generation (emails, summaries, descriptions)
- 1,100 calls: document extraction and structured data parsing
- 800 calls: multi-step reasoning, code generation, complex analysis
- 700 calls: miscellaneous (health checks, test calls, fallback logic)
Most of the calls are narrow in scope. They require coherence, speed, and reliability — but they don't require the full reasoning capacity of a frontier model.
The actual breakdown by routability
| Task Category | % of Traffic | Needs Frontier Model? | Optimal Model |
|---|---|---|---|
| Customer support responses | 32% | No | GPT-4o-mini / Gemini Flash |
| Classification & intent | 24% | No | GPT-4o-mini |
| Short-form generation | 18% | Rarely | GPT-4o-mini |
| Structured extraction | 11% | No | GPT-4o-mini |
| Complex reasoning / code | 8% | Yes | GPT-4o / Claude Opus |
| Misc / edge cases | 7% | Varies | Configurable |
62% of calls routed to a smaller model with no measurable quality degradation. A further 7% were marginal — routable with slightly different system prompt engineering.
Why 60% Is the Number Most Teams See
The 60% figure isn't coincidental. It tracks with the structural reality of most LLM-integrated products.
Most applications have a small number of high-complexity use cases — the features that drove the original decision to use a frontier model — surrounded by a much larger volume of lower-complexity calls that accumulated over time. A code review tool might genuinely need GPT-4o for analysis, but it also calls the model for formatting file names, generating changelog summaries, and writing brief notification emails.
Those peripheral calls are cheap to route but often represent 50–70% of total token volume.
The cost arithmetic
At current pricing (GPT-4o: $2.50/M input, $10.00/M output; GPT-4o-mini: $0.15/M input, $0.60/M output):
| Scenario | GPT-4o cost | GPT-4o-mini cost | Difference |
|---|---|---|---|
| 1,000 input tokens | $0.0025 | $0.00015 | 16x cheaper |
| 1,000 output tokens | $0.01 | $0.0006 | 16x cheaper |
| Full call (1K in, 500 out) | $0.0075 | $0.00038 | ~20x cheaper |
Routing 60% of your traffic to GPT-4o-mini on a $5,000/month bill doesn't save 60% — because the routed calls may be smaller — but realistic savings land between 40% and 65% of total spend, depending on token distribution and task mix.
The Tasks That Never Needed GPT-4o
Summarization
Summarization is the clearest case. The task is bounded: take a longer document, produce a shorter version preserving key points. GPT-4o-mini handles summarization across customer support tickets, product reviews, meeting transcripts, and news articles at quality levels indistinguishable from GPT-4o when measured by human evaluators.
We ran 400 side-by-side summarization evaluations across our dataset. GPT-4o-mini scored within 3 percentage points of GPT-4o on coherence, accuracy, and completeness. The cost difference was 20x.
Classification and intent detection
Classification tasks — "is this message a complaint, a question, or a compliment?", "which product category does this query belong to?", "is this code snippet valid Python?" — are among the lowest-complexity tasks you can give an LLM.
These tasks play to the strengths of smaller models. They're short, well-defined, and have ground truth labels you can test against. GPT-4o-mini's classification accuracy across our dataset was within 1.5% of GPT-4o. For most applications, a 1.5% drop in classification accuracy has no user-visible impact.
Structured extraction
Pulling structured data from unstructured text — extracting dates, names, addresses, and key fields from contracts or forms — is another task that doesn't require deep reasoning. It requires pattern recognition and consistent JSON output.
Both GPT-4o and GPT-4o-mini perform well here with appropriate output schemas. We found GPT-4o-mini matched or exceeded GPT-4o on simpler extraction tasks when given explicit schema instructions.
Customer support responses
This one surprises teams the most. Customer support responses feel like they should require nuance and quality — and they do, to a point. But GPT-4o-mini with a well-crafted system prompt produces support responses that customers can't distinguish from GPT-4o-generated ones in A/B tests.
The quality ceiling of support responses is often the system prompt, not the model. Once you've invested in a good system prompt, a smaller model delivers nearly the same output at a fraction of the cost.
The Tasks That Actually Need a Frontier Model
Being clear about when smaller models fall short matters just as much as identifying savings.
Code generation for non-trivial problems. Multi-file refactors, complex algorithm design, and debugging subtle concurrency issues require the reasoning depth of GPT-4o or Claude Opus. Routing these to a smaller model produces lower-quality suggestions that developers push back.
Long-context reasoning. Tasks that require tracking dozens of facts across a 50,000-token context window — legal document analysis, large codebase comprehension, multi-document synthesis — benefit materially from frontier models.
Creative and brand-sensitive content. When output quality is a brand differentiator and there's no ground truth to test against, defaulting to the best model is the right call. This is a small percentage of most traffic but important to protect.
Multi-step reasoning chains. Agentic tasks where the model must plan, execute, and self-correct across several steps degrade noticeably at smaller model tiers.
The key point: these high-value tasks are the minority. Protecting them by routing everything else to cheaper models doesn't compromise them — it funds them.
How Routing Works in Practice
Model routing classifies each incoming request before it reaches the LLM. The classifier evaluates:
- Request complexity signals: token count, presence of code, multi-turn depth, explicit instruction complexity
- Task type: extracted from system prompt and user message structure
- Historical quality signals: how similar requests have performed on smaller models in the past
The classification adds single-digit milliseconds of latency — negligible for any real-world application.
When you use an OpenAI-compatible proxy like PromptUnit, the integration is a single base URL change:
from openai import OpenAI
client = OpenAI(
api_key="sk-...",
base_url="https://api.promptunit.ai/proxy/openai",
default_headers={"x-promptunit-key": "YOUR_KEY"},
)
# All existing calls work unchanged
response = client.chat.completions.create(
model="gpt-4o", # PromptUnit routes this intelligently
messages=[{"role": "user", "content": "Summarize this support ticket..."}]
)
Your application code, error handling, and response parsing don't change. The routing decision happens transparently.
The 14-Day Observation Window
Before any routing decisions are made, PromptUnit runs in observation mode. Every request is analyzed and classified, but all traffic continues to hit the same models it always has.
After 14 days, you see the full picture:
- Which calls were classified as routable
- Which model they would have been routed to
- The projected cost reduction
- Quality confidence scores for each routing decision
If the analysis shows 60% routability on your traffic, you see that number before anything changes. If it shows 20%, you see that too — and you can decide routing isn't worth activating for your workload.
This is the right way to approach routing: measure first, act second. Read more about how the observation period works in How to Reduce Your OpenAI API Costs by 50–70% Without Changing Your Code.
Key Takeaways
- In real production traffic across diverse applications, 60–65% of LLM API calls can be handled by smaller, cheaper models without measurable quality degradation.
- The most routeable categories are summarization, classification, structured extraction, and customer support — tasks with bounded scope and verifiable outputs.
- The tasks that genuinely require frontier models (complex code, long-context reasoning, agentic chains) represent 10–15% of typical production traffic.
- Routing 60% of traffic from GPT-4o to GPT-4o-mini reduces that portion of spend by approximately 94%, translating to 40–65% total bill reduction.
- Effective routing requires measurement before action: observe your traffic patterns before configuring any routing rules.
- Integration overhead is minimal — a single base URL change is sufficient to activate observation mode and, when ready, live routing.
The question isn't whether routing would save money on your traffic. For most teams, it will. The question is how much — and the only way to know is to measure it against your actual calls.