GPT-5.4 vs GPT-5.4 Mini: The Routing Decision OpenAI Just Made For You
GPT-5.4 mini scored 72.1% on OSWorld against the flagship's 75.0%, at 70% lower cost. Here is the routing math after the March 17 release.
OpenAI released GPT-5.4 mini on March 17, 2026. On OSWorld-Verified, the standard benchmark for computer-use agents, it scored 72.1%. The flagship GPT-5.4 scored 75.0%. That is a 2.9-point gap, and mini costs roughly 70% less per token.
Most production traffic running on GPT-5.4 today does not need GPT-5.4. It needed something like it back when "something like it" did not exist. Now it does, and the price is $0.75 per million input tokens and $4.50 per million output tokens, against the flagship's $2.50 and $15.00. If you are routing every call to the flagship by default, you are paying for capability your prompts will never spend.
This is not a hot take. It is a benchmark-by-benchmark walk through what happened on March 17, and what it should change about how you route requests in production.
The benchmarks
SWE-Bench Pro, the public standard for autonomous software engineering tasks, gave GPT-5.4 mini a 54.4%. Nano scored 52.4%. The previous-generation GPT-5 mini scored 45.7%. In a single point release, mini gained nine points on the hardest coding benchmark in public circulation, which puts it within striking distance of the flagship on tasks that previously required full reasoning.
Terminal-Bench 2.0, which tests command-line and shell agent behavior, gave mini a 60.0%. Claude Haiku 4.5 scored 41.0% on the same benchmark. Gemini 3 Flash scored 47.6%. Mini is now the best-in-class small model for terminal work, and by a meaningful margin.
τ2-bench, the multi-turn tool-use benchmark, jumped from 74.1% on GPT-5 mini to 93.4% on GPT-5.4 mini. That is a 19-point lift on a benchmark that historically tracks closely with real-world agent reliability.
MCP Atlas, which tests Model Context Protocol fluency, climbed from 47.6% to 57.7%. That matters if your stack passes structured tool definitions to the model, which most production stacks now do.
Then OSWorld-Verified: 72.1% for mini, 75.0% for flagship. That is the smallest gap we have ever seen between a mini-tier OpenAI model and its flagship sibling.
Where mini still loses
Mini is not a flagship killer. It is a flagship competitor on a defined surface.
The gap widens on long-horizon reasoning, on math beyond grade-school arithmetic, on tasks that require stitching together more than a few minutes of context, and on creative generation where the flagship's larger parameter count buys real diversity in output. If you are running an agent that has to plan five steps ahead, hold a 200K-token document in working memory, and reconcile contradictions across sources, the flagship is still the right call.
The gap also matters when output quality is the product. If you are charging a customer for a generated essay and they will read every word, do not save 70 cents per call to ship something noticeably worse. The cost math only works if spend is the bottleneck and quality has a floor mini can clear.
Nano is a different conversation. SWE-Bench Pro 52.4% and OSWorld 39.0% means nano is a classification, extraction, and short-form generation model. Do not route reasoning-heavy work to it.
The cost math
Take a team running 100 million tokens of completion traffic per month, split 60/40 input to output. On the flagship, that is roughly $1,050 per month at standard rates. On mini, the same volume is roughly $315. The delta is $735 per month, or $8,820 per year, on a single workload.
That is the small case. Most engineering teams we see running production LLM workloads are at 1 to 5 billion tokens per month across all features. At 1 billion tokens, the same split puts the flagship at $10,500 and mini at $3,150 per month. At 5 billion, the flagship costs $52,500 and mini costs $15,750. The annualized delta for a 5-billion-token customer is $441,000.
You do not have to route 100% of traffic to mini to capture most of that. If you route the 60% to 70% of calls that would not lose quality on mini (a number we see consistently in customer traffic, and which we wrote about previously in our analysis of 10,000 GPT-4o calls), the savings still land in the 40% to 50% range against your current bill.
What to actually route
After two years of production routing data, the rules of thumb hold from the GPT-4o era and have tightened with the 5.4 release. The pattern is the same one we covered in our earlier comparison of when the cheaper model wins, only the gap has narrowed further on coding and tool use.
Route to mini: classification, intent detection, short summaries (under 500 tokens of output), tool-call orchestration, code completion on well-defined functions, terminal-style tasks, document tagging, structured extraction (JSON from unstructured input), retrieval reranking, semantic search over chunked content, and most agent inner-loop steps where the action space is bounded.
Keep on flagship: long-form essay generation, multi-document synthesis, novel reasoning chains over five steps, code architecture spanning more than two files, math and proofs, ambiguous customer-facing output where tone matters, and anything where a bad output costs more than 10x the savings.
Route to nano: classification with fewer than five labels, sentiment scoring, language detection, content moderation pre-screening, and any task where you are essentially using the model as a regex with vibes.
The default-flagship trap
Most teams default to GPT-5.4 because it was the easiest decision at integration time. The implicit contract was: pay the highest rate, get the fewest surprises. That trade made sense in 2024, when mini-tier models had real quality gaps. It makes less sense in April 2026, when mini hits 72% on OSWorld and 60% on Terminal-Bench.
The cost of defaulting to flagship is no longer hypothetical. We covered this in our piece on the hidden cost of defaulting to GPT-4o in production, and the math is now sharper, not softer. With the 5.4 release, the routing decision flipped from "is mini good enough on this task" to "is the flagship demonstrably better on this task." That is a different question, and the answer is "no" more often than most teams have measured.
The fix is not to rip out your flagship integration. It is to add a routing layer that classifies each request and sends it to the cheapest model that will not degrade output quality. The classification can be heuristic (token count, prompt structure, tool definitions present) or learned (fingerprinting based on prompt embeddings), but it has to exist. Without it, you are paying flagship rates for nano-tier requests, and that is where the bill bloats.
What to do this week
Pull a week of completion traffic from your logs. Bucket by use case. For each bucket, sample 20 calls and run them through GPT-5.4 mini in parallel with the flagship. Compare outputs on whatever quality metric your team already tracks (eval pass rate, human rating, regex match against expected output). Any bucket where mini matches or comes within your acceptable quality tolerance is a candidate for routing.
If you want to skip the bake-off, that is what proxy-based routing exists for. PromptUnit's Inferio routing layer scores each incoming prompt against a 22-layer model and routes it to the cheapest model that can return equivalent quality. When OpenAI shipped GPT-5.4 mini, the routing weights updated within hours, and customer traffic that had previously gone to the flagship for borderline cases started landing on mini automatically. Customers do not change code. They swap a base URL, run for 14 days in observation mode to see projected savings, then flip the switch. Pricing is 20% of verified savings; if the routing does not save you money, you do not pay.
If you are running 100 million or more tokens per month on GPT-5.4 and have not benchmarked your traffic against mini, that is the highest-leverage thing you can do this week. Start at promptunit.ai.