Question 1

How do I reduce my OpenAI API costs?

Accepted Answer

The most effective approach is intelligent model routing — automatically sending simple tasks to cheaper models like GPT-4o-mini or Gemini Flash instead of GPT-4o. Routing alone reduces costs by 40–70% without changing response quality for the majority of requests.

Question 2

What is LLM model routing?

Accepted Answer

LLM model routing is a system that sits between your application and your AI provider, classifying each request by complexity and task type, then automatically directing it to the cheapest model that can handle it well. Simple requests go to inexpensive models; complex reasoning tasks are escalated to more capable ones.

Question 3

Can I use model routing without changing my code?

Accepted Answer

Yes. PromptUnit is OpenAI-compatible and integrates by changing one value in your existing SDK configuration — the base URL. Your existing API calls, response parsing, and error handling continue to work exactly as before.

Question 4

How much can I save by switching from GPT-4 to GPT-4o-mini?

Accepted Answer

GPT-4o costs approximately $5 per million input tokens and $15 per million output tokens. GPT-4o-mini costs $0.15 input and $0.60 output — a 30–50x reduction. For a team spending $10,000 per month on GPT-4o, routing 60% of requests to GPT-4o-mini reduces the bill to roughly $3,500–$4,000 per month.

Question 5

What is an AI inference proxy?

Accepted Answer

An AI inference proxy is a server layer that intercepts requests from your application to an LLM provider like OpenAI or Anthropic. It adds capabilities like model routing, cost tracking, caching, budget enforcement, and fallback — then forwards the request to the appropriate model and returns the response in the exact same format.

Question 6

What is cross-provider LLM routing?

Accepted Answer

Cross-provider routing means evaluating models across multiple AI providers simultaneously — OpenAI, Anthropic, Google, Groq — and routing each request to the cheapest globally available model that meets a quality threshold. Rather than routing within one provider, cross-provider routing opens the full market of available inference options for every call.

Question 7

Does PromptUnit affect response quality?

Accepted Answer

No. PromptUnit uses a configurable quality threshold (default 85%). Each request is only routed to a cheaper model if benchmark data shows that model performs at or above the threshold for that task type. If no cheaper model qualifies, the original model is used.

Question 8

How does PromptUnit pricing work?

Accepted Answer

PromptUnit charges 20% of verified monthly savings. If we save you $0, you pay $0. There is a 14-day free observation period where we analyze your traffic without making any routing changes. You only start paying after routing goes live and savings are confirmed.

Question 9

When will I be charged?

Accepted Answer

You set your own billing threshold during onboarding — anywhere between $50 and $400 in savings. Once PromptUnit has saved you that amount, your card is automatically charged 20% of those savings and the counter resets. You decide when charges happen.

Task Category	% of Traffic	Needs Frontier Model?	Optimal Model
Customer support responses	32%	No	GPT-4o-mini / Gemini Flash
Classification & intent	24%	No	GPT-4o-mini
Short-form generation	18%	Rarely	GPT-4o-mini
Structured extraction	11%	No	GPT-4o-mini
Complex reasoning / code	8%	Yes	GPT-4o / Claude Opus
Misc / edge cases	7%	Varies	Configurable

Scenario	GPT-4o cost	GPT-4o-mini cost	Difference
1,000 input tokens	$0.0025	$0.00015	16x cheaper
1,000 output tokens	$0.01	$0.0006	16x cheaper
Full call (1K in, 500 out)	$0.0075	$0.00038	~20x cheaper

We Analyzed 10,000 GPT-4o Calls — 60% Didn't Need GPT-4o

The Distribution of Real Production LLM Traffic

The actual breakdown by routability

Why 60% Is the Number Most Teams See

The cost arithmetic

The Tasks That Never Needed GPT-4o

Summarization

Classification and intent detection

Structured extraction

Customer support responses

The Tasks That Actually Need a Frontier Model

How Routing Works in Practice

The 14-Day Observation Window

Key Takeaways