How to Cut AI Infrastructure Costs Safely

Jun 3, 2026Omar Trejo9 min read

AI infrastructure cost is not a bill. It is a slope, and the slope was set months earlier by defaults nobody has revisited since: the model a feature reached for on day one, a prompt that grew by accretion, a retry policy written before anyone knew what the failure modes were. What arrives at the end of the month is the integral of those decisions, which is why the invoice is the worst possible place to discover them. By the time finance asks the question, the patterns are baked into production traffic and every fix is a change to a live system.

The economics underneath are moving in the buyer's favour, and that makes the drift more expensive rather than less. Research from MIT finds that the price of a given level of benchmark performance has been falling by 5-10x per year (Gundlach et al., 2025). A system whose model choices are frozen is not holding its costs steady against that backdrop — it is paying an increasing premium for a decision it made once, and the premium compounds quietly because nothing in the architecture is watching for it.

The controls that matter are less glamorous than the model choice. On the cloud ECG backend ML LABS built for HeartSciences, the AI layer runs studies through multiple AI model providers, and two design decisions carry the cost surface. Every inference request carries an idempotency key, so a retry after a provider timeout can never produce a second billable result for the same study — a duplicate charge is made structurally impossible rather than caught later. And reconciliation jobs compare each provider's invoice against the platform's own ledger before it reaches a customer, so a discrepancy is an alarm rather than a discovery. Neither is a cost optimization in the usual sense. Both are the reason a cost optimization is safe to attempt.

Where The Spend Actually Hides

Three patterns account for most avoidable AI spend, and none of them is visible from a dashboard that plots dollars against time.

Redundant processing: the system reprocesses content it has already seen, because nothing in the request path is content-aware.
Over-specification: a frontier model runs a task a smaller fine-tuned model would complete at equivalent quality, because no routing rule was ever written and the strongest model is the safest default.
Prompt accretion: instructions grow with every edge case and never shrink, inflating token cost on every call without improving the output.

The signals that these are active are operational, not financial, and they show up in the shape of the telemetry rather than in its absolute values. Token growth that outpaces request growth means prompts are inflating, retries are firing without being counted, or one user action is fanning out into several model calls — three different problems wearing the same symptom. A frontier-model share that stays flat as the feature set grows means no routing rule exists and every new feature defaulted to the strongest option. A cache hit rate that nobody can quote is the harder version of a low one: an unmeasured cache is not a cache. And tail latency climbing alongside spend points back at oversized prompts, which is the happy case — one edit fixes both curves.

Treat any of these as a trigger to audit rather than as a number to hit. The drift is dangerous precisely because each week's increase is small enough to dismiss and the annualized figure is not.

Audit, Route, Cache, Compress

Four phases, in order, because each one makes the next one safe. Auditing before routing is what separates a cost cut from a silent regression.

graph TD
    subgraph Phase1["Phase 1: Audit"]
        A1["Map API calls"] --> A2["Classify by tier"]
    end

    subgraph Phase2["Phase 2: Right-Size"]
        B1["A/B test models"] --> B2["Route by complexity"]
    end

    subgraph Phase3["Phase 3: Cache"]
        C1["Exact-match layer"] --> C2["Semantic layer"]
    end

    subgraph Phase4["Phase 4: Optimize"]
        D1["Compress prompts"] --> D2["Batch & dedup"]
    end

    Phase1 --> Phase2 --> Phase3 --> Phase4

    style Phase1 fill:#1a1a2e,stroke:#16c79a,color:#fff
    style Phase2 fill:#1a1a2e,stroke:#0f3460,color:#fff
    style Phase3 fill:#1a1a2e,stroke:#e94560,color:#fff
    style Phase4 fill:#1a1a2e,stroke:#ffd700,color:#fff

Phase 1: Audit And Classify

Map every AI call in the system: model used, input and output tokens, frequency, and the business function it serves. Then classify each call by quality sensitivity, because that is what decides how much a cheaper answer would cost you.

Critical: quality moves revenue or user trust directly; a frontier model is justified.
Important: needs good quality, tolerates minor degradation; the first candidates for a downgrade.
Routine: classification, extraction, formatting; should already be running on the cheapest capable model.

The audit's real product is not the classification. It is the discovery of spend nobody was reading, and there is a structural reason it goes unread: assumptions about cost are formed at design time, while the traffic that generates the cost is shaped afterwards by users. The two drift apart without anyone deciding that they should. The most expensive version of that ML LABS has met was a hedge fund's storage layer, where correcting a foundation nobody had read cut storage costs by more than 60% and improved the models by 2% as a byproduct; the full account of that engagement is a study in what an unexamined line item is worth. Inference spend behaves the same way and moves faster.

Phase 2: Right-Size The Model

Run A/B tests on the Important and Routine tiers against smaller models, and score the outputs objectively — an evaluation harness against ground truth or human ratings is what makes the cut reversible. Research on LLM routing and cascading (ICLR, 2025) shows that unified routing-and-cascading strategies outperform static model assignments on the cost-performance tradeoff, which is the formal version of an obvious idea: a lightweight classifier that sends each request to the cheapest model able to handle it will beat any single global choice.

For high-volume, narrow tasks, fine-tuning a small model on domain data closes the gap further. Recent work on small models fine-tuned for tool calling (Jhandi et al., 2025) trained a 350M-parameter model to 77.55% success on the ToolBench benchmark, above general-purpose frontier baselines on that task. The pattern holds wherever the task surface is bounded — extraction, classification, structured tool use, domain-specific summarization.

Take a schematic workload to see the arithmetic, illustrative rather than a client's: a support system running a couple of hundred thousand intent-classification calls a day. At roughly a cent a call on a frontier model, that is tens of thousands of dollars a month for a task that requires no general reasoning. Route the bulk of it to a fine-tuned small model and the marginal cost collapses by an order of magnitude, with the frontier model retained as the fallback for the ambiguous remainder. The audit in Phase 1 is what makes that safe: without per-call quality logs, a successful cut and a regression look identical on the invoice.

Phase 3: Cache What Repeats

Many queries are functionally identical while being textually different, and an exact-match cache never sees them. Research on semantic caching for LLM applications (Regmi & Pun, 2024) reported API call reductions of up to 69% while maintaining response quality, by matching on embedding similarity rather than on the literal string.

Exact match: identical inputs return a cached response at zero model cost.
Semantic similarity: a high cosine threshold catches paraphrases without catching different questions.
Fallback: a miss hits the model and populates the cache, so the layer improves with traffic.

The threshold is the whole engineering problem. Set it loose and the cache starts answering questions nobody asked, which is a quality incident wearing a cost saving.

Phase 4: Compress The Request

A survey of prompt compression techniques (NAACL, 2025) reports compression of up to 20x with minimal performance loss across the surveyed methods. Three techniques carry most of that: structured output formats, where a JSON schema constraint replaces paragraphs of instruction about the shape of the answer; content-aware deduplication, so only the novel portion of a document reaches the model; and batching windows, which aggregate similar requests to smooth utilization and unlock bulk pricing.

Cost Falls Only While Quality Holds

Cost optimization without quality instrumentation is not optimization, it is a wager. The barrier here is organizational rather than technical: the strongest model feels like the safe choice, and in the absence of an evaluation harness it is genuinely unfalsifiable — nobody can prove the cheaper model would have been fine, so nobody proposes it. That same absence is what lets a bad cut survive. Degradation from an over-aggressive downgrade does not page anyone; it surfaces weeks later as stakeholders reporting that the AI "seems worse lately", by which point the change that caused it is several deploys back.

So the sequencing rule is firm, and it is the opposite of what pressure suggests: establish quality baselines and per-tier monitoring before touching a single model assignment. If building the measurement infrastructure takes longer than the optimization it protects, that is not a delay — that is the project, correctly ordered. Track cost per successful output rather than cost per call, watch the full quality distribution rather than its average, and treat a cost curve and a quality curve that fall together as evidence the optimization went too far. The instinct is the same one that separates a system's scoreboard from its effect on the business — what it means to measure a live AI system properly — applied to the spend side.

First Steps

Instrument before you optimize. Log model, tokens, cost, and a quality signal for every AI call. Until that exists, every cost decision is being made on the invoice, which is the one artifact that arrives too late to act on.
Find the three costliest call patterns. Not the three costliest features — the three costliest patterns, which is a different list, and it is the list routing acts on.
Make one cut, reversibly. Pick the highest-volume Routine-tier pattern, route it to a smaller model behind a flag, and watch the quality distribution rather than the average for a full traffic cycle before widening it.

Match The Cut To The Signal

Which lever to pull is decided by the signal, not by the size of the bill. If tokens are outgrowing requests, the problem is in the prompt and the fan-out, and compression is the first cut. If the frontier-model share never moved as the product grew, the problem is routing, and a classifier in front of the model layer is worth more than any prompt edit. If nobody can quote a cache hit rate, the cache is the finding. And if a retry can produce a second charge, or a provider's invoice has never been reconciled against your own ledger, stop optimizing entirely — you do not yet have a cost surface you can trust, which is exactly why those two controls were built into the ECG platform's inference layer before anyone tuned anything, and why the billing system underneath it reconciles before it invoices.

The deeper pattern is that none of these fixes holds on its own. A routing rule decays the moment a feature ships without one, a cache degrades as the query mix shifts, and a prompt regrows. Cost control is a standing job rather than a project, which is why it belongs to whoever owns the live system and not to whoever has capacity. If the bill is climbing and the person who understood the routing rules has moved on, that is the situation managed AI operations exists for: one senior owner across a defined scope of live systems, cost and drift review carried as a monthly obligation instead of a year-end discovery, a written brief on what ran, what changed, and what is at risk, and cancellation on thirty days' notice. The cheapest infrastructure decision available is still the one where somebody is reading the bill before it doubles.

References

Gundlach, H., Lynch, J., Mertens, M., & Thompson, N. The Price of Progress: Algorithmic Efficiency and the Falling Cost of AI Inference. MIT, 2025.
Dekoninck, J., Baader, M., & Vechev, M. A Unified Approach to Routing and Cascading for LLMs. ICLR, 2025.
Regmi, S., & Pun, C. P. GPT Semantic Cache: Reducing LLM Costs and Latency via Semantic Embedding Caching. arXiv, 2024.
Li, Z., Liu, Y., Su, Y., & Collier, N. Prompt Compression for Large Language Models: A Survey. NAACL, 2025.
Jhandi, P., Kazi, O., Subramanian, S., & Sendas, N. Small Language Models for Efficient Agentic Tool Calling: Outperforming Large Models with Targeted Fine-tuning. arXiv, 2025.

↔RELATEDKEEP READING

Omar Trejo — The Operator

The operator who scopes, builds, and runs the work.