LLM inference in production: latency, caching, and the hidden costs

Almost every team building on LLMs has the same month-three realization: the demo was a lie. Not a malicious one - but the demo talked to a top-tier model, over a fast connection, with exactly one user (you), asking one question at a time. Production has hundreds of users, a latency budget, a finance team, and a long tail of weird inputs. The model didn’t get worse; the conditions got honest. This post is the field guide we hand teams making that transition - what actually moves the needle once real users show up. Whether you’re an engineer tuning a server or a PM trying to understand why “it’s slow” is complicated, there’s something here for you.

Think of it as a restaurant kitchen, not a vending machine

The mental model that unlocks everything: an LLM feature is a kitchen, not a vending machine. A vending machine gives you the whole product instantly or not at all. A kitchen plates dishes progressively, batches similar orders, keeps common prep ready, and lives or dies on how it manages a queue under a dinner rush. Almost every technique below is a kitchen technique. Once you stop expecting a vending machine, the engineering gets obvious.

The latency budget

Here’s the uncomfortable physics: a user starts to feel a response is “slow” at around 200–300 milliseconds, and a language model generating a full answer almost never fits inside that window. So every production LLM integration is, at its core, a latency management problem wearing an AI costume. You are not making the model faster - you’re managing the user’s perception of time. Three levers do most of the work.

Lever 1: Stream the tokens (do this first, today)

Don’t wait for the whole dish to plate before serving. Flush tokens to the UI as they generate. Counterintuitively, a response that starts in 100ms and finishes in 2 seconds feels faster than one that delivers everything at once at 1.2 seconds - because the user sees motion immediately and starts reading. Streaming costs almost nothing to implement and buys an enormous amount of perceived performance. It is the single highest-leverage thing on this entire page, and it’s the first thing to ship.

# The whole win, in a few lines: yield tokens as they arrive.
stream = client.chat.completions.create(model="...", messages=msgs, stream=True)
for chunk in stream:
    if token := chunk.choices[0].delta.content:
        yield token   # user sees the answer growing in real time

The measurement that matters here is TTFT - time to first token - not total generation time. Optimize for “how fast does something appear,” because that’s the number the user’s gut is actually grading.

Lever 2: Cache the prompt prefix (mise en place)

A good kitchen preps the common ingredients before service. The model equivalent is prefix caching: the major API providers can cache the internal key-value state for a fixed prompt prefix and reuse it across requests. If your system prompt is 3,000 tokens of instructions that never change, the model shouldn’t re-read them from scratch every single call - and with caching, it doesn’t. In our deployments this shaved 40–60% off TTFT once the cache warmed.

There’s one rule that bites everyone: the cached prefix must be byte-identical. Inject anything dynamic - a timestamp, a user name, a session ID - into the front of the prompt and you’ve invalidated the cache on every request, getting zero benefit and quietly wondering why caching “doesn’t work.” Keep the system prompt frozen and put all the variable context in the user turn:

[ system: 3,000 tokens, byte-for-byte identical every call ]   ← cached
[ user:   "Given the document below, ...{dynamic content}" ]   ← not cached, and that's fine

Lever 3: Batch the background work (cook similar orders together)

Not everything is interactive. Document enrichment, classification, nightly summarization - these don’t have a human watching a spinner, so stop sending them one at a time. Batching lets the inference server parallelize work and slash your wall-clock time. On one nightly job enriching 10,000 documents, moving from serial to batched requests cut runtime from 4 hours to 38 minutes - same token cost, a fraction of the clock. The dinner-rush queue moves faster when the kitchen cooks the ten identical orders as one ticket.

If you run your own server: the KV cache is the dial

If you’ve moved to self-hosted inference (vLLM, TGI, or llama.cpp), one knob dominates the rest: the KV cache. It’s the model’s short-term memory of the conversation so far, and it lives in GPU VRAM. Tuning it is the whole game.

Cache size versus throughput. The KV cache trades VRAM for concurrency. A bigger cache lets more requests run at once, but leaves less room for the model weights. For a 7B model on a 24GB card, we typically hand 20–25% of VRAM to the cache and then watch the meter - vLLM exposes gpu_cache_usage on its metrics endpoint, and that number tells you whether you’ve starved the cache or the weights.

Prefix sharing. vLLM’s --enable-prefix-caching is the self-hosted twin of the API’s prompt caching - turn it on unconditionally if multiple users share a system prompt. In one multi-user deployment it cut cache misses by 71%, which is a free latency win for the cost of a flag.

Chunked prefill. Without it, one user pasting a 20-page document can stall every shorter request behind it while the GPU chews through the long input - the kitchen equivalent of one giant catering order freezing the whole line. Chunked prefill processes long inputs in segments so short requests keep flowing. Enable it the moment you support long-context inputs.

Self-host or call the API? The honest answer

Usually: don’t self-host. We say this as people who enjoy self-hosting.

The API providers have poured staggering engineering into their inference infrastructure, and replicating even a fraction of it at small or medium scale is a losing trade. The rough breakeven for self-hosting a 70B-class model - once you count GPU hardware amortization, power, ops, and the engineer-hours to keep it healthy - typically sits north of $50–80K/month in equivalent API spend. Below that line, managed inference almost always wins on total cost, and definitely on your team’s attention.

Self-hosting genuinely makes sense in three cases:

Data residency you can’t satisfy otherwise. A contract or regulation pins the data to a jurisdiction no hosted provider covers. The model has to come to the data.

A fine-tuned model you can’t expose. You’ve fine-tuned on proprietary data and can’t ship that model to a third party. Self-hosting is the only door.

Extreme, predictable volume. Millions of short, uniform requests - classification, embeddings, tagging - that can saturate a dedicated GPU around the clock. At full, steady utilization the per-token math finally tilts your way.

For everything else, pick the provider that fits your latency and quality needs and pour your energy into the integration, not the infrastructure.

The hidden cost nobody puts on the slide

Here’s the line that should be on every LLM project’s first planning doc: tokens are cheap; engineering time is not. The sticker price everyone fixates on - cost per million tokens - is rarely where the budget actually goes. The real spend is three things teams discover too late:

The prompt iteration treadmill - prompts are never “done,” and every tweak risks regressing something that used to work. The observability you have to build - you can’t debug what you can’t see, and LLM outputs are non-deterministic, so you need logging and tracing tuned for “why did it say that.” And the evaluation harness - the thing that tells you a model upgrade or prompt change didn’t quietly break three features.

Budget for all three before you write the first line of inference code, or they’ll bill you later, with interest.

Evals are not optional (and not fancy)

If your product depends on output quality, you need evals in your CI pipeline - not as a someday-project, as a deployment gate. The good news: effective evals are unglamorous. We use a golden-set approach - 200 to 400 hand-curated examples with expected outputs - run automatically on every model or prompt change, reporting the score delta as part of the deploy:

$ npm run eval -- --against=golden-set
  ✔ 372/400 passed   (prev: 369/400)   Δ +3
  ✗ 28 regressions in "tone" category  ← block the deploy, investigate

That’s it. No exotic framework, no LLM-judging-LLM hall of mirrors required to start. This boring little harness has caught three significant regressions for us in the last year - each one a bad model upgrade or a “harmless” prompt tweak that would otherwise have shipped straight to users. The teams that skip this don’t avoid the regressions; they just find them in production, from angry customers, instead of in CI, from a red checkmark.

The short version

Stream first - it’s the cheapest, biggest perceived-speed win you’ll ever ship. Cache your prompt prefix and keep it byte-identical. Batch anything a human isn’t watching. If you self-host, the KV cache is the dial that matters; if you’re under the five-figure-monthly mark, you probably shouldn’t self-host at all. And budget for the real costs - iteration, observability, evals - because the tokens were never the expensive part.

We’ve shipped LLM features into production across several domains, and these are the lessons that survived contact with real users. If you’re building something in this space and want a second opinion on the architecture before month three teaches it to you the hard way, reach out.