Skip to content
About

Serving & Inference

Inference is running a trained model to produce output — the serving side, as opposed to training. Serving LLMs efficiently is its own discipline, because LLM inference has an unusual shape.

A single LLM request runs in two distinct phases with very different performance profiles:

PREFILL · processes the whole prompt at once · all tokens computed in parallel · compute-bound — fast → sets "time to first token" DECODE · generates one token at a time · each token = a full forward pass · memory-bandwidth-bound — slower → sets "time per output token"
  • Prefill reads the entire prompt at once — highly parallel, so it’s fast even for long prompts.
  • Decode is inherently sequential: token N+1 can’t start until token N exists. This is why a long response takes longer than a long prompt, and why output tokens cost more.

During decoding, each new token’s attention needs the keys and values of every previous token. Recomputing those each step would be hugely wasteful, so they are stored — the KV cache.

The KV cache is fast, but it’s memory, and it grows with both sequence length and the number of concurrent requests. On a serving GPU, the KV cache — not the model weights — is often what limits how many users you can handle at once. Long contexts are expensive partly because their KV cache is large.

A GPU processing one request at a time is mostly idle. Batching runs many requests together, sharing the work and using the hardware far better.

Naive (static) batching makes everyone wait for the slowest request in the batch. Modern inference servers use continuous batching (a.k.a. in-flight batching): finished requests leave the batch and new ones join mid-flight, so the GPU stays saturated. This is one of the biggest real-world throughput wins.

You don’t write inference loops by hand — you use a serving engine that bundles these optimizations:

EngineNotes
vLLMPopular open-source server; continuous batching, efficient KV-cache paging
TGIHugging Face’s text-generation server
TensorRT-LLMNVIDIA’s highly optimized stack
Ollama / llama.cppLocal and laptop-scale serving, CPU/GPU

They also apply optimizations like PagedAttention (manage the KV cache like OS virtual memory, cutting waste), quantization (smaller weights — see GPUs & Hardware), and speculative decoding (a small model drafts tokens, the big model verifies several at once).

The central serving trade-off:

  • Latency — how fast one request completes. What an interactive user feels.
  • Throughput — how many requests per second across all users. What sets your cost-per-request.

Bigger batches raise throughput (cheaper per request) but raise latency (each request waits longer). Tune to the workload:

WorkloadOptimize forApproach
Interactive chatLatencySmaller batches, stream tokens
Bulk / offline jobsThroughputLarge batches, latency irrelevant
Mixed trafficBalancedContinuous batching, maybe separate pools

LLM inference splits into a fast, parallel prefill phase and a slow, sequential decode phase — which is why responses stream and output costs more. The KV cache speeds decoding but consumes memory that limits concurrency. Continuous batching keeps the GPU saturated and is a major throughput win. Use a serving engine like vLLM rather than rolling your own. Throughput and latency trade off — tune batching to whether humans are waiting.