Serving & Inference

Inference is running a trained model to produce output — the serving side, as opposed to training. Serving LLMs efficiently is its own discipline, because LLM inference has an unusual shape.

Two phases: prefill and decode

A single LLM request runs in two distinct phases with very different performance profiles:

Prefill reads the entire prompt at once — highly parallel, so it’s fast even for long prompts.
Decode is inherently sequential: token N+1 can’t start until token N exists. This is why a long response takes longer than a long prompt, and why output tokens cost more.

The KV cache

During decoding, each new token’s attention needs the keys and values of every previous token. Recomputing those each step would be hugely wasteful, so they are stored — the KV cache.

The KV cache is fast, but it’s memory, and it grows with both sequence length and the number of concurrent requests. On a serving GPU, the KV cache — not the model weights — is often what limits how many users you can handle at once. Long contexts are expensive partly because their KV cache is large.

Batching

A GPU processing one request at a time is mostly idle. Batching runs many requests together, sharing the work and using the hardware far better.

Naive (static) batching makes everyone wait for the slowest request in the batch. Modern inference servers use continuous batching (a.k.a. in-flight batching): finished requests leave the batch and new ones join mid-flight, so the GPU stays saturated. This is one of the biggest real-world throughput wins.

Inference servers

You don’t write inference loops by hand — you use a serving engine that bundles these optimizations:

Engine	Notes
vLLM	Popular open-source server; continuous batching, efficient KV-cache paging
TGI	Hugging Face’s text-generation server
TensorRT-LLM	NVIDIA’s highly optimized stack
Ollama / llama.cpp	Local and laptop-scale serving, CPU/GPU

They also apply optimizations like PagedAttention (manage the KV cache like OS virtual memory, cutting waste), quantization (smaller weights — see GPUs & Hardware), and speculative decoding (a small model drafts tokens, the big model verifies several at once).

Throughput vs. latency

The central serving trade-off:

Latency — how fast one request completes. What an interactive user feels.
Throughput — how many requests per second across all users. What sets your cost-per-request.

Bigger batches raise throughput (cheaper per request) but raise latency (each request waits longer). Tune to the workload:

Workload	Optimize for	Approach
Interactive chat	Latency	Smaller batches, stream tokens
Bulk / offline jobs	Throughput	Large batches, latency irrelevant
Mixed traffic	Balanced	Continuous batching, maybe separate pools

Key takeaways

LLM inference splits into a fast, parallel prefill phase and a slow, sequential decode phase — which is why responses stream and output costs more. The KV cache speeds decoding but consumes memory that limits concurrency. Continuous batching keeps the GPU saturated and is a major throughput win. Use a serving engine like vLLM rather than rolling your own. Throughput and latency trade off — tune batching to whether humans are waiting.