Skip to content
About

Cost, Latency & Reliability

Every LLM system design is a negotiation between three forces: cost, latency, and reliability/quality. You cannot maximize all three — pushing one usually costs another. Good design makes the trade-off deliberate.

LLM cost is tokens × price-per-token, billed separately for input and output. Estimate it explicitly before launch:

Per request:
input = system prompt + history + retrieved context + user message
output = the response
cost = input_tokens × input_price
+ output_tokens × output_price
Monthly = cost_per_request × requests_per_month

Two facts dominate the math:

  • Output tokens cost several times more than input tokens. Concise outputs save real money.
  • RAG context is often the biggest input cost. Retrieving 20 chunks when 4 would do can multiply your bill.
  • Exact-match cache — identical request seen before? Return the stored response. Free and instant for repeated queries (FAQs, retries).
  • Semantic cache — embed the query; if a similar past query exists, reuse its answer. Higher hit rate, but tune the similarity threshold or you’ll serve subtly wrong answers.
  • Prompt caching — providers cache a long, stable prefix (system prompt, big context) so you’re billed less for resending it. Order prompts stable-part-first to exploit this.

Don’t send every request to your most expensive model. Route by difficulty: a small, cheap model handles classification and simple queries; the frontier model is reserved for genuinely hard ones. A cheap classifier — or rules — does the routing. This is often the single largest cost lever.

TacticCostLatencyQuality
Exact / semantic caching▼▼▼▼
Route easy traffic to a small model▼▼▼ slightly
Shorter prompts & smaller context— / ▼
Lower max_tokens▼ if truncated

LLM latency has two parts: time to first token (TTFT) and time per output token. Total response time scales with how much text is generated.

  • Stream the response. Send tokens as they’re produced. The total time is unchanged, but perceived latency drops dramatically — the user reads while the model writes.
  • Parallelize independent calls instead of chaining them.
  • Shorten outputs. Fewer output tokens is the most direct latency win.
  • Route to a faster model for latency-sensitive paths.
  • Cache — a cache hit is the fastest possible response.

Provider APIs have outages, rate limits, and latency spikes. Build for it:

  • Timeouts + retries with exponential backoff and jitter on every call.
  • Fallback models — if the primary provider fails or is rate-limited, fail over to a secondary. The LLM gateway is where this lives.
  • Circuit breakers — stop hammering a provider that’s clearly down.
  • Graceful degradation — when AI is unavailable, degrade to something useful: cached results, a simpler non-AI path, or an honest “try again shortly” — never a crash.
  • Rate-limit yourself — queue and smooth your own traffic so you don’t trip provider limits during spikes.

There is no universal right answer — it depends on the use case:

  • Interactive chat — prioritize latency: stream, route to a fast model, cache aggressively.
  • Batch processing — prioritize cost: cheapest capable model, large batches, ignore latency.
  • High-stakes output (legal, medical, financial) — prioritize quality and reliability: strongest model, verification, human review; accept the cost.

State the priority order explicitly for each feature. That single decision drives every other choice.

Model token cost up front — output tokens and RAG context dominate. Cut cost with caching and difficulty-based model routing. Cut perceived latency by streaming, and real latency by shortening output. Engineer reliability with timeouts, retries, fallback models, circuit breakers, and graceful degradation. You can’t optimize cost, latency, and quality at once — rank them per feature and design to that ranking.