Skip to content
About

LLM Application Architecture

Most production LLM applications share the same skeleton. Learn it once and you can read — or design — almost any of them.

User API Gateway auth · rate limiting · request shaping Input Guardrails injection · abuse · off-topic checks Orchestration Layer your code — decides: retrieve? call a tool? loop? Prompt Assembly Retrieval (RAG) Tool / Action Execution LLM Gateway routing · retries · fallback Model providers Output Guardrails schema · safety · PII User Observability — every stage emits traces, metrics & cost

The ordinary front door: authentication, per-user rate limiting, request validation. Nothing AI-specific, but it’s where you stop abuse before it costs you model calls.

The first AI-aware layer. Cheap, fast checks before an expensive model call: prompt-injection detection, off-topic filtering, and abuse classification. Reject early.

The brain of the application — your own code that decides the control flow:

  • A simple feature: assemble a prompt, call the model, return the result.
  • A RAG feature: retrieve context first, then call the model.
  • An agent: loop — call the model, run the tool it requested, feed the result back, repeat. See AI Agents.

This layer is deterministic application code. Resist the urge to let the LLM control flow it doesn’t need to.

Builds the final prompt from parts: a versioned system prompt, conversation history (often trimmed or summarized to fit the context window), retrieved context, and the user’s message. Treat prompts as versioned artifacts, not string literals scattered through the codebase.

Fetches relevant knowledge — vector search, keyword search, SQL, an API call — and feeds it to prompt assembly. This is RAG, and it’s how you give the model facts it doesn’t have.

When the model needs to do something — query a database, call an API, run code — this layer executes the requested tool in a sandbox and returns the result. Every tool call is a security boundary: validate arguments, scope permissions narrowly.

A single internal choke point for all model calls. It centralizes retries, timeouts, fallback to a second provider, model routing (cheap model for easy requests, strong model for hard ones), caching, and cost tracking. This is the swappable-model principle made concrete.

Before a response reaches the user: validate it against a schema, scan for unsafe content and leaked PII or secrets, and confirm it stays on policy. Fail closed.

Cross-cutting, not a step. Every layer emits traces (the full chain for one request), metrics (latency, token cost, error and cache-hit rates), and evals (quality scored on sampled traffic). Covered in MLOps.

You do not build all of this on day one. Most applications begin as:

Gateway → Prompt → LLM → Response

Add layers when a real problem demands them: retrieval when the model lacks knowledge, guardrails when you see abuse, an LLM gateway when you need fallback or routing, orchestration loops when one call can’t finish the job. Premature architecture is as costly as no architecture.

Production LLM apps share a skeleton: gateway, input guardrails, an orchestration layer that owns control flow, prompt assembly, retrieval, tool execution, an LLM gateway, output guardrails, and pervasive observability. Orchestration is your deterministic code — keep it that way. The LLM gateway centralizes resilience and routing. Start with the minimal path and add layers only when a concrete problem justifies them.