Skip to content
About

LLMOps

LLMOps is MLOps adapted to systems built on large language models. The operational mindset carries over — version, evaluate, monitor, roll out safely — but LLM systems shift what you operate.

AspectClassical MLOpsLLMOps
The modelYou train and own itYou consume a vendor’s or open model
What you versionModel weights, datasetsPrompts, model choice, retrieval data
The artifact you tuneModel parametersPrompts, context, the system around the model
EvaluationMetrics on a labeled test setHarder — output is open-ended text
MonitoringDrift, accuracy+ quality, hallucination, cost, latency, safety
Main costGPU training costPer-token inference cost — recurring
Failure modeWrong predictionWrong, or unsafe / off-topic / injected

The shift in a sentence: you don’t manage a model’s weights — you manage the prompts, context, and guardrails wrapped around someone else’s model.

The prompt is now the artifact under change control. Apply the discipline from Advanced Patterns: prompts in version control, externalized from code, templated, and model-pinned — since a prompt is tuned to a specific model and a model upgrade is a change to re-evaluate.

The unit of versioning is the whole configuration: prompt + model + parameters + retrieval setup. Change any one and you have a new version to evaluate.

The single most important LLMOps practice. Because LLM output is open-ended, “does it work?” is meaningless without a measurement — so the evaluation set becomes the center of development:

Change a prompt / model / retrieval setting Run the evaluation suite programmatic · LLM-as-judge · sampled human Compare scores to the production version Improved → ship Regressed → reject

This is the LLM equivalent of the evaluation gate. It turns prompt engineering from vibes into engineering — every change is proven, every regression is caught before users see it.

LLM systems — especially RAG and agents — are multi-step, so a single quality metric can’t tell you where something went wrong. You need tracing: the full record of one request.

A trace should capture every step — retrieval queries and the chunks returned, the fully assembled prompt, the raw model response, every tool call, plus token counts, latency, and cost per step. When a user reports a bad answer, the trace is how you find the broken step instead of guessing. (Tools: LangSmith, Langfuse, Arize Phoenix, and others.)

Beyond standard service health, watch the LLM-specific signals:

  • Quality — run evals on a sample of live traffic, not just your test set.
  • Hallucination / faithfulness — for RAG, are answers grounded in retrieved context?
  • Cost — tokens and dollars per request, per feature, per user — with alerts. LLM spend drifts upward quietly.
  • Latency — time-to-first-token and total, at p99.
  • Safety — guardrail trigger rates; refusals; flagged content.
  • User signals — thumbs up/down, edits, retries, escalations — cheap, honest quality data.

Guardrails — input/output safety checks — are not write-once. They are operated: monitor their trigger rates, review what they catch and miss, and update them as new misuse and injection patterns appear. New attacks surface continuously.

  1. Baseline — prompts in version control; a small evaluation set; basic request logging.
  2. Repeatable — automated eval suite gating every change; full tracing; cost and latency dashboards.
  3. Mature — evals on live traffic; drift and quality alerting; systematic guardrail review; automated regression detection.

Start at the bottom. An eval set and versioned prompts beat an elaborate platform you don’t yet need.

LLMOps keeps the MLOps mindset but changes the artifact: you operate prompts, context, and guardrails around a model you don’t own. Version the whole configuration — prompt, model, parameters, retrieval — and pin the model. Eval-driven development is the core loop: gate every change against an evaluation suite. Trace multi-step requests end to end. Monitor quality, cost, latency, and safety on live traffic, and operate guardrails as living systems.