Skip to content
About

Deployment & Monitoring

Getting a model into production is half the job. Keeping it working — as data shifts beneath it — is the other half, and the half teams most often skip.

Two main shapes, picked by how predictions are consumed:

  • Online / real-time — the model sits behind an API and answers per request. For interactive features. Needs low latency and scaling. (See AI Infrastructure.)
  • Batch — the model scores a large dataset on a schedule, results written to a store. For “score every customer nightly.” Simpler; no latency pressure.

Choose online only when predictions are genuinely needed on demand — batch is cheaper and simpler when freshness allows it.

A model registry is version control for trained models — the source of truth for what exists and what’s live.

Each registered model carries its version, the metrics it earned, a link to the training run and data, and a stage: staging, production, archived. The registry makes one critical operation trivial: rollback. When a new model misbehaves, you re-point production at the previous version immediately — no retraining, no scramble.

Never flip 100% of traffic to a new model at once. Roll it out progressively:

StrategyHow it works
ShadowNew model runs on real traffic; its output is logged, not served. Zero-risk validation.
CanaryNew model serves a small slice (1–5%); watch metrics; widen or roll back.
Blue-greenTwo environments; switch traffic over, switch back instantly on trouble.
A/B testTwo models split traffic to compare a business metric directly.

Shadow then canary is a strong default: prove the model on live traffic without risk, then expose it gradually.

A deployed model degrades silently — no errors, no alerts, just slowly worsening predictions. Monitoring is what makes that visible. Watch four layers:

  1. Operational — latency, throughput, error rate, cost. Standard service health.
  2. Data quality — are incoming features valid: schema, ranges, null rates, missing values? Bad inputs are the most common production failure.
  3. Drift — has the input distribution moved away from training data?
  4. Model performance — accuracy, precision, recall on live data, once ground truth is available.

Drift is the core reason models decay:

  • Data drift — the input distribution changes. A new customer segment, a pricing change, seasonality. The model now sees inputs unlike its training set.
  • Concept drift — the relationship between input and output changes. What predicted fraud last year no longer does, because fraud tactics evolved.

When monitoring shows decay, the model is refreshed on newer data — re-entering the lifecycle loop. Triggers:

  • Scheduled — retrain every week or month. Simple, predictable.
  • Triggered — retrain when drift or a performance metric crosses a threshold. Efficient, but needs reliable monitoring to fire it.

Every retrained model goes through the same evaluation gate and staged rollout. “Retrain” never means “deploy blindly” — newer data does not guarantee a better model.

Serve models online for on-demand predictions, in batch when freshness allows — batch is simpler. A model registry versions trained models and makes rollback instant. Deploy progressively: shadow, then canary, never a hard cutover. Monitor operations, data quality, drift, and performance — because models fail silently. Drift (in data or concept) is the main cause of decay; watch input distributions since true labels arrive late. Retrain on a schedule or a trigger, always through the same gate and rollout.