The ML Lifecycle

Shipping a model is not a straight line from “train” to “deploy.” It’s a lifecycle — a loop that runs continuously, because a model in production slowly goes stale.

The lifecycle loop

The loop is the point. Classical software, once correct, stays correct. A model degrades as the world drifts away from its training data — so the lifecycle never really ends.

Problem framing

Before any code: what decision will this model drive, what does success look like as a metric, and what’s the baseline? Many ML projects fail here — a technically fine model solving the wrong problem.

Data preparation

Usually the largest share of the work: collecting, cleaning, labeling, and validating data. Two practices keep it sane:

Data validation — automated checks on schema, ranges, nulls, and class balance for every dataset. Bad data silently produces bad models.
Data versioning — datasets change; pin and version them so any model can be traced to the exact data it learned from. (Tools: DVC, lakeFS, or dataset snapshots.)

Experimentation and tracking

Training is iterative — many runs with different data, features, and hyperparameters. Without records this becomes chaos, so use experiment tracking.

import mlflow

with mlflow.start_run():
    mlflow.log_params({"model": "xgboost", "max_depth": 6, "lr": 0.1})
    model = train(...)
    mlflow.log_metrics({"val_auc": 0.91, "val_f1": 0.88})
    mlflow.log_artifact("model.pkl")
    # Now this run is comparable to every other.

A tracker (MLflow, Weights & Biases) logs each run’s parameters, metrics, code version, and artifacts — so you can compare runs objectively and reproduce the winner instead of guessing which notebook produced the good number.

Reproducibility

A result you can’t reproduce isn’t an asset — it’s a rumor. Reproducibility means pinning all four inputs to a training run:

Code — a git commit.
Data — a dataset version.
Environment — pinned dependencies, ideally containerized.
Configuration — hyperparameters and random seeds.

If all four are captured, anyone can regenerate the model byte-for-byte. If any float, you have a model nobody can rebuild or safely change.

CI/CD for ML

ML extends CI/CD with model-specific stages. A change to code, data, or a model can all trigger the pipeline:

The evaluation gate is the ML-specific part: a build can pass every unit test and still ship a worse model. The pipeline must compare new model metrics against the current production model — and the baseline — and fail on a regression. Tests catch broken code; the gate catches a degraded model.

Key takeaways

ML delivery is a continuous loop — frame, prepare data, experiment, evaluate, deploy, monitor, repeat — because models drift as the world changes. Validate and version data. Track every experiment so runs are comparable and the winner is reproducible. Reproducibility requires pinning code, data, environment, and config together. CI/CD for ML adds an evaluation gate that blocks any deploy which regresses model quality.