Model Evaluation
Building a model is easy. Knowing whether it’s any good is the hard part — and the part that separates engineers who ship reliable AI from those who ship confident-looking failures. This page is the single most transferable skill in the guide: the same discipline evaluates a fraud model and a RAG pipeline.
Always start with a baseline
Section titled “Always start with a baseline”Before celebrating “92% accuracy,” ask: what does doing nothing get? If 92% of transactions are legitimate, a model that predicts “legitimate” every time scores 92% — and catches zero fraud.
A baseline is the trivial reference point: predict the most common class, predict the average, or use last week’s value. Your model’s score is only meaningful as a delta over the baseline. Compute the baseline first, every time.
Classification metrics
Section titled “Classification metrics”When classes are imbalanced, accuracy lies. Use the confusion matrix and the metrics built from it.
- Accuracy = (TP + TN) / all. Fine only when classes are balanced.
- Precision = TP / (TP + FP). Of the items we flagged, how many were right? Matters when false positives are costly.
- Recall = TP / (TP + FN). Of the items we should have flagged, how many did we catch? Matters when misses are costly.
- F1 score = the harmonic mean of precision and recall — one number when you care about both.
Regression metrics
Section titled “Regression metrics”For models predicting a number:
- MAE (mean absolute error) — average miss, in the original units. Easy to explain: “off by $12k on average.”
- RMSE (root mean squared error) — like MAE but punishes large misses much harder. Use when big errors are disproportionately bad.
- R² — fraction of variance explained, 0 to 1. A quick “how much of the pattern did we capture?”
Cross-validation
Section titled “Cross-validation”A single train/test split can be lucky or unlucky. k-fold cross-validation removes the luck: split the data into k parts, train on k−1 and test on the held-out one, rotate so each part is the test set once, then average.
You get a more reliable estimate and a sense of variance — if scores swing wildly across folds, your model is unstable.
Data leakage: the silent killer
Section titled “Data leakage: the silent killer”Data leakage is when information unavailable at prediction time sneaks into training. The model looks brilliant in testing, then collapses in production. Classic causes:
- Target leakage — a feature that’s a proxy for the answer. Predicting churn
using
cancellation_date— which only exists because the user churned. - Train/test contamination — scaling, imputing, or selecting features using the whole dataset before splitting, so test statistics leak into training.
- Temporal leakage — training on future data to predict the past. Time-series splits must respect the clock; never shuffle randomly.
Offline metrics aren’t the finish line
Section titled “Offline metrics aren’t the finish line”A model that wins on your test set can still lose in production. Distribution shift — real inputs drifting away from your training data — degrades models silently over time. That’s why the workflow doesn’t end at evaluation: you monitor live performance and re-evaluate continuously. That operational loop is MLOps.
Key takeaways
Section titled “Key takeaways”Compute a baseline before trusting any score. Use precision and recall — not accuracy — for imbalanced problems, and let the cost of each error type pick the trade-off. Cross-validate to remove the luck of a single split. Hunt relentlessly for data leakage: it’s the most common cause of models that look great offline and fail in production.