Skip to content
About

Core Algorithms

You don’t need to derive these from scratch. You need to recognize them, know what they’re good at, and pick sensibly. This is a field guide, not a textbook.

The simplest models, and a smart default baseline.

  • Linear regression predicts a number as a weighted sum of features.
  • Logistic regression does the same, then squashes the result into a 0–1 probability for classification.

They’re fast, need little data, and — crucially — are interpretable: each weight tells you how a feature affects the prediction. Always try a linear model first. If a complex model can’t beat it, it isn’t earning its complexity.

A tree of yes/no questions on features: “income > 50k? → yes → age < 30? → …”. Each leaf is a prediction. Trees handle non-linear patterns, need no feature scaling, and a single tree is easy to read. Their weakness: one deep tree overfits badly. The fix is to combine many of them.

Ensembles: random forests & gradient boosting

Section titled “Ensembles: random forests & gradient boosting”

Ensembles combine many weak models into one strong one.

  • Random forest — train hundreds of trees on random subsets of data and features, then average them. Robust, hard to misuse, a great default.
  • Gradient boosting — build trees sequentially, each one correcting the previous ensemble’s errors. Libraries: XGBoost, LightGBM, CatBoost.

To classify a new point, find the k closest known points and take a majority vote. There’s no real “training” — it just stores the data. Simple and intuitive, but slow at prediction time on large datasets. Its core idea — “similar inputs have similar outputs” — is the exact intuition behind vector search.

The go-to unsupervised algorithm. Pick k, and it partitions data into k groups by iteratively assigning points to the nearest cluster center and recomputing centers. Used for customer segmentation and exploratory analysis. You must choose k yourself, and it assumes roughly round, similar-sized clusters.

Covered in depth in Deep Learning. In one line: layers of simple units that, stacked deep, learn their own features from raw data. They dominate unstructured input — images, audio, text — and underperform boosted trees on small tabular datasets.

SituationStart with
Tabular data, need a baselineLogistic / linear regression
Tabular data, want best accuracyGradient boosting (XGBoost / LightGBM)
Need a human-explainable modelLinear model or a shallow decision tree
Images, audio, or textA neural network
No labels, want groupsk-Means or hierarchical clustering
Small dataset, simple relationshipkNN or linear regression

Feature engineering: still the highest-leverage work

Section titled “Feature engineering: still the highest-leverage work”

A feature is an input variable. Feature engineering is transforming raw data into inputs that expose the signal — and for classical ML it routinely matters more than the algorithm choice.

# Raw timestamp -> features a model can actually use.
df["hour"] = df["ts"].dt.hour
df["is_weekend"] = df["ts"].dt.dayofweek >= 5
df["days_since_signup"] = (df["ts"] - df["signup"]).dt.days

Common transformations: encoding categories as numbers (one-hot, target encoding), scaling numeric ranges, bucketing continuous values, extracting parts of dates, and combining columns into ratios.

Start every tabular problem with a linear baseline, then try gradient boosting — it wins most structured-data tasks. Use neural networks for images, audio, and text. Use k-Means when you have no labels. And remember that for classical ML, thoughtful feature engineering often beats a fancier algorithm.