Skip to content
About

Training Neural Networks

Training a neural network is the training loop from AI Fundamentals scaled up. The new ingredient is backpropagation — the algorithm that figures out how to adjust millions of parameters at once.

After a forward pass produces a prediction, the loss function scores how wrong it was. Backpropagation then works backward through the layers, using the chain rule of calculus to compute — for every single parameter — how much it contributed to the error. That per-parameter blame is the gradient. The optimizer then nudges each parameter against its gradient. Forward pass, backward pass, update: that cycle, repeated millions of times, is training.

You will essentially never implement this — frameworks like PyTorch compute gradients automatically (autograd). But knowing it exists explains why training needs so much memory: the framework must remember every intermediate value from the forward pass to compute the backward pass.

The optimizer decides how to apply gradients.

  • SGD (stochastic gradient descent) — step directly against the gradient. Simple, sometimes still best for vision models.
  • Adam / AdamW — adapts the step size per-parameter and smooths updates with momentum. The default for training transformers and LLMs.

The learning rate — the step size — is the single most important knob. Too high and training diverges into nonsense; too low and it crawls or gets stuck. Real training uses a schedule: warm up the learning rate, then decay it.

You don’t feed the whole dataset in at once — it wouldn’t fit in memory.

  • Batch — a small group of examples (e.g. 32) processed together.
  • Step — one forward + backward pass on one batch; parameters update once.
  • Epoch — one full pass over the entire training dataset.
for epoch in range(epochs):
for batch in dataloader: # one step per batch
preds = model(batch.inputs)
loss = loss_fn(preds, batch.labels)
loss.backward() # backprop: compute gradients
optimizer.step() # apply the update
optimizer.zero_grad() # reset for the next batch

Batch size trades off speed and stability: larger batches use the GPU more efficiently but need more memory and can generalize slightly worse.

Deep networks have enough capacity to memorize their training data. Regularization keeps them honest:

  • Dropout — randomly switch off a fraction of neurons each step, so the network can’t lean on any single one. The most common technique.
  • Weight decay — gently push weights toward zero, discouraging overly complex fits.
  • Early stopping — watch validation loss; stop when it starts rising even as training loss keeps falling.
  • Data augmentation — expand the dataset with realistic variations (crop, rotate, paraphrase) so the model sees more diversity.

Parameters are learned; hyperparameters are chosen by you before training: learning rate, batch size, number of layers, dropout rate, number of epochs. Tuning them is empirical — you try combinations and compare validation scores. Start from known-good defaults and change one thing at a time.

Training is billions of matrix multiplications. A CPU has a few dozen powerful cores built for sequential work; a GPU has thousands of simpler cores built for doing the same operation on lots of data at once — exactly the shape of neural network math.

The practical limit is usually memory. The GPU must hold the model parameters, the gradients, the optimizer state, and every forward-pass intermediate — simultaneously. Run out of GPU memory and training stops. This is why large models are trained across many GPUs at once, and it’s the central constraint of AI Infrastructure.

Backpropagation computes a per-parameter gradient by working backward from the loss; the optimizer (usually AdamW) applies it. Training proceeds in batches, steps, and epochs. Regularization — dropout, weight decay, early stopping — prevents memorization. Hyperparameters are tuned empirically. GPUs are mandatory because the work is massively parallel matrix math, and GPU memory is the ceiling on model size.