Training Neural Networks

Training a neural network is the training loop from AI Fundamentals scaled up. The new ingredient is backpropagation — the algorithm that figures out how to adjust millions of parameters at once.

Backpropagation in one paragraph

After a forward pass produces a prediction, the loss function scores how wrong it was. Backpropagation then works backward through the layers, using the chain rule of calculus to compute — for every single parameter — how much it contributed to the error. That per-parameter blame is the gradient. The optimizer then nudges each parameter against its gradient. Forward pass, backward pass, update: that cycle, repeated millions of times, is training.

You will essentially never implement this — frameworks like PyTorch compute gradients automatically (autograd). But knowing it exists explains why training needs so much memory: the framework must remember every intermediate value from the forward pass to compute the backward pass.

Optimizers

The optimizer decides how to apply gradients.

SGD (stochastic gradient descent) — step directly against the gradient. Simple, sometimes still best for vision models.
Adam / AdamW — adapts the step size per-parameter and smooths updates with momentum. The default for training transformers and LLMs.

The learning rate — the step size — is the single most important knob. Too high and training diverges into nonsense; too low and it crawls or gets stuck. Real training uses a schedule: warm up the learning rate, then decay it.

Batches, epochs, and steps

You don’t feed the whole dataset in at once — it wouldn’t fit in memory.

Batch — a small group of examples (e.g. 32) processed together.
Step — one forward + backward pass on one batch; parameters update once.
Epoch — one full pass over the entire training dataset.

for epoch in range(epochs):
    for batch in dataloader:          # one step per batch
        preds = model(batch.inputs)
        loss  = loss_fn(preds, batch.labels)
        loss.backward()               # backprop: compute gradients
        optimizer.step()              # apply the update
        optimizer.zero_grad()         # reset for the next batch

Batch size trades off speed and stability: larger batches use the GPU more efficiently but need more memory and can generalize slightly worse.

Regularization: fighting overfitting

Deep networks have enough capacity to memorize their training data. Regularization keeps them honest:

Dropout — randomly switch off a fraction of neurons each step, so the network can’t lean on any single one. The most common technique.
Weight decay — gently push weights toward zero, discouraging overly complex fits.
Early stopping — watch validation loss; stop when it starts rising even as training loss keeps falling.
Data augmentation — expand the dataset with realistic variations (crop, rotate, paraphrase) so the model sees more diversity.

Hyperparameters

Parameters are learned; hyperparameters are chosen by you before training: learning rate, batch size, number of layers, dropout rate, number of epochs. Tuning them is empirical — you try combinations and compare validation scores. Start from known-good defaults and change one thing at a time.

Why GPUs

Training is billions of matrix multiplications. A CPU has a few dozen powerful cores built for sequential work; a GPU has thousands of simpler cores built for doing the same operation on lots of data at once — exactly the shape of neural network math.

The practical limit is usually memory. The GPU must hold the model parameters, the gradients, the optimizer state, and every forward-pass intermediate — simultaneously. Run out of GPU memory and training stops. This is why large models are trained across many GPUs at once, and it’s the central constraint of AI Infrastructure.

Key takeaways

Backpropagation computes a per-parameter gradient by working backward from the loss; the optimizer (usually AdamW) applies it. Training proceeds in batches, steps, and epochs. Regularization — dropout, weight decay, early stopping — prevents memorization. Hyperparameters are tuned empirically. GPUs are mandatory because the work is massively parallel matrix math, and GPU memory is the ceiling on model size.