Key Architectures

The neuron is universal, but how you wire neurons together — the architecture — should match the structure of your data. Four families cover almost everything you’ll encounter.

MLP — the fully-connected network

The multi-layer perceptron is the basic design: every neuron in one layer connects to every neuron in the next. It treats input as a flat list of numbers with no assumed structure.

MLPs work for tabular data, but for it, gradient boosting usually wins. Their real importance today: MLPs appear inside bigger architectures as components — including inside every transformer block.

CNN — convolutional neural network, for grids

A CNN is built for data with spatial structure: images, video, spectrograms. Instead of connecting everything to everything, it slides small learnable filters across the input, detecting local patterns regardless of where they appear.

This encodes two priors that match images perfectly: nearby pixels are related, and a feature (an edge, an eye) means the same thing anywhere in the frame. Stacked, CNN layers build the hierarchy edges → textures → parts → objects. CNNs drove the 2012–2017 computer-vision boom and still power lots of production vision systems.

RNN — recurrent neural network, for sequences

An RNN processes a sequence one element at a time, carrying a hidden state — a memory — from each step to the next. That made it the natural choice for text and time series. LSTMs and GRUs are improved RNNs with gating that helps them remember longer.

But RNNs have two fatal flaws that the next architecture fixed:

They’re sequential. Step t needs step t−1, so you can’t parallelize over the sequence — painfully slow to train at scale.
They forget. Information from far back in a long sequence gets diluted with each step.

Transformer — attention, and why it won

The 2017 paper “Attention Is All You Need” introduced the transformer, and it now underpins essentially all of modern AI: LLMs, embedding models, even state-of-the-art vision.

Its core mechanism is self-attention. Instead of passing a sequence through step by step, the transformer looks at all positions at once and, for each token, computes how much it should “attend to” every other token. The word “it” can directly and instantly draw information from a noun thirty words earlier.

Two consequences made it dominant:

Parallelism. With no step-by-step recurrence, the whole sequence is processed simultaneously — a perfect fit for GPUs. This is what made internet-scale training feasible.
Long-range context. Any token can reach any other token directly, with no decay over distance.

Choosing an architecture

Data	Architecture
Tabular rows and columns	Gradient boosting; MLP as a fallback
Images, video, spectrograms	CNN (or a vision transformer)
Sequences, legacy/streaming	RNN / LSTM
Text, code, and most modern AI	Transformer

In practice, as a builder you’ll work with transformers far more than you’ll choose an architecture — the foundation models you consume are transformers, and the choice was made for you.

Key takeaways

Match the architecture to the data’s structure: MLPs for flat input, CNNs for spatial grids, RNNs for sequences. The transformer replaced RNNs for text because self-attention is parallelizable (great for GPUs) and connects distant tokens directly. Its cost scales quadratically with sequence length — the root cause of context limits and per-token pricing throughout modern LLM systems.