How LLMs Work

A large language model does exactly one thing: given a sequence of text, it predicts the next token. Everything else — answering questions, writing code, holding a conversation — is that single capability, applied over and over.

Next-token prediction

The model outputs a probability for every token in its vocabulary being next:

It picks a token (how it picks is decoding), appends it to the input, and predicts again. This loop — called autoregressive generation — is why output streams in word by word, and why a long response genuinely takes longer to produce: each token is a separate forward pass through the network.

Pretraining: where the knowledge comes from

Pretraining is the massive, expensive phase. The model is shown trillions of tokens of text — web pages, books, code — and trained self-supervised to predict the next token at every position.

To get good at that prediction, the model is forced to internalize grammar, facts, writing styles, translation, and patterns of reasoning — because all of those help predict text. Knowledge is a side effect of compression. This phase costs millions of dollars and produces a base model: knowledgeable, but just an autocomplete engine — it’ll happily continue your question with ten more questions.

Post-training: making it useful

A base model is turned into an assistant through post-training:

Instruction tuning (SFT) — fine-tune on curated examples of instructions paired with good responses. The model learns to answer rather than continue.
Preference alignment (RLHF / DPO) — humans rank competing responses; the model is tuned to produce the kind humans prefer — helpful, harmless, appropriately formatted.

When you call a chat model through an API, you’re using a post-trained model. Its “personality” and refusals are artifacts of this stage.

Emergent abilities

Some capabilities — multi-step reasoning, in-context learning, basic arithmetic — were never explicitly trained. They appeared once models crossed a certain scale of parameters and data. In-context learning is the most useful: showing a few examples in the prompt makes the model perform a task it was never fine-tuned for. That single property is what makes prompt engineering possible.

What LLMs are good and bad at

Strong	Weak
Fluent language: summarizing, rewriting, translating	Reliable facts — they hallucinate plausibly
Transforming text from format A to B	Exact arithmetic and counting
Drafting and ideating from a prompt	Knowing recent events past their training cutoff
Extracting structure from messy text	Knowing what they don’t know
Code generation and explanation	Truly deterministic, repeatable output

Hallucination, honestly

A hallucination is fluent, confident, wrong output. It is not a bug to be patched — it’s intrinsic. The model is optimized to produce plausible text, and a plausible falsehood scores just as well as the truth during generation.

You don’t eliminate hallucination; you engineer around it:

Ground the model in trusted data with RAG instead of relying on its memory.
Verify outputs against authoritative sources or with code.
Constrain the task — extraction and transformation hallucinate far less than open-ended recall.
Keep a human in the loop for high-stakes decisions.

Key takeaways

An LLM predicts the next token, repeatedly — that’s the entire mechanism. Pretraining compresses internet-scale text into knowledge; post-training turns the result into a helpful assistant. Useful behaviors like in-context learning emerged at scale. LLMs excel at language transformation and struggle with facts, math, and self-knowledge. Hallucination is inherent — design systems that ground, verify, and constrain rather than trust.