How LLMs Work
A large language model does exactly one thing: given a sequence of text, it predicts the next token. Everything else — answering questions, writing code, holding a conversation — is that single capability, applied over and over.
Next-token prediction
Section titled “Next-token prediction”The model outputs a probability for every token in its vocabulary being next:
It picks a token (how it picks is decoding), appends it to the input, and predicts again. This loop — called autoregressive generation — is why output streams in word by word, and why a long response genuinely takes longer to produce: each token is a separate forward pass through the network.
Pretraining: where the knowledge comes from
Section titled “Pretraining: where the knowledge comes from”Pretraining is the massive, expensive phase. The model is shown trillions of tokens of text — web pages, books, code — and trained self-supervised to predict the next token at every position.
To get good at that prediction, the model is forced to internalize grammar, facts, writing styles, translation, and patterns of reasoning — because all of those help predict text. Knowledge is a side effect of compression. This phase costs millions of dollars and produces a base model: knowledgeable, but just an autocomplete engine — it’ll happily continue your question with ten more questions.
Post-training: making it useful
Section titled “Post-training: making it useful”A base model is turned into an assistant through post-training:
- Instruction tuning (SFT) — fine-tune on curated examples of instructions paired with good responses. The model learns to answer rather than continue.
- Preference alignment (RLHF / DPO) — humans rank competing responses; the model is tuned to produce the kind humans prefer — helpful, harmless, appropriately formatted.
When you call a chat model through an API, you’re using a post-trained model. Its “personality” and refusals are artifacts of this stage.
Emergent abilities
Section titled “Emergent abilities”Some capabilities — multi-step reasoning, in-context learning, basic arithmetic — were never explicitly trained. They appeared once models crossed a certain scale of parameters and data. In-context learning is the most useful: showing a few examples in the prompt makes the model perform a task it was never fine-tuned for. That single property is what makes prompt engineering possible.
What LLMs are good and bad at
Section titled “What LLMs are good and bad at”| Strong | Weak |
|---|---|
| Fluent language: summarizing, rewriting, translating | Reliable facts — they hallucinate plausibly |
| Transforming text from format A to B | Exact arithmetic and counting |
| Drafting and ideating from a prompt | Knowing recent events past their training cutoff |
| Extracting structure from messy text | Knowing what they don’t know |
| Code generation and explanation | Truly deterministic, repeatable output |
Hallucination, honestly
Section titled “Hallucination, honestly”A hallucination is fluent, confident, wrong output. It is not a bug to be patched — it’s intrinsic. The model is optimized to produce plausible text, and a plausible falsehood scores just as well as the truth during generation.
You don’t eliminate hallucination; you engineer around it:
- Ground the model in trusted data with RAG instead of relying on its memory.
- Verify outputs against authoritative sources or with code.
- Constrain the task — extraction and transformation hallucinate far less than open-ended recall.
- Keep a human in the loop for high-stakes decisions.
Key takeaways
Section titled “Key takeaways”An LLM predicts the next token, repeatedly — that’s the entire mechanism. Pretraining compresses internet-scale text into knowledge; post-training turns the result into a helpful assistant. Useful behaviors like in-context learning emerged at scale. LLMs excel at language transformation and struggle with facts, math, and self-knowledge. Hallucination is inherent — design systems that ground, verify, and constrain rather than trust.