Skip to content
About

How LLMs Work

A large language model does exactly one thing: given a sequence of text, it predicts the next token. Everything else — answering questions, writing code, holding a conversation — is that single capability, applied over and over.

The model outputs a probability for every token in its vocabulary being next:

Prompt: "The capital of France is ___" next token probability " Paris" 0.91 " a" 0.03 " located" 0.02 " home" 0.01

It picks a token (how it picks is decoding), appends it to the input, and predicts again. This loop — called autoregressive generation — is why output streams in word by word, and why a long response genuinely takes longer to produce: each token is a separate forward pass through the network.

Pretraining: where the knowledge comes from

Section titled “Pretraining: where the knowledge comes from”

Pretraining is the massive, expensive phase. The model is shown trillions of tokens of text — web pages, books, code — and trained self-supervised to predict the next token at every position.

To get good at that prediction, the model is forced to internalize grammar, facts, writing styles, translation, and patterns of reasoning — because all of those help predict text. Knowledge is a side effect of compression. This phase costs millions of dollars and produces a base model: knowledgeable, but just an autocomplete engine — it’ll happily continue your question with ten more questions.

A base model is turned into an assistant through post-training:

  1. Instruction tuning (SFT) — fine-tune on curated examples of instructions paired with good responses. The model learns to answer rather than continue.
  2. Preference alignment (RLHF / DPO) — humans rank competing responses; the model is tuned to produce the kind humans prefer — helpful, harmless, appropriately formatted.

When you call a chat model through an API, you’re using a post-trained model. Its “personality” and refusals are artifacts of this stage.

Some capabilities — multi-step reasoning, in-context learning, basic arithmetic — were never explicitly trained. They appeared once models crossed a certain scale of parameters and data. In-context learning is the most useful: showing a few examples in the prompt makes the model perform a task it was never fine-tuned for. That single property is what makes prompt engineering possible.

StrongWeak
Fluent language: summarizing, rewriting, translatingReliable facts — they hallucinate plausibly
Transforming text from format A to BExact arithmetic and counting
Drafting and ideating from a promptKnowing recent events past their training cutoff
Extracting structure from messy textKnowing what they don’t know
Code generation and explanationTruly deterministic, repeatable output

A hallucination is fluent, confident, wrong output. It is not a bug to be patched — it’s intrinsic. The model is optimized to produce plausible text, and a plausible falsehood scores just as well as the truth during generation.

You don’t eliminate hallucination; you engineer around it:

  • Ground the model in trusted data with RAG instead of relying on its memory.
  • Verify outputs against authoritative sources or with code.
  • Constrain the task — extraction and transformation hallucinate far less than open-ended recall.
  • Keep a human in the loop for high-stakes decisions.

An LLM predicts the next token, repeatedly — that’s the entire mechanism. Pretraining compresses internet-scale text into knowledge; post-training turns the result into a helpful assistant. Useful behaviors like in-context learning emerged at scale. LLMs excel at language transformation and struggle with facts, math, and self-knowledge. Hallucination is inherent — design systems that ground, verify, and constrain rather than trust.