Adapting LLMs
A foundation model is general. Your task is specific. Adaptation is closing that gap — and the most common mistake in LLM engineering is reaching for the heaviest tool first.
The three levers
Section titled “The three levers”There are three ways to adapt an LLM, in increasing order of cost and effort:
- Prompting / in-context learning — put instructions and examples in the prompt. No training. Covered in Prompt Engineering.
- Retrieval-augmented generation (RAG) — fetch relevant data at request time and insert it into the prompt. Covered in RAG.
- Fine-tuning — actually update the model’s weights on your examples.
The decision framework
Section titled “The decision framework”Most teams jump to fine-tuning when prompting plus RAG would have solved it faster and cheaper. Diagnose by the kind of gap:
| The problem is… | The fix is… |
|---|---|
| The model doesn’t know what to do | Better prompting — clearer instructions, examples |
| The model lacks facts / private data | RAG — retrieve and inject the data |
| The data changes constantly | RAG — re-index, no retraining |
| The model can’t match a format, tone, or style | Fine-tuning |
| You need a narrow task done cheaper/faster on a small model | Fine-tuning |
| Prompts have grown huge and repetitive | Fine-tuning — bake the behavior in |
How fine-tuning works
Section titled “How fine-tuning works”Fine-tuning is the training loop again, but started from a pretrained model and run on a small, curated dataset of input→output examples for your task.
Full fine-tuning updates every parameter — expensive in compute and memory, and it produces a whole new copy of the model. Almost nobody needs it.
LoRA and parameter-efficient fine-tuning
Section titled “LoRA and parameter-efficient fine-tuning”LoRA (Low-Rank Adaptation) is the practical standard. Instead of updating billions of parameters, it freezes the original model and trains a tiny set of new “adapter” weights alongside it.
- Trains on far less hardware — often a single GPU.
- The adapter is megabytes, not gigabytes.
- One base model can serve many swappable LoRA adapters.
This is what people usually mean by “fine-tuning” in practice.
Data is the hard part
Section titled “Data is the hard part”Fine-tuning quality is bounded by data quality. A few hundred to a few thousand clean, consistent examples beats tens of thousands of noisy ones — the model imitates whatever you show it, flaws included. Hold examples back as a test set, just like any ML project.
Evaluating LLM output
Section titled “Evaluating LLM output”You cannot improve what you can’t measure, and “it looked good in the playground” is not measurement. Build an evaluation set before you optimize anything.
Methods, from cheapest to most expensive:
- Programmatic checks — is it valid JSON? Does it contain the required fields? Is the extracted ID a real ID? Fast, deterministic, automate everything you can.
- LLM-as-judge — use a strong model to score outputs against a rubric for qualities like faithfulness or helpfulness. Scalable; calibrate it against human ratings.
- Human review — the gold standard for nuance and high-stakes output. Slow and costly; reserve it for what automation can’t catch.
# A minimal eval harness — the most valuable 20 lines in an LLM project.eval_set = [ {"input": "...", "must_contain": "refund", "must_be_json": True}, # ...dozens more cases, including known failures]for case in eval_set: out = run_my_llm_feature(case["input"]) record(passed=check(out, case))# Track the pass rate on every prompt or model change.Run this on every prompt edit and model swap. Without it, “improvements” are guesses and regressions ship silently.
Key takeaways
Section titled “Key takeaways”Adapt with the lightest tool that works: prompt first, add RAG for missing knowledge, fine-tune only for missing behavior. Fine-tuning teaches form, not facts — and LoRA makes it cheap and modular. Whatever you choose, build an evaluation set first: programmatic checks where possible, LLM-as-judge for nuance, humans for the high-stakes cases.