Skip to content
About

Adapting LLMs

A foundation model is general. Your task is specific. Adaptation is closing that gap — and the most common mistake in LLM engineering is reaching for the heaviest tool first.

There are three ways to adapt an LLM, in increasing order of cost and effort:

  1. Prompting / in-context learning — put instructions and examples in the prompt. No training. Covered in Prompt Engineering.
  2. Retrieval-augmented generation (RAG) — fetch relevant data at request time and insert it into the prompt. Covered in RAG.
  3. Fine-tuning — actually update the model’s weights on your examples.

Most teams jump to fine-tuning when prompting plus RAG would have solved it faster and cheaper. Diagnose by the kind of gap:

The problem is…The fix is…
The model doesn’t know what to doBetter prompting — clearer instructions, examples
The model lacks facts / private dataRAG — retrieve and inject the data
The data changes constantlyRAG — re-index, no retraining
The model can’t match a format, tone, or styleFine-tuning
You need a narrow task done cheaper/faster on a small modelFine-tuning
Prompts have grown huge and repetitiveFine-tuning — bake the behavior in

Fine-tuning is the training loop again, but started from a pretrained model and run on a small, curated dataset of input→output examples for your task.

Full fine-tuning updates every parameter — expensive in compute and memory, and it produces a whole new copy of the model. Almost nobody needs it.

LoRA (Low-Rank Adaptation) is the practical standard. Instead of updating billions of parameters, it freezes the original model and trains a tiny set of new “adapter” weights alongside it.

  • Trains on far less hardware — often a single GPU.
  • The adapter is megabytes, not gigabytes.
  • One base model can serve many swappable LoRA adapters.
Base model frozen — billions of params + LoRA adapter A support-bot tone + LoRA adapter B legal summary style + LoRA adapter C SQL generation One frozen base model serves many small, swappable adapters — each just megabytes.

This is what people usually mean by “fine-tuning” in practice.

Fine-tuning quality is bounded by data quality. A few hundred to a few thousand clean, consistent examples beats tens of thousands of noisy ones — the model imitates whatever you show it, flaws included. Hold examples back as a test set, just like any ML project.

You cannot improve what you can’t measure, and “it looked good in the playground” is not measurement. Build an evaluation set before you optimize anything.

Methods, from cheapest to most expensive:

  • Programmatic checks — is it valid JSON? Does it contain the required fields? Is the extracted ID a real ID? Fast, deterministic, automate everything you can.
  • LLM-as-judge — use a strong model to score outputs against a rubric for qualities like faithfulness or helpfulness. Scalable; calibrate it against human ratings.
  • Human review — the gold standard for nuance and high-stakes output. Slow and costly; reserve it for what automation can’t catch.
# A minimal eval harness — the most valuable 20 lines in an LLM project.
eval_set = [
{"input": "...", "must_contain": "refund", "must_be_json": True},
# ...dozens more cases, including known failures
]
for case in eval_set:
out = run_my_llm_feature(case["input"])
record(passed=check(out, case))
# Track the pass rate on every prompt or model change.

Run this on every prompt edit and model swap. Without it, “improvements” are guesses and regressions ship silently.

Adapt with the lightest tool that works: prompt first, add RAG for missing knowledge, fine-tune only for missing behavior. Fine-tuning teaches form, not facts — and LoRA makes it cheap and modular. Whatever you choose, build an evaluation set first: programmatic checks where possible, LLM-as-judge for nuance, humans for the high-stakes cases.