Adapting LLMs

A foundation model is general. Your task is specific. Adaptation is closing that gap — and the most common mistake in LLM engineering is reaching for the heaviest tool first.

The three levers

There are three ways to adapt an LLM, in increasing order of cost and effort:

Prompting / in-context learning — put instructions and examples in the prompt. No training. Covered in Prompt Engineering.
Retrieval-augmented generation (RAG) — fetch relevant data at request time and insert it into the prompt. Covered in RAG.
Fine-tuning — actually update the model’s weights on your examples.

The decision framework

Most teams jump to fine-tuning when prompting plus RAG would have solved it faster and cheaper. Diagnose by the kind of gap:

The problem is…	The fix is…
The model doesn’t know what to do	Better prompting — clearer instructions, examples
The model lacks facts / private data	RAG — retrieve and inject the data
The data changes constantly	RAG — re-index, no retraining
The model can’t match a format, tone, or style	Fine-tuning
You need a narrow task done cheaper/faster on a small model	Fine-tuning
Prompts have grown huge and repetitive	Fine-tuning — bake the behavior in

How fine-tuning works

Fine-tuning is the training loop again, but started from a pretrained model and run on a small, curated dataset of input→output examples for your task.

Full fine-tuning updates every parameter — expensive in compute and memory, and it produces a whole new copy of the model. Almost nobody needs it.

LoRA and parameter-efficient fine-tuning

LoRA (Low-Rank Adaptation) is the practical standard. Instead of updating billions of parameters, it freezes the original model and trains a tiny set of new “adapter” weights alongside it.

Trains on far less hardware — often a single GPU.
The adapter is megabytes, not gigabytes.
One base model can serve many swappable LoRA adapters.

This is what people usually mean by “fine-tuning” in practice.

Data is the hard part

Fine-tuning quality is bounded by data quality. A few hundred to a few thousand clean, consistent examples beats tens of thousands of noisy ones — the model imitates whatever you show it, flaws included. Hold examples back as a test set, just like any ML project.

Evaluating LLM output

You cannot improve what you can’t measure, and “it looked good in the playground” is not measurement. Build an evaluation set before you optimize anything.

Methods, from cheapest to most expensive:

Programmatic checks — is it valid JSON? Does it contain the required fields? Is the extracted ID a real ID? Fast, deterministic, automate everything you can.
LLM-as-judge — use a strong model to score outputs against a rubric for qualities like faithfulness or helpfulness. Scalable; calibrate it against human ratings.
Human review — the gold standard for nuance and high-stakes output. Slow and costly; reserve it for what automation can’t catch.

# A minimal eval harness — the most valuable 20 lines in an LLM project.
eval_set = [
    {"input": "...", "must_contain": "refund", "must_be_json": True},
    # ...dozens more cases, including known failures
]
for case in eval_set:
    out = run_my_llm_feature(case["input"])
    record(passed=check(out, case))
# Track the pass rate on every prompt or model change.

Run this on every prompt edit and model swap. Without it, “improvements” are guesses and regressions ship silently.

Key takeaways

Adapt with the lightest tool that works: prompt first, add RAG for missing knowledge, fine-tune only for missing behavior. Fine-tuning teaches form, not facts — and LoRA makes it cheap and modular. Whatever you choose, build an evaluation set first: programmatic checks where possible, LLM-as-judge for nuance, humans for the high-stakes cases.