Running Models Locally
Running a capable LLM on your own laptop or server is now routine. It’s worth doing — for privacy, cost, offline use, and the genuine intuition you gain from operating a model directly.
Why run locally
Section titled “Why run locally”- Privacy — data never leaves your machine. Decisive for sensitive content.
- Cost — no per-token bill; once the hardware exists, inference is “free.”
- Offline — works with no network, in air-gapped environments.
- Learning — nothing builds intuition like running, quantizing, and prompting a model yourself.
- No rate limits — bounded only by your hardware.
The trade-off: a model that fits on consumer hardware is smaller and less capable than a frontier API model. Local is excellent for many tasks — and not a drop-in replacement for the strongest closed models.
The tools
Section titled “The tools”Ollama
Section titled “Ollama”The easiest entry point. One command pulls and runs a model, with a local API server that mimics common LLM APIs.
ollama run llama3.2 # download + chat, interactively# A local OpenAI-compatible endpoint also starts at localhost:11434Your application code can point at that local endpoint with almost no changes — ideal for development and for swapping between local and hosted models.
llama.cpp
Section titled “llama.cpp”The engine under much local inference, Ollama included. A highly optimized C++ runtime that runs models efficiently on CPUs, consumer GPUs, and Apple Silicon. Use it directly when you want maximum control or to embed inference in an app.
Others
Section titled “Others”Desktop apps like LM Studio and Jan offer a GUI over the same engines; vLLM handles serious local serving on a GPU.
The GGUF format and quantization
Section titled “The GGUF format and quantization”Local models are usually distributed as GGUF files — a format built for efficient CPU/GPU inference that bundles the model and its quantization.
Quantization is what makes local viable: 4-bit weights cut memory ~4× versus 16-bit, moving a model from “data-center GPU” to “your laptop.” You’ll pick a quantization level — a trade-off:
| Quantization | Memory | Quality | Use when |
|---|---|---|---|
| 8-bit (Q8) | Highest | Near-original | You have the RAM/VRAM to spare |
| 4–5-bit (Q4/Q5) | Moderate | Very good | The common sweet spot |
| 2–3-bit | Lowest | Noticeably degraded | Only if nothing else fits |
Q4/Q5 is the usual default — most of the quality, a fraction of the memory.
Hardware: what you need
Section titled “Hardware: what you need”The binding constraint is memory to hold the model — see the memory math. Roughly, for a 4-bit model:
| Model size | Memory (~Q4) | Runs comfortably on |
|---|---|---|
| 1–3B | ~1–3 GB | Almost any modern laptop |
| 7–8B | ~5–8 GB | 16 GB RAM laptop; mainstream GPU |
| 13–14B | ~9–12 GB | 32 GB RAM; a GPU with 12 GB+ VRAM |
| 30–34B | ~20–24 GB | High-memory machine or a 24 GB GPU |
| 70B | ~40 GB+ | Workstation; multi-GPU; high-RAM Mac |
Also watch speed: memory bandwidth sets token throughput, and small or heavily quantized models may run acceptably on CPU alone — but a GPU (or Apple Silicon) is far better for anything interactive.
When local makes sense — and when it doesn’t
Section titled “When local makes sense — and when it doesn’t”Good fit: development and experimentation; privacy-sensitive data; offline or air-gapped use; high-volume simple tasks (classification, extraction) where a small model suffices; learning.
Poor fit: you need frontier-level capability; scaling to many concurrent users (that’s real serving infrastructure, not a laptop); you want zero operational effort.
A practical pattern: develop locally against a small model via Ollama’s API-compatible endpoint, then deploy against either a self-hosted model or a hosted API — the same code, just a different base URL.
Key takeaways
Section titled “Key takeaways”Running models locally gives privacy, zero per-token cost, offline use, and real intuition — at the price of using smaller, less capable models. Ollama is the easiest start; llama.cpp is the engine beneath it. Models ship as quantized GGUF files, and Q4/Q5 is the usual quality/memory sweet spot. Memory is the constraint — size hardware to the model, and note that Apple Silicon’s unified memory makes Macs unusually capable. Use local for dev, privacy, and simple high-volume tasks; use hosted infrastructure for frontier capability and real multi-user scale.