Skip to content
About

GPUs & Hardware

The single most useful infrastructure skill for an application engineer: estimating whether a given model fits on given hardware, and what that costs. It’s mostly arithmetic.

A CPU has a few dozen cores tuned for fast sequential work. A GPU has thousands of simpler cores tuned for doing the same operation across lots of data at once — exactly the matrix multiplications of a neural network. For LLM work, the spec that matters most isn’t raw speed; it’s memory (VRAM).

To serve a model, its parameters must fit in GPU memory. The estimate:

memory for weights ≈ parameters × bytes-per-parameter
Precision Bytes/param A 7B model A 70B model
FP32 (full) 4 ~28 GB ~280 GB
FP16 / BF16 2 ~14 GB ~140 GB ← typical serving
INT8 (quantized) 1 ~7 GB ~70 GB
INT4 (quantized) 0.5 ~3.5 GB ~35 GB

So a 7B model at FP16 needs ~14 GB just for weights. Then add overhead:

  • The KV cache — grows with context length and concurrency; can be many GB.
  • Activations and framework overhead.

A working rule of thumb: budget ~1.5–2× the weight size for real serving. That 7B/FP16 model wants a ~24 GB GPU to serve comfortably; a 70B model needs multiple GPUs.

Quantization stores weights at lower precision — INT8 or INT4 instead of 16-bit. From the table, INT4 cuts memory ~4× versus FP16. That can move a model from “needs a data-center GPU” to “runs on a consumer card” or even a laptop.

The trade-off is a usually small quality loss. Modern methods (GPTQ, AWQ, GGUF/llama.cpp k-quants) are good enough that 8-bit is often nearly indistinguishable from 16-bit, and 4-bit is acceptable for many uses. Quantization is the main reason open models are practical to self-host.

The two have very different appetites:

  • Inference — needs memory for weights + KV cache. The focus of most app engineers.
  • Training / fine-tuning — also needs gradients, optimizer state, and saved activations: often 3–4× the inference footprint. This is why LoRA, which trains only a tiny adapter, matters so much — it brings fine-tuning down to a single GPU.
NeedTypical choice
Local dev, small/quantized modelsConsumer GPU, or Apple Silicon (unified memory)
Serving mid-size modelsOne data-center GPU (e.g. A10, L40S, A100)
Serving large models / trainingMultiple high-VRAM GPUs (A100/H100), networked
Occasional or spiky workloadsRent cloud GPUs by the hour

Key specs, in order: VRAM (does it fit?), memory bandwidth (decode speed), then compute throughput.

  • Hyperscaler cloud (AWS/GCP/Azure) — flexible, integrated, priciest per GPU-hour.
  • Specialized GPU clouds — often cheaper per hour for raw compute.
  • Owned hardware — lowest cost if utilization is consistently high; you own capacity planning, failures, and idle time.

Cloud GPUs are scarce and expensive — multi-GPU instances especially. This scarcity, more than anything, is why hosted model APIs win for most teams: the provider amortizes GPUs across many customers far better than you can alone.

GPUs serve LLMs because the math is massively parallel, and VRAM is the binding constraint. Estimate weight memory as parameters × bytes-per-parameter, then budget ~1.5–2× for the KV cache and overhead. Quantization (INT8/INT4) cuts memory several-fold for a usually-small quality cost, making self-hosting feasible. Training needs 3–4× the memory of inference. Match hardware to VRAM first — and remember API providers exist largely because GPUs are scarce.