AI Infrastructure

At some point “just call the API” stops being the answer — because of cost, data residency, latency, customization, or scale. AI infrastructure is what runs underneath: the servers, GPUs, and scaling systems that turn a model into a service.

You don’t need to be an infrastructure specialist. You do need enough fluency to estimate what a model costs to run, decide between an API and self-hosting, and talk credibly with the platform team.

In this section

Serving & Inference Inference servers, batching, the KV cache, prefill vs. decode, and the throughput-versus-latency trade-off.

GPUs & Hardware GPU memory math, what model size really costs, quantization, and choosing hardware.

Scaling & Cost Autoscaling, cold starts, capacity planning, and the real economics of API vs. self-hosting.

What you’ll be able to do

Estimate the GPU memory and cost to serve a given model, reason about inference throughput and latency, and make a defensible API-vs-self-host decision.

Prerequisites

Deep Learning and LLM Engineering.