AI Infrastructure
At some point “just call the API” stops being the answer — because of cost, data residency, latency, customization, or scale. AI infrastructure is what runs underneath: the servers, GPUs, and scaling systems that turn a model into a service.
You don’t need to be an infrastructure specialist. You do need enough fluency to estimate what a model costs to run, decide between an API and self-hosting, and talk credibly with the platform team.
In this section
Section titled “In this section” Serving & Inference Inference servers, batching, the KV cache, prefill vs. decode, and the throughput-versus-latency trade-off.
GPUs & Hardware GPU memory math, what model size really costs, quantization, and choosing hardware.
Scaling & Cost Autoscaling, cold starts, capacity planning, and the real economics of API vs. self-hosting.
What you’ll be able to do
Section titled “What you’ll be able to do”Estimate the GPU memory and cost to serve a given model, reason about inference throughput and latency, and make a defensible API-vs-self-host decision.