LLM System Design
The LLM system design interview is the round that matters most for AI engineering roles. It’s a classic system design interview with an AI core — and it rewards structure, trade-off reasoning, and awareness of how AI systems fail.
A framework
Section titled “A framework”Use a repeatable structure so you cover everything under pressure:
Steps 4–6 are where AI system design diverges from ordinary system design. Spend your time there.
Worked example: “Design a customer support AI assistant”
Section titled “Worked example: “Design a customer support AI assistant””1. Clarify
Section titled “1. Clarify”Ask before designing: What can it do — answer questions only, or take actions (refunds, ticket updates)? What’s the knowledge source — help docs, past tickets? Volume? Latency expectation? What happens when it’s unsure? Channels? Languages?
Assume: answers product questions from help docs and past tickets, can escalate to a human, ~10k conversations/day, chat latency, English first.
2. Define success
Section titled “2. Define success”State metrics up front — it signals maturity: resolution rate (no human needed), answer accuracy/faithfulness, escalation rate, latency (p99), cost per conversation, and user satisfaction. Note explicitly that a wrong answer is worse than an escalation — that framing drives the whole design.
3. Sketch the architecture
Section titled “3. Sketch the architecture”This is a RAG system — knowledge gap, changing data:
4. Deep-dive: the AI core
Section titled “4. Deep-dive: the AI core”Walk through the real decisions:
- Retrieval — chunk docs and tickets; embed; store in a vector DB. Hybrid search so exact terms (error codes, product names) aren’t lost; rerank; metadata-filter by product and recency.
- Generation — a grounding prompt: answer only from retrieved context, cite sources, and say “I’m not sure” rather than guess. Low temperature.
- Model choice — start with a capable hosted model; later route easy questions to a cheaper model for cost.
- Actions — keep refunds and account changes behind tool calls with strict validation and human approval. Read-only by default.
- The unsure path — when retrieval is weak or confidence is low, escalate. Better than a confident wrong answer — and you defined it that way in step 2.
5. Evaluation
Section titled “5. Evaluation”Build a test set of real questions with ideal answers and source chunks. Score retrieval (context recall) and generation (faithfulness, correctness) separately. Gate every prompt or model change on it. In production, run evals on sampled live traffic and watch thumbs-down and escalations. See Advanced RAG & Evaluation.
6. Operate
Section titled “6. Operate”Cost: estimate tokens per conversation; cache common questions; consider model routing. Latency: stream responses; parallelize retrieval. Reliability: timeouts, retries, a fallback model, graceful escalation if AI is down. Monitoring: trace every conversation; track cost, latency, faithfulness, escalation rate.
7. Trade-offs
Section titled “7. Trade-offs”Name them unprompted: hybrid search and reranking cost latency but lift accuracy; a stronger model costs more but escalates less; aggressive caching risks staleness. Show you see the tensions and chose deliberately.
What interviewers reward
Section titled “What interviewers reward”| They want to see | They worry when you |
|---|---|
| Clarifying before designing | Jump straight to a solution |
| Evaluation as a first-class concern | Never mention measuring quality |
| Awareness of failure modes | Assume the LLM is always right |
| Explicit cost/latency/quality trade-offs | Ignore cost and latency |
| Guardrails and the “unsure” path | Let the model act unconstrained |
| Starting simple, then scaling | Over-engineer from slide one |
Key takeaways
Section titled “Key takeaways”Drive the LLM system design interview with a fixed framework: clarify, define success, sketch, deep-dive the AI core, evaluate, operate, iterate. Spend your time on the AI-specific parts — retrieval, prompts, model choice, guardrails. Raise evaluation, failure modes, and cost/latency/quality trade-offs yourself. Start simple and scale deliberately. Treating a wrong answer as worse than an escalation — and designing for it — is what marks a real practitioner.