Chunking & Retrieval
If a RAG system gives bad answers, the cause is usually retrieval, not generation — the model never received the right context. This page covers the levers that fix retrieval.
Chunking
Section titled “Chunking”A chunk is the unit of retrieval — the passage you embed and later inject. Chunk size is a real trade-off:
- Too large — one vector blurs many topics, retrieval gets imprecise, and chunks eat context-window budget.
- Too small — a chunk loses the surrounding context needed to make sense.
Strategies
Section titled “Strategies”| Strategy | How it works | Use when |
|---|---|---|
| Fixed-size | N tokens per chunk, with overlap | Quick baseline, uniform text |
| Recursive | Split on paragraphs → sentences to respect structure | General-purpose default |
| Document-aware | Split on Markdown headings, code blocks, sections | Structured docs, code |
| Semantic | Split where the topic shifts (embedding-based) | High-value corpora; costlier to build |
Overlap — repeating ~10–20% of text between adjacent chunks — keeps a sentence split across a boundary from being lost.
# Recursive splitting: respect natural boundaries before falling back to length.chunks = recursive_split( document, separators=["\n\n", "\n", ". ", " "], # try paragraphs, then lines, ... chunk_size=500, chunk_overlap=75,)Enriching chunks
Section titled “Enriching chunks”Quality jumps when each chunk carries more than its raw text:
- Metadata — source, title, date, section, permissions — for filtering and citations.
- Context prefix — prepend the document title and section heading so an isolated chunk still self-describes.
- Small-to-big — embed a small chunk for precise matching, but return its larger parent for richer context.
Hybrid search
Section titled “Hybrid search”Pure vector search is strong on meaning, weak on exact strings — error codes, product names, IDs. Hybrid search combines semantic (vector) and lexical (keyword/BM25) retrieval and fuses the results.
Hybrid search is one of the highest-ROI upgrades to naive RAG. If exact terms matter at all in your domain, add it early.
Query transformation
Section titled “Query transformation”The user’s raw question is often a poor search query. Transform it before retrieval:
- Query rewriting — turn a vague or conversational question into a clean, keyword-rich query. Essential in chat, where “what about the second one?” only makes sense with history.
- Multi-query — generate several phrasings, retrieve for each, and union the results. Catches relevant chunks that any single phrasing would miss.
- Decomposition — break a multi-part question into sub-questions, retrieve per sub-question, then synthesize.
- HyDE — have the LLM draft a hypothetical answer, then embed that to search; a full answer often sits closer to real passages than a terse question.
Reranking
Section titled “Reranking”Retrieval optimizes for speed and casts a wide net; it’s not precise about ordering. A reranker fixes that as a second stage:
A reranker (a cross-encoder model) reads the query and each chunk together, producing a far more accurate relevance score than embedding similarity alone. Retrieve broadly, rerank, then keep only the top few. This reliably lifts answer quality and shrinks the context you send — so it can cut cost too.
A tuned retrieval pipeline
Section titled “A tuned retrieval pipeline”Add these stages one at a time, measuring each — see evaluation. Not every system needs all of them.
Key takeaways
Section titled “Key takeaways”Most RAG failures are retrieval failures. Chunk with structure-aware recursive splitting and modest overlap; enrich chunks with metadata and context. Add hybrid search so exact terms aren’t lost. Transform weak user questions into good search queries before retrieving. Add a reranker to turn a broad candidate set into a precise, compact context. Introduce each lever incrementally and measure.