Chunking & Retrieval

If a RAG system gives bad answers, the cause is usually retrieval, not generation — the model never received the right context. This page covers the levers that fix retrieval.

Chunking

A chunk is the unit of retrieval — the passage you embed and later inject. Chunk size is a real trade-off:

Too large — one vector blurs many topics, retrieval gets imprecise, and chunks eat context-window budget.
Too small — a chunk loses the surrounding context needed to make sense.

Strategies

Strategy	How it works	Use when
Fixed-size	N tokens per chunk, with overlap	Quick baseline, uniform text
Recursive	Split on paragraphs → sentences to respect structure	General-purpose default
Document-aware	Split on Markdown headings, code blocks, sections	Structured docs, code
Semantic	Split where the topic shifts (embedding-based)	High-value corpora; costlier to build

Overlap — repeating ~10–20% of text between adjacent chunks — keeps a sentence split across a boundary from being lost.

# Recursive splitting: respect natural boundaries before falling back to length.
chunks = recursive_split(
    document,
    separators=["\n\n", "\n", ". ", " "],  # try paragraphs, then lines, ...
    chunk_size=500,
    chunk_overlap=75,
)

Enriching chunks

Quality jumps when each chunk carries more than its raw text:

Metadata — source, title, date, section, permissions — for filtering and citations.
Context prefix — prepend the document title and section heading so an isolated chunk still self-describes.
Small-to-big — embed a small chunk for precise matching, but return its larger parent for richer context.

Hybrid search

Pure vector search is strong on meaning, weak on exact strings — error codes, product names, IDs. Hybrid search combines semantic (vector) and lexical (keyword/BM25) retrieval and fuses the results.

Hybrid search is one of the highest-ROI upgrades to naive RAG. If exact terms matter at all in your domain, add it early.

Query transformation

The user’s raw question is often a poor search query. Transform it before retrieval:

Query rewriting — turn a vague or conversational question into a clean, keyword-rich query. Essential in chat, where “what about the second one?” only makes sense with history.
Multi-query — generate several phrasings, retrieve for each, and union the results. Catches relevant chunks that any single phrasing would miss.
Decomposition — break a multi-part question into sub-questions, retrieve per sub-question, then synthesize.
HyDE — have the LLM draft a hypothetical answer, then embed that to search; a full answer often sits closer to real passages than a terse question.

Reranking

Retrieval optimizes for speed and casts a wide net; it’s not precise about ordering. A reranker fixes that as a second stage:

A reranker (a cross-encoder model) reads the query and each chunk together, producing a far more accurate relevance score than embedding similarity alone. Retrieve broadly, rerank, then keep only the top few. This reliably lifts answer quality and shrinks the context you send — so it can cut cost too.

A tuned retrieval pipeline

Add these stages one at a time, measuring each — see evaluation. Not every system needs all of them.

Key takeaways

Most RAG failures are retrieval failures. Chunk with structure-aware recursive splitting and modest overlap; enrich chunks with metadata and context. Add hybrid search so exact terms aren’t lost. Transform weak user questions into good search queries before retrieving. Add a reranker to turn a broad candidate set into a precise, compact context. Introduce each lever incrementally and measure.