Skip to content
About

Chunking & Retrieval

If a RAG system gives bad answers, the cause is usually retrieval, not generation — the model never received the right context. This page covers the levers that fix retrieval.

A chunk is the unit of retrieval — the passage you embed and later inject. Chunk size is a real trade-off:

  • Too large — one vector blurs many topics, retrieval gets imprecise, and chunks eat context-window budget.
  • Too small — a chunk loses the surrounding context needed to make sense.
StrategyHow it worksUse when
Fixed-sizeN tokens per chunk, with overlapQuick baseline, uniform text
RecursiveSplit on paragraphs → sentences to respect structureGeneral-purpose default
Document-awareSplit on Markdown headings, code blocks, sectionsStructured docs, code
SemanticSplit where the topic shifts (embedding-based)High-value corpora; costlier to build

Overlap — repeating ~10–20% of text between adjacent chunks — keeps a sentence split across a boundary from being lost.

# Recursive splitting: respect natural boundaries before falling back to length.
chunks = recursive_split(
document,
separators=["\n\n", "\n", ". ", " "], # try paragraphs, then lines, ...
chunk_size=500,
chunk_overlap=75,
)

Quality jumps when each chunk carries more than its raw text:

  • Metadata — source, title, date, section, permissions — for filtering and citations.
  • Context prefix — prepend the document title and section heading so an isolated chunk still self-describes.
  • Small-to-big — embed a small chunk for precise matching, but return its larger parent for richer context.

Pure vector search is strong on meaning, weak on exact strings — error codes, product names, IDs. Hybrid search combines semantic (vector) and lexical (keyword/BM25) retrieval and fuses the results.

"config flag ENABLE_SSO not working" Vector search meaning — SSO setup chunks Keyword search precision — exact "ENABLE_SSO" Reciprocal Rank Fusion

Hybrid search is one of the highest-ROI upgrades to naive RAG. If exact terms matter at all in your domain, add it early.

The user’s raw question is often a poor search query. Transform it before retrieval:

  • Query rewriting — turn a vague or conversational question into a clean, keyword-rich query. Essential in chat, where “what about the second one?” only makes sense with history.
  • Multi-query — generate several phrasings, retrieve for each, and union the results. Catches relevant chunks that any single phrasing would miss.
  • Decomposition — break a multi-part question into sub-questions, retrieve per sub-question, then synthesize.
  • HyDE — have the LLM draft a hypothetical answer, then embed that to search; a full answer often sits closer to real passages than a terse question.

Retrieval optimizes for speed and casts a wide net; it’s not precise about ordering. A reranker fixes that as a second stage:

Retrieve top 25 candidates fast · approximate · wide net 25 candidates Reranker — cross-encoder scores each query–chunk pair · slow, accurate top 5 Keep the best 5 → send to the LLM precise, compact context

A reranker (a cross-encoder model) reads the query and each chunk together, producing a far more accurate relevance score than embedding similarity alone. Retrieve broadly, rerank, then keep only the top few. This reliably lifts answer quality and shrinks the context you send — so it can cut cost too.

User query Query transformation rewrite · multi-query Hybrid search vector + keyword ~25 candidates Metadata filter permissions · recency Rerank score and narrow top 5 chunks Generation

Add these stages one at a time, measuring each — see evaluation. Not every system needs all of them.

Most RAG failures are retrieval failures. Chunk with structure-aware recursive splitting and modest overlap; enrich chunks with metadata and context. Add hybrid search so exact terms aren’t lost. Transform weak user questions into good search queries before retrieving. Add a reranker to turn a broad candidate set into a precise, compact context. Introduce each lever incrementally and measure.