Retrieval-Augmented Generation (RAG)
Combine prompts with retrieval to ground answers in external knowledge, improving accuracy and traceability.
Content
Indexing and Chunking Tactics
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Indexing and Chunking Tactics — The Art of Turning a Novel into Useful Googleable Snacks
"You can give a model the entire internet, but if it can’t find the right sentence, it’s still guessing." — Probably me, at 2 a.m.
You already know the basics of RAG (we covered concepts and benefits) and how embeddings vectorize meaning (that lovely chapter we just did). Now we’re entering the pragmatic, slightly messy, very important realm: how to slice your documents and how to store those slices so retrieval is fast, accurate, and cheap. This is the stage where retrieval meets engineering discipline — and where many projects quietly explode or quietly succeed.
Why indexing + chunking matter (without the fluff)
- Embeddings live at the chunk level. If chunks are garbage, embeddings are garbage.
- Chunking controls precision vs recall: big chunks = more context but fuzzier retrieval; small chunks = precise hits but more noise and more storage.
- Index structure affects latency, memory, and the scalability of your RAG system.
(Tip: if you read the previous section on Tools, Functions, and Agentic Workflows, think of chunking/indexing as the planner’s strategy choice — the executor actually runs the splits, embeds, upserts, and monitors outcomes.)
Core tactics: chunking strategies
1) Semantic chunking (preferred when possible)
- What: Split by logical units — paragraphs, sections, headings, code blocks.
- Why: Keeps semantically coherent pieces, so an embedding represents a single idea.
- When not to use: Rare docs without clear structure (e.g., raw logs).
2) Fixed-size chunking (token-based)
- What: Split into N-token chunks (e.g., 200–500 tokens).
- Why: Predictable embedding sizes and costs; aligns with model token limits.
- Drawback: May split ideas mid-sentence.
3) Sliding windows / overlap
- Add 10–30% overlap between chunks to preserve context across boundaries.
- Helps when the answer spans a boundary; costs more space but increases retrieval recall.
4) Hybrid: headings + truncation
- Use headings to create semantic chunks, but if a heading block is huge, break it by tokens.
Heuristics: how big should a chunk be?
- Short content (FAQs): 50–150 tokens
- Documentation & articles: 150–400 tokens
- Books or long reports: 300–800 tokens with overlap
Table: Chunk size trade-offs
| Chunk size | Pros | Cons |
|---|---|---|
| Small (50–150 tokens) | Precise retrieval, cheap to re-rank | More vectors, potential missing context |
| Medium (150–400 tokens) | Good balance of context and precision | Moderate storage & compute |
| Large (400–800 tokens) | Lots of context in one hit | Lower precision, higher cost |
Indexing architectures — which engine for which vibe
- Flat vector (brute force): simple, great for small corpora, predictable recall.
- HNSW (Hierarchical Navigable Small World): excellent latency & recall for medium/large datasets.
- IVF + PQ (Inverted File + Product Quantization): efficient for huge datasets where memory matters.
- Managed vector DBs (Pinecone, Weaviate, Milvus, Qdrant): add metadata filters, multi-tenancy, and durability.
Practical rule: start simple (FAISS or managed DB) and optimize when you hit latency or cost problems.
Metadata, filtering, and hybrid retrieval
- Always store and index metadata with each chunk: doc_id, section, timestamp, source_url, confidentiality flags.
- Use metadata filters for precision (e.g., only fetch from docs last updated < 2024 or only from internal manuals).
- Hybrid retrieval: combine lexical search (BM25) + vector search for cases where exact phrase matching matters (e.g., legal citations, code).
Example flow:
- Run BM25 to get candidates (good for exact matches).
- Run vector search to get semantic candidates.
- Union/rerank candidates using a cross-encoder or scoring heuristic.
Practical pipeline (planner → executor) with functions & error handling
Planner: decides split strategy, chooses index, selects embedding model, sets metadata.
Executor: runs splitting, embeds chunks, upserts to vector DB, logs metrics, retries errors.
Pseudocode (Python-style):
# Planner
strategy = choose_chunking_strategy(doc)
chunks = split_doc(doc, strategy)
metadata = build_metadata(doc)
# Executor (robust)
try:
embeddings = embed_batch(chunks)
upsert_to_index(chunks, embeddings, metadata)
except TransientError as e:
retry(upsert_to_index, attempts=3)
except Exception as e:
log_error(e, context=doc.id)
alert_oncall(e)
Observability best practices: track embedding time, upsert latency, index size, retrieval latency, recall@k, and MRR. Embed hashing and index versioning make debugging reproducible.
Incremental indexing, re-embedding, and lifecycle
- Upserts vs full rebuilds: upserts are faster, but schema or embedding model changes often require a rebuild.
- Version your embeddings and index schema: store embedding_model_id with each vector.
- Re-embed only changed documents where possible.
When to re-embed:
- You change the embedding model
- You adjust chunking strategy significantly
- You change tokenization behavior
Privacy, PII, and legal hygiene
- Remove or redact PII before indexing, or mark chunks as restricted and apply filters.
- Track data provenance in metadata for audits.
- Embed hashes of original text for integrity checks (don’t store raw PII when you can avoid it).
Quick checklist (because you’ll forget one of these)
- Decide semantic vs token chunking
- Use 10–30% overlap for boundary sensitivity
- Store rich metadata and model ids
- Prefer HNSW for moderate scale; IVF+PQ for massive scale
- Implement hybrid lexical + semantic retrieval where needed
- Version your index and embeddings
- Monitor recall@k, latency, and index growth
- Redact or flag PII
Final Mic Drop / TL;DR
Chunking is the design decision that governs everything: recall, precision, cost, and how often your system says something confidently wrong. Treat chunking and indexing as product features, not afterthoughts. Use the planner–executor pattern from our Tools & Functions section: the planner picks the strategy, the executor runs robust, observable jobs. Start with semantic chunks + medium size + 10–20% overlap, store metadata, and iterate.
Go forth and slice responsibly. Your users will thank you. (Possibly with bug reports — but fewer ones if you follow this guide.)
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!