Retrieval-Augmented Generation (RAG)
Combine prompts with retrieval to ground answers in external knowledge, improving accuracy and traceability.
Content
Embeddings and Vectorization
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Embeddings and Vectorization — The Secret Sauce of RAG (with Extra Spice)
"Embeddings are how we turn messy human meaning into neat numerical vibes the model can touch." — Your overexcited TA
You already know why RAG matters and how tools/functions can be orchestrated into planner–executor dances (and how to fallback when some tool ghosts you). Now we’re past the architecture pep talk and getting into the actual plumbing: embeddings and vectorization — the tiny mathematical elves that fetch the right context to your LLM so it can stop hallucinating and start answering.
Why embeddings? Quick refresher (no duplicate lecture)
Recall RAG's two-step rhythm: 1) retrieve relevant documents, 2) condition the LLM on them to generate responses. Embeddings let retrieval be semantic, not just keyword mashups. Instead of fast-forwarding through text strings, we compare meaning in vector space.
Think: searches that understand "How to tie a Windsor" and return a video on a tie knot — not a laundry detergent ad for "Windsor".
What is an embedding, actually?
- Definition: A dense vector (usually 64–1536+ dimensions) representing the semantic meaning of a piece of text.
- Key idea: Similar texts => nearby vectors.
Important terms:
- Vectorization — converting tokens/text to embeddings
- Vector store (index) — a database optimized for similarity search
- Similarity metric — cosine, dot product, L2 (Euclidean)
How embeddings are created (high-level)
- Normalize text: lowercasing, minimal cleaning (don’t over-sanitize — context matters).
- Chunking: split long docs into passages (200–500 tokens typical).
- Batch encode: call an embedding model to get vectors.
- Index: add vectors + metadata to a vector DB (FAISS, HNSW, Qdrant, Pinecone, etc.).
Pseudocode: Batch vectorization and indexing
# Pseudocode (not a library-specific snippet)
chunks = chunk_document(doc, size=300)
vectors = []
for batch in batchify(chunks, batch_size=64):
vectors += embed_model.encode(batch)
vecstore.upsert(vectors, metadata=chunk_metadata)
Vector stores at a glance (quick reference table)
| Vector DB | Strengths | Typical use-case |
|---|---|---|
| FAISS | Fast, local, configurable IVF/HNSW, offline-friendly | Research, prototype, self-hosted |
| Milvus | Scalable, GPU support, hybrid search | Enterprises needing throughput |
| Pinecone | Managed, easy API, metadata filtering | Fast production deployment |
| Qdrant | Open-source, filters & collections | Flexible production/self-hosting |
| Weaviate | Schema-aware, hybrid | Semantic search + KG use |
Similarity metrics & normalization — pick your fights
- Cosine similarity — most common for semantic embeddings (scale-invariant).
- Dot product — fast when embeddings are L2-normalized or when model expects it.
- Euclidean (L2) — sometimes used in dense vector spaces but less common for text.
Tip: normalize vectors if your index/metric expects it — mismatched metrics = sad retrieval.
Chunking strategy — the underrated hero
- Keep chunks semantically coherent (stop mid-sentence = anger of the gods).
- Use overlapping windows (e.g., 50–100 token overlap) to preserve context across boundaries.
- Store chunk-level metadata: source doc ID, chunk start/end, timestamp, author.
Why metadata matters: it enables filtering (e.g., "only policies after 2022") and helps with observability/logging of retrieval decisions (remember that previous module on observability?).
Hybrid retrieval: marry BM25 with vectors
Vectors capture semantics; lexical methods (BM25) capture exact matches and rare tokens. A practical pipeline:
- Run BM25 to get lexical top-K.
- Run vector search for semantic top-K.
- Merge/re-rank using a cross-encoder or scoring function.
This reduces misses on factual strings (IDs, product SKUs, names).
Re-ranking and cross-encoders
After retrieving candidate passages, a cross-encoder (full attention) can rerank them more precisely. Pros: improved precision. Cons: latency & cost.
Use re-rankers for the final top-10, not for every millisecond.
Embedding versioning, freshness & drift
- Version your embedding model! If you re-embed with a new model, similarity relationships change. Keep old vectors or reingest strategically.
- Freshness strategy: incremental indexing for new documents; use background jobs to re-embed periodically.
- Concept drift: monitor retrieval quality over time; set alerts if top-k precision drops.
Observability note: log retrieval IDs, embedding model version, and similarity scores for every query. This makes debugging hallucinations traceable back to a bad vector or stale doc (we saw this pattern in agentic workflows debugging).
Evaluating retrieval quality — metrics to track
- Precision@k, Recall@k
- MRR (Mean Reciprocal Rank)
- nDCG (discounted cumulative gain)
- Latency and throughput
- Embedding model version heatmap (track performance per version)
Pro tip: log the LLM output with the retrieved passages. When hallucinations happen, you’ll quickly see whether relevant docs were missing or mis-ranked.
Prompting with retrieved context — safety & performance tips
- Prepend a brief instruction describing the provenance of each passage (source + score).
- Limit tokens: choose top-K by score until you hit the token budget.
- Sanitize: remove obvious malicious inputs; treat retrieved text as untrusted.
And remember fallback mechanisms: if vector DB fails or latency spikes, fallback to tool-free modes (e.g., knowledge base summary or cached answers) to maintain UX (you already learned how to design that in the functions/tools module).
Common pitfalls & quick remedies
- Low diversity in vectors → try a different embedding model or fine-tune.
- Token limits exhausted by verbose retrieval → tighten chunk size, re-rank more aggressively.
- Drift after model upgrade → flag old vectors and stagger re-embedding.
- Index corruption / latency spike → implement graceful degradations and cached results.
Final checklist — What to implement next
- Choose an embedding model and version it.
- Decide chunking and overlap strategy; store chunk metadata.
- Implement batch embedding + efficient upsert into a vector store.
- Add hybrid retrieval (BM25 + vector) and a light re-ranker.
- Log retrieval results, scores, and model versions for observability.
- Plan fallback behavior for index downtime or degraded quality.
Closing Mic Drop
Embeddings are the quiet genius of RAG: they convert our messy, poetic human intent into something a model can actually retrieve and reason over. Do them well — chunk neatly, version boldly, log obsessively — and your LLM will be less of an imaginative fiction-writer and more of a reliable research assistant.
Want a tiny homework prompt? Try: embed a Wikipedia article with two different models, run similarity queries for 10 sample questions, and compare Precision@5. Log the differences and see which model gives you fewer hallucinations. Go forth and vectorize.
"If vectors are the ink, then embeddings are the pen. Write thoughtfully." — Still your TA
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!