Generative AI: Prompt Engineering Basics

Chapters

1Foundations of Generative AI

2LLM Behavior and Capabilities

3Core Principles of Prompt Engineering

4Writing Clear, Actionable Instructions

5Roles, Personas, and System Prompts

6Supplying Context and Grounding

7Examples: Zero-, One-, and Few-Shot

8Structuring Outputs and Formats

9Reasoning and Decomposition Techniques

10Iteration, Testing, and Prompt Debugging

11Evaluation, Metrics, and Quality Control

12Safety, Ethics, and Risk Mitigation

13Tools, Functions, and Agentic Workflows

14Retrieval-Augmented Generation (RAG)

RAG Concepts and Benefits Embeddings and Vectorization Indexing and Chunking Tactics Query Construction Prompts Re-Ranking and Fusion Citation and Attribution Formats Freshness and Recency Strategies Reducing Hallucinations with RAG Hybrid Sparse–Dense Search Context Compression Techniques Budget-Aware Retrieval RAG Evaluation Methods Answer–Source Separation Dynamic Routing and Switching Vector Store Hygiene

15Multimodal and Advanced Prompt Patterns

Courses/Generative AI: Prompt Engineering Basics/Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG)

23136 views

Combine prompts with retrieval to ground answers in external knowledge, improving accuracy and traceability.

Content

2 of 15

Embeddings and Vectorization

Vector Vibes: Embeddings & Practical Vectorization for RAG (Sassy TA Edition)

7567 views

intermediate

humorous

computer-science

education theory

gpt-5-mini

7567 views

Versions:

Vector Vibes: Embeddings & Practical Vectorization for RAG (Sassy TA Edition)

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Embeddings and Vectorization — The Secret Sauce of RAG (with Extra Spice)

"Embeddings are how we turn messy human meaning into neat numerical vibes the model can touch." — Your overexcited TA

You already know why RAG matters and how tools/functions can be orchestrated into planner–executor dances (and how to fallback when some tool ghosts you). Now we’re past the architecture pep talk and getting into the actual plumbing: embeddings and vectorization — the tiny mathematical elves that fetch the right context to your LLM so it can stop hallucinating and start answering.

Why embeddings? Quick refresher (no duplicate lecture)

Recall RAG's two-step rhythm: 1) retrieve relevant documents, 2) condition the LLM on them to generate responses. Embeddings let retrieval be semantic, not just keyword mashups. Instead of fast-forwarding through text strings, we compare meaning in vector space.

Think: searches that understand "How to tie a Windsor" and return a video on a tie knot — not a laundry detergent ad for "Windsor".

What is an embedding, actually?

Definition: A dense vector (usually 64–1536+ dimensions) representing the semantic meaning of a piece of text.
Key idea: Similar texts => nearby vectors.

Important terms:

Vectorization — converting tokens/text to embeddings
Vector store (index) — a database optimized for similarity search
Similarity metric — cosine, dot product, L2 (Euclidean)

How embeddings are created (high-level)

Normalize text: lowercasing, minimal cleaning (don’t over-sanitize — context matters).
Chunking: split long docs into passages (200–500 tokens typical).
Batch encode: call an embedding model to get vectors.
Index: add vectors + metadata to a vector DB (FAISS, HNSW, Qdrant, Pinecone, etc.).

Pseudocode: Batch vectorization and indexing

# Pseudocode (not a library-specific snippet)
chunks = chunk_document(doc, size=300)
vectors = []
for batch in batchify(chunks, batch_size=64):
    vectors += embed_model.encode(batch)
vecstore.upsert(vectors, metadata=chunk_metadata)

Vector stores at a glance (quick reference table)

Vector DB	Strengths	Typical use-case
FAISS	Fast, local, configurable IVF/HNSW, offline-friendly	Research, prototype, self-hosted
Milvus	Scalable, GPU support, hybrid search	Enterprises needing throughput
Pinecone	Managed, easy API, metadata filtering	Fast production deployment
Qdrant	Open-source, filters & collections	Flexible production/self-hosting
Weaviate	Schema-aware, hybrid	Semantic search + KG use

Similarity metrics & normalization — pick your fights

Cosine similarity — most common for semantic embeddings (scale-invariant).
Dot product — fast when embeddings are L2-normalized or when model expects it.
Euclidean (L2) — sometimes used in dense vector spaces but less common for text.

Tip: normalize vectors if your index/metric expects it — mismatched metrics = sad retrieval.

Chunking strategy — the underrated hero

Keep chunks semantically coherent (stop mid-sentence = anger of the gods).
Use overlapping windows (e.g., 50–100 token overlap) to preserve context across boundaries.
Store chunk-level metadata: source doc ID, chunk start/end, timestamp, author.

Why metadata matters: it enables filtering (e.g., "only policies after 2022") and helps with observability/logging of retrieval decisions (remember that previous module on observability?).

Hybrid retrieval: marry BM25 with vectors

Vectors capture semantics; lexical methods (BM25) capture exact matches and rare tokens. A practical pipeline:

Run BM25 to get lexical top-K.
Run vector search for semantic top-K.
Merge/re-rank using a cross-encoder or scoring function.

This reduces misses on factual strings (IDs, product SKUs, names).

Re-ranking and cross-encoders

After retrieving candidate passages, a cross-encoder (full attention) can rerank them more precisely. Pros: improved precision. Cons: latency & cost.

Use re-rankers for the final top-10, not for every millisecond.

Embedding versioning, freshness & drift

Version your embedding model! If you re-embed with a new model, similarity relationships change. Keep old vectors or reingest strategically.
Freshness strategy: incremental indexing for new documents; use background jobs to re-embed periodically.
Concept drift: monitor retrieval quality over time; set alerts if top-k precision drops.

Observability note: log retrieval IDs, embedding model version, and similarity scores for every query. This makes debugging hallucinations traceable back to a bad vector or stale doc (we saw this pattern in agentic workflows debugging).

Evaluating retrieval quality — metrics to track

Precision@k, Recall@k
MRR (Mean Reciprocal Rank)
nDCG (discounted cumulative gain)
Latency and throughput
Embedding model version heatmap (track performance per version)

Pro tip: log the LLM output with the retrieved passages. When hallucinations happen, you’ll quickly see whether relevant docs were missing or mis-ranked.

Prompting with retrieved context — safety & performance tips

Prepend a brief instruction describing the provenance of each passage (source + score).
Limit tokens: choose top-K by score until you hit the token budget.
Sanitize: remove obvious malicious inputs; treat retrieved text as untrusted.

And remember fallback mechanisms: if vector DB fails or latency spikes, fallback to tool-free modes (e.g., knowledge base summary or cached answers) to maintain UX (you already learned how to design that in the functions/tools module).

Common pitfalls & quick remedies

Low diversity in vectors → try a different embedding model or fine-tune.
Token limits exhausted by verbose retrieval → tighten chunk size, re-rank more aggressively.
Drift after model upgrade → flag old vectors and stagger re-embedding.
Index corruption / latency spike → implement graceful degradations and cached results.

Final checklist — What to implement next

Choose an embedding model and version it.
Decide chunking and overlap strategy; store chunk metadata.
Implement batch embedding + efficient upsert into a vector store.
Add hybrid retrieval (BM25 + vector) and a light re-ranker.
Log retrieval results, scores, and model versions for observability.
Plan fallback behavior for index downtime or degraded quality.

Closing Mic Drop

Embeddings are the quiet genius of RAG: they convert our messy, poetic human intent into something a model can actually retrieve and reason over. Do them well — chunk neatly, version boldly, log obsessively — and your LLM will be less of an imaginative fiction-writer and more of a reliable research assistant.

Want a tiny homework prompt? Try: embed a Wikipedia article with two different models, run similarity queries for 10 sample questions, and compare Precision@5. Log the differences and see which model gives you fewer hallucinations. Go forth and vectorize.

"If vectors are the ink, then embeddings are the pen. Write thoughtfully." — Still your TA

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics