Retrieval-Augmented Generation (RAG)
Combine prompts with retrieval to ground answers in external knowledge, improving accuracy and traceability.
Content
Re-Ranking and Fusion
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Re-Ranking and Fusion in RAG — The Chaotic Good Guide
"Think of retrieval like speed-dating for knowledge. Re-ranking is where you decide which of the suitors actually deserve a second date. Fusion is the awkward montage where you try to stitch together all the good parts without sounding insane."
You're already comfortable with query-construction tricks (Position 4) and chunking/indexing tactics (Position 3). Great — we won't waste time on the basics. Instead, we'll take those well-formed queries and well-chunked indexes and show how to make retrieval actually useful: re-ranking the hits and fusing them into a coherent, faithful, and useful answer. We'll also fold in the previous topic (tools, functions, planner–executor patterns) so your pipeline is not only smart, but debuggable and manageable.
Why re-rank & fuse? (aka the problem statement)
- Raw retrieval (BM25/k-NN embeddings) gives you candidates — often noisy, partially relevant, or overlapping.
- A generator left to its own devices will either ignore relevant documents or hallucinate from tangential ones.
Re-ranking weeds the pile for the highest-quality evidence. Fusion assembles the evidence into an answer that maximizes useful information while minimizing contradictions and hallucination.
Re-Ranking: The Gatekeeper
What it is: Re-ranking takes an initial candidate set R (from BM25 or vector search) and re-orders or re-scores them using a finer-grained model (usually a cross-encoder or a stronger bi-encoder).
Common techniques:
- Lexical re-rankers: BM25 or TF-IDF refinement (fast; baselines).
- Dense re-rankers: dot-product of stronger embeddings (Faiss/Annoy/KNN refinements).
- Cross-encoder re-rankers: feed (query, doc) pairs to a transformer that outputs a relevance score — slow but high fidelity.
- Learning-to-rank ensembles: combine features (BM25 score, embedding similarity, doc recency, citations) into a learned model.
When to use what
| Method | Speed | Accuracy | Cost | Use when... |
|---|---|---|---|---|
| BM25 | Very fast | Low-medium | Cheap | small index; keyword-heavy queries |
| Dense bi-encoder | Fast | Medium | Moderate | semantic matches, many queries |
| Cross-encoder | Slow | High | Expensive | few candidates (<100), need precision |
| Hybrid (BM25 + cross) | Medium | High | Moderate-High | practical production tradeoff |
Example re-ranker prompt / objective
Code-style pseudocode for a cross-encoder re-ranker:
function cross_rank(query, candidates):
scores = []
for doc in candidates:
// model returns scalar relevance score
score = cross_encoder.score(concat(query, "\n--\n", doc.text))
scores.append((doc, score))
return sort_descending(scores)
Question to ask yourself: Do I care more about precision (top-1 quality) or throughput? If precision, favor cross-encoders.
Fusion: The Art of Not Being Dumb with Good Docs
Fusion means combining multiple documents into what the generator will use. There are two broad families:
- Early Fusion: merge text chunks into a single prompt/context before generation (concatenation, summarization).
- Late Fusion: generate answers from individual chunks (or subsets) and then aggregate (voting, scoring, final synthesis).
Notable patterns
- Concatenate (naive): just glue top-k into the prompt. Simple, but context length and contradictions bite you.
- Extract-and-Consolidate: extract facts from each doc (or ask a model to summarize each), then synthesize those summaries.
- Fusion-in-Decoder (FiD): encode each doc independently, pass encoded representations to the decoder so it can attend cross-doc — higher-quality but requires architecture support.
- RAG-Sequence vs RAG-Token:
- RAG-Sequence: generates sequences conditioned on individual retrieved docs and then merges candidate outputs.
- RAG-Token: fuses at the token level — the model considers all documents while generating each token (more coherent, more compute).
Late fusion strategies (practical)
- Voting: generate answers per doc, pick most common answer.
- Score-weighted merge: weight each doc's contribution by re-ranker score, then synthesize.
- Fact extraction + aggregator: extract structured facts (triples), then render them into prose.
Putting it together: A Planner–Executor Pipeline (with tools)
You used planner–executor before for tools. Do the same here.
- Planner (tool): builds a retrieval plan — which indices, query rewrites, k candidates per index.
- Retriever (executor tool): runs BM25 and dense search in parallel.
- Re-Ranker (tool/function): a cross-encoder or learned ranker reorders the combined set.
- Fusion module (function): chooses strategy (FiD/concat/extract+merge) and prepares inputs for the generator.
- Generator (LLM): produces the final answer. Optionally call a citation function to attach sources.
- Observability tool: logs scores, chosen docs, hallucination flags.
Pseudocode:
plan = planner.create(query)
candidates = retriever.search(plan)
ranked = reranker.rank(query, candidates)
fused_input = fusion.prepare(ranked.top_k)
answer = generator.generate(fused_input)
logger.log({query, ranked.top_k_ids, scores, answer})
Tips: make each piece a callable function (tool) so you can instrument errors, timeouts, and retries. If the re-ranker times out, fall back to a faster ranker — graceful degradation.
Evaluation: How do you measure success?
- Retrieval metrics: Recall@k, MRR — are the true evidence docs in the candidates?
- Re-ranker metrics: NDCG, MAP — is ordering improved?
- Generation metrics: ROUGE/BLEU (weak for open answers), factuality checks, hallucination rate (automatic fact checks), citation precision.
- Human eval: faithfulness, helpfulness, concision.
Quick practical checklist (copy-paste for your next sprint)
- Build initial retriever with BM25 + dense embeddings.
- Add a cross-encoder re-ranker for top-100 candidates.
- Choose fusion: FiD if you can, else extract-and-consolidate.
- Instrument every tool call for latency, failures, and selected docs.
- Add fallback rules (e.g., if re-ranker fails, use BM25 top-k).
- Track recall@k and hallucination metrics each deployment.
Final notes & spicy thoughts
- Re-ranking is the hill where you win or lose precision. If your top-3 are garbage, the generator will be glamorous garbage. Invest in a re-ranker.
- Fusion is the art of making multiple truths sing in harmony without producing choir-of-lies. Structured extraction + careful weighting is often your best friend.
- Treat re-ranking and fusion as independent modules (tools) you can A/B and observe. The planner–executor pattern you learned earlier fits beautifully here.
"Good retrieval gets you the sources; clever re-ranking picks the right ones; thoughtful fusion makes the model tell you the truth in a way that doesn’t make you want to cry into your keyboard."
Want a cheat-sheet prompt for testing a re-ranker? Try this
System: You are a relevance scorer. Score how well the document answers the user query.
User: [QUERY]
Document: [DOC]
Assistant: Provide a numeric score 0-100 and a short justification (1-2 lines).
Use that output to debug misrankings: where is your re-ranker overconfident? Underconfident? Fix by adding features or augmenting training data.
Summary: Re-rank to get the right evidence. Fuse to make that evidence readable, accurate, and concise. Wrap them as tools in your planner–executor pipeline, instrument everything, and always have fallbacks. Now go make search-stories that don't lie to people.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!