Supplying Context and Grounding
Feed the model the right facts at the right time using structured context blocks, delimiters, and source pinning.
Content
Planning Context Budgets
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Planning Context Budgets — The Art of Feeding the Beast Without Starving the Brain
"If context were calories, you'd be trying to feed a marathoner with a cupcake and a sticky note." — Your inner prompt engineer, probably drunk on tokens
You're already familiar with retrieval summaries and citing/linking evidence, and we've seen how roles, personas, and system prompts can steer the model's behavior. Now we need to get pragmatic: what actually goes into the context window, why, and how to pick the bits that matter when tokens are limited, latency matters, or costs start looking like a bad dinner tab. This is Planning Context Budgets: choosing, compressing, and allocating the precious real estate of your prompt so the model produces useful, grounded output.
Why a context budget is a thing (and why you care)
- Token limits are real: LLMs have finite context windows and tokens cost money. You cannot dump the entire Internet into every prompt.
- Relevance beats volume: More text isn't always better; irrelevant context often creates noise and hallucination risk.
- Latency and UX: Large contexts slow things down and increase user wait time. Your users want answers, not a loading spinner named Regret.
Think of it like packing a carry-on for a week: prioritize essentials, compress bulky stuff, and pick outfits that mix-and-match.
Quick recap: where this sits in the pipeline
- Retrieval summaries: you already use them to condense retrieved docs into a succinct digest. Those summaries should be part of your context budget.
- Citing/linking evidence: when you include sources, you have to decide which sources to include verbatim, which to summarize, and which to only cite by reference.
- Roles/personas/system prompts: decide which persona-level constraints and priorities live in the system layer (cheap, persistent tokens) vs the upfront prompt. Use system instructions to offload constant expectations.
Core principles for planning context budgets (aka the commandments)
- Prioritize by use-case impact. If a piece of context changes the answer, include it. If it only mildly colors phrasing, summarize or omit.
- Compress aggressively. Summaries, bullet points, structured metadata — all reduce token cost while preserving signal.
- Segment context by permanence. Put stable instructions (tone, role, safety rules) in system prompts. Put dynamic facts (user state, recent retrievals) in the request context.
- Score and select. Rank candidate documents by relevance, recency, and trustworthiness; include top-k until budget is reached.
- Fallback to summaries. When docs exceed budget, include short summaries and explicit citations rather than the full text.
- Be explicit about process. Tell the model which sections are authoritative, which are optional, and where to look first.
A practical workflow: Plan, Score, Allocate, Execute
- Plan: determine the total token budget for context (B). Subtract the estimated token needs for system persona, user query, and required output format. Remaining is the allocable budget.
- Score: for each retrieved item compute a relevance score combining: topicality (semantic similarity), recency, trust/source quality, and length penalty.
- Allocate: include full text for top items until you hit a threshold (e.g., 60% of allocable budget). Convert less-critical items into summaries or metadata until you fill the budget.
- Execute: build the final prompt with clear section markers and role/system instructions to guide prioritization.
Pseudocode for selection
budget = B - (system_tokens + user_query_tokens + expected_answer_tokens)
ranked_docs = sort(docs, by=relevance_score)
selected = []
used = 0
for doc in ranked_docs:
if doc.tokens + used <= budget * full_text_fraction:
selected.append({type: 'full', doc: doc})
used += doc.tokens
else:
summary = summarize(doc)
selected.append({type: 'summary', doc: summary})
used += summary.tokens
if used >= budget:
break
return build_prompt(selected)
Replace summarize(doc) with your retrieval-summary pipeline that compresses content into key claims and citations.
Concrete prompt structure (example)
Use explicit labels so the model knows what's priority:
- System prompt: persona, constraints, output format, and a one-line instruction: 'Prioritize information from Section A over B.'
- Section A: High-priority documents (full text or long summaries)
- Section B: Supporting data (shorter summaries, metadata)
- Section C: Links and citations only (for traceability)
- User query: the actual task
- Tools or memory: optional
Example layout (markdown in the prompt):
SYSTEM: You are an expert summarizer. Follow the priority order below. Output must include citations.
SECTION A - HIGH PRIORITY (full text up to X tokens)
- Document 1: ...
SECTION B - SUMMARIES
- Doc 4 summary (source: url)
USER QUERY: ...
How to choose between full docs, summaries, and citations
| When to include | Strategy | Why it works |
|---|---|---|
| High-confidence, decisive evidence | Full text | Gives model raw facts and reduces summarization error |
| Supporting or long docs | Summarize key claims + citations | Saves tokens while preserving signal |
| Many but low-relevance hits | List of citations or metadata | Allows model to ask to fetch more when needed |
Tricks and advanced tactics
- Dynamic budget shifting: if the model requests a deeper dive, allow an on-demand retrieval step and append relevant doc excerpts to a follow-up prompt.
- Chunking + sliding window: for long sources, chunk by section headings and include only the chunks with the highest similarity to the query.
- Role-based weighting: system prompt can instruct the model to prioritize certain sources or types of evidence (e.g., peer-reviewed over blogs).
- Progressive summarization: first-level summaries condensed into second-level ultra-summaries if tokens are scarce.
- Cache golden summaries: for frequently retrieved documents, store compact summaries so you don't pay the summarization cost every time.
Example micro-case: customer support agent
Scenario: A user asks why their bill suddenly increased. You have 20 documents: account history, service change logs, system outage notices, pricing policy, and a support chat.
- Put account history and billing adjustments in Section A (full or near-full).
- Summarize pricing policy and outage notices in Section B.
- Include a Section C list of raw transcripts and logs with timestamps for traceability.
This avoids burying the agent in entire support transcripts while preserving the facts that change the user's bill.
Closing — TL;DR and parting wisdom
- Plan the budget first, then fill it. Don't blindly shove everything into the window.
- Prioritize impact, compress ruthlessly, and use system prompts to offload permanence.
- Make the model's life easier by labeling priorities. Explicit structure leads to better, more grounded answers.
Final thought: a good context budget is like a good playlist — curated, purposeful, and leaves room for the encore. If your model keeps hallucinating, it's probably starving for the right tracks.
Versioning: store your budget strategies alongside prompt templates so you can iterate. The model isn't a magician — it's a very fancy parrot whose attention you must manage.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!