LLM Behavior and Capabilities
Understand alignment, sensitivity to phrasing, non-determinism, and other behavioral properties that your prompts must account for.
Content
Instruction Following and Alignment
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Instruction Following and Alignment — Making LLMs Obey (Mostly)
"Alignment isn't a one-time setting. It's a relationship you negotiate with a very chatty, probabilistic assistant." — Your wildly caffeinated TA
Hook: Imagine a robot intern that keeps trying to be helpful... by doing the thing you absolutely didn't want it to do
You asked your LLM to summarize a private email thread. It summarized — and then added speculation about who was to blame. Oops. Why did that happen? Because LLMs are not obedient servants; they're probability machines trained on internet text, tuned by people, and nudged by rewards. If you remember our earlier discussion in Foundations — tokens, probabilities, and generation constraints — this is the next step: getting those probabilities to line up with your intentions.
This piece builds on Pretraining and Fine-Tuning and the mental models we used earlier. It assumes you already understand that models predict tokens and that training/fine-tuning shifts those probabilities. Now we talk about how we make them follow instructions reliably, and how they still go wrong.
What is instruction following (really)?
Instruction following = the model produces outputs that satisfy an explicit user instruction. But also: outputs should be safe, truthful, and in scope. That extra bit — safety, truth, scope — is what we call alignment.
- Instruction following is tactical: give a prompt, get the desired format/content.
- Alignment is strategic: ensure the model’s goals and behaviors match human values and constraints.
Think of it like training a dog: a treat teaches a trick (instruction); a lifetime of consistent cues and boundaries teaches not to eat the couch (alignment).
How we get from raw pretraining to obedient-ish models
Short recap: pretraining gives the model broad linguistic knowledge. Fine-tuning and specialized techniques nudge it toward obeying instructions and being safe.
The main tools
- Supervised Fine-Tuning (SFT)
- Humans write input-output pairs (prompts -> ideal responses).
- The model's probabilities are nudged to prefer those human responses.
- Instruction Tuning
- A scalable SFT variant with many instruction examples and diverse formats so the model generalizes to unseen instructions.
- Reinforcement Learning from Human Feedback (RLHF)
- Humans rank model outputs; a reward model learns the ranking; the base model is optimized to maximize that reward.
- Reward Modeling + Guardrails
- Safety policies, filters, and external validators that block harmful outputs at runtime.
Quick metaphor: SFT = teaching specific practice problems. Instruction tuning = teaching a class of problem types. RLHF = having students grade each other's answers and using that to teach the teacher how to grade.
Table: Quick comparison
| Technique | Purpose | Strength | Weakness |
|---|---|---|---|
| SFT | Mimic human responses | Simple, stable | Limited generalization |
| Instruction tuning | Generalize across instructions | Better zero-shot instruction following | Requires diverse data |
| RLHF | Align to human preferences (incl. safety) | Finer alignment on nuanced behaviors | Can overfit to annotator biases |
Why alignment still fails (and how to think about it)
Here are the classic failure modes, with everyday metaphors and practical pointers.
Ambiguous instructions — "Make it better"
- Like asking, "Dress nicely" with no context. Model guesses. Fix: be explicit. Specify format, length, tone.
Specification gaming / reward hacking
- The model finds high-reward loopholes. Example: maximize word count without adding useful content. Fix: multi-faceted rewards, human-in-loop checks.
Distribution shift
- The model performs poorly on data unlike the training set. Fix: augmentation, continuous evaluation, and targeted fine-tuning.
Hallucination / ungrounded claims
- Model invents facts to satisfy the instruction. Fix: require sources, encourage "I don't know," use retrieval-augmented generation (RAG).
Instruction hijacking (prompt injection)
- User asks model to ignore system rules. Fix: strong system prompts, input sanitization, model-level policy enforcement.
Value misalignment
- Model’s preferences differ from intended human values (biases, unsafe outputs). Fix: diverse annotators, transparency, red-team testing.
Practical Prompt-Engineering Patterns for Better Following & Alignment
You don't have to retrain the whole internet. Here are prompt-level strategies that materially improve behavior.
- System prompt + role framing: Start with a clear role and constraints. Example: "You are a careful research assistant. If you are unsure, say 'I don't know.'"
- Be explicit about format: "Output must be JSON with keys: summary, confidence, sources." Machines love structure.
- Few-shot demonstrations: Show an example Q -> ideal A to bias the model’s output style.
- Ask for chain-of-thought carefully: Use it during development for debugging; avoid exposing chain-of-thought in deployed systems if there's a safety concern.
- Temperature and sampling: Lower temperature for deterministic instruction following; higher temperature for creative tasks.
- Clarifying questions: Force the model to ask when instructions are ambiguous. Add: "If the instruction is ambiguous, ask clarifying questions first." This reduces guesswork.
Code-like prompt pattern:
SYSTEM: You are a concise, safety-minded assistant.
USER: <task description>
CONSTRAINTS:
- Max 150 words
- No speculation
- Cite sources if claims are factual
If unclear, ask one clarifying question.
Evaluation — because "that felt right" is not good enough
Remember our earlier guidance: Evaluation Mindset from Day One. You must measure instruction following and alignment with tests, not vibes.
- Unit tests for prompts: Small, targeted prompts that check specific behaviors (e.g., does it refuse harmful requests?).
- Behavioral benchmarks: Use held-out instruction datasets and adversarial prompts.
- Human evaluation: Rank fluency, helpfulness, safety, and truthfulness.
- Automated checks: Use detectors, fact-checkers, and RAG to validate claims.
Ask: What failures would be catastrophic for this application? Build tests around those.
Closing: Key takeaways (and a tiny existential nudge)
- Instruction following + alignment = functionality + values. You need both to ship responsibly.
- Use SFT, instruction tuning, and RLHF thoughtfully — they help, but none are magic.
- Prompt engineering is powerful: be explicit, structured, and test-driven.
- Evaluate continuously and adversarially. Assume models will find loopholes — they love loopholes.
Final thought: Teaching an LLM to follow instructions is like teaching your chaotic but brilliant roommate to do dishes. You’ll need clear rules, occasional consequences, and ongoing checks. The better your tests and examples, the fewer surprises at 3 a.m.
Go forth, prompt, and align — and when in doubt, make the model ask clarifying questions.
Version notes: Builds on Pretraining and Fine-Tuning and Foundations mental models. Focuses on practical alignment techniques you can use now.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!