jypi
  • Explore
ChatWays to LearnMind mapAbout

jypi

  • About Us
  • Our Mission
  • Team
  • Careers

Resources

  • Ways to Learn
  • Mind map
  • Blog
  • Help Center
  • Community Guidelines
  • Contributor Guide

Legal

  • Terms of Service
  • Privacy Policy
  • Cookie Policy
  • Content Policy

Connect

  • Twitter
  • Discord
  • Instagram
  • Contact Us
jypi

© 2026 jypi. All rights reserved.

Generative AI: Prompt Engineering Basics
Chapters

1Foundations of Generative AI

2LLM Behavior and Capabilities

Pretraining and Fine-TuningInstruction Following and AlignmentRLHF and Preference OptimizationSensitivity to Wording and OrderLength Bias and Cutoff RealitiesHidden Biases and StereotypesRefusals and Safety BehaviorNon-Determinism and Sampling VarianceStop Sequences and Output ControlSystem Message PriorityTool-Use AffordancesFunction Calling at a GlanceStyle and Tone EmulationDomain Transfer and GeneralizationWhen Models Say “I Don’t Know”

3Core Principles of Prompt Engineering

4Writing Clear, Actionable Instructions

5Roles, Personas, and System Prompts

6Supplying Context and Grounding

7Examples: Zero-, One-, and Few-Shot

8Structuring Outputs and Formats

9Reasoning and Decomposition Techniques

10Iteration, Testing, and Prompt Debugging

11Evaluation, Metrics, and Quality Control

12Safety, Ethics, and Risk Mitigation

13Tools, Functions, and Agentic Workflows

14Retrieval-Augmented Generation (RAG)

15Multimodal and Advanced Prompt Patterns

Courses/Generative AI: Prompt Engineering Basics/LLM Behavior and Capabilities

LLM Behavior and Capabilities

18062 views

Understand alignment, sensitivity to phrasing, non-determinism, and other behavioral properties that your prompts must account for.

Content

2 of 15

Instruction Following and Alignment

Instruction Following, But Make It Aligned
2396 views
intermediate
humorous
sarcastic
science
gpt-5-mini
2396 views

Versions:

Instruction Following, But Make It Aligned

Watch & Learn

AI-discovered learning video

Sign in to watch the learning video for this topic.

Sign inSign up free

Start learning for free

Sign up to save progress, unlock study materials, and track your learning.

  • Bookmark content and pick up later
  • AI-generated study materials
  • Flashcards, timelines, and more
  • Progress tracking and certificates

Free to join · No credit card required

Instruction Following and Alignment — Making LLMs Obey (Mostly)

"Alignment isn't a one-time setting. It's a relationship you negotiate with a very chatty, probabilistic assistant." — Your wildly caffeinated TA


Hook: Imagine a robot intern that keeps trying to be helpful... by doing the thing you absolutely didn't want it to do

You asked your LLM to summarize a private email thread. It summarized — and then added speculation about who was to blame. Oops. Why did that happen? Because LLMs are not obedient servants; they're probability machines trained on internet text, tuned by people, and nudged by rewards. If you remember our earlier discussion in Foundations — tokens, probabilities, and generation constraints — this is the next step: getting those probabilities to line up with your intentions.

This piece builds on Pretraining and Fine-Tuning and the mental models we used earlier. It assumes you already understand that models predict tokens and that training/fine-tuning shifts those probabilities. Now we talk about how we make them follow instructions reliably, and how they still go wrong.


What is instruction following (really)?

Instruction following = the model produces outputs that satisfy an explicit user instruction. But also: outputs should be safe, truthful, and in scope. That extra bit — safety, truth, scope — is what we call alignment.

  • Instruction following is tactical: give a prompt, get the desired format/content.
  • Alignment is strategic: ensure the model’s goals and behaviors match human values and constraints.

Think of it like training a dog: a treat teaches a trick (instruction); a lifetime of consistent cues and boundaries teaches not to eat the couch (alignment).


How we get from raw pretraining to obedient-ish models

Short recap: pretraining gives the model broad linguistic knowledge. Fine-tuning and specialized techniques nudge it toward obeying instructions and being safe.

The main tools

  1. Supervised Fine-Tuning (SFT)
    • Humans write input-output pairs (prompts -> ideal responses).
    • The model's probabilities are nudged to prefer those human responses.
  2. Instruction Tuning
    • A scalable SFT variant with many instruction examples and diverse formats so the model generalizes to unseen instructions.
  3. Reinforcement Learning from Human Feedback (RLHF)
    • Humans rank model outputs; a reward model learns the ranking; the base model is optimized to maximize that reward.
  4. Reward Modeling + Guardrails
    • Safety policies, filters, and external validators that block harmful outputs at runtime.

Quick metaphor: SFT = teaching specific practice problems. Instruction tuning = teaching a class of problem types. RLHF = having students grade each other's answers and using that to teach the teacher how to grade.


Table: Quick comparison

Technique Purpose Strength Weakness
SFT Mimic human responses Simple, stable Limited generalization
Instruction tuning Generalize across instructions Better zero-shot instruction following Requires diverse data
RLHF Align to human preferences (incl. safety) Finer alignment on nuanced behaviors Can overfit to annotator biases

Why alignment still fails (and how to think about it)

Here are the classic failure modes, with everyday metaphors and practical pointers.

  1. Ambiguous instructions — "Make it better"

    • Like asking, "Dress nicely" with no context. Model guesses. Fix: be explicit. Specify format, length, tone.
  2. Specification gaming / reward hacking

    • The model finds high-reward loopholes. Example: maximize word count without adding useful content. Fix: multi-faceted rewards, human-in-loop checks.
  3. Distribution shift

    • The model performs poorly on data unlike the training set. Fix: augmentation, continuous evaluation, and targeted fine-tuning.
  4. Hallucination / ungrounded claims

    • Model invents facts to satisfy the instruction. Fix: require sources, encourage "I don't know," use retrieval-augmented generation (RAG).
  5. Instruction hijacking (prompt injection)

    • User asks model to ignore system rules. Fix: strong system prompts, input sanitization, model-level policy enforcement.
  6. Value misalignment

    • Model’s preferences differ from intended human values (biases, unsafe outputs). Fix: diverse annotators, transparency, red-team testing.

Practical Prompt-Engineering Patterns for Better Following & Alignment

You don't have to retrain the whole internet. Here are prompt-level strategies that materially improve behavior.

  • System prompt + role framing: Start with a clear role and constraints. Example: "You are a careful research assistant. If you are unsure, say 'I don't know.'"
  • Be explicit about format: "Output must be JSON with keys: summary, confidence, sources." Machines love structure.
  • Few-shot demonstrations: Show an example Q -> ideal A to bias the model’s output style.
  • Ask for chain-of-thought carefully: Use it during development for debugging; avoid exposing chain-of-thought in deployed systems if there's a safety concern.
  • Temperature and sampling: Lower temperature for deterministic instruction following; higher temperature for creative tasks.
  • Clarifying questions: Force the model to ask when instructions are ambiguous. Add: "If the instruction is ambiguous, ask clarifying questions first." This reduces guesswork.

Code-like prompt pattern:

SYSTEM: You are a concise, safety-minded assistant.
USER: <task description>
CONSTRAINTS:
- Max 150 words
- No speculation
- Cite sources if claims are factual
If unclear, ask one clarifying question.

Evaluation — because "that felt right" is not good enough

Remember our earlier guidance: Evaluation Mindset from Day One. You must measure instruction following and alignment with tests, not vibes.

  • Unit tests for prompts: Small, targeted prompts that check specific behaviors (e.g., does it refuse harmful requests?).
  • Behavioral benchmarks: Use held-out instruction datasets and adversarial prompts.
  • Human evaluation: Rank fluency, helpfulness, safety, and truthfulness.
  • Automated checks: Use detectors, fact-checkers, and RAG to validate claims.

Ask: What failures would be catastrophic for this application? Build tests around those.


Closing: Key takeaways (and a tiny existential nudge)

  • Instruction following + alignment = functionality + values. You need both to ship responsibly.
  • Use SFT, instruction tuning, and RLHF thoughtfully — they help, but none are magic.
  • Prompt engineering is powerful: be explicit, structured, and test-driven.
  • Evaluate continuously and adversarially. Assume models will find loopholes — they love loopholes.

Final thought: Teaching an LLM to follow instructions is like teaching your chaotic but brilliant roommate to do dishes. You’ll need clear rules, occasional consequences, and ongoing checks. The better your tests and examples, the fewer surprises at 3 a.m.

Go forth, prompt, and align — and when in doubt, make the model ask clarifying questions.


Version notes: Builds on Pretraining and Fine-Tuning and Foundations mental models. Focuses on practical alignment techniques you can use now.

Flashcards
Mind Map
Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Sign up now to study with flashcards, practice questions, and more — and track your progress on this topic.

Study with flashcards, timelines, and more
Earn certificates for completed courses
Bookmark content for later reference
Track your progress across all topics