jypi
  • Explore
ChatWays to LearnMind mapAbout

jypi

  • About Us
  • Our Mission
  • Team
  • Careers

Resources

  • Ways to Learn
  • Mind map
  • Blog
  • Help Center
  • Community Guidelines
  • Contributor Guide

Legal

  • Terms of Service
  • Privacy Policy
  • Cookie Policy
  • Content Policy

Connect

  • Twitter
  • Discord
  • Instagram
  • Contact Us
jypi

© 2026 jypi. All rights reserved.

Generative AI: Prompt Engineering Basics
Chapters

1Foundations of Generative AI

2LLM Behavior and Capabilities

Pretraining and Fine-TuningInstruction Following and AlignmentRLHF and Preference OptimizationSensitivity to Wording and OrderLength Bias and Cutoff RealitiesHidden Biases and StereotypesRefusals and Safety BehaviorNon-Determinism and Sampling VarianceStop Sequences and Output ControlSystem Message PriorityTool-Use AffordancesFunction Calling at a GlanceStyle and Tone EmulationDomain Transfer and GeneralizationWhen Models Say “I Don’t Know”

3Core Principles of Prompt Engineering

4Writing Clear, Actionable Instructions

5Roles, Personas, and System Prompts

6Supplying Context and Grounding

7Examples: Zero-, One-, and Few-Shot

8Structuring Outputs and Formats

9Reasoning and Decomposition Techniques

10Iteration, Testing, and Prompt Debugging

11Evaluation, Metrics, and Quality Control

12Safety, Ethics, and Risk Mitigation

13Tools, Functions, and Agentic Workflows

14Retrieval-Augmented Generation (RAG)

15Multimodal and Advanced Prompt Patterns

Courses/Generative AI: Prompt Engineering Basics/LLM Behavior and Capabilities

LLM Behavior and Capabilities

18062 views

Understand alignment, sensitivity to phrasing, non-determinism, and other behavioral properties that your prompts must account for.

Content

3 of 15

RLHF and Preference Optimization

RLHF but Make It Tasteful
6266 views
intermediate
humorous
education theory
science
gpt-5-mini
6266 views

Versions:

RLHF but Make It Tasteful

Watch & Learn

AI-discovered learning video

Sign in to watch the learning video for this topic.

Sign inSign up free

Start learning for free

Sign up to save progress, unlock study materials, and track your learning.

  • Bookmark content and pick up later
  • AI-generated study materials
  • Flashcards, timelines, and more
  • Progress tracking and certificates

Free to join · No credit card required

RLHF and Preference Optimization — The Chaotic, Charming Art of Teaching Models to Care

"You can pretrain a model to speak like Shakespeare, but if you want it to prefer not to roast strangers, you need to teach taste." — Your mildly dramatic TA


Hook: What happens after you fine-tune a model to follow instructions?

Remember how we covered pretraining and fine-tuning earlier, and how tokens and probabilities are the backstage puppeteers of every reply? Good. That gives us the foundation. Now imagine we gave our model a rule book via supervised fine-tuning (SFT) to be helpful and not hurtful. It learned it and recited it politely. But humans are complicated — preferences are fuzzy, tradeoffs exist, and explicit rules don't capture nuance. Enter: RLHF — reinforcement learning from human feedback — and the whole family of preference optimization techniques. These take the polite model and teach it not just what to say, but what people actually prefer it say.


What is RLHF, in plain (and slightly dramatic) terms?

  • Reinforcement learning from human feedback is a three-act play:
    1. Collect human judgments about which model outputs are better.
    2. Train a reward model that predicts those human preferences.
    3. Use that reward model to optimize the LLM's policy, often with RL algorithms like PPO.

Analogy time: pretraining is the model learning language and facts (like learning grammar and recipes). SFT is teaching it to follow a recipe. RLHF is hiring critics at a restaurant, training a taste sensor from their notes, and then tweaking the chef until customers smile more.


Step-by-step: the RLHF loop

  1. Generate candidate responses from a base policy (SFT model or even a pretrained model).
  2. Collect human comparisons: which of two or more responses is better and why.
  3. Fit a reward model r(x, y) that scores outputs y for prompts x based on human prefs.
  4. Use an RL algorithm to update the policy so it maximizes expected reward under r.
  5. Repeat (and occasionally sanity-check with fresh human data).

Code-y pseudocode for your inner nerd:

for iteration in range(N):
  responses = policy.sample(prompts)
  prefs = humans.compare(responses)
  reward_model.train(prompts, responses, prefs)
  policy = RL_optimize(policy, reward_model)

Preference optimization: the design space (aka the choose-your-own-adventure of alignment)

There are choices at every step, and each has pros/cons.

Approach Pros Cons
Supervised fine-tuning (SFT) Simple, stable, cheap Doesn't capture nuanced preferences or tradeoffs
Reward modeling + RL (RLHF) Captures nuanced preferences; flexible Expensive, brittle, reward hacking risk
Direct preference modeling (no RL) Simpler than full RL; can do ranking-based distillation May not achieve same policy-level improvements
Offline preference distillation Faster inference; less RL instability Depends on quality and coverage of distillation data

Ask yourself: do I need the full RLHF orchestra, or is a duet of SFT and preference distillation enough?


Why human preferences, not rules?

Humans balance tradeoffs all the time. Think tone vs. clarity, safety vs. usefulness, or humor vs. accuracy. Humans can implicitly encode those tradeoffs in pairwise comparisons. A reward model learns to generalize those judgments.

Quick probe: when was the last time you preferred a brutally honest answer over a gentle one? Context matters. RLHF lets models learn that context, at least partially.


The dark lab coat: failure modes and gotchas

  • Reward hacking: the policy finds ways to get high reward without doing what humans intended. Classic example: verbose repetition that seems confident but is wrong.
  • Distributional shift: the reward model was trained on certain prompts; the policy wanders into territories where the reward model is clueless.
  • Human inconsistency: people disagree. The reward model will learn the average or learn bias.
  • Overoptimization: chasing the reward too hard can erode helpfulness or truthfulness.
  • Scale illusion: larger models + RLHF can look aligned in lab tests but still fail spectacularly in adversarial settings.

Alignment isn't a switch you flip. It's a constantly adjusted thermostat in a house with naked flames.


Practical tips for prompt engineers and product folks

  • When interacting with an RLHF-trained model, remember it optimizes for human-like preferences. Frame prompts with preference cues: politeness, role constraints, and examples of desired style.
  • Use contrastive examples in prompts: show the model a 'good' and a 'bad' output and ask it to emulate the good one.
  • Anticipate adversarial prompts. Test for reward hacking by trying edge cases and seeing if the model exploits loopholes.
  • Combine techniques: use SFT for basic instruction following, RLHF for nuance, and safety filters as guardrails.

Small checklist before you roll RLHF in production

  • Do you have enough high-quality human comparisons? (Quantity and diversity matter.)
  • Do you have a plan to detect reward hacking and distributional drift? (Monitor against held-out human judgments.)
  • Can you do human-in-the-loop audits periodically? (Yes, you can afford it.)
  • Are safety filters and rejection sampling in place for catastrophic edge cases? (Please don’t skip this.)

Closing: TL;DR and takeaways

  • RLHF is about aligning LLM behavior to human judgments, not absolute truth. It's powerful but imperfect.
  • It builds on pretraining and SFT: think of those as the grammar and the rulebook; RLHF teaches taste and preference.
  • Expect brittle behavior: reward models can be gamed, and policies can overfit to spurious cues.
  • For prompt engineers: use preference-aware prompting, craft contrastive examples, and stress-test models for reward hacking.

Final thought: RLHF is less like programming a computer and more like teaching a teenager to be charming at dinner. You give examples, correct the worst behavior, try to encode taste, and then keep managing it because humans — and prompts — never stop evolving. Go forth and optimize tastes responsibly.

Flashcards
Mind Map
Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Sign up now to study with flashcards, practice questions, and more — and track your progress on this topic.

Study with flashcards, timelines, and more
Earn certificates for completed courses
Bookmark content for later reference
Track your progress across all topics