Generative AI: Prompt Engineering Basics

Chapters

1Foundations of Generative AI

2LLM Behavior and Capabilities

Pretraining and Fine-Tuning Instruction Following and Alignment RLHF and Preference Optimization Sensitivity to Wording and Order Length Bias and Cutoff Realities Hidden Biases and Stereotypes Refusals and Safety Behavior Non-Determinism and Sampling Variance Stop Sequences and Output Control System Message Priority Tool-Use Affordances Function Calling at a Glance Style and Tone Emulation Domain Transfer and Generalization When Models Say “I Don’t Know”

3Core Principles of Prompt Engineering

4Writing Clear, Actionable Instructions

5Roles, Personas, and System Prompts

6Supplying Context and Grounding

7Examples: Zero-, One-, and Few-Shot

8Structuring Outputs and Formats

9Reasoning and Decomposition Techniques

10Iteration, Testing, and Prompt Debugging

11Evaluation, Metrics, and Quality Control

12Safety, Ethics, and Risk Mitigation

13Tools, Functions, and Agentic Workflows

14Retrieval-Augmented Generation (RAG)

15Multimodal and Advanced Prompt Patterns

Courses/Generative AI: Prompt Engineering Basics/LLM Behavior and Capabilities

LLM Behavior and Capabilities

18078 views

Understand alignment, sensitivity to phrasing, non-determinism, and other behavioral properties that your prompts must account for.

Content

3 of 15

RLHF and Preference Optimization

RLHF but Make It Tasteful

6266 views

intermediate

humorous

education theory

science

gpt-5-mini

6266 views

Versions:

RLHF but Make It Tasteful

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

RLHF and Preference Optimization — The Chaotic, Charming Art of Teaching Models to Care

"You can pretrain a model to speak like Shakespeare, but if you want it to prefer not to roast strangers, you need to teach taste." — Your mildly dramatic TA

Hook: What happens after you fine-tune a model to follow instructions?

Remember how we covered pretraining and fine-tuning earlier, and how tokens and probabilities are the backstage puppeteers of every reply? Good. That gives us the foundation. Now imagine we gave our model a rule book via supervised fine-tuning (SFT) to be helpful and not hurtful. It learned it and recited it politely. But humans are complicated — preferences are fuzzy, tradeoffs exist, and explicit rules don't capture nuance. Enter: RLHF — reinforcement learning from human feedback — and the whole family of preference optimization techniques. These take the polite model and teach it not just what to say, but what people actually prefer it say.

What is RLHF, in plain (and slightly dramatic) terms?

Reinforcement learning from human feedback is a three-act play:
1. Collect human judgments about which model outputs are better.
2. Train a reward model that predicts those human preferences.
3. Use that reward model to optimize the LLM's policy, often with RL algorithms like PPO.

Analogy time: pretraining is the model learning language and facts (like learning grammar and recipes). SFT is teaching it to follow a recipe. RLHF is hiring critics at a restaurant, training a taste sensor from their notes, and then tweaking the chef until customers smile more.

Step-by-step: the RLHF loop

Generate candidate responses from a base policy (SFT model or even a pretrained model).
Collect human comparisons: which of two or more responses is better and why.
Fit a reward model r(x, y) that scores outputs y for prompts x based on human prefs.
Use an RL algorithm to update the policy so it maximizes expected reward under r.
Repeat (and occasionally sanity-check with fresh human data).

Code-y pseudocode for your inner nerd:

for iteration in range(N):
  responses = policy.sample(prompts)
  prefs = humans.compare(responses)
  reward_model.train(prompts, responses, prefs)
  policy = RL_optimize(policy, reward_model)

Preference optimization: the design space (aka the choose-your-own-adventure of alignment)

There are choices at every step, and each has pros/cons.

Approach	Pros	Cons
Supervised fine-tuning (SFT)	Simple, stable, cheap	Doesn't capture nuanced preferences or tradeoffs
Reward modeling + RL (RLHF)	Captures nuanced preferences; flexible	Expensive, brittle, reward hacking risk
Direct preference modeling (no RL)	Simpler than full RL; can do ranking-based distillation	May not achieve same policy-level improvements
Offline preference distillation	Faster inference; less RL instability	Depends on quality and coverage of distillation data

Ask yourself: do I need the full RLHF orchestra, or is a duet of SFT and preference distillation enough?

Why human preferences, not rules?

Humans balance tradeoffs all the time. Think tone vs. clarity, safety vs. usefulness, or humor vs. accuracy. Humans can implicitly encode those tradeoffs in pairwise comparisons. A reward model learns to generalize those judgments.

Quick probe: when was the last time you preferred a brutally honest answer over a gentle one? Context matters. RLHF lets models learn that context, at least partially.

The dark lab coat: failure modes and gotchas

Reward hacking: the policy finds ways to get high reward without doing what humans intended. Classic example: verbose repetition that seems confident but is wrong.
Distributional shift: the reward model was trained on certain prompts; the policy wanders into territories where the reward model is clueless.
Human inconsistency: people disagree. The reward model will learn the average or learn bias.
Overoptimization: chasing the reward too hard can erode helpfulness or truthfulness.
Scale illusion: larger models + RLHF can look aligned in lab tests but still fail spectacularly in adversarial settings.

Alignment isn't a switch you flip. It's a constantly adjusted thermostat in a house with naked flames.

Practical tips for prompt engineers and product folks

When interacting with an RLHF-trained model, remember it optimizes for human-like preferences. Frame prompts with preference cues: politeness, role constraints, and examples of desired style.
Use contrastive examples in prompts: show the model a 'good' and a 'bad' output and ask it to emulate the good one.
Anticipate adversarial prompts. Test for reward hacking by trying edge cases and seeing if the model exploits loopholes.
Combine techniques: use SFT for basic instruction following, RLHF for nuance, and safety filters as guardrails.

Small checklist before you roll RLHF in production

Do you have enough high-quality human comparisons? (Quantity and diversity matter.)
Do you have a plan to detect reward hacking and distributional drift? (Monitor against held-out human judgments.)
Can you do human-in-the-loop audits periodically? (Yes, you can afford it.)
Are safety filters and rejection sampling in place for catastrophic edge cases? (Please don’t skip this.)

Closing: TL;DR and takeaways

RLHF is about aligning LLM behavior to human judgments, not absolute truth. It's powerful but imperfect.
It builds on pretraining and SFT: think of those as the grammar and the rulebook; RLHF teaches taste and preference.
Expect brittle behavior: reward models can be gamed, and policies can overfit to spurious cues.
For prompt engineers: use preference-aware prompting, craft contrastive examples, and stress-test models for reward hacking.

Final thought: RLHF is less like programming a computer and more like teaching a teenager to be charming at dinner. You give examples, correct the worst behavior, try to encode taste, and then keep managing it because humans — and prompts — never stop evolving. Go forth and optimize tastes responsibly.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics