Generative AI: Prompt Engineering Basics

Chapters

1Foundations of Generative AI

2LLM Behavior and Capabilities

3Core Principles of Prompt Engineering

4Writing Clear, Actionable Instructions

5Roles, Personas, and System Prompts

6Supplying Context and Grounding

7Examples: Zero-, One-, and Few-Shot

8Structuring Outputs and Formats

9Reasoning and Decomposition Techniques

10Iteration, Testing, and Prompt Debugging

Test Case Design A/B and Multivariate Prompt Tests Minimal Reproducible Prompts Error Pattern Analysis Prompt Ablation Studies Parameter Sweep Experiments Red Teaming for Robustness Guardrail Trigger Testing Fallback and Recovery Prompts Versioning and Naming Conventions Change Logs and Diffing Regression Test Suites Canary Questions and Probes Peer Review and Pair Prompting Capturing Learnings and Playbooks

11Evaluation, Metrics, and Quality Control

12Safety, Ethics, and Risk Mitigation

13Tools, Functions, and Agentic Workflows

14Retrieval-Augmented Generation (RAG)

15Multimodal and Advanced Prompt Patterns

Courses/Generative AI: Prompt Engineering Basics/Iteration, Testing, and Prompt Debugging

Iteration, Testing, and Prompt Debugging

25116 views

Develop a rigorous workflow to test, analyze, and refine prompts using experiments, versioning, and red teaming.

Content

2 of 15

A/B and Multivariate Prompt Tests

The Methodical Mad Scientist

5183 views

intermediate

humorous

generative-ai

education theory

gpt-5-mini

5183 views

Versions:

The Methodical Mad Scientist

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

A/B and Multivariate Prompt Tests — The Scientific Circus of Prompts

"You're not debugging prompts until you've broken at least two assumptions and blamed the model for being dramatic." — Your mildly exasperated TA

Hook: You already know how to think; now learn how to measure it

This isn't a fluff exercise. You've been taught outline-first strategies, hypothesis testing, and verification-first prompting — perfect. Now we turn that thoughtful process into controlled experiments. If your last prompt iteration felt like whispering suggestions to a fortune cookie, A/B and multivariate testing will give you repeatable, interpretable results.

Quick context linkup: remember Chain-of-Thought (CoT) considerations and the advice to eliminate irrelevant paths? Those help you design clean comparisons — keep the variables controlled and the noise out of your measurements.

What's the difference? A/B vs multivariate (the elevator pitch)

A/B test: Compare two prompts (A vs B). Simple, low-friction, great for single-change hypotheses.
Multivariate test: Compare multiple prompts across several factors (e.g., tone × structure × CoT). More powerful, reveals interactions, but needs more runs and clearer metrics.

Quick table

Feature	A/B	Multivariate (factorial)
Complexity	Low	Medium → High
Number of variants	2	4, 8, 16, ...
Best for	Single-change validation	Exploring multiple factor interactions
Sample size needs	Small	Larger

Why run these tests? (Spoiler: opinions are lying)

Turn subjective impressions (“this feels better”) into objective evidence.
Catch interaction effects (that charming surprise where adding both a polite tone and CoT makes answers worse, not better).
Avoid chasing phantom improvements caused by randomness — especially if you tune temperature, system prompt, or model version without controlling them.

Design your A/B or multivariate test (the engineering checklist)

Define the hypothesis: e.g., Adding an outline-first instruction increases factual accuracy on multi-step math by 10%.
Pick your metrics (quant + qual):
- Primary metric: factual accuracy, pass/fail on test cases (binary)
- Secondary: conciseness, hallucination rate, adherence to tone (Likert or rubric-scored)
Control variables: model version, temperature, max tokens, seed (if available), dataset split.
Create variants:
- A/B: Baseline prompt vs Baseline + outline-first line
- Multivariate: Factor 1 = CoT (on/off), Factor 2 = Outline-first (on/off), Factor 3 = Temperature (0.0 vs 0.7)
Sample design: randomize assignment, stratify if your test cases are heterogenous (e.g., easy/hard).
Decide statistical test: proportion test (binomial) for accuracy, t-test or ANOVA for continuous scores, chi-squared for categorical.

Pro-tip: use the hypothesis-testing mindset from earlier modules. Define a null hypothesis and what counts as a meaningful effect size before you run anything.

Metrics: What to track (practically)

Quantitative: accuracy, BLEU/ROUGE for constrained outputs, length, token usage
Qualitative: rubric scores for hallucination, reasoning quality, adherence
Operational: latency, cost per prompt

Always pair an objective primary metric (e.g., accuracy) with at least one human-in-the-loop qualitative check for subtle failure modes (like plausible-sounding-but-wrong answers).

Statistical basics (non-nerdy summary)

Small differences can be noise. Increase sample size or accept uncertainty.
For A/B with binary outcomes, use a two-proportion z-test or binomial test.
For multivariate (factorial) designs, use ANOVA to check main effects and interactions.
If you run many comparisons, adjust for multiple tests (Bonferroni, FDR).

If stats makes you want to cry, at least track effect sizes and confidence intervals — they tell you how meaningful the difference is, not just whether it passed a p-value barrier.

Example: A/B test you can actually run

Baseline (A):

You are an expert tutor. Answer the question concisely with the final answer and a brief explanation.
Q: Solve for x: 2x + 7 = 15

Variant (B): add outline-first + verification:

You are an expert tutor. First provide a 1–3 step outline of your plan, then solve. After solving, add a short verification step showing the result substituted back into the equation.
Q: Solve for x: 2x + 7 = 15

Run N test questions, randomize which prompt is used per question, score correctness (binary). Compute proportions and test for significance.

Sample results table:

Variant	N	Correct	Accuracy
Baseline	100	86	86%
Outline+Verify	100	93	93%

If p < 0.05 and confidence intervals don't overlap, you probably have a win.

Multivariate example (showing interactions)

Factors: CoT (yes/no) × Outline (yes/no)
Variants: 4 (00, 01, 10, 11). Run each on same test set randomly assigned. You might discover:

CoT helps on complex reasoning
Outline alone helps on step clarity
But CoT + Outline together reduces conciseness and increases hallucination rate (an interaction!)

This is why multivariate testing is powerful: it uncovers when 'two goods' combine into a mess.

Pseudocode: automated test harness (concept)

for prompt_variant in variants:
    for question in test_set:
        response = model.generate(prompt_variant + question)
        score = auto_or_human_score(response, question)
        log(record={"variant": prompt_variant, "question_id": question.id, "score": score})
# aggregate scores per variant, run statistical tests

Add bootstrapping if you have small N or non-normal score distributions.

Pitfalls & Debugging Checklist

You changed the model or temperature mid-test? Abort and restart.
You compared prompts with different implicit constraints (length, tone) -> not apples-to-apples.
Small sample illusions: repeat-run variability misleads you.
Overfitting the test set: don’t tune on the same questions you evaluate on.
Ignoring interactions: a variant that wins overall might fail for a subgroup.

When a variant fails: check logs, inspect examples where it lost, look for pattern (e.g., fails on multi-step arithmetic). Use your earlier decomposition skills: hypothesize failure mode, design targeted A/B between two micro-variants, and iterate.

Quick debugging recipes

If answers are shorter but worse: test with and without max_tokens cap and with explicit brevity instruction.
If hallucinations spike: run an A/B where only the verification instruction is toggled.
If CoT outputs nonsense: test a reduced temperature, or a variant that asks for bullet-point outlines only.

Closing — TL;DR (the takeaway you can tattoo on your brain)

Use A/B for focused, quick checks; use multivariate for exploring multiple levers and interactions.
Control variables, predefine metrics, randomize, and use human checks for subtle failures.
Lean on your outline-first and hypothesis-testing habits: design tests with clear hypotheses and verification steps.
When in doubt, run a small factorial test rather than guessing. The model is not psychic; your experiments are.

Final mic drop: good prompt engineering is less about clever one-liners and more about clean experiments. If you can measure it, you can improve it.

Version notes: This builds on outline-first strategies and Chain-of-Thought considerations — use those approaches as factors in your A/B or multivariate designs for the highest signal-to-noise ratio.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics