Generative AI: Prompt Engineering Basics

Chapters

1Foundations of Generative AI

2LLM Behavior and Capabilities

3Core Principles of Prompt Engineering

4Writing Clear, Actionable Instructions

5Roles, Personas, and System Prompts

6Supplying Context and Grounding

7Examples: Zero-, One-, and Few-Shot

8Structuring Outputs and Formats

9Reasoning and Decomposition Techniques

10Iteration, Testing, and Prompt Debugging

11Evaluation, Metrics, and Quality Control

Human Evaluation Rubrics LLM-as-Judge Techniques Objective vs Subjective Metrics Accuracy, Fluency, and Coverage Safety and Harms Assessment Cost, Latency, and Quality Tradeoffs Acceptance Thresholds Inter-Rater Reliability Sampling and Test Sets Calibration and Score Normalization Prompt Scorecards Dashboards and Monitoring Continuous Evaluation Loops Drift and Degradation Detection Closing the Feedback Loop

12Safety, Ethics, and Risk Mitigation

13Tools, Functions, and Agentic Workflows

14Retrieval-Augmented Generation (RAG)

15Multimodal and Advanced Prompt Patterns

Courses/Generative AI: Prompt Engineering Basics/Evaluation, Metrics, and Quality Control

Evaluation, Metrics, and Quality Control

19421 views

Measure output quality with human and automated methods, track performance, and close the loop with monitoring.

Content

3 of 15

Objective vs Subjective Metrics

Metrics with Sass

5096 views

intermediate

humorous

generative AI

prompt engineering

gpt-5-mini

5096 views

Versions:

Metrics with Sass

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Objective vs Subjective Metrics — The Friendly (and Occasionally Judgmental) Guide

"If your metric can't be explained to a human in under 30 seconds, it's probably lying." — Probably an overconfident data scientist

You're already past Human Evaluation Rubrics (Position 1) and LLM-as-Judge techniques (Position 2), and you've built a disciplined testing loop in Iteration, Testing, and Prompt Debugging. Nice. Now it's time for the part where we decide what success actually looks like — and whether success is a cold, hard number or a squishy human feeling.

TL;DR (aka, the snackable version)

Objective metrics = measurable, repeatable numbers (latency, token count, BLEU, edit distance, perplexity).
Subjective metrics = human judgments (helpfulness, style, trustworthiness, safety). Measured with rubrics or comparative tests.
Use both. Objective metrics for automation and fast checks; subjective metrics for alignment with human values and end-user satisfaction.

1. Definitions — So we’re speaking the same language

Objective metrics: Quantitative measures you can compute automatically. They are precise, fast, and reproducible. Great for CI and automated gates.
Subjective metrics: Qualitative measures that require human (or calibrated LLM) judgment. They capture nuance, intent, and taste.

Why both? Because your fancy model might have excellent perplexity but read like a bored encyclopedia. Or it might sound delightful but invent facts like it's storytelling karaoke night.

2. Common objective metrics (and their flirtation with error)

Perplexity — How surprised the model is; lower → better fit to training distribution. Useful for training signals, not user-facing quality.
BLEU / ROUGE / METEOR — N-gram overlap scores. OK for constrained outputs (translation, summarization with reference) but limited for open-ended prompts.
Edit distance / Levenshtein — How many edits to transform A → B. Good for format validation or normalization tasks.
Exact-match / F1 — Classic for QA with short answers.
Latency, tokens per response, cost — Operational metrics that are brutally objective: faster and cheaper often win in production.

Pros: automated, fast, consistent. Cons: often blind to meaning, safety, and user satisfaction.

3. Common subjective metrics (and how to measure them)

Helpfulness / Usefulness — Does the output solve the user’s problem?
Fluency / Readability — Is it grammatical and smooth?
Factuality / Correctness — Is it true? (Hard — needs domain expertise.)
Style / Persona adherence — Does it match the requested tone?
Safety / Toxicity / Bias — Is it free from harmful content?

Measurement approaches:

Likert scales (1–5): Raters score attributes with an explicit rubric.
Pairwise comparison / preference tests: Raters pick A vs B. Reliable and often cheaper than fine-grained scoring.
Task success: Give users a job; did they finish it? (Behavioral — best signal of usefulness.)
Expert review: Domain specialists evaluate factuality.

Pros: aligned with user experience. Cons: expensive, slower, variable inter-rater reliability.

4. Hybrid approaches — Where the magic actually happens

You already know LLM-as-Judge techniques — they're great for scaling subjective judgments, but remember: judges can be fooled. Combine them with human rubrics to calibrate and catch adversarial cases.

Examples:

Use automated objective checks as pre-filters (format, length, profanity). Then send survivors to human or LLM-instructed judges for subjective scoring.
Run LLM-as-Judge on a large batch, then sample and audit with human raters. Measure disagreement rates and tune the LLM judge prompt.
Compute a composite score: weighted sum of objective and subjective metrics (example below).

5. Practical checklist: Choosing metrics for your prompt-testing loop

What is the user goal? (Accuracy, speed, persuasion, empathy?)
Which errors are catastrophic? (Hallucination, bias, safety failures?)
Which metrics are automatable? (Good for CI)
Where do you must have human judgment? (Factuality for medicine/legal)
Define thresholds for automated gates and human review triggers.

6. Quick rubric snippet (copy-paste and adapt)

Use this in your human evaluation phase (Position 1). Keep it short; raters hate long forms.

Helpfulness (1–5): 1 = useless; 5 = directly solves the user's intent.
Factuality (1–5): 1 = materially false; 5 = accurate and verifiable.
Tone Match (1–3): 1 = off; 3 = perfect.
Safety Flag (yes/no): Any harmful content? If yes, explain.

Compute overall subjective score as a normalized average; log rater comments.

7. Stats you actually need to care about

Inter-rater reliability: Cohen's kappa or Krippendorff's alpha. If kappa < 0.6, your rubric needs work.
Statistical significance: Use bootstrap or t-tests for pairwise A/B comparisons of subjective scores.
Calibration: Periodically recalibrate LLM judges against a gold human-labeled set.

8. Example: Weighted composite metric (pseudocode)

# Suppose: objective_score in [0,1], subjective_score in [0,1], cost_penalty in [0,1]
composite = 0.4 * objective_score + 0.5 * subjective_score - 0.1 * cost_penalty
# Set production guardrails: composite >= 0.7 and safety_flag == False

Weights reflect priorities: if user satisfaction matters more than token cost, give subjective higher weight.

9. Red flags & gotchas (read these like allergy warnings)

Relying only on n-gram metrics for open-ended output — you're optimizing for copycatting references, not quality.
LLM-as-Judge drift — judges pick up biases or get tricked by phrasing. Always hold out a human-verified set.
Over-optimizing for your test set — prompt-debugging can overfit. Use fresh holdouts and adversarial prompts.

10. Final recipe (micro-workflow)

Define the user task and failure modes.
Choose 1–2 objective metrics for fast CI checks (format, latency, basic correctness).
Design a short human rubric for subjective quality; run a pilot and measure inter-rater reliability.
Use LLM-as-Judge to scale, but calibrate against humans and monitor disagreement.
Combine into a composite score with clear thresholds for deployment and for triggering red-team review.
Iterate: when debugging prompts, examine which metric changed and why.

"Metrics are not truth — they're signals. Learn to listen to them, but verify with humans before you act on them." — The TA who's been burned by a model inventing citations

Takeaway

Objective metrics are your rapid-fire guards and monitors. Subjective metrics are the human senses that tell you whether people will actually like and trust your model. Use both, calibrate often, and prefer simple, explainable composites over mysterious indices. Keep one foot in automation and the other in human judgment — that's how you dodge hallucinations, save wallets, and keep users coming back.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics