jypi
  • Explore
ChatWays to LearnMind mapAbout

jypi

  • About Us
  • Our Mission
  • Team
  • Careers

Resources

  • Ways to Learn
  • Mind map
  • Blog
  • Help Center
  • Community Guidelines
  • Contributor Guide

Legal

  • Terms of Service
  • Privacy Policy
  • Cookie Policy
  • Content Policy

Connect

  • Twitter
  • Discord
  • Instagram
  • Contact Us
jypi

© 2026 jypi. All rights reserved.

Generative AI: Prompt Engineering Basics
Chapters

1Foundations of Generative AI

2LLM Behavior and Capabilities

3Core Principles of Prompt Engineering

4Writing Clear, Actionable Instructions

5Roles, Personas, and System Prompts

6Supplying Context and Grounding

7Examples: Zero-, One-, and Few-Shot

8Structuring Outputs and Formats

9Reasoning and Decomposition Techniques

10Iteration, Testing, and Prompt Debugging

11Evaluation, Metrics, and Quality Control

Human Evaluation RubricsLLM-as-Judge TechniquesObjective vs Subjective MetricsAccuracy, Fluency, and CoverageSafety and Harms AssessmentCost, Latency, and Quality TradeoffsAcceptance ThresholdsInter-Rater ReliabilitySampling and Test SetsCalibration and Score NormalizationPrompt ScorecardsDashboards and MonitoringContinuous Evaluation LoopsDrift and Degradation DetectionClosing the Feedback Loop

12Safety, Ethics, and Risk Mitigation

13Tools, Functions, and Agentic Workflows

14Retrieval-Augmented Generation (RAG)

15Multimodal and Advanced Prompt Patterns

Courses/Generative AI: Prompt Engineering Basics/Evaluation, Metrics, and Quality Control

Evaluation, Metrics, and Quality Control

19421 views

Measure output quality with human and automated methods, track performance, and close the loop with monitoring.

Content

3 of 15

Objective vs Subjective Metrics

Metrics with Sass
5096 views
intermediate
humorous
generative AI
prompt engineering
gpt-5-mini
5096 views

Versions:

Metrics with Sass

Watch & Learn

AI-discovered learning video

Sign in to watch the learning video for this topic.

Sign inSign up free

Start learning for free

Sign up to save progress, unlock study materials, and track your learning.

  • Bookmark content and pick up later
  • AI-generated study materials
  • Flashcards, timelines, and more
  • Progress tracking and certificates

Free to join · No credit card required

Objective vs Subjective Metrics — The Friendly (and Occasionally Judgmental) Guide

"If your metric can't be explained to a human in under 30 seconds, it's probably lying." — Probably an overconfident data scientist

You're already past Human Evaluation Rubrics (Position 1) and LLM-as-Judge techniques (Position 2), and you've built a disciplined testing loop in Iteration, Testing, and Prompt Debugging. Nice. Now it's time for the part where we decide what success actually looks like — and whether success is a cold, hard number or a squishy human feeling.


TL;DR (aka, the snackable version)

  • Objective metrics = measurable, repeatable numbers (latency, token count, BLEU, edit distance, perplexity).
  • Subjective metrics = human judgments (helpfulness, style, trustworthiness, safety). Measured with rubrics or comparative tests.
  • Use both. Objective metrics for automation and fast checks; subjective metrics for alignment with human values and end-user satisfaction.

1. Definitions — So we’re speaking the same language

  • Objective metrics: Quantitative measures you can compute automatically. They are precise, fast, and reproducible. Great for CI and automated gates.
  • Subjective metrics: Qualitative measures that require human (or calibrated LLM) judgment. They capture nuance, intent, and taste.

Why both? Because your fancy model might have excellent perplexity but read like a bored encyclopedia. Or it might sound delightful but invent facts like it's storytelling karaoke night.


2. Common objective metrics (and their flirtation with error)

  • Perplexity — How surprised the model is; lower → better fit to training distribution. Useful for training signals, not user-facing quality.
  • BLEU / ROUGE / METEOR — N-gram overlap scores. OK for constrained outputs (translation, summarization with reference) but limited for open-ended prompts.
  • Edit distance / Levenshtein — How many edits to transform A → B. Good for format validation or normalization tasks.
  • Exact-match / F1 — Classic for QA with short answers.
  • Latency, tokens per response, cost — Operational metrics that are brutally objective: faster and cheaper often win in production.

Pros: automated, fast, consistent. Cons: often blind to meaning, safety, and user satisfaction.


3. Common subjective metrics (and how to measure them)

  • Helpfulness / Usefulness — Does the output solve the user’s problem?
  • Fluency / Readability — Is it grammatical and smooth?
  • Factuality / Correctness — Is it true? (Hard — needs domain expertise.)
  • Style / Persona adherence — Does it match the requested tone?
  • Safety / Toxicity / Bias — Is it free from harmful content?

Measurement approaches:

  • Likert scales (1–5): Raters score attributes with an explicit rubric.
  • Pairwise comparison / preference tests: Raters pick A vs B. Reliable and often cheaper than fine-grained scoring.
  • Task success: Give users a job; did they finish it? (Behavioral — best signal of usefulness.)
  • Expert review: Domain specialists evaluate factuality.

Pros: aligned with user experience. Cons: expensive, slower, variable inter-rater reliability.


4. Hybrid approaches — Where the magic actually happens

You already know LLM-as-Judge techniques — they're great for scaling subjective judgments, but remember: judges can be fooled. Combine them with human rubrics to calibrate and catch adversarial cases.

Examples:

  • Use automated objective checks as pre-filters (format, length, profanity). Then send survivors to human or LLM-instructed judges for subjective scoring.
  • Run LLM-as-Judge on a large batch, then sample and audit with human raters. Measure disagreement rates and tune the LLM judge prompt.
  • Compute a composite score: weighted sum of objective and subjective metrics (example below).

5. Practical checklist: Choosing metrics for your prompt-testing loop

  1. What is the user goal? (Accuracy, speed, persuasion, empathy?)
  2. Which errors are catastrophic? (Hallucination, bias, safety failures?)
  3. Which metrics are automatable? (Good for CI)
  4. Where do you must have human judgment? (Factuality for medicine/legal)
  5. Define thresholds for automated gates and human review triggers.

6. Quick rubric snippet (copy-paste and adapt)

Use this in your human evaluation phase (Position 1). Keep it short; raters hate long forms.

  • Helpfulness (1–5): 1 = useless; 5 = directly solves the user's intent.
  • Factuality (1–5): 1 = materially false; 5 = accurate and verifiable.
  • Tone Match (1–3): 1 = off; 3 = perfect.
  • Safety Flag (yes/no): Any harmful content? If yes, explain.

Compute overall subjective score as a normalized average; log rater comments.


7. Stats you actually need to care about

  • Inter-rater reliability: Cohen's kappa or Krippendorff's alpha. If kappa < 0.6, your rubric needs work.
  • Statistical significance: Use bootstrap or t-tests for pairwise A/B comparisons of subjective scores.
  • Calibration: Periodically recalibrate LLM judges against a gold human-labeled set.

8. Example: Weighted composite metric (pseudocode)

# Suppose: objective_score in [0,1], subjective_score in [0,1], cost_penalty in [0,1]
composite = 0.4 * objective_score + 0.5 * subjective_score - 0.1 * cost_penalty
# Set production guardrails: composite >= 0.7 and safety_flag == False

Weights reflect priorities: if user satisfaction matters more than token cost, give subjective higher weight.


9. Red flags & gotchas (read these like allergy warnings)

  • Relying only on n-gram metrics for open-ended output — you're optimizing for copycatting references, not quality.
  • LLM-as-Judge drift — judges pick up biases or get tricked by phrasing. Always hold out a human-verified set.
  • Over-optimizing for your test set — prompt-debugging can overfit. Use fresh holdouts and adversarial prompts.

10. Final recipe (micro-workflow)

  1. Define the user task and failure modes.
  2. Choose 1–2 objective metrics for fast CI checks (format, latency, basic correctness).
  3. Design a short human rubric for subjective quality; run a pilot and measure inter-rater reliability.
  4. Use LLM-as-Judge to scale, but calibrate against humans and monitor disagreement.
  5. Combine into a composite score with clear thresholds for deployment and for triggering red-team review.
  6. Iterate: when debugging prompts, examine which metric changed and why.

"Metrics are not truth — they're signals. Learn to listen to them, but verify with humans before you act on them." — The TA who's been burned by a model inventing citations

Takeaway

Objective metrics are your rapid-fire guards and monitors. Subjective metrics are the human senses that tell you whether people will actually like and trust your model. Use both, calibrate often, and prefer simple, explainable composites over mysterious indices. Keep one foot in automation and the other in human judgment — that's how you dodge hallucinations, save wallets, and keep users coming back.

Flashcards
Mind Map
Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Sign up now to study with flashcards, practice questions, and more — and track your progress on this topic.

Study with flashcards, timelines, and more
Earn certificates for completed courses
Bookmark content for later reference
Track your progress across all topics