Evaluation, Metrics, and Quality Control
Measure output quality with human and automated methods, track performance, and close the loop with monitoring.
Content
Objective vs Subjective Metrics
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Objective vs Subjective Metrics — The Friendly (and Occasionally Judgmental) Guide
"If your metric can't be explained to a human in under 30 seconds, it's probably lying." — Probably an overconfident data scientist
You're already past Human Evaluation Rubrics (Position 1) and LLM-as-Judge techniques (Position 2), and you've built a disciplined testing loop in Iteration, Testing, and Prompt Debugging. Nice. Now it's time for the part where we decide what success actually looks like — and whether success is a cold, hard number or a squishy human feeling.
TL;DR (aka, the snackable version)
- Objective metrics = measurable, repeatable numbers (latency, token count, BLEU, edit distance, perplexity).
- Subjective metrics = human judgments (helpfulness, style, trustworthiness, safety). Measured with rubrics or comparative tests.
- Use both. Objective metrics for automation and fast checks; subjective metrics for alignment with human values and end-user satisfaction.
1. Definitions — So we’re speaking the same language
- Objective metrics: Quantitative measures you can compute automatically. They are precise, fast, and reproducible. Great for CI and automated gates.
- Subjective metrics: Qualitative measures that require human (or calibrated LLM) judgment. They capture nuance, intent, and taste.
Why both? Because your fancy model might have excellent perplexity but read like a bored encyclopedia. Or it might sound delightful but invent facts like it's storytelling karaoke night.
2. Common objective metrics (and their flirtation with error)
- Perplexity — How surprised the model is; lower → better fit to training distribution. Useful for training signals, not user-facing quality.
- BLEU / ROUGE / METEOR — N-gram overlap scores. OK for constrained outputs (translation, summarization with reference) but limited for open-ended prompts.
- Edit distance / Levenshtein — How many edits to transform A → B. Good for format validation or normalization tasks.
- Exact-match / F1 — Classic for QA with short answers.
- Latency, tokens per response, cost — Operational metrics that are brutally objective: faster and cheaper often win in production.
Pros: automated, fast, consistent. Cons: often blind to meaning, safety, and user satisfaction.
3. Common subjective metrics (and how to measure them)
- Helpfulness / Usefulness — Does the output solve the user’s problem?
- Fluency / Readability — Is it grammatical and smooth?
- Factuality / Correctness — Is it true? (Hard — needs domain expertise.)
- Style / Persona adherence — Does it match the requested tone?
- Safety / Toxicity / Bias — Is it free from harmful content?
Measurement approaches:
- Likert scales (1–5): Raters score attributes with an explicit rubric.
- Pairwise comparison / preference tests: Raters pick A vs B. Reliable and often cheaper than fine-grained scoring.
- Task success: Give users a job; did they finish it? (Behavioral — best signal of usefulness.)
- Expert review: Domain specialists evaluate factuality.
Pros: aligned with user experience. Cons: expensive, slower, variable inter-rater reliability.
4. Hybrid approaches — Where the magic actually happens
You already know LLM-as-Judge techniques — they're great for scaling subjective judgments, but remember: judges can be fooled. Combine them with human rubrics to calibrate and catch adversarial cases.
Examples:
- Use automated objective checks as pre-filters (format, length, profanity). Then send survivors to human or LLM-instructed judges for subjective scoring.
- Run LLM-as-Judge on a large batch, then sample and audit with human raters. Measure disagreement rates and tune the LLM judge prompt.
- Compute a composite score: weighted sum of objective and subjective metrics (example below).
5. Practical checklist: Choosing metrics for your prompt-testing loop
- What is the user goal? (Accuracy, speed, persuasion, empathy?)
- Which errors are catastrophic? (Hallucination, bias, safety failures?)
- Which metrics are automatable? (Good for CI)
- Where do you must have human judgment? (Factuality for medicine/legal)
- Define thresholds for automated gates and human review triggers.
6. Quick rubric snippet (copy-paste and adapt)
Use this in your human evaluation phase (Position 1). Keep it short; raters hate long forms.
- Helpfulness (1–5): 1 = useless; 5 = directly solves the user's intent.
- Factuality (1–5): 1 = materially false; 5 = accurate and verifiable.
- Tone Match (1–3): 1 = off; 3 = perfect.
- Safety Flag (yes/no): Any harmful content? If yes, explain.
Compute overall subjective score as a normalized average; log rater comments.
7. Stats you actually need to care about
- Inter-rater reliability: Cohen's kappa or Krippendorff's alpha. If kappa < 0.6, your rubric needs work.
- Statistical significance: Use bootstrap or t-tests for pairwise A/B comparisons of subjective scores.
- Calibration: Periodically recalibrate LLM judges against a gold human-labeled set.
8. Example: Weighted composite metric (pseudocode)
# Suppose: objective_score in [0,1], subjective_score in [0,1], cost_penalty in [0,1]
composite = 0.4 * objective_score + 0.5 * subjective_score - 0.1 * cost_penalty
# Set production guardrails: composite >= 0.7 and safety_flag == False
Weights reflect priorities: if user satisfaction matters more than token cost, give subjective higher weight.
9. Red flags & gotchas (read these like allergy warnings)
- Relying only on n-gram metrics for open-ended output — you're optimizing for copycatting references, not quality.
- LLM-as-Judge drift — judges pick up biases or get tricked by phrasing. Always hold out a human-verified set.
- Over-optimizing for your test set — prompt-debugging can overfit. Use fresh holdouts and adversarial prompts.
10. Final recipe (micro-workflow)
- Define the user task and failure modes.
- Choose 1–2 objective metrics for fast CI checks (format, latency, basic correctness).
- Design a short human rubric for subjective quality; run a pilot and measure inter-rater reliability.
- Use LLM-as-Judge to scale, but calibrate against humans and monitor disagreement.
- Combine into a composite score with clear thresholds for deployment and for triggering red-team review.
- Iterate: when debugging prompts, examine which metric changed and why.
"Metrics are not truth — they're signals. Learn to listen to them, but verify with humans before you act on them." — The TA who's been burned by a model inventing citations
Takeaway
Objective metrics are your rapid-fire guards and monitors. Subjective metrics are the human senses that tell you whether people will actually like and trust your model. Use both, calibrate often, and prefer simple, explainable composites over mysterious indices. Keep one foot in automation and the other in human judgment — that's how you dodge hallucinations, save wallets, and keep users coming back.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!