Evaluation, Metrics, and Quality Control
Measure output quality with human and automated methods, track performance, and close the loop with monitoring.
Content
Human Evaluation Rubrics
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Human Evaluation Rubrics — The Human-in-the-Loop Scorecard
"If testing is the lab coat, a good rubric is the pipette — small, precise, and totally necessary to avoid a disaster." — Your slightly dramatic TA
You're coming off a solid workflow: experiments, versioning, red teams, peer prompting, and those tiny but glorious canary questions that catch regressions before they go viral. Great. Now we need to turn human judgment into reliable, repeatable data. That's where human evaluation rubrics live: the structured checklist that keeps subjective judgments from spiraling into chaos.
Why rubrics? (Short answer: consistency + signal)
- You already built prompts and ran canary questions. Now ask: are the outputs actually good according to humans?
- Rubrics convert fuzzily defined quality into measurable dimensions (accuracy, safety, style, helpfulness).
- They enable comparisons across prompt versions, teams, and time — feeding your playbooks and iteration logs.
Anatomy of a Useful Rubric
Goal: Make the subjective objective enough for humans to agree.
Key components:
Dimensions (what you measure) — pick 4–7. Too many = annotator fatigue. Typical list:
- Factuality / Accuracy (no hallucinations)
- Helpfulness / Usefulness (answers the user intent)
- Completeness / Specificity (sufficient detail)
- Safety / Policy Compliance (no harmful content)
- Tone / Style (matches desired persona)
- Creativity (when applicable)
Clear definitions — one-sentence definition + what counts as a 1 vs 5.
Rating scale — 3-point or 5-point Likert (5-point gives nuance; 3-point is faster and often more reliable).
Examples / anchors — show exemplar responses for each scale point.
Adjudication rules — how to resolve ties or low agreement.
Example Rubric (Markdown table)
| Dimension | 1 (Bad) | 3 (Okay) | 5 (Excellent) |
|---|---|---|---|
| Factuality | Contains clear factual errors or hallucinations | Mostly accurate; minor issues | Accurate with verifiable claims or citations |
| Helpfulness | Misses the user's intent or is irrelevant | Partially answers; needs follow-up | Direct, actionable, answers intent fully |
| Completeness | Very short or missing key steps/details | Sufficient but not thorough | Thorough, anticipates follow-ups |
| Safety | Contains disallowed/harmful content | Unclear but not harmful | Complies with policy and avoids sensitive assumptions |
Tip: Store these tables in your playbook. When a new iteration begins, copy-paste and adapt — version-controlled and blessed by your red team.
Writing crystal-clear rubric items (Checklist)
- Use concrete language (no "good" or "bad").
- Give 1–2 anchor examples per scale point.
- For factuality, specify what sources count (user text, common knowledge, verifiable citation).
- Say explicitly whether creativity is a positive or a detractor.
- Specify time limits for annotation (avoid marathon sessions).
Annotation workflow: practical steps
- Calibration set — 20–50 examples with gold annotations. Discuss as a group.
- Train annotators with a live session: walk through anchors, handle edge cases.
- Pilot: run 100 examples, compute agreement, then refine rubric.
- Scale: annotate full dataset, periodically re-calibrate.
- Adjudicate: for low-agreement items, have a senior rater decide.
Measuring agreement (don’t ignore this!)
Percent agreement — simple but misleading.
Cohen's kappa (two raters):
kappa = (P_o - P_e) / (1 - P_e)
where P_o = observed agreement, P_e = agreement by chance.
Interpretation: 0.6–0.8 = substantial, >0.8 = excellent. Aim for 0.6+ before trusting results.
Krippendorff's alpha — better for >2 annotators and incomplete data.
If your kappa is low: revisit definitions, add anchors, retrain, or collapse scale (5 → 3).
Aggregation: How to turn scores into decisions
- Mean scores — easiest. Use when dimensions are numeric and comparable.
- Weighted mean — give more weight to critical dimensions (e.g., safety = 2x).
- Majority/Mode — useful for categorical judgments (Accept / Reject / Needs Edits).
- Pairwise preference tests — ask raters which of two outputs is better (high power, less rubric complexity).
Example weighted score pseudocode:
weights = {factuality: 0.4, helpfulness: 0.3, completeness: 0.2, safety: 0.1}
rubric_scores = {factuality: 4, helpfulness: 3, completeness: 4, safety: 5}
weighted_score = sum(weights[d]*rubric_scores[d] for d in rubric_scores)
Common pitfalls & how to avoid them
Pitfall: Too many dimensions → annotator fatigue.
- Fix: Prioritize 4–5 core dimensions.
Pitfall: Vague definitions → low agreement.
- Fix: Add anchors and run calibration sessions.
Pitfall: Anchors not updated for new prompts.
- Fix: Version the rubrics with your prompt playbooks and update alongside model changes.
Pitfall: Ignoring context (user intent differs per prompt).
- Fix: Include the user intent in the annotation interface.
Tie-ins with your previous work
- Use canary questions as calibration anchors — they're intentionally tricky and reveal drift.
- Feed rubric outcomes into your playbooks and versioning system: each prompt iteration should include the rubric version and aggregate scores.
- Combine peer prompting with pairwise preference tests to emulate collaborative human judgment.
Next steps: operationalizing human eval
- Start small: pick 4 dimensions, make a 50-sample calibration set.
- Automate the aggregation and alerting (e.g., if safety score drops by >0.2 vs baseline, flag for red team review).
- Use rubric-labeled data to train automatic estimators (but keep periodic human checks).
Closing — The one-sentence mic drop
A rubric is your team’s contract with reality: it makes human judgments auditable, comparable, and actionable. Build it with anchors, calibrate like a scientist, version it like code, and treat disagreement as useful data — not failure.
Go forth: design your first 4-dimension rubric, run a 50-example calibration, and if anything feels subjective — add an anchor. Repeat until your annotators start using the same words for the same weird edge cases.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!