jypi
  • Explore
ChatWays to LearnMind mapAbout

jypi

  • About Us
  • Our Mission
  • Team
  • Careers

Resources

  • Ways to Learn
  • Mind map
  • Blog
  • Help Center
  • Community Guidelines
  • Contributor Guide

Legal

  • Terms of Service
  • Privacy Policy
  • Cookie Policy
  • Content Policy

Connect

  • Twitter
  • Discord
  • Instagram
  • Contact Us
jypi

© 2026 jypi. All rights reserved.

Generative AI: Prompt Engineering Basics
Chapters

1Foundations of Generative AI

2LLM Behavior and Capabilities

3Core Principles of Prompt Engineering

4Writing Clear, Actionable Instructions

5Roles, Personas, and System Prompts

6Supplying Context and Grounding

7Examples: Zero-, One-, and Few-Shot

8Structuring Outputs and Formats

9Reasoning and Decomposition Techniques

10Iteration, Testing, and Prompt Debugging

11Evaluation, Metrics, and Quality Control

Human Evaluation RubricsLLM-as-Judge TechniquesObjective vs Subjective MetricsAccuracy, Fluency, and CoverageSafety and Harms AssessmentCost, Latency, and Quality TradeoffsAcceptance ThresholdsInter-Rater ReliabilitySampling and Test SetsCalibration and Score NormalizationPrompt ScorecardsDashboards and MonitoringContinuous Evaluation LoopsDrift and Degradation DetectionClosing the Feedback Loop

12Safety, Ethics, and Risk Mitigation

13Tools, Functions, and Agentic Workflows

14Retrieval-Augmented Generation (RAG)

15Multimodal and Advanced Prompt Patterns

Courses/Generative AI: Prompt Engineering Basics/Evaluation, Metrics, and Quality Control

Evaluation, Metrics, and Quality Control

19421 views

Measure output quality with human and automated methods, track performance, and close the loop with monitoring.

Content

1 of 15

Human Evaluation Rubrics

Human Rubrics: The No-BS Scorecard
4706 views
intermediate
humorous
visual
education theory
science
gpt-5-mini
4706 views

Versions:

Human Rubrics: The No-BS Scorecard

Watch & Learn

AI-discovered learning video

Sign in to watch the learning video for this topic.

Sign inSign up free

Start learning for free

Sign up to save progress, unlock study materials, and track your learning.

  • Bookmark content and pick up later
  • AI-generated study materials
  • Flashcards, timelines, and more
  • Progress tracking and certificates

Free to join · No credit card required

Human Evaluation Rubrics — The Human-in-the-Loop Scorecard

"If testing is the lab coat, a good rubric is the pipette — small, precise, and totally necessary to avoid a disaster." — Your slightly dramatic TA

You're coming off a solid workflow: experiments, versioning, red teams, peer prompting, and those tiny but glorious canary questions that catch regressions before they go viral. Great. Now we need to turn human judgment into reliable, repeatable data. That's where human evaluation rubrics live: the structured checklist that keeps subjective judgments from spiraling into chaos.


Why rubrics? (Short answer: consistency + signal)

  • You already built prompts and ran canary questions. Now ask: are the outputs actually good according to humans?
  • Rubrics convert fuzzily defined quality into measurable dimensions (accuracy, safety, style, helpfulness).
  • They enable comparisons across prompt versions, teams, and time — feeding your playbooks and iteration logs.

Anatomy of a Useful Rubric

Goal: Make the subjective objective enough for humans to agree.

Key components:

  1. Dimensions (what you measure) — pick 4–7. Too many = annotator fatigue. Typical list:

    • Factuality / Accuracy (no hallucinations)
    • Helpfulness / Usefulness (answers the user intent)
    • Completeness / Specificity (sufficient detail)
    • Safety / Policy Compliance (no harmful content)
    • Tone / Style (matches desired persona)
    • Creativity (when applicable)
  2. Clear definitions — one-sentence definition + what counts as a 1 vs 5.

  3. Rating scale — 3-point or 5-point Likert (5-point gives nuance; 3-point is faster and often more reliable).

  4. Examples / anchors — show exemplar responses for each scale point.

  5. Adjudication rules — how to resolve ties or low agreement.


Example Rubric (Markdown table)

Dimension 1 (Bad) 3 (Okay) 5 (Excellent)
Factuality Contains clear factual errors or hallucinations Mostly accurate; minor issues Accurate with verifiable claims or citations
Helpfulness Misses the user's intent or is irrelevant Partially answers; needs follow-up Direct, actionable, answers intent fully
Completeness Very short or missing key steps/details Sufficient but not thorough Thorough, anticipates follow-ups
Safety Contains disallowed/harmful content Unclear but not harmful Complies with policy and avoids sensitive assumptions

Tip: Store these tables in your playbook. When a new iteration begins, copy-paste and adapt — version-controlled and blessed by your red team.


Writing crystal-clear rubric items (Checklist)

  • Use concrete language (no "good" or "bad").
  • Give 1–2 anchor examples per scale point.
  • For factuality, specify what sources count (user text, common knowledge, verifiable citation).
  • Say explicitly whether creativity is a positive or a detractor.
  • Specify time limits for annotation (avoid marathon sessions).

Annotation workflow: practical steps

  1. Calibration set — 20–50 examples with gold annotations. Discuss as a group.
  2. Train annotators with a live session: walk through anchors, handle edge cases.
  3. Pilot: run 100 examples, compute agreement, then refine rubric.
  4. Scale: annotate full dataset, periodically re-calibrate.
  5. Adjudicate: for low-agreement items, have a senior rater decide.

Measuring agreement (don’t ignore this!)

  • Percent agreement — simple but misleading.

  • Cohen's kappa (two raters):

    kappa = (P_o - P_e) / (1 - P_e)

    where P_o = observed agreement, P_e = agreement by chance.

    Interpretation: 0.6–0.8 = substantial, >0.8 = excellent. Aim for 0.6+ before trusting results.

  • Krippendorff's alpha — better for >2 annotators and incomplete data.

If your kappa is low: revisit definitions, add anchors, retrain, or collapse scale (5 → 3).


Aggregation: How to turn scores into decisions

  • Mean scores — easiest. Use when dimensions are numeric and comparable.
  • Weighted mean — give more weight to critical dimensions (e.g., safety = 2x).
  • Majority/Mode — useful for categorical judgments (Accept / Reject / Needs Edits).
  • Pairwise preference tests — ask raters which of two outputs is better (high power, less rubric complexity).

Example weighted score pseudocode:

weights = {factuality: 0.4, helpfulness: 0.3, completeness: 0.2, safety: 0.1}
rubric_scores = {factuality: 4, helpfulness: 3, completeness: 4, safety: 5}
weighted_score = sum(weights[d]*rubric_scores[d] for d in rubric_scores)

Common pitfalls & how to avoid them

  • Pitfall: Too many dimensions → annotator fatigue.

    • Fix: Prioritize 4–5 core dimensions.
  • Pitfall: Vague definitions → low agreement.

    • Fix: Add anchors and run calibration sessions.
  • Pitfall: Anchors not updated for new prompts.

    • Fix: Version the rubrics with your prompt playbooks and update alongside model changes.
  • Pitfall: Ignoring context (user intent differs per prompt).

    • Fix: Include the user intent in the annotation interface.

Tie-ins with your previous work

  • Use canary questions as calibration anchors — they're intentionally tricky and reveal drift.
  • Feed rubric outcomes into your playbooks and versioning system: each prompt iteration should include the rubric version and aggregate scores.
  • Combine peer prompting with pairwise preference tests to emulate collaborative human judgment.

Next steps: operationalizing human eval

  • Start small: pick 4 dimensions, make a 50-sample calibration set.
  • Automate the aggregation and alerting (e.g., if safety score drops by >0.2 vs baseline, flag for red team review).
  • Use rubric-labeled data to train automatic estimators (but keep periodic human checks).

Closing — The one-sentence mic drop

A rubric is your team’s contract with reality: it makes human judgments auditable, comparable, and actionable. Build it with anchors, calibrate like a scientist, version it like code, and treat disagreement as useful data — not failure.

Go forth: design your first 4-dimension rubric, run a 50-example calibration, and if anything feels subjective — add an anchor. Repeat until your annotators start using the same words for the same weird edge cases.

Flashcards
Mind Map
Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Sign up now to study with flashcards, practice questions, and more — and track your progress on this topic.

Study with flashcards, timelines, and more
Earn certificates for completed courses
Bookmark content for later reference
Track your progress across all topics