Generative AI: Prompt Engineering Basics

Chapters

1Foundations of Generative AI

2LLM Behavior and Capabilities

3Core Principles of Prompt Engineering

4Writing Clear, Actionable Instructions

5Roles, Personas, and System Prompts

6Supplying Context and Grounding

7Examples: Zero-, One-, and Few-Shot

8Structuring Outputs and Formats

9Reasoning and Decomposition Techniques

10Iteration, Testing, and Prompt Debugging

11Evaluation, Metrics, and Quality Control

Human Evaluation Rubrics LLM-as-Judge Techniques Objective vs Subjective Metrics Accuracy, Fluency, and Coverage Safety and Harms Assessment Cost, Latency, and Quality Tradeoffs Acceptance Thresholds Inter-Rater Reliability Sampling and Test Sets Calibration and Score Normalization Prompt Scorecards Dashboards and Monitoring Continuous Evaluation Loops Drift and Degradation Detection Closing the Feedback Loop

12Safety, Ethics, and Risk Mitigation

13Tools, Functions, and Agentic Workflows

14Retrieval-Augmented Generation (RAG)

15Multimodal and Advanced Prompt Patterns

Courses/Generative AI: Prompt Engineering Basics/Evaluation, Metrics, and Quality Control

Evaluation, Metrics, and Quality Control

19423 views

Measure output quality with human and automated methods, track performance, and close the loop with monitoring.

Content

1 of 15

Human Evaluation Rubrics

Human Rubrics: The No-BS Scorecard

4706 views

intermediate

humorous

visual

education theory

science

gpt-5-mini

4706 views

Versions:

Human Rubrics: The No-BS Scorecard

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Human Evaluation Rubrics — The Human-in-the-Loop Scorecard

"If testing is the lab coat, a good rubric is the pipette — small, precise, and totally necessary to avoid a disaster." — Your slightly dramatic TA

You're coming off a solid workflow: experiments, versioning, red teams, peer prompting, and those tiny but glorious canary questions that catch regressions before they go viral. Great. Now we need to turn human judgment into reliable, repeatable data. That's where human evaluation rubrics live: the structured checklist that keeps subjective judgments from spiraling into chaos.

Why rubrics? (Short answer: consistency + signal)

You already built prompts and ran canary questions. Now ask: are the outputs actually good according to humans?
Rubrics convert fuzzily defined quality into measurable dimensions (accuracy, safety, style, helpfulness).
They enable comparisons across prompt versions, teams, and time — feeding your playbooks and iteration logs.

Anatomy of a Useful Rubric

Goal: Make the subjective objective enough for humans to agree.

Key components:

Dimensions (what you measure) — pick 4–7. Too many = annotator fatigue. Typical list:
- Factuality / Accuracy (no hallucinations)
- Helpfulness / Usefulness (answers the user intent)
- Completeness / Specificity (sufficient detail)
- Safety / Policy Compliance (no harmful content)
- Tone / Style (matches desired persona)
- Creativity (when applicable)
Clear definitions — one-sentence definition + what counts as a 1 vs 5.
Rating scale — 3-point or 5-point Likert (5-point gives nuance; 3-point is faster and often more reliable).
Examples / anchors — show exemplar responses for each scale point.
Adjudication rules — how to resolve ties or low agreement.

Example Rubric (Markdown table)

Dimension	1 (Bad)	3 (Okay)	5 (Excellent)
Factuality	Contains clear factual errors or hallucinations	Mostly accurate; minor issues	Accurate with verifiable claims or citations
Helpfulness	Misses the user's intent or is irrelevant	Partially answers; needs follow-up	Direct, actionable, answers intent fully
Completeness	Very short or missing key steps/details	Sufficient but not thorough	Thorough, anticipates follow-ups
Safety	Contains disallowed/harmful content	Unclear but not harmful	Complies with policy and avoids sensitive assumptions

Tip: Store these tables in your playbook. When a new iteration begins, copy-paste and adapt — version-controlled and blessed by your red team.

Writing crystal-clear rubric items (Checklist)

Use concrete language (no "good" or "bad").
Give 1–2 anchor examples per scale point.
For factuality, specify what sources count (user text, common knowledge, verifiable citation).
Say explicitly whether creativity is a positive or a detractor.
Specify time limits for annotation (avoid marathon sessions).

Annotation workflow: practical steps

Calibration set — 20–50 examples with gold annotations. Discuss as a group.
Train annotators with a live session: walk through anchors, handle edge cases.
Pilot: run 100 examples, compute agreement, then refine rubric.
Scale: annotate full dataset, periodically re-calibrate.
Adjudicate: for low-agreement items, have a senior rater decide.

Measuring agreement (don’t ignore this!)

Percent agreement — simple but misleading.
Cohen's kappa (two raters):

kappa = (P_o - P_e) / (1 - P_e)

where P_o = observed agreement, P_e = agreement by chance.

Interpretation: 0.6–0.8 = substantial, >0.8 = excellent. Aim for 0.6+ before trusting results.
Krippendorff's alpha — better for >2 annotators and incomplete data.

If your kappa is low: revisit definitions, add anchors, retrain, or collapse scale (5 → 3).

Aggregation: How to turn scores into decisions

Mean scores — easiest. Use when dimensions are numeric and comparable.
Weighted mean — give more weight to critical dimensions (e.g., safety = 2x).
Majority/Mode — useful for categorical judgments (Accept / Reject / Needs Edits).
Pairwise preference tests — ask raters which of two outputs is better (high power, less rubric complexity).

Example weighted score pseudocode:

weights = {factuality: 0.4, helpfulness: 0.3, completeness: 0.2, safety: 0.1}
rubric_scores = {factuality: 4, helpfulness: 3, completeness: 4, safety: 5}
weighted_score = sum(weights[d]*rubric_scores[d] for d in rubric_scores)

Common pitfalls & how to avoid them

Pitfall: Too many dimensions → annotator fatigue.
- Fix: Prioritize 4–5 core dimensions.
Pitfall: Vague definitions → low agreement.
- Fix: Add anchors and run calibration sessions.
Pitfall: Anchors not updated for new prompts.
- Fix: Version the rubrics with your prompt playbooks and update alongside model changes.
Pitfall: Ignoring context (user intent differs per prompt).
- Fix: Include the user intent in the annotation interface.

Tie-ins with your previous work

Use canary questions as calibration anchors — they're intentionally tricky and reveal drift.
Feed rubric outcomes into your playbooks and versioning system: each prompt iteration should include the rubric version and aggregate scores.
Combine peer prompting with pairwise preference tests to emulate collaborative human judgment.

Next steps: operationalizing human eval

Start small: pick 4 dimensions, make a 50-sample calibration set.
Automate the aggregation and alerting (e.g., if safety score drops by >0.2 vs baseline, flag for red team review).
Use rubric-labeled data to train automatic estimators (but keep periodic human checks).

Closing — The one-sentence mic drop

A rubric is your team’s contract with reality: it makes human judgments auditable, comparable, and actionable. Build it with anchors, calibrate like a scientist, version it like code, and treat disagreement as useful data — not failure.

Go forth: design your first 4-dimension rubric, run a 50-example calibration, and if anything feels subjective — add an anchor. Repeat until your annotators start using the same words for the same weird edge cases.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics