Generative AI: Prompt Engineering Basics

Chapters

1Foundations of Generative AI

2LLM Behavior and Capabilities

3Core Principles of Prompt Engineering

4Writing Clear, Actionable Instructions

5Roles, Personas, and System Prompts

6Supplying Context and Grounding

7Examples: Zero-, One-, and Few-Shot

8Structuring Outputs and Formats

9Reasoning and Decomposition Techniques

10Iteration, Testing, and Prompt Debugging

11Evaluation, Metrics, and Quality Control

Human Evaluation Rubrics LLM-as-Judge Techniques Objective vs Subjective Metrics Accuracy, Fluency, and Coverage Safety and Harms Assessment Cost, Latency, and Quality Tradeoffs Acceptance Thresholds Inter-Rater Reliability Sampling and Test Sets Calibration and Score Normalization Prompt Scorecards Dashboards and Monitoring Continuous Evaluation Loops Drift and Degradation Detection Closing the Feedback Loop

12Safety, Ethics, and Risk Mitigation

13Tools, Functions, and Agentic Workflows

14Retrieval-Augmented Generation (RAG)

15Multimodal and Advanced Prompt Patterns

Courses/Generative AI: Prompt Engineering Basics/Evaluation, Metrics, and Quality Control

Evaluation, Metrics, and Quality Control

19421 views

Measure output quality with human and automated methods, track performance, and close the loop with monitoring.

Content

2 of 15

LLM-as-Judge Techniques

LLM-as-Judge: Snarky, Systematic, and Scalable

2023 views

intermediate

humorous

education theory

visual

gpt-5-mini

2023 views

Versions:

LLM-as-Judge: Snarky, Systematic, and Scalable

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

LLM-as-Judge Techniques: Letting the Model Roast (and Grade) Itself — Responsibly

"If you want speed, ask a model. If you want truth, ask a human. If you want both, train the model to judge... then watch carefully." — Your slightly paranoid but sensible TA

Opening: quick hook and context

You already built a rigorous prompt testing workflow, kept playbooks of what worked, and practiced peer prompting and human rubrics. Now we're leveling up: use LLMs themselves as judges. It sounds like asking a comedian to review their own set, but with the right structure, LLM-as-judge can give fast, consistent, and scalable evaluations — perfect for iteration cycles and catching regressions between versions.

This section explains what LLM-judging really is, when to trust it, how to design judge prompts and rubrics, and how to calibrate and guardrail the process so it doesn't silently drift into nonsense.

What is LLM-as-Judge, and why bother?

Definition: Using one or more LLMs to evaluate outputs from another model (or from itself) against defined criteria, returning scores, rationales, or pairwise preferences.
Why use it? Speed, cost-efficiency, near-instant feedback loops during prompt iteration, and the ability to run large-scale A/B tests without constant human labor.

But this is not a silver bullet. The model can be biased, overconfident, or hallucinate—so treat LLM-judges as high-quality assistants that still need human oversight.

Types of LLM-judge techniques (pick your weapon)

Pairwise Comparison: Given A and B, which is better? Simple, robust for preference tasks.
Scalar Rating: Give a score (0-5, 0-100) on specific axes (accuracy, helpfulness, tone).
Critique + Revision: Judge critiques the response and proposes an improved version; useful for measuring improvement potential.
Chain-of-Thought Explanation: Ask the judge to explain its reasoning and highlight evidence; increases transparency but costs more.
Ensemble Voting: Multiple judge LLMs vote; majority wins or average score computed.
Adjudication: If judges disagree, a stronger model or human adjudicator decides.

Each has tradeoffs: pairwise is robust for relative quality, scalar rating gives fine-grained numbers, critiques reveal error types.

Designing judge prompts: templates and tips

Good judge prompts are explicit, short, and include the rubric. Always ask for both a numeric score and a concise justification.

Example judge prompt (use as a template):

You are an impartial evaluator. Evaluate the candidate response against the reference/task.
Criteria:
- Accuracy (0-5): factual correctness and faithfulness to the task.
- Relevance (0-5): directly answers the prompt.
- Clarity (0-5): readable and well-structured.
Provide: a JSON object with fields {score_total, breakdown: {accuracy, relevance, clarity}, short_justification (1-2 sentences)}.
Do not hallucinate facts in the justification.

Tips:

Explicitness beats inspiration. List concrete criteria and scoring scales.
Force structured output. JSON or strict templates simplify automatic parsing and aggregation.
Limit chain-of-thought in production judges to avoid leakage; use it during analysis runs.

Calibration: Make the judge agree with humans (mostly)

LLM-judges will drift; calibration aligns them to human ground truth.

Steps:

Create a representative human-evaluated seed set (100–500 examples depending on variability).
Run the LLM-judge on that seed and compute correlations/agreements (Pearson/Spearman for scores, Cohen's kappa for categorical labels, Kendall for rankings).
If correlation is low, refine the prompt or instruct the judge to mimic human examples (few-shot).
Optionally apply a mapping function to convert judge scores to human-equivalent scale (linear rescale, isotonic regression).

Rule of thumb: a median Spearman rho > 0.6 is a decent starting point; aim higher for high-stakes tasks.

Reliability, statistics, and how to read the scores

Use inter-judge agreement measures: Cohen's kappa (two judges), Fleiss' kappa (many judges), or Krippendorff's alpha.
Report confidence intervals for mean scores and test significance when comparing models (bootstrap the evaluation set).
Sample size matters: for stable estimates, target at least a few hundred examples for moderate-variance tasks. Low-variance tasks can get away with less.

Quick checklist:

Are judge scores consistent across seeds? If not, add more examples.
Do judge rationales match human error categories? If not, retrain the prompt.

Pipeline pseudocode: automated judge + human checks

for each model_version in versions:
  generate N responses for evaluation_prompts
  for each response:
    get_judge_score = call_llm_judge(response, prompt_template)
  aggregate_scores = compute_means_and_confidence_intervals(get_judge_score)
  if significant_drop_or_edge_case_detected:
    sample K cases -> human_review
    if human_agrees_with_judge on critical failures:
      mark_fail_and_roll_back or patch_prompt

Add randomness in judge seeds and run multiple judge instances if possible to reduce stochasticity.

Defenses: don't let the judge gaslight you

Hallucination: Ask judges to cite lines or quote the model output when making factual claims.
Calibration drift: Periodically re-run the seed human set and update the prompt.
Overfitting to judge style: Rotate judge models or use ensemble judges to prevent gaming prompts that exploit a single judge's quirks.
Adversarial responses: Red-team the system by prompting intentionally adversarial inputs and seeing if judge fails to flag them.

Best practices / Playbook (short and furious)

Always pair LLM-judge runs with human spot checks (random + triggered cases).
Use structured outputs and strict JSON for reliability.
Calibrate annually or whenever you change base models or datasets.
Keep a versioned record: judge prompts, seeds, agreement scores — put them in your playbook.
Use multiple techniques: pairwise for preference, scalar for trend tracking, critique-for-revision for improvements.

Closing: TL;DR and next steps

LLM-as-judge speeds up iteration and scales evaluation, but it requires explicit rubrics, calibration, and human oversight.
Start small: calibrate with a human seed set, enforce structured outputs, and add human adjudication for edge cases.

Final challenge: take one prompt from your playbook, instrument it with a judge prompt like the template above, run 200 evaluations, and compare judge vs human agreement. If you don't find at least one systematic mismatch, congratulations — you probably missed something.

Version_name: “LLM-as-Judge: Snarky, Systematic, and Scalable”

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics