Evaluation, Metrics, and Quality Control
Measure output quality with human and automated methods, track performance, and close the loop with monitoring.
Content
LLM-as-Judge Techniques
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
LLM-as-Judge Techniques: Letting the Model Roast (and Grade) Itself — Responsibly
"If you want speed, ask a model. If you want truth, ask a human. If you want both, train the model to judge... then watch carefully." — Your slightly paranoid but sensible TA
Opening: quick hook and context
You already built a rigorous prompt testing workflow, kept playbooks of what worked, and practiced peer prompting and human rubrics. Now we're leveling up: use LLMs themselves as judges. It sounds like asking a comedian to review their own set, but with the right structure, LLM-as-judge can give fast, consistent, and scalable evaluations — perfect for iteration cycles and catching regressions between versions.
This section explains what LLM-judging really is, when to trust it, how to design judge prompts and rubrics, and how to calibrate and guardrail the process so it doesn't silently drift into nonsense.
What is LLM-as-Judge, and why bother?
- Definition: Using one or more LLMs to evaluate outputs from another model (or from itself) against defined criteria, returning scores, rationales, or pairwise preferences.
- Why use it? Speed, cost-efficiency, near-instant feedback loops during prompt iteration, and the ability to run large-scale A/B tests without constant human labor.
But this is not a silver bullet. The model can be biased, overconfident, or hallucinate—so treat LLM-judges as high-quality assistants that still need human oversight.
Types of LLM-judge techniques (pick your weapon)
- Pairwise Comparison: Given A and B, which is better? Simple, robust for preference tasks.
- Scalar Rating: Give a score (0-5, 0-100) on specific axes (accuracy, helpfulness, tone).
- Critique + Revision: Judge critiques the response and proposes an improved version; useful for measuring improvement potential.
- Chain-of-Thought Explanation: Ask the judge to explain its reasoning and highlight evidence; increases transparency but costs more.
- Ensemble Voting: Multiple judge LLMs vote; majority wins or average score computed.
- Adjudication: If judges disagree, a stronger model or human adjudicator decides.
Each has tradeoffs: pairwise is robust for relative quality, scalar rating gives fine-grained numbers, critiques reveal error types.
Designing judge prompts: templates and tips
Good judge prompts are explicit, short, and include the rubric. Always ask for both a numeric score and a concise justification.
Example judge prompt (use as a template):
You are an impartial evaluator. Evaluate the candidate response against the reference/task.
Criteria:
- Accuracy (0-5): factual correctness and faithfulness to the task.
- Relevance (0-5): directly answers the prompt.
- Clarity (0-5): readable and well-structured.
Provide: a JSON object with fields {score_total, breakdown: {accuracy, relevance, clarity}, short_justification (1-2 sentences)}.
Do not hallucinate facts in the justification.
Tips:
- Explicitness beats inspiration. List concrete criteria and scoring scales.
- Force structured output. JSON or strict templates simplify automatic parsing and aggregation.
- Limit chain-of-thought in production judges to avoid leakage; use it during analysis runs.
Calibration: Make the judge agree with humans (mostly)
LLM-judges will drift; calibration aligns them to human ground truth.
Steps:
- Create a representative human-evaluated seed set (100–500 examples depending on variability).
- Run the LLM-judge on that seed and compute correlations/agreements (Pearson/Spearman for scores, Cohen's kappa for categorical labels, Kendall for rankings).
- If correlation is low, refine the prompt or instruct the judge to mimic human examples (few-shot).
- Optionally apply a mapping function to convert judge scores to human-equivalent scale (linear rescale, isotonic regression).
Rule of thumb: a median Spearman rho > 0.6 is a decent starting point; aim higher for high-stakes tasks.
Reliability, statistics, and how to read the scores
- Use inter-judge agreement measures: Cohen's kappa (two judges), Fleiss' kappa (many judges), or Krippendorff's alpha.
- Report confidence intervals for mean scores and test significance when comparing models (bootstrap the evaluation set).
- Sample size matters: for stable estimates, target at least a few hundred examples for moderate-variance tasks. Low-variance tasks can get away with less.
Quick checklist:
- Are judge scores consistent across seeds? If not, add more examples.
- Do judge rationales match human error categories? If not, retrain the prompt.
Pipeline pseudocode: automated judge + human checks
for each model_version in versions:
generate N responses for evaluation_prompts
for each response:
get_judge_score = call_llm_judge(response, prompt_template)
aggregate_scores = compute_means_and_confidence_intervals(get_judge_score)
if significant_drop_or_edge_case_detected:
sample K cases -> human_review
if human_agrees_with_judge on critical failures:
mark_fail_and_roll_back or patch_prompt
Add randomness in judge seeds and run multiple judge instances if possible to reduce stochasticity.
Defenses: don't let the judge gaslight you
- Hallucination: Ask judges to cite lines or quote the model output when making factual claims.
- Calibration drift: Periodically re-run the seed human set and update the prompt.
- Overfitting to judge style: Rotate judge models or use ensemble judges to prevent gaming prompts that exploit a single judge's quirks.
- Adversarial responses: Red-team the system by prompting intentionally adversarial inputs and seeing if judge fails to flag them.
Best practices / Playbook (short and furious)
- Always pair LLM-judge runs with human spot checks (random + triggered cases).
- Use structured outputs and strict JSON for reliability.
- Calibrate annually or whenever you change base models or datasets.
- Keep a versioned record: judge prompts, seeds, agreement scores — put them in your playbook.
- Use multiple techniques: pairwise for preference, scalar for trend tracking, critique-for-revision for improvements.
Closing: TL;DR and next steps
- LLM-as-judge speeds up iteration and scales evaluation, but it requires explicit rubrics, calibration, and human oversight.
- Start small: calibrate with a human seed set, enforce structured outputs, and add human adjudication for edge cases.
Final challenge: take one prompt from your playbook, instrument it with a judge prompt like the template above, run 200 evaluations, and compare judge vs human agreement. If you don't find at least one systematic mismatch, congratulations — you probably missed something.
Version_name: “LLM-as-Judge: Snarky, Systematic, and Scalable”
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!