Skip to content

Validating Your Judge Against Human Labels

Measure how well your LLM judge agrees with human evaluators.

The Scenario

You've deployed an LLM judge for content moderation. Before trusting it at scale, you need to validate it against human moderator decisions. You have 100 items with human labels and want comprehensive metrics: accuracy, precision, recall, Cohen's kappa, and analysis of systematic biases.

What You'll Learn

  • Using compute_metrics() with ground truth datasets
  • Interpreting accuracy, precision, recall, F1, and kappa
  • Detecting systematic bias with systematic_bias()
  • Per-criterion breakdown for targeted improvements
  • Bootstrap confidence intervals for statistical rigor

The Solution

Step 1: Prepare Your Validation Dataset

Load a dataset with human-labeled ground truth:

from autorubric import RubricDataset

# Load dataset with ground truth labels
dataset = RubricDataset.from_file("content_moderation_labeled.json")

print(f"Dataset: {dataset.name}")
print(f"Items: {len(dataset)}")
print(f"Criteria: {dataset.criterion_names}")

# Verify ground truth coverage
items_with_gt = sum(1 for item in dataset if item.ground_truth is not None)
print(f"Items with ground truth: {items_with_gt}/{len(dataset)}")

Step 2: Run Evaluation

Evaluate the dataset with your grader:

from autorubric import LLMConfig, evaluate
from autorubric.graders import CriterionGrader

grader = CriterionGrader(
    llm_config=LLMConfig(model="openai/gpt-4.1-mini", temperature=0.0)
)

# Run evaluation
result = await evaluate(
    dataset,
    grader,
    show_progress=True,
    experiment_name="judge-validation-v1"
)

print(f"Evaluated: {result.successful_items}/{result.total_items}")
print(f"Cost: ${result.total_completion_cost:.4f}")

Step 3: Compute Validation Metrics

Use compute_metrics() to compare predictions against ground truth:

metrics = result.compute_metrics(dataset)

# Overall metrics
print("=" * 50)
print("OVERALL VALIDATION METRICS")
print("=" * 50)
print(f"Criterion Accuracy:  {metrics.criterion_accuracy:.1%}")
print(f"Cohen's Kappa:       {metrics.cohens_kappa:.3f}")
print(f"Precision (MET):     {metrics.precision:.1%}")
print(f"Recall (MET):        {metrics.recall:.1%}")
print(f"F1 Score:            {metrics.f1_score:.3f}")

Interpreting the Metrics

Metric What It Measures Good Value
Accuracy % of verdicts matching ground truth >85%
Cohen's Kappa Agreement beyond chance >0.6 (substantial)
Precision Of predicted MET, % actually MET Depends on use case
Recall Of actual MET, % predicted MET Depends on use case
F1 Harmonic mean of precision/recall >0.7

Kappa Interpretation

Kappa Agreement Level
< 0.2 Slight
0.2–0.4 Fair
0.4–0.6 Moderate
0.6–0.8 Substantial
> 0.8 Near perfect

Step 4: Per-Criterion Analysis

Identify which criteria need improvement:

print("\nPER-CRITERION BREAKDOWN")
print("-" * 50)
print(f"{'Criterion':<25} {'Acc':>8} {'Kappa':>8} {'F1':>8}")
print("-" * 50)

for name, cr_metrics in metrics.per_criterion.items():
    print(f"{name:<25} {cr_metrics.accuracy:>7.1%} {cr_metrics.cohens_kappa:>8.3f} "
          f"{cr_metrics.f1_score:>8.3f}")

Sample output:

PER-CRITERION BREAKDOWN
--------------------------------------------------
Criterion                      Acc    Kappa       F1
--------------------------------------------------
hate_speech                  92.0%    0.812    0.891
harassment                   87.0%    0.714    0.823
misinformation               78.0%    0.521    0.742
self_harm                    95.0%    0.876    0.923
spam                         89.0%    0.761    0.856

The misinformation criterion has lower kappa—consider few-shot calibration or clearer criteria definition.

Step 5: Detect Systematic Bias

Check if the judge systematically over- or under-predicts MET:

from autorubric.metrics import systematic_bias

bias = systematic_bias(metrics)

print(f"\nSystematic Bias Analysis:")
print(f"  MET rate (predicted): {bias['predicted_met_rate']:.1%}")
print(f"  MET rate (ground truth): {bias['ground_truth_met_rate']:.1%}")
print(f"  Bias direction: {bias['direction']}")  # "permissive" or "strict"
print(f"  Bias magnitude: {bias['magnitude']:.1%}")
  • Permissive bias: Judge marks MET more often than humans
  • Strict bias: Judge marks MET less often than humans

Step 6: Bootstrap Confidence Intervals

Get statistical confidence in your metrics:

metrics = result.compute_metrics(
    dataset,
    bootstrap=True,
    n_bootstrap=1000,
    confidence_level=0.95,
    seed=42
)

print(f"\nAccuracy: {metrics.criterion_accuracy:.1%}")
print(f"  95% CI: [{metrics.accuracy_ci[0]:.1%}, {metrics.accuracy_ci[1]:.1%}]")

print(f"\nKappa: {metrics.cohens_kappa:.3f}")
print(f"  95% CI: [{metrics.kappa_ci[0]:.3f}, {metrics.kappa_ci[1]:.3f}]")

Bootstrap Cost

Bootstrap analysis is computationally expensive. For quick iteration, use bootstrap=False during development, then enable for final validation.

Step 7: Score Correlation

Check how well predicted scores correlate with ground truth scores:

print(f"\nScore Correlation:")
print(f"  Pearson:  {metrics.score_pearson:.3f}")
print(f"  Spearman: {metrics.score_spearman:.3f}")
print(f"  RMSE:     {metrics.score_rmse:.3f}")

High correlation (>0.8) indicates the judge ranks items similarly to humans, even if individual verdicts differ.

Step 8: Export Results

Save metrics for reporting:

# Get summary as text
print(metrics.summary())

# Export to DataFrame for analysis
df = metrics.to_dataframe()
df.to_csv("validation_results.csv")

Key Takeaways

  • compute_metrics() provides comprehensive validation against ground truth
  • Cohen's kappa is more meaningful than accuracy for imbalanced labels
  • Per-criterion analysis identifies weak spots for targeted improvement
  • Systematic bias detection reveals if judge is too permissive or strict
  • Bootstrap CIs provide statistical confidence in your metrics
  • Score correlation shows overall ranking agreement

Going Further


Appendix: Complete Code

"""Judge Validation - Content Moderation Evaluation"""

import asyncio
from autorubric import (
    Rubric, RubricDataset, CriterionVerdict,
    LLMConfig, evaluate
)
from autorubric.graders import CriterionGrader


def create_moderation_dataset() -> RubricDataset:
    """Create a content moderation dataset with human labels."""

    rubric = Rubric.from_dict([
        {
            "name": "hate_speech",
            "weight": -15.0,
            "requirement": "Contains hate speech targeting protected groups"
        },
        {
            "name": "harassment",
            "weight": -12.0,
            "requirement": "Contains harassment or bullying directed at individuals"
        },
        {
            "name": "misinformation",
            "weight": -10.0,
            "requirement": "Contains demonstrably false claims about health/safety"
        },
        {
            "name": "appropriate_content",
            "weight": 10.0,
            "requirement": "Content is appropriate for general audience"
        }
    ])

    dataset = RubricDataset(
        prompt="Moderate this user-generated content.",
        rubric=rubric,
        name="content-moderation-v1"
    )

    # Sample content with human moderator labels
    # In practice, these would come from your human annotation process
    items = [
        {
            "submission": "I love this product! Works great for my family.",
            "description": "Positive review",
            "ground_truth": [CriterionVerdict.UNMET, CriterionVerdict.UNMET,
                           CriterionVerdict.UNMET, CriterionVerdict.MET]
        },
        {
            "submission": "This group of people are all criminals and should be deported.",
            "description": "Hate speech",
            "ground_truth": [CriterionVerdict.MET, CriterionVerdict.UNMET,
                           CriterionVerdict.UNMET, CriterionVerdict.UNMET]
        },
        {
            "submission": "The vaccine contains microchips for mind control.",
            "description": "Health misinformation",
            "ground_truth": [CriterionVerdict.UNMET, CriterionVerdict.UNMET,
                           CriterionVerdict.MET, CriterionVerdict.UNMET]
        },
        {
            "submission": "Just tried the new restaurant downtown. The pasta was amazing!",
            "description": "Normal review",
            "ground_truth": [CriterionVerdict.UNMET, CriterionVerdict.UNMET,
                           CriterionVerdict.UNMET, CriterionVerdict.MET]
        },
        {
            "submission": "@username you're so stupid, everyone hates you, just quit already",
            "description": "Direct harassment",
            "ground_truth": [CriterionVerdict.UNMET, CriterionVerdict.MET,
                           CriterionVerdict.UNMET, CriterionVerdict.UNMET]
        },
        {
            "submission": "Studies show 5G towers cause cancer. Wake up sheeple!",
            "description": "Health misinformation",
            "ground_truth": [CriterionVerdict.UNMET, CriterionVerdict.UNMET,
                           CriterionVerdict.MET, CriterionVerdict.UNMET]
        },
        {
            "submission": "Thanks for the help! Your customer service team is wonderful.",
            "description": "Positive feedback",
            "ground_truth": [CriterionVerdict.UNMET, CriterionVerdict.UNMET,
                           CriterionVerdict.UNMET, CriterionVerdict.MET]
        },
        {
            "submission": "Anyone else notice the weather has been weird lately?",
            "description": "Neutral observation",
            "ground_truth": [CriterionVerdict.UNMET, CriterionVerdict.UNMET,
                           CriterionVerdict.UNMET, CriterionVerdict.MET]
        },
        {
            "submission": "[Religious group] are taking over the country. We need to stop them.",
            "description": "Religious discrimination",
            "ground_truth": [CriterionVerdict.MET, CriterionVerdict.UNMET,
                           CriterionVerdict.UNMET, CriterionVerdict.UNMET]
        },
        {
            "submission": "The product arrived damaged. Very disappointed with shipping.",
            "description": "Negative but appropriate review",
            "ground_truth": [CriterionVerdict.UNMET, CriterionVerdict.UNMET,
                           CriterionVerdict.UNMET, CriterionVerdict.MET]
        }
    ]

    for item in items:
        dataset.add_item(**item)

    return dataset


async def main():
    # Create/load dataset
    dataset = create_moderation_dataset()
    print(f"Dataset: {dataset.name}")
    print(f"Items: {len(dataset)}")
    print(f"Criteria: {dataset.criterion_names}")

    # Configure grader
    grader = CriterionGrader(
        llm_config=LLMConfig(model="openai/gpt-4.1-mini", temperature=0.0)
    )

    # Run evaluation
    print("\n" + "=" * 60)
    print("RUNNING EVALUATION")
    print("=" * 60)

    result = await evaluate(
        dataset,
        grader,
        show_progress=True,
        experiment_name="moderation-validation"
    )

    print(f"\nEvaluated: {result.successful_items}/{result.total_items}")
    print(f"Cost: ${result.total_completion_cost or 0:.4f}")

    # Compute validation metrics
    print("\n" + "=" * 60)
    print("VALIDATION METRICS")
    print("=" * 60)

    metrics = result.compute_metrics(dataset)

    print(f"\nOverall Metrics:")
    print(f"  Criterion Accuracy:  {metrics.criterion_accuracy:.1%}")
    print(f"  Cohen's Kappa:       {metrics.cohens_kappa:.3f}")
    print(f"  Precision (MET):     {metrics.precision:.1%}")
    print(f"  Recall (MET):        {metrics.recall:.1%}")
    print(f"  F1 Score:            {metrics.f1_score:.3f}")

    # Per-criterion breakdown
    print("\n" + "-" * 60)
    print("PER-CRITERION METRICS")
    print("-" * 60)
    print(f"{'Criterion':<20} {'Acc':>8} {'Kappa':>8} {'Prec':>8} {'Recall':>8}")
    print("-" * 60)

    for name, cr_metrics in metrics.per_criterion.items():
        print(f"{name:<20} {cr_metrics.accuracy:>7.1%} {cr_metrics.cohens_kappa:>8.3f} "
              f"{cr_metrics.precision:>7.1%} {cr_metrics.recall:>7.1%}")

    # Score correlation
    print("\n" + "-" * 60)
    print("SCORE CORRELATION")
    print("-" * 60)
    print(f"  Pearson r:   {metrics.score_pearson:.3f}")
    print(f"  Spearman ρ:  {metrics.score_spearman:.3f}")
    print(f"  RMSE:        {metrics.score_rmse:.3f}")

    # Bias analysis
    print("\n" + "-" * 60)
    print("BIAS ANALYSIS")
    print("-" * 60)

    # Count predicted vs ground truth MET rates
    pred_met = 0
    gt_met = 0
    total = 0

    for item_result in result.item_results:
        if item_result.error:
            continue
        item = item_result.item
        if item.ground_truth is None:
            continue

        for j, cr in enumerate(item_result.report.report or []):
            total += 1
            if cr.verdict == CriterionVerdict.MET:
                pred_met += 1
            if item.ground_truth[j] == CriterionVerdict.MET:
                gt_met += 1

    if total > 0:
        pred_rate = pred_met / total
        gt_rate = gt_met / total
        direction = "permissive" if pred_rate > gt_rate else "strict"
        magnitude = abs(pred_rate - gt_rate)

        print(f"  Predicted MET rate:    {pred_rate:.1%}")
        print(f"  Ground truth MET rate: {gt_rate:.1%}")
        print(f"  Bias direction:        {direction}")
        print(f"  Bias magnitude:        {magnitude:.1%}")


if __name__ == "__main__":
    asyncio.run(main())