Metrics¶

Agreement and correlation metrics for validating LLM judges against ground truth.

Overview¶

When your dataset includes ground truth labels, compute_metrics() measures how well your LLM judge agrees with human annotations. Metrics include accuracy, precision, recall, F1, Cohen's kappa, correlations, and systematic bias analysis.

Research Background

Casabianca et al. (2025) recommend agreement metrics including ICC, Krippendorff's alpha, and quadratic-weighted kappa (QWK), with iterative refinement until agreement with human-labeled subsets is acceptable. He et al. (2025) emphasize that correlation alone can mask systematic bias.

Quick Example¶

from autorubric import RubricDataset, LLMConfig, evaluate
from autorubric.graders import CriterionGrader

dataset = RubricDataset.from_file("data_with_ground_truth.json")
grader = CriterionGrader(llm_config=LLMConfig(model="openai/gpt-4.1-mini"))

result = await evaluate(dataset, grader, show_progress=True)

# Compute metrics
metrics = result.compute_metrics(dataset)

# Formatted summary
print(metrics.summary())

# Export options
df = metrics.to_dataframe()
metrics.to_file("metrics.json")

Bootstrap Confidence Intervals¶

metrics = result.compute_metrics(
    dataset,
    bootstrap=True,
    n_bootstrap=1000,
    confidence_level=0.95,
    seed=42,
)

print(metrics.summary())
# Bootstrap CIs (95%):
#   Accuracy: [85.2%, 92.1%]
#   Kappa:    [0.712, 0.845]

Per-Judge Metrics (Ensemble)¶

metrics = result.compute_metrics(
    dataset,
    per_judge=True,
)

for judge_id, jm in metrics.per_judge.items():
    print(f"{judge_id}: Accuracy={jm.criterion_accuracy:.1%}, RMSE={jm.score_rmse:.4f}")

Metric Fields¶

Field	Description
`criterion_accuracy`	Overall accuracy across all criteria
`criterion_precision`	Precision for MET class
`criterion_recall`	Recall for MET class
`criterion_f1`	F1 score for MET class
`mean_kappa`	Mean Cohen's kappa across criteria
`per_criterion`	Per-criterion metrics breakdown (polymorphic: `CriterionMetrics`, `OrdinalCriterionMetrics`, `NominalCriterionMetrics`)
`score_rmse`	RMSE of cumulative scores
`score_mae`	MAE of cumulative scores
`score_spearman`	Spearman rank correlation
`score_kendall`	Kendall tau correlation
`score_pearson`	Pearson correlation
`bias`	Systematic bias analysis (`BiasResult`)
`bootstrap`	Bootstrap confidence intervals (`BootstrapResults`, if enabled)
`per_judge`	Per-judge metrics for ensemble (`dict[str, JudgeMetrics]`, if enabled)
`n_items`	Number of items used in computation
`n_criteria`	Number of criteria
`n_binary_criteria`	Number of binary criteria
`n_ordinal_criteria`	Number of ordinal multi-choice criteria
`n_nominal_criteria`	Number of nominal multi-choice criteria
`na_stats`	Statistics for NA handling in multi-choice criteria (`NAStats`)
`warnings`	Any warnings generated during computation

compute_metrics¶

Compute agreement metrics between predictions and ground truth.

compute_metrics ¶

compute_metrics(eval_result: 'EvalResult', dataset: 'RubricDataset', *, bootstrap: bool = False, n_bootstrap: int = 1000, per_judge: bool = False, cannot_assess: Literal['exclude', 'as_unmet'] = 'exclude', na_mode: Literal['exclude', 'as_worst'] = 'exclude', confidence_level: float = 0.95, seed: int | None = None) -> MetricsResult

Compute comprehensive evaluation metrics.

This is the main entry point for computing metrics from an evaluation run. It compares predicted verdicts and scores against ground truth from the dataset. Supports binary, ordinal, and nominal (multi-choice) criteria.

PARAMETER	DESCRIPTION
`eval_result`	The evaluation result from EvalRunner. TYPE: `'EvalResult'`
`dataset`	The dataset with ground truth labels. TYPE: `'RubricDataset'`
`bootstrap`	If True, compute bootstrap confidence intervals (expensive). TYPE: `bool` DEFAULT: `False`
`n_bootstrap`	Number of bootstrap samples if bootstrap=True. TYPE: `int` DEFAULT: `1000`
`per_judge`	If True and ensemble, compute per-judge metrics. TYPE: `bool` DEFAULT: `False`
`cannot_assess`	How to handle CANNOT_ASSESS verdicts (binary criteria): - "exclude": Skip pairs where either is CANNOT_ASSESS (default) - "as_unmet": Treat CANNOT_ASSESS as UNMET TYPE: `Literal['exclude', 'as_unmet']` DEFAULT: `'exclude'`
`na_mode`	How to handle NA options (multi-choice criteria): - "exclude": Skip pairs where either is NA (default) - "as_worst": Keep NA in metrics (no special treatment) TYPE: `Literal['exclude', 'as_worst']` DEFAULT: `'exclude'`
`confidence_level`	Confidence level for bootstrap CIs (default 0.95). TYPE: `float` DEFAULT: `0.95`
`seed`	Random seed for bootstrap reproducibility. TYPE: `int \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`MetricsResult`	MetricsResult with comprehensive metrics and optional per-judge breakdown.

RAISES	DESCRIPTION
`ValueError`	If no common items between eval_result and dataset.

Example

result = await evaluate(dataset, grader) metrics = result.compute_metrics(dataset) print(metrics.summary()) df = metrics.to_dataframe()

Source code in src/autorubric/metrics/_compute.py

def compute_metrics(
    eval_result: "EvalResult",
    dataset: "RubricDataset",
    *,
    bootstrap: bool = False,
    n_bootstrap: int = 1000,
    per_judge: bool = False,
    cannot_assess: Literal["exclude", "as_unmet"] = "exclude",
    na_mode: Literal["exclude", "as_worst"] = "exclude",
    confidence_level: float = 0.95,
    seed: int | None = None,
) -> MetricsResult:
    """Compute comprehensive evaluation metrics.

    This is the main entry point for computing metrics from an evaluation run.
    It compares predicted verdicts and scores against ground truth from the dataset.
    Supports binary, ordinal, and nominal (multi-choice) criteria.

    Args:
        eval_result: The evaluation result from EvalRunner.
        dataset: The dataset with ground truth labels.
        bootstrap: If True, compute bootstrap confidence intervals (expensive).
        n_bootstrap: Number of bootstrap samples if bootstrap=True.
        per_judge: If True and ensemble, compute per-judge metrics.
        cannot_assess: How to handle CANNOT_ASSESS verdicts (binary criteria):
            - "exclude": Skip pairs where either is CANNOT_ASSESS (default)
            - "as_unmet": Treat CANNOT_ASSESS as UNMET
        na_mode: How to handle NA options (multi-choice criteria):
            - "exclude": Skip pairs where either is NA (default)
            - "as_worst": Keep NA in metrics (no special treatment)
        confidence_level: Confidence level for bootstrap CIs (default 0.95).
        seed: Random seed for bootstrap reproducibility.

    Returns:
        MetricsResult with comprehensive metrics and optional per-judge breakdown.

    Raises:
        ValueError: If no common items between eval_result and dataset.

    Example:
        >>> result = await evaluate(dataset, grader)
        >>> metrics = result.compute_metrics(dataset)
        >>> print(metrics.summary())
        >>> df = metrics.to_dataframe()
    """
    result_warnings: list[str] = []

    # Build map of item_idx -> ItemResult
    eval_map = {ir.item_idx: ir for ir in eval_result.item_results}

    # Check for missing/extra items
    dataset_indices = set(range(len(dataset)))
    eval_indices = set(eval_map.keys())

    missing = dataset_indices - eval_indices
    if missing:
        result_warnings.append(
            f"{len(missing)} items from dataset not found in eval_result"
        )

    extra = eval_indices - dataset_indices
    if extra:
        result_warnings.append(
            f"{len(extra)} items in eval_result not in dataset"
        )

    # Use intersection
    common_indices = sorted(dataset_indices & eval_indices)

    if not common_indices:
        raise ValueError("No common items between eval_result and dataset")

    # Validate rubric homogeneity for metrics computation
    # If using per-item rubrics, all must have the same structure
    if dataset.rubric is not None:
        reference_rubric = dataset.rubric
    else:
        # Get rubric from first item
        reference_rubric = dataset.get_item_rubric(common_indices[0])

    reference_n_criteria = len(reference_rubric.rubric)

    for idx in common_indices:
        item_rubric = dataset.get_item_rubric(idx)
        if len(item_rubric.rubric) != reference_n_criteria:
            raise ValueError(
                f"Cannot compute metrics: items have different rubric structures. "
                f"Item {idx} has {len(item_rubric.rubric)} criteria but "
                f"expected {reference_n_criteria}. "
                f"Metrics require homogeneous rubric structures across all items."
            )

    # Use the reference rubric for classification
    criteria = list(reference_rubric.rubric)
    criterion_types = classify_criteria(criteria)
    n_criteria = len(criteria)

    # Count criteria by type
    n_binary = sum(1 for ct in criterion_types if ct == "binary")
    n_ordinal = sum(1 for ct in criterion_types if ct == "ordinal")
    n_nominal = sum(1 for ct in criterion_types if ct == "nominal")

    # Per-criterion data storage
    # For binary: list[CriterionVerdict]
    # For multi-choice: list[int] (option indices)
    per_criterion_pred: list[list[CriterionVerdict | int]] = [[] for _ in range(n_criteria)]
    per_criterion_true: list[list[CriterionVerdict | int]] = [[] for _ in range(n_criteria)]

    # Overall scores
    all_pred_scores: list[float] = []
    all_true_scores: list[float] = []

    # For ensemble: per-judge data (binary only for now)
    judge_scores: dict[str, list[float]] = {}
    judge_verdicts: dict[str, list[list[CriterionVerdict]]] = {}
    is_ensemble = False

    items_with_ground_truth = 0

    # NA tracking for multi-choice
    total_na_true = 0
    total_na_pred = 0
    total_na_agreement = 0
    total_na_fp = 0
    total_na_fn = 0

    for idx in common_indices:
        item = dataset.items[idx]
        item_result = eval_map[idx]
        report = item_result.report

        if item.ground_truth is None:
            result_warnings.append(f"Item {idx} has no ground truth, skipping")
            continue

        if item_result.error is not None:
            continue

        items_with_ground_truth += 1

        # Extract predictions using type-aware extraction
        pred_all = extract_all_verdicts_from_report(report, criteria)

        # Resolve ground truth (string labels → indices for multi-choice)
        try:
            true_all = resolve_ground_truth(list(item.ground_truth), criteria)
        except ValueError as e:
            result_warnings.append(f"Item {idx}: {e}")
            continue

        # Store per-criterion data
        for c_idx in range(n_criteria):
            pred_val = pred_all[c_idx]
            true_val = true_all[c_idx]

            # Handle None predictions (failed extraction)
            if pred_val is None:
                if criterion_types[c_idx] == "binary":
                    pred_val = CriterionVerdict.UNMET
                else:
                    pred_val = 0  # Default to first option

            per_criterion_pred[c_idx].append(pred_val)
            per_criterion_true[c_idx].append(true_val)

        # Compute scores
        pred_score = report.score if not report.error else 0.0
        # For true score, need to pass the original ground truth format
        # compute_weighted_score expects CriterionVerdict for binary, str for multi-choice
        true_score_verdicts = []
        for c_idx in range(n_criteria):
            if criterion_types[c_idx] == "binary":
                true_score_verdicts.append(true_all[c_idx])
            else:
                # For multi-choice, pass the option label (string)
                criterion = criteria[c_idx]
                opt_idx = true_all[c_idx]
                if isinstance(opt_idx, int) and 0 <= opt_idx < len(criterion.options):
                    true_score_verdicts.append(criterion.options[opt_idx].label)
                else:
                    # Default to first option if index is invalid
                    true_score_verdicts.append(criterion.options[0].label)

        true_score = dataset.compute_weighted_score(true_score_verdicts)

        all_pred_scores.append(pred_score)
        all_true_scores.append(true_score)

        # Check if ensemble and collect per-judge data
        if hasattr(report, "judge_scores") and report.judge_scores:
            is_ensemble = True
            for jid, score in report.judge_scores.items():
                if jid not in judge_scores:
                    judge_scores[jid] = []
                    judge_verdicts[jid] = []
                judge_scores[jid].append(score)

            # Extract per-judge verdicts from EnsembleCriterionReport.votes (binary only)
            if hasattr(report, "report") and report.report:
                for jid in judge_scores.keys():
                    judge_v = []
                    for cr in report.report:
                        if hasattr(cr, "votes"):
                            for vote in cr.votes:
                                if vote.judge_id == jid:
                                    judge_v.append(vote.verdict)
                                    break
                            else:
                                judge_v.append(CriterionVerdict.UNMET)
                        else:
                            judge_v.append(CriterionVerdict.UNMET)
                    if jid in judge_verdicts:
                        judge_verdicts[jid].append(judge_v)

    n_items = items_with_ground_truth

    if n_items == 0:
        raise ValueError("No valid items with ground truth found")

    # Compute per-criterion metrics by type
    per_criterion: list[CriterionMetricsUnion] = []
    criterion_kappas: list[float] = []

    # For binary-only aggregate metrics
    binary_pred_flat: list[int] = []
    binary_true_flat: list[int] = []

    for c_idx in range(n_criteria):
        criterion = criteria[c_idx]
        c_type = criterion_types[c_idx]
        pred_data = per_criterion_pred[c_idx]
        true_data = per_criterion_true[c_idx]

        if c_type == "binary":
            # Binary criterion metrics
            pred_verdicts = [v for v in pred_data if isinstance(v, CriterionVerdict)]
            true_verdicts = [v for v in true_data if isinstance(v, CriterionVerdict)]

            # Filter CANNOT_ASSESS
            pred_filtered = []
            true_filtered = []
            for p, t in zip(pred_verdicts, true_verdicts):
                if cannot_assess == "exclude":
                    if p == CriterionVerdict.CANNOT_ASSESS or t == CriterionVerdict.CANNOT_ASSESS:
                        continue
                pred_filtered.append(_verdict_to_binary(p))
                true_filtered.append(_verdict_to_binary(t))

            # Add to aggregate
            binary_pred_flat.extend(pred_filtered)
            binary_true_flat.extend(true_filtered)

            name = criterion.name or f"Criterion {c_idx + 1}"

            if not pred_filtered:
                per_criterion.append(
                    CriterionMetrics(
                        name=name,
                        index=c_idx,
                        n_samples=0,
                        accuracy=0.0,
                        precision=0.0,
                        recall=0.0,
                        f1=0.0,
                        kappa=0.0,
                        kappa_interpretation="undefined",
                        support_true=0,
                        support_pred=0,
                    )
                )
                continue

            c_acc = accuracy_score(true_filtered, pred_filtered)
            c_prec = precision_score(true_filtered, pred_filtered, zero_division=0)
            c_rec = recall_score(true_filtered, pred_filtered, zero_division=0)
            c_f1 = f1_score(true_filtered, pred_filtered, zero_division=0)

            try:
                c_kappa = cohen_kappa_score(true_filtered, pred_filtered)
            except Exception:
                c_kappa = 0.0

            criterion_kappas.append(c_kappa)

            per_criterion.append(
                CriterionMetrics(
                    name=name,
                    index=c_idx,
                    n_samples=len(pred_filtered),
                    accuracy=float(c_acc),
                    precision=float(c_prec),
                    recall=float(c_rec),
                    f1=float(c_f1),
                    kappa=float(c_kappa),
                    kappa_interpretation=_interpret_kappa(c_kappa),
                    support_true=sum(true_filtered),
                    support_pred=sum(pred_filtered),
                )
            )

        elif c_type == "ordinal":
            # Ordinal multi-choice criterion metrics
            pred_indices = [v for v in pred_data if isinstance(v, int)]
            true_indices = [v for v in true_data if isinstance(v, int)]

            # Filter NA options
            pred_filtered, true_filtered, na_agree, na_fp, na_fn = filter_na_multi_choice(
                pred_indices, true_indices, criterion, mode=na_mode
            )

            # Track NA stats
            total_na_agreement += na_agree
            total_na_fp += na_fp
            total_na_fn += na_fn

            metrics = _compute_ordinal_criterion_metrics(
                pred_filtered, true_filtered, criterion, c_idx
            )
            per_criterion.append(metrics)

            # Use weighted kappa for ordinal in mean calculation
            criterion_kappas.append(metrics.weighted_kappa)

        else:  # nominal
            # Nominal multi-choice criterion metrics
            pred_indices = [v for v in pred_data if isinstance(v, int)]
            true_indices = [v for v in true_data if isinstance(v, int)]

            # Filter NA options
            pred_filtered, true_filtered, na_agree, na_fp, na_fn = filter_na_multi_choice(
                pred_indices, true_indices, criterion, mode=na_mode
            )

            # Track NA stats
            total_na_agreement += na_agree
            total_na_fp += na_fp
            total_na_fn += na_fn

            metrics = _compute_nominal_criterion_metrics(
                pred_filtered, true_filtered, criterion, c_idx
            )
            per_criterion.append(metrics)

            # Use unweighted kappa for nominal
            criterion_kappas.append(metrics.kappa)

    # Aggregate metrics
    mean_kappa = (
        sum(criterion_kappas) / len(criterion_kappas) if criterion_kappas else 0.0
    )

    # Binary-only aggregate metrics (precision/recall/f1 only make sense for binary)
    if binary_pred_flat:
        criterion_accuracy = accuracy_score(binary_true_flat, binary_pred_flat)
        criterion_precision = precision_score(binary_true_flat, binary_pred_flat, zero_division=0)
        criterion_recall = recall_score(binary_true_flat, binary_pred_flat, zero_division=0)
        criterion_f1 = f1_score(binary_true_flat, binary_pred_flat, zero_division=0)
    else:
        # No binary criteria - compute accuracy across all multi-choice
        # For multi-choice, accuracy is exact match
        all_correct = 0
        all_total = 0
        for c_idx in range(n_criteria):
            c_type = criterion_types[c_idx]
            if c_type != "binary":
                pred_data = per_criterion_pred[c_idx]
                true_data = per_criterion_true[c_idx]
                for p, t in zip(pred_data, true_data):
                    if isinstance(p, int) and isinstance(t, int):
                        all_total += 1
                        if p == t:
                            all_correct += 1

        criterion_accuracy = all_correct / all_total if all_total > 0 else 0.0
        # Precision/recall/f1 not meaningful for pure multi-choice rubrics
        criterion_precision = 0.0
        criterion_recall = 0.0
        criterion_f1 = 0.0

    # Score-level metrics
    score_rmse = float(np.sqrt(mean_squared_error(all_true_scores, all_pred_scores)))
    score_mae = float(mean_absolute_error(all_true_scores, all_pred_scores))

    score_spearman = _compute_correlation(all_pred_scores, all_true_scores, "spearman")
    score_kendall = _compute_correlation(all_pred_scores, all_true_scores, "kendall")
    score_pearson = _compute_correlation(all_pred_scores, all_true_scores, "pearson")

    # Bias analysis
    bias = systematic_bias(all_pred_scores, all_true_scores)

    # Bootstrap CIs (optional) - uses binary metrics for backwards compat
    bootstrap_results = None
    if bootstrap and binary_pred_flat:
        bootstrap_results = _compute_bootstrap_ci(
            binary_true_flat,
            binary_pred_flat,
            all_true_scores,
            all_pred_scores,
            n_bootstrap=n_bootstrap,
            confidence_level=confidence_level,
            seed=seed,
        )

    # Per-judge metrics (optional, for ensemble) - binary only for now
    per_judge_metrics = None
    if per_judge and is_ensemble and judge_scores:
        per_judge_metrics = {}
        for jid in judge_scores.keys():
            jv = judge_verdicts.get(jid, [])
            if not jv:
                continue

            # Extract binary verdicts for this judge
            binary_true_verdicts = []
            for true_item in per_criterion_true:
                binary_true_verdicts.append(
                    [v for v in true_item if isinstance(v, CriterionVerdict)]
                )

            per_judge_metrics[jid] = _compute_judge_metrics(
                judge_id=jid,
                judge_scores=judge_scores[jid],
                true_scores=all_true_scores,
                judge_verdicts=jv,
                true_verdicts=binary_true_verdicts[0] if binary_true_verdicts else [],
                cannot_assess=cannot_assess,
            )

    # NA stats (for multi-choice criteria)
    na_stats = None
    if n_ordinal > 0 or n_nominal > 0:
        # Calculate total NA counts
        for c_idx in range(n_criteria):
            if criterion_types[c_idx] != "binary":
                criterion = criteria[c_idx]
                na_indices = {i for i, opt in enumerate(criterion.options) if opt.na}
                if na_indices:
                    pred_data = per_criterion_pred[c_idx]
                    true_data = per_criterion_true[c_idx]
                    for p in pred_data:
                        if isinstance(p, int) and p in na_indices:
                            total_na_pred += 1
                    for t in true_data:
                        if isinstance(t, int) and t in na_indices:
                            total_na_true += 1

        total_na = total_na_true + total_na_pred
        na_stats = NAStats(
            na_count_true=total_na_true,
            na_count_pred=total_na_pred,
            na_agreement=total_na_agreement / max(1, total_na) if total_na > 0 else 0.0,
            na_false_positive=total_na_fp,
            na_false_negative=total_na_fn,
        )

    return MetricsResult(
        criterion_accuracy=float(criterion_accuracy),
        criterion_precision=float(criterion_precision),
        criterion_recall=float(criterion_recall),
        criterion_f1=float(criterion_f1),
        mean_kappa=float(mean_kappa),
        per_criterion=per_criterion,
        score_rmse=score_rmse,
        score_mae=score_mae,
        score_spearman=score_spearman,
        score_kendall=score_kendall,
        score_pearson=score_pearson,
        bias=bias,
        bootstrap=bootstrap_results,
        per_judge=per_judge_metrics,
        n_items=n_items,
        n_criteria=n_criteria,
        n_binary_criteria=n_binary,
        n_ordinal_criteria=n_ordinal,
        n_nominal_criteria=n_nominal,
        na_stats=na_stats,
        warnings=result_warnings,
    )

MetricsResult¶

Complete metrics result with aggregate and per-criterion breakdowns.

MetricsResult ¶

Bases: BaseModel

Complete metrics result from compute_metrics().

This is the main result type returned by EvalResult.compute_metrics(). It provides a comprehensive view of evaluation quality including: - Criterion-level agreement metrics - Score-level correlation and error metrics - Per-criterion breakdown (supports binary, ordinal, and nominal criteria) - Optional bootstrap confidence intervals - Optional per-judge metrics for ensemble evaluations

ATTRIBUTE	DESCRIPTION
`criterion_accuracy`	Overall accuracy across all criteria. TYPE: `float`
`criterion_precision`	Overall precision for MET class (binary criteria only). TYPE: `float`
`criterion_recall`	Overall recall for MET class (binary criteria only). TYPE: `float`
`criterion_f1`	Overall F1 for MET class (binary criteria only). TYPE: `float`
`mean_kappa`	Mean kappa across criteria (weighted for ordinal, unweighted for binary/nominal). TYPE: `float`
`per_criterion`	Per-criterion metrics breakdown (polymorphic union type). TYPE: `list[CriterionMetricsUnion]`
`score_rmse`	RMSE of cumulative scores. TYPE: `float`
`score_mae`	MAE of cumulative scores. TYPE: `float`
`score_spearman`	Spearman correlation result. TYPE: `CorrelationResult`
`score_kendall`	Kendall tau correlation result. TYPE: `CorrelationResult`
`score_pearson`	Pearson correlation result. TYPE: `CorrelationResult`
`bias`	Systematic bias analysis. TYPE: `BiasResult`
`bootstrap`	Optional bootstrap confidence intervals. TYPE: `BootstrapResults \| None`
`per_judge`	Optional per-judge metrics for ensemble. TYPE: `dict[str, JudgeMetrics] \| None`
`n_items`	Number of items used in computation. TYPE: `int`
`n_criteria`	Number of criteria. TYPE: `int`
`n_binary_criteria`	Number of binary criteria (default 0 for backwards compat). TYPE: `int`
`n_ordinal_criteria`	Number of ordinal multi-choice criteria. TYPE: `int`
`n_nominal_criteria`	Number of nominal multi-choice criteria. TYPE: `int`
`na_stats`	Statistics for NA handling in multi-choice criteria. TYPE: `NAStats \| None`
`warnings`	Any warnings generated during computation. TYPE: `list[str]`

summary ¶

summary() -> str

Return formatted text summary of metrics.

Source code in src/autorubric/metrics/_types.py

def summary(self) -> str:
    """Return formatted text summary of metrics."""
    lines = []
    lines.append("=" * 60)
    lines.append("METRICS SUMMARY")
    lines.append("=" * 60)

    # Show criteria type breakdown if mixed
    criteria_info = f"Items: {self.n_items}, Criteria: {self.n_criteria}"
    if self.n_ordinal_criteria > 0 or self.n_nominal_criteria > 0:
        type_parts = []
        if self.n_binary_criteria > 0:
            type_parts.append(f"{self.n_binary_criteria} binary")
        if self.n_ordinal_criteria > 0:
            type_parts.append(f"{self.n_ordinal_criteria} ordinal")
        if self.n_nominal_criteria > 0:
            type_parts.append(f"{self.n_nominal_criteria} nominal")
        criteria_info += f" ({', '.join(type_parts)})"
    lines.append(criteria_info)

    if self.warnings:
        lines.append(f"\nWarnings ({len(self.warnings)}):")
        for w in self.warnings:
            lines.append(f"  - {w}")

    lines.append("")
    lines.append("Criterion-Level Metrics:")
    lines.append(f"  Accuracy:   {self.criterion_accuracy:.1%}")
    if self.n_binary_criteria > 0:
        lines.append(f"  Precision:  {self.criterion_precision:.2f}")
        lines.append(f"  Recall:     {self.criterion_recall:.2f}")
        lines.append(f"  F1:         {self.criterion_f1:.2f}")
    lines.append(f"  Mean Kappa: {self.mean_kappa:.3f}")

    lines.append("")
    lines.append("Score-Level Metrics:")
    lines.append(f"  RMSE:     {self.score_rmse:.4f}")
    lines.append(f"  MAE:      {self.score_mae:.4f}")
    lines.append(
        f"  Spearman: {self.score_spearman.coefficient:.4f} "
        f"({self.score_spearman.interpretation})"
    )
    lines.append(
        f"  Kendall:  {self.score_kendall.coefficient:.4f} "
        f"({self.score_kendall.interpretation})"
    )
    lines.append(
        f"  Pearson:  {self.score_pearson.coefficient:.4f} "
        f"({self.score_pearson.interpretation})"
    )

    lines.append("")
    lines.append("Bias Analysis:")
    lines.append(
        f"  Mean Bias:   {self.bias.mean_bias:+.4f} ({self.bias.direction})"
    )
    lines.append(f"  Significant: {'Yes' if self.bias.is_significant else 'No'}")

    # NA stats for multi-choice
    if self.na_stats:
        lines.append("")
        lines.append("NA Handling:")
        lines.append(f"  NA in Ground Truth: {self.na_stats.na_count_true}")
        lines.append(f"  NA in Predictions:  {self.na_stats.na_count_pred}")
        lines.append(f"  NA Agreement:       {self.na_stats.na_agreement:.1%}")
        if self.na_stats.na_false_positive > 0 or self.na_stats.na_false_negative > 0:
            lines.append(
                f"  NA FP/FN:           {self.na_stats.na_false_positive} / "
                f"{self.na_stats.na_false_negative}"
            )

    if self.bootstrap:
        lines.append("")
        lines.append(f"Bootstrap CIs ({self.bootstrap.confidence_level:.0%}):")
        lines.append(
            f"  Accuracy: [{self.bootstrap.accuracy_ci[0]:.1%}, "
            f"{self.bootstrap.accuracy_ci[1]:.1%}]"
        )
        lines.append(
            f"  Kappa:    [{self.bootstrap.kappa_ci[0]:.3f}, "
            f"{self.bootstrap.kappa_ci[1]:.3f}]"
        )
        lines.append(
            f"  RMSE:     [{self.bootstrap.rmse_ci[0]:.4f}, "
            f"{self.bootstrap.rmse_ci[1]:.4f}]"
        )

    if self.per_judge:
        lines.append("")
        lines.append("Per-Judge Metrics:")
        for judge_id, jm in sorted(self.per_judge.items()):
            lines.append(
                f"  {judge_id}: RMSE={jm.score_rmse:.4f}, "
                f"Spearman={jm.score_spearman.coefficient:.4f}"
            )

    lines.append("")
    lines.append("Per-Criterion Breakdown:")

    # Separate display by criterion type
    binary_criteria = [cm for cm in self.per_criterion if cm.criterion_type == "binary"]
    ordinal_criteria = [cm for cm in self.per_criterion if cm.criterion_type == "ordinal"]
    nominal_criteria = [cm for cm in self.per_criterion if cm.criterion_type == "nominal"]

    if binary_criteria:
        if ordinal_criteria or nominal_criteria:
            lines.append("\nBinary Criteria:")
        header = f"{'Criterion':<20} {'Acc':>8} {'Prec':>8} {'Rec':>8} {'F1':>8} {'Kappa':>8}"
        lines.append(header)
        lines.append("-" * len(header))
        for cm in binary_criteria:
            lines.append(
                f"{cm.name:<20} {cm.accuracy:>8.1%} {cm.precision:>8.2f} "
                f"{cm.recall:>8.2f} {cm.f1:>8.2f} {cm.kappa:>8.3f}"
            )

    if ordinal_criteria:
        lines.append("\nOrdinal Criteria:")
        header = f"{'Criterion':<20} {'Exact':>8} {'Adj':>8} {'WKappa':>8} {'Spearman':>10} {'RMSE':>8}"
        lines.append(header)
        lines.append("-" * len(header))
        for cm in ordinal_criteria:
            lines.append(
                f"{cm.name:<20} {cm.exact_accuracy:>8.1%} {cm.adjacent_accuracy:>8.1%} "
                f"{cm.weighted_kappa:>8.3f} {cm.spearman.coefficient:>10.4f} {cm.rmse:>8.4f}"
            )

    if nominal_criteria:
        lines.append("\nNominal Criteria:")
        header = f"{'Criterion':<20} {'Accuracy':>10} {'Kappa':>8} {'Interpretation':<20}"
        lines.append(header)
        lines.append("-" * len(header))
        for cm in nominal_criteria:
            lines.append(
                f"{cm.name:<20} {cm.exact_accuracy:>10.1%} {cm.kappa:>8.3f} "
                f"{cm.kappa_interpretation:<20}"
            )

    return "\n".join(lines)

to_dataframe ¶

to_dataframe() -> DataFrame

Export metrics to pandas DataFrame.

Returns a flat DataFrame with a 'level' column indicating: - 'aggregate': Overall metrics - 'criterion': Per-criterion metrics (binary) - 'criterion_ordinal': Per-criterion metrics (ordinal) - 'criterion_nominal': Per-criterion metrics (nominal) - 'judge': Per-judge metrics (if available)

Source code in src/autorubric/metrics/_types.py

def to_dataframe(self) -> "pd.DataFrame":
    """Export metrics to pandas DataFrame.

    Returns a flat DataFrame with a 'level' column indicating:
    - 'aggregate': Overall metrics
    - 'criterion': Per-criterion metrics (binary)
    - 'criterion_ordinal': Per-criterion metrics (ordinal)
    - 'criterion_nominal': Per-criterion metrics (nominal)
    - 'judge': Per-judge metrics (if available)
    """
    import pandas as pd

    rows = []

    # Aggregate row
    rows.append(
        {
            "level": "aggregate",
            "name": "overall",
            "criterion_type": "all",
            "accuracy": self.criterion_accuracy,
            "precision": self.criterion_precision,
            "recall": self.criterion_recall,
            "f1": self.criterion_f1,
            "kappa": self.mean_kappa,
            "rmse": self.score_rmse,
            "mae": self.score_mae,
            "spearman": self.score_spearman.coefficient,
            "kendall": self.score_kendall.coefficient,
            "pearson": self.score_pearson.coefficient,
            "bias": self.bias.mean_bias,
            "adjacent_accuracy": None,
            "weighted_kappa": None,
        }
    )

    # Per-criterion rows (handle different types)
    for cm in self.per_criterion:
        if cm.criterion_type == "binary":
            rows.append(
                {
                    "level": "criterion",
                    "name": cm.name,
                    "criterion_type": "binary",
                    "accuracy": cm.accuracy,
                    "precision": cm.precision,
                    "recall": cm.recall,
                    "f1": cm.f1,
                    "kappa": cm.kappa,
                    "rmse": None,
                    "mae": None,
                    "spearman": None,
                    "kendall": None,
                    "pearson": None,
                    "bias": None,
                    "adjacent_accuracy": None,
                    "weighted_kappa": None,
                }
            )
        elif cm.criterion_type == "ordinal":
            rows.append(
                {
                    "level": "criterion",
                    "name": cm.name,
                    "criterion_type": "ordinal",
                    "accuracy": cm.exact_accuracy,
                    "precision": None,
                    "recall": None,
                    "f1": None,
                    "kappa": cm.weighted_kappa,
                    "rmse": cm.rmse,
                    "mae": cm.mae,
                    "spearman": cm.spearman.coefficient,
                    "kendall": cm.kendall.coefficient,
                    "pearson": None,
                    "bias": None,
                    "adjacent_accuracy": cm.adjacent_accuracy,
                    "weighted_kappa": cm.weighted_kappa,
                }
            )
        else:  # nominal
            rows.append(
                {
                    "level": "criterion",
                    "name": cm.name,
                    "criterion_type": "nominal",
                    "accuracy": cm.exact_accuracy,
                    "precision": None,
                    "recall": None,
                    "f1": None,
                    "kappa": cm.kappa,
                    "rmse": None,
                    "mae": None,
                    "spearman": None,
                    "kendall": None,
                    "pearson": None,
                    "bias": None,
                    "adjacent_accuracy": None,
                    "weighted_kappa": None,
                }
            )

    # Per-judge rows (if available)
    if self.per_judge:
        for judge_id, jm in self.per_judge.items():
            rows.append(
                {
                    "level": "judge",
                    "name": judge_id,
                    "criterion_type": "all",
                    "accuracy": jm.criterion_accuracy,
                    "precision": jm.criterion_precision,
                    "recall": jm.criterion_recall,
                    "f1": jm.criterion_f1,
                    "kappa": jm.mean_kappa,
                    "rmse": jm.score_rmse,
                    "mae": jm.score_mae,
                    "spearman": jm.score_spearman.coefficient,
                    "kendall": jm.score_kendall.coefficient,
                    "pearson": jm.score_pearson.coefficient,
                    "bias": jm.bias.mean_bias,
                    "adjacent_accuracy": None,
                    "weighted_kappa": None,
                }
            )

    return pd.DataFrame(rows)

to_file ¶

to_file(path: str | Path) -> None

Save metrics to a JSON file.

PARAMETER	DESCRIPTION
`path`	Path to the output JSON file. TYPE: `str \| Path`

Source code in src/autorubric/metrics/_types.py

def to_file(self, path: str | Path) -> None:
    """Save metrics to a JSON file.

    Args:
        path: Path to the output JSON file.
    """
    path = Path(path)
    path.parent.mkdir(parents=True, exist_ok=True)
    path.write_text(self.model_dump_json(indent=2), encoding="utf-8")

CriterionMetrics¶

Per-criterion binary metrics.

CriterionMetrics ¶

Bases: BaseModel

Metrics for a single binary criterion.

ATTRIBUTE	DESCRIPTION
`name`	Name of the criterion. TYPE: `str`
`index`	Index of the criterion in the rubric. TYPE: `int`
`criterion_type`	Type of criterion ("binary" for this class). TYPE: `Literal['binary']`
`n_samples`	Number of samples used for this criterion. TYPE: `int`
`accuracy`	Binary accuracy (proportion of exact matches). TYPE: `float`
`precision`	Precision for MET class. TYPE: `float`
`recall`	Recall for MET class. TYPE: `float`
`f1`	F1 score for MET class. TYPE: `float`
`kappa`	Cohen's kappa coefficient. TYPE: `float`
`kappa_interpretation`	Human-readable interpretation of kappa. TYPE: `str`
`support_true`	Count of MET in ground truth. TYPE: `int`
`support_pred`	Count of MET in predictions. TYPE: `int`

CorrelationResult¶

Correlation statistics between predicted and ground truth scores.

CorrelationResult ¶

Bases: BaseModel

Result from correlation calculation (Spearman, Kendall, Pearson).

ATTRIBUTE	DESCRIPTION
`coefficient`	The correlation coefficient (-1 to 1). TYPE: `float`
`p_value`	P-value for testing the null hypothesis of no correlation. TYPE: `float \| None`
`ci`	Optional confidence interval for the coefficient. TYPE: `ConfidenceInterval \| None`
`interpretation`	Human-readable interpretation. TYPE: `str`
`n_samples`	Number of samples used in calculation. TYPE: `int`
`method`	Correlation method used (e.g., "spearman", "kendall", "pearson"). TYPE: `str`

interpret_correlation `staticmethod` ¶

interpret_correlation(r: float) -> str

Return human-readable interpretation of correlation coefficient.

Source code in src/autorubric/metrics/_types.py

@staticmethod
def interpret_correlation(r: float) -> str:
    """Return human-readable interpretation of correlation coefficient."""
    abs_r = abs(r)
    if abs_r >= 0.9:
        strength = "very strong"
    elif abs_r >= 0.7:
        strength = "strong"
    elif abs_r >= 0.5:
        strength = "moderate"
    elif abs_r >= 0.3:
        strength = "weak"
    else:
        strength = "very weak"

    direction = "positive" if r >= 0 else "negative"
    return f"{strength} {direction}"

BootstrapResults¶

Bootstrap confidence intervals for key metrics.

BootstrapResults ¶

Bases: BaseModel

Bootstrap confidence interval results.

ATTRIBUTE	DESCRIPTION
`accuracy_ci`	95% CI for criterion-level accuracy. TYPE: `tuple[float, float]`
`kappa_ci`	95% CI for mean kappa. TYPE: `tuple[float, float]`
`rmse_ci`	95% CI for score RMSE. TYPE: `tuple[float, float]`
`n_bootstrap`	Number of bootstrap samples used. TYPE: `int`
`confidence_level`	Confidence level (default 0.95). TYPE: `float`

BootstrapResult¶

Single bootstrap result with confidence interval.

BootstrapResult ¶

Bases: BaseModel

Bootstrap confidence interval result.

ATTRIBUTE	DESCRIPTION
`estimate`	Point estimate of the statistic. TYPE: `float`
`ci`	Confidence interval from bootstrap. TYPE: `ConfidenceInterval`
`standard_error`	Bootstrap standard error. TYPE: `float`
`n_bootstrap`	Number of bootstrap samples used. TYPE: `int`
`bootstrap_distribution`	Optional array of bootstrap estimates. TYPE: `list[float] \| None`

ConfidenceInterval¶

Confidence interval bounds.

ConfidenceInterval ¶

Bases: BaseModel

Confidence interval for a statistic.

ATTRIBUTE	DESCRIPTION
`lower`	Lower bound of the interval. TYPE: `float`
`upper`	Upper bound of the interval. TYPE: `float`
`confidence`	Confidence level (default 0.95 for 95% CI). TYPE: `float`
`method`	Method used to compute the interval. TYPE: `str`

width `property` ¶

width: float

Width of the confidence interval.

JudgeMetrics¶

Per-judge metrics for ensemble evaluations.

JudgeMetrics ¶

Bases: BaseModel

Metrics for a single judge in an ensemble.

ATTRIBUTE	DESCRIPTION
`judge_id`	Identifier for this judge. TYPE: `str`
`criterion_accuracy`	Overall criterion-level accuracy. TYPE: `float`
`criterion_precision`	Overall precision for MET class. TYPE: `float`
`criterion_recall`	Overall recall for MET class. TYPE: `float`
`criterion_f1`	Overall F1 for MET class. TYPE: `float`
`mean_kappa`	Mean Cohen's kappa across criteria. TYPE: `float`
`score_rmse`	RMSE of cumulative scores. TYPE: `float`
`score_mae`	MAE of cumulative scores. TYPE: `float`
`score_spearman`	Spearman correlation result. TYPE: `CorrelationResult`
`score_kendall`	Kendall tau correlation result. TYPE: `CorrelationResult`
`score_pearson`	Pearson correlation result. TYPE: `CorrelationResult`
`bias`	Systematic bias analysis result. TYPE: `BiasResult`

BiasResult¶

Systematic bias analysis between predicted and ground truth scores.

BiasResult ¶

Bases: BaseModel

Result from systematic bias analysis.

Systematic bias occurs when one rater consistently scores higher or lower than another, independent of the item being rated.

ATTRIBUTE	DESCRIPTION
`mean_bias`	Mean difference (predictions - actuals). TYPE: `float`
`std_bias`	Standard deviation of differences. TYPE: `float`
`is_significant`	Whether the bias is statistically significant (p < 0.05). TYPE: `bool`
`p_value`	P-value from t-test. TYPE: `float \| None`
`direction`	Direction of bias ("positive" if predictions > actuals). TYPE: `Literal['positive', 'negative', 'none']`
`effect_size`	Cohen's d effect size. TYPE: `float \| None`
`ci`	Confidence interval for mean bias. TYPE: `ConfidenceInterval \| None`
`n_samples`	Number of samples. TYPE: `int`

interpret_effect_size `staticmethod` ¶

interpret_effect_size(d: float) -> str

Interpret effect size using Cohen's guidelines.

Source code in src/autorubric/metrics/_types.py

@staticmethod
def interpret_effect_size(d: float) -> str:
    """Interpret effect size using Cohen's guidelines."""
    abs_d = abs(d)
    if abs_d < 0.2:
        return "negligible"
    elif abs_d < 0.5:
        return "small"
    elif abs_d < 0.8:
        return "medium"
    else:
        return "large"

OrdinalCriterionMetrics¶

Per-criterion metrics for ordinal multi-choice criteria.

OrdinalCriterionMetrics ¶

Bases: BaseModel

Metrics for an ordinal multi-choice criterion.

Ordinal criteria have options with inherent ordering (e.g., satisfaction 1-4). This enables additional metrics like weighted kappa and rank correlations.

ATTRIBUTE	DESCRIPTION
`name`	Name of the criterion. TYPE: `str`
`index`	Index of the criterion in the rubric. TYPE: `int`
`criterion_type`	Type of criterion ("ordinal" for this class). TYPE: `Literal['ordinal']`
`n_samples`	Number of samples used in computation. TYPE: `int`
`n_options`	Number of options in this criterion. TYPE: `int`
`exact_accuracy`	Proportion of exact index matches. TYPE: `float`
`adjacent_accuracy`	Proportion within +/-1 position. TYPE: `float`
`weighted_kappa`	Quadratic-weighted Cohen's kappa (accounts for distance). TYPE: `float`
`kappa_interpretation`	Human-readable interpretation of kappa. TYPE: `str`
`fleiss_kappa`	Fleiss' kappa for multi-rater agreement (None if < 3 judges). TYPE: `float \| None`
`spearman`	Spearman rank correlation result. TYPE: `CorrelationResult`
`kendall`	Kendall tau correlation result. TYPE: `CorrelationResult`
`rmse`	RMSE on option values (0-1 scale). TYPE: `float`
`mae`	MAE on option values (0-1 scale). TYPE: `float`
`per_option`	Per-option precision/recall/F1 breakdown. TYPE: `list[OptionMetrics]`
`confusion_matrix`	N×N confusion matrix (rows=true, cols=pred). TYPE: `list[list[int]]`
`option_labels`	Labels for confusion matrix axes. TYPE: `list[str]`

NominalCriterionMetrics¶

Per-criterion metrics for nominal multi-choice criteria.

NominalCriterionMetrics ¶

Bases: BaseModel

Metrics for a nominal multi-choice criterion.

Nominal criteria have unordered categories (e.g., "too few", "just right", "too many"). Distance between options is not meaningful, so only exact matches matter.

ATTRIBUTE	DESCRIPTION
`name`	Name of the criterion. TYPE: `str`
`index`	Index of the criterion in the rubric. TYPE: `int`
`criterion_type`	Type of criterion ("nominal" for this class). TYPE: `Literal['nominal']`
`n_samples`	Number of samples used in computation. TYPE: `int`
`n_options`	Number of options in this criterion. TYPE: `int`
`exact_accuracy`	Proportion of exact index matches. TYPE: `float`
`kappa`	Unweighted Cohen's kappa (N×N). TYPE: `float`
`kappa_interpretation`	Human-readable interpretation of kappa. TYPE: `str`
`fleiss_kappa`	Fleiss' kappa for multi-rater agreement (None if < 3 judges). TYPE: `float \| None`
`per_option`	Per-option precision/recall/F1 breakdown. TYPE: `list[OptionMetrics]`
`confusion_matrix`	N×N confusion matrix (rows=true, cols=pred). TYPE: `list[list[int]]`
`option_labels`	Labels for confusion matrix axes. TYPE: `list[str]`

NAStats¶

Statistics for NA (not applicable) handling in multi-choice criteria.

NAStats ¶

Bases: BaseModel

Statistics for NA (not applicable) handling in multi-choice criteria.

Tracks how NA options are handled in both ground truth and predictions, similar to CANNOT_ASSESS handling for binary criteria.

ATTRIBUTE	DESCRIPTION
`na_count_true`	Number of NA selections in ground truth. TYPE: `int`
`na_count_pred`	Number of NA selections in predictions. TYPE: `int`
`na_agreement`	Proportion where both agreed on NA (0-1). TYPE: `float`
`na_false_positive`	Count where prediction was NA but ground truth was not. TYPE: `int`
`na_false_negative`	Count where ground truth was NA but prediction was not. TYPE: `int`

References¶

Casabianca, J., McCaffrey, D. F., Johnson, M. S., Alper, N., and Zubenko, V. (2025). Validity Arguments For Constructed Response Scoring Using Generative Artificial Intelligence Applications. arXiv:2501.02334.

He, J., Shi, J., Zhuo, T. Y., Treude, C., Sun, J., Xing, Z., Du, X., and Lo, D. (2025). LLM-as-a-Judge for Software Engineering: Literature Review, Vision, and the Road Ahead. arXiv:2510.24367.

Metrics¶

Overview¶

Quick Example¶

Bootstrap Confidence Intervals¶

Per-Judge Metrics (Ensemble)¶

Metric Fields¶

compute_metrics¶

compute_metrics ¶

MetricsResult¶

MetricsResult ¶

summary ¶

to_dataframe ¶

to_file ¶

CriterionMetrics¶

CriterionMetrics ¶

CorrelationResult¶

CorrelationResult ¶

interpret_correlation staticmethod ¶

BootstrapResults¶

BootstrapResults ¶

BootstrapResult¶

BootstrapResult ¶

ConfidenceInterval¶

ConfidenceInterval ¶

width property ¶

JudgeMetrics¶

JudgeMetrics ¶

BiasResult¶

BiasResult ¶

interpret_effect_size staticmethod ¶

OrdinalCriterionMetrics¶

OrdinalCriterionMetrics ¶

NominalCriterionMetrics¶

NominalCriterionMetrics ¶

NAStats¶

NAStats ¶

References¶

interpret_correlation `staticmethod` ¶

width `property` ¶

interpret_effect_size `staticmethod` ¶