Metrics¶

Agreement and correlation metrics for validating LLM judges against ground truth.

Overview¶

When your dataset includes ground truth labels, compute_metrics() measures how well your LLM judge agrees with human annotations. Metrics include accuracy, precision, recall, F1, Cohen's kappa, correlations, and systematic bias analysis.

For ensemble (multi-judge) evaluations, each per-criterion metrics object also reports inter-judge agreement (judges vs. each other, independent of ground truth). The recommended statistic is Krippendorff's alpha (krippendorff_alpha) — it handles unequal/missing raters and is level-aware (nominal vs. ordinal). Fleiss' kappa (fleiss_kappa) is also computed as the classic fixed-rater nominal measure, complete-case. Both are populated only with an ensemble of ≥2 judges and ≥2 items, and are None otherwise.

One inter-judge statistic on binary/nominal data

On binary and nominal data Krippendorff's nominal α and Fleiss' κ coincide up to a finite-sample correction (1 − κ_F)/(N·R) — they are one statistic, not corroborating evidence. summary() therefore reports α as the single primary inter-judge column for binary/nominal criteria and drops the bare Fleiss column (a note explains the omission); to_dataframe() leaves the binary/nominal fleiss_kappa value None. On ordinal data α is distance-aware while Fleiss is nominal (different geometry), so both are kept with a distinguishing note.

Research Background

Casabianca et al. (2025) recommend agreement metrics including ICC, Krippendorff's alpha, and quadratic-weighted kappa (QWK), with iterative refinement until agreement with human-labeled subsets is acceptable. He et al. (2025) emphasize that correlation alone can mask systematic bias.

Quick Example¶

from autorubric import RubricDataset, LLMConfig, evaluate
from autorubric.graders import CriterionGrader

dataset = RubricDataset.from_file("data_with_ground_truth.json")
grader = CriterionGrader(llm_config=LLMConfig(model="openai/gpt-4.1-mini"))

result = await evaluate(dataset, grader, show_progress=True)

# Compute metrics
metrics = result.compute_metrics(dataset)

# Formatted summary. The header names the handling modes
# (CANNOT_ASSESS / NA estimands), the criterion-level scalars carry their
# aggregation level (micro vs macro), and binary criteria show φ + FP/FN/FPR/FNR.
print(metrics.summary())

# verbose=True additionally prints the per-judge RMSE/Spearman columns and each
# judge's confusion matrix (the default per-judge line leads with accuracy + kappa + φ).
print(metrics.summary(verbose=True))

# Export options. to_dataframe() uses level-labelled aggregate keys
# (accuracy_micro / accuracy_macro / mean_kappa_macro / kappa_micro / phi_micro / ...)
# and round-trips the handling modes + coverage columns.
df = metrics.to_dataframe()
metrics.to_file("metrics.json")

Bootstrap Confidence Intervals¶

metrics = result.compute_metrics(
    dataset,
    bootstrap=True,
    n_bootstrap=1000,
    confidence_level=0.95,
    seed=42,
)

print(metrics.summary())
# Bootstrap CIs (95%):
#   Accuracy: [85.2%, 92.1%]
#   Kappa:    [0.712, 0.845]

Per-Judge Metrics (Ensemble)¶

metrics = result.compute_metrics(
    dataset,
    per_judge=True,
)

for judge_id, jm in metrics.per_judge.items():
    # jm.criterion_accuracy is `float | None` (None when undefined); score_rmse is always a float.
    acc = f"{jm.criterion_accuracy:.1%}" if jm.criterion_accuracy is not None else "n/a"
    print(f"{judge_id}: Accuracy={acc}, RMSE={jm.score_rmse:.4f}")

Metric Fields¶

None means genuinely undefined, never a fabricated 0.0

The numeric metric fields below are typed float | None. A field is None when the metric is genuinely undefined for the data at hand — it is never silently reported as a fake 0.0. Always guard the format spec (e.g. f"{x:.2f}" if x is not None else "n/a") before printing these.

Field	Description
`criterion_accuracy`	Overall accuracy across all criteria. `float \| None` — `None` when undefined (e.g. no paired predictions).
`criterion_precision`	Precision for the binary MET class. `float \| None` — `None` when not applicable, e.g. a multi-choice-only rubric (no binary MET class).
`criterion_recall`	Recall for the binary MET class. `float \| None` — `None` when not applicable (multi-choice-only rubric).
`criterion_f1`	F1 for the binary MET class. `float \| None` — `None` when not applicable (multi-choice-only rubric).
`mean_kappa`	Mean Cohen's kappa across criteria (macro — unweighted mean over criteria). `float \| None` — `None` when undefined (e.g. degenerate single-class).
`macro_accuracy`	Unweighted mean of the per-criterion accuracies (macro). `float \| None`.
`micro_kappa`	Cohen's kappa pooled across criteria (micro, distinct from the macro `mean_kappa`). `float \| None`.
`criterion_phi`	Matthews correlation coefficient (φ) pooled over the binary MET-vs-rest flats (micro). `float \| None` — `None` for a multi-choice-only rubric or on single-class data. φ = Pearson = Spearman = Kendall = MCC on binary data; the κ − φ gap is the judge's positive-rate drift.
`mean_krippendorff_alpha`	Macro mean of the per-criterion Krippendorff's α (inter-judge). `float \| None`.
`cannot_assess_mode` / `na_mode`	How CANNOT_ASSESS / NA were handled when the metrics were computed (`exclude` / `as_unmet` / `as_category`). Frozen on the result and round-tripped by `to_file` so a serialized number is never ambiguous among the estimands.
`n_samples`	Total paired observations contributing to the aggregate metrics. `int \| None`.
`coverage_stats`	Under the `exclude` mode, how much of the raw paired sample survived abstention/error exclusion (`CoverageStats \| None`). Counts `n_total` (raw pre-exclusion denominator), `n_covered` (== per-criterion `n_samples`), and `n_errored`; rates `coverage`, `judge_abstain_rate`, `gt_abstain_rate`, `union_exclusion_rate`, `error_rate` are each `float \| None` (`None` when `n_total == 0`).
`per_criterion`	Per-criterion metrics breakdown (polymorphic: `CriterionMetrics`, `OrdinalCriterionMetrics`, `NominalCriterionMetrics`). Their per-criterion numeric fields (`accuracy`, `precision`, `recall`, `f1`, `kappa`, `weighted_kappa`, `adjacent_accuracy`, per-option metrics) are likewise `float \| None` when undefined.
`score_rmse`	RMSE of cumulative scores (always a `float`).
`score_mae`	MAE of cumulative scores (always a `float`).
`score_spearman`	Spearman rank correlation (`CorrelationResult`). Its `.coefficient` is `float \| None` — `None` for a constant array or fewer than 3 samples.
`score_kendall`	Kendall tau correlation (`CorrelationResult`). `.coefficient` is `float \| None` (`None` for a constant array or < 3 samples).
`score_pearson`	Pearson correlation (`CorrelationResult`). `.coefficient` is `float \| None` (`None` for a constant array or < 3 samples).
`bias`	Systematic bias analysis (`BiasResult`). Its `.mean_bias` / `.std_bias` are `float \| None` — `mean_bias` is `None` at n=0 and `std_bias` is `None` for n < 2.
`bootstrap`	Bootstrap confidence intervals (`BootstrapResults`, if enabled)
`per_judge`	Per-judge metrics for ensemble (`dict[str, JudgeMetrics]`, if enabled)
`n_items`	Number of items used in computation
`n_criteria`	Number of criteria
`n_binary_criteria`	Number of binary criteria
`n_ordinal_criteria`	Number of ordinal multi-choice criteria
`n_nominal_criteria`	Number of nominal multi-choice criteria
`na_stats`	Statistics for NA handling in multi-choice criteria (`NAStats`): `na_count_true` / `na_count_pred` counts, `na_kappa` (`float \| None`) on the {NA, not-NA} dichotomy, and `na_false_positive` / `na_false_negative`.
`cannot_assess_stats`	Statistics for CANNOT_ASSESS handling in binary criteria (`CannotAssessStats`) — the binary parallel of `na_stats` (a distinct kind of abstention; see below): `ca_count_true` / `ca_count_pred` counts, `ca_kappa` (`float \| None`) on the {CANNOT_ASSESS, not-CANNOT_ASSESS} dichotomy, and `ca_false_positive` / `ca_false_negative`.
`warnings`	Any warnings generated during computation

compute_metrics¶

Compute agreement metrics between predictions and ground truth.

compute_metrics ¶

compute_metrics(eval_result: EvalResult, dataset: RubricDataset, *, bootstrap: bool = False, n_bootstrap: int = 1000, per_judge: bool = False, cannot_assess: CannotAssessMode = 'exclude', na_mode: NAMode = 'exclude', confidence_level: float = 0.95, seed: int | None = None, per_item_metrics: Literal['auto', 'pooled', 'per_criterion'] = 'auto') -> MetricsResult

Compute comprehensive evaluation metrics.

This is the main entry point for computing metrics from an evaluation run. It compares predicted verdicts and scores against ground truth from the dataset. Supports binary, ordinal, and nominal (multi-choice) criteria.

PARAMETER	DESCRIPTION
`eval_result`	The evaluation result from EvalRunner. TYPE: `EvalResult`
`dataset`	The dataset with ground truth labels. TYPE: `RubricDataset`
`bootstrap`	If True, compute bootstrap confidence intervals (expensive). Covers ANY rubric type via an item-level resample: `accuracy_ci`←`criterion_accuracy`, `kappa_ci`←`mean_kappa` (ordinal quadratic-weighted), `rmse_ci`←`score_rmse`. Each CI is `None` when undefined (empty/degenerate axis). TYPE: `bool` DEFAULT: `False`
`n_bootstrap`	Number of bootstrap samples if bootstrap=True. TYPE: `int` DEFAULT: `1000`
`per_judge`	If True and ensemble, compute per-judge metrics. TYPE: `bool` DEFAULT: `False`
`cannot_assess`	How to handle CANNOT_ASSESS verdicts (binary criteria): - "exclude": Skip pairs where either is CANNOT_ASSESS (default) - "as_unmet": Treat CANNOT_ASSESS as UNMET - "as_category": Keep CANNOT_ASSESS as a distinct third class. Accuracy and Cohen's kappa are then computed over three classes (a CANNOT_ASSESS prediction matching a CANNOT_ASSESS ground truth counts as correct); precision/recall/f1 remain MET-vs-rest. TYPE: `CannotAssessMode` DEFAULT: `'exclude'`
`na_mode`	How to handle NA options (multi-choice criteria). Mirrors `cannot_assess` for binary — NA on multi-choice is the structural analog of CANNOT_ASSESS on binary: "exclude": Skip pairs where either is NA (default). "as_unmet": Remap NA to the score-minimizing non-NA option, weight-sign aware (lowest `value` for non-negative weight, highest `value` for negative weight). Shares `Criterion.worst_scored_option()` with the grader's `unknown`-error worst-case path so the layers cannot drift. "as_category": Keep NA as a distinct categorical column. Refused for ordinal criteria with an NA option (raises `ValueError`): NA has no ordinal position, so quadratic weighted Cohen's kappa would assign NA a geometrically meaningless distance. TYPE: `NAMode` DEFAULT: `'exclude'`
`confidence_level`	Confidence level for bootstrap CIs (default 0.95). TYPE: `float` DEFAULT: `0.95`
`seed`	Random seed for bootstrap reproducibility. TYPE: `int \| None` DEFAULT: `None`
`per_item_metrics`	How to handle per-item-rubric datasets (no global rubric): "auto" (default): pool rubric-point metrics ONLY when the dataset has no global rubric AND item rubrics genuinely differ (heterogeneous, e.g. HealthBench); otherwise use the normal per-criterion path. "pooled": always pool (see `MetricsResult.pooled_by_scale`). "per_criterion": always use the per-criterion path (requires a homogeneous criteria structure across items, else raises). TYPE: `Literal['auto', 'pooled', 'per_criterion']` DEFAULT: `'auto'`

RETURNS	DESCRIPTION
`MetricsResult`	MetricsResult with comprehensive metrics and optional per-judge breakdown. For
`MetricsResult`	heterogeneous per-item rubrics, `per_criterion` is empty and the pooled rubric-point
`MetricsResult`	view is in `pooled_by_scale` (one entry per scale type present).

RAISES	DESCRIPTION
`ValueError`	If no common items between eval_result and dataset.

Example

result = await evaluate(dataset, grader) metrics = result.compute_metrics(dataset) print(metrics.summary()) df = metrics.to_dataframe()

Source code in src/autorubric/metrics/_compute.py

def compute_metrics(
    eval_result: EvalResult,
    dataset: RubricDataset,
    *,
    bootstrap: bool = False,
    n_bootstrap: int = 1000,
    per_judge: bool = False,
    cannot_assess: CannotAssessMode = "exclude",
    na_mode: NAMode = "exclude",
    confidence_level: float = 0.95,
    seed: int | None = None,
    per_item_metrics: Literal["auto", "pooled", "per_criterion"] = "auto",
) -> MetricsResult:
    """Compute comprehensive evaluation metrics.

    This is the main entry point for computing metrics from an evaluation run.
    It compares predicted verdicts and scores against ground truth from the dataset.
    Supports binary, ordinal, and nominal (multi-choice) criteria.

    Args:
        eval_result: The evaluation result from EvalRunner.
        dataset: The dataset with ground truth labels.
        bootstrap: If True, compute bootstrap confidence intervals (expensive). Covers ANY
            rubric type via an item-level resample: ``accuracy_ci``←``criterion_accuracy``,
            ``kappa_ci``←``mean_kappa`` (ordinal quadratic-weighted), ``rmse_ci``←``score_rmse``.
            Each CI is ``None`` when undefined (empty/degenerate axis).
        n_bootstrap: Number of bootstrap samples if bootstrap=True.
        per_judge: If True and ensemble, compute per-judge metrics.
        cannot_assess: How to handle CANNOT_ASSESS verdicts (binary criteria):
            - "exclude": Skip pairs where either is CANNOT_ASSESS (default)
            - "as_unmet": Treat CANNOT_ASSESS as UNMET
            - "as_category": Keep CANNOT_ASSESS as a distinct third class. Accuracy and
              Cohen's kappa are then computed over three classes (a CANNOT_ASSESS
              prediction matching a CANNOT_ASSESS ground truth counts as correct);
              precision/recall/f1 remain MET-vs-rest.
        na_mode: How to handle NA options (multi-choice criteria). Mirrors
            ``cannot_assess`` for binary — NA on multi-choice is the structural
            analog of CANNOT_ASSESS on binary:

            - "exclude": Skip pairs where either is NA (default).
            - "as_unmet": Remap NA to the score-minimizing non-NA option,
              weight-sign aware (lowest ``value`` for non-negative weight,
              highest ``value`` for negative weight). Shares
              ``Criterion.worst_scored_option()`` with the grader's
              ``unknown``-error worst-case path so the layers cannot drift.
            - "as_category": Keep NA as a distinct categorical column.
              **Refused for ordinal criteria with an NA option** (raises
              ``ValueError``): NA has no ordinal position, so quadratic
              weighted Cohen's kappa would assign NA a geometrically
              meaningless distance.
        confidence_level: Confidence level for bootstrap CIs (default 0.95).
        seed: Random seed for bootstrap reproducibility.
        per_item_metrics: How to handle per-item-rubric datasets (no global rubric):

            - "auto" (default): pool rubric-point metrics ONLY when the dataset has no global
              rubric AND item rubrics genuinely differ (heterogeneous, e.g. HealthBench);
              otherwise use the normal per-criterion path.
            - "pooled": always pool (see ``MetricsResult.pooled_by_scale``).
            - "per_criterion": always use the per-criterion path (requires a homogeneous
              criteria structure across items, else raises).

    Returns:
        MetricsResult with comprehensive metrics and optional per-judge breakdown. For
        heterogeneous per-item rubrics, ``per_criterion`` is empty and the pooled rubric-point
        view is in ``pooled_by_scale`` (one entry per scale type present).

    Raises:
        ValueError: If no common items between eval_result and dataset.

    Example:
        >>> result = await evaluate(dataset, grader)
        >>> metrics = result.compute_metrics(dataset)
        >>> print(metrics.summary())
        >>> df = metrics.to_dataframe()
    """
    result_warnings: list[str] = []

    # Build map of item_idx -> ItemResult
    eval_map = {ir.item_idx: ir for ir in eval_result.item_results}

    # Check for missing/extra items
    dataset_indices = set(range(len(dataset)))
    eval_indices = set(eval_map.keys())

    missing = dataset_indices - eval_indices
    if missing:
        result_warnings.append(f"{len(missing)} items from dataset not found in eval_result")

    extra = eval_indices - dataset_indices
    if extra:
        result_warnings.append(f"{len(extra)} items in eval_result not in dataset")

    # Use intersection
    common_indices = sorted(dataset_indices & eval_indices)

    if not common_indices:
        raise ValueError("No common items between eval_result and dataset")

    # Per-item heterogeneous rubrics (e.g. HealthBench) have no shared per-criterion table;
    # pool rubric-point decisions instead of forcing an index-aligned per-criterion view.
    use_pooled = per_item_metrics == "pooled" or (
        per_item_metrics == "auto"
        and dataset.rubric is None
        and _has_heterogeneous_rubrics(dataset, common_indices)
    )
    if use_pooled:
        return _compute_per_item_pooled_metrics(
            eval_map,
            dataset,
            common_indices,
            result_warnings=result_warnings,
            cannot_assess=cannot_assess,
            na_mode=na_mode,
        )

    # Validate rubric homogeneity for metrics computation
    # If using per-item rubrics, all must have the same structure
    if dataset.rubric is not None:
        reference_rubric = dataset.rubric
    else:
        # Get rubric from first item
        reference_rubric = dataset.get_item_rubric(common_indices[0])

    reference_n_criteria = len(reference_rubric.rubric)

    for idx in common_indices:
        item_rubric = dataset.get_item_rubric(idx)
        if len(item_rubric.rubric) != reference_n_criteria:
            raise ValueError(
                f"Cannot compute metrics: items have different rubric structures. "
                f"Item {idx} has {len(item_rubric.rubric)} criteria but "
                f"expected {reference_n_criteria}. "
                f"Metrics require homogeneous rubric structures across all items."
            )

    # Use the reference rubric for classification
    criteria = list(reference_rubric.rubric)
    criterion_types = classify_criteria(criteria)
    n_criteria = len(criteria)

    # Count criteria by type
    n_binary = sum(1 for ct in criterion_types if ct == "binary")
    n_ordinal = sum(1 for ct in criterion_types if ct == "ordinal")
    n_nominal = sum(1 for ct in criterion_types if ct == "nominal")

    # Per-criterion data storage
    # For binary: list[CriterionVerdict]
    # For multi-choice: list[int] (option indices). A predicted index may transiently be
    # None for a genuine multi-choice error-abstain; it is normalized to the
    # effective NA index right after the effective criteria are built, so consumers below
    # only ever see CriterionVerdict | int.
    per_criterion_pred: list[list[CriterionVerdict | int | None]] = [[] for _ in range(n_criteria)]
    per_criterion_true: list[list[CriterionVerdict | int]] = [[] for _ in range(n_criteria)]

    # Overall scores
    all_pred_scores: list[float] = []
    all_true_scores: list[float] = []

    # For ensemble: per-judge data (binary verdicts + multi-choice option indices).
    judge_scores: dict[str, list[float]] = {}
    judge_verdicts: dict[str, list[list[CriterionVerdict]]] = {}
    # Per-judge multi-choice predictions (items x criteria); binary cells are a None
    # placeholder. A multi-choice cell may transiently be None (genuine error-abstain);
    # it is normalized to the effective NA index after the effective criteria are
    # built, mirroring the aggregate per_criterion_pred normalization.
    judge_mc_preds: dict[str, list[list[int | None]]] = {}
    judge_errors: dict[str, list[list[str | None]]] = {}
    is_ensemble = False

    # Per-item ground-truth verdicts (all criteria) aligned 1:1 with each item that
    # contributes ensemble per-judge data; used for the per-judge metrics fix.
    per_item_true: list[list[CriterionVerdict | int]] = []

    # Fleiss' kappa ratings rows, per criterion (only ensemble reports produce rows).
    fleiss_rows: dict[int, list[list[int]]] = {c: [] for c in range(n_criteria)}

    # Krippendorff's alpha: per criterion, one dict per ensemble item mapping
    # judge_id -> numeric cell value (np.nan = missing). Rows (judges) and columns
    # (items) are assembled after the loop using the final judge id set.
    alpha_cells: dict[int, list[dict[str, float]]] = {c: [] for c in range(n_criteria)}

    items_with_ground_truth = 0
    # Count GT-bearing items lost to a grading error (skipped below). These have ground truth
    # (the no-ground-truth case is handled separately) but no usable verdicts, so they reduce
    # coverage. Feeds the CoverageStats error_rate and a warning.
    n_errored_items = 0

    # NA tracking for multi-choice
    total_na_true = 0
    total_na_pred = 0
    total_na_fp = 0
    total_na_fn = 0

    for idx in common_indices:
        item = dataset.items[idx]
        item_result = eval_map[idx]
        report = item_result.report

        if item.ground_truth is None:
            result_warnings.append(f"Item {idx} has no ground truth, skipping")
            continue

        if item_result.error is not None:
            # GT-bearing item lost to a grading error: counted toward the raw coverage
            # denominator (it had ground truth) but contributes no usable verdicts.
            n_errored_items += 1
            continue

        items_with_ground_truth += 1

        # Extract predictions using type-aware extraction
        pred_all = extract_all_verdicts_from_report(report, criteria)

        # Resolve ground truth (string labels → indices for multi-choice)
        try:
            true_all = resolve_ground_truth(list(item.ground_truth), criteria)
        except ValueError as e:
            result_warnings.append(f"Item {idx}: {e}")
            continue

        # Store per-criterion data
        for c_idx in range(n_criteria):
            pred_val = pred_all[c_idx]
            true_val = true_all[c_idx]

            # Handle None predictions (failed extraction). Binary None -> UNMET (the
            # conservative default). A multi-choice None is a GENUINE error-abstain (no NA
            # option, forced-choice): leave it as None here and normalize it to the
            # effective criterion's NA index after the effective criteria are built below,
            # so it is recognized as NA instead of being silently counted as option 0.
            if pred_val is None and criterion_types[c_idx] == "binary":
                pred_val = CriterionVerdict.UNMET

            per_criterion_pred[c_idx].append(pred_val)
            per_criterion_true[c_idx].append(true_val)

        # Score-level aggregation (RMSE/correlation/bias). A grade-FAILURE has no score
        # (report.error set, score is None): EXCLUDE it from the paired score arrays
        # rather than fabricating a 0.0 — a fake 0.0 would corrupt RMSE/bias and is
        # indistinguishable from a real catastrophic score. The per-criterion verdict
        # arrays above are unaffected (they handle errored verdicts on their own terms).
        # Item-level errors are already skipped earlier; this catches a report-level
        # error with no item-level error (e.g. the "No judge results" report).
        if report.error is None and report.score is not None:
            # For true score, need to pass the original ground truth format.
            # compute_weighted_score expects CriterionVerdict for binary, str for multi-choice.
            true_score_verdicts = []
            for c_idx in range(n_criteria):
                if criterion_types[c_idx] == "binary":
                    true_score_verdicts.append(true_all[c_idx])
                else:
                    # For multi-choice, pass the option label (string)
                    criterion = criteria[c_idx]
                    opt_idx = true_all[c_idx]
                    if isinstance(opt_idx, int) and 0 <= opt_idx < len(criterion.options):
                        true_score_verdicts.append(criterion.options[opt_idx].label)
                    else:
                        # Default to first option if index is invalid
                        true_score_verdicts.append(criterion.options[0].label)

            # Use the item's effective rubric (== the global rubric when one is set, so this
            # is a no-op there) so homogeneous per-item-rubric datasets don't raise on a
            # missing global rubric. The predicted score was computed with the same rubric.
            true_score = dataset.compute_weighted_score(
                true_score_verdicts, rubric=dataset.get_item_rubric(idx)
            )

            all_pred_scores.append(report.score)
            all_true_scores.append(true_score)

        # Check if ensemble and collect per-judge data. Gate on the SAME score/error
        # condition as the score-level append above so per-item arrays stay length-aligned
        # with `all_true_scores`: a score-less report (report-level error, score None)
        # contributes nothing to per-judge metrics or inter-judge agreement, exactly as it
        # contributes nothing to the aggregate score metrics. (In normal operation a
        # score-less ensemble report has empty judge_scores anyway; this also keeps a
        # hand-built / deserialized score-less report from de-aligning the arrays.)
        if (
            report.error is None
            and report.score is not None
            and hasattr(report, "judge_scores")
            and report.judge_scores
        ):
            is_ensemble = True
            for jid, score in report.judge_scores.items():
                if jid not in judge_scores:
                    judge_scores[jid] = []
                    judge_verdicts[jid] = []
                    judge_mc_preds[jid] = []
                    judge_errors[jid] = []
                judge_scores[jid].append(score)

            # Align ground truth (all criteria) once per ensemble item.
            per_item_true.append(list(true_all))

            # Extract per-judge verdicts (binary) + multi-choice indices + errors from
            # EnsembleCriterionReport.votes / .multi_choice_votes. A binary criterion
            # yields a verdict and a None multi-choice placeholder; a multi-choice
            # criterion yields a placeholder UNMET verdict and the vote's selected_index
            # (raw int|None — None is a genuine abstain, normalized later). The error
            # is captured per criterion from whichever vote type matched, so errored MC
            # votes are skipped with the same parity as binary.
            if hasattr(report, "report") and report.report:
                for jid in judge_scores.keys():
                    judge_v: list[CriterionVerdict] = []
                    judge_mc: list[int | None] = []
                    judge_e: list[str | None] = []
                    for c_idx, cr in enumerate(report.report):
                        c_type = (
                            criterion_types[c_idx] if c_idx < len(criterion_types) else "binary"
                        )
                        if c_type == "binary":
                            judge_mc.append(None)
                            votes = getattr(cr, "votes", None) or []
                            for vote in votes:
                                if vote.judge_id == jid:
                                    judge_v.append(vote.verdict)
                                    judge_e.append(vote.error)
                                    break
                            else:
                                judge_v.append(CriterionVerdict.UNMET)
                                judge_e.append(None)
                        else:
                            judge_v.append(CriterionVerdict.UNMET)  # placeholder
                            mc_votes = getattr(cr, "multi_choice_votes", None) or []
                            for vote in mc_votes:
                                if vote.judge_id == jid:
                                    judge_mc.append(vote.selected_index)
                                    judge_e.append(vote.error)
                                    break
                            else:
                                judge_mc.append(None)
                                judge_e.append(None)
                    if jid in judge_verdicts:
                        judge_verdicts[jid].append(judge_v)
                        judge_mc_preds[jid].append(judge_mc)
                        judge_errors[jid].append(judge_e)

            # Inter-judge agreement collection (binary + multi-choice) from ensemble votes.
            if hasattr(report, "report") and report.report:
                n_judges = len(report.judge_scores)
                for c_idx in range(n_criteria):
                    cr = report.report[c_idx]
                    c_type = criterion_types[c_idx]
                    # Fleiss: complete-case ratings row (uniform rater count).
                    row = _build_fleiss_row(
                        cr,
                        criteria[c_idx],
                        c_type,
                        cannot_assess,
                        n_judges,
                    )
                    if row is not None:
                        fleiss_rows[c_idx].append(row)
                    # Krippendorff alpha: per-judge cells (missing handled natively).
                    votes = (
                        cr.votes if c_type == "binary" else getattr(cr, "multi_choice_votes", [])
                    )
                    cell_map: dict[str, float] = {
                        v.judge_id: _build_alpha_cell(v, c_type, cannot_assess)
                        for v in (votes or [])
                    }
                    alpha_cells[c_idx].append(cell_map)

    n_items = items_with_ground_truth

    if n_items == 0:
        raise ValueError("No valid items with ground truth found")

    # Score-level metrics need ≥1 scoreable (non-errored, real-float) item. Every
    # ground-truth item having a report-level error would leave these arrays empty
    # (sklearn's mean_squared_error rejects empty input). Treat it like no-valid-items
    # rather than fabricating a score.
    if not all_pred_scores:
        raise ValueError("No valid items with a computed score found")

    # Reconstruct the effective criterion for any multi-choice criterion whose graded
    # reports used an auto-injected NA option OR produced a genuine None error-abstain.
    # The grader appends an auto-injected NA at index N = len(author.options) — out of
    # range for the author rubric used above — and emits selected_index=None when it had to
    # abstain with no NA option. We normalize only when an out-of-range OR a None prediction
    # is actually observed, so forced-choice runs without abstains are unaffected and never
    # gain a spurious NA column. ``with_guaranteed_na_option`` is the same pure helper the
    # grader uses, so the two layers cannot drift.
    effective_criteria = list(criteria)
    for c_idx in range(n_criteria):
        if criterion_types[c_idx] == "binary":
            continue
        author_c = criteria[c_idx]
        n_author = len(author_c.options) if author_c.options else 0

        def _needs_na(v: object, n_author: int = n_author) -> bool:
            return (isinstance(v, int) and v >= n_author) or v is None

        observed = any(_needs_na(v) for v in per_criterion_pred[c_idx])
        if not observed:
            # Also consider per-judge multi-choice cells: a single judge may have
            # abstained (None) or picked the injected NA while the aggregate verdict
            # did not, so the effective criterion still needs an NA option for the
            # per-judge normalization to recognize that cell.
            observed = any(
                c_idx < len(row) and _needs_na(row[c_idx])
                for rows in judge_mc_preds.values()
                for row in rows
            )
        if observed:
            effective_criteria[c_idx] = author_c.with_guaranteed_na_option()

    # Normalize any remaining None multi-choice predictions (genuine error-abstains) to
    # the effective criterion's NA index, so every downstream consumer sees only ints and the
    # abstain is recognized as NA (FP/FN, na_kappa, filtering) under every na_mode. The
    # reconstruction above guarantees a NA option exists for any criterion that had a None.
    for c_idx in range(n_criteria):
        if criterion_types[c_idx] == "binary":
            continue
        na_idx = effective_criteria[c_idx].na_option_index
        if na_idx is None:
            continue
        per_criterion_pred[c_idx] = [na_idx if v is None else v for v in per_criterion_pred[c_idx]]

    # Mirror the aggregate None→NA normalization for each judge's multi-choice predictions,
    # using the SAME effective_criteria. A judge's None multi-choice cell is either a binary
    # placeholder (no NA option to point at) or a genuine abstain on a multi-choice
    # criterion; only multi-choice cells with a resolvable NA index are normalized, so binary
    # placeholders stay None and are ignored by the per-judge multi-choice path.
    for jid in judge_mc_preds:
        for item_row in judge_mc_preds[jid]:
            for c_idx in range(min(n_criteria, len(item_row))):
                if criterion_types[c_idx] == "binary":
                    continue
                if item_row[c_idx] is not None:
                    continue
                na_idx = effective_criteria[c_idx].na_option_index
                if na_idx is None:
                    continue
                item_row[c_idx] = na_idx

    # Compute per-criterion metrics by type
    per_criterion: list[CriterionMetricsUnion] = []
    # Collects the per-criterion kappas (binary Cohen, ordinal weighted, nominal). Each may
    # be None (degenerate single-class) — _mean_or_none excludes None when averaging.
    criterion_kappas: list[float | None] = []

    # Inter-judge agreement (Krippendorff's alpha + Fleiss' kappa) is only meaningful
    # with an ensemble of >=2 judges (>=2 items is enforced downstream).
    eligible = is_ensemble and len(judge_scores) >= 2

    # Precompute Krippendorff's alpha per criterion from the collected reliability cells.
    # Rows = judges (fixed judge-id order), columns = items; np.nan marks missing ratings.
    # Alpha uses ALL items (missing handled natively) — no complete-case dropping.
    judge_ids = list(judge_scores.keys())
    krippendorff_alphas: dict[int, float | None] = dict.fromkeys(range(n_criteria))
    if eligible:
        for c_idx in range(n_criteria):
            level: Literal["nominal", "ordinal"] = (
                "ordinal" if criterion_types[c_idx] == "ordinal" else "nominal"
            )
            cell_maps = alpha_cells.get(c_idx, [])
            reliability_data = [
                [cm.get(jid, float("nan")) for cm in cell_maps] for jid in judge_ids
            ]
            krippendorff_alphas[c_idx] = _compute_krippendorff_alpha(reliability_data, level)

    for c_idx in range(n_criteria):
        criterion = criteria[c_idx]
        c_type = criterion_types[c_idx]
        pred_data = per_criterion_pred[c_idx]
        true_data = per_criterion_true[c_idx]

        if c_type == "binary":
            # Binary criterion metrics
            pred_verdicts = [v for v in pred_data if isinstance(v, CriterionVerdict)]
            true_verdicts = [v for v in true_data if isinstance(v, CriterionVerdict)]

            # Handle CANNOT_ASSESS centrally and build label + MET-vs-rest reps.
            label_pred, label_true, met_pred, met_true = prepare_binary_metric_inputs(
                pred_verdicts, true_verdicts, cannot_assess
            )

            name = criterion.name or f"Criterion {c_idx + 1}"

            fleiss_kappa = _compute_fleiss_kappa(fleiss_rows.get(c_idx)) if eligible else None
            krippendorff_alpha = krippendorff_alphas.get(c_idx) if eligible else None

            if not label_pred:
                # No samples → metric values are undefined (None); counts stay 0. Do NOT
                # append to criterion_kappas (matches per-judge, which skips empty binary;
                # _mean_or_none would exclude a None regardless, so parity holds either way).
                per_criterion.append(
                    CriterionMetrics(
                        name=name,
                        index=c_idx,
                        n_samples=0,
                        accuracy=None,
                        precision=None,
                        recall=None,
                        f1=None,
                        kappa=None,
                        kappa_interpretation="undefined",
                        krippendorff_alpha=krippendorff_alpha,
                        fleiss_kappa=fleiss_kappa,
                        support_true=0,
                        support_pred=0,
                    )
                )
                continue

            c_acc = accuracy_score(label_true, label_pred)
            c_prec = precision_score(met_true, met_pred, zero_division=0)
            c_rec = recall_score(met_true, met_pred, zero_division=0)
            c_f1 = f1_score(met_true, met_pred, zero_division=0)

            # None on degenerate single-class data (NaN) or failure — never a fake 0.0.
            c_kappa = _kappa_or_none(label_true, label_pred)

            criterion_kappas.append(c_kappa)

            # 2x2 confusion matrix on the MET-vs-rest dichotomy (rows=true, cols=pred,
            # labels ["MET","UNMET"]). Built from the same met flats so FPR/FNR derived from
            # ``.fpr``/``.fnr`` honour undefined→None at a zero denominator.
            c_cm = _build_binary_2x2_confusion_matrix(met_true, met_pred)
            # phi (MCC) on the MET-vs-rest dichotomy: None on single-class — never a fake 0.0.
            c_phi = _mcc_or_none(met_true, met_pred)
            # Degenerate iff there were samples but agreement (kappa) could not be estimated
            # because the data collapsed onto a single class — distinct from the no-data case.
            c_degenerate = c_kappa is None

            per_criterion.append(
                CriterionMetrics(
                    name=name,
                    index=c_idx,
                    n_samples=len(label_pred),
                    accuracy=float(c_acc),
                    precision=float(c_prec),
                    recall=float(c_rec),
                    f1=float(c_f1),
                    kappa=c_kappa,
                    kappa_interpretation=(
                        KappaResult.interpret_kappa(c_kappa) if c_kappa is not None else "undefined"
                    ),
                    krippendorff_alpha=krippendorff_alpha,
                    fleiss_kappa=fleiss_kappa,
                    support_true=sum(met_true),
                    support_pred=sum(met_pred),
                    confusion_matrix=c_cm,
                    fpr=c_cm.fpr,
                    fnr=c_cm.fnr,
                    phi=c_phi,
                    is_degenerate=c_degenerate,
                )
            )

        elif c_type == "ordinal":
            # Ordinal multi-choice criterion metrics. Use the effective criterion so a
            # predicted auto-injected NA index is recognized.
            eff_criterion = effective_criteria[c_idx]
            pred_indices = [v for v in pred_data if isinstance(v, int)]
            true_indices = [v for v in true_data if isinstance(v, int)]

            # Filter NA options. na_agree is unused here (the NAStats block below
            # computes kappa on the {NA, not-NA} dichotomy from per-criterion data).
            pred_filtered, true_filtered, _na_agree, na_fp, na_fn = filter_na_multi_choice(
                pred_indices, true_indices, eff_criterion, mode=na_mode
            )

            # Track NA stats (FP/FN feed the diagnostic counts on NAStats)
            total_na_fp += na_fp
            total_na_fn += na_fn

            metrics = _compute_ordinal_criterion_metrics(
                pred_filtered,
                true_filtered,
                eff_criterion,
                c_idx,
                fleiss_matrix=(fleiss_rows.get(c_idx) if eligible else None),
                krippendorff_alpha=(krippendorff_alphas.get(c_idx) if eligible else None),
            )
            per_criterion.append(metrics)

            # Use weighted kappa for ordinal in mean calculation
            criterion_kappas.append(metrics.weighted_kappa)

        else:  # nominal
            # Nominal multi-choice criterion metrics. Use the effective criterion so a
            # predicted auto-injected NA index is recognized.
            eff_criterion = effective_criteria[c_idx]
            pred_indices = [v for v in pred_data if isinstance(v, int)]
            true_indices = [v for v in true_data if isinstance(v, int)]

            # Filter NA options. na_agree is unused here (the NAStats block below
            # computes kappa on the {NA, not-NA} dichotomy from per-criterion data).
            pred_filtered, true_filtered, _na_agree, na_fp, na_fn = filter_na_multi_choice(
                pred_indices, true_indices, eff_criterion, mode=na_mode
            )

            # Track NA stats (FP/FN feed the diagnostic counts on NAStats)
            total_na_fp += na_fp
            total_na_fn += na_fn

            metrics = _compute_nominal_criterion_metrics(
                pred_filtered,
                true_filtered,
                eff_criterion,
                c_idx,
                fleiss_matrix=(fleiss_rows.get(c_idx) if eligible else None),
                krippendorff_alpha=(krippendorff_alphas.get(c_idx) if eligible else None),
            )
            per_criterion.append(metrics)

            # Use unweighted kappa for nominal
            criterion_kappas.append(metrics.kappa)

    # Aggregate criterion-level scalars via the shared helper, so the aggregate and
    # per-judge paths cannot drift. accuracy/mean_kappa reproduce the prior expressions
    # exactly; the only behavior change is multi-choice-only precision/recall/f1 going
    # 0.0 → None (the binary MET-vs-rest metric is genuinely undefined without a MET
    # class). per_criterion_pred has been normalized to ints (no None) by here, so its
    # static type matches the helper's expected list[CriterionVerdict | int].
    (
        criterion_accuracy,
        criterion_precision,
        criterion_recall,
        criterion_f1,
        mean_kappa,
        criterion_phi,
        micro_kappa,
    ) = _criterion_level_scalars(
        per_criterion_pred,  # type: ignore[arg-type]
        per_criterion_true,
        list(criterion_types),
        cannot_assess,
        precomputed_kappas=criterion_kappas,
    )

    # Macro accuracy: unweighted mean of the per-criterion accuracies (binary ``accuracy`` /
    # multi-choice ``exact_accuracy``), the macro complement to the pooled (micro)
    # ``criterion_accuracy`` above. None when no criterion contributed an accuracy.
    per_criterion_accuracies: list[float | None] = []
    for cm in per_criterion:
        if cm.criterion_type == "binary":
            per_criterion_accuracies.append(cm.accuracy)
        else:
            per_criterion_accuracies.append(cm.exact_accuracy)
    macro_accuracy = _mean_or_none(per_criterion_accuracies)

    # Macro mean of the per-criterion Krippendorff's alpha (inter-judge agreement). None when
    # no criterion contributed an alpha (e.g. single-judge runs).
    mean_krippendorff_alpha = _mean_or_none([cm.krippendorff_alpha for cm in per_criterion])

    # Coverage / error diagnostics — only meaningful under the ``exclude`` handling modes,
    # where abstentions (CANNOT_ASSESS / NA) and grading errors drop a paired observation
    # from the agreement denominator. Under ``as_unmet`` / ``as_category`` nothing is
    # union-excluded, so coverage would be trivially 1.0 and we leave these ``None``. The raw
    # denominator counts every GT-bearing item (including those lost to a grading error), so
    # error_rate and the abstain rates share one consistent denominator.
    coverage_stats: CoverageStats | None = None
    coverage_mode = cannot_assess == "exclude" and na_mode == "exclude"
    if coverage_mode:
        n_total_raw = items_with_ground_truth + n_errored_items
        agg_judge_abstain = 0
        agg_gt_abstain = 0
        for c_idx in range(n_criteria):
            c_type = criterion_types[c_idx]
            raw_pred = per_criterion_pred[c_idx]
            raw_true = per_criterion_true[c_idx]
            if c_type == "binary":
                CA = CriterionVerdict.CANNOT_ASSESS
                judge_abstain = sum(1 for v in raw_pred if v == CA)
                gt_abstain = sum(1 for v in raw_true if v == CA)
            else:
                na_idx_set = {
                    i for i, opt in enumerate(effective_criteria[c_idx].options) if opt.na
                }
                judge_abstain = sum(1 for v in raw_pred if isinstance(v, int) and v in na_idx_set)
                gt_abstain = sum(1 for v in raw_true if isinstance(v, int) and v in na_idx_set)
            agg_judge_abstain += judge_abstain
            agg_gt_abstain += gt_abstain
            cstats = _build_coverage_stats(
                n_total=n_total_raw,
                n_covered=per_criterion[c_idx].n_samples,
                judge_abstain=judge_abstain,
                gt_abstain=gt_abstain,
                n_errored=n_errored_items,
            )
            per_criterion[c_idx] = per_criterion[c_idx].model_copy(
                update={"coverage_stats": cstats}
            )

        # Aggregate rollup: coverage pools the per-criterion *pairs* (raw pair count summed
        # over criteria; covered = sum of per-criterion covered counts), so the coverage /
        # abstain fractions reflect the full paired sample. ``n_errored``, by contrast, is an
        # *item* count (an errored item has no usable verdicts at all) — reported as the raw
        # item count for an intuitive read, matching the per-criterion value; its ``error_rate``
        # is the fraction of raw ground-truth-bearing items lost to a grading error.
        agg_total = n_total_raw * n_criteria
        agg_covered = sum(cm.n_samples for cm in per_criterion)
        agg_coverage = agg_covered / agg_total if agg_total else None
        coverage_stats = CoverageStats(
            n_total=agg_total,
            n_covered=agg_covered,
            coverage=agg_coverage,
            judge_abstain_rate=(agg_judge_abstain / agg_total if agg_total else None),
            gt_abstain_rate=(agg_gt_abstain / agg_total if agg_total else None),
            union_exclusion_rate=(1 - agg_coverage if agg_coverage is not None else None),
            n_errored=n_errored_items,
            error_rate=(n_errored_items / n_total_raw if n_total_raw else None),
        )

    if n_errored_items > 0:
        result_warnings.append(
            f"{n_errored_items} item(s) with ground truth were excluded from metrics because "
            "grading errored; their verdicts and scores do not contribute."
        )

    # Score-level metrics
    score_rmse = float(np.sqrt(mean_squared_error(all_true_scores, all_pred_scores)))
    score_mae = float(mean_absolute_error(all_true_scores, all_pred_scores))

    score_spearman = _compute_correlation(all_pred_scores, all_true_scores, "spearman")
    score_kendall = _compute_correlation(all_pred_scores, all_true_scores, "kendall")
    score_pearson = _compute_correlation(all_pred_scores, all_true_scores, "pearson")

    # Score-collapse warning: if the per-item ground-truth scores take at most two distinct
    # values, the score-level rank correlations are uninformative (a rank correlation on a
    # near-constant variable conveys almost nothing), so flag it rather than letting a reader
    # over-interpret the Spearman/Kendall numbers.
    if len(set(all_true_scores)) <= 2:
        result_warnings.append(
            "Ground-truth scores take <=2 distinct values; score-level correlations "
            "(Spearman/Kendall/Pearson) are uninformative on a collapsed score range."
        )

    # Degeneracy warning: name the criteria that had samples but whose agreement coefficient
    # collapsed to None (single-class data). Distinct from no-data criteria.
    degenerate_names = [cm.name for cm in per_criterion if cm.is_degenerate]
    if degenerate_names:
        result_warnings.append(
            "Degenerate (single-class) data prevented an agreement estimate for "
            f"criteria: {', '.join(degenerate_names)}."
        )

    # Bias analysis
    bias = systematic_bias(all_pred_scores, all_true_scores)

    # Bootstrap CIs (optional) — item-level resample over ANY rubric type (binary /
    # multi-choice / mixed). Per-metric None when its resample axis is empty / degenerate.
    # per_criterion_pred has been normalized to ints (no None) by the effective-criteria pass
    # above, so its static type matches the helper's list[CriterionVerdict | int].
    bootstrap_results = None
    if bootstrap:
        bootstrap_results = _compute_bootstrap_ci(
            per_criterion_pred,  # type: ignore[arg-type]
            per_criterion_true,
            list(criterion_types),
            effective_criteria,
            cannot_assess,
            na_mode,
            all_true_scores,
            all_pred_scores,
            n_bootstrap=n_bootstrap,
            confidence_level=confidence_level,
            seed=seed,
        )

    # Per-judge metrics (optional, for ensemble). Each judge mirrors the aggregate's
    # type handling: binary criteria contribute MET-vs-rest + label metrics, multi-choice
    # criteria contribute exact-match accuracy and kappa via the same per-criterion
    # functions the aggregate uses.
    per_judge_metrics = None
    if per_judge and is_ensemble and judge_scores:
        per_judge_metrics = {}
        for jid in judge_scores.keys():
            jv = judge_verdicts.get(jid, [])
            if not jv:
                continue

            per_judge_metrics[jid] = _compute_judge_metrics(
                judge_id=jid,
                judge_scores=judge_scores[jid],
                true_scores=all_true_scores,
                judge_verdicts=jv,
                judge_mc_preds=judge_mc_preds.get(jid, []),
                judge_errors=judge_errors.get(jid, []),
                true_verdicts=per_item_true,
                criterion_types=list(criterion_types),
                criteria=criteria,
                effective_criteria=effective_criteria,
                cannot_assess=cannot_assess,
                na_mode=na_mode,
            )

    # NA stats (for multi-choice criteria)
    # Cohen's kappa on the {NA, not-NA} dichotomy across all multi-choice
    # criteria that define an NA option, paired pred-vs-truth. Reuses the
    # same chance-corrected statistic as the rest of the framework's
    # prediction-vs-ground-truth agreement metrics (binary `kappa`, ordinal
    # `weighted_kappa`, nominal `kappa`). Returns None when undefined.
    na_stats = None
    if n_ordinal > 0 or n_nominal > 0:
        na_pred_bool: list[bool] = []
        na_true_bool: list[bool] = []
        for c_idx in range(n_criteria):
            if criterion_types[c_idx] == "binary":
                continue
            criterion = effective_criteria[c_idx]
            na_indices = {i for i, opt in enumerate(criterion.options) if opt.na}
            if not na_indices:
                continue
            for p, t in zip(per_criterion_pred[c_idx], per_criterion_true[c_idx]):
                if isinstance(p, int) and isinstance(t, int):
                    p_is_na = p in na_indices
                    t_is_na = t in na_indices
                    na_pred_bool.append(p_is_na)
                    na_true_bool.append(t_is_na)
                    if p_is_na:
                        total_na_pred += 1
                    if t_is_na:
                        total_na_true += 1

        na_kappa: float | None = None
        if na_pred_bool:
            try:
                k = float(cohen_kappa_score(na_true_bool, na_pred_bool))
                na_kappa = None if math.isnan(k) else k
            except Exception:
                na_kappa = None
        na_kappa_interpretation = (
            KappaResult.interpret_kappa(na_kappa) if na_kappa is not None else None
        )

        na_stats = NAStats(
            na_count_true=total_na_true,
            na_count_pred=total_na_pred,
            na_kappa=na_kappa,
            na_kappa_interpretation=na_kappa_interpretation,
            na_false_positive=total_na_fp,
            na_false_negative=total_na_fn,
        )

    # CANNOT_ASSESS stats (for binary criteria) — the binary parallel of NA stats above.
    # Cohen's kappa on the {CANNOT_ASSESS, not-CANNOT_ASSESS} dichotomy across all binary
    # criteria, paired pred-vs-truth. CANNOT_ASSESS is a DISTINCT kind of abstention from
    # multi-choice NA (epistemic MET-vs-UNMET abstention rather than "no applicable
    # option"), so it is tracked by a separate stats type (CannotAssessStats) even though
    # both share the SKIP scoring path. Counts are mode-independent: read from the raw
    # per-criterion verdicts (set at the top of this function), never the
    # cannot_assess-filtered lists. Returns None when there are no binary criteria.
    cannot_assess_stats = None
    if n_binary > 0:
        ca_pred_bool: list[bool] = []
        ca_true_bool: list[bool] = []
        total_ca_true = 0
        total_ca_pred = 0
        total_ca_fp = 0
        total_ca_fn = 0
        CA = CriterionVerdict.CANNOT_ASSESS
        for c_idx in range(n_criteria):
            if criterion_types[c_idx] != "binary":
                continue
            for p, t in zip(per_criterion_pred[c_idx], per_criterion_true[c_idx]):
                if isinstance(p, CriterionVerdict) and isinstance(t, CriterionVerdict):
                    p_is_ca = p == CA
                    t_is_ca = t == CA
                    ca_pred_bool.append(p_is_ca)
                    ca_true_bool.append(t_is_ca)
                    if p_is_ca:
                        total_ca_pred += 1
                    if t_is_ca:
                        total_ca_true += 1
                    if p_is_ca and not t_is_ca:
                        total_ca_fp += 1
                    if t_is_ca and not p_is_ca:
                        total_ca_fn += 1

        ca_kappa: float | None = None
        if ca_pred_bool:
            try:
                k = float(cohen_kappa_score(ca_true_bool, ca_pred_bool))
                ca_kappa = None if math.isnan(k) else k
            except Exception:
                ca_kappa = None
        ca_kappa_interpretation = (
            KappaResult.interpret_kappa(ca_kappa) if ca_kappa is not None else None
        )

        cannot_assess_stats = CannotAssessStats(
            ca_count_true=total_ca_true,
            ca_count_pred=total_ca_pred,
            ca_kappa=ca_kappa,
            ca_kappa_interpretation=ca_kappa_interpretation,
            ca_false_positive=total_ca_fp,
            ca_false_negative=total_ca_fn,
        )

    return MetricsResult(
        # Each scalar may be None (genuinely undefined / not applicable), so only wrap a
        # present value in float() — never coerce None to 0.0.
        criterion_accuracy=criterion_accuracy
        if criterion_accuracy is None
        else float(criterion_accuracy),
        criterion_precision=criterion_precision
        if criterion_precision is None
        else float(criterion_precision),
        criterion_recall=criterion_recall if criterion_recall is None else float(criterion_recall),
        criterion_f1=criterion_f1 if criterion_f1 is None else float(criterion_f1),
        mean_kappa=mean_kappa if mean_kappa is None else float(mean_kappa),
        per_criterion=per_criterion,
        score_rmse=score_rmse,
        score_mae=score_mae,
        score_spearman=score_spearman,
        score_kendall=score_kendall,
        score_pearson=score_pearson,
        bias=bias,
        bootstrap=bootstrap_results,
        per_judge=per_judge_metrics,
        n_items=n_items,
        n_criteria=n_criteria,
        n_binary_criteria=n_binary,
        n_ordinal_criteria=n_ordinal,
        n_nominal_criteria=n_nominal,
        na_stats=na_stats,
        cannot_assess_stats=cannot_assess_stats,
        # Handling-mode provenance (recorded on the result so downstream readers know how
        # abstentions were treated when these numbers were produced).
        cannot_assess_mode=cannot_assess,
        na_mode=na_mode,
        # Additional aggregate scalars (every one honours undefined→None — never a fake 0.0).
        n_samples=sum(cm.n_samples for cm in per_criterion),
        mean_krippendorff_alpha=mean_krippendorff_alpha,
        criterion_phi=criterion_phi if criterion_phi is None else float(criterion_phi),
        macro_accuracy=macro_accuracy if macro_accuracy is None else float(macro_accuracy),
        micro_kappa=micro_kappa if micro_kappa is None else float(micro_kappa),
        coverage_stats=coverage_stats,
        warnings=result_warnings,
    )

MetricsResult¶

Complete metrics result with aggregate and per-criterion breakdowns.

MetricsResult ¶

Bases: BaseModel

Complete metrics result from compute_metrics().

This is the main result type returned by EvalResult.compute_metrics(). It provides a comprehensive view of evaluation quality including: - Criterion-level agreement metrics - Score-level correlation and error metrics - Per-criterion breakdown (supports binary, ordinal, and nominal criteria) - Optional bootstrap confidence intervals - Optional per-judge metrics for ensemble evaluations

ATTRIBUTE	DESCRIPTION
`criterion_accuracy`	Overall accuracy across all criteria (binary label accuracy and/or multi-choice exact-match). `None` when undefined (no comparable pairs at all). TYPE: `float \| None`
`criterion_precision`	Overall precision for the binary MET class. `None` when not applicable — multi-choice-only rubrics have no MET class (the per-option precision/recall/f1 story lives in each criterion's `per_option`). TYPE: `float \| None`
`criterion_recall`	Overall recall for the binary MET class. `None` when not applicable (see `criterion_precision`). TYPE: `float \| None`
`criterion_f1`	Overall F1 for the binary MET class. `None` when not applicable (see `criterion_precision`). TYPE: `float \| None`
`mean_kappa`	Mean kappa across criteria (weighted for ordinal, unweighted for binary/nominal). `None` when undefined (no criterion contributed a kappa). TYPE: `float \| None`
`per_criterion`	Per-criterion metrics breakdown (polymorphic union type). TYPE: `list[CriterionMetricsUnion]`
`score_rmse`	RMSE of cumulative scores. TYPE: `float`
`score_mae`	MAE of cumulative scores. TYPE: `float`
`score_spearman`	Spearman correlation result. TYPE: `CorrelationResult`
`score_kendall`	Kendall tau correlation result. TYPE: `CorrelationResult`
`score_pearson`	Pearson correlation result. TYPE: `CorrelationResult`
`bias`	Systematic bias analysis. TYPE: `BiasResult`
`bootstrap`	Optional bootstrap confidence intervals. TYPE: `BootstrapResults \| None`
`per_judge`	Optional per-judge metrics for ensemble. TYPE: `dict[str, JudgeMetrics] \| None`
`n_items`	Number of items used in computation. TYPE: `int`
`n_criteria`	Number of criteria. TYPE: `int`
`n_binary_criteria`	Number of binary criteria (default 0 for backwards compat). TYPE: `int`
`n_ordinal_criteria`	Number of ordinal multi-choice criteria. TYPE: `int`
`n_nominal_criteria`	Number of nominal multi-choice criteria. TYPE: `int`
`na_stats`	Statistics for NA handling in multi-choice criteria. TYPE: `NAStats \| None`
`cannot_assess_stats`	Statistics for CANNOT_ASSESS handling in binary criteria — the binary parallel to `na_stats` (a distinct kind of abstention; see CannotAssessStats). TYPE: `CannotAssessStats \| None`
`cannot_assess_mode`	How binary CANNOT_ASSESS verdicts were handled when these metrics were computed (`exclude` / `as_unmet` / `as_category`). TYPE: `CannotAssessMode`
`na_mode`	How multi-choice NA options were handled when these metrics were computed (the multi-choice analog of `cannot_assess_mode`). TYPE: `NAMode`
`n_samples`	Total number of paired observations contributing to the aggregate metrics. `None` when not recorded (legacy checkpoints). TYPE: `int \| None`
`mean_krippendorff_alpha`	Macro mean of the per-criterion Krippendorff's alpha. `None` when no criterion contributed an alpha. TYPE: `float \| None`
`criterion_phi`	Aggregate (micro) Matthews correlation coefficient (φ) over the pooled binary {MET, UNMET} flats. `None` for multi-choice-only rubrics or when undefined. TYPE: `float \| None`
`macro_accuracy`	Unweighted mean of the per-criterion accuracies. `None` when no criterion contributed an accuracy. TYPE: `float \| None`
`micro_kappa`	Aggregate (micro) Cohen's kappa pooled across criteria. `None` when undefined. TYPE: `float \| None`
`coverage_stats`	Aggregate rollup of how much of the raw paired sample survived abstention/error exclusion. Only populated under the `exclude` handling mode. TYPE: `CoverageStats \| None`
`warnings`	Any warnings generated during computation. TYPE: `list[str]`

summary ¶

summary(*, verbose: bool = False) -> str

Return formatted text summary of metrics.

PARAMETER	DESCRIPTION
`verbose`	When `True`, the per-judge table swaps in the secondary numeric columns (RMSE, Spearman) it omits by default and prints each judge's confusion matrix. The default (`False`) per-judge line leads with the chance-corrected accuracy + mean kappa (and Matthews phi), the metrics most directly comparable across judges. TYPE: `bool` DEFAULT: `False`

Source code in src/autorubric/metrics/_types.py

def summary(self, *, verbose: bool = False) -> str:
    """Return formatted text summary of metrics.

    Args:
        verbose: When ``True``, the per-judge table swaps in the secondary numeric
            columns (RMSE, Spearman) it omits by default and prints each judge's
            confusion matrix. The default (``False``) per-judge line leads with the
            chance-corrected accuracy + mean kappa (and Matthews phi), the metrics most
            directly comparable across judges.
    """
    lines = []
    lines.append("=" * 60)
    lines.append("METRICS SUMMARY")
    lines.append("=" * 60)

    # Show criteria type breakdown if mixed
    criteria_info = f"Items: {self.n_items}, Criteria: {self.n_criteria}"
    if self.n_ordinal_criteria > 0 or self.n_nominal_criteria > 0:
        type_parts = []
        if self.n_binary_criteria > 0:
            type_parts.append(f"{self.n_binary_criteria} binary")
        if self.n_ordinal_criteria > 0:
            type_parts.append(f"{self.n_ordinal_criteria} ordinal")
        if self.n_nominal_criteria > 0:
            type_parts.append(f"{self.n_nominal_criteria} nominal")
        criteria_info += f" ({', '.join(type_parts)})"
    lines.append(criteria_info)

    # Handling modes: every accuracy/kappa/F1 below depends on how abstentions were
    # treated, so the estimand is named explicitly. A number reported without its
    # handling mode is ambiguous among the three estimands.
    lines.append(f"Handling modes: CANNOT_ASSESS={self.cannot_assess_mode}, NA={self.na_mode}")

    if self.warnings:
        lines.append(f"\nWarnings ({len(self.warnings)}):")
        for w in self.warnings:
            lines.append(f"  - {w}")

    # Criterion-level scalars span two aggregation levels. Pooled-over-decisions
    # metrics are micro; the unweighted mean over criteria is macro. They estimate
    # different quantities (a high-support criterion dominates micro, every criterion
    # counts equally for macro), so each carries its level explicitly.
    lines.append("")
    lines.append("Criterion-Level Metrics:")
    lines.append(f"  Accuracy (micro):       {_fmt_opt(self.criterion_accuracy, '.1%')}")
    lines.append(f"  Accuracy (macro):       {_fmt_opt(self.macro_accuracy, '.1%')}")
    if self.n_binary_criteria > 0:
        # Guaranteed non-None here, but render via _fmt_opt so ty is satisfied.
        lines.append(f"  Precision (micro):      {_fmt_opt(self.criterion_precision, '.2f')}")
        lines.append(f"  Recall (micro):         {_fmt_opt(self.criterion_recall, '.2f')}")
        lines.append(f"  F1 (micro):             {_fmt_opt(self.criterion_f1, '.2f')}")
    lines.append(f"  Mean Kappa (macro):     {_fmt_opt(self.mean_kappa, '.3f')}")
    lines.append(f"  Kappa (micro):          {_fmt_opt(self.micro_kappa, '.3f')}")
    if self.n_binary_criteria > 0:
        lines.append(f"  Phi (micro):            {_fmt_opt(self.criterion_phi, '.3f')}")
        # Single-source conflation note: on binary data phi coincides with the
        # Pearson/Spearman/Kendall/MCC family, and the kappa minus phi gap measures the
        # judge's positive-rate drift from the human's (not a second, corroborating
        # statistic). This note lives only here (and in to_dataframe()/docstrings).
        lines.append(
            "    (phi = Pearson = Spearman = Kendall = MCC on binary data; the "
            "Kappa - Phi gap is the judge's positive-rate drift, not extra evidence)"
        )
    lines.append(f"  Mean Kripp-α (macro):   {_fmt_opt(self.mean_krippendorff_alpha, '.3f')}")

    # Aggregate coverage continuation: under exclude mode an abstention/error drops a
    # paired observation, so the covered-subset metrics above are reported alongside
    # their coverage (a selective accuracy without its coverage is incomplete).
    if self.coverage_stats is not None:
        cs = self.coverage_stats
        lines.append(f"  Coverage:               {_fmt_opt(cs.coverage, '.1%')}")
        lines.append(
            f"    judge-abstain={_fmt_opt(cs.judge_abstain_rate, '.1%')}, "
            f"gt-abstain={_fmt_opt(cs.gt_abstain_rate, '.1%')}, "
            f"errored={cs.n_errored}"
        )

    lines.append("")
    lines.append("Score-Level Metrics (continuous per-item weighted score):")
    lines.append(f"  RMSE:     {self.score_rmse:.4f}")
    lines.append(f"  MAE:      {self.score_mae:.4f}")
    lines.append(
        f"  Spearman: {_fmt_opt(self.score_spearman.coefficient, '.4f')} "
        f"({self.score_spearman.interpretation})"
    )
    lines.append(
        f"  Kendall:  {_fmt_opt(self.score_kendall.coefficient, '.4f')} "
        f"({self.score_kendall.interpretation})"
    )
    lines.append(
        f"  Pearson:  {_fmt_opt(self.score_pearson.coefficient, '.4f')} "
        f"({self.score_pearson.interpretation})"
    )

    lines.append("")
    lines.append("Bias Analysis:")
    lines.append(
        f"  Mean Bias:   {_fmt_opt(self.bias.mean_bias, '+.4f')} ({self.bias.direction})"
    )
    lines.append(f"  Significant: {'Yes' if self.bias.is_significant else 'No'}")

    # NA stats for multi-choice
    if self.na_stats:
        lines.append("")
        lines.append("NA Handling:")
        lines.append(f"  NA in Ground Truth: {self.na_stats.na_count_true}")
        lines.append(f"  NA in Predictions:  {self.na_stats.na_count_pred}")
        if self.na_stats.na_kappa is not None:
            interp = self.na_stats.na_kappa_interpretation or ""
            lines.append(f"  NA Kappa:           {self.na_stats.na_kappa:.3f} ({interp})")
        if self.na_stats.na_false_positive > 0 or self.na_stats.na_false_negative > 0:
            lines.append(
                f"  NA FP/FN:           {self.na_stats.na_false_positive} / "
                f"{self.na_stats.na_false_negative}"
            )

    # CANNOT_ASSESS stats for binary criteria (parallel to NA Handling above; a
    # distinct kind of abstention — epistemic MET/UNMET rather than "no option").
    if self.cannot_assess_stats:
        ca = self.cannot_assess_stats
        lines.append("")
        lines.append("CANNOT_ASSESS Handling:")
        lines.append(f"  CA in Ground Truth: {ca.ca_count_true}")
        lines.append(f"  CA in Predictions:  {ca.ca_count_pred}")
        if ca.ca_kappa is not None:
            interp = ca.ca_kappa_interpretation or ""
            lines.append(f"  CA Kappa:           {ca.ca_kappa:.3f} ({interp})")
        if ca.ca_false_positive > 0 or ca.ca_false_negative > 0:
            lines.append(
                f"  CA FP/FN:           {ca.ca_false_positive} / {ca.ca_false_negative}"
            )

    if self.bootstrap:
        lines.append("")
        lines.append(f"Bootstrap CIs ({self.bootstrap.confidence_level:.0%}):")
        # Each CI may be None (genuinely undefined / no samples) — render "n/a".
        acc_ci = self.bootstrap.accuracy_ci
        kappa_ci = self.bootstrap.kappa_ci
        rmse_ci = self.bootstrap.rmse_ci
        lines.append(
            "  Accuracy: "
            + (f"[{acc_ci[0]:.1%}, {acc_ci[1]:.1%}]" if acc_ci is not None else "n/a")
        )
        lines.append(
            "  Kappa:    "
            + (f"[{kappa_ci[0]:.3f}, {kappa_ci[1]:.3f}]" if kappa_ci is not None else "n/a")
        )
        lines.append(
            "  RMSE:     "
            + (f"[{rmse_ci[0]:.4f}, {rmse_ci[1]:.4f}]" if rmse_ci is not None else "n/a")
        )

    if self.per_judge:
        lines.append("")
        lines.append("Per-Judge Metrics:")
        for judge_id, jm in sorted(self.per_judge.items()):
            # Default line leads with the chance-corrected accuracy + mean kappa (and
            # phi), the metrics most comparable across judges. RMSE/Spearman are demoted
            # to the verbose view.
            lines.append(
                f"  {judge_id}: Acc={_fmt_opt(jm.criterion_accuracy, '.1%')}, "
                f"Mean Kappa={_fmt_opt(jm.mean_kappa, '.3f')}, "
                f"Phi={_fmt_opt(jm.phi, '.3f')}"
            )
            if verbose:
                lines.append(
                    f"      RMSE={jm.score_rmse:.4f}, "
                    f"Spearman={_fmt_opt(jm.score_spearman.coefficient, '.4f')}, "
                    f"MAE={jm.score_mae:.4f}"
                )
                if jm.confusion_matrix is not None:
                    lines.append(
                        "      Confusion (" + ", ".join(jm.confusion_matrix.labels) + "):"
                    )
                    for label, mrow in zip(
                        jm.confusion_matrix.labels, jm.confusion_matrix.matrix
                    ):
                        lines.append(f"        {label:<14} {mrow}")

    lines.append("")
    lines.append("Per-Criterion Breakdown:")

    # Inter-judge agreement is only populated for ensembles with >=2 judges. Render is
    # type-aware: on binary/nominal data Krippendorff's nominal alpha and Fleiss' kappa
    # coincide up to a finite-sample correction (1 - kappa_F)/(N*R) — they are one
    # statistic, not corroborating evidence — so alpha is reported as the single primary
    # column and the bare Fleiss column is dropped. On ordinal data alpha is
    # distance-aware while Fleiss stays nominal (different geometry), so both are kept.
    def _has_alpha(criteria: list) -> bool:
        return any(cm.krippendorff_alpha is not None for cm in criteria)

    def _has_fleiss(criteria: list) -> bool:
        return any(cm.fleiss_kappa is not None for cm in criteria)

    def _alpha_header() -> str:
        return f" {'Kripp-α':>9}"

    def _alpha_cell(cm) -> str:
        return f" {_fmt_opt(cm.krippendorff_alpha, '>9.3f', 9)}"

    def _alpha_fleiss_header() -> str:
        return f" {'Kripp-α':>9} {'Fleiss':>8}"

    def _alpha_fleiss_cells(cm) -> str:
        return (
            f" {_fmt_opt(cm.krippendorff_alpha, '>9.3f', 9)}"
            f" {_fmt_opt(cm.fleiss_kappa, '>8.3f', 8)}"
        )

    # Marks a criterion that had samples but whose agreement coefficient collapsed to
    # None (single-class) — distinct from a no-data criterion (n_samples == 0).
    def _degen_suffix(cm) -> str:
        return "  [degenerate: agreement undefined, single class]" if cm.is_degenerate else ""

    # Separate display by criterion type
    binary_criteria = [cm for cm in self.per_criterion if cm.criterion_type == "binary"]
    ordinal_criteria = [cm for cm in self.per_criterion if cm.criterion_type == "ordinal"]
    nominal_criteria = [cm for cm in self.per_criterion if cm.criterion_type == "nominal"]

    alpha_note_needed = False

    if binary_criteria:
        if ordinal_criteria or nominal_criteria:
            lines.append("\nBinary Criteria:")
        # Binary/nominal: alpha primary, bare Fleiss dropped.
        show_alpha = _has_alpha(binary_criteria)
        alpha_note_needed = alpha_note_needed or show_alpha
        header = (
            f"{'Criterion':<20} {'Acc':>8} {'Prec':>8} {'Rec':>8} {'F1':>8} "
            f"{'Kappa':>8} {'Phi':>8} {'FP':>5} {'FN':>5} {'FPR':>7} {'FNR':>7}"
        )
        if show_alpha:
            header += _alpha_header()
        lines.append(header)
        lines.append("-" * len(header))
        for cm in binary_criteria:
            fp = cm.confusion_matrix.fp if cm.confusion_matrix is not None else None
            fn = cm.confusion_matrix.fn if cm.confusion_matrix is not None else None
            row = (
                f"{cm.name:<20} {_fmt_opt(cm.accuracy, '>8.1%', 8)} "
                f"{_fmt_opt(cm.precision, '>8.2f', 8)} "
                f"{_fmt_opt(cm.recall, '>8.2f', 8)} {_fmt_opt(cm.f1, '>8.2f', 8)} "
                f"{_fmt_opt(cm.kappa, '>8.3f', 8)} {_fmt_opt(cm.phi, '>8.3f', 8)} "
                f"{(str(fp) if fp is not None else 'n/a'):>5} "
                f"{(str(fn) if fn is not None else 'n/a'):>5} "
                f"{_fmt_opt(cm.fpr, '>7.2f', 7)} {_fmt_opt(cm.fnr, '>7.2f', 7)}"
            )
            if show_alpha:
                row += _alpha_cell(cm)
            row += _degen_suffix(cm)
            lines.append(row)

    if ordinal_criteria:
        lines.append("\nOrdinal Criteria:")
        # Ordinal: keep both alpha (distance-aware) and Fleiss (nominal) — different
        # geometry, see the note below.
        show_alpha = _has_alpha(ordinal_criteria)
        show_fleiss = _has_fleiss(ordinal_criteria)
        header = (
            f"{'Criterion':<20} {'Exact':>8} {'Adj':>8} "
            f"{'WKappa':>8} {'Spearman':>10} {'RMSE':>8}"
        )
        if show_alpha or show_fleiss:
            header += _alpha_fleiss_header()
        lines.append(header)
        lines.append("-" * len(header))
        for cm in ordinal_criteria:
            row = (
                f"{cm.name:<20} {_fmt_opt(cm.exact_accuracy, '>8.1%', 8)} "
                f"{_fmt_opt(cm.adjacent_accuracy, '>8.1%', 8)} "
                f"{_fmt_opt(cm.weighted_kappa, '>8.3f', 8)} "
                f"{_fmt_opt(cm.spearman.coefficient, '>10.4f', 10)} "
                f"{_fmt_opt(cm.rmse, '>8.4f', 8)}"
            )
            if show_alpha or show_fleiss:
                row += _alpha_fleiss_cells(cm)
            row += _degen_suffix(cm)
            lines.append(row)
        if show_fleiss:
            # Distinguishing note (NOT a conflation note): ordinal alpha and nominal
            # Fleiss measure different geometries and are both intentionally retained.
            lines.append(
                "  Note: ordinal Kripp-α is distance-aware while Fleiss is nominal — they "
                "measure different geometry, not the same statistic."
            )

    if nominal_criteria:
        lines.append("\nNominal Criteria:")
        # Binary/nominal: alpha primary, bare Fleiss dropped.
        show_alpha = _has_alpha(nominal_criteria)
        alpha_note_needed = alpha_note_needed or show_alpha
        header = f"{'Criterion':<20} {'Accuracy':>10} {'Kappa':>8} {'Interpretation':<20}"
        if show_alpha:
            header += _alpha_header()
        lines.append(header)
        lines.append("-" * len(header))
        for cm in nominal_criteria:
            row = (
                f"{cm.name:<20} {_fmt_opt(cm.exact_accuracy, '>10.1%', 10)} "
                f"{_fmt_opt(cm.kappa, '>8.3f', 8)} {cm.kappa_interpretation:<20}"
            )
            if show_alpha:
                row += _alpha_cell(cm)
            row += _degen_suffix(cm)
            lines.append(row)

    if alpha_note_needed:
        # Single-source conflation note for binary/nominal: alpha and Fleiss coincide up
        # to a finite-sample correction, so only the primary (alpha) is reported and
        # Fleiss is omitted. This note lives only here (and in to_dataframe()/docstrings).
        lines.append(
            "  Note: on binary/nominal data Krippendorff's nominal α equals Fleiss' κ up "
            "to a finite-sample correction (1 - κ_F)/(N·R) — one statistic, not "
            "corroborating evidence; α is reported as primary (bare Fleiss omitted)."
        )

    return "\n".join(lines)

to_dataframe ¶

to_dataframe() -> DataFrame

Export metrics to pandas DataFrame.

Returns a flat DataFrame with a 'level' column indicating 'aggregate' / 'criterion' / 'judge'. The criterion-level scalars carry their aggregation level in the column name: accuracy_micro / precision_micro / recall_micro / f1_micro / kappa_micro / phi_micro are pooled over decisions, while accuracy_macro and mean_kappa_macro are unweighted means over criteria (the former bare accuracy / precision / recall / f1 / kappa columns are gone — they mixed levels). The handling modes (cannot_assess_mode / na_mode) and n_samples round-trip on the aggregate row, alongside coverage columns (coverage / judge_abstain_rate / gt_abstain_rate / union_exclusion_rate / n_errored / error_rate; None outside exclude mode). On binary/nominal data Krippendorff's α equals Fleiss' κ up to a finite-sample correction, so α is the single primary inter-judge column and the bare fleiss_kappa value is emitted only for ordinal criteria (different geometry).

Source code in src/autorubric/metrics/_types.py

def to_dataframe(self) -> "pd.DataFrame":
    """Export metrics to pandas DataFrame.

    Returns a flat DataFrame with a 'level' column indicating 'aggregate' / 'criterion'
    / 'judge'. The criterion-level scalars carry their **aggregation level** in the
    column name: ``accuracy_micro`` / ``precision_micro`` / ``recall_micro`` /
    ``f1_micro`` / ``kappa_micro`` / ``phi_micro`` are pooled over decisions, while
    ``accuracy_macro`` and ``mean_kappa_macro`` are unweighted means over criteria (the
    former bare ``accuracy`` / ``precision`` / ``recall`` / ``f1`` / ``kappa`` columns
    are gone — they mixed levels). The handling modes (``cannot_assess_mode`` /
    ``na_mode``) and ``n_samples`` round-trip on the aggregate row, alongside coverage
    columns (``coverage`` / ``judge_abstain_rate`` / ``gt_abstain_rate`` /
    ``union_exclusion_rate`` / ``n_errored`` / ``error_rate``; ``None`` outside exclude
    mode). On binary/nominal data Krippendorff's α equals Fleiss' κ up to a
    finite-sample correction, so α is the single primary inter-judge column and the bare
    ``fleiss_kappa`` value is emitted only for ordinal criteria (different geometry).
    """
    import pandas as pd

    rows = []

    def _coverage_cols(cs: "CoverageStats | None") -> dict:
        """Coverage columns, all None when coverage was not computed (non-exclude mode)."""
        if cs is None:
            return {
                "coverage": None,
                "judge_abstain_rate": None,
                "gt_abstain_rate": None,
                "union_exclusion_rate": None,
                "n_errored": None,
                "error_rate": None,
            }
        return {
            "coverage": cs.coverage,
            "judge_abstain_rate": cs.judge_abstain_rate,
            "gt_abstain_rate": cs.gt_abstain_rate,
            "union_exclusion_rate": cs.union_exclusion_rate,
            "n_errored": cs.n_errored,
            "error_rate": cs.error_rate,
        }

    # Aggregate row. The criterion-level scalars are labelled by aggregation level:
    # accuracy_micro / precision_micro / recall_micro / f1_micro / kappa_micro / phi_micro
    # are pooled over all decisions; accuracy_macro / mean_kappa_macro are unweighted
    # means over criteria. The bare accuracy/precision/recall/f1/kappa columns are gone
    # (they hid the level). On binary/nominal data Krippendorff's α is the single primary
    # inter-judge statistic (Fleiss coincides up to a finite-sample correction), so the
    # aggregate-level mean is mean_krippendorff_alpha and bare Fleiss is omitted there.
    rows.append(
        {
            "level": "aggregate",
            "name": "overall",
            "criterion_type": "all",
            "accuracy_micro": self.criterion_accuracy,
            "accuracy_macro": self.macro_accuracy,
            "precision_micro": self.criterion_precision,
            "recall_micro": self.criterion_recall,
            "f1_micro": self.criterion_f1,
            "mean_kappa_macro": self.mean_kappa,
            "kappa_micro": self.micro_kappa,
            "phi_micro": self.criterion_phi,
            "mean_krippendorff_alpha": self.mean_krippendorff_alpha,
            "cannot_assess_mode": self.cannot_assess_mode,
            "na_mode": self.na_mode,
            "n_samples": self.n_samples,
            "rmse": self.score_rmse,
            "mae": self.score_mae,
            "spearman": self.score_spearman.coefficient,
            "kendall": self.score_kendall.coefficient,
            "pearson": self.score_pearson.coefficient,
            "bias": self.bias.mean_bias,
            "adjacent_accuracy": None,
            "weighted_kappa": None,
            "phi": None,
            "fpr": None,
            "fnr": None,
            "is_degenerate": None,
            "krippendorff_alpha": None,
            "fleiss_kappa": None,
            **_coverage_cols(self.coverage_stats),
        }
    )

    # Per-criterion rows (handle different types). Each criterion's pooled accuracy /
    # kappa land in the *_micro columns (a single criterion has no macro/micro split);
    # the macro columns stay None at this level. Binary/nominal drop the bare Fleiss
    # value (α primary); ordinal keeps both (different geometry).
    for cm in self.per_criterion:
        if cm.criterion_type == "binary":
            rows.append(
                {
                    "level": "criterion",
                    "name": cm.name,
                    "criterion_type": "binary",
                    "accuracy_micro": cm.accuracy,
                    "accuracy_macro": None,
                    "precision_micro": cm.precision,
                    "recall_micro": cm.recall,
                    "f1_micro": cm.f1,
                    "mean_kappa_macro": None,
                    "kappa_micro": cm.kappa,
                    "phi_micro": None,
                    "mean_krippendorff_alpha": None,
                    "cannot_assess_mode": None,
                    "na_mode": None,
                    "n_samples": cm.n_samples,
                    "rmse": None,
                    "mae": None,
                    "spearman": None,
                    "kendall": None,
                    "pearson": None,
                    "bias": None,
                    "adjacent_accuracy": None,
                    "weighted_kappa": None,
                    "phi": cm.phi,
                    "fpr": cm.fpr,
                    "fnr": cm.fnr,
                    "is_degenerate": cm.is_degenerate,
                    "krippendorff_alpha": cm.krippendorff_alpha,
                    # Binary: bare Fleiss dropped (α primary).
                    "fleiss_kappa": None,
                    **_coverage_cols(cm.coverage_stats),
                }
            )
        elif cm.criterion_type == "ordinal":
            rows.append(
                {
                    "level": "criterion",
                    "name": cm.name,
                    "criterion_type": "ordinal",
                    "accuracy_micro": cm.exact_accuracy,
                    "accuracy_macro": None,
                    "precision_micro": None,
                    "recall_micro": None,
                    "f1_micro": None,
                    "mean_kappa_macro": None,
                    "kappa_micro": cm.weighted_kappa,
                    "phi_micro": None,
                    "mean_krippendorff_alpha": None,
                    "cannot_assess_mode": None,
                    "na_mode": None,
                    "n_samples": cm.n_samples,
                    "rmse": cm.rmse,
                    "mae": cm.mae,
                    "spearman": cm.spearman.coefficient,
                    "kendall": cm.kendall.coefficient,
                    "pearson": None,
                    "bias": None,
                    "adjacent_accuracy": cm.adjacent_accuracy,
                    "weighted_kappa": cm.weighted_kappa,
                    "phi": None,
                    "fpr": None,
                    "fnr": None,
                    "is_degenerate": cm.is_degenerate,
                    "krippendorff_alpha": cm.krippendorff_alpha,
                    # Ordinal: keep Fleiss (different geometry from α).
                    "fleiss_kappa": cm.fleiss_kappa,
                    **_coverage_cols(cm.coverage_stats),
                }
            )
        else:  # nominal
            rows.append(
                {
                    "level": "criterion",
                    "name": cm.name,
                    "criterion_type": "nominal",
                    "accuracy_micro": cm.exact_accuracy,
                    "accuracy_macro": None,
                    "precision_micro": None,
                    "recall_micro": None,
                    "f1_micro": None,
                    "mean_kappa_macro": None,
                    "kappa_micro": cm.kappa,
                    "phi_micro": None,
                    "mean_krippendorff_alpha": None,
                    "cannot_assess_mode": None,
                    "na_mode": None,
                    "n_samples": cm.n_samples,
                    "rmse": None,
                    "mae": None,
                    "spearman": None,
                    "kendall": None,
                    "pearson": None,
                    "bias": None,
                    "adjacent_accuracy": None,
                    "weighted_kappa": None,
                    "phi": None,
                    "fpr": None,
                    "fnr": None,
                    "is_degenerate": cm.is_degenerate,
                    "krippendorff_alpha": cm.krippendorff_alpha,
                    # Nominal: bare Fleiss dropped (α primary).
                    "fleiss_kappa": None,
                    **_coverage_cols(cm.coverage_stats),
                }
            )

    # Per-judge rows (if available)
    if self.per_judge:
        for judge_id, jm in self.per_judge.items():
            rows.append(
                {
                    "level": "judge",
                    "name": judge_id,
                    "criterion_type": "all",
                    "accuracy_micro": jm.criterion_accuracy,
                    "accuracy_macro": None,
                    "precision_micro": jm.criterion_precision,
                    "recall_micro": jm.criterion_recall,
                    "f1_micro": jm.criterion_f1,
                    "mean_kappa_macro": jm.mean_kappa,
                    "kappa_micro": None,
                    "phi_micro": None,
                    "mean_krippendorff_alpha": None,
                    "cannot_assess_mode": None,
                    "na_mode": None,
                    "n_samples": None,
                    "rmse": jm.score_rmse,
                    "mae": jm.score_mae,
                    "spearman": jm.score_spearman.coefficient,
                    "kendall": jm.score_kendall.coefficient,
                    "pearson": jm.score_pearson.coefficient,
                    "bias": jm.bias.mean_bias,
                    "adjacent_accuracy": None,
                    "weighted_kappa": None,
                    "phi": jm.phi,
                    "fpr": None,
                    "fnr": None,
                    "is_degenerate": None,
                    "krippendorff_alpha": None,
                    "fleiss_kappa": None,
                    **_coverage_cols(None),
                }
            )

    return pd.DataFrame(rows)

to_file ¶

to_file(path: str | Path) -> None

Save metrics to a JSON file.

PARAMETER	DESCRIPTION
`path`	Path to the output JSON file. TYPE: `str \| Path`

Source code in src/autorubric/metrics/_types.py

def to_file(self, path: str | Path) -> None:
    """Save metrics to a JSON file.

    Args:
        path: Path to the output JSON file.
    """
    path = Path(path)
    path.parent.mkdir(parents=True, exist_ok=True)
    path.write_text(self.model_dump_json(indent=2), encoding="utf-8")

CriterionMetrics¶

Per-criterion binary metrics.

CriterionMetrics ¶

Bases: BaseModel

Metrics for a single binary criterion.

ATTRIBUTE	DESCRIPTION
`name`	Name of the criterion. TYPE: `str`
`index`	Index of the criterion in the rubric. TYPE: `int`
`criterion_type`	Type of criterion ("binary" for this class). TYPE: `Literal['binary']`
`n_samples`	Number of samples used for this criterion. TYPE: `int`
`accuracy`	Binary accuracy (proportion of exact matches). None when undefined / no samples. TYPE: `float \| None`
`precision`	Precision for MET class. None when undefined / no samples. TYPE: `float \| None`
`recall`	Recall for MET class. None when undefined / no samples. TYPE: `float \| None`
`f1`	F1 score for MET class. None when undefined / no samples. TYPE: `float \| None`
`kappa`	Cohen's kappa coefficient. None when undefined (degenerate single-class) / no samples. TYPE: `float \| None`
`kappa_interpretation`	Human-readable interpretation of kappa ("undefined" when kappa is None). TYPE: `str`
`krippendorff_alpha`	Krippendorff's alpha — the general, recommended inter-judge agreement statistic. It natively handles unequal/missing raters (errored or excluded votes) and is level-aware (nominal for binary criteria). None unless this is an ensemble with >=2 judges and >=2 items. TYPE: `float \| None`
`fleiss_kappa`	Fleiss' kappa — the classic fixed-rater nominal inter-judge agreement measure, computed complete-case (only items where every judge cast a genuine counted vote contribute). Prefer `krippendorff_alpha` as the general statistic. None unless ensemble with >=2 judges and >=2 complete-case items. TYPE: `float \| None`
`support_true`	Count of MET in ground truth. TYPE: `int`
`support_pred`	Count of MET in predictions. TYPE: `int`
`confusion_matrix`	2×2 labelled confusion matrix (`["MET", "UNMET"]`, rows=true, cols=pred). `None` when there are no samples. TYPE: `ConfusionMatrix \| None`
`fpr`	False-positive rate (true UNMET predicted MET). `None` when undefined (no true negatives) / no samples. TYPE: `float \| None`
`fnr`	False-negative rate (true MET predicted UNMET). `None` when undefined (no true positives) / no samples. TYPE: `float \| None`
`phi`	Matthews correlation coefficient (the φ coefficient) on the {MET, UNMET} dichotomy. `None` on constant / single-class data, where it is genuinely undefined (never a fabricated `0.0`). TYPE: `float \| None`
`is_degenerate`	True iff this criterion had samples (`n_samples > 0`) but `kappa` is still `None` — agreement could not be estimated because the data collapsed onto a single class. Distinct from the no-data case (`n_samples == 0`). TYPE: `bool`
`coverage_stats`	How much of the raw paired sample survived abstention/error exclusion. Only populated under the `exclude` handling mode; `None` otherwise. TYPE: `CoverageStats \| None`

CorrelationResult¶

Correlation statistics between predicted and ground truth scores.

CorrelationResult ¶

Bases: BaseModel

Result from correlation calculation (Spearman, Kendall, Pearson).

ATTRIBUTE	DESCRIPTION
`coefficient`	The correlation coefficient (-1 to 1). `None` when the correlation is genuinely undefined — a constant input array (zero variance → NaN) or fewer than 3 samples. A `0.0` ("no correlation") would be a lie in those cases. TYPE: `float \| None`
`p_value`	P-value for testing the null hypothesis of no correlation. `None` when the coefficient is undefined (see `coefficient`). TYPE: `float \| None`
`ci`	Optional confidence interval for the coefficient. TYPE: `ConfidenceInterval \| None`
`interpretation`	Human-readable interpretation ("undefined" for a constant/NaN array, "insufficient data" for <3 samples). TYPE: `str`
`n_samples`	Number of samples used in calculation. TYPE: `int`
`method`	Correlation method used (e.g., "spearman", "kendall", "pearson"). TYPE: `str`

interpret_correlation `staticmethod` ¶

interpret_correlation(r: float) -> str

Return human-readable interpretation of correlation coefficient.

Source code in src/autorubric/metrics/_types.py

@staticmethod
def interpret_correlation(r: float) -> str:
    """Return human-readable interpretation of correlation coefficient."""
    abs_r = abs(r)
    if abs_r >= 0.9:
        strength = "very strong"
    elif abs_r >= 0.7:
        strength = "strong"
    elif abs_r >= 0.5:
        strength = "moderate"
    elif abs_r >= 0.3:
        strength = "weak"
    else:
        strength = "very weak"

    direction = "positive" if r >= 0 else "negative"
    return f"{strength} {direction}"

BootstrapResults¶

Bootstrap confidence intervals for key metrics.

BootstrapResults ¶

Bases: BaseModel

Bootstrap confidence interval results.

The three CIs are MARGINAL — bootstrapped on two independent item-level resample axes (a verdict-item axis for accuracy/kappa, an independent scored-item axis for RMSE), so they reflect each statistic's own sampling distribution, not their joint covariance. Covers any rubric type (binary / multi-choice / mixed).

ATTRIBUTE	DESCRIPTION
`accuracy_ci`	95% CI for `criterion_accuracy`. None when undefined / no samples. TYPE: `tuple[float, float] \| None`
`kappa_ci`	95% CI for `mean_kappa` (ordinal contributes quadratic-weighted kappa). Each replicate's mean conditions on which criteria were non-degenerate in that resample. None when undefined (kappa never defined across resamples) / no samples. TYPE: `tuple[float, float] \| None`
`rmse_ci`	95% CI for `score_rmse` over the scored-item subset. None when no samples (a single scored item yields a degenerate `(v, v)` interval, not None). TYPE: `tuple[float, float] \| None`
`n_bootstrap`	Number of bootstrap samples used. TYPE: `int`
`confidence_level`	Confidence level (default 0.95). TYPE: `float`

BootstrapResult¶

Single bootstrap result with confidence interval.

BootstrapResult ¶

Bases: BaseModel

Bootstrap confidence interval result.

ATTRIBUTE	DESCRIPTION
`estimate`	Point estimate of the statistic. TYPE: `float`
`ci`	Confidence interval from bootstrap. TYPE: `ConfidenceInterval`
`standard_error`	Bootstrap standard error. TYPE: `float`
`n_bootstrap`	Number of bootstrap samples used. TYPE: `int`
`bootstrap_distribution`	Optional array of bootstrap estimates. TYPE: `list[float] \| None`

ConfidenceInterval¶

Confidence interval bounds.

ConfidenceInterval ¶

Bases: BaseModel

Confidence interval for a statistic.

ATTRIBUTE	DESCRIPTION
`lower`	Lower bound of the interval. TYPE: `float`
`upper`	Upper bound of the interval. TYPE: `float`
`confidence`	Confidence level (default 0.95 for 95% CI). TYPE: `float`
`method`	Method used to compute the interval. TYPE: `str`

width `property` ¶

width: float

Width of the confidence interval.

JudgeMetrics¶

Per-judge metrics for ensemble evaluations.

JudgeMetrics ¶

Bases: BaseModel

Metrics for a single judge in an ensemble.

Mirrors the aggregate's type handling field-for-field: precision/recall/f1 are the binary MET-vs-rest metric → None for a multi-choice-only rubric (no MET class), and accuracy/mean_kappa generalize but are None when undefined.

ATTRIBUTE	DESCRIPTION
`judge_id`	Identifier for this judge. TYPE: `str`
`criterion_accuracy`	Overall criterion-level accuracy (binary label and/or multi-choice exact-match). `None` when undefined. TYPE: `float \| None`
`criterion_precision`	Overall precision for the binary MET class. `None` when not applicable (multi-choice-only — no MET class). TYPE: `float \| None`
`criterion_recall`	Overall recall for the binary MET class. `None` when not applicable (see `criterion_precision`). TYPE: `float \| None`
`criterion_f1`	Overall F1 for the binary MET class. `None` when not applicable (see `criterion_precision`). TYPE: `float \| None`
`mean_kappa`	Mean Cohen's kappa across criteria. `None` when undefined. TYPE: `float \| None`
`phi`	Matthews correlation coefficient (φ) for this judge on the binary {MET, UNMET} dichotomy, pooled across criteria. `None` when undefined (constant / single class / no binary data). TYPE: `float \| None`
`confusion_matrix`	This judge's confusion matrix, aggregated across criteria from the raw pre-filter codes (binary MET/UNMET with an abstain `CANNOT_ASSESS` class last → 3×3). `None` when there is no data. TYPE: `ConfusionMatrix \| None`
`score_rmse`	RMSE of cumulative scores. TYPE: `float`
`score_mae`	MAE of cumulative scores. TYPE: `float`
`score_spearman`	Spearman correlation result. TYPE: `CorrelationResult`
`score_kendall`	Kendall tau correlation result. TYPE: `CorrelationResult`
`score_pearson`	Pearson correlation result. TYPE: `CorrelationResult`
`bias`	Systematic bias analysis result. TYPE: `BiasResult`

BiasResult¶

Systematic bias analysis between predicted and ground truth scores.

BiasResult ¶

Bases: BaseModel

Result from systematic bias analysis.

Systematic bias occurs when one rater consistently scores higher or lower than another, independent of the item being rated.

A statistic is None when it is genuinely undefined for the sample size, never a fake 0.0. mean_bias is the single pred−true difference at n=1 (computable) and is None only at n=0. std_bias is None when undefined (n<2). effect_size (Cohen's d) is None when std_bias is 0 or undefined.

ATTRIBUTE	DESCRIPTION
`mean_bias`	Mean difference (predictions - actuals). `None` only at n=0. TYPE: `float \| None`
`std_bias`	Standard deviation of differences. `None` when undefined (n < 2). TYPE: `float \| None`
`is_significant`	Whether the bias is statistically significant (p < 0.05). TYPE: `bool`
`p_value`	P-value from t-test. TYPE: `float \| None`
`direction`	Direction of bias ("positive" if predictions > actuals). TYPE: `Literal['positive', 'negative', 'none']`
`effect_size`	Cohen's d effect size. `None` when undefined (std_bias 0 or undefined). TYPE: `float \| None`
`ci`	Confidence interval for mean bias. TYPE: `ConfidenceInterval \| None`
`n_samples`	Number of samples. TYPE: `int`

interpret_effect_size `staticmethod` ¶

interpret_effect_size(d: float) -> str

Interpret effect size using Cohen's guidelines.

Source code in src/autorubric/metrics/_types.py

@staticmethod
def interpret_effect_size(d: float) -> str:
    """Interpret effect size using Cohen's guidelines."""
    abs_d = abs(d)
    if abs_d < 0.2:
        return "negligible"
    elif abs_d < 0.5:
        return "small"
    elif abs_d < 0.8:
        return "medium"
    else:
        return "large"

OrdinalCriterionMetrics¶

Per-criterion metrics for ordinal multi-choice criteria.

OrdinalCriterionMetrics ¶

Bases: BaseModel

Metrics for an ordinal multi-choice criterion.

Ordinal criteria have options with inherent ordering (e.g., satisfaction 1-4). This enables additional metrics like weighted kappa and rank correlations.

ATTRIBUTE	DESCRIPTION
`name`	Name of the criterion. TYPE: `str`
`index`	Index of the criterion in the rubric. TYPE: `int`
`criterion_type`	Type of criterion ("ordinal" for this class). TYPE: `Literal['ordinal']`
`n_samples`	Number of samples used in computation. TYPE: `int`
`n_options`	Number of options in this criterion. TYPE: `int`
`exact_accuracy`	Proportion of exact index matches. None when undefined / no samples. TYPE: `float \| None`
`adjacent_accuracy`	Proportion within +/-1 position. None when undefined / no samples. TYPE: `float \| None`
`weighted_kappa`	Quadratic-weighted Cohen's kappa (accounts for distance). None when undefined (degenerate single-class) / no samples. TYPE: `float \| None`
`kappa_interpretation`	Human-readable interpretation of kappa ("undefined" when weighted_kappa is None). TYPE: `str`
`krippendorff_alpha`	Krippendorff's alpha — the general, recommended inter-judge agreement statistic. Computed with `level_of_measurement="ordinal"` so it is distance-aware (near-miss disagreements penalized less than far-miss), and it natively handles unequal/missing raters. None unless ensemble with >=2 judges and >=2 items. TYPE: `float \| None`
`fleiss_kappa`	Fleiss' kappa — the classic fixed-rater nominal measure (ignores ordering), computed complete-case. Prefer `krippendorff_alpha` for ordinal criteria. None unless ensemble with >=2 judges and >=2 complete-case items. TYPE: `float \| None`
`spearman`	Spearman rank correlation result. TYPE: `CorrelationResult`
`kendall`	Kendall tau correlation result. TYPE: `CorrelationResult`
`rmse`	RMSE on option values (0-1 scale). None when undefined / no samples. TYPE: `float \| None`
`mae`	MAE on option values (0-1 scale). None when undefined / no samples. TYPE: `float \| None`
`per_option`	Per-option precision/recall/F1 breakdown. TYPE: `list[OptionMetrics]`
`confusion_matrix`	N×N labelled confusion matrix (rows=true, cols=pred); its `.labels` carry the option labels (the former `option_labels`). TYPE: `ConfusionMatrix`
`is_degenerate`	True iff this criterion had samples (`n_samples > 0`) but `weighted_kappa` is still `None` — agreement could not be estimated because the data collapsed onto a single class. Distinct from the no-data case (`n_samples == 0`), where every metric is `None` simply for lack of samples. TYPE: `bool`
`coverage_stats`	How much of the raw paired sample survived abstention/error exclusion. Only populated under the `exclude` handling mode; `None` otherwise. TYPE: `CoverageStats \| None`

NominalCriterionMetrics¶

Per-criterion metrics for nominal multi-choice criteria.

NominalCriterionMetrics ¶

Bases: BaseModel

Metrics for a nominal multi-choice criterion.

Nominal criteria have unordered categories (e.g., "too few", "just right", "too many"). Distance between options is not meaningful, so only exact matches matter.

ATTRIBUTE	DESCRIPTION
`name`	Name of the criterion. TYPE: `str`
`index`	Index of the criterion in the rubric. TYPE: `int`
`criterion_type`	Type of criterion ("nominal" for this class). TYPE: `Literal['nominal']`
`n_samples`	Number of samples used in computation. TYPE: `int`
`n_options`	Number of options in this criterion. TYPE: `int`
`exact_accuracy`	Proportion of exact index matches. None when undefined / no samples. TYPE: `float \| None`
`kappa`	Unweighted Cohen's kappa (N×N). None when undefined (degenerate single-class) / no samples. TYPE: `float \| None`
`kappa_interpretation`	Human-readable interpretation of kappa ("undefined" when kappa is None). TYPE: `str`
`krippendorff_alpha`	Krippendorff's alpha — the general, recommended inter-judge agreement statistic. Computed with `level_of_measurement="nominal"` and natively handles unequal/missing raters. None unless ensemble with >=2 judges and >=2 items. TYPE: `float \| None`
`fleiss_kappa`	Fleiss' kappa — the classic fixed-rater nominal measure, computed complete-case. Prefer `krippendorff_alpha` as the general statistic. None unless ensemble with >=2 judges and >=2 complete-case items. TYPE: `float \| None`
`per_option`	Per-option precision/recall/F1 breakdown. TYPE: `list[OptionMetrics]`
`confusion_matrix`	N×N labelled confusion matrix (rows=true, cols=pred); its `.labels` carry the option labels (the former `option_labels`). TYPE: `ConfusionMatrix`
`is_degenerate`	True iff this criterion had samples (`n_samples > 0`) but `kappa` is still `None` — agreement could not be estimated because the data collapsed onto a single class. Distinct from the no-data case (`n_samples == 0`). TYPE: `bool`
`coverage_stats`	How much of the raw paired sample survived abstention/error exclusion. Only populated under the `exclude` handling mode; `None` otherwise. TYPE: `CoverageStats \| None`

NAStats¶

Statistics for NA (not applicable) handling in multi-choice criteria.

NAStats ¶

Bases: BaseModel

Statistics for NA (not applicable) handling in multi-choice criteria.

Tracks how the prediction and ground truth agree on the dichotomized {NA, not-NA} decision per item, similar to how CANNOT_ASSESS is handled for binary criteria.

ATTRIBUTE	DESCRIPTION
`na_count_true`	Number of NA selections in ground truth. TYPE: `int`
`na_count_pred`	Number of NA selections in predictions. TYPE: `int`
`na_kappa`	Cohen's kappa on the {NA, not-NA} dichotomy (pred vs truth). Range [-1, 1]; 1.0 is perfect agreement, 0 is chance-level, negative is worse than chance. None when undefined (no paired NA observations, single class, or NaN). The framework reports prediction-vs-ground-truth categorical agreement as Cohen's kappa across the board (binary `kappa`, ordinal `weighted_kappa`, nominal `kappa`); na_kappa is the dichotomized kappa for the orthogonal abstain decision. Readers who want a raw proportion can derive `A / (A + fp + fn)` from the counts below. TYPE: `float \| None`
`na_kappa_interpretation`	Landis & Koch interpretation of `na_kappa` via `KappaResult.interpret_kappa`. None when na_kappa is None. TYPE: `str \| None`
`na_false_positive`	Count where prediction was NA but ground truth was not. TYPE: `int`
`na_false_negative`	Count where ground truth was NA but prediction was not. TYPE: `int`

CannotAssessStats¶

Statistics for CANNOT_ASSESS handling in binary criteria — the binary parallel of NAStats. Both are abstentions that flow through the same SKIP scoring path and get a dichotomized Cohen's-kappa diagnostic, but they are tracked as distinct types: CANNOT_ASSESS is an epistemic abstention on a yes/no decision ("I cannot determine MET vs. UNMET"), while multi-choice NA is "no applicable option" (a statement about the option space). Its fields are ca_-prefixed: ca_count_true, ca_count_pred, ca_kappa (float | None), ca_kappa_interpretation, ca_false_positive, ca_false_negative.

CannotAssessStats ¶

Bases: BaseModel

Statistics for CANNOT_ASSESS handling in binary criteria.

The binary parallel of :class:NAStats: tracks how the prediction and ground truth agree on the dichotomized {CANNOT_ASSESS, not-CANNOT_ASSESS} decision per item.

Both CANNOT_ASSESS (binary) and NA (multi-choice) are abstentions that flow through the same SKIP scoring path (score_reports), and both get a parallel dichotomized Cohen's-kappa diagnostic block. They are nonetheless distinct kinds of abstention, which is exactly why they are tracked by separate stats types rather than merged:

Binary CANNOT_ASSESS is the judge being unable to determine MET-vs-UNMET — an epistemic abstention on a yes/no question ("I cannot decide whether this requirement is met").
Multi-choice NA is "not applicable / cannot pick an applicable option" — abstaining because no scored category fits, a statement about the option space rather than a yes/no decision.

Keeping them separate (and prefixing these fields ca_) makes the semantic distinction explicit in the data model while preserving the structural analogy.

ATTRIBUTE	DESCRIPTION
`ca_count_true`	Number of CANNOT_ASSESS verdicts in ground truth. TYPE: `int`
`ca_count_pred`	Number of CANNOT_ASSESS verdicts in predictions. TYPE: `int`
`ca_kappa`	Cohen's kappa on the {CANNOT_ASSESS, not-CANNOT_ASSESS} dichotomy (pred vs truth). Range [-1, 1]; 1.0 is perfect agreement, 0 is chance-level, negative is worse than chance. None when undefined (no paired CANNOT_ASSESS observations, single class, or NaN). The framework reports prediction-vs-ground-truth categorical agreement as Cohen's kappa across the board (binary `kappa`, ordinal `weighted_kappa`, nominal `kappa`, and NA's `na_kappa`); ca_kappa is the dichotomized kappa for the orthogonal binary abstain decision. Readers who want a raw proportion can derive `A / (A + ca_fp + ca_fn)` from the counts below. TYPE: `float \| None`
`ca_kappa_interpretation`	Landis & Koch interpretation of `ca_kappa` via `KappaResult.interpret_kappa`. None when ca_kappa is None. TYPE: `str \| None`
`ca_false_positive`	Count where prediction was CANNOT_ASSESS but ground truth was not. TYPE: `int`
`ca_false_negative`	Count where ground truth was CANNOT_ASSESS but prediction was not. TYPE: `int`

CoverageStats¶

How much of the raw paired sample survived abstention/error exclusion. Built only under the exclude handling mode (under as_unmet / as_category no observation is dropped, so coverage would be trivially 1.0 and these stats are left None). n_total is the raw pre-exclusion denominator and n_covered equals the per-criterion n_samples; every rate (coverage, judge_abstain_rate, gt_abstain_rate, union_exclusion_rate, error_rate) is float | None, None when its denominator is zero.

CoverageStats ¶

Bases: BaseModel

How much of the raw paired sample survived abstention/error exclusion.

Built only under the exclude handling mode, where abstentions (CANNOT_ASSESS / NA) and grading errors drop a paired observation from the agreement denominator. Under as_unmet or as_category no observation is dropped, so coverage would be trivially 1.0 and these stats are not produced (left None by callers).

n_total is the raw pre-exclusion denominator; n_covered is what remained after the union of all exclusion reasons (it equals the per-criterion n_samples). Every rate honours undefined→None (None when its denominator is zero); counts stay int.

ATTRIBUTE	DESCRIPTION
`n_total`	Raw pre-exclusion paired count (the denominator before any drops). TYPE: `int`
`n_covered`	Paired count remaining after union-exclusion (== per-criterion `n_samples`). TYPE: `int`
`coverage`	`n_covered / n_total`. None when `n_total == 0`. TYPE: `float \| None`
`judge_abstain_rate`	Fraction of the raw pairs where the judge/prediction abstained. None when `n_total == 0`. TYPE: `float \| None`
`gt_abstain_rate`	Fraction of the raw pairs where the ground truth abstained. None when `n_total == 0`. TYPE: `float \| None`
`union_exclusion_rate`	Fraction excluded for any reason (`1 - coverage`). None when `n_total == 0`. TYPE: `float \| None`
`n_errored`	Count of paired observations dropped because grading errored. TYPE: `int`
`error_rate`	`n_errored / n_total`. None when `n_total == 0`. TYPE: `float \| None`

References¶

Casabianca, J., McCaffrey, D. F., Johnson, M. S., Alper, N., and Zubenko, V. (2025). Validity Arguments For Constructed Response Scoring Using Generative Artificial Intelligence Applications. arXiv:2501.02334.

He, J., Shi, J., Zhuo, T. Y., Treude, C., Sun, J., Xing, Z., Du, X., and Lo, D. (2025). LLM-as-a-Judge for Software Engineering: Literature Review, Vision, and the Road Ahead. arXiv:2510.24367.

Metrics¶

Overview¶

Quick Example¶

Bootstrap Confidence Intervals¶

Per-Judge Metrics (Ensemble)¶

Metric Fields¶

compute_metrics¶

compute_metrics ¶

MetricsResult¶

MetricsResult ¶

summary ¶

to_dataframe ¶

to_file ¶

CriterionMetrics¶

CriterionMetrics ¶

CorrelationResult¶

CorrelationResult ¶

interpret_correlation staticmethod ¶

BootstrapResults¶

BootstrapResults ¶

BootstrapResult¶

BootstrapResult ¶

ConfidenceInterval¶

ConfidenceInterval ¶

width property ¶

JudgeMetrics¶

JudgeMetrics ¶

BiasResult¶

BiasResult ¶

interpret_effect_size staticmethod ¶

OrdinalCriterionMetrics¶

OrdinalCriterionMetrics ¶

NominalCriterionMetrics¶

NominalCriterionMetrics ¶

NAStats¶

NAStats ¶

CannotAssessStats¶

CannotAssessStats ¶

CoverageStats¶

CoverageStats ¶

References¶

interpret_correlation `staticmethod` ¶

width `property` ¶

interpret_effect_size `staticmethod` ¶