Skip to content

Metrics

Agreement and correlation metrics for validating LLM judges against ground truth.

Overview

When your dataset includes ground truth labels, compute_metrics() measures how well your LLM judge agrees with human annotations. Metrics include accuracy, precision, recall, F1, Cohen's kappa, correlations, and systematic bias analysis.

For ensemble (multi-judge) evaluations, each per-criterion metrics object also reports inter-judge agreement (judges vs. each other, independent of ground truth). The recommended statistic is Krippendorff's alpha (krippendorff_alpha) — it handles unequal/missing raters and is level-aware (nominal vs. ordinal). Fleiss' kappa (fleiss_kappa) is also computed as the classic fixed-rater nominal measure, complete-case. Both are populated only with an ensemble of ≥2 judges and ≥2 items, and are None otherwise.

One inter-judge statistic on binary/nominal data

On binary and nominal data Krippendorff's nominal α and Fleiss' κ coincide up to a finite-sample correction (1 − κ_F)/(N·R) — they are one statistic, not corroborating evidence. summary() therefore reports α as the single primary inter-judge column for binary/nominal criteria and drops the bare Fleiss column (a note explains the omission); to_dataframe() leaves the binary/nominal fleiss_kappa value None. On ordinal data α is distance-aware while Fleiss is nominal (different geometry), so both are kept with a distinguishing note.

Research Background

Casabianca et al. (2025) recommend agreement metrics including ICC, Krippendorff's alpha, and quadratic-weighted kappa (QWK), with iterative refinement until agreement with human-labeled subsets is acceptable. He et al. (2025) emphasize that correlation alone can mask systematic bias.

Quick Example

from autorubric import RubricDataset, LLMConfig, evaluate
from autorubric.graders import CriterionGrader

dataset = RubricDataset.from_file("data_with_ground_truth.json")
grader = CriterionGrader(llm_config=LLMConfig(model="openai/gpt-4.1-mini"))

result = await evaluate(dataset, grader, show_progress=True)

# Compute metrics
metrics = result.compute_metrics(dataset)

# Formatted summary. The header names the handling modes
# (CANNOT_ASSESS / NA estimands), the criterion-level scalars carry their
# aggregation level (micro vs macro), and binary criteria show φ + FP/FN/FPR/FNR.
print(metrics.summary())

# verbose=True additionally prints the per-judge RMSE/Spearman columns and each
# judge's confusion matrix (the default per-judge line leads with accuracy + kappa + φ).
print(metrics.summary(verbose=True))

# Export options. to_dataframe() uses level-labelled aggregate keys
# (accuracy_micro / accuracy_macro / mean_kappa_macro / kappa_micro / phi_micro / ...)
# and round-trips the handling modes + coverage columns.
df = metrics.to_dataframe()
metrics.to_file("metrics.json")

Bootstrap Confidence Intervals

metrics = result.compute_metrics(
    dataset,
    bootstrap=True,
    n_bootstrap=1000,
    confidence_level=0.95,
    seed=42,
)

print(metrics.summary())
# Bootstrap CIs (95%):
#   Accuracy: [85.2%, 92.1%]
#   Kappa:    [0.712, 0.845]

Per-Judge Metrics (Ensemble)

metrics = result.compute_metrics(
    dataset,
    per_judge=True,
)

for judge_id, jm in metrics.per_judge.items():
    # jm.criterion_accuracy is `float | None` (None when undefined); score_rmse is always a float.
    acc = f"{jm.criterion_accuracy:.1%}" if jm.criterion_accuracy is not None else "n/a"
    print(f"{judge_id}: Accuracy={acc}, RMSE={jm.score_rmse:.4f}")

Metric Fields

None means genuinely undefined, never a fabricated 0.0

The numeric metric fields below are typed float | None. A field is None when the metric is genuinely undefined for the data at hand — it is never silently reported as a fake 0.0. Always guard the format spec (e.g. f"{x:.2f}" if x is not None else "n/a") before printing these.

Field Description
criterion_accuracy Overall accuracy across all criteria. float | NoneNone when undefined (e.g. no paired predictions).
criterion_precision Precision for the binary MET class. float | NoneNone when not applicable, e.g. a multi-choice-only rubric (no binary MET class).
criterion_recall Recall for the binary MET class. float | NoneNone when not applicable (multi-choice-only rubric).
criterion_f1 F1 for the binary MET class. float | NoneNone when not applicable (multi-choice-only rubric).
mean_kappa Mean Cohen's kappa across criteria (macro — unweighted mean over criteria). float | NoneNone when undefined (e.g. degenerate single-class).
macro_accuracy Unweighted mean of the per-criterion accuracies (macro). float | None.
micro_kappa Cohen's kappa pooled across criteria (micro, distinct from the macro mean_kappa). float | None.
criterion_phi Matthews correlation coefficient (φ) pooled over the binary MET-vs-rest flats (micro). float | NoneNone for a multi-choice-only rubric or on single-class data. φ = Pearson = Spearman = Kendall = MCC on binary data; the κ − φ gap is the judge's positive-rate drift.
mean_krippendorff_alpha Macro mean of the per-criterion Krippendorff's α (inter-judge). float | None.
cannot_assess_mode / na_mode How CANNOT_ASSESS / NA were handled when the metrics were computed (exclude / as_unmet / as_category). Frozen on the result and round-tripped by to_file so a serialized number is never ambiguous among the estimands.
n_samples Total paired observations contributing to the aggregate metrics. int | None.
coverage_stats Under the exclude mode, how much of the raw paired sample survived abstention/error exclusion (CoverageStats | None). Counts n_total (raw pre-exclusion denominator), n_covered (== per-criterion n_samples), and n_errored; rates coverage, judge_abstain_rate, gt_abstain_rate, union_exclusion_rate, error_rate are each float | None (None when n_total == 0).
per_criterion Per-criterion metrics breakdown (polymorphic: CriterionMetrics, OrdinalCriterionMetrics, NominalCriterionMetrics). Their per-criterion numeric fields (accuracy, precision, recall, f1, kappa, weighted_kappa, adjacent_accuracy, per-option metrics) are likewise float | None when undefined.
score_rmse RMSE of cumulative scores (always a float).
score_mae MAE of cumulative scores (always a float).
score_spearman Spearman rank correlation (CorrelationResult). Its .coefficient is float | NoneNone for a constant array or fewer than 3 samples.
score_kendall Kendall tau correlation (CorrelationResult). .coefficient is float | None (None for a constant array or < 3 samples).
score_pearson Pearson correlation (CorrelationResult). .coefficient is float | None (None for a constant array or < 3 samples).
bias Systematic bias analysis (BiasResult). Its .mean_bias / .std_bias are float | Nonemean_bias is None at n=0 and std_bias is None for n < 2.
bootstrap Bootstrap confidence intervals (BootstrapResults, if enabled)
per_judge Per-judge metrics for ensemble (dict[str, JudgeMetrics], if enabled)
n_items Number of items used in computation
n_criteria Number of criteria
n_binary_criteria Number of binary criteria
n_ordinal_criteria Number of ordinal multi-choice criteria
n_nominal_criteria Number of nominal multi-choice criteria
na_stats Statistics for NA handling in multi-choice criteria (NAStats): na_count_true / na_count_pred counts, na_kappa (float | None) on the {NA, not-NA} dichotomy, and na_false_positive / na_false_negative.
cannot_assess_stats Statistics for CANNOT_ASSESS handling in binary criteria (CannotAssessStats) — the binary parallel of na_stats (a distinct kind of abstention; see below): ca_count_true / ca_count_pred counts, ca_kappa (float | None) on the {CANNOT_ASSESS, not-CANNOT_ASSESS} dichotomy, and ca_false_positive / ca_false_negative.
warnings Any warnings generated during computation

compute_metrics

Compute agreement metrics between predictions and ground truth.

compute_metrics

compute_metrics(eval_result: EvalResult, dataset: RubricDataset, *, bootstrap: bool = False, n_bootstrap: int = 1000, per_judge: bool = False, cannot_assess: CannotAssessMode = 'exclude', na_mode: NAMode = 'exclude', confidence_level: float = 0.95, seed: int | None = None) -> MetricsResult

Compute comprehensive evaluation metrics.

This is the main entry point for computing metrics from an evaluation run. It compares predicted verdicts and scores against ground truth from the dataset. Supports binary, ordinal, and nominal (multi-choice) criteria.

PARAMETER DESCRIPTION
eval_result

The evaluation result from EvalRunner.

TYPE: EvalResult

dataset

The dataset with ground truth labels.

TYPE: RubricDataset

bootstrap

If True, compute bootstrap confidence intervals (expensive). Covers ANY rubric type via an item-level resample: accuracy_cicriterion_accuracy, kappa_cimean_kappa (ordinal quadratic-weighted), rmse_ciscore_rmse. Each CI is None when undefined (empty/degenerate axis).

TYPE: bool DEFAULT: False

n_bootstrap

Number of bootstrap samples if bootstrap=True.

TYPE: int DEFAULT: 1000

per_judge

If True and ensemble, compute per-judge metrics.

TYPE: bool DEFAULT: False

cannot_assess

How to handle CANNOT_ASSESS verdicts (binary criteria): - "exclude": Skip pairs where either is CANNOT_ASSESS (default) - "as_unmet": Treat CANNOT_ASSESS as UNMET - "as_category": Keep CANNOT_ASSESS as a distinct third class. Accuracy and Cohen's kappa are then computed over three classes (a CANNOT_ASSESS prediction matching a CANNOT_ASSESS ground truth counts as correct); precision/recall/f1 remain MET-vs-rest.

TYPE: CannotAssessMode DEFAULT: 'exclude'

na_mode

How to handle NA options (multi-choice criteria). Mirrors cannot_assess for binary — NA on multi-choice is the structural analog of CANNOT_ASSESS on binary:

  • "exclude": Skip pairs where either is NA (default).
  • "as_unmet": Remap NA to the score-minimizing non-NA option, weight-sign aware (lowest value for non-negative weight, highest value for negative weight). Shares Criterion.worst_scored_option() with the grader's unknown-error worst-case path so the layers cannot drift.
  • "as_category": Keep NA as a distinct categorical column. Refused for ordinal criteria with an NA option (raises ValueError): NA has no ordinal position, so quadratic weighted Cohen's kappa would assign NA a geometrically meaningless distance.

TYPE: NAMode DEFAULT: 'exclude'

confidence_level

Confidence level for bootstrap CIs (default 0.95).

TYPE: float DEFAULT: 0.95

seed

Random seed for bootstrap reproducibility.

TYPE: int | None DEFAULT: None

RETURNS DESCRIPTION
MetricsResult

MetricsResult with comprehensive metrics and optional per-judge breakdown.

RAISES DESCRIPTION
ValueError

If no common items between eval_result and dataset.

Example

result = await evaluate(dataset, grader) metrics = result.compute_metrics(dataset) print(metrics.summary()) df = metrics.to_dataframe()

Source code in src/autorubric/metrics/_compute.py
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
def compute_metrics(
    eval_result: EvalResult,
    dataset: RubricDataset,
    *,
    bootstrap: bool = False,
    n_bootstrap: int = 1000,
    per_judge: bool = False,
    cannot_assess: CannotAssessMode = "exclude",
    na_mode: NAMode = "exclude",
    confidence_level: float = 0.95,
    seed: int | None = None,
) -> MetricsResult:
    """Compute comprehensive evaluation metrics.

    This is the main entry point for computing metrics from an evaluation run.
    It compares predicted verdicts and scores against ground truth from the dataset.
    Supports binary, ordinal, and nominal (multi-choice) criteria.

    Args:
        eval_result: The evaluation result from EvalRunner.
        dataset: The dataset with ground truth labels.
        bootstrap: If True, compute bootstrap confidence intervals (expensive). Covers ANY
            rubric type via an item-level resample: ``accuracy_ci``←``criterion_accuracy``,
            ``kappa_ci``←``mean_kappa`` (ordinal quadratic-weighted), ``rmse_ci``←``score_rmse``.
            Each CI is ``None`` when undefined (empty/degenerate axis).
        n_bootstrap: Number of bootstrap samples if bootstrap=True.
        per_judge: If True and ensemble, compute per-judge metrics.
        cannot_assess: How to handle CANNOT_ASSESS verdicts (binary criteria):
            - "exclude": Skip pairs where either is CANNOT_ASSESS (default)
            - "as_unmet": Treat CANNOT_ASSESS as UNMET
            - "as_category": Keep CANNOT_ASSESS as a distinct third class. Accuracy and
              Cohen's kappa are then computed over three classes (a CANNOT_ASSESS
              prediction matching a CANNOT_ASSESS ground truth counts as correct);
              precision/recall/f1 remain MET-vs-rest.
        na_mode: How to handle NA options (multi-choice criteria). Mirrors
            ``cannot_assess`` for binary — NA on multi-choice is the structural
            analog of CANNOT_ASSESS on binary:

            - "exclude": Skip pairs where either is NA (default).
            - "as_unmet": Remap NA to the score-minimizing non-NA option,
              weight-sign aware (lowest ``value`` for non-negative weight,
              highest ``value`` for negative weight). Shares
              ``Criterion.worst_scored_option()`` with the grader's
              ``unknown``-error worst-case path so the layers cannot drift.
            - "as_category": Keep NA as a distinct categorical column.
              **Refused for ordinal criteria with an NA option** (raises
              ``ValueError``): NA has no ordinal position, so quadratic
              weighted Cohen's kappa would assign NA a geometrically
              meaningless distance.
        confidence_level: Confidence level for bootstrap CIs (default 0.95).
        seed: Random seed for bootstrap reproducibility.

    Returns:
        MetricsResult with comprehensive metrics and optional per-judge breakdown.

    Raises:
        ValueError: If no common items between eval_result and dataset.

    Example:
        >>> result = await evaluate(dataset, grader)
        >>> metrics = result.compute_metrics(dataset)
        >>> print(metrics.summary())
        >>> df = metrics.to_dataframe()
    """
    result_warnings: list[str] = []

    # Build map of item_idx -> ItemResult
    eval_map = {ir.item_idx: ir for ir in eval_result.item_results}

    # Check for missing/extra items
    dataset_indices = set(range(len(dataset)))
    eval_indices = set(eval_map.keys())

    missing = dataset_indices - eval_indices
    if missing:
        result_warnings.append(f"{len(missing)} items from dataset not found in eval_result")

    extra = eval_indices - dataset_indices
    if extra:
        result_warnings.append(f"{len(extra)} items in eval_result not in dataset")

    # Use intersection
    common_indices = sorted(dataset_indices & eval_indices)

    if not common_indices:
        raise ValueError("No common items between eval_result and dataset")

    # Validate rubric homogeneity for metrics computation
    # If using per-item rubrics, all must have the same structure
    if dataset.rubric is not None:
        reference_rubric = dataset.rubric
    else:
        # Get rubric from first item
        reference_rubric = dataset.get_item_rubric(common_indices[0])

    reference_n_criteria = len(reference_rubric.rubric)

    for idx in common_indices:
        item_rubric = dataset.get_item_rubric(idx)
        if len(item_rubric.rubric) != reference_n_criteria:
            raise ValueError(
                f"Cannot compute metrics: items have different rubric structures. "
                f"Item {idx} has {len(item_rubric.rubric)} criteria but "
                f"expected {reference_n_criteria}. "
                f"Metrics require homogeneous rubric structures across all items."
            )

    # Use the reference rubric for classification
    criteria = list(reference_rubric.rubric)
    criterion_types = classify_criteria(criteria)
    n_criteria = len(criteria)

    # Count criteria by type
    n_binary = sum(1 for ct in criterion_types if ct == "binary")
    n_ordinal = sum(1 for ct in criterion_types if ct == "ordinal")
    n_nominal = sum(1 for ct in criterion_types if ct == "nominal")

    # Per-criterion data storage
    # For binary: list[CriterionVerdict]
    # For multi-choice: list[int] (option indices). A predicted index may transiently be
    # None for a genuine multi-choice error-abstain; it is normalized to the
    # effective NA index right after the effective criteria are built, so consumers below
    # only ever see CriterionVerdict | int.
    per_criterion_pred: list[list[CriterionVerdict | int | None]] = [[] for _ in range(n_criteria)]
    per_criterion_true: list[list[CriterionVerdict | int]] = [[] for _ in range(n_criteria)]

    # Overall scores
    all_pred_scores: list[float] = []
    all_true_scores: list[float] = []

    # For ensemble: per-judge data (binary verdicts + multi-choice option indices).
    judge_scores: dict[str, list[float]] = {}
    judge_verdicts: dict[str, list[list[CriterionVerdict]]] = {}
    # Per-judge multi-choice predictions (items x criteria); binary cells are a None
    # placeholder. A multi-choice cell may transiently be None (genuine error-abstain);
    # it is normalized to the effective NA index after the effective criteria are
    # built, mirroring the aggregate per_criterion_pred normalization.
    judge_mc_preds: dict[str, list[list[int | None]]] = {}
    judge_errors: dict[str, list[list[str | None]]] = {}
    is_ensemble = False

    # Per-item ground-truth verdicts (all criteria) aligned 1:1 with each item that
    # contributes ensemble per-judge data; used for the per-judge metrics fix.
    per_item_true: list[list[CriterionVerdict | int]] = []

    # Fleiss' kappa ratings rows, per criterion (only ensemble reports produce rows).
    fleiss_rows: dict[int, list[list[int]]] = {c: [] for c in range(n_criteria)}

    # Krippendorff's alpha: per criterion, one dict per ensemble item mapping
    # judge_id -> numeric cell value (np.nan = missing). Rows (judges) and columns
    # (items) are assembled after the loop using the final judge id set.
    alpha_cells: dict[int, list[dict[str, float]]] = {c: [] for c in range(n_criteria)}

    items_with_ground_truth = 0
    # Count GT-bearing items lost to a grading error (skipped below). These have ground truth
    # (the no-ground-truth case is handled separately) but no usable verdicts, so they reduce
    # coverage. Feeds the CoverageStats error_rate and a warning.
    n_errored_items = 0

    # NA tracking for multi-choice
    total_na_true = 0
    total_na_pred = 0
    total_na_fp = 0
    total_na_fn = 0

    for idx in common_indices:
        item = dataset.items[idx]
        item_result = eval_map[idx]
        report = item_result.report

        if item.ground_truth is None:
            result_warnings.append(f"Item {idx} has no ground truth, skipping")
            continue

        if item_result.error is not None:
            # GT-bearing item lost to a grading error: counted toward the raw coverage
            # denominator (it had ground truth) but contributes no usable verdicts.
            n_errored_items += 1
            continue

        items_with_ground_truth += 1

        # Extract predictions using type-aware extraction
        pred_all = extract_all_verdicts_from_report(report, criteria)

        # Resolve ground truth (string labels → indices for multi-choice)
        try:
            true_all = resolve_ground_truth(list(item.ground_truth), criteria)
        except ValueError as e:
            result_warnings.append(f"Item {idx}: {e}")
            continue

        # Store per-criterion data
        for c_idx in range(n_criteria):
            pred_val = pred_all[c_idx]
            true_val = true_all[c_idx]

            # Handle None predictions (failed extraction). Binary None -> UNMET (the
            # conservative default). A multi-choice None is a GENUINE error-abstain (no NA
            # option, forced-choice): leave it as None here and normalize it to the
            # effective criterion's NA index after the effective criteria are built below,
            # so it is recognized as NA instead of being silently counted as option 0.
            if pred_val is None and criterion_types[c_idx] == "binary":
                pred_val = CriterionVerdict.UNMET

            per_criterion_pred[c_idx].append(pred_val)
            per_criterion_true[c_idx].append(true_val)

        # Score-level aggregation (RMSE/correlation/bias). A grade-FAILURE has no score
        # (report.error set, score is None): EXCLUDE it from the paired score arrays
        # rather than fabricating a 0.0 — a fake 0.0 would corrupt RMSE/bias and is
        # indistinguishable from a real catastrophic score. The per-criterion verdict
        # arrays above are unaffected (they handle errored verdicts on their own terms).
        # Item-level errors are already skipped earlier; this catches a report-level
        # error with no item-level error (e.g. the "No judge results" report).
        if report.error is None and report.score is not None:
            # For true score, need to pass the original ground truth format.
            # compute_weighted_score expects CriterionVerdict for binary, str for multi-choice.
            true_score_verdicts = []
            for c_idx in range(n_criteria):
                if criterion_types[c_idx] == "binary":
                    true_score_verdicts.append(true_all[c_idx])
                else:
                    # For multi-choice, pass the option label (string)
                    criterion = criteria[c_idx]
                    opt_idx = true_all[c_idx]
                    if isinstance(opt_idx, int) and 0 <= opt_idx < len(criterion.options):
                        true_score_verdicts.append(criterion.options[opt_idx].label)
                    else:
                        # Default to first option if index is invalid
                        true_score_verdicts.append(criterion.options[0].label)

            true_score = dataset.compute_weighted_score(true_score_verdicts)

            all_pred_scores.append(report.score)
            all_true_scores.append(true_score)

        # Check if ensemble and collect per-judge data. Gate on the SAME score/error
        # condition as the score-level append above so per-item arrays stay length-aligned
        # with `all_true_scores`: a score-less report (report-level error, score None)
        # contributes nothing to per-judge metrics or inter-judge agreement, exactly as it
        # contributes nothing to the aggregate score metrics. (In normal operation a
        # score-less ensemble report has empty judge_scores anyway; this also keeps a
        # hand-built / deserialized score-less report from de-aligning the arrays.)
        if (
            report.error is None
            and report.score is not None
            and hasattr(report, "judge_scores")
            and report.judge_scores
        ):
            is_ensemble = True
            for jid, score in report.judge_scores.items():
                if jid not in judge_scores:
                    judge_scores[jid] = []
                    judge_verdicts[jid] = []
                    judge_mc_preds[jid] = []
                    judge_errors[jid] = []
                judge_scores[jid].append(score)

            # Align ground truth (all criteria) once per ensemble item.
            per_item_true.append(list(true_all))

            # Extract per-judge verdicts (binary) + multi-choice indices + errors from
            # EnsembleCriterionReport.votes / .multi_choice_votes. A binary criterion
            # yields a verdict and a None multi-choice placeholder; a multi-choice
            # criterion yields a placeholder UNMET verdict and the vote's selected_index
            # (raw int|None — None is a genuine abstain, normalized later). The error
            # is captured per criterion from whichever vote type matched, so errored MC
            # votes are skipped with the same parity as binary.
            if hasattr(report, "report") and report.report:
                for jid in judge_scores.keys():
                    judge_v: list[CriterionVerdict] = []
                    judge_mc: list[int | None] = []
                    judge_e: list[str | None] = []
                    for c_idx, cr in enumerate(report.report):
                        c_type = (
                            criterion_types[c_idx] if c_idx < len(criterion_types) else "binary"
                        )
                        if c_type == "binary":
                            judge_mc.append(None)
                            votes = getattr(cr, "votes", None) or []
                            for vote in votes:
                                if vote.judge_id == jid:
                                    judge_v.append(vote.verdict)
                                    judge_e.append(vote.error)
                                    break
                            else:
                                judge_v.append(CriterionVerdict.UNMET)
                                judge_e.append(None)
                        else:
                            judge_v.append(CriterionVerdict.UNMET)  # placeholder
                            mc_votes = getattr(cr, "multi_choice_votes", None) or []
                            for vote in mc_votes:
                                if vote.judge_id == jid:
                                    judge_mc.append(vote.selected_index)
                                    judge_e.append(vote.error)
                                    break
                            else:
                                judge_mc.append(None)
                                judge_e.append(None)
                    if jid in judge_verdicts:
                        judge_verdicts[jid].append(judge_v)
                        judge_mc_preds[jid].append(judge_mc)
                        judge_errors[jid].append(judge_e)

            # Inter-judge agreement collection (binary + multi-choice) from ensemble votes.
            if hasattr(report, "report") and report.report:
                n_judges = len(report.judge_scores)
                for c_idx in range(n_criteria):
                    cr = report.report[c_idx]
                    c_type = criterion_types[c_idx]
                    # Fleiss: complete-case ratings row (uniform rater count).
                    row = _build_fleiss_row(
                        cr,
                        criteria[c_idx],
                        c_type,
                        cannot_assess,
                        n_judges,
                    )
                    if row is not None:
                        fleiss_rows[c_idx].append(row)
                    # Krippendorff alpha: per-judge cells (missing handled natively).
                    votes = (
                        cr.votes if c_type == "binary" else getattr(cr, "multi_choice_votes", [])
                    )
                    cell_map: dict[str, float] = {
                        v.judge_id: _build_alpha_cell(v, c_type, cannot_assess)
                        for v in (votes or [])
                    }
                    alpha_cells[c_idx].append(cell_map)

    n_items = items_with_ground_truth

    if n_items == 0:
        raise ValueError("No valid items with ground truth found")

    # Score-level metrics need ≥1 scoreable (non-errored, real-float) item. Every
    # ground-truth item having a report-level error would leave these arrays empty
    # (sklearn's mean_squared_error rejects empty input). Treat it like no-valid-items
    # rather than fabricating a score.
    if not all_pred_scores:
        raise ValueError("No valid items with a computed score found")

    # Reconstruct the effective criterion for any multi-choice criterion whose graded
    # reports used an auto-injected NA option OR produced a genuine None error-abstain.
    # The grader appends an auto-injected NA at index N = len(author.options) — out of
    # range for the author rubric used above — and emits selected_index=None when it had to
    # abstain with no NA option. We normalize only when an out-of-range OR a None prediction
    # is actually observed, so forced-choice runs without abstains are unaffected and never
    # gain a spurious NA column. ``with_guaranteed_na_option`` is the same pure helper the
    # grader uses, so the two layers cannot drift.
    effective_criteria = list(criteria)
    for c_idx in range(n_criteria):
        if criterion_types[c_idx] == "binary":
            continue
        author_c = criteria[c_idx]
        n_author = len(author_c.options) if author_c.options else 0

        def _needs_na(v: object, n_author: int = n_author) -> bool:
            return (isinstance(v, int) and v >= n_author) or v is None

        observed = any(_needs_na(v) for v in per_criterion_pred[c_idx])
        if not observed:
            # Also consider per-judge multi-choice cells: a single judge may have
            # abstained (None) or picked the injected NA while the aggregate verdict
            # did not, so the effective criterion still needs an NA option for the
            # per-judge normalization to recognize that cell.
            observed = any(
                c_idx < len(row) and _needs_na(row[c_idx])
                for rows in judge_mc_preds.values()
                for row in rows
            )
        if observed:
            effective_criteria[c_idx] = author_c.with_guaranteed_na_option()

    # Normalize any remaining None multi-choice predictions (genuine error-abstains) to
    # the effective criterion's NA index, so every downstream consumer sees only ints and the
    # abstain is recognized as NA (FP/FN, na_kappa, filtering) under every na_mode. The
    # reconstruction above guarantees a NA option exists for any criterion that had a None.
    for c_idx in range(n_criteria):
        if criterion_types[c_idx] == "binary":
            continue
        na_idx = effective_criteria[c_idx].na_option_index
        if na_idx is None:
            continue
        per_criterion_pred[c_idx] = [na_idx if v is None else v for v in per_criterion_pred[c_idx]]

    # Mirror the aggregate None→NA normalization for each judge's multi-choice predictions,
    # using the SAME effective_criteria. A judge's None multi-choice cell is either a binary
    # placeholder (no NA option to point at) or a genuine abstain on a multi-choice
    # criterion; only multi-choice cells with a resolvable NA index are normalized, so binary
    # placeholders stay None and are ignored by the per-judge multi-choice path.
    for jid in judge_mc_preds:
        for item_row in judge_mc_preds[jid]:
            for c_idx in range(min(n_criteria, len(item_row))):
                if criterion_types[c_idx] == "binary":
                    continue
                if item_row[c_idx] is not None:
                    continue
                na_idx = effective_criteria[c_idx].na_option_index
                if na_idx is None:
                    continue
                item_row[c_idx] = na_idx

    # Compute per-criterion metrics by type
    per_criterion: list[CriterionMetricsUnion] = []
    # Collects the per-criterion kappas (binary Cohen, ordinal weighted, nominal). Each may
    # be None (degenerate single-class) — _mean_or_none excludes None when averaging.
    criterion_kappas: list[float | None] = []

    # Inter-judge agreement (Krippendorff's alpha + Fleiss' kappa) is only meaningful
    # with an ensemble of >=2 judges (>=2 items is enforced downstream).
    eligible = is_ensemble and len(judge_scores) >= 2

    # Precompute Krippendorff's alpha per criterion from the collected reliability cells.
    # Rows = judges (fixed judge-id order), columns = items; np.nan marks missing ratings.
    # Alpha uses ALL items (missing handled natively) — no complete-case dropping.
    judge_ids = list(judge_scores.keys())
    krippendorff_alphas: dict[int, float | None] = dict.fromkeys(range(n_criteria))
    if eligible:
        for c_idx in range(n_criteria):
            level: Literal["nominal", "ordinal"] = (
                "ordinal" if criterion_types[c_idx] == "ordinal" else "nominal"
            )
            cell_maps = alpha_cells.get(c_idx, [])
            reliability_data = [
                [cm.get(jid, float("nan")) for cm in cell_maps] for jid in judge_ids
            ]
            krippendorff_alphas[c_idx] = _compute_krippendorff_alpha(reliability_data, level)

    for c_idx in range(n_criteria):
        criterion = criteria[c_idx]
        c_type = criterion_types[c_idx]
        pred_data = per_criterion_pred[c_idx]
        true_data = per_criterion_true[c_idx]

        if c_type == "binary":
            # Binary criterion metrics
            pred_verdicts = [v for v in pred_data if isinstance(v, CriterionVerdict)]
            true_verdicts = [v for v in true_data if isinstance(v, CriterionVerdict)]

            # Handle CANNOT_ASSESS centrally and build label + MET-vs-rest reps.
            label_pred, label_true, met_pred, met_true = prepare_binary_metric_inputs(
                pred_verdicts, true_verdicts, cannot_assess
            )

            name = criterion.name or f"Criterion {c_idx + 1}"

            fleiss_kappa = _compute_fleiss_kappa(fleiss_rows.get(c_idx)) if eligible else None
            krippendorff_alpha = krippendorff_alphas.get(c_idx) if eligible else None

            if not label_pred:
                # No samples → metric values are undefined (None); counts stay 0. Do NOT
                # append to criterion_kappas (matches per-judge, which skips empty binary;
                # _mean_or_none would exclude a None regardless, so parity holds either way).
                per_criterion.append(
                    CriterionMetrics(
                        name=name,
                        index=c_idx,
                        n_samples=0,
                        accuracy=None,
                        precision=None,
                        recall=None,
                        f1=None,
                        kappa=None,
                        kappa_interpretation="undefined",
                        krippendorff_alpha=krippendorff_alpha,
                        fleiss_kappa=fleiss_kappa,
                        support_true=0,
                        support_pred=0,
                    )
                )
                continue

            c_acc = accuracy_score(label_true, label_pred)
            c_prec = precision_score(met_true, met_pred, zero_division=0)
            c_rec = recall_score(met_true, met_pred, zero_division=0)
            c_f1 = f1_score(met_true, met_pred, zero_division=0)

            # None on degenerate single-class data (NaN) or failure — never a fake 0.0.
            c_kappa = _kappa_or_none(label_true, label_pred)

            criterion_kappas.append(c_kappa)

            # 2x2 confusion matrix on the MET-vs-rest dichotomy (rows=true, cols=pred,
            # labels ["MET","UNMET"]). Built from the same met flats so FPR/FNR derived from
            # ``.fpr``/``.fnr`` honour undefined→None at a zero denominator.
            c_cm = _build_binary_2x2_confusion_matrix(met_true, met_pred)
            # phi (MCC) on the MET-vs-rest dichotomy: None on single-class — never a fake 0.0.
            c_phi = _mcc_or_none(met_true, met_pred)
            # Degenerate iff there were samples but agreement (kappa) could not be estimated
            # because the data collapsed onto a single class — distinct from the no-data case.
            c_degenerate = c_kappa is None

            per_criterion.append(
                CriterionMetrics(
                    name=name,
                    index=c_idx,
                    n_samples=len(label_pred),
                    accuracy=float(c_acc),
                    precision=float(c_prec),
                    recall=float(c_rec),
                    f1=float(c_f1),
                    kappa=c_kappa,
                    kappa_interpretation=(
                        KappaResult.interpret_kappa(c_kappa) if c_kappa is not None else "undefined"
                    ),
                    krippendorff_alpha=krippendorff_alpha,
                    fleiss_kappa=fleiss_kappa,
                    support_true=sum(met_true),
                    support_pred=sum(met_pred),
                    confusion_matrix=c_cm,
                    fpr=c_cm.fpr,
                    fnr=c_cm.fnr,
                    phi=c_phi,
                    is_degenerate=c_degenerate,
                )
            )

        elif c_type == "ordinal":
            # Ordinal multi-choice criterion metrics. Use the effective criterion so a
            # predicted auto-injected NA index is recognized.
            eff_criterion = effective_criteria[c_idx]
            pred_indices = [v for v in pred_data if isinstance(v, int)]
            true_indices = [v for v in true_data if isinstance(v, int)]

            # Filter NA options. na_agree is unused here (the NAStats block below
            # computes kappa on the {NA, not-NA} dichotomy from per-criterion data).
            pred_filtered, true_filtered, _na_agree, na_fp, na_fn = filter_na_multi_choice(
                pred_indices, true_indices, eff_criterion, mode=na_mode
            )

            # Track NA stats (FP/FN feed the diagnostic counts on NAStats)
            total_na_fp += na_fp
            total_na_fn += na_fn

            metrics = _compute_ordinal_criterion_metrics(
                pred_filtered,
                true_filtered,
                eff_criterion,
                c_idx,
                fleiss_matrix=(fleiss_rows.get(c_idx) if eligible else None),
                krippendorff_alpha=(krippendorff_alphas.get(c_idx) if eligible else None),
            )
            per_criterion.append(metrics)

            # Use weighted kappa for ordinal in mean calculation
            criterion_kappas.append(metrics.weighted_kappa)

        else:  # nominal
            # Nominal multi-choice criterion metrics. Use the effective criterion so a
            # predicted auto-injected NA index is recognized.
            eff_criterion = effective_criteria[c_idx]
            pred_indices = [v for v in pred_data if isinstance(v, int)]
            true_indices = [v for v in true_data if isinstance(v, int)]

            # Filter NA options. na_agree is unused here (the NAStats block below
            # computes kappa on the {NA, not-NA} dichotomy from per-criterion data).
            pred_filtered, true_filtered, _na_agree, na_fp, na_fn = filter_na_multi_choice(
                pred_indices, true_indices, eff_criterion, mode=na_mode
            )

            # Track NA stats (FP/FN feed the diagnostic counts on NAStats)
            total_na_fp += na_fp
            total_na_fn += na_fn

            metrics = _compute_nominal_criterion_metrics(
                pred_filtered,
                true_filtered,
                eff_criterion,
                c_idx,
                fleiss_matrix=(fleiss_rows.get(c_idx) if eligible else None),
                krippendorff_alpha=(krippendorff_alphas.get(c_idx) if eligible else None),
            )
            per_criterion.append(metrics)

            # Use unweighted kappa for nominal
            criterion_kappas.append(metrics.kappa)

    # Aggregate criterion-level scalars via the shared helper, so the aggregate and
    # per-judge paths cannot drift. accuracy/mean_kappa reproduce the prior expressions
    # exactly; the only behavior change is multi-choice-only precision/recall/f1 going
    # 0.0 → None (the binary MET-vs-rest metric is genuinely undefined without a MET
    # class). per_criterion_pred has been normalized to ints (no None) by here, so its
    # static type matches the helper's expected list[CriterionVerdict | int].
    (
        criterion_accuracy,
        criterion_precision,
        criterion_recall,
        criterion_f1,
        mean_kappa,
        criterion_phi,
        micro_kappa,
    ) = _criterion_level_scalars(
        per_criterion_pred,  # type: ignore[arg-type]
        per_criterion_true,
        list(criterion_types),
        cannot_assess,
        precomputed_kappas=criterion_kappas,
    )

    # Macro accuracy: unweighted mean of the per-criterion accuracies (binary ``accuracy`` /
    # multi-choice ``exact_accuracy``), the macro complement to the pooled (micro)
    # ``criterion_accuracy`` above. None when no criterion contributed an accuracy.
    per_criterion_accuracies: list[float | None] = []
    for cm in per_criterion:
        if cm.criterion_type == "binary":
            per_criterion_accuracies.append(cm.accuracy)
        else:
            per_criterion_accuracies.append(cm.exact_accuracy)
    macro_accuracy = _mean_or_none(per_criterion_accuracies)

    # Macro mean of the per-criterion Krippendorff's alpha (inter-judge agreement). None when
    # no criterion contributed an alpha (e.g. single-judge runs).
    mean_krippendorff_alpha = _mean_or_none([cm.krippendorff_alpha for cm in per_criterion])

    # Coverage / error diagnostics — only meaningful under the ``exclude`` handling modes,
    # where abstentions (CANNOT_ASSESS / NA) and grading errors drop a paired observation
    # from the agreement denominator. Under ``as_unmet`` / ``as_category`` nothing is
    # union-excluded, so coverage would be trivially 1.0 and we leave these ``None``. The raw
    # denominator counts every GT-bearing item (including those lost to a grading error), so
    # error_rate and the abstain rates share one consistent denominator.
    coverage_stats: CoverageStats | None = None
    coverage_mode = cannot_assess == "exclude" and na_mode == "exclude"
    if coverage_mode:
        n_total_raw = items_with_ground_truth + n_errored_items
        agg_judge_abstain = 0
        agg_gt_abstain = 0
        for c_idx in range(n_criteria):
            c_type = criterion_types[c_idx]
            raw_pred = per_criterion_pred[c_idx]
            raw_true = per_criterion_true[c_idx]
            if c_type == "binary":
                CA = CriterionVerdict.CANNOT_ASSESS
                judge_abstain = sum(1 for v in raw_pred if v == CA)
                gt_abstain = sum(1 for v in raw_true if v == CA)
            else:
                na_idx_set = {
                    i for i, opt in enumerate(effective_criteria[c_idx].options) if opt.na
                }
                judge_abstain = sum(1 for v in raw_pred if isinstance(v, int) and v in na_idx_set)
                gt_abstain = sum(1 for v in raw_true if isinstance(v, int) and v in na_idx_set)
            agg_judge_abstain += judge_abstain
            agg_gt_abstain += gt_abstain
            cstats = _build_coverage_stats(
                n_total=n_total_raw,
                n_covered=per_criterion[c_idx].n_samples,
                judge_abstain=judge_abstain,
                gt_abstain=gt_abstain,
                n_errored=n_errored_items,
            )
            per_criterion[c_idx] = per_criterion[c_idx].model_copy(
                update={"coverage_stats": cstats}
            )

        # Aggregate rollup: coverage pools the per-criterion *pairs* (raw pair count summed
        # over criteria; covered = sum of per-criterion covered counts), so the coverage /
        # abstain fractions reflect the full paired sample. ``n_errored``, by contrast, is an
        # *item* count (an errored item has no usable verdicts at all) — reported as the raw
        # item count for an intuitive read, matching the per-criterion value; its ``error_rate``
        # is the fraction of raw ground-truth-bearing items lost to a grading error.
        agg_total = n_total_raw * n_criteria
        agg_covered = sum(cm.n_samples for cm in per_criterion)
        agg_coverage = agg_covered / agg_total if agg_total else None
        coverage_stats = CoverageStats(
            n_total=agg_total,
            n_covered=agg_covered,
            coverage=agg_coverage,
            judge_abstain_rate=(agg_judge_abstain / agg_total if agg_total else None),
            gt_abstain_rate=(agg_gt_abstain / agg_total if agg_total else None),
            union_exclusion_rate=(1 - agg_coverage if agg_coverage is not None else None),
            n_errored=n_errored_items,
            error_rate=(n_errored_items / n_total_raw if n_total_raw else None),
        )

    if n_errored_items > 0:
        result_warnings.append(
            f"{n_errored_items} item(s) with ground truth were excluded from metrics because "
            "grading errored; their verdicts and scores do not contribute."
        )

    # Score-level metrics
    score_rmse = float(np.sqrt(mean_squared_error(all_true_scores, all_pred_scores)))
    score_mae = float(mean_absolute_error(all_true_scores, all_pred_scores))

    score_spearman = _compute_correlation(all_pred_scores, all_true_scores, "spearman")
    score_kendall = _compute_correlation(all_pred_scores, all_true_scores, "kendall")
    score_pearson = _compute_correlation(all_pred_scores, all_true_scores, "pearson")

    # Score-collapse warning: if the per-item ground-truth scores take at most two distinct
    # values, the score-level rank correlations are uninformative (a rank correlation on a
    # near-constant variable conveys almost nothing), so flag it rather than letting a reader
    # over-interpret the Spearman/Kendall numbers.
    if len(set(all_true_scores)) <= 2:
        result_warnings.append(
            "Ground-truth scores take <=2 distinct values; score-level correlations "
            "(Spearman/Kendall/Pearson) are uninformative on a collapsed score range."
        )

    # Degeneracy warning: name the criteria that had samples but whose agreement coefficient
    # collapsed to None (single-class data). Distinct from no-data criteria.
    degenerate_names = [cm.name for cm in per_criterion if cm.is_degenerate]
    if degenerate_names:
        result_warnings.append(
            "Degenerate (single-class) data prevented an agreement estimate for "
            f"criteria: {', '.join(degenerate_names)}."
        )

    # Bias analysis
    bias = systematic_bias(all_pred_scores, all_true_scores)

    # Bootstrap CIs (optional) — item-level resample over ANY rubric type (binary /
    # multi-choice / mixed). Per-metric None when its resample axis is empty / degenerate.
    # per_criterion_pred has been normalized to ints (no None) by the effective-criteria pass
    # above, so its static type matches the helper's list[CriterionVerdict | int].
    bootstrap_results = None
    if bootstrap:
        bootstrap_results = _compute_bootstrap_ci(
            per_criterion_pred,  # type: ignore[arg-type]
            per_criterion_true,
            list(criterion_types),
            effective_criteria,
            cannot_assess,
            na_mode,
            all_true_scores,
            all_pred_scores,
            n_bootstrap=n_bootstrap,
            confidence_level=confidence_level,
            seed=seed,
        )

    # Per-judge metrics (optional, for ensemble). Each judge mirrors the aggregate's
    # type handling: binary criteria contribute MET-vs-rest + label metrics, multi-choice
    # criteria contribute exact-match accuracy and kappa via the same per-criterion
    # functions the aggregate uses.
    per_judge_metrics = None
    if per_judge and is_ensemble and judge_scores:
        per_judge_metrics = {}
        for jid in judge_scores.keys():
            jv = judge_verdicts.get(jid, [])
            if not jv:
                continue

            per_judge_metrics[jid] = _compute_judge_metrics(
                judge_id=jid,
                judge_scores=judge_scores[jid],
                true_scores=all_true_scores,
                judge_verdicts=jv,
                judge_mc_preds=judge_mc_preds.get(jid, []),
                judge_errors=judge_errors.get(jid, []),
                true_verdicts=per_item_true,
                criterion_types=list(criterion_types),
                criteria=criteria,
                effective_criteria=effective_criteria,
                cannot_assess=cannot_assess,
                na_mode=na_mode,
            )

    # NA stats (for multi-choice criteria)
    # Cohen's kappa on the {NA, not-NA} dichotomy across all multi-choice
    # criteria that define an NA option, paired pred-vs-truth. Reuses the
    # same chance-corrected statistic as the rest of the framework's
    # prediction-vs-ground-truth agreement metrics (binary `kappa`, ordinal
    # `weighted_kappa`, nominal `kappa`). Returns None when undefined.
    na_stats = None
    if n_ordinal > 0 or n_nominal > 0:
        na_pred_bool: list[bool] = []
        na_true_bool: list[bool] = []
        for c_idx in range(n_criteria):
            if criterion_types[c_idx] == "binary":
                continue
            criterion = effective_criteria[c_idx]
            na_indices = {i for i, opt in enumerate(criterion.options) if opt.na}
            if not na_indices:
                continue
            for p, t in zip(per_criterion_pred[c_idx], per_criterion_true[c_idx]):
                if isinstance(p, int) and isinstance(t, int):
                    p_is_na = p in na_indices
                    t_is_na = t in na_indices
                    na_pred_bool.append(p_is_na)
                    na_true_bool.append(t_is_na)
                    if p_is_na:
                        total_na_pred += 1
                    if t_is_na:
                        total_na_true += 1

        na_kappa: float | None = None
        if na_pred_bool:
            try:
                k = float(cohen_kappa_score(na_true_bool, na_pred_bool))
                na_kappa = None if math.isnan(k) else k
            except Exception:
                na_kappa = None
        na_kappa_interpretation = (
            KappaResult.interpret_kappa(na_kappa) if na_kappa is not None else None
        )

        na_stats = NAStats(
            na_count_true=total_na_true,
            na_count_pred=total_na_pred,
            na_kappa=na_kappa,
            na_kappa_interpretation=na_kappa_interpretation,
            na_false_positive=total_na_fp,
            na_false_negative=total_na_fn,
        )

    # CANNOT_ASSESS stats (for binary criteria) — the binary parallel of NA stats above.
    # Cohen's kappa on the {CANNOT_ASSESS, not-CANNOT_ASSESS} dichotomy across all binary
    # criteria, paired pred-vs-truth. CANNOT_ASSESS is a DISTINCT kind of abstention from
    # multi-choice NA (epistemic MET-vs-UNMET abstention rather than "no applicable
    # option"), so it is tracked by a separate stats type (CannotAssessStats) even though
    # both share the SKIP scoring path. Counts are mode-independent: read from the raw
    # per-criterion verdicts (set at the top of this function), never the
    # cannot_assess-filtered lists. Returns None when there are no binary criteria.
    cannot_assess_stats = None
    if n_binary > 0:
        ca_pred_bool: list[bool] = []
        ca_true_bool: list[bool] = []
        total_ca_true = 0
        total_ca_pred = 0
        total_ca_fp = 0
        total_ca_fn = 0
        CA = CriterionVerdict.CANNOT_ASSESS
        for c_idx in range(n_criteria):
            if criterion_types[c_idx] != "binary":
                continue
            for p, t in zip(per_criterion_pred[c_idx], per_criterion_true[c_idx]):
                if isinstance(p, CriterionVerdict) and isinstance(t, CriterionVerdict):
                    p_is_ca = p == CA
                    t_is_ca = t == CA
                    ca_pred_bool.append(p_is_ca)
                    ca_true_bool.append(t_is_ca)
                    if p_is_ca:
                        total_ca_pred += 1
                    if t_is_ca:
                        total_ca_true += 1
                    if p_is_ca and not t_is_ca:
                        total_ca_fp += 1
                    if t_is_ca and not p_is_ca:
                        total_ca_fn += 1

        ca_kappa: float | None = None
        if ca_pred_bool:
            try:
                k = float(cohen_kappa_score(ca_true_bool, ca_pred_bool))
                ca_kappa = None if math.isnan(k) else k
            except Exception:
                ca_kappa = None
        ca_kappa_interpretation = (
            KappaResult.interpret_kappa(ca_kappa) if ca_kappa is not None else None
        )

        cannot_assess_stats = CannotAssessStats(
            ca_count_true=total_ca_true,
            ca_count_pred=total_ca_pred,
            ca_kappa=ca_kappa,
            ca_kappa_interpretation=ca_kappa_interpretation,
            ca_false_positive=total_ca_fp,
            ca_false_negative=total_ca_fn,
        )

    return MetricsResult(
        # Each scalar may be None (genuinely undefined / not applicable), so only wrap a
        # present value in float() — never coerce None to 0.0.
        criterion_accuracy=criterion_accuracy
        if criterion_accuracy is None
        else float(criterion_accuracy),
        criterion_precision=criterion_precision
        if criterion_precision is None
        else float(criterion_precision),
        criterion_recall=criterion_recall if criterion_recall is None else float(criterion_recall),
        criterion_f1=criterion_f1 if criterion_f1 is None else float(criterion_f1),
        mean_kappa=mean_kappa if mean_kappa is None else float(mean_kappa),
        per_criterion=per_criterion,
        score_rmse=score_rmse,
        score_mae=score_mae,
        score_spearman=score_spearman,
        score_kendall=score_kendall,
        score_pearson=score_pearson,
        bias=bias,
        bootstrap=bootstrap_results,
        per_judge=per_judge_metrics,
        n_items=n_items,
        n_criteria=n_criteria,
        n_binary_criteria=n_binary,
        n_ordinal_criteria=n_ordinal,
        n_nominal_criteria=n_nominal,
        na_stats=na_stats,
        cannot_assess_stats=cannot_assess_stats,
        # Handling-mode provenance (recorded on the result so downstream readers know how
        # abstentions were treated when these numbers were produced).
        cannot_assess_mode=cannot_assess,
        na_mode=na_mode,
        # Additional aggregate scalars (every one honours undefined→None — never a fake 0.0).
        n_samples=sum(cm.n_samples for cm in per_criterion),
        mean_krippendorff_alpha=mean_krippendorff_alpha,
        criterion_phi=criterion_phi if criterion_phi is None else float(criterion_phi),
        macro_accuracy=macro_accuracy if macro_accuracy is None else float(macro_accuracy),
        micro_kappa=micro_kappa if micro_kappa is None else float(micro_kappa),
        coverage_stats=coverage_stats,
        warnings=result_warnings,
    )

MetricsResult

Complete metrics result with aggregate and per-criterion breakdowns.

MetricsResult

Bases: BaseModel

Complete metrics result from compute_metrics().

This is the main result type returned by EvalResult.compute_metrics(). It provides a comprehensive view of evaluation quality including: - Criterion-level agreement metrics - Score-level correlation and error metrics - Per-criterion breakdown (supports binary, ordinal, and nominal criteria) - Optional bootstrap confidence intervals - Optional per-judge metrics for ensemble evaluations

ATTRIBUTE DESCRIPTION
criterion_accuracy

Overall accuracy across all criteria (binary label accuracy and/or multi-choice exact-match). None when undefined (no comparable pairs at all).

TYPE: float | None

criterion_precision

Overall precision for the binary MET class. None when not applicable — multi-choice-only rubrics have no MET class (the per-option precision/recall/f1 story lives in each criterion's per_option).

TYPE: float | None

criterion_recall

Overall recall for the binary MET class. None when not applicable (see criterion_precision).

TYPE: float | None

criterion_f1

Overall F1 for the binary MET class. None when not applicable (see criterion_precision).

TYPE: float | None

mean_kappa

Mean kappa across criteria (weighted for ordinal, unweighted for binary/nominal). None when undefined (no criterion contributed a kappa).

TYPE: float | None

per_criterion

Per-criterion metrics breakdown (polymorphic union type).

TYPE: list[CriterionMetricsUnion]

score_rmse

RMSE of cumulative scores.

TYPE: float

score_mae

MAE of cumulative scores.

TYPE: float

score_spearman

Spearman correlation result.

TYPE: CorrelationResult

score_kendall

Kendall tau correlation result.

TYPE: CorrelationResult

score_pearson

Pearson correlation result.

TYPE: CorrelationResult

bias

Systematic bias analysis.

TYPE: BiasResult

bootstrap

Optional bootstrap confidence intervals.

TYPE: BootstrapResults | None

per_judge

Optional per-judge metrics for ensemble.

TYPE: dict[str, JudgeMetrics] | None

n_items

Number of items used in computation.

TYPE: int

n_criteria

Number of criteria.

TYPE: int

n_binary_criteria

Number of binary criteria (default 0 for backwards compat).

TYPE: int

n_ordinal_criteria

Number of ordinal multi-choice criteria.

TYPE: int

n_nominal_criteria

Number of nominal multi-choice criteria.

TYPE: int

na_stats

Statistics for NA handling in multi-choice criteria.

TYPE: NAStats | None

cannot_assess_stats

Statistics for CANNOT_ASSESS handling in binary criteria — the binary parallel to na_stats (a distinct kind of abstention; see CannotAssessStats).

TYPE: CannotAssessStats | None

cannot_assess_mode

How binary CANNOT_ASSESS verdicts were handled when these metrics were computed (exclude / as_unmet / as_category).

TYPE: CannotAssessMode

na_mode

How multi-choice NA options were handled when these metrics were computed (the multi-choice analog of cannot_assess_mode).

TYPE: NAMode

n_samples

Total number of paired observations contributing to the aggregate metrics. None when not recorded (legacy checkpoints).

TYPE: int | None

mean_krippendorff_alpha

Macro mean of the per-criterion Krippendorff's alpha. None when no criterion contributed an alpha.

TYPE: float | None

criterion_phi

Aggregate (micro) Matthews correlation coefficient (φ) over the pooled binary {MET, UNMET} flats. None for multi-choice-only rubrics or when undefined.

TYPE: float | None

macro_accuracy

Unweighted mean of the per-criterion accuracies. None when no criterion contributed an accuracy.

TYPE: float | None

micro_kappa

Aggregate (micro) Cohen's kappa pooled across criteria. None when undefined.

TYPE: float | None

coverage_stats

Aggregate rollup of how much of the raw paired sample survived abstention/error exclusion. Only populated under the exclude handling mode.

TYPE: CoverageStats | None

warnings

Any warnings generated during computation.

TYPE: list[str]

summary

summary(*, verbose: bool = False) -> str

Return formatted text summary of metrics.

PARAMETER DESCRIPTION
verbose

When True, the per-judge table swaps in the secondary numeric columns (RMSE, Spearman) it omits by default and prints each judge's confusion matrix. The default (False) per-judge line leads with the chance-corrected accuracy + mean kappa (and Matthews phi), the metrics most directly comparable across judges.

TYPE: bool DEFAULT: False

Source code in src/autorubric/metrics/_types.py
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
def summary(self, *, verbose: bool = False) -> str:
    """Return formatted text summary of metrics.

    Args:
        verbose: When ``True``, the per-judge table swaps in the secondary numeric
            columns (RMSE, Spearman) it omits by default and prints each judge's
            confusion matrix. The default (``False``) per-judge line leads with the
            chance-corrected accuracy + mean kappa (and Matthews phi), the metrics most
            directly comparable across judges.
    """
    lines = []
    lines.append("=" * 60)
    lines.append("METRICS SUMMARY")
    lines.append("=" * 60)

    # Show criteria type breakdown if mixed
    criteria_info = f"Items: {self.n_items}, Criteria: {self.n_criteria}"
    if self.n_ordinal_criteria > 0 or self.n_nominal_criteria > 0:
        type_parts = []
        if self.n_binary_criteria > 0:
            type_parts.append(f"{self.n_binary_criteria} binary")
        if self.n_ordinal_criteria > 0:
            type_parts.append(f"{self.n_ordinal_criteria} ordinal")
        if self.n_nominal_criteria > 0:
            type_parts.append(f"{self.n_nominal_criteria} nominal")
        criteria_info += f" ({', '.join(type_parts)})"
    lines.append(criteria_info)

    # Handling modes: every accuracy/kappa/F1 below depends on how abstentions were
    # treated, so the estimand is named explicitly. A number reported without its
    # handling mode is ambiguous among the three estimands.
    lines.append(f"Handling modes: CANNOT_ASSESS={self.cannot_assess_mode}, NA={self.na_mode}")

    if self.warnings:
        lines.append(f"\nWarnings ({len(self.warnings)}):")
        for w in self.warnings:
            lines.append(f"  - {w}")

    # Criterion-level scalars span two aggregation levels. Pooled-over-decisions
    # metrics are micro; the unweighted mean over criteria is macro. They estimate
    # different quantities (a high-support criterion dominates micro, every criterion
    # counts equally for macro), so each carries its level explicitly.
    lines.append("")
    lines.append("Criterion-Level Metrics:")
    lines.append(f"  Accuracy (micro):       {_fmt_opt(self.criterion_accuracy, '.1%')}")
    lines.append(f"  Accuracy (macro):       {_fmt_opt(self.macro_accuracy, '.1%')}")
    if self.n_binary_criteria > 0:
        # Guaranteed non-None here, but render via _fmt_opt so ty is satisfied.
        lines.append(f"  Precision (micro):      {_fmt_opt(self.criterion_precision, '.2f')}")
        lines.append(f"  Recall (micro):         {_fmt_opt(self.criterion_recall, '.2f')}")
        lines.append(f"  F1 (micro):             {_fmt_opt(self.criterion_f1, '.2f')}")
    lines.append(f"  Mean Kappa (macro):     {_fmt_opt(self.mean_kappa, '.3f')}")
    lines.append(f"  Kappa (micro):          {_fmt_opt(self.micro_kappa, '.3f')}")
    if self.n_binary_criteria > 0:
        lines.append(f"  Phi (micro):            {_fmt_opt(self.criterion_phi, '.3f')}")
        # Single-source conflation note: on binary data phi coincides with the
        # Pearson/Spearman/Kendall/MCC family, and the kappa minus phi gap measures the
        # judge's positive-rate drift from the human's (not a second, corroborating
        # statistic). This note lives only here (and in to_dataframe()/docstrings).
        lines.append(
            "    (phi = Pearson = Spearman = Kendall = MCC on binary data; the "
            "Kappa - Phi gap is the judge's positive-rate drift, not extra evidence)"
        )
    lines.append(f"  Mean Kripp-α (macro):   {_fmt_opt(self.mean_krippendorff_alpha, '.3f')}")

    # Aggregate coverage continuation: under exclude mode an abstention/error drops a
    # paired observation, so the covered-subset metrics above are reported alongside
    # their coverage (a selective accuracy without its coverage is incomplete).
    if self.coverage_stats is not None:
        cs = self.coverage_stats
        lines.append(f"  Coverage:               {_fmt_opt(cs.coverage, '.1%')}")
        lines.append(
            f"    judge-abstain={_fmt_opt(cs.judge_abstain_rate, '.1%')}, "
            f"gt-abstain={_fmt_opt(cs.gt_abstain_rate, '.1%')}, "
            f"errored={cs.n_errored}"
        )

    lines.append("")
    lines.append("Score-Level Metrics (continuous per-item weighted score):")
    lines.append(f"  RMSE:     {self.score_rmse:.4f}")
    lines.append(f"  MAE:      {self.score_mae:.4f}")
    lines.append(
        f"  Spearman: {_fmt_opt(self.score_spearman.coefficient, '.4f')} "
        f"({self.score_spearman.interpretation})"
    )
    lines.append(
        f"  Kendall:  {_fmt_opt(self.score_kendall.coefficient, '.4f')} "
        f"({self.score_kendall.interpretation})"
    )
    lines.append(
        f"  Pearson:  {_fmt_opt(self.score_pearson.coefficient, '.4f')} "
        f"({self.score_pearson.interpretation})"
    )

    lines.append("")
    lines.append("Bias Analysis:")
    lines.append(
        f"  Mean Bias:   {_fmt_opt(self.bias.mean_bias, '+.4f')} ({self.bias.direction})"
    )
    lines.append(f"  Significant: {'Yes' if self.bias.is_significant else 'No'}")

    # NA stats for multi-choice
    if self.na_stats:
        lines.append("")
        lines.append("NA Handling:")
        lines.append(f"  NA in Ground Truth: {self.na_stats.na_count_true}")
        lines.append(f"  NA in Predictions:  {self.na_stats.na_count_pred}")
        if self.na_stats.na_kappa is not None:
            interp = self.na_stats.na_kappa_interpretation or ""
            lines.append(f"  NA Kappa:           {self.na_stats.na_kappa:.3f} ({interp})")
        if self.na_stats.na_false_positive > 0 or self.na_stats.na_false_negative > 0:
            lines.append(
                f"  NA FP/FN:           {self.na_stats.na_false_positive} / "
                f"{self.na_stats.na_false_negative}"
            )

    # CANNOT_ASSESS stats for binary criteria (parallel to NA Handling above; a
    # distinct kind of abstention — epistemic MET/UNMET rather than "no option").
    if self.cannot_assess_stats:
        ca = self.cannot_assess_stats
        lines.append("")
        lines.append("CANNOT_ASSESS Handling:")
        lines.append(f"  CA in Ground Truth: {ca.ca_count_true}")
        lines.append(f"  CA in Predictions:  {ca.ca_count_pred}")
        if ca.ca_kappa is not None:
            interp = ca.ca_kappa_interpretation or ""
            lines.append(f"  CA Kappa:           {ca.ca_kappa:.3f} ({interp})")
        if ca.ca_false_positive > 0 or ca.ca_false_negative > 0:
            lines.append(
                f"  CA FP/FN:           {ca.ca_false_positive} / {ca.ca_false_negative}"
            )

    if self.bootstrap:
        lines.append("")
        lines.append(f"Bootstrap CIs ({self.bootstrap.confidence_level:.0%}):")
        # Each CI may be None (genuinely undefined / no samples) — render "n/a".
        acc_ci = self.bootstrap.accuracy_ci
        kappa_ci = self.bootstrap.kappa_ci
        rmse_ci = self.bootstrap.rmse_ci
        lines.append(
            "  Accuracy: "
            + (f"[{acc_ci[0]:.1%}, {acc_ci[1]:.1%}]" if acc_ci is not None else "n/a")
        )
        lines.append(
            "  Kappa:    "
            + (f"[{kappa_ci[0]:.3f}, {kappa_ci[1]:.3f}]" if kappa_ci is not None else "n/a")
        )
        lines.append(
            "  RMSE:     "
            + (f"[{rmse_ci[0]:.4f}, {rmse_ci[1]:.4f}]" if rmse_ci is not None else "n/a")
        )

    if self.per_judge:
        lines.append("")
        lines.append("Per-Judge Metrics:")
        for judge_id, jm in sorted(self.per_judge.items()):
            # Default line leads with the chance-corrected accuracy + mean kappa (and
            # phi), the metrics most comparable across judges. RMSE/Spearman are demoted
            # to the verbose view.
            lines.append(
                f"  {judge_id}: Acc={_fmt_opt(jm.criterion_accuracy, '.1%')}, "
                f"Mean Kappa={_fmt_opt(jm.mean_kappa, '.3f')}, "
                f"Phi={_fmt_opt(jm.phi, '.3f')}"
            )
            if verbose:
                lines.append(
                    f"      RMSE={jm.score_rmse:.4f}, "
                    f"Spearman={_fmt_opt(jm.score_spearman.coefficient, '.4f')}, "
                    f"MAE={jm.score_mae:.4f}"
                )
                if jm.confusion_matrix is not None:
                    lines.append(
                        "      Confusion (" + ", ".join(jm.confusion_matrix.labels) + "):"
                    )
                    for label, mrow in zip(
                        jm.confusion_matrix.labels, jm.confusion_matrix.matrix
                    ):
                        lines.append(f"        {label:<14} {mrow}")

    lines.append("")
    lines.append("Per-Criterion Breakdown:")

    # Inter-judge agreement is only populated for ensembles with >=2 judges. Render is
    # type-aware: on binary/nominal data Krippendorff's nominal alpha and Fleiss' kappa
    # coincide up to a finite-sample correction (1 - kappa_F)/(N*R) — they are one
    # statistic, not corroborating evidence — so alpha is reported as the single primary
    # column and the bare Fleiss column is dropped. On ordinal data alpha is
    # distance-aware while Fleiss stays nominal (different geometry), so both are kept.
    def _has_alpha(criteria: list) -> bool:
        return any(cm.krippendorff_alpha is not None for cm in criteria)

    def _has_fleiss(criteria: list) -> bool:
        return any(cm.fleiss_kappa is not None for cm in criteria)

    def _alpha_header() -> str:
        return f" {'Kripp-α':>9}"

    def _alpha_cell(cm) -> str:
        return f" {_fmt_opt(cm.krippendorff_alpha, '>9.3f', 9)}"

    def _alpha_fleiss_header() -> str:
        return f" {'Kripp-α':>9} {'Fleiss':>8}"

    def _alpha_fleiss_cells(cm) -> str:
        return (
            f" {_fmt_opt(cm.krippendorff_alpha, '>9.3f', 9)}"
            f" {_fmt_opt(cm.fleiss_kappa, '>8.3f', 8)}"
        )

    # Marks a criterion that had samples but whose agreement coefficient collapsed to
    # None (single-class) — distinct from a no-data criterion (n_samples == 0).
    def _degen_suffix(cm) -> str:
        return "  [degenerate: agreement undefined, single class]" if cm.is_degenerate else ""

    # Separate display by criterion type
    binary_criteria = [cm for cm in self.per_criterion if cm.criterion_type == "binary"]
    ordinal_criteria = [cm for cm in self.per_criterion if cm.criterion_type == "ordinal"]
    nominal_criteria = [cm for cm in self.per_criterion if cm.criterion_type == "nominal"]

    alpha_note_needed = False

    if binary_criteria:
        if ordinal_criteria or nominal_criteria:
            lines.append("\nBinary Criteria:")
        # Binary/nominal: alpha primary, bare Fleiss dropped.
        show_alpha = _has_alpha(binary_criteria)
        alpha_note_needed = alpha_note_needed or show_alpha
        header = (
            f"{'Criterion':<20} {'Acc':>8} {'Prec':>8} {'Rec':>8} {'F1':>8} "
            f"{'Kappa':>8} {'Phi':>8} {'FP':>5} {'FN':>5} {'FPR':>7} {'FNR':>7}"
        )
        if show_alpha:
            header += _alpha_header()
        lines.append(header)
        lines.append("-" * len(header))
        for cm in binary_criteria:
            fp = cm.confusion_matrix.fp if cm.confusion_matrix is not None else None
            fn = cm.confusion_matrix.fn if cm.confusion_matrix is not None else None
            row = (
                f"{cm.name:<20} {_fmt_opt(cm.accuracy, '>8.1%', 8)} "
                f"{_fmt_opt(cm.precision, '>8.2f', 8)} "
                f"{_fmt_opt(cm.recall, '>8.2f', 8)} {_fmt_opt(cm.f1, '>8.2f', 8)} "
                f"{_fmt_opt(cm.kappa, '>8.3f', 8)} {_fmt_opt(cm.phi, '>8.3f', 8)} "
                f"{(str(fp) if fp is not None else 'n/a'):>5} "
                f"{(str(fn) if fn is not None else 'n/a'):>5} "
                f"{_fmt_opt(cm.fpr, '>7.2f', 7)} {_fmt_opt(cm.fnr, '>7.2f', 7)}"
            )
            if show_alpha:
                row += _alpha_cell(cm)
            row += _degen_suffix(cm)
            lines.append(row)

    if ordinal_criteria:
        lines.append("\nOrdinal Criteria:")
        # Ordinal: keep both alpha (distance-aware) and Fleiss (nominal) — different
        # geometry, see the note below.
        show_alpha = _has_alpha(ordinal_criteria)
        show_fleiss = _has_fleiss(ordinal_criteria)
        header = (
            f"{'Criterion':<20} {'Exact':>8} {'Adj':>8} "
            f"{'WKappa':>8} {'Spearman':>10} {'RMSE':>8}"
        )
        if show_alpha or show_fleiss:
            header += _alpha_fleiss_header()
        lines.append(header)
        lines.append("-" * len(header))
        for cm in ordinal_criteria:
            row = (
                f"{cm.name:<20} {_fmt_opt(cm.exact_accuracy, '>8.1%', 8)} "
                f"{_fmt_opt(cm.adjacent_accuracy, '>8.1%', 8)} "
                f"{_fmt_opt(cm.weighted_kappa, '>8.3f', 8)} "
                f"{_fmt_opt(cm.spearman.coefficient, '>10.4f', 10)} "
                f"{_fmt_opt(cm.rmse, '>8.4f', 8)}"
            )
            if show_alpha or show_fleiss:
                row += _alpha_fleiss_cells(cm)
            row += _degen_suffix(cm)
            lines.append(row)
        if show_fleiss:
            # Distinguishing note (NOT a conflation note): ordinal alpha and nominal
            # Fleiss measure different geometries and are both intentionally retained.
            lines.append(
                "  Note: ordinal Kripp-α is distance-aware while Fleiss is nominal — they "
                "measure different geometry, not the same statistic."
            )

    if nominal_criteria:
        lines.append("\nNominal Criteria:")
        # Binary/nominal: alpha primary, bare Fleiss dropped.
        show_alpha = _has_alpha(nominal_criteria)
        alpha_note_needed = alpha_note_needed or show_alpha
        header = f"{'Criterion':<20} {'Accuracy':>10} {'Kappa':>8} {'Interpretation':<20}"
        if show_alpha:
            header += _alpha_header()
        lines.append(header)
        lines.append("-" * len(header))
        for cm in nominal_criteria:
            row = (
                f"{cm.name:<20} {_fmt_opt(cm.exact_accuracy, '>10.1%', 10)} "
                f"{_fmt_opt(cm.kappa, '>8.3f', 8)} {cm.kappa_interpretation:<20}"
            )
            if show_alpha:
                row += _alpha_cell(cm)
            row += _degen_suffix(cm)
            lines.append(row)

    if alpha_note_needed:
        # Single-source conflation note for binary/nominal: alpha and Fleiss coincide up
        # to a finite-sample correction, so only the primary (alpha) is reported and
        # Fleiss is omitted. This note lives only here (and in to_dataframe()/docstrings).
        lines.append(
            "  Note: on binary/nominal data Krippendorff's nominal α equals Fleiss' κ up "
            "to a finite-sample correction (1 - κ_F)/(N·R) — one statistic, not "
            "corroborating evidence; α is reported as primary (bare Fleiss omitted)."
        )

    return "\n".join(lines)

to_dataframe

to_dataframe() -> DataFrame

Export metrics to pandas DataFrame.

Returns a flat DataFrame with a 'level' column indicating 'aggregate' / 'criterion' / 'judge'. The criterion-level scalars carry their aggregation level in the column name: accuracy_micro / precision_micro / recall_micro / f1_micro / kappa_micro / phi_micro are pooled over decisions, while accuracy_macro and mean_kappa_macro are unweighted means over criteria (the former bare accuracy / precision / recall / f1 / kappa columns are gone — they mixed levels). The handling modes (cannot_assess_mode / na_mode) and n_samples round-trip on the aggregate row, alongside coverage columns (coverage / judge_abstain_rate / gt_abstain_rate / union_exclusion_rate / n_errored / error_rate; None outside exclude mode). On binary/nominal data Krippendorff's α equals Fleiss' κ up to a finite-sample correction, so α is the single primary inter-judge column and the bare fleiss_kappa value is emitted only for ordinal criteria (different geometry).

Source code in src/autorubric/metrics/_types.py
def to_dataframe(self) -> "pd.DataFrame":
    """Export metrics to pandas DataFrame.

    Returns a flat DataFrame with a 'level' column indicating 'aggregate' / 'criterion'
    / 'judge'. The criterion-level scalars carry their **aggregation level** in the
    column name: ``accuracy_micro`` / ``precision_micro`` / ``recall_micro`` /
    ``f1_micro`` / ``kappa_micro`` / ``phi_micro`` are pooled over decisions, while
    ``accuracy_macro`` and ``mean_kappa_macro`` are unweighted means over criteria (the
    former bare ``accuracy`` / ``precision`` / ``recall`` / ``f1`` / ``kappa`` columns
    are gone — they mixed levels). The handling modes (``cannot_assess_mode`` /
    ``na_mode``) and ``n_samples`` round-trip on the aggregate row, alongside coverage
    columns (``coverage`` / ``judge_abstain_rate`` / ``gt_abstain_rate`` /
    ``union_exclusion_rate`` / ``n_errored`` / ``error_rate``; ``None`` outside exclude
    mode). On binary/nominal data Krippendorff's α equals Fleiss' κ up to a
    finite-sample correction, so α is the single primary inter-judge column and the bare
    ``fleiss_kappa`` value is emitted only for ordinal criteria (different geometry).
    """
    import pandas as pd

    rows = []

    def _coverage_cols(cs: "CoverageStats | None") -> dict:
        """Coverage columns, all None when coverage was not computed (non-exclude mode)."""
        if cs is None:
            return {
                "coverage": None,
                "judge_abstain_rate": None,
                "gt_abstain_rate": None,
                "union_exclusion_rate": None,
                "n_errored": None,
                "error_rate": None,
            }
        return {
            "coverage": cs.coverage,
            "judge_abstain_rate": cs.judge_abstain_rate,
            "gt_abstain_rate": cs.gt_abstain_rate,
            "union_exclusion_rate": cs.union_exclusion_rate,
            "n_errored": cs.n_errored,
            "error_rate": cs.error_rate,
        }

    # Aggregate row. The criterion-level scalars are labelled by aggregation level:
    # accuracy_micro / precision_micro / recall_micro / f1_micro / kappa_micro / phi_micro
    # are pooled over all decisions; accuracy_macro / mean_kappa_macro are unweighted
    # means over criteria. The bare accuracy/precision/recall/f1/kappa columns are gone
    # (they hid the level). On binary/nominal data Krippendorff's α is the single primary
    # inter-judge statistic (Fleiss coincides up to a finite-sample correction), so the
    # aggregate-level mean is mean_krippendorff_alpha and bare Fleiss is omitted there.
    rows.append(
        {
            "level": "aggregate",
            "name": "overall",
            "criterion_type": "all",
            "accuracy_micro": self.criterion_accuracy,
            "accuracy_macro": self.macro_accuracy,
            "precision_micro": self.criterion_precision,
            "recall_micro": self.criterion_recall,
            "f1_micro": self.criterion_f1,
            "mean_kappa_macro": self.mean_kappa,
            "kappa_micro": self.micro_kappa,
            "phi_micro": self.criterion_phi,
            "mean_krippendorff_alpha": self.mean_krippendorff_alpha,
            "cannot_assess_mode": self.cannot_assess_mode,
            "na_mode": self.na_mode,
            "n_samples": self.n_samples,
            "rmse": self.score_rmse,
            "mae": self.score_mae,
            "spearman": self.score_spearman.coefficient,
            "kendall": self.score_kendall.coefficient,
            "pearson": self.score_pearson.coefficient,
            "bias": self.bias.mean_bias,
            "adjacent_accuracy": None,
            "weighted_kappa": None,
            "phi": None,
            "fpr": None,
            "fnr": None,
            "is_degenerate": None,
            "krippendorff_alpha": None,
            "fleiss_kappa": None,
            **_coverage_cols(self.coverage_stats),
        }
    )

    # Per-criterion rows (handle different types). Each criterion's pooled accuracy /
    # kappa land in the *_micro columns (a single criterion has no macro/micro split);
    # the macro columns stay None at this level. Binary/nominal drop the bare Fleiss
    # value (α primary); ordinal keeps both (different geometry).
    for cm in self.per_criterion:
        if cm.criterion_type == "binary":
            rows.append(
                {
                    "level": "criterion",
                    "name": cm.name,
                    "criterion_type": "binary",
                    "accuracy_micro": cm.accuracy,
                    "accuracy_macro": None,
                    "precision_micro": cm.precision,
                    "recall_micro": cm.recall,
                    "f1_micro": cm.f1,
                    "mean_kappa_macro": None,
                    "kappa_micro": cm.kappa,
                    "phi_micro": None,
                    "mean_krippendorff_alpha": None,
                    "cannot_assess_mode": None,
                    "na_mode": None,
                    "n_samples": cm.n_samples,
                    "rmse": None,
                    "mae": None,
                    "spearman": None,
                    "kendall": None,
                    "pearson": None,
                    "bias": None,
                    "adjacent_accuracy": None,
                    "weighted_kappa": None,
                    "phi": cm.phi,
                    "fpr": cm.fpr,
                    "fnr": cm.fnr,
                    "is_degenerate": cm.is_degenerate,
                    "krippendorff_alpha": cm.krippendorff_alpha,
                    # Binary: bare Fleiss dropped (α primary).
                    "fleiss_kappa": None,
                    **_coverage_cols(cm.coverage_stats),
                }
            )
        elif cm.criterion_type == "ordinal":
            rows.append(
                {
                    "level": "criterion",
                    "name": cm.name,
                    "criterion_type": "ordinal",
                    "accuracy_micro": cm.exact_accuracy,
                    "accuracy_macro": None,
                    "precision_micro": None,
                    "recall_micro": None,
                    "f1_micro": None,
                    "mean_kappa_macro": None,
                    "kappa_micro": cm.weighted_kappa,
                    "phi_micro": None,
                    "mean_krippendorff_alpha": None,
                    "cannot_assess_mode": None,
                    "na_mode": None,
                    "n_samples": cm.n_samples,
                    "rmse": cm.rmse,
                    "mae": cm.mae,
                    "spearman": cm.spearman.coefficient,
                    "kendall": cm.kendall.coefficient,
                    "pearson": None,
                    "bias": None,
                    "adjacent_accuracy": cm.adjacent_accuracy,
                    "weighted_kappa": cm.weighted_kappa,
                    "phi": None,
                    "fpr": None,
                    "fnr": None,
                    "is_degenerate": cm.is_degenerate,
                    "krippendorff_alpha": cm.krippendorff_alpha,
                    # Ordinal: keep Fleiss (different geometry from α).
                    "fleiss_kappa": cm.fleiss_kappa,
                    **_coverage_cols(cm.coverage_stats),
                }
            )
        else:  # nominal
            rows.append(
                {
                    "level": "criterion",
                    "name": cm.name,
                    "criterion_type": "nominal",
                    "accuracy_micro": cm.exact_accuracy,
                    "accuracy_macro": None,
                    "precision_micro": None,
                    "recall_micro": None,
                    "f1_micro": None,
                    "mean_kappa_macro": None,
                    "kappa_micro": cm.kappa,
                    "phi_micro": None,
                    "mean_krippendorff_alpha": None,
                    "cannot_assess_mode": None,
                    "na_mode": None,
                    "n_samples": cm.n_samples,
                    "rmse": None,
                    "mae": None,
                    "spearman": None,
                    "kendall": None,
                    "pearson": None,
                    "bias": None,
                    "adjacent_accuracy": None,
                    "weighted_kappa": None,
                    "phi": None,
                    "fpr": None,
                    "fnr": None,
                    "is_degenerate": cm.is_degenerate,
                    "krippendorff_alpha": cm.krippendorff_alpha,
                    # Nominal: bare Fleiss dropped (α primary).
                    "fleiss_kappa": None,
                    **_coverage_cols(cm.coverage_stats),
                }
            )

    # Per-judge rows (if available)
    if self.per_judge:
        for judge_id, jm in self.per_judge.items():
            rows.append(
                {
                    "level": "judge",
                    "name": judge_id,
                    "criterion_type": "all",
                    "accuracy_micro": jm.criterion_accuracy,
                    "accuracy_macro": None,
                    "precision_micro": jm.criterion_precision,
                    "recall_micro": jm.criterion_recall,
                    "f1_micro": jm.criterion_f1,
                    "mean_kappa_macro": jm.mean_kappa,
                    "kappa_micro": None,
                    "phi_micro": None,
                    "mean_krippendorff_alpha": None,
                    "cannot_assess_mode": None,
                    "na_mode": None,
                    "n_samples": None,
                    "rmse": jm.score_rmse,
                    "mae": jm.score_mae,
                    "spearman": jm.score_spearman.coefficient,
                    "kendall": jm.score_kendall.coefficient,
                    "pearson": jm.score_pearson.coefficient,
                    "bias": jm.bias.mean_bias,
                    "adjacent_accuracy": None,
                    "weighted_kappa": None,
                    "phi": jm.phi,
                    "fpr": None,
                    "fnr": None,
                    "is_degenerate": None,
                    "krippendorff_alpha": None,
                    "fleiss_kappa": None,
                    **_coverage_cols(None),
                }
            )

    return pd.DataFrame(rows)

to_file

to_file(path: str | Path) -> None

Save metrics to a JSON file.

PARAMETER DESCRIPTION
path

Path to the output JSON file.

TYPE: str | Path

Source code in src/autorubric/metrics/_types.py
def to_file(self, path: str | Path) -> None:
    """Save metrics to a JSON file.

    Args:
        path: Path to the output JSON file.
    """
    path = Path(path)
    path.parent.mkdir(parents=True, exist_ok=True)
    path.write_text(self.model_dump_json(indent=2), encoding="utf-8")

CriterionMetrics

Per-criterion binary metrics.

CriterionMetrics

Bases: BaseModel

Metrics for a single binary criterion.

ATTRIBUTE DESCRIPTION
name

Name of the criterion.

TYPE: str

index

Index of the criterion in the rubric.

TYPE: int

criterion_type

Type of criterion ("binary" for this class).

TYPE: Literal['binary']

n_samples

Number of samples used for this criterion.

TYPE: int

accuracy

Binary accuracy (proportion of exact matches). None when undefined / no samples.

TYPE: float | None

precision

Precision for MET class. None when undefined / no samples.

TYPE: float | None

recall

Recall for MET class. None when undefined / no samples.

TYPE: float | None

f1

F1 score for MET class. None when undefined / no samples.

TYPE: float | None

kappa

Cohen's kappa coefficient. None when undefined (degenerate single-class) / no samples.

TYPE: float | None

kappa_interpretation

Human-readable interpretation of kappa ("undefined" when kappa is None).

TYPE: str

krippendorff_alpha

Krippendorff's alpha — the general, recommended inter-judge agreement statistic. It natively handles unequal/missing raters (errored or excluded votes) and is level-aware (nominal for binary criteria). None unless this is an ensemble with >=2 judges and >=2 items.

TYPE: float | None

fleiss_kappa

Fleiss' kappa — the classic fixed-rater nominal inter-judge agreement measure, computed complete-case (only items where every judge cast a genuine counted vote contribute). Prefer krippendorff_alpha as the general statistic. None unless ensemble with >=2 judges and >=2 complete-case items.

TYPE: float | None

support_true

Count of MET in ground truth.

TYPE: int

support_pred

Count of MET in predictions.

TYPE: int

confusion_matrix

2×2 labelled confusion matrix (["MET", "UNMET"], rows=true, cols=pred). None when there are no samples.

TYPE: ConfusionMatrix | None

fpr

False-positive rate (true UNMET predicted MET). None when undefined (no true negatives) / no samples.

TYPE: float | None

fnr

False-negative rate (true MET predicted UNMET). None when undefined (no true positives) / no samples.

TYPE: float | None

phi

Matthews correlation coefficient (the φ coefficient) on the {MET, UNMET} dichotomy. None on constant / single-class data, where it is genuinely undefined (never a fabricated 0.0).

TYPE: float | None

is_degenerate

True iff this criterion had samples (n_samples > 0) but kappa is still None — agreement could not be estimated because the data collapsed onto a single class. Distinct from the no-data case (n_samples == 0).

TYPE: bool

coverage_stats

How much of the raw paired sample survived abstention/error exclusion. Only populated under the exclude handling mode; None otherwise.

TYPE: CoverageStats | None


CorrelationResult

Correlation statistics between predicted and ground truth scores.

CorrelationResult

Bases: BaseModel

Result from correlation calculation (Spearman, Kendall, Pearson).

ATTRIBUTE DESCRIPTION
coefficient

The correlation coefficient (-1 to 1). None when the correlation is genuinely undefined — a constant input array (zero variance → NaN) or fewer than 3 samples. A 0.0 ("no correlation") would be a lie in those cases.

TYPE: float | None

p_value

P-value for testing the null hypothesis of no correlation. None when the coefficient is undefined (see coefficient).

TYPE: float | None

ci

Optional confidence interval for the coefficient.

TYPE: ConfidenceInterval | None

interpretation

Human-readable interpretation ("undefined" for a constant/NaN array, "insufficient data" for <3 samples).

TYPE: str

n_samples

Number of samples used in calculation.

TYPE: int

method

Correlation method used (e.g., "spearman", "kendall", "pearson").

TYPE: str

interpret_correlation staticmethod

interpret_correlation(r: float) -> str

Return human-readable interpretation of correlation coefficient.

Source code in src/autorubric/metrics/_types.py
@staticmethod
def interpret_correlation(r: float) -> str:
    """Return human-readable interpretation of correlation coefficient."""
    abs_r = abs(r)
    if abs_r >= 0.9:
        strength = "very strong"
    elif abs_r >= 0.7:
        strength = "strong"
    elif abs_r >= 0.5:
        strength = "moderate"
    elif abs_r >= 0.3:
        strength = "weak"
    else:
        strength = "very weak"

    direction = "positive" if r >= 0 else "negative"
    return f"{strength} {direction}"

BootstrapResults

Bootstrap confidence intervals for key metrics.

BootstrapResults

Bases: BaseModel

Bootstrap confidence interval results.

The three CIs are MARGINAL — bootstrapped on two independent item-level resample axes (a verdict-item axis for accuracy/kappa, an independent scored-item axis for RMSE), so they reflect each statistic's own sampling distribution, not their joint covariance. Covers any rubric type (binary / multi-choice / mixed).

ATTRIBUTE DESCRIPTION
accuracy_ci

95% CI for criterion_accuracy. None when undefined / no samples.

TYPE: tuple[float, float] | None

kappa_ci

95% CI for mean_kappa (ordinal contributes quadratic-weighted kappa). Each replicate's mean conditions on which criteria were non-degenerate in that resample. None when undefined (kappa never defined across resamples) / no samples.

TYPE: tuple[float, float] | None

rmse_ci

95% CI for score_rmse over the scored-item subset. None when no samples (a single scored item yields a degenerate (v, v) interval, not None).

TYPE: tuple[float, float] | None

n_bootstrap

Number of bootstrap samples used.

TYPE: int

confidence_level

Confidence level (default 0.95).

TYPE: float


BootstrapResult

Single bootstrap result with confidence interval.

BootstrapResult

Bases: BaseModel

Bootstrap confidence interval result.

ATTRIBUTE DESCRIPTION
estimate

Point estimate of the statistic.

TYPE: float

ci

Confidence interval from bootstrap.

TYPE: ConfidenceInterval

standard_error

Bootstrap standard error.

TYPE: float

n_bootstrap

Number of bootstrap samples used.

TYPE: int

bootstrap_distribution

Optional array of bootstrap estimates.

TYPE: list[float] | None


ConfidenceInterval

Confidence interval bounds.

ConfidenceInterval

Bases: BaseModel

Confidence interval for a statistic.

ATTRIBUTE DESCRIPTION
lower

Lower bound of the interval.

TYPE: float

upper

Upper bound of the interval.

TYPE: float

confidence

Confidence level (default 0.95 for 95% CI).

TYPE: float

method

Method used to compute the interval.

TYPE: str

width property

width: float

Width of the confidence interval.


JudgeMetrics

Per-judge metrics for ensemble evaluations.

JudgeMetrics

Bases: BaseModel

Metrics for a single judge in an ensemble.

Mirrors the aggregate's type handling field-for-field: precision/recall/f1 are the binary MET-vs-rest metric → None for a multi-choice-only rubric (no MET class), and accuracy/mean_kappa generalize but are None when undefined.

ATTRIBUTE DESCRIPTION
judge_id

Identifier for this judge.

TYPE: str

criterion_accuracy

Overall criterion-level accuracy (binary label and/or multi-choice exact-match). None when undefined.

TYPE: float | None

criterion_precision

Overall precision for the binary MET class. None when not applicable (multi-choice-only — no MET class).

TYPE: float | None

criterion_recall

Overall recall for the binary MET class. None when not applicable (see criterion_precision).

TYPE: float | None

criterion_f1

Overall F1 for the binary MET class. None when not applicable (see criterion_precision).

TYPE: float | None

mean_kappa

Mean Cohen's kappa across criteria. None when undefined.

TYPE: float | None

phi

Matthews correlation coefficient (φ) for this judge on the binary {MET, UNMET} dichotomy, pooled across criteria. None when undefined (constant / single class / no binary data).

TYPE: float | None

confusion_matrix

This judge's confusion matrix, aggregated across criteria from the raw pre-filter codes (binary MET/UNMET with an abstain CANNOT_ASSESS class last → 3×3). None when there is no data.

TYPE: ConfusionMatrix | None

score_rmse

RMSE of cumulative scores.

TYPE: float

score_mae

MAE of cumulative scores.

TYPE: float

score_spearman

Spearman correlation result.

TYPE: CorrelationResult

score_kendall

Kendall tau correlation result.

TYPE: CorrelationResult

score_pearson

Pearson correlation result.

TYPE: CorrelationResult

bias

Systematic bias analysis result.

TYPE: BiasResult


BiasResult

Systematic bias analysis between predicted and ground truth scores.

BiasResult

Bases: BaseModel

Result from systematic bias analysis.

Systematic bias occurs when one rater consistently scores higher or lower than another, independent of the item being rated.

A statistic is None when it is genuinely undefined for the sample size, never a fake 0.0. mean_bias is the single pred−true difference at n=1 (computable) and is None only at n=0. std_bias is None when undefined (n<2). effect_size (Cohen's d) is None when std_bias is 0 or undefined.

ATTRIBUTE DESCRIPTION
mean_bias

Mean difference (predictions - actuals). None only at n=0.

TYPE: float | None

std_bias

Standard deviation of differences. None when undefined (n < 2).

TYPE: float | None

is_significant

Whether the bias is statistically significant (p < 0.05).

TYPE: bool

p_value

P-value from t-test.

TYPE: float | None

direction

Direction of bias ("positive" if predictions > actuals).

TYPE: Literal['positive', 'negative', 'none']

effect_size

Cohen's d effect size. None when undefined (std_bias 0 or undefined).

TYPE: float | None

ci

Confidence interval for mean bias.

TYPE: ConfidenceInterval | None

n_samples

Number of samples.

TYPE: int

interpret_effect_size staticmethod

interpret_effect_size(d: float) -> str

Interpret effect size using Cohen's guidelines.

Source code in src/autorubric/metrics/_types.py
@staticmethod
def interpret_effect_size(d: float) -> str:
    """Interpret effect size using Cohen's guidelines."""
    abs_d = abs(d)
    if abs_d < 0.2:
        return "negligible"
    elif abs_d < 0.5:
        return "small"
    elif abs_d < 0.8:
        return "medium"
    else:
        return "large"

OrdinalCriterionMetrics

Per-criterion metrics for ordinal multi-choice criteria.

OrdinalCriterionMetrics

Bases: BaseModel

Metrics for an ordinal multi-choice criterion.

Ordinal criteria have options with inherent ordering (e.g., satisfaction 1-4). This enables additional metrics like weighted kappa and rank correlations.

ATTRIBUTE DESCRIPTION
name

Name of the criterion.

TYPE: str

index

Index of the criterion in the rubric.

TYPE: int

criterion_type

Type of criterion ("ordinal" for this class).

TYPE: Literal['ordinal']

n_samples

Number of samples used in computation.

TYPE: int

n_options

Number of options in this criterion.

TYPE: int

exact_accuracy

Proportion of exact index matches. None when undefined / no samples.

TYPE: float | None

adjacent_accuracy

Proportion within +/-1 position. None when undefined / no samples.

TYPE: float | None

weighted_kappa

Quadratic-weighted Cohen's kappa (accounts for distance). None when undefined (degenerate single-class) / no samples.

TYPE: float | None

kappa_interpretation

Human-readable interpretation of kappa ("undefined" when weighted_kappa is None).

TYPE: str

krippendorff_alpha

Krippendorff's alpha — the general, recommended inter-judge agreement statistic. Computed with level_of_measurement="ordinal" so it is distance-aware (near-miss disagreements penalized less than far-miss), and it natively handles unequal/missing raters. None unless ensemble with >=2 judges and >=2 items.

TYPE: float | None

fleiss_kappa

Fleiss' kappa — the classic fixed-rater nominal measure (ignores ordering), computed complete-case. Prefer krippendorff_alpha for ordinal criteria. None unless ensemble with >=2 judges and >=2 complete-case items.

TYPE: float | None

spearman

Spearman rank correlation result.

TYPE: CorrelationResult

kendall

Kendall tau correlation result.

TYPE: CorrelationResult

rmse

RMSE on option values (0-1 scale). None when undefined / no samples.

TYPE: float | None

mae

MAE on option values (0-1 scale). None when undefined / no samples.

TYPE: float | None

per_option

Per-option precision/recall/F1 breakdown.

TYPE: list[OptionMetrics]

confusion_matrix

N×N labelled confusion matrix (rows=true, cols=pred); its .labels carry the option labels (the former option_labels).

TYPE: ConfusionMatrix

is_degenerate

True iff this criterion had samples (n_samples > 0) but weighted_kappa is still None — agreement could not be estimated because the data collapsed onto a single class. Distinct from the no-data case (n_samples == 0), where every metric is None simply for lack of samples.

TYPE: bool

coverage_stats

How much of the raw paired sample survived abstention/error exclusion. Only populated under the exclude handling mode; None otherwise.

TYPE: CoverageStats | None


NominalCriterionMetrics

Per-criterion metrics for nominal multi-choice criteria.

NominalCriterionMetrics

Bases: BaseModel

Metrics for a nominal multi-choice criterion.

Nominal criteria have unordered categories (e.g., "too few", "just right", "too many"). Distance between options is not meaningful, so only exact matches matter.

ATTRIBUTE DESCRIPTION
name

Name of the criterion.

TYPE: str

index

Index of the criterion in the rubric.

TYPE: int

criterion_type

Type of criterion ("nominal" for this class).

TYPE: Literal['nominal']

n_samples

Number of samples used in computation.

TYPE: int

n_options

Number of options in this criterion.

TYPE: int

exact_accuracy

Proportion of exact index matches. None when undefined / no samples.

TYPE: float | None

kappa

Unweighted Cohen's kappa (N×N). None when undefined (degenerate single-class) / no samples.

TYPE: float | None

kappa_interpretation

Human-readable interpretation of kappa ("undefined" when kappa is None).

TYPE: str

krippendorff_alpha

Krippendorff's alpha — the general, recommended inter-judge agreement statistic. Computed with level_of_measurement="nominal" and natively handles unequal/missing raters. None unless ensemble with >=2 judges and >=2 items.

TYPE: float | None

fleiss_kappa

Fleiss' kappa — the classic fixed-rater nominal measure, computed complete-case. Prefer krippendorff_alpha as the general statistic. None unless ensemble with >=2 judges and >=2 complete-case items.

TYPE: float | None

per_option

Per-option precision/recall/F1 breakdown.

TYPE: list[OptionMetrics]

confusion_matrix

N×N labelled confusion matrix (rows=true, cols=pred); its .labels carry the option labels (the former option_labels).

TYPE: ConfusionMatrix

is_degenerate

True iff this criterion had samples (n_samples > 0) but kappa is still None — agreement could not be estimated because the data collapsed onto a single class. Distinct from the no-data case (n_samples == 0).

TYPE: bool

coverage_stats

How much of the raw paired sample survived abstention/error exclusion. Only populated under the exclude handling mode; None otherwise.

TYPE: CoverageStats | None


NAStats

Statistics for NA (not applicable) handling in multi-choice criteria.

NAStats

Bases: BaseModel

Statistics for NA (not applicable) handling in multi-choice criteria.

Tracks how the prediction and ground truth agree on the dichotomized {NA, not-NA} decision per item, similar to how CANNOT_ASSESS is handled for binary criteria.

ATTRIBUTE DESCRIPTION
na_count_true

Number of NA selections in ground truth.

TYPE: int

na_count_pred

Number of NA selections in predictions.

TYPE: int

na_kappa

Cohen's kappa on the {NA, not-NA} dichotomy (pred vs truth). Range [-1, 1]; 1.0 is perfect agreement, 0 is chance-level, negative is worse than chance. None when undefined (no paired NA observations, single class, or NaN). The framework reports prediction-vs-ground-truth categorical agreement as Cohen's kappa across the board (binary kappa, ordinal weighted_kappa, nominal kappa); na_kappa is the dichotomized kappa for the orthogonal abstain decision. Readers who want a raw proportion can derive A / (A + fp + fn) from the counts below.

TYPE: float | None

na_kappa_interpretation

Landis & Koch interpretation of na_kappa via KappaResult.interpret_kappa. None when na_kappa is None.

TYPE: str | None

na_false_positive

Count where prediction was NA but ground truth was not.

TYPE: int

na_false_negative

Count where ground truth was NA but prediction was not.

TYPE: int


CannotAssessStats

Statistics for CANNOT_ASSESS handling in binary criteria — the binary parallel of NAStats. Both are abstentions that flow through the same SKIP scoring path and get a dichotomized Cohen's-kappa diagnostic, but they are tracked as distinct types: CANNOT_ASSESS is an epistemic abstention on a yes/no decision ("I cannot determine MET vs. UNMET"), while multi-choice NA is "no applicable option" (a statement about the option space). Its fields are ca_-prefixed: ca_count_true, ca_count_pred, ca_kappa (float | None), ca_kappa_interpretation, ca_false_positive, ca_false_negative.

CannotAssessStats

Bases: BaseModel

Statistics for CANNOT_ASSESS handling in binary criteria.

The binary parallel of :class:NAStats: tracks how the prediction and ground truth agree on the dichotomized {CANNOT_ASSESS, not-CANNOT_ASSESS} decision per item.

Both CANNOT_ASSESS (binary) and NA (multi-choice) are abstentions that flow through the same SKIP scoring path (score_reports), and both get a parallel dichotomized Cohen's-kappa diagnostic block. They are nonetheless distinct kinds of abstention, which is exactly why they are tracked by separate stats types rather than merged:

  • Binary CANNOT_ASSESS is the judge being unable to determine MET-vs-UNMET — an epistemic abstention on a yes/no question ("I cannot decide whether this requirement is met").
  • Multi-choice NA is "not applicable / cannot pick an applicable option" — abstaining because no scored category fits, a statement about the option space rather than a yes/no decision.

Keeping them separate (and prefixing these fields ca_) makes the semantic distinction explicit in the data model while preserving the structural analogy.

ATTRIBUTE DESCRIPTION
ca_count_true

Number of CANNOT_ASSESS verdicts in ground truth.

TYPE: int

ca_count_pred

Number of CANNOT_ASSESS verdicts in predictions.

TYPE: int

ca_kappa

Cohen's kappa on the {CANNOT_ASSESS, not-CANNOT_ASSESS} dichotomy (pred vs truth). Range [-1, 1]; 1.0 is perfect agreement, 0 is chance-level, negative is worse than chance. None when undefined (no paired CANNOT_ASSESS observations, single class, or NaN). The framework reports prediction-vs-ground-truth categorical agreement as Cohen's kappa across the board (binary kappa, ordinal weighted_kappa, nominal kappa, and NA's na_kappa); ca_kappa is the dichotomized kappa for the orthogonal binary abstain decision. Readers who want a raw proportion can derive A / (A + ca_fp + ca_fn) from the counts below.

TYPE: float | None

ca_kappa_interpretation

Landis & Koch interpretation of ca_kappa via KappaResult.interpret_kappa. None when ca_kappa is None.

TYPE: str | None

ca_false_positive

Count where prediction was CANNOT_ASSESS but ground truth was not.

TYPE: int

ca_false_negative

Count where ground truth was CANNOT_ASSESS but prediction was not.

TYPE: int


CoverageStats

How much of the raw paired sample survived abstention/error exclusion. Built only under the exclude handling mode (under as_unmet / as_category no observation is dropped, so coverage would be trivially 1.0 and these stats are left None). n_total is the raw pre-exclusion denominator and n_covered equals the per-criterion n_samples; every rate (coverage, judge_abstain_rate, gt_abstain_rate, union_exclusion_rate, error_rate) is float | None, None when its denominator is zero.

CoverageStats

Bases: BaseModel

How much of the raw paired sample survived abstention/error exclusion.

Built only under the exclude handling mode, where abstentions (CANNOT_ASSESS / NA) and grading errors drop a paired observation from the agreement denominator. Under as_unmet or as_category no observation is dropped, so coverage would be trivially 1.0 and these stats are not produced (left None by callers).

n_total is the raw pre-exclusion denominator; n_covered is what remained after the union of all exclusion reasons (it equals the per-criterion n_samples). Every rate honours undefined→None (None when its denominator is zero); counts stay int.

ATTRIBUTE DESCRIPTION
n_total

Raw pre-exclusion paired count (the denominator before any drops).

TYPE: int

n_covered

Paired count remaining after union-exclusion (== per-criterion n_samples).

TYPE: int

coverage

n_covered / n_total. None when n_total == 0.

TYPE: float | None

judge_abstain_rate

Fraction of the raw pairs where the judge/prediction abstained. None when n_total == 0.

TYPE: float | None

gt_abstain_rate

Fraction of the raw pairs where the ground truth abstained. None when n_total == 0.

TYPE: float | None

union_exclusion_rate

Fraction excluded for any reason (1 - coverage). None when n_total == 0.

TYPE: float | None

n_errored

Count of paired observations dropped because grading errored.

TYPE: int

error_rate

n_errored / n_total. None when n_total == 0.

TYPE: float | None


References

Casabianca, J., McCaffrey, D. F., Johnson, M. S., Alper, N., and Zubenko, V. (2025). Validity Arguments For Constructed Response Scoring Using Generative Artificial Intelligence Applications. arXiv:2501.02334.

He, J., Shi, J., Zhuo, T. Y., Treude, C., Sun, J., Xing, Z., Du, X., and Lo, D. (2025). LLM-as-a-Judge for Software Engineering: Literature Review, Vision, and the Road Ahead. arXiv:2510.24367.