Skip to content

Metrics

Agreement and correlation metrics for validating LLM judges against ground truth.

Overview

When your dataset includes ground truth labels, compute_metrics() measures how well your LLM judge agrees with human annotations. Metrics include accuracy, precision, recall, F1, Cohen's kappa, correlations, and systematic bias analysis.

Research Background

Casabianca et al. (2025) recommend agreement metrics including ICC, Krippendorff's alpha, and quadratic-weighted kappa (QWK), with iterative refinement until agreement with human-labeled subsets is acceptable. He et al. (2025) emphasize that correlation alone can mask systematic bias.

Quick Example

from autorubric import RubricDataset, LLMConfig, evaluate
from autorubric.graders import CriterionGrader

dataset = RubricDataset.from_file("data_with_ground_truth.json")
grader = CriterionGrader(llm_config=LLMConfig(model="openai/gpt-4.1-mini"))

result = await evaluate(dataset, grader, show_progress=True)

# Compute metrics
metrics = result.compute_metrics(dataset)

# Formatted summary
print(metrics.summary())

# Export options
df = metrics.to_dataframe()
metrics.to_file("metrics.json")

Bootstrap Confidence Intervals

metrics = result.compute_metrics(
    dataset,
    bootstrap=True,
    n_bootstrap=1000,
    confidence_level=0.95,
    seed=42,
)

print(metrics.summary())
# Bootstrap CIs (95%):
#   Accuracy: [85.2%, 92.1%]
#   Kappa:    [0.712, 0.845]

Per-Judge Metrics (Ensemble)

metrics = result.compute_metrics(
    dataset,
    per_judge=True,
)

for judge_id, jm in metrics.per_judge.items():
    print(f"{judge_id}: Accuracy={jm.criterion_accuracy:.1%}, RMSE={jm.score_rmse:.4f}")

Metric Fields

Field Description
criterion_accuracy Overall accuracy across all criteria
criterion_precision Precision for MET class
criterion_recall Recall for MET class
criterion_f1 F1 score for MET class
mean_kappa Mean Cohen's kappa across criteria
score_rmse RMSE of cumulative scores
score_mae MAE of cumulative scores
score_spearman Spearman rank correlation
score_kendall Kendall tau correlation
score_pearson Pearson correlation

compute_metrics

Compute agreement metrics between predictions and ground truth.

compute_metrics

compute_metrics(eval_result: 'EvalResult', dataset: 'RubricDataset', *, bootstrap: bool = False, n_bootstrap: int = 1000, per_judge: bool = False, cannot_assess: Literal['exclude', 'as_unmet'] = 'exclude', na_mode: Literal['exclude', 'as_worst'] = 'exclude', confidence_level: float = 0.95, seed: int | None = None) -> MetricsResult

Compute comprehensive evaluation metrics.

This is the main entry point for computing metrics from an evaluation run. It compares predicted verdicts and scores against ground truth from the dataset. Supports binary, ordinal, and nominal (multi-choice) criteria.

PARAMETER DESCRIPTION
eval_result

The evaluation result from EvalRunner.

TYPE: 'EvalResult'

dataset

The dataset with ground truth labels.

TYPE: 'RubricDataset'

bootstrap

If True, compute bootstrap confidence intervals (expensive).

TYPE: bool DEFAULT: False

n_bootstrap

Number of bootstrap samples if bootstrap=True.

TYPE: int DEFAULT: 1000

per_judge

If True and ensemble, compute per-judge metrics.

TYPE: bool DEFAULT: False

cannot_assess

How to handle CANNOT_ASSESS verdicts (binary criteria): - "exclude": Skip pairs where either is CA (default) - "as_unmet": Treat CA as UNMET

TYPE: Literal['exclude', 'as_unmet'] DEFAULT: 'exclude'

na_mode

How to handle NA options (multi-choice criteria): - "exclude": Skip pairs where either is NA (default) - "as_worst": Keep NA in metrics (no special treatment)

TYPE: Literal['exclude', 'as_worst'] DEFAULT: 'exclude'

confidence_level

Confidence level for bootstrap CIs (default 0.95).

TYPE: float DEFAULT: 0.95

seed

Random seed for bootstrap reproducibility.

TYPE: int | None DEFAULT: None

RETURNS DESCRIPTION
MetricsResult

MetricsResult with comprehensive metrics and optional per-judge breakdown.

RAISES DESCRIPTION
ValueError

If no common items between eval_result and dataset.

Example

result = await evaluate(dataset, grader) metrics = result.compute_metrics(dataset) print(metrics.summary()) df = metrics.to_dataframe()

Source code in src/autorubric/metrics/_compute.py
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
def compute_metrics(
    eval_result: "EvalResult",
    dataset: "RubricDataset",
    *,
    bootstrap: bool = False,
    n_bootstrap: int = 1000,
    per_judge: bool = False,
    cannot_assess: Literal["exclude", "as_unmet"] = "exclude",
    na_mode: Literal["exclude", "as_worst"] = "exclude",
    confidence_level: float = 0.95,
    seed: int | None = None,
) -> MetricsResult:
    """Compute comprehensive evaluation metrics.

    This is the main entry point for computing metrics from an evaluation run.
    It compares predicted verdicts and scores against ground truth from the dataset.
    Supports binary, ordinal, and nominal (multi-choice) criteria.

    Args:
        eval_result: The evaluation result from EvalRunner.
        dataset: The dataset with ground truth labels.
        bootstrap: If True, compute bootstrap confidence intervals (expensive).
        n_bootstrap: Number of bootstrap samples if bootstrap=True.
        per_judge: If True and ensemble, compute per-judge metrics.
        cannot_assess: How to handle CANNOT_ASSESS verdicts (binary criteria):
            - "exclude": Skip pairs where either is CA (default)
            - "as_unmet": Treat CA as UNMET
        na_mode: How to handle NA options (multi-choice criteria):
            - "exclude": Skip pairs where either is NA (default)
            - "as_worst": Keep NA in metrics (no special treatment)
        confidence_level: Confidence level for bootstrap CIs (default 0.95).
        seed: Random seed for bootstrap reproducibility.

    Returns:
        MetricsResult with comprehensive metrics and optional per-judge breakdown.

    Raises:
        ValueError: If no common items between eval_result and dataset.

    Example:
        >>> result = await evaluate(dataset, grader)
        >>> metrics = result.compute_metrics(dataset)
        >>> print(metrics.summary())
        >>> df = metrics.to_dataframe()
    """
    result_warnings: list[str] = []

    # Build map of item_idx -> ItemResult
    eval_map = {ir.item_idx: ir for ir in eval_result.item_results}

    # Check for missing/extra items
    dataset_indices = set(range(len(dataset)))
    eval_indices = set(eval_map.keys())

    missing = dataset_indices - eval_indices
    if missing:
        result_warnings.append(
            f"{len(missing)} items from dataset not found in eval_result"
        )

    extra = eval_indices - dataset_indices
    if extra:
        result_warnings.append(
            f"{len(extra)} items in eval_result not in dataset"
        )

    # Use intersection
    common_indices = sorted(dataset_indices & eval_indices)

    if not common_indices:
        raise ValueError("No common items between eval_result and dataset")

    # Validate rubric homogeneity for metrics computation
    # If using per-item rubrics, all must have the same structure
    if dataset.rubric is not None:
        reference_rubric = dataset.rubric
    else:
        # Get rubric from first item
        reference_rubric = dataset.get_item_rubric(common_indices[0])

    reference_n_criteria = len(reference_rubric.rubric)

    for idx in common_indices:
        item_rubric = dataset.get_item_rubric(idx)
        if len(item_rubric.rubric) != reference_n_criteria:
            raise ValueError(
                f"Cannot compute metrics: items have different rubric structures. "
                f"Item {idx} has {len(item_rubric.rubric)} criteria but "
                f"expected {reference_n_criteria}. "
                f"Metrics require homogeneous rubric structures across all items."
            )

    # Use the reference rubric for classification
    criteria = list(reference_rubric.rubric)
    criterion_types = classify_criteria(criteria)
    n_criteria = len(criteria)

    # Count criteria by type
    n_binary = sum(1 for ct in criterion_types if ct == "binary")
    n_ordinal = sum(1 for ct in criterion_types if ct == "ordinal")
    n_nominal = sum(1 for ct in criterion_types if ct == "nominal")

    # Per-criterion data storage
    # For binary: list[CriterionVerdict]
    # For multi-choice: list[int] (option indices)
    per_criterion_pred: list[list[CriterionVerdict | int]] = [[] for _ in range(n_criteria)]
    per_criterion_true: list[list[CriterionVerdict | int]] = [[] for _ in range(n_criteria)]

    # Overall scores
    all_pred_scores: list[float] = []
    all_true_scores: list[float] = []

    # For ensemble: per-judge data (binary only for now)
    judge_scores: dict[str, list[float]] = {}
    judge_verdicts: dict[str, list[list[CriterionVerdict]]] = {}
    is_ensemble = False

    items_with_ground_truth = 0

    # NA tracking for multi-choice
    total_na_true = 0
    total_na_pred = 0
    total_na_agreement = 0
    total_na_fp = 0
    total_na_fn = 0

    for idx in common_indices:
        item = dataset.items[idx]
        item_result = eval_map[idx]
        report = item_result.report

        if item.ground_truth is None:
            result_warnings.append(f"Item {idx} has no ground truth, skipping")
            continue

        if item_result.error is not None:
            continue

        items_with_ground_truth += 1

        # Extract predictions using type-aware extraction
        pred_all = extract_all_verdicts_from_report(report, criteria)

        # Resolve ground truth (string labels → indices for multi-choice)
        try:
            true_all = resolve_ground_truth(list(item.ground_truth), criteria)
        except ValueError as e:
            result_warnings.append(f"Item {idx}: {e}")
            continue

        # Store per-criterion data
        for c_idx in range(n_criteria):
            pred_val = pred_all[c_idx]
            true_val = true_all[c_idx]

            # Handle None predictions (failed extraction)
            if pred_val is None:
                if criterion_types[c_idx] == "binary":
                    pred_val = CriterionVerdict.UNMET
                else:
                    pred_val = 0  # Default to first option

            per_criterion_pred[c_idx].append(pred_val)
            per_criterion_true[c_idx].append(true_val)

        # Compute scores
        pred_score = report.score if not report.error else 0.0
        # For true score, need to pass the original ground truth format
        # compute_weighted_score expects CriterionVerdict for binary, str for multi-choice
        true_score_verdicts = []
        for c_idx in range(n_criteria):
            if criterion_types[c_idx] == "binary":
                true_score_verdicts.append(true_all[c_idx])
            else:
                # For multi-choice, pass the option label (string)
                criterion = criteria[c_idx]
                opt_idx = true_all[c_idx]
                if isinstance(opt_idx, int) and 0 <= opt_idx < len(criterion.options):
                    true_score_verdicts.append(criterion.options[opt_idx].label)
                else:
                    # Default to first option if index is invalid
                    true_score_verdicts.append(criterion.options[0].label)

        true_score = dataset.compute_weighted_score(true_score_verdicts)

        all_pred_scores.append(pred_score)
        all_true_scores.append(true_score)

        # Check if ensemble and collect per-judge data
        if hasattr(report, "judge_scores") and report.judge_scores:
            is_ensemble = True
            for jid, score in report.judge_scores.items():
                if jid not in judge_scores:
                    judge_scores[jid] = []
                    judge_verdicts[jid] = []
                judge_scores[jid].append(score)

            # Extract per-judge verdicts from EnsembleCriterionReport.votes (binary only)
            if hasattr(report, "report") and report.report:
                for jid in judge_scores.keys():
                    judge_v = []
                    for cr in report.report:
                        if hasattr(cr, "votes"):
                            for vote in cr.votes:
                                if vote.judge_id == jid:
                                    judge_v.append(vote.verdict)
                                    break
                            else:
                                judge_v.append(CriterionVerdict.UNMET)
                        else:
                            judge_v.append(CriterionVerdict.UNMET)
                    if jid in judge_verdicts:
                        judge_verdicts[jid].append(judge_v)

    n_items = items_with_ground_truth

    if n_items == 0:
        raise ValueError("No valid items with ground truth found")

    # Compute per-criterion metrics by type
    per_criterion: list[CriterionMetricsUnion] = []
    criterion_kappas: list[float] = []

    # For binary-only aggregate metrics
    binary_pred_flat: list[int] = []
    binary_true_flat: list[int] = []

    for c_idx in range(n_criteria):
        criterion = criteria[c_idx]
        c_type = criterion_types[c_idx]
        pred_data = per_criterion_pred[c_idx]
        true_data = per_criterion_true[c_idx]

        if c_type == "binary":
            # Binary criterion metrics
            pred_verdicts = [v for v in pred_data if isinstance(v, CriterionVerdict)]
            true_verdicts = [v for v in true_data if isinstance(v, CriterionVerdict)]

            # Filter CANNOT_ASSESS
            pred_filtered = []
            true_filtered = []
            for p, t in zip(pred_verdicts, true_verdicts):
                if cannot_assess == "exclude":
                    if p == CriterionVerdict.CANNOT_ASSESS or t == CriterionVerdict.CANNOT_ASSESS:
                        continue
                pred_filtered.append(_verdict_to_binary(p))
                true_filtered.append(_verdict_to_binary(t))

            # Add to aggregate
            binary_pred_flat.extend(pred_filtered)
            binary_true_flat.extend(true_filtered)

            name = criterion.name or f"Criterion {c_idx + 1}"

            if not pred_filtered:
                per_criterion.append(
                    CriterionMetrics(
                        name=name,
                        index=c_idx,
                        n_samples=0,
                        accuracy=0.0,
                        precision=0.0,
                        recall=0.0,
                        f1=0.0,
                        kappa=0.0,
                        kappa_interpretation="undefined",
                        support_true=0,
                        support_pred=0,
                    )
                )
                continue

            c_acc = accuracy_score(true_filtered, pred_filtered)
            c_prec = precision_score(true_filtered, pred_filtered, zero_division=0)
            c_rec = recall_score(true_filtered, pred_filtered, zero_division=0)
            c_f1 = f1_score(true_filtered, pred_filtered, zero_division=0)

            try:
                c_kappa = cohen_kappa_score(true_filtered, pred_filtered)
            except Exception:
                c_kappa = 0.0

            criterion_kappas.append(c_kappa)

            per_criterion.append(
                CriterionMetrics(
                    name=name,
                    index=c_idx,
                    n_samples=len(pred_filtered),
                    accuracy=float(c_acc),
                    precision=float(c_prec),
                    recall=float(c_rec),
                    f1=float(c_f1),
                    kappa=float(c_kappa),
                    kappa_interpretation=_interpret_kappa(c_kappa),
                    support_true=sum(true_filtered),
                    support_pred=sum(pred_filtered),
                )
            )

        elif c_type == "ordinal":
            # Ordinal multi-choice criterion metrics
            pred_indices = [v for v in pred_data if isinstance(v, int)]
            true_indices = [v for v in true_data if isinstance(v, int)]

            # Filter NA options
            pred_filtered, true_filtered, na_agree, na_fp, na_fn = filter_na_multi_choice(
                pred_indices, true_indices, criterion, mode=na_mode
            )

            # Track NA stats
            total_na_agreement += na_agree
            total_na_fp += na_fp
            total_na_fn += na_fn

            metrics = _compute_ordinal_criterion_metrics(
                pred_filtered, true_filtered, criterion, c_idx
            )
            per_criterion.append(metrics)

            # Use weighted kappa for ordinal in mean calculation
            criterion_kappas.append(metrics.weighted_kappa)

        else:  # nominal
            # Nominal multi-choice criterion metrics
            pred_indices = [v for v in pred_data if isinstance(v, int)]
            true_indices = [v for v in true_data if isinstance(v, int)]

            # Filter NA options
            pred_filtered, true_filtered, na_agree, na_fp, na_fn = filter_na_multi_choice(
                pred_indices, true_indices, criterion, mode=na_mode
            )

            # Track NA stats
            total_na_agreement += na_agree
            total_na_fp += na_fp
            total_na_fn += na_fn

            metrics = _compute_nominal_criterion_metrics(
                pred_filtered, true_filtered, criterion, c_idx
            )
            per_criterion.append(metrics)

            # Use unweighted kappa for nominal
            criterion_kappas.append(metrics.kappa)

    # Aggregate metrics
    mean_kappa = (
        sum(criterion_kappas) / len(criterion_kappas) if criterion_kappas else 0.0
    )

    # Binary-only aggregate metrics (precision/recall/f1 only make sense for binary)
    if binary_pred_flat:
        criterion_accuracy = accuracy_score(binary_true_flat, binary_pred_flat)
        criterion_precision = precision_score(binary_true_flat, binary_pred_flat, zero_division=0)
        criterion_recall = recall_score(binary_true_flat, binary_pred_flat, zero_division=0)
        criterion_f1 = f1_score(binary_true_flat, binary_pred_flat, zero_division=0)
    else:
        # No binary criteria - compute accuracy across all multi-choice
        # For multi-choice, accuracy is exact match
        all_correct = 0
        all_total = 0
        for c_idx in range(n_criteria):
            c_type = criterion_types[c_idx]
            if c_type != "binary":
                pred_data = per_criterion_pred[c_idx]
                true_data = per_criterion_true[c_idx]
                for p, t in zip(pred_data, true_data):
                    if isinstance(p, int) and isinstance(t, int):
                        all_total += 1
                        if p == t:
                            all_correct += 1

        criterion_accuracy = all_correct / all_total if all_total > 0 else 0.0
        # Precision/recall/f1 not meaningful for pure multi-choice rubrics
        criterion_precision = 0.0
        criterion_recall = 0.0
        criterion_f1 = 0.0

    # Score-level metrics
    score_rmse = float(np.sqrt(mean_squared_error(all_true_scores, all_pred_scores)))
    score_mae = float(mean_absolute_error(all_true_scores, all_pred_scores))

    score_spearman = _compute_correlation(all_pred_scores, all_true_scores, "spearman")
    score_kendall = _compute_correlation(all_pred_scores, all_true_scores, "kendall")
    score_pearson = _compute_correlation(all_pred_scores, all_true_scores, "pearson")

    # Bias analysis
    bias = systematic_bias(all_pred_scores, all_true_scores)

    # Bootstrap CIs (optional) - uses binary metrics for backwards compat
    bootstrap_results = None
    if bootstrap and binary_pred_flat:
        bootstrap_results = _compute_bootstrap_ci(
            binary_true_flat,
            binary_pred_flat,
            all_true_scores,
            all_pred_scores,
            n_bootstrap=n_bootstrap,
            confidence_level=confidence_level,
            seed=seed,
        )

    # Per-judge metrics (optional, for ensemble) - binary only for now
    per_judge_metrics = None
    if per_judge and is_ensemble and judge_scores:
        per_judge_metrics = {}
        for jid in judge_scores.keys():
            jv = judge_verdicts.get(jid, [])
            if not jv:
                continue

            # Extract binary verdicts for this judge
            binary_true_verdicts = []
            for true_item in per_criterion_true:
                binary_true_verdicts.append(
                    [v for v in true_item if isinstance(v, CriterionVerdict)]
                )

            per_judge_metrics[jid] = _compute_judge_metrics(
                judge_id=jid,
                judge_scores=judge_scores[jid],
                true_scores=all_true_scores,
                judge_verdicts=jv,
                true_verdicts=binary_true_verdicts[0] if binary_true_verdicts else [],
                cannot_assess=cannot_assess,
            )

    # NA stats (for multi-choice criteria)
    na_stats = None
    if n_ordinal > 0 or n_nominal > 0:
        # Calculate total NA counts
        for c_idx in range(n_criteria):
            if criterion_types[c_idx] != "binary":
                criterion = criteria[c_idx]
                na_indices = {i for i, opt in enumerate(criterion.options) if opt.na}
                if na_indices:
                    pred_data = per_criterion_pred[c_idx]
                    true_data = per_criterion_true[c_idx]
                    for p in pred_data:
                        if isinstance(p, int) and p in na_indices:
                            total_na_pred += 1
                    for t in true_data:
                        if isinstance(t, int) and t in na_indices:
                            total_na_true += 1

        total_na = total_na_true + total_na_pred
        na_stats = NAStats(
            na_count_true=total_na_true,
            na_count_pred=total_na_pred,
            na_agreement=total_na_agreement / max(1, total_na) if total_na > 0 else 0.0,
            na_false_positive=total_na_fp,
            na_false_negative=total_na_fn,
        )

    return MetricsResult(
        criterion_accuracy=float(criterion_accuracy),
        criterion_precision=float(criterion_precision),
        criterion_recall=float(criterion_recall),
        criterion_f1=float(criterion_f1),
        mean_kappa=float(mean_kappa),
        per_criterion=per_criterion,
        score_rmse=score_rmse,
        score_mae=score_mae,
        score_spearman=score_spearman,
        score_kendall=score_kendall,
        score_pearson=score_pearson,
        bias=bias,
        bootstrap=bootstrap_results,
        per_judge=per_judge_metrics,
        n_items=n_items,
        n_criteria=n_criteria,
        n_binary_criteria=n_binary,
        n_ordinal_criteria=n_ordinal,
        n_nominal_criteria=n_nominal,
        na_stats=na_stats,
        warnings=result_warnings,
    )

MetricsResult

Complete metrics result with aggregate and per-criterion breakdowns.

MetricsResult

Bases: BaseModel

Complete metrics result from compute_metrics().

This is the main result type returned by EvalResult.compute_metrics(). It provides a comprehensive view of evaluation quality including: - Criterion-level agreement metrics - Score-level correlation and error metrics - Per-criterion breakdown (supports binary, ordinal, and nominal criteria) - Optional bootstrap confidence intervals - Optional per-judge metrics for ensemble evaluations

ATTRIBUTE DESCRIPTION
criterion_accuracy

Overall accuracy across all criteria.

TYPE: float

criterion_precision

Overall precision for MET class (binary criteria only).

TYPE: float

criterion_recall

Overall recall for MET class (binary criteria only).

TYPE: float

criterion_f1

Overall F1 for MET class (binary criteria only).

TYPE: float

mean_kappa

Mean kappa across criteria (weighted for ordinal, unweighted for binary/nominal).

TYPE: float

per_criterion

Per-criterion metrics breakdown (polymorphic union type).

TYPE: list[CriterionMetricsUnion]

score_rmse

RMSE of cumulative scores.

TYPE: float

score_mae

MAE of cumulative scores.

TYPE: float

score_spearman

Spearman correlation result.

TYPE: CorrelationResult

score_kendall

Kendall tau correlation result.

TYPE: CorrelationResult

score_pearson

Pearson correlation result.

TYPE: CorrelationResult

bias

Systematic bias analysis.

TYPE: BiasResult

bootstrap

Optional bootstrap confidence intervals.

TYPE: BootstrapResults | None

per_judge

Optional per-judge metrics for ensemble.

TYPE: dict[str, JudgeMetrics] | None

n_items

Number of items used in computation.

TYPE: int

n_criteria

Number of criteria.

TYPE: int

n_binary_criteria

Number of binary criteria (default 0 for backwards compat).

TYPE: int

n_ordinal_criteria

Number of ordinal multi-choice criteria.

TYPE: int

n_nominal_criteria

Number of nominal multi-choice criteria.

TYPE: int

na_stats

Statistics for NA handling in multi-choice criteria.

TYPE: NAStats | None

warnings

Any warnings generated during computation.

TYPE: list[str]

summary

summary() -> str

Return formatted text summary of metrics.

Source code in src/autorubric/metrics/_types.py
def summary(self) -> str:
    """Return formatted text summary of metrics."""
    lines = []
    lines.append("=" * 60)
    lines.append("METRICS SUMMARY")
    lines.append("=" * 60)

    # Show criteria type breakdown if mixed
    criteria_info = f"Items: {self.n_items}, Criteria: {self.n_criteria}"
    if self.n_ordinal_criteria > 0 or self.n_nominal_criteria > 0:
        type_parts = []
        if self.n_binary_criteria > 0:
            type_parts.append(f"{self.n_binary_criteria} binary")
        if self.n_ordinal_criteria > 0:
            type_parts.append(f"{self.n_ordinal_criteria} ordinal")
        if self.n_nominal_criteria > 0:
            type_parts.append(f"{self.n_nominal_criteria} nominal")
        criteria_info += f" ({', '.join(type_parts)})"
    lines.append(criteria_info)

    if self.warnings:
        lines.append(f"\nWarnings ({len(self.warnings)}):")
        for w in self.warnings:
            lines.append(f"  - {w}")

    lines.append("")
    lines.append("Criterion-Level Metrics:")
    lines.append(f"  Accuracy:   {self.criterion_accuracy:.1%}")
    if self.n_binary_criteria > 0:
        lines.append(f"  Precision:  {self.criterion_precision:.2f}")
        lines.append(f"  Recall:     {self.criterion_recall:.2f}")
        lines.append(f"  F1:         {self.criterion_f1:.2f}")
    lines.append(f"  Mean Kappa: {self.mean_kappa:.3f}")

    lines.append("")
    lines.append("Score-Level Metrics:")
    lines.append(f"  RMSE:     {self.score_rmse:.4f}")
    lines.append(f"  MAE:      {self.score_mae:.4f}")
    lines.append(
        f"  Spearman: {self.score_spearman.coefficient:.4f} "
        f"({self.score_spearman.interpretation})"
    )
    lines.append(
        f"  Kendall:  {self.score_kendall.coefficient:.4f} "
        f"({self.score_kendall.interpretation})"
    )
    lines.append(
        f"  Pearson:  {self.score_pearson.coefficient:.4f} "
        f"({self.score_pearson.interpretation})"
    )

    lines.append("")
    lines.append("Bias Analysis:")
    lines.append(
        f"  Mean Bias:   {self.bias.mean_bias:+.4f} ({self.bias.direction})"
    )
    lines.append(f"  Significant: {'Yes' if self.bias.is_significant else 'No'}")

    # NA stats for multi-choice
    if self.na_stats:
        lines.append("")
        lines.append("NA Handling:")
        lines.append(f"  NA in Ground Truth: {self.na_stats.na_count_true}")
        lines.append(f"  NA in Predictions:  {self.na_stats.na_count_pred}")
        lines.append(f"  NA Agreement:       {self.na_stats.na_agreement:.1%}")
        if self.na_stats.na_false_positive > 0 or self.na_stats.na_false_negative > 0:
            lines.append(
                f"  NA FP/FN:           {self.na_stats.na_false_positive} / "
                f"{self.na_stats.na_false_negative}"
            )

    if self.bootstrap:
        lines.append("")
        lines.append(f"Bootstrap CIs ({self.bootstrap.confidence_level:.0%}):")
        lines.append(
            f"  Accuracy: [{self.bootstrap.accuracy_ci[0]:.1%}, "
            f"{self.bootstrap.accuracy_ci[1]:.1%}]"
        )
        lines.append(
            f"  Kappa:    [{self.bootstrap.kappa_ci[0]:.3f}, "
            f"{self.bootstrap.kappa_ci[1]:.3f}]"
        )
        lines.append(
            f"  RMSE:     [{self.bootstrap.rmse_ci[0]:.4f}, "
            f"{self.bootstrap.rmse_ci[1]:.4f}]"
        )

    if self.per_judge:
        lines.append("")
        lines.append("Per-Judge Metrics:")
        for judge_id, jm in sorted(self.per_judge.items()):
            lines.append(
                f"  {judge_id}: RMSE={jm.score_rmse:.4f}, "
                f"Spearman={jm.score_spearman.coefficient:.4f}"
            )

    lines.append("")
    lines.append("Per-Criterion Breakdown:")

    # Separate display by criterion type
    binary_criteria = [cm for cm in self.per_criterion if cm.criterion_type == "binary"]
    ordinal_criteria = [cm for cm in self.per_criterion if cm.criterion_type == "ordinal"]
    nominal_criteria = [cm for cm in self.per_criterion if cm.criterion_type == "nominal"]

    if binary_criteria:
        if ordinal_criteria or nominal_criteria:
            lines.append("\nBinary Criteria:")
        header = f"{'Criterion':<20} {'Acc':>8} {'Prec':>8} {'Rec':>8} {'F1':>8} {'Kappa':>8}"
        lines.append(header)
        lines.append("-" * len(header))
        for cm in binary_criteria:
            lines.append(
                f"{cm.name:<20} {cm.accuracy:>8.1%} {cm.precision:>8.2f} "
                f"{cm.recall:>8.2f} {cm.f1:>8.2f} {cm.kappa:>8.3f}"
            )

    if ordinal_criteria:
        lines.append("\nOrdinal Criteria:")
        header = f"{'Criterion':<20} {'Exact':>8} {'Adj':>8} {'WKappa':>8} {'Spearman':>10} {'RMSE':>8}"
        lines.append(header)
        lines.append("-" * len(header))
        for cm in ordinal_criteria:
            lines.append(
                f"{cm.name:<20} {cm.exact_accuracy:>8.1%} {cm.adjacent_accuracy:>8.1%} "
                f"{cm.weighted_kappa:>8.3f} {cm.spearman.coefficient:>10.4f} {cm.rmse:>8.4f}"
            )

    if nominal_criteria:
        lines.append("\nNominal Criteria:")
        header = f"{'Criterion':<20} {'Accuracy':>10} {'Kappa':>8} {'Interpretation':<20}"
        lines.append(header)
        lines.append("-" * len(header))
        for cm in nominal_criteria:
            lines.append(
                f"{cm.name:<20} {cm.exact_accuracy:>10.1%} {cm.kappa:>8.3f} "
                f"{cm.kappa_interpretation:<20}"
            )

    return "\n".join(lines)

to_dataframe

to_dataframe() -> DataFrame

Export metrics to pandas DataFrame.

Returns a flat DataFrame with a 'level' column indicating: - 'aggregate': Overall metrics - 'criterion': Per-criterion metrics (binary) - 'criterion_ordinal': Per-criterion metrics (ordinal) - 'criterion_nominal': Per-criterion metrics (nominal) - 'judge': Per-judge metrics (if available)

Source code in src/autorubric/metrics/_types.py
def to_dataframe(self) -> "pd.DataFrame":
    """Export metrics to pandas DataFrame.

    Returns a flat DataFrame with a 'level' column indicating:
    - 'aggregate': Overall metrics
    - 'criterion': Per-criterion metrics (binary)
    - 'criterion_ordinal': Per-criterion metrics (ordinal)
    - 'criterion_nominal': Per-criterion metrics (nominal)
    - 'judge': Per-judge metrics (if available)
    """
    import pandas as pd

    rows = []

    # Aggregate row
    rows.append(
        {
            "level": "aggregate",
            "name": "overall",
            "criterion_type": "all",
            "accuracy": self.criterion_accuracy,
            "precision": self.criterion_precision,
            "recall": self.criterion_recall,
            "f1": self.criterion_f1,
            "kappa": self.mean_kappa,
            "rmse": self.score_rmse,
            "mae": self.score_mae,
            "spearman": self.score_spearman.coefficient,
            "kendall": self.score_kendall.coefficient,
            "pearson": self.score_pearson.coefficient,
            "bias": self.bias.mean_bias,
            "adjacent_accuracy": None,
            "weighted_kappa": None,
        }
    )

    # Per-criterion rows (handle different types)
    for cm in self.per_criterion:
        if cm.criterion_type == "binary":
            rows.append(
                {
                    "level": "criterion",
                    "name": cm.name,
                    "criterion_type": "binary",
                    "accuracy": cm.accuracy,
                    "precision": cm.precision,
                    "recall": cm.recall,
                    "f1": cm.f1,
                    "kappa": cm.kappa,
                    "rmse": None,
                    "mae": None,
                    "spearman": None,
                    "kendall": None,
                    "pearson": None,
                    "bias": None,
                    "adjacent_accuracy": None,
                    "weighted_kappa": None,
                }
            )
        elif cm.criterion_type == "ordinal":
            rows.append(
                {
                    "level": "criterion",
                    "name": cm.name,
                    "criterion_type": "ordinal",
                    "accuracy": cm.exact_accuracy,
                    "precision": None,
                    "recall": None,
                    "f1": None,
                    "kappa": cm.weighted_kappa,
                    "rmse": cm.rmse,
                    "mae": cm.mae,
                    "spearman": cm.spearman.coefficient,
                    "kendall": cm.kendall.coefficient,
                    "pearson": None,
                    "bias": None,
                    "adjacent_accuracy": cm.adjacent_accuracy,
                    "weighted_kappa": cm.weighted_kappa,
                }
            )
        else:  # nominal
            rows.append(
                {
                    "level": "criterion",
                    "name": cm.name,
                    "criterion_type": "nominal",
                    "accuracy": cm.exact_accuracy,
                    "precision": None,
                    "recall": None,
                    "f1": None,
                    "kappa": cm.kappa,
                    "rmse": None,
                    "mae": None,
                    "spearman": None,
                    "kendall": None,
                    "pearson": None,
                    "bias": None,
                    "adjacent_accuracy": None,
                    "weighted_kappa": None,
                }
            )

    # Per-judge rows (if available)
    if self.per_judge:
        for judge_id, jm in self.per_judge.items():
            rows.append(
                {
                    "level": "judge",
                    "name": judge_id,
                    "criterion_type": "all",
                    "accuracy": jm.criterion_accuracy,
                    "precision": jm.criterion_precision,
                    "recall": jm.criterion_recall,
                    "f1": jm.criterion_f1,
                    "kappa": jm.mean_kappa,
                    "rmse": jm.score_rmse,
                    "mae": jm.score_mae,
                    "spearman": jm.score_spearman.coefficient,
                    "kendall": jm.score_kendall.coefficient,
                    "pearson": jm.score_pearson.coefficient,
                    "bias": jm.bias.mean_bias,
                    "adjacent_accuracy": None,
                    "weighted_kappa": None,
                }
            )

    return pd.DataFrame(rows)

to_file

to_file(path: str | Path) -> None

Save metrics to a JSON file.

PARAMETER DESCRIPTION
path

Path to the output JSON file.

TYPE: str | Path

Source code in src/autorubric/metrics/_types.py
def to_file(self, path: str | Path) -> None:
    """Save metrics to a JSON file.

    Args:
        path: Path to the output JSON file.
    """
    path = Path(path)
    path.parent.mkdir(parents=True, exist_ok=True)
    path.write_text(self.model_dump_json(indent=2), encoding="utf-8")

CriterionMetrics

Per-criterion binary metrics.

CriterionMetrics

Bases: BaseModel

Metrics for a single binary criterion.

ATTRIBUTE DESCRIPTION
name

Name of the criterion.

TYPE: str

index

Index of the criterion in the rubric.

TYPE: int

criterion_type

Type of criterion ("binary" for this class).

TYPE: Literal['binary']

n_samples

Number of samples used for this criterion.

TYPE: int

accuracy

Binary accuracy (proportion of exact matches).

TYPE: float

precision

Precision for MET class.

TYPE: float

recall

Recall for MET class.

TYPE: float

f1

F1 score for MET class.

TYPE: float

kappa

Cohen's kappa coefficient.

TYPE: float

kappa_interpretation

Human-readable interpretation of kappa.

TYPE: str

support_true

Count of MET in ground truth.

TYPE: int

support_pred

Count of MET in predictions.

TYPE: int


CorrelationResult

Correlation statistics between predicted and ground truth scores.

CorrelationResult

Bases: BaseModel

Result from correlation calculation (Spearman, Kendall, Pearson).

ATTRIBUTE DESCRIPTION
coefficient

The correlation coefficient (-1 to 1).

TYPE: float

p_value

P-value for testing the null hypothesis of no correlation.

TYPE: float | None

ci

Optional confidence interval for the coefficient.

TYPE: ConfidenceInterval | None

interpretation

Human-readable interpretation.

TYPE: str

n_samples

Number of samples used in calculation.

TYPE: int

method

Correlation method used (e.g., "spearman", "kendall", "pearson").

TYPE: str

interpret_correlation staticmethod

interpret_correlation(r: float) -> str

Return human-readable interpretation of correlation coefficient.

Source code in src/autorubric/metrics/_types.py
@staticmethod
def interpret_correlation(r: float) -> str:
    """Return human-readable interpretation of correlation coefficient."""
    abs_r = abs(r)
    if abs_r >= 0.9:
        strength = "very strong"
    elif abs_r >= 0.7:
        strength = "strong"
    elif abs_r >= 0.5:
        strength = "moderate"
    elif abs_r >= 0.3:
        strength = "weak"
    else:
        strength = "very weak"

    direction = "positive" if r >= 0 else "negative"
    return f"{strength} {direction}"

BootstrapResults

Bootstrap confidence intervals for key metrics.

BootstrapResults

Bases: BaseModel

Bootstrap confidence interval results.

ATTRIBUTE DESCRIPTION
accuracy_ci

95% CI for criterion-level accuracy.

TYPE: tuple[float, float]

kappa_ci

95% CI for mean kappa.

TYPE: tuple[float, float]

rmse_ci

95% CI for score RMSE.

TYPE: tuple[float, float]

n_bootstrap

Number of bootstrap samples used.

TYPE: int

confidence_level

Confidence level (default 0.95).

TYPE: float


BootstrapResult

Single bootstrap result with confidence interval.

BootstrapResult

Bases: BaseModel

Bootstrap confidence interval result.

ATTRIBUTE DESCRIPTION
estimate

Point estimate of the statistic.

TYPE: float

ci

Confidence interval from bootstrap.

TYPE: ConfidenceInterval

standard_error

Bootstrap standard error.

TYPE: float

n_bootstrap

Number of bootstrap samples used.

TYPE: int

bootstrap_distribution

Optional array of bootstrap estimates.

TYPE: list[float] | None


ConfidenceInterval

Confidence interval bounds.

ConfidenceInterval

Bases: BaseModel

Confidence interval for a statistic.

ATTRIBUTE DESCRIPTION
lower

Lower bound of the interval.

TYPE: float

upper

Upper bound of the interval.

TYPE: float

confidence

Confidence level (default 0.95 for 95% CI).

TYPE: float

method

Method used to compute the interval.

TYPE: str

width property

width: float

Width of the confidence interval.


JudgeMetrics

Per-judge metrics for ensemble evaluations.

JudgeMetrics

Bases: BaseModel

Metrics for a single judge in an ensemble.

ATTRIBUTE DESCRIPTION
judge_id

Identifier for this judge.

TYPE: str

criterion_accuracy

Overall criterion-level accuracy.

TYPE: float

criterion_precision

Overall precision for MET class.

TYPE: float

criterion_recall

Overall recall for MET class.

TYPE: float

criterion_f1

Overall F1 for MET class.

TYPE: float

mean_kappa

Mean Cohen's kappa across criteria.

TYPE: float

score_rmse

RMSE of cumulative scores.

TYPE: float

score_mae

MAE of cumulative scores.

TYPE: float

score_spearman

Spearman correlation result.

TYPE: CorrelationResult

score_kendall

Kendall tau correlation result.

TYPE: CorrelationResult

score_pearson

Pearson correlation result.

TYPE: CorrelationResult

bias

Systematic bias analysis result.

TYPE: BiasResult


References

Casabianca, J., McCaffrey, D. F., Johnson, M. S., Alper, N., and Zubenko, V. (2025). Validity Arguments For Constructed Response Scoring Using Generative Artificial Intelligence Applications. arXiv:2501.02334.

He, J., Shi, J., Zhuo, T. Y., Treude, C., Sun, J., Xing, Z., Du, X., and Lo, D. (2025). LLM-as-a-Judge for Software Engineering: Literature Review, Vision, and the Road Ahead. arXiv:2510.24367.