Eval Runner¶
High-throughput batch evaluation with checkpointing, resumption, and timing statistics.
Overview¶
EvalRunner and the evaluate() convenience function provide infrastructure for evaluating datasets at scale. Features include parallel execution with rate limiting, progress tracking, automatic checkpointing for long-running jobs, and comprehensive timing/cost statistics.
Research Background
Casabianca et al. (2025) recommend maintaining a "gold set" of human-graded examples and sampling 1-5% of production traffic for continuous validation. EvalRunner provides the infrastructure for systematic evaluation with checkpointing for long-running jobs and cost tracking for budget management.
Quick Example¶
from autorubric import RubricDataset, LLMConfig, evaluate
from autorubric.graders import CriterionGrader
async def main():
dataset = RubricDataset.from_file("essays.json")
grader = CriterionGrader(
llm_config=LLMConfig(
model="openai/gpt-4.1-mini",
max_parallel_requests=10,
)
)
result = await evaluate(dataset, grader, show_progress=True)
print(f"Evaluated {result.successful_items}/{result.total_items}")
print(f"Throughput: {result.timing_stats.items_per_second:.2f} items/s")
print(f"Total cost: ${result.total_completion_cost:.4f}")
Checkpointing and Resumption¶
from autorubric import EvalRunner, EvalConfig, EvalResult
# First run (may be interrupted)
config = EvalConfig(
experiment_name="my-essay-eval",
experiments_dir="./experiments",
show_progress=True,
)
runner = EvalRunner(dataset=dataset, grader=grader, config=config)
result = await runner.run()
# Saves to: experiments/my-essay-eval/manifest.json + items.jsonl
# Resume after crash
runner = EvalRunner(dataset=dataset, grader=grader, config=config)
result = await runner.run() # Skips already-completed items
# Load results later
result = EvalResult.from_experiment("experiments/my-essay-eval")
Rate Limiting¶
from autorubric.graders import CriterionGrader, JudgeSpec
grader = CriterionGrader(
judges=[
JudgeSpec(LLMConfig(model="openai/gpt-4.1", max_parallel_requests=10), "gpt"),
JudgeSpec(LLMConfig(model="anthropic/claude-sonnet-4-5-20250929", max_parallel_requests=5), "claude"),
],
aggregation="majority",
)
Rate limiting uses a global per-provider semaphore, so all openai/* models share the same limit.
evaluate¶
Convenience function for batch evaluation.
evaluate
async
¶
evaluate(dataset: RubricDataset, grader: Grader, *, fail_fast: bool = False, show_progress: bool = True, progress_style: Literal['simple', 'detailed'] = 'simple', max_concurrent_items: int | None = None, experiment_name: str | None = None, experiments_dir: Path | str = 'experiments', resume: bool = True) -> EvalResult
Evaluate a dataset with a grader.
Convenience wrapper around EvalRunner.
| PARAMETER | DESCRIPTION |
|---|---|
dataset
|
The dataset to evaluate.
TYPE:
|
grader
|
The grader to use.
TYPE:
|
fail_fast
|
Stop on first error if True.
TYPE:
|
show_progress
|
Display progress bars if True.
TYPE:
|
progress_style
|
"simple" or "detailed" progress display.
TYPE:
|
max_concurrent_items
|
Limit concurrent items (None = unlimited).
TYPE:
|
experiment_name
|
Name for this experiment run.
TYPE:
|
experiments_dir
|
Root directory for experiment outputs.
TYPE:
|
resume
|
If True and experiment exists, resume from checkpoint.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
EvalResult
|
EvalResult with all results and aggregated statistics. |
Example
from autorubric.eval import evaluate result = await evaluate(dataset, grader, show_progress=True) print(f"Evaluated {result.successful_items}/{result.total_items}")
Source code in src/autorubric/eval.py
EvalRunner¶
Runner class for batch evaluation with checkpointing.
EvalRunner
¶
EvalRunner(dataset: RubricDataset, grader: Grader, config: EvalConfig | None = None)
Runs batch evaluations with rate limiting and progress tracking.
This class orchestrates the evaluation of a RubricDataset using a grader, handling: - Concurrent execution with configurable parallelism - Rate limiting via LLMConfig.max_parallel_requests - Progress display with rich progress bars - Checkpointing and resumption from failures - Result aggregation with timing statistics
Example
from autorubric import LLMConfig, RubricDataset from autorubric.graders import CriterionGrader from autorubric.eval import EvalRunner, EvalConfig
dataset = RubricDataset.from_file("data.json") grader = CriterionGrader( ... llm_config=LLMConfig( ... model="openai/gpt-4", ... max_parallel_requests=10, ... ) ... )
runner = EvalRunner(dataset=dataset, grader=grader) result = await runner.run() print(f"Evaluated {result.successful_items}/{result.total_items}")
Initialize the evaluation runner.
| PARAMETER | DESCRIPTION |
|---|---|
dataset
|
The dataset to evaluate.
TYPE:
|
grader
|
The grader to use for evaluation.
TYPE:
|
config
|
Optional configuration. Uses defaults if not provided.
TYPE:
|
Source code in src/autorubric/eval.py
run
async
¶
run() -> EvalResult
Run the evaluation and return aggregated results.
| RETURNS | DESCRIPTION |
|---|---|
EvalResult
|
EvalResult with all item results, aggregated usage/cost, |
EvalResult
|
and timing statistics. |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If fail_fast=True and any item fails. |
Source code in src/autorubric/eval.py
887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 | |
EvalConfig¶
Configuration options for evaluation runs.
EvalConfig
dataclass
¶
EvalConfig(fail_fast: bool = False, show_progress: bool = True, progress_style: Literal['simple', 'detailed'] = 'simple', max_concurrent_items: int | None = None, experiment_name: str | None = None, experiments_dir: Path | str = 'experiments', resume: bool = True)
Configuration for evaluation runs.
| ATTRIBUTE | DESCRIPTION |
|---|---|
fail_fast |
If True, stop on first error. Default False continues all items.
TYPE:
|
show_progress |
Whether to display progress bars. Default True.
TYPE:
|
progress_style |
Style of progress display. - "simple": Single overall progress bar - "detailed": Shows per-judge progress for ensemble mode
TYPE:
|
max_concurrent_items |
Maximum items to grade concurrently. None = grade all items in parallel (default). Set this to limit memory usage for very large datasets.
TYPE:
|
experiment_name |
Name for this experiment run. If None, auto-generates using coolname.
TYPE:
|
experiments_dir |
Root directory for experiment outputs. Default is "./experiments".
TYPE:
|
resume |
If True and experiment exists, resume from checkpoint. Default True.
TYPE:
|
EvalResult¶
Results from a completed evaluation run.
EvalResult
dataclass
¶
EvalResult(item_results: list[ItemResult], total_items: int, successful_items: int, failed_items: int, total_token_usage: TokenUsage | None, total_completion_cost: float | None, timing_stats: EvalTimingStats, started_at: datetime, completed_at: datetime, errors: list[tuple[int, str]] = list(), experiment_name: str | None = None, experiment_dir: Path | None = None)
Complete result from an evaluation run.
get_scores
¶
get_reports
¶
get_reports() -> list[EvaluationReport | EnsembleEvaluationReport]
filter_successful
¶
filter_successful() -> list[ItemResult]
filter_failed
¶
filter_failed() -> list[ItemResult]
compute_metrics
¶
compute_metrics(dataset: RubricDataset, *, bootstrap: bool = False, n_bootstrap: int = 1000, per_judge: bool = False, cannot_assess: Literal['exclude', 'as_unmet'] = 'exclude', na_mode: Literal['exclude', 'as_worst'] = 'exclude', confidence_level: float = 0.95, seed: int | None = None) -> 'MetricsResult'
Compute comprehensive evaluation metrics against ground truth.
This method compares predicted verdicts and scores against ground truth from the dataset, computing criterion-level agreement metrics, score correlations, and bias analysis.
If eval_result does not contain all items from the dataset, metrics are computed only for the intersection, and a warning is included in the result.
| PARAMETER | DESCRIPTION |
|---|---|
dataset
|
The dataset with ground truth labels.
TYPE:
|
bootstrap
|
If True, compute bootstrap confidence intervals (expensive).
TYPE:
|
n_bootstrap
|
Number of bootstrap samples if bootstrap=True.
TYPE:
|
per_judge
|
If True and ensemble, compute per-judge metrics.
TYPE:
|
cannot_assess
|
How to handle CANNOT_ASSESS verdicts: - "exclude": Skip pairs where either is CANNOT_ASSESS (default) - "as_unmet": Treat CANNOT_ASSESS as UNMET
TYPE:
|
na_mode
|
How to handle NA options in multi-choice criteria: - "exclude": Skip pairs where either is NA (default) - "as_worst": Keep NA in metrics computation
TYPE:
|
confidence_level
|
Confidence level for bootstrap CIs (default 0.95).
TYPE:
|
seed
|
Random seed for bootstrap reproducibility.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
'MetricsResult'
|
MetricsResult with comprehensive metrics. Use .summary() for |
'MetricsResult'
|
formatted output or .to_dataframe() for export. |
Example
result = await evaluate(dataset, grader) metrics = result.compute_metrics(dataset) print(metrics.summary()) print(f"Accuracy: {metrics.criterion_accuracy:.1%}") df = metrics.to_dataframe()
Source code in src/autorubric/eval.py
from_experiment
classmethod
¶
from_experiment(experiment_path: Path | str) -> EvalResult
Load EvalResult from a completed experiment directory.
| PARAMETER | DESCRIPTION |
|---|---|
experiment_path
|
Path to the experiment directory.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
EvalResult
|
EvalResult with loaded item results and statistics. |
| RAISES | DESCRIPTION |
|---|---|
FileNotFoundError
|
If experiment directory doesn't exist. |
ValueError
|
If manifest is invalid or experiment is incomplete. |
Source code in src/autorubric/eval.py
606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 | |
ItemResult¶
Result for a single evaluated item.
ItemResult
dataclass
¶
ItemResult(item_idx: int, item: DataItem, report: EvaluationReport | EnsembleEvaluationReport, duration_seconds: float, error: str | None = None)
Result for a single evaluated item.
to_dict
¶
Serialize to dictionary for JSON storage.
Source code in src/autorubric/eval.py
from_dict
classmethod
¶
from_dict(data: dict[str, Any], item: DataItem) -> ItemResult
Deserialize from dictionary.
Source code in src/autorubric/eval.py
EvalTimingStats¶
Timing statistics for an evaluation run.
EvalTimingStats
dataclass
¶
EvalTimingStats(total_duration_seconds: float, mean_item_duration_seconds: float, min_item_duration_seconds: float, max_item_duration_seconds: float, p50_item_duration_seconds: float, p95_item_duration_seconds: float, items_per_second: float)
Timing statistics for the evaluation run.
from_durations
classmethod
¶
from_durations(durations: list[float], total_duration: float) -> EvalTimingStats
Compute timing stats from a list of item durations.
Source code in src/autorubric/eval.py
to_dict
¶
Serialize to dictionary.
Source code in src/autorubric/eval.py
ExperimentManifest¶
Metadata for a saved experiment.
ExperimentManifest
dataclass
¶
ExperimentManifest(experiment_name: str, created_at: datetime, dataset_name: str | None, dataset_hash: str, total_items: int, status: Literal['running', 'completed', 'failed'], completed_indices: set[int], error: str | None = None, started_at: datetime | None = None, completed_at: datetime | None = None, total_duration_seconds: float | None = None, dataset_path: str | None = None, grader_config: dict[str, Any] | None = None, eval_config: dict[str, Any] | None = None)
Manifest for experiment checkpointing.
Contains metadata about an evaluation run for reproducibility and resumption.
to_dict
¶
Serialize to dictionary for JSON storage.
Source code in src/autorubric/eval.py
from_dict
classmethod
¶
from_dict(data: dict[str, Any]) -> ExperimentManifest
Deserialize from dictionary.
Source code in src/autorubric/eval.py
References¶
Casabianca, J., McCaffrey, D. F., Johnson, M. S., Alper, N., and Zubenko, V. (2025). Validity Arguments For Constructed Response Scoring Using Generative Artificial Intelligence Applications. arXiv:2501.02334.