Eval Runner¶
High-throughput batch evaluation with checkpointing, resumption, and timing statistics.
Overview¶
EvalRunner and the evaluate() convenience function provide infrastructure for evaluating datasets at scale. Features include parallel execution with rate limiting, progress tracking, automatic checkpointing for long-running jobs, and comprehensive timing/cost statistics.
Research Background
Casabianca et al. (2025) recommend maintaining a "gold set" of human-graded examples and sampling 1-5% of production traffic for continuous validation. EvalRunner provides the infrastructure for systematic evaluation with checkpointing for long-running jobs and cost tracking for budget management.
Quick Example¶
from autorubric import RubricDataset, LLMConfig, evaluate
from autorubric.graders import CriterionGrader
async def main():
dataset = RubricDataset.from_file("essays.json")
grader = CriterionGrader(
llm_config=LLMConfig(
model="openai/gpt-4.1-mini",
max_parallel_requests=10,
)
)
result = await evaluate(dataset, grader, show_progress=True)
print(f"Evaluated {result.successful_items}/{result.total_items}")
print(f"Throughput: {result.timing_stats.items_per_second:.2f} items/s")
print(f"Total cost: ${result.total_completion_cost:.4f}")
Checkpointing and Resumption¶
from autorubric import EvalRunner, EvalConfig, EvalResult
# First run (may be interrupted)
config = EvalConfig(
experiment_name="my-essay-eval",
experiments_dir="./experiments",
show_progress=True,
)
runner = EvalRunner(dataset=dataset, grader=grader, config=config)
result = await runner.run()
# Saves to: experiments/my-essay-eval/manifest.json + items.jsonl
# Resume after crash
runner = EvalRunner(dataset=dataset, grader=grader, config=config)
result = await runner.run() # Skips already-completed items
# Load results later
result = EvalResult.from_experiment("experiments/my-essay-eval")
Rate Limiting¶
from autorubric.graders import CriterionGrader, JudgeSpec
grader = CriterionGrader(
judges=[
JudgeSpec(LLMConfig(model="openai/gpt-4.1", max_parallel_requests=10), "gpt"),
JudgeSpec(LLMConfig(model="anthropic/claude-sonnet-4-5-20250929", max_parallel_requests=5), "claude"),
],
aggregation="majority",
)
Rate limiting uses a global per-provider semaphore, so all openai/* models share the same limit.
evaluate¶
Convenience function for batch evaluation.
evaluate
async
¶
evaluate(dataset: RubricDataset, grader: Grader, *, fail_fast: bool = False, show_progress: bool = True, progress_style: Literal['simple', 'detailed'] = 'simple', max_concurrent_items: int | None = None, experiment_name: str | None = None, experiments_dir: Path | str = 'experiments', resume: bool = True) -> EvalResult
Evaluate a dataset with a grader.
Convenience wrapper around EvalRunner.
| PARAMETER | DESCRIPTION |
|---|---|
dataset
|
The dataset to evaluate.
TYPE:
|
grader
|
The grader to use.
TYPE:
|
fail_fast
|
Stop on first error if True.
TYPE:
|
show_progress
|
Display progress bars if True.
TYPE:
|
progress_style
|
"simple" or "detailed" progress display.
TYPE:
|
max_concurrent_items
|
Limit concurrent items (None = unlimited).
TYPE:
|
experiment_name
|
Name for this experiment run.
TYPE:
|
experiments_dir
|
Root directory for experiment outputs.
TYPE:
|
resume
|
If True and experiment exists, resume from checkpoint.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
EvalResult
|
EvalResult with all results and aggregated statistics. |
Example
from autorubric.eval import evaluate result = await evaluate(dataset, grader, show_progress=True) print(f"Evaluated {result.successful_items}/{result.total_items}")
Source code in src/autorubric/eval.py
EvalRunner¶
Runner class for batch evaluation with checkpointing.
EvalRunner
¶
EvalRunner(dataset: RubricDataset, grader: Grader, config: EvalConfig | None = None)
Runs batch evaluations with rate limiting and progress tracking.
This class orchestrates the evaluation of a RubricDataset using a grader, handling: - Concurrent execution with configurable parallelism - Rate limiting via LLMConfig.max_parallel_requests - Progress display with rich progress bars - Checkpointing and resumption from failures - Result aggregation with timing statistics
Example
from autorubric import LLMConfig, RubricDataset from autorubric.graders import CriterionGrader from autorubric.eval import EvalRunner, EvalConfig
dataset = RubricDataset.from_file("data.json") grader = CriterionGrader( ... llm_config=LLMConfig( ... model="openai/gpt-4", ... max_parallel_requests=10, ... ) ... )
runner = EvalRunner(dataset=dataset, grader=grader) result = await runner.run() print(f"Evaluated {result.successful_items}/{result.total_items}")
Initialize the evaluation runner.
| PARAMETER | DESCRIPTION |
|---|---|
dataset
|
The dataset to evaluate.
TYPE:
|
grader
|
The grader to use for evaluation.
TYPE:
|
config
|
Optional configuration. Uses defaults if not provided.
TYPE:
|
Source code in src/autorubric/eval.py
run
async
¶
run() -> EvalResult
Run the evaluation and return aggregated results.
| RETURNS | DESCRIPTION |
|---|---|
EvalResult
|
EvalResult with all item results, aggregated usage/cost, |
EvalResult
|
and timing statistics. |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If fail_fast=True and any item fails. |
Source code in src/autorubric/eval.py
719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 | |
EvalConfig¶
Configuration options for evaluation runs.
EvalConfig
dataclass
¶
EvalConfig(fail_fast: bool = False, show_progress: bool = True, progress_style: Literal['simple', 'detailed'] = 'simple', max_concurrent_items: int | None = None, experiment_name: str | None = None, experiments_dir: Path | str = 'experiments', resume: bool = True)
Configuration for evaluation runs.
| ATTRIBUTE | DESCRIPTION |
|---|---|
fail_fast |
If True, stop on first error. Default False continues all items.
TYPE:
|
show_progress |
Whether to display progress bars. Default True.
TYPE:
|
progress_style |
Style of progress display. - "simple": Single overall progress bar - "detailed": Shows per-judge progress for ensemble mode
TYPE:
|
max_concurrent_items |
Maximum items to grade concurrently. None = grade all items in parallel (default). Set this to limit memory usage for very large datasets.
TYPE:
|
experiment_name |
Name for this experiment run. If None, auto-generates using coolname.
TYPE:
|
experiments_dir |
Root directory for experiment outputs. Default is "./experiments".
TYPE:
|
resume |
If True and experiment exists, resume from checkpoint. Default True.
TYPE:
|
EvalResult¶
Results from a completed evaluation run.
EvalResult
dataclass
¶
EvalResult(item_results: list[ItemResult], total_items: int, successful_items: int, failed_items: int, total_token_usage: TokenUsage | None, total_completion_cost: float | None, timing_stats: EvalTimingStats, started_at: datetime, completed_at: datetime, errors: list[tuple[int, str]] = list(), experiment_name: str | None = None, experiment_dir: Path | None = None)
Complete result from an evaluation run.
get_scores
¶
get_reports
¶
get_reports() -> list[EvaluationReport | EnsembleEvaluationReport]
filter_successful
¶
filter_successful() -> list[ItemResult]
filter_failed
¶
filter_failed() -> list[ItemResult]
compute_metrics
¶
compute_metrics(dataset: RubricDataset, *, bootstrap: bool = False, n_bootstrap: int = 1000, per_judge: bool = False, cannot_assess: Literal['exclude', 'as_unmet'] = 'exclude', na_mode: Literal['exclude', 'as_worst'] = 'exclude', confidence_level: float = 0.95, seed: int | None = None) -> 'MetricsResult'
Compute comprehensive evaluation metrics against ground truth.
This method compares predicted verdicts and scores against ground truth from the dataset, computing criterion-level agreement metrics, score correlations, and bias analysis.
If eval_result does not contain all items from the dataset, metrics are computed only for the intersection, and a warning is included in the result.
| PARAMETER | DESCRIPTION |
|---|---|
dataset
|
The dataset with ground truth labels.
TYPE:
|
bootstrap
|
If True, compute bootstrap confidence intervals (expensive).
TYPE:
|
n_bootstrap
|
Number of bootstrap samples if bootstrap=True.
TYPE:
|
per_judge
|
If True and ensemble, compute per-judge metrics.
TYPE:
|
cannot_assess
|
How to handle CANNOT_ASSESS verdicts: - "exclude": Skip pairs where either is CA (default) - "as_unmet": Treat CA as UNMET
TYPE:
|
na_mode
|
How to handle NA options in multi-choice criteria: - "exclude": Skip pairs where either is NA (default) - "as_worst": Keep NA in metrics computation
TYPE:
|
confidence_level
|
Confidence level for bootstrap CIs (default 0.95).
TYPE:
|
seed
|
Random seed for bootstrap reproducibility.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
'MetricsResult'
|
MetricsResult with comprehensive metrics. Use .summary() for |
'MetricsResult'
|
formatted output or .to_dataframe() for export. |
Example
result = await evaluate(dataset, grader) metrics = result.compute_metrics(dataset) print(metrics.summary()) print(f"Accuracy: {metrics.criterion_accuracy:.1%}") df = metrics.to_dataframe()
Source code in src/autorubric/eval.py
from_experiment
classmethod
¶
from_experiment(experiment_path: Path | str) -> EvalResult
Load EvalResult from a completed experiment directory.
| PARAMETER | DESCRIPTION |
|---|---|
experiment_path
|
Path to the experiment directory.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
EvalResult
|
EvalResult with loaded item results and statistics. |
| RAISES | DESCRIPTION |
|---|---|
FileNotFoundError
|
If experiment directory doesn't exist. |
ValueError
|
If manifest is invalid or experiment is incomplete. |
Source code in src/autorubric/eval.py
438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 | |
ItemResult¶
Result for a single evaluated item.
ItemResult
dataclass
¶
ItemResult(item_idx: int, item: DataItem, report: EvaluationReport | EnsembleEvaluationReport, duration_seconds: float, error: str | None = None)
Result for a single evaluated item.
to_dict
¶
Serialize to dictionary for JSON storage.
Source code in src/autorubric/eval.py
from_dict
classmethod
¶
from_dict(data: dict[str, Any], item: DataItem) -> ItemResult
Deserialize from dictionary.
Source code in src/autorubric/eval.py
EvalTimingStats¶
Timing statistics for an evaluation run.
EvalTimingStats
dataclass
¶
EvalTimingStats(total_duration_seconds: float, mean_item_duration_seconds: float, min_item_duration_seconds: float, max_item_duration_seconds: float, p50_item_duration_seconds: float, p95_item_duration_seconds: float, items_per_second: float)
Timing statistics for the evaluation run.
from_durations
classmethod
¶
from_durations(durations: list[float], total_duration: float) -> EvalTimingStats
Compute timing stats from a list of item durations.
Source code in src/autorubric/eval.py
to_dict
¶
Serialize to dictionary.
Source code in src/autorubric/eval.py
ExperimentManifest¶
Metadata for a saved experiment.
ExperimentManifest
dataclass
¶
ExperimentManifest(experiment_name: str, created_at: datetime, dataset_name: str | None, dataset_hash: str, total_items: int, status: Literal['running', 'completed', 'failed'], completed_indices: set[int], error: str | None = None, started_at: datetime | None = None, completed_at: datetime | None = None, total_duration_seconds: float | None = None, dataset_path: str | None = None, grader_config: dict[str, Any] | None = None, eval_config: dict[str, Any] | None = None)
Manifest for experiment checkpointing.
Contains metadata about an evaluation run for reproducibility and resumption.
to_dict
¶
Serialize to dictionary for JSON storage.
Source code in src/autorubric/eval.py
from_dict
classmethod
¶
from_dict(data: dict[str, Any]) -> ExperimentManifest
Deserialize from dictionary.
Source code in src/autorubric/eval.py
References¶
Casabianca, J., McCaffrey, D. F., Johnson, M. S., Alper, N., and Zubenko, V. (2025). Validity Arguments For Constructed Response Scoring Using Generative Artificial Intelligence Applications. arXiv:2501.02334.