Eval Runner¶
High-throughput batch evaluation with checkpointing, resumption, and timing statistics.
Overview¶
EvalRunner and the evaluate() convenience function provide infrastructure for evaluating datasets at scale. Features include parallel execution with rate limiting, progress tracking, automatic checkpointing for long-running jobs, and comprehensive timing/cost statistics.
Research Background
Casabianca et al. (2025) recommend maintaining a "gold set" of human-graded examples and sampling 1-5% of production traffic for continuous validation. EvalRunner provides the infrastructure for systematic evaluation with checkpointing for long-running jobs and cost tracking for budget management.
Quick Example¶
from autorubric import RubricDataset, LLMConfig, evaluate
from autorubric.graders import CriterionGrader
async def main():
dataset = RubricDataset.from_file("essays.json")
grader = CriterionGrader(
llm_config=LLMConfig(
model="openai/gpt-4.1-mini",
max_parallel_requests=10,
)
)
result = await evaluate(dataset, grader, show_progress=True)
print(f"Evaluated {result.successful_items}/{result.total_items}")
print(f"Throughput: {result.timing_stats.items_per_second:.2f} items/s")
print(f"Total cost: ${result.total_completion_cost or 0:.4f}")
Checkpointing and Resumption¶
from autorubric import EvalRunner, EvalConfig, EvalResult
# First run (may be interrupted)
config = EvalConfig(
experiment_name="my-essay-eval",
experiments_dir="./experiments",
show_progress=True,
)
runner = EvalRunner(dataset=dataset, grader=grader, config=config)
result = await runner.run()
# Saves to: experiments/my-essay-eval/manifest.json + items.jsonl
# Resume after crash
runner = EvalRunner(dataset=dataset, grader=grader, config=config)
result = await runner.run() # Skips already-completed items
# Load results later
result = EvalResult.from_experiment("experiments/my-essay-eval")
Rate Limiting¶
from autorubric.graders import CriterionGrader, JudgeSpec
grader = CriterionGrader(
judges=[
JudgeSpec(LLMConfig(model="openai/gpt-4.1", max_parallel_requests=10), "gpt"),
JudgeSpec(LLMConfig(model="anthropic/claude-sonnet-4-5-20250929", max_parallel_requests=5), "claude"),
],
aggregation="majority",
)
Rate limiting uses a global per-provider semaphore, so all openai/* models share the same limit.
evaluate¶
Convenience function for batch evaluation.
evaluate
async
¶
evaluate(dataset: RubricDataset, grader: Grader, *, fail_fast: bool = False, show_progress: bool = True, progress_style: Literal['simple', 'detailed'] = 'simple', max_concurrent_items: int | None = None, experiment_name: str | None = None, experiments_dir: Path | str = 'experiments', resume: bool = True) -> EvalResult
Evaluate a dataset with a grader.
Convenience wrapper around EvalRunner.
| PARAMETER | DESCRIPTION |
|---|---|
dataset
|
The dataset to evaluate.
TYPE:
|
grader
|
The grader to use.
TYPE:
|
fail_fast
|
Stop on first error if True.
TYPE:
|
show_progress
|
Display progress bars if True.
TYPE:
|
progress_style
|
"simple" or "detailed" progress display.
TYPE:
|
max_concurrent_items
|
Limit concurrent items (None = unlimited).
TYPE:
|
experiment_name
|
Name for this experiment run.
TYPE:
|
experiments_dir
|
Root directory for experiment outputs.
TYPE:
|
resume
|
If True and experiment exists, resume from checkpoint.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
EvalResult
|
EvalResult with all results and aggregated statistics. |
Example
from autorubric.eval import evaluate result = await evaluate(dataset, grader, show_progress=True) print(f"Evaluated {result.successful_items}/{result.total_items}")
Source code in src/autorubric/eval.py
EvalRunner¶
Runner class for batch evaluation with checkpointing.
EvalRunner
¶
EvalRunner(dataset: RubricDataset, grader: Grader, config: EvalConfig | None = None)
Runs batch evaluations with rate limiting and progress tracking.
This class orchestrates the evaluation of a RubricDataset using a grader, handling: - Concurrent execution with configurable parallelism - Rate limiting via LLMConfig.max_parallel_requests - Progress display with rich progress bars - Checkpointing and resumption from failures - Result aggregation with timing statistics
Example
from autorubric import LLMConfig, RubricDataset from autorubric.graders import CriterionGrader from autorubric.eval import EvalRunner, EvalConfig
dataset = RubricDataset.from_file("data.json") grader = CriterionGrader( ... llm_config=LLMConfig( ... model="openai/gpt-4", ... max_parallel_requests=10, ... ) ... )
runner = EvalRunner(dataset=dataset, grader=grader) result = await runner.run() print(f"Evaluated {result.successful_items}/{result.total_items}")
Initialize the evaluation runner.
| PARAMETER | DESCRIPTION |
|---|---|
dataset
|
The dataset to evaluate.
TYPE:
|
grader
|
The grader to use for evaluation.
TYPE:
|
config
|
Optional configuration. Uses defaults if not provided.
TYPE:
|
Source code in src/autorubric/eval.py
run
async
¶
run() -> EvalResult
Run the evaluation and return aggregated results.
| RETURNS | DESCRIPTION |
|---|---|
EvalResult
|
EvalResult with all item results, aggregated usage/cost, |
EvalResult
|
and timing statistics. |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If fail_fast=True and any item fails. |
Source code in src/autorubric/eval.py
850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 | |
EvalConfig¶
Configuration options for evaluation runs.
EvalConfig
dataclass
¶
EvalConfig(fail_fast: bool = False, show_progress: bool = True, progress_style: Literal['simple', 'detailed'] = 'simple', max_concurrent_items: int | None = None, experiment_name: str | None = None, experiments_dir: Path | str = 'experiments', resume: bool = True)
Configuration for evaluation runs.
| ATTRIBUTE | DESCRIPTION |
|---|---|
fail_fast |
If True, stop on first error. Default False continues all items.
TYPE:
|
show_progress |
Whether to display progress bars. Default True.
TYPE:
|
progress_style |
Style of progress display. - "simple": Single overall progress bar - "detailed": Shows per-judge progress for ensemble mode
TYPE:
|
max_concurrent_items |
Maximum items to grade concurrently. None = grade all items in parallel (default). Set this to limit memory usage for very large datasets.
TYPE:
|
experiment_name |
Name for this experiment run. If None, auto-generates using coolname.
TYPE:
|
experiments_dir |
Root directory for experiment outputs. Default is "./experiments".
TYPE:
|
resume |
If True and experiment exists, resume from checkpoint. Default True.
TYPE:
|
EvalResult¶
Results from a completed evaluation run.
EvalResult
dataclass
¶
EvalResult(item_results: list[ItemResult], total_items: int, successful_items: int, failed_items: int, total_token_usage: TokenUsage | None, total_completion_cost: float | None, timing_stats: EvalTimingStats, started_at: datetime, completed_at: datetime, errors: list[tuple[int, str]] = list(), experiment_name: str | None = None, experiment_dir: Path | None = None)
Complete result from an evaluation run.
get_scores
¶
Extract scores from all successful results.
A grade-FAILURE has no score (report.score is None); such results are
skipped. This subsumes the item-level error filter and also drops a
report-level error that carried no item-level error.
Source code in src/autorubric/eval.py
get_reports
¶
get_reports() -> list[EvaluationReport | EnsembleEvaluationReport]
filter_successful
¶
filter_successful() -> list[ItemResult]
filter_failed
¶
filter_failed() -> list[ItemResult]
compute_metrics
¶
compute_metrics(dataset: RubricDataset, *, bootstrap: bool = False, n_bootstrap: int = 1000, per_judge: bool = False, cannot_assess: CannotAssessMode = 'exclude', na_mode: NAMode = 'exclude', confidence_level: float = 0.95, seed: int | None = None) -> MetricsResult
Compute comprehensive evaluation metrics against ground truth.
This method compares predicted verdicts and scores against ground truth from the dataset, computing criterion-level agreement metrics, score correlations, and bias analysis.
If eval_result does not contain all items from the dataset, metrics are computed only for the intersection, and a warning is included in the result.
| PARAMETER | DESCRIPTION |
|---|---|
dataset
|
The dataset with ground truth labels.
TYPE:
|
bootstrap
|
If True, compute bootstrap confidence intervals (expensive).
TYPE:
|
n_bootstrap
|
Number of bootstrap samples if bootstrap=True.
TYPE:
|
per_judge
|
If True and ensemble, compute per-judge metrics.
TYPE:
|
cannot_assess
|
How to handle CANNOT_ASSESS verdicts: - "exclude": Skip pairs where either is CANNOT_ASSESS (default) - "as_unmet": Treat CANNOT_ASSESS as UNMET - "as_category": Keep CANNOT_ASSESS as a distinct third class
TYPE:
|
na_mode
|
How to handle NA options in multi-choice criteria.
Mirrors
TYPE:
|
confidence_level
|
Confidence level for bootstrap CIs (default 0.95).
TYPE:
|
seed
|
Random seed for bootstrap reproducibility.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
MetricsResult
|
MetricsResult with comprehensive metrics. Use .summary() for |
MetricsResult
|
formatted output or .to_dataframe() for export. |
Example
result = await evaluate(dataset, grader) metrics = result.compute_metrics(dataset) print(metrics.summary()) print(f"Accuracy: {metrics.criterion_accuracy:.1%}") df = metrics.to_dataframe()
Source code in src/autorubric/eval.py
from_experiment
classmethod
¶
from_experiment(experiment_path: Path | str) -> EvalResult
Load EvalResult from a completed experiment directory.
| PARAMETER | DESCRIPTION |
|---|---|
experiment_path
|
Path to the experiment directory.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
EvalResult
|
EvalResult with loaded item results and statistics. |
| RAISES | DESCRIPTION |
|---|---|
FileNotFoundError
|
If experiment directory doesn't exist. |
ValueError
|
If manifest is invalid or experiment is incomplete. |
Source code in src/autorubric/eval.py
571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 | |
ItemResult¶
Result for a single evaluated item.
ItemResult
dataclass
¶
ItemResult(item_idx: int, item: DataItem, report: EvaluationReport | EnsembleEvaluationReport, duration_seconds: float, error: str | None = None)
Result for a single evaluated item.
to_dict
¶
Serialize to dictionary for JSON storage.
Source code in src/autorubric/eval.py
from_dict
classmethod
¶
from_dict(data: dict[str, Any], item: DataItem) -> ItemResult
Deserialize from dictionary.
Source code in src/autorubric/eval.py
EvalTimingStats¶
Timing statistics for an evaluation run.
EvalTimingStats
dataclass
¶
EvalTimingStats(total_duration_seconds: float, mean_item_duration_seconds: float, min_item_duration_seconds: float, max_item_duration_seconds: float, p50_item_duration_seconds: float, p95_item_duration_seconds: float, items_per_second: float)
Timing statistics for the evaluation run.
from_durations
classmethod
¶
from_durations(durations: list[float], total_duration: float) -> EvalTimingStats
Compute timing stats from a list of item durations.
Source code in src/autorubric/eval.py
to_dict
¶
Serialize to dictionary.
Source code in src/autorubric/eval.py
ExperimentManifest¶
Metadata for a saved experiment.
ExperimentManifest
dataclass
¶
ExperimentManifest(experiment_name: str, created_at: datetime, dataset_name: str | None, dataset_hash: str, total_items: int, status: Literal['running', 'completed', 'failed'], completed_indices: set[int], error: str | None = None, started_at: datetime | None = None, completed_at: datetime | None = None, total_duration_seconds: float | None = None, dataset_path: str | None = None, grader_config: dict[str, Any] | None = None, eval_config: dict[str, Any] | None = None)
Manifest for experiment checkpointing.
Contains metadata about an evaluation run for reproducibility and resumption.
to_dict
¶
Serialize to dictionary for JSON storage.
Source code in src/autorubric/eval.py
from_dict
classmethod
¶
from_dict(data: dict[str, Any]) -> ExperimentManifest
Deserialize from dictionary.
Source code in src/autorubric/eval.py
References¶
Casabianca, J., McCaffrey, D. F., Johnson, M. S., Alper, N., and Zubenko, V. (2025). Validity Arguments For Constructed Response Scoring Using Generative Artificial Intelligence Applications. arXiv:2501.02334.