API Reference¶
Complete API documentation for AutoRubric, organized by functional area.
Overview¶
AutoRubric exports 87 public items across the main module and autorubric.graders. This reference is organized into thematic chapters for easier navigation.
Quick Links¶
| Chapter | Key Exports | Description |
|---|---|---|
| CANNOT_ASSESS Handling | CannotAssessConfig, CannotAssessStrategy |
Configure handling of uncertain verdicts |
| Core Grading | Criterion, Rubric, EvaluationReport |
Fundamental types for rubric-based evaluation |
| Dataset | DataItem, RubricDataset |
Dataset management and serialization |
| Distribution Metrics | earth_movers_distance, ks_test |
Statistical distribution comparisons |
| Ensemble | EnsembleEvaluationReport, JudgeVote |
Multi-judge aggregation |
| Eval Runner | EvalRunner, evaluate() |
Batch evaluation with checkpointing |
| Few-Shot | FewShotConfig, FewShotExample |
Calibration with labeled examples |
| Graders | CriterionGrader, Grader, JudgeSpec |
Grader implementations |
| Length Penalty | LengthPenalty, compute_length_penalty |
Verbosity control |
| LLM Infrastructure | LLMConfig, LLMClient, generate() |
LLM client and configuration |
| Meta-Rubric Evaluation | evaluate_rubric_standalone, evaluate_rubric_in_context |
Assess rubric quality |
| Metrics | MetricsResult, compute_metrics |
Agreement and correlation metrics |
| Multi-Choice | CriterionOption, MultiChoiceVerdict |
Ordinal and nominal scales |
| Utilities | aggregate_token_usage, word_count |
Helper functions |
Import Patterns¶
Main Module¶
from autorubric import (
# Core types
Criterion,
Rubric,
CriterionVerdict,
EvaluationReport,
# LLM configuration
LLMConfig,
LLMClient,
# Dataset
DataItem,
RubricDataset,
# Evaluation
EvalRunner,
evaluate,
# Metrics
compute_metrics,
MetricsResult,
)
Graders Module¶
Meta Module¶
from autorubric.meta import (
evaluate_rubric_standalone,
evaluate_rubric_in_context,
get_standalone_meta_rubric,
get_in_context_meta_rubric,
)
Type Hierarchy¶
Grader (ABC)
└── CriterionGrader
EvaluationReport
└── EnsembleEvaluationReport
CriterionReport
└── EnsembleCriterionReport
MetricsResult
├── CriterionMetrics
├── OrdinalCriterionMetrics
├── NominalCriterionMetrics
└── JudgeMetrics
Architecture¶
Grading Flow¶
Rubric.grade()delegates to grader'sgrade()methodCriterionGradertreats single LLM as "ensemble of 1"- Makes concurrent LLM calls per criterion per judge via
asyncio.gather() - Aggregates votes using configurable strategy
- Returns
EnsembleEvaluationReport(consistent interface)
Score Calculation¶
# Positive criteria: MET earns weight, UNMET earns 0
# Negative criteria: MET subtracts weight, UNMET contributes 0
weighted_sum = sum(verdict_value * criterion.weight for each criterion)
score = clamp(weighted_sum / total_positive_weight, 0, 1) # if normalized
# Length penalty subtracted after base calculation
Conventions¶
- All graders return
EnsembleEvaluationReportfor consistent interface raw_scoreis always populated regardless ofnormalizesetting- Parse failures use conservative defaults (UNMET for positive, MET for negative weights)
- Filter
error is not Noneresults in training pipelines - Rate limiting via
LLMConfig.max_parallel_requests(per-provider semaphore)