Core Grading¶
Fundamental types for rubric-based evaluation: criteria, rubrics, verdicts, and evaluation reports.
Overview¶
The core grading module provides the foundational types for defining evaluation criteria and receiving grading results. A Rubric contains multiple Criterion objects, each with a weight and requirement. Grading produces an EvaluationReport with per-criterion verdicts and explanations.
Quick Example¶
from autorubric import Rubric, Criterion, CriterionVerdict, LLMConfig
from autorubric.graders import CriterionGrader
# Define criteria
rubric = Rubric([
Criterion(name="accuracy", weight=10.0, requirement="States the correct answer"),
Criterion(name="clarity", weight=5.0, requirement="Explains reasoning clearly"),
Criterion(weight=-15.0, requirement="Contains factual errors"), # name optional
])
# Or from dict/file
rubric = Rubric.from_dict([
{"weight": 10.0, "requirement": "States the correct answer"},
{"requirement": "Explains reasoning clearly"}, # weight defaults to 10.0
])
rubric = Rubric.from_file("rubric.yaml")
# Grade
grader = CriterionGrader(llm_config=LLMConfig(model="openai/gpt-4.1-mini"))
result = await rubric.grade(to_grade="...", grader=grader)
# result.score is `float | None` (None if the grade failed); guard before formatting.
print(f"Score: {result.score:.2f}" if result.score is not None else "Score: n/a (grade failed)")
for cr in result.report:
# `final_verdict` is None on error/multi-choice criteria; guard before printing.
verdict = cr.final_verdict.value if cr.final_verdict is not None else "n/a"
print(f" [{verdict}] {cr.criterion.requirement}")
print(f" Reason: {cr.final_reason}")
Score Calculation¶
For each criterion \(i\):
- If verdict = MET, contribution = \(w_i\)
- If verdict = UNMET, contribution = 0
Final score:
Criterion¶
A single evaluation criterion with weight and requirement.
Criterion
¶
Bases: BaseModel
A single evaluation criterion with a weight and requirement description.
Supports both binary (MET/UNMET) and multi-choice criteria. If options is None,
the criterion is binary. If options is provided, the criterion is multi-choice.
| ATTRIBUTE | DESCRIPTION |
|---|---|
weight |
Scoring weight. Positive for desired traits, negative for errors/penalties. Defaults to 10.0 for uniform weighting when not specified.
TYPE:
|
requirement |
Description of what the criterion evaluates.
TYPE:
|
name |
Optional short identifier for the criterion (e.g., "clarity", "accuracy"). Useful for referencing criteria in reports and debugging.
TYPE:
|
options |
List of options for multi-choice criteria. If None, criterion is binary.
TYPE:
|
scale_type |
For multi-choice, indicates if options are ordinal (ordered) or nominal (unordered categories). Affects aggregation strategy selection.
TYPE:
|
aggregation |
Per-criterion aggregation strategy override. If None, uses grader default.
TYPE:
|
Example
Binary criterion (existing behavior)¶
binary = Criterion( ... name="accuracy", ... weight=10.0, ... requirement="The response is factually accurate" ... )
Multi-choice ordinal criterion¶
ordinal = Criterion( ... name="satisfaction", ... weight=10.0, ... requirement="How satisfied would you be?", ... options=[ ... CriterionOption(label="1", value=0.0), ... CriterionOption(label="2", value=0.33), ... CriterionOption(label="3", value=0.67), ... CriterionOption(label="4", value=1.0), ... ], ... scale_type="ordinal", ... )
na_option_index
property
¶
Index of the first NA option, or None if there is none.
Returns None for binary criteria (no options). This is the single
source for the recurring "find the (first) NA option" lookup used by the
grader's error/abstain path and the ensemble aggregation NA-abstain paths.
get_option_value
¶
Get the score value for an option by index.
| PARAMETER | DESCRIPTION |
|---|---|
index
|
Zero-based index of the option.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
float
|
The score value for the option. |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If this is a binary criterion or index is out of range. |
Source code in src/autorubric/types.py
find_option_by_label
¶
Find option index by label (case-insensitive, whitespace-normalized).
Used for resolving ground truth labels to indices for metrics computation.
| PARAMETER | DESCRIPTION |
|---|---|
label
|
The label to search for.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
int
|
Zero-based index of the matching option. |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If this is a binary criterion or label not found. |
Source code in src/autorubric/types.py
worst_option_among
¶
Return the score-minimizing option index among candidate_indices.
Weight-sign aware: for non-negative weight the worst option has the lowest
value; for negative weight it has the highest value (a high value on a
negative-weight criterion subtracts more from the score). Value ties resolve to
the lowest index, independent of the order of candidate_indices.
This is the canonical tie-break shared by ensemble vote aggregation
(mode/weighted_mode count/weight ties and mean/median snap ties,
in criterion_grader.py) and :meth:worst_scored_option, so scoring, the
grader's unknown-error path, and aggregation tie-breaking cannot drift.
| PARAMETER | DESCRIPTION |
|---|---|
candidate_indices
|
Indices into
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
int
|
The score-minimizing index (lowest |
int
|
weight < 0; lowest index on a value tie). |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If this is a binary criterion (no options) or
|
Source code in src/autorubric/types.py
worst_scored_option
¶
worst_scored_option() -> tuple[int, CriterionOption]
Return (index, option) of the score-minimizing scored (non-NA) option.
Weight-sign aware: for non-negative weight, returns the option with the
lowest value; for negative weight, returns the option with the
highest value (the worst case flips because a high value on a
negative-weight criterion subtracts more from the score). NA options
are excluded — this returns the score-minimizing scored option, the
analog of binary UNMET (for positive weight) or MET (for negative
weight).
Ties resolve to the lowest index (delegates to
:meth:worst_option_among over the non-NA indices).
Shared by the grader's unknown-error worst-case path
(criterion_grader.py) and the metrics' na_mode="as_unmet" remap
(metrics/_helpers.py) so the two layers cannot drift.
| RETURNS | DESCRIPTION |
|---|---|
int
|
Tuple of |
CriterionOption
|
non-NA option. |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If this is a binary criterion or has no non-NA option.
The |
Source code in src/autorubric/types.py
with_guaranteed_na_option
¶
with_guaranteed_na_option() -> Criterion
Return a multi-choice criterion guaranteed to expose an NA/abstain option.
Gives the judge a first-class "cannot assess" channel analogous to binary
CriterionVerdict.CANNOT_ASSESS. If the criterion already has an
NA option (author intent), returns self unchanged. Otherwise returns a
copy with a single :data:CANONICAL_NA_OPTION appended at the end
(highest index) so existing option indices 0..N-1 stay stable for
ground-truth alignment, shuffle-order mapping, and
:meth:worst_scored_option.
This is a pure function of the criterion (no RNG, no external state), so the grader and the metrics layer can both reconstruct the identical effective option set without drifting.
| RETURNS | DESCRIPTION |
|---|---|
Criterion
|
|
Criterion
|
with the canonical NA option appended. |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If this is a binary criterion (no options). |
Source code in src/autorubric/types.py
validate_options
¶
validate_options() -> Criterion
Validate multi-choice options if present.
Source code in src/autorubric/types.py
CriterionVerdict¶
Enum representing the verdict for a criterion.
CriterionVerdict
¶
Bases: str, Enum
Status of a criterion evaluation.
- MET: The criterion is satisfied by the submission
- UNMET: The criterion is not satisfied by the submission
- CANNOT_ASSESS: Insufficient evidence to make a determination
CriterionReport¶
Per-criterion result with verdict and explanation.
CriterionReport
¶
Bases: Criterion
A criterion with its evaluation result.
Supports both binary (MET/UNMET/CANNOT_ASSESS) and multi-choice verdicts.
For binary criteria, use verdict. For multi-choice, use multi_choice_verdict.
| ATTRIBUTE | DESCRIPTION |
|---|---|
verdict |
Binary verdict (MET/UNMET/CANNOT_ASSESS). None for multi-choice criteria.
TYPE:
|
multi_choice_verdict |
Multi-choice verdict with selected option. None for binary.
TYPE:
|
reason |
The judge's brief, final justification for the verdict. When thinking is
enabled this is the concise conclusion the judge distilled from its
TYPE:
|
shuffle_order |
Permutation used when presenting multi-choice options to the LLM. Maps shuffled position → original index. None for binary criteria or when shuffle_options is disabled.
TYPE:
|
error |
Set when this verdict was synthesized because the judge call failed,
rather than produced by a genuine judgment. The string is prefixed with the
failure category (
TYPE:
|
reasoning |
The judge's verbose extended-thinking deliberation trace — the
chain of thought produced before settling on
TYPE:
|
score_value
property
¶
Get the score contribution (0-1) for this criterion.
For binary criteria: 1.0 if MET, 0.0 otherwise. For multi-choice: the value of the selected option.
is_na
property
¶
Check if this criterion was marked NA or CANNOT_ASSESS.
Returns True for: - Binary criteria with CANNOT_ASSESS verdict - Multi-choice criteria with NA option selected
is_error
property
¶
Whether this verdict was synthesized due to a judge-call failure.
Use this instead of inspecting reason to distinguish error-induced
verdicts from genuine judgments.
CriterionJudgment¶
Structured output from LLM judge for a single criterion.
CriterionJudgment
¶
Bases: BaseModel
Structured LLM output for single criterion evaluation.
Used with LiteLLM's response_format parameter to ensure type-safe, validated responses from the judge LLM.
Note: This is separate from CriterionReport because: - CriterionReport includes 'weight' and 'requirement' fields that come from the rubric, not from the LLM - The LLM only outputs the judgment (status + explanation)
explanation is the judge's brief, final justification. reasoning is the
verbose extended-thinking deliberation trace behind it (populated only when thinking
is enabled): when present, explanation is the concise conclusion the judge
distilled from reasoning.
Rubric¶
Collection of criteria for evaluation.
Rubric
¶
Rubric(rubric: list[Criterion])
A rubric is a list of criteria used to evaluate text outputs.
Each criterion has a weight and requirement. Use the grade() method to evaluate text against this rubric using a grader.
Source code in src/autorubric/rubric.py
grade
async
¶
grade(to_grade: ToGradeInput, grader: Grader, query: str | None = None, reference_submission: str | None = None) -> EvaluationReport
Grade text against this rubric using a grader.
| PARAMETER | DESCRIPTION |
|---|---|
to_grade
|
The text to evaluate. Can be either:
- A string (optionally with
TYPE:
|
grader
|
The grader to use. REQUIRED - must be provided. Configure length_penalty and normalize on the grader if needed.
TYPE:
|
query
|
Optional input/query that prompted the response.
TYPE:
|
reference_submission
|
Optional exemplar response for grading context. When present, provides calibration for the grader.
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
TypeError
|
If grader is not provided. |
Source code in src/autorubric/rubric.py
validate_and_create_criteria
staticmethod
¶
validate_and_create_criteria(data: list[dict[str, Any]] | dict[str, Any]) -> list[Criterion]
Validate and create Criterion objects from raw data.
Supports multiple formats: - Flat list of criteria - List of sections with criteria - Dict with 'sections' key containing list of sections - Dict with 'rubric' key containing sections
Source code in src/autorubric/rubric.py
from_yaml
classmethod
¶
from_yaml(yaml_string: str) -> Rubric
Parse rubric from a YAML string.
Source code in src/autorubric/rubric.py
from_json
classmethod
¶
from_json(json_string: str) -> Rubric
Parse rubric from a JSON string.
Source code in src/autorubric/rubric.py
from_file
classmethod
¶
from_file(source: str | Any) -> Rubric
Load rubric from a file path or file-like object, auto-detecting format.
Source code in src/autorubric/rubric.py
compute_score
¶
compute_score(verdicts: list[CriterionVerdict | str], normalize: bool = True, cannot_assess_strategy: CannotAssessStrategy = SKIP, partial_credit: float = 0.5) -> float
Compute a weighted score from raw verdicts against this rubric.
Single source of truth for scoring from verdict lists (e.g. ground truth labels). Handles binary (MET/UNMET/CANNOT_ASSESS) and multi-choice (option label strings) criteria.
Parses and validates each verdict into a CriterionReport and delegates
to the shared score_reports core, so this path agrees exactly with the
live grader and RubricDataset.compute_weighted_score across every
CannotAssessStrategy x {binary, multi-choice} x {+/- weight}.
| PARAMETER | DESCRIPTION |
|---|---|
verdicts
|
One value per criterion. Binary criteria accept CriterionVerdict or its string form; multi-choice criteria accept an option label string.
TYPE:
|
normalize
|
If True, normalise to [0, 1]. If False, return the raw weighted sum.
TYPE:
|
cannot_assess_strategy
|
How to handle CANNOT_ASSESS / NA verdicts.
TYPE:
|
partial_credit
|
Credit fraction when strategy is PARTIAL.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
float
|
The computed score. |
Source code in src/autorubric/rubric.py
240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 | |
EvaluationReport¶
Complete grading result with score and per-criterion reports.
EvaluationReport
¶
Bases: BaseModel
Final evaluation result with score and per-criterion reports.
For training use cases, set normalize=False in the grader to get raw weighted sums instead of normalized 0-1 scores.
| ATTRIBUTE | DESCRIPTION |
|---|---|
score |
The final score (0-1 if normalized, raw weighted sum otherwise).
TYPE:
|
raw_score |
The unnormalized weighted sum.
TYPE:
|
llm_raw_score |
The original score returned by the LLM (same as raw_score).
TYPE:
|
report |
Per-criterion breakdown with verdicts and explanations.
TYPE:
|
cannot_assess_count |
Number of criteria with CANNOT_ASSESS verdict.
TYPE:
|
error |
Optional error message if grading failed (e.g., JSON parse error).
When set, score/raw_score are
TYPE:
|
token_usage |
Aggregated token usage across all LLM calls made during grading. For CriterionGrader, this is the sum across all criterion evaluations.
TYPE:
|
completion_cost |
Total cost in USD for all LLM calls made during grading. Calculated using LiteLLM's completion_cost() function.
TYPE:
|
Example
result = await rubric.grade(to_grade=response, grader=grader) print(f"Score: {result.score:.2f}") if result.cannot_assess_count: ... print(f"Could not assess {result.cannot_assess_count} criteria") if result.token_usage: ... print(f"Tokens: {result.token_usage.total_tokens}") if result.completion_cost: ... print(f"Cost: ${result.completion_cost:.6f}")
TokenUsage¶
Token usage tracking for LLM calls.
TokenUsage
dataclass
¶
TokenUsage(prompt_tokens: int = 0, completion_tokens: int = 0, total_tokens: int = 0, cache_creation_input_tokens: int = 0, cache_read_input_tokens: int = 0)
Token usage statistics from LLM API calls.
| ATTRIBUTE | DESCRIPTION |
|---|---|
prompt_tokens |
Number of tokens in the prompt/input.
TYPE:
|
completion_tokens |
Number of tokens in the completion/output.
TYPE:
|
total_tokens |
Total tokens (prompt + completion).
TYPE:
|
cache_creation_input_tokens |
Tokens used to create cache entries (Anthropic).
TYPE:
|
cache_read_input_tokens |
Tokens read from cache (Anthropic).
TYPE:
|
Example
usage = TokenUsage(prompt_tokens=100, completion_tokens=50, total_tokens=150) print(f"Total tokens: {usage.total_tokens}") Total tokens: 150
ToGradeInput¶
Type alias for the input format accepted by rubric.grade().
ToGradeInput
module-attribute
¶
ToGradeInput = str | ThinkingOutputDict
Union type for to_grade parameter.
Accepts either a plain string or a dict with thinking/output keys.
ThinkingOutputDict¶
TypedDict for responses with separate thinking and output sections.
ThinkingOutputDict
¶
Bases: TypedDict
Dict format for submissions with separate thinking and output sections.
Both fields are optional to allow partial submissions or gradual construction. When used with length penalty, missing fields are treated as empty strings.
ScaleType¶
Literal type alias for multi-choice criterion scale types (ordinal, nominal).
ScaleType
module-attribute
¶
Scale type for multi-choice criteria.
- ordinal: Options have inherent order (e.g., 1-4 satisfaction scale)
- nominal: Options are unordered categories (e.g., "too few", "too many", "just right")