Core Grading¶

Fundamental types for rubric-based evaluation: criteria, rubrics, verdicts, and evaluation reports.

Overview¶

The core grading module provides the foundational types for defining evaluation criteria and receiving grading results. A Rubric contains multiple Criterion objects, each with a weight and requirement. Grading produces an EvaluationReport with per-criterion verdicts and explanations.

Quick Example¶

from autorubric import Rubric, Criterion, CriterionVerdict, LLMConfig
from autorubric.graders import CriterionGrader

# Define criteria
rubric = Rubric([
    Criterion(name="accuracy", weight=10.0, requirement="States the correct answer"),
    Criterion(name="clarity", weight=5.0, requirement="Explains reasoning clearly"),
    Criterion(weight=-15.0, requirement="Contains factual errors"),  # name optional
])

# Or from dict/file
rubric = Rubric.from_dict([
    {"weight": 10.0, "requirement": "States the correct answer"},
    {"requirement": "Explains reasoning clearly"},  # weight defaults to 10.0
])
rubric = Rubric.from_file("rubric.yaml")

# Grade
grader = CriterionGrader(llm_config=LLMConfig(model="openai/gpt-4.1-mini"))
result = await rubric.grade(to_grade="...", grader=grader)

# result.score is `float | None` (None if the grade failed); guard before formatting.
print(f"Score: {result.score:.2f}" if result.score is not None else "Score: n/a (grade failed)")
for cr in result.report:
    # `final_verdict` is None on error/multi-choice criteria; guard before printing.
    verdict = cr.final_verdict.value if cr.final_verdict is not None else "n/a"
    print(f"  [{verdict}] {cr.criterion.requirement}")
    print(f"    Reason: {cr.final_reason}")

Score Calculation¶

For each criterion $i$:

If verdict = MET, contribution = $w_i$
If verdict = UNMET, contribution = 0

Final score:

\[ \text{score} = \max\left(0, \min\left(1, \frac{\sum_{i=1}^{n} \mathbb{1}[\text{verdict}_i = \text{MET}] \cdot w_i}{\sum_{i=1}^{n} \max(0, w_i)}\right)\right) \]

Criterion¶

A single evaluation criterion with weight and requirement.

Criterion ¶

Bases: BaseModel

A single evaluation criterion with a weight and requirement description.

Supports both binary (MET/UNMET) and multi-choice criteria. If options is None, the criterion is binary. If options is provided, the criterion is multi-choice.

ATTRIBUTE	DESCRIPTION
`weight`	Scoring weight. Positive for desired traits, negative for errors/penalties. Defaults to 10.0 for uniform weighting when not specified. TYPE: `float`
`requirement`	Description of what the criterion evaluates. TYPE: `str`
`name`	Optional short identifier for the criterion (e.g., "clarity", "accuracy"). Useful for referencing criteria in reports and debugging. TYPE: `str \| None`
`options`	List of options for multi-choice criteria. If None, criterion is binary. TYPE: `list[CriterionOption] \| None`
`scale_type`	For multi-choice, indicates if options are ordinal (ordered) or nominal (unordered categories). Affects aggregation strategy selection. TYPE: `ScaleType`
`aggregation`	Per-criterion aggregation strategy override. If None, uses grader default. TYPE: `str \| None`

Example

Binary criterion (existing behavior)¶

binary = Criterion( ... name="accuracy", ... weight=10.0, ... requirement="The response is factually accurate" ... )

Multi-choice ordinal criterion¶

ordinal = Criterion( ... name="satisfaction", ... weight=10.0, ... requirement="How satisfied would you be?", ... options=[ ... CriterionOption(label="1", value=0.0), ... CriterionOption(label="2", value=0.33), ... CriterionOption(label="3", value=0.67), ... CriterionOption(label="4", value=1.0), ... ], ... scale_type="ordinal", ... )

is_binary `property` ¶

is_binary: bool

Check if this is a binary (MET/UNMET) criterion.

is_multi_choice `property` ¶

is_multi_choice: bool

Check if this is a multi-choice criterion.

na_option_index `property` ¶

na_option_index: int | None

Index of the first NA option, or None if there is none.

Returns None for binary criteria (no options). This is the single source for the recurring "find the (first) NA option" lookup used by the grader's error/abstain path and the ensemble aggregation NA-abstain paths.

get_option_value ¶

get_option_value(index: int) -> float

Get the score value for an option by index.

PARAMETER	DESCRIPTION
`index`	Zero-based index of the option. TYPE: `int`

RETURNS	DESCRIPTION
`float`	The score value for the option.

RAISES	DESCRIPTION
`ValueError`	If this is a binary criterion or index is out of range.

Source code in src/autorubric/types.py

def get_option_value(self, index: int) -> float:
    """Get the score value for an option by index.

    Args:
        index: Zero-based index of the option.

    Returns:
        The score value for the option.

    Raises:
        ValueError: If this is a binary criterion or index is out of range.
    """
    if self.options is None:
        raise ValueError("Binary criterion has no options")
    if index < 0 or index >= len(self.options):
        raise ValueError(f"Option index {index} out of range [0, {len(self.options)})")
    return self.options[index].value

find_option_by_label ¶

find_option_by_label(label: str) -> int

Find option index by label (case-insensitive, whitespace-normalized).

Used for resolving ground truth labels to indices for metrics computation.

PARAMETER	DESCRIPTION
`label`	The label to search for. TYPE: `str`

RETURNS	DESCRIPTION
`int`	Zero-based index of the matching option.

RAISES	DESCRIPTION
`ValueError`	If this is a binary criterion or label not found.

Source code in src/autorubric/types.py

def find_option_by_label(self, label: str) -> int:
    """Find option index by label (case-insensitive, whitespace-normalized).

    Used for resolving ground truth labels to indices for metrics computation.

    Args:
        label: The label to search for.

    Returns:
        Zero-based index of the matching option.

    Raises:
        ValueError: If this is a binary criterion or label not found.
    """
    if self.options is None:
        raise ValueError("Binary criterion has no options")
    normalized_label = label.strip().lower()
    for i, opt in enumerate(self.options):
        if opt.label.strip().lower() == normalized_label:
            return i
    available = [opt.label for opt in self.options]
    raise ValueError(f"Label '{label}' not found. Available: {available}")

worst_option_among ¶

worst_option_among(candidate_indices: Iterable[int]) -> int

Return the score-minimizing option index among candidate_indices.

Weight-sign aware: for non-negative weight the worst option has the lowest value; for negative weight it has the highest value (a high value on a negative-weight criterion subtracts more from the score). Value ties resolve to the lowest index, independent of the order of candidate_indices.

This is the canonical tie-break shared by ensemble vote aggregation (mode/weighted_mode count/weight ties and mean/median snap ties, in criterion_grader.py) and :meth:worst_scored_option, so scoring, the grader's unknown-error path, and aggregation tie-breaking cannot drift.

PARAMETER	DESCRIPTION
`candidate_indices`	Indices into `self.options` to choose among. TYPE: `Iterable[int]`

RETURNS	DESCRIPTION
`int`	The score-minimizing index (lowest `value` for weight ≥ 0, highest for
`int`	weight < 0; lowest index on a value tie).

RAISES	DESCRIPTION
`ValueError`	If this is a binary criterion (no options) or `candidate_indices` is empty.

Source code in src/autorubric/types.py

def worst_option_among(self, candidate_indices: Iterable[int]) -> int:
    """Return the score-minimizing option index among ``candidate_indices``.

    Weight-sign aware: for non-negative weight the worst option has the lowest
    ``value``; for negative weight it has the highest ``value`` (a high value on a
    negative-weight criterion subtracts more from the score). Value ties resolve to
    the **lowest index**, independent of the order of ``candidate_indices``.

    This is the canonical tie-break shared by ensemble vote aggregation
    (``mode``/``weighted_mode`` count/weight ties and ``mean``/``median`` snap ties,
    in ``criterion_grader.py``) and :meth:`worst_scored_option`, so scoring, the
    grader's ``unknown``-error path, and aggregation tie-breaking cannot drift.

    Args:
        candidate_indices: Indices into ``self.options`` to choose among.

    Returns:
        The score-minimizing index (lowest ``value`` for weight ≥ 0, highest for
        weight < 0; lowest index on a value tie).

    Raises:
        ValueError: If this is a binary criterion (no options) or
            ``candidate_indices`` is empty.
    """
    if self.options is None:
        raise ValueError("Binary criterion has no scored options")
    candidates = list(candidate_indices)
    if not candidates:
        raise ValueError("No candidate options to choose among")
    options = self.options
    if self.weight < 0:
        # Highest value is worst; on a value tie prefer the lowest index (-i maximal).
        return max(candidates, key=lambda i: (options[i].value, -i))
    # Lowest value is worst; on a value tie prefer the lowest index.
    return min(candidates, key=lambda i: (options[i].value, i))

worst_scored_option ¶

worst_scored_option() -> tuple[int, CriterionOption]

Return (index, option) of the score-minimizing scored (non-NA) option.

Weight-sign aware: for non-negative weight, returns the option with the lowest value; for negative weight, returns the option with the highest value (the worst case flips because a high value on a negative-weight criterion subtracts more from the score). NA options are excluded — this returns the score-minimizing scored option, the analog of binary UNMET (for positive weight) or MET (for negative weight).

Ties resolve to the lowest index (delegates to :meth:worst_option_among over the non-NA indices).

Shared by the grader's unknown-error worst-case path (criterion_grader.py) and the metrics' na_mode="as_unmet" remap (metrics/_helpers.py) so the two layers cannot drift.

RETURNS	DESCRIPTION
`int`	Tuple of `(index, CriterionOption)` for the score-minimizing
`CriterionOption`	non-NA option.

RAISES	DESCRIPTION
`ValueError`	If this is a binary criterion or has no non-NA option. The `Criterion` validator guarantees ≥2 non-NA options for multi-choice criteria, so the no-non-NA case is defensive.

Source code in src/autorubric/types.py

def worst_scored_option(self) -> tuple[int, CriterionOption]:
    """Return (index, option) of the score-minimizing scored (non-NA) option.

    Weight-sign aware: for non-negative weight, returns the option with the
    lowest ``value``; for negative weight, returns the option with the
    highest ``value`` (the worst case flips because a high ``value`` on a
    negative-weight criterion subtracts more from the score). NA options
    are excluded — this returns the score-minimizing *scored* option, the
    analog of binary UNMET (for positive weight) or MET (for negative
    weight).

    Ties resolve to the lowest index (delegates to
    :meth:`worst_option_among` over the non-NA indices).

    Shared by the grader's ``unknown``-error worst-case path
    (``criterion_grader.py``) and the metrics' ``na_mode="as_unmet"`` remap
    (``metrics/_helpers.py``) so the two layers cannot drift.

    Returns:
        Tuple of ``(index, CriterionOption)`` for the score-minimizing
        non-NA option.

    Raises:
        ValueError: If this is a binary criterion or has no non-NA option.
            The ``Criterion`` validator guarantees ≥2 non-NA options for
            multi-choice criteria, so the no-non-NA case is defensive.
    """
    if self.options is None:
        raise ValueError("Binary criterion has no scored options")
    scored = [i for i, opt in enumerate(self.options) if not opt.na]
    if not scored:
        raise ValueError("Criterion has no non-NA option")
    idx = self.worst_option_among(scored)
    return idx, self.options[idx]

with_guaranteed_na_option ¶

with_guaranteed_na_option() -> Criterion

Return a multi-choice criterion guaranteed to expose an NA/abstain option.

Gives the judge a first-class "cannot assess" channel analogous to binary CriterionVerdict.CANNOT_ASSESS. If the criterion already has an NA option (author intent), returns self unchanged. Otherwise returns a copy with a single :data:CANONICAL_NA_OPTION appended at the end (highest index) so existing option indices 0..N-1 stay stable for ground-truth alignment, shuffle-order mapping, and :meth:worst_scored_option.

This is a pure function of the criterion (no RNG, no external state), so the grader and the metrics layer can both reconstruct the identical effective option set without drifting.

RETURNS	DESCRIPTION
`Criterion`	`self` when an NA option is already present, else a new `Criterion`
`Criterion`	with the canonical NA option appended.

RAISES	DESCRIPTION
`ValueError`	If this is a binary criterion (no options).

Source code in src/autorubric/types.py

def with_guaranteed_na_option(self) -> "Criterion":
    """Return a multi-choice criterion guaranteed to expose an NA/abstain option.

    Gives the judge a first-class "cannot assess" channel analogous to binary
    ``CriterionVerdict.CANNOT_ASSESS``. If the criterion already has an
    NA option (author intent), returns ``self`` unchanged. Otherwise returns a
    copy with a single :data:`CANONICAL_NA_OPTION` **appended at the end**
    (highest index) so existing option indices ``0..N-1`` stay stable for
    ground-truth alignment, shuffle-order mapping, and
    :meth:`worst_scored_option`.

    This is a pure function of the criterion (no RNG, no external state), so the
    grader and the metrics layer can both reconstruct the identical effective
    option set without drifting.

    Returns:
        ``self`` when an NA option is already present, else a new ``Criterion``
        with the canonical NA option appended.

    Raises:
        ValueError: If this is a binary criterion (no options).
    """
    if self.options is None:
        raise ValueError("Binary criterion has no options")
    if self.na_option_index is not None:
        return self
    return self.model_copy(update={"options": [*self.options, CANONICAL_NA_OPTION]})

validate_options ¶

validate_options() -> Criterion

Validate multi-choice options if present.

Source code in src/autorubric/types.py

@model_validator(mode="after")
def validate_options(self) -> "Criterion":
    """Validate multi-choice options if present."""
    if self.options is not None:
        if len(self.options) < 2:
            raise ValueError("Multi-choice criterion must have at least 2 options")
        # Ensure at least 2 non-NA options
        non_na = [o for o in self.options if not o.na]
        if len(non_na) < 2:
            raise ValueError("Must have at least 2 non-NA options")
    return self

CriterionVerdict¶

Enum representing the verdict for a criterion.

CriterionVerdict ¶

Bases: str, Enum

Status of a criterion evaluation.

MET: The criterion is satisfied by the submission
UNMET: The criterion is not satisfied by the submission
CANNOT_ASSESS: Insufficient evidence to make a determination

CriterionReport¶

Per-criterion result with verdict and explanation.

CriterionReport ¶

Bases: Criterion

A criterion with its evaluation result.

Supports both binary (MET/UNMET/CANNOT_ASSESS) and multi-choice verdicts. For binary criteria, use verdict. For multi-choice, use multi_choice_verdict.

ATTRIBUTE	DESCRIPTION
`verdict`	Binary verdict (MET/UNMET/CANNOT_ASSESS). None for multi-choice criteria. TYPE: `CriterionVerdict \| None`
`multi_choice_verdict`	Multi-choice verdict with selected option. None for binary. TYPE: `MultiChoiceVerdict \| AggregatedMultiChoiceVerdict \| None`
`reason`	The judge's brief, final justification for the verdict. When thinking is enabled this is the concise conclusion the judge distilled from its `reasoning` deliberation trace below; when thinking is disabled `reasoning` is None and `reason` stands alone. TYPE: `str`
`shuffle_order`	Permutation used when presenting multi-choice options to the LLM. Maps shuffled position → original index. None for binary criteria or when shuffle_options is disabled. TYPE: `list[int] \| None`
`error`	Set when this verdict was synthesized because the judge call failed, rather than produced by a genuine judgment. The string is prefixed with the failure category (`"infrastructure: ..."`, `"parse: ..."`, or `"unknown: ..."`). None for genuine verdicts. See `is_error`. TYPE: `str \| None`
`reasoning`	The judge's verbose extended-thinking deliberation trace — the chain of thought produced before settling on `verdict`/`reason` (the provider's `reasoning_content` channel). Populated only when thinking is enabled; None otherwise. `reason` is the conclusion distilled from this. TYPE: `str \| None`

score_value `property` ¶

score_value: float

Get the score contribution (0-1) for this criterion.

For binary criteria: 1.0 if MET, 0.0 otherwise. For multi-choice: the value of the selected option.

is_na `property` ¶

is_na: bool

Check if this criterion was marked NA or CANNOT_ASSESS.

Returns True for: - Binary criteria with CANNOT_ASSESS verdict - Multi-choice criteria with NA option selected

is_error `property` ¶

is_error: bool

Whether this verdict was synthesized due to a judge-call failure.

Use this instead of inspecting reason to distinguish error-induced verdicts from genuine judgments.

CriterionJudgment¶

Structured output from LLM judge for a single criterion.

CriterionJudgment ¶

Bases: BaseModel

Structured LLM output for single criterion evaluation.

Used with LiteLLM's response_format parameter to ensure type-safe, validated responses from the judge LLM.

Note: This is separate from CriterionReport because: - CriterionReport includes 'weight' and 'requirement' fields that come from the rubric, not from the LLM - The LLM only outputs the judgment (status + explanation)

explanation is the judge's brief, final justification. reasoning is the verbose extended-thinking deliberation trace behind it (populated only when thinking is enabled): when present, explanation is the concise conclusion the judge distilled from reasoning.

Rubric¶

Collection of criteria for evaluation.

Rubric ¶

Rubric(rubric: list[Criterion])

A rubric is a list of criteria used to evaluate text outputs.

Each criterion has a weight and requirement. Use the grade() method to evaluate text against this rubric using a grader.

Source code in src/autorubric/rubric.py

def __init__(self, rubric: list[Criterion]):
    self.rubric = rubric

grade `async` ¶

grade(to_grade: ToGradeInput, grader: Grader, query: str | None = None, reference_submission: str | None = None) -> EvaluationReport

Grade text against this rubric using a grader.

PARAMETER	DESCRIPTION
`to_grade`	The text to evaluate. Can be either: - A string (optionally with / markers) - A dict with 'thinking' and 'output' keys TYPE: `ToGradeInput`
`grader`	The grader to use. REQUIRED - must be provided. Configure length_penalty and normalize on the grader if needed. TYPE: `Grader`
`query`	Optional input/query that prompted the response. TYPE: `str \| None` DEFAULT: `None`
`reference_submission`	Optional exemplar response for grading context. When present, provides calibration for the grader. TYPE: `str \| None` DEFAULT: `None`

RAISES	DESCRIPTION
`TypeError`	If grader is not provided.

Source code in src/autorubric/rubric.py

async def grade(
    self,
    to_grade: ToGradeInput,
    grader: Grader,
    query: str | None = None,
    reference_submission: str | None = None,
) -> EvaluationReport:
    """Grade text against this rubric using a grader.

    Args:
        to_grade: The text to evaluate. Can be either:
            - A string (optionally with <thinking>/<output> markers)
            - A dict with 'thinking' and 'output' keys
        grader: The grader to use. REQUIRED - must be provided.
            Configure length_penalty and normalize on the grader if needed.
        query: Optional input/query that prompted the response.
        reference_submission: Optional exemplar response for grading context.
            When present, provides calibration for the grader.

    Raises:
        TypeError: If grader is not provided.
    """
    return await grader.grade(
        to_grade=to_grade,
        rubric=self.rubric,
        query=query,
        reference_submission=reference_submission,
    )

validate_and_create_criteria `staticmethod` ¶

validate_and_create_criteria(data: list[dict[str, Any]] | dict[str, Any]) -> list[Criterion]

Validate and create Criterion objects from raw data.

Supports multiple formats: - Flat list of criteria - List of sections with criteria - Dict with 'sections' key containing list of sections - Dict with 'rubric' key containing sections

Source code in src/autorubric/rubric.py

@staticmethod
def validate_and_create_criteria(
    data: list[dict[str, Any]] | dict[str, Any],
) -> list[Criterion]:
    """Validate and create Criterion objects from raw data.

    Supports multiple formats:
    - Flat list of criteria
    - List of sections with criteria
    - Dict with 'sections' key containing list of sections
    - Dict with 'rubric' key containing sections
    """
    if isinstance(data, dict):
        if "rubric" in data:
            data = data["rubric"]

        if isinstance(data, dict):
            if "sections" in data:
                sections = data["sections"]
                if not isinstance(sections, list):
                    raise ValueError(
                        f"Invalid rubric format. Expected 'sections' to be a list, "
                        f"got {type(sections).__name__}"
                    )
                data = sections
            else:
                raise ValueError(
                    "Invalid rubric format. Dict must contain either 'sections' or 'rubric' key"
                )

    if not isinstance(data, list):
        raise ValueError(f"Invalid rubric format. Expected a list, got {type(data).__name__}")

    if not data:
        raise ValueError("No criteria found")

    flattened_criteria_data = []
    for idx, item in enumerate(data):
        if not isinstance(item, dict):
            raise ValueError(
                f"Invalid item at index {idx}: expected a dictionary, got {type(item).__name__}"
            )

        if "criteria" in item:
            section_criteria = item["criteria"]
            if not isinstance(section_criteria, list):
                raise ValueError(
                    f"Invalid section at index {idx}: 'criteria' must be a list, "
                    f"got {type(section_criteria).__name__}"
                )
            flattened_criteria_data.extend(section_criteria)
        else:
            flattened_criteria_data.append(item)

    if not flattened_criteria_data:
        raise ValueError("No criteria found")

    criteria = []
    for idx, criterion_data in enumerate(flattened_criteria_data):
        if not isinstance(criterion_data, dict):
            raise ValueError(
                f"Invalid criterion at index {idx}: expected a dictionary, "
                f"got {type(criterion_data).__name__}"
            )

        try:
            criteria.append(Criterion(**criterion_data))  # type: ignore[arg-type]
        except ValidationError as e:
            error_details = []
            for error in e.errors():
                field = ".".join(str(loc) for loc in error["loc"])
                error_details.append(f"{field}: {error['msg']}")

            error_msg = f"Invalid criterion at index {idx}:\n  " + "\n  ".join(error_details)
            raise ValueError(error_msg) from e
        except Exception as e:
            raise ValueError(f"Failed to create criterion at index {idx}: {e}") from e

    return criteria

from_yaml `classmethod` ¶

from_yaml(yaml_string: str) -> Rubric

Parse rubric from a YAML string.

Source code in src/autorubric/rubric.py

@classmethod
def from_yaml(cls, yaml_string: str) -> Rubric:
    """Parse rubric from a YAML string."""
    try:
        data = yaml.safe_load(yaml_string)
    except yaml.YAMLError as e:
        raise ValueError(f"Failed to parse YAML string: {e}") from e

    criteria = cls.validate_and_create_criteria(data)
    return cls(criteria)

from_json `classmethod` ¶

from_json(json_string: str) -> Rubric

Parse rubric from a JSON string.

Source code in src/autorubric/rubric.py

@classmethod
def from_json(cls, json_string: str) -> Rubric:
    """Parse rubric from a JSON string."""
    try:
        data = json.loads(json_string)
    except json.JSONDecodeError as e:
        raise ValueError(f"Failed to parse JSON string: {e}") from e

    criteria = cls.validate_and_create_criteria(data)
    return cls(criteria)

from_file `classmethod` ¶

from_file(source: str | Any) -> Rubric

Load rubric from a file path or file-like object, auto-detecting format.

Source code in src/autorubric/rubric.py

@classmethod
def from_file(cls, source: str | Any) -> Rubric:
    """Load rubric from a file path or file-like object, auto-detecting format."""
    if hasattr(source, "read"):
        file_name = getattr(source, "name", "")  # type: ignore[arg-type]
        extension = Path(file_name).suffix.lower() if file_name else ""

        if not extension:
            raise ValueError(
                "Cannot determine file format from file object. "
                "File object must have a 'name' attribute with a file extension."
            )

        try:
            content = source.read()  # type: ignore[misc]
        except Exception as e:
            raise ValueError(f"Failed to read from file object: {e}") from e

        if extension in [".yaml", ".yml"]:
            try:
                data = yaml.safe_load(content)
            except yaml.YAMLError as e:
                raise ValueError(f"Failed to parse YAML from file object: {e}") from e
            criteria = cls.validate_and_create_criteria(data)
            return cls(criteria)
        elif extension == ".json":
            try:
                data = json.loads(content)
            except json.JSONDecodeError as e:
                raise ValueError(f"Failed to parse JSON from file object: {e}") from e
            criteria = cls.validate_and_create_criteria(data)
            return cls(criteria)
        else:
            raise ValueError(
                f"Unsupported file format '{extension}' for file object: {file_name}\n"
                f"Supported formats: .yaml, .yml, .json"
            )

    elif isinstance(source, str):
        path = Path(source)

        if not path.exists():
            raise FileNotFoundError(f"File not found: {source}")

        extension = path.suffix.lower()

        if extension in [".yaml", ".yml"]:
            with open(source, encoding="utf-8") as f:
                try:
                    data = yaml.safe_load(f)
                except yaml.YAMLError as e:
                    raise ValueError(f"Failed to parse YAML file: {e}") from e
            criteria = cls.validate_and_create_criteria(data)
            return cls(criteria)
        elif extension == ".json":
            with open(source, encoding="utf-8") as f:
                try:
                    data = json.load(f)
                except json.JSONDecodeError as e:
                    raise ValueError(f"Failed to parse JSON file: {e}") from e
            criteria = cls.validate_and_create_criteria(data)
            return cls(criteria)
        else:
            raise ValueError(
                f"Unsupported file format '{extension}' for file: {source}\n"
                f"Supported formats: .yaml, .yml, .json"
            )
    else:
        raise ValueError(
            f"Invalid source type: expected str (file path) or file-like object, "
            f"got {type(source).__name__}"
        )

compute_score ¶

compute_score(verdicts: list[CriterionVerdict | str], normalize: bool = True, cannot_assess_strategy: CannotAssessStrategy = SKIP, partial_credit: float = 0.5) -> float

Compute a weighted score from raw verdicts against this rubric.

Single source of truth for scoring from verdict lists (e.g. ground truth labels). Handles binary (MET/UNMET/CANNOT_ASSESS) and multi-choice (option label strings) criteria.

Parses and validates each verdict into a CriterionReport and delegates to the shared score_reports core, so this path agrees exactly with the live grader and RubricDataset.compute_weighted_score across every CannotAssessStrategy x {binary, multi-choice} x {+/- weight}.

PARAMETER	DESCRIPTION
`verdicts`	One value per criterion. Binary criteria accept CriterionVerdict or its string form; multi-choice criteria accept an option label string. TYPE: `list[CriterionVerdict \| str]`
`normalize`	If True, normalise to [0, 1]. If False, return the raw weighted sum. TYPE: `bool` DEFAULT: `True`
`cannot_assess_strategy`	How to handle CANNOT_ASSESS / NA verdicts. TYPE: `CannotAssessStrategy` DEFAULT: `SKIP`
`partial_credit`	Credit fraction when strategy is PARTIAL. TYPE: `float` DEFAULT: `0.5`

RETURNS	DESCRIPTION
`float`	The computed score.

Source code in src/autorubric/rubric.py

def compute_score(
    self,
    verdicts: list[CriterionVerdict | str],
    normalize: bool = True,
    cannot_assess_strategy: CannotAssessStrategy = CannotAssessStrategy.SKIP,
    partial_credit: float = 0.5,
) -> float:
    """Compute a weighted score from raw verdicts against this rubric.

    Single source of truth for scoring from verdict lists (e.g. ground truth
    labels). Handles binary (MET/UNMET/CANNOT_ASSESS) and multi-choice
    (option label strings) criteria.

    Parses and validates each verdict into a ``CriterionReport`` and delegates
    to the shared ``score_reports`` core, so this path agrees exactly with the
    live grader and ``RubricDataset.compute_weighted_score`` across every
    ``CannotAssessStrategy`` x {binary, multi-choice} x {+/- weight}.

    Args:
        verdicts: One value per criterion. Binary criteria accept
            CriterionVerdict or its string form; multi-choice criteria
            accept an option label string.
        normalize: If True, normalise to [0, 1]. If False, return the raw
            weighted sum.
        cannot_assess_strategy: How to handle CANNOT_ASSESS / NA verdicts.
        partial_credit: Credit fraction when strategy is PARTIAL.

    Returns:
        The computed score.
    """
    if len(verdicts) != len(self.rubric):
        raise ValueError(f"Expected {len(self.rubric)} verdicts, got {len(verdicts)}")

    reports: list[CriterionReport] = []
    for criterion, verdict in zip(self.rubric, verdicts):
        if criterion.is_multi_choice:
            if not isinstance(verdict, str):
                raise ValueError(
                    f"Multi-choice criterion '{criterion.name}' requires a "
                    f"label string, got {type(verdict).__name__}"
                )
            idx = criterion.find_option_by_label(verdict)
            opt = criterion.options[idx]  # type: ignore[index]
            reports.append(
                CriterionReport(
                    requirement=criterion.requirement,
                    name=criterion.name,
                    weight=criterion.weight,
                    options=criterion.options,
                    scale_type=criterion.scale_type,
                    aggregation=criterion.aggregation,
                    multi_choice_verdict=MultiChoiceVerdict(
                        selected_index=idx,
                        selected_label=opt.label,
                        value=opt.value,
                        na=opt.na,
                    ),
                    reason="",
                )
            )
        else:
            if isinstance(verdict, str):
                try:
                    verdict = CriterionVerdict(verdict)
                except ValueError:
                    raise ValueError(
                        f"Invalid binary verdict '{verdict}'. "
                        f"Must be 'MET', 'UNMET', or 'CANNOT_ASSESS'."
                    ) from None
            reports.append(
                CriterionReport(
                    requirement=criterion.requirement,
                    name=criterion.name,
                    weight=criterion.weight,
                    verdict=verdict,
                    reason="",
                )
            )

    config = CannotAssessConfig(strategy=cannot_assess_strategy, partial_credit=partial_credit)
    return score_reports(reports, config, normalize)

from_dict `classmethod` ¶

from_dict(data: list[dict[str, Any]] | dict[str, Any]) -> Rubric

Create rubric from a list of dictionaries or a dict with sections.

Source code in src/autorubric/rubric.py

@classmethod
def from_dict(cls, data: list[dict[str, Any]] | dict[str, Any]) -> Rubric:
    """Create rubric from a list of dictionaries or a dict with sections."""
    criteria = cls.validate_and_create_criteria(data)
    return cls(criteria)

EvaluationReport¶

Complete grading result with score and per-criterion reports.

EvaluationReport ¶

Bases: BaseModel

Final evaluation result with score and per-criterion reports.

For training use cases, set normalize=False in the grader to get raw weighted sums instead of normalized 0-1 scores.

ATTRIBUTE	DESCRIPTION
`score`	The final score (0-1 if normalized, raw weighted sum otherwise). `None` only when grading FAILED (an error report); the normal grading path always COMPUTES a real float. Consumers must skip `None` (most already filter on `error is not None`). TYPE: `float \| None`
`raw_score`	The unnormalized weighted sum. `None` only on a failed/empty report. TYPE: `float \| None`
`llm_raw_score`	The original score returned by the LLM (same as raw_score). TYPE: `float \| None`
`report`	Per-criterion breakdown with verdicts and explanations. TYPE: `list[CriterionReport] \| None`
`cannot_assess_count`	Number of criteria with CANNOT_ASSESS verdict. TYPE: `int`
`error`	Optional error message if grading failed (e.g., JSON parse error). When set, score/raw_score are `None` (a failure has no score — a fabricated 0.0 is indistinguishable from a real catastrophic score). Training pipelines should filter these out. TYPE: `str \| None`
`token_usage`	Aggregated token usage across all LLM calls made during grading. For CriterionGrader, this is the sum across all criterion evaluations. TYPE: `TokenUsage \| None`
`completion_cost`	Total cost in USD for all LLM calls made during grading. Calculated using LiteLLM's completion_cost() function. TYPE: `float \| None`

Example

result = await rubric.grade(to_grade=response, grader=grader) print(f"Score: {result.score:.2f}") if result.cannot_assess_count: ... print(f"Could not assess {result.cannot_assess_count} criteria") if result.token_usage: ... print(f"Tokens: {result.token_usage.total_tokens}") if result.completion_cost: ... print(f"Cost: ${result.completion_cost:.6f}")

TokenUsage¶

Token usage tracking for LLM calls.

TokenUsage `dataclass` ¶

TokenUsage(prompt_tokens: int = 0, completion_tokens: int = 0, total_tokens: int = 0, cache_creation_input_tokens: int = 0, cache_read_input_tokens: int = 0)

Token usage statistics from LLM API calls.

ATTRIBUTE	DESCRIPTION
`prompt_tokens`	Number of tokens in the prompt/input. TYPE: `int`
`completion_tokens`	Number of tokens in the completion/output. TYPE: `int`
`total_tokens`	Total tokens (prompt + completion). TYPE: `int`
`cache_creation_input_tokens`	Tokens used to create cache entries (Anthropic). TYPE: `int`
`cache_read_input_tokens`	Tokens read from cache (Anthropic). TYPE: `int`

Example

usage = TokenUsage(prompt_tokens=100, completion_tokens=50, total_tokens=150) print(f"Total tokens: {usage.total_tokens}") Total tokens: 150

ToGradeInput¶

Type alias for the input format accepted by rubric.grade().

ToGradeInput `module-attribute` ¶

ToGradeInput = str | ThinkingOutputDict

Union type for to_grade parameter.

Accepts either a plain string or a dict with thinking/output keys.

ThinkingOutputDict¶

TypedDict for responses with separate thinking and output sections.

ThinkingOutputDict ¶

Bases: TypedDict

Dict format for submissions with separate thinking and output sections.

Both fields are optional to allow partial submissions or gradual construction. When used with length penalty, missing fields are treated as empty strings.

ScaleType¶

Literal type alias for multi-choice criterion scale types (ordinal, nominal).

ScaleType `module-attribute` ¶

ScaleType = Literal['ordinal', 'nominal']

Scale type for multi-choice criteria.

ordinal: Options have inherent order (e.g., 1-4 satisfaction scale)
nominal: Options are unordered categories (e.g., "too few", "too many", "just right")

Core Grading¶

Overview¶

Quick Example¶

Score Calculation¶

Criterion¶

Criterion ¶

Binary criterion (existing behavior)¶

Multi-choice ordinal criterion¶

is_binary property ¶

is_multi_choice property ¶

na_option_index property ¶

get_option_value ¶

find_option_by_label ¶

worst_option_among ¶

worst_scored_option ¶

with_guaranteed_na_option ¶

validate_options ¶

CriterionVerdict¶

CriterionVerdict ¶

CriterionReport¶

CriterionReport ¶

score_value property ¶

is_na property ¶

is_error property ¶

CriterionJudgment¶

CriterionJudgment ¶

Rubric¶

Rubric ¶

grade async ¶

validate_and_create_criteria staticmethod ¶

from_yaml classmethod ¶

from_json classmethod ¶

from_file classmethod ¶

compute_score ¶

from_dict classmethod ¶

EvaluationReport¶

EvaluationReport ¶

TokenUsage¶

TokenUsage dataclass ¶

ToGradeInput¶

ToGradeInput module-attribute ¶

ThinkingOutputDict¶

ThinkingOutputDict ¶

ScaleType¶

ScaleType module-attribute ¶

is_binary `property` ¶

is_multi_choice `property` ¶

na_option_index `property` ¶

score_value `property` ¶

is_na `property` ¶

is_error `property` ¶

grade `async` ¶

validate_and_create_criteria `staticmethod` ¶

from_yaml `classmethod` ¶

from_json `classmethod` ¶

from_file `classmethod` ¶

from_dict `classmethod` ¶

TokenUsage `dataclass` ¶

ToGradeInput `module-attribute` ¶

ScaleType `module-attribute` ¶