Skip to content

Core Grading

Fundamental types for rubric-based evaluation: criteria, rubrics, verdicts, and evaluation reports.

Overview

The core grading module provides the foundational types for defining evaluation criteria and receiving grading results. A Rubric contains multiple Criterion objects, each with a weight and requirement. Grading produces an EvaluationReport with per-criterion verdicts and explanations.

Quick Example

from autorubric import Rubric, Criterion, CriterionVerdict, LLMConfig
from autorubric.graders import CriterionGrader

# Define criteria
rubric = Rubric([
    Criterion(name="accuracy", weight=10.0, requirement="States the correct answer"),
    Criterion(name="clarity", weight=5.0, requirement="Explains reasoning clearly"),
    Criterion(weight=-15.0, requirement="Contains factual errors"),  # name optional
])

# Or from dict/file
rubric = Rubric.from_dict([
    {"weight": 10.0, "requirement": "States the correct answer"},
    {"requirement": "Explains reasoning clearly"},  # weight defaults to 10.0
])
rubric = Rubric.from_file("rubric.yaml")

# Grade
grader = CriterionGrader(llm_config=LLMConfig(model="openai/gpt-4.1-mini"))
result = await rubric.grade(to_grade="...", grader=grader)

print(f"Score: {result.score:.2f}")
for cr in result.report:
    print(f"  [{cr.final_verdict}] {cr.criterion.requirement}")

Score Calculation

For each criterion \(i\):

  • If verdict = MET, contribution = \(w_i\)
  • If verdict = UNMET, contribution = 0

Final score:

\[ \text{score} = \max\left(0, \min\left(1, \frac{\sum_{i=1}^{n} \mathbb{1}[\text{verdict}_i = \text{MET}] \cdot w_i}{\sum_{i=1}^{n} \max(0, w_i)}\right)\right) \]

Criterion

A single evaluation criterion with weight and requirement.

Criterion

Bases: BaseModel

A single evaluation criterion with a weight and requirement description.

Supports both binary (MET/UNMET) and multi-choice criteria. If options is None, the criterion is binary. If options is provided, the criterion is multi-choice.

ATTRIBUTE DESCRIPTION
weight

Scoring weight. Positive for desired traits, negative for errors/penalties. Defaults to 10.0 for uniform weighting when not specified.

TYPE: float

requirement

Description of what the criterion evaluates.

TYPE: str

name

Optional short identifier for the criterion (e.g., "clarity", "accuracy"). Useful for referencing criteria in reports and debugging.

TYPE: str | None

options

List of options for multi-choice criteria. If None, criterion is binary.

TYPE: list[CriterionOption] | None

scale_type

For multi-choice, indicates if options are ordinal (ordered) or nominal (unordered categories). Affects aggregation strategy selection.

TYPE: ScaleType

aggregation

Per-criterion aggregation strategy override. If None, uses grader default.

TYPE: str | None

Example

Binary criterion (existing behavior)

binary = Criterion( ... name="accuracy", ... weight=10.0, ... requirement="The response is factually accurate" ... )

Multi-choice ordinal criterion

ordinal = Criterion( ... name="satisfaction", ... weight=10.0, ... requirement="How satisfied would you be?", ... options=[ ... CriterionOption(label="1", value=0.0), ... CriterionOption(label="2", value=0.33), ... CriterionOption(label="3", value=0.67), ... CriterionOption(label="4", value=1.0), ... ], ... scale_type="ordinal", ... )

is_binary property

is_binary: bool

Check if this is a binary (MET/UNMET) criterion.

is_multi_choice property

is_multi_choice: bool

Check if this is a multi-choice criterion.

get_option_value

get_option_value(index: int) -> float

Get the score value for an option by index.

PARAMETER DESCRIPTION
index

Zero-based index of the option.

TYPE: int

RETURNS DESCRIPTION
float

The score value for the option.

RAISES DESCRIPTION
ValueError

If this is a binary criterion or index is out of range.

Source code in src/autorubric/types.py
def get_option_value(self, index: int) -> float:
    """Get the score value for an option by index.

    Args:
        index: Zero-based index of the option.

    Returns:
        The score value for the option.

    Raises:
        ValueError: If this is a binary criterion or index is out of range.
    """
    if self.options is None:
        raise ValueError("Binary criterion has no options")
    if index < 0 or index >= len(self.options):
        raise ValueError(
            f"Option index {index} out of range [0, {len(self.options)})"
        )
    return self.options[index].value

find_option_by_label

find_option_by_label(label: str) -> int

Find option index by label (case-insensitive, whitespace-normalized).

Used for resolving ground truth labels to indices for metrics computation.

PARAMETER DESCRIPTION
label

The label to search for.

TYPE: str

RETURNS DESCRIPTION
int

Zero-based index of the matching option.

RAISES DESCRIPTION
ValueError

If this is a binary criterion or label not found.

Source code in src/autorubric/types.py
def find_option_by_label(self, label: str) -> int:
    """Find option index by label (case-insensitive, whitespace-normalized).

    Used for resolving ground truth labels to indices for metrics computation.

    Args:
        label: The label to search for.

    Returns:
        Zero-based index of the matching option.

    Raises:
        ValueError: If this is a binary criterion or label not found.
    """
    if self.options is None:
        raise ValueError("Binary criterion has no options")
    normalized_label = label.strip().lower()
    for i, opt in enumerate(self.options):
        if opt.label.strip().lower() == normalized_label:
            return i
    available = [opt.label for opt in self.options]
    raise ValueError(f"Label '{label}' not found. Available: {available}")

validate_options

validate_options() -> Criterion

Validate multi-choice options if present.

Source code in src/autorubric/types.py
@model_validator(mode="after")
def validate_options(self) -> "Criterion":
    """Validate multi-choice options if present."""
    if self.options is not None:
        if len(self.options) < 2:
            raise ValueError("Multi-choice criterion must have at least 2 options")
        # Ensure at least 2 non-NA options
        non_na = [o for o in self.options if not o.na]
        if len(non_na) < 2:
            raise ValueError("Must have at least 2 non-NA options")
    return self

CriterionVerdict

Enum representing the verdict for a criterion.

CriterionVerdict

Bases: str, Enum

Status of a criterion evaluation.

  • MET: The criterion is satisfied by the submission
  • UNMET: The criterion is not satisfied by the submission
  • CANNOT_ASSESS: Insufficient evidence to make a determination

CriterionReport

Per-criterion result with verdict and explanation.

CriterionReport

Bases: Criterion

A criterion with its evaluation result.

Supports both binary (MET/UNMET/CANNOT_ASSESS) and multi-choice verdicts. For binary criteria, use verdict. For multi-choice, use multi_choice_verdict.

ATTRIBUTE DESCRIPTION
verdict

Binary verdict (MET/UNMET/CANNOT_ASSESS). None for multi-choice criteria.

TYPE: CriterionVerdict | None

multi_choice_verdict

Multi-choice verdict with selected option. None for binary.

TYPE: MultiChoiceVerdict | AggregatedMultiChoiceVerdict | None

reason

Explanation for the verdict from the LLM judge.

TYPE: str

score_value property

score_value: float

Get the score contribution (0-1) for this criterion.

For binary criteria: 1.0 if MET, 0.0 otherwise. For multi-choice: the value of the selected option.

is_na property

is_na: bool

Check if this criterion was marked NA or CANNOT_ASSESS.

Returns True for: - Binary criteria with CANNOT_ASSESS verdict - Multi-choice criteria with NA option selected


CriterionJudgment

Structured output from LLM judge for a single criterion.

CriterionJudgment

Bases: BaseModel

Structured LLM output for single criterion evaluation.

Used with LiteLLM's response_format parameter to ensure type-safe, validated responses from the judge LLM.

Note: This is separate from CriterionReport because: - CriterionReport includes 'weight' and 'requirement' fields that come from the rubric, not from the LLM - The LLM only outputs the judgment (status + explanation)


Rubric

Collection of criteria for evaluation.

Rubric

Rubric(rubric: list[Criterion])

A rubric is a list of criteria used to evaluate text outputs.

Each criterion has a weight and requirement. Use the grade() method to evaluate text against this rubric using a grader.

Source code in src/autorubric/rubric.py
def __init__(self, rubric: list[Criterion]):
    self.rubric = rubric

grade async

grade(to_grade: ToGradeInput, grader: Grader, query: str | None = None, reference_submission: str | None = None) -> EvaluationReport

Grade text against this rubric using a grader.

PARAMETER DESCRIPTION
to_grade

The text to evaluate. Can be either: - A string (optionally with / markers) - A dict with 'thinking' and 'output' keys

TYPE: ToGradeInput

grader

The grader to use. REQUIRED - must be provided. Configure length_penalty and normalize on the grader if needed.

TYPE: Grader

query

Optional input/query that prompted the response.

TYPE: str | None DEFAULT: None

reference_submission

Optional exemplar response for grading context. When present, provides calibration for the grader.

TYPE: str | None DEFAULT: None

RAISES DESCRIPTION
TypeError

If grader is not provided.

Source code in src/autorubric/rubric.py
async def grade(
    self,
    to_grade: ToGradeInput,
    grader: Grader,
    query: str | None = None,
    reference_submission: str | None = None,
) -> EvaluationReport:
    """Grade text against this rubric using a grader.

    Args:
        to_grade: The text to evaluate. Can be either:
            - A string (optionally with <thinking>/<output> markers)
            - A dict with 'thinking' and 'output' keys
        grader: The grader to use. REQUIRED - must be provided.
            Configure length_penalty and normalize on the grader if needed.
        query: Optional input/query that prompted the response.
        reference_submission: Optional exemplar response for grading context.
            When present, provides calibration for the grader.

    Raises:
        TypeError: If grader is not provided.
    """
    return await grader.grade(
        to_grade=to_grade,
        rubric=self.rubric,
        query=query,
        reference_submission=reference_submission,
    )

validate_and_create_criteria staticmethod

validate_and_create_criteria(data: list[dict[str, Any]] | dict[str, Any]) -> list[Criterion]

Validate and create Criterion objects from raw data.

Supports multiple formats: - Flat list of criteria - List of sections with criteria - Dict with 'sections' key containing list of sections - Dict with 'rubric' key containing sections

Source code in src/autorubric/rubric.py
@staticmethod
def validate_and_create_criteria(
    data: list[dict[str, Any]] | dict[str, Any],
) -> list[Criterion]:
    """Validate and create Criterion objects from raw data.

    Supports multiple formats:
    - Flat list of criteria
    - List of sections with criteria
    - Dict with 'sections' key containing list of sections
    - Dict with 'rubric' key containing sections
    """
    if isinstance(data, dict):
        if "rubric" in data:
            data = data["rubric"]

        if isinstance(data, dict):
            if "sections" in data:
                sections = data["sections"]
                if not isinstance(sections, list):
                    raise ValueError(
                        f"Invalid rubric format. Expected 'sections' to be a list, "
                        f"got {type(sections).__name__}"
                    )
                data = sections
            else:
                raise ValueError(
                    "Invalid rubric format. Dict must contain either 'sections' or 'rubric' key"
                )

    if not isinstance(data, list):
        raise ValueError(
            f"Invalid rubric format. Expected a list, got {type(data).__name__}"
        )

    if not data:
        raise ValueError("No criteria found")

    flattened_criteria_data = []
    for idx, item in enumerate(data):
        if not isinstance(item, dict):
            raise ValueError(
                f"Invalid item at index {idx}: expected a dictionary, got {type(item).__name__}"
            )

        if "criteria" in item:
            section_criteria = item["criteria"]
            if not isinstance(section_criteria, list):
                raise ValueError(
                    f"Invalid section at index {idx}: 'criteria' must be a list, "
                    f"got {type(section_criteria).__name__}"
                )
            flattened_criteria_data.extend(section_criteria)
        else:
            flattened_criteria_data.append(item)

    if not flattened_criteria_data:
        raise ValueError("No criteria found")

    criteria = []
    for idx, criterion_data in enumerate(flattened_criteria_data):
        if not isinstance(criterion_data, dict):
            raise ValueError(
                f"Invalid criterion at index {idx}: expected a dictionary, "
                f"got {type(criterion_data).__name__}"
            )

        try:
            criteria.append(Criterion(**criterion_data))  # type: ignore[arg-type]
        except ValidationError as e:
            error_details = []
            for error in e.errors():
                field = ".".join(str(loc) for loc in error["loc"])
                error_details.append(f"{field}: {error['msg']}")

            error_msg = f"Invalid criterion at index {idx}:\n  " + "\n  ".join(
                error_details
            )
            raise ValueError(error_msg) from e
        except Exception as e:
            raise ValueError(
                f"Failed to create criterion at index {idx}: {e}"
            ) from e

    return criteria

from_yaml classmethod

from_yaml(yaml_string: str) -> Rubric

Parse rubric from a YAML string.

Source code in src/autorubric/rubric.py
@classmethod
def from_yaml(cls, yaml_string: str) -> "Rubric":
    """Parse rubric from a YAML string."""
    try:
        data = yaml.safe_load(yaml_string)
    except yaml.YAMLError as e:
        raise ValueError(f"Failed to parse YAML string: {e}") from e

    criteria = cls.validate_and_create_criteria(data)
    return cls(criteria)

from_json classmethod

from_json(json_string: str) -> Rubric

Parse rubric from a JSON string.

Source code in src/autorubric/rubric.py
@classmethod
def from_json(cls, json_string: str) -> "Rubric":
    """Parse rubric from a JSON string."""
    try:
        data = json.loads(json_string)
    except json.JSONDecodeError as e:
        raise ValueError(f"Failed to parse JSON string: {e}") from e

    criteria = cls.validate_and_create_criteria(data)
    return cls(criteria)

from_file classmethod

from_file(source: str | Any) -> Rubric

Load rubric from a file path or file-like object, auto-detecting format.

Source code in src/autorubric/rubric.py
@classmethod
def from_file(cls, source: str | Any) -> "Rubric":
    """Load rubric from a file path or file-like object, auto-detecting format."""
    if hasattr(source, "read"):
        file_name = getattr(source, "name", "")  # type: ignore[arg-type]
        extension = Path(file_name).suffix.lower() if file_name else ""

        if not extension:
            raise ValueError(
                "Cannot determine file format from file object. "
                "File object must have a 'name' attribute with a file extension."
            )

        try:
            content = source.read()  # type: ignore[misc]
        except Exception as e:
            raise ValueError(f"Failed to read from file object: {e}") from e

        if extension in [".yaml", ".yml"]:
            try:
                data = yaml.safe_load(content)
            except yaml.YAMLError as e:
                raise ValueError(
                    f"Failed to parse YAML from file object: {e}"
                ) from e
            criteria = cls.validate_and_create_criteria(data)
            return cls(criteria)
        elif extension == ".json":
            try:
                data = json.loads(content)
            except json.JSONDecodeError as e:
                raise ValueError(
                    f"Failed to parse JSON from file object: {e}"
                ) from e
            criteria = cls.validate_and_create_criteria(data)
            return cls(criteria)
        else:
            raise ValueError(
                f"Unsupported file format '{extension}' for file object: {file_name}\n"
                f"Supported formats: .yaml, .yml, .json"
            )

    elif isinstance(source, str):
        path = Path(source)

        if not path.exists():
            raise FileNotFoundError(f"File not found: {source}")

        extension = path.suffix.lower()

        if extension in [".yaml", ".yml"]:
            with open(source, encoding="utf-8") as f:
                try:
                    data = yaml.safe_load(f)
                except yaml.YAMLError as e:
                    raise ValueError(f"Failed to parse YAML file: {e}") from e
            criteria = cls.validate_and_create_criteria(data)
            return cls(criteria)
        elif extension == ".json":
            with open(source, encoding="utf-8") as f:
                try:
                    data = json.load(f)
                except json.JSONDecodeError as e:
                    raise ValueError(f"Failed to parse JSON file: {e}") from e
            criteria = cls.validate_and_create_criteria(data)
            return cls(criteria)
        else:
            raise ValueError(
                f"Unsupported file format '{extension}' for file: {source}\n"
                f"Supported formats: .yaml, .yml, .json"
            )
    else:
        raise ValueError(
            f"Invalid source type: expected str (file path) or file-like object, "
            f"got {type(source).__name__}"
        )

from_dict classmethod

from_dict(data: list[dict[str, Any]] | dict[str, Any]) -> Rubric

Create rubric from a list of dictionaries or a dict with sections.

Source code in src/autorubric/rubric.py
@classmethod
def from_dict(cls, data: list[dict[str, Any]] | dict[str, Any]) -> "Rubric":
    """Create rubric from a list of dictionaries or a dict with sections."""
    criteria = cls.validate_and_create_criteria(data)
    return cls(criteria)

EvaluationReport

Complete grading result with score and per-criterion reports.

EvaluationReport

Bases: BaseModel

Final evaluation result with score and per-criterion reports.

For training use cases, set normalize=False in the grader to get raw weighted sums instead of normalized 0-1 scores.

ATTRIBUTE DESCRIPTION
score

The final score (0-1 if normalized, raw weighted sum otherwise).

TYPE: float

raw_score

The unnormalized weighted sum.

TYPE: float | None

llm_raw_score

The original score returned by the LLM (same as raw_score).

TYPE: float | None

report

Per-criterion breakdown with verdicts and explanations.

TYPE: list[CriterionReport] | None

cannot_assess_count

Number of criteria with CANNOT_ASSESS verdict.

TYPE: int

error

Optional error message if grading failed (e.g., JSON parse error). When set, score defaults to 0.0. Training pipelines should filter these out.

TYPE: str | None

token_usage

Aggregated token usage across all LLM calls made during grading. For CriterionGrader, this is the sum across all criterion evaluations.

TYPE: TokenUsage | None

completion_cost

Total cost in USD for all LLM calls made during grading. Calculated using LiteLLM's completion_cost() function.

TYPE: float | None

Example

result = await rubric.grade(to_grade=response, grader=grader) print(f"Score: {result.score:.2f}") if result.cannot_assess_count: ... print(f"Could not assess {result.cannot_assess_count} criteria") if result.token_usage: ... print(f"Tokens: {result.token_usage.total_tokens}") if result.completion_cost: ... print(f"Cost: ${result.completion_cost:.6f}")


TokenUsage

Token usage tracking for LLM calls.

TokenUsage dataclass

TokenUsage(prompt_tokens: int = 0, completion_tokens: int = 0, total_tokens: int = 0, cache_creation_input_tokens: int = 0, cache_read_input_tokens: int = 0)

Token usage statistics from LLM API calls.

ATTRIBUTE DESCRIPTION
prompt_tokens

Number of tokens in the prompt/input.

TYPE: int

completion_tokens

Number of tokens in the completion/output.

TYPE: int

total_tokens

Total tokens (prompt + completion).

TYPE: int

cache_creation_input_tokens

Tokens used to create cache entries (Anthropic).

TYPE: int

cache_read_input_tokens

Tokens read from cache (Anthropic).

TYPE: int

Example

usage = TokenUsage(prompt_tokens=100, completion_tokens=50, total_tokens=150) print(f"Total tokens: {usage.total_tokens}") Total tokens: 150


ToGradeInput

Type alias for the input format accepted by rubric.grade().

ToGradeInput module-attribute

ToGradeInput = str | ThinkingOutputDict

Union type for to_grade parameter.

Accepts either a plain string or a dict with thinking/output keys.


ThinkingOutputDict

TypedDict for responses with separate thinking and output sections.

ThinkingOutputDict

Bases: TypedDict

Dict format for submissions with separate thinking and output sections.

Both fields are optional to allow partial submissions or gradual construction. When used with length penalty, missing fields are treated as empty strings.


ScaleType

Enum for criterion scale types (binary, ordinal, nominal).

ScaleType module-attribute

ScaleType = Literal['ordinal', 'nominal']

Scale type for multi-choice criteria.

  • ordinal: Options have inherent order (e.g., 1-4 satisfaction scale)
  • nominal: Options are unordered categories (e.g., "too few", "too many", "just right")