Graders¶

Grader implementations for evaluating responses against rubrics.

Overview¶

Graders evaluate responses against rubrics and return structured reports. The main implementation is CriterionGrader, which supports single LLM, ensemble, and few-shot modes. All combinations work orthogonally.

Quick Example¶

from autorubric import LLMConfig, FewShotConfig
from autorubric.graders import CriterionGrader, JudgeSpec, Grader

# Single LLM mode
grader = CriterionGrader(
    llm_config=LLMConfig(model="openai/gpt-4.1-mini"),
)

# With custom system prompt
grader = CriterionGrader(
    llm_config=LLMConfig(model="openai/gpt-4.1-mini"),
    system_prompt="You are evaluating technical documentation...",
)

# Ensemble mode
grader = CriterionGrader(
    judges=[
        JudgeSpec(LLMConfig(model="gemini/gemini-3-flash-preview"), "gemini", weight=1.0),
        JudgeSpec(LLMConfig(model="anthropic/claude-sonnet-4-5-20250929"), "claude", weight=1.2),
    ],
    aggregation="weighted",
)

# Single LLM + few-shot
grader = CriterionGrader(
    llm_config=LLMConfig(model="openai/gpt-4.1-mini"),
    training_data=train_data,
    few_shot_config=FewShotConfig(n_examples=3, balance_verdicts=True),
)

# Grade
result = await rubric.grade(to_grade=response, grader=grader)

Grading Options¶

grader = CriterionGrader(
    llm_config=LLMConfig(model="openai/gpt-4.1-mini"),

    # Score normalization
    normalize=True,          # True: 0-1 range, False: raw weighted sum

    # CANNOT_ASSESS handling
    cannot_assess_config=CannotAssessConfig(strategy=CannotAssessStrategy.SKIP),

    # Length penalty
    length_penalty=LengthPenalty(free_budget=6000, max_cap=8000),

    # Position bias mitigation (for multi-choice)
    shuffle_options=True,    # Default: enabled

    # Multi-choice abstain channel
    auto_na_option=True,     # Default: True — guarantees every multi-choice criterion a
                             # first-class NA/abstain option (auto-injected if absent).
                             # Set False for forced-choice (no auto NA option).

    # Reproducibility — pins all non-LLM randomness (shuffles, few-shot selection)
    seed=42,                 # Default: auto-generated
)

With auto_na_option=True, an auto-injected NA option is appended at the end (highest index), so existing option indices are preserved; an author-supplied NA option is never stripped.

CriterionGrader¶

Main grader with support for single LLM, ensemble, and few-shot modes.

CriterionGrader ¶

CriterionGrader(*, llm_config: LLMConfig | None = None, judges: list[JudgeSpec] | None = None, aggregation: AggregationStrategy = 'majority', ordinal_aggregation: OrdinalAggregation = 'mean', nominal_aggregation: NominalAggregation = 'mode', training_data: RubricDataset | None = None, few_shot_config: FewShotConfig | None = None, system_prompt: str | None = None, multi_choice_system_prompt: str | None = None, length_penalty: LengthPenalty | None = None, normalize: bool = True, cannot_assess_config: CannotAssessConfig | None = None, shuffle_options: bool = True, auto_na_option: bool = True, seed: int | None = None, binary_response_format: type[BaseModel] | None = None, multi_choice_response_format: type[BaseModel] | None = None)

Bases: Grader

Unified criterion-based grader with compositional few-shot and ensemble support.

This grader evaluates each criterion independently and supports: - Single LLM mode (via llm_config) - Ensemble mode with multiple judges (via judges) - Few-shot prompting (via training_data + few_shot_config)

All combinations work: single LLM, single + few-shot, ensemble, ensemble + few-shot.

Parameters are orthogonal: - llm_config OR judges: Choose single-LLM or ensemble mode - training_data + few_shot_config: Enable few-shot prompting (applies to all judges)

Example

from autorubric import LLMConfig, FewShotConfig, RubricDataset from autorubric.graders import CriterionGrader, JudgeSpec

Single LLM¶

grader = CriterionGrader(llm_config=LLMConfig(model="gemini/gemini-3-flash-preview"))

Single LLM + few-shot¶

train, test = dataset.split_train_test(n_train=100) grader = CriterionGrader( ... llm_config=LLMConfig(model="gemini/gemini-3-flash-preview"), ... training_data=train, ... few_shot_config=FewShotConfig(n_examples=3), ... )

Ensemble¶

grader = CriterionGrader( ... judges=[ ... JudgeSpec(LLMConfig(model="gemini/gemini-3-flash-preview"), "gemini"), ... JudgeSpec(LLMConfig(model="anthropic/claude-sonnet-4-5-20250929"), "claude"), ... ], ... aggregation="majority", ... )

Ensemble + few-shot¶

grader = CriterionGrader( ... judges=[JudgeSpec(...), JudgeSpec(...)], ... aggregation="majority", ... training_data=train, ... few_shot_config=FewShotConfig(n_examples=3), ... )

Initialize the criterion grader.

PARAMETER	DESCRIPTION
`llm_config`	Configuration for single-LLM mode. Mutually exclusive with judges. TYPE: `LLMConfig \| None` DEFAULT: `None`
`judges`	List of JudgeSpec for ensemble mode. Mutually exclusive with llm_config. TYPE: `list[JudgeSpec] \| None` DEFAULT: `None`
`aggregation`	Strategy for aggregating votes in ensemble mode (binary criteria). TYPE: `AggregationStrategy` DEFAULT: `'majority'`
`ordinal_aggregation`	Strategy for aggregating ordinal multi-choice votes. Central tendency: "mean", "median", "weighted_mean", "mode". Conservative/ permissive (analogs of binary unanimous/any): "min" (lowest selected option) / "max" (highest selected option). TYPE: `OrdinalAggregation` DEFAULT: `'mean'`
`nominal_aggregation`	Strategy for aggregating nominal multi-choice votes. Options: "mode", "weighted_mode", "unanimous". "unanimous" abstains via the NA option on disagreement, or falls back to mode + warns if there is no NA option. TYPE: `NominalAggregation` DEFAULT: `'mode'`
`training_data`	Dataset for few-shot examples. If provided, enables few-shot prompting. TYPE: `RubricDataset \| None` DEFAULT: `None`
`few_shot_config`	Configuration for few-shot example selection. TYPE: `FewShotConfig \| None` DEFAULT: `None`
`system_prompt`	Custom system prompt for binary criteria. TYPE: `str \| None` DEFAULT: `None`
`multi_choice_system_prompt`	Custom system prompt for multi-choice criteria. TYPE: `str \| None` DEFAULT: `None`
`length_penalty`	Optional length penalty configuration. TYPE: `LengthPenalty \| None` DEFAULT: `None`
`normalize`	If True, normalize score to [0, 1]. If False, return raw sum. TYPE: `bool` DEFAULT: `True`
`cannot_assess_config`	Configuration for handling CANNOT_ASSESS verdicts. TYPE: `CannotAssessConfig \| None` DEFAULT: `None`
`shuffle_options`	If True (default), randomize the order of multi-choice options presented to the LLM to mitigate position bias. Each judge/call sees a different random order, and responses are mapped back to original indices. Disable for deterministic behavior in tests. TYPE: `bool` DEFAULT: `True`
`auto_na_option`	If True (default), auto-inject a canonical NA / "cannot assess" option into any multi-choice criterion that lacks one, giving the judge a first-class abstain channel analogous to binary CANNOT_ASSESS. The injected option is appended at the end (highest index) so existing option indices are preserved. Set False for forced-choice classification (the judge must pick a scored option). Never strips an author-supplied NA option — author intent wins. TYPE: `bool` DEFAULT: `True`
`seed`	Master seed for all non-LLM randomness (option shuffling, few-shot example selection). Auto-generated when None so that randomness is always pinned and reproducible. Inspect via the `seed` property after construction. TYPE: `int \| None` DEFAULT: `None`
`binary_response_format`	Pydantic model to use as the structured output schema for binary criterion judgments. Must be a subclass of (or compatible with) CriterionJudgment. If the model includes an `affected_criteria` field (list[int]), matching indices are injected as an `[Affects: ...]` tag into the reason string. Defaults to CriterionJudgment. TYPE: `type[BaseModel] \| None` DEFAULT: `None`
`multi_choice_response_format`	Pydantic model to use as the structured output schema for multi-choice criterion judgments. Must be a subclass of (or compatible with) MultiChoiceJudgment. If the model includes an `affected_criteria` field (list[int]), matching indices are injected as an `[Affects: ...]` tag into the reason string (same convention as binary_response_format). Defaults to MultiChoiceJudgment. TYPE: `type[BaseModel] \| None` DEFAULT: `None`

RAISES	DESCRIPTION
`ValueError`	If neither llm_config nor judges is provided, or both are provided.

Source code in src/autorubric/graders/criterion_grader.py

def __init__(
    self,
    *,
    # Single LLM mode
    llm_config: LLMConfig | None = None,
    # Ensemble mode (overrides llm_config)
    judges: list[JudgeSpec] | None = None,
    aggregation: AggregationStrategy = "majority",
    # Multi-choice aggregation strategies
    ordinal_aggregation: OrdinalAggregation = "mean",
    nominal_aggregation: NominalAggregation = "mode",
    # Few-shot mode (orthogonal - applies to all judges)
    training_data: RubricDataset | None = None,
    few_shot_config: FewShotConfig | None = None,
    # Common parameters
    system_prompt: str | None = None,
    multi_choice_system_prompt: str | None = None,
    length_penalty: LengthPenalty | None = None,
    normalize: bool = True,
    cannot_assess_config: CannotAssessConfig | None = None,
    # Position bias mitigation
    shuffle_options: bool = True,
    # Multi-choice abstain channel
    auto_na_option: bool = True,
    # Reproducibility
    seed: int | None = None,
    # Structured output override for binary criteria
    binary_response_format: type[BaseModel] | None = None,
    # Structured output override for multi-choice criteria
    multi_choice_response_format: type[BaseModel] | None = None,
):
    """Initialize the criterion grader.

    Args:
        llm_config: Configuration for single-LLM mode. Mutually exclusive with judges.
        judges: List of JudgeSpec for ensemble mode. Mutually exclusive with llm_config.
        aggregation: Strategy for aggregating votes in ensemble mode (binary criteria).
        ordinal_aggregation: Strategy for aggregating ordinal multi-choice votes.
            Central tendency: "mean", "median", "weighted_mean", "mode". Conservative/
            permissive (analogs of binary unanimous/any): "min" (lowest selected
            option) / "max" (highest selected option).
        nominal_aggregation: Strategy for aggregating nominal multi-choice votes.
            Options: "mode", "weighted_mode", "unanimous". "unanimous" abstains via the
            NA option on disagreement, or falls back to mode + warns if there is no NA
            option.
        training_data: Dataset for few-shot examples. If provided, enables few-shot prompting.
        few_shot_config: Configuration for few-shot example selection.
        system_prompt: Custom system prompt for binary criteria.
        multi_choice_system_prompt: Custom system prompt for multi-choice criteria.
        length_penalty: Optional length penalty configuration.
        normalize: If True, normalize score to [0, 1]. If False, return raw sum.
        cannot_assess_config: Configuration for handling CANNOT_ASSESS verdicts.
        shuffle_options: If True (default), randomize the order of multi-choice options
            presented to the LLM to mitigate position bias. Each judge/call sees a
            different random order, and responses are mapped back to original indices.
            Disable for deterministic behavior in tests.
        auto_na_option: If True (default), auto-inject a canonical NA / "cannot assess"
            option into any multi-choice criterion that lacks one, giving the judge a
            first-class abstain channel analogous to binary CANNOT_ASSESS. The injected
            option is appended at the end (highest index) so existing option indices are
            preserved. Set False for forced-choice classification (the judge must pick a
            scored option). Never strips an author-supplied NA option — author intent wins.
        seed: Master seed for all non-LLM randomness (option shuffling, few-shot
            example selection). Auto-generated when None so that randomness is always
            pinned and reproducible. Inspect via the ``seed`` property after construction.
        binary_response_format: Pydantic model to use as the structured output schema
            for binary criterion judgments. Must be a subclass of (or compatible with)
            CriterionJudgment. If the model includes an ``affected_criteria`` field
            (list[int]), matching indices are injected as an ``[Affects: ...]`` tag
            into the reason string. Defaults to CriterionJudgment.
        multi_choice_response_format: Pydantic model to use as the structured output
            schema for multi-choice criterion judgments. Must be a subclass of (or
            compatible with) MultiChoiceJudgment. If the model includes an
            ``affected_criteria`` field (list[int]), matching indices are injected as
            an ``[Affects: ...]`` tag into the reason string (same convention as
            binary_response_format). Defaults to MultiChoiceJudgment.

    Raises:
        ValueError: If neither llm_config nor judges is provided, or both are provided.
    """
    super().__init__(length_penalty=length_penalty, normalize=normalize)

    # Validate: must have either llm_config or judges, not both, not neither
    if llm_config is None and judges is None:
        raise ValueError("Must provide either llm_config or judges")
    if llm_config is not None and judges is not None:
        raise ValueError("Cannot provide both llm_config and judges")

    # Normalize to ensemble representation (single LLM = ensemble of 1).
    # The validation above guarantees exactly one of llm_config/judges is set, so by
    # this point `judges` in the else branch is a non-None list[JudgeSpec]; the explicit
    # attribute annotation lets the type checker see that without a suppression comment.
    self._judges: list[JudgeSpec]
    if llm_config is not None:
        self._judges = [JudgeSpec(llm_config=llm_config, judge_id="default", weight=1.0)]
    else:
        assert judges is not None
        self._judges = judges

    self._aggregation = aggregation
    self._ordinal_aggregation = ordinal_aggregation
    self._nominal_aggregation = nominal_aggregation
    self._training_data = training_data
    self._cannot_assess_config = cannot_assess_config or CannotAssessConfig()
    self._shuffle_options = shuffle_options
    self._auto_na_option = auto_na_option
    self._seed = seed if seed is not None else random.randint(0, 2**31 - 1)

    # Coordinate few-shot seed with master seed when unset
    fsc = few_shot_config or FewShotConfig()
    if fsc.seed is None and training_data is not None:
        fsc = dataclasses.replace(fsc, seed=self._seed)
    self._few_shot_config = fsc
    self._binary_response_format = binary_response_format or CriterionJudgment
    self._multi_choice_response_format = multi_choice_response_format or MultiChoiceJudgment

    # Build system prompts (separate for binary and multi-choice)
    if system_prompt is None:
        self._system_prompt = GRADER_SYSTEM_PROMPT_DEFAULT
        if training_data is not None:
            self._system_prompt += FEW_SHOT_SYSTEM_PROMPT_ADDITION
    else:
        self._system_prompt = system_prompt

    if multi_choice_system_prompt is None:
        self._multi_choice_system_prompt = MULTI_CHOICE_SYSTEM_PROMPT
        if training_data is not None:
            self._multi_choice_system_prompt += MULTI_CHOICE_FEW_SHOT_ADDITION
    else:
        self._multi_choice_system_prompt = multi_choice_system_prompt

    # Create LLM clients for each judge
    self._clients = {judge.judge_id: LLMClient(judge.llm_config) for judge in self._judges}

    # Pre-compute few-shot examples if training data provided
    # Note: For multi-choice, examples are stored as (submission, selected_index, reason)
    self._criterion_examples: dict[tuple[int, str], list[FewShotExample]] = {}
    self._multi_choice_examples: dict[tuple[int, str], list[tuple[str, int, str | None]]] = {}
    if training_data is not None:
        self._prepare_examples()

is_ensemble `property` ¶

is_ensemble: bool

Whether this grader uses multiple judges.

has_few_shot `property` ¶

has_few_shot: bool

Whether this grader uses few-shot prompting.

seed `property` ¶

seed: int

The master seed governing all non-LLM randomness in this grader.

judge `async` ¶

judge(to_grade: str, rubric: list[Criterion], query: str | None = None, reference_submission: str | None = None) -> list[JudgeCriterionResults]

Judge all criteria with all judges (parallel across judges).

Source code in src/autorubric/graders/criterion_grader.py

async def judge(
    self,
    to_grade: str,
    rubric: list[Criterion],
    query: str | None = None,
    reference_submission: str | None = None,
) -> list[JudgeCriterionResults]:
    """Judge all criteria with all judges (parallel across judges)."""
    # Normalize once to the effective rubric (abstain channel guaranteed for
    # multi-choice under auto_na_option). Same length/order, so criterion_idx — and
    # thus the shuffle RNG key — stays aligned; the user's rubric is never mutated.
    effective_rubric = [self._effective_criterion(c) for c in rubric]
    tasks = [
        self._judge_all_criteria_for_judge(
            judge, effective_rubric, to_grade, query, reference_submission
        )
        for judge in self._judges
    ]
    return list(await asyncio.gather(*tasks))

aggregate `async` ¶

aggregate(judge_results: list[JudgeCriterionResults], *, normalize: bool = True) -> EnsembleEvaluationReport

Aggregate results from all judges into final report.

Handles both binary and multi-choice criteria: - Binary: Uses JudgeVote and _aggregate_votes() - Multi-choice: Uses MultiChoiceJudgeVote and _aggregate_multi_choice_votes()

Source code in src/autorubric/graders/criterion_grader.py

async def aggregate(
    self, judge_results: list[JudgeCriterionResults], *, normalize: bool = True
) -> EnsembleEvaluationReport:
    """Aggregate results from all judges into final report.

    Handles both binary and multi-choice criteria:
    - Binary: Uses JudgeVote and _aggregate_votes()
    - Multi-choice: Uses MultiChoiceJudgeVote and _aggregate_multi_choice_votes()
    """
    if not judge_results:
        # Empty/failed aggregation has no score: emit None, not a fabricated 0.0
        # (which is a valid catastrophic score, indistinguishable from a real zero).
        return EnsembleEvaluationReport(
            score=None,
            raw_score=None,
            llm_raw_score=None,
            error="No judge results to aggregate",
        )

    n_criteria = len(judge_results[0].criterion_results)

    # Build ensemble criterion reports
    ensemble_reports: list[EnsembleCriterionReport] = []
    for criterion_idx in range(n_criteria):
        # Get criterion from first judge's result
        first_cr = judge_results[0].criterion_results[criterion_idx]
        criterion_report = first_cr.report

        if criterion_report.is_multi_choice:
            # Multi-choice: build MultiChoiceJudgeVote list
            mc_votes: list[MultiChoiceJudgeVote] = []
            # Each MultiChoiceJudgeVote carries its own .error (parity with JudgeVote),
            # so the ensemble error is derived from the votes via _aggregate_error,
            # mirroring the binary path. Every judge call synthesizes a verdict, so the
            # `mcv is not None` guard below never drops an errored vote.
            for judge_result in judge_results:
                cr = judge_result.criterion_results[criterion_idx]
                mcv = cr.report.multi_choice_verdict
                if mcv is not None:
                    mc_votes.append(
                        MultiChoiceJudgeVote(
                            judge_id=judge_result.judge_id,
                            selected_index=mcv.selected_index,
                            selected_label=mcv.selected_label,
                            value=mcv.value,
                            reason=cr.report.reason,
                            weight=judge_result.weight,
                            na=mcv.na,
                            shuffle_order=cr.report.shuffle_order,
                            error=cr.report.error,
                            reasoning=cr.report.reasoning,
                        )
                    )

            # Aggregate multi-choice votes
            final_mc_verdict, final_reason = self._aggregate_multi_choice_votes(
                mc_votes, criterion_report
            )

            ensemble_reports.append(
                EnsembleCriterionReport(
                    criterion=Criterion(
                        weight=criterion_report.weight,
                        requirement=criterion_report.requirement,
                        name=criterion_report.name,
                        options=criterion_report.options,
                        scale_type=criterion_report.scale_type,
                        aggregation=criterion_report.aggregation,
                    ),
                    final_verdict=None,  # Binary verdict is None for multi-choice
                    final_reason=final_reason,
                    votes=[],  # Binary votes empty for multi-choice
                    final_multi_choice_verdict=final_mc_verdict,
                    multi_choice_votes=mc_votes,
                    error=_aggregate_error(mc_votes),
                )
            )
        else:
            # Binary: build JudgeVote list
            votes: list[JudgeVote] = []
            for judge_result in judge_results:
                cr = judge_result.criterion_results[criterion_idx]
                votes.append(
                    JudgeVote(
                        judge_id=judge_result.judge_id,
                        verdict=cr.report.verdict,
                        reason=cr.report.reason,
                        weight=judge_result.weight,
                        error=cr.report.error,
                        reasoning=cr.report.reasoning,
                    )
                )

            final_verdict, final_reason = self._aggregate_votes(votes, criterion_report.weight)

            ensemble_reports.append(
                EnsembleCriterionReport(
                    criterion=Criterion(
                        weight=criterion_report.weight,
                        requirement=criterion_report.requirement,
                        name=criterion_report.name,
                    ),
                    final_verdict=final_verdict,
                    final_reason=final_reason,
                    votes=votes,
                    error=_aggregate_error(votes),
                )
            )

    # Calculate per-judge scores
    judge_scores = {}
    for judge_result in judge_results:
        score = self._calculate_score_from_reports(judge_result.reports, normalize)
        judge_scores[judge_result.judge_id] = score

    # Calculate final score from aggregated verdicts
    final_reports = []
    for er in ensemble_reports:
        if er.final_multi_choice_verdict is not None:
            # Multi-choice criterion
            final_reports.append(
                CriterionReport(
                    weight=er.criterion.weight,
                    requirement=er.criterion.requirement,
                    name=er.criterion.name,
                    options=er.criterion.options,
                    scale_type=er.criterion.scale_type,
                    aggregation=er.criterion.aggregation,
                    verdict=None,  # Binary verdict is None
                    multi_choice_verdict=er.final_multi_choice_verdict,
                    reason=er.final_reason,
                )
            )
        else:
            # Binary criterion
            final_reports.append(
                CriterionReport(
                    weight=er.criterion.weight,
                    requirement=er.criterion.requirement,
                    name=er.criterion.name,
                    verdict=er.final_verdict,
                    reason=er.final_reason,
                )
            )
    final_score = self._calculate_score_from_reports(final_reports, normalize)
    raw_score = self._calculate_score_from_reports(final_reports, normalize=False)

    # Calculate agreement
    mean_agreement = (
        sum(er.agreement for er in ensemble_reports) / len(ensemble_reports)
        if ensemble_reports
        else None  # No criteria to agree on -> not measured (never fabricate 1.0)
    )

    # Count CANNOT_ASSESS (binary) and NA (multi-choice)
    cannot_assess_count = sum(
        1
        for er in ensemble_reports
        if (er.final_verdict == CriterionVerdict.CANNOT_ASSESS)
        or (er.final_multi_choice_verdict is not None and er.final_multi_choice_verdict.na)
    )

    # Aggregate token usage and cost
    total_usage = TokenUsage()
    total_cost = 0.0
    for jr in judge_results:
        if jr.total_usage:
            total_usage = total_usage + jr.total_usage
        if jr.total_cost:
            total_cost += jr.total_cost

    return EnsembleEvaluationReport(
        score=final_score,
        raw_score=raw_score,
        llm_raw_score=raw_score,
        report=ensemble_reports,
        judge_scores=judge_scores,
        mean_agreement=mean_agreement,
        cannot_assess_count=cannot_assess_count,
        token_usage=total_usage if total_usage.total_tokens > 0 else None,
        completion_cost=total_cost if total_cost > 0 else None,
    )

Grader¶

Abstract base class for grader implementations.

Grader ¶

Grader(*, length_penalty: LengthPenalty | None = None, normalize: bool = True)

Bases: ABC

Base class for LLM-backed grading implementations.

All graders require an LLMConfig for the LLM client. Subclasses must implement judge() and aggregate() methods.

PARAMETER	DESCRIPTION
`length_penalty`	Optional configuration for penalizing overly long outputs. When provided, a penalty based on the token/word count is subtracted from the final score. TYPE: `LengthPenalty \| None` DEFAULT: `None`
`normalize`	If True (default), scores are normalized to 0-1. If False, raw weighted sums are returned, which is useful for RL training scenarios. TYPE: `bool` DEFAULT: `True`

Source code in src/autorubric/graders/base.py

def __init__(
    self,
    *,
    length_penalty: LengthPenalty | None = None,
    normalize: bool = True,
):
    self.length_penalty: LengthPenalty | None = length_penalty
    self.normalize: bool = normalize

judge `abstractmethod` `async` ¶

judge(to_grade: str, rubric: list[Criterion], query: str | None = None, reference_submission: str | None = None) -> Any

Collect raw judge results for the provided submission.

PARAMETER	DESCRIPTION
`to_grade`	The text to evaluate. TYPE: `str`
`rubric`	List of criteria to evaluate against. TYPE: `list[Criterion]`
`query`	Optional input/query that prompted the response. TYPE: `str \| None` DEFAULT: `None`
`reference_submission`	Optional exemplar response for grading context. TYPE: `str \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`Any`	Raw judge results (format depends on implementation).

Source code in src/autorubric/graders/base.py

@abstractmethod
async def judge(
    self,
    to_grade: str,
    rubric: list[Criterion],
    query: str | None = None,
    reference_submission: str | None = None,
) -> Any:
    """Collect raw judge results for the provided submission.

    Args:
        to_grade: The text to evaluate.
        rubric: List of criteria to evaluate against.
        query: Optional input/query that prompted the response.
        reference_submission: Optional exemplar response for grading context.

    Returns:
        Raw judge results (format depends on implementation).
    """
    pass

aggregate `abstractmethod` `async` ¶

aggregate(judge_results: Any, *, normalize: bool = True) -> EvaluationReport

Transform judge results into an EvaluationReport.

PARAMETER	DESCRIPTION
`judge_results`	Raw results from judge(). TYPE: `Any`
`normalize`	If True, normalize score to 0-1. If False, return raw weighted sum. TYPE: `bool` DEFAULT: `True`

RETURNS	DESCRIPTION
`EvaluationReport`	EvaluationReport with score and optional per-criterion breakdown.

Source code in src/autorubric/graders/base.py

@abstractmethod
async def aggregate(self, judge_results: Any, *, normalize: bool = True) -> EvaluationReport:
    """Transform judge results into an EvaluationReport.

    Args:
        judge_results: Raw results from judge().
        normalize: If True, normalize score to 0-1. If False, return raw weighted sum.

    Returns:
        EvaluationReport with score and optional per-criterion breakdown.
    """
    pass

grade `async` ¶

grade(to_grade: ToGradeInput, rubric: list[Criterion], query: str | None = None, reference_submission: str | None = None) -> EvaluationReport

Grade the submission against the rubric.

This is the main entry point for the grader.

PARAMETER	DESCRIPTION
`to_grade`	The text to evaluate. Can be either: - A string (optionally with / markers) - A dict with 'thinking' and 'output' keys TYPE: `ToGradeInput`
`rubric`	List of criteria to evaluate against. TYPE: `list[Criterion]`
`query`	Optional input/query that prompted the response. TYPE: `str \| None` DEFAULT: `None`
`reference_submission`	Optional exemplar response for grading context. TYPE: `str \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`EvaluationReport`	EvaluationReport with score and optional per-criterion breakdown.
`EvaluationReport`	If normalize=True (default), score is 0-1. If normalize=False, score is raw
`EvaluationReport`	weighted sum. If length_penalty was configured, the penalty is subtracted from
`EvaluationReport`	the score. The raw_score field contains the unnormalized weighted sum before
`EvaluationReport`	length penalty.

Source code in src/autorubric/graders/base.py

async def grade(
    self,
    to_grade: ToGradeInput,
    rubric: list[Criterion],
    query: str | None = None,
    reference_submission: str | None = None,
) -> EvaluationReport:
    """Grade the submission against the rubric.

    This is the main entry point for the grader.

    Args:
        to_grade: The text to evaluate. Can be either:
            - A string (optionally with <thinking>/<output> markers)
            - A dict with 'thinking' and 'output' keys
        rubric: List of criteria to evaluate against.
        query: Optional input/query that prompted the response.
        reference_submission: Optional exemplar response for grading context.

    Returns:
        EvaluationReport with score and optional per-criterion breakdown.
        If normalize=True (default), score is 0-1. If normalize=False, score is raw
        weighted sum. If length_penalty was configured, the penalty is subtracted from
        the score. The raw_score field contains the unnormalized weighted sum before
        length penalty.
    """
    # Convert to_grade to string for judge() call (maintains compatibility)
    if isinstance(to_grade, str):
        to_grade_str = to_grade
    else:
        # Dict format - reconstruct string with markers for judge()
        thinking = to_grade.get("thinking", "")
        output = to_grade.get("output", "")
        parts = []
        if thinking:
            parts.append(f"<thinking>{thinking}</thinking>")
        if output:
            parts.append(f"<output>{output}</output>")
        to_grade_str = "\n".join(parts) if parts else ""

    # Call judge with string format (maintains compatibility)
    judge_results = await self.judge(to_grade_str, rubric, query, reference_submission)
    report = await self.aggregate(judge_results, normalize=self.normalize)

    if self.length_penalty is not None:
        # A grade-FAILURE has no score (errored/empty report): there is nothing to
        # penalize, so return it unchanged rather than subtracting from None.
        if report.score is None:
            return report

        # Normalize to_grade to dict format for penalty calculation
        to_grade_normalized = normalize_to_grade_input(to_grade)

        # Compute penalty
        penalty = compute_length_penalty(to_grade_normalized, self.length_penalty)

        # Apply penalty (penalty is always non-negative, so we subtract)
        adjusted_score = report.score - penalty
        if self.normalize:
            adjusted_score = max(0.0, adjusted_score)

        # Return the same report type with adjusted score
        if isinstance(report, EnsembleEvaluationReport):
            return EnsembleEvaluationReport(
                score=adjusted_score,
                raw_score=report.raw_score,
                llm_raw_score=report.llm_raw_score,
                report=report.report,
                judge_scores=report.judge_scores,
                mean_agreement=report.mean_agreement,
                cannot_assess_count=report.cannot_assess_count,
                token_usage=report.token_usage,
                completion_cost=report.completion_cost,
                error=report.error,
            )
        else:
            return EvaluationReport(
                score=adjusted_score,
                raw_score=report.raw_score,
                llm_raw_score=report.llm_raw_score,
                report=report.report,
                cannot_assess_count=report.cannot_assess_count,
                error=report.error,
                token_usage=report.token_usage,
                completion_cost=report.completion_cost,
            )

    return report

JudgeSpec¶

Configuration for a single judge in an ensemble.

JudgeSpec `dataclass` ¶

JudgeSpec(llm_config: LLMConfig, judge_id: str, weight: float = 1.0)

Specification for a single judge in an ensemble.

ATTRIBUTE	DESCRIPTION
`llm_config`	Configuration for this judge's LLM. TYPE: `LLMConfig`
`judge_id`	Unique identifier for this judge (e.g., "gpt-4", "claude-sonnet"). TYPE: `str`
`weight`	Voting weight for weighted aggregation (default 1.0). TYPE: `float`

Graders¶

Overview¶

Quick Example¶

Grading Options¶

CriterionGrader¶

CriterionGrader ¶

Single LLM¶

Single LLM + few-shot¶

Ensemble¶

Ensemble + few-shot¶

is_ensemble property ¶

has_few_shot property ¶

seed property ¶

judge async ¶

aggregate async ¶

Grader¶

Grader ¶

judge abstractmethod async ¶

aggregate abstractmethod async ¶

grade async ¶

JudgeSpec¶

JudgeSpec dataclass ¶

is_ensemble `property` ¶

has_few_shot `property` ¶

seed `property` ¶

judge `async` ¶

aggregate `async` ¶

judge `abstractmethod` `async` ¶

aggregate `abstractmethod` `async` ¶

grade `async` ¶

JudgeSpec `dataclass` ¶