Skip to content

Graders

Grader implementations for evaluating responses against rubrics.

Overview

Graders evaluate responses against rubrics and return structured reports. The main implementation is CriterionGrader, which supports single LLM, ensemble, and few-shot modes. All combinations work orthogonally.

Quick Example

from autorubric import LLMConfig, FewShotConfig
from autorubric.graders import CriterionGrader, JudgeSpec, Grader

# Single LLM mode
grader = CriterionGrader(
    llm_config=LLMConfig(model="openai/gpt-4.1-mini"),
)

# With custom system prompt
grader = CriterionGrader(
    llm_config=LLMConfig(model="openai/gpt-4.1-mini"),
    system_prompt="You are evaluating technical documentation...",
)

# Ensemble mode
grader = CriterionGrader(
    judges=[
        JudgeSpec(LLMConfig(model="gemini/gemini-3-flash-preview"), "gemini", weight=1.0),
        JudgeSpec(LLMConfig(model="anthropic/claude-sonnet-4-5-20250929"), "claude", weight=1.2),
    ],
    aggregation="weighted",
)

# Single LLM + few-shot
grader = CriterionGrader(
    llm_config=LLMConfig(model="openai/gpt-4.1-mini"),
    training_data=train_data,
    few_shot_config=FewShotConfig(n_examples=3, balance_verdicts=True),
)

# Grade
result = await rubric.grade(to_grade=response, grader=grader)

Grading Options

grader = CriterionGrader(
    llm_config=LLMConfig(model="openai/gpt-4.1-mini"),

    # Score normalization
    normalize=True,          # True: 0-1 range, False: raw weighted sum

    # CANNOT_ASSESS handling
    cannot_assess_config=CannotAssessConfig(strategy=CannotAssessStrategy.SKIP),

    # Length penalty
    length_penalty=LengthPenalty(free_budget=6000, max_cap=8000),

    # Position bias mitigation (for multi-choice)
    shuffle_options=True,    # Default: enabled
)

CriterionGrader

Main grader with support for single LLM, ensemble, and few-shot modes.

CriterionGrader

CriterionGrader(*, llm_config: LLMConfig | None = None, judges: list[JudgeSpec] | None = None, aggregation: AggregationStrategy = 'majority', ordinal_aggregation: OrdinalAggregation = 'mean', nominal_aggregation: NominalAggregation = 'mode', training_data: RubricDataset | None = None, few_shot_config: FewShotConfig | None = None, system_prompt: str | None = None, multi_choice_system_prompt: str | None = None, length_penalty: LengthPenalty | None = None, normalize: bool = True, cannot_assess_config: CannotAssessConfig | None = None, shuffle_options: bool = True)

Bases: Grader

Unified criterion-based grader with compositional few-shot and ensemble support.

This grader evaluates each criterion independently and supports: - Single LLM mode (via llm_config) - Ensemble mode with multiple judges (via judges) - Few-shot prompting (via training_data + few_shot_config)

All combinations work: single LLM, single + few-shot, ensemble, ensemble + few-shot.

Parameters are orthogonal: - llm_config OR judges: Choose single-LLM or ensemble mode - training_data + few_shot_config: Enable few-shot prompting (applies to all judges)

Example

from autorubric import LLMConfig, FewShotConfig, RubricDataset from autorubric.graders import CriterionGrader, JudgeSpec

Single LLM

grader = CriterionGrader(llm_config=LLMConfig(model="gemini/gemini-3-flash-preview"))

Single LLM + few-shot

train, test = dataset.split_train_test(n_train=100) grader = CriterionGrader( ... llm_config=LLMConfig(model="gemini/gemini-3-flash-preview"), ... training_data=train, ... few_shot_config=FewShotConfig(n_examples=3), ... )

Ensemble

grader = CriterionGrader( ... judges=[ ... JudgeSpec(LLMConfig(model="gemini/gemini-3-flash-preview"), "gemini"), ... JudgeSpec(LLMConfig(model="anthropic/claude-sonnet-4-5-20250929"), "claude"), ... ], ... aggregation="majority", ... )

Ensemble + few-shot

grader = CriterionGrader( ... judges=[JudgeSpec(...), JudgeSpec(...)], ... aggregation="majority", ... training_data=train, ... few_shot_config=FewShotConfig(n_examples=3), ... )

Initialize the criterion grader.

PARAMETER DESCRIPTION
llm_config

Configuration for single-LLM mode. Mutually exclusive with judges.

TYPE: LLMConfig | None DEFAULT: None

judges

List of JudgeSpec for ensemble mode. Mutually exclusive with llm_config.

TYPE: list[JudgeSpec] | None DEFAULT: None

aggregation

Strategy for aggregating votes in ensemble mode (binary criteria).

TYPE: AggregationStrategy DEFAULT: 'majority'

ordinal_aggregation

Strategy for aggregating ordinal multi-choice votes. Options: "mean", "median", "weighted_mean", "mode".

TYPE: OrdinalAggregation DEFAULT: 'mean'

nominal_aggregation

Strategy for aggregating nominal multi-choice votes. Options: "mode", "weighted_mode", "unanimous".

TYPE: NominalAggregation DEFAULT: 'mode'

training_data

Dataset for few-shot examples. If provided, enables few-shot prompting.

TYPE: RubricDataset | None DEFAULT: None

few_shot_config

Configuration for few-shot example selection.

TYPE: FewShotConfig | None DEFAULT: None

system_prompt

Custom system prompt for binary criteria.

TYPE: str | None DEFAULT: None

multi_choice_system_prompt

Custom system prompt for multi-choice criteria.

TYPE: str | None DEFAULT: None

length_penalty

Optional length penalty configuration.

TYPE: LengthPenalty | None DEFAULT: None

normalize

If True, normalize score to [0, 1]. If False, return raw sum.

TYPE: bool DEFAULT: True

cannot_assess_config

Configuration for handling CANNOT_ASSESS verdicts.

TYPE: CannotAssessConfig | None DEFAULT: None

shuffle_options

If True (default), randomize the order of multi-choice options presented to the LLM to mitigate position bias. Each judge/call sees a different random order, and responses are mapped back to original indices. Disable for deterministic behavior in tests.

TYPE: bool DEFAULT: True

RAISES DESCRIPTION
ValueError

If neither llm_config nor judges is provided, or both are provided.

Source code in src/autorubric/graders/criterion_grader.py
def __init__(
    self,
    *,
    # Single LLM mode
    llm_config: LLMConfig | None = None,
    # Ensemble mode (overrides llm_config)
    judges: list[JudgeSpec] | None = None,
    aggregation: AggregationStrategy = "majority",
    # Multi-choice aggregation strategies
    ordinal_aggregation: OrdinalAggregation = "mean",
    nominal_aggregation: NominalAggregation = "mode",
    # Few-shot mode (orthogonal - applies to all judges)
    training_data: RubricDataset | None = None,
    few_shot_config: FewShotConfig | None = None,
    # Common parameters
    system_prompt: str | None = None,
    multi_choice_system_prompt: str | None = None,
    length_penalty: LengthPenalty | None = None,
    normalize: bool = True,
    cannot_assess_config: CannotAssessConfig | None = None,
    # Position bias mitigation
    shuffle_options: bool = True,
):
    """Initialize the criterion grader.

    Args:
        llm_config: Configuration for single-LLM mode. Mutually exclusive with judges.
        judges: List of JudgeSpec for ensemble mode. Mutually exclusive with llm_config.
        aggregation: Strategy for aggregating votes in ensemble mode (binary criteria).
        ordinal_aggregation: Strategy for aggregating ordinal multi-choice votes.
            Options: "mean", "median", "weighted_mean", "mode".
        nominal_aggregation: Strategy for aggregating nominal multi-choice votes.
            Options: "mode", "weighted_mode", "unanimous".
        training_data: Dataset for few-shot examples. If provided, enables few-shot prompting.
        few_shot_config: Configuration for few-shot example selection.
        system_prompt: Custom system prompt for binary criteria.
        multi_choice_system_prompt: Custom system prompt for multi-choice criteria.
        length_penalty: Optional length penalty configuration.
        normalize: If True, normalize score to [0, 1]. If False, return raw sum.
        cannot_assess_config: Configuration for handling CANNOT_ASSESS verdicts.
        shuffle_options: If True (default), randomize the order of multi-choice options
            presented to the LLM to mitigate position bias. Each judge/call sees a
            different random order, and responses are mapped back to original indices.
            Disable for deterministic behavior in tests.

    Raises:
        ValueError: If neither llm_config nor judges is provided, or both are provided.
    """
    super().__init__(length_penalty=length_penalty, normalize=normalize)

    # Validate: must have either llm_config or judges, not both, not neither
    if llm_config is None and judges is None:
        raise ValueError("Must provide either llm_config or judges")
    if llm_config is not None and judges is not None:
        raise ValueError("Cannot provide both llm_config and judges")

    # Normalize to ensemble representation (single LLM = ensemble of 1)
    if llm_config is not None:
        self._judges = [JudgeSpec(llm_config=llm_config, judge_id="default", weight=1.0)]
    else:
        self._judges = judges  # type: ignore

    self._aggregation = aggregation
    self._ordinal_aggregation = ordinal_aggregation
    self._nominal_aggregation = nominal_aggregation
    self._training_data = training_data
    self._few_shot_config = few_shot_config or FewShotConfig()
    self._cannot_assess_config = cannot_assess_config or CannotAssessConfig()
    self._shuffle_options = shuffle_options

    # Build system prompts (separate for binary and multi-choice)
    if system_prompt is None:
        self._system_prompt = GRADER_SYSTEM_PROMPT_DEFAULT
        if training_data is not None:
            self._system_prompt += FEW_SHOT_SYSTEM_PROMPT_ADDITION
    else:
        self._system_prompt = system_prompt

    if multi_choice_system_prompt is None:
        self._multi_choice_system_prompt = MULTI_CHOICE_SYSTEM_PROMPT
        if training_data is not None:
            self._multi_choice_system_prompt += MULTI_CHOICE_FEW_SHOT_ADDITION
    else:
        self._multi_choice_system_prompt = multi_choice_system_prompt

    # Create LLM clients for each judge
    self._clients = {
        judge.judge_id: LLMClient(judge.llm_config)
        for judge in self._judges
    }

    # Pre-compute few-shot examples if training data provided
    # Note: For multi-choice, examples are stored as (submission, selected_index, reason)
    self._criterion_examples: dict[int, list[FewShotExample]] = {}
    self._multi_choice_examples: dict[int, list[tuple[str, int, str | None]]] = {}
    if training_data is not None:
        self._prepare_examples()

is_ensemble property

is_ensemble: bool

Whether this grader uses multiple judges.

has_few_shot property

has_few_shot: bool

Whether this grader uses few-shot prompting.

judge async

judge(to_grade: str, rubric: list[Criterion], query: str | None = None, reference_submission: str | None = None) -> list[JudgeCriterionResults]

Judge all criteria with all judges (parallel across judges).

Source code in src/autorubric/graders/criterion_grader.py
async def judge(
    self,
    to_grade: str,
    rubric: list[Criterion],
    query: str | None = None,
    reference_submission: str | None = None,
) -> list[JudgeCriterionResults]:
    """Judge all criteria with all judges (parallel across judges)."""
    tasks = [
        self._judge_all_criteria_for_judge(judge, rubric, to_grade, query, reference_submission)
        for judge in self._judges
    ]
    return list(await asyncio.gather(*tasks))

aggregate async

aggregate(judge_results: list[JudgeCriterionResults], *, normalize: bool = True) -> EnsembleEvaluationReport

Aggregate results from all judges into final report.

Handles both binary and multi-choice criteria: - Binary: Uses JudgeVote and _aggregate_votes() - Multi-choice: Uses MultiChoiceJudgeVote and _aggregate_multi_choice_votes()

Source code in src/autorubric/graders/criterion_grader.py
async def aggregate(
    self, judge_results: list[JudgeCriterionResults], *, normalize: bool = True
) -> EnsembleEvaluationReport:
    """Aggregate results from all judges into final report.

    Handles both binary and multi-choice criteria:
    - Binary: Uses JudgeVote and _aggregate_votes()
    - Multi-choice: Uses MultiChoiceJudgeVote and _aggregate_multi_choice_votes()
    """
    if not judge_results:
        return EnsembleEvaluationReport(
            score=0.0,
            raw_score=0.0,
            llm_raw_score=0.0,
            error="No judge results to aggregate",
        )

    n_criteria = len(judge_results[0].criterion_results)

    # Build ensemble criterion reports
    ensemble_reports: list[EnsembleCriterionReport] = []
    for criterion_idx in range(n_criteria):
        # Get criterion from first judge's result
        first_cr = judge_results[0].criterion_results[criterion_idx]
        criterion_report = first_cr.report

        if criterion_report.is_multi_choice:
            # Multi-choice: build MultiChoiceJudgeVote list
            mc_votes: list[MultiChoiceJudgeVote] = []
            for judge_result in judge_results:
                cr = judge_result.criterion_results[criterion_idx]
                mcv = cr.report.multi_choice_verdict
                if mcv is not None:
                    mc_votes.append(
                        MultiChoiceJudgeVote(
                            judge_id=judge_result.judge_id,
                            selected_index=mcv.selected_index,
                            selected_label=mcv.selected_label,
                            value=mcv.value,
                            reason=cr.report.reason,
                            weight=judge_result.weight,
                            na=mcv.na,
                        )
                    )

            # Aggregate multi-choice votes
            final_mc_verdict, final_reason = self._aggregate_multi_choice_votes(
                mc_votes, criterion_report
            )

            ensemble_reports.append(
                EnsembleCriterionReport(
                    criterion=Criterion(
                        weight=criterion_report.weight,
                        requirement=criterion_report.requirement,
                        name=criterion_report.name,
                        options=criterion_report.options,
                        scale_type=criterion_report.scale_type,
                        aggregation=criterion_report.aggregation,
                    ),
                    final_verdict=None,  # Binary verdict is None for multi-choice
                    final_reason=final_reason,
                    votes=[],  # Binary votes empty for multi-choice
                    final_multi_choice_verdict=final_mc_verdict,
                    multi_choice_votes=mc_votes,
                )
            )
        else:
            # Binary: build JudgeVote list
            votes: list[JudgeVote] = []
            for judge_result in judge_results:
                cr = judge_result.criterion_results[criterion_idx]
                votes.append(
                    JudgeVote(
                        judge_id=judge_result.judge_id,
                        verdict=cr.report.verdict,
                        reason=cr.report.reason,
                        weight=judge_result.weight,
                    )
                )

            final_verdict, final_reason = self._aggregate_votes(votes)

            ensemble_reports.append(
                EnsembleCriterionReport(
                    criterion=Criterion(
                        weight=criterion_report.weight,
                        requirement=criterion_report.requirement,
                        name=criterion_report.name,
                    ),
                    final_verdict=final_verdict,
                    final_reason=final_reason,
                    votes=votes,
                )
            )

    # Calculate per-judge scores
    judge_scores = {}
    for judge_result in judge_results:
        score = self._calculate_score_from_reports(judge_result.reports, normalize)
        judge_scores[judge_result.judge_id] = score

    # Calculate final score from aggregated verdicts
    final_reports = []
    for er in ensemble_reports:
        if er.final_multi_choice_verdict is not None:
            # Multi-choice criterion
            final_reports.append(
                CriterionReport(
                    weight=er.criterion.weight,
                    requirement=er.criterion.requirement,
                    name=er.criterion.name,
                    options=er.criterion.options,
                    scale_type=er.criterion.scale_type,
                    aggregation=er.criterion.aggregation,
                    verdict=None,  # Binary verdict is None
                    multi_choice_verdict=er.final_multi_choice_verdict,
                    reason=er.final_reason,
                )
            )
        else:
            # Binary criterion
            final_reports.append(
                CriterionReport(
                    weight=er.criterion.weight,
                    requirement=er.criterion.requirement,
                    name=er.criterion.name,
                    verdict=er.final_verdict,
                    reason=er.final_reason,
                )
            )
    final_score = self._calculate_score_from_reports(final_reports, normalize)
    raw_score = self._calculate_score_from_reports(final_reports, normalize=False)

    # Calculate agreement
    mean_agreement = (
        sum(er.agreement for er in ensemble_reports) / len(ensemble_reports)
        if ensemble_reports
        else 1.0
    )

    # Count CANNOT_ASSESS (binary) and NA (multi-choice)
    cannot_assess_count = sum(
        1
        for er in ensemble_reports
        if (er.final_verdict == CriterionVerdict.CANNOT_ASSESS)
        or (er.final_multi_choice_verdict is not None and er.final_multi_choice_verdict.na)
    )

    # Aggregate token usage and cost
    total_usage = TokenUsage()
    total_cost = 0.0
    for jr in judge_results:
        if jr.total_usage:
            total_usage = total_usage + jr.total_usage
        if jr.total_cost:
            total_cost += jr.total_cost

    return EnsembleEvaluationReport(
        score=final_score,
        raw_score=raw_score,
        llm_raw_score=raw_score,
        report=ensemble_reports,
        judge_scores=judge_scores,
        mean_agreement=mean_agreement,
        cannot_assess_count=cannot_assess_count,
        token_usage=total_usage if total_usage.total_tokens > 0 else None,
        completion_cost=total_cost if total_cost > 0 else None,
    )

Grader

Abstract base class for grader implementations.

Grader

Grader(*, length_penalty: LengthPenalty | None = None, normalize: bool = True)

Bases: ABC

Base class for LLM-backed grading implementations.

All graders require an LLMConfig for the LLM client. Subclasses must implement judge() and aggregate() methods.

PARAMETER DESCRIPTION
length_penalty

Optional configuration for penalizing overly long outputs. When provided, a penalty based on the token/word count is subtracted from the final score.

TYPE: LengthPenalty | None DEFAULT: None

normalize

If True (default), scores are normalized to 0-1. If False, raw weighted sums are returned, which is useful for RL training scenarios.

TYPE: bool DEFAULT: True

Source code in src/autorubric/graders/base.py
def __init__(
    self,
    *,
    length_penalty: LengthPenalty | None = None,
    normalize: bool = True,
):
    self.length_penalty: LengthPenalty | None = length_penalty
    self.normalize: bool = normalize

judge abstractmethod async

judge(to_grade: str, rubric: list[Criterion], query: str | None = None, reference_submission: str | None = None) -> Any

Collect raw judge results for the provided submission.

PARAMETER DESCRIPTION
to_grade

The text to evaluate.

TYPE: str

rubric

List of criteria to evaluate against.

TYPE: list[Criterion]

query

Optional input/query that prompted the response.

TYPE: str | None DEFAULT: None

reference_submission

Optional exemplar response for grading context.

TYPE: str | None DEFAULT: None

RETURNS DESCRIPTION
Any

Raw judge results (format depends on implementation).

Source code in src/autorubric/graders/base.py
@abstractmethod
async def judge(
    self,
    to_grade: str,
    rubric: list[Criterion],
    query: str | None = None,
    reference_submission: str | None = None,
) -> Any:
    """Collect raw judge results for the provided submission.

    Args:
        to_grade: The text to evaluate.
        rubric: List of criteria to evaluate against.
        query: Optional input/query that prompted the response.
        reference_submission: Optional exemplar response for grading context.

    Returns:
        Raw judge results (format depends on implementation).
    """
    pass

aggregate abstractmethod async

aggregate(judge_results: Any, *, normalize: bool = True) -> EvaluationReport

Transform judge results into an EvaluationReport.

PARAMETER DESCRIPTION
judge_results

Raw results from judge().

TYPE: Any

normalize

If True, normalize score to 0-1. If False, return raw weighted sum.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
EvaluationReport

EvaluationReport with score and optional per-criterion breakdown.

Source code in src/autorubric/graders/base.py
@abstractmethod
async def aggregate(
    self, judge_results: Any, *, normalize: bool = True
) -> EvaluationReport:
    """Transform judge results into an EvaluationReport.

    Args:
        judge_results: Raw results from judge().
        normalize: If True, normalize score to 0-1. If False, return raw weighted sum.

    Returns:
        EvaluationReport with score and optional per-criterion breakdown.
    """
    pass

grade async

grade(to_grade: ToGradeInput, rubric: list[Criterion], query: str | None = None, reference_submission: str | None = None) -> EvaluationReport

Grade the submission against the rubric.

This is the main entry point for the grader.

PARAMETER DESCRIPTION
to_grade

The text to evaluate. Can be either: - A string (optionally with / markers) - A dict with 'thinking' and 'output' keys

TYPE: ToGradeInput

rubric

List of criteria to evaluate against.

TYPE: list[Criterion]

query

Optional input/query that prompted the response.

TYPE: str | None DEFAULT: None

reference_submission

Optional exemplar response for grading context.

TYPE: str | None DEFAULT: None

RETURNS DESCRIPTION
EvaluationReport

EvaluationReport with score and optional per-criterion breakdown.

EvaluationReport

If normalize=True (default), score is 0-1. If normalize=False, score is raw

EvaluationReport

weighted sum. If length_penalty was configured, the penalty is subtracted from

EvaluationReport

the score. The raw_score field contains the unnormalized weighted sum before

EvaluationReport

length penalty.

Source code in src/autorubric/graders/base.py
async def grade(
    self,
    to_grade: ToGradeInput,
    rubric: list[Criterion],
    query: str | None = None,
    reference_submission: str | None = None,
) -> EvaluationReport:
    """Grade the submission against the rubric.

    This is the main entry point for the grader.

    Args:
        to_grade: The text to evaluate. Can be either:
            - A string (optionally with <thinking>/<output> markers)
            - A dict with 'thinking' and 'output' keys
        rubric: List of criteria to evaluate against.
        query: Optional input/query that prompted the response.
        reference_submission: Optional exemplar response for grading context.

    Returns:
        EvaluationReport with score and optional per-criterion breakdown.
        If normalize=True (default), score is 0-1. If normalize=False, score is raw
        weighted sum. If length_penalty was configured, the penalty is subtracted from
        the score. The raw_score field contains the unnormalized weighted sum before
        length penalty.
    """
    # Convert to_grade to string for judge() call (maintains compatibility)
    if isinstance(to_grade, str):
        to_grade_str = to_grade
    else:
        # Dict format - reconstruct string with markers for judge()
        thinking = to_grade.get("thinking", "")
        output = to_grade.get("output", "")
        parts = []
        if thinking:
            parts.append(f"<thinking>{thinking}</thinking>")
        if output:
            parts.append(f"<output>{output}</output>")
        to_grade_str = "\n".join(parts) if parts else ""

    # Call judge with string format (maintains compatibility)
    judge_results = await self.judge(to_grade_str, rubric, query, reference_submission)
    report = await self.aggregate(judge_results, normalize=self.normalize)

    if self.length_penalty is not None:
        # Normalize to_grade to dict format for penalty calculation
        to_grade_normalized = normalize_to_grade_input(to_grade)

        # Compute penalty
        penalty = compute_length_penalty(to_grade_normalized, self.length_penalty)

        # Apply penalty (penalty is always non-negative, so we subtract)
        adjusted_score = report.score - penalty
        if self.normalize:
            adjusted_score = max(0.0, adjusted_score)

        # Return the same report type with adjusted score
        if isinstance(report, EnsembleEvaluationReport):
            return EnsembleEvaluationReport(
                score=adjusted_score,
                raw_score=report.raw_score,
                llm_raw_score=report.llm_raw_score,
                report=report.report,
                judge_scores=report.judge_scores,
                mean_agreement=report.mean_agreement,
                cannot_assess_count=report.cannot_assess_count,
                token_usage=report.token_usage,
                completion_cost=report.completion_cost,
                error=report.error,
            )
        else:
            return EvaluationReport(
                score=adjusted_score,
                raw_score=report.raw_score,
                llm_raw_score=report.llm_raw_score,
                report=report.report,
                cannot_assess_count=report.cannot_assess_count,
                error=report.error,
                token_usage=report.token_usage,
                completion_cost=report.completion_cost,
            )

    return report

JudgeSpec

Configuration for a single judge in an ensemble.

JudgeSpec dataclass

JudgeSpec(llm_config: LLMConfig, judge_id: str, weight: float = 1.0)

Specification for a single judge in an ensemble.

ATTRIBUTE DESCRIPTION
llm_config

Configuration for this judge's LLM.

TYPE: LLMConfig

judge_id

Unique identifier for this judge (e.g., "gpt-4", "claude-sonnet").

TYPE: str

weight

Voting weight for weighted aggregation (default 1.0).

TYPE: float