Graders¶

Grader implementations for evaluating responses against rubrics.

Overview¶

Graders evaluate responses against rubrics and return structured reports. The main implementation is CriterionGrader, which supports single LLM, ensemble, and few-shot modes. All combinations work orthogonally.

Quick Example¶

from autorubric import LLMConfig, FewShotConfig
from autorubric.graders import CriterionGrader, JudgeSpec, Grader

# Single LLM mode
grader = CriterionGrader(
    llm_config=LLMConfig(model="openai/gpt-4.1-mini"),
)

# With custom system prompt
grader = CriterionGrader(
    llm_config=LLMConfig(model="openai/gpt-4.1-mini"),
    system_prompt="You are evaluating technical documentation...",
)

# Ensemble mode
grader = CriterionGrader(
    judges=[
        JudgeSpec(LLMConfig(model="gemini/gemini-3-flash-preview"), "gemini", weight=1.0),
        JudgeSpec(LLMConfig(model="anthropic/claude-sonnet-4-5-20250929"), "claude", weight=1.2),
    ],
    aggregation="weighted",
)

# Single LLM + few-shot
grader = CriterionGrader(
    llm_config=LLMConfig(model="openai/gpt-4.1-mini"),
    training_data=train_data,
    few_shot_config=FewShotConfig(n_examples=3, balance_verdicts=True),
)

# Grade
result = await rubric.grade(to_grade=response, grader=grader)

Grading Options¶

grader = CriterionGrader(
    llm_config=LLMConfig(model="openai/gpt-4.1-mini"),

    # Score normalization
    normalize=True,          # True: 0-1 range, False: raw weighted sum

    # CANNOT_ASSESS handling
    cannot_assess_config=CannotAssessConfig(strategy=CannotAssessStrategy.SKIP),

    # Length penalty
    length_penalty=LengthPenalty(free_budget=6000, max_cap=8000),

    # Position bias mitigation (for multi-choice)
    shuffle_options=True,    # Default: enabled
)

CriterionGrader¶

Main grader with support for single LLM, ensemble, and few-shot modes.

CriterionGrader ¶

CriterionGrader(*, llm_config: LLMConfig | None = None, judges: list[JudgeSpec] | None = None, aggregation: AggregationStrategy = 'majority', ordinal_aggregation: OrdinalAggregation = 'mean', nominal_aggregation: NominalAggregation = 'mode', training_data: RubricDataset | None = None, few_shot_config: FewShotConfig | None = None, system_prompt: str | None = None, multi_choice_system_prompt: str | None = None, length_penalty: LengthPenalty | None = None, normalize: bool = True, cannot_assess_config: CannotAssessConfig | None = None, shuffle_options: bool = True, binary_response_format: type[BaseModel] | None = None)

Bases: Grader

Unified criterion-based grader with compositional few-shot and ensemble support.

This grader evaluates each criterion independently and supports: - Single LLM mode (via llm_config) - Ensemble mode with multiple judges (via judges) - Few-shot prompting (via training_data + few_shot_config)

All combinations work: single LLM, single + few-shot, ensemble, ensemble + few-shot.

Parameters are orthogonal: - llm_config OR judges: Choose single-LLM or ensemble mode - training_data + few_shot_config: Enable few-shot prompting (applies to all judges)

Example

from autorubric import LLMConfig, FewShotConfig, RubricDataset from autorubric.graders import CriterionGrader, JudgeSpec

Single LLM¶

grader = CriterionGrader(llm_config=LLMConfig(model="gemini/gemini-3-flash-preview"))

Single LLM + few-shot¶

train, test = dataset.split_train_test(n_train=100) grader = CriterionGrader( ... llm_config=LLMConfig(model="gemini/gemini-3-flash-preview"), ... training_data=train, ... few_shot_config=FewShotConfig(n_examples=3), ... )

Ensemble¶

grader = CriterionGrader( ... judges=[ ... JudgeSpec(LLMConfig(model="gemini/gemini-3-flash-preview"), "gemini"), ... JudgeSpec(LLMConfig(model="anthropic/claude-sonnet-4-5-20250929"), "claude"), ... ], ... aggregation="majority", ... )

Ensemble + few-shot¶

grader = CriterionGrader( ... judges=[JudgeSpec(...), JudgeSpec(...)], ... aggregation="majority", ... training_data=train, ... few_shot_config=FewShotConfig(n_examples=3), ... )

Initialize the criterion grader.

PARAMETER	DESCRIPTION
`llm_config`	Configuration for single-LLM mode. Mutually exclusive with judges. TYPE: `LLMConfig \| None` DEFAULT: `None`
`judges`	List of JudgeSpec for ensemble mode. Mutually exclusive with llm_config. TYPE: `list[JudgeSpec] \| None` DEFAULT: `None`
`aggregation`	Strategy for aggregating votes in ensemble mode (binary criteria). TYPE: `AggregationStrategy` DEFAULT: `'majority'`
`ordinal_aggregation`	Strategy for aggregating ordinal multi-choice votes. Options: "mean", "median", "weighted_mean", "mode". TYPE: `OrdinalAggregation` DEFAULT: `'mean'`
`nominal_aggregation`	Strategy for aggregating nominal multi-choice votes. Options: "mode", "weighted_mode", "unanimous". TYPE: `NominalAggregation` DEFAULT: `'mode'`
`training_data`	Dataset for few-shot examples. If provided, enables few-shot prompting. TYPE: `RubricDataset \| None` DEFAULT: `None`
`few_shot_config`	Configuration for few-shot example selection. TYPE: `FewShotConfig \| None` DEFAULT: `None`
`system_prompt`	Custom system prompt for binary criteria. TYPE: `str \| None` DEFAULT: `None`
`multi_choice_system_prompt`	Custom system prompt for multi-choice criteria. TYPE: `str \| None` DEFAULT: `None`
`length_penalty`	Optional length penalty configuration. TYPE: `LengthPenalty \| None` DEFAULT: `None`
`normalize`	If True, normalize score to [0, 1]. If False, return raw sum. TYPE: `bool` DEFAULT: `True`
`cannot_assess_config`	Configuration for handling CANNOT_ASSESS verdicts. TYPE: `CannotAssessConfig \| None` DEFAULT: `None`
`shuffle_options`	If True (default), randomize the order of multi-choice options presented to the LLM to mitigate position bias. Each judge/call sees a different random order, and responses are mapped back to original indices. Disable for deterministic behavior in tests. TYPE: `bool` DEFAULT: `True`
`binary_response_format`	Pydantic model to use as the structured output schema for binary criterion judgments. Must be a subclass of (or compatible with) CriterionJudgment. If the model includes an `affected_criteria` field (list[int]), matching indices are injected as an `[Affects: ...]` tag into the reason string. Defaults to CriterionJudgment. TYPE: `type[BaseModel] \| None` DEFAULT: `None`

RAISES	DESCRIPTION
`ValueError`	If neither llm_config nor judges is provided, or both are provided.

Source code in src/autorubric/graders/criterion_grader.py

def __init__(
    self,
    *,
    # Single LLM mode
    llm_config: LLMConfig | None = None,
    # Ensemble mode (overrides llm_config)
    judges: list[JudgeSpec] | None = None,
    aggregation: AggregationStrategy = "majority",
    # Multi-choice aggregation strategies
    ordinal_aggregation: OrdinalAggregation = "mean",
    nominal_aggregation: NominalAggregation = "mode",
    # Few-shot mode (orthogonal - applies to all judges)
    training_data: RubricDataset | None = None,
    few_shot_config: FewShotConfig | None = None,
    # Common parameters
    system_prompt: str | None = None,
    multi_choice_system_prompt: str | None = None,
    length_penalty: LengthPenalty | None = None,
    normalize: bool = True,
    cannot_assess_config: CannotAssessConfig | None = None,
    # Position bias mitigation
    shuffle_options: bool = True,
    # Structured output override for binary criteria
    binary_response_format: type[BaseModel] | None = None,
):
    """Initialize the criterion grader.

    Args:
        llm_config: Configuration for single-LLM mode. Mutually exclusive with judges.
        judges: List of JudgeSpec for ensemble mode. Mutually exclusive with llm_config.
        aggregation: Strategy for aggregating votes in ensemble mode (binary criteria).
        ordinal_aggregation: Strategy for aggregating ordinal multi-choice votes.
            Options: "mean", "median", "weighted_mean", "mode".
        nominal_aggregation: Strategy for aggregating nominal multi-choice votes.
            Options: "mode", "weighted_mode", "unanimous".
        training_data: Dataset for few-shot examples. If provided, enables few-shot prompting.
        few_shot_config: Configuration for few-shot example selection.
        system_prompt: Custom system prompt for binary criteria.
        multi_choice_system_prompt: Custom system prompt for multi-choice criteria.
        length_penalty: Optional length penalty configuration.
        normalize: If True, normalize score to [0, 1]. If False, return raw sum.
        cannot_assess_config: Configuration for handling CANNOT_ASSESS verdicts.
        shuffle_options: If True (default), randomize the order of multi-choice options
            presented to the LLM to mitigate position bias. Each judge/call sees a
            different random order, and responses are mapped back to original indices.
            Disable for deterministic behavior in tests.
        binary_response_format: Pydantic model to use as the structured output schema
            for binary criterion judgments. Must be a subclass of (or compatible with)
            CriterionJudgment. If the model includes an ``affected_criteria`` field
            (list[int]), matching indices are injected as an ``[Affects: ...]`` tag
            into the reason string. Defaults to CriterionJudgment.

    Raises:
        ValueError: If neither llm_config nor judges is provided, or both are provided.
    """
    super().__init__(length_penalty=length_penalty, normalize=normalize)

    # Validate: must have either llm_config or judges, not both, not neither
    if llm_config is None and judges is None:
        raise ValueError("Must provide either llm_config or judges")
    if llm_config is not None and judges is not None:
        raise ValueError("Cannot provide both llm_config and judges")

    # Normalize to ensemble representation (single LLM = ensemble of 1)
    if llm_config is not None:
        self._judges = [JudgeSpec(llm_config=llm_config, judge_id="default", weight=1.0)]
    else:
        self._judges = judges  # type: ignore

    self._aggregation = aggregation
    self._ordinal_aggregation = ordinal_aggregation
    self._nominal_aggregation = nominal_aggregation
    self._training_data = training_data
    self._few_shot_config = few_shot_config or FewShotConfig()
    self._cannot_assess_config = cannot_assess_config or CannotAssessConfig()
    self._shuffle_options = shuffle_options
    self._binary_response_format = binary_response_format or CriterionJudgment

    # Build system prompts (separate for binary and multi-choice)
    if system_prompt is None:
        self._system_prompt = GRADER_SYSTEM_PROMPT_DEFAULT
        if training_data is not None:
            self._system_prompt += FEW_SHOT_SYSTEM_PROMPT_ADDITION
    else:
        self._system_prompt = system_prompt

    if multi_choice_system_prompt is None:
        self._multi_choice_system_prompt = MULTI_CHOICE_SYSTEM_PROMPT
        if training_data is not None:
            self._multi_choice_system_prompt += MULTI_CHOICE_FEW_SHOT_ADDITION
    else:
        self._multi_choice_system_prompt = multi_choice_system_prompt

    # Create LLM clients for each judge
    self._clients = {
        judge.judge_id: LLMClient(judge.llm_config)
        for judge in self._judges
    }

    # Pre-compute few-shot examples if training data provided
    # Note: For multi-choice, examples are stored as (submission, selected_index, reason)
    self._criterion_examples: dict[int, list[FewShotExample]] = {}
    self._multi_choice_examples: dict[int, list[tuple[str, int, str | None]]] = {}
    if training_data is not None:
        self._prepare_examples()

is_ensemble `property` ¶

is_ensemble: bool

Whether this grader uses multiple judges.

has_few_shot `property` ¶

has_few_shot: bool

Whether this grader uses few-shot prompting.

judge `async` ¶

judge(to_grade: str, rubric: list[Criterion], query: str | None = None, reference_submission: str | None = None) -> list[JudgeCriterionResults]

Judge all criteria with all judges (parallel across judges).

Source code in src/autorubric/graders/criterion_grader.py

async def judge(
    self,
    to_grade: str,
    rubric: list[Criterion],
    query: str | None = None,
    reference_submission: str | None = None,
) -> list[JudgeCriterionResults]:
    """Judge all criteria with all judges (parallel across judges)."""
    tasks = [
        self._judge_all_criteria_for_judge(judge, rubric, to_grade, query, reference_submission)
        for judge in self._judges
    ]
    return list(await asyncio.gather(*tasks))

aggregate `async` ¶

aggregate(judge_results: list[JudgeCriterionResults], *, normalize: bool = True) -> EnsembleEvaluationReport

Aggregate results from all judges into final report.

Handles both binary and multi-choice criteria: - Binary: Uses JudgeVote and _aggregate_votes() - Multi-choice: Uses MultiChoiceJudgeVote and _aggregate_multi_choice_votes()

Source code in src/autorubric/graders/criterion_grader.py

async def aggregate(
    self, judge_results: list[JudgeCriterionResults], *, normalize: bool = True
) -> EnsembleEvaluationReport:
    """Aggregate results from all judges into final report.

    Handles both binary and multi-choice criteria:
    - Binary: Uses JudgeVote and _aggregate_votes()
    - Multi-choice: Uses MultiChoiceJudgeVote and _aggregate_multi_choice_votes()
    """
    if not judge_results:
        return EnsembleEvaluationReport(
            score=0.0,
            raw_score=0.0,
            llm_raw_score=0.0,
            error="No judge results to aggregate",
        )

    n_criteria = len(judge_results[0].criterion_results)

    # Build ensemble criterion reports
    ensemble_reports: list[EnsembleCriterionReport] = []
    for criterion_idx in range(n_criteria):
        # Get criterion from first judge's result
        first_cr = judge_results[0].criterion_results[criterion_idx]
        criterion_report = first_cr.report

        if criterion_report.is_multi_choice:
            # Multi-choice: build MultiChoiceJudgeVote list
            mc_votes: list[MultiChoiceJudgeVote] = []
            for judge_result in judge_results:
                cr = judge_result.criterion_results[criterion_idx]
                mcv = cr.report.multi_choice_verdict
                if mcv is not None:
                    mc_votes.append(
                        MultiChoiceJudgeVote(
                            judge_id=judge_result.judge_id,
                            selected_index=mcv.selected_index,
                            selected_label=mcv.selected_label,
                            value=mcv.value,
                            reason=cr.report.reason,
                            weight=judge_result.weight,
                            na=mcv.na,
                        )
                    )

            # Aggregate multi-choice votes
            final_mc_verdict, final_reason = self._aggregate_multi_choice_votes(
                mc_votes, criterion_report
            )

            ensemble_reports.append(
                EnsembleCriterionReport(
                    criterion=Criterion(
                        weight=criterion_report.weight,
                        requirement=criterion_report.requirement,
                        name=criterion_report.name,
                        options=criterion_report.options,
                        scale_type=criterion_report.scale_type,
                        aggregation=criterion_report.aggregation,
                    ),
                    final_verdict=None,  # Binary verdict is None for multi-choice
                    final_reason=final_reason,
                    votes=[],  # Binary votes empty for multi-choice
                    final_multi_choice_verdict=final_mc_verdict,
                    multi_choice_votes=mc_votes,
                )
            )
        else:
            # Binary: build JudgeVote list
            votes: list[JudgeVote] = []
            for judge_result in judge_results:
                cr = judge_result.criterion_results[criterion_idx]
                votes.append(
                    JudgeVote(
                        judge_id=judge_result.judge_id,
                        verdict=cr.report.verdict,
                        reason=cr.report.reason,
                        weight=judge_result.weight,
                    )
                )

            final_verdict, final_reason = self._aggregate_votes(votes)

            ensemble_reports.append(
                EnsembleCriterionReport(
                    criterion=Criterion(
                        weight=criterion_report.weight,
                        requirement=criterion_report.requirement,
                        name=criterion_report.name,
                    ),
                    final_verdict=final_verdict,
                    final_reason=final_reason,
                    votes=votes,
                )
            )

    # Calculate per-judge scores
    judge_scores = {}
    for judge_result in judge_results:
        score = self._calculate_score_from_reports(judge_result.reports, normalize)
        judge_scores[judge_result.judge_id] = score

    # Calculate final score from aggregated verdicts
    final_reports = []
    for er in ensemble_reports:
        if er.final_multi_choice_verdict is not None:
            # Multi-choice criterion
            final_reports.append(
                CriterionReport(
                    weight=er.criterion.weight,
                    requirement=er.criterion.requirement,
                    name=er.criterion.name,
                    options=er.criterion.options,
                    scale_type=er.criterion.scale_type,
                    aggregation=er.criterion.aggregation,
                    verdict=None,  # Binary verdict is None
                    multi_choice_verdict=er.final_multi_choice_verdict,
                    reason=er.final_reason,
                )
            )
        else:
            # Binary criterion
            final_reports.append(
                CriterionReport(
                    weight=er.criterion.weight,
                    requirement=er.criterion.requirement,
                    name=er.criterion.name,
                    verdict=er.final_verdict,
                    reason=er.final_reason,
                )
            )
    final_score = self._calculate_score_from_reports(final_reports, normalize)
    raw_score = self._calculate_score_from_reports(final_reports, normalize=False)

    # Calculate agreement
    mean_agreement = (
        sum(er.agreement for er in ensemble_reports) / len(ensemble_reports)
        if ensemble_reports
        else 1.0
    )

    # Count CANNOT_ASSESS (binary) and NA (multi-choice)
    cannot_assess_count = sum(
        1
        for er in ensemble_reports
        if (er.final_verdict == CriterionVerdict.CANNOT_ASSESS)
        or (er.final_multi_choice_verdict is not None and er.final_multi_choice_verdict.na)
    )

    # Aggregate token usage and cost
    total_usage = TokenUsage()
    total_cost = 0.0
    for jr in judge_results:
        if jr.total_usage:
            total_usage = total_usage + jr.total_usage
        if jr.total_cost:
            total_cost += jr.total_cost

    return EnsembleEvaluationReport(
        score=final_score,
        raw_score=raw_score,
        llm_raw_score=raw_score,
        report=ensemble_reports,
        judge_scores=judge_scores,
        mean_agreement=mean_agreement,
        cannot_assess_count=cannot_assess_count,
        token_usage=total_usage if total_usage.total_tokens > 0 else None,
        completion_cost=total_cost if total_cost > 0 else None,
    )

Grader¶

Abstract base class for grader implementations.

Grader ¶

Grader(*, length_penalty: LengthPenalty | None = None, normalize: bool = True)

Bases: ABC

Base class for LLM-backed grading implementations.

All graders require an LLMConfig for the LLM client. Subclasses must implement judge() and aggregate() methods.

PARAMETER	DESCRIPTION
`length_penalty`	Optional configuration for penalizing overly long outputs. When provided, a penalty based on the token/word count is subtracted from the final score. TYPE: `LengthPenalty \| None` DEFAULT: `None`
`normalize`	If True (default), scores are normalized to 0-1. If False, raw weighted sums are returned, which is useful for RL training scenarios. TYPE: `bool` DEFAULT: `True`

Source code in src/autorubric/graders/base.py

def __init__(
    self,
    *,
    length_penalty: LengthPenalty | None = None,
    normalize: bool = True,
):
    self.length_penalty: LengthPenalty | None = length_penalty
    self.normalize: bool = normalize

judge `abstractmethod` `async` ¶

judge(to_grade: str, rubric: list[Criterion], query: str | None = None, reference_submission: str | None = None) -> Any

Collect raw judge results for the provided submission.

PARAMETER	DESCRIPTION
`to_grade`	The text to evaluate. TYPE: `str`
`rubric`	List of criteria to evaluate against. TYPE: `list[Criterion]`
`query`	Optional input/query that prompted the response. TYPE: `str \| None` DEFAULT: `None`
`reference_submission`	Optional exemplar response for grading context. TYPE: `str \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`Any`	Raw judge results (format depends on implementation).

Source code in src/autorubric/graders/base.py

@abstractmethod
async def judge(
    self,
    to_grade: str,
    rubric: list[Criterion],
    query: str | None = None,
    reference_submission: str | None = None,
) -> Any:
    """Collect raw judge results for the provided submission.

    Args:
        to_grade: The text to evaluate.
        rubric: List of criteria to evaluate against.
        query: Optional input/query that prompted the response.
        reference_submission: Optional exemplar response for grading context.

    Returns:
        Raw judge results (format depends on implementation).
    """
    pass

aggregate `abstractmethod` `async` ¶

aggregate(judge_results: Any, *, normalize: bool = True) -> EvaluationReport

Transform judge results into an EvaluationReport.

PARAMETER	DESCRIPTION
`judge_results`	Raw results from judge(). TYPE: `Any`
`normalize`	If True, normalize score to 0-1. If False, return raw weighted sum. TYPE: `bool` DEFAULT: `True`

RETURNS	DESCRIPTION
`EvaluationReport`	EvaluationReport with score and optional per-criterion breakdown.

Source code in src/autorubric/graders/base.py

@abstractmethod
async def aggregate(
    self, judge_results: Any, *, normalize: bool = True
) -> EvaluationReport:
    """Transform judge results into an EvaluationReport.

    Args:
        judge_results: Raw results from judge().
        normalize: If True, normalize score to 0-1. If False, return raw weighted sum.

    Returns:
        EvaluationReport with score and optional per-criterion breakdown.
    """
    pass

grade `async` ¶

grade(to_grade: ToGradeInput, rubric: list[Criterion], query: str | None = None, reference_submission: str | None = None) -> EvaluationReport

Grade the submission against the rubric.

This is the main entry point for the grader.

PARAMETER	DESCRIPTION
`to_grade`	The text to evaluate. Can be either: - A string (optionally with / markers) - A dict with 'thinking' and 'output' keys TYPE: `ToGradeInput`
`rubric`	List of criteria to evaluate against. TYPE: `list[Criterion]`
`query`	Optional input/query that prompted the response. TYPE: `str \| None` DEFAULT: `None`
`reference_submission`	Optional exemplar response for grading context. TYPE: `str \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`EvaluationReport`	EvaluationReport with score and optional per-criterion breakdown.
`EvaluationReport`	If normalize=True (default), score is 0-1. If normalize=False, score is raw
`EvaluationReport`	weighted sum. If length_penalty was configured, the penalty is subtracted from
`EvaluationReport`	the score. The raw_score field contains the unnormalized weighted sum before
`EvaluationReport`	length penalty.

Source code in src/autorubric/graders/base.py

async def grade(
    self,
    to_grade: ToGradeInput,
    rubric: list[Criterion],
    query: str | None = None,
    reference_submission: str | None = None,
) -> EvaluationReport:
    """Grade the submission against the rubric.

    This is the main entry point for the grader.

    Args:
        to_grade: The text to evaluate. Can be either:
            - A string (optionally with <thinking>/<output> markers)
            - A dict with 'thinking' and 'output' keys
        rubric: List of criteria to evaluate against.
        query: Optional input/query that prompted the response.
        reference_submission: Optional exemplar response for grading context.

    Returns:
        EvaluationReport with score and optional per-criterion breakdown.
        If normalize=True (default), score is 0-1. If normalize=False, score is raw
        weighted sum. If length_penalty was configured, the penalty is subtracted from
        the score. The raw_score field contains the unnormalized weighted sum before
        length penalty.
    """
    # Convert to_grade to string for judge() call (maintains compatibility)
    if isinstance(to_grade, str):
        to_grade_str = to_grade
    else:
        # Dict format - reconstruct string with markers for judge()
        thinking = to_grade.get("thinking", "")
        output = to_grade.get("output", "")
        parts = []
        if thinking:
            parts.append(f"<thinking>{thinking}</thinking>")
        if output:
            parts.append(f"<output>{output}</output>")
        to_grade_str = "\n".join(parts) if parts else ""

    # Call judge with string format (maintains compatibility)
    judge_results = await self.judge(to_grade_str, rubric, query, reference_submission)
    report = await self.aggregate(judge_results, normalize=self.normalize)

    if self.length_penalty is not None:
        # Normalize to_grade to dict format for penalty calculation
        to_grade_normalized = normalize_to_grade_input(to_grade)

        # Compute penalty
        penalty = compute_length_penalty(to_grade_normalized, self.length_penalty)

        # Apply penalty (penalty is always non-negative, so we subtract)
        adjusted_score = report.score - penalty
        if self.normalize:
            adjusted_score = max(0.0, adjusted_score)

        # Return the same report type with adjusted score
        if isinstance(report, EnsembleEvaluationReport):
            return EnsembleEvaluationReport(
                score=adjusted_score,
                raw_score=report.raw_score,
                llm_raw_score=report.llm_raw_score,
                report=report.report,
                judge_scores=report.judge_scores,
                mean_agreement=report.mean_agreement,
                cannot_assess_count=report.cannot_assess_count,
                token_usage=report.token_usage,
                completion_cost=report.completion_cost,
                error=report.error,
            )
        else:
            return EvaluationReport(
                score=adjusted_score,
                raw_score=report.raw_score,
                llm_raw_score=report.llm_raw_score,
                report=report.report,
                cannot_assess_count=report.cannot_assess_count,
                error=report.error,
                token_usage=report.token_usage,
                completion_cost=report.completion_cost,
            )

    return report

JudgeSpec¶

Configuration for a single judge in an ensemble.

JudgeSpec `dataclass` ¶

JudgeSpec(llm_config: LLMConfig, judge_id: str, weight: float = 1.0)

Specification for a single judge in an ensemble.

ATTRIBUTE	DESCRIPTION
`llm_config`	Configuration for this judge's LLM. TYPE: `LLMConfig`
`judge_id`	Unique identifier for this judge (e.g., "gpt-4", "claude-sonnet"). TYPE: `str`
`weight`	Voting weight for weighted aggregation (default 1.0). TYPE: `float`

Graders¶

Overview¶

Quick Example¶

Grading Options¶

CriterionGrader¶

CriterionGrader ¶

Single LLM¶

Single LLM + few-shot¶

Ensemble¶

Ensemble + few-shot¶

is_ensemble property ¶

has_few_shot property ¶

judge async ¶

aggregate async ¶

Grader¶

Grader ¶

judge abstractmethod async ¶

aggregate abstractmethod async ¶

grade async ¶

JudgeSpec¶

JudgeSpec dataclass ¶

is_ensemble `property` ¶

has_few_shot `property` ¶

judge `async` ¶

aggregate `async` ¶

judge `abstractmethod` `async` ¶

aggregate `abstractmethod` `async` ¶

grade `async` ¶

JudgeSpec `dataclass` ¶