Ensemble¶

Multi-judge evaluation with configurable aggregation strategies.

Overview¶

Ensemble judging combines verdicts from multiple LLM judges to improve robustness and reduce individual model biases. All graders return EnsembleEvaluationReport for a consistent interface (single LLM is treated as "ensemble of 1").

Research Background

Verga et al. (2024) demonstrate in "Replacing Judges with Juries" that aggregating independent judgments from diverse models reduces systematic errors. Cross-family judging (using models from different providers) is particularly effective at mitigating self-preference bias documented by He et al. (2025).

Quick Example¶

from autorubric import LLMConfig
from autorubric.graders import CriterionGrader, JudgeSpec

# Ensemble with multiple judges
grader = CriterionGrader(
    judges=[
        JudgeSpec(LLMConfig(model="gemini/gemini-3-flash-preview"), "gemini", weight=1.0),
        JudgeSpec(LLMConfig(model="anthropic/claude-sonnet-4-5-20250929"), "claude", weight=1.2),
        JudgeSpec(LLMConfig(model="openai/gpt-4.1-mini"), "gpt", weight=1.0),
    ],
    aggregation="weighted",
)

result = await rubric.grade(to_grade=response, grader=grader)

# Ensemble-specific fields. score / mean_agreement are `float | None`
# (None on a failed grade or an empty rubric); guard before formatting.
print(f"Score: {result.score:.3f}" if result.score is not None else "Score: n/a (grade failed)")
print(
    f"Mean Agreement: {result.mean_agreement:.1%}"
    if result.mean_agreement is not None
    else "Mean Agreement: n/a"
)
print(f"Judge Scores: {result.judge_scores}")

# Per-criterion vote breakdown
for cr in result.report:
    print(f"{cr.criterion.requirement}")
    for vote in cr.votes:
        print(f"  {vote.judge_id}: {vote.verdict} ({vote.reason[:50]}...)")

Aggregation Strategies¶

Strategy	Description
`majority`	> 50% of judges must vote MET
`weighted`	Weighted vote using judge weights
`unanimous`	All judges must vote MET
`any`	Any judge voting MET results in MET

These apply to binary criteria only and are independent of multi-choice aggregation (ordinal_aggregation / nominal_aggregation). Conceptually, binary unanimous ≡ the min over the {0, 1} option values and any ≡ the max; the ordinal analogs are the min / max strategies (see the multi-choice cookbook).

AggregationStrategy¶

Enum for binary verdict aggregation strategies.

AggregationStrategy `module-attribute` ¶

AggregationStrategy = Literal['majority', 'weighted', 'unanimous', 'any']

Strategy for aggregating votes from multiple judges (binary criteria).

majority: Simple majority vote (> 50% of judges must agree)
weighted: Weighted vote based on judge weights
unanimous: All judges must agree for MET
any: Any judge voting MET results in MET

Tie-breaking (majority head-count tie or weighted equal-weight tie) resolves to the score-minimizing verdict by weight sign: UNMET for weight ≥ 0 (earns 0), MET for weight < 0 (applies the full penalty) — the binary analog of Criterion.worst_scored_option. unanimous/any are thresholds, not ties.

EnsembleEvaluationReport¶

Evaluation result from ensemble grading with per-judge breakdown.

EnsembleEvaluationReport ¶

Bases: BaseModel

Evaluation report with ensemble voting details.

Extends EvaluationReport with per-judge breakdown and agreement metrics.

ATTRIBUTE	DESCRIPTION
`score`	The final aggregated score (0-1 if normalized). `None` only when grading FAILED (an error report, e.g. no judge results); the normal grading path always COMPUTES a real float. TYPE: `float \| None`
`raw_score`	The unnormalized weighted sum. `None` only on a failed/empty report. TYPE: `float \| None`
`llm_raw_score`	Same as raw_score (for compatibility with EvaluationReport). TYPE: `float \| None`
`report`	Per-criterion breakdown with ensemble voting details. TYPE: `list[EnsembleCriterionReport] \| None`
`judge_scores`	Individual scores from each judge. TYPE: `dict[str, float]`
`mean_agreement`	Average agreement across all criteria, or None when there are no criteria to agree on (empty rubric) / agreement was not measured. TYPE: `float \| None`
`cannot_assess_count`	Number of criteria with CANNOT_ASSESS final verdict. TYPE: `int`
`token_usage`	Total token usage across all judges. TYPE: `TokenUsage \| None`
`completion_cost`	Total cost across all judges. TYPE: `float \| None`
`error`	Error message if grading failed. TYPE: `str \| None`

EnsembleCriterionReport¶

Per-criterion result with individual judge votes.

EnsembleCriterionReport ¶

Bases: BaseModel

A criterion report with ensemble voting details.

Supports both binary and multi-choice criteria: - Binary: Use final_verdict and votes (list of JudgeVote) - Multi-choice: Use final_multi_choice_verdict and multi_choice_votes

ATTRIBUTE	DESCRIPTION
`criterion`	The criterion being evaluated. TYPE: `Criterion`
`final_verdict`	Aggregated binary verdict from all judges. None for multi-choice. TYPE: `CriterionVerdict \| None`
`final_reason`	Combined reasoning from judges. TYPE: `str`
`votes`	Individual binary votes from each judge. Empty for multi-choice. TYPE: `list[JudgeVote]`
`agreement`	Proportion of judges agreeing with final verdict (0-1). TYPE: `float`
`final_multi_choice_verdict`	Aggregated multi-choice verdict. None for binary. TYPE: `AggregatedMultiChoiceVerdict \| None`
`multi_choice_votes`	Individual multi-choice votes. Empty for binary. TYPE: `list[MultiChoiceJudgeVote]`
`error`	Set (with a category prefix) when the final verdict was driven entirely by judge-call failures (every contributing vote errored). None when at least one genuine judgment was available. See `is_error`. TYPE: `str \| None`

score_value `property` ¶

score_value: float

Get the score contribution (0-1) for this criterion.

is_na `property` ¶

is_na: bool

Check if this criterion was marked NA or CANNOT_ASSESS.

is_error `property` ¶

is_error: bool

Whether the final verdict was driven entirely by judge-call failures.

JudgeVote¶

Individual judge's verdict for a criterion.

JudgeVote ¶

Bases: BaseModel

A single judge's vote on a criterion.

ATTRIBUTE	DESCRIPTION
`judge_id`	Identifier for the judge (e.g., "gpt-4", "claude-sonnet"). TYPE: `str`
`verdict`	The judge's verdict (MET/UNMET). TYPE: `CriterionVerdict`
`reason`	The judge's brief justification for the verdict (the conclusion distilled from `reasoning` when thinking is enabled). TYPE: `str`
`weight`	Judge's voting weight (default 1.0). TYPE: `float`
`error`	Set (with a category prefix) when this vote's verdict was synthesized because the judge call failed. None for genuine votes. TYPE: `str \| None`
`reasoning`	The judge's verbose extended-thinking deliberation trace (populated only when thinking is enabled; None otherwise). `reason` is the conclusion distilled from it. Carried from this judge's `CriterionReport.reasoning`. TYPE: `str \| None`

is_error `property` ¶

is_error: bool

Whether this vote's verdict was synthesized due to a judge-call failure.

Use this instead of inspecting reason to distinguish error-induced votes from genuine judgments.

References¶

He, J., Shi, J., Zhuo, T. Y., Treude, C., Sun, J., Xing, Z., Du, X., and Lo, D. (2025). LLM-as-a-Judge for Software Engineering: Literature Review, Vision, and the Road Ahead. arXiv:2510.24367.

Verga, P., Hofstatter, S., Althammer, S., Su, Y., Piktus, A., Arkhangorodsky, A., Xu, M., White, N., and Lewis, P. (2024). Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models. arXiv:2404.18796.

Ensemble¶

Overview¶

Quick Example¶

Aggregation Strategies¶

AggregationStrategy¶

AggregationStrategy module-attribute ¶

EnsembleEvaluationReport¶

EnsembleEvaluationReport ¶

EnsembleCriterionReport¶

EnsembleCriterionReport ¶

score_value property ¶

is_na property ¶

is_error property ¶

JudgeVote¶

JudgeVote ¶

is_error property ¶

References¶

AggregationStrategy `module-attribute` ¶

score_value `property` ¶

is_na `property` ¶

is_error `property` ¶

is_error `property` ¶