Skip to content

Ensemble

Multi-judge evaluation with configurable aggregation strategies.

Overview

Ensemble judging combines verdicts from multiple LLM judges to improve robustness and reduce individual model biases. All graders return EnsembleEvaluationReport for a consistent interface (single LLM is treated as "ensemble of 1").

Research Background

Verga et al. (2024) demonstrate in "Replacing Judges with Juries" that aggregating independent judgments from diverse models reduces systematic errors. Cross-family judging (using models from different providers) is particularly effective at mitigating self-preference bias documented by He et al. (2025).

Quick Example

from autorubric import LLMConfig
from autorubric.graders import CriterionGrader, JudgeSpec

# Ensemble with multiple judges
grader = CriterionGrader(
    judges=[
        JudgeSpec(LLMConfig(model="gemini/gemini-3-flash-preview"), "gemini", weight=1.0),
        JudgeSpec(LLMConfig(model="anthropic/claude-sonnet-4-5-20250929"), "claude", weight=1.2),
        JudgeSpec(LLMConfig(model="openai/gpt-4.1-mini"), "gpt", weight=1.0),
    ],
    aggregation="weighted",
)

result = await rubric.grade(to_grade=response, grader=grader)

# Ensemble-specific fields
print(f"Score: {result.score:.3f}")
print(f"Mean Agreement: {result.mean_agreement:.1%}")
print(f"Judge Scores: {result.judge_scores}")

# Per-criterion vote breakdown
for cr in result.report:
    print(f"{cr.criterion.requirement}")
    for vote in cr.votes:
        print(f"  {vote.judge_id}: {vote.verdict} ({vote.reason[:50]}...)")

Aggregation Strategies

Strategy Description
majority > 50% of judges must vote MET
weighted Weighted vote using judge weights
unanimous All judges must vote MET
any Any judge voting MET results in MET

AggregationStrategy

Enum for binary verdict aggregation strategies.

AggregationStrategy module-attribute

AggregationStrategy = Literal['majority', 'weighted', 'unanimous', 'any']

Strategy for aggregating votes from multiple judges.

  • majority: Simple majority vote (> 50% must agree)
  • weighted: Weighted vote based on judge weights
  • unanimous: All judges must agree for MET
  • any: Any judge voting MET results in MET

EnsembleEvaluationReport

Evaluation result from ensemble grading with per-judge breakdown.

EnsembleEvaluationReport

Bases: BaseModel

Evaluation report with ensemble voting details.

Extends EvaluationReport with per-judge breakdown and agreement metrics.

ATTRIBUTE DESCRIPTION
score

The final aggregated score (0-1 if normalized).

TYPE: float

raw_score

The unnormalized weighted sum.

TYPE: float | None

llm_raw_score

Same as raw_score (for compatibility with EvaluationReport).

TYPE: float | None

report

Per-criterion breakdown with ensemble voting details.

TYPE: list[EnsembleCriterionReport] | None

judge_scores

Individual scores from each judge.

TYPE: dict[str, float]

mean_agreement

Average agreement across all criteria.

TYPE: float

cannot_assess_count

Number of criteria with CANNOT_ASSESS final verdict.

TYPE: int

token_usage

Total token usage across all judges.

TYPE: TokenUsage | None

completion_cost

Total cost across all judges.

TYPE: float | None

error

Error message if grading failed.

TYPE: str | None


EnsembleCriterionReport

Per-criterion result with individual judge votes.

EnsembleCriterionReport dataclass

EnsembleCriterionReport(criterion: Criterion, final_verdict: CriterionVerdict | None, final_reason: str, votes: list[JudgeVote] = list(), agreement: float = 0.0, final_multi_choice_verdict: AggregatedMultiChoiceVerdict | None = None, multi_choice_votes: list[MultiChoiceJudgeVote] = list())

A criterion report with ensemble voting details.

Supports both binary and multi-choice criteria: - Binary: Use final_verdict and votes (list of JudgeVote) - Multi-choice: Use final_multi_choice_verdict and multi_choice_votes

ATTRIBUTE DESCRIPTION
criterion

The criterion being evaluated.

TYPE: Criterion

final_verdict

Aggregated binary verdict from all judges. None for multi-choice.

TYPE: CriterionVerdict | None

final_reason

Combined reasoning from judges.

TYPE: str

votes

Individual binary votes from each judge. Empty for multi-choice.

TYPE: list[JudgeVote]

agreement

Proportion of judges agreeing with final verdict (0-1).

TYPE: float

final_multi_choice_verdict

Aggregated multi-choice verdict. None for binary.

TYPE: AggregatedMultiChoiceVerdict | None

multi_choice_votes

Individual multi-choice votes. Empty for binary.

TYPE: list[MultiChoiceJudgeVote]

score_value property

score_value: float

Get the score contribution (0-1) for this criterion.

is_na property

is_na: bool

Check if this criterion was marked NA or CANNOT_ASSESS.


JudgeVote

Individual judge's verdict for a criterion.

JudgeVote dataclass

JudgeVote(judge_id: str, verdict: CriterionVerdict, reason: str, weight: float = 1.0)

A single judge's vote on a criterion.

ATTRIBUTE DESCRIPTION
judge_id

Identifier for the judge (e.g., "gpt-4", "claude-sonnet").

TYPE: str

verdict

The judge's verdict (MET/UNMET).

TYPE: CriterionVerdict

reason

The judge's explanation for the verdict.

TYPE: str

weight

Judge's voting weight (default 1.0).

TYPE: float


References

He, J., Shi, J., Zhuo, T. Y., Treude, C., Sun, J., Xing, Z., Du, X., and Lo, D. (2025). LLM-as-a-Judge for Software Engineering: Literature Review, Vision, and the Road Ahead. arXiv:2510.24367.

Verga, P., Hofstatter, S., Althammer, S., Su, Y., Piktus, A., Arkhangorodsky, A., Xu, M., White, N., and Lewis, P. (2024). Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models. arXiv:2404.18796.