Ensemble¶
Multi-judge evaluation with configurable aggregation strategies.
Overview¶
Ensemble judging combines verdicts from multiple LLM judges to improve robustness and reduce individual model biases. All graders return EnsembleEvaluationReport for a consistent interface (single LLM is treated as "ensemble of 1").
Research Background
Verga et al. (2024) demonstrate in "Replacing Judges with Juries" that aggregating independent judgments from diverse models reduces systematic errors. Cross-family judging (using models from different providers) is particularly effective at mitigating self-preference bias documented by He et al. (2025).
Quick Example¶
from autorubric import LLMConfig
from autorubric.graders import CriterionGrader, JudgeSpec
# Ensemble with multiple judges
grader = CriterionGrader(
judges=[
JudgeSpec(LLMConfig(model="gemini/gemini-3-flash-preview"), "gemini", weight=1.0),
JudgeSpec(LLMConfig(model="anthropic/claude-sonnet-4-5-20250929"), "claude", weight=1.2),
JudgeSpec(LLMConfig(model="openai/gpt-4.1-mini"), "gpt", weight=1.0),
],
aggregation="weighted",
)
result = await rubric.grade(to_grade=response, grader=grader)
# Ensemble-specific fields
print(f"Score: {result.score:.3f}")
print(f"Mean Agreement: {result.mean_agreement:.1%}")
print(f"Judge Scores: {result.judge_scores}")
# Per-criterion vote breakdown
for cr in result.report:
print(f"{cr.criterion.requirement}")
for vote in cr.votes:
print(f" {vote.judge_id}: {vote.verdict} ({vote.reason[:50]}...)")
Aggregation Strategies¶
| Strategy | Description |
|---|---|
majority |
> 50% of judges must vote MET |
weighted |
Weighted vote using judge weights |
unanimous |
All judges must vote MET |
any |
Any judge voting MET results in MET |
AggregationStrategy¶
Enum for binary verdict aggregation strategies.
AggregationStrategy
module-attribute
¶
Strategy for aggregating votes from multiple judges.
- majority: Simple majority vote (> 50% must agree)
- weighted: Weighted vote based on judge weights
- unanimous: All judges must agree for MET
- any: Any judge voting MET results in MET
EnsembleEvaluationReport¶
Evaluation result from ensemble grading with per-judge breakdown.
EnsembleEvaluationReport
¶
Bases: BaseModel
Evaluation report with ensemble voting details.
Extends EvaluationReport with per-judge breakdown and agreement metrics.
| ATTRIBUTE | DESCRIPTION |
|---|---|
score |
The final aggregated score (0-1 if normalized).
TYPE:
|
raw_score |
The unnormalized weighted sum.
TYPE:
|
llm_raw_score |
Same as raw_score (for compatibility with EvaluationReport).
TYPE:
|
report |
Per-criterion breakdown with ensemble voting details.
TYPE:
|
judge_scores |
Individual scores from each judge.
TYPE:
|
mean_agreement |
Average agreement across all criteria.
TYPE:
|
cannot_assess_count |
Number of criteria with CANNOT_ASSESS final verdict.
TYPE:
|
token_usage |
Total token usage across all judges.
TYPE:
|
completion_cost |
Total cost across all judges.
TYPE:
|
error |
Error message if grading failed.
TYPE:
|
EnsembleCriterionReport¶
Per-criterion result with individual judge votes.
EnsembleCriterionReport
dataclass
¶
EnsembleCriterionReport(criterion: Criterion, final_verdict: CriterionVerdict | None, final_reason: str, votes: list[JudgeVote] = list(), agreement: float = 0.0, final_multi_choice_verdict: AggregatedMultiChoiceVerdict | None = None, multi_choice_votes: list[MultiChoiceJudgeVote] = list())
A criterion report with ensemble voting details.
Supports both binary and multi-choice criteria:
- Binary: Use final_verdict and votes (list of JudgeVote)
- Multi-choice: Use final_multi_choice_verdict and multi_choice_votes
| ATTRIBUTE | DESCRIPTION |
|---|---|
criterion |
The criterion being evaluated.
TYPE:
|
final_verdict |
Aggregated binary verdict from all judges. None for multi-choice.
TYPE:
|
final_reason |
Combined reasoning from judges.
TYPE:
|
votes |
Individual binary votes from each judge. Empty for multi-choice.
TYPE:
|
agreement |
Proportion of judges agreeing with final verdict (0-1).
TYPE:
|
final_multi_choice_verdict |
Aggregated multi-choice verdict. None for binary.
TYPE:
|
multi_choice_votes |
Individual multi-choice votes. Empty for binary.
TYPE:
|
JudgeVote¶
Individual judge's verdict for a criterion.
JudgeVote
dataclass
¶
JudgeVote(judge_id: str, verdict: CriterionVerdict, reason: str, weight: float = 1.0)
A single judge's vote on a criterion.
| ATTRIBUTE | DESCRIPTION |
|---|---|
judge_id |
Identifier for the judge (e.g., "gpt-4", "claude-sonnet").
TYPE:
|
verdict |
The judge's verdict (MET/UNMET).
TYPE:
|
reason |
The judge's explanation for the verdict.
TYPE:
|
weight |
Judge's voting weight (default 1.0).
TYPE:
|
References¶
He, J., Shi, J., Zhuo, T. Y., Treude, C., Sun, J., Xing, Z., Du, X., and Lo, D. (2025). LLM-as-a-Judge for Software Engineering: Literature Review, Vision, and the Road Ahead. arXiv:2510.24367.
Verga, P., Hofstatter, S., Althammer, S., Su, Y., Piktus, A., Arkhangorodsky, A., Xu, M., White, N., and Lewis, P. (2024). Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models. arXiv:2404.18796.