Multi-Choice¶
Ordinal and nominal scales beyond binary MET/UNMET verdicts.
Overview¶
Multi-choice criteria support evaluation beyond binary verdicts:
- Ordinal scales: Satisfaction ratings, quality levels with ordered values
- Nominal scales: Categorical judgments where options may share values
- NA options: Options excluded from scoring
Research Background
Multiple sources recommend low-precision ordinal scales (0-3 or 1-5) rather than high-precision numeric scales (1-10). Broad scales invite central-tendency and anchoring problems. Multi-choice criteria with explicit option values provide clear behavioral anchors (Kim et al., 2024; Zheng et al., 2023).
Ordinal Scale Example¶
# rubric.yaml
- name: satisfaction
requirement: "How satisfied would you be with this response?"
weight: 10.0
scale_type: ordinal
options:
- label: "1"
value: 0.0
- label: "2"
value: 0.33
- label: "3"
value: 0.67
- label: "4"
value: 1.0
Nominal Scale Example¶
- name: efficiency
requirement: "Is the number of exchange turns appropriate?"
weight: 5.0
scale_type: nominal
options:
- label: "Too few interactions"
value: 0.0
- label: "Too many interactions"
value: 0.0
- label: "Just right"
value: 1.0
NA Options¶
Exclude options from scoring (like CANNOT_ASSESS for binary):
options:
- label: "None"
value: 0.0
- label: "All claims"
value: 1.0
- label: "NA - No references provided"
na: true
Ensemble Aggregation¶
from autorubric.graders import CriterionGrader
grader = CriterionGrader(
judges=[...],
aggregation="majority", # For binary criteria
ordinal_aggregation="mean", # "mean", "median", "weighted_mean", "mode"
nominal_aggregation="mode", # "mode", "weighted_mode", "unanimous"
)
Position Bias Mitigation¶
LLM judges exhibit position bias in multi-choice settings. AutoRubric shuffles options by default:
# Default: shuffling enabled
grader = CriterionGrader(llm_config=config)
# Disable for deterministic behavior
grader = CriterionGrader(llm_config=config, shuffle_options=False)
Ground Truth Format¶
dataset.add_item(
submission="Response text...",
description="Good response",
ground_truth=[
CriterionVerdict.MET, # Binary criterion
"4", # Multi-choice ordinal
"Just right", # Multi-choice nominal
]
)
CriterionOption¶
Single option for multi-choice criteria.
CriterionOption
¶
Bases: BaseModel
A single option in a multi-choice criterion.
| ATTRIBUTE | DESCRIPTION |
|---|---|
label |
Display text shown to the LLM judge.
TYPE:
|
value |
Score value (0.0-1.0) when this option is selected. REQUIRED.
TYPE:
|
na |
If True, this option indicates "not applicable" and is treated like CANNOT_ASSESS (excluded from scoring).
TYPE:
|
Example
Ordinal scale with explicit values¶
options = [ ... CriterionOption(label="Very dissatisfied", value=0.0), ... CriterionOption(label="Dissatisfied", value=0.33), ... CriterionOption(label="Satisfied", value=0.67), ... CriterionOption(label="Very satisfied", value=1.0), ... ]
Option with NA¶
na_option = CriterionOption(label="N/A - No claims made", value=0.0, na=True)
validate_value_range
¶
validate_value_range() -> CriterionOption
Validate that non-NA options have values in [0, 1].
Source code in src/autorubric/types.py
MultiChoiceVerdict¶
Verdict for a multi-choice criterion.
MultiChoiceVerdict
¶
Bases: BaseModel
Verdict for a multi-choice criterion evaluation.
Stores both index (stable, for metrics computation) and label (readable, for reports). This design enables future metrics like kappa, accuracy, and confusion matrices.
| ATTRIBUTE | DESCRIPTION |
|---|---|
selected_index |
Zero-based index of the selected option. STABLE for metrics.
TYPE:
|
selected_label |
Label text of the selected option. READABLE for reports.
TYPE:
|
value |
Score contribution of the selected option (0.0-1.0).
TYPE:
|
na |
True if the selected option is marked as NA (not applicable).
TYPE:
|
Example
verdict = MultiChoiceVerdict( ... selected_index=2, ... selected_label="Satisfied", ... value=0.67, ... na=False ... )
AggregatedMultiChoiceVerdict¶
Aggregated verdict from ensemble for multi-choice criteria.
AggregatedMultiChoiceVerdict
¶
Bases: MultiChoiceVerdict
Extended verdict for ensemble aggregation results.
Stores both discrete (snapped to nearest option) and continuous (actual mean/median) results to support different metrics: - Discrete (selected_index, value): for exact accuracy, kappa - Continuous (aggregated_value): for RMSE, MAE on scores
| ATTRIBUTE | DESCRIPTION |
|---|---|
aggregated_value |
Continuous aggregated value before snapping to nearest option.
For ordinal scales with mean/median aggregation, this may differ from
TYPE:
|
Example
Mean of [0.0, 0.33, 0.67] = 0.33, snapped to option 1¶
verdict = AggregatedMultiChoiceVerdict( ... selected_index=1, ... selected_label="Dissatisfied", ... value=0.33, # Value of snapped option ... aggregated_value=0.33, # Actual mean ... na=False ... )
MultiChoiceJudgment¶
LLM judgment for a multi-choice criterion.
MultiChoiceJudgment
¶
Bases: BaseModel
Structured LLM output for multi-choice criterion evaluation.
Used with LiteLLM's response_format parameter to ensure type-safe, validated responses from the judge LLM for multi-choice criteria.
Note: The LLM uses 1-indexed option numbers for human readability. The grader converts to 0-indexed internally.
| ATTRIBUTE | DESCRIPTION |
|---|---|
selected_option |
1-indexed number of the selected option (1, 2, 3, etc.)
TYPE:
|
explanation |
Brief explanation of why this option was selected.
TYPE:
|
reasoning |
Extended thinking/reasoning trace (populated when thinking enabled).
TYPE:
|
MultiChoiceJudgeVote¶
Individual judge vote for multi-choice criteria in ensemble.
MultiChoiceJudgeVote
dataclass
¶
MultiChoiceJudgeVote(judge_id: str, selected_index: int, selected_label: str, value: float, reason: str, weight: float = 1.0, na: bool = False)
Individual judge's vote for a multi-choice criterion (ensemble mode).
Preserves full vote details for per-judge metrics and inter-judge agreement analysis (e.g., Fleiss' kappa, per-judge accuracy).
| ATTRIBUTE | DESCRIPTION |
|---|---|
judge_id |
Identifier for the judge (e.g., "gpt-4", "claude-sonnet").
TYPE:
|
selected_index |
Zero-based index of selected option. STABLE for metrics.
TYPE:
|
selected_label |
Label of selected option. READABLE for reports.
TYPE:
|
value |
Score value of selected option.
TYPE:
|
reason |
Judge's explanation for the selection.
TYPE:
|
weight |
Judge's voting weight (default 1.0).
TYPE:
|
na |
True if selected option is NA.
TYPE:
|
OrdinalAggregation¶
Aggregation strategy for ordinal multi-choice criteria in ensemble.
OrdinalAggregation
module-attribute
¶
Aggregation strategy for ordinal multi-choice criteria.
- mean: Average of score values across judges
- median: Median of score values
- weighted_mean: Weighted average by judge weight
- mode: Most common selection
NominalAggregation¶
Aggregation strategy for nominal multi-choice criteria in ensemble.
NominalAggregation
module-attribute
¶
Aggregation strategy for nominal multi-choice criteria.
- mode: Most common selection (majority vote)
- weighted_mode: Weight votes by judge weight
- unanimous: All judges must agree
References¶
Kim, S., Suk, J., Longpre, S., Lin, B. Y., Shin, J., Welleck, S., Neubig, G., Lee, M., Lee, K., and Seo, M. (2024). Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4334–4353.
Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., and Stoica, I. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685.