Multi-Choice¶
Ordinal and nominal scales beyond binary MET/UNMET verdicts.
Overview¶
Multi-choice criteria support evaluation beyond binary verdicts:
- Ordinal scales: Satisfaction ratings, quality levels with ordered values
- Nominal scales: Categorical judgments where options may share values
- NA options: Options excluded from scoring
Research Background
Multiple sources recommend low-precision ordinal scales (0-3 or 1-5) rather than high-precision numeric scales (1-10). Broad scales invite central-tendency and anchoring problems. Multi-choice criteria with explicit option values provide clear behavioral anchors (Kim et al., 2024; Zheng et al., 2023).
Ordinal Scale Example¶
# rubric.yaml
- name: satisfaction
requirement: "How satisfied would you be with this response?"
weight: 10.0
scale_type: ordinal
options:
- label: "1"
value: 0.0
- label: "2"
value: 0.33
- label: "3"
value: 0.67
- label: "4"
value: 1.0
Nominal Scale Example¶
- name: efficiency
requirement: "Is the number of exchange turns appropriate?"
weight: 5.0
scale_type: nominal
options:
- label: "Too few interactions"
value: 0.0
- label: "Too many interactions"
value: 0.0
- label: "Just right"
value: 1.0
NA Options¶
Exclude options from scoring (like CANNOT_ASSESS for binary):
options:
- label: "None"
value: 0.0
- label: "All claims"
value: 1.0
- label: "NA - No references provided"
na: true
Ensemble Aggregation¶
from autorubric.graders import CriterionGrader
grader = CriterionGrader(
judges=[...],
aggregation="majority", # For binary criteria
ordinal_aggregation="mean", # "mean", "median", "weighted_mean", "mode", "min", "max"
nominal_aggregation="mode", # "mode", "weighted_mode", "unanimous"
)
The three knobs are independent. For ordinal criteria, min/max are the conservative/
permissive analogs of binary unanimous/any (lowest/highest option any judge selected).
For nominal criteria, unanimous abstains via the NA option on disagreement (falling back
to mode and warning if there is no NA option).
Position Bias Mitigation¶
LLM judges exhibit position bias in multi-choice settings. AutoRubric shuffles options by default:
# Default: shuffling enabled, seed auto-generated
grader = CriterionGrader(llm_config=config)
# Pin the seed for reproducible shuffles
grader = CriterionGrader(llm_config=config, seed=42)
# Disable shuffling entirely
grader = CriterionGrader(llm_config=config, shuffle_options=False)
The shuffle order for each criterion is recorded in CriterionReport.shuffle_order and persisted in experiment checkpoints.
Ground Truth Format¶
dataset.add_item(
submission="Response text...",
description="Good response",
ground_truth=[
CriterionVerdict.MET, # Binary criterion
"4", # Multi-choice ordinal
"Just right", # Multi-choice nominal
]
)
CriterionOption¶
Single option for multi-choice criteria.
CriterionOption
¶
Bases: BaseModel
A single option in a multi-choice criterion.
| ATTRIBUTE | DESCRIPTION |
|---|---|
label |
Display text shown to the LLM judge.
TYPE:
|
value |
Score value (0.0-1.0) when this option is selected. REQUIRED.
TYPE:
|
na |
If True, this option indicates "not applicable" and is treated like CANNOT_ASSESS (excluded from scoring).
TYPE:
|
Example
Ordinal scale with explicit values¶
options = [ ... CriterionOption(label="Very dissatisfied", value=0.0), ... CriterionOption(label="Dissatisfied", value=0.33), ... CriterionOption(label="Satisfied", value=0.67), ... CriterionOption(label="Very satisfied", value=1.0), ... ]
Option with NA¶
na_option = CriterionOption(label="N/A - No claims made", value=0.0, na=True)
validate_value_range
¶
validate_value_range() -> CriterionOption
Validate that non-NA options have values in [0, 1].
Source code in src/autorubric/types.py
MultiChoiceVerdict¶
Verdict for a multi-choice criterion.
MultiChoiceVerdict
¶
Bases: BaseModel
Verdict for a multi-choice criterion evaluation.
Stores both index (stable, for metrics computation) and label (readable, for reports). This design enables future metrics like kappa, accuracy, and confusion matrices.
| ATTRIBUTE | DESCRIPTION |
|---|---|
selected_index |
Zero-based index of the selected option. STABLE for metrics.
TYPE:
|
selected_label |
Label text of the selected option. READABLE for reports.
TYPE:
|
value |
Score contribution of the selected option (0.0-1.0).
TYPE:
|
na |
True if the selected option is marked as NA (not applicable).
TYPE:
|
Example
verdict = MultiChoiceVerdict( ... selected_index=2, ... selected_label="Satisfied", ... value=0.67, ... na=False ... )
AggregatedMultiChoiceVerdict¶
Aggregated verdict from ensemble for multi-choice criteria.
AggregatedMultiChoiceVerdict
¶
Bases: MultiChoiceVerdict
Extended verdict for ensemble aggregation results.
Stores both discrete (snapped to nearest option) and continuous (actual mean/median) results to support different metrics: - Discrete (selected_index, value): for exact accuracy, kappa - Continuous (aggregated_value): for RMSE, MAE on scores
| ATTRIBUTE | DESCRIPTION |
|---|---|
aggregated_value |
Continuous aggregated value before snapping to nearest option.
For ordinal scales with mean/median aggregation, this may differ from
TYPE:
|
Example
Mean of [0.0, 0.33, 0.67] = 0.33, snapped to option 1¶
verdict = AggregatedMultiChoiceVerdict( ... selected_index=1, ... selected_label="Dissatisfied", ... value=0.33, # Value of snapped option ... aggregated_value=0.33, # Actual mean ... na=False ... )
MultiChoiceJudgment¶
LLM judgment for a multi-choice criterion.
MultiChoiceJudgment
¶
Bases: BaseModel
Structured LLM output for multi-choice criterion evaluation.
Used with LiteLLM's response_format parameter to ensure type-safe, validated responses from the judge LLM for multi-choice criteria.
Note: The LLM uses 1-indexed option numbers for human readability. The grader converts to 0-indexed internally.
| ATTRIBUTE | DESCRIPTION |
|---|---|
selected_option |
1-indexed number of the selected option (1, 2, 3, etc.)
TYPE:
|
explanation |
Brief explanation of why this option was selected. When thinking is
enabled this is the concise conclusion distilled from
TYPE:
|
reasoning |
Verbose extended-thinking deliberation trace (populated only when
thinking is enabled; None otherwise).
TYPE:
|
MultiChoiceJudgeVote¶
Individual judge vote for multi-choice criteria in ensemble.
MultiChoiceJudgeVote
¶
Bases: BaseModel
Individual judge's vote for a multi-choice criterion (ensemble mode).
Preserves full vote details for per-judge metrics and inter-judge agreement analysis (Krippendorff's alpha, Fleiss' kappa, per-judge accuracy).
| ATTRIBUTE | DESCRIPTION |
|---|---|
judge_id |
Identifier for the judge (e.g., "gpt-4", "claude-sonnet").
TYPE:
|
selected_index |
Zero-based index of selected option. STABLE for metrics.
TYPE:
|
selected_label |
Label of selected option. READABLE for reports.
TYPE:
|
value |
Score value of selected option.
TYPE:
|
reason |
Judge's brief justification for the selection (the conclusion distilled
from
TYPE:
|
weight |
Judge's voting weight (default 1.0).
TYPE:
|
na |
True if selected option is NA.
TYPE:
|
shuffle_order |
Permutation used when presenting options to the judge.
TYPE:
|
error |
Set (with a category prefix) when this vote's verdict was synthesized
because the judge call failed. None for genuine votes. Mirrors
TYPE:
|
reasoning |
The judge's verbose extended-thinking deliberation trace (populated
only when thinking is enabled; None otherwise).
TYPE:
|
is_error
property
¶
Whether this vote's verdict was synthesized due to a judge-call failure.
Use this instead of inspecting reason to distinguish error-induced
votes from genuine judgments.
OrdinalAggregation¶
Aggregation strategy for ordinal multi-choice criteria in ensemble.
OrdinalAggregation
module-attribute
¶
Aggregation strategy for ordinal multi-choice criteria.
Central tendency:
- mean: Average of score values across judges, snapped to the nearest option
- median: Median of score values, snapped to the nearest option
- weighted_mean: Weighted average by judge weight, snapped to the nearest option
- mode: Most common selection
Conservative / permissive (the ordinal analogs of binary unanimous / any):
- min: The option with the lowest value any judge selected (conservative).
- max: The option with the highest value any judge selected (permissive).
min/max are gentle robust extremes: they return the lowest/highest option a
judge actually selected, not a "reset to worst on any dissent". Example: selections
{0.67, 0.67, 1.0} give min -> the 0.67 option, max -> the 1.0 option.
Tie-breaking: a mode count tie and a mean/median/weighted_mean snap tie
(value equidistant from two options) resolve to the score-minimizing tied option by
weight sign (lowest value for weight ≥ 0, highest for weight < 0; lowest index on a
value tie) via Criterion.worst_option_among — deterministic, independent of judge
order. min/max value ties already resolve to the lowest index.
NominalAggregation¶
Aggregation strategy for nominal multi-choice criteria in ensemble.
NominalAggregation
module-attribute
¶
Aggregation strategy for nominal multi-choice criteria.
- mode: Most common selection (majority vote)
- weighted_mode: Weight votes by judge weight
- unanimous: All judges must select the same option. On disagreement, abstain by
selecting the criterion's NA option (verdict
na=True, excluded from scoring under the SKIP strategy); if the criterion has no NA option, fall back tomodeand emit a warning.
Tie-breaking: a mode count tie or a weighted_mode equal-weight tie resolves to
the score-minimizing tied option by weight sign (lowest value for weight ≥ 0,
highest for weight < 0; lowest index on a value tie) via
Criterion.worst_option_among — deterministic, independent of judge order.
References¶
Kim, S., Suk, J., Longpre, S., Lin, B. Y., Shin, J., Welleck, S., Neubig, G., Lee, M., Lee, K., and Seo, M. (2024). Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4334–4353.
Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., and Stoica, I. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685.