Multi-Choice¶

Ordinal and nominal scales beyond binary MET/UNMET verdicts.

Overview¶

Multi-choice criteria support evaluation beyond binary verdicts:

Ordinal scales: Satisfaction ratings, quality levels with ordered values
Nominal scales: Categorical judgments where options may share values
NA options: Options excluded from scoring

Research Background

Multiple sources recommend low-precision ordinal scales (0-3 or 1-5) rather than high-precision numeric scales (1-10). Broad scales invite central-tendency and anchoring problems. Multi-choice criteria with explicit option values provide clear behavioral anchors (Kim et al., 2024; Zheng et al., 2023).

Ordinal Scale Example¶

# rubric.yaml
- name: satisfaction
  requirement: "How satisfied would you be with this response?"
  weight: 10.0
  scale_type: ordinal
  options:
    - label: "1"
      value: 0.0
    - label: "2"
      value: 0.33
    - label: "3"
      value: 0.67
    - label: "4"
      value: 1.0

Nominal Scale Example¶

- name: efficiency
  requirement: "Is the number of exchange turns appropriate?"
  weight: 5.0
  scale_type: nominal
  options:
    - label: "Too few interactions"
      value: 0.0
    - label: "Too many interactions"
      value: 0.0
    - label: "Just right"
      value: 1.0

NA Options¶

Exclude options from scoring (like CANNOT_ASSESS for binary):

options:
  - label: "None"
    value: 0.0
  - label: "All claims"
    value: 1.0
  - label: "NA - No references provided"
    na: true

Ensemble Aggregation¶

from autorubric.graders import CriterionGrader

grader = CriterionGrader(
    judges=[...],
    aggregation="majority",           # For binary criteria
    ordinal_aggregation="mean",       # "mean", "median", "weighted_mean", "mode", "min", "max"
    nominal_aggregation="mode",       # "mode", "weighted_mode", "unanimous"
)

The three knobs are independent. For ordinal criteria, min/max are the conservative/ permissive analogs of binary unanimous/any (lowest/highest option any judge selected). For nominal criteria, unanimous abstains via the NA option on disagreement (falling back to mode and warning if there is no NA option).

Position Bias Mitigation¶

LLM judges exhibit position bias in multi-choice settings. AutoRubric shuffles options by default:

# Default: shuffling enabled, seed auto-generated
grader = CriterionGrader(llm_config=config)

# Pin the seed for reproducible shuffles
grader = CriterionGrader(llm_config=config, seed=42)

# Disable shuffling entirely
grader = CriterionGrader(llm_config=config, shuffle_options=False)

The shuffle order for each criterion is recorded in CriterionReport.shuffle_order and persisted in experiment checkpoints.

Ground Truth Format¶

dataset.add_item(
    submission="Response text...",
    description="Good response",
    ground_truth=[
        CriterionVerdict.MET,  # Binary criterion
        "4",                    # Multi-choice ordinal
        "Just right",           # Multi-choice nominal
    ]
)

CriterionOption¶

Single option for multi-choice criteria.

CriterionOption ¶

Bases: BaseModel

A single option in a multi-choice criterion.

ATTRIBUTE	DESCRIPTION
`label`	Display text shown to the LLM judge. TYPE: `str`
`value`	Score value (0.0-1.0) when this option is selected. REQUIRED. TYPE: `float`
`na`	If True, this option indicates "not applicable" and is treated like CANNOT_ASSESS (excluded from scoring). TYPE: `bool`

Example

Ordinal scale with explicit values¶

options = [ ... CriterionOption(label="Very dissatisfied", value=0.0), ... CriterionOption(label="Dissatisfied", value=0.33), ... CriterionOption(label="Satisfied", value=0.67), ... CriterionOption(label="Very satisfied", value=1.0), ... ]

Option with NA¶

na_option = CriterionOption(label="N/A - No claims made", value=0.0, na=True)

validate_value_range ¶

validate_value_range() -> CriterionOption

Validate that non-NA options have values in [0, 1].

Source code in src/autorubric/types.py

@model_validator(mode="after")
def validate_value_range(self) -> "CriterionOption":
    """Validate that non-NA options have values in [0, 1]."""
    if not self.na and not (0.0 <= self.value <= 1.0):
        raise ValueError(f"Option value must be in [0, 1], got {self.value}")
    return self

MultiChoiceVerdict¶

Verdict for a multi-choice criterion.

MultiChoiceVerdict ¶

Bases: BaseModel

Verdict for a multi-choice criterion evaluation.

Stores both index (stable, for metrics computation) and label (readable, for reports). This design enables future metrics like kappa, accuracy, and confusion matrices.

ATTRIBUTE	DESCRIPTION
`selected_index`	Zero-based index of the selected option. STABLE for metrics. `None` for a genuine abstain synthesized on an infrastructure/parse failure when the criterion has no NA option (forced-choice, `auto_na_option=False`): the verdict is `na=True` but no real option was selected, so it never contradicts itself by pointing `na=True` at a scored option. TYPE: `int \| None`
`selected_label`	Label text of the selected option. READABLE for reports. `None` in the same no-option-selected abstain case as `selected_index`. TYPE: `str \| None`
`value`	Score contribution of the selected option (0.0-1.0). TYPE: `float`
`na`	True if the selected option is marked as NA (not applicable). TYPE: `bool`

Example

verdict = MultiChoiceVerdict( ... selected_index=2, ... selected_label="Satisfied", ... value=0.67, ... na=False ... )

AggregatedMultiChoiceVerdict¶

Aggregated verdict from ensemble for multi-choice criteria.

AggregatedMultiChoiceVerdict ¶

Bases: MultiChoiceVerdict

Extended verdict for ensemble aggregation results.

Stores both discrete (snapped to nearest option) and continuous (actual mean/median) results to support different metrics: - Discrete (selected_index, value): for exact accuracy, kappa - Continuous (aggregated_value): for RMSE, MAE on scores

ATTRIBUTE	DESCRIPTION
`aggregated_value`	Continuous aggregated value before snapping to nearest option. For ordinal scales with mean/median aggregation, this may differ from `value`. For nominal scales with mode aggregation, this equals `value`. TYPE: `float`

Example

Mean of [0.0, 0.33, 0.67] = 0.33, snapped to option 1¶

verdict = AggregatedMultiChoiceVerdict( ... selected_index=1, ... selected_label="Dissatisfied", ... value=0.33, # Value of snapped option ... aggregated_value=0.33, # Actual mean ... na=False ... )

MultiChoiceJudgment¶

LLM judgment for a multi-choice criterion.

MultiChoiceJudgment ¶

Bases: BaseModel

Structured LLM output for multi-choice criterion evaluation.

Used with LiteLLM's response_format parameter to ensure type-safe, validated responses from the judge LLM for multi-choice criteria.

Note: The LLM uses 1-indexed option numbers for human readability. The grader converts to 0-indexed internally.

ATTRIBUTE	DESCRIPTION
`selected_option`	1-indexed number of the selected option (1, 2, 3, etc.) TYPE: `int`
`explanation`	Brief explanation of why this option was selected. When thinking is enabled this is the concise conclusion distilled from `reasoning`. TYPE: `str`
`reasoning`	Verbose extended-thinking deliberation trace (populated only when thinking is enabled; None otherwise). `explanation` is distilled from it. TYPE: `str \| None`

MultiChoiceJudgeVote¶

Individual judge vote for multi-choice criteria in ensemble.

MultiChoiceJudgeVote ¶

Bases: BaseModel

Individual judge's vote for a multi-choice criterion (ensemble mode).

Preserves full vote details for per-judge metrics and inter-judge agreement analysis (Krippendorff's alpha, Fleiss' kappa, per-judge accuracy).

ATTRIBUTE	DESCRIPTION
`judge_id`	Identifier for the judge (e.g., "gpt-4", "claude-sonnet"). TYPE: `str`
`selected_index`	Zero-based index of selected option. STABLE for metrics. `None` for a genuine abstain synthesized on a judge-call failure when the criterion has no NA option (forced-choice); otherwise identifies the option. TYPE: `int \| None`
`selected_label`	Label of selected option. READABLE for reports. `None` in the same no-option-selected abstain case as `selected_index`. TYPE: `str \| None`
`value`	Score value of selected option. TYPE: `float`
`reason`	Judge's brief justification for the selection (the conclusion distilled from `reasoning` when thinking is enabled). TYPE: `str`
`weight`	Judge's voting weight (default 1.0). TYPE: `float`
`na`	True if selected option is NA. TYPE: `bool`
`shuffle_order`	Permutation used when presenting options to the judge. TYPE: `list[int] \| None`
`error`	Set (with a category prefix) when this vote's verdict was synthesized because the judge call failed. None for genuine votes. Mirrors `JudgeVote.error` for multi-choice criteria. TYPE: `str \| None`
`reasoning`	The judge's verbose extended-thinking deliberation trace (populated only when thinking is enabled; None otherwise). `reason` is the conclusion distilled from it. Mirrors `JudgeVote.reasoning` for multi-choice criteria. TYPE: `str \| None`

is_error `property` ¶

is_error: bool

Whether this vote's verdict was synthesized due to a judge-call failure.

Use this instead of inspecting reason to distinguish error-induced votes from genuine judgments.

OrdinalAggregation¶

Aggregation strategy for ordinal multi-choice criteria in ensemble.

OrdinalAggregation `module-attribute` ¶

OrdinalAggregation = Literal['mean', 'median', 'weighted_mean', 'mode', 'min', 'max']

Aggregation strategy for ordinal multi-choice criteria.

Central tendency:

mean: Average of score values across judges, snapped to the nearest option
median: Median of score values, snapped to the nearest option
weighted_mean: Weighted average by judge weight, snapped to the nearest option
mode: Most common selection

Conservative / permissive (the ordinal analogs of binary unanimous / any):

min: The option with the lowest value any judge selected (conservative).
max: The option with the highest value any judge selected (permissive).

min/max are gentle robust extremes: they return the lowest/highest option a judge actually selected, not a "reset to worst on any dissent". Example: selections {0.67, 0.67, 1.0} give min -> the 0.67 option, max -> the 1.0 option.

Tie-breaking: a mode count tie and a mean/median/weighted_mean snap tie (value equidistant from two options) resolve to the score-minimizing tied option by weight sign (lowest value for weight ≥ 0, highest for weight < 0; lowest index on a value tie) via Criterion.worst_option_among — deterministic, independent of judge order. min/max value ties already resolve to the lowest index.

NominalAggregation¶

Aggregation strategy for nominal multi-choice criteria in ensemble.

NominalAggregation `module-attribute` ¶

NominalAggregation = Literal['mode', 'weighted_mode', 'unanimous']

Aggregation strategy for nominal multi-choice criteria.

mode: Most common selection (majority vote)
weighted_mode: Weight votes by judge weight
unanimous: All judges must select the same option. On disagreement, abstain by selecting the criterion's NA option (verdict na=True, excluded from scoring under the SKIP strategy); if the criterion has no NA option, fall back to mode and emit a warning.

Tie-breaking: a mode count tie or a weighted_mode equal-weight tie resolves to the score-minimizing tied option by weight sign (lowest value for weight ≥ 0, highest for weight < 0; lowest index on a value tie) via Criterion.worst_option_among — deterministic, independent of judge order.

References¶

Kim, S., Suk, J., Longpre, S., Lin, B. Y., Shin, J., Welleck, S., Neubig, G., Lee, M., Lee, K., and Seo, M. (2024). Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4334–4353.

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., and Stoica, I. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685.

Multi-Choice¶

Overview¶

Ordinal Scale Example¶

Nominal Scale Example¶

NA Options¶

Ensemble Aggregation¶

Position Bias Mitigation¶

Ground Truth Format¶

CriterionOption¶

CriterionOption ¶

Ordinal scale with explicit values¶

Option with NA¶

validate_value_range ¶

MultiChoiceVerdict¶

MultiChoiceVerdict ¶

AggregatedMultiChoiceVerdict¶

AggregatedMultiChoiceVerdict ¶

Mean of [0.0, 0.33, 0.67] = 0.33, snapped to option 1¶

MultiChoiceJudgment¶

MultiChoiceJudgment ¶

MultiChoiceJudgeVote¶

MultiChoiceJudgeVote ¶

is_error property ¶

OrdinalAggregation¶

OrdinalAggregation module-attribute ¶

NominalAggregation¶

NominalAggregation module-attribute ¶

References¶

is_error `property` ¶

OrdinalAggregation `module-attribute` ¶

NominalAggregation `module-attribute` ¶