Skip to content

Multi-Choice

Ordinal and nominal scales beyond binary MET/UNMET verdicts.

Overview

Multi-choice criteria support evaluation beyond binary verdicts:

  • Ordinal scales: Satisfaction ratings, quality levels with ordered values
  • Nominal scales: Categorical judgments where options may share values
  • NA options: Options excluded from scoring

Research Background

Multiple sources recommend low-precision ordinal scales (0-3 or 1-5) rather than high-precision numeric scales (1-10). Broad scales invite central-tendency and anchoring problems. Multi-choice criteria with explicit option values provide clear behavioral anchors (Kim et al., 2024; Zheng et al., 2023).

Ordinal Scale Example

# rubric.yaml
- name: satisfaction
  requirement: "How satisfied would you be with this response?"
  weight: 10.0
  scale_type: ordinal
  options:
    - label: "1"
      value: 0.0
    - label: "2"
      value: 0.33
    - label: "3"
      value: 0.67
    - label: "4"
      value: 1.0

Nominal Scale Example

- name: efficiency
  requirement: "Is the number of exchange turns appropriate?"
  weight: 5.0
  scale_type: nominal
  options:
    - label: "Too few interactions"
      value: 0.0
    - label: "Too many interactions"
      value: 0.0
    - label: "Just right"
      value: 1.0

NA Options

Exclude options from scoring (like CANNOT_ASSESS for binary):

options:
  - label: "None"
    value: 0.0
  - label: "All claims"
    value: 1.0
  - label: "NA - No references provided"
    na: true

Ensemble Aggregation

from autorubric.graders import CriterionGrader

grader = CriterionGrader(
    judges=[...],
    aggregation="majority",           # For binary criteria
    ordinal_aggregation="mean",       # "mean", "median", "weighted_mean", "mode", "min", "max"
    nominal_aggregation="mode",       # "mode", "weighted_mode", "unanimous"
)

The three knobs are independent. For ordinal criteria, min/max are the conservative/ permissive analogs of binary unanimous/any (lowest/highest option any judge selected). For nominal criteria, unanimous abstains via the NA option on disagreement (falling back to mode and warning if there is no NA option).

Position Bias Mitigation

LLM judges exhibit position bias in multi-choice settings. AutoRubric shuffles options by default:

# Default: shuffling enabled, seed auto-generated
grader = CriterionGrader(llm_config=config)

# Pin the seed for reproducible shuffles
grader = CriterionGrader(llm_config=config, seed=42)

# Disable shuffling entirely
grader = CriterionGrader(llm_config=config, shuffle_options=False)

The shuffle order for each criterion is recorded in CriterionReport.shuffle_order and persisted in experiment checkpoints.

Ground Truth Format

dataset.add_item(
    submission="Response text...",
    description="Good response",
    ground_truth=[
        CriterionVerdict.MET,  # Binary criterion
        "4",                    # Multi-choice ordinal
        "Just right",           # Multi-choice nominal
    ]
)

CriterionOption

Single option for multi-choice criteria.

CriterionOption

Bases: BaseModel

A single option in a multi-choice criterion.

ATTRIBUTE DESCRIPTION
label

Display text shown to the LLM judge.

TYPE: str

value

Score value (0.0-1.0) when this option is selected. REQUIRED.

TYPE: float

na

If True, this option indicates "not applicable" and is treated like CANNOT_ASSESS (excluded from scoring).

TYPE: bool

Example

Ordinal scale with explicit values

options = [ ... CriterionOption(label="Very dissatisfied", value=0.0), ... CriterionOption(label="Dissatisfied", value=0.33), ... CriterionOption(label="Satisfied", value=0.67), ... CriterionOption(label="Very satisfied", value=1.0), ... ]

Option with NA

na_option = CriterionOption(label="N/A - No claims made", value=0.0, na=True)

validate_value_range

validate_value_range() -> CriterionOption

Validate that non-NA options have values in [0, 1].

Source code in src/autorubric/types.py
@model_validator(mode="after")
def validate_value_range(self) -> "CriterionOption":
    """Validate that non-NA options have values in [0, 1]."""
    if not self.na and not (0.0 <= self.value <= 1.0):
        raise ValueError(f"Option value must be in [0, 1], got {self.value}")
    return self

MultiChoiceVerdict

Verdict for a multi-choice criterion.

MultiChoiceVerdict

Bases: BaseModel

Verdict for a multi-choice criterion evaluation.

Stores both index (stable, for metrics computation) and label (readable, for reports). This design enables future metrics like kappa, accuracy, and confusion matrices.

ATTRIBUTE DESCRIPTION
selected_index

Zero-based index of the selected option. STABLE for metrics. None for a genuine abstain synthesized on an infrastructure/parse failure when the criterion has no NA option (forced-choice, auto_na_option=False): the verdict is na=True but no real option was selected, so it never contradicts itself by pointing na=True at a scored option.

TYPE: int | None

selected_label

Label text of the selected option. READABLE for reports. None in the same no-option-selected abstain case as selected_index.

TYPE: str | None

value

Score contribution of the selected option (0.0-1.0).

TYPE: float

na

True if the selected option is marked as NA (not applicable).

TYPE: bool

Example

verdict = MultiChoiceVerdict( ... selected_index=2, ... selected_label="Satisfied", ... value=0.67, ... na=False ... )


AggregatedMultiChoiceVerdict

Aggregated verdict from ensemble for multi-choice criteria.

AggregatedMultiChoiceVerdict

Bases: MultiChoiceVerdict

Extended verdict for ensemble aggregation results.

Stores both discrete (snapped to nearest option) and continuous (actual mean/median) results to support different metrics: - Discrete (selected_index, value): for exact accuracy, kappa - Continuous (aggregated_value): for RMSE, MAE on scores

ATTRIBUTE DESCRIPTION
aggregated_value

Continuous aggregated value before snapping to nearest option. For ordinal scales with mean/median aggregation, this may differ from value. For nominal scales with mode aggregation, this equals value.

TYPE: float

Example

Mean of [0.0, 0.33, 0.67] = 0.33, snapped to option 1

verdict = AggregatedMultiChoiceVerdict( ... selected_index=1, ... selected_label="Dissatisfied", ... value=0.33, # Value of snapped option ... aggregated_value=0.33, # Actual mean ... na=False ... )


MultiChoiceJudgment

LLM judgment for a multi-choice criterion.

MultiChoiceJudgment

Bases: BaseModel

Structured LLM output for multi-choice criterion evaluation.

Used with LiteLLM's response_format parameter to ensure type-safe, validated responses from the judge LLM for multi-choice criteria.

Note: The LLM uses 1-indexed option numbers for human readability. The grader converts to 0-indexed internally.

ATTRIBUTE DESCRIPTION
selected_option

1-indexed number of the selected option (1, 2, 3, etc.)

TYPE: int

explanation

Brief explanation of why this option was selected. When thinking is enabled this is the concise conclusion distilled from reasoning.

TYPE: str

reasoning

Verbose extended-thinking deliberation trace (populated only when thinking is enabled; None otherwise). explanation is distilled from it.

TYPE: str | None


MultiChoiceJudgeVote

Individual judge vote for multi-choice criteria in ensemble.

MultiChoiceJudgeVote

Bases: BaseModel

Individual judge's vote for a multi-choice criterion (ensemble mode).

Preserves full vote details for per-judge metrics and inter-judge agreement analysis (Krippendorff's alpha, Fleiss' kappa, per-judge accuracy).

ATTRIBUTE DESCRIPTION
judge_id

Identifier for the judge (e.g., "gpt-4", "claude-sonnet").

TYPE: str

selected_index

Zero-based index of selected option. STABLE for metrics. None for a genuine abstain synthesized on a judge-call failure when the criterion has no NA option (forced-choice); otherwise identifies the option.

TYPE: int | None

selected_label

Label of selected option. READABLE for reports. None in the same no-option-selected abstain case as selected_index.

TYPE: str | None

value

Score value of selected option.

TYPE: float

reason

Judge's brief justification for the selection (the conclusion distilled from reasoning when thinking is enabled).

TYPE: str

weight

Judge's voting weight (default 1.0).

TYPE: float

na

True if selected option is NA.

TYPE: bool

shuffle_order

Permutation used when presenting options to the judge.

TYPE: list[int] | None

error

Set (with a category prefix) when this vote's verdict was synthesized because the judge call failed. None for genuine votes. Mirrors JudgeVote.error for multi-choice criteria.

TYPE: str | None

reasoning

The judge's verbose extended-thinking deliberation trace (populated only when thinking is enabled; None otherwise). reason is the conclusion distilled from it. Mirrors JudgeVote.reasoning for multi-choice criteria.

TYPE: str | None

is_error property

is_error: bool

Whether this vote's verdict was synthesized due to a judge-call failure.

Use this instead of inspecting reason to distinguish error-induced votes from genuine judgments.


OrdinalAggregation

Aggregation strategy for ordinal multi-choice criteria in ensemble.

OrdinalAggregation module-attribute

OrdinalAggregation = Literal['mean', 'median', 'weighted_mean', 'mode', 'min', 'max']

Aggregation strategy for ordinal multi-choice criteria.

Central tendency:

  • mean: Average of score values across judges, snapped to the nearest option
  • median: Median of score values, snapped to the nearest option
  • weighted_mean: Weighted average by judge weight, snapped to the nearest option
  • mode: Most common selection

Conservative / permissive (the ordinal analogs of binary unanimous / any):

  • min: The option with the lowest value any judge selected (conservative).
  • max: The option with the highest value any judge selected (permissive).

min/max are gentle robust extremes: they return the lowest/highest option a judge actually selected, not a "reset to worst on any dissent". Example: selections {0.67, 0.67, 1.0} give min -> the 0.67 option, max -> the 1.0 option.

Tie-breaking: a mode count tie and a mean/median/weighted_mean snap tie (value equidistant from two options) resolve to the score-minimizing tied option by weight sign (lowest value for weight ≥ 0, highest for weight < 0; lowest index on a value tie) via Criterion.worst_option_among — deterministic, independent of judge order. min/max value ties already resolve to the lowest index.


NominalAggregation

Aggregation strategy for nominal multi-choice criteria in ensemble.

NominalAggregation module-attribute

NominalAggregation = Literal['mode', 'weighted_mode', 'unanimous']

Aggregation strategy for nominal multi-choice criteria.

  • mode: Most common selection (majority vote)
  • weighted_mode: Weight votes by judge weight
  • unanimous: All judges must select the same option. On disagreement, abstain by selecting the criterion's NA option (verdict na=True, excluded from scoring under the SKIP strategy); if the criterion has no NA option, fall back to mode and emit a warning.

Tie-breaking: a mode count tie or a weighted_mode equal-weight tie resolves to the score-minimizing tied option by weight sign (lowest value for weight ≥ 0, highest for weight < 0; lowest index on a value tie) via Criterion.worst_option_among — deterministic, independent of judge order.


References

Kim, S., Suk, J., Longpre, S., Lin, B. Y., Shin, J., Welleck, S., Neubig, G., Lee, M., Lee, K., and Seo, M. (2024). Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4334–4353.

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., and Stoica, I. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685.