Skip to content

Multi-Choice

Ordinal and nominal scales beyond binary MET/UNMET verdicts.

Overview

Multi-choice criteria support evaluation beyond binary verdicts:

  • Ordinal scales: Satisfaction ratings, quality levels with ordered values
  • Nominal scales: Categorical judgments where options may share values
  • NA options: Options excluded from scoring

Research Background

Multiple sources recommend low-precision ordinal scales (0-3 or 1-5) rather than high-precision numeric scales (1-10). Broad scales invite central-tendency and anchoring problems. Multi-choice criteria with explicit option values provide clear behavioral anchors (Kim et al., 2024; Zheng et al., 2023).

Ordinal Scale Example

# rubric.yaml
- name: satisfaction
  requirement: "How satisfied would you be with this response?"
  weight: 10.0
  scale_type: ordinal
  options:
    - label: "1"
      value: 0.0
    - label: "2"
      value: 0.33
    - label: "3"
      value: 0.67
    - label: "4"
      value: 1.0

Nominal Scale Example

- name: efficiency
  requirement: "Is the number of exchange turns appropriate?"
  weight: 5.0
  scale_type: nominal
  options:
    - label: "Too few interactions"
      value: 0.0
    - label: "Too many interactions"
      value: 0.0
    - label: "Just right"
      value: 1.0

NA Options

Exclude options from scoring (like CANNOT_ASSESS for binary):

options:
  - label: "None"
    value: 0.0
  - label: "All claims"
    value: 1.0
  - label: "NA - No references provided"
    na: true

Ensemble Aggregation

from autorubric.graders import CriterionGrader

grader = CriterionGrader(
    judges=[...],
    aggregation="majority",           # For binary criteria
    ordinal_aggregation="mean",       # "mean", "median", "weighted_mean", "mode"
    nominal_aggregation="mode",       # "mode", "weighted_mode", "unanimous"
)

Position Bias Mitigation

LLM judges exhibit position bias in multi-choice settings. AutoRubric shuffles options by default:

# Default: shuffling enabled
grader = CriterionGrader(llm_config=config)

# Disable for deterministic behavior
grader = CriterionGrader(llm_config=config, shuffle_options=False)

Ground Truth Format

dataset.add_item(
    submission="Response text...",
    description="Good response",
    ground_truth=[
        CriterionVerdict.MET,  # Binary criterion
        "4",                    # Multi-choice ordinal
        "Just right",           # Multi-choice nominal
    ]
)

CriterionOption

Single option for multi-choice criteria.

CriterionOption

Bases: BaseModel

A single option in a multi-choice criterion.

ATTRIBUTE DESCRIPTION
label

Display text shown to the LLM judge.

TYPE: str

value

Score value (0.0-1.0) when this option is selected. REQUIRED.

TYPE: float

na

If True, this option indicates "not applicable" and is treated like CANNOT_ASSESS (excluded from scoring).

TYPE: bool

Example

Ordinal scale with explicit values

options = [ ... CriterionOption(label="Very dissatisfied", value=0.0), ... CriterionOption(label="Dissatisfied", value=0.33), ... CriterionOption(label="Satisfied", value=0.67), ... CriterionOption(label="Very satisfied", value=1.0), ... ]

Option with NA

na_option = CriterionOption(label="N/A - No claims made", value=0.0, na=True)

validate_value_range

validate_value_range() -> CriterionOption

Validate that non-NA options have values in [0, 1].

Source code in src/autorubric/types.py
@model_validator(mode="after")
def validate_value_range(self) -> "CriterionOption":
    """Validate that non-NA options have values in [0, 1]."""
    if not self.na and not (0.0 <= self.value <= 1.0):
        raise ValueError(f"Option value must be in [0, 1], got {self.value}")
    return self

MultiChoiceVerdict

Verdict for a multi-choice criterion.

MultiChoiceVerdict

Bases: BaseModel

Verdict for a multi-choice criterion evaluation.

Stores both index (stable, for metrics computation) and label (readable, for reports). This design enables future metrics like kappa, accuracy, and confusion matrices.

ATTRIBUTE DESCRIPTION
selected_index

Zero-based index of the selected option. STABLE for metrics.

TYPE: int

selected_label

Label text of the selected option. READABLE for reports.

TYPE: str

value

Score contribution of the selected option (0.0-1.0).

TYPE: float

na

True if the selected option is marked as NA (not applicable).

TYPE: bool

Example

verdict = MultiChoiceVerdict( ... selected_index=2, ... selected_label="Satisfied", ... value=0.67, ... na=False ... )


AggregatedMultiChoiceVerdict

Aggregated verdict from ensemble for multi-choice criteria.

AggregatedMultiChoiceVerdict

Bases: MultiChoiceVerdict

Extended verdict for ensemble aggregation results.

Stores both discrete (snapped to nearest option) and continuous (actual mean/median) results to support different metrics: - Discrete (selected_index, value): for exact accuracy, kappa - Continuous (aggregated_value): for RMSE, MAE on scores

ATTRIBUTE DESCRIPTION
aggregated_value

Continuous aggregated value before snapping to nearest option. For ordinal scales with mean/median aggregation, this may differ from value. For nominal scales with mode aggregation, this equals value.

TYPE: float

Example

Mean of [0.0, 0.33, 0.67] = 0.33, snapped to option 1

verdict = AggregatedMultiChoiceVerdict( ... selected_index=1, ... selected_label="Dissatisfied", ... value=0.33, # Value of snapped option ... aggregated_value=0.33, # Actual mean ... na=False ... )


MultiChoiceJudgment

LLM judgment for a multi-choice criterion.

MultiChoiceJudgment

Bases: BaseModel

Structured LLM output for multi-choice criterion evaluation.

Used with LiteLLM's response_format parameter to ensure type-safe, validated responses from the judge LLM for multi-choice criteria.

Note: The LLM uses 1-indexed option numbers for human readability. The grader converts to 0-indexed internally.

ATTRIBUTE DESCRIPTION
selected_option

1-indexed number of the selected option (1, 2, 3, etc.)

TYPE: int

explanation

Brief explanation of why this option was selected.

TYPE: str

reasoning

Extended thinking/reasoning trace (populated when thinking enabled).

TYPE: str | None


MultiChoiceJudgeVote

Individual judge vote for multi-choice criteria in ensemble.

MultiChoiceJudgeVote dataclass

MultiChoiceJudgeVote(judge_id: str, selected_index: int, selected_label: str, value: float, reason: str, weight: float = 1.0, na: bool = False)

Individual judge's vote for a multi-choice criterion (ensemble mode).

Preserves full vote details for per-judge metrics and inter-judge agreement analysis (e.g., Fleiss' kappa, per-judge accuracy).

ATTRIBUTE DESCRIPTION
judge_id

Identifier for the judge (e.g., "gpt-4", "claude-sonnet").

TYPE: str

selected_index

Zero-based index of selected option. STABLE for metrics.

TYPE: int

selected_label

Label of selected option. READABLE for reports.

TYPE: str

value

Score value of selected option.

TYPE: float

reason

Judge's explanation for the selection.

TYPE: str

weight

Judge's voting weight (default 1.0).

TYPE: float

na

True if selected option is NA.

TYPE: bool


OrdinalAggregation

Aggregation strategy for ordinal multi-choice criteria in ensemble.

OrdinalAggregation module-attribute

OrdinalAggregation = Literal['mean', 'median', 'weighted_mean', 'mode']

Aggregation strategy for ordinal multi-choice criteria.

  • mean: Average of score values across judges
  • median: Median of score values
  • weighted_mean: Weighted average by judge weight
  • mode: Most common selection

NominalAggregation

Aggregation strategy for nominal multi-choice criteria in ensemble.

NominalAggregation module-attribute

NominalAggregation = Literal['mode', 'weighted_mode', 'unanimous']

Aggregation strategy for nominal multi-choice criteria.

  • mode: Most common selection (majority vote)
  • weighted_mode: Weight votes by judge weight
  • unanimous: All judges must agree

References

Kim, S., Suk, J., Longpre, S., Lin, B. Y., Shin, J., Welleck, S., Neubig, G., Lee, M., Lee, K., and Seo, M. (2024). Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4334–4353.

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., and Stoica, I. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685.