Multi-Choice¶

Ordinal and nominal scales beyond binary MET/UNMET verdicts.

Overview¶

Multi-choice criteria support evaluation beyond binary verdicts:

Ordinal scales: Satisfaction ratings, quality levels with ordered values
Nominal scales: Categorical judgments where options may share values
NA options: Options excluded from scoring

Research Background

Multiple sources recommend low-precision ordinal scales (0-3 or 1-5) rather than high-precision numeric scales (1-10). Broad scales invite central-tendency and anchoring problems. Multi-choice criteria with explicit option values provide clear behavioral anchors (Kim et al., 2024; Zheng et al., 2023).

Ordinal Scale Example¶

# rubric.yaml
- name: satisfaction
  requirement: "How satisfied would you be with this response?"
  weight: 10.0
  scale_type: ordinal
  options:
    - label: "1"
      value: 0.0
    - label: "2"
      value: 0.33
    - label: "3"
      value: 0.67
    - label: "4"
      value: 1.0

Nominal Scale Example¶

- name: efficiency
  requirement: "Is the number of exchange turns appropriate?"
  weight: 5.0
  scale_type: nominal
  options:
    - label: "Too few interactions"
      value: 0.0
    - label: "Too many interactions"
      value: 0.0
    - label: "Just right"
      value: 1.0

NA Options¶

Exclude options from scoring (like CANNOT_ASSESS for binary):

options:
  - label: "None"
    value: 0.0
  - label: "All claims"
    value: 1.0
  - label: "NA - No references provided"
    na: true

Ensemble Aggregation¶

from autorubric.graders import CriterionGrader

grader = CriterionGrader(
    judges=[...],
    aggregation="majority",           # For binary criteria
    ordinal_aggregation="mean",       # "mean", "median", "weighted_mean", "mode"
    nominal_aggregation="mode",       # "mode", "weighted_mode", "unanimous"
)

Position Bias Mitigation¶

LLM judges exhibit position bias in multi-choice settings. AutoRubric shuffles options by default:

# Default: shuffling enabled
grader = CriterionGrader(llm_config=config)

# Disable for deterministic behavior
grader = CriterionGrader(llm_config=config, shuffle_options=False)

Ground Truth Format¶

dataset.add_item(
    submission="Response text...",
    description="Good response",
    ground_truth=[
        CriterionVerdict.MET,  # Binary criterion
        "4",                    # Multi-choice ordinal
        "Just right",           # Multi-choice nominal
    ]
)

CriterionOption¶

Single option for multi-choice criteria.

CriterionOption ¶

Bases: BaseModel

A single option in a multi-choice criterion.

ATTRIBUTE	DESCRIPTION
`label`	Display text shown to the LLM judge. TYPE: `str`
`value`	Score value (0.0-1.0) when this option is selected. REQUIRED. TYPE: `float`
`na`	If True, this option indicates "not applicable" and is treated like CANNOT_ASSESS (excluded from scoring). TYPE: `bool`

Example

Ordinal scale with explicit values¶

options = [ ... CriterionOption(label="Very dissatisfied", value=0.0), ... CriterionOption(label="Dissatisfied", value=0.33), ... CriterionOption(label="Satisfied", value=0.67), ... CriterionOption(label="Very satisfied", value=1.0), ... ]

Option with NA¶

na_option = CriterionOption(label="N/A - No claims made", value=0.0, na=True)

validate_value_range ¶

validate_value_range() -> CriterionOption

Validate that non-NA options have values in [0, 1].

Source code in src/autorubric/types.py

@model_validator(mode="after")
def validate_value_range(self) -> "CriterionOption":
    """Validate that non-NA options have values in [0, 1]."""
    if not self.na and not (0.0 <= self.value <= 1.0):
        raise ValueError(f"Option value must be in [0, 1], got {self.value}")
    return self

MultiChoiceVerdict¶

Verdict for a multi-choice criterion.

MultiChoiceVerdict ¶

Bases: BaseModel

Verdict for a multi-choice criterion evaluation.

Stores both index (stable, for metrics computation) and label (readable, for reports). This design enables future metrics like kappa, accuracy, and confusion matrices.

ATTRIBUTE	DESCRIPTION
`selected_index`	Zero-based index of the selected option. STABLE for metrics. TYPE: `int`
`selected_label`	Label text of the selected option. READABLE for reports. TYPE: `str`
`value`	Score contribution of the selected option (0.0-1.0). TYPE: `float`
`na`	True if the selected option is marked as NA (not applicable). TYPE: `bool`

Example

verdict = MultiChoiceVerdict( ... selected_index=2, ... selected_label="Satisfied", ... value=0.67, ... na=False ... )

AggregatedMultiChoiceVerdict¶

Aggregated verdict from ensemble for multi-choice criteria.

AggregatedMultiChoiceVerdict ¶

Bases: MultiChoiceVerdict

Extended verdict for ensemble aggregation results.

Stores both discrete (snapped to nearest option) and continuous (actual mean/median) results to support different metrics: - Discrete (selected_index, value): for exact accuracy, kappa - Continuous (aggregated_value): for RMSE, MAE on scores

ATTRIBUTE	DESCRIPTION
`aggregated_value`	Continuous aggregated value before snapping to nearest option. For ordinal scales with mean/median aggregation, this may differ from `value`. For nominal scales with mode aggregation, this equals `value`. TYPE: `float`

Example

Mean of [0.0, 0.33, 0.67] = 0.33, snapped to option 1¶

verdict = AggregatedMultiChoiceVerdict( ... selected_index=1, ... selected_label="Dissatisfied", ... value=0.33, # Value of snapped option ... aggregated_value=0.33, # Actual mean ... na=False ... )

MultiChoiceJudgment¶

LLM judgment for a multi-choice criterion.

MultiChoiceJudgment ¶

Bases: BaseModel

Structured LLM output for multi-choice criterion evaluation.

Used with LiteLLM's response_format parameter to ensure type-safe, validated responses from the judge LLM for multi-choice criteria.

Note: The LLM uses 1-indexed option numbers for human readability. The grader converts to 0-indexed internally.

ATTRIBUTE	DESCRIPTION
`selected_option`	1-indexed number of the selected option (1, 2, 3, etc.) TYPE: `int`
`explanation`	Brief explanation of why this option was selected. TYPE: `str`
`reasoning`	Extended thinking/reasoning trace (populated when thinking enabled). TYPE: `str \| None`

MultiChoiceJudgeVote¶

Individual judge vote for multi-choice criteria in ensemble.

MultiChoiceJudgeVote `dataclass` ¶

MultiChoiceJudgeVote(judge_id: str, selected_index: int, selected_label: str, value: float, reason: str, weight: float = 1.0, na: bool = False)

Individual judge's vote for a multi-choice criterion (ensemble mode).

Preserves full vote details for per-judge metrics and inter-judge agreement analysis (e.g., Fleiss' kappa, per-judge accuracy).

ATTRIBUTE	DESCRIPTION
`judge_id`	Identifier for the judge (e.g., "gpt-4", "claude-sonnet"). TYPE: `str`
`selected_index`	Zero-based index of selected option. STABLE for metrics. TYPE: `int`
`selected_label`	Label of selected option. READABLE for reports. TYPE: `str`
`value`	Score value of selected option. TYPE: `float`
`reason`	Judge's explanation for the selection. TYPE: `str`
`weight`	Judge's voting weight (default 1.0). TYPE: `float`
`na`	True if selected option is NA. TYPE: `bool`

OrdinalAggregation¶

Aggregation strategy for ordinal multi-choice criteria in ensemble.

OrdinalAggregation `module-attribute` ¶

OrdinalAggregation = Literal['mean', 'median', 'weighted_mean', 'mode']

Aggregation strategy for ordinal multi-choice criteria.

mean: Average of score values across judges
median: Median of score values
weighted_mean: Weighted average by judge weight
mode: Most common selection

NominalAggregation¶

Aggregation strategy for nominal multi-choice criteria in ensemble.

NominalAggregation `module-attribute` ¶

NominalAggregation = Literal['mode', 'weighted_mode', 'unanimous']

Aggregation strategy for nominal multi-choice criteria.

mode: Most common selection (majority vote)
weighted_mode: Weight votes by judge weight
unanimous: All judges must agree

References¶

Kim, S., Suk, J., Longpre, S., Lin, B. Y., Shin, J., Welleck, S., Neubig, G., Lee, M., Lee, K., and Seo, M. (2024). Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4334–4353.

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., and Stoica, I. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685.

Multi-Choice¶

Overview¶

Ordinal Scale Example¶

Nominal Scale Example¶

NA Options¶

Ensemble Aggregation¶

Position Bias Mitigation¶

Ground Truth Format¶

CriterionOption¶

CriterionOption ¶

Ordinal scale with explicit values¶

Option with NA¶

validate_value_range ¶

MultiChoiceVerdict¶

MultiChoiceVerdict ¶

AggregatedMultiChoiceVerdict¶

AggregatedMultiChoiceVerdict ¶

Mean of [0.0, 0.33, 0.67] = 0.33, snapped to option 1¶

MultiChoiceJudgment¶

MultiChoiceJudgment ¶

MultiChoiceJudgeVote¶

MultiChoiceJudgeVote dataclass ¶

OrdinalAggregation¶

OrdinalAggregation module-attribute ¶

NominalAggregation¶

NominalAggregation module-attribute ¶

References¶

MultiChoiceJudgeVote `dataclass` ¶

OrdinalAggregation `module-attribute` ¶

NominalAggregation `module-attribute` ¶