Multi-Choice Rubrics for Nuanced Assessment¶

Go beyond binary MET/UNMET with ordinal and nominal scales for granular evaluation.

The Scenario¶

You're evaluating restaurant reviews for a review platform. Simple "good/bad" doesn't capture the nuance—you need Likert scales for quality dimensions and categorical ratings for specific aspects. Some reviews might be excellent on detail but neutral on helpfulness.

What You'll Learn¶

Creating multi-choice criteria with CriterionOption
Ordinal vs nominal scales with scale_type
Handling NA options for inapplicable criteria
Position bias mitigation with shuffle_options
Interpreting MultiChoiceVerdict results

The Solution¶

flowchart LR
    C[Criterion with Options] --> J[Judge selects option]
    J --> V["option.value (0-1)"]
    V --> W[Weighted by criterion weight]
    W --> A[Aggregate into final score]

    subgraph "Binary (MET/UNMET)"
        B1[Criterion] --> B2{MET or UNMET?}
        B2 -->|MET| B3["1.0"]
        B2 -->|UNMET| B4["0.0"]
    end

    subgraph "Multi-Choice"
        M1[Criterion] --> M2{Select from N options}
        M2 --> M3["option.value (any 0-1)"]
    end

Step 1: Define Ordinal (Likert) Criteria¶

For criteria with ordered options (1-5 scales, agreement levels):

from autorubric import Rubric, Criterion, CriterionOption

# Ordinal criterion: options have inherent order
detail_criterion = Criterion(
    name="detail_level",
    weight=10.0,
    requirement="How detailed and informative is this review?",
    scale_type="ordinal",
    options=[
        CriterionOption(label="Very Poor - No useful details", value=0.0),
        CriterionOption(label="Poor - Minimal details", value=0.25),
        CriterionOption(label="Average - Some useful details", value=0.5),
        CriterionOption(label="Good - Detailed and helpful", value=0.75),
        CriterionOption(label="Excellent - Comprehensive with specifics", value=1.0),
    ]
)

Value Assignment

Assign value explicitly (0.0 to 1.0) rather than relying on position. This avoids position bias and makes scoring intentional.

Step 2: Define Nominal (Categorical) Criteria¶

For criteria with unordered categories:

# Nominal criterion: categories without inherent order
tone_criterion = Criterion(
    name="review_tone",
    weight=5.0,
    requirement="What is the overall tone of this review?",
    scale_type="nominal",
    options=[
        CriterionOption(label="Positive", value=1.0),
        CriterionOption(label="Neutral", value=0.5),
        CriterionOption(label="Negative", value=0.0),
        CriterionOption(label="Mixed", value=0.5),
    ]
)

Step 3: Add NA Options for Inapplicable Cases¶

Some reviews may not cover certain aspects:

# Criterion with NA option
service_criterion = Criterion(
    name="service_rating",
    weight=8.0,
    requirement="How does the reviewer rate the service quality?",
    scale_type="ordinal",
    options=[
        CriterionOption(label="Very Poor service", value=0.0),
        CriterionOption(label="Poor service", value=0.25),
        CriterionOption(label="Average service", value=0.5),
        CriterionOption(label="Good service", value=0.75),
        CriterionOption(label="Excellent service", value=1.0),
        CriterionOption(label="N/A - Service not mentioned", value=0.0, na=True),
    ]
)

NA Handling

Options with na=True are treated like CANNOT_ASSESS for binary criteria. They're excluded from scoring when the SKIP strategy is used (default).

Step 4: Build the Complete Rubric¶

Combine multi-choice and binary criteria:

rubric = Rubric([
    # Multi-choice ordinal
    Criterion(
        name="detail_level",
        weight=10.0,
        requirement="How detailed and informative is this review?",
        scale_type="ordinal",
        options=[
            CriterionOption(label="Very Poor", value=0.0),
            CriterionOption(label="Poor", value=0.25),
            CriterionOption(label="Average", value=0.5),
            CriterionOption(label="Good", value=0.75),
            CriterionOption(label="Excellent", value=1.0),
        ]
    ),
    # Multi-choice nominal
    Criterion(
        name="review_tone",
        weight=5.0,
        requirement="What is the overall tone of this review?",
        scale_type="nominal",
        options=[
            CriterionOption(label="Positive", value=1.0),
            CriterionOption(label="Neutral", value=0.5),
            CriterionOption(label="Negative", value=0.0),
            CriterionOption(label="Mixed", value=0.5),
        ]
    ),
    # Multi-choice with NA
    Criterion(
        name="food_quality",
        weight=12.0,
        requirement="How does the reviewer rate the food quality?",
        scale_type="ordinal",
        options=[
            CriterionOption(label="Terrible", value=0.0),
            CriterionOption(label="Below Average", value=0.25),
            CriterionOption(label="Average", value=0.5),
            CriterionOption(label="Above Average", value=0.75),
            CriterionOption(label="Outstanding", value=1.0),
            CriterionOption(label="Not discussed", value=0.0, na=True),
        ]
    ),
    # Binary criterion (still supported)
    Criterion(
        name="spam_content",
        weight=-10.0,
        requirement="Review contains spam or promotional content"
    ),
])

Step 5: Configure the Grader¶

Enable option shuffling to mitigate position bias:

from autorubric import LLMConfig
from autorubric.graders import CriterionGrader

grader = CriterionGrader(
    llm_config=LLMConfig(model="openai/gpt-4.1-mini"),
    shuffle_options=True,  # Randomize option order (default)
)

Position Bias

LLMs tend to favor options presented earlier in the list. shuffle_options=True (default) randomizes presentation order and maps responses back to original indices, mitigating this bias.

Step 6: Grade and Interpret Results¶

import asyncio

review = """
Last night's dinner at Bistro Luna was exceptional. The pan-seared salmon
was perfectly cooked with a crispy skin and buttery interior. The accompanying
risotto was creamy without being heavy. Service was attentive but not intrusive -
our server knew the wine list well and made excellent pairing suggestions.
The ambiance was romantic with soft lighting and well-spaced tables.

Only minor issue: the dessert menu was limited. We opted for tiramisu which
was good but not memorable. Still, at $85 per person including wine, it's
excellent value for the quality.
"""

async def main():
    result = await rubric.grade(
        to_grade=review,
        grader=grader,
        query="Evaluate this restaurant review."
    )
    return result

result = asyncio.run(main())

# Print results. result.score is `float | None` (None if the grade failed).
print(f"Overall Score: {result.score:.2f}\n" if result.score is not None else "Overall Score: n/a\n")

for cr in result.report:
    name = cr.criterion.name

    if cr.final_verdict is not None:
        # Binary criterion
        print(f"[{cr.final_verdict.value}] {name}")
    else:
        # Multi-choice criterion
        mc = cr.final_multi_choice_verdict
        # selected_label is str | None (None on a no-option-selected abstain).
        label = mc.selected_label if mc is not None and mc.selected_label is not None else "N/A"
        print(f"[{label}] {name}")
        if mc is not None:
            print(f"  Value: {mc.value:.2f}")
            if mc.na:
                print(f"  (Not applicable)")

    print(f"  Reason: {cr.final_reason}\n")

Sample output:

Overall Score: 0.89

[Excellent] detail_level
  Value: 1.00
  Reason: Review provides specific details about dishes, service, ambiance, and pricing.

[Positive] review_tone
  Value: 1.00
  Reason: Overall positive with only minor criticism of dessert menu.

[Above Average] food_quality
  Value: 0.75
  Reason: Reviewer praises the salmon and risotto highly, mild criticism of dessert.

[UNMET] spam_content
  Reason: No promotional or spam content detected.

Step 7: Ensemble Aggregation for Multi-Choice¶

With ensembles, multi-choice votes are aggregated differently based on scale type:

grader = CriterionGrader(
    judges=[...],
    ordinal_aggregation="mean",  # Mean of values, snap to nearest option
    nominal_aggregation="mode",  # Most common selection
)

Ordinal aggregation strategies:

mean: Average values, snap to nearest option
median: Median value, snap to nearest
weighted_mean: Weighted by judge weights
mode: Most common selection
min: Lowest-value option any judge selected (conservative; ordinal analog of binary unanimous)
max: Highest-value option any judge selected (permissive; ordinal analog of binary any)

Nominal aggregation strategies:

mode: Most common selection (majority)
weighted_mode: Weighted by judge weights
unanimous: All must select the same option; on disagreement, abstain via the NA option (verdict na=True) — or fall back to mode and warn if the criterion has no NA option

Property	Ordinal	Nominal
Option ordering	Matters - options have inherent rank	Does not matter - categories are unordered
Agreement metric	Weighted kappa (penalizes distant disagreements more)	Unweighted kappa (all disagreements equal)
Ensemble aggregation	Mean, median, mode, min, or max of option values	Mode or weighted mode of selections; `unanimous` abstains on disagreement
When to use	Quality ratings, Likert scales, satisfaction levels	Sentiment type, content category, tone classification
Example	"Very Poor / Poor / Average / Good / Excellent"	"Positive / Neutral / Negative / Mixed"

Consensus posture across criterion types¶

The three aggregation knobs (aggregation for binary, ordinal_aggregation, nominal_aggregation) are independent — setting binary aggregation does not change how multi-choice criteria aggregate. Conceptually they share a central / conservative / permissive axis (binary unanimous ≡ taking the min over the {0,1} option values, and binary any ≡ the max):

Concept	Binary (`aggregation`)	Ordinal (`ordinal_aggregation`)	Nominal (`nominal_aggregation`)
Central	`majority`, `weighted`	`mean`, `median`, `weighted_mean`, `mode`	`mode`, `weighted_mode`
Conservative	`unanimous` (≡ min over {0,1})	`min` (lowest selected option)	`unanimous` (abstain via NA on disagreement)
Permissive	`any` (≡ max over {0,1})	`max` (highest selected option)	— (unordered ⇒ no permissive analog)

Step 8: Bootstrap Confidence Intervals for Multi-Choice Criteria¶

Judge Validation shows compute_metrics(bootstrap=True) for a binary rubric. The same item-level resample covers ordinal and nominal criteria too — you don't opt in separately. Every CI lives on metrics.bootstrap (a BootstrapResults), and each one generalizes across the criterion types your rubric mixes:

accuracy_ci — CI for criterion_accuracy, which for multi-choice criteria is exact-match accuracy (the selected option index equals the ground-truth index).
kappa_ci — CI for mean_kappa, the mean of the per-criterion kappas. Each criterion type contributes its own chance-corrected agreement statistic into that mean: an ordinal criterion contributes its quadratic-weighted kappa (OrdinalCriterionMetrics.weighted_kappa, which penalizes far-apart disagreements more) and a nominal criterion contributes its unweighted kappa (NominalCriterionMetrics.kappa, where every disagreement counts equally).
rmse_ci — CI for score_rmse on the per-item weighted score, which on multi-choice rubrics is driven by each option's value.

One aggregate CI per statistic, not one per criterion

The bootstrap CIs are aggregate scalars: there is a single kappa_ci covering mean_kappa, not a separate per-criterion weighted_kappa_ci/kappa_ci. The per-criterion weighted_kappa (ordinal) and kappa (nominal) on each entry of metrics.per_criterion remain point estimates without their own interval.

from autorubric import RubricDataset, evaluate

# Bootstrap CIs need ground truth, so evaluate a *labeled* dataset — each item's
# ground_truth carries the reference option (index or label) for every criterion.
# compute_metrics() lives on the EvalResult that evaluate() returns, not on the
# single report a one-off rubric.grade() produces.
dataset = RubricDataset.from_file("multi_choice_labeled.json")
result = await evaluate(dataset, grader)  # `grader` from the earlier steps

metrics = result.compute_metrics(
    dataset,
    bootstrap=True,
    n_bootstrap=1000,
    confidence_level=0.95,
    seed=42,
)

# Each bootstrap CI is `tuple[float, float] | None` — None on an empty axis or when every
# resample was degenerate (e.g. a multi-choice criterion that collapsed onto one option,
# leaving kappa undefined). Always guard before subscripting.
acc_ci = metrics.bootstrap.accuracy_ci
kappa_ci = metrics.bootstrap.kappa_ci
rmse_ci = metrics.bootstrap.rmse_ci

print(
    f"Exact-match accuracy 95% CI: [{acc_ci[0]:.1%}, {acc_ci[1]:.1%}]"
    if acc_ci is not None else "Exact-match accuracy 95% CI: n/a"
)
# Ordinal contributes weighted kappa, nominal contributes unweighted kappa, into mean_kappa.
print(
    f"Mean kappa 95% CI:           [{kappa_ci[0]:.3f}, {kappa_ci[1]:.3f}]"
    if kappa_ci is not None else "Mean kappa 95% CI:           n/a"
)
print(
    f"Score RMSE 95% CI:           [{rmse_ci[0]:.4f}, {rmse_ci[1]:.4f}]"
    if rmse_ci is not None else "Score RMSE 95% CI:           n/a"
)

Sparse multi-choice data widens (or voids) the kappa CI

A multi-choice criterion needs at least two distinct ground-truth options among the resampled items for its kappa to be defined. With few labeled items per criterion, many resamples collapse onto a single option and contribute nothing — so kappa_ci may come back wider than the binary case, or None entirely. Treat a None CI as a signal to label more items rather than a metric failure.

Key Takeaways¶

Ordinal scales for ordered options (satisfaction, quality ratings)
Nominal scales for unordered categories (sentiment, type)
Explicit values (0.0-1.0) avoid position bias in scoring
NA options handle inapplicable criteria gracefully
shuffle_options=True mitigates LLM position bias
Multi-choice and binary can coexist in the same rubric

Going Further¶

Judge Validation - Weighted kappa for ordinal agreement
Ensemble Judging - Aggregating multi-choice votes
API Reference: Multi-Choice - Full documentation

Appendix: Complete Code¶

"""Multi-Choice Rubrics - Restaurant Review Evaluation"""

import asyncio
from autorubric import Rubric, Criterion, CriterionOption, LLMConfig
from autorubric.graders import CriterionGrader


# Sample restaurant reviews
REVIEWS = [
    {
        "text": """
Last night's dinner at Bistro Luna was exceptional. The pan-seared salmon
was perfectly cooked with a crispy skin and buttery interior. The accompanying
risotto was creamy without being heavy. Service was attentive but not intrusive -
our server knew the wine list well and made excellent pairing suggestions.
The ambiance was romantic with soft lighting and well-spaced tables.

Only minor issue: the dessert menu was limited. We opted for tiramisu which
was good but not memorable. Still, at $85 per person including wine, it's
excellent value for the quality.
""",
        "description": "Detailed positive review with minor criticism"
    },
    {
        "text": """
Meh. Food was okay I guess. Nothing special. Wouldn't go out of my way to
come back but wouldn't avoid it either.
""",
        "description": "Vague neutral review"
    },
    {
        "text": """
DO NOT EAT HERE!!! Waited 45 minutes for cold pasta. The waiter was rude
when I complained. Manager didn't care. $60 wasted. ZERO STARS.
""",
        "description": "Angry negative review"
    },
    {
        "text": """
The new tasting menu is a culinary journey worth taking. Chef Maria's
interpretation of classic French techniques with local ingredients creates
unexpected harmonies. The course progression - from delicate amuse-bouche
through robust mains to ethereal desserts - demonstrates masterful pacing.

Standouts: the deconstructed bouillabaisse, and the 36-hour braised short
rib. Wine pairings ($75 supplement) are thoughtfully curated.

Note: Vegetarian tasting menu available with advance notice.
""",
        "description": "Sophisticated positive review"
    },
    {
        "text": """
Great spot for brunch! The avocado toast was Instagram-worthy and actually
tasty. Bloody Marys are strong. Gets crowded on weekends so arrive early.
Parking is tricky - use the lot behind the building.
""",
        "description": "Casual helpful review"
    },
    {
        "text": """
I had high hopes based on reviews but was disappointed. The $40 steak was
overcooked despite ordering medium-rare. However, the appetizers (especially
the crab cakes) were excellent, and the cocktails creative. Mixed bag overall.
""",
        "description": "Mixed review with specific details"
    },
    {
        "text": """
Perfect for business dinners. Private rooms available, excellent wine list,
professional service. Food is solid upscale American - nothing risky but
reliably good. Expense account friendly.
""",
        "description": "Practical business-focused review"
    },
    {
        "text": """
GET 50% OFF YOUR FIRST ORDER WITH CODE FOODIE50! This restaurant is
AMAZING and you should definitely try their new app for exclusive deals!
Download now at...
""",
        "description": "Spam/promotional content"
    }
]


async def main():
    # Build multi-choice rubric
    rubric = Rubric([
        Criterion(
            name="detail_level",
            weight=10.0,
            requirement="How detailed and informative is this review?",
            scale_type="ordinal",
            options=[
                CriterionOption(label="Very Poor - No useful details", value=0.0),
                CriterionOption(label="Poor - Minimal details", value=0.25),
                CriterionOption(label="Average - Some useful information", value=0.5),
                CriterionOption(label="Good - Detailed and helpful", value=0.75),
                CriterionOption(label="Excellent - Comprehensive with specifics", value=1.0),
            ]
        ),
        Criterion(
            name="review_tone",
            weight=5.0,
            requirement="What is the overall tone of this review?",
            scale_type="nominal",
            options=[
                CriterionOption(label="Positive", value=1.0),
                CriterionOption(label="Neutral", value=0.5),
                CriterionOption(label="Negative", value=0.0),
                CriterionOption(label="Mixed", value=0.5),
            ]
        ),
        Criterion(
            name="food_rating",
            weight=12.0,
            requirement="How does the reviewer rate the food quality?",
            scale_type="ordinal",
            options=[
                CriterionOption(label="Terrible", value=0.0),
                CriterionOption(label="Below Average", value=0.25),
                CriterionOption(label="Average", value=0.5),
                CriterionOption(label="Above Average", value=0.75),
                CriterionOption(label="Outstanding", value=1.0),
                CriterionOption(label="Not mentioned", value=0.0, na=True),
            ]
        ),
        Criterion(
            name="service_rating",
            weight=8.0,
            requirement="How does the reviewer rate the service?",
            scale_type="ordinal",
            options=[
                CriterionOption(label="Terrible", value=0.0),
                CriterionOption(label="Below Average", value=0.25),
                CriterionOption(label="Average", value=0.5),
                CriterionOption(label="Above Average", value=0.75),
                CriterionOption(label="Outstanding", value=1.0),
                CriterionOption(label="Not mentioned", value=0.0, na=True),
            ]
        ),
        Criterion(
            name="actionable_info",
            weight=6.0,
            requirement="Does the review provide actionable information (prices, tips, recommendations)?",
            scale_type="ordinal",
            options=[
                CriterionOption(label="None", value=0.0),
                CriterionOption(label="Minimal", value=0.33),
                CriterionOption(label="Moderate", value=0.67),
                CriterionOption(label="Extensive", value=1.0),
            ]
        ),
        Criterion(
            name="spam_content",
            weight=-15.0,
            requirement="Contains spam, promotional content, or fake review indicators"
        ),
    ])

    # Configure grader
    grader = CriterionGrader(
        llm_config=LLMConfig(model="openai/gpt-4.1-mini", temperature=0.0),
        shuffle_options=True,
    )

    print("=" * 70)
    print("RESTAURANT REVIEW QUALITY ASSESSMENT")
    print("=" * 70)

    for i, review in enumerate(REVIEWS, 1):
        result = await rubric.grade(
            to_grade=review["text"],
            grader=grader,
            query="Evaluate this restaurant review for quality and helpfulness."
        )

        print(f"\n{'─' * 70}")
        print(f"Review {i}: {review['description']}")
        # result.score is `float | None` (None if the grade failed).
        print(f"Score: {result.score:.2f}" if result.score is not None else "Score: n/a")
        print(f"{'─' * 70}")

        for cr in result.report:
            if cr.final_verdict is not None:
                # Binary
                verdict_str = f"[{cr.final_verdict.value}]"
            else:
                # Multi-choice
                mc = cr.final_multi_choice_verdict
                na_marker = " (N/A)" if mc is not None and mc.na else ""
                # selected_label is str | None (None on a no-option-selected abstain).
                label = (
                    mc.selected_label if mc is not None and mc.selected_label is not None else "N/A"
                )
                verdict_str = f"[{label}]{na_marker}"

            print(f"  {verdict_str} {cr.criterion.name}")


if __name__ == "__main__":
    asyncio.run(main())