Multi-Choice Rubrics for Nuanced Assessment¶
Go beyond binary MET/UNMET with ordinal and nominal scales for granular evaluation.
The Scenario¶
You're evaluating restaurant reviews for a review platform. Simple "good/bad" doesn't capture the nuance—you need Likert scales for quality dimensions and categorical ratings for specific aspects. Some reviews might be excellent on detail but neutral on helpfulness.
What You'll Learn¶
- Creating multi-choice criteria with
CriterionOption - Ordinal vs nominal scales with
scale_type - Handling NA options for inapplicable criteria
- Position bias mitigation with
shuffle_options - Interpreting
MultiChoiceVerdictresults
The Solution¶
flowchart LR
C[Criterion with Options] --> J[Judge selects option]
J --> V["option.value (0-1)"]
V --> W[Weighted by criterion weight]
W --> A[Aggregate into final score]
subgraph "Binary (MET/UNMET)"
B1[Criterion] --> B2{MET or UNMET?}
B2 -->|MET| B3["1.0"]
B2 -->|UNMET| B4["0.0"]
end
subgraph "Multi-Choice"
M1[Criterion] --> M2{Select from N options}
M2 --> M3["option.value (any 0-1)"]
end
Step 1: Define Ordinal (Likert) Criteria¶
For criteria with ordered options (1-5 scales, agreement levels):
from autorubric import Rubric, Criterion, CriterionOption
# Ordinal criterion: options have inherent order
detail_criterion = Criterion(
name="detail_level",
weight=10.0,
requirement="How detailed and informative is this review?",
scale_type="ordinal",
options=[
CriterionOption(label="Very Poor - No useful details", value=0.0),
CriterionOption(label="Poor - Minimal details", value=0.25),
CriterionOption(label="Average - Some useful details", value=0.5),
CriterionOption(label="Good - Detailed and helpful", value=0.75),
CriterionOption(label="Excellent - Comprehensive with specifics", value=1.0),
]
)
Value Assignment
Assign value explicitly (0.0 to 1.0) rather than relying on position.
This avoids position bias and makes scoring intentional.
Step 2: Define Nominal (Categorical) Criteria¶
For criteria with unordered categories:
# Nominal criterion: categories without inherent order
tone_criterion = Criterion(
name="review_tone",
weight=5.0,
requirement="What is the overall tone of this review?",
scale_type="nominal",
options=[
CriterionOption(label="Positive", value=1.0),
CriterionOption(label="Neutral", value=0.5),
CriterionOption(label="Negative", value=0.0),
CriterionOption(label="Mixed", value=0.5),
]
)
Step 3: Add NA Options for Inapplicable Cases¶
Some reviews may not cover certain aspects:
# Criterion with NA option
service_criterion = Criterion(
name="service_rating",
weight=8.0,
requirement="How does the reviewer rate the service quality?",
scale_type="ordinal",
options=[
CriterionOption(label="Very Poor service", value=0.0),
CriterionOption(label="Poor service", value=0.25),
CriterionOption(label="Average service", value=0.5),
CriterionOption(label="Good service", value=0.75),
CriterionOption(label="Excellent service", value=1.0),
CriterionOption(label="N/A - Service not mentioned", value=0.0, na=True),
]
)
NA Handling
Options with na=True are treated like CANNOT_ASSESS for binary criteria.
They're excluded from scoring when the SKIP strategy is used (default).
Step 4: Build the Complete Rubric¶
Combine multi-choice and binary criteria:
rubric = Rubric([
# Multi-choice ordinal
Criterion(
name="detail_level",
weight=10.0,
requirement="How detailed and informative is this review?",
scale_type="ordinal",
options=[
CriterionOption(label="Very Poor", value=0.0),
CriterionOption(label="Poor", value=0.25),
CriterionOption(label="Average", value=0.5),
CriterionOption(label="Good", value=0.75),
CriterionOption(label="Excellent", value=1.0),
]
),
# Multi-choice nominal
Criterion(
name="review_tone",
weight=5.0,
requirement="What is the overall tone of this review?",
scale_type="nominal",
options=[
CriterionOption(label="Positive", value=1.0),
CriterionOption(label="Neutral", value=0.5),
CriterionOption(label="Negative", value=0.0),
CriterionOption(label="Mixed", value=0.5),
]
),
# Multi-choice with NA
Criterion(
name="food_quality",
weight=12.0,
requirement="How does the reviewer rate the food quality?",
scale_type="ordinal",
options=[
CriterionOption(label="Terrible", value=0.0),
CriterionOption(label="Below Average", value=0.25),
CriterionOption(label="Average", value=0.5),
CriterionOption(label="Above Average", value=0.75),
CriterionOption(label="Outstanding", value=1.0),
CriterionOption(label="Not discussed", value=0.0, na=True),
]
),
# Binary criterion (still supported)
Criterion(
name="spam_content",
weight=-10.0,
requirement="Review contains spam or promotional content"
),
])
Step 5: Configure the Grader¶
Enable option shuffling to mitigate position bias:
from autorubric import LLMConfig
from autorubric.graders import CriterionGrader
grader = CriterionGrader(
llm_config=LLMConfig(model="openai/gpt-4.1-mini"),
shuffle_options=True, # Randomize option order (default)
)
Position Bias
LLMs tend to favor options presented earlier in the list.
shuffle_options=True (default) randomizes presentation order and maps
responses back to original indices, mitigating this bias.
Step 6: Grade and Interpret Results¶
import asyncio
review = """
Last night's dinner at Bistro Luna was exceptional. The pan-seared salmon
was perfectly cooked with a crispy skin and buttery interior. The accompanying
risotto was creamy without being heavy. Service was attentive but not intrusive -
our server knew the wine list well and made excellent pairing suggestions.
The ambiance was romantic with soft lighting and well-spaced tables.
Only minor issue: the dessert menu was limited. We opted for tiramisu which
was good but not memorable. Still, at $85 per person including wine, it's
excellent value for the quality.
"""
async def main():
result = await rubric.grade(
to_grade=review,
grader=grader,
query="Evaluate this restaurant review."
)
return result
result = asyncio.run(main())
# Print results. result.score is `float | None` (None if the grade failed).
print(f"Overall Score: {result.score:.2f}\n" if result.score is not None else "Overall Score: n/a\n")
for cr in result.report:
name = cr.criterion.name
if cr.final_verdict is not None:
# Binary criterion
print(f"[{cr.final_verdict.value}] {name}")
else:
# Multi-choice criterion
mc = cr.final_multi_choice_verdict
# selected_label is str | None (None on a no-option-selected abstain).
label = mc.selected_label if mc is not None and mc.selected_label is not None else "N/A"
print(f"[{label}] {name}")
if mc is not None:
print(f" Value: {mc.value:.2f}")
if mc.na:
print(f" (Not applicable)")
print(f" Reason: {cr.final_reason}\n")
Sample output:
Overall Score: 0.89
[Excellent] detail_level
Value: 1.00
Reason: Review provides specific details about dishes, service, ambiance, and pricing.
[Positive] review_tone
Value: 1.00
Reason: Overall positive with only minor criticism of dessert menu.
[Above Average] food_quality
Value: 0.75
Reason: Reviewer praises the salmon and risotto highly, mild criticism of dessert.
[UNMET] spam_content
Reason: No promotional or spam content detected.
Step 7: Ensemble Aggregation for Multi-Choice¶
With ensembles, multi-choice votes are aggregated differently based on scale type:
grader = CriterionGrader(
judges=[...],
ordinal_aggregation="mean", # Mean of values, snap to nearest option
nominal_aggregation="mode", # Most common selection
)
Ordinal aggregation strategies:
mean: Average values, snap to nearest optionmedian: Median value, snap to nearestweighted_mean: Weighted by judge weightsmode: Most common selectionmin: Lowest-value option any judge selected (conservative; ordinal analog of binaryunanimous)max: Highest-value option any judge selected (permissive; ordinal analog of binaryany)
Nominal aggregation strategies:
mode: Most common selection (majority)weighted_mode: Weighted by judge weightsunanimous: All must select the same option; on disagreement, abstain via the NA option (verdictna=True) — or fall back tomodeand warn if the criterion has no NA option
| Property | Ordinal | Nominal |
|---|---|---|
| Option ordering | Matters - options have inherent rank | Does not matter - categories are unordered |
| Agreement metric | Weighted kappa (penalizes distant disagreements more) | Unweighted kappa (all disagreements equal) |
| Ensemble aggregation | Mean, median, mode, min, or max of option values | Mode or weighted mode of selections; unanimous abstains on disagreement |
| When to use | Quality ratings, Likert scales, satisfaction levels | Sentiment type, content category, tone classification |
| Example | "Very Poor / Poor / Average / Good / Excellent" | "Positive / Neutral / Negative / Mixed" |
Consensus posture across criterion types¶
The three aggregation knobs (aggregation for binary, ordinal_aggregation,
nominal_aggregation) are independent — setting binary aggregation does not change
how multi-choice criteria aggregate. Conceptually they share a central / conservative /
permissive axis (binary unanimous ≡ taking the min over the {0,1} option values, and
binary any ≡ the max):
| Concept | Binary (aggregation) |
Ordinal (ordinal_aggregation) |
Nominal (nominal_aggregation) |
|---|---|---|---|
| Central | majority, weighted |
mean, median, weighted_mean, mode |
mode, weighted_mode |
| Conservative | unanimous (≡ min over {0,1}) |
min (lowest selected option) |
unanimous (abstain via NA on disagreement) |
| Permissive | any (≡ max over {0,1}) |
max (highest selected option) |
— (unordered ⇒ no permissive analog) |
Step 8: Bootstrap Confidence Intervals for Multi-Choice Criteria¶
Judge Validation shows compute_metrics(bootstrap=True) for a binary
rubric. The same item-level resample covers ordinal and nominal criteria too — you don't
opt in separately. Every CI lives on metrics.bootstrap (a BootstrapResults), and each one
generalizes across the criterion types your rubric mixes:
accuracy_ci— CI forcriterion_accuracy, which for multi-choice criteria is exact-match accuracy (the selected option index equals the ground-truth index).kappa_ci— CI formean_kappa, the mean of the per-criterion kappas. Each criterion type contributes its own chance-corrected agreement statistic into that mean: an ordinal criterion contributes its quadratic-weighted kappa (OrdinalCriterionMetrics.weighted_kappa, which penalizes far-apart disagreements more) and a nominal criterion contributes its unweighted kappa (NominalCriterionMetrics.kappa, where every disagreement counts equally).rmse_ci— CI forscore_rmseon the per-item weighted score, which on multi-choice rubrics is driven by each option'svalue.
One aggregate CI per statistic, not one per criterion
The bootstrap CIs are aggregate scalars: there is a single kappa_ci covering
mean_kappa, not a separate per-criterion weighted_kappa_ci/kappa_ci. The
per-criterion weighted_kappa (ordinal) and kappa (nominal) on each entry of
metrics.per_criterion remain point estimates without their own interval.
from autorubric import RubricDataset, evaluate
# Bootstrap CIs need ground truth, so evaluate a *labeled* dataset — each item's
# ground_truth carries the reference option (index or label) for every criterion.
# compute_metrics() lives on the EvalResult that evaluate() returns, not on the
# single report a one-off rubric.grade() produces.
dataset = RubricDataset.from_file("multi_choice_labeled.json")
result = await evaluate(dataset, grader) # `grader` from the earlier steps
metrics = result.compute_metrics(
dataset,
bootstrap=True,
n_bootstrap=1000,
confidence_level=0.95,
seed=42,
)
# Each bootstrap CI is `tuple[float, float] | None` — None on an empty axis or when every
# resample was degenerate (e.g. a multi-choice criterion that collapsed onto one option,
# leaving kappa undefined). Always guard before subscripting.
acc_ci = metrics.bootstrap.accuracy_ci
kappa_ci = metrics.bootstrap.kappa_ci
rmse_ci = metrics.bootstrap.rmse_ci
print(
f"Exact-match accuracy 95% CI: [{acc_ci[0]:.1%}, {acc_ci[1]:.1%}]"
if acc_ci is not None else "Exact-match accuracy 95% CI: n/a"
)
# Ordinal contributes weighted kappa, nominal contributes unweighted kappa, into mean_kappa.
print(
f"Mean kappa 95% CI: [{kappa_ci[0]:.3f}, {kappa_ci[1]:.3f}]"
if kappa_ci is not None else "Mean kappa 95% CI: n/a"
)
print(
f"Score RMSE 95% CI: [{rmse_ci[0]:.4f}, {rmse_ci[1]:.4f}]"
if rmse_ci is not None else "Score RMSE 95% CI: n/a"
)
Sparse multi-choice data widens (or voids) the kappa CI
A multi-choice criterion needs at least two distinct ground-truth options among the
resampled items for its kappa to be defined. With few labeled items per criterion, many
resamples collapse onto a single option and contribute nothing — so kappa_ci may come
back wider than the binary case, or None entirely. Treat a None CI as a signal to
label more items rather than a metric failure.
Key Takeaways¶
- Ordinal scales for ordered options (satisfaction, quality ratings)
- Nominal scales for unordered categories (sentiment, type)
- Explicit values (0.0-1.0) avoid position bias in scoring
- NA options handle inapplicable criteria gracefully
shuffle_options=Truemitigates LLM position bias- Multi-choice and binary can coexist in the same rubric
Going Further¶
- Judge Validation - Weighted kappa for ordinal agreement
- Ensemble Judging - Aggregating multi-choice votes
- API Reference: Multi-Choice - Full documentation
Appendix: Complete Code¶
"""Multi-Choice Rubrics - Restaurant Review Evaluation"""
import asyncio
from autorubric import Rubric, Criterion, CriterionOption, LLMConfig
from autorubric.graders import CriterionGrader
# Sample restaurant reviews
REVIEWS = [
{
"text": """
Last night's dinner at Bistro Luna was exceptional. The pan-seared salmon
was perfectly cooked with a crispy skin and buttery interior. The accompanying
risotto was creamy without being heavy. Service was attentive but not intrusive -
our server knew the wine list well and made excellent pairing suggestions.
The ambiance was romantic with soft lighting and well-spaced tables.
Only minor issue: the dessert menu was limited. We opted for tiramisu which
was good but not memorable. Still, at $85 per person including wine, it's
excellent value for the quality.
""",
"description": "Detailed positive review with minor criticism"
},
{
"text": """
Meh. Food was okay I guess. Nothing special. Wouldn't go out of my way to
come back but wouldn't avoid it either.
""",
"description": "Vague neutral review"
},
{
"text": """
DO NOT EAT HERE!!! Waited 45 minutes for cold pasta. The waiter was rude
when I complained. Manager didn't care. $60 wasted. ZERO STARS.
""",
"description": "Angry negative review"
},
{
"text": """
The new tasting menu is a culinary journey worth taking. Chef Maria's
interpretation of classic French techniques with local ingredients creates
unexpected harmonies. The course progression - from delicate amuse-bouche
through robust mains to ethereal desserts - demonstrates masterful pacing.
Standouts: the deconstructed bouillabaisse, and the 36-hour braised short
rib. Wine pairings ($75 supplement) are thoughtfully curated.
Note: Vegetarian tasting menu available with advance notice.
""",
"description": "Sophisticated positive review"
},
{
"text": """
Great spot for brunch! The avocado toast was Instagram-worthy and actually
tasty. Bloody Marys are strong. Gets crowded on weekends so arrive early.
Parking is tricky - use the lot behind the building.
""",
"description": "Casual helpful review"
},
{
"text": """
I had high hopes based on reviews but was disappointed. The $40 steak was
overcooked despite ordering medium-rare. However, the appetizers (especially
the crab cakes) were excellent, and the cocktails creative. Mixed bag overall.
""",
"description": "Mixed review with specific details"
},
{
"text": """
Perfect for business dinners. Private rooms available, excellent wine list,
professional service. Food is solid upscale American - nothing risky but
reliably good. Expense account friendly.
""",
"description": "Practical business-focused review"
},
{
"text": """
GET 50% OFF YOUR FIRST ORDER WITH CODE FOODIE50! This restaurant is
AMAZING and you should definitely try their new app for exclusive deals!
Download now at...
""",
"description": "Spam/promotional content"
}
]
async def main():
# Build multi-choice rubric
rubric = Rubric([
Criterion(
name="detail_level",
weight=10.0,
requirement="How detailed and informative is this review?",
scale_type="ordinal",
options=[
CriterionOption(label="Very Poor - No useful details", value=0.0),
CriterionOption(label="Poor - Minimal details", value=0.25),
CriterionOption(label="Average - Some useful information", value=0.5),
CriterionOption(label="Good - Detailed and helpful", value=0.75),
CriterionOption(label="Excellent - Comprehensive with specifics", value=1.0),
]
),
Criterion(
name="review_tone",
weight=5.0,
requirement="What is the overall tone of this review?",
scale_type="nominal",
options=[
CriterionOption(label="Positive", value=1.0),
CriterionOption(label="Neutral", value=0.5),
CriterionOption(label="Negative", value=0.0),
CriterionOption(label="Mixed", value=0.5),
]
),
Criterion(
name="food_rating",
weight=12.0,
requirement="How does the reviewer rate the food quality?",
scale_type="ordinal",
options=[
CriterionOption(label="Terrible", value=0.0),
CriterionOption(label="Below Average", value=0.25),
CriterionOption(label="Average", value=0.5),
CriterionOption(label="Above Average", value=0.75),
CriterionOption(label="Outstanding", value=1.0),
CriterionOption(label="Not mentioned", value=0.0, na=True),
]
),
Criterion(
name="service_rating",
weight=8.0,
requirement="How does the reviewer rate the service?",
scale_type="ordinal",
options=[
CriterionOption(label="Terrible", value=0.0),
CriterionOption(label="Below Average", value=0.25),
CriterionOption(label="Average", value=0.5),
CriterionOption(label="Above Average", value=0.75),
CriterionOption(label="Outstanding", value=1.0),
CriterionOption(label="Not mentioned", value=0.0, na=True),
]
),
Criterion(
name="actionable_info",
weight=6.0,
requirement="Does the review provide actionable information (prices, tips, recommendations)?",
scale_type="ordinal",
options=[
CriterionOption(label="None", value=0.0),
CriterionOption(label="Minimal", value=0.33),
CriterionOption(label="Moderate", value=0.67),
CriterionOption(label="Extensive", value=1.0),
]
),
Criterion(
name="spam_content",
weight=-15.0,
requirement="Contains spam, promotional content, or fake review indicators"
),
])
# Configure grader
grader = CriterionGrader(
llm_config=LLMConfig(model="openai/gpt-4.1-mini", temperature=0.0),
shuffle_options=True,
)
print("=" * 70)
print("RESTAURANT REVIEW QUALITY ASSESSMENT")
print("=" * 70)
for i, review in enumerate(REVIEWS, 1):
result = await rubric.grade(
to_grade=review["text"],
grader=grader,
query="Evaluate this restaurant review for quality and helpfulness."
)
print(f"\n{'─' * 70}")
print(f"Review {i}: {review['description']}")
# result.score is `float | None` (None if the grade failed).
print(f"Score: {result.score:.2f}" if result.score is not None else "Score: n/a")
print(f"{'─' * 70}")
for cr in result.report:
if cr.final_verdict is not None:
# Binary
verdict_str = f"[{cr.final_verdict.value}]"
else:
# Multi-choice
mc = cr.final_multi_choice_verdict
na_marker = " (N/A)" if mc is not None and mc.na else ""
# selected_label is str | None (None on a no-option-selected abstain).
label = (
mc.selected_label if mc is not None and mc.selected_label is not None else "N/A"
)
verdict_str = f"[{label}]{na_marker}"
print(f" {verdict_str} {cr.criterion.name}")
if __name__ == "__main__":
asyncio.run(main())