Working with Grading Explanations¶
Learn how to access, display, and use per-criterion explanations from rubric grading.
The Scenario¶
You're building an automated essay feedback system. Students submit essays, and you want to provide not just a score, but per-criterion feedback explaining why each requirement was met or not met. AutoRubric's grading produces these explanations automatically — you just need to access them.
What You'll Learn¶
- Accessing
reasonfrom grading results - Formatting explanations for student feedback
- Working with ensemble explanations (combined judge reasons)
- Filtering and categorizing reasons programmatically
The Solution¶
flowchart LR
A[Submission] --> B[CriterionGrader]
B --> C{Mode}
C -->|Single Judge| D[One Reason per Criterion]
C -->|Ensemble| E[Multiple Judge Reasons]
E --> F[Aggregated final_reason]
D --> G[CriterionReport]
F --> H[EnsembleCriterionReport]
Step 1: Grade and Access Explanations¶
Every grading result contains a report — a list of EnsembleCriterionReport objects, each with a final_reason field (grade() always returns an ensemble report, even for a single judge):
import asyncio
from autorubric import Rubric, LLMConfig
from autorubric.graders import CriterionGrader
rubric = Rubric.from_dict([
{"name": "causes", "weight": 30.0, "requirement": "Identifies at least 2 major causes of the Industrial Revolution"},
{"name": "effects", "weight": 30.0, "requirement": "Describes at least 2 major effects of the Industrial Revolution"},
{"name": "structure", "weight": 12.0, "requirement": "Clear essay structure with introduction and logical flow"},
{"name": "errors", "weight": -15.0, "requirement": "Contains significant factual errors"},
])
grader = CriterionGrader(
llm_config=LLMConfig(model="openai/gpt-4.1-mini", temperature=0.0)
)
async def main():
result = await rubric.grade(
to_grade="The Industrial Revolution began in Britain around 1760...",
grader=grader,
query="Explain the causes and effects of the Industrial Revolution.",
)
for cr in result.report:
verdict = cr.final_verdict.value if cr.final_verdict else "N/A"
name = cr.criterion.name or "unnamed"
print(f"[{verdict}] {name}: {cr.final_reason}")
asyncio.run(main())
Single vs. Ensemble Explanations
grade() always returns an EnsembleEvaluationReport (a single LLM is treated as an
"ensemble of 1"), so you always read explanations via cr.final_reason. With a single
judge, final_reason is simply that one judge's explanation. With multiple judges,
final_reason concatenates all judges' reasons with a pipe separator, and individual
verdicts are accessible through cr.votes. Choose multiple judges when you need
multiple perspectives or higher reliability.
Step 2: Format as Student Feedback¶
Structure the explanations into a readable feedback report:
def format_feedback(result):
# result.score is `float | None` (None if the grade failed); render None as "n/a".
score_line = f"Overall Score: {result.score:.0%}\n" if result.score is not None else "Overall Score: n/a\n"
lines = [score_line]
met = [cr for cr in result.report if cr.final_verdict and cr.final_verdict.value == "MET" and cr.criterion.weight > 0]
unmet = [cr for cr in result.report if cr.final_verdict and cr.final_verdict.value == "UNMET" and cr.criterion.weight > 0]
errors = [cr for cr in result.report if cr.final_verdict and cr.final_verdict.value == "MET" and cr.criterion.weight < 0]
if met:
lines.append("Strengths:")
for cr in met:
lines.append(f" + {cr.criterion.name}: {cr.final_reason}")
if unmet:
lines.append("\nAreas for Improvement:")
for cr in unmet:
lines.append(f" - {cr.criterion.name}: {cr.final_reason}")
if errors:
lines.append("\nErrors Found:")
for cr in errors:
lines.append(f" ! {cr.criterion.name}: {cr.final_reason}")
return "\n".join(lines)
Step 3: Ensemble Explanations¶
When using ensemble judging, final_reason combines all judges' explanations with a pipe (|) separator:
from autorubric.graders import CriterionGrader, JudgeSpec
grader = CriterionGrader(
judges=[
JudgeSpec(LLMConfig(model="openai/gpt-4.1-mini"), "gpt"),
JudgeSpec(LLMConfig(model="anthropic/claude-sonnet-4-5-20250929"), "claude"),
],
aggregation="majority",
)
result = await rubric.grade(to_grade=essay, grader=grader, query=prompt)
for cr in result.report:
# Individual judge reasons are pipe-separated
judge_reasons = cr.final_reason.split(" | ")
print(f"[{cr.final_verdict.value}] {cr.criterion.name}")
for i, reason in enumerate(judge_reasons):
print(f" Judge {i + 1}: {reason}")
# Individual votes are also available
for vote in cr.votes:
print(f" {vote.judge_id}: {vote.verdict.value} — {vote.reason}")
Step 4: Programmatic Filtering¶
Extract specific explanations for downstream use:
def get_unmet_feedback(result):
"""Extract reasons for criteria that were not met."""
return {
cr.criterion.name: cr.final_reason
for cr in result.report
if cr.final_verdict and cr.final_verdict.value == "UNMET" and cr.criterion.weight > 0
}
def get_error_explanations(result):
"""Extract explanations for detected errors (negative-weight criteria that were MET)."""
return {
cr.criterion.name: cr.final_reason
for cr in result.report
if cr.final_verdict and cr.final_verdict.value == "MET" and cr.criterion.weight < 0
}
Negative-Weight Criteria and MET Verdicts
For negative-weight criteria like errors, a MET verdict means the undesirable behavior
was detected -- the submission contains the problem described in the requirement. The
reason then explains what the error is, not what was done well. Filter these
separately when building feedback reports.
Key Takeaways¶
grade() always returns an EnsembleEvaluationReport, so the right-hand column is what
you use in practice (with one judge, the "ensemble" simply wraps that single judge). The
left column documents the single-report CriterionReport type you encounter elsewhere
(e.g., per-judge EvaluationReports).
| Concept | CriterionReport (single report) |
EnsembleCriterionReport (from grade()) |
|---|---|---|
| Reason | cr.reason — judge's direct explanation |
cr.final_reason — all judges' reasons joined with \| (the single judge's reason when there is one) |
| Individual votes | (no votes; it is the single report) | cr.votes — one JudgeVote per judge |
| Verdict | cr.verdict — the judge's verdict |
cr.final_verdict — aggregated verdict (e.g., majority vote) |
| Criterion access | cr.name, cr.weight |
cr.criterion.name, cr.criterion.weight |
| Negative-weight MET | Reason explains the detected problem | Each judge's reason for detecting the problem |
| Access pattern | cr.reason directly |
Split cr.final_reason on \| or iterate cr.votes |
Going Further¶
- Ensemble Judging — Get multiple perspectives on each criterion
- Extended Thinking — Enable deeper reasoning for complex evaluations
- API Reference: Core Grading — Full
CriterionReportandEnsembleCriterionReportdocs
Appendix: Complete Code¶
"""Working with Grading Explanations - Essay Feedback System"""
import asyncio
from pathlib import Path
from autorubric import LLMConfig
from autorubric.dataset import RubricDataset
from autorubric.graders import CriterionGrader
DATASET_PATH = Path(__file__).parent / "examples" / "data" / "essay_grading_dataset.json"
def format_feedback(result):
"""Format grading result as student-readable feedback."""
# result.score is `float | None` (None if the grade failed); render None as "n/a".
score_line = f"Overall Score: {result.score:.0%}\n" if result.score is not None else "Overall Score: n/a\n"
lines = [score_line]
met = [cr for cr in result.report if cr.final_verdict and cr.final_verdict.value == "MET" and cr.criterion.weight > 0]
unmet = [cr for cr in result.report if cr.final_verdict and cr.final_verdict.value == "UNMET" and cr.criterion.weight > 0]
errors = [cr for cr in result.report if cr.final_verdict and cr.final_verdict.value == "MET" and cr.criterion.weight < 0]
if met:
lines.append("Strengths:")
for cr in met:
lines.append(f" + {cr.criterion.name}: {cr.final_reason}")
if unmet:
lines.append("\nAreas for Improvement:")
for cr in unmet:
lines.append(f" - {cr.criterion.name}: {cr.final_reason}")
if errors:
lines.append("\nErrors Found:")
for cr in errors:
lines.append(f" ! {cr.criterion.name}: {cr.final_reason}")
return "\n".join(lines)
async def main():
dataset = RubricDataset.from_file(DATASET_PATH)
grader = CriterionGrader(
llm_config=LLMConfig(model="openai/gpt-4.1-mini", temperature=0.0)
)
item = dataset.items[0]
rubric = dataset.get_item_rubric(0)
prompt = dataset.get_item_prompt(0)
print(f"Prompt: {prompt}")
print(f"Submission: {item.description}")
print("=" * 70)
result = await rubric.grade(
to_grade=item.submission,
grader=grader,
query=prompt,
)
print(format_feedback(result))
if __name__ == "__main__":
asyncio.run(main())