Working with Grading Explanations¶

Learn how to access, display, and use per-criterion explanations from rubric grading.

The Scenario¶

You're building an automated essay feedback system. Students submit essays, and you want to provide not just a score, but per-criterion feedback explaining why each requirement was met or not met. AutoRubric's grading produces these explanations automatically — you just need to access them.

What You'll Learn¶

Accessing reason from grading results
Formatting explanations for student feedback
Working with ensemble explanations (combined judge reasons)
Filtering and categorizing reasons programmatically

The Solution¶

flowchart LR
    A[Submission] --> B[CriterionGrader]
    B --> C{Mode}
    C -->|Single Judge| D[One Reason per Criterion]
    C -->|Ensemble| E[Multiple Judge Reasons]
    E --> F[Aggregated final_reason]
    D --> G[CriterionReport]
    F --> H[EnsembleCriterionReport]

Step 1: Grade and Access Explanations¶

Every grading result contains a report — a list of CriterionReport objects, each with a reason field:

import asyncio
from autorubric import Rubric, LLMConfig
from autorubric.graders import CriterionGrader

rubric = Rubric.from_dict([
    {"name": "causes", "weight": 30.0, "requirement": "Identifies at least 2 major causes of the Industrial Revolution"},
    {"name": "effects", "weight": 30.0, "requirement": "Describes at least 2 major effects of the Industrial Revolution"},
    {"name": "structure", "weight": 12.0, "requirement": "Clear essay structure with introduction and logical flow"},
    {"name": "errors", "weight": -15.0, "requirement": "Contains significant factual errors"},
])

grader = CriterionGrader(
    llm_config=LLMConfig(model="openai/gpt-4.1-mini", temperature=0.0)
)

async def main():
    result = await rubric.grade(
        to_grade="The Industrial Revolution began in Britain around 1760...",
        grader=grader,
        query="Explain the causes and effects of the Industrial Revolution.",
    )

    for cr in result.report:
        verdict = cr.verdict.value if cr.verdict else "N/A"
        name = cr.name or "unnamed"
        print(f"[{verdict}] {name}: {cr.reason}")

asyncio.run(main())

Single vs. Ensemble Explanations

With a single judge, reason is the judge's direct explanation. With an ensemble, final_reason concatenates all judges' reasons with a pipe separator, and individual verdicts are accessible through cr.votes. Choose ensemble when you need multiple perspectives or higher reliability.

Step 2: Format as Student Feedback¶

Structure the explanations into a readable feedback report:

def format_feedback(result):
    lines = [f"Overall Score: {result.score:.0%}\n"]

    met = [cr for cr in result.report if cr.verdict and cr.verdict.value == "MET" and cr.weight > 0]
    unmet = [cr for cr in result.report if cr.verdict and cr.verdict.value == "UNMET" and cr.weight > 0]
    errors = [cr for cr in result.report if cr.verdict and cr.verdict.value == "MET" and cr.weight < 0]

    if met:
        lines.append("Strengths:")
        for cr in met:
            lines.append(f"  + {cr.name}: {cr.reason}")

    if unmet:
        lines.append("\nAreas for Improvement:")
        for cr in unmet:
            lines.append(f"  - {cr.name}: {cr.reason}")

    if errors:
        lines.append("\nErrors Found:")
        for cr in errors:
            lines.append(f"  ! {cr.name}: {cr.reason}")

    return "\n".join(lines)

Step 3: Ensemble Explanations¶

When using ensemble judging, final_reason combines all judges' explanations with a pipe (|) separator:

from autorubric.graders import CriterionGrader, JudgeSpec

grader = CriterionGrader(
    judges=[
        JudgeSpec(LLMConfig(model="openai/gpt-4.1-mini"), "gpt"),
        JudgeSpec(LLMConfig(model="anthropic/claude-sonnet-4-5-20250929"), "claude"),
    ],
    aggregation="majority",
)

result = await rubric.grade(to_grade=essay, grader=grader, query=prompt)

for cr in result.report:
    # Individual judge reasons are pipe-separated
    judge_reasons = cr.final_reason.split(" | ")
    print(f"[{cr.final_verdict.value}] {cr.criterion.name}")
    for i, reason in enumerate(judge_reasons):
        print(f"  Judge {i + 1}: {reason}")

    # Individual votes are also available
    for vote in cr.votes:
        print(f"  {vote.judge_id}: {vote.verdict.value} — {vote.reason}")

Step 4: Programmatic Filtering¶

Extract specific explanations for downstream use:

def get_unmet_feedback(result):
    """Extract reasons for criteria that were not met."""
    return {
        cr.name: cr.reason
        for cr in result.report
        if cr.verdict and cr.verdict.value == "UNMET" and cr.weight > 0
    }

def get_error_explanations(result):
    """Extract explanations for detected errors (negative-weight criteria that were MET)."""
    return {
        cr.name: cr.reason
        for cr in result.report
        if cr.verdict and cr.verdict.value == "MET" and cr.weight < 0
    }

Negative-Weight Criteria and MET Verdicts

For negative-weight criteria like errors, a MET verdict means the undesirable behavior was detected -- the submission contains the problem described in the requirement. The reason then explains what the error is, not what was done well. Filter these separately when building feedback reports.

Key Takeaways¶

Concept	Single Judge	Ensemble
Reason	`cr.reason` — judge's direct explanation	`cr.final_reason` — all judges' reasons joined with `\\|`
Individual votes	One verdict in `cr.votes`	Multiple verdicts in `cr.votes`, one per judge
Verdict	`cr.verdict` — the judge's verdict	`cr.final_verdict` — aggregated verdict (e.g., majority vote)
Criterion access	`cr.name`, `cr.weight`	`cr.criterion.name`, `cr.criterion.weight`
Negative-weight MET	Reason explains the detected problem	Each judge's reason for detecting the problem
Access pattern	`cr.reason` directly	Split `cr.final_reason` on `\\|` or iterate `cr.votes`

Going Further¶

Ensemble Judging — Get multiple perspectives on each criterion
Extended Thinking — Enable deeper reasoning for complex evaluations
API Reference: Core Grading — Full CriterionReport and EnsembleCriterionReport docs

Appendix: Complete Code¶

"""Working with Grading Explanations - Essay Feedback System"""

import asyncio
from pathlib import Path

from autorubric import LLMConfig
from autorubric.dataset import RubricDataset
from autorubric.graders import CriterionGrader

DATASET_PATH = Path(__file__).parent / "examples" / "data" / "essay_grading_dataset.json"


def format_feedback(result):
    """Format grading result as student-readable feedback."""
    lines = [f"Overall Score: {result.score:.0%}\n"]

    met = [cr for cr in result.report if cr.verdict and cr.verdict.value == "MET" and cr.weight > 0]
    unmet = [cr for cr in result.report if cr.verdict and cr.verdict.value == "UNMET" and cr.weight > 0]
    errors = [cr for cr in result.report if cr.verdict and cr.verdict.value == "MET" and cr.weight < 0]

    if met:
        lines.append("Strengths:")
        for cr in met:
            lines.append(f"  + {cr.name}: {cr.reason}")

    if unmet:
        lines.append("\nAreas for Improvement:")
        for cr in unmet:
            lines.append(f"  - {cr.name}: {cr.reason}")

    if errors:
        lines.append("\nErrors Found:")
        for cr in errors:
            lines.append(f"  ! {cr.name}: {cr.reason}")

    return "\n".join(lines)


async def main():
    dataset = RubricDataset.from_file(DATASET_PATH)

    grader = CriterionGrader(
        llm_config=LLMConfig(model="openai/gpt-4.1-mini", temperature=0.0)
    )

    item = dataset.items[0]
    rubric = dataset.get_item_rubric(0)
    prompt = dataset.get_item_prompt(0)

    print(f"Prompt: {prompt}")
    print(f"Submission: {item.description}")
    print("=" * 70)

    result = await rubric.grade(
        to_grade=item.submission,
        grader=grader,
        query=prompt,
    )

    print(format_feedback(result))


if __name__ == "__main__":
    asyncio.run(main())