Structured evaluation for
LLM-generated content

AutoRubric brings together best practices from rubric science and LLM-as-a-Judge research. Define criteria, validate against humans, and iterate on your rubrics with built-in meta-evaluation.

import asyncio
from autorubric import Rubric, LLMConfig
from autorubric.graders import CriterionGrader

async def main():
    grader = CriterionGrader(llm_config=LLMConfig(model="openai/gpt-5.1-mini"))

    rubric = Rubric.from_dict([
        {"weight": 10.0, "requirement": "States NMC cell-level energy density in the 250-300 Wh/kg range"},
        {"weight": 8.0, "requirement": "Identifies LFP thermal runaway threshold (~270°C) as higher than NMC (~210°C)"},
        {"weight": 6.0, "requirement": "States LFP cycle life advantage (2000-5000 cycles vs 1000-2000 for NMC)"},
        {"weight": -15.0, "requirement": "Incorrectly claims LFP has higher gravimetric energy density than NMC"}
    ])

    result = await rubric.grade(
        to_grade="""NMC cathodes (LiNixMnyCozO2) achieve 250-280 Wh/kg at the cell level,
        while LFP (LiFePO4) typically reaches 150-205 Wh/kg. However, LFP offers superior
        thermal stability with decomposition onset at ~270°C compared to ~210°C for NMC,
        and delivers 2000-5000 charge cycles versus 1000-2000 for NMC.""",
        grader=grader,
        query="Compare NMC and LFP cathode materials for EV battery applications.",
    )

    print(f"Score: {result.score:.2f}")
    for criterion in result.report:
        print(f"  [{criterion.final_verdict}] {criterion.criterion.requirement}")

asyncio.run(main())

Why AutoRubric?

LLM judges are increasingly used to evaluate LLM outputs, but ad-hoc prompting leads to inconsistent, biased results. Research shows that LLM judges suffer from position bias, self-preference, and verbosity preference. Vague scoring criteria produce unreliable assessments.

AutoRubric codifies findings from educational measurement theory and LLM evaluation research: analytic rubrics with atomic criteria, multi-judge panels, explicit uncertainty handling, and validation against human judgments. It also provides meta-rubric evaluation to assess and improve your rubrics themselves.

What AutoRubric Offers

Core

Weighted Criteria

Define rubrics with positive and negative weights. Binary verdicts or multi-choice scales (ordinal, nominal). Per-criterion explanations.

Robustness

Ensemble Judging

Combine multiple LLM judges with voting strategies. Mitigate self-preference bias with cross-provider panels.

Robustness

Few-Shot Calibration

Calibrate judges with labeled examples. Balanced sampling prevents verdict bias. Works with single or ensemble judges.

Robustness

Bias Mitigation

Position bias mitigation via option shuffling. CANNOT_ASSESS verdict for explicit uncertainty. Length penalties for verbosity.

Validation

Agreement Metrics

Cohen's kappa, accuracy, precision, recall, F1. Spearman, Kendall, and Pearson correlations. Bootstrap confidence intervals.

Validation

Distribution Analysis

Earth Mover's Distance and KS tests for score distributions. Systematic bias detection. Per-judge breakdowns.

Meta-Evaluation

Meta-Rubric Feedback

Evaluate your rubrics for clarity, structure, and LLM-friendliness. Get actionable suggestions to improve criteria.

Operations

Batch Evaluation

High-throughput processing with rate limiting. Checkpointing and resumption. Timing stats and cost tracking.

Operations

100+ LLM Providers

OpenAI, Anthropic, Google, Azure, Groq, Ollama via LiteLLM. Extended thinking support. Response caching.

This research was developed with funding from the Defense Advanced Research Projects Agency's (DARPA) SciFy program (Agreement No. HR00112520300). The views expressed are those of the author and do not reflect the official policy or position of the Department of Defense or the U.S. Government.