Structured evaluation for
LLM-generated content
AutoRubric brings together best practices from rubric science and LLM-as-a-Judge research. Define criteria, validate against humans, and iterate on your rubrics with built-in meta-evaluation.
import asyncio from autorubric import Rubric, LLMConfig from autorubric.graders import CriterionGrader async def main(): grader = CriterionGrader(llm_config=LLMConfig(model="openai/gpt-5.1-mini")) rubric = Rubric.from_dict([ {"weight": 10.0, "requirement": "States NMC cell-level energy density in the 250-300 Wh/kg range"}, {"weight": 8.0, "requirement": "Identifies LFP thermal runaway threshold (~270°C) as higher than NMC (~210°C)"}, {"weight": 6.0, "requirement": "States LFP cycle life advantage (2000-5000 cycles vs 1000-2000 for NMC)"}, {"weight": -15.0, "requirement": "Incorrectly claims LFP has higher gravimetric energy density than NMC"} ]) result = await rubric.grade( to_grade="""NMC cathodes (LiNixMnyCozO2) achieve 250-280 Wh/kg at the cell level, while LFP (LiFePO4) typically reaches 150-205 Wh/kg. However, LFP offers superior thermal stability with decomposition onset at ~270°C compared to ~210°C for NMC, and delivers 2000-5000 charge cycles versus 1000-2000 for NMC.""", grader=grader, query="Compare NMC and LFP cathode materials for EV battery applications.", ) print(f"Score: {result.score:.2f}") for criterion in result.report: print(f" [{criterion.final_verdict}] {criterion.criterion.requirement}") asyncio.run(main())
Why AutoRubric?
LLM judges are increasingly used to evaluate LLM outputs, but ad-hoc prompting leads to inconsistent, biased results. Research shows that LLM judges suffer from position bias, self-preference, and verbosity preference. Vague scoring criteria produce unreliable assessments.
AutoRubric codifies findings from educational measurement theory and LLM evaluation research: analytic rubrics with atomic criteria, multi-judge panels, explicit uncertainty handling, and validation against human judgments. It also provides meta-rubric evaluation to assess and improve your rubrics themselves.
What AutoRubric Offers
Weighted Criteria
Define rubrics with positive and negative weights. Binary verdicts or multi-choice scales (ordinal, nominal). Per-criterion explanations.
Ensemble Judging
Combine multiple LLM judges with voting strategies. Mitigate self-preference bias with cross-provider panels.
Few-Shot Calibration
Calibrate judges with labeled examples. Balanced sampling prevents verdict bias. Works with single or ensemble judges.
Bias Mitigation
Position bias mitigation via option shuffling. CANNOT_ASSESS verdict for explicit uncertainty. Length penalties for verbosity.
Agreement Metrics
Cohen's kappa, accuracy, precision, recall, F1. Spearman, Kendall, and Pearson correlations. Bootstrap confidence intervals.
Distribution Analysis
Earth Mover's Distance and KS tests for score distributions. Systematic bias detection. Per-judge breakdowns.
Meta-Rubric Feedback
Evaluate your rubrics for clarity, structure, and LLM-friendliness. Get actionable suggestions to improve criteria.
Batch Evaluation
High-throughput processing with rate limiting. Checkpointing and resumption. Timing stats and cost tracking.
100+ LLM Providers
OpenAI, Anthropic, Google, Azure, Groq, Ollama via LiteLLM. Extended thinking support. Response caching.
This research was developed with funding from the Defense Advanced Research Projects Agency's (DARPA) SciFy program (Agreement No. HR00112520300). The views expressed are those of the author and do not reflect the official policy or position of the Department of Defense or the U.S. Government.