Evaluate text with
LLM-as-a-Judge

A Python library for evaluating text outputs against weighted criteria. Define rubrics, run evaluations, and measure quality at scale.

from autorubric import Rubric, LLMConfig
from autorubric.graders import CriterionGrader

rubric = Rubric.from_dict([
    {"name": "accuracy", "weight": 10, "requirement": "Response is factually correct"},
    {"name": "clarity", "weight": 8, "requirement": "Explanation is clear and concise"},
])

grader = CriterionGrader(llm_config=LLMConfig(model="openai/gpt-4.1-mini"))
result = await rubric.grade(submission, grader=grader)
print(f"Score: {result.score:.0%}")

Weighted Criteria

Define rubrics with positive and negative weights. Penalize errors, reward quality, and compute normalized scores.

Ensemble Judging

Combine multiple LLM judges with voting strategies for high-stakes evaluations with better reliability.

Few-Shot Calibration

Calibrate judges with labeled examples. Balance verdicts and improve consistency with your ground truth.

Comprehensive Metrics

Compute accuracy, Cohen's kappa, precision, recall, and correlations against human ground truth.

Multi-Choice Scales

Support ordinal and nominal scales with Likert-style ratings, not just binary MET/UNMET verdicts.

Multi-Provider

Works with OpenAI, Anthropic, Google, and any OpenAI-compatible API out of the box.