Quickstart¶
This guide covers installation, basic configuration, and your first evaluation with AutoRubric.
Installation¶
API Key Setup¶
AutoRubric uses LiteLLM under the hood, providing access to 100+ LLM providers. Set up your API key for your chosen provider:
# OpenAI
export OPENAI_API_KEY=your_key_here
# Anthropic
export ANTHROPIC_API_KEY=your_key_here
# Google
export GEMINI_API_KEY=your_key_here
AutoRubric automatically loads environment variables from .env files.
Supported Providers¶
| Provider | Model Format | Environment Variable |
|---|---|---|
| OpenAI | openai/gpt-4.1, openai/gpt-4.1-mini |
OPENAI_API_KEY |
| Anthropic | anthropic/claude-sonnet-4-5-20250929 |
ANTHROPIC_API_KEY |
gemini/gemini-2.5-flash, gemini/gemini-2.5-pro |
GEMINI_API_KEY |
|
| Azure OpenAI | azure/openai/gpt-4.1 |
AZURE_API_KEY, AZURE_API_BASE |
| Groq | groq/llama-3.1-70b-versatile |
GROQ_API_KEY |
| Ollama | ollama/qwen3:14b, ollama/llama3 |
(local, no key needed) |
See the LiteLLM Provider Documentation for the full list.
Your First Evaluation¶
import asyncio
from autorubric import Rubric, LLMConfig
from autorubric.graders import CriterionGrader
async def main():
# 1. Configure the LLM judge
grader = CriterionGrader(llm_config=LLMConfig(model="openai/gpt-4.1-mini"))
# 2. Define your evaluation rubric
rubric = Rubric.from_dict([
{"weight": 10.0, "requirement": "States NMC cell-level energy density in the 250-300 Wh/kg range"},
{"weight": 8.0, "requirement": "Identifies LFP thermal runaway threshold (~270°C) as higher than NMC (~210°C)"},
{"weight": 6.0, "requirement": "States LFP cycle life advantage (2000-5000 cycles vs 1000-2000 for NMC)"},
{"weight": -15.0, "requirement": "Incorrectly claims LFP has higher gravimetric energy density than NMC"}
])
# 3. Grade a response
result = await rubric.grade(
to_grade="""NMC cathodes (LiNixMnyCozO2) achieve 250-280 Wh/kg at the cell level,
while LFP (LiFePO4) typically reaches 150-205 Wh/kg. However, LFP offers superior
thermal stability with decomposition onset at ~270°C compared to ~210°C for NMC,
and delivers 2000-5000 charge cycles versus 1000-2000 for NMC.""",
grader=grader,
query="Compare NMC and LFP cathode materials for EV battery applications.",
)
# 4. Review results
print(f"Score: {result.score:.2f}") # Score is 0.0-1.0
for criterion in result.report:
print(f" [{criterion.final_verdict}] {criterion.criterion.requirement}")
print(f" -> {criterion.final_reason}")
asyncio.run(main())
Core Concepts¶
Rubrics¶
A rubric is a list of criteria that define what you're evaluating. Each criterion has:
- requirement: What the response should (or shouldn't) contain
- weight: How important this criterion is (positive = good, negative = bad)
- name (optional): Identifier for the criterion
from autorubric import Rubric, Criterion
# Direct construction
rubric = Rubric([
Criterion(name="accuracy", weight=10.0, requirement="States the correct answer"),
Criterion(name="clarity", weight=5.0, requirement="Explains reasoning clearly"),
Criterion(name="errors", weight=-15.0, requirement="Contains factual errors"),
])
# From dictionaries (name and weight are optional)
rubric = Rubric.from_dict([
{"weight": 10.0, "requirement": "States the correct answer"},
{"requirement": "Explains reasoning clearly"}, # weight defaults to 10.0
])
# From files
rubric = Rubric.from_file("rubric.yaml")
rubric = Rubric.from_file("rubric.json")
Graders¶
Graders evaluate responses against rubrics. The CriterionGrader is the main grader with support for:
- Single LLM: One judge model
- Ensemble: Multiple judges with aggregation
- Few-shot: Calibration with labeled examples
from autorubric import LLMConfig
from autorubric.graders import CriterionGrader, JudgeSpec
# Single LLM
grader = CriterionGrader(llm_config=LLMConfig(model="openai/gpt-4.1-mini"))
# Ensemble with multiple judges
grader = CriterionGrader(
judges=[
JudgeSpec(LLMConfig(model="openai/gpt-4.1-mini"), "gpt"),
JudgeSpec(LLMConfig(model="anthropic/claude-sonnet-4-5-20250929"), "claude"),
],
aggregation="majority",
)
Verdicts¶
Each criterion receives one of three verdicts:
| Verdict | Meaning |
|---|---|
MET |
The requirement is satisfied |
UNMET |
The requirement is not satisfied |
CANNOT_ASSESS |
Insufficient evidence to determine |
Scoring¶
The final score is calculated from weighted verdicts:
- Positive criteria: MET earns the weight, UNMET earns 0
- Negative criteria: MET subtracts the weight, UNMET contributes 0
- Score is normalized to 0-1 range by default
\[
\text{score} = \max\left(0, \min\left(1, \frac{\sum_{i=1}^{n} \mathbb{1}[\text{verdict}_i = \text{MET}] \cdot w_i}{\sum_{i=1}^{n} \max(0, w_i)}\right)\right)
\]
LLMConfig Options¶
LLMConfig controls LLM behavior:
from autorubric import LLMConfig
config = LLMConfig(
# Required
model="openai/gpt-4.1-mini",
# Sampling
temperature=0.0, # 0.0 = deterministic (default)
max_tokens=1024, # Maximum response tokens
# Rate limiting
max_parallel_requests=10, # Concurrent requests per provider
# Caching
cache_enabled=True, # Enable response caching
cache_dir=".autorubric_cache",
cache_ttl=3600, # Cache TTL in seconds
# Extended thinking (for complex evaluations)
thinking="high", # "low", "medium", "high", or token budget
)
Loading Rubrics from YAML¶
# rubric.yaml
- name: accuracy
weight: 10.0
requirement: "States the correct answer"
- name: clarity
weight: 5.0
requirement: "Explains reasoning clearly"
- name: errors
weight: -15.0
requirement: "Contains factual errors"
Batch Evaluation¶
For evaluating multiple responses, use EvalRunner or the evaluate() function:
from autorubric import RubricDataset, LLMConfig, evaluate
from autorubric.graders import CriterionGrader
async def batch_eval():
# Load dataset
dataset = RubricDataset.from_file("data.json")
# Configure grader with rate limiting
grader = CriterionGrader(
llm_config=LLMConfig(
model="openai/gpt-4.1-mini",
max_parallel_requests=10,
)
)
# Run evaluation
result = await evaluate(dataset, grader, show_progress=True)
print(f"Evaluated {result.successful_items}/{result.total_items}")
print(f"Total cost: ${result.total_completion_cost:.4f}")
Next Steps¶
- API Reference: Complete documentation of all classes and functions
- Cookbook: Practical examples and recipes
- Ensemble Judging: Reduce bias with multiple judges
- Metrics: Measure agreement with ground truth