LLM Infrastructure¶
LLM client configuration, caching, and generation utilities.
Overview¶
AutoRubric uses LiteLLM for multi-provider LLM support. The LLMConfig class provides centralized configuration, while LLMClient handles request execution with caching, rate limiting, and retry logic.
Quick Example¶
from autorubric import LLMConfig, LLMClient, generate
# Configuration
config = LLMConfig(
model="openai/gpt-4.1-mini",
temperature=0.0,
max_tokens=1024,
cache_enabled=True,
max_parallel_requests=10,
)
# Direct generation
result = await generate(
config=config,
system_prompt="You are a helpful assistant.",
user_prompt="Explain quantum computing.",
)
print(result.content)
# Or use the client
client = LLMClient(config)
result = await client.generate(
system_prompt="You are a helpful assistant.",
user_prompt="Explain quantum computing.",
)
Provider Configuration¶
| Provider | Model Format | Environment Variable |
|---|---|---|
| OpenAI | openai/gpt-4.1, openai/gpt-4.1-mini |
OPENAI_API_KEY |
| Anthropic | anthropic/claude-sonnet-4-5-20250929 |
ANTHROPIC_API_KEY |
gemini/gemini-2.5-flash |
GEMINI_API_KEY |
|
| Azure | azure/openai/gpt-4.1 |
AZURE_API_KEY, AZURE_API_BASE |
| Groq | groq/llama-3.1-70b-versatile |
GROQ_API_KEY |
| Ollama | ollama/qwen3:14b |
(local, no key needed) |
YAML Configuration¶
# llm_config.yaml
model: openai/gpt-4.1
temperature: 0.0
max_tokens: 1024
cache_enabled: true
cache_ttl: 3600
Extended Thinking¶
Enable step-by-step reasoning for complex evaluations:
# Level-based (cross-provider)
config = LLMConfig(
model="anthropic/claude-sonnet-4-5-20250929",
thinking="high", # "low", "medium", "high", or "none"
)
# Token budget
config = LLMConfig(
model="anthropic/claude-opus-4-5-20251101",
thinking=32000, # Explicit token budget
)
Supported providers: Anthropic, OpenAI (o-series), Gemini (2.5+), DeepSeek.
Response Caching¶
config = LLMConfig(
model="openai/gpt-4.1-mini",
cache_enabled=True,
cache_dir=".autorubric_cache",
cache_ttl=3600, # 1 hour
)
client = LLMClient(config)
client.clear_cache()
stats = client.cache_stats()
# {'size': 1024, 'count': 10, 'directory': '.autorubric_cache'}
Prompt Caching (Anthropic)¶
Reduce latency and cost on repeated calls. Enabled by default:
LLMConfig¶
Central configuration class for LLM calls.
LLMConfig
dataclass
¶
LLMConfig(model: str, temperature: float = 0.0, max_tokens: int | None = None, top_p: float | None = None, timeout: float = 60.0, max_retries: int = 3, retry_min_wait: float = 1.0, retry_max_wait: float = 60.0, max_parallel_requests: int | None = None, cache_enabled: bool = False, cache_dir: str | Path = '.autorubric_cache', cache_ttl: int | None = None, api_key: str | None = None, api_base: str | None = None, thinking: ThinkingParam = None, prompt_caching: bool = True, seed: int | None = None, extra_headers: dict[str, str] = dict(), extra_params: dict[str, Any] = dict())
Configuration for LLM calls.
| ATTRIBUTE | DESCRIPTION |
|---|---|
model |
Model identifier in LiteLLM format (e.g., "openai/gpt-5.2", "anthropic/claude-sonnet-4-5-20250929", "gemini/gemini-3-pro-preview", "ollama/qwen3:14b"). REQUIRED - no default. See LiteLLM docs for full list of supported models.
TYPE:
|
temperature |
Sampling temperature (0.0 = deterministic).
TYPE:
|
max_tokens |
Maximum tokens in response.
TYPE:
|
top_p |
Nucleus sampling parameter.
TYPE:
|
timeout |
Request timeout in seconds.
TYPE:
|
max_retries |
Maximum retry attempts for transient failures.
TYPE:
|
retry_min_wait |
Minimum wait between retries (seconds).
TYPE:
|
retry_max_wait |
Maximum wait between retries (seconds).
TYPE:
|
max_parallel_requests |
Maximum concurrent requests to this model's provider. When set, a global per-provider semaphore limits parallel requests. None (default) means unlimited parallel requests.
TYPE:
|
cache_enabled |
Default caching behavior (can be overridden per-request).
TYPE:
|
cache_dir |
Directory for response cache.
TYPE:
|
cache_ttl |
Cache time-to-live in seconds (None = no expiration).
TYPE:
|
api_key |
Optional API key override (otherwise uses environment variables).
TYPE:
|
api_base |
Optional API base URL override.
TYPE:
|
thinking |
Enable thinking/reasoning mode (unified across providers). Accepts multiple formats: - ThinkingLevel enum: ThinkingLevel.HIGH, ThinkingLevel.MEDIUM, etc. - String: "low", "medium", "high", "none" - Int: Direct token budget (e.g., 32000) - ThinkingConfig: Full configuration with level and/or budget_tokens - None: Disable thinking (default) Provider support: - Anthropic: Extended thinking (claude-sonnet-4-5, claude-opus-4-5+) - OpenAI: Reasoning for o-series and GPT-5 models - Gemini: Thinking mode (2.5+, 3.0+ models) - DeepSeek: Reasoning content
TYPE:
|
prompt_caching |
Enable prompt caching for supported models (default: True). When enabled, automatically detects if the model supports caching via litellm.supports_prompt_caching() and applies provider-specific config: - Anthropic: Adds cache_control to system messages + beta header - OpenAI/Deepseek: Automatic for prompts ≥1024 tokens (no extra config) - Bedrock: Supported for all models Set to False to disable prompt caching entirely.
TYPE:
|
seed |
Random seed for reproducible outputs (OpenAI, some other providers).
TYPE:
|
extra_headers |
Additional HTTP headers for provider-specific features.
TYPE:
|
extra_params |
Additional provider-specific parameters passed to LiteLLM.
TYPE:
|
Examples:
Basic usage without thinking¶
config = LLMConfig(model="openai/gpt-5.2")
Enable thinking with a level¶
config = LLMConfig(model="anthropic/claude-sonnet-4-5-20250929", thinking="high") config = LLMConfig(model="openai/responses/gpt-5-mini", thinking=ThinkingLevel.HIGH)
Enable thinking with explicit token budget¶
config = LLMConfig(model="anthropic/claude-opus-4-5-20251101", thinking=32000)
Full control with ThinkingConfig¶
config = LLMConfig( model="gemini/gemini-2.5-pro", thinking=ThinkingConfig(level=ThinkingLevel.HIGH, budget_tokens=50000) )
get_thinking_config
¶
get_thinking_config() -> ThinkingConfig | None
from_yaml
classmethod
¶
from_yaml(path: str | Path) -> LLMConfig
Load LLMConfig from a YAML file.
| PARAMETER | DESCRIPTION |
|---|---|
path
|
Path to YAML configuration file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
LLMConfig
|
LLMConfig instance with values from the YAML file. |
| RAISES | DESCRIPTION |
|---|---|
FileNotFoundError
|
If the file doesn't exist. |
ValueError
|
If required fields are missing or invalid. |
Example YAML file (llm_config.yaml): model: openai/gpt-5.2 temperature: 0.0 max_tokens: 1024 cache_enabled: true cache_ttl: 3600
Source code in src/autorubric/llm.py
to_yaml
¶
Save LLMConfig to a YAML file.
| PARAMETER | DESCRIPTION |
|---|---|
path
|
Path to write YAML configuration file.
TYPE:
|
Source code in src/autorubric/llm.py
LLMClient¶
Async client for LLM generation with caching and rate limiting.
LLMClient
¶
LLMClient(config: LLMConfig)
Unified LLM client with retries, caching, and structured output support.
Uses diskcache for efficient, thread-safe response caching.
Initialize LLM client.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
LLMConfig instance. The model field is required.
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If config.model is not specified. |
Source code in src/autorubric/llm.py
generate
async
¶
generate(system_prompt: str, user_prompt: str, response_format: type[T] | None = None, use_cache: bool | None = None, return_thinking: bool = False, return_result: bool = False, **kwargs: Any) -> str | T | GenerateResult
Generate LLM response with optional structured output.
| PARAMETER | DESCRIPTION |
|---|---|
system_prompt
|
System message for the LLM.
TYPE:
|
user_prompt
|
User message for the LLM.
TYPE:
|
response_format
|
Optional Pydantic model class for structured output. When provided, LiteLLM uses the model's JSON schema to constrain the LLM output and returns a validated Pydantic instance.
TYPE:
|
use_cache
|
Whether to use caching for this request. - None (default): Use config.cache_enabled setting - True: Force cache usage (initializes cache if needed) - False: Skip cache for this request
TYPE:
|
return_thinking
|
If True and thinking is enabled, return a GenerateResult with both content and thinking. If False (default), only return content. Note: When response_format is provided, thinking is injected into the 'reasoning' field if it exists, regardless of this setting.
TYPE:
|
return_result
|
If True, always return a GenerateResult with full details including usage statistics and completion cost. This is useful when you need to track token usage. When True, takes precedence over the default return behavior.
TYPE:
|
**kwargs
|
Override any LLMConfig parameters for this call.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str | T | GenerateResult
|
If return_result=True or return_thinking=True: GenerateResult with content, thinking, usage, cost, and parsed (if response_format was provided). |
str | T | GenerateResult
|
If response_format is None: String response from the LLM. |
str | T | GenerateResult
|
If response_format is provided: Validated Pydantic model instance. If thinking is enabled and the response_format has a 'reasoning' field, it will be populated with the model's thinking trace. |
| RAISES | DESCRIPTION |
|---|---|
APIError
|
If all retries fail |
ValidationError
|
If response doesn't match schema |
Source code in src/autorubric/llm.py
527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 | |
clear_cache
¶
Clear all cached responses.
| RETURNS | DESCRIPTION |
|---|---|
int
|
Number of entries cleared. |
cache_stats
¶
Get cache statistics.
| RETURNS | DESCRIPTION |
|---|---|
dict[str, Any]
|
Dict with 'size', 'count', and 'directory' keys. |
Source code in src/autorubric/llm.py
generate¶
Convenience function for one-off LLM generation.
generate
async
¶
generate(system_prompt: str, user_prompt: str, model: str, response_format: type[T] | None = None, **kwargs: Any) -> str | T
Simple one-shot generation function.
For repeated calls, prefer creating an LLMClient instance.
| PARAMETER | DESCRIPTION |
|---|---|
system_prompt
|
System message for the LLM.
TYPE:
|
user_prompt
|
User message for the LLM.
TYPE:
|
model
|
Model identifier (REQUIRED).
TYPE:
|
response_format
|
Optional Pydantic model for structured output.
TYPE:
|
**kwargs
|
Additional LLMConfig parameters.
TYPE:
|
Example
Simple string response¶
response = await generate( "You are a helpful assistant.", "What is 2+2?", model="openai/gpt-5.2-mini" )
Structured output¶
from pydantic import BaseModel
class MathAnswer(BaseModel): result: int explanation: str
answer = await generate( "You are a math tutor.", "What is 2+2?", model="openai/gpt-5.2-mini", response_format=MathAnswer ) print(answer.result) # 4
Source code in src/autorubric/llm.py
GenerateResult¶
Result from LLM generation.
GenerateResult
dataclass
¶
GenerateResult(content: str, thinking: str | None = None, raw_response: Any = None, usage: 'TokenUsage | None' = None, cost: float | None = None, parsed: Any = None)
Result from LLM generation including content, thinking, and usage statistics.
| ATTRIBUTE | DESCRIPTION |
|---|---|
content |
The main response content from the LLM (raw string).
TYPE:
|
thinking |
The thinking/reasoning trace if thinking was enabled. None if thinking was not enabled or the provider doesn't support it.
TYPE:
|
raw_response |
The raw LiteLLM response object for advanced use cases.
TYPE:
|
usage |
Token usage statistics from this LLM call.
TYPE:
|
cost |
Completion cost in USD for this LLM call, calculated using LiteLLM's completion_cost() function. None if cost calculation fails.
TYPE:
|
parsed |
The parsed Pydantic model instance when response_format was provided. None if no response_format was used or parsing failed.
TYPE:
|
ThinkingConfig¶
Configuration for extended thinking/reasoning.
ThinkingConfig
dataclass
¶
ThinkingConfig(level: ThinkingLevel | ThinkingLevelLiteral = MEDIUM, budget_tokens: int | None = None)
Detailed configuration for LLM thinking/reasoning.
Provides a uniform interface across providers: - Anthropic: Extended thinking (claude-sonnet-4-5, claude-opus-4-5+) - OpenAI: Reasoning (o-series, GPT-5 models via openai/responses/ prefix) - Gemini: Thinking mode (2.5+, 3.0+ models) - DeepSeek: Reasoning content
| ATTRIBUTE | DESCRIPTION |
|---|---|
level |
High-level thinking effort. Used when budget_tokens is not set. Defaults to MEDIUM for a good balance of quality and latency.
TYPE:
|
budget_tokens |
Explicit token budget for thinking (provider-specific). When set, overrides level. Recommended: 10000-50000 for complex tasks. Providers that don't support explicit budgets will map this to the nearest level.
TYPE:
|
Examples:
Simple: use a thinking level¶
ThinkingConfig(level=ThinkingLevel.HIGH)
Fine-grained: specify exact token budget¶
ThinkingConfig(budget_tokens=32000) # For complex reasoning tasks
Disable thinking¶
ThinkingConfig(level=ThinkingLevel.NONE)
get_effective_budget
¶
Get the effective token budget based on level or explicit budget.
ThinkingLevel¶
Enum for thinking level presets.
ThinkingLevel
¶
Bases: str, Enum
Standardized thinking/reasoning effort levels across LLM providers.
LiteLLM translates these to provider-specific parameters: - Anthropic: thinking={type, budget_tokens} with low→1024, medium→2048, high→4096 - OpenAI: reasoning_effort parameter for o-series and GPT-5 models - Gemini: thinking configuration with similar token budgets - DeepSeek: Standardized via LiteLLM's reasoning_effort
Higher levels mean more "thinking" tokens/steps, trading latency for quality.
ThinkingLevelLiteral¶
Type alias for thinking level strings.
ThinkingLevelLiteral
module-attribute
¶
ThinkingParam¶
Type alias for thinking parameter (level or budget).
ThinkingParam
module-attribute
¶
ThinkingParam = ThinkingConfig | ThinkingLevel | ThinkingLevelLiteral | int | None
Type for the thinking parameter in LLMConfig.
Accepts: - ThinkingConfig: Full configuration object - ThinkingLevel: Enum value (e.g., ThinkingLevel.HIGH) - str: Level as string ("low", "medium", "high", "none") - int: Direct token budget (e.g., 32000) - None: Disable thinking