LLM Infrastructure¶

LLM client configuration, caching, and generation utilities.

Overview¶

AutoRubric uses LiteLLM for multi-provider LLM support. The LLMConfig class provides centralized configuration, while LLMClient handles request execution with caching, rate limiting, and retry logic.

Quick Example¶

from autorubric import LLMConfig, LLMClient, generate

# Configuration
config = LLMConfig(
    model="openai/gpt-4.1-mini",
    temperature=0.0,
    max_tokens=1024,
    cache_enabled=True,
    max_parallel_requests=10,
)

# Direct generation (standalone function)
result = await generate(
    system_prompt="You are a helpful assistant.",
    user_prompt="Explain quantum computing.",
    model="openai/gpt-4.1-mini",
)
print(result)

# Or use the client
client = LLMClient(config)
result = await client.generate(
    system_prompt="You are a helpful assistant.",
    user_prompt="Explain quantum computing.",
)

Provider Configuration¶

Provider	Model Format	Environment Variable
OpenAI	`openai/gpt-4.1`, `openai/gpt-4.1-mini`	`OPENAI_API_KEY`
Anthropic	`anthropic/claude-sonnet-4-5-20250929`	`ANTHROPIC_API_KEY`
Google	`gemini/gemini-2.5-flash`	`GEMINI_API_KEY`
Azure	`azure/openai/gpt-4.1`	`AZURE_API_KEY`, `AZURE_API_BASE`
Groq	`groq/llama-3.1-70b-versatile`	`GROQ_API_KEY`
Ollama	`ollama/qwen3:14b`	(local, no key needed)

YAML Configuration¶

# llm_config.yaml
model: openai/gpt-4.1
temperature: 0.0
max_tokens: 1024
cache_enabled: true
cache_ttl: 3600

config = LLMConfig.from_yaml("llm_config.yaml")
config.to_yaml("llm_config_backup.yaml")

Extended Thinking¶

Enable step-by-step reasoning for complex evaluations:

# Level-based (cross-provider)
config = LLMConfig(
    model="anthropic/claude-sonnet-4-5-20250929",
    thinking="high",  # "low", "medium", "high", or "none"
)

# Token budget
config = LLMConfig(
    model="anthropic/claude-opus-4-5-20251101",
    thinking=32000,  # Explicit token budget
)

Supported providers: Anthropic, OpenAI (o-series), Gemini (2.5+), DeepSeek.

Response Caching¶

config = LLMConfig(
    model="openai/gpt-4.1-mini",
    cache_enabled=True,
    cache_dir=".autorubric_cache",
    cache_ttl=3600,  # 1 hour
)

client = LLMClient(config)
client.clear_cache()
stats = client.cache_stats()
# {'size': 1024, 'count': 10, 'directory': '.autorubric_cache'}

Prompt Caching (Anthropic)¶

Reduce latency and cost on repeated calls. Enabled by default:

config = LLMConfig(
    model="anthropic/claude-sonnet-4-5-20250929",
    prompt_caching=True,  # Default
)

LLMConfig¶

Central configuration class for LLM calls.

LLMConfig `dataclass` ¶

LLMConfig(model: str, temperature: float = 0.0, max_tokens: int | None = None, top_p: float | None = None, timeout: float = 60.0, max_retries: int = 3, retry_min_wait: float = 1.0, retry_max_wait: float = 60.0, max_parallel_requests: int | None = None, cache_enabled: bool = False, cache_dir: str | Path = '.autorubric_cache', cache_ttl: int | None = None, api_key: str | None = None, api_base: str | None = None, thinking: ThinkingParam = None, prompt_caching: bool = True, seed: int | None = None, extra_headers: dict[str, str] = dict(), extra_params: dict[str, Any] = dict())

Configuration for LLM calls.

ATTRIBUTE	DESCRIPTION
`model`	Model identifier in LiteLLM format (e.g., "openai/gpt-5.2", "anthropic/claude-sonnet-4-5-20250929", "gemini/gemini-3-pro-preview", "ollama/qwen3:14b"). REQUIRED - no default. See LiteLLM docs for full list of supported models. TYPE: `str`
`temperature`	Sampling temperature (0.0 = deterministic). TYPE: `float`
`max_tokens`	Maximum tokens in response. TYPE: `int \| None`
`top_p`	Nucleus sampling parameter. TYPE: `float \| None`
`timeout`	Request timeout in seconds. TYPE: `float`
`max_retries`	Maximum retry attempts for transient failures. TYPE: `int`
`retry_min_wait`	Minimum wait between retries (seconds). TYPE: `float`
`retry_max_wait`	Maximum wait between retries (seconds). TYPE: `float`
`max_parallel_requests`	Maximum concurrent requests to this model's provider. When set, a global per-provider semaphore limits parallel requests. None (default) means unlimited parallel requests. TYPE: `int \| None`
`cache_enabled`	Default caching behavior (can be overridden per-request). TYPE: `bool`
`cache_dir`	Directory for response cache. TYPE: `str \| Path`
`cache_ttl`	Cache time-to-live in seconds (None = no expiration). TYPE: `int \| None`
`api_key`	Optional API key override (otherwise uses environment variables). TYPE: `str \| None`
`api_base`	Optional API base URL override. TYPE: `str \| None`
`thinking`	Enable thinking/reasoning mode (unified across providers). Accepts multiple formats: - ThinkingLevel enum: ThinkingLevel.HIGH, ThinkingLevel.MEDIUM, etc. - String: "low", "medium", "high", "none" - Int: Direct token budget (e.g., 32000) - ThinkingConfig: Full configuration with level and/or budget_tokens - None: Disable thinking (default) Provider support: - Anthropic: Extended thinking (claude-sonnet-4-5, claude-opus-4-5+) - OpenAI: Reasoning for o-series and GPT-5 models - Gemini: Thinking mode (2.5+, 3.0+ models) - DeepSeek: Reasoning content TYPE: `ThinkingParam`
`prompt_caching`	Enable prompt caching for supported models (default: True). When enabled, automatically detects if the model supports caching via litellm.supports_prompt_caching() and applies provider-specific config: - Anthropic: Adds cache_control to system messages + beta header - OpenAI/Deepseek: Automatic for prompts ≥1024 tokens (no extra config) - Bedrock: Supported for all models Set to False to disable prompt caching entirely. TYPE: `bool`
`seed`	Random seed for reproducible outputs (OpenAI, some other providers). TYPE: `int \| None`
`extra_headers`	Additional HTTP headers for provider-specific features. TYPE: `dict[str, str]`
`extra_params`	Additional provider-specific parameters passed to LiteLLM. TYPE: `dict[str, Any]`

Examples:

Basic usage without thinking¶

config = LLMConfig(model="openai/gpt-5.2")

Enable thinking with a level¶

config = LLMConfig(model="anthropic/claude-sonnet-4-5-20250929", thinking="high") config = LLMConfig(model="openai/responses/gpt-5-mini", thinking=ThinkingLevel.HIGH)

Enable thinking with explicit token budget¶

config = LLMConfig(model="anthropic/claude-opus-4-5-20251101", thinking=32000)

Full control with ThinkingConfig¶

config = LLMConfig( model="gemini/gemini-2.5-pro", thinking=ThinkingConfig(level=ThinkingLevel.HIGH, budget_tokens=50000) )

get_thinking_config ¶

get_thinking_config() -> ThinkingConfig | None

Get normalized thinking configuration.

Source code in src/autorubric/llm.py

def get_thinking_config(self) -> ThinkingConfig | None:
    """Get normalized thinking configuration."""
    return _normalize_thinking_param(self.thinking)

from_yaml `classmethod` ¶

from_yaml(path: str | Path) -> LLMConfig

Load LLMConfig from a YAML file.

PARAMETER	DESCRIPTION
`path`	Path to YAML configuration file. TYPE: `str \| Path`

RETURNS	DESCRIPTION
`LLMConfig`	LLMConfig instance with values from the YAML file.

RAISES	DESCRIPTION
`FileNotFoundError`	If the file doesn't exist.
`ValueError`	If required fields are missing or invalid.

Example YAML file (llm_config.yaml): model: openai/gpt-5.2 temperature: 0.0 max_tokens: 1024 cache_enabled: true cache_ttl: 3600

Source code in src/autorubric/llm.py

@classmethod
def from_yaml(cls, path: str | Path) -> LLMConfig:
    """Load LLMConfig from a YAML file.

    Args:
        path: Path to YAML configuration file.

    Returns:
        LLMConfig instance with values from the YAML file.

    Raises:
        FileNotFoundError: If the file doesn't exist.
        ValueError: If required fields are missing or invalid.

    Example YAML file (llm_config.yaml):
        model: openai/gpt-5.2
        temperature: 0.0
        max_tokens: 1024
        cache_enabled: true
        cache_ttl: 3600
    """
    path = Path(path)
    if not path.exists():
        raise FileNotFoundError(f"LLM config file not found: {path}")

    with open(path, encoding="utf-8") as f:
        data = yaml.safe_load(f)

    if not isinstance(data, dict):
        raise ValueError(f"Invalid YAML config: expected dict, got {type(data).__name__}")

    if "model" not in data:
        raise ValueError("LLM config YAML must specify 'model' field")

    # Handle extra_params specially - any unknown keys go there
    known_fields = {
        "model",
        "temperature",
        "max_tokens",
        "top_p",
        "timeout",
        "max_retries",
        "retry_min_wait",
        "retry_max_wait",
        "cache_enabled",
        "cache_dir",
        "cache_ttl",
        "api_key",
        "api_base",
        # Thinking/Reasoning
        "thinking",
        # Other provider-specific features
        "prompt_caching",
        "seed",
        "extra_headers",
        "extra_params",
    }
    extra = {k: v for k, v in data.items() if k not in known_fields}
    if extra:
        data.setdefault("extra_params", {}).update(extra)
        for k in extra:
            del data[k]

    return cls(**data)

to_yaml ¶

to_yaml(path: str | Path) -> None

Save LLMConfig to a YAML file.

PARAMETER	DESCRIPTION
`path`	Path to write YAML configuration file. TYPE: `str \| Path`

Source code in src/autorubric/llm.py

def to_yaml(self, path: str | Path) -> None:
    """Save LLMConfig to a YAML file.

    Args:
        path: Path to write YAML configuration file.
    """
    path = Path(path)
    data = asdict(self)

    # Convert Path to string for YAML serialization
    if isinstance(data.get("cache_dir"), Path):
        data["cache_dir"] = str(data["cache_dir"])

    # Remove None values and empty dicts for cleaner YAML
    data = {k: v for k, v in data.items() if v is not None and v != {}}

    with open(path, "w", encoding="utf-8") as f:
        yaml.safe_dump(data, f, default_flow_style=False, sort_keys=False)

LLMClient¶

Async client for LLM generation with caching and rate limiting.

LLMClient ¶

LLMClient(config: LLMConfig)

Unified LLM client with retries, caching, and structured output support.

Uses diskcache for efficient, thread-safe response caching.

Initialize LLM client.

PARAMETER	DESCRIPTION
`config`	LLMConfig instance. The model field is required. TYPE: `LLMConfig`

RAISES	DESCRIPTION
`ValueError`	If config.model is not specified.

Source code in src/autorubric/llm.py

def __init__(self, config: LLMConfig):
    """Initialize LLM client.

    Args:
        config: LLMConfig instance. The model field is required.

    Raises:
        ValueError: If config.model is not specified.
    """
    if not config.model:
        raise ValueError("LLMConfig.model is required and cannot be empty")

    self.config = config
    self._cache: diskcache.Cache | None = None

    if self.config.cache_enabled:
        self._init_cache()

generate `async` ¶

generate(system_prompt: str, user_prompt: str, response_format: type[T] | None = None, use_cache: bool | None = None, return_thinking: bool = False, return_result: bool = False, **kwargs: Any) -> str | T | GenerateResult

Generate LLM response with optional structured output.

PARAMETER	DESCRIPTION
`system_prompt`	System message for the LLM. TYPE: `str`
`user_prompt`	User message for the LLM. TYPE: `str`
`response_format`	Optional Pydantic model class for structured output. When provided, LiteLLM uses the model's JSON schema to constrain the LLM output and returns a validated Pydantic instance. TYPE: `type[T] \| None` DEFAULT: `None`
`use_cache`	Whether to use caching for this request. - None (default): Use config.cache_enabled setting - True: Force cache usage (initializes cache if needed) - False: Skip cache for this request TYPE: `bool \| None` DEFAULT: `None`
`return_thinking`	If True and thinking is enabled, return a GenerateResult with both content and thinking. If False (default), only return content. Note: When response_format is provided, thinking is injected into the 'reasoning' field if it exists, regardless of this setting. TYPE: `bool` DEFAULT: `False`
`return_result`	If True, always return a GenerateResult with full details including usage statistics and completion cost. This is useful when you need to track token usage. When True, takes precedence over the default return behavior. TYPE: `bool` DEFAULT: `False`
`**kwargs`	Override any LLMConfig parameters for this call. TYPE: `Any` DEFAULT: `{}`

RETURNS	DESCRIPTION
`str \| T \| GenerateResult`	If return_result=True or return_thinking=True: GenerateResult with content, thinking, usage, cost, and parsed (if response_format was provided).
`str \| T \| GenerateResult`	If response_format is None: String response from the LLM.
`str \| T \| GenerateResult`	If response_format is provided: Validated Pydantic model instance. If thinking is enabled and the response_format has a 'reasoning' field, it will be populated with the model's thinking trace.

RAISES	DESCRIPTION
`APIError`	If all retries fail
`ValidationError`	If response doesn't match schema

Source code in src/autorubric/llm.py

async def generate(
    self,
    system_prompt: str,
    user_prompt: str,
    response_format: type[T] | None = None,
    use_cache: bool | None = None,
    return_thinking: bool = False,
    return_result: bool = False,
    **kwargs: Any,
) -> str | T | GenerateResult:
    """Generate LLM response with optional structured output.

    Args:
        system_prompt: System message for the LLM.
        user_prompt: User message for the LLM.
        response_format: Optional Pydantic model class for structured output.
            When provided, LiteLLM uses the model's JSON schema to constrain
            the LLM output and returns a validated Pydantic instance.
        use_cache: Whether to use caching for this request.
            - None (default): Use config.cache_enabled setting
            - True: Force cache usage (initializes cache if needed)
            - False: Skip cache for this request
        return_thinking: If True and thinking is enabled, return a GenerateResult
            with both content and thinking. If False (default), only return content.
            Note: When response_format is provided, thinking is injected into the
            'reasoning' field if it exists, regardless of this setting.
        return_result: If True, always return a GenerateResult with full details
            including usage statistics and completion cost. This is useful when
            you need to track token usage. When True, takes precedence over the
            default return behavior.
        **kwargs: Override any LLMConfig parameters for this call.

    Returns:
        If return_result=True or return_thinking=True: GenerateResult with content,
            thinking, usage, cost, and parsed (if response_format was provided).
        If response_format is None: String response from the LLM.
        If response_format is provided: Validated Pydantic model instance.
            If thinking is enabled and the response_format has a 'reasoning'
            field, it will be populated with the model's thinking trace.

    Raises:
        litellm.APIError: If all retries fail
        pydantic.ValidationError: If response doesn't match schema
    """
    # Determine caching behavior for this request
    should_cache = use_cache if use_cache is not None else self.config.cache_enabled

    # Check cache first
    cache_key: str | None = None
    if should_cache:
        cache = self._ensure_cache()
        cache_key = self._cache_key(
            self.config.model, system_prompt, user_prompt, response_format
        )
        cached = cache.get(cache_key)
        if cached is not None:
            logger.debug(f"Cache hit for {cache_key[:8]}...")
            return cached  # type: ignore[return-value]

    # Build request parameters
    model = kwargs.get("model", self.config.model)

    # Check if this is an Anthropic model that supports prompt caching
    # Anthropic requires cache_control on message content; other providers
    # handle caching automatically (OpenAI, Deepseek) or don't support it
    is_anthropic = model.startswith("anthropic/") or model.startswith("claude")
    use_prompt_caching = self.config.prompt_caching and is_anthropic
    if use_prompt_caching:
        # Anthropic requires cache_control on message content
        messages = [
            {
                "role": "system",
                "content": [
                    {
                        "type": "text",
                        "text": system_prompt,
                        "cache_control": {"type": "ephemeral"},
                    }
                ],
            },
            {"role": "user", "content": user_prompt},
        ]
    else:
        # Standard message format for other providers
        # OpenAI/Deepseek: Caching is automatic for prompts ≥1024 tokens
        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ]

    params: dict[str, Any] = {
        "model": model,
        "messages": messages,
        "temperature": kwargs.get("temperature", self.config.temperature),
        "timeout": kwargs.get("timeout", self.config.timeout),
        **self.config.extra_params,
    }

    if self.config.max_tokens:
        params["max_tokens"] = kwargs.get("max_tokens", self.config.max_tokens)
    if self.config.top_p:
        params["top_p"] = kwargs.get("top_p", self.config.top_p)
    if self.config.api_key:
        params["api_key"] = self.config.api_key
    if self.config.api_base:
        params["api_base"] = self.config.api_base

    # Thinking/Reasoning configuration (unified across providers)
    thinking_config = self.config.get_thinking_config()
    if thinking_config is not None:
        # Determine whether to use reasoning_effort or explicit thinking dict
        # Use explicit budget_tokens when specified for fine-grained control
        # Otherwise use reasoning_effort for better cross-provider compatibility
        if thinking_config.budget_tokens is not None:
            # Explicit token budget - use thinking dict (Anthropic/Gemini style)
            params["thinking"] = {
                "type": "enabled",
                "budget_tokens": thinking_config.budget_tokens,
            }
        else:
            # Level-based - use reasoning_effort for cross-provider support
            # LiteLLM translates this to provider-specific parameters
            params["reasoning_effort"] = thinking_config.get_reasoning_effort()

    # Extra headers configuration
    extra_headers = dict(self.config.extra_headers)
    if use_prompt_caching:
        # Anthropic prompt caching requires beta header
        extra_headers["anthropic-beta"] = "prompt-caching-2024-07-31"
    if extra_headers:
        params["extra_headers"] = extra_headers

    if self.config.seed is not None:
        params["seed"] = self.config.seed

    # Enable structured output if Pydantic model provided
    if response_format is not None:
        params["response_format"] = response_format

    # Print debug prompt if debug mode is enabled
    import autorubric

    if autorubric.debug:
        _print_debug_prompt(system_prompt, user_prompt, model)

    # Make request with retries
    retry_decorator = self._get_retry_decorator()
    thinking_content: str | None = None
    raw_response: Any = None

    @retry_decorator
    async def _call() -> str:
        nonlocal thinking_content, raw_response
        response = await litellm.acompletion(**params)
        raw_response = response

        message = response.choices[0].message

        # Extract thinking/reasoning content (standardized across providers)
        # LiteLLM provides unified `reasoning_content` field
        thinking_content = _extract_thinking_content(message)

        return message.content  # type: ignore[return-value]

    # Apply rate limiting if configured
    semaphore = await RateLimitPool.get_instance().get_semaphore(
        model, self.config.max_parallel_requests
    )
    if semaphore is not None:
        async with semaphore:
            response_content = await _call()
    else:
        response_content = await _call()

    # Extract usage and cost from the raw response
    usage = _extract_usage_from_response(raw_response)
    cost = _calculate_completion_cost(raw_response)

    # Parse structured output if requested
    parsed_response: T | None = None
    if response_format is not None:
        # LiteLLM returns JSON string when response_format is set
        # Parse it into the Pydantic model
        data = json.loads(response_content)

        # Inject thinking content into the reasoning field if available
        if thinking_content and "reasoning" in response_format.model_fields:
            data["reasoning"] = thinking_content

        parsed_response = response_format.model_validate(data)

    # Determine what to return
    result: str | T | GenerateResult
    if return_result or return_thinking:
        # Return full GenerateResult with all details
        result = GenerateResult(
            content=response_content,
            thinking=thinking_content,
            raw_response=raw_response,
            usage=usage,
            cost=cost,
            parsed=parsed_response,
        )
    elif response_format is not None:
        # Return just the parsed Pydantic model
        result = parsed_response  # type: ignore[assignment]
    else:
        # Return just the string content
        result = response_content

    # Cache the response (cache the parsed object for structured outputs)
    if should_cache and cache_key:
        cache = self._ensure_cache()
        cache.set(
            cache_key,
            result,
            expire=self.config.cache_ttl,
        )
        logger.debug(f"Cached response for {cache_key[:8]}...")

    return result

clear_cache ¶

clear_cache() -> int

Clear all cached responses.

RETURNS	DESCRIPTION
`int`	Number of entries cleared.

Source code in src/autorubric/llm.py

def clear_cache(self) -> int:
    """Clear all cached responses.

    Returns:
        Number of entries cleared.
    """
    if self._cache is None:
        return 0
    count = len(self._cache)
    self._cache.clear()
    return count

cache_stats ¶

cache_stats() -> dict[str, Any]

Get cache statistics.

RETURNS	DESCRIPTION
`dict[str, Any]`	Dict with 'size', 'count', and 'directory' keys.

Source code in src/autorubric/llm.py

def cache_stats(self) -> dict[str, Any]:
    """Get cache statistics.

    Returns:
        Dict with 'size', 'count', and 'directory' keys.
    """
    if self._cache is None:
        return {"size": 0, "count": 0, "directory": None}
    return {
        "size": self._cache.volume(),
        "count": len(self._cache),
        "directory": str(self._cache.directory),
    }

generate¶

Convenience function for one-off LLM generation.

generate `async` ¶

generate(system_prompt: str, user_prompt: str, model: str, response_format: type[T] | None = None, **kwargs: Any) -> str | T

Simple one-shot generation function.

For repeated calls, prefer creating an LLMClient instance.

PARAMETER	DESCRIPTION
`system_prompt`	System message for the LLM. TYPE: `str`
`user_prompt`	User message for the LLM. TYPE: `str`
`model`	Model identifier (REQUIRED). TYPE: `str`
`response_format`	Optional Pydantic model for structured output. TYPE: `type[T] \| None` DEFAULT: `None`
`**kwargs`	Additional LLMConfig parameters. TYPE: `Any` DEFAULT: `{}`

Example

Simple string response¶

response = await generate( "You are a helpful assistant.", "What is 2+2?", model="openai/gpt-5.2-mini" )

Structured output¶

from pydantic import BaseModel

class MathAnswer(BaseModel): result: int explanation: str

answer = await generate( "You are a math tutor.", "What is 2+2?", model="openai/gpt-5.2-mini", response_format=MathAnswer ) print(answer.result) # 4

Source code in src/autorubric/llm.py

async def generate(
    system_prompt: str,
    user_prompt: str,
    model: str,
    response_format: type[T] | None = None,
    **kwargs: Any,
) -> str | T:
    """Simple one-shot generation function.

    For repeated calls, prefer creating an LLMClient instance.

    Args:
        system_prompt: System message for the LLM.
        user_prompt: User message for the LLM.
        model: Model identifier (REQUIRED).
        response_format: Optional Pydantic model for structured output.
        **kwargs: Additional LLMConfig parameters.

    Example:
        # Simple string response
        response = await generate(
            "You are a helpful assistant.",
            "What is 2+2?",
            model="openai/gpt-5.2-mini"
        )

        # Structured output
        from pydantic import BaseModel

        class MathAnswer(BaseModel):
            result: int
            explanation: str

        answer = await generate(
            "You are a math tutor.",
            "What is 2+2?",
            model="openai/gpt-5.2-mini",
            response_format=MathAnswer
        )
        print(answer.result)  # 4
    """
    config = LLMConfig(model=model, **kwargs)
    client = LLMClient(config)
    return await client.generate(system_prompt, user_prompt, response_format=response_format)

GenerateResult¶

Result from LLM generation.

GenerateResult `dataclass` ¶

GenerateResult(content: str, thinking: str | None = None, raw_response: Any = None, usage: 'TokenUsage | None' = None, cost: float | None = None, parsed: Any = None)

Result from LLM generation including content, thinking, and usage statistics.

ATTRIBUTE	DESCRIPTION
`content`	The main response content from the LLM (raw string). TYPE: `str`
`thinking`	The thinking/reasoning trace if thinking was enabled. None if thinking was not enabled or the provider doesn't support it. TYPE: `str \| None`
`raw_response`	The raw LiteLLM response object for advanced use cases. TYPE: `Any`
`usage`	Token usage statistics from this LLM call. TYPE: `'TokenUsage \| None'`
`cost`	Completion cost in USD for this LLM call, calculated using LiteLLM's completion_cost() function. None if cost calculation fails. TYPE: `float \| None`
`parsed`	The parsed Pydantic model instance when response_format was provided. None if no response_format was used or parsing failed. TYPE: `Any`

ThinkingConfig¶

Configuration for extended thinking/reasoning.

ThinkingConfig `dataclass` ¶

ThinkingConfig(level: ThinkingLevel | ThinkingLevelLiteral = MEDIUM, budget_tokens: int | None = None)

Detailed configuration for LLM thinking/reasoning.

Provides a uniform interface across providers: - Anthropic: Extended thinking (claude-sonnet-4-5, claude-opus-4-5+) - OpenAI: Reasoning (o-series, GPT-5 models via openai/responses/ prefix) - Gemini: Thinking mode (2.5+, 3.0+ models) - DeepSeek: Reasoning content

ATTRIBUTE	DESCRIPTION
`level`	High-level thinking effort. Used when budget_tokens is not set. Defaults to MEDIUM for a good balance of quality and latency. TYPE: `ThinkingLevel \| ThinkingLevelLiteral`
`budget_tokens`	Explicit token budget for thinking (provider-specific). When set, overrides level. Recommended: 10000-50000 for complex tasks. Providers that don't support explicit budgets will map this to the nearest level. TYPE: `int \| None`

Examples:

Simple: use a thinking level¶

ThinkingConfig(level=ThinkingLevel.HIGH)

Fine-grained: specify exact token budget¶

ThinkingConfig(budget_tokens=32000) # For complex reasoning tasks

Disable thinking¶

ThinkingConfig(level=ThinkingLevel.NONE)

get_effective_budget ¶

get_effective_budget() -> int

Get the effective token budget based on level or explicit budget.

Source code in src/autorubric/llm.py

def get_effective_budget(self) -> int:
    """Get the effective token budget based on level or explicit budget."""
    if self.budget_tokens is not None:
        return self.budget_tokens
    return THINKING_LEVEL_BUDGETS.get(self.level.value, 2048)

get_reasoning_effort ¶

get_reasoning_effort() -> str

Get the reasoning_effort string for LiteLLM.

Source code in src/autorubric/llm.py

def get_reasoning_effort(self) -> str:
    """Get the reasoning_effort string for LiteLLM."""
    return self.level.value

ThinkingLevel¶

Enum for thinking level presets.

ThinkingLevel ¶

Bases: str, Enum

Standardized thinking/reasoning effort levels across LLM providers.

LiteLLM translates these to provider-specific parameters: - Anthropic: thinking={type, budget_tokens} with low→1024, medium→2048, high→4096 - OpenAI: reasoning_effort parameter for o-series and GPT-5 models - Gemini: thinking configuration with similar token budgets - DeepSeek: Standardized via LiteLLM's reasoning_effort

Higher levels mean more "thinking" tokens/steps, trading latency for quality.

ThinkingLevelLiteral¶

Type alias for thinking level strings.

ThinkingLevelLiteral `module-attribute` ¶

ThinkingLevelLiteral = Literal['none', 'low', 'medium', 'high']

ThinkingParam¶

Type alias for thinking parameter (level or budget).

ThinkingParam `module-attribute` ¶

ThinkingParam = ThinkingConfig | ThinkingLevel | ThinkingLevelLiteral | int | None

Type for the thinking parameter in LLMConfig.

Accepts: - ThinkingConfig: Full configuration object - ThinkingLevel: Enum value (e.g., ThinkingLevel.HIGH) - str: Level as string ("low", "medium", "high", "none") - int: Direct token budget (e.g., 32000) - None: Disable thinking

LLM Infrastructure¶

Overview¶

Quick Example¶

Provider Configuration¶

YAML Configuration¶

Extended Thinking¶

Response Caching¶

Prompt Caching (Anthropic)¶

LLMConfig¶

LLMConfig dataclass ¶

Basic usage without thinking¶

Enable thinking with a level¶

Enable thinking with explicit token budget¶

Full control with ThinkingConfig¶

get_thinking_config ¶

from_yaml classmethod ¶

to_yaml ¶

LLMClient¶

LLMClient ¶

generate async ¶

clear_cache ¶

cache_stats ¶

generate¶

generate async ¶

Simple string response¶

Structured output¶

GenerateResult¶

GenerateResult dataclass ¶

ThinkingConfig¶

ThinkingConfig dataclass ¶

Simple: use a thinking level¶

Fine-grained: specify exact token budget¶

Disable thinking¶

get_effective_budget ¶

get_reasoning_effort ¶

ThinkingLevel¶

ThinkingLevel ¶

ThinkingLevelLiteral¶

ThinkingLevelLiteral module-attribute ¶

ThinkingParam¶

ThinkingParam module-attribute ¶

LLMConfig `dataclass` ¶

from_yaml `classmethod` ¶

generate `async` ¶

generate `async` ¶

GenerateResult `dataclass` ¶

ThinkingConfig `dataclass` ¶

ThinkingLevelLiteral `module-attribute` ¶

ThinkingParam `module-attribute` ¶