Skip to content

LLM Infrastructure

LLM client configuration, caching, and generation utilities.

Overview

AutoRubric uses LiteLLM for multi-provider LLM support. The LLMConfig class provides centralized configuration, while LLMClient handles request execution with caching, rate limiting, and retry logic.

Quick Example

from autorubric import LLMConfig, LLMClient, generate

# Configuration
config = LLMConfig(
    model="openai/gpt-4.1-mini",
    temperature=0.0,
    max_tokens=1024,
    cache_enabled=True,
    max_parallel_requests=10,
)

# Direct generation
result = await generate(
    config=config,
    system_prompt="You are a helpful assistant.",
    user_prompt="Explain quantum computing.",
)
print(result.content)

# Or use the client
client = LLMClient(config)
result = await client.generate(
    system_prompt="You are a helpful assistant.",
    user_prompt="Explain quantum computing.",
)

Provider Configuration

Provider Model Format Environment Variable
OpenAI openai/gpt-4.1, openai/gpt-4.1-mini OPENAI_API_KEY
Anthropic anthropic/claude-sonnet-4-5-20250929 ANTHROPIC_API_KEY
Google gemini/gemini-2.5-flash GEMINI_API_KEY
Azure azure/openai/gpt-4.1 AZURE_API_KEY, AZURE_API_BASE
Groq groq/llama-3.1-70b-versatile GROQ_API_KEY
Ollama ollama/qwen3:14b (local, no key needed)

YAML Configuration

# llm_config.yaml
model: openai/gpt-4.1
temperature: 0.0
max_tokens: 1024
cache_enabled: true
cache_ttl: 3600
config = LLMConfig.from_yaml("llm_config.yaml")
config.to_yaml("llm_config_backup.yaml")

Extended Thinking

Enable step-by-step reasoning for complex evaluations:

# Level-based (cross-provider)
config = LLMConfig(
    model="anthropic/claude-sonnet-4-5-20250929",
    thinking="high",  # "low", "medium", "high", or "none"
)

# Token budget
config = LLMConfig(
    model="anthropic/claude-opus-4-5-20251101",
    thinking=32000,  # Explicit token budget
)

Supported providers: Anthropic, OpenAI (o-series), Gemini (2.5+), DeepSeek.

Response Caching

config = LLMConfig(
    model="openai/gpt-4.1-mini",
    cache_enabled=True,
    cache_dir=".autorubric_cache",
    cache_ttl=3600,  # 1 hour
)

client = LLMClient(config)
client.clear_cache()
stats = client.cache_stats()
# {'size': 1024, 'count': 10, 'directory': '.autorubric_cache'}

Prompt Caching (Anthropic)

Reduce latency and cost on repeated calls. Enabled by default:

config = LLMConfig(
    model="anthropic/claude-sonnet-4-5-20250929",
    prompt_caching=True,  # Default
)

LLMConfig

Central configuration class for LLM calls.

LLMConfig dataclass

LLMConfig(model: str, temperature: float = 0.0, max_tokens: int | None = None, top_p: float | None = None, timeout: float = 60.0, max_retries: int = 3, retry_min_wait: float = 1.0, retry_max_wait: float = 60.0, max_parallel_requests: int | None = None, cache_enabled: bool = False, cache_dir: str | Path = '.autorubric_cache', cache_ttl: int | None = None, api_key: str | None = None, api_base: str | None = None, thinking: ThinkingParam = None, prompt_caching: bool = True, seed: int | None = None, extra_headers: dict[str, str] = dict(), extra_params: dict[str, Any] = dict())

Configuration for LLM calls.

ATTRIBUTE DESCRIPTION
model

Model identifier in LiteLLM format (e.g., "openai/gpt-5.2", "anthropic/claude-sonnet-4-5-20250929", "gemini/gemini-3-pro-preview", "ollama/qwen3:14b"). REQUIRED - no default. See LiteLLM docs for full list of supported models.

TYPE: str

temperature

Sampling temperature (0.0 = deterministic).

TYPE: float

max_tokens

Maximum tokens in response.

TYPE: int | None

top_p

Nucleus sampling parameter.

TYPE: float | None

timeout

Request timeout in seconds.

TYPE: float

max_retries

Maximum retry attempts for transient failures.

TYPE: int

retry_min_wait

Minimum wait between retries (seconds).

TYPE: float

retry_max_wait

Maximum wait between retries (seconds).

TYPE: float

max_parallel_requests

Maximum concurrent requests to this model's provider. When set, a global per-provider semaphore limits parallel requests. None (default) means unlimited parallel requests.

TYPE: int | None

cache_enabled

Default caching behavior (can be overridden per-request).

TYPE: bool

cache_dir

Directory for response cache.

TYPE: str | Path

cache_ttl

Cache time-to-live in seconds (None = no expiration).

TYPE: int | None

api_key

Optional API key override (otherwise uses environment variables).

TYPE: str | None

api_base

Optional API base URL override.

TYPE: str | None

thinking

Enable thinking/reasoning mode (unified across providers). Accepts multiple formats: - ThinkingLevel enum: ThinkingLevel.HIGH, ThinkingLevel.MEDIUM, etc. - String: "low", "medium", "high", "none" - Int: Direct token budget (e.g., 32000) - ThinkingConfig: Full configuration with level and/or budget_tokens - None: Disable thinking (default)

Provider support: - Anthropic: Extended thinking (claude-sonnet-4-5, claude-opus-4-5+) - OpenAI: Reasoning for o-series and GPT-5 models - Gemini: Thinking mode (2.5+, 3.0+ models) - DeepSeek: Reasoning content

TYPE: ThinkingParam

prompt_caching

Enable prompt caching for supported models (default: True). When enabled, automatically detects if the model supports caching via litellm.supports_prompt_caching() and applies provider-specific config: - Anthropic: Adds cache_control to system messages + beta header - OpenAI/Deepseek: Automatic for prompts ≥1024 tokens (no extra config) - Bedrock: Supported for all models Set to False to disable prompt caching entirely.

TYPE: bool

seed

Random seed for reproducible outputs (OpenAI, some other providers).

TYPE: int | None

extra_headers

Additional HTTP headers for provider-specific features.

TYPE: dict[str, str]

extra_params

Additional provider-specific parameters passed to LiteLLM.

TYPE: dict[str, Any]

Examples:

Basic usage without thinking

config = LLMConfig(model="openai/gpt-5.2")

Enable thinking with a level

config = LLMConfig(model="anthropic/claude-sonnet-4-5-20250929", thinking="high") config = LLMConfig(model="openai/responses/gpt-5-mini", thinking=ThinkingLevel.HIGH)

Enable thinking with explicit token budget

config = LLMConfig(model="anthropic/claude-opus-4-5-20251101", thinking=32000)

Full control with ThinkingConfig

config = LLMConfig( model="gemini/gemini-2.5-pro", thinking=ThinkingConfig(level=ThinkingLevel.HIGH, budget_tokens=50000) )

get_thinking_config

get_thinking_config() -> ThinkingConfig | None

Get normalized thinking configuration.

Source code in src/autorubric/llm.py
def get_thinking_config(self) -> ThinkingConfig | None:
    """Get normalized thinking configuration."""
    return _normalize_thinking_param(self.thinking)

from_yaml classmethod

from_yaml(path: str | Path) -> LLMConfig

Load LLMConfig from a YAML file.

PARAMETER DESCRIPTION
path

Path to YAML configuration file.

TYPE: str | Path

RETURNS DESCRIPTION
LLMConfig

LLMConfig instance with values from the YAML file.

RAISES DESCRIPTION
FileNotFoundError

If the file doesn't exist.

ValueError

If required fields are missing or invalid.

Example YAML file (llm_config.yaml): model: openai/gpt-5.2 temperature: 0.0 max_tokens: 1024 cache_enabled: true cache_ttl: 3600

Source code in src/autorubric/llm.py
@classmethod
def from_yaml(cls, path: str | Path) -> LLMConfig:
    """Load LLMConfig from a YAML file.

    Args:
        path: Path to YAML configuration file.

    Returns:
        LLMConfig instance with values from the YAML file.

    Raises:
        FileNotFoundError: If the file doesn't exist.
        ValueError: If required fields are missing or invalid.

    Example YAML file (llm_config.yaml):
        model: openai/gpt-5.2
        temperature: 0.0
        max_tokens: 1024
        cache_enabled: true
        cache_ttl: 3600
    """
    path = Path(path)
    if not path.exists():
        raise FileNotFoundError(f"LLM config file not found: {path}")

    with open(path, encoding="utf-8") as f:
        data = yaml.safe_load(f)

    if not isinstance(data, dict):
        raise ValueError(f"Invalid YAML config: expected dict, got {type(data).__name__}")

    if "model" not in data:
        raise ValueError("LLM config YAML must specify 'model' field")

    # Handle extra_params specially - any unknown keys go there
    known_fields = {
        "model",
        "temperature",
        "max_tokens",
        "top_p",
        "timeout",
        "max_retries",
        "retry_min_wait",
        "retry_max_wait",
        "cache_enabled",
        "cache_dir",
        "cache_ttl",
        "api_key",
        "api_base",
        # Thinking/Reasoning
        "thinking",
        # Other provider-specific features
        "prompt_caching",
        "seed",
        "extra_headers",
        "extra_params",
    }
    extra = {k: v for k, v in data.items() if k not in known_fields}
    if extra:
        data.setdefault("extra_params", {}).update(extra)
        for k in extra:
            del data[k]

    return cls(**data)

to_yaml

to_yaml(path: str | Path) -> None

Save LLMConfig to a YAML file.

PARAMETER DESCRIPTION
path

Path to write YAML configuration file.

TYPE: str | Path

Source code in src/autorubric/llm.py
def to_yaml(self, path: str | Path) -> None:
    """Save LLMConfig to a YAML file.

    Args:
        path: Path to write YAML configuration file.
    """
    path = Path(path)
    data = asdict(self)

    # Convert Path to string for YAML serialization
    if isinstance(data.get("cache_dir"), Path):
        data["cache_dir"] = str(data["cache_dir"])

    # Remove None values and empty dicts for cleaner YAML
    data = {k: v for k, v in data.items() if v is not None and v != {}}

    with open(path, "w", encoding="utf-8") as f:
        yaml.safe_dump(data, f, default_flow_style=False, sort_keys=False)

LLMClient

Async client for LLM generation with caching and rate limiting.

LLMClient

LLMClient(config: LLMConfig)

Unified LLM client with retries, caching, and structured output support.

Uses diskcache for efficient, thread-safe response caching.

Initialize LLM client.

PARAMETER DESCRIPTION
config

LLMConfig instance. The model field is required.

TYPE: LLMConfig

RAISES DESCRIPTION
ValueError

If config.model is not specified.

Source code in src/autorubric/llm.py
def __init__(self, config: LLMConfig):
    """Initialize LLM client.

    Args:
        config: LLMConfig instance. The model field is required.

    Raises:
        ValueError: If config.model is not specified.
    """
    if not config.model:
        raise ValueError("LLMConfig.model is required and cannot be empty")

    self.config = config
    self._cache: diskcache.Cache | None = None

    if self.config.cache_enabled:
        self._init_cache()

generate async

generate(system_prompt: str, user_prompt: str, response_format: type[T] | None = None, use_cache: bool | None = None, return_thinking: bool = False, return_result: bool = False, **kwargs: Any) -> str | T | GenerateResult

Generate LLM response with optional structured output.

PARAMETER DESCRIPTION
system_prompt

System message for the LLM.

TYPE: str

user_prompt

User message for the LLM.

TYPE: str

response_format

Optional Pydantic model class for structured output. When provided, LiteLLM uses the model's JSON schema to constrain the LLM output and returns a validated Pydantic instance.

TYPE: type[T] | None DEFAULT: None

use_cache

Whether to use caching for this request. - None (default): Use config.cache_enabled setting - True: Force cache usage (initializes cache if needed) - False: Skip cache for this request

TYPE: bool | None DEFAULT: None

return_thinking

If True and thinking is enabled, return a GenerateResult with both content and thinking. If False (default), only return content. Note: When response_format is provided, thinking is injected into the 'reasoning' field if it exists, regardless of this setting.

TYPE: bool DEFAULT: False

return_result

If True, always return a GenerateResult with full details including usage statistics and completion cost. This is useful when you need to track token usage. When True, takes precedence over the default return behavior.

TYPE: bool DEFAULT: False

**kwargs

Override any LLMConfig parameters for this call.

TYPE: Any DEFAULT: {}

RETURNS DESCRIPTION
str | T | GenerateResult

If return_result=True or return_thinking=True: GenerateResult with content, thinking, usage, cost, and parsed (if response_format was provided).

str | T | GenerateResult

If response_format is None: String response from the LLM.

str | T | GenerateResult

If response_format is provided: Validated Pydantic model instance. If thinking is enabled and the response_format has a 'reasoning' field, it will be populated with the model's thinking trace.

RAISES DESCRIPTION
APIError

If all retries fail

ValidationError

If response doesn't match schema

Source code in src/autorubric/llm.py
async def generate(
    self,
    system_prompt: str,
    user_prompt: str,
    response_format: type[T] | None = None,
    use_cache: bool | None = None,
    return_thinking: bool = False,
    return_result: bool = False,
    **kwargs: Any,
) -> str | T | GenerateResult:
    """Generate LLM response with optional structured output.

    Args:
        system_prompt: System message for the LLM.
        user_prompt: User message for the LLM.
        response_format: Optional Pydantic model class for structured output.
            When provided, LiteLLM uses the model's JSON schema to constrain
            the LLM output and returns a validated Pydantic instance.
        use_cache: Whether to use caching for this request.
            - None (default): Use config.cache_enabled setting
            - True: Force cache usage (initializes cache if needed)
            - False: Skip cache for this request
        return_thinking: If True and thinking is enabled, return a GenerateResult
            with both content and thinking. If False (default), only return content.
            Note: When response_format is provided, thinking is injected into the
            'reasoning' field if it exists, regardless of this setting.
        return_result: If True, always return a GenerateResult with full details
            including usage statistics and completion cost. This is useful when
            you need to track token usage. When True, takes precedence over the
            default return behavior.
        **kwargs: Override any LLMConfig parameters for this call.

    Returns:
        If return_result=True or return_thinking=True: GenerateResult with content,
            thinking, usage, cost, and parsed (if response_format was provided).
        If response_format is None: String response from the LLM.
        If response_format is provided: Validated Pydantic model instance.
            If thinking is enabled and the response_format has a 'reasoning'
            field, it will be populated with the model's thinking trace.

    Raises:
        litellm.APIError: If all retries fail
        pydantic.ValidationError: If response doesn't match schema
    """
    # Determine caching behavior for this request
    should_cache = use_cache if use_cache is not None else self.config.cache_enabled

    # Check cache first
    cache_key: str | None = None
    if should_cache:
        cache = self._ensure_cache()
        cache_key = self._cache_key(
            self.config.model, system_prompt, user_prompt, response_format
        )
        cached = cache.get(cache_key)
        if cached is not None:
            logger.debug(f"Cache hit for {cache_key[:8]}...")
            return cached  # type: ignore[return-value]

    # Build request parameters
    model = kwargs.get("model", self.config.model)

    # Check if this is an Anthropic model that supports prompt caching
    # Anthropic requires cache_control on message content; other providers
    # handle caching automatically (OpenAI, Deepseek) or don't support it
    is_anthropic = model.startswith("anthropic/") or model.startswith("claude")
    use_prompt_caching = self.config.prompt_caching and is_anthropic
    if use_prompt_caching:
        # Anthropic requires cache_control on message content
        messages = [
            {
                "role": "system",
                "content": [
                    {
                        "type": "text",
                        "text": system_prompt,
                        "cache_control": {"type": "ephemeral"},
                    }
                ],
            },
            {"role": "user", "content": user_prompt},
        ]
    else:
        # Standard message format for other providers
        # OpenAI/Deepseek: Caching is automatic for prompts ≥1024 tokens
        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ]

    params: dict[str, Any] = {
        "model": model,
        "messages": messages,
        "temperature": kwargs.get("temperature", self.config.temperature),
        "timeout": kwargs.get("timeout", self.config.timeout),
        **self.config.extra_params,
    }

    if self.config.max_tokens:
        params["max_tokens"] = kwargs.get("max_tokens", self.config.max_tokens)
    if self.config.top_p:
        params["top_p"] = kwargs.get("top_p", self.config.top_p)
    if self.config.api_key:
        params["api_key"] = self.config.api_key
    if self.config.api_base:
        params["api_base"] = self.config.api_base

    # Thinking/Reasoning configuration (unified across providers)
    thinking_config = self.config.get_thinking_config()
    if thinking_config is not None:
        # Determine whether to use reasoning_effort or explicit thinking dict
        # Use explicit budget_tokens when specified for fine-grained control
        # Otherwise use reasoning_effort for better cross-provider compatibility
        if thinking_config.budget_tokens is not None:
            # Explicit token budget - use thinking dict (Anthropic/Gemini style)
            params["thinking"] = {
                "type": "enabled",
                "budget_tokens": thinking_config.budget_tokens,
            }
        else:
            # Level-based - use reasoning_effort for cross-provider support
            # LiteLLM translates this to provider-specific parameters
            params["reasoning_effort"] = thinking_config.get_reasoning_effort()

    # Extra headers configuration
    extra_headers = dict(self.config.extra_headers)
    if use_prompt_caching:
        # Anthropic prompt caching requires beta header
        extra_headers["anthropic-beta"] = "prompt-caching-2024-07-31"
    if extra_headers:
        params["extra_headers"] = extra_headers

    if self.config.seed is not None:
        params["seed"] = self.config.seed

    # Enable structured output if Pydantic model provided
    if response_format is not None:
        params["response_format"] = response_format

    # Make request with retries
    retry_decorator = self._get_retry_decorator()
    thinking_content: str | None = None
    raw_response: Any = None

    @retry_decorator
    async def _call() -> str:
        nonlocal thinking_content, raw_response
        response = await litellm.acompletion(**params)
        raw_response = response

        message = response.choices[0].message

        # Extract thinking/reasoning content (standardized across providers)
        # LiteLLM provides unified `reasoning_content` field
        thinking_content = _extract_thinking_content(message)

        return message.content  # type: ignore[return-value]

    # Apply rate limiting if configured
    semaphore = await RateLimitPool.get_instance().get_semaphore(
        model, self.config.max_parallel_requests
    )
    if semaphore is not None:
        async with semaphore:
            response_content = await _call()
    else:
        response_content = await _call()

    # Extract usage and cost from the raw response
    usage = _extract_usage_from_response(raw_response)
    cost = _calculate_completion_cost(raw_response)

    # Parse structured output if requested
    parsed_response: T | None = None
    if response_format is not None:
        # LiteLLM returns JSON string when response_format is set
        # Parse it into the Pydantic model
        data = json.loads(response_content)

        # Inject thinking content into the reasoning field if available
        if thinking_content and "reasoning" in response_format.model_fields:
            data["reasoning"] = thinking_content

        parsed_response = response_format.model_validate(data)

    # Determine what to return
    result: str | T | GenerateResult
    if return_result or return_thinking:
        # Return full GenerateResult with all details
        result = GenerateResult(
            content=response_content,
            thinking=thinking_content,
            raw_response=raw_response,
            usage=usage,
            cost=cost,
            parsed=parsed_response,
        )
    elif response_format is not None:
        # Return just the parsed Pydantic model
        result = parsed_response  # type: ignore[assignment]
    else:
        # Return just the string content
        result = response_content

    # Cache the response (cache the parsed object for structured outputs)
    if should_cache and cache_key:
        cache = self._ensure_cache()
        cache.set(
            cache_key,
            result,
            expire=self.config.cache_ttl,
        )
        logger.debug(f"Cached response for {cache_key[:8]}...")

    return result

clear_cache

clear_cache() -> int

Clear all cached responses.

RETURNS DESCRIPTION
int

Number of entries cleared.

Source code in src/autorubric/llm.py
def clear_cache(self) -> int:
    """Clear all cached responses.

    Returns:
        Number of entries cleared.
    """
    if self._cache is None:
        return 0
    count = len(self._cache)
    self._cache.clear()
    return count

cache_stats

cache_stats() -> dict[str, Any]

Get cache statistics.

RETURNS DESCRIPTION
dict[str, Any]

Dict with 'size', 'count', and 'directory' keys.

Source code in src/autorubric/llm.py
def cache_stats(self) -> dict[str, Any]:
    """Get cache statistics.

    Returns:
        Dict with 'size', 'count', and 'directory' keys.
    """
    if self._cache is None:
        return {"size": 0, "count": 0, "directory": None}
    return {
        "size": self._cache.volume(),
        "count": len(self._cache),
        "directory": str(self._cache.directory),
    }

generate

Convenience function for one-off LLM generation.

generate async

generate(system_prompt: str, user_prompt: str, model: str, response_format: type[T] | None = None, **kwargs: Any) -> str | T

Simple one-shot generation function.

For repeated calls, prefer creating an LLMClient instance.

PARAMETER DESCRIPTION
system_prompt

System message for the LLM.

TYPE: str

user_prompt

User message for the LLM.

TYPE: str

model

Model identifier (REQUIRED).

TYPE: str

response_format

Optional Pydantic model for structured output.

TYPE: type[T] | None DEFAULT: None

**kwargs

Additional LLMConfig parameters.

TYPE: Any DEFAULT: {}

Example

Simple string response

response = await generate( "You are a helpful assistant.", "What is 2+2?", model="openai/gpt-5.2-mini" )

Structured output

from pydantic import BaseModel

class MathAnswer(BaseModel): result: int explanation: str

answer = await generate( "You are a math tutor.", "What is 2+2?", model="openai/gpt-5.2-mini", response_format=MathAnswer ) print(answer.result) # 4

Source code in src/autorubric/llm.py
async def generate(
    system_prompt: str,
    user_prompt: str,
    model: str,
    response_format: type[T] | None = None,
    **kwargs: Any,
) -> str | T:
    """Simple one-shot generation function.

    For repeated calls, prefer creating an LLMClient instance.

    Args:
        system_prompt: System message for the LLM.
        user_prompt: User message for the LLM.
        model: Model identifier (REQUIRED).
        response_format: Optional Pydantic model for structured output.
        **kwargs: Additional LLMConfig parameters.

    Example:
        # Simple string response
        response = await generate(
            "You are a helpful assistant.",
            "What is 2+2?",
            model="openai/gpt-5.2-mini"
        )

        # Structured output
        from pydantic import BaseModel

        class MathAnswer(BaseModel):
            result: int
            explanation: str

        answer = await generate(
            "You are a math tutor.",
            "What is 2+2?",
            model="openai/gpt-5.2-mini",
            response_format=MathAnswer
        )
        print(answer.result)  # 4
    """
    config = LLMConfig(model=model, **kwargs)
    client = LLMClient(config)
    return await client.generate(system_prompt, user_prompt, response_format=response_format)

GenerateResult

Result from LLM generation.

GenerateResult dataclass

GenerateResult(content: str, thinking: str | None = None, raw_response: Any = None, usage: 'TokenUsage | None' = None, cost: float | None = None, parsed: Any = None)

Result from LLM generation including content, thinking, and usage statistics.

ATTRIBUTE DESCRIPTION
content

The main response content from the LLM (raw string).

TYPE: str

thinking

The thinking/reasoning trace if thinking was enabled. None if thinking was not enabled or the provider doesn't support it.

TYPE: str | None

raw_response

The raw LiteLLM response object for advanced use cases.

TYPE: Any

usage

Token usage statistics from this LLM call.

TYPE: 'TokenUsage | None'

cost

Completion cost in USD for this LLM call, calculated using LiteLLM's completion_cost() function. None if cost calculation fails.

TYPE: float | None

parsed

The parsed Pydantic model instance when response_format was provided. None if no response_format was used or parsing failed.

TYPE: Any


ThinkingConfig

Configuration for extended thinking/reasoning.

ThinkingConfig dataclass

ThinkingConfig(level: ThinkingLevel | ThinkingLevelLiteral = MEDIUM, budget_tokens: int | None = None)

Detailed configuration for LLM thinking/reasoning.

Provides a uniform interface across providers: - Anthropic: Extended thinking (claude-sonnet-4-5, claude-opus-4-5+) - OpenAI: Reasoning (o-series, GPT-5 models via openai/responses/ prefix) - Gemini: Thinking mode (2.5+, 3.0+ models) - DeepSeek: Reasoning content

ATTRIBUTE DESCRIPTION
level

High-level thinking effort. Used when budget_tokens is not set. Defaults to MEDIUM for a good balance of quality and latency.

TYPE: ThinkingLevel | ThinkingLevelLiteral

budget_tokens

Explicit token budget for thinking (provider-specific). When set, overrides level. Recommended: 10000-50000 for complex tasks. Providers that don't support explicit budgets will map this to the nearest level.

TYPE: int | None

Examples:

Simple: use a thinking level

ThinkingConfig(level=ThinkingLevel.HIGH)

Fine-grained: specify exact token budget

ThinkingConfig(budget_tokens=32000) # For complex reasoning tasks

Disable thinking

ThinkingConfig(level=ThinkingLevel.NONE)

get_effective_budget

get_effective_budget() -> int

Get the effective token budget based on level or explicit budget.

Source code in src/autorubric/llm.py
def get_effective_budget(self) -> int:
    """Get the effective token budget based on level or explicit budget."""
    if self.budget_tokens is not None:
        return self.budget_tokens
    return THINKING_LEVEL_BUDGETS.get(self.level.value, 2048)

get_reasoning_effort

get_reasoning_effort() -> str

Get the reasoning_effort string for LiteLLM.

Source code in src/autorubric/llm.py
def get_reasoning_effort(self) -> str:
    """Get the reasoning_effort string for LiteLLM."""
    return self.level.value

ThinkingLevel

Enum for thinking level presets.

ThinkingLevel

Bases: str, Enum

Standardized thinking/reasoning effort levels across LLM providers.

LiteLLM translates these to provider-specific parameters: - Anthropic: thinking={type, budget_tokens} with low→1024, medium→2048, high→4096 - OpenAI: reasoning_effort parameter for o-series and GPT-5 models - Gemini: thinking configuration with similar token budgets - DeepSeek: Standardized via LiteLLM's reasoning_effort

Higher levels mean more "thinking" tokens/steps, trading latency for quality.


ThinkingLevelLiteral

Type alias for thinking level strings.

ThinkingLevelLiteral module-attribute

ThinkingLevelLiteral = Literal['none', 'low', 'medium', 'high']

ThinkingParam

Type alias for thinking parameter (level or budget).

ThinkingParam module-attribute

ThinkingParam = ThinkingConfig | ThinkingLevel | ThinkingLevelLiteral | int | None

Type for the thinking parameter in LLMConfig.

Accepts: - ThinkingConfig: Full configuration object - ThinkingLevel: Enum value (e.g., ThinkingLevel.HIGH) - str: Level as string ("low", "medium", "high", "none") - int: Direct token budget (e.g., 32000) - None: Disable thinking