Length Penalty¶

Control verbosity bias by penalizing excessively long responses.

Overview¶

LLM judges often prefer longer answers, a phenomenon known as verbosity bias. The length penalty feature provides a configurable mechanism to penalize excessively verbose outputs without requiring changes to the rubric itself.

Research Background

Dubois et al. (2024) document verbosity bias extensively in their length-controlled AlpacaEval work. Length penalty helps reduce verbosity-driven score inflation by adding conciseness as an implicit scoring dimension.

Quick Example¶

from autorubric import Rubric, LLMConfig, LengthPenalty
from autorubric.graders import CriterionGrader

grader = CriterionGrader(
    llm_config=LLMConfig(model="openai/gpt-4.1-mini"),
    length_penalty=LengthPenalty(
        free_budget=6000,        # No penalty below this count
        max_cap=8000,            # Maximum penalty at/above this count
        penalty_at_cap=0.5,      # Max penalty to subtract from score
        exponent=1.6,            # Curve steepness
        penalty_type="ALL",      # "ALL", "OUTPUT_ONLY", "THINKING_ONLY"
    ),
)

result = await rubric.grade(to_grade=response, grader=grader)

Penalty Formula¶

if count <= free_budget:
    penalty = 0
elif count >= max_cap:
    penalty = penalty_at_cap
else:
    frac = (count - free_budget) / (max_cap - free_budget)
    penalty = penalty_at_cap * (frac ** exponent)

final_score = max(0.0, base_score - penalty)

Custom Count Functions¶

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")

grader = CriterionGrader(
    llm_config=LLMConfig(model="openai/gpt-4.1-mini"),
    length_penalty=LengthPenalty(
        free_budget=8000,
        max_cap=10000,
        count_fn=lambda t: len(tokenizer.encode(t)),  # Token count
    ),
)

Thinking/Output Separation¶

For models with separate thinking and output sections:

# Dict format
await rubric.grade(
    to_grade={
        "thinking": "Let me reason through this...",
        "output": "The final answer is 42"
    },
    grader=grader
)

# String with markers
await rubric.grade(
    to_grade="<thinking>My reasoning...</thinking><output>Final answer</output>",
    grader=grader
)

# Penalty type selection
penalty = LengthPenalty(
    free_budget=8000,
    max_cap=10000,
    penalty_at_cap=0.5,
    penalty_type="OUTPUT_ONLY"  # Only count output, not thinking
)

Training/RL Use Cases¶

For reinforcement learning, use unnormalized scores with absolute penalties:

grader = CriterionGrader(
    llm_config=LLMConfig(model="openai/gpt-4.1-mini"),
    normalize=False,  # Raw weighted sums
    length_penalty=LengthPenalty(
        free_budget=8000,
        max_cap=10000,
        penalty_at_cap=50.0,  # Absolute penalty
        exponent=1.6,
        count_fn=lambda text: len(tokenizer.encode(text, add_special_tokens=False))
    ),
)

LengthPenalty¶

Configuration for length-based score penalty.

LengthPenalty ¶

Bases: BaseModel

Configuration for applying length-based penalties during grading.

The penalty is computed as: - 0 if count <= free_budget - penalty_at_cap if count >= max_cap - penalty_at_cap * ((count - free_budget) / (max_cap - free_budget)) ** exponent otherwise

By default, the penalty is subtracted from the final score (which is normalized to 0-1). For training use cases with raw scores, use absolute penalty values (e.g., 50.0).

ATTRIBUTE	DESCRIPTION
`free_budget`	Number of tokens/words allowed before any penalty applies. TYPE: `int`
`max_cap`	Number of tokens/words at which the maximum penalty is applied. TYPE: `int`
`penalty_at_cap`	Maximum penalty value (always subtracted from score). For normalized scores, use fractional values like 0.5 (lose up to 50% of score). For training with raw scores, use absolute values like 50.0 (subtract up to 50 points). TYPE: `float`
`exponent`	Controls the penalty curve steepness. Higher = more lenient near free_budget. TYPE: `float`
`count_fn`	Function to count tokens/words in text. If None, uses whitespace word count. For accurate token counting, pass a tokenizer-based function like: `lambda text: len(tokenizer.encode(text))` TYPE: `CountFn \| None`
`penalty_type`	Which text to count for penalty calculation: - "ALL": Count both thinking and output tokens (default) - "OUTPUT_ONLY": Count only output tokens (useful for RL training) - "THINKING_ONLY": Count only thinking tokens TYPE: `PenaltyType`

Example

Default: word-based counting with sensible defaults for normalized scores¶

penalty = LengthPenalty()

For training with raw (unnormalized) scores - absolute penalty values¶

penalty = LengthPenalty( ... free_budget=8000, ... max_cap=10000, ... penalty_at_cap=50.0, # Subtract up to 50 points from raw score ... exponent=1.6, ... )

Custom tokenizer-based counting (e.g., with HuggingFace)¶

from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("gpt2") penalty = LengthPenalty( ... free_budget=8000, ... max_cap=10000, ... count_fn=lambda text: len(tokenizer.encode(text)) ... )

Only penalize output tokens (allow long thinking)¶

penalty = LengthPenalty( ... free_budget=8000, ... penalty_type="OUTPUT_ONLY", ... )

compute_length_penalty¶

Compute the penalty value for a given text.

compute_length_penalty ¶

compute_length_penalty(text: str | ThinkingOutputDict, config: LengthPenalty) -> float

Compute the length penalty for the given text based on the config.

The penalty follows an exponential curve: - Returns 0 if word/token count is at or below free_budget - Returns penalty_at_cap if count is at or above max_cap - Returns an interpolated value between those bounds using the exponent

PARAMETER	DESCRIPTION
`text`	Either a string (backwards compatible) or a dict with 'thinking' and 'output' keys. When a string is provided, it's treated as all output (no thinking section). TYPE: `str \| ThinkingOutputDict`
`config`	LengthPenalty configuration specifying thresholds, penalty, and which sections to count based on penalty_type. TYPE: `LengthPenalty`

RETURNS	DESCRIPTION
`float`	A penalty value between 0 and penalty_at_cap to subtract from the score.

Source code in src/autorubric/utils.py

def compute_length_penalty(text: str | ThinkingOutputDict, config: LengthPenalty) -> float:
    """Compute the length penalty for the given text based on the config.

    The penalty follows an exponential curve:
    - Returns 0 if word/token count is at or below free_budget
    - Returns penalty_at_cap if count is at or above max_cap
    - Returns an interpolated value between those bounds using the exponent

    Args:
        text: Either a string (backwards compatible) or a dict with 'thinking'
            and 'output' keys. When a string is provided, it's treated as
            all output (no thinking section).
        config: LengthPenalty configuration specifying thresholds, penalty,
            and which sections to count based on penalty_type.

    Returns:
        A penalty value between 0 and penalty_at_cap to subtract from the score.
    """
    # Normalize input to dict format
    if isinstance(text, str):
        # Backwards compatibility: treat string as output only
        text_dict = ThinkingOutputDict(thinking="", output=text)
    else:
        text_dict = text

    # Select which text to count based on penalty_type
    if config.penalty_type == "ALL":
        # Concatenate both sections (with space to avoid word merging)
        text_to_count = text_dict.get("thinking", "") + " " + text_dict.get("output", "")
    elif config.penalty_type == "OUTPUT_ONLY":
        text_to_count = text_dict.get("output", "")
    elif config.penalty_type == "THINKING_ONLY":
        text_to_count = text_dict.get("thinking", "")
    else:
        raise ValueError(
            f"Invalid penalty_type: {config.penalty_type}. "
            f"Must be 'ALL', 'OUTPUT_ONLY', or 'THINKING_ONLY'."
        )

    # Count tokens/words
    count_fn = config.count_fn if config.count_fn is not None else word_count
    count = count_fn(text_to_count)

    # Apply penalty curve
    if count <= config.free_budget:
        return 0.0
    if count >= config.max_cap:
        return config.penalty_at_cap

    frac = (count - config.free_budget) / float(config.max_cap - config.free_budget)
    return config.penalty_at_cap * (frac**config.exponent)

PenaltyType¶

Enum for which sections to count for length penalty.

PenaltyType `module-attribute` ¶

PenaltyType = Literal['ALL', 'OUTPUT_ONLY', 'THINKING_ONLY']

Type for penalty_type field: specifies which sections to count for length penalty.

CountFn¶

Type alias for custom counting functions.

CountFn `module-attribute` ¶

CountFn = Callable[[str], int]

word_count¶

Default word counting function.

word_count ¶

word_count(text: str) -> int

Count the number of whitespace-separated words in text.

This is the default counting function used by LengthPenalty. For more accurate token counting with a specific model, provide a custom count_fn that uses a tokenizer.

Source code in src/autorubric/utils.py

def word_count(text: str) -> int:
    """Count the number of whitespace-separated words in text.

    This is the default counting function used by LengthPenalty.
    For more accurate token counting with a specific model, provide a custom
    count_fn that uses a tokenizer.
    """
    return len(text.split())

References¶

Dubois, Y., Galambosi, B., Liang, P., and Hashimoto, T. B. (2024). Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators. arXiv:2404.04475.