Length Penalty¶
Control verbosity bias by penalizing excessively long responses.
Overview¶
LLM judges often prefer longer answers, a phenomenon known as verbosity bias. The length penalty feature provides a configurable mechanism to penalize excessively verbose outputs without requiring changes to the rubric itself.
Research Background
Dubois et al. (2024) document verbosity bias extensively in their length-controlled AlpacaEval work. Length penalty helps reduce verbosity-driven score inflation by adding conciseness as an implicit scoring dimension.
Quick Example¶
from autorubric import Rubric, LLMConfig, LengthPenalty
from autorubric.graders import CriterionGrader
grader = CriterionGrader(
llm_config=LLMConfig(model="openai/gpt-4.1-mini"),
length_penalty=LengthPenalty(
free_budget=6000, # No penalty below this count
max_cap=8000, # Maximum penalty at/above this count
penalty_at_cap=0.5, # Max penalty to subtract from score
exponent=1.6, # Curve steepness
penalty_type="ALL", # "ALL", "OUTPUT_ONLY", "THINKING_ONLY"
),
)
result = await rubric.grade(to_grade=response, grader=grader)
Penalty Formula¶
if count <= free_budget:
penalty = 0
elif count >= max_cap:
penalty = penalty_at_cap
else:
frac = (count - free_budget) / (max_cap - free_budget)
penalty = penalty_at_cap * (frac ** exponent)
final_score = max(0.0, base_score - penalty)
Custom Count Functions¶
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
grader = CriterionGrader(
llm_config=LLMConfig(model="openai/gpt-4.1-mini"),
length_penalty=LengthPenalty(
free_budget=8000,
max_cap=10000,
count_fn=lambda t: len(tokenizer.encode(t)), # Token count
),
)
Thinking/Output Separation¶
For models with separate thinking and output sections:
# Dict format
await rubric.grade(
to_grade={
"thinking": "Let me reason through this...",
"output": "The final answer is 42"
},
grader=grader
)
# String with markers
await rubric.grade(
to_grade="<thinking>My reasoning...</thinking><output>Final answer</output>",
grader=grader
)
# Penalty type selection
penalty = LengthPenalty(
free_budget=8000,
max_cap=10000,
penalty_at_cap=0.5,
penalty_type="OUTPUT_ONLY" # Only count output, not thinking
)
Training/RL Use Cases¶
For reinforcement learning, use unnormalized scores with absolute penalties:
grader = CriterionGrader(
llm_config=LLMConfig(model="openai/gpt-4.1-mini"),
normalize=False, # Raw weighted sums
length_penalty=LengthPenalty(
free_budget=8000,
max_cap=10000,
penalty_at_cap=50.0, # Absolute penalty
exponent=1.6,
count_fn=lambda text: len(tokenizer.encode(text, add_special_tokens=False))
),
)
LengthPenalty¶
Configuration for length-based score penalty.
LengthPenalty
¶
Bases: BaseModel
Configuration for applying length-based penalties during grading.
The penalty is computed as: - 0 if count <= free_budget - penalty_at_cap if count >= max_cap - penalty_at_cap * ((count - free_budget) / (max_cap - free_budget)) ** exponent otherwise
By default, the penalty is subtracted from the final score (which is normalized to 0-1). For training use cases with raw scores, use absolute penalty values (e.g., 50.0).
| ATTRIBUTE | DESCRIPTION |
|---|---|
free_budget |
Number of tokens/words allowed before any penalty applies.
TYPE:
|
max_cap |
Number of tokens/words at which the maximum penalty is applied.
TYPE:
|
penalty_at_cap |
Maximum penalty value (always subtracted from score). For normalized scores, use fractional values like 0.5 (lose up to 50% of score). For training with raw scores, use absolute values like 50.0 (subtract up to 50 points).
TYPE:
|
exponent |
Controls the penalty curve steepness. Higher = more lenient near free_budget.
TYPE:
|
count_fn |
Function to count tokens/words in text. If None, uses whitespace word count.
For accurate token counting, pass a tokenizer-based function like:
TYPE:
|
penalty_type |
Which text to count for penalty calculation: - "ALL": Count both thinking and output tokens (default) - "OUTPUT_ONLY": Count only output tokens (useful for RL training) - "THINKING_ONLY": Count only thinking tokens
TYPE:
|
Example
Default: word-based counting with sensible defaults for normalized scores¶
penalty = LengthPenalty()
For training with raw (unnormalized) scores - absolute penalty values¶
penalty = LengthPenalty( ... free_budget=8000, ... max_cap=10000, ... penalty_at_cap=50.0, # Subtract up to 50 points from raw score ... exponent=1.6, ... )
Custom tokenizer-based counting (e.g., with HuggingFace)¶
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("gpt2") penalty = LengthPenalty( ... free_budget=8000, ... max_cap=10000, ... count_fn=lambda text: len(tokenizer.encode(text)) ... )
Only penalize output tokens (allow long thinking)¶
penalty = LengthPenalty( ... free_budget=8000, ... penalty_type="OUTPUT_ONLY", ... )
compute_length_penalty¶
Compute the penalty value for a given text.
compute_length_penalty
¶
compute_length_penalty(text: str | ThinkingOutputDict, config: LengthPenalty) -> float
Compute the length penalty for the given text based on the config.
The penalty follows an exponential curve: - Returns 0 if word/token count is at or below free_budget - Returns penalty_at_cap if count is at or above max_cap - Returns an interpolated value between those bounds using the exponent
| PARAMETER | DESCRIPTION |
|---|---|
text
|
Either a string (backwards compatible) or a dict with 'thinking' and 'output' keys. When a string is provided, it's treated as all output (no thinking section).
TYPE:
|
config
|
LengthPenalty configuration specifying thresholds, penalty, and which sections to count based on penalty_type.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
float
|
A penalty value between 0 and penalty_at_cap to subtract from the score. |
Source code in src/autorubric/utils.py
PenaltyType¶
Enum for which sections to count for length penalty.
PenaltyType
module-attribute
¶
Type for penalty_type field: specifies which sections to count for length penalty.
CountFn¶
Type alias for custom counting functions.
word_count¶
Default word counting function.
word_count
¶
Count the number of whitespace-separated words in text.
This is the default counting function used by LengthPenalty. For more accurate token counting with a specific model, provide a custom count_fn that uses a tokenizer.
Source code in src/autorubric/utils.py
References¶
Dubois, Y., Galambosi, B., Liang, P., and Hashimoto, T. B. (2024). Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators. arXiv:2404.04475.