Utilities¶
Helper functions for token aggregation, text processing, and data manipulation.
Overview¶
Utility functions for common operations like aggregating token usage across multiple evaluations, parsing thinking/output responses, and generating synthetic ground truth labels.
Token Aggregation¶
from autorubric import (
aggregate_token_usage,
aggregate_completion_cost,
aggregate_evaluation_usage,
)
# After batch grading
results = await asyncio.gather(*[rubric.grade(r, grader) for r in responses])
# Aggregate usage and cost
total_usage, total_cost = aggregate_evaluation_usage(results)
# Or aggregate manually
usages = [r.token_usage for r in results]
costs = [r.completion_cost for r in results]
total_usage = aggregate_token_usage(usages)
total_cost = aggregate_completion_cost(costs)
if total_usage:
print(f"Total tokens: {total_usage.total_tokens}")
if total_cost:
print(f"Total cost: ${total_cost:.4f}")
Thinking Output Parsing¶
from autorubric import parse_thinking_output, normalize_to_grade_input
# Parse string with markers
text = "<thinking>Reasoning here</thinking><output>Final answer</output>"
parsed = parse_thinking_output(text)
# {'thinking': 'Reasoning here', 'output': 'Final answer'}
# Normalize any input format
input1 = "plain text"
input2 = {"thinking": "...", "output": "..."}
input3 = "<thinking>...</thinking><output>...</output>"
normalized = normalize_to_grade_input(input1) # {'thinking': None, 'output': 'plain text'}
normalized = normalize_to_grade_input(input2) # passes through
normalized = normalize_to_grade_input(input3) # parses markers
Synthetic Ground Truth¶
from autorubric import RubricDataset, LLMConfig
from autorubric.graders import CriterionGrader
from autorubric import fill_ground_truth
async def generate_labels():
dataset = RubricDataset.from_file("unlabeled.json")
# Use strong model for ground truth
grader = CriterionGrader(
llm_config=LLMConfig(
model="anthropic/claude-sonnet-4-5-20250929",
max_parallel_requests=10,
)
)
labeled = await fill_ground_truth(
dataset,
grader,
force=False, # Only label items without ground_truth
show_progress=True,
)
labeled.to_file("labeled.json")
Verdict Helpers¶
from autorubric import (
extract_verdicts_from_report,
filter_cannot_assess,
verdict_to_binary,
verdict_to_string,
)
# Extract verdicts from evaluation report
verdicts = extract_verdicts_from_report(result.report)
# Filter out CANNOT_ASSESS
filtered = filter_cannot_assess(verdicts)
# Convert to binary (for metrics)
binary = verdict_to_binary(CriterionVerdict.MET) # 1
binary = verdict_to_binary(CriterionVerdict.UNMET) # 0
# Convert to string
string = verdict_to_string(CriterionVerdict.MET) # "MET"
aggregate_token_usage¶
Aggregate token usage from multiple evaluations.
aggregate_token_usage
¶
aggregate_token_usage(usages: list[TokenUsage | None]) -> TokenUsage | None
Aggregate multiple TokenUsage objects into a single total.
Useful for combining usage from multiple LLM calls or multiple grading operations.
| PARAMETER | DESCRIPTION |
|---|---|
usages
|
List of TokenUsage objects (None values are filtered out).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TokenUsage | None
|
A single TokenUsage with summed values, or None if all inputs are None. |
Example
from autorubric import TokenUsage usage1 = TokenUsage(prompt_tokens=100, completion_tokens=50, total_tokens=150) usage2 = TokenUsage(prompt_tokens=200, completion_tokens=100, total_tokens=300) total = aggregate_token_usage([usage1, usage2]) print(f"Total tokens: {total.total_tokens}") Total tokens: 450
Source code in src/autorubric/utils.py
aggregate_completion_cost¶
Aggregate completion costs from multiple evaluations.
aggregate_completion_cost
¶
Aggregate multiple completion costs into a single total.
Useful for combining costs from multiple LLM calls or multiple grading operations.
| PARAMETER | DESCRIPTION |
|---|---|
costs
|
List of cost values in USD (None values are filtered out).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
float | None
|
Total cost in USD, or None if all inputs are None. |
Example
costs = [0.001, 0.002, None, 0.003] total = aggregate_completion_cost(costs) print(f"Total cost: ${total:.4f}") Total cost: $0.0060
Source code in src/autorubric/utils.py
aggregate_evaluation_usage¶
Aggregate both usage and cost from evaluation reports.
aggregate_evaluation_usage
¶
aggregate_evaluation_usage(reports: list['EvaluationReport']) -> tuple[TokenUsage | None, float | None]
Aggregate usage and cost from multiple EvaluationReports.
Useful for batch grading operations where you want to track total resource usage.
| PARAMETER | DESCRIPTION |
|---|---|
reports
|
List of EvaluationReport objects from grading operations.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TokenUsage | None
|
Tuple of (total_token_usage, total_completion_cost). |
float | None
|
Either value may be None if no usage data was available. |
Example
After batch grading¶
results = await asyncio.gather(*[rubric.grade(...) for item in items]) total_usage, total_cost = aggregate_evaluation_usage(results) if total_usage: ... print(f"Total tokens used: {total_usage.total_tokens}") if total_cost: ... print(f"Total cost: ${total_cost:.4f}")
Source code in src/autorubric/utils.py
fill_ground_truth¶
Generate synthetic ground truth labels for unlabeled datasets.
fill_ground_truth
async
¶
fill_ground_truth(dataset: 'RubricDataset', grader: 'Grader', *, force: bool = False, show_progress: bool = True, max_concurrent_items: int | None = None) -> 'RubricDataset'
Generate ground truth labels for dataset items using an LLM grader.
Uses the provided grader to evaluate each item and extracts the verdicts to populate ground_truth. This is useful for creating synthetic ground truth labels when manual annotation is impractical.
| PARAMETER | DESCRIPTION |
|---|---|
dataset
|
The dataset to fill ground truth for.
TYPE:
|
grader
|
The grader to use for generating verdicts.
TYPE:
|
force
|
If True, re-grade all items. If False (default), only grade items where ground_truth is None.
TYPE:
|
show_progress
|
Whether to display progress bars. Default True.
TYPE:
|
max_concurrent_items
|
Maximum items to grade concurrently. None = grade all items in parallel (default).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
'RubricDataset'
|
A new RubricDataset with ground_truth filled in. Items that fail to |
'RubricDataset'
|
grade are excluded from the returned dataset. Items with existing |
'RubricDataset'
|
ground_truth (when force=False) are included unchanged. |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If dataset has no items. |
Example
from autorubric import RubricDataset, LLMConfig from autorubric.graders import CriterionGrader from autorubric.utils import fill_ground_truth
dataset = RubricDataset.from_file("unlabeled.json") grader = CriterionGrader(llm_config=LLMConfig(model="openai/gpt-4o")) labeled = await fill_ground_truth(dataset, grader) labeled.to_file("labeled.json")
Source code in src/autorubric/utils.py
332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 | |
parse_thinking_output¶
Parse text with thinking/output markers.
parse_thinking_output
¶
parse_thinking_output(text: str) -> ThinkingOutputDict
Parse thinking and output sections from text with XML-style markers.
Looks for
| PARAMETER | DESCRIPTION |
|---|---|
text
|
Text potentially containing thinking/output markers.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ThinkingOutputDict
|
Dict with 'thinking' and 'output' keys. Empty strings if sections not found. |
Examples:
>>> parse_thinking_output("<thinking>ABC</thinking><output>DEF</output>")
{'thinking': 'ABC', 'output': 'DEF'}
Source code in src/autorubric/utils.py
normalize_to_grade_input¶
Normalize any input format to ThinkingOutputDict.
normalize_to_grade_input
¶
normalize_to_grade_input(to_grade: ToGradeInput) -> ThinkingOutputDict
Normalize to_grade input to dict format.
| PARAMETER | DESCRIPTION |
|---|---|
to_grade
|
Either a string (with optional markers) or a dict.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ThinkingOutputDict
|
Dict with 'thinking' and 'output' keys. |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If dict format is invalid (missing keys, wrong types). |
Source code in src/autorubric/utils.py
word_count¶
Count words in text (default length penalty function).
word_count
¶
Count the number of whitespace-separated words in text.
This is the default counting function used by LengthPenalty. For more accurate token counting with a specific model, provide a custom count_fn that uses a tokenizer.
Source code in src/autorubric/utils.py
extract_verdicts_from_report¶
Extract verdicts from criterion reports.
extract_verdicts_from_report
¶
extract_verdicts_from_report(report: EvaluationReport | EnsembleEvaluationReport, num_criteria: int) -> list[CriterionVerdict]
Extract verdicts from an EvaluationReport.
| PARAMETER | DESCRIPTION |
|---|---|
report
|
The evaluation report. |
num_criteria
|
Expected number of criteria.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[CriterionVerdict]
|
List of CriterionVerdict values. |
Source code in src/autorubric/metrics/_helpers.py
filter_cannot_assess¶
Filter out CANNOT_ASSESS verdicts.
filter_cannot_assess
¶
filter_cannot_assess(pred_verdicts: list[CriterionVerdict], true_verdicts: list[CriterionVerdict], mode: CannotAssessMode = 'exclude') -> tuple[list[CriterionVerdict], list[CriterionVerdict]]
Filter or transform CANNOT_ASSESS verdicts based on mode.
| PARAMETER | DESCRIPTION |
|---|---|
pred_verdicts
|
Predicted verdicts.
TYPE:
|
true_verdicts
|
Ground truth verdicts.
TYPE:
|
mode
|
How to handle CANNOT_ASSESS: - "exclude": Remove pairs where either is CA - "as_unmet": Convert CA to UNMET - "as_category": Keep CA as-is (3-class)
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple[list[CriterionVerdict], list[CriterionVerdict]]
|
Tuple of (filtered_pred, filtered_true). |
Source code in src/autorubric/metrics/_helpers.py
verdict_to_binary¶
Convert verdict to binary value.
verdict_to_binary
¶
verdict_to_binary(verdicts: Sequence[CriterionVerdict]) -> list[int]
Convert verdicts to binary (MET=1, UNMET/CA=0).
| PARAMETER | DESCRIPTION |
|---|---|
verdicts
|
List of CriterionVerdict values.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[int]
|
List of 0/1 values. |
Source code in src/autorubric/metrics/_helpers.py
verdict_to_string¶
Convert verdict to string representation.
verdict_to_string
¶
verdict_to_string(verdicts: Sequence[CriterionVerdict]) -> list[str]
Convert verdicts to string values.
| PARAMETER | DESCRIPTION |
|---|---|
verdicts
|
List of CriterionVerdict values.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[str]
|
List of string values. |