Dataset¶
Dataset management classes for organizing evaluation data with optional ground truth labels.
Overview¶
The RubricDataset class provides structured storage for evaluation datasets, including submissions, optional ground truth verdicts, per-item rubrics, and reference submissions. Datasets can be serialized to JSON/YAML for sharing and reproducibility.
Quick Example¶
from autorubric import Rubric, Criterion, CriterionVerdict, DataItem, RubricDataset
# Create a rubric
rubric = Rubric([
Criterion(name="accuracy", weight=10.0, requirement="Factually correct"),
Criterion(name="clarity", weight=5.0, requirement="Clear and concise"),
])
# Create a dataset
dataset = RubricDataset(
name="photosynthesis-eval",
prompt="Explain photosynthesis",
rubric=rubric,
)
# Add items with ground truth
dataset.add_item(
submission="Photosynthesis is the process by which plants convert sunlight...",
description="Good response",
ground_truth=[CriterionVerdict.MET, CriterionVerdict.MET]
)
# Serialize
dataset.to_file("dataset.json")
# Load
loaded = RubricDataset.from_file("dataset.json")
Per-Item Rubrics¶
For datasets where each item requires a unique rubric (e.g., question-specific evaluation):
item = DataItem(
submission="Answer to question 1...",
description="Q1",
rubric=Rubric([
Criterion(weight=1.0, requirement="Correct answer for Q1"),
])
)
dataset = RubricDataset(
prompt="Answer the question",
rubric=None, # No global rubric
items=[item],
)
# Get effective rubric for an item
rubric = dataset.get_item_rubric(0) # Returns item's rubric
Reference Submissions¶
Provide exemplar responses for judge calibration:
# Global reference for all items
dataset = RubricDataset(
prompt="Explain photosynthesis",
rubric=rubric,
reference_submission="Detailed explanation of photosynthesis...",
)
# Per-item reference (overrides global)
dataset.add_item(
submission="Student answer...",
description="Q1",
reference_submission="Custom reference for this item",
)
# Get effective reference
ref = dataset.get_item_reference_submission(0)
Train/Test Split¶
train_data, test_data = dataset.split_train_test(
n_train=100,
stratify=True, # Balance by ground truth verdicts
seed=42,
)
DataItem¶
A single item in an evaluation dataset.
DataItem
dataclass
¶
DataItem(submission: str, description: str, ground_truth: list[CriterionVerdict | str] | None = None, rubric: Rubric | None = None, reference_submission: str | None = None)
A single item to be graded, optionally with ground truth verdicts.
| ATTRIBUTE | DESCRIPTION |
|---|---|
submission |
The content to be evaluated. Can be plain text or a JSON-serialized string for structured data (e.g., dialogues, multi-part responses).
TYPE:
|
description |
A brief description of this item (e.g., "High quality response").
TYPE:
|
ground_truth |
Optional list of ground truth values, one per criterion. - For binary criteria: CriterionVerdict (MET, UNMET, CANNOT_ASSESS) - For multi-choice criteria: str (option label) Used for computing evaluation metrics against LLM predictions.
TYPE:
|
rubric |
Optional per-item rubric. If provided, this rubric is used for grading instead of the dataset-level rubric. Useful for datasets where each item has unique evaluation criteria (e.g., ResearcherBench).
TYPE:
|
reference_submission |
Optional exemplar response for grading context. When present, helps calibrate the grader's expectations. Item-level takes precedence over dataset-level reference.
TYPE:
|
Example
Binary criteria only¶
item = DataItem( ... submission="The Industrial Revolution began in Britain around 1760...", ... description="Excellent essay covering all criteria", ... ground_truth=[CriterionVerdict.MET, CriterionVerdict.MET, CriterionVerdict.UNMET] ... )
Mixed binary and multi-choice¶
item = DataItem( ... submission="The assistant responded helpfully...", ... description="Good dialogue", ... ground_truth=[CriterionVerdict.MET, "Very satisfied", "Yes - reasonable"] ... )
With per-item rubric¶
from autorubric import Rubric, Criterion item = DataItem( ... submission="Response to a specific question...", ... description="Question-specific grading", ... rubric=Rubric([Criterion(name="Relevance", weight=1.0, requirement="...")]) ... )
RubricDataset¶
Container for evaluation datasets with optional ground truth.
RubricDataset
dataclass
¶
RubricDataset(prompt: str, rubric: Rubric | None = None, items: list[DataItem] = list(), name: str | None = None, reference_submission: str | None = None)
A collection of DataItems tied to a specific prompt and rubric.
The RubricDataset encapsulates: - The prompt that generated the responses - The rubric used for evaluation (global or per-item) - A collection of DataItems with optional ground truth labels
This is useful for: - Evaluating LLM grader accuracy against human judgments - Training reward models with labeled data - Benchmarking different grading strategies
| ATTRIBUTE | DESCRIPTION |
|---|---|
prompt |
The prompt/question that items are responses to.
TYPE:
|
rubric |
Optional global Rubric used to evaluate items. Can be None if all items have their own rubrics.
TYPE:
|
items |
List of DataItem instances to evaluate.
TYPE:
|
name |
Optional name for the dataset (e.g., "essay-grading-v1").
TYPE:
|
reference_submission |
Optional global exemplar response for grading context. When present, provides calibration for the grader. Item-level reference takes precedence over this dataset-level reference.
TYPE:
|
Example
from autorubric import Rubric, Criterion, CriterionVerdict rubric = Rubric([ ... Criterion(name="Accuracy", weight=10.0, requirement="Factually correct"), ... Criterion(name="Clarity", weight=5.0, requirement="Clear and concise"), ... ]) dataset = RubricDataset( ... prompt="Explain photosynthesis", ... rubric=rubric, ... ) dataset.add_item( ... submission="Photosynthesis is the process...", ... description="Good response", ... ground_truth=[CriterionVerdict.MET, CriterionVerdict.MET] ... )
criterion_names
property
¶
Get criterion names from global rubric.
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If no global rubric is set. |
num_criteria
property
¶
Number of criteria in the global rubric.
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If no global rubric is set. |
total_positive_weight
property
¶
Sum of all positive criterion weights in global rubric.
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If no global rubric is set. |
get_item_rubric
¶
get_item_rubric(idx: int) -> Rubric
Get the effective rubric for an item (per-item or global fallback).
| PARAMETER | DESCRIPTION |
|---|---|
idx
|
Index of the item.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Rubric
|
The item's rubric if set, otherwise the dataset's global rubric. |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If neither item nor dataset has a rubric. |
Source code in src/autorubric/dataset.py
get_item_reference_submission
¶
Get the effective reference submission for an item.
Item-level reference takes precedence over dataset-level reference.
| PARAMETER | DESCRIPTION |
|---|---|
idx
|
Index of the item.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str | None
|
The item's reference_submission if set, otherwise the dataset's |
str | None
|
global reference_submission. May be None if neither is set. |
Source code in src/autorubric/dataset.py
compute_weighted_score
¶
compute_weighted_score(verdicts: list[CriterionVerdict | str], normalize: bool = True, rubric: Rubric | None = None) -> float
Compute weighted score from verdicts (binary or multi-choice).
| PARAMETER | DESCRIPTION |
|---|---|
verdicts
|
List of verdict values, one per criterion. - For binary criteria: CriterionVerdict (MET=1.0, UNMET=0.0) - For multi-choice criteria: str (option label, resolved to value)
TYPE:
|
normalize
|
If True, normalize score to [0, 1]. If False, return raw sum.
TYPE:
|
rubric
|
Optional rubric to use for scoring. If None, uses global rubric.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
float
|
Weighted score based on criterion weights and verdicts. |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If a multi-choice label doesn't match any option, or if rubric is None and no global rubric is set. |
Source code in src/autorubric/dataset.py
add_item
¶
add_item(submission: str, description: str, ground_truth: list[CriterionVerdict | str] | None = None, rubric: Rubric | None = None, reference_submission: str | None = None) -> None
Add a new item to the dataset.
| PARAMETER | DESCRIPTION |
|---|---|
submission
|
The content to be evaluated.
TYPE:
|
description
|
A brief description of this item.
TYPE:
|
ground_truth
|
Optional list of ground truth values. - For binary criteria: CriterionVerdict (MET, UNMET, CANNOT_ASSESS) - For multi-choice criteria: str (option label)
TYPE:
|
rubric
|
Optional per-item rubric. If None, uses global rubric.
TYPE:
|
reference_submission
|
Optional exemplar response for grading context.
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If ground_truth length doesn't match effective rubric criteria count, or if neither per-item nor global rubric is available. |
Source code in src/autorubric/dataset.py
to_json
¶
Serialize the dataset to a JSON string.
| PARAMETER | DESCRIPTION |
|---|---|
indent
|
Number of spaces for indentation. None for compact output.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
JSON string representation of the dataset. |
Source code in src/autorubric/dataset.py
to_file
¶
Save dataset to a JSON file.
| PARAMETER | DESCRIPTION |
|---|---|
path
|
Path to write the JSON file.
TYPE:
|
from_json
classmethod
¶
from_json(json_string: str) -> RubricDataset
Deserialize a dataset from a JSON string.
| PARAMETER | DESCRIPTION |
|---|---|
json_string
|
JSON string representation of the dataset.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
RubricDataset
|
RubricDataset instance. |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If the JSON is invalid, missing required fields, or if an item has no rubric when no global rubric is set. |
Source code in src/autorubric/dataset.py
444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 | |
from_file
classmethod
¶
from_file(path: str | Path) -> RubricDataset
Load dataset from a JSON file.
| PARAMETER | DESCRIPTION |
|---|---|
path
|
Path to the JSON file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
RubricDataset
|
RubricDataset instance. |
| RAISES | DESCRIPTION |
|---|---|
FileNotFoundError
|
If the file doesn't exist. |
ValueError
|
If the JSON is invalid. |
Source code in src/autorubric/dataset.py
split_train_test
¶
split_train_test(n_train: int, *, stratify: bool = True, seed: int | None = None) -> tuple[RubricDataset, RubricDataset]
Split dataset into training and test sets.
The training set can be used to provide few-shot examples for grading, while the test set is used for evaluation.
| PARAMETER | DESCRIPTION |
|---|---|
n_train
|
Exact number of items for training set.
TYPE:
|
stratify
|
If True, stratify by per-criterion verdict distribution. This ensures each split has similar proportion of MET/UNMET/CANNOT_ASSESS for each criterion position. Requires all items to have ground_truth.
TYPE:
|
seed
|
Random seed for reproducible splits.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple[RubricDataset, RubricDataset]
|
Tuple of (train_dataset, test_dataset). |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If n_train is invalid or stratify=True but items lack ground_truth. |
Example
dataset = RubricDataset.from_file("data.json") train, test = dataset.split_train_test(n_train=100, stratify=True, seed=42) print(f"Train: {len(train)}, Test: {len(test)}")