Skip to content

Dataset

Dataset management classes for organizing evaluation data with optional ground truth labels.

Overview

The RubricDataset class provides structured storage for evaluation datasets, including submissions, optional ground truth verdicts, per-item rubrics, and reference submissions. Datasets can be serialized to JSON/YAML for sharing and reproducibility.

Quick Example

from autorubric import Rubric, Criterion, CriterionVerdict, DataItem, RubricDataset

# Create a rubric
rubric = Rubric([
    Criterion(name="accuracy", weight=10.0, requirement="Factually correct"),
    Criterion(name="clarity", weight=5.0, requirement="Clear and concise"),
])

# Create a dataset
dataset = RubricDataset(
    name="photosynthesis-eval",
    prompt="Explain photosynthesis",
    rubric=rubric,
)

# Add items with ground truth
dataset.add_item(
    submission="Photosynthesis is the process by which plants convert sunlight...",
    description="Good response",
    ground_truth=[CriterionVerdict.MET, CriterionVerdict.MET]
)

# Serialize
dataset.to_file("dataset.json")

# Load
loaded = RubricDataset.from_file("dataset.json")

Per-Item Rubrics

For datasets where each item requires a unique rubric (e.g., question-specific evaluation):

item = DataItem(
    submission="Answer to question 1...",
    description="Q1",
    rubric=Rubric([
        Criterion(weight=1.0, requirement="Correct answer for Q1"),
    ])
)

dataset = RubricDataset(
    prompt="Answer the question",
    rubric=None,  # No global rubric
    items=[item],
)

# Get effective rubric for an item
rubric = dataset.get_item_rubric(0)  # Returns item's rubric

Reference Submissions

Provide exemplar responses for judge calibration:

# Global reference for all items
dataset = RubricDataset(
    prompt="Explain photosynthesis",
    rubric=rubric,
    reference_submission="Detailed explanation of photosynthesis...",
)

# Per-item reference (overrides global)
dataset.add_item(
    submission="Student answer...",
    description="Q1",
    reference_submission="Custom reference for this item",
)

# Get effective reference
ref = dataset.get_item_reference_submission(0)

Train/Test Split

train_data, test_data = dataset.split_train_test(
    n_train=100,
    stratify=True,  # Balance by ground truth verdicts
    seed=42,
)

DataItem

A single item in an evaluation dataset.

DataItem dataclass

DataItem(submission: str, description: str, ground_truth: list[CriterionVerdict | str] | None = None, rubric: Rubric | None = None, reference_submission: str | None = None)

A single item to be graded, optionally with ground truth verdicts.

ATTRIBUTE DESCRIPTION
submission

The content to be evaluated. Can be plain text or a JSON-serialized string for structured data (e.g., dialogues, multi-part responses).

TYPE: str

description

A brief description of this item (e.g., "High quality response").

TYPE: str

ground_truth

Optional list of ground truth values, one per criterion. - For binary criteria: CriterionVerdict (MET, UNMET, CANNOT_ASSESS) - For multi-choice criteria: str (option label) Used for computing evaluation metrics against LLM predictions.

TYPE: list[CriterionVerdict | str] | None

rubric

Optional per-item rubric. If provided, this rubric is used for grading instead of the dataset-level rubric. Useful for datasets where each item has unique evaluation criteria (e.g., ResearcherBench).

TYPE: Rubric | None

reference_submission

Optional exemplar response for grading context. When present, helps calibrate the grader's expectations. Item-level takes precedence over dataset-level reference.

TYPE: str | None

Example

Binary criteria only

item = DataItem( ... submission="The Industrial Revolution began in Britain around 1760...", ... description="Excellent essay covering all criteria", ... ground_truth=[CriterionVerdict.MET, CriterionVerdict.MET, CriterionVerdict.UNMET] ... )

Mixed binary and multi-choice

item = DataItem( ... submission="The assistant responded helpfully...", ... description="Good dialogue", ... ground_truth=[CriterionVerdict.MET, "Very satisfied", "Yes - reasonable"] ... )

With per-item rubric

from autorubric import Rubric, Criterion item = DataItem( ... submission="Response to a specific question...", ... description="Question-specific grading", ... rubric=Rubric([Criterion(name="Relevance", weight=1.0, requirement="...")]) ... )


RubricDataset

Container for evaluation datasets with optional ground truth.

RubricDataset dataclass

RubricDataset(prompt: str, rubric: Rubric | None = None, items: list[DataItem] = list(), name: str | None = None, reference_submission: str | None = None)

A collection of DataItems tied to a specific prompt and rubric.

The RubricDataset encapsulates: - The prompt that generated the responses - The rubric used for evaluation (global or per-item) - A collection of DataItems with optional ground truth labels

This is useful for: - Evaluating LLM grader accuracy against human judgments - Training reward models with labeled data - Benchmarking different grading strategies

ATTRIBUTE DESCRIPTION
prompt

The prompt/question that items are responses to.

TYPE: str

rubric

Optional global Rubric used to evaluate items. Can be None if all items have their own rubrics.

TYPE: Rubric | None

items

List of DataItem instances to evaluate.

TYPE: list[DataItem]

name

Optional name for the dataset (e.g., "essay-grading-v1").

TYPE: str | None

reference_submission

Optional global exemplar response for grading context. When present, provides calibration for the grader. Item-level reference takes precedence over this dataset-level reference.

TYPE: str | None

Example

from autorubric import Rubric, Criterion, CriterionVerdict rubric = Rubric([ ... Criterion(name="Accuracy", weight=10.0, requirement="Factually correct"), ... Criterion(name="Clarity", weight=5.0, requirement="Clear and concise"), ... ]) dataset = RubricDataset( ... prompt="Explain photosynthesis", ... rubric=rubric, ... ) dataset.add_item( ... submission="Photosynthesis is the process...", ... description="Good response", ... ground_truth=[CriterionVerdict.MET, CriterionVerdict.MET] ... )

criterion_names property

criterion_names: list[str]

Get criterion names from global rubric.

RAISES DESCRIPTION
ValueError

If no global rubric is set.

num_criteria property

num_criteria: int

Number of criteria in the global rubric.

RAISES DESCRIPTION
ValueError

If no global rubric is set.

total_positive_weight property

total_positive_weight: float

Sum of all positive criterion weights in global rubric.

RAISES DESCRIPTION
ValueError

If no global rubric is set.

get_item_rubric

get_item_rubric(idx: int) -> Rubric

Get the effective rubric for an item (per-item or global fallback).

PARAMETER DESCRIPTION
idx

Index of the item.

TYPE: int

RETURNS DESCRIPTION
Rubric

The item's rubric if set, otherwise the dataset's global rubric.

RAISES DESCRIPTION
ValueError

If neither item nor dataset has a rubric.

Source code in src/autorubric/dataset.py
def get_item_rubric(self, idx: int) -> Rubric:
    """Get the effective rubric for an item (per-item or global fallback).

    Args:
        idx: Index of the item.

    Returns:
        The item's rubric if set, otherwise the dataset's global rubric.

    Raises:
        ValueError: If neither item nor dataset has a rubric.
    """
    item = self.items[idx]
    if item.rubric is not None:
        return item.rubric
    if self.rubric is not None:
        return self.rubric
    raise ValueError(
        f"Item {idx} has no rubric and dataset has no global rubric"
    )

get_item_reference_submission

get_item_reference_submission(idx: int) -> str | None

Get the effective reference submission for an item.

Item-level reference takes precedence over dataset-level reference.

PARAMETER DESCRIPTION
idx

Index of the item.

TYPE: int

RETURNS DESCRIPTION
str | None

The item's reference_submission if set, otherwise the dataset's

str | None

global reference_submission. May be None if neither is set.

Source code in src/autorubric/dataset.py
def get_item_reference_submission(self, idx: int) -> str | None:
    """Get the effective reference submission for an item.

    Item-level reference takes precedence over dataset-level reference.

    Args:
        idx: Index of the item.

    Returns:
        The item's reference_submission if set, otherwise the dataset's
        global reference_submission. May be None if neither is set.
    """
    item = self.items[idx]
    if item.reference_submission is not None:
        return item.reference_submission
    return self.reference_submission

compute_weighted_score

compute_weighted_score(verdicts: list[CriterionVerdict | str], normalize: bool = True, rubric: Rubric | None = None) -> float

Compute weighted score from verdicts (binary or multi-choice).

PARAMETER DESCRIPTION
verdicts

List of verdict values, one per criterion. - For binary criteria: CriterionVerdict (MET=1.0, UNMET=0.0) - For multi-choice criteria: str (option label, resolved to value)

TYPE: list[CriterionVerdict | str]

normalize

If True, normalize score to [0, 1]. If False, return raw sum.

TYPE: bool DEFAULT: True

rubric

Optional rubric to use for scoring. If None, uses global rubric.

TYPE: Rubric | None DEFAULT: None

RETURNS DESCRIPTION
float

Weighted score based on criterion weights and verdicts.

RAISES DESCRIPTION
ValueError

If a multi-choice label doesn't match any option, or if rubric is None and no global rubric is set.

Source code in src/autorubric/dataset.py
def compute_weighted_score(
    self,
    verdicts: list[CriterionVerdict | str],
    normalize: bool = True,
    rubric: Rubric | None = None,
) -> float:
    """Compute weighted score from verdicts (binary or multi-choice).

    Args:
        verdicts: List of verdict values, one per criterion.
            - For binary criteria: CriterionVerdict (MET=1.0, UNMET=0.0)
            - For multi-choice criteria: str (option label, resolved to value)
        normalize: If True, normalize score to [0, 1]. If False, return raw sum.
        rubric: Optional rubric to use for scoring. If None, uses global rubric.

    Returns:
        Weighted score based on criterion weights and verdicts.

    Raises:
        ValueError: If a multi-choice label doesn't match any option, or if
            rubric is None and no global rubric is set.
    """
    effective_rubric = rubric if rubric is not None else self.rubric
    if effective_rubric is None:
        raise ValueError(
            "Cannot compute score: no rubric provided and no global rubric set"
        )

    score = 0.0
    total_positive = 0.0

    for i, verdict in enumerate(verdicts):
        criterion = effective_rubric.rubric[i]
        weight = criterion.weight

        if criterion.is_multi_choice:
            # Multi-choice: resolve label to value
            if isinstance(verdict, str):
                idx = criterion.find_option_by_label(verdict)
                opt = criterion.options[idx]  # type: ignore
                if opt.na:
                    # NA options don't contribute
                    continue
                score += opt.value * weight
                if weight > 0:
                    total_positive += weight
            else:
                raise ValueError(
                    f"Criterion {i} is multi-choice but got CriterionVerdict; "
                    f"expected option label string"
                )
        else:
            # Binary: MET=1.0, UNMET=0.0, CANNOT_ASSESS skipped
            if isinstance(verdict, str):
                # Try to parse as CriterionVerdict
                try:
                    verdict = CriterionVerdict(verdict)
                except ValueError:
                    raise ValueError(
                        f"Criterion {i} is binary but got invalid verdict '{verdict}'. "
                        f"Must be 'MET', 'UNMET', or 'CANNOT_ASSESS'."
                    ) from None

            if verdict == CriterionVerdict.CANNOT_ASSESS:
                continue  # Skip
            if verdict == CriterionVerdict.MET:
                score += weight
            if weight > 0:
                total_positive += weight

    if normalize:
        if total_positive > 0:
            return max(0.0, min(1.0, score / total_positive))
        return 0.0
    return score

add_item

add_item(submission: str, description: str, ground_truth: list[CriterionVerdict | str] | None = None, rubric: Rubric | None = None, reference_submission: str | None = None) -> None

Add a new item to the dataset.

PARAMETER DESCRIPTION
submission

The content to be evaluated.

TYPE: str

description

A brief description of this item.

TYPE: str

ground_truth

Optional list of ground truth values. - For binary criteria: CriterionVerdict (MET, UNMET, CANNOT_ASSESS) - For multi-choice criteria: str (option label)

TYPE: list[CriterionVerdict | str] | None DEFAULT: None

rubric

Optional per-item rubric. If None, uses global rubric.

TYPE: Rubric | None DEFAULT: None

reference_submission

Optional exemplar response for grading context.

TYPE: str | None DEFAULT: None

RAISES DESCRIPTION
ValueError

If ground_truth length doesn't match effective rubric criteria count, or if neither per-item nor global rubric is available.

Source code in src/autorubric/dataset.py
def add_item(
    self,
    submission: str,
    description: str,
    ground_truth: list[CriterionVerdict | str] | None = None,
    rubric: Rubric | None = None,
    reference_submission: str | None = None,
) -> None:
    """Add a new item to the dataset.

    Args:
        submission: The content to be evaluated.
        description: A brief description of this item.
        ground_truth: Optional list of ground truth values.
            - For binary criteria: CriterionVerdict (MET, UNMET, CANNOT_ASSESS)
            - For multi-choice criteria: str (option label)
        rubric: Optional per-item rubric. If None, uses global rubric.
        reference_submission: Optional exemplar response for grading context.

    Raises:
        ValueError: If ground_truth length doesn't match effective rubric criteria count,
            or if neither per-item nor global rubric is available.
    """
    item = DataItem(
        submission=submission,
        description=description,
        ground_truth=ground_truth,
        rubric=rubric,
        reference_submission=reference_submission,
    )
    effective_rubric = item.rubric if item.rubric is not None else self.rubric
    if effective_rubric is None:
        raise ValueError(
            "Cannot add item: no per-item rubric provided and no global rubric set"
        )
    if item.ground_truth is not None and len(item.ground_truth) != len(
        effective_rubric.rubric
    ):
        raise ValueError(
            f"Ground truth has {len(item.ground_truth)} values, "
            f"but rubric has {len(effective_rubric.rubric)} criteria"
        )
    self.items.append(item)

to_json

to_json(indent: int | None = 2) -> str

Serialize the dataset to a JSON string.

PARAMETER DESCRIPTION
indent

Number of spaces for indentation. None for compact output.

TYPE: int | None DEFAULT: 2

RETURNS DESCRIPTION
str

JSON string representation of the dataset.

Source code in src/autorubric/dataset.py
def to_json(self, indent: int | None = 2) -> str:
    """Serialize the dataset to a JSON string.

    Args:
        indent: Number of spaces for indentation. None for compact output.

    Returns:
        JSON string representation of the dataset.
    """
    data: dict[str, Any] = {}
    if self.name is not None:
        data["name"] = self.name
    data["prompt"] = self.prompt

    # Serialize global rubric (can be None)
    if self.rubric is not None:
        data["rubric"] = self._serialize_rubric(self.rubric)
    else:
        data["rubric"] = None

    # Serialize global reference_submission if present
    if self.reference_submission is not None:
        data["reference_submission"] = self.reference_submission

    # Serialize items with ground truth and per-item rubrics
    items_data = []
    for item in self.items:
        item_data: dict[str, Any] = {
            "submission": item.submission,
            "description": item.description,
        }
        if item.ground_truth is not None:
            # Serialize ground truth: CriterionVerdict -> str, str stays str
            gt_values = []
            for v in item.ground_truth:
                if isinstance(v, CriterionVerdict):
                    gt_values.append(v.value)
                else:
                    gt_values.append(v)  # Already a string (option label)
            item_data["ground_truth"] = gt_values
        else:
            item_data["ground_truth"] = None
        # Serialize per-item rubric if present
        if item.rubric is not None:
            item_data["rubric"] = self._serialize_rubric(item.rubric)
        # Serialize per-item reference_submission if present
        if item.reference_submission is not None:
            item_data["reference_submission"] = item.reference_submission
        items_data.append(item_data)
    data["items"] = items_data

    return json.dumps(data, indent=indent)

to_file

to_file(path: str | Path) -> None

Save dataset to a JSON file.

PARAMETER DESCRIPTION
path

Path to write the JSON file.

TYPE: str | Path

Source code in src/autorubric/dataset.py
def to_file(self, path: str | Path) -> None:
    """Save dataset to a JSON file.

    Args:
        path: Path to write the JSON file.
    """
    from pathlib import Path

    Path(path).write_text(self.to_json(), encoding="utf-8")

from_json classmethod

from_json(json_string: str) -> RubricDataset

Deserialize a dataset from a JSON string.

PARAMETER DESCRIPTION
json_string

JSON string representation of the dataset.

TYPE: str

RETURNS DESCRIPTION
RubricDataset

RubricDataset instance.

RAISES DESCRIPTION
ValueError

If the JSON is invalid, missing required fields, or if an item has no rubric when no global rubric is set.

Source code in src/autorubric/dataset.py
@classmethod
def from_json(cls, json_string: str) -> RubricDataset:
    """Deserialize a dataset from a JSON string.

    Args:
        json_string: JSON string representation of the dataset.

    Returns:
        RubricDataset instance.

    Raises:
        ValueError: If the JSON is invalid, missing required fields, or if
            an item has no rubric when no global rubric is set.
    """
    try:
        data = json.loads(json_string)
    except json.JSONDecodeError as e:
        raise ValueError(f"Failed to parse JSON: {e}") from e

    if not isinstance(data, dict):
        raise ValueError(f"Expected JSON object, got {type(data).__name__}")

    # Validate required fields
    if "prompt" not in data:
        raise ValueError("Missing required field: 'prompt'")
    if "rubric" not in data:
        raise ValueError("Missing required field: 'rubric'")

    # Parse global rubric (can be None/null)
    rubric_data = data["rubric"]
    rubric: Rubric | None = None
    if rubric_data is not None:
        rubric = Rubric.from_dict(rubric_data)

    # Parse items
    items: list[DataItem] = []
    for i, item_data in enumerate(data.get("items", [])):
        if not isinstance(item_data, dict):
            raise ValueError(
                f"Item {i} must be a dict, got {type(item_data).__name__}"
            )

        submission = item_data.get("submission")
        description = item_data.get("description")

        if submission is None:
            raise ValueError(f"Item {i} missing required field: 'submission'")
        if description is None:
            raise ValueError(f"Item {i} missing required field: 'description'")

        # Parse per-item rubric if present
        item_rubric_data = item_data.get("rubric")
        item_rubric: Rubric | None = None
        if item_rubric_data is not None:
            item_rubric = Rubric.from_dict(item_rubric_data)

        # Validate that item has access to a rubric
        effective_rubric = item_rubric if item_rubric is not None else rubric
        if effective_rubric is None:
            raise ValueError(
                f"Item {i} has no rubric and dataset has no global rubric"
            )

        # Parse ground truth against the effective rubric
        ground_truth_raw = item_data.get("ground_truth")
        ground_truth: list[CriterionVerdict | str] | None = None
        if ground_truth_raw is not None:
            ground_truth = []
            for j, v in enumerate(ground_truth_raw):
                criterion = (
                    effective_rubric.rubric[j]
                    if j < len(effective_rubric.rubric)
                    else None
                )

                if criterion is not None and criterion.is_multi_choice:
                    # Multi-choice: keep as string (option label)
                    if not isinstance(v, str):
                        raise ValueError(
                            f"Item {i}, ground_truth[{j}]: multi-choice criterion "
                            f"expects option label string, got {type(v).__name__}"
                        )
                    # Validate that the label exists
                    try:
                        criterion.find_option_by_label(v)
                    except ValueError as e:
                        raise ValueError(
                            f"Item {i}, ground_truth[{j}]: {e}"
                        ) from None
                    ground_truth.append(v)
                else:
                    # Binary: parse as CriterionVerdict
                    try:
                        ground_truth.append(CriterionVerdict(v))
                    except ValueError:
                        raise ValueError(
                            f"Item {i}, ground_truth[{j}]: invalid verdict '{v}'. "
                            f"Must be 'MET', 'UNMET', or 'CANNOT_ASSESS'."
                        ) from None

        # Parse per-item reference_submission if present
        item_reference = item_data.get("reference_submission")

        items.append(
            DataItem(
                submission=submission,
                description=description,
                ground_truth=ground_truth,
                rubric=item_rubric,
                reference_submission=item_reference,
            )
        )

    return cls(
        prompt=data["prompt"],
        rubric=rubric,
        items=items,
        name=data.get("name"),
        reference_submission=data.get("reference_submission"),
    )

from_file classmethod

from_file(path: str | Path) -> RubricDataset

Load dataset from a JSON file.

PARAMETER DESCRIPTION
path

Path to the JSON file.

TYPE: str | Path

RETURNS DESCRIPTION
RubricDataset

RubricDataset instance.

RAISES DESCRIPTION
FileNotFoundError

If the file doesn't exist.

ValueError

If the JSON is invalid.

Source code in src/autorubric/dataset.py
@classmethod
def from_file(cls, path: str | Path) -> RubricDataset:
    """Load dataset from a JSON file.

    Args:
        path: Path to the JSON file.

    Returns:
        RubricDataset instance.

    Raises:
        FileNotFoundError: If the file doesn't exist.
        ValueError: If the JSON is invalid.
    """
    from pathlib import Path

    path = Path(path)
    if not path.exists():
        raise FileNotFoundError(f"Dataset file not found: {path}")

    return cls.from_json(path.read_text(encoding="utf-8"))

split_train_test

split_train_test(n_train: int, *, stratify: bool = True, seed: int | None = None) -> tuple[RubricDataset, RubricDataset]

Split dataset into training and test sets.

The training set can be used to provide few-shot examples for grading, while the test set is used for evaluation.

PARAMETER DESCRIPTION
n_train

Exact number of items for training set.

TYPE: int

stratify

If True, stratify by per-criterion verdict distribution. This ensures each split has similar proportion of MET/UNMET/CANNOT_ASSESS for each criterion position. Requires all items to have ground_truth.

TYPE: bool DEFAULT: True

seed

Random seed for reproducible splits.

TYPE: int | None DEFAULT: None

RETURNS DESCRIPTION
tuple[RubricDataset, RubricDataset]

Tuple of (train_dataset, test_dataset).

RAISES DESCRIPTION
ValueError

If n_train is invalid or stratify=True but items lack ground_truth.

Example

dataset = RubricDataset.from_file("data.json") train, test = dataset.split_train_test(n_train=100, stratify=True, seed=42) print(f"Train: {len(train)}, Test: {len(test)}")

Source code in src/autorubric/dataset.py
def split_train_test(
    self,
    n_train: int,
    *,
    stratify: bool = True,
    seed: int | None = None,
) -> tuple[RubricDataset, RubricDataset]:
    """Split dataset into training and test sets.

    The training set can be used to provide few-shot examples for grading,
    while the test set is used for evaluation.

    Args:
        n_train: Exact number of items for training set.
        stratify: If True, stratify by per-criterion verdict distribution.
            This ensures each split has similar proportion of MET/UNMET/CANNOT_ASSESS
            for each criterion position. Requires all items to have ground_truth.
        seed: Random seed for reproducible splits.

    Returns:
        Tuple of (train_dataset, test_dataset).

    Raises:
        ValueError: If n_train is invalid or stratify=True but items lack ground_truth.

    Example:
        >>> dataset = RubricDataset.from_file("data.json")
        >>> train, test = dataset.split_train_test(n_train=100, stratify=True, seed=42)
        >>> print(f"Train: {len(train)}, Test: {len(test)}")
    """
    import random

    if n_train < 0:
        raise ValueError(f"n_train must be non-negative, got {n_train}")
    if n_train > len(self.items):
        raise ValueError(
            f"n_train ({n_train}) exceeds dataset size ({len(self.items)})"
        )

    rng = random.Random(seed)

    if stratify:
        train_items, test_items = self._stratified_split(n_train, rng)
    else:
        indices = list(range(len(self.items)))
        rng.shuffle(indices)
        train_items = [self.items[i] for i in indices[:n_train]]
        test_items = [self.items[i] for i in indices[n_train:]]

    train_dataset = RubricDataset(
        prompt=self.prompt,
        rubric=self.rubric,
        items=train_items,
        name=self.name,
        reference_submission=self.reference_submission,
    )
    test_dataset = RubricDataset(
        prompt=self.prompt,
        rubric=self.rubric,
        items=test_items,
        reference_submission=self.reference_submission,
        name=self.name,
    )

    return train_dataset, test_dataset