Dataset¶

Dataset management classes for organizing evaluation data with optional ground truth labels.

Overview¶

The RubricDataset class provides structured storage for evaluation datasets, including submissions, optional ground truth verdicts, per-item rubrics, and reference submissions. Datasets can be serialized to JSON/YAML for sharing and reproducibility.

Quick Example¶

from autorubric import Rubric, Criterion, CriterionVerdict, DataItem, RubricDataset

# Create a rubric
rubric = Rubric([
    Criterion(name="accuracy", weight=10.0, requirement="Factually correct"),
    Criterion(name="clarity", weight=5.0, requirement="Clear and concise"),
])

# Create a dataset (prompt is now optional)
dataset = RubricDataset(
    name="photosynthesis-eval",
    prompt="Explain photosynthesis",  # Optional global prompt
    rubric=rubric,
)

# Add items with ground truth
dataset.add_item(
    submission="Photosynthesis is the process by which plants convert sunlight...",
    description="Good response",
    ground_truth=[CriterionVerdict.MET, CriterionVerdict.MET]
)

# Serialize
dataset.to_file("dataset.json")

# Load
loaded = RubricDataset.from_file("dataset.json")

Per-Item Rubrics¶

For datasets where each item requires a unique rubric (e.g., question-specific evaluation):

item = DataItem(
    submission="Answer to question 1...",
    description="Q1",
    rubric=Rubric([
        Criterion(weight=1.0, requirement="Correct answer for Q1"),
    ])
)

dataset = RubricDataset(
    prompt="Answer the question",
    rubric=None,  # No global rubric
    items=[item],
)

# Get effective rubric for an item
rubric = dataset.get_item_rubric(0)  # Returns item's rubric

Per-Item Prompts¶

Different items can have different prompts:

# Create items with individual prompts
item1 = DataItem(
    submission="The answer is 42.",
    description="Math problem",
    prompt="Evaluate this answer to: What is 6 x 7?"
)

item2 = DataItem(
    submission="Analysis of Hamlet's soliloquy...",
    description="Literary analysis"
    # No prompt specified, will use global prompt
)

# Create dataset with optional global prompt
dataset = RubricDataset(
    prompt="Evaluate this response",  # Optional global prompt (fallback)
    rubric=rubric,
    items=[item1, item2],
)

# Get effective prompt for an item
prompt = dataset.get_item_prompt(0)  # Returns item1's prompt
prompt = dataset.get_item_prompt(1)  # Returns global prompt (item2 has none)

# Raises ValueError if neither item nor dataset has a prompt
try:
    prompt = dataset.get_item_prompt(0)
except ValueError:
    print("No prompt available for this item")

Use in grading:

for i, item in enumerate(dataset):
    result = await rubric.grade(
        to_grade=item.submission,
        grader=grader,
        query=dataset.get_item_prompt(i),
    )

Reference Submissions¶

Provide exemplar responses for judge calibration:

# Global reference for all items
dataset = RubricDataset(
    prompt="Explain photosynthesis",
    rubric=rubric,
    reference_submission="Detailed explanation of photosynthesis...",
)

# Per-item reference (overrides global)
dataset.add_item(
    submission="Student answer...",
    description="Q1",
    reference_submission="Custom reference for this item",
)

# Get effective reference
ref = dataset.get_item_reference_submission(0)

Train/Test Split¶

train_data, test_data = dataset.split_train_test(
    n_train=100,
    stratify=True,  # Balance by ground truth verdicts
    seed=42,
)

DataItem¶

A single item in an evaluation dataset.

DataItem `dataclass` ¶

DataItem(submission: str, description: str, ground_truth: list[CriterionVerdict | str] | None = None, rubric: Rubric | None = None, reference_submission: str | None = None, prompt: str | None = None)

A single item to be graded, optionally with ground truth verdicts.

ATTRIBUTE	DESCRIPTION
`submission`	The content to be evaluated. Can be plain text or a JSON-serialized string for structured data (e.g., dialogues, multi-part responses). TYPE: `str`
`description`	A brief description of this item (e.g., "High quality response"). TYPE: `str`
`ground_truth`	Optional list of ground truth values, one per criterion. - For binary criteria: CriterionVerdict (MET, UNMET, CANNOT_ASSESS) - For multi-choice criteria: str (option label) Used for computing evaluation metrics against LLM predictions. TYPE: `list[CriterionVerdict \| str] \| None`
`rubric`	Optional per-item rubric. If provided, this rubric is used for grading instead of the dataset-level rubric. Useful for datasets where each item has unique evaluation criteria (e.g., ResearcherBench). TYPE: `Rubric \| None`
`reference_submission`	Optional exemplar response for grading context. When present, helps calibrate the grader's expectations. Item-level takes precedence over dataset-level reference. TYPE: `str \| None`
`prompt`	Optional per-item prompt. If provided, overrides the dataset-level prompt for this item. TYPE: `str \| None`

Example

Binary criteria only¶

item = DataItem( ... submission="The Industrial Revolution began in Britain around 1760...", ... description="Excellent essay covering all criteria", ... ground_truth=[CriterionVerdict.MET, CriterionVerdict.MET, CriterionVerdict.UNMET] ... )

Mixed binary and multi-choice¶

item = DataItem( ... submission="The assistant responded helpfully...", ... description="Good dialogue", ... ground_truth=[CriterionVerdict.MET, "Very satisfied", "Yes - reasonable"] ... )

With per-item rubric¶

from autorubric import Rubric, Criterion item = DataItem( ... submission="Response to a specific question...", ... description="Question-specific grading", ... rubric=Rubric([Criterion(name="Relevance", weight=1.0, requirement="...")]) ... )

RubricDataset¶

Container for evaluation datasets with optional ground truth.

RubricDataset `dataclass` ¶

RubricDataset(prompt: str | None = None, rubric: Rubric | None = None, items: list[DataItem] = list(), name: str | None = None, reference_submission: str | None = None)

A collection of DataItems tied to a specific prompt and rubric.

The RubricDataset encapsulates: - The prompt that generated the responses - The rubric used for evaluation (global or per-item) - A collection of DataItems with optional ground truth labels

This is useful for: - Evaluating LLM grader accuracy against human judgments - Training reward models with labeled data - Benchmarking different grading strategies

ATTRIBUTE	DESCRIPTION
`prompt`	Optional global prompt/question that items are responses to. Can be None if all items have their own prompts. TYPE: `str \| None`
`rubric`	Optional global Rubric used to evaluate items. Can be None if all items have their own rubrics. TYPE: `Rubric \| None`
`items`	List of DataItem instances to evaluate. TYPE: `list[DataItem]`
`name`	Optional name for the dataset (e.g., "essay-grading-v1"). TYPE: `str \| None`
`reference_submission`	Optional global exemplar response for grading context. When present, provides calibration for the grader. Item-level reference takes precedence over this dataset-level reference. TYPE: `str \| None`

Example

from autorubric import Rubric, Criterion, CriterionVerdict rubric = Rubric([ ... Criterion(name="Accuracy", weight=10.0, requirement="Factually correct"), ... Criterion(name="Clarity", weight=5.0, requirement="Clear and concise"), ... ]) dataset = RubricDataset( ... prompt="Explain photosynthesis", ... rubric=rubric, ... ) dataset.add_item( ... submission="Photosynthesis is the process...", ... description="Good response", ... ground_truth=[CriterionVerdict.MET, CriterionVerdict.MET] ... )

criterion_names `property` ¶

criterion_names: list[str]

Get criterion names from global rubric.

RAISES	DESCRIPTION
`ValueError`	If no global rubric is set.

num_criteria `property` ¶

num_criteria: int

Number of criteria in the global rubric.

RAISES	DESCRIPTION
`ValueError`	If no global rubric is set.

total_positive_weight `property` ¶

total_positive_weight: float

Sum of all positive criterion weights in global rubric.

RAISES	DESCRIPTION
`ValueError`	If no global rubric is set.

get_item_rubric ¶

get_item_rubric(idx: int) -> Rubric

Get the effective rubric for an item (per-item or global fallback).

PARAMETER	DESCRIPTION
`idx`	Index of the item. TYPE: `int`

RETURNS	DESCRIPTION
`Rubric`	The item's rubric if set, otherwise the dataset's global rubric.

RAISES	DESCRIPTION
`ValueError`	If neither item nor dataset has a rubric.

Source code in src/autorubric/dataset.py

def get_item_rubric(self, idx: int) -> Rubric:
    """Get the effective rubric for an item (per-item or global fallback).

    Args:
        idx: Index of the item.

    Returns:
        The item's rubric if set, otherwise the dataset's global rubric.

    Raises:
        ValueError: If neither item nor dataset has a rubric.
    """
    item = self.items[idx]
    if item.rubric is not None:
        return item.rubric
    if self.rubric is not None:
        return self.rubric
    raise ValueError(
        f"Item {idx} has no rubric and dataset has no global rubric"
    )

get_item_reference_submission ¶

get_item_reference_submission(idx: int) -> str | None

Get the effective reference submission for an item.

Item-level reference takes precedence over dataset-level reference.

PARAMETER	DESCRIPTION
`idx`	Index of the item. TYPE: `int`

RETURNS	DESCRIPTION
`str \| None`	The item's reference_submission if set, otherwise the dataset's
`str \| None`	global reference_submission. May be None if neither is set.

Source code in src/autorubric/dataset.py

def get_item_reference_submission(self, idx: int) -> str | None:
    """Get the effective reference submission for an item.

    Item-level reference takes precedence over dataset-level reference.

    Args:
        idx: Index of the item.

    Returns:
        The item's reference_submission if set, otherwise the dataset's
        global reference_submission. May be None if neither is set.
    """
    item = self.items[idx]
    if item.reference_submission is not None:
        return item.reference_submission
    return self.reference_submission

get_item_prompt ¶

get_item_prompt(idx: int) -> str

Get the effective prompt for an item (per-item or global fallback).

PARAMETER	DESCRIPTION
`idx`	Index of the item. TYPE: `int`

RETURNS	DESCRIPTION
`str`	The item's prompt if set, otherwise the dataset's global prompt.

RAISES	DESCRIPTION
`ValueError`	If neither item nor dataset has a prompt.

Source code in src/autorubric/dataset.py

def get_item_prompt(self, idx: int) -> str:
    """Get the effective prompt for an item (per-item or global fallback).

    Args:
        idx: Index of the item.

    Returns:
        The item's prompt if set, otherwise the dataset's global prompt.

    Raises:
        ValueError: If neither item nor dataset has a prompt.
    """
    item = self.items[idx]
    if item.prompt is not None:
        return item.prompt
    if self.prompt is not None:
        return self.prompt
    raise ValueError(
        f"Item {idx} has no prompt and dataset has no global prompt"
    )

compute_weighted_score ¶

compute_weighted_score(verdicts: list[CriterionVerdict | str], normalize: bool = True, rubric: Rubric | None = None) -> float

Compute weighted score from verdicts (binary or multi-choice).

PARAMETER	DESCRIPTION
`verdicts`	List of verdict values, one per criterion. - For binary criteria: CriterionVerdict (MET=1.0, UNMET=0.0) - For multi-choice criteria: str (option label, resolved to value) TYPE: `list[CriterionVerdict \| str]`
`normalize`	If True, normalize score to [0, 1]. If False, return raw sum. TYPE: `bool` DEFAULT: `True`
`rubric`	Optional rubric to use for scoring. If None, uses global rubric. TYPE: `Rubric \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`float`	Weighted score based on criterion weights and verdicts.

RAISES	DESCRIPTION
`ValueError`	If a multi-choice label doesn't match any option, or if rubric is None and no global rubric is set.

Source code in src/autorubric/dataset.py

def compute_weighted_score(
    self,
    verdicts: list[CriterionVerdict | str],
    normalize: bool = True,
    rubric: Rubric | None = None,
) -> float:
    """Compute weighted score from verdicts (binary or multi-choice).

    Args:
        verdicts: List of verdict values, one per criterion.
            - For binary criteria: CriterionVerdict (MET=1.0, UNMET=0.0)
            - For multi-choice criteria: str (option label, resolved to value)
        normalize: If True, normalize score to [0, 1]. If False, return raw sum.
        rubric: Optional rubric to use for scoring. If None, uses global rubric.

    Returns:
        Weighted score based on criterion weights and verdicts.

    Raises:
        ValueError: If a multi-choice label doesn't match any option, or if
            rubric is None and no global rubric is set.
    """
    effective_rubric = rubric if rubric is not None else self.rubric
    if effective_rubric is None:
        raise ValueError(
            "Cannot compute score: no rubric provided and no global rubric set"
        )

    score = 0.0
    total_positive = 0.0

    for i, verdict in enumerate(verdicts):
        criterion = effective_rubric.rubric[i]
        weight = criterion.weight

        if criterion.is_multi_choice:
            # Multi-choice: resolve label to value
            if isinstance(verdict, str):
                idx = criterion.find_option_by_label(verdict)
                opt = criterion.options[idx]  # type: ignore
                if opt.na:
                    # NA options don't contribute
                    continue
                score += opt.value * weight
                if weight > 0:
                    total_positive += weight
            else:
                raise ValueError(
                    f"Criterion {i} is multi-choice but got CriterionVerdict; "
                    f"expected option label string"
                )
        else:
            # Binary: MET=1.0, UNMET=0.0, CANNOT_ASSESS skipped
            if isinstance(verdict, str):
                # Try to parse as CriterionVerdict
                try:
                    verdict = CriterionVerdict(verdict)
                except ValueError:
                    raise ValueError(
                        f"Criterion {i} is binary but got invalid verdict '{verdict}'. "
                        f"Must be 'MET', 'UNMET', or 'CANNOT_ASSESS'."
                    ) from None

            if verdict == CriterionVerdict.CANNOT_ASSESS:
                continue  # Skip
            if verdict == CriterionVerdict.MET:
                score += weight
            if weight > 0:
                total_positive += weight

    if normalize:
        if total_positive > 0:
            return max(0.0, min(1.0, score / total_positive))
        return 0.0
    return score

add_item ¶

add_item(submission: str, description: str, ground_truth: list[CriterionVerdict | str] | None = None, rubric: Rubric | None = None, reference_submission: str | None = None, prompt: str | None = None) -> None

Add a new item to the dataset.

PARAMETER	DESCRIPTION
`submission`	The content to be evaluated. TYPE: `str`
`description`	A brief description of this item. TYPE: `str`
`ground_truth`	Optional list of ground truth values. - For binary criteria: CriterionVerdict (MET, UNMET, CANNOT_ASSESS) - For multi-choice criteria: str (option label) TYPE: `list[CriterionVerdict \| str] \| None` DEFAULT: `None`
`rubric`	Optional per-item rubric. If None, uses global rubric. TYPE: `Rubric \| None` DEFAULT: `None`
`reference_submission`	Optional exemplar response for grading context. TYPE: `str \| None` DEFAULT: `None`
`prompt`	Optional per-item prompt. If None, uses global prompt. TYPE: `str \| None` DEFAULT: `None`

RAISES	DESCRIPTION
`ValueError`	If ground_truth length doesn't match effective rubric criteria count, or if neither per-item nor global rubric is available, or if neither per-item nor global prompt is available.

Source code in src/autorubric/dataset.py

def add_item(
    self,
    submission: str,
    description: str,
    ground_truth: list[CriterionVerdict | str] | None = None,
    rubric: Rubric | None = None,
    reference_submission: str | None = None,
    prompt: str | None = None,
) -> None:
    """Add a new item to the dataset.

    Args:
        submission: The content to be evaluated.
        description: A brief description of this item.
        ground_truth: Optional list of ground truth values.
            - For binary criteria: CriterionVerdict (MET, UNMET, CANNOT_ASSESS)
            - For multi-choice criteria: str (option label)
        rubric: Optional per-item rubric. If None, uses global rubric.
        reference_submission: Optional exemplar response for grading context.
        prompt: Optional per-item prompt. If None, uses global prompt.

    Raises:
        ValueError: If ground_truth length doesn't match effective rubric criteria count,
            or if neither per-item nor global rubric is available, or if neither
            per-item nor global prompt is available.
    """
    if prompt is None and self.prompt is None:
        raise ValueError(
            "Cannot add item: no per-item prompt provided and no global prompt set"
        )
    item = DataItem(
        submission=submission,
        description=description,
        ground_truth=ground_truth,
        rubric=rubric,
        reference_submission=reference_submission,
        prompt=prompt,
    )
    effective_rubric = item.rubric if item.rubric is not None else self.rubric
    if effective_rubric is None:
        raise ValueError(
            "Cannot add item: no per-item rubric provided and no global rubric set"
        )
    if item.ground_truth is not None and len(item.ground_truth) != len(
        effective_rubric.rubric
    ):
        raise ValueError(
            f"Ground truth has {len(item.ground_truth)} values, "
            f"but rubric has {len(effective_rubric.rubric)} criteria"
        )
    self.items.append(item)

to_json ¶

to_json(indent: int | None = 2) -> str

Serialize the dataset to a JSON string.

PARAMETER	DESCRIPTION
`indent`	Number of spaces for indentation. None for compact output. TYPE: `int \| None` DEFAULT: `2`

RETURNS	DESCRIPTION
`str`	JSON string representation of the dataset.

Source code in src/autorubric/dataset.py

def to_json(self, indent: int | None = 2) -> str:
    """Serialize the dataset to a JSON string.

    Args:
        indent: Number of spaces for indentation. None for compact output.

    Returns:
        JSON string representation of the dataset.
    """
    data: dict[str, Any] = {}
    if self.name is not None:
        data["name"] = self.name
    if self.prompt is not None:
        data["prompt"] = self.prompt

    # Serialize global rubric (can be None)
    if self.rubric is not None:
        data["rubric"] = self._serialize_rubric(self.rubric)
    else:
        data["rubric"] = None

    # Serialize global reference_submission if present
    if self.reference_submission is not None:
        data["reference_submission"] = self.reference_submission

    # Serialize items with ground truth and per-item rubrics
    items_data = []
    for item in self.items:
        item_data: dict[str, Any] = {
            "submission": item.submission,
            "description": item.description,
        }
        if item.ground_truth is not None:
            # Serialize ground truth: CriterionVerdict -> str, str stays str
            gt_values = []
            for v in item.ground_truth:
                if isinstance(v, CriterionVerdict):
                    gt_values.append(v.value)
                else:
                    gt_values.append(v)  # Already a string (option label)
            item_data["ground_truth"] = gt_values
        else:
            item_data["ground_truth"] = None
        # Serialize per-item rubric if present
        if item.rubric is not None:
            item_data["rubric"] = self._serialize_rubric(item.rubric)
        # Serialize per-item reference_submission if present
        if item.reference_submission is not None:
            item_data["reference_submission"] = item.reference_submission
        # Serialize per-item prompt if present
        if item.prompt is not None:
            item_data["prompt"] = item.prompt
        items_data.append(item_data)
    data["items"] = items_data

    return json.dumps(data, indent=indent)

to_file ¶

to_file(path: str | Path) -> None

Save dataset to a JSON file.

PARAMETER	DESCRIPTION
`path`	Path to write the JSON file. TYPE: `str \| Path`

Source code in src/autorubric/dataset.py

def to_file(self, path: str | Path) -> None:
    """Save dataset to a JSON file.

    Args:
        path: Path to write the JSON file.
    """
    from pathlib import Path

    Path(path).write_text(self.to_json(), encoding="utf-8")

from_json `classmethod` ¶

from_json(json_string: str) -> RubricDataset

Deserialize a dataset from a JSON string.

PARAMETER	DESCRIPTION
`json_string`	JSON string representation of the dataset. TYPE: `str`

RETURNS	DESCRIPTION
`RubricDataset`	RubricDataset instance.

RAISES	DESCRIPTION
`ValueError`	If the JSON is invalid, missing required fields, or if an item has no rubric when no global rubric is set.

Source code in src/autorubric/dataset.py

@classmethod
def from_json(cls, json_string: str) -> RubricDataset:
    """Deserialize a dataset from a JSON string.

    Args:
        json_string: JSON string representation of the dataset.

    Returns:
        RubricDataset instance.

    Raises:
        ValueError: If the JSON is invalid, missing required fields, or if
            an item has no rubric when no global rubric is set.
    """
    try:
        data = json.loads(json_string)
    except json.JSONDecodeError as e:
        raise ValueError(f"Failed to parse JSON: {e}") from e

    if not isinstance(data, dict):
        raise ValueError(f"Expected JSON object, got {type(data).__name__}")

    # Validate required fields
    if "rubric" not in data:
        raise ValueError("Missing required field: 'rubric'")

    # Parse global rubric (can be None/null)
    rubric_data = data["rubric"]
    rubric: Rubric | None = None
    if rubric_data is not None:
        rubric = Rubric.from_dict(rubric_data)

    # Parse items
    items: list[DataItem] = []
    for i, item_data in enumerate(data.get("items", [])):
        if not isinstance(item_data, dict):
            raise ValueError(
                f"Item {i} must be a dict, got {type(item_data).__name__}"
            )

        submission = item_data.get("submission")
        description = item_data.get("description")

        if submission is None:
            raise ValueError(f"Item {i} missing required field: 'submission'")
        if description is None:
            raise ValueError(f"Item {i} missing required field: 'description'")

        # Parse per-item rubric if present
        item_rubric_data = item_data.get("rubric")
        item_rubric: Rubric | None = None
        if item_rubric_data is not None:
            item_rubric = Rubric.from_dict(item_rubric_data)

        # Validate that item has access to a rubric
        effective_rubric = item_rubric if item_rubric is not None else rubric
        if effective_rubric is None:
            raise ValueError(
                f"Item {i} has no rubric and dataset has no global rubric"
            )

        # Parse ground truth against the effective rubric
        ground_truth_raw = item_data.get("ground_truth")
        ground_truth: list[CriterionVerdict | str] | None = None
        if ground_truth_raw is not None:
            ground_truth = []
            for j, v in enumerate(ground_truth_raw):
                criterion = (
                    effective_rubric.rubric[j]
                    if j < len(effective_rubric.rubric)
                    else None
                )

                if criterion is not None and criterion.is_multi_choice:
                    # Multi-choice: keep as string (option label)
                    if not isinstance(v, str):
                        raise ValueError(
                            f"Item {i}, ground_truth[{j}]: multi-choice criterion "
                            f"expects option label string, got {type(v).__name__}"
                        )
                    # Validate that the label exists
                    try:
                        criterion.find_option_by_label(v)
                    except ValueError as e:
                        raise ValueError(
                            f"Item {i}, ground_truth[{j}]: {e}"
                        ) from None
                    ground_truth.append(v)
                else:
                    # Binary: parse as CriterionVerdict
                    try:
                        ground_truth.append(CriterionVerdict(v))
                    except ValueError:
                        raise ValueError(
                            f"Item {i}, ground_truth[{j}]: invalid verdict '{v}'. "
                            f"Must be 'MET', 'UNMET', or 'CANNOT_ASSESS'."
                        ) from None

        # Parse per-item reference_submission if present
        item_reference = item_data.get("reference_submission")
        # Parse per-item prompt if present
        item_prompt = item_data.get("prompt")

        items.append(
            DataItem(
                submission=submission,
                description=description,
                ground_truth=ground_truth,
                rubric=item_rubric,
                reference_submission=item_reference,
                prompt=item_prompt,
            )
        )

    return cls(
        prompt=data.get("prompt"),
        rubric=rubric,
        items=items,
        name=data.get("name"),
        reference_submission=data.get("reference_submission"),
    )

from_file `classmethod` ¶

from_file(path: str | Path) -> RubricDataset

Load dataset from a JSON file.

PARAMETER	DESCRIPTION
`path`	Path to the JSON file. TYPE: `str \| Path`

RETURNS	DESCRIPTION
`RubricDataset`	RubricDataset instance.

RAISES	DESCRIPTION
`FileNotFoundError`	If the file doesn't exist.
`ValueError`	If the JSON is invalid.

Source code in src/autorubric/dataset.py

@classmethod
def from_file(cls, path: str | Path) -> RubricDataset:
    """Load dataset from a JSON file.

    Args:
        path: Path to the JSON file.

    Returns:
        RubricDataset instance.

    Raises:
        FileNotFoundError: If the file doesn't exist.
        ValueError: If the JSON is invalid.
    """
    from pathlib import Path

    path = Path(path)
    if not path.exists():
        raise FileNotFoundError(f"Dataset file not found: {path}")

    return cls.from_json(path.read_text(encoding="utf-8"))

split_train_test ¶

split_train_test(n_train: int, *, stratify: bool = True, seed: int | None = None) -> tuple[RubricDataset, RubricDataset]

Split dataset into training and test sets.

The training set can be used to provide few-shot examples for grading, while the test set is used for evaluation.

PARAMETER	DESCRIPTION
`n_train`	Exact number of items for training set. TYPE: `int`
`stratify`	If True, stratify by per-criterion verdict distribution. This ensures each split has similar proportion of MET/UNMET/CANNOT_ASSESS for each criterion position. Requires all items to have ground_truth. TYPE: `bool` DEFAULT: `True`
`seed`	Random seed for reproducible splits. TYPE: `int \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`tuple[RubricDataset, RubricDataset]`	Tuple of (train_dataset, test_dataset).

RAISES	DESCRIPTION
`ValueError`	If n_train is invalid or stratify=True but items lack ground_truth.

Example

dataset = RubricDataset.from_file("data.json") train, test = dataset.split_train_test(n_train=100, stratify=True, seed=42) print(f"Train: {len(train)}, Test: {len(test)}")

Source code in src/autorubric/dataset.py

def split_train_test(
    self,
    n_train: int,
    *,
    stratify: bool = True,
    seed: int | None = None,
) -> tuple[RubricDataset, RubricDataset]:
    """Split dataset into training and test sets.

    The training set can be used to provide few-shot examples for grading,
    while the test set is used for evaluation.

    Args:
        n_train: Exact number of items for training set.
        stratify: If True, stratify by per-criterion verdict distribution.
            This ensures each split has similar proportion of MET/UNMET/CANNOT_ASSESS
            for each criterion position. Requires all items to have ground_truth.
        seed: Random seed for reproducible splits.

    Returns:
        Tuple of (train_dataset, test_dataset).

    Raises:
        ValueError: If n_train is invalid or stratify=True but items lack ground_truth.

    Example:
        >>> dataset = RubricDataset.from_file("data.json")
        >>> train, test = dataset.split_train_test(n_train=100, stratify=True, seed=42)
        >>> print(f"Train: {len(train)}, Test: {len(test)}")
    """
    import random

    if n_train < 0:
        raise ValueError(f"n_train must be non-negative, got {n_train}")
    if n_train > len(self.items):
        raise ValueError(
            f"n_train ({n_train}) exceeds dataset size ({len(self.items)})"
        )

    rng = random.Random(seed)

    if stratify:
        train_items, test_items = self._stratified_split(n_train, rng)
    else:
        indices = list(range(len(self.items)))
        rng.shuffle(indices)
        train_items = [self.items[i] for i in indices[:n_train]]
        test_items = [self.items[i] for i in indices[n_train:]]

    train_dataset = RubricDataset(
        prompt=self.prompt,
        rubric=self.rubric,
        items=train_items,
        name=self.name,
        reference_submission=self.reference_submission,
    )
    test_dataset = RubricDataset(
        prompt=self.prompt,
        rubric=self.rubric,
        items=test_items,
        reference_submission=self.reference_submission,
        name=self.name,
    )

    return train_dataset, test_dataset

Dataset¶

Overview¶

Quick Example¶

Per-Item Rubrics¶

Per-Item Prompts¶

Reference Submissions¶

Train/Test Split¶

DataItem¶

DataItem dataclass ¶

Binary criteria only¶

Mixed binary and multi-choice¶

With per-item rubric¶

RubricDataset¶

RubricDataset dataclass ¶

criterion_names property ¶

num_criteria property ¶

total_positive_weight property ¶

get_item_rubric ¶

get_item_reference_submission ¶

get_item_prompt ¶

compute_weighted_score ¶

add_item ¶

to_json ¶

to_file ¶

from_json classmethod ¶

from_file classmethod ¶

split_train_test ¶

DataItem `dataclass` ¶

RubricDataset `dataclass` ¶

criterion_names `property` ¶

num_criteria `property` ¶

total_positive_weight `property` ¶

from_json `classmethod` ¶

from_file `classmethod` ¶