Rubric Improvement¶

Iterative rubric improvement engine that optimizes for meta-rubric quality (validity) and validation reliability.

Overview¶

The improvement API provides a two-tier interface for iteratively refining rubrics:

Level	Entry Point	Use Case
Convenience	`improve_rubric()`	Quick start with keyword arguments
Full Control	`ImprovementRunner`	Custom convergence, callbacks, fine-grained config

Two validation modes are supported:

Mode	Trigger	Metric
Ground-truth	`validation_data` items have `ground_truth`	Spearman rank correlation (ρ) between rubric scores and expected scores
Multi-judge	`validation_data` items lack `ground_truth` + `eval_llm` is `list[JudgeSpec]`	Mean inter-judge agreement

A Pareto constraint rejects revisions that improve quality but decrease validation reliability.

Quick Example¶

Using `improve_rubric()`¶

import asyncio
from autorubric import LLMConfig, Rubric
from autorubric.dataset import RubricDataset
from autorubric.meta import improve_rubric

async def main():
    eval_llm = LLMConfig(model="openai/gpt-4.1", temperature=0.0)
    revision_llm = LLMConfig(model="openai/gpt-4.1", temperature=0.3)

    rubric = Rubric.from_file("my_rubric.json")
    validation_data = RubricDataset.from_file("validation_data.json")

    result = await improve_rubric(
        rubric,
        "Your task prompt here",
        eval_llm=eval_llm,
        revision_llm=revision_llm,
        validation_data=validation_data,
        artifacts_dir="experiments/my_improvement",
        display="stdout",
    )

    print(f"Quality: {result.iterations[-1].quality_score:.0%}")
    print(f"Convergence: {result.convergence_reason}")
    result.final_rubric.to_file("improved_rubric.json")

asyncio.run(main())

Using `ImprovementRunner`¶

from autorubric.meta import ImprovementRunner, ImprovementConfig

config = ImprovementConfig(
    eval_llm=eval_llm,
    revision_llm=revision_llm,
    validation_data=validation_data,
    max_iterations=15,
    min_quality_score=0.95,
    show_progress=True,
)
runner = ImprovementRunner(rubric, "Your task prompt", config=config)
result = await runner.run()

Validation Modes¶

Ground-Truth Mode¶

When validation_data items have ground_truth verdicts, the loop computes expected scores from the rubric weights and measures Spearman ρ against the actual graded scores.

# Items with ground_truth → ground-truth mode
dataset = RubricDataset.from_file("labeled_data.json")
result = await improve_rubric(
    rubric, prompt,
    eval_llm=LLMConfig(model="openai/gpt-4.1"),
    revision_llm=LLMConfig(model="openai/gpt-4.1"),
    validation_data=dataset,
)

Multi-Judge Mode¶

When items lack ground_truth, provide an ensemble of judges to measure inter-judge agreement:

from autorubric.graders import JudgeSpec

judges = [
    JudgeSpec(LLMConfig(model="openai/gpt-4.1"), "gpt"),
    JudgeSpec(LLMConfig(model="anthropic/claude-sonnet-4-5-20250929"), "claude"),
]
result = await improve_rubric(
    rubric, prompt,
    eval_llm=judges,
    revision_llm=LLMConfig(model="openai/gpt-4.1"),
    validation_data=dataset,  # items without ground_truth
)

Improvement Strategies¶

The improvement loop supports two strategies for guiding rubric revision:

Strategy	Description	Metric
`meta_rubric` (default)	Revise based on meta-rubric quality issues	Meta-rubric quality score
`held_out`	Revise based on per-criterion grading errors on held-out data	Mean per-criterion accuracy

Held-Out Strategy¶

The held_out strategy optimizes the rubric against grading errors on held-out data. Instead of using a meta-rubric to identify structural issues, it grades the validation items, compares per-criterion verdicts against ground truth, and uses the resulting error analysis (false positives, false negatives, disagreement exemplars) to guide revision. This requires validation_data with ground_truth verdicts.

result = await improve_rubric(
    rubric, prompt,
    eval_llm=LLMConfig(model="openai/gpt-4.1"),
    revision_llm=LLMConfig(model="openai/gpt-4.1"),
    validation_data=dataset,  # must have ground_truth
    strategy="held_out",
)

Strategies can be chained — for example, first optimize against held-out errors, then polish with meta-rubric evaluation:

# Phase 1: fix grading errors
result1 = await improve_rubric(
    rubric, prompt,
    eval_llm=eval_llm, revision_llm=revision_llm,
    validation_data=dataset,
    strategy="held_out",
    max_iterations=5,
)

# Phase 2: polish with meta-rubric
result2 = await improve_rubric(
    result1.final_rubric, prompt,
    eval_llm=eval_llm, revision_llm=revision_llm,
    validation_data=dataset,
    strategy="meta_rubric",
    max_iterations=5,
)

Artifact Persistence¶

When save_artifacts=True and artifacts_dir is set, the improvement loop writes:

File	Contents
`rubric-iter-{NN}.json`	Criteria array per iteration
`eval-iter-{NN}.html`	Meta-rubric eval report (always generated)
`iter-{NN}.json`	Rich per-iteration JSON (quality report, issues, validation samples, revision prompts/response)
`improvement_report.html`	Consolidated report (always generated)
`summary.json`	Full run metadata, config snapshot, and per-iteration summary

Custom Convergence¶

Replace the built-in convergence logic with a custom function:

from autorubric.meta import ConvergenceFn, IterationResult

def my_convergence(current: IterationResult, history: list[IterationResult]) -> str | None:
    if current.quality_score > 0.9 and len(current.issues) == 0:
        return "perfect quality with no issues"
    if len(history) >= 5:
        return "max iterations reached"
    return None  # continue

config = ImprovementConfig(
    eval_llm=eval_llm,
    revision_llm=revision_llm,
    convergence_fn=my_convergence,
)

improve_rubric¶

Convenience wrapper for iterative rubric improvement.

improve_rubric `async` ¶

improve_rubric(rubric: Rubric, task_prompt: str | None = None, *, config: ImprovementConfig | None = None, eval_llm: LLMConfig | list[JudgeSpec] | None = None, revision_llm: LLMConfig | None = None, validation_data: RubricDataset | None = None, max_iterations: int | None = None, display: Literal['stdout', 'html', None] = None, artifacts_dir: Path | str | None = None, save_artifacts: bool | None = None, show_progress: bool | None = None, mode: Literal['standalone', 'in_context'] | None = None, max_total_cost: float | None = None, strategy: Literal['meta_rubric', 'held_out'] | None = None) -> ImprovementResult

Iteratively improve a rubric using meta-rubric evaluation and validation.

Convenience wrapper around ImprovementRunner. Two strategies available:

meta_rubric (default): Optimize against structural meta-rubric quality.
held_out: Optimize against grading errors on held-out data with ground truth.

Strategies are composable: feed result.best_rubric from one run into the next (e.g., held_out -> meta_rubric).

PARAMETER	DESCRIPTION
`rubric`	The rubric to improve. TYPE: `Rubric`
`task_prompt`	The task the rubric evaluates (required for in_context mode). TYPE: `str \| None` DEFAULT: `None`
`config`	Full configuration. If provided, keyword shortcuts override its fields (non-None values only). TYPE: `ImprovementConfig \| None` DEFAULT: `None`
`eval_llm`	LLM for meta-rubric evaluation and validation. Can be a single `LLMConfig` or `list[JudgeSpec]` for ensemble. TYPE: `LLMConfig \| list[JudgeSpec] \| None` DEFAULT: `None`
`revision_llm`	LLM for rubric revision. TYPE: `LLMConfig \| None` DEFAULT: `None`
`validation_data`	Dataset for validation (ground-truth or multi-judge mode). TYPE: `RubricDataset \| None` DEFAULT: `None`
`max_iterations`	Maximum number of iterations. TYPE: `int \| None` DEFAULT: `None`
`display`	Output mode - "stdout", "html", or None. TYPE: `Literal['stdout', 'html', None]` DEFAULT: `None`
`artifacts_dir`	Directory for saved artifacts. TYPE: `Path \| str \| None` DEFAULT: `None`
`save_artifacts`	Whether to save rubric JSONs and reports to disk. TYPE: `bool \| None` DEFAULT: `None`
`show_progress`	Whether to show Rich progress indicators. TYPE: `bool \| None` DEFAULT: `None`
`mode`	Evaluation mode - "standalone" or "in_context". TYPE: `Literal['standalone', 'in_context'] \| None` DEFAULT: `None`
`max_total_cost`	Stop if total cost exceeds this amount (USD). TYPE: `float \| None` DEFAULT: `None`
`strategy`	Improvement strategy - "meta_rubric" or "held_out". TYPE: `Literal['meta_rubric', 'held_out'] \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`ImprovementResult`	ImprovementResult with the original, final, and best rubrics plus
`ImprovementResult`	all iteration details.

RAISES	DESCRIPTION
`ValueError`	If neither config nor eval_llm/revision_llm are provided.

ImprovementRunner¶

Full-control runner class following the EvalRunner pattern.

ImprovementRunner ¶

ImprovementRunner(rubric: Rubric, task_prompt: str | None = None, *, config: ImprovementConfig | None = None)

Runs iterative rubric improvement with progress tracking.

Following the EvalRunner pattern, this class orchestrates the full improvement loop: evaluate quality, test agreement, check convergence, revise rubric, repeat.

Example

from autorubric import LLMConfig, Rubric from autorubric.meta import ImprovementRunner, ImprovementConfig

config = ImprovementConfig( ... eval_llm=LLMConfig(model="gpt-4o"), ... revision_llm=LLMConfig(model="gpt-4o"), ... ) runner = ImprovementRunner(rubric, "Your task prompt", config=config) result = await runner.run()

run `async` ¶

run() -> ImprovementResult

Run the improvement loop and return the result.

Dispatches to the appropriate strategy: _run_meta_rubric() for meta-rubric-based optimization, or _run_held_out() for held-out grading error optimization.

RETURNS	DESCRIPTION
`ImprovementResult`	ImprovementResult with the original, final, and best rubrics
`ImprovementResult`	plus all iteration details.

RAISES	DESCRIPTION
`ValueError`	If task_prompt is required but not provided.

ImprovementConfig¶

Configuration for the rubric improvement process. The strategy field selects the revision approach: "meta_rubric" (default) or "held_out".

ImprovementConfig `dataclass` ¶

ImprovementConfig(eval_llm: LLMConfig | list[JudgeSpec], revision_llm: LLMConfig, mode: Literal['standalone', 'in_context'] = 'in_context', strategy: Literal['meta_rubric', 'held_out'] = 'meta_rubric', validation_data: RubricDataset | None = None, max_iterations: int = 10, min_quality_score: float = 0.95, min_agreement: float = 0.85, score_plateau_threshold: float = 0.02, plateau_patience: int = 2, max_exemplars_per_criterion: int = 3, held_out_min_accuracy: float = 0.9, history_window: int = 3, reject_agreement_regression: bool = True, save_artifacts: bool = True, artifacts_dir: Path | str | None = None, display: Literal['stdout', 'html', None] = None, show_progress: bool = True, max_total_cost: float | None = None, convergence_fn: ConvergenceFn | None = None, revision_system_prompt: str | None = None, revision_user_prompt_template: str | None = None)

Configuration for the rubric improvement process.

ATTRIBUTE	DESCRIPTION
`eval_llm`	LLM configuration for meta-rubric evaluation and validation. - `LLMConfig`: single judge (used in both ground-truth mode and meta-rubric evaluation). - `list[JudgeSpec]`: ensemble (required for multi-judge mode; meta-rubric evaluation uses the first judge's config). TYPE: `LLMConfig \| list[JudgeSpec]`
`revision_llm`	LLM configuration for rubric revision. TYPE: `LLMConfig`
`mode`	Evaluation mode - "standalone" or "in_context". TYPE: `Literal['standalone', 'in_context']`
`strategy`	Improvement strategy - "meta_rubric" (default) optimizes against structural meta-rubric quality; "held_out" optimizes against grading errors on held-out data with ground truth. TYPE: `Literal['meta_rubric', 'held_out']`
`validation_data`	Optional dataset for validation. When items have `ground_truth`, uses ground-truth mode (Spearman ρ). When items lack `ground_truth`, requires `eval_llm` as `list[JudgeSpec]` for multi-judge agreement mode. Required for `held_out` strategy. TYPE: `RubricDataset \| None`
`max_iterations`	Maximum number of improvement iterations. TYPE: `int`
`min_quality_score`	Stop if quality score reaches this threshold. TYPE: `float`
`min_agreement`	Stop if agreement/correlation reaches this threshold. TYPE: `float`
`score_plateau_threshold`	Minimum score improvement to avoid plateau detection. TYPE: `float`
`plateau_patience`	Number of iterations with no improvement before stopping. TYPE: `int`
`max_exemplars_per_criterion`	Max disagreement exemplars per criterion in the held-out revision prompt. TYPE: `int`
`held_out_min_accuracy`	Convergence threshold for held_out strategy. TYPE: `float`
`history_window`	Number of recent iterations to include in revision prompt. TYPE: `int`
`reject_agreement_regression`	Whether to reject revisions that decrease validation reliability. TYPE: `bool`
`save_artifacts`	Whether to save rubric JSONs and reports to disk. TYPE: `bool`
`artifacts_dir`	Directory for saved artifacts. Auto-generated if None. TYPE: `Path \| str \| None`
`display`	Output mode - "stdout", "html", or None. TYPE: `Literal['stdout', 'html', None]`
`show_progress`	Whether to show Rich progress indicators. TYPE: `bool`
`max_total_cost`	Stop if total cost exceeds this amount (USD). TYPE: `float \| None`
`convergence_fn`	Custom convergence function. When provided, replaces the built-in convergence logic entirely. Called after each iteration with (current_result, all_results). Returns a reason string to stop, or None to continue. TYPE: `ConvergenceFn \| None`
`revision_system_prompt`	Custom system prompt for rubric revision LLM calls. Falls back to the default from prompts.py if None. TYPE: `str \| None`
`revision_user_prompt_template`	Custom user prompt template for revision. For meta_rubric: must contain {task_prompt}, {original_criteria}, {issues_text}, {validation_text}, {history_text} placeholders. For held_out: must contain {task_prompt}, {original_criteria}, {diagnostics_text}, {history_text}, {num_criteria} placeholders. Falls back to the default from prompts.py if None. TYPE: `str \| None`

ImprovementResult¶

Final result from the rubric improvement process.

ImprovementResult `dataclass` ¶

ImprovementResult(original_rubric: Rubric, final_rubric: Rubric, iterations: list[IterationResult], best_rubric: Rubric, best_iteration: int, convergence_reason: str, total_completion_cost: float | None)

Final result from the rubric improvement process.

ATTRIBUTE	DESCRIPTION
`original_rubric`	The rubric before any improvements. TYPE: `Rubric`
`final_rubric`	The rubric after the last accepted iteration. TYPE: `Rubric`
`iterations`	All iteration results (including rejected ones). TYPE: `list[IterationResult]`
`best_rubric`	The rubric with the best combined quality+agreement. TYPE: `Rubric`
`best_iteration`	Index of the best iteration. TYPE: `int`
`convergence_reason`	Why the improvement loop stopped. TYPE: `str`
`total_completion_cost`	Total cost across all iterations. TYPE: `float \| None`

IterationResult¶

Result from a single improvement iteration.

IterationResult `dataclass` ¶

IterationResult(iteration: int, rubric: Rubric, quality_score: float, agreement: float | None, per_criterion_agreement: dict[str, float] | None, issues: list[IssueDetail], issues_fixed: list[str], issues_introduced: list[str], accepted: bool, rejection_reason: str | None, quality_report: EnsembleEvaluationReport | None, token_usage: TokenUsage | None, completion_cost: float | None, held_out_diagnostics: HeldOutValidationResult | None = None)

Result from a single improvement iteration.

ATTRIBUTE	DESCRIPTION
`iteration`	Zero-based iteration number. TYPE: `int`
`rubric`	The rubric at this iteration. TYPE: `Rubric`
`quality_score`	Meta-rubric quality score (0-1), or mean accuracy for held_out. TYPE: `float`
`agreement`	Mean inter-judge agreement (0-1), or None if not tested. TYPE: `float \| None`
`per_criterion_agreement`	Per-criterion agreement scores, or None. TYPE: `dict[str, float] \| None`
`issues`	List of issues identified in this iteration. TYPE: `list[IssueDetail]`
`issues_fixed`	Names of issues present in previous iteration but not this one. TYPE: `list[str]`
`issues_introduced`	Names of issues not in previous iteration but present now. TYPE: `list[str]`
`accepted`	Whether this revision was accepted (Pareto check passed). TYPE: `bool`
`rejection_reason`	Why the revision was rejected, if applicable. TYPE: `str \| None`
`quality_report`	Full meta-rubric evaluation report. None in held_out mode. TYPE: `EnsembleEvaluationReport \| None`
`token_usage`	Token usage for this iteration. TYPE: `TokenUsage \| None`
`completion_cost`	Cost in USD for this iteration. TYPE: `float \| None`
`held_out_diagnostics`	Per-criterion error analysis from held-out validation. Populated only in held_out mode. TYPE: `HeldOutValidationResult \| None`

IssueDetail¶

A single issue identified in a rubric by meta-rubric evaluation.

IssueDetail `dataclass` ¶

IssueDetail(criterion_name: str, requirement: str, weight: float, is_antipattern: bool, feedback: str)

A single issue identified in a rubric by meta-rubric evaluation.

ATTRIBUTE	DESCRIPTION
`criterion_name`	Name of the meta-rubric criterion that flagged this issue. TYPE: `str`
`requirement`	The meta-rubric criterion's requirement text. TYPE: `str`
`weight`	Weight of the meta-rubric criterion. TYPE: `float`
`is_antipattern`	True if this is a negative criterion (anti-pattern detected). TYPE: `bool`
`feedback`	The judge's explanation for why this issue was flagged. TYPE: `str`

CriterionExemplar¶

A single grading case for a criterion, capturing the LLM verdict, ground-truth verdict, and whether they disagree.

CriterionExemplar `dataclass` ¶

CriterionExemplar(item_index: int, submission_snippet: str, llm_verdict: CriterionVerdict, ground_truth_verdict: CriterionVerdict, llm_reason: str, is_disagreement: bool)

A single grading case for a criterion.

CriterionErrorReport¶

Per-criterion error analysis from held-out grading, including accuracy, false positive/negative rates, and exemplars.

CriterionErrorReport `dataclass` ¶

CriterionErrorReport(criterion_index: int, criterion_name: str, n_samples: int, accuracy: float, false_positive_rate: float, false_negative_rate: float, disagreement_exemplars: list[CriterionExemplar], agreement_exemplars: list[CriterionExemplar])

Per-criterion error analysis from held-out grading.

HeldOutValidationResult¶

Result from held-out validation with per-criterion diagnostics and overall accuracy.

HeldOutValidationResult `dataclass` ¶

HeldOutValidationResult(mean_accuracy: float, per_criterion: list[CriterionErrorReport], total_cost: float | None, item_reports: list[EnsembleEvaluationReport])

Result from held-out validation with per-criterion diagnostics.

ConvergenceFn¶

Custom convergence function type alias.

ConvergenceFn `module-attribute` ¶

ConvergenceFn = Callable[['IterationResult', list['IterationResult']], str | None]

Custom convergence function type.

Called after each iteration with (current_result, all_results). Returns a convergence reason string to stop, or None to continue. When provided in ImprovementConfig, replaces the built-in convergence logic.

ImprovementProgressDisplay¶

Rich-based progress display for the improvement loop.

ImprovementProgressDisplay ¶

ImprovementProgressDisplay()

Rich-based progress display for the improvement loop.

Shows a progress bar during evaluation/agreement phases, prints one-line iteration summaries with issues tables and rubric panels, and renders a Rich Table at the end.

begin_iteration ¶

begin_iteration(iteration: int, max_iterations: int, total_steps: int) -> None

Start a progress bar for one iteration.

advance ¶

advance(phase_name: str | None = None) -> None

Advance by 1 step, optionally updating the phase label.

end_iteration ¶

end_iteration() -> None

Stop the progress bar (disappears due to transient=True).

phase ¶

phase(iteration: int, max_iterations: int, phase_name: str)

Spinner-only context for single-step atomic phases (e.g. revision).

log_iteration ¶

log_iteration(result: IterationResult) -> None

Print a one-line iteration summary.

log_issues_table ¶

log_issues_table(issues: list['IssueDetail'], *, rubric: 'Rubric | None' = None) -> None

Print a Rich table of issues found in this iteration.

log_rubric ¶

log_rubric(rubric: 'Rubric', iteration: int) -> None

Print a Rich panel showing the rubric criteria.

log_held_out_iteration ¶

log_held_out_iteration(result: IterationResult) -> None

Print a one-line iteration summary for held-out mode.

print_held_out_summary ¶

print_held_out_summary(iterations: list['IterationResult'], convergence_reason: str, total_cost: float, artifacts_dir: 'Path | None') -> None

Print a summary table for held-out improvement.

log_rubric_diff ¶

log_rubric_diff(prev_rubric: 'Rubric', curr_rubric: 'Rubric', iteration: int) -> None

Print a paired before/after diff of rubric criteria between iterations.

Changed lines are shown as adjacent old/new pairs with character-level highlighting of the exact changes.

log_convergence ¶

log_convergence(reason: str) -> None

Print convergence reason.

print_summary ¶

print_summary(iterations: list[IterationResult], convergence_reason: str, total_cost: float, artifacts_dir: Path | None) -> None

Print a Rich Table summary and final statistics.

Building Blocks¶

These functions can be used independently to compose custom improvement loops.

extract_issues¶

Extract actionable issues from a meta-rubric evaluation report.

extract_issues ¶

extract_issues(report: EnsembleEvaluationReport) -> list[IssueDetail]

Extract actionable issues from a meta-rubric evaluation report.

An issue is either a positive criterion that is UNMET (quality gap) or a negative criterion that is MET (anti-pattern detected).

diff_issues¶

Track fixed and introduced issues between iterations.

diff_issues ¶

diff_issues(prev_issues: list[IssueDetail], curr_issues: list[IssueDetail]) -> tuple[list[str], list[str]]

Compare issue sets to track which were fixed and which were introduced.

RETURNS	DESCRIPTION
`tuple[list[str], list[str]]`	Tuple of (issues_fixed, issues_introduced) as lists of criterion names.

format_issues_for_prompt¶

Format issues into text for the revision prompt.

format_issues_for_prompt ¶

format_issues_for_prompt(issues: list[IssueDetail]) -> str

Format issues into text for the revision prompt.

format_agreement_for_prompt¶

Format per-criterion agreement data as a self-contained prompt section.

format_agreement_for_prompt ¶

format_agreement_for_prompt(per_criterion_agreement: dict[str, float] | None) -> str

Format per-criterion agreement data as a self-contained prompt section.

format_ground_truth_for_prompt¶

Format ground-truth validation results as a prompt section.

format_ground_truth_for_prompt ¶

format_ground_truth_for_prompt(correlation: float, per_item: list[tuple[float, float]], *, item_reports: list[EnsembleEvaluationReport] | None = None, n_diagnostic: int = 3) -> str

Format ground-truth validation results as a self-contained prompt section.

PARAMETER	DESCRIPTION
`correlation`	Spearman ρ or 1-MAE metric. TYPE: `float`
`per_item`	List of (rubric_score, expected_score) pairs. TYPE: `list[tuple[float, float]]`
`item_reports`	Per-item grading reports from `validate_ground_truth`. When provided, a diagnostics section is appended showing per-criterion reasons for the items with the largest scoring gaps. TYPE: `list[EnsembleEvaluationReport] \| None` DEFAULT: `None`
`n_diagnostic`	Number of items per direction (over/under-scored) to include in the diagnostics section. TYPE: `int` DEFAULT: `3`

RETURNS	DESCRIPTION
`str`	Formatted section string with header, data, and instructions.

build_revision_history¶

Format recent iteration history for the revision prompt.

build_revision_history ¶

build_revision_history(iterations: list[IterationResult], window: int) -> str

Format recent iteration history for the revision prompt.

validate_agreement¶

Test inter-judge agreement on validation data.

validate_agreement `async` ¶

validate_agreement(rubric: Rubric, samples: list[str], judges: list[JudgeSpec], task_prompt: str | None = None, *, on_sample_complete: Callable[[], None] | None = None, _capture: list | None = None) -> tuple[float, dict[str, float], float | None]

Test inter-judge agreement by grading samples with an ensemble.

PARAMETER	DESCRIPTION
`rubric`	Rubric to test. TYPE: `Rubric`
`samples`	Sample submissions to grade. TYPE: `list[str]`
`judges`	Judge specifications for the ensemble. TYPE: `list[JudgeSpec]`
`task_prompt`	Optional task prompt for context. TYPE: `str \| None` DEFAULT: `None`
`on_sample_complete`	Optional callback invoked after each sample is graded. TYPE: `Callable[[], None] \| None` DEFAULT: `None`
`_capture`	When provided, per-sample ensemble reports are appended as serialized dicts for artifact persistence. TYPE: `list \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`tuple[float, dict[str, float], float \| None]`	Tuple of (mean_agreement, per_criterion_agreement, total_cost).

validate_ground_truth¶

Grade validation items and compute Spearman ρ against expected scores.

validate_ground_truth `async` ¶

validate_ground_truth(rubric: Rubric, validation_data: RubricDataset, expected_scores: list[float], grader: CriterionGrader, task_prompt: str | None = None, *, on_item_complete: Callable[[], None] | None = None, _capture: list | None = None, _item_reports: list | None = None) -> tuple[float, list[tuple[float, float]], float | None]

Grade validation items with the current rubric and compare against expected scores.

Uses Spearman rank correlation when n >= 3, falls back to 1 - MAE when n < 3.

PARAMETER	DESCRIPTION
`rubric`	Current rubric to evaluate. TYPE: `Rubric`
`validation_data`	Dataset with ground-truth verdicts. TYPE: `RubricDataset`
`expected_scores`	Pre-computed expected scores from `compute_expected_scores`. TYPE: `list[float]`
`grader`	Grader configured from `eval_llm`. TYPE: `CriterionGrader`
`task_prompt`	Optional task prompt for grading context. TYPE: `str \| None` DEFAULT: `None`
`on_item_complete`	Callback invoked after each item is graded. TYPE: `Callable[[], None] \| None` DEFAULT: `None`
`_capture`	When provided, per-item results are appended for artifact persistence. TYPE: `list \| None` DEFAULT: `None`
`_item_reports`	When provided, each item's `EnsembleEvaluationReport` is appended for downstream diagnostics (e.g. grading reasons). TYPE: `list \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`float`	Tuple of (correlation_metric, per_item_pairs, total_cost) where
`list[tuple[float, float]]`	per_item_pairs is a list of (rubric_score, expected_score) tuples.

compute_expected_scores¶

Compute expected scores from ground-truth verdicts and rubric weights.

compute_expected_scores ¶

compute_expected_scores(validation_data: RubricDataset) -> list[float]

Compute expected scores from ground-truth verdicts and the rubric weights.

PARAMETER	DESCRIPTION
`validation_data`	Dataset whose items all have `ground_truth`. TYPE: `RubricDataset`

RETURNS	DESCRIPTION
`list[float]`	List of expected scores, one per item.

pareto_accept¶

Check revision acceptance under the Pareto constraint.

pareto_accept ¶

pareto_accept(curr_agreement: float | None, prev_agreement: float | None, reject_regression: bool, consecutive_rejections: int, epsilon: float = 0.03) -> tuple[bool, str | None]

Check if a revision should be accepted under the Pareto constraint.

A revision is accepted if agreement >= prev_agreement - epsilon. After 2 consecutive rejections, the constraint is relaxed.

RETURNS	DESCRIPTION
`tuple[bool, str \| None]`	Tuple of (accepted, rejection_reason).

validate_held_out¶

Grade held-out items and compare per-criterion verdicts against ground truth.

validate_held_out `async` ¶

validate_held_out(rubric: Rubric, validation_data: RubricDataset, grader: CriterionGrader, task_prompt: str | None = None, *, max_exemplars_per_criterion: int = 3, on_item_complete: Callable[[], None] | None = None, _capture: list | None = None) -> HeldOutValidationResult

Grade held-out items and compare per-criterion verdicts against ground truth.

PARAMETER	DESCRIPTION
`rubric`	Current rubric to evaluate. TYPE: `Rubric`
`validation_data`	Dataset where all items have `ground_truth`. TYPE: `RubricDataset`
`grader`	Grader configured from `eval_llm`. TYPE: `CriterionGrader`
`task_prompt`	Optional task prompt for grading context. TYPE: `str \| None` DEFAULT: `None`
`max_exemplars_per_criterion`	Max disagreement exemplars per criterion. TYPE: `int` DEFAULT: `3`
`on_item_complete`	Callback invoked after each item is graded. TYPE: `Callable[[], None] \| None` DEFAULT: `None`
`_capture`	When provided, per-item results are appended for artifact persistence. TYPE: `list \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`HeldOutValidationResult`	HeldOutValidationResult with per-criterion error analysis.

format_held_out_for_prompt¶

Format held-out validation result into revision prompt text.

format_held_out_for_prompt ¶

format_held_out_for_prompt(result: HeldOutValidationResult, *, max_exemplars_per_criterion: int = 3) -> str

Format HeldOutValidationResult into text for the revision prompt.

PARAMETER	DESCRIPTION
`result`	The held-out validation result. TYPE: `HeldOutValidationResult`
`max_exemplars_per_criterion`	Max exemplars shown per criterion. TYPE: `int` DEFAULT: `3`

RETURNS	DESCRIPTION
`str`	Formatted diagnostics text with per-criterion analysis.

validate_criteria_structure¶

Post-revision check that criteria count and order were preserved.

validate_criteria_structure ¶

validate_criteria_structure(original: Rubric, revised: Rubric) -> tuple[bool, str | None]

Check that criteria count and order was preserved after revision.

PARAMETER	DESCRIPTION
`original`	The rubric before revision. TYPE: `Rubric`
`revised`	The rubric after revision. TYPE: `Rubric`

RETURNS	DESCRIPTION
`tuple[bool, str \| None]`	(valid, error_message) — True and None if valid, False and reason if not.

revise_rubric_held_out¶

Revise a rubric using held-out-specific prompt templates.

revise_rubric_held_out `async` ¶

revise_rubric_held_out(rubric: Rubric, task_prompt: str | None, diagnostics_text: str, history_text: str, config: ImprovementConfig, *, system_prompt: str | None = None, user_prompt_template: str | None = None, _capture: dict | None = None) -> tuple[Rubric, float | None]

Revise rubric based on held-out grading diagnostics.

Uses held-out-specific prompt templates that enforce structural constraints (same number of criteria in same order).

PARAMETER	DESCRIPTION
`rubric`	Current rubric to revise. TYPE: `Rubric`
`task_prompt`	The task the rubric evaluates. TYPE: `str \| None`
`diagnostics_text`	Formatted held-out diagnostics from `format_held_out_for_prompt`. TYPE: `str`
`history_text`	Formatted revision history. TYPE: `str`
`config`	Improvement configuration (provides revision_llm). TYPE: `ImprovementConfig`
`system_prompt`	Override system prompt. TYPE: `str \| None` DEFAULT: `None`
`user_prompt_template`	Override user prompt template. TYPE: `str \| None` DEFAULT: `None`
`_capture`	When provided, populated with prompts and response for artifact persistence. TYPE: `dict \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`tuple[Rubric, float \| None]`	Tuple of (revised Rubric, completion cost or None).

revise_rubric¶

Revise a rubric via LLM based on identified issues.

revise_rubric `async` ¶

revise_rubric(rubric: Rubric, task_prompt: str | None, issues: list[IssueDetail], validation_text: str, history_text: str, config: ImprovementConfig, *, system_prompt: str | None = None, user_prompt_template: str | None = None, _capture: dict | None = None) -> tuple[Rubric, float | None]

Use an LLM to revise the rubric based on evaluation feedback and validation data.

PARAMETER	DESCRIPTION
`rubric`	Current rubric to revise. TYPE: `Rubric`
`task_prompt`	The task the rubric evaluates. TYPE: `str \| None`
`issues`	Issues identified by meta-rubric evaluation. TYPE: `list[IssueDetail]`
`validation_text`	Formatted validation data (ground-truth or agreement). TYPE: `str`
`history_text`	Formatted revision history. TYPE: `str`
`config`	Improvement configuration (provides revision_llm). TYPE: `ImprovementConfig`
`system_prompt`	Override system prompt. Falls back to config.revision_system_prompt, then the default from prompts.py. TYPE: `str \| None` DEFAULT: `None`
`user_prompt_template`	Override user prompt template. Falls back to config.revision_user_prompt_template, then the default from prompts.py. TYPE: `str \| None` DEFAULT: `None`
`_capture`	When provided, populated with the system prompt, user prompt, and LLM response text for artifact persistence. TYPE: `dict \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`tuple[Rubric, float \| None]`	Tuple of (revised Rubric, completion cost or None).