Skip to content

Rubric Improvement

Iterative rubric improvement engine that optimizes for meta-rubric quality (validity) and validation reliability.

Overview

The improvement API provides a two-tier interface for iteratively refining rubrics:

Level Entry Point Use Case
Convenience improve_rubric() Quick start with keyword arguments
Full Control ImprovementRunner Custom convergence, callbacks, fine-grained config

Two validation modes are supported:

Mode Trigger Metric
Ground-truth validation_data items have ground_truth Spearman rank correlation (ρ) between rubric scores and expected scores
Multi-judge validation_data items lack ground_truth + eval_llm is list[JudgeSpec] Mean inter-judge agreement

A Pareto constraint rejects revisions that improve quality but decrease validation reliability.

Quick Example

Using improve_rubric()

import asyncio
from autorubric import LLMConfig, Rubric
from autorubric.dataset import RubricDataset
from autorubric.meta import improve_rubric

async def main():
    eval_llm = LLMConfig(model="openai/gpt-4.1", temperature=0.0)
    revision_llm = LLMConfig(model="openai/gpt-4.1", temperature=0.3)

    rubric = Rubric.from_file("my_rubric.json")
    validation_data = RubricDataset.from_file("validation_data.json")

    result = await improve_rubric(
        rubric,
        "Your task prompt here",
        eval_llm=eval_llm,
        revision_llm=revision_llm,
        validation_data=validation_data,
        artifacts_dir="experiments/my_improvement",
        display="stdout",
    )

    print(f"Quality: {result.iterations[-1].quality_score:.0%}")
    print(f"Convergence: {result.convergence_reason}")
    result.final_rubric.to_file("improved_rubric.json")

asyncio.run(main())

Using ImprovementRunner

from autorubric.meta import ImprovementRunner, ImprovementConfig

config = ImprovementConfig(
    eval_llm=eval_llm,
    revision_llm=revision_llm,
    validation_data=validation_data,
    max_iterations=15,
    min_quality_score=0.95,
    show_progress=True,
)
runner = ImprovementRunner(rubric, "Your task prompt", config=config)
result = await runner.run()

Validation Modes

Ground-Truth Mode

When validation_data items have ground_truth verdicts, the loop computes expected scores from the rubric weights and measures Spearman ρ against the actual graded scores.

# Items with ground_truth → ground-truth mode
dataset = RubricDataset.from_file("labeled_data.json")
result = await improve_rubric(
    rubric, prompt,
    eval_llm=LLMConfig(model="openai/gpt-4.1"),
    revision_llm=LLMConfig(model="openai/gpt-4.1"),
    validation_data=dataset,
)

Multi-Judge Mode

When items lack ground_truth, provide an ensemble of judges to measure inter-judge agreement:

from autorubric.graders import JudgeSpec

judges = [
    JudgeSpec(LLMConfig(model="openai/gpt-4.1"), "gpt"),
    JudgeSpec(LLMConfig(model="anthropic/claude-sonnet-4-5-20250929"), "claude"),
]
result = await improve_rubric(
    rubric, prompt,
    eval_llm=judges,
    revision_llm=LLMConfig(model="openai/gpt-4.1"),
    validation_data=dataset,  # items without ground_truth
)

Improvement Strategies

The improvement loop supports two strategies for guiding rubric revision:

Strategy Description Metric
meta_rubric (default) Revise based on meta-rubric quality issues Meta-rubric quality score
held_out Revise based on per-criterion grading errors on held-out data Mean per-criterion accuracy

Held-Out Strategy

The held_out strategy optimizes the rubric against grading errors on held-out data. Instead of using a meta-rubric to identify structural issues, it grades the validation items, compares per-criterion verdicts against ground truth, and uses the resulting error analysis (false positives, false negatives, disagreement exemplars) to guide revision. This requires validation_data with ground_truth verdicts.

result = await improve_rubric(
    rubric, prompt,
    eval_llm=LLMConfig(model="openai/gpt-4.1"),
    revision_llm=LLMConfig(model="openai/gpt-4.1"),
    validation_data=dataset,  # must have ground_truth
    strategy="held_out",
)

Strategies can be chained — for example, first optimize against held-out errors, then polish with meta-rubric evaluation:

# Phase 1: fix grading errors
result1 = await improve_rubric(
    rubric, prompt,
    eval_llm=eval_llm, revision_llm=revision_llm,
    validation_data=dataset,
    strategy="held_out",
    max_iterations=5,
)

# Phase 2: polish with meta-rubric
result2 = await improve_rubric(
    result1.final_rubric, prompt,
    eval_llm=eval_llm, revision_llm=revision_llm,
    validation_data=dataset,
    strategy="meta_rubric",
    max_iterations=5,
)

Artifact Persistence

When save_artifacts=True and artifacts_dir is set, the improvement loop writes:

File Contents
rubric-iter-{NN}.json Criteria array per iteration
eval-iter-{NN}.html Meta-rubric eval report (always generated)
iter-{NN}.json Rich per-iteration JSON (quality report, issues, validation samples, revision prompts/response)
improvement_report.html Consolidated report (always generated)
summary.json Full run metadata, config snapshot, and per-iteration summary

Custom Convergence

Replace the built-in convergence logic with a custom function:

from autorubric.meta import ConvergenceFn, IterationResult

def my_convergence(current: IterationResult, history: list[IterationResult]) -> str | None:
    if current.quality_score > 0.9 and len(current.issues) == 0:
        return "perfect quality with no issues"
    if len(history) >= 5:
        return "max iterations reached"
    return None  # continue

config = ImprovementConfig(
    eval_llm=eval_llm,
    revision_llm=revision_llm,
    convergence_fn=my_convergence,
)

improve_rubric

Convenience wrapper for iterative rubric improvement.

improve_rubric async

improve_rubric(rubric: Rubric, task_prompt: str | None = None, *, config: ImprovementConfig | None = None, eval_llm: LLMConfig | list[JudgeSpec] | None = None, revision_llm: LLMConfig | None = None, validation_data: RubricDataset | None = None, max_iterations: int | None = None, display: Literal['stdout', 'html', None] = None, artifacts_dir: Path | str | None = None, save_artifacts: bool | None = None, show_progress: bool | None = None, mode: Literal['standalone', 'in_context'] | None = None, max_total_cost: float | None = None, strategy: Literal['meta_rubric', 'held_out'] | None = None) -> ImprovementResult

Iteratively improve a rubric using meta-rubric evaluation and validation.

Convenience wrapper around ImprovementRunner. Two strategies available:

  • meta_rubric (default): Optimize against structural meta-rubric quality.
  • held_out: Optimize against grading errors on held-out data with ground truth.

Strategies are composable: feed result.best_rubric from one run into the next (e.g., held_out -> meta_rubric).

PARAMETER DESCRIPTION
rubric

The rubric to improve.

TYPE: Rubric

task_prompt

The task the rubric evaluates (required for in_context mode).

TYPE: str | None DEFAULT: None

config

Full configuration. If provided, keyword shortcuts override its fields (non-None values only).

TYPE: ImprovementConfig | None DEFAULT: None

eval_llm

LLM for meta-rubric evaluation and validation. Can be a single LLMConfig or list[JudgeSpec] for ensemble.

TYPE: LLMConfig | list[JudgeSpec] | None DEFAULT: None

revision_llm

LLM for rubric revision.

TYPE: LLMConfig | None DEFAULT: None

validation_data

Dataset for validation (ground-truth or multi-judge mode).

TYPE: RubricDataset | None DEFAULT: None

max_iterations

Maximum number of iterations.

TYPE: int | None DEFAULT: None

display

Output mode - "stdout", "html", or None.

TYPE: Literal['stdout', 'html', None] DEFAULT: None

artifacts_dir

Directory for saved artifacts.

TYPE: Path | str | None DEFAULT: None

save_artifacts

Whether to save rubric JSONs and reports to disk.

TYPE: bool | None DEFAULT: None

show_progress

Whether to show Rich progress indicators.

TYPE: bool | None DEFAULT: None

mode

Evaluation mode - "standalone" or "in_context".

TYPE: Literal['standalone', 'in_context'] | None DEFAULT: None

max_total_cost

Stop if total cost exceeds this amount (USD).

TYPE: float | None DEFAULT: None

strategy

Improvement strategy - "meta_rubric" or "held_out".

TYPE: Literal['meta_rubric', 'held_out'] | None DEFAULT: None

RETURNS DESCRIPTION
ImprovementResult

ImprovementResult with the original, final, and best rubrics plus

ImprovementResult

all iteration details.

RAISES DESCRIPTION
ValueError

If neither config nor eval_llm/revision_llm are provided.


ImprovementRunner

Full-control runner class following the EvalRunner pattern.

ImprovementRunner

ImprovementRunner(rubric: Rubric, task_prompt: str | None = None, *, config: ImprovementConfig | None = None)

Runs iterative rubric improvement with progress tracking.

Following the EvalRunner pattern, this class orchestrates the full improvement loop: evaluate quality, test agreement, check convergence, revise rubric, repeat.

Example

from autorubric import LLMConfig, Rubric from autorubric.meta import ImprovementRunner, ImprovementConfig

config = ImprovementConfig( ... eval_llm=LLMConfig(model="gpt-4o"), ... revision_llm=LLMConfig(model="gpt-4o"), ... ) runner = ImprovementRunner(rubric, "Your task prompt", config=config) result = await runner.run()

run async

Run the improvement loop and return the result.

Dispatches to the appropriate strategy: _run_meta_rubric() for meta-rubric-based optimization, or _run_held_out() for held-out grading error optimization.

RETURNS DESCRIPTION
ImprovementResult

ImprovementResult with the original, final, and best rubrics

ImprovementResult

plus all iteration details.

RAISES DESCRIPTION
ValueError

If task_prompt is required but not provided.


ImprovementConfig

Configuration for the rubric improvement process. The strategy field selects the revision approach: "meta_rubric" (default) or "held_out".

ImprovementConfig dataclass

ImprovementConfig(eval_llm: LLMConfig | list[JudgeSpec], revision_llm: LLMConfig, mode: Literal['standalone', 'in_context'] = 'in_context', strategy: Literal['meta_rubric', 'held_out'] = 'meta_rubric', validation_data: RubricDataset | None = None, max_iterations: int = 10, min_quality_score: float = 0.95, min_agreement: float = 0.85, score_plateau_threshold: float = 0.02, plateau_patience: int = 2, max_exemplars_per_criterion: int = 3, held_out_min_accuracy: float = 0.9, history_window: int = 3, reject_agreement_regression: bool = True, save_artifacts: bool = True, artifacts_dir: Path | str | None = None, display: Literal['stdout', 'html', None] = None, show_progress: bool = True, max_total_cost: float | None = None, convergence_fn: ConvergenceFn | None = None, revision_system_prompt: str | None = None, revision_user_prompt_template: str | None = None)

Configuration for the rubric improvement process.

ATTRIBUTE DESCRIPTION
eval_llm

LLM configuration for meta-rubric evaluation and validation. - LLMConfig: single judge (used in both ground-truth mode and meta-rubric evaluation). - list[JudgeSpec]: ensemble (required for multi-judge mode; meta-rubric evaluation uses the first judge's config).

TYPE: LLMConfig | list[JudgeSpec]

revision_llm

LLM configuration for rubric revision.

TYPE: LLMConfig

mode

Evaluation mode - "standalone" or "in_context".

TYPE: Literal['standalone', 'in_context']

strategy

Improvement strategy - "meta_rubric" (default) optimizes against structural meta-rubric quality; "held_out" optimizes against grading errors on held-out data with ground truth.

TYPE: Literal['meta_rubric', 'held_out']

validation_data

Optional dataset for validation. When items have ground_truth, uses ground-truth mode (Spearman ρ). When items lack ground_truth, requires eval_llm as list[JudgeSpec] for multi-judge agreement mode. Required for held_out strategy.

TYPE: RubricDataset | None

max_iterations

Maximum number of improvement iterations.

TYPE: int

min_quality_score

Stop if quality score reaches this threshold.

TYPE: float

min_agreement

Stop if agreement/correlation reaches this threshold.

TYPE: float

score_plateau_threshold

Minimum score improvement to avoid plateau detection.

TYPE: float

plateau_patience

Number of iterations with no improvement before stopping.

TYPE: int

max_exemplars_per_criterion

Max disagreement exemplars per criterion in the held-out revision prompt.

TYPE: int

held_out_min_accuracy

Convergence threshold for held_out strategy.

TYPE: float

history_window

Number of recent iterations to include in revision prompt.

TYPE: int

reject_agreement_regression

Whether to reject revisions that decrease validation reliability.

TYPE: bool

save_artifacts

Whether to save rubric JSONs and reports to disk.

TYPE: bool

artifacts_dir

Directory for saved artifacts. Auto-generated if None.

TYPE: Path | str | None

display

Output mode - "stdout", "html", or None.

TYPE: Literal['stdout', 'html', None]

show_progress

Whether to show Rich progress indicators.

TYPE: bool

max_total_cost

Stop if total cost exceeds this amount (USD).

TYPE: float | None

convergence_fn

Custom convergence function. When provided, replaces the built-in convergence logic entirely. Called after each iteration with (current_result, all_results). Returns a reason string to stop, or None to continue.

TYPE: ConvergenceFn | None

revision_system_prompt

Custom system prompt for rubric revision LLM calls. Falls back to the default from prompts.py if None.

TYPE: str | None

revision_user_prompt_template

Custom user prompt template for revision. For meta_rubric: must contain {task_prompt}, {original_criteria}, {issues_text}, {validation_text}, {history_text} placeholders. For held_out: must contain {task_prompt}, {original_criteria}, {diagnostics_text}, {history_text}, {num_criteria} placeholders. Falls back to the default from prompts.py if None.

TYPE: str | None


ImprovementResult

Final result from the rubric improvement process.

ImprovementResult dataclass

ImprovementResult(original_rubric: Rubric, final_rubric: Rubric, iterations: list[IterationResult], best_rubric: Rubric, best_iteration: int, convergence_reason: str, total_completion_cost: float | None)

Final result from the rubric improvement process.

ATTRIBUTE DESCRIPTION
original_rubric

The rubric before any improvements.

TYPE: Rubric

final_rubric

The rubric after the last accepted iteration.

TYPE: Rubric

iterations

All iteration results (including rejected ones).

TYPE: list[IterationResult]

best_rubric

The rubric with the best combined quality+agreement.

TYPE: Rubric

best_iteration

Index of the best iteration.

TYPE: int

convergence_reason

Why the improvement loop stopped.

TYPE: str

total_completion_cost

Total cost across all iterations.

TYPE: float | None


IterationResult

Result from a single improvement iteration.

IterationResult dataclass

IterationResult(iteration: int, rubric: Rubric, quality_score: float, agreement: float | None, per_criterion_agreement: dict[str, float] | None, issues: list[IssueDetail], issues_fixed: list[str], issues_introduced: list[str], accepted: bool, rejection_reason: str | None, quality_report: EnsembleEvaluationReport | None, token_usage: TokenUsage | None, completion_cost: float | None, held_out_diagnostics: HeldOutValidationResult | None = None)

Result from a single improvement iteration.

ATTRIBUTE DESCRIPTION
iteration

Zero-based iteration number.

TYPE: int

rubric

The rubric at this iteration.

TYPE: Rubric

quality_score

Meta-rubric quality score (0-1), or mean accuracy for held_out.

TYPE: float

agreement

Mean inter-judge agreement (0-1), or None if not tested.

TYPE: float | None

per_criterion_agreement

Per-criterion agreement scores, or None.

TYPE: dict[str, float] | None

issues

List of issues identified in this iteration.

TYPE: list[IssueDetail]

issues_fixed

Names of issues present in previous iteration but not this one.

TYPE: list[str]

issues_introduced

Names of issues not in previous iteration but present now.

TYPE: list[str]

accepted

Whether this revision was accepted (Pareto check passed).

TYPE: bool

rejection_reason

Why the revision was rejected, if applicable.

TYPE: str | None

quality_report

Full meta-rubric evaluation report. None in held_out mode.

TYPE: EnsembleEvaluationReport | None

token_usage

Token usage for this iteration.

TYPE: TokenUsage | None

completion_cost

Cost in USD for this iteration.

TYPE: float | None

held_out_diagnostics

Per-criterion error analysis from held-out validation. Populated only in held_out mode.

TYPE: HeldOutValidationResult | None


IssueDetail

A single issue identified in a rubric by meta-rubric evaluation.

IssueDetail dataclass

IssueDetail(criterion_name: str, requirement: str, weight: float, is_antipattern: bool, feedback: str)

A single issue identified in a rubric by meta-rubric evaluation.

ATTRIBUTE DESCRIPTION
criterion_name

Name of the meta-rubric criterion that flagged this issue.

TYPE: str

requirement

The meta-rubric criterion's requirement text.

TYPE: str

weight

Weight of the meta-rubric criterion.

TYPE: float

is_antipattern

True if this is a negative criterion (anti-pattern detected).

TYPE: bool

feedback

The judge's explanation for why this issue was flagged.

TYPE: str


CriterionExemplar

A single grading case for a criterion, capturing the LLM verdict, ground-truth verdict, and whether they disagree.

CriterionExemplar dataclass

CriterionExemplar(item_index: int, submission_snippet: str, llm_verdict: CriterionVerdict, ground_truth_verdict: CriterionVerdict, llm_reason: str, is_disagreement: bool)

A single grading case for a criterion.


CriterionErrorReport

Per-criterion error analysis from held-out grading, including accuracy, false positive/negative rates, and exemplars.

CriterionErrorReport dataclass

CriterionErrorReport(criterion_index: int, criterion_name: str, n_samples: int, accuracy: float, false_positive_rate: float, false_negative_rate: float, disagreement_exemplars: list[CriterionExemplar], agreement_exemplars: list[CriterionExemplar])

Per-criterion error analysis from held-out grading.


HeldOutValidationResult

Result from held-out validation with per-criterion diagnostics and overall accuracy.

HeldOutValidationResult dataclass

HeldOutValidationResult(mean_accuracy: float, per_criterion: list[CriterionErrorReport], total_cost: float | None, item_reports: list[EnsembleEvaluationReport])

Result from held-out validation with per-criterion diagnostics.


ConvergenceFn

Custom convergence function type alias.

ConvergenceFn module-attribute

ConvergenceFn = Callable[['IterationResult', list['IterationResult']], str | None]

Custom convergence function type.

Called after each iteration with (current_result, all_results). Returns a convergence reason string to stop, or None to continue. When provided in ImprovementConfig, replaces the built-in convergence logic.


ImprovementProgressDisplay

Rich-based progress display for the improvement loop.

ImprovementProgressDisplay

ImprovementProgressDisplay()

Rich-based progress display for the improvement loop.

Shows a progress bar during evaluation/agreement phases, prints one-line iteration summaries with issues tables and rubric panels, and renders a Rich Table at the end.

begin_iteration

begin_iteration(iteration: int, max_iterations: int, total_steps: int) -> None

Start a progress bar for one iteration.

advance

advance(phase_name: str | None = None) -> None

Advance by 1 step, optionally updating the phase label.

end_iteration

end_iteration() -> None

Stop the progress bar (disappears due to transient=True).

phase

phase(iteration: int, max_iterations: int, phase_name: str)

Spinner-only context for single-step atomic phases (e.g. revision).

log_iteration

log_iteration(result: IterationResult) -> None

Print a one-line iteration summary.

log_issues_table

log_issues_table(issues: list['IssueDetail'], *, rubric: 'Rubric | None' = None) -> None

Print a Rich table of issues found in this iteration.

log_rubric

log_rubric(rubric: 'Rubric', iteration: int) -> None

Print a Rich panel showing the rubric criteria.

log_held_out_iteration

log_held_out_iteration(result: IterationResult) -> None

Print a one-line iteration summary for held-out mode.

print_held_out_summary

print_held_out_summary(iterations: list['IterationResult'], convergence_reason: str, total_cost: float, artifacts_dir: 'Path | None') -> None

Print a summary table for held-out improvement.

log_rubric_diff

log_rubric_diff(prev_rubric: 'Rubric', curr_rubric: 'Rubric', iteration: int) -> None

Print a paired before/after diff of rubric criteria between iterations.

Changed lines are shown as adjacent old/new pairs with character-level highlighting of the exact changes.

log_convergence

log_convergence(reason: str) -> None

Print convergence reason.

print_summary

print_summary(iterations: list[IterationResult], convergence_reason: str, total_cost: float, artifacts_dir: Path | None) -> None

Print a Rich Table summary and final statistics.


Building Blocks

These functions can be used independently to compose custom improvement loops.

extract_issues

Extract actionable issues from a meta-rubric evaluation report.

extract_issues

extract_issues(report: EnsembleEvaluationReport) -> list[IssueDetail]

Extract actionable issues from a meta-rubric evaluation report.

An issue is either a positive criterion that is UNMET (quality gap) or a negative criterion that is MET (anti-pattern detected).


diff_issues

Track fixed and introduced issues between iterations.

diff_issues

diff_issues(prev_issues: list[IssueDetail], curr_issues: list[IssueDetail]) -> tuple[list[str], list[str]]

Compare issue sets to track which were fixed and which were introduced.

RETURNS DESCRIPTION
tuple[list[str], list[str]]

Tuple of (issues_fixed, issues_introduced) as lists of criterion names.


format_issues_for_prompt

Format issues into text for the revision prompt.

format_issues_for_prompt

format_issues_for_prompt(issues: list[IssueDetail]) -> str

Format issues into text for the revision prompt.


format_agreement_for_prompt

Format per-criterion agreement data as a self-contained prompt section.

format_agreement_for_prompt

format_agreement_for_prompt(per_criterion_agreement: dict[str, float] | None) -> str

Format per-criterion agreement data as a self-contained prompt section.


format_ground_truth_for_prompt

Format ground-truth validation results as a prompt section.

format_ground_truth_for_prompt

format_ground_truth_for_prompt(correlation: float, per_item: list[tuple[float, float]], *, item_reports: list[EnsembleEvaluationReport] | None = None, n_diagnostic: int = 3) -> str

Format ground-truth validation results as a self-contained prompt section.

PARAMETER DESCRIPTION
correlation

Spearman ρ or 1-MAE metric.

TYPE: float

per_item

List of (rubric_score, expected_score) pairs.

TYPE: list[tuple[float, float]]

item_reports

Per-item grading reports from validate_ground_truth. When provided, a diagnostics section is appended showing per-criterion reasons for the items with the largest scoring gaps.

TYPE: list[EnsembleEvaluationReport] | None DEFAULT: None

n_diagnostic

Number of items per direction (over/under-scored) to include in the diagnostics section.

TYPE: int DEFAULT: 3

RETURNS DESCRIPTION
str

Formatted section string with header, data, and instructions.


build_revision_history

Format recent iteration history for the revision prompt.

build_revision_history

build_revision_history(iterations: list[IterationResult], window: int) -> str

Format recent iteration history for the revision prompt.


validate_agreement

Test inter-judge agreement on validation data.

validate_agreement async

validate_agreement(rubric: Rubric, samples: list[str], judges: list[JudgeSpec], task_prompt: str | None = None, *, on_sample_complete: Callable[[], None] | None = None, _capture: list | None = None) -> tuple[float, dict[str, float], float | None]

Test inter-judge agreement by grading samples with an ensemble.

PARAMETER DESCRIPTION
rubric

Rubric to test.

TYPE: Rubric

samples

Sample submissions to grade.

TYPE: list[str]

judges

Judge specifications for the ensemble.

TYPE: list[JudgeSpec]

task_prompt

Optional task prompt for context.

TYPE: str | None DEFAULT: None

on_sample_complete

Optional callback invoked after each sample is graded.

TYPE: Callable[[], None] | None DEFAULT: None

_capture

When provided, per-sample ensemble reports are appended as serialized dicts for artifact persistence.

TYPE: list | None DEFAULT: None

RETURNS DESCRIPTION
tuple[float, dict[str, float], float | None]

Tuple of (mean_agreement, per_criterion_agreement, total_cost).


validate_ground_truth

Grade validation items and compute Spearman ρ against expected scores.

validate_ground_truth async

validate_ground_truth(rubric: Rubric, validation_data: RubricDataset, expected_scores: list[float], grader: CriterionGrader, task_prompt: str | None = None, *, on_item_complete: Callable[[], None] | None = None, _capture: list | None = None, _item_reports: list | None = None) -> tuple[float, list[tuple[float, float]], float | None]

Grade validation items with the current rubric and compare against expected scores.

Uses Spearman rank correlation when n >= 3, falls back to 1 - MAE when n < 3.

PARAMETER DESCRIPTION
rubric

Current rubric to evaluate.

TYPE: Rubric

validation_data

Dataset with ground-truth verdicts.

TYPE: RubricDataset

expected_scores

Pre-computed expected scores from compute_expected_scores.

TYPE: list[float]

grader

Grader configured from eval_llm.

TYPE: CriterionGrader

task_prompt

Optional task prompt for grading context.

TYPE: str | None DEFAULT: None

on_item_complete

Callback invoked after each item is graded.

TYPE: Callable[[], None] | None DEFAULT: None

_capture

When provided, per-item results are appended for artifact persistence.

TYPE: list | None DEFAULT: None

_item_reports

When provided, each item's EnsembleEvaluationReport is appended for downstream diagnostics (e.g. grading reasons).

TYPE: list | None DEFAULT: None

RETURNS DESCRIPTION
float

Tuple of (correlation_metric, per_item_pairs, total_cost) where

list[tuple[float, float]]

per_item_pairs is a list of (rubric_score, expected_score) tuples.


compute_expected_scores

Compute expected scores from ground-truth verdicts and rubric weights.

compute_expected_scores

compute_expected_scores(validation_data: RubricDataset) -> list[float]

Compute expected scores from ground-truth verdicts and the rubric weights.

PARAMETER DESCRIPTION
validation_data

Dataset whose items all have ground_truth.

TYPE: RubricDataset

RETURNS DESCRIPTION
list[float]

List of expected scores, one per item.


pareto_accept

Check revision acceptance under the Pareto constraint.

pareto_accept

pareto_accept(curr_agreement: float | None, prev_agreement: float | None, reject_regression: bool, consecutive_rejections: int, epsilon: float = 0.03) -> tuple[bool, str | None]

Check if a revision should be accepted under the Pareto constraint.

A revision is accepted if agreement >= prev_agreement - epsilon. After 2 consecutive rejections, the constraint is relaxed.

RETURNS DESCRIPTION
tuple[bool, str | None]

Tuple of (accepted, rejection_reason).


validate_held_out

Grade held-out items and compare per-criterion verdicts against ground truth.

validate_held_out async

validate_held_out(rubric: Rubric, validation_data: RubricDataset, grader: CriterionGrader, task_prompt: str | None = None, *, max_exemplars_per_criterion: int = 3, on_item_complete: Callable[[], None] | None = None, _capture: list | None = None) -> HeldOutValidationResult

Grade held-out items and compare per-criterion verdicts against ground truth.

PARAMETER DESCRIPTION
rubric

Current rubric to evaluate.

TYPE: Rubric

validation_data

Dataset where all items have ground_truth.

TYPE: RubricDataset

grader

Grader configured from eval_llm.

TYPE: CriterionGrader

task_prompt

Optional task prompt for grading context.

TYPE: str | None DEFAULT: None

max_exemplars_per_criterion

Max disagreement exemplars per criterion.

TYPE: int DEFAULT: 3

on_item_complete

Callback invoked after each item is graded.

TYPE: Callable[[], None] | None DEFAULT: None

_capture

When provided, per-item results are appended for artifact persistence.

TYPE: list | None DEFAULT: None

RETURNS DESCRIPTION
HeldOutValidationResult

HeldOutValidationResult with per-criterion error analysis.


format_held_out_for_prompt

Format held-out validation result into revision prompt text.

format_held_out_for_prompt

format_held_out_for_prompt(result: HeldOutValidationResult, *, max_exemplars_per_criterion: int = 3) -> str

Format HeldOutValidationResult into text for the revision prompt.

PARAMETER DESCRIPTION
result

The held-out validation result.

TYPE: HeldOutValidationResult

max_exemplars_per_criterion

Max exemplars shown per criterion.

TYPE: int DEFAULT: 3

RETURNS DESCRIPTION
str

Formatted diagnostics text with per-criterion analysis.


validate_criteria_structure

Post-revision check that criteria count and order were preserved.

validate_criteria_structure

validate_criteria_structure(original: Rubric, revised: Rubric) -> tuple[bool, str | None]

Check that criteria count and order was preserved after revision.

PARAMETER DESCRIPTION
original

The rubric before revision.

TYPE: Rubric

revised

The rubric after revision.

TYPE: Rubric

RETURNS DESCRIPTION
tuple[bool, str | None]

(valid, error_message) — True and None if valid, False and reason if not.


revise_rubric_held_out

Revise a rubric using held-out-specific prompt templates.

revise_rubric_held_out async

revise_rubric_held_out(rubric: Rubric, task_prompt: str | None, diagnostics_text: str, history_text: str, config: ImprovementConfig, *, system_prompt: str | None = None, user_prompt_template: str | None = None, _capture: dict | None = None) -> tuple[Rubric, float | None]

Revise rubric based on held-out grading diagnostics.

Uses held-out-specific prompt templates that enforce structural constraints (same number of criteria in same order).

PARAMETER DESCRIPTION
rubric

Current rubric to revise.

TYPE: Rubric

task_prompt

The task the rubric evaluates.

TYPE: str | None

diagnostics_text

Formatted held-out diagnostics from format_held_out_for_prompt.

TYPE: str

history_text

Formatted revision history.

TYPE: str

config

Improvement configuration (provides revision_llm).

TYPE: ImprovementConfig

system_prompt

Override system prompt.

TYPE: str | None DEFAULT: None

user_prompt_template

Override user prompt template.

TYPE: str | None DEFAULT: None

_capture

When provided, populated with prompts and response for artifact persistence.

TYPE: dict | None DEFAULT: None

RETURNS DESCRIPTION
tuple[Rubric, float | None]

Tuple of (revised Rubric, completion cost or None).


revise_rubric

Revise a rubric via LLM based on identified issues.

revise_rubric async

revise_rubric(rubric: Rubric, task_prompt: str | None, issues: list[IssueDetail], validation_text: str, history_text: str, config: ImprovementConfig, *, system_prompt: str | None = None, user_prompt_template: str | None = None, _capture: dict | None = None) -> tuple[Rubric, float | None]

Use an LLM to revise the rubric based on evaluation feedback and validation data.

PARAMETER DESCRIPTION
rubric

Current rubric to revise.

TYPE: Rubric

task_prompt

The task the rubric evaluates.

TYPE: str | None

issues

Issues identified by meta-rubric evaluation.

TYPE: list[IssueDetail]

validation_text

Formatted validation data (ground-truth or agreement).

TYPE: str

history_text

Formatted revision history.

TYPE: str

config

Improvement configuration (provides revision_llm).

TYPE: ImprovementConfig

system_prompt

Override system prompt. Falls back to config.revision_system_prompt, then the default from prompts.py.

TYPE: str | None DEFAULT: None

user_prompt_template

Override user prompt template. Falls back to config.revision_user_prompt_template, then the default from prompts.py.

TYPE: str | None DEFAULT: None

_capture

When provided, populated with the system prompt, user prompt, and LLM response text for artifact persistence.

TYPE: dict | None DEFAULT: None

RETURNS DESCRIPTION
tuple[Rubric, float | None]

Tuple of (revised Rubric, completion cost or None).