Skip to content

Meta-Rubric Improvement

Iterative rubric improvement engine that optimizes for meta-rubric quality (validity) and validation reliability.

Overview

The improvement API provides a two-tier interface for iteratively refining rubrics:

Level Entry Point Use Case
Convenience improve_rubric() Quick start with keyword arguments
Full Control ImprovementRunner Custom convergence, callbacks, fine-grained config

Two validation modes are supported:

Mode Trigger Metric
Ground-truth validation_data items have ground_truth Spearman rank correlation (ρ) between rubric scores and expected scores
Multi-judge validation_data items lack ground_truth + eval_llm is list[JudgeSpec] Mean inter-judge agreement

A Pareto constraint rejects revisions that improve quality but decrease validation reliability.

Quick Example

Using improve_rubric()

import asyncio
from autorubric import LLMConfig, Rubric
from autorubric.dataset import RubricDataset
from autorubric.meta import improve_rubric

async def main():
    eval_llm = LLMConfig(model="openai/gpt-4.1", temperature=0.0)
    revision_llm = LLMConfig(model="openai/gpt-4.1", temperature=0.3)

    rubric = Rubric.from_file("my_rubric.json")
    validation_data = RubricDataset.from_file("validation_data.json")

    result = await improve_rubric(
        rubric,
        "Your task prompt here",
        eval_llm=eval_llm,
        revision_llm=revision_llm,
        validation_data=validation_data,
        artifacts_dir="experiments/my_improvement",
        display="stdout",
    )

    print(f"Quality: {result.iterations[-1].quality_score:.0%}")
    print(f"Convergence: {result.convergence_reason}")
    result.final_rubric.to_file("improved_rubric.json")

asyncio.run(main())

Using ImprovementRunner

from autorubric.meta import ImprovementRunner, ImprovementConfig

config = ImprovementConfig(
    eval_llm=eval_llm,
    revision_llm=revision_llm,
    validation_data=validation_data,
    max_iterations=15,
    min_quality_score=0.95,
    show_progress=True,
)
runner = ImprovementRunner(rubric, "Your task prompt", config=config)
result = await runner.run()

Validation Modes

Ground-Truth Mode

When validation_data items have ground_truth verdicts, the loop computes expected scores from the rubric weights and measures Spearman ρ against the actual graded scores.

# Items with ground_truth → ground-truth mode
dataset = RubricDataset.from_file("labeled_data.json")
result = await improve_rubric(
    rubric, prompt,
    eval_llm=LLMConfig(model="openai/gpt-4.1"),
    revision_llm=LLMConfig(model="openai/gpt-4.1"),
    validation_data=dataset,
)

Multi-Judge Mode

When items lack ground_truth, provide an ensemble of judges to measure inter-judge agreement:

from autorubric.graders import JudgeSpec

judges = [
    JudgeSpec(LLMConfig(model="openai/gpt-4.1"), "gpt"),
    JudgeSpec(LLMConfig(model="anthropic/claude-sonnet-4-5-20250929"), "claude"),
]
result = await improve_rubric(
    rubric, prompt,
    eval_llm=judges,
    revision_llm=LLMConfig(model="openai/gpt-4.1"),
    validation_data=dataset,  # items without ground_truth
)

Artifact Persistence

When save_artifacts=True and artifacts_dir is set, the improvement loop writes:

File Contents
rubric-iter-{NN}.json Criteria array per iteration
eval-iter-{NN}.html Meta-rubric eval report (always generated)
iter-{NN}.json Rich per-iteration JSON (quality report, issues, validation samples, revision prompts/response)
improvement_report.html Consolidated report (always generated)
summary.json Full run metadata, config snapshot, and per-iteration summary

Custom Convergence

Replace the built-in convergence logic with a custom function:

from autorubric.meta import ConvergenceFn, IterationResult

def my_convergence(current: IterationResult, history: list[IterationResult]) -> str | None:
    if current.quality_score > 0.9 and len(current.issues) == 0:
        return "perfect quality with no issues"
    if len(history) >= 5:
        return "max iterations reached"
    return None  # continue

config = ImprovementConfig(
    eval_llm=eval_llm,
    revision_llm=revision_llm,
    convergence_fn=my_convergence,
)

improve_rubric

Convenience wrapper for iterative rubric improvement.

improve_rubric async

improve_rubric(rubric: Rubric, task_prompt: str | None = None, *, config: ImprovementConfig | None = None, eval_llm: LLMConfig | list[JudgeSpec] | None = None, revision_llm: LLMConfig | None = None, validation_data: RubricDataset | None = None, max_iterations: int | None = None, display: Literal['stdout', 'html', None] = None, artifacts_dir: Path | str | None = None, save_artifacts: bool | None = None, show_progress: bool | None = None, mode: Literal['standalone', 'in_context'] | None = None, max_total_cost: float | None = None) -> ImprovementResult

Iteratively improve a rubric using meta-rubric evaluation and validation.

Convenience wrapper around ImprovementRunner. Optimizes for both quality (meta-rubric score) and validation reliability. Uses a Pareto constraint to reject revisions that improve quality but decrease reliability.

PARAMETER DESCRIPTION
rubric

The rubric to improve.

TYPE: Rubric

task_prompt

The task the rubric evaluates (required for in_context mode).

TYPE: str | None DEFAULT: None

config

Full configuration. If provided, keyword shortcuts override its fields (non-None values only).

TYPE: ImprovementConfig | None DEFAULT: None

eval_llm

LLM for meta-rubric evaluation and validation. Can be a single LLMConfig or list[JudgeSpec] for ensemble.

TYPE: LLMConfig | list[JudgeSpec] | None DEFAULT: None

revision_llm

LLM for rubric revision.

TYPE: LLMConfig | None DEFAULT: None

validation_data

Dataset for validation (ground-truth or multi-judge mode).

TYPE: RubricDataset | None DEFAULT: None

max_iterations

Maximum number of iterations.

TYPE: int | None DEFAULT: None

display

Output mode - "stdout", "html", or None.

TYPE: Literal['stdout', 'html', None] DEFAULT: None

artifacts_dir

Directory for saved artifacts.

TYPE: Path | str | None DEFAULT: None

save_artifacts

Whether to save rubric JSONs and reports to disk.

TYPE: bool | None DEFAULT: None

show_progress

Whether to show Rich progress indicators.

TYPE: bool | None DEFAULT: None

mode

Evaluation mode - "standalone" or "in_context".

TYPE: Literal['standalone', 'in_context'] | None DEFAULT: None

max_total_cost

Stop if total cost exceeds this amount (USD).

TYPE: float | None DEFAULT: None

RETURNS DESCRIPTION
ImprovementResult

ImprovementResult with the original, final, and best rubrics plus

ImprovementResult

all iteration details.

RAISES DESCRIPTION
ValueError

If neither config nor eval_llm/revision_llm are provided.


ImprovementRunner

Full-control runner class following the EvalRunner pattern.

ImprovementRunner

ImprovementRunner(rubric: Rubric, task_prompt: str | None = None, *, config: ImprovementConfig | None = None)

Runs iterative rubric improvement with progress tracking.

Following the EvalRunner pattern, this class orchestrates the full improvement loop: evaluate quality, test agreement, check convergence, revise rubric, repeat.

Example

from autorubric import LLMConfig, Rubric from autorubric.meta import ImprovementRunner, ImprovementConfig

config = ImprovementConfig( ... eval_llm=LLMConfig(model="gpt-4o"), ... revision_llm=LLMConfig(model="gpt-4o"), ... ) runner = ImprovementRunner(rubric, "Your task prompt", config=config) result = await runner.run()

run async

Run the improvement loop and return the result.

RETURNS DESCRIPTION
ImprovementResult

ImprovementResult with the original, final, and best rubrics

ImprovementResult

plus all iteration details.

RAISES DESCRIPTION
ValueError

If task_prompt is required but not provided.


ImprovementConfig

Configuration for the rubric improvement process.

ImprovementConfig dataclass

ImprovementConfig(eval_llm: LLMConfig | list[JudgeSpec], revision_llm: LLMConfig, mode: Literal['standalone', 'in_context'] = 'in_context', validation_data: RubricDataset | None = None, max_iterations: int = 10, min_quality_score: float = 0.95, min_agreement: float = 0.85, score_plateau_threshold: float = 0.02, plateau_patience: int = 2, history_window: int = 3, reject_agreement_regression: bool = True, save_artifacts: bool = True, artifacts_dir: Path | str | None = None, display: Literal['stdout', 'html', None] = None, show_progress: bool = True, max_total_cost: float | None = None, convergence_fn: ConvergenceFn | None = None, revision_system_prompt: str | None = None, revision_user_prompt_template: str | None = None)

Configuration for the rubric improvement process.

ATTRIBUTE DESCRIPTION
eval_llm

LLM configuration for meta-rubric evaluation and validation. - LLMConfig: single judge (used in both ground-truth mode and meta-rubric evaluation). - list[JudgeSpec]: ensemble (required for multi-judge mode; meta-rubric evaluation uses the first judge's config).

TYPE: LLMConfig | list[JudgeSpec]

revision_llm

LLM configuration for rubric revision.

TYPE: LLMConfig

mode

Evaluation mode - "standalone" or "in_context".

TYPE: Literal['standalone', 'in_context']

validation_data

Optional dataset for validation. When items have ground_truth, uses ground-truth mode (Spearman ρ). When items lack ground_truth, requires eval_llm as list[JudgeSpec] for multi-judge agreement mode.

TYPE: RubricDataset | None

max_iterations

Maximum number of improvement iterations.

TYPE: int

min_quality_score

Stop if quality score reaches this threshold.

TYPE: float

min_agreement

Stop if agreement/correlation reaches this threshold.

TYPE: float

score_plateau_threshold

Minimum score improvement to avoid plateau detection.

TYPE: float

plateau_patience

Number of iterations with no improvement before stopping.

TYPE: int

history_window

Number of recent iterations to include in revision prompt.

TYPE: int

reject_agreement_regression

Whether to reject revisions that decrease validation reliability.

TYPE: bool

save_artifacts

Whether to save rubric JSONs and reports to disk.

TYPE: bool

artifacts_dir

Directory for saved artifacts. Auto-generated if None.

TYPE: Path | str | None

display

Output mode - "stdout", "html", or None.

TYPE: Literal['stdout', 'html', None]

show_progress

Whether to show Rich progress indicators.

TYPE: bool

max_total_cost

Stop if total cost exceeds this amount (USD).

TYPE: float | None

convergence_fn

Custom convergence function. When provided, replaces the built-in convergence logic entirely. Called after each iteration with (current_result, all_results). Returns a reason string to stop, or None to continue.

TYPE: ConvergenceFn | None

revision_system_prompt

Custom system prompt for rubric revision LLM calls. Falls back to the default from prompts.py if None.

TYPE: str | None

revision_user_prompt_template

Custom user prompt template for revision. Must contain {task_prompt}, {original_criteria}, {issues_text}, {validation_text}, {history_text} placeholders. Falls back to the default from prompts.py if None.

TYPE: str | None


ImprovementResult

Final result from the rubric improvement process.

ImprovementResult dataclass

ImprovementResult(original_rubric: Rubric, final_rubric: Rubric, iterations: list[IterationResult], best_rubric: Rubric, best_iteration: int, convergence_reason: str, total_completion_cost: float | None)

Final result from the rubric improvement process.

ATTRIBUTE DESCRIPTION
original_rubric

The rubric before any improvements.

TYPE: Rubric

final_rubric

The rubric after the last accepted iteration.

TYPE: Rubric

iterations

All iteration results (including rejected ones).

TYPE: list[IterationResult]

best_rubric

The rubric with the best combined quality+agreement.

TYPE: Rubric

best_iteration

Index of the best iteration.

TYPE: int

convergence_reason

Why the improvement loop stopped.

TYPE: str

total_completion_cost

Total cost across all iterations.

TYPE: float | None


IterationResult

Result from a single improvement iteration.

IterationResult dataclass

IterationResult(iteration: int, rubric: Rubric, quality_score: float, agreement: float | None, per_criterion_agreement: dict[str, float] | None, issues: list[IssueDetail], issues_fixed: list[str], issues_introduced: list[str], accepted: bool, rejection_reason: str | None, quality_report: EnsembleEvaluationReport, token_usage: TokenUsage | None, completion_cost: float | None)

Result from a single improvement iteration.

ATTRIBUTE DESCRIPTION
iteration

Zero-based iteration number.

TYPE: int

rubric

The rubric at this iteration.

TYPE: Rubric

quality_score

Meta-rubric quality score (0-1).

TYPE: float

agreement

Mean inter-judge agreement (0-1), or None if not tested.

TYPE: float | None

per_criterion_agreement

Per-criterion agreement scores, or None.

TYPE: dict[str, float] | None

issues

List of issues identified in this iteration.

TYPE: list[IssueDetail]

issues_fixed

Names of issues present in previous iteration but not this one.

TYPE: list[str]

issues_introduced

Names of issues not in previous iteration but present now.

TYPE: list[str]

accepted

Whether this revision was accepted (Pareto check passed).

TYPE: bool

rejection_reason

Why the revision was rejected, if applicable.

TYPE: str | None

quality_report

Full meta-rubric evaluation report.

TYPE: EnsembleEvaluationReport

token_usage

Token usage for this iteration.

TYPE: TokenUsage | None

completion_cost

Cost in USD for this iteration.

TYPE: float | None


IssueDetail

A single issue identified in a rubric by meta-rubric evaluation.

IssueDetail dataclass

IssueDetail(criterion_name: str, requirement: str, weight: float, is_antipattern: bool, feedback: str)

A single issue identified in a rubric by meta-rubric evaluation.

ATTRIBUTE DESCRIPTION
criterion_name

Name of the meta-rubric criterion that flagged this issue.

TYPE: str

requirement

The meta-rubric criterion's requirement text.

TYPE: str

weight

Weight of the meta-rubric criterion.

TYPE: float

is_antipattern

True if this is a negative criterion (anti-pattern detected).

TYPE: bool

feedback

The judge's explanation for why this issue was flagged.

TYPE: str


ConvergenceFn

Custom convergence function type alias.

ConvergenceFn module-attribute

ConvergenceFn = Callable[['IterationResult', list['IterationResult']], str | None]

Custom convergence function type.

Called after each iteration with (current_result, all_results). Returns a convergence reason string to stop, or None to continue. When provided in ImprovementConfig, replaces the built-in convergence logic.


ImprovementProgressDisplay

Rich-based progress display for the improvement loop.

ImprovementProgressDisplay

ImprovementProgressDisplay()

Rich-based progress display for the improvement loop.

Shows a progress bar during evaluation/agreement phases, prints one-line iteration summaries with issues tables and rubric panels, and renders a Rich Table at the end.

begin_iteration

begin_iteration(iteration: int, max_iterations: int, total_steps: int) -> None

Start a progress bar for one iteration.

advance

advance(phase_name: str | None = None) -> None

Advance by 1 step, optionally updating the phase label.

end_iteration

end_iteration() -> None

Stop the progress bar (disappears due to transient=True).

phase

phase(iteration: int, max_iterations: int, phase_name: str)

Spinner-only context for single-step atomic phases (e.g. revision).

log_iteration

log_iteration(result: IterationResult) -> None

Print a one-line iteration summary.

log_issues_table

log_issues_table(issues: list['IssueDetail'], *, rubric: 'Rubric | None' = None) -> None

Print a Rich table of issues found in this iteration.

log_rubric

log_rubric(rubric: 'Rubric', iteration: int) -> None

Print a Rich panel showing the rubric criteria.

log_rubric_diff

log_rubric_diff(prev_rubric: 'Rubric', curr_rubric: 'Rubric', iteration: int) -> None

Print a paired before/after diff of rubric criteria between iterations.

Changed lines are shown as adjacent old/new pairs with character-level highlighting of the exact changes.

log_convergence

log_convergence(reason: str) -> None

Print convergence reason.

print_summary

print_summary(iterations: list[IterationResult], convergence_reason: str, total_cost: float, artifacts_dir: Path | None) -> None

Print a Rich Table summary and final statistics.


Building Blocks

These functions can be used independently to compose custom improvement loops.

extract_issues

Extract actionable issues from a meta-rubric evaluation report.

extract_issues

extract_issues(report: EnsembleEvaluationReport) -> list[IssueDetail]

Extract actionable issues from a meta-rubric evaluation report.

An issue is either a positive criterion that is UNMET (quality gap) or a negative criterion that is MET (anti-pattern detected).


diff_issues

Track fixed and introduced issues between iterations.

diff_issues

diff_issues(prev_issues: list[IssueDetail], curr_issues: list[IssueDetail]) -> tuple[list[str], list[str]]

Compare issue sets to track which were fixed and which were introduced.

RETURNS DESCRIPTION
tuple[list[str], list[str]]

Tuple of (issues_fixed, issues_introduced) as lists of criterion names.


format_issues_for_prompt

Format issues into text for the revision prompt.

format_issues_for_prompt

format_issues_for_prompt(issues: list[IssueDetail]) -> str

Format issues into text for the revision prompt.


format_agreement_for_prompt

Format per-criterion agreement data as a self-contained prompt section.

format_agreement_for_prompt

format_agreement_for_prompt(per_criterion_agreement: dict[str, float] | None) -> str

Format per-criterion agreement data as a self-contained prompt section.


format_ground_truth_for_prompt

Format ground-truth validation results as a prompt section.

format_ground_truth_for_prompt

format_ground_truth_for_prompt(correlation: float, per_item: list[tuple[float, float]], *, item_reports: list[EnsembleEvaluationReport] | None = None, n_diagnostic: int = 3) -> str

Format ground-truth validation results as a self-contained prompt section.

PARAMETER DESCRIPTION
correlation

Spearman ρ or 1-MAE metric.

TYPE: float

per_item

List of (rubric_score, expected_score) pairs.

TYPE: list[tuple[float, float]]

item_reports

Per-item grading reports from validate_ground_truth. When provided, a diagnostics section is appended showing per-criterion reasons for the items with the largest scoring gaps.

TYPE: list[EnsembleEvaluationReport] | None DEFAULT: None

n_diagnostic

Number of items per direction (over/under-scored) to include in the diagnostics section.

TYPE: int DEFAULT: 3

RETURNS DESCRIPTION
str

Formatted section string with header, data, and instructions.


build_revision_history

Format recent iteration history for the revision prompt.

build_revision_history

build_revision_history(iterations: list[IterationResult], window: int) -> str

Format recent iteration history for the revision prompt.


validate_agreement

Test inter-judge agreement on validation data.

validate_agreement async

validate_agreement(rubric: Rubric, samples: list[str], judges: list[JudgeSpec], task_prompt: str | None = None, *, on_sample_complete: Callable[[], None] | None = None, _capture: list | None = None) -> tuple[float, dict[str, float], float | None]

Test inter-judge agreement by grading samples with an ensemble.

PARAMETER DESCRIPTION
rubric

Rubric to test.

TYPE: Rubric

samples

Sample submissions to grade.

TYPE: list[str]

judges

Judge specifications for the ensemble.

TYPE: list[JudgeSpec]

task_prompt

Optional task prompt for context.

TYPE: str | None DEFAULT: None

on_sample_complete

Optional callback invoked after each sample is graded.

TYPE: Callable[[], None] | None DEFAULT: None

_capture

When provided, per-sample ensemble reports are appended as serialized dicts for artifact persistence.

TYPE: list | None DEFAULT: None

RETURNS DESCRIPTION
tuple[float, dict[str, float], float | None]

Tuple of (mean_agreement, per_criterion_agreement, total_cost).


validate_ground_truth

Grade validation items and compute Spearman ρ against expected scores.

validate_ground_truth async

validate_ground_truth(rubric: Rubric, validation_data: RubricDataset, expected_scores: list[float], grader: CriterionGrader, task_prompt: str | None = None, *, on_item_complete: Callable[[], None] | None = None, _capture: list | None = None, _item_reports: list | None = None) -> tuple[float, list[tuple[float, float]], float | None]

Grade validation items with the current rubric and compare against expected scores.

Uses Spearman rank correlation when n >= 3, falls back to 1 - MAE when n < 3.

PARAMETER DESCRIPTION
rubric

Current rubric to evaluate.

TYPE: Rubric

validation_data

Dataset with ground-truth verdicts.

TYPE: RubricDataset

expected_scores

Pre-computed expected scores from compute_expected_scores.

TYPE: list[float]

grader

Grader configured from eval_llm.

TYPE: CriterionGrader

task_prompt

Optional task prompt for grading context.

TYPE: str | None DEFAULT: None

on_item_complete

Callback invoked after each item is graded.

TYPE: Callable[[], None] | None DEFAULT: None

_capture

When provided, per-item results are appended for artifact persistence.

TYPE: list | None DEFAULT: None

_item_reports

When provided, each item's EnsembleEvaluationReport is appended for downstream diagnostics (e.g. grading reasons).

TYPE: list | None DEFAULT: None

RETURNS DESCRIPTION
float

Tuple of (correlation_metric, per_item_pairs, total_cost) where

list[tuple[float, float]]

per_item_pairs is a list of (rubric_score, expected_score) tuples.


compute_expected_scores

Compute expected scores from ground-truth verdicts and rubric weights.

compute_expected_scores

compute_expected_scores(validation_data: RubricDataset) -> list[float]

Compute expected scores from ground-truth verdicts and the rubric weights.

PARAMETER DESCRIPTION
validation_data

Dataset whose items all have ground_truth.

TYPE: RubricDataset

RETURNS DESCRIPTION
list[float]

List of expected scores, one per item.


pareto_accept

Check revision acceptance under the Pareto constraint.

pareto_accept

pareto_accept(curr_agreement: float | None, prev_agreement: float | None, reject_regression: bool, consecutive_rejections: int, epsilon: float = 0.03) -> tuple[bool, str | None]

Check if a revision should be accepted under the Pareto constraint.

A revision is accepted if agreement >= prev_agreement - epsilon. After 2 consecutive rejections, the constraint is relaxed.

RETURNS DESCRIPTION
tuple[bool, str | None]

Tuple of (accepted, rejection_reason).


revise_rubric

Revise a rubric via LLM based on identified issues.

revise_rubric async

revise_rubric(rubric: Rubric, task_prompt: str | None, issues: list[IssueDetail], validation_text: str, history_text: str, config: ImprovementConfig, *, system_prompt: str | None = None, user_prompt_template: str | None = None, _capture: dict | None = None) -> tuple[Rubric, float | None]

Use an LLM to revise the rubric based on evaluation feedback and validation data.

PARAMETER DESCRIPTION
rubric

Current rubric to revise.

TYPE: Rubric

task_prompt

The task the rubric evaluates.

TYPE: str | None

issues

Issues identified by meta-rubric evaluation.

TYPE: list[IssueDetail]

validation_text

Formatted validation data (ground-truth or agreement).

TYPE: str

history_text

Formatted revision history.

TYPE: str

config

Improvement configuration (provides revision_llm).

TYPE: ImprovementConfig

system_prompt

Override system prompt. Falls back to config.revision_system_prompt, then the default from prompts.py.

TYPE: str | None DEFAULT: None

user_prompt_template

Override user prompt template. Falls back to config.revision_user_prompt_template, then the default from prompts.py.

TYPE: str | None DEFAULT: None

_capture

When provided, populated with the system prompt, user prompt, and LLM response text for artifact persistence.

TYPE: dict | None DEFAULT: None

RETURNS DESCRIPTION
tuple[Rubric, float | None]

Tuple of (revised Rubric, completion cost or None).