Meta-Rubric Improvement¶
Iterative rubric improvement engine that optimizes for meta-rubric quality (validity) and validation reliability.
Overview¶
The improvement API provides a two-tier interface for iteratively refining rubrics:
| Level | Entry Point | Use Case |
|---|---|---|
| Convenience | improve_rubric() |
Quick start with keyword arguments |
| Full Control | ImprovementRunner |
Custom convergence, callbacks, fine-grained config |
Two validation modes are supported:
| Mode | Trigger | Metric |
|---|---|---|
| Ground-truth | validation_data items have ground_truth |
Spearman rank correlation (ρ) between rubric scores and expected scores |
| Multi-judge | validation_data items lack ground_truth + eval_llm is list[JudgeSpec] |
Mean inter-judge agreement |
A Pareto constraint rejects revisions that improve quality but decrease validation reliability.
Quick Example¶
Using improve_rubric()¶
import asyncio
from autorubric import LLMConfig, Rubric
from autorubric.dataset import RubricDataset
from autorubric.meta import improve_rubric
async def main():
eval_llm = LLMConfig(model="openai/gpt-4.1", temperature=0.0)
revision_llm = LLMConfig(model="openai/gpt-4.1", temperature=0.3)
rubric = Rubric.from_file("my_rubric.json")
validation_data = RubricDataset.from_file("validation_data.json")
result = await improve_rubric(
rubric,
"Your task prompt here",
eval_llm=eval_llm,
revision_llm=revision_llm,
validation_data=validation_data,
artifacts_dir="experiments/my_improvement",
display="stdout",
)
print(f"Quality: {result.iterations[-1].quality_score:.0%}")
print(f"Convergence: {result.convergence_reason}")
result.final_rubric.to_file("improved_rubric.json")
asyncio.run(main())
Using ImprovementRunner¶
from autorubric.meta import ImprovementRunner, ImprovementConfig
config = ImprovementConfig(
eval_llm=eval_llm,
revision_llm=revision_llm,
validation_data=validation_data,
max_iterations=15,
min_quality_score=0.95,
show_progress=True,
)
runner = ImprovementRunner(rubric, "Your task prompt", config=config)
result = await runner.run()
Validation Modes¶
Ground-Truth Mode¶
When validation_data items have ground_truth verdicts, the loop computes expected scores from the rubric weights and measures Spearman ρ against the actual graded scores.
# Items with ground_truth → ground-truth mode
dataset = RubricDataset.from_file("labeled_data.json")
result = await improve_rubric(
rubric, prompt,
eval_llm=LLMConfig(model="openai/gpt-4.1"),
revision_llm=LLMConfig(model="openai/gpt-4.1"),
validation_data=dataset,
)
Multi-Judge Mode¶
When items lack ground_truth, provide an ensemble of judges to measure inter-judge agreement:
from autorubric.graders import JudgeSpec
judges = [
JudgeSpec(LLMConfig(model="openai/gpt-4.1"), "gpt"),
JudgeSpec(LLMConfig(model="anthropic/claude-sonnet-4-5-20250929"), "claude"),
]
result = await improve_rubric(
rubric, prompt,
eval_llm=judges,
revision_llm=LLMConfig(model="openai/gpt-4.1"),
validation_data=dataset, # items without ground_truth
)
Artifact Persistence¶
When save_artifacts=True and artifacts_dir is set, the improvement loop writes:
| File | Contents |
|---|---|
rubric-iter-{NN}.json |
Criteria array per iteration |
eval-iter-{NN}.html |
Meta-rubric eval report (always generated) |
iter-{NN}.json |
Rich per-iteration JSON (quality report, issues, validation samples, revision prompts/response) |
improvement_report.html |
Consolidated report (always generated) |
summary.json |
Full run metadata, config snapshot, and per-iteration summary |
Custom Convergence¶
Replace the built-in convergence logic with a custom function:
from autorubric.meta import ConvergenceFn, IterationResult
def my_convergence(current: IterationResult, history: list[IterationResult]) -> str | None:
if current.quality_score > 0.9 and len(current.issues) == 0:
return "perfect quality with no issues"
if len(history) >= 5:
return "max iterations reached"
return None # continue
config = ImprovementConfig(
eval_llm=eval_llm,
revision_llm=revision_llm,
convergence_fn=my_convergence,
)
improve_rubric¶
Convenience wrapper for iterative rubric improvement.
improve_rubric
async
¶
improve_rubric(rubric: Rubric, task_prompt: str | None = None, *, config: ImprovementConfig | None = None, eval_llm: LLMConfig | list[JudgeSpec] | None = None, revision_llm: LLMConfig | None = None, validation_data: RubricDataset | None = None, max_iterations: int | None = None, display: Literal['stdout', 'html', None] = None, artifacts_dir: Path | str | None = None, save_artifacts: bool | None = None, show_progress: bool | None = None, mode: Literal['standalone', 'in_context'] | None = None, max_total_cost: float | None = None) -> ImprovementResult
Iteratively improve a rubric using meta-rubric evaluation and validation.
Convenience wrapper around ImprovementRunner. Optimizes for both quality (meta-rubric score) and validation reliability. Uses a Pareto constraint to reject revisions that improve quality but decrease reliability.
| PARAMETER | DESCRIPTION |
|---|---|
rubric
|
The rubric to improve.
TYPE:
|
task_prompt
|
The task the rubric evaluates (required for in_context mode).
TYPE:
|
config
|
Full configuration. If provided, keyword shortcuts override its fields (non-None values only).
TYPE:
|
eval_llm
|
LLM for meta-rubric evaluation and validation. Can be a
single |
revision_llm
|
LLM for rubric revision.
TYPE:
|
validation_data
|
Dataset for validation (ground-truth or multi-judge mode).
TYPE:
|
max_iterations
|
Maximum number of iterations.
TYPE:
|
display
|
Output mode - "stdout", "html", or None.
TYPE:
|
artifacts_dir
|
Directory for saved artifacts.
TYPE:
|
save_artifacts
|
Whether to save rubric JSONs and reports to disk.
TYPE:
|
show_progress
|
Whether to show Rich progress indicators.
TYPE:
|
mode
|
Evaluation mode - "standalone" or "in_context".
TYPE:
|
max_total_cost
|
Stop if total cost exceeds this amount (USD).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ImprovementResult
|
ImprovementResult with the original, final, and best rubrics plus |
ImprovementResult
|
all iteration details. |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If neither config nor eval_llm/revision_llm are provided. |
ImprovementRunner¶
Full-control runner class following the EvalRunner pattern.
ImprovementRunner
¶
ImprovementRunner(rubric: Rubric, task_prompt: str | None = None, *, config: ImprovementConfig | None = None)
Runs iterative rubric improvement with progress tracking.
Following the EvalRunner pattern, this class orchestrates the full improvement loop: evaluate quality, test agreement, check convergence, revise rubric, repeat.
Example
from autorubric import LLMConfig, Rubric from autorubric.meta import ImprovementRunner, ImprovementConfig
config = ImprovementConfig( ... eval_llm=LLMConfig(model="gpt-4o"), ... revision_llm=LLMConfig(model="gpt-4o"), ... ) runner = ImprovementRunner(rubric, "Your task prompt", config=config) result = await runner.run()
run
async
¶
run() -> ImprovementResult
Run the improvement loop and return the result.
| RETURNS | DESCRIPTION |
|---|---|
ImprovementResult
|
ImprovementResult with the original, final, and best rubrics |
ImprovementResult
|
plus all iteration details. |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If task_prompt is required but not provided. |
ImprovementConfig¶
Configuration for the rubric improvement process.
ImprovementConfig
dataclass
¶
ImprovementConfig(eval_llm: LLMConfig | list[JudgeSpec], revision_llm: LLMConfig, mode: Literal['standalone', 'in_context'] = 'in_context', validation_data: RubricDataset | None = None, max_iterations: int = 10, min_quality_score: float = 0.95, min_agreement: float = 0.85, score_plateau_threshold: float = 0.02, plateau_patience: int = 2, history_window: int = 3, reject_agreement_regression: bool = True, save_artifacts: bool = True, artifacts_dir: Path | str | None = None, display: Literal['stdout', 'html', None] = None, show_progress: bool = True, max_total_cost: float | None = None, convergence_fn: ConvergenceFn | None = None, revision_system_prompt: str | None = None, revision_user_prompt_template: str | None = None)
Configuration for the rubric improvement process.
| ATTRIBUTE | DESCRIPTION |
|---|---|
eval_llm |
LLM configuration for meta-rubric evaluation and validation.
- |
revision_llm |
LLM configuration for rubric revision.
TYPE:
|
mode |
Evaluation mode - "standalone" or "in_context".
TYPE:
|
validation_data |
Optional dataset for validation. When items have
TYPE:
|
max_iterations |
Maximum number of improvement iterations.
TYPE:
|
min_quality_score |
Stop if quality score reaches this threshold.
TYPE:
|
min_agreement |
Stop if agreement/correlation reaches this threshold.
TYPE:
|
score_plateau_threshold |
Minimum score improvement to avoid plateau detection.
TYPE:
|
plateau_patience |
Number of iterations with no improvement before stopping.
TYPE:
|
history_window |
Number of recent iterations to include in revision prompt.
TYPE:
|
reject_agreement_regression |
Whether to reject revisions that decrease validation reliability.
TYPE:
|
save_artifacts |
Whether to save rubric JSONs and reports to disk.
TYPE:
|
artifacts_dir |
Directory for saved artifacts. Auto-generated if None.
TYPE:
|
display |
Output mode - "stdout", "html", or None.
TYPE:
|
show_progress |
Whether to show Rich progress indicators.
TYPE:
|
max_total_cost |
Stop if total cost exceeds this amount (USD).
TYPE:
|
convergence_fn |
Custom convergence function. When provided, replaces the built-in convergence logic entirely. Called after each iteration with (current_result, all_results). Returns a reason string to stop, or None to continue.
TYPE:
|
revision_system_prompt |
Custom system prompt for rubric revision LLM calls. Falls back to the default from prompts.py if None.
TYPE:
|
revision_user_prompt_template |
Custom user prompt template for revision. Must contain {task_prompt}, {original_criteria}, {issues_text}, {validation_text}, {history_text} placeholders. Falls back to the default from prompts.py if None.
TYPE:
|
ImprovementResult¶
Final result from the rubric improvement process.
ImprovementResult
dataclass
¶
ImprovementResult(original_rubric: Rubric, final_rubric: Rubric, iterations: list[IterationResult], best_rubric: Rubric, best_iteration: int, convergence_reason: str, total_completion_cost: float | None)
Final result from the rubric improvement process.
| ATTRIBUTE | DESCRIPTION |
|---|---|
original_rubric |
The rubric before any improvements.
TYPE:
|
final_rubric |
The rubric after the last accepted iteration.
TYPE:
|
iterations |
All iteration results (including rejected ones).
TYPE:
|
best_rubric |
The rubric with the best combined quality+agreement.
TYPE:
|
best_iteration |
Index of the best iteration.
TYPE:
|
convergence_reason |
Why the improvement loop stopped.
TYPE:
|
total_completion_cost |
Total cost across all iterations.
TYPE:
|
IterationResult¶
Result from a single improvement iteration.
IterationResult
dataclass
¶
IterationResult(iteration: int, rubric: Rubric, quality_score: float, agreement: float | None, per_criterion_agreement: dict[str, float] | None, issues: list[IssueDetail], issues_fixed: list[str], issues_introduced: list[str], accepted: bool, rejection_reason: str | None, quality_report: EnsembleEvaluationReport, token_usage: TokenUsage | None, completion_cost: float | None)
Result from a single improvement iteration.
| ATTRIBUTE | DESCRIPTION |
|---|---|
iteration |
Zero-based iteration number.
TYPE:
|
rubric |
The rubric at this iteration.
TYPE:
|
quality_score |
Meta-rubric quality score (0-1).
TYPE:
|
agreement |
Mean inter-judge agreement (0-1), or None if not tested.
TYPE:
|
per_criterion_agreement |
Per-criterion agreement scores, or None.
TYPE:
|
issues |
List of issues identified in this iteration.
TYPE:
|
issues_fixed |
Names of issues present in previous iteration but not this one.
TYPE:
|
issues_introduced |
Names of issues not in previous iteration but present now.
TYPE:
|
accepted |
Whether this revision was accepted (Pareto check passed).
TYPE:
|
rejection_reason |
Why the revision was rejected, if applicable.
TYPE:
|
quality_report |
Full meta-rubric evaluation report.
TYPE:
|
token_usage |
Token usage for this iteration.
TYPE:
|
completion_cost |
Cost in USD for this iteration.
TYPE:
|
IssueDetail¶
A single issue identified in a rubric by meta-rubric evaluation.
IssueDetail
dataclass
¶
IssueDetail(criterion_name: str, requirement: str, weight: float, is_antipattern: bool, feedback: str)
A single issue identified in a rubric by meta-rubric evaluation.
| ATTRIBUTE | DESCRIPTION |
|---|---|
criterion_name |
Name of the meta-rubric criterion that flagged this issue.
TYPE:
|
requirement |
The meta-rubric criterion's requirement text.
TYPE:
|
weight |
Weight of the meta-rubric criterion.
TYPE:
|
is_antipattern |
True if this is a negative criterion (anti-pattern detected).
TYPE:
|
feedback |
The judge's explanation for why this issue was flagged.
TYPE:
|
ConvergenceFn¶
Custom convergence function type alias.
ConvergenceFn
module-attribute
¶
Custom convergence function type.
Called after each iteration with (current_result, all_results). Returns a convergence reason string to stop, or None to continue. When provided in ImprovementConfig, replaces the built-in convergence logic.
ImprovementProgressDisplay¶
Rich-based progress display for the improvement loop.
ImprovementProgressDisplay
¶
Rich-based progress display for the improvement loop.
Shows a progress bar during evaluation/agreement phases, prints one-line iteration summaries with issues tables and rubric panels, and renders a Rich Table at the end.
begin_iteration
¶
Start a progress bar for one iteration.
advance
¶
Advance by 1 step, optionally updating the phase label.
phase
¶
Spinner-only context for single-step atomic phases (e.g. revision).
log_issues_table
¶
Print a Rich table of issues found in this iteration.
log_rubric
¶
Print a Rich panel showing the rubric criteria.
log_rubric_diff
¶
Print a paired before/after diff of rubric criteria between iterations.
Changed lines are shown as adjacent old/new pairs with character-level highlighting of the exact changes.
print_summary
¶
print_summary(iterations: list[IterationResult], convergence_reason: str, total_cost: float, artifacts_dir: Path | None) -> None
Print a Rich Table summary and final statistics.
Building Blocks¶
These functions can be used independently to compose custom improvement loops.
extract_issues¶
Extract actionable issues from a meta-rubric evaluation report.
extract_issues
¶
extract_issues(report: EnsembleEvaluationReport) -> list[IssueDetail]
Extract actionable issues from a meta-rubric evaluation report.
An issue is either a positive criterion that is UNMET (quality gap) or a negative criterion that is MET (anti-pattern detected).
diff_issues¶
Track fixed and introduced issues between iterations.
diff_issues
¶
diff_issues(prev_issues: list[IssueDetail], curr_issues: list[IssueDetail]) -> tuple[list[str], list[str]]
Compare issue sets to track which were fixed and which were introduced.
| RETURNS | DESCRIPTION |
|---|---|
tuple[list[str], list[str]]
|
Tuple of (issues_fixed, issues_introduced) as lists of criterion names. |
format_issues_for_prompt¶
Format issues into text for the revision prompt.
format_issues_for_prompt
¶
format_issues_for_prompt(issues: list[IssueDetail]) -> str
Format issues into text for the revision prompt.
format_agreement_for_prompt¶
Format per-criterion agreement data as a self-contained prompt section.
format_agreement_for_prompt
¶
Format per-criterion agreement data as a self-contained prompt section.
format_ground_truth_for_prompt¶
Format ground-truth validation results as a prompt section.
format_ground_truth_for_prompt
¶
format_ground_truth_for_prompt(correlation: float, per_item: list[tuple[float, float]], *, item_reports: list[EnsembleEvaluationReport] | None = None, n_diagnostic: int = 3) -> str
Format ground-truth validation results as a self-contained prompt section.
| PARAMETER | DESCRIPTION |
|---|---|
correlation
|
Spearman ρ or 1-MAE metric.
TYPE:
|
per_item
|
List of (rubric_score, expected_score) pairs.
TYPE:
|
item_reports
|
Per-item grading reports from
TYPE:
|
n_diagnostic
|
Number of items per direction (over/under-scored) to include in the diagnostics section.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Formatted section string with header, data, and instructions. |
build_revision_history¶
Format recent iteration history for the revision prompt.
build_revision_history
¶
build_revision_history(iterations: list[IterationResult], window: int) -> str
Format recent iteration history for the revision prompt.
validate_agreement¶
Test inter-judge agreement on validation data.
validate_agreement
async
¶
validate_agreement(rubric: Rubric, samples: list[str], judges: list[JudgeSpec], task_prompt: str | None = None, *, on_sample_complete: Callable[[], None] | None = None, _capture: list | None = None) -> tuple[float, dict[str, float], float | None]
Test inter-judge agreement by grading samples with an ensemble.
| PARAMETER | DESCRIPTION |
|---|---|
rubric
|
Rubric to test.
TYPE:
|
samples
|
Sample submissions to grade.
TYPE:
|
judges
|
Judge specifications for the ensemble.
TYPE:
|
task_prompt
|
Optional task prompt for context.
TYPE:
|
on_sample_complete
|
Optional callback invoked after each sample is graded.
TYPE:
|
_capture
|
When provided, per-sample ensemble reports are appended as serialized dicts for artifact persistence.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple[float, dict[str, float], float | None]
|
Tuple of (mean_agreement, per_criterion_agreement, total_cost). |
validate_ground_truth¶
Grade validation items and compute Spearman ρ against expected scores.
validate_ground_truth
async
¶
validate_ground_truth(rubric: Rubric, validation_data: RubricDataset, expected_scores: list[float], grader: CriterionGrader, task_prompt: str | None = None, *, on_item_complete: Callable[[], None] | None = None, _capture: list | None = None, _item_reports: list | None = None) -> tuple[float, list[tuple[float, float]], float | None]
Grade validation items with the current rubric and compare against expected scores.
Uses Spearman rank correlation when n >= 3, falls back to 1 - MAE
when n < 3.
| PARAMETER | DESCRIPTION |
|---|---|
rubric
|
Current rubric to evaluate.
TYPE:
|
validation_data
|
Dataset with ground-truth verdicts.
TYPE:
|
expected_scores
|
Pre-computed expected scores from
TYPE:
|
grader
|
Grader configured from
TYPE:
|
task_prompt
|
Optional task prompt for grading context.
TYPE:
|
on_item_complete
|
Callback invoked after each item is graded.
TYPE:
|
_capture
|
When provided, per-item results are appended for artifact persistence.
TYPE:
|
_item_reports
|
When provided, each item's
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
float
|
Tuple of (correlation_metric, per_item_pairs, total_cost) where |
list[tuple[float, float]]
|
per_item_pairs is a list of (rubric_score, expected_score) tuples. |
compute_expected_scores¶
Compute expected scores from ground-truth verdicts and rubric weights.
compute_expected_scores
¶
compute_expected_scores(validation_data: RubricDataset) -> list[float]
Compute expected scores from ground-truth verdicts and the rubric weights.
| PARAMETER | DESCRIPTION |
|---|---|
validation_data
|
Dataset whose items all have
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[float]
|
List of expected scores, one per item. |
pareto_accept¶
Check revision acceptance under the Pareto constraint.
pareto_accept
¶
pareto_accept(curr_agreement: float | None, prev_agreement: float | None, reject_regression: bool, consecutive_rejections: int, epsilon: float = 0.03) -> tuple[bool, str | None]
Check if a revision should be accepted under the Pareto constraint.
A revision is accepted if agreement >= prev_agreement - epsilon. After 2 consecutive rejections, the constraint is relaxed.
| RETURNS | DESCRIPTION |
|---|---|
tuple[bool, str | None]
|
Tuple of (accepted, rejection_reason). |
revise_rubric¶
Revise a rubric via LLM based on identified issues.
revise_rubric
async
¶
revise_rubric(rubric: Rubric, task_prompt: str | None, issues: list[IssueDetail], validation_text: str, history_text: str, config: ImprovementConfig, *, system_prompt: str | None = None, user_prompt_template: str | None = None, _capture: dict | None = None) -> tuple[Rubric, float | None]
Use an LLM to revise the rubric based on evaluation feedback and validation data.
| PARAMETER | DESCRIPTION |
|---|---|
rubric
|
Current rubric to revise.
TYPE:
|
task_prompt
|
The task the rubric evaluates.
TYPE:
|
issues
|
Issues identified by meta-rubric evaluation.
TYPE:
|
validation_text
|
Formatted validation data (ground-truth or agreement).
TYPE:
|
history_text
|
Formatted revision history.
TYPE:
|
config
|
Improvement configuration (provides revision_llm).
TYPE:
|
system_prompt
|
Override system prompt. Falls back to config.revision_system_prompt, then the default from prompts.py.
TYPE:
|
user_prompt_template
|
Override user prompt template. Falls back to config.revision_user_prompt_template, then the default from prompts.py.
TYPE:
|
_capture
|
When provided, populated with the system prompt, user prompt, and LLM response text for artifact persistence.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple[Rubric, float | None]
|
Tuple of (revised Rubric, completion cost or None). |