Skip to content

API Reference

Complete API documentation for AutoRubric, organized by functional area.

Overview

AutoRubric exports 87 public items across the main module and autorubric.graders. This reference is organized into thematic chapters for easier navigation.

Chapter Key Exports Description
CANNOT_ASSESS Handling CannotAssessConfig, CannotAssessStrategy Configure handling of uncertain verdicts
Core Grading Criterion, Rubric, EvaluationReport Fundamental types for rubric-based evaluation
Dataset DataItem, RubricDataset Dataset management and serialization
Distribution Metrics earth_movers_distance, ks_test Statistical distribution comparisons
Ensemble EnsembleEvaluationReport, JudgeVote Multi-judge aggregation
Eval Runner EvalRunner, evaluate() Batch evaluation with checkpointing
Few-Shot FewShotConfig, FewShotExample Calibration with labeled examples
Graders CriterionGrader, Grader, JudgeSpec Grader implementations
Length Penalty LengthPenalty, compute_length_penalty Verbosity control
LLM Infrastructure LLMConfig, LLMClient, generate() LLM client and configuration
Meta-Rubric Evaluation evaluate_rubric_standalone, evaluate_rubric_in_context Assess rubric quality
Metrics MetricsResult, compute_metrics Agreement and correlation metrics
Multi-Choice CriterionOption, MultiChoiceVerdict Ordinal and nominal scales
Utilities aggregate_token_usage, word_count Helper functions

Import Patterns

Main Module

from autorubric import (
    # Core types
    Criterion,
    Rubric,
    CriterionVerdict,
    EvaluationReport,

    # LLM configuration
    LLMConfig,
    LLMClient,

    # Dataset
    DataItem,
    RubricDataset,

    # Evaluation
    EvalRunner,
    evaluate,

    # Metrics
    compute_metrics,
    MetricsResult,
)

Graders Module

from autorubric.graders import (
    CriterionGrader,
    Grader,
    JudgeSpec,
)

Meta Module

from autorubric.meta import (
    evaluate_rubric_standalone,
    evaluate_rubric_in_context,
    get_standalone_meta_rubric,
    get_in_context_meta_rubric,
)

Type Hierarchy

Grader (ABC)
└── CriterionGrader

EvaluationReport
└── EnsembleEvaluationReport

CriterionReport
└── EnsembleCriterionReport

MetricsResult
├── CriterionMetrics
├── OrdinalCriterionMetrics
├── NominalCriterionMetrics
└── JudgeMetrics

Architecture

Grading Flow

  1. Rubric.grade() delegates to grader's grade() method
  2. CriterionGrader treats single LLM as "ensemble of 1"
  3. Makes concurrent LLM calls per criterion per judge via asyncio.gather()
  4. Aggregates votes using configurable strategy
  5. Returns EnsembleEvaluationReport (consistent interface)

Score Calculation

# Positive criteria: MET earns weight, UNMET earns 0
# Negative criteria: MET subtracts weight, UNMET contributes 0
weighted_sum = sum(verdict_value * criterion.weight for each criterion)
score = clamp(weighted_sum / total_positive_weight, 0, 1)  # if normalized
# Length penalty subtracted after base calculation

Conventions

  • All graders return EnsembleEvaluationReport for consistent interface
  • raw_score is always populated regardless of normalize setting
  • Parse failures use conservative defaults (UNMET for positive, MET for negative weights)
  • Filter error is not None results in training pipelines
  • Rate limiting via LLMConfig.max_parallel_requests (per-provider semaphore)