Few-Shot¶

Calibrate LLM judges with labeled examples for improved grading consistency.

Overview¶

Few-shot learning provides the judge with graded examples before evaluation, helping calibrate its understanding of the rubric criteria. This is particularly effective for subjective criteria or domain-specific evaluation.

Research Background

Casabianca et al. (2025) and Ashktorab et al. (2025) recommend graded exemplars ("gold anchors") including negative examples of common failure modes for both human and LLM judge calibration. Few-shot examples reduce rater error and improve agreement metrics.

Quick Example¶

from autorubric import LLMConfig, FewShotConfig, RubricDataset
from autorubric.graders import CriterionGrader

# Load dataset with ground truth
dataset = RubricDataset.from_file("labeled_data.json")

# Split into training (for few-shot) and test
train_data, test_data = dataset.split_train_test(n_train=100, stratify=True, seed=42)

# Configure few-shot grader
grader = CriterionGrader(
    llm_config=LLMConfig(model="openai/gpt-4.1-mini"),
    training_data=train_data,
    few_shot_config=FewShotConfig(
        n_examples=3,
        balance_verdicts=True,  # Include both MET and UNMET examples
        include_reason=True,
        seed=42,
    ),
)

# Grade with few-shot calibration
result = await rubric.grade(to_grade=response, grader=grader)

Ensemble + Few-Shot¶

Few-shot works orthogonally with ensemble mode. All judges receive the same examples:

from autorubric.graders import JudgeSpec

grader = CriterionGrader(
    judges=[
        JudgeSpec(LLMConfig(model="gemini/gemini-3-flash-preview"), "gemini"),
        JudgeSpec(LLMConfig(model="anthropic/claude-sonnet-4-5-20250929"), "claude"),
    ],
    aggregation="majority",
    training_data=train_data,
    few_shot_config=FewShotConfig(n_examples=3),
)

FewShotConfig¶

Configuration for few-shot example selection.

FewShotConfig `dataclass` ¶

FewShotConfig(n_examples: int = 3, balance_verdicts: bool = True, include_reason: bool = False, seed: int | None = None)

Configuration for few-shot example selection.

ATTRIBUTE	DESCRIPTION
`n_examples`	Total number of examples to include per criterion. TYPE: `int`
`balance_verdicts`	If True, attempt to balance MET/UNMET/CANNOT_ASSESS. If False, randomly sample without balancing. TYPE: `bool`
`include_reason`	If True, include the reason/explanation in examples. Note: Ground truth datasets typically don't have reasons. TYPE: `bool`
`seed`	Random seed for reproducible sampling. TYPE: `int \| None`

Example

config = FewShotConfig( ... n_examples=3, ... balance_verdicts=True, ... seed=42 ... )

FewShotExample¶

A single few-shot example with submission and ground truth verdict.

FewShotExample `dataclass` ¶

FewShotExample(submission: str, verdict: CriterionVerdict, reason: str | None = None)

A single few-shot example for criterion evaluation.

ATTRIBUTE	DESCRIPTION
`submission`	The content that was evaluated. Can be plain text or JSON-serialized. TYPE: `str`
`verdict`	The ground truth verdict for this criterion. TYPE: `CriterionVerdict`
`reason`	Optional explanation for why the verdict was assigned. TYPE: `str \| None`

Example

example = FewShotExample( ... submission="The Industrial Revolution began in Britain...", ... verdict=CriterionVerdict.MET, ... reason="Correctly identifies Britain as origin" ... )

References¶

Ashktorab, Z., Daly, E. M., Miehling, E., Geyer, W., Santillán Cooper, M., Pedapati, T., Desmond, M., Pan, Q., and Do, H. J. (2025). EvalAssist: A Human-Centered Tool for LLM-as-a-Judge. arXiv:2507.02186.

Casabianca, J., McCaffrey, D. F., Johnson, M. S., Alper, N., and Zubenko, V. (2025). Validity Arguments For Constructed Response Scoring Using Generative Artificial Intelligence Applications. arXiv:2501.02334.