Skip to content

Few-Shot

Calibrate LLM judges with labeled examples for improved grading consistency.

Overview

Few-shot learning provides the judge with graded examples before evaluation, helping calibrate its understanding of the rubric criteria. This is particularly effective for subjective criteria or domain-specific evaluation.

Research Background

Casabianca et al. (2025) and Ashktorab et al. (2025) recommend graded exemplars ("gold anchors") including negative examples of common failure modes for both human and LLM judge calibration. Few-shot examples reduce rater error and improve agreement metrics.

Quick Example

from autorubric import LLMConfig, FewShotConfig, RubricDataset
from autorubric.graders import CriterionGrader

# Load dataset with ground truth
dataset = RubricDataset.from_file("labeled_data.json")

# Split into training (for few-shot) and test
train_data, test_data = dataset.split_train_test(n_train=100, stratify=True, seed=42)

# Configure few-shot grader
grader = CriterionGrader(
    llm_config=LLMConfig(model="openai/gpt-4.1-mini"),
    training_data=train_data,
    few_shot_config=FewShotConfig(
        n_examples=3,
        balance_verdicts=True,  # Include both MET and UNMET examples
        include_reason=True,
        seed=42,
    ),
)

# Grade with few-shot calibration
result = await rubric.grade(to_grade=response, grader=grader)

Ensemble + Few-Shot

Few-shot works orthogonally with ensemble mode. All judges receive the same examples:

from autorubric.graders import JudgeSpec

grader = CriterionGrader(
    judges=[
        JudgeSpec(LLMConfig(model="gemini/gemini-3-flash-preview"), "gemini"),
        JudgeSpec(LLMConfig(model="anthropic/claude-sonnet-4-5-20250929"), "claude"),
    ],
    aggregation="majority",
    training_data=train_data,
    few_shot_config=FewShotConfig(n_examples=3),
)

FewShotConfig

Configuration for few-shot example selection.

FewShotConfig dataclass

FewShotConfig(n_examples: int = 3, balance_verdicts: bool = True, include_reason: bool = False, seed: int | None = None)

Configuration for few-shot example selection.

ATTRIBUTE DESCRIPTION
n_examples

Total number of examples to include per criterion.

TYPE: int

balance_verdicts

If True, attempt to balance MET/UNMET/CANNOT_ASSESS. If False, randomly sample without balancing.

TYPE: bool

include_reason

If True, include the reason/explanation in examples. Note: Ground truth datasets typically don't have reasons.

TYPE: bool

seed

Random seed for reproducible sampling.

TYPE: int | None

Example

config = FewShotConfig( ... n_examples=3, ... balance_verdicts=True, ... seed=42 ... )


FewShotExample

A single few-shot example with submission and ground truth verdict.

FewShotExample dataclass

FewShotExample(submission: str, verdict: CriterionVerdict, reason: str | None = None)

A single few-shot example for criterion evaluation.

ATTRIBUTE DESCRIPTION
submission

The content that was evaluated. Can be plain text or JSON-serialized.

TYPE: str

verdict

The ground truth verdict for this criterion.

TYPE: CriterionVerdict

reason

Optional explanation for why the verdict was assigned.

TYPE: str | None

Example

example = FewShotExample( ... submission="The Industrial Revolution began in Britain...", ... verdict=CriterionVerdict.MET, ... reason="Correctly identifies Britain as origin" ... )


References

Ashktorab, Z., Daly, E. M., Miehling, E., Geyer, W., Santillán Cooper, M., Pedapati, T., Desmond, M., Pan, Q., and Do, H. J. (2025). EvalAssist: A Human-Centered Tool for LLM-as-a-Judge. arXiv:2507.02186.

Casabianca, J., McCaffrey, D. F., Johnson, M. S., Alper, N., and Zubenko, V. (2025). Validity Arguments For Constructed Response Scoring Using Generative Artificial Intelligence Applications. arXiv:2501.02334.