Few-Shot¶
Calibrate LLM judges with labeled examples for improved grading consistency.
Overview¶
Few-shot learning provides the judge with graded examples before evaluation, helping calibrate its understanding of the rubric criteria. This is particularly effective for subjective criteria or domain-specific evaluation.
Research Background
Casabianca et al. (2025) and Ashktorab et al. (2025) recommend graded exemplars ("gold anchors") including negative examples of common failure modes for both human and LLM judge calibration. Few-shot examples reduce rater error and improve agreement metrics.
Quick Example¶
from autorubric import LLMConfig, FewShotConfig, RubricDataset
from autorubric.graders import CriterionGrader
# Load dataset with ground truth
dataset = RubricDataset.from_file("labeled_data.json")
# Split into training (for few-shot) and test
train_data, test_data = dataset.split_train_test(n_train=100, stratify=True, seed=42)
# Configure few-shot grader
grader = CriterionGrader(
llm_config=LLMConfig(model="openai/gpt-4.1-mini"),
training_data=train_data,
few_shot_config=FewShotConfig(
n_examples=3,
balance_verdicts=True, # Include both MET and UNMET examples
include_reason=True,
seed=42,
),
)
# Grade with few-shot calibration
result = await rubric.grade(to_grade=response, grader=grader)
Ensemble + Few-Shot¶
Few-shot works orthogonally with ensemble mode. All judges receive the same examples:
from autorubric.graders import JudgeSpec
grader = CriterionGrader(
judges=[
JudgeSpec(LLMConfig(model="gemini/gemini-3-flash-preview"), "gemini"),
JudgeSpec(LLMConfig(model="anthropic/claude-sonnet-4-5-20250929"), "claude"),
],
aggregation="majority",
training_data=train_data,
few_shot_config=FewShotConfig(n_examples=3),
)
FewShotConfig¶
Configuration for few-shot example selection.
FewShotConfig
dataclass
¶
FewShotConfig(n_examples: int = 3, balance_verdicts: bool = True, include_reason: bool = False, seed: int | None = None)
Configuration for few-shot example selection.
| ATTRIBUTE | DESCRIPTION |
|---|---|
n_examples |
Total number of examples to include per criterion.
TYPE:
|
balance_verdicts |
If True, attempt to balance MET/UNMET/CANNOT_ASSESS. If False, randomly sample without balancing.
TYPE:
|
include_reason |
If True, include the reason/explanation in examples. Note: Ground truth datasets typically don't have reasons.
TYPE:
|
seed |
Random seed for reproducible sampling.
TYPE:
|
Example
config = FewShotConfig( ... n_examples=3, ... balance_verdicts=True, ... seed=42 ... )
FewShotExample¶
A single few-shot example with submission and ground truth verdict.
FewShotExample
dataclass
¶
FewShotExample(submission: str, verdict: CriterionVerdict, reason: str | None = None)
A single few-shot example for criterion evaluation.
| ATTRIBUTE | DESCRIPTION |
|---|---|
submission |
The content that was evaluated. Can be plain text or JSON-serialized.
TYPE:
|
verdict |
The ground truth verdict for this criterion.
TYPE:
|
reason |
Optional explanation for why the verdict was assigned.
TYPE:
|
Example
example = FewShotExample( ... submission="The Industrial Revolution began in Britain...", ... verdict=CriterionVerdict.MET, ... reason="Correctly identifies Britain as origin" ... )
References¶
Ashktorab, Z., Daly, E. M., Miehling, E., Geyer, W., Santillán Cooper, M., Pedapati, T., Desmond, M., Pan, Q., and Do, H. J. (2025). EvalAssist: A Human-Centered Tool for LLM-as-a-Judge. arXiv:2507.02186.
Casabianca, J., McCaffrey, D. F., Johnson, M. S., Alper, N., and Zubenko, V. (2025). Validity Arguments For Constructed Response Scoring Using Generative Artificial Intelligence Applications. arXiv:2501.02334.