Skip to content

AutoRubric

A Python library for evaluating text outputs against weighted criteria using LLM-as-a-judge.

PyPI version Python versions License

What is AutoRubric?

AutoRubric provides a structured, research-backed approach to evaluating LLM outputs using rubric-based grading with LLM judges. Instead of relying on vague quality assessments, AutoRubric enables you to define explicit, weighted criteria and receive detailed per-criterion verdicts with explanations.

Key Features

  • Rubric-based evaluation: Define weighted criteria with explicit requirements
  • Multi-provider support: Works with OpenAI, Anthropic, Google, Azure, Groq, Ollama, and 100+ providers via LiteLLM
  • Ensemble judging: Combine multiple LLM judges to reduce bias and improve robustness
  • Few-shot learning: Calibrate judges with labeled examples
  • Multi-choice criteria: Support for ordinal and nominal scales beyond binary verdicts
  • Structured outputs: Type-safe responses with detailed per-criterion reports
  • Batch evaluation: High-throughput processing with checkpointing and resumption
  • Metrics & validation: Comprehensive agreement metrics and bootstrap confidence intervals

Installation

uv add autorubric
pip install autorubric

Quick Example

import asyncio
from autorubric import Rubric, LLMConfig
from autorubric.graders import CriterionGrader

async def main():
    # Configure LLM judge
    grader = CriterionGrader(llm_config=LLMConfig(model="openai/gpt-4.1-mini"))

    # Define evaluation rubric
    rubric = Rubric.from_dict([
        {"weight": 10.0, "requirement": "States the correct answer"},
        {"weight": 5.0, "requirement": "Provides clear explanation"},
        {"weight": -10.0, "requirement": "Contains factual errors"}
    ])

    # Grade a response
    result = await rubric.grade(
        to_grade="The answer is 42. This is derived from...",
        grader=grader,
        query="What is the answer to life, the universe, and everything?",
    )

    print(f"Score: {result.score:.2f}")
    for criterion in result.report:
        print(f"  [{criterion.final_verdict}] {criterion.criterion.requirement}")

asyncio.run(main())

Why AutoRubric?

AutoRubric is built on research findings about effective LLM-as-a-judge evaluation:

  • Analytic rubrics: Multiple explicit criteria increase interpretability and help diagnose failure modes (Casabianca et al., 2025; Ye et al., 2023)
  • Ensemble judging: Multi-LLM panels reduce systematic bias and self-preference (Verga et al., 2024; He et al., 2025)
  • Structured evaluation: Form-filling with per-criterion verdicts improves repeatability (Kim et al., 2024)
  • Position bias mitigation: Randomized option order reduces positional effects (Wang et al., 2023)
  • CANNOT_ASSESS handling: Explicit uncertainty option prevents low-confidence guessing (Min et al., 2023)

Documentation

Guide Description
Quickstart Get up and running with installation, configuration, and your first evaluation
Cookbook Practical examples and recipes for common evaluation scenarios
API Reference Complete API documentation with all classes, functions, and types

Requirements

  • Python 3.11+
  • LiteLLM for multi-provider LLM support
  • Pydantic for structured outputs

References

Casabianca, J., McCaffrey, D. F., Johnson, M. S., Alper, N., and Zubenko, V. (2025). Validity Arguments For Constructed Response Scoring Using Generative Artificial Intelligence Applications. arXiv:2501.02334.

He, J., Shi, J., Zhuo, T. Y., Treude, C., Sun, J., Xing, Z., Du, X., and Lo, D. (2025). LLM-as-a-Judge for Software Engineering: Literature Review, Vision, and the Road Ahead. arXiv:2510.24367.

Kim, S., Suk, J., Longpre, S., Lin, B. Y., Shin, J., Welleck, S., Neubig, G., Lee, M., Lee, K., and Seo, M. (2024). Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4334–4353.

Min, S., Krishna, K., Lyu, X., Lewis, M., Yih, W., Koh, P. W., Iyyer, M., Zettlemoyer, L., and Hajishirzi, H. (2023). FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 12076–12100.

Verga, P., Hofstatter, S., Althammer, S., Su, Y., Piktus, A., Arkhangorodsky, A., Xu, M., White, N., and Lewis, P. (2024). Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models. arXiv:2404.18796.

Wang, P., Li, L., Chen, L., Cai, Z., Zhu, D., Lin, B., Cao, Y., Liu, Q., Liu, T., and Sui, Z. (2023). Large Language Models are not Fair Evaluators. arXiv:2305.17926.

Ye, S., Kim, D., Kim, S., Hwang, H., Kim, S., Jo, Y., Thorne, J., Kim, J., and Seo, M. (2023). FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets. In Proceedings of the 2024 International Conference on Learning Representations (ICLR).

License

MIT License - see LICENSE for details.