Distribution Metrics¶
Statistical functions for comparing score distributions between predicted and ground truth.
Overview¶
These metrics go beyond point estimates (like accuracy) to compare the full distribution of scores. This is important because high correlation can mask systematic biases in the judge's behavior.
Research Background
He et al. (2025) emphasize that correlation alone can mask systematic bias. Distribution-aware comparisons like Earth Mover's Distance reveal systematic deviations that point metrics miss.
Quick Example¶
from autorubric import earth_movers_distance, ks_test, score_distribution, systematic_bias
predicted_scores = [0.8, 0.7, 0.9, 0.6, 0.85]
ground_truth_scores = [0.75, 0.72, 0.88, 0.65, 0.80]
# Earth Mover's Distance (lower = more similar distributions)
emd = earth_movers_distance(predicted_scores, ground_truth_scores)
print(f"EMD: {emd.distance:.4f}")
# Kolmogorov-Smirnov test
ks = ks_test(predicted_scores, ground_truth_scores)
print(f"KS statistic: {ks.statistic:.4f}, p-value: {ks.p_value:.4f}")
# Score distribution statistics
pred_dist = score_distribution(predicted_scores)
print(f"Mean: {pred_dist.mean:.3f}, Std: {pred_dist.std:.3f}")
# Systematic bias
bias = systematic_bias(predicted_scores, ground_truth_scores)
print(f"Bias: {bias.mean_bias:+.4f} (predicted tends to be {'higher' if bias.mean_bias > 0 else 'lower'})")
earth_movers_distance¶
Compute Earth Mover's Distance (Wasserstein-1) between two score distributions.
earth_movers_distance
¶
earth_movers_distance(dist1: ArrayLike, dist2: ArrayLike, *, normalize: bool = True) -> EMDResult
Compute Earth Mover's Distance (Wasserstein distance) between two distributions.
EMD measures the minimum "work" required to transform one distribution into another. Unlike correlation, it captures both shift (systematic bias) and shape differences (variance, skew).
| PARAMETER | DESCRIPTION |
|---|---|
dist1
|
First set of values (e.g., LLM scores).
TYPE:
|
dist2
|
Second set of values (e.g., human scores).
TYPE:
|
normalize
|
If True, normalize both distributions to [0, 1] before computing EMD. This makes EMD comparable across different scales.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
EMDResult
|
EMDResult with EMD value and interpretive statistics. |
Interpretation
- EMD = 0: Identical distributions
- EMD < 0.05: Very similar distributions
- EMD 0.05-0.10: Minor distributional differences
- EMD 0.10-0.20: Moderate differences (may need attention)
- EMD > 0.20: Substantial differences (likely systematic bias)
Example
result = earth_movers_distance([0.8, 0.7, 0.9], [0.7, 0.6, 0.8]) result.emd 0.1
Source code in src/autorubric/metrics/distribution.py
wasserstein_distance¶
Alias for earth_movers_distance.
wasserstein_distance
¶
Compute Wasserstein distance (alias for EMD).
This is a convenience function that returns just the distance value.
| PARAMETER | DESCRIPTION |
|---|---|
dist1
|
First set of values.
TYPE:
|
dist2
|
Second set of values.
TYPE:
|
normalize
|
If True, normalize both distributions to [0, 1].
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
float
|
Wasserstein distance value. |
Source code in src/autorubric/metrics/distribution.py
ks_test¶
Perform Kolmogorov-Smirnov test comparing two distributions.
ks_test
¶
ks_test(sample1: ArrayLike, sample2: ArrayLike | None = None) -> KSTestResult
Kolmogorov-Smirnov test comparing two samples (or one sample to normal).
The KS test measures the maximum difference between cumulative distribution functions. It tests whether two samples come from the same distribution.
| PARAMETER | DESCRIPTION |
|---|---|
sample1
|
First sample of values.
TYPE:
|
sample2
|
Second sample. If None, tests against normal distribution.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
KSTestResult
|
KSTestResult with test statistic and p-value. |
Example
result = ks_test([0.1, 0.2, 0.3], [0.4, 0.5, 0.6]) result.is_significant False
Source code in src/autorubric/metrics/distribution.py
score_distribution¶
Compute distribution statistics for a set of scores.
score_distribution
¶
score_distribution(scores: ArrayLike, *, bins: int | Sequence[float] = 10, include_histogram: bool = True) -> DistributionResult
Compute descriptive statistics for a score distribution.
| PARAMETER | DESCRIPTION |
|---|---|
scores
|
Sequence of scores to analyze.
TYPE:
|
bins
|
Number of bins or explicit bin edges for histogram.
TYPE:
|
include_histogram
|
If True, include histogram counts and edges.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DistributionResult
|
DistributionResult with summary statistics and optional histogram. |
Example
result = score_distribution([0.1, 0.5, 0.8, 0.9]) 0.5 < result.mean < 0.6 True
Source code in src/autorubric/metrics/distribution.py
297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 | |
systematic_bias¶
Analyze systematic bias between predicted and ground truth scores.
systematic_bias
¶
systematic_bias(y_pred: ArrayLike, y_true: ArrayLike, *, paired: bool = True, confidence: float = 0.95) -> BiasResult
Detect and quantify systematic bias between predictions and ground truth.
Systematic bias occurs when one set consistently scores higher or lower than another, independent of the item being rated.
| PARAMETER | DESCRIPTION |
|---|---|
y_pred
|
Predicted values (e.g., LLM scores).
TYPE:
|
y_true
|
Ground truth values (e.g., human scores).
TYPE:
|
paired
|
If True, assumes values are paired (same items). Uses paired t-test. If False, uses independent t-test.
TYPE:
|
confidence
|
Confidence level for interval estimation.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
BiasResult
|
BiasResult with bias magnitude, direction, and statistical tests. |
Example
result = systematic_bias([0.8, 0.7, 0.9], [0.7, 0.6, 0.8]) result.mean_bias 0.1 result.direction 'positive'
Source code in src/autorubric/metrics/distribution.py
187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 | |
Result Types¶
EMDResult¶
EMDResult
¶
Bases: BaseModel
Result of Earth Mover's Distance computation.
EMD measures the minimum "work" required to transform one distribution into another. Unlike correlation, it captures both shift (systematic bias) and shape differences (variance, skew).
| ATTRIBUTE | DESCRIPTION |
|---|---|
emd |
Earth Mover's Distance (0 to ~1 if normalized).
TYPE:
|
mean_diff |
Difference in means (dist2 - dist1).
TYPE:
|
std_diff |
Difference in standard deviations.
TYPE:
|
bias_direction |
Whether dist1 tends higher, lower, or same.
TYPE:
|
bias_magnitude |
Absolute mean difference.
TYPE:
|
interpretation |
Human-readable interpretation.
TYPE:
|
interpret_emd
staticmethod
¶
Human-readable interpretation of EMD value.
Source code in src/autorubric/metrics/_types.py
KSTestResult¶
KSTestResult
¶
Bases: BaseModel
Kolmogorov-Smirnov test result.
The KS test compares two distributions and tests whether they come from the same underlying distribution.
| ATTRIBUTE | DESCRIPTION |
|---|---|
statistic |
KS test statistic.
TYPE:
|
p_value |
P-value for the test.
TYPE:
|
is_significant |
Whether the difference is significant (p < 0.05).
TYPE:
|
DistributionResult¶
DistributionResult
¶
Bases: BaseModel
Score distribution statistics.
| ATTRIBUTE | DESCRIPTION |
|---|---|
n |
Number of samples.
TYPE:
|
mean |
Mean score.
TYPE:
|
std |
Standard deviation.
TYPE:
|
variance |
Variance.
TYPE:
|
min |
Minimum score.
TYPE:
|
max |
Maximum score.
TYPE:
|
median |
Median score.
TYPE:
|
q25 |
25th percentile.
TYPE:
|
q75 |
75th percentile.
TYPE:
|
iqr |
Interquartile range.
TYPE:
|
skewness |
Skewness (measure of asymmetry).
TYPE:
|
kurtosis |
Kurtosis (measure of tail heaviness).
TYPE:
|
histogram |
Tuple of (counts, bin_edges).
TYPE:
|
BiasResult¶
BiasResult
¶
Bases: BaseModel
Result from systematic bias analysis.
Systematic bias occurs when one rater consistently scores higher or lower than another, independent of the item being rated.
| ATTRIBUTE | DESCRIPTION |
|---|---|
mean_bias |
Mean difference (predictions - actuals).
TYPE:
|
std_bias |
Standard deviation of differences.
TYPE:
|
is_significant |
Whether the bias is statistically significant (p < 0.05).
TYPE:
|
p_value |
P-value from t-test.
TYPE:
|
direction |
Direction of bias ("positive" if predictions > actuals).
TYPE:
|
effect_size |
Cohen's d effect size.
TYPE:
|
ci |
Confidence interval for mean bias.
TYPE:
|
n_samples |
Number of samples.
TYPE:
|
interpret_effect_size
staticmethod
¶
Interpret effect size using Cohen's guidelines.
Source code in src/autorubric/metrics/_types.py
References¶
He, J., Shi, J., Zhuo, T. Y., Treude, C., Sun, J., Xing, Z., Du, X., and Lo, D. (2025). LLM-as-a-Judge for Software Engineering: Literature Review, Vision, and the Road Ahead. arXiv:2510.24367.