Distribution Metrics¶
Statistical functions for comparing score distributions between predicted and ground truth.
Overview¶
These metrics go beyond point estimates (like accuracy) to compare the full distribution of scores. This is important because high correlation can mask systematic biases in the judge's behavior.
Research Background
He et al. (2025) emphasize that correlation alone can mask systematic bias. Distribution-aware comparisons like Earth Mover's Distance reveal systematic deviations that point metrics miss.
Quick Example¶
from autorubric import earth_movers_distance, ks_test, score_distribution, systematic_bias
predicted_scores = [0.8, 0.7, 0.9, 0.6, 0.85]
ground_truth_scores = [0.75, 0.72, 0.88, 0.65, 0.80]
# Earth Mover's Distance (lower = more similar distributions).
# EMDResult.emd is `float | None` (None for an empty distribution); guard before formatting.
emd_result = earth_movers_distance(predicted_scores, ground_truth_scores)
print(f"EMD: {emd_result.emd:.4f}" if emd_result.emd is not None else "EMD: n/a")
# Kolmogorov-Smirnov test (statistic / p_value are always floats)
ks = ks_test(predicted_scores, ground_truth_scores)
print(f"KS statistic: {ks.statistic:.4f}, p-value: {ks.p_value:.4f}")
# Score distribution statistics. DistributionResult.mean is `float | None` (None at n=0)
# and .std is `float | None` (None at n < 2); guard before formatting.
pred_dist = score_distribution(predicted_scores)
mean_str = f"{pred_dist.mean:.3f}" if pred_dist.mean is not None else "n/a"
std_str = f"{pred_dist.std:.3f}" if pred_dist.std is not None else "n/a"
print(f"Mean: {mean_str}, Std: {std_str}")
# Systematic bias. bias.mean_bias is `float | None` (None at n=0); guard before
# formatting and comparison.
bias = systematic_bias(predicted_scores, ground_truth_scores)
if bias.mean_bias is not None:
direction = "higher" if bias.mean_bias > 0 else "lower"
print(f"Bias: {bias.mean_bias:+.4f} (predicted tends to be {direction})")
else:
print("Bias: n/a (need at least 1 paired sample)")
earth_movers_distance¶
Compute Earth Mover's Distance (Wasserstein-1) between two score distributions.
earth_movers_distance
¶
earth_movers_distance(dist1: ArrayLike, dist2: ArrayLike, *, normalize: bool = True) -> EMDResult
Compute Earth Mover's Distance (Wasserstein distance) between two distributions.
EMD measures the minimum "work" required to transform one distribution into another. Unlike correlation, it captures both shift (systematic bias) and shape differences (variance, skew).
| PARAMETER | DESCRIPTION |
|---|---|
dist1
|
First set of values (e.g., LLM scores).
TYPE:
|
dist2
|
Second set of values (e.g., human scores).
TYPE:
|
normalize
|
If True, normalize both distributions to [0, 1] before computing EMD. This makes EMD comparable across different scales.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
EMDResult
|
EMDResult with EMD value and interpretive statistics. |
Interpretation
- EMD = 0: Identical distributions
- EMD < 0.05: Very similar distributions
- EMD 0.05-0.10: Minor distributional differences
- EMD 0.10-0.20: Moderate differences (may need attention)
- EMD > 0.20: Substantial differences (likely systematic bias)
Example
result = earth_movers_distance([0.8, 0.7, 0.9], [0.7, 0.6, 0.8]) result.emd 0.1
Source code in src/autorubric/metrics/distribution.py
wasserstein_distance¶
Alias for earth_movers_distance.
wasserstein_distance
¶
Compute Wasserstein distance (alias for EMD).
This is a convenience function that returns just the distance value.
| PARAMETER | DESCRIPTION |
|---|---|
dist1
|
First set of values.
TYPE:
|
dist2
|
Second set of values.
TYPE:
|
normalize
|
If True, normalize both distributions to [0, 1].
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
float | None
|
Wasserstein distance value, or |
float | None
|
distance is genuinely undefined with no data to transport). |
Source code in src/autorubric/metrics/distribution.py
ks_test¶
Perform Kolmogorov-Smirnov test comparing two distributions.
ks_test
¶
ks_test(sample1: ArrayLike, sample2: ArrayLike | None = None) -> KSTestResult
Kolmogorov-Smirnov test comparing two samples (or one sample to normal).
The KS test measures the maximum difference between cumulative distribution functions. It tests whether two samples come from the same distribution.
| PARAMETER | DESCRIPTION |
|---|---|
sample1
|
First sample of values.
TYPE:
|
sample2
|
Second sample. If None, tests against normal distribution.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
KSTestResult
|
KSTestResult with test statistic and p-value. |
Example
result = ks_test([0.1, 0.2, 0.3], [0.4, 0.5, 0.6]) result.is_significant False
Source code in src/autorubric/metrics/distribution.py
score_distribution¶
Compute distribution statistics for a set of scores.
score_distribution
¶
score_distribution(scores: ArrayLike, *, bins: int | Sequence[float] = 10, include_histogram: bool = True) -> DistributionResult
Compute descriptive statistics for a score distribution.
| PARAMETER | DESCRIPTION |
|---|---|
scores
|
Sequence of scores to analyze.
TYPE:
|
bins
|
Number of bins or explicit bin edges for histogram.
TYPE:
|
include_histogram
|
If True, include histogram counts and edges.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DistributionResult
|
DistributionResult with summary statistics and optional histogram. |
Example
result = score_distribution([0.1, 0.5, 0.8, 0.9]) 0.5 < result.mean < 0.6 True
Source code in src/autorubric/metrics/distribution.py
334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 | |
systematic_bias¶
Analyze systematic bias between predicted and ground truth scores.
systematic_bias
¶
systematic_bias(y_pred: ArrayLike, y_true: ArrayLike, *, paired: bool = True, confidence: float = 0.95) -> BiasResult
Detect and quantify systematic bias between predictions and ground truth.
Systematic bias occurs when one set consistently scores higher or lower than another, independent of the item being rated.
| PARAMETER | DESCRIPTION |
|---|---|
y_pred
|
Predicted values (e.g., LLM scores).
TYPE:
|
y_true
|
Ground truth values (e.g., human scores).
TYPE:
|
paired
|
If True, assumes values are paired (same items). Uses paired t-test. If False, uses independent t-test.
TYPE:
|
confidence
|
Confidence level for interval estimation.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
BiasResult
|
BiasResult with bias magnitude, direction, and statistical tests. |
Example
result = systematic_bias([0.8, 0.7, 0.9], [0.7, 0.6, 0.8]) result.mean_bias 0.1 result.direction 'positive'
Source code in src/autorubric/metrics/distribution.py
189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 | |
Result Types¶
None means genuinely undefined, never a fabricated 0.0
The numeric fields on these result types are typed float | None. A field is
None when the statistic is genuinely undefined for the data at hand (it is
never silently reported as a fake 0.0). In particular, BiasResult.mean_bias
is None at n=0 and BiasResult.std_bias is None for n < 2; the per-criterion
CorrelationResult.coefficient (Pearson/Spearman/Kendall) is None for a constant
array or fewer than 3 samples. Guard before formatting (e.g.
f"{x:+.4f}" if x is not None else "n/a").
EMDResult¶
EMDResult
¶
Bases: BaseModel
Result of Earth Mover's Distance computation.
EMD measures the minimum "work" required to transform one distribution into another. Unlike correlation, it captures both shift (systematic bias) and shape differences (variance, skew).
A statistic is None when it is genuinely undefined, never a fake 0.0. With an
empty distribution on either side (no data to transport) emd/mean_diff/
std_diff/bias_magnitude are all None; bias_direction stays "none"
and interpretation "insufficient data".
| ATTRIBUTE | DESCRIPTION |
|---|---|
emd |
Earth Mover's Distance (0 to ~1 if normalized).
TYPE:
|
mean_diff |
Difference in means (dist2 - dist1).
TYPE:
|
std_diff |
Difference in standard deviations.
TYPE:
|
bias_direction |
Whether dist1 tends higher, lower, or same.
TYPE:
|
bias_magnitude |
Absolute mean difference.
TYPE:
|
interpretation |
Human-readable interpretation.
TYPE:
|
interpret_emd
staticmethod
¶
Human-readable interpretation of EMD value.
Source code in src/autorubric/metrics/_types.py
KSTestResult¶
KSTestResult
¶
Bases: BaseModel
Kolmogorov-Smirnov test result.
The KS test compares two distributions and tests whether they come from the same underlying distribution.
| ATTRIBUTE | DESCRIPTION |
|---|---|
statistic |
KS test statistic.
TYPE:
|
p_value |
P-value for the test.
TYPE:
|
is_significant |
Whether the difference is significant (p < 0.05).
TYPE:
|
DistributionResult¶
DistributionResult
¶
Bases: BaseModel
Score distribution statistics.
A statistic is None when it is genuinely undefined for the sample size, never a
fake 0.0. At n=0 every stat is None. A single point still has a defined
mean/min/max/median/q25/q75 (and iqr = q75 − q25 = 0.0, the true IQR of one point);
std/variance need ≥2 points, skewness ≥3, kurtosis ≥4 — each None
below its threshold. n is always the real count.
| ATTRIBUTE | DESCRIPTION |
|---|---|
n |
Number of samples.
TYPE:
|
mean |
Mean score.
TYPE:
|
std |
Standard deviation.
TYPE:
|
variance |
Variance.
TYPE:
|
min |
Minimum score.
TYPE:
|
max |
Maximum score.
TYPE:
|
median |
Median score.
TYPE:
|
q25 |
25th percentile.
TYPE:
|
q75 |
75th percentile.
TYPE:
|
iqr |
Interquartile range.
TYPE:
|
skewness |
Skewness (measure of asymmetry).
TYPE:
|
kurtosis |
Kurtosis (measure of tail heaviness).
TYPE:
|
histogram |
Tuple of (counts, bin_edges).
TYPE:
|
BiasResult¶
BiasResult
¶
Bases: BaseModel
Result from systematic bias analysis.
Systematic bias occurs when one rater consistently scores higher or lower than another, independent of the item being rated.
A statistic is None when it is genuinely undefined for the sample size, never a
fake 0.0. mean_bias is the single pred−true difference at n=1 (computable) and
is None only at n=0. std_bias is None when undefined (n<2). effect_size
(Cohen's d) is None when std_bias is 0 or undefined.
| ATTRIBUTE | DESCRIPTION |
|---|---|
mean_bias |
Mean difference (predictions - actuals).
TYPE:
|
std_bias |
Standard deviation of differences.
TYPE:
|
is_significant |
Whether the bias is statistically significant (p < 0.05).
TYPE:
|
p_value |
P-value from t-test.
TYPE:
|
direction |
Direction of bias ("positive" if predictions > actuals).
TYPE:
|
effect_size |
Cohen's d effect size.
TYPE:
|
ci |
Confidence interval for mean bias.
TYPE:
|
n_samples |
Number of samples.
TYPE:
|
interpret_effect_size
staticmethod
¶
Interpret effect size using Cohen's guidelines.
Source code in src/autorubric/metrics/_types.py
References¶
He, J., Shi, J., Zhuo, T. Y., Treude, C., Sun, J., Xing, Z., Du, X., and Lo, D. (2025). LLM-as-a-Judge for Software Engineering: Literature Review, Vision, and the Road Ahead. arXiv:2510.24367.