Skip to content

Distribution Metrics

Statistical functions for comparing score distributions between predicted and ground truth.

Overview

These metrics go beyond point estimates (like accuracy) to compare the full distribution of scores. This is important because high correlation can mask systematic biases in the judge's behavior.

Research Background

He et al. (2025) emphasize that correlation alone can mask systematic bias. Distribution-aware comparisons like Earth Mover's Distance reveal systematic deviations that point metrics miss.

Quick Example

from autorubric import earth_movers_distance, ks_test, score_distribution, systematic_bias

predicted_scores = [0.8, 0.7, 0.9, 0.6, 0.85]
ground_truth_scores = [0.75, 0.72, 0.88, 0.65, 0.80]

# Earth Mover's Distance (lower = more similar distributions).
# EMDResult.emd is `float | None` (None for an empty distribution); guard before formatting.
emd_result = earth_movers_distance(predicted_scores, ground_truth_scores)
print(f"EMD: {emd_result.emd:.4f}" if emd_result.emd is not None else "EMD: n/a")

# Kolmogorov-Smirnov test (statistic / p_value are always floats)
ks = ks_test(predicted_scores, ground_truth_scores)
print(f"KS statistic: {ks.statistic:.4f}, p-value: {ks.p_value:.4f}")

# Score distribution statistics. DistributionResult.mean is `float | None` (None at n=0)
# and .std is `float | None` (None at n < 2); guard before formatting.
pred_dist = score_distribution(predicted_scores)
mean_str = f"{pred_dist.mean:.3f}" if pred_dist.mean is not None else "n/a"
std_str = f"{pred_dist.std:.3f}" if pred_dist.std is not None else "n/a"
print(f"Mean: {mean_str}, Std: {std_str}")

# Systematic bias. bias.mean_bias is `float | None` (None at n=0); guard before
# formatting and comparison.
bias = systematic_bias(predicted_scores, ground_truth_scores)
if bias.mean_bias is not None:
    direction = "higher" if bias.mean_bias > 0 else "lower"
    print(f"Bias: {bias.mean_bias:+.4f} (predicted tends to be {direction})")
else:
    print("Bias: n/a (need at least 1 paired sample)")

earth_movers_distance

Compute Earth Mover's Distance (Wasserstein-1) between two score distributions.

earth_movers_distance

earth_movers_distance(dist1: ArrayLike, dist2: ArrayLike, *, normalize: bool = True) -> EMDResult

Compute Earth Mover's Distance (Wasserstein distance) between two distributions.

EMD measures the minimum "work" required to transform one distribution into another. Unlike correlation, it captures both shift (systematic bias) and shape differences (variance, skew).

PARAMETER DESCRIPTION
dist1

First set of values (e.g., LLM scores).

TYPE: ArrayLike

dist2

Second set of values (e.g., human scores).

TYPE: ArrayLike

normalize

If True, normalize both distributions to [0, 1] before computing EMD. This makes EMD comparable across different scales.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
EMDResult

EMDResult with EMD value and interpretive statistics.

Interpretation
  • EMD = 0: Identical distributions
  • EMD < 0.05: Very similar distributions
  • EMD 0.05-0.10: Minor distributional differences
  • EMD 0.10-0.20: Moderate differences (may need attention)
  • EMD > 0.20: Substantial differences (likely systematic bias)
Example

result = earth_movers_distance([0.8, 0.7, 0.9], [0.7, 0.6, 0.8]) result.emd 0.1

Source code in src/autorubric/metrics/distribution.py
def earth_movers_distance(
    dist1: ArrayLike,
    dist2: ArrayLike,
    *,
    normalize: bool = True,
) -> EMDResult:
    """Compute Earth Mover's Distance (Wasserstein distance) between two distributions.

    EMD measures the minimum "work" required to transform one distribution
    into another. Unlike correlation, it captures both shift (systematic bias)
    and shape differences (variance, skew).

    Args:
        dist1: First set of values (e.g., LLM scores).
        dist2: Second set of values (e.g., human scores).
        normalize: If True, normalize both distributions to [0, 1] before
            computing EMD. This makes EMD comparable across different scales.

    Returns:
        EMDResult with EMD value and interpretive statistics.

    Interpretation:
        - EMD = 0: Identical distributions
        - EMD < 0.05: Very similar distributions
        - EMD 0.05-0.10: Minor distributional differences
        - EMD 0.10-0.20: Moderate differences (may need attention)
        - EMD > 0.20: Substantial differences (likely systematic bias)

    Example:
        >>> result = earth_movers_distance([0.8, 0.7, 0.9], [0.7, 0.6, 0.8])
        >>> result.emd
        0.1
    """
    d1 = _to_array(dist1).astype(float)
    d2 = _to_array(dist2).astype(float)

    if len(d1) == 0 or len(d2) == 0:
        # No data to transport ⇒ every distance/diff statistic is genuinely undefined.
        return EMDResult(
            emd=None,
            mean_diff=None,
            std_diff=None,
            bias_direction="none",
            bias_magnitude=None,
            interpretation="insufficient data",
        )

    # Normalize if requested
    if normalize:
        all_vals = np.concatenate([d1, d2])
        min_val = all_vals.min()
        max_val = all_vals.max()
        if max_val > min_val:
            d1 = (d1 - min_val) / (max_val - min_val)
            d2 = (d2 - min_val) / (max_val - min_val)

    # Compute EMD using scipy
    emd = float(stats.wasserstein_distance(d1, d2))

    # Compute statistics
    mean1, mean2 = np.mean(d1), np.mean(d2)
    std1, std2 = np.std(d1), np.std(d2)
    mean_diff = float(mean1 - mean2)
    std_diff = float(std1 - std2)
    bias_magnitude = abs(mean_diff)

    # Determine bias direction
    if mean_diff > 0.01:
        bias_direction: Literal["higher", "lower", "none"] = "higher"
    elif mean_diff < -0.01:
        bias_direction = "lower"
    else:
        bias_direction = "none"

    return EMDResult(
        emd=emd,
        mean_diff=mean_diff,
        std_diff=std_diff,
        bias_direction=bias_direction,
        bias_magnitude=bias_magnitude,
        interpretation=EMDResult.interpret_emd(emd),
    )

wasserstein_distance

Alias for earth_movers_distance.

wasserstein_distance

wasserstein_distance(dist1: ArrayLike, dist2: ArrayLike, *, normalize: bool = True) -> float | None

Compute Wasserstein distance (alias for EMD).

This is a convenience function that returns just the distance value.

PARAMETER DESCRIPTION
dist1

First set of values.

TYPE: ArrayLike

dist2

Second set of values.

TYPE: ArrayLike

normalize

If True, normalize both distributions to [0, 1].

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
float | None

Wasserstein distance value, or None when either distribution is empty (the

float | None

distance is genuinely undefined with no data to transport).

Source code in src/autorubric/metrics/distribution.py
def wasserstein_distance(
    dist1: ArrayLike,
    dist2: ArrayLike,
    *,
    normalize: bool = True,
) -> float | None:
    """Compute Wasserstein distance (alias for EMD).

    This is a convenience function that returns just the distance value.

    Args:
        dist1: First set of values.
        dist2: Second set of values.
        normalize: If True, normalize both distributions to [0, 1].

    Returns:
        Wasserstein distance value, or ``None`` when either distribution is empty (the
        distance is genuinely undefined with no data to transport).
    """
    result = earth_movers_distance(dist1, dist2, normalize=normalize)
    return result.emd

ks_test

Perform Kolmogorov-Smirnov test comparing two distributions.

ks_test

ks_test(sample1: ArrayLike, sample2: ArrayLike | None = None) -> KSTestResult

Kolmogorov-Smirnov test comparing two samples (or one sample to normal).

The KS test measures the maximum difference between cumulative distribution functions. It tests whether two samples come from the same distribution.

PARAMETER DESCRIPTION
sample1

First sample of values.

TYPE: ArrayLike

sample2

Second sample. If None, tests against normal distribution.

TYPE: ArrayLike | None DEFAULT: None

RETURNS DESCRIPTION
KSTestResult

KSTestResult with test statistic and p-value.

Example

result = ks_test([0.1, 0.2, 0.3], [0.4, 0.5, 0.6]) result.is_significant False

Source code in src/autorubric/metrics/distribution.py
def ks_test(
    sample1: ArrayLike,
    sample2: ArrayLike | None = None,
) -> KSTestResult:
    """Kolmogorov-Smirnov test comparing two samples (or one sample to normal).

    The KS test measures the maximum difference between cumulative distribution
    functions. It tests whether two samples come from the same distribution.

    Args:
        sample1: First sample of values.
        sample2: Second sample. If None, tests against normal distribution.

    Returns:
        KSTestResult with test statistic and p-value.

    Example:
        >>> result = ks_test([0.1, 0.2, 0.3], [0.4, 0.5, 0.6])
        >>> result.is_significant
        False
    """
    s1 = _to_array(sample1).astype(float)

    if sample2 is None:
        # One-sample test against normal distribution
        stat, p_value = stats.kstest(s1, "norm", args=(np.mean(s1), np.std(s1)))
    else:
        # Two-sample test
        s2 = _to_array(sample2).astype(float)
        stat, p_value = stats.ks_2samp(s1, s2)

    return KSTestResult(
        statistic=float(stat),
        p_value=float(p_value),
        is_significant=p_value < 0.05,
    )

score_distribution

Compute distribution statistics for a set of scores.

score_distribution

score_distribution(scores: ArrayLike, *, bins: int | Sequence[float] = 10, include_histogram: bool = True) -> DistributionResult

Compute descriptive statistics for a score distribution.

PARAMETER DESCRIPTION
scores

Sequence of scores to analyze.

TYPE: ArrayLike

bins

Number of bins or explicit bin edges for histogram.

TYPE: int | Sequence[float] DEFAULT: 10

include_histogram

If True, include histogram counts and edges.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
DistributionResult

DistributionResult with summary statistics and optional histogram.

Example

result = score_distribution([0.1, 0.5, 0.8, 0.9]) 0.5 < result.mean < 0.6 True

Source code in src/autorubric/metrics/distribution.py
def score_distribution(
    scores: ArrayLike,
    *,
    bins: int | Sequence[float] = 10,
    include_histogram: bool = True,
) -> DistributionResult:
    """Compute descriptive statistics for a score distribution.

    Args:
        scores: Sequence of scores to analyze.
        bins: Number of bins or explicit bin edges for histogram.
        include_histogram: If True, include histogram counts and edges.

    Returns:
        DistributionResult with summary statistics and optional histogram.

    Example:
        >>> result = score_distribution([0.1, 0.5, 0.8, 0.9])
        >>> 0.5 < result.mean < 0.6
        True
    """
    scores = _to_array(scores).astype(float)

    n = len(scores)
    if n == 0:
        # No data ⇒ every descriptive statistic is genuinely undefined (not 0.0).
        return DistributionResult(
            n=0,
            mean=None,
            std=None,
            variance=None,
            min=None,
            max=None,
            median=None,
            q25=None,
            q75=None,
            iqr=None,
            skewness=None,
            kurtosis=None,
            histogram=None,
        )

    # Basic statistics. A single point still has a defined mean/min/max/median/quartiles
    # (and iqr = q75 - q25 = 0.0, the true IQR of one point). Dispersion- and
    # shape-dependent stats need more points and are None below their thresholds:
    #   std/variance need >=2, skewness needs >=3, kurtosis needs >=4.
    mean = float(np.mean(scores))
    std = float(np.std(scores, ddof=1)) if n > 1 else None
    variance = float(np.var(scores, ddof=1)) if n > 1 else None
    min_val = float(np.min(scores))
    max_val = float(np.max(scores))
    median = float(np.median(scores))
    q25 = float(np.percentile(scores, 25))
    q75 = float(np.percentile(scores, 75))
    iqr = q75 - q25

    # Skewness and kurtosis
    skewness = float(stats.skew(scores)) if n > 2 else None
    kurtosis = float(stats.kurtosis(scores)) if n > 3 else None

    # Histogram
    histogram = None
    if include_histogram:
        counts, edges = np.histogram(scores, bins=bins)
        histogram = (counts.tolist(), edges.tolist())

    return DistributionResult(
        n=n,
        mean=mean,
        std=std,
        variance=variance,
        min=min_val,
        max=max_val,
        median=median,
        q25=q25,
        q75=q75,
        iqr=iqr,
        skewness=skewness,
        kurtosis=kurtosis,
        histogram=histogram,
    )

systematic_bias

Analyze systematic bias between predicted and ground truth scores.

systematic_bias

systematic_bias(y_pred: ArrayLike, y_true: ArrayLike, *, paired: bool = True, confidence: float = 0.95) -> BiasResult

Detect and quantify systematic bias between predictions and ground truth.

Systematic bias occurs when one set consistently scores higher or lower than another, independent of the item being rated.

PARAMETER DESCRIPTION
y_pred

Predicted values (e.g., LLM scores).

TYPE: ArrayLike

y_true

Ground truth values (e.g., human scores).

TYPE: ArrayLike

paired

If True, assumes values are paired (same items). Uses paired t-test. If False, uses independent t-test.

TYPE: bool DEFAULT: True

confidence

Confidence level for interval estimation.

TYPE: float DEFAULT: 0.95

RETURNS DESCRIPTION
BiasResult

BiasResult with bias magnitude, direction, and statistical tests.

Example

result = systematic_bias([0.8, 0.7, 0.9], [0.7, 0.6, 0.8]) result.mean_bias 0.1 result.direction 'positive'

Source code in src/autorubric/metrics/distribution.py
def systematic_bias(
    y_pred: ArrayLike,
    y_true: ArrayLike,
    *,
    paired: bool = True,
    confidence: float = 0.95,
) -> BiasResult:
    """Detect and quantify systematic bias between predictions and ground truth.

    Systematic bias occurs when one set consistently scores higher or lower
    than another, independent of the item being rated.

    Args:
        y_pred: Predicted values (e.g., LLM scores).
        y_true: Ground truth values (e.g., human scores).
        paired: If True, assumes values are paired (same items).
            Uses paired t-test. If False, uses independent t-test.
        confidence: Confidence level for interval estimation.

    Returns:
        BiasResult with bias magnitude, direction, and statistical tests.

    Example:
        >>> result = systematic_bias([0.8, 0.7, 0.9], [0.7, 0.6, 0.8])
        >>> result.mean_bias
        0.1
        >>> result.direction
        'positive'
    """
    y_pred = _to_array(y_pred).astype(float)
    y_true = _to_array(y_true).astype(float)

    n = len(y_pred)
    if n == 0:
        # Nothing to compare ⇒ mean_bias is genuinely undefined (not 0.0).
        return BiasResult(
            mean_bias=None,
            std_bias=None,
            is_significant=False,
            p_value=None,
            direction="none",
            effect_size=None,
            ci=None,
            n_samples=0,
        )

    if n == 1:
        # mean_bias IS computable from a single pair (the one difference); the
        # dispersion-dependent statistics (std_bias, Cohen's d, CI, p-value) are not.
        if paired:
            _validate_same_length(y_pred, y_true)
            single = float(y_pred[0] - y_true[0])
        else:
            single = float(np.mean(y_pred) - np.mean(y_true))

        single_direction: Literal["positive", "negative", "none"]
        if single > 0.001:
            single_direction = "positive"
        elif single < -0.001:
            single_direction = "negative"
        else:
            single_direction = "none"

        return BiasResult(
            mean_bias=single,
            std_bias=None,
            is_significant=False,
            p_value=None,
            direction=single_direction,
            effect_size=None,
            ci=None,
            n_samples=1,
        )

    if paired:
        _validate_same_length(y_pred, y_true)
        differences = y_pred - y_true
        mean_bias = float(np.mean(differences))
        std_bias = float(np.std(differences, ddof=1))

        # Paired t-test
        t_stat, p_value = stats.ttest_rel(y_pred, y_true)

        # Cohen's d for paired samples (undefined when std_bias == 0).
        effect_size = mean_bias / std_bias if std_bias > 0 else None
    else:
        mean_bias = float(np.mean(y_pred) - np.mean(y_true))

        # Pooled standard deviation
        var_pred = np.var(y_pred, ddof=1)
        var_true = np.var(y_true, ddof=1)
        pooled_std = np.sqrt((var_pred + var_true) / 2)
        std_bias = float(pooled_std)

        # Independent t-test
        t_stat, p_value = stats.ttest_ind(y_pred, y_true)

        # Cohen's d (undefined when pooled std == 0).
        effect_size = mean_bias / std_bias if std_bias > 0 else None

    # Determine direction
    if mean_bias > 0.001:
        direction: Literal["positive", "negative", "none"] = "positive"
    elif mean_bias < -0.001:
        direction = "negative"
    else:
        direction = "none"

    # Confidence interval for mean bias. The t critical value is undefined when the
    # degrees of freedom drop below 1 (e.g. n=2 unpaired → df=0 → stats.t.ppf returns
    # NaN); the interval is then genuinely undefined → ci=None, never a NaN-valued CI.
    df = (n - 1) if paired else (n - 2)
    if df < 1:
        ci = None
    else:
        if paired:
            se = std_bias / np.sqrt(n)
        else:
            se = std_bias * np.sqrt(1 / len(y_pred) + 1 / len(y_true))

        t_crit = stats.t.ppf(1 - (1 - confidence) / 2, df)
        ci = ConfidenceInterval(
            lower=float(mean_bias - t_crit * se),
            upper=float(mean_bias + t_crit * se),
            confidence=confidence,
            method="t",
        )

    return BiasResult(
        mean_bias=mean_bias,
        std_bias=std_bias,
        is_significant=p_value < 0.05,
        p_value=float(p_value),
        direction=direction,
        effect_size=float(effect_size) if effect_size is not None else None,
        ci=ci,
        n_samples=n,
    )

Result Types

None means genuinely undefined, never a fabricated 0.0

The numeric fields on these result types are typed float | None. A field is None when the statistic is genuinely undefined for the data at hand (it is never silently reported as a fake 0.0). In particular, BiasResult.mean_bias is None at n=0 and BiasResult.std_bias is None for n < 2; the per-criterion CorrelationResult.coefficient (Pearson/Spearman/Kendall) is None for a constant array or fewer than 3 samples. Guard before formatting (e.g. f"{x:+.4f}" if x is not None else "n/a").

EMDResult

EMDResult

Bases: BaseModel

Result of Earth Mover's Distance computation.

EMD measures the minimum "work" required to transform one distribution into another. Unlike correlation, it captures both shift (systematic bias) and shape differences (variance, skew).

A statistic is None when it is genuinely undefined, never a fake 0.0. With an empty distribution on either side (no data to transport) emd/mean_diff/ std_diff/bias_magnitude are all None; bias_direction stays "none" and interpretation "insufficient data".

ATTRIBUTE DESCRIPTION
emd

Earth Mover's Distance (0 to ~1 if normalized). None for empty input.

TYPE: float | None

mean_diff

Difference in means (dist2 - dist1). None for empty input.

TYPE: float | None

std_diff

Difference in standard deviations. None for empty input.

TYPE: float | None

bias_direction

Whether dist1 tends higher, lower, or same.

TYPE: Literal['higher', 'lower', 'none']

bias_magnitude

Absolute mean difference. None for empty input.

TYPE: float | None

interpretation

Human-readable interpretation.

TYPE: str

interpret_emd staticmethod

interpret_emd(emd: float) -> str

Human-readable interpretation of EMD value.

Source code in src/autorubric/metrics/_types.py
@staticmethod
def interpret_emd(emd: float) -> str:
    """Human-readable interpretation of EMD value."""
    if emd < 0.05:
        return "very similar"
    elif emd < 0.10:
        return "minor differences"
    elif emd < 0.20:
        return "moderate differences"
    else:
        return "substantial differences"

KSTestResult

KSTestResult

Bases: BaseModel

Kolmogorov-Smirnov test result.

The KS test compares two distributions and tests whether they come from the same underlying distribution.

ATTRIBUTE DESCRIPTION
statistic

KS test statistic.

TYPE: float

p_value

P-value for the test.

TYPE: float

is_significant

Whether the difference is significant (p < 0.05).

TYPE: bool

DistributionResult

DistributionResult

Bases: BaseModel

Score distribution statistics.

A statistic is None when it is genuinely undefined for the sample size, never a fake 0.0. At n=0 every stat is None. A single point still has a defined mean/min/max/median/q25/q75 (and iqr = q75 − q25 = 0.0, the true IQR of one point); std/variance need ≥2 points, skewness ≥3, kurtosis ≥4 — each None below its threshold. n is always the real count.

ATTRIBUTE DESCRIPTION
n

Number of samples.

TYPE: int

mean

Mean score. None when undefined (n = 0).

TYPE: float | None

std

Standard deviation. None when undefined (n < 2).

TYPE: float | None

variance

Variance. None when undefined (n < 2).

TYPE: float | None

min

Minimum score. None when undefined (n = 0).

TYPE: float | None

max

Maximum score. None when undefined (n = 0).

TYPE: float | None

median

Median score. None when undefined (n = 0).

TYPE: float | None

q25

25th percentile. None when undefined (n = 0).

TYPE: float | None

q75

75th percentile. None when undefined (n = 0).

TYPE: float | None

iqr

Interquartile range. None when undefined (n = 0); 0.0 for a single point.

TYPE: float | None

skewness

Skewness (measure of asymmetry). None when undefined (n < 3).

TYPE: float | None

kurtosis

Kurtosis (measure of tail heaviness). None when undefined (n < 4).

TYPE: float | None

histogram

Tuple of (counts, bin_edges).

TYPE: tuple[list[float], list[float]] | None

BiasResult

BiasResult

Bases: BaseModel

Result from systematic bias analysis.

Systematic bias occurs when one rater consistently scores higher or lower than another, independent of the item being rated.

A statistic is None when it is genuinely undefined for the sample size, never a fake 0.0. mean_bias is the single pred−true difference at n=1 (computable) and is None only at n=0. std_bias is None when undefined (n<2). effect_size (Cohen's d) is None when std_bias is 0 or undefined.

ATTRIBUTE DESCRIPTION
mean_bias

Mean difference (predictions - actuals). None only at n=0.

TYPE: float | None

std_bias

Standard deviation of differences. None when undefined (n < 2).

TYPE: float | None

is_significant

Whether the bias is statistically significant (p < 0.05).

TYPE: bool

p_value

P-value from t-test.

TYPE: float | None

direction

Direction of bias ("positive" if predictions > actuals).

TYPE: Literal['positive', 'negative', 'none']

effect_size

Cohen's d effect size. None when undefined (std_bias 0 or undefined).

TYPE: float | None

ci

Confidence interval for mean bias.

TYPE: ConfidenceInterval | None

n_samples

Number of samples.

TYPE: int

interpret_effect_size staticmethod

interpret_effect_size(d: float) -> str

Interpret effect size using Cohen's guidelines.

Source code in src/autorubric/metrics/_types.py
@staticmethod
def interpret_effect_size(d: float) -> str:
    """Interpret effect size using Cohen's guidelines."""
    abs_d = abs(d)
    if abs_d < 0.2:
        return "negligible"
    elif abs_d < 0.5:
        return "small"
    elif abs_d < 0.8:
        return "medium"
    else:
        return "large"

References

He, J., Shi, J., Zhuo, T. Y., Treude, C., Sun, J., Xing, Z., Du, X., and Lo, D. (2025). LLM-as-a-Judge for Software Engineering: Literature Review, Vision, and the Road Ahead. arXiv:2510.24367.