Skip to content

Distribution Metrics

Statistical functions for comparing score distributions between predicted and ground truth.

Overview

These metrics go beyond point estimates (like accuracy) to compare the full distribution of scores. This is important because high correlation can mask systematic biases in the judge's behavior.

Research Background

He et al. (2025) emphasize that correlation alone can mask systematic bias. Distribution-aware comparisons like Earth Mover's Distance reveal systematic deviations that point metrics miss.

Quick Example

from autorubric import earth_movers_distance, ks_test, score_distribution, systematic_bias

predicted_scores = [0.8, 0.7, 0.9, 0.6, 0.85]
ground_truth_scores = [0.75, 0.72, 0.88, 0.65, 0.80]

# Earth Mover's Distance (lower = more similar distributions)
emd = earth_movers_distance(predicted_scores, ground_truth_scores)
print(f"EMD: {emd.distance:.4f}")

# Kolmogorov-Smirnov test
ks = ks_test(predicted_scores, ground_truth_scores)
print(f"KS statistic: {ks.statistic:.4f}, p-value: {ks.p_value:.4f}")

# Score distribution statistics
pred_dist = score_distribution(predicted_scores)
print(f"Mean: {pred_dist.mean:.3f}, Std: {pred_dist.std:.3f}")

# Systematic bias
bias = systematic_bias(predicted_scores, ground_truth_scores)
print(f"Bias: {bias.mean_bias:+.4f} (predicted tends to be {'higher' if bias.mean_bias > 0 else 'lower'})")

earth_movers_distance

Compute Earth Mover's Distance (Wasserstein-1) between two score distributions.

earth_movers_distance

earth_movers_distance(dist1: ArrayLike, dist2: ArrayLike, *, normalize: bool = True) -> EMDResult

Compute Earth Mover's Distance (Wasserstein distance) between two distributions.

EMD measures the minimum "work" required to transform one distribution into another. Unlike correlation, it captures both shift (systematic bias) and shape differences (variance, skew).

PARAMETER DESCRIPTION
dist1

First set of values (e.g., LLM scores).

TYPE: ArrayLike

dist2

Second set of values (e.g., human scores).

TYPE: ArrayLike

normalize

If True, normalize both distributions to [0, 1] before computing EMD. This makes EMD comparable across different scales.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
EMDResult

EMDResult with EMD value and interpretive statistics.

Interpretation
  • EMD = 0: Identical distributions
  • EMD < 0.05: Very similar distributions
  • EMD 0.05-0.10: Minor distributional differences
  • EMD 0.10-0.20: Moderate differences (may need attention)
  • EMD > 0.20: Substantial differences (likely systematic bias)
Example

result = earth_movers_distance([0.8, 0.7, 0.9], [0.7, 0.6, 0.8]) result.emd 0.1

Source code in src/autorubric/metrics/distribution.py
def earth_movers_distance(
    dist1: ArrayLike,
    dist2: ArrayLike,
    *,
    normalize: bool = True,
) -> EMDResult:
    """Compute Earth Mover's Distance (Wasserstein distance) between two distributions.

    EMD measures the minimum "work" required to transform one distribution
    into another. Unlike correlation, it captures both shift (systematic bias)
    and shape differences (variance, skew).

    Args:
        dist1: First set of values (e.g., LLM scores).
        dist2: Second set of values (e.g., human scores).
        normalize: If True, normalize both distributions to [0, 1] before
            computing EMD. This makes EMD comparable across different scales.

    Returns:
        EMDResult with EMD value and interpretive statistics.

    Interpretation:
        - EMD = 0: Identical distributions
        - EMD < 0.05: Very similar distributions
        - EMD 0.05-0.10: Minor distributional differences
        - EMD 0.10-0.20: Moderate differences (may need attention)
        - EMD > 0.20: Substantial differences (likely systematic bias)

    Example:
        >>> result = earth_movers_distance([0.8, 0.7, 0.9], [0.7, 0.6, 0.8])
        >>> result.emd
        0.1
    """
    d1 = _to_array(dist1).astype(float)
    d2 = _to_array(dist2).astype(float)

    if len(d1) == 0 or len(d2) == 0:
        return EMDResult(
            emd=0.0,
            mean_diff=0.0,
            std_diff=0.0,
            bias_direction="none",
            bias_magnitude=0.0,
            interpretation="insufficient data",
        )

    # Normalize if requested
    if normalize:
        all_vals = np.concatenate([d1, d2])
        min_val = all_vals.min()
        max_val = all_vals.max()
        if max_val > min_val:
            d1 = (d1 - min_val) / (max_val - min_val)
            d2 = (d2 - min_val) / (max_val - min_val)

    # Compute EMD using scipy
    emd = float(stats.wasserstein_distance(d1, d2))

    # Compute statistics
    mean1, mean2 = np.mean(d1), np.mean(d2)
    std1, std2 = np.std(d1), np.std(d2)
    mean_diff = float(mean1 - mean2)
    std_diff = float(std1 - std2)
    bias_magnitude = abs(mean_diff)

    # Determine bias direction
    if mean_diff > 0.01:
        bias_direction: Literal["higher", "lower", "none"] = "higher"
    elif mean_diff < -0.01:
        bias_direction = "lower"
    else:
        bias_direction = "none"

    return EMDResult(
        emd=emd,
        mean_diff=mean_diff,
        std_diff=std_diff,
        bias_direction=bias_direction,
        bias_magnitude=bias_magnitude,
        interpretation=EMDResult.interpret_emd(emd),
    )

wasserstein_distance

Alias for earth_movers_distance.

wasserstein_distance

wasserstein_distance(dist1: ArrayLike, dist2: ArrayLike, *, normalize: bool = True) -> float

Compute Wasserstein distance (alias for EMD).

This is a convenience function that returns just the distance value.

PARAMETER DESCRIPTION
dist1

First set of values.

TYPE: ArrayLike

dist2

Second set of values.

TYPE: ArrayLike

normalize

If True, normalize both distributions to [0, 1].

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
float

Wasserstein distance value.

Source code in src/autorubric/metrics/distribution.py
def wasserstein_distance(
    dist1: ArrayLike,
    dist2: ArrayLike,
    *,
    normalize: bool = True,
) -> float:
    """Compute Wasserstein distance (alias for EMD).

    This is a convenience function that returns just the distance value.

    Args:
        dist1: First set of values.
        dist2: Second set of values.
        normalize: If True, normalize both distributions to [0, 1].

    Returns:
        Wasserstein distance value.
    """
    result = earth_movers_distance(dist1, dist2, normalize=normalize)
    return result.emd

ks_test

Perform Kolmogorov-Smirnov test comparing two distributions.

ks_test

ks_test(sample1: ArrayLike, sample2: ArrayLike | None = None) -> KSTestResult

Kolmogorov-Smirnov test comparing two samples (or one sample to normal).

The KS test measures the maximum difference between cumulative distribution functions. It tests whether two samples come from the same distribution.

PARAMETER DESCRIPTION
sample1

First sample of values.

TYPE: ArrayLike

sample2

Second sample. If None, tests against normal distribution.

TYPE: ArrayLike | None DEFAULT: None

RETURNS DESCRIPTION
KSTestResult

KSTestResult with test statistic and p-value.

Example

result = ks_test([0.1, 0.2, 0.3], [0.4, 0.5, 0.6]) result.is_significant False

Source code in src/autorubric/metrics/distribution.py
def ks_test(
    sample1: ArrayLike,
    sample2: ArrayLike | None = None,
) -> KSTestResult:
    """Kolmogorov-Smirnov test comparing two samples (or one sample to normal).

    The KS test measures the maximum difference between cumulative distribution
    functions. It tests whether two samples come from the same distribution.

    Args:
        sample1: First sample of values.
        sample2: Second sample. If None, tests against normal distribution.

    Returns:
        KSTestResult with test statistic and p-value.

    Example:
        >>> result = ks_test([0.1, 0.2, 0.3], [0.4, 0.5, 0.6])
        >>> result.is_significant
        False
    """
    s1 = _to_array(sample1).astype(float)

    if sample2 is None:
        # One-sample test against normal distribution
        stat, p_value = stats.kstest(s1, "norm", args=(np.mean(s1), np.std(s1)))
    else:
        # Two-sample test
        s2 = _to_array(sample2).astype(float)
        stat, p_value = stats.ks_2samp(s1, s2)

    return KSTestResult(
        statistic=float(stat),
        p_value=float(p_value),
        is_significant=p_value < 0.05,
    )

score_distribution

Compute distribution statistics for a set of scores.

score_distribution

score_distribution(scores: ArrayLike, *, bins: int | Sequence[float] = 10, include_histogram: bool = True) -> DistributionResult

Compute descriptive statistics for a score distribution.

PARAMETER DESCRIPTION
scores

Sequence of scores to analyze.

TYPE: ArrayLike

bins

Number of bins or explicit bin edges for histogram.

TYPE: int | Sequence[float] DEFAULT: 10

include_histogram

If True, include histogram counts and edges.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
DistributionResult

DistributionResult with summary statistics and optional histogram.

Example

result = score_distribution([0.1, 0.5, 0.8, 0.9]) 0.5 < result.mean < 0.6 True

Source code in src/autorubric/metrics/distribution.py
def score_distribution(
    scores: ArrayLike,
    *,
    bins: int | Sequence[float] = 10,
    include_histogram: bool = True,
) -> DistributionResult:
    """Compute descriptive statistics for a score distribution.

    Args:
        scores: Sequence of scores to analyze.
        bins: Number of bins or explicit bin edges for histogram.
        include_histogram: If True, include histogram counts and edges.

    Returns:
        DistributionResult with summary statistics and optional histogram.

    Example:
        >>> result = score_distribution([0.1, 0.5, 0.8, 0.9])
        >>> 0.5 < result.mean < 0.6
        True
    """
    scores = _to_array(scores).astype(float)

    n = len(scores)
    if n == 0:
        return DistributionResult(
            n=0,
            mean=0.0,
            std=0.0,
            variance=0.0,
            min=0.0,
            max=0.0,
            median=0.0,
            q25=0.0,
            q75=0.0,
            iqr=0.0,
            skewness=0.0,
            kurtosis=0.0,
            histogram=None,
        )

    # Basic statistics
    mean = float(np.mean(scores))
    std = float(np.std(scores, ddof=1)) if n > 1 else 0.0
    variance = float(np.var(scores, ddof=1)) if n > 1 else 0.0
    min_val = float(np.min(scores))
    max_val = float(np.max(scores))
    median = float(np.median(scores))
    q25 = float(np.percentile(scores, 25))
    q75 = float(np.percentile(scores, 75))
    iqr = q75 - q25

    # Skewness and kurtosis
    skewness = float(stats.skew(scores)) if n > 2 else 0.0
    kurtosis = float(stats.kurtosis(scores)) if n > 3 else 0.0

    # Histogram
    histogram = None
    if include_histogram:
        counts, edges = np.histogram(scores, bins=bins)
        histogram = (counts.tolist(), edges.tolist())

    return DistributionResult(
        n=n,
        mean=mean,
        std=std,
        variance=variance,
        min=min_val,
        max=max_val,
        median=median,
        q25=q25,
        q75=q75,
        iqr=iqr,
        skewness=skewness,
        kurtosis=kurtosis,
        histogram=histogram,
    )

systematic_bias

Analyze systematic bias between predicted and ground truth scores.

systematic_bias

systematic_bias(y_pred: ArrayLike, y_true: ArrayLike, *, paired: bool = True, confidence: float = 0.95) -> BiasResult

Detect and quantify systematic bias between predictions and ground truth.

Systematic bias occurs when one set consistently scores higher or lower than another, independent of the item being rated.

PARAMETER DESCRIPTION
y_pred

Predicted values (e.g., LLM scores).

TYPE: ArrayLike

y_true

Ground truth values (e.g., human scores).

TYPE: ArrayLike

paired

If True, assumes values are paired (same items). Uses paired t-test. If False, uses independent t-test.

TYPE: bool DEFAULT: True

confidence

Confidence level for interval estimation.

TYPE: float DEFAULT: 0.95

RETURNS DESCRIPTION
BiasResult

BiasResult with bias magnitude, direction, and statistical tests.

Example

result = systematic_bias([0.8, 0.7, 0.9], [0.7, 0.6, 0.8]) result.mean_bias 0.1 result.direction 'positive'

Source code in src/autorubric/metrics/distribution.py
def systematic_bias(
    y_pred: ArrayLike,
    y_true: ArrayLike,
    *,
    paired: bool = True,
    confidence: float = 0.95,
) -> BiasResult:
    """Detect and quantify systematic bias between predictions and ground truth.

    Systematic bias occurs when one set consistently scores higher or lower
    than another, independent of the item being rated.

    Args:
        y_pred: Predicted values (e.g., LLM scores).
        y_true: Ground truth values (e.g., human scores).
        paired: If True, assumes values are paired (same items).
            Uses paired t-test. If False, uses independent t-test.
        confidence: Confidence level for interval estimation.

    Returns:
        BiasResult with bias magnitude, direction, and statistical tests.

    Example:
        >>> result = systematic_bias([0.8, 0.7, 0.9], [0.7, 0.6, 0.8])
        >>> result.mean_bias
        0.1
        >>> result.direction
        'positive'
    """
    y_pred = _to_array(y_pred).astype(float)
    y_true = _to_array(y_true).astype(float)

    n = len(y_pred)
    if n < 2:
        return BiasResult(
            mean_bias=0.0,
            std_bias=0.0,
            is_significant=False,
            p_value=None,
            direction="none",
            effect_size=None,
            ci=None,
            n_samples=n,
        )

    if paired:
        _validate_same_length(y_pred, y_true)
        differences = y_pred - y_true
        mean_bias = float(np.mean(differences))
        std_bias = float(np.std(differences, ddof=1))

        # Paired t-test
        t_stat, p_value = stats.ttest_rel(y_pred, y_true)

        # Cohen's d for paired samples
        effect_size = mean_bias / std_bias if std_bias > 0 else 0.0
    else:
        mean_bias = float(np.mean(y_pred) - np.mean(y_true))

        # Pooled standard deviation
        var_pred = np.var(y_pred, ddof=1)
        var_true = np.var(y_true, ddof=1)
        pooled_std = np.sqrt((var_pred + var_true) / 2)
        std_bias = float(pooled_std)

        # Independent t-test
        t_stat, p_value = stats.ttest_ind(y_pred, y_true)

        # Cohen's d
        effect_size = mean_bias / std_bias if std_bias > 0 else 0.0

    # Determine direction
    if mean_bias > 0.001:
        direction: Literal["positive", "negative", "none"] = "positive"
    elif mean_bias < -0.001:
        direction = "negative"
    else:
        direction = "none"

    # Confidence interval for mean bias
    if paired:
        se = std_bias / np.sqrt(n)
    else:
        se = std_bias * np.sqrt(1 / len(y_pred) + 1 / len(y_true))

    t_crit = stats.t.ppf(1 - (1 - confidence) / 2, n - 1 if paired else n - 2)
    ci = ConfidenceInterval(
        lower=float(mean_bias - t_crit * se),
        upper=float(mean_bias + t_crit * se),
        confidence=confidence,
        method="t",
    )

    return BiasResult(
        mean_bias=mean_bias,
        std_bias=std_bias,
        is_significant=p_value < 0.05,
        p_value=float(p_value),
        direction=direction,
        effect_size=float(effect_size),
        ci=ci,
        n_samples=n,
    )

Result Types

EMDResult

EMDResult

Bases: BaseModel

Result of Earth Mover's Distance computation.

EMD measures the minimum "work" required to transform one distribution into another. Unlike correlation, it captures both shift (systematic bias) and shape differences (variance, skew).

ATTRIBUTE DESCRIPTION
emd

Earth Mover's Distance (0 to ~1 if normalized).

TYPE: float

mean_diff

Difference in means (dist2 - dist1).

TYPE: float

std_diff

Difference in standard deviations.

TYPE: float

bias_direction

Whether dist1 tends higher, lower, or same.

TYPE: Literal['higher', 'lower', 'none']

bias_magnitude

Absolute mean difference.

TYPE: float

interpretation

Human-readable interpretation.

TYPE: str

interpret_emd staticmethod

interpret_emd(emd: float) -> str

Human-readable interpretation of EMD value.

Source code in src/autorubric/metrics/_types.py
@staticmethod
def interpret_emd(emd: float) -> str:
    """Human-readable interpretation of EMD value."""
    if emd < 0.05:
        return "very similar"
    elif emd < 0.10:
        return "minor differences"
    elif emd < 0.20:
        return "moderate differences"
    else:
        return "substantial differences"

KSTestResult

KSTestResult

Bases: BaseModel

Kolmogorov-Smirnov test result.

The KS test compares two distributions and tests whether they come from the same underlying distribution.

ATTRIBUTE DESCRIPTION
statistic

KS test statistic.

TYPE: float

p_value

P-value for the test.

TYPE: float

is_significant

Whether the difference is significant (p < 0.05).

TYPE: bool

DistributionResult

DistributionResult

Bases: BaseModel

Score distribution statistics.

ATTRIBUTE DESCRIPTION
n

Number of samples.

TYPE: int

mean

Mean score.

TYPE: float

std

Standard deviation.

TYPE: float

variance

Variance.

TYPE: float

min

Minimum score.

TYPE: float

max

Maximum score.

TYPE: float

median

Median score.

TYPE: float

q25

25th percentile.

TYPE: float

q75

75th percentile.

TYPE: float

iqr

Interquartile range.

TYPE: float

skewness

Skewness (measure of asymmetry).

TYPE: float

kurtosis

Kurtosis (measure of tail heaviness).

TYPE: float

histogram

Tuple of (counts, bin_edges).

TYPE: tuple[list[float], list[float]] | None

BiasResult

BiasResult

Bases: BaseModel

Result from systematic bias analysis.

Systematic bias occurs when one rater consistently scores higher or lower than another, independent of the item being rated.

ATTRIBUTE DESCRIPTION
mean_bias

Mean difference (predictions - actuals).

TYPE: float

std_bias

Standard deviation of differences.

TYPE: float

is_significant

Whether the bias is statistically significant (p < 0.05).

TYPE: bool

p_value

P-value from t-test.

TYPE: float | None

direction

Direction of bias ("positive" if predictions > actuals).

TYPE: Literal['positive', 'negative', 'none']

effect_size

Cohen's d effect size.

TYPE: float | None

ci

Confidence interval for mean bias.

TYPE: ConfidenceInterval | None

n_samples

Number of samples.

TYPE: int

interpret_effect_size staticmethod

interpret_effect_size(d: float) -> str

Interpret effect size using Cohen's guidelines.

Source code in src/autorubric/metrics/_types.py
@staticmethod
def interpret_effect_size(d: float) -> str:
    """Interpret effect size using Cohen's guidelines."""
    abs_d = abs(d)
    if abs_d < 0.2:
        return "negligible"
    elif abs_d < 0.5:
        return "small"
    elif abs_d < 0.8:
        return "medium"
    else:
        return "large"

References

He, J., Shi, J., Zhuo, T. Y., Treude, C., Sun, J., Xing, Z., Du, X., and Lo, D. (2025). LLM-as-a-Judge for Software Engineering: Literature Review, Vision, and the Road Ahead. arXiv:2510.24367.