Distribution Metrics¶

Statistical functions for comparing score distributions between predicted and ground truth.

Overview¶

These metrics go beyond point estimates (like accuracy) to compare the full distribution of scores. This is important because high correlation can mask systematic biases in the judge's behavior.

Research Background

He et al. (2025) emphasize that correlation alone can mask systematic bias. Distribution-aware comparisons like Earth Mover's Distance reveal systematic deviations that point metrics miss.

Quick Example¶

from autorubric import earth_movers_distance, ks_test, score_distribution, systematic_bias

predicted_scores = [0.8, 0.7, 0.9, 0.6, 0.85]
ground_truth_scores = [0.75, 0.72, 0.88, 0.65, 0.80]

# Earth Mover's Distance (lower = more similar distributions)
emd = earth_movers_distance(predicted_scores, ground_truth_scores)
print(f"EMD: {emd.distance:.4f}")

# Kolmogorov-Smirnov test
ks = ks_test(predicted_scores, ground_truth_scores)
print(f"KS statistic: {ks.statistic:.4f}, p-value: {ks.p_value:.4f}")

# Score distribution statistics
pred_dist = score_distribution(predicted_scores)
print(f"Mean: {pred_dist.mean:.3f}, Std: {pred_dist.std:.3f}")

# Systematic bias
bias = systematic_bias(predicted_scores, ground_truth_scores)
print(f"Bias: {bias.mean_bias:+.4f} (predicted tends to be {'higher' if bias.mean_bias > 0 else 'lower'})")

earth_movers_distance¶

Compute Earth Mover's Distance (Wasserstein-1) between two score distributions.

earth_movers_distance ¶

earth_movers_distance(dist1: ArrayLike, dist2: ArrayLike, *, normalize: bool = True) -> EMDResult

Compute Earth Mover's Distance (Wasserstein distance) between two distributions.

EMD measures the minimum "work" required to transform one distribution into another. Unlike correlation, it captures both shift (systematic bias) and shape differences (variance, skew).

PARAMETER	DESCRIPTION
`dist1`	First set of values (e.g., LLM scores). TYPE: `ArrayLike`
`dist2`	Second set of values (e.g., human scores). TYPE: `ArrayLike`
`normalize`	If True, normalize both distributions to [0, 1] before computing EMD. This makes EMD comparable across different scales. TYPE: `bool` DEFAULT: `True`

RETURNS	DESCRIPTION
`EMDResult`	EMDResult with EMD value and interpretive statistics.

Interpretation

EMD = 0: Identical distributions
EMD < 0.05: Very similar distributions
EMD 0.05-0.10: Minor distributional differences
EMD 0.10-0.20: Moderate differences (may need attention)
EMD > 0.20: Substantial differences (likely systematic bias)

Example

result = earth_movers_distance([0.8, 0.7, 0.9], [0.7, 0.6, 0.8]) result.emd 0.1

Source code in src/autorubric/metrics/distribution.py

def earth_movers_distance(
    dist1: ArrayLike,
    dist2: ArrayLike,
    *,
    normalize: bool = True,
) -> EMDResult:
    """Compute Earth Mover's Distance (Wasserstein distance) between two distributions.

    EMD measures the minimum "work" required to transform one distribution
    into another. Unlike correlation, it captures both shift (systematic bias)
    and shape differences (variance, skew).

    Args:
        dist1: First set of values (e.g., LLM scores).
        dist2: Second set of values (e.g., human scores).
        normalize: If True, normalize both distributions to [0, 1] before
            computing EMD. This makes EMD comparable across different scales.

    Returns:
        EMDResult with EMD value and interpretive statistics.

    Interpretation:
        - EMD = 0: Identical distributions
        - EMD < 0.05: Very similar distributions
        - EMD 0.05-0.10: Minor distributional differences
        - EMD 0.10-0.20: Moderate differences (may need attention)
        - EMD > 0.20: Substantial differences (likely systematic bias)

    Example:
        >>> result = earth_movers_distance([0.8, 0.7, 0.9], [0.7, 0.6, 0.8])
        >>> result.emd
        0.1
    """
    d1 = _to_array(dist1).astype(float)
    d2 = _to_array(dist2).astype(float)

    if len(d1) == 0 or len(d2) == 0:
        return EMDResult(
            emd=0.0,
            mean_diff=0.0,
            std_diff=0.0,
            bias_direction="none",
            bias_magnitude=0.0,
            interpretation="insufficient data",
        )

    # Normalize if requested
    if normalize:
        all_vals = np.concatenate([d1, d2])
        min_val = all_vals.min()
        max_val = all_vals.max()
        if max_val > min_val:
            d1 = (d1 - min_val) / (max_val - min_val)
            d2 = (d2 - min_val) / (max_val - min_val)

    # Compute EMD using scipy
    emd = float(stats.wasserstein_distance(d1, d2))

    # Compute statistics
    mean1, mean2 = np.mean(d1), np.mean(d2)
    std1, std2 = np.std(d1), np.std(d2)
    mean_diff = float(mean1 - mean2)
    std_diff = float(std1 - std2)
    bias_magnitude = abs(mean_diff)

    # Determine bias direction
    if mean_diff > 0.01:
        bias_direction: Literal["higher", "lower", "none"] = "higher"
    elif mean_diff < -0.01:
        bias_direction = "lower"
    else:
        bias_direction = "none"

    return EMDResult(
        emd=emd,
        mean_diff=mean_diff,
        std_diff=std_diff,
        bias_direction=bias_direction,
        bias_magnitude=bias_magnitude,
        interpretation=EMDResult.interpret_emd(emd),
    )

wasserstein_distance¶

Alias for earth_movers_distance.

wasserstein_distance ¶

wasserstein_distance(dist1: ArrayLike, dist2: ArrayLike, *, normalize: bool = True) -> float

Compute Wasserstein distance (alias for EMD).

This is a convenience function that returns just the distance value.

PARAMETER	DESCRIPTION
`dist1`	First set of values. TYPE: `ArrayLike`
`dist2`	Second set of values. TYPE: `ArrayLike`
`normalize`	If True, normalize both distributions to [0, 1]. TYPE: `bool` DEFAULT: `True`

RETURNS	DESCRIPTION
`float`	Wasserstein distance value.

Source code in src/autorubric/metrics/distribution.py

def wasserstein_distance(
    dist1: ArrayLike,
    dist2: ArrayLike,
    *,
    normalize: bool = True,
) -> float:
    """Compute Wasserstein distance (alias for EMD).

    This is a convenience function that returns just the distance value.

    Args:
        dist1: First set of values.
        dist2: Second set of values.
        normalize: If True, normalize both distributions to [0, 1].

    Returns:
        Wasserstein distance value.
    """
    result = earth_movers_distance(dist1, dist2, normalize=normalize)
    return result.emd

ks_test¶

Perform Kolmogorov-Smirnov test comparing two distributions.

ks_test ¶

ks_test(sample1: ArrayLike, sample2: ArrayLike | None = None) -> KSTestResult

Kolmogorov-Smirnov test comparing two samples (or one sample to normal).

The KS test measures the maximum difference between cumulative distribution functions. It tests whether two samples come from the same distribution.

PARAMETER	DESCRIPTION
`sample1`	First sample of values. TYPE: `ArrayLike`
`sample2`	Second sample. If None, tests against normal distribution. TYPE: `ArrayLike \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`KSTestResult`	KSTestResult with test statistic and p-value.

Example

result = ks_test([0.1, 0.2, 0.3], [0.4, 0.5, 0.6]) result.is_significant False

Source code in src/autorubric/metrics/distribution.py

def ks_test(
    sample1: ArrayLike,
    sample2: ArrayLike | None = None,
) -> KSTestResult:
    """Kolmogorov-Smirnov test comparing two samples (or one sample to normal).

    The KS test measures the maximum difference between cumulative distribution
    functions. It tests whether two samples come from the same distribution.

    Args:
        sample1: First sample of values.
        sample2: Second sample. If None, tests against normal distribution.

    Returns:
        KSTestResult with test statistic and p-value.

    Example:
        >>> result = ks_test([0.1, 0.2, 0.3], [0.4, 0.5, 0.6])
        >>> result.is_significant
        False
    """
    s1 = _to_array(sample1).astype(float)

    if sample2 is None:
        # One-sample test against normal distribution
        stat, p_value = stats.kstest(s1, "norm", args=(np.mean(s1), np.std(s1)))
    else:
        # Two-sample test
        s2 = _to_array(sample2).astype(float)
        stat, p_value = stats.ks_2samp(s1, s2)

    return KSTestResult(
        statistic=float(stat),
        p_value=float(p_value),
        is_significant=p_value < 0.05,
    )

score_distribution¶

Compute distribution statistics for a set of scores.

score_distribution ¶

score_distribution(scores: ArrayLike, *, bins: int | Sequence[float] = 10, include_histogram: bool = True) -> DistributionResult

Compute descriptive statistics for a score distribution.

PARAMETER	DESCRIPTION
`scores`	Sequence of scores to analyze. TYPE: `ArrayLike`
`bins`	Number of bins or explicit bin edges for histogram. TYPE: `int \| Sequence[float]` DEFAULT: `10`
`include_histogram`	If True, include histogram counts and edges. TYPE: `bool` DEFAULT: `True`

RETURNS	DESCRIPTION
`DistributionResult`	DistributionResult with summary statistics and optional histogram.

Example

result = score_distribution([0.1, 0.5, 0.8, 0.9]) 0.5 < result.mean < 0.6 True

Source code in src/autorubric/metrics/distribution.py

def score_distribution(
    scores: ArrayLike,
    *,
    bins: int | Sequence[float] = 10,
    include_histogram: bool = True,
) -> DistributionResult:
    """Compute descriptive statistics for a score distribution.

    Args:
        scores: Sequence of scores to analyze.
        bins: Number of bins or explicit bin edges for histogram.
        include_histogram: If True, include histogram counts and edges.

    Returns:
        DistributionResult with summary statistics and optional histogram.

    Example:
        >>> result = score_distribution([0.1, 0.5, 0.8, 0.9])
        >>> 0.5 < result.mean < 0.6
        True
    """
    scores = _to_array(scores).astype(float)

    n = len(scores)
    if n == 0:
        return DistributionResult(
            n=0,
            mean=0.0,
            std=0.0,
            variance=0.0,
            min=0.0,
            max=0.0,
            median=0.0,
            q25=0.0,
            q75=0.0,
            iqr=0.0,
            skewness=0.0,
            kurtosis=0.0,
            histogram=None,
        )

    # Basic statistics
    mean = float(np.mean(scores))
    std = float(np.std(scores, ddof=1)) if n > 1 else 0.0
    variance = float(np.var(scores, ddof=1)) if n > 1 else 0.0
    min_val = float(np.min(scores))
    max_val = float(np.max(scores))
    median = float(np.median(scores))
    q25 = float(np.percentile(scores, 25))
    q75 = float(np.percentile(scores, 75))
    iqr = q75 - q25

    # Skewness and kurtosis
    skewness = float(stats.skew(scores)) if n > 2 else 0.0
    kurtosis = float(stats.kurtosis(scores)) if n > 3 else 0.0

    # Histogram
    histogram = None
    if include_histogram:
        counts, edges = np.histogram(scores, bins=bins)
        histogram = (counts.tolist(), edges.tolist())

    return DistributionResult(
        n=n,
        mean=mean,
        std=std,
        variance=variance,
        min=min_val,
        max=max_val,
        median=median,
        q25=q25,
        q75=q75,
        iqr=iqr,
        skewness=skewness,
        kurtosis=kurtosis,
        histogram=histogram,
    )

systematic_bias¶

Analyze systematic bias between predicted and ground truth scores.

systematic_bias ¶

systematic_bias(y_pred: ArrayLike, y_true: ArrayLike, *, paired: bool = True, confidence: float = 0.95) -> BiasResult

Detect and quantify systematic bias between predictions and ground truth.

Systematic bias occurs when one set consistently scores higher or lower than another, independent of the item being rated.

PARAMETER	DESCRIPTION
`y_pred`	Predicted values (e.g., LLM scores). TYPE: `ArrayLike`
`y_true`	Ground truth values (e.g., human scores). TYPE: `ArrayLike`
`paired`	If True, assumes values are paired (same items). Uses paired t-test. If False, uses independent t-test. TYPE: `bool` DEFAULT: `True`
`confidence`	Confidence level for interval estimation. TYPE: `float` DEFAULT: `0.95`

RETURNS	DESCRIPTION
`BiasResult`	BiasResult with bias magnitude, direction, and statistical tests.

Example

result = systematic_bias([0.8, 0.7, 0.9], [0.7, 0.6, 0.8]) result.mean_bias 0.1 result.direction 'positive'

Source code in src/autorubric/metrics/distribution.py

def systematic_bias(
    y_pred: ArrayLike,
    y_true: ArrayLike,
    *,
    paired: bool = True,
    confidence: float = 0.95,
) -> BiasResult:
    """Detect and quantify systematic bias between predictions and ground truth.

    Systematic bias occurs when one set consistently scores higher or lower
    than another, independent of the item being rated.

    Args:
        y_pred: Predicted values (e.g., LLM scores).
        y_true: Ground truth values (e.g., human scores).
        paired: If True, assumes values are paired (same items).
            Uses paired t-test. If False, uses independent t-test.
        confidence: Confidence level for interval estimation.

    Returns:
        BiasResult with bias magnitude, direction, and statistical tests.

    Example:
        >>> result = systematic_bias([0.8, 0.7, 0.9], [0.7, 0.6, 0.8])
        >>> result.mean_bias
        0.1
        >>> result.direction
        'positive'
    """
    y_pred = _to_array(y_pred).astype(float)
    y_true = _to_array(y_true).astype(float)

    n = len(y_pred)
    if n < 2:
        return BiasResult(
            mean_bias=0.0,
            std_bias=0.0,
            is_significant=False,
            p_value=None,
            direction="none",
            effect_size=None,
            ci=None,
            n_samples=n,
        )

    if paired:
        _validate_same_length(y_pred, y_true)
        differences = y_pred - y_true
        mean_bias = float(np.mean(differences))
        std_bias = float(np.std(differences, ddof=1))

        # Paired t-test
        t_stat, p_value = stats.ttest_rel(y_pred, y_true)

        # Cohen's d for paired samples
        effect_size = mean_bias / std_bias if std_bias > 0 else 0.0
    else:
        mean_bias = float(np.mean(y_pred) - np.mean(y_true))

        # Pooled standard deviation
        var_pred = np.var(y_pred, ddof=1)
        var_true = np.var(y_true, ddof=1)
        pooled_std = np.sqrt((var_pred + var_true) / 2)
        std_bias = float(pooled_std)

        # Independent t-test
        t_stat, p_value = stats.ttest_ind(y_pred, y_true)

        # Cohen's d
        effect_size = mean_bias / std_bias if std_bias > 0 else 0.0

    # Determine direction
    if mean_bias > 0.001:
        direction: Literal["positive", "negative", "none"] = "positive"
    elif mean_bias < -0.001:
        direction = "negative"
    else:
        direction = "none"

    # Confidence interval for mean bias
    if paired:
        se = std_bias / np.sqrt(n)
    else:
        se = std_bias * np.sqrt(1 / len(y_pred) + 1 / len(y_true))

    t_crit = stats.t.ppf(1 - (1 - confidence) / 2, n - 1 if paired else n - 2)
    ci = ConfidenceInterval(
        lower=float(mean_bias - t_crit * se),
        upper=float(mean_bias + t_crit * se),
        confidence=confidence,
        method="t",
    )

    return BiasResult(
        mean_bias=mean_bias,
        std_bias=std_bias,
        is_significant=p_value < 0.05,
        p_value=float(p_value),
        direction=direction,
        effect_size=float(effect_size),
        ci=ci,
        n_samples=n,
    )

Result Types¶

EMDResult¶

EMDResult ¶

Bases: BaseModel

Result of Earth Mover's Distance computation.

EMD measures the minimum "work" required to transform one distribution into another. Unlike correlation, it captures both shift (systematic bias) and shape differences (variance, skew).

ATTRIBUTE	DESCRIPTION
`emd`	Earth Mover's Distance (0 to ~1 if normalized). TYPE: `float`
`mean_diff`	Difference in means (dist2 - dist1). TYPE: `float`
`std_diff`	Difference in standard deviations. TYPE: `float`
`bias_direction`	Whether dist1 tends higher, lower, or same. TYPE: `Literal['higher', 'lower', 'none']`
`bias_magnitude`	Absolute mean difference. TYPE: `float`
`interpretation`	Human-readable interpretation. TYPE: `str`

interpret_emd `staticmethod` ¶

interpret_emd(emd: float) -> str

Human-readable interpretation of EMD value.

Source code in src/autorubric/metrics/_types.py

@staticmethod
def interpret_emd(emd: float) -> str:
    """Human-readable interpretation of EMD value."""
    if emd < 0.05:
        return "very similar"
    elif emd < 0.10:
        return "minor differences"
    elif emd < 0.20:
        return "moderate differences"
    else:
        return "substantial differences"

KSTestResult¶

KSTestResult ¶

Bases: BaseModel

Kolmogorov-Smirnov test result.

The KS test compares two distributions and tests whether they come from the same underlying distribution.

ATTRIBUTE	DESCRIPTION
`statistic`	KS test statistic. TYPE: `float`
`p_value`	P-value for the test. TYPE: `float`
`is_significant`	Whether the difference is significant (p < 0.05). TYPE: `bool`

DistributionResult¶

DistributionResult ¶

Bases: BaseModel

Score distribution statistics.

ATTRIBUTE	DESCRIPTION
`n`	Number of samples. TYPE: `int`
`mean`	Mean score. TYPE: `float`
`std`	Standard deviation. TYPE: `float`
`variance`	Variance. TYPE: `float`
`min`	Minimum score. TYPE: `float`
`max`	Maximum score. TYPE: `float`
`median`	Median score. TYPE: `float`
`q25`	25th percentile. TYPE: `float`
`q75`	75th percentile. TYPE: `float`
`iqr`	Interquartile range. TYPE: `float`
`skewness`	Skewness (measure of asymmetry). TYPE: `float`
`kurtosis`	Kurtosis (measure of tail heaviness). TYPE: `float`
`histogram`	Tuple of (counts, bin_edges). TYPE: `tuple[list[float], list[float]] \| None`

BiasResult¶

BiasResult ¶

Bases: BaseModel

Result from systematic bias analysis.

Systematic bias occurs when one rater consistently scores higher or lower than another, independent of the item being rated.

ATTRIBUTE	DESCRIPTION
`mean_bias`	Mean difference (predictions - actuals). TYPE: `float`
`std_bias`	Standard deviation of differences. TYPE: `float`
`is_significant`	Whether the bias is statistically significant (p < 0.05). TYPE: `bool`
`p_value`	P-value from t-test. TYPE: `float \| None`
`direction`	Direction of bias ("positive" if predictions > actuals). TYPE: `Literal['positive', 'negative', 'none']`
`effect_size`	Cohen's d effect size. TYPE: `float \| None`
`ci`	Confidence interval for mean bias. TYPE: `ConfidenceInterval \| None`
`n_samples`	Number of samples. TYPE: `int`

interpret_effect_size `staticmethod` ¶

interpret_effect_size(d: float) -> str

Interpret effect size using Cohen's guidelines.

Source code in src/autorubric/metrics/_types.py

@staticmethod
def interpret_effect_size(d: float) -> str:
    """Interpret effect size using Cohen's guidelines."""
    abs_d = abs(d)
    if abs_d < 0.2:
        return "negligible"
    elif abs_d < 0.5:
        return "small"
    elif abs_d < 0.8:
        return "medium"
    else:
        return "large"

References¶

He, J., Shi, J., Zhuo, T. Y., Treude, C., Sun, J., Xing, Z., Du, X., and Lo, D. (2025). LLM-as-a-Judge for Software Engineering: Literature Review, Vision, and the Road Ahead. arXiv:2510.24367.

Distribution Metrics¶

Overview¶

Quick Example¶

earth_movers_distance¶

earth_movers_distance ¶

wasserstein_distance¶

wasserstein_distance ¶

ks_test¶

ks_test ¶

score_distribution¶

score_distribution ¶

systematic_bias¶

systematic_bias ¶

Result Types¶

EMDResult¶

EMDResult ¶

interpret_emd staticmethod ¶

KSTestResult¶

KSTestResult ¶

DistributionResult¶

DistributionResult ¶

BiasResult¶

BiasResult ¶

interpret_effect_size staticmethod ¶

References¶

interpret_emd `staticmethod` ¶

interpret_effect_size `staticmethod` ¶