Distribution Metrics¶

Statistical functions for comparing score distributions between predicted and ground truth.

Overview¶

These metrics go beyond point estimates (like accuracy) to compare the full distribution of scores. This is important because high correlation can mask systematic biases in the judge's behavior.

Research Background

He et al. (2025) emphasize that correlation alone can mask systematic bias. Distribution-aware comparisons like Earth Mover's Distance reveal systematic deviations that point metrics miss.

Quick Example¶

from autorubric import earth_movers_distance, ks_test, score_distribution, systematic_bias

predicted_scores = [0.8, 0.7, 0.9, 0.6, 0.85]
ground_truth_scores = [0.75, 0.72, 0.88, 0.65, 0.80]

# Earth Mover's Distance (lower = more similar distributions).
# EMDResult.emd is `float | None` (None for an empty distribution); guard before formatting.
emd_result = earth_movers_distance(predicted_scores, ground_truth_scores)
print(f"EMD: {emd_result.emd:.4f}" if emd_result.emd is not None else "EMD: n/a")

# Kolmogorov-Smirnov test (statistic / p_value are always floats)
ks = ks_test(predicted_scores, ground_truth_scores)
print(f"KS statistic: {ks.statistic:.4f}, p-value: {ks.p_value:.4f}")

# Score distribution statistics. DistributionResult.mean is `float | None` (None at n=0)
# and .std is `float | None` (None at n < 2); guard before formatting.
pred_dist = score_distribution(predicted_scores)
mean_str = f"{pred_dist.mean:.3f}" if pred_dist.mean is not None else "n/a"
std_str = f"{pred_dist.std:.3f}" if pred_dist.std is not None else "n/a"
print(f"Mean: {mean_str}, Std: {std_str}")

# Systematic bias. bias.mean_bias is `float | None` (None at n=0); guard before
# formatting and comparison.
bias = systematic_bias(predicted_scores, ground_truth_scores)
if bias.mean_bias is not None:
    direction = "higher" if bias.mean_bias > 0 else "lower"
    print(f"Bias: {bias.mean_bias:+.4f} (predicted tends to be {direction})")
else:
    print("Bias: n/a (need at least 1 paired sample)")

earth_movers_distance¶

Compute Earth Mover's Distance (Wasserstein-1) between two score distributions.

earth_movers_distance ¶

earth_movers_distance(dist1: ArrayLike, dist2: ArrayLike, *, normalize: bool = True) -> EMDResult

Compute Earth Mover's Distance (Wasserstein distance) between two distributions.

EMD measures the minimum "work" required to transform one distribution into another. Unlike correlation, it captures both shift (systematic bias) and shape differences (variance, skew).

PARAMETER	DESCRIPTION
`dist1`	First set of values (e.g., LLM scores). TYPE: `ArrayLike`
`dist2`	Second set of values (e.g., human scores). TYPE: `ArrayLike`
`normalize`	If True, normalize both distributions to [0, 1] before computing EMD. This makes EMD comparable across different scales. TYPE: `bool` DEFAULT: `True`

RETURNS	DESCRIPTION
`EMDResult`	EMDResult with EMD value and interpretive statistics.

Interpretation

EMD = 0: Identical distributions
EMD < 0.05: Very similar distributions
EMD 0.05-0.10: Minor distributional differences
EMD 0.10-0.20: Moderate differences (may need attention)
EMD > 0.20: Substantial differences (likely systematic bias)

Example

result = earth_movers_distance([0.8, 0.7, 0.9], [0.7, 0.6, 0.8]) result.emd 0.1

Source code in src/autorubric/metrics/distribution.py

def earth_movers_distance(
    dist1: ArrayLike,
    dist2: ArrayLike,
    *,
    normalize: bool = True,
) -> EMDResult:
    """Compute Earth Mover's Distance (Wasserstein distance) between two distributions.

    EMD measures the minimum "work" required to transform one distribution
    into another. Unlike correlation, it captures both shift (systematic bias)
    and shape differences (variance, skew).

    Args:
        dist1: First set of values (e.g., LLM scores).
        dist2: Second set of values (e.g., human scores).
        normalize: If True, normalize both distributions to [0, 1] before
            computing EMD. This makes EMD comparable across different scales.

    Returns:
        EMDResult with EMD value and interpretive statistics.

    Interpretation:
        - EMD = 0: Identical distributions
        - EMD < 0.05: Very similar distributions
        - EMD 0.05-0.10: Minor distributional differences
        - EMD 0.10-0.20: Moderate differences (may need attention)
        - EMD > 0.20: Substantial differences (likely systematic bias)

    Example:
        >>> result = earth_movers_distance([0.8, 0.7, 0.9], [0.7, 0.6, 0.8])
        >>> result.emd
        0.1
    """
    d1 = _to_array(dist1).astype(float)
    d2 = _to_array(dist2).astype(float)

    if len(d1) == 0 or len(d2) == 0:
        # No data to transport ⇒ every distance/diff statistic is genuinely undefined.
        return EMDResult(
            emd=None,
            mean_diff=None,
            std_diff=None,
            bias_direction="none",
            bias_magnitude=None,
            interpretation="insufficient data",
        )

    # Normalize if requested
    if normalize:
        all_vals = np.concatenate([d1, d2])
        min_val = all_vals.min()
        max_val = all_vals.max()
        if max_val > min_val:
            d1 = (d1 - min_val) / (max_val - min_val)
            d2 = (d2 - min_val) / (max_val - min_val)

    # Compute EMD using scipy
    emd = float(stats.wasserstein_distance(d1, d2))

    # Compute statistics
    mean1, mean2 = np.mean(d1), np.mean(d2)
    std1, std2 = np.std(d1), np.std(d2)
    mean_diff = float(mean1 - mean2)
    std_diff = float(std1 - std2)
    bias_magnitude = abs(mean_diff)

    # Determine bias direction
    if mean_diff > 0.01:
        bias_direction: Literal["higher", "lower", "none"] = "higher"
    elif mean_diff < -0.01:
        bias_direction = "lower"
    else:
        bias_direction = "none"

    return EMDResult(
        emd=emd,
        mean_diff=mean_diff,
        std_diff=std_diff,
        bias_direction=bias_direction,
        bias_magnitude=bias_magnitude,
        interpretation=EMDResult.interpret_emd(emd),
    )

wasserstein_distance¶

Alias for earth_movers_distance.

wasserstein_distance ¶

wasserstein_distance(dist1: ArrayLike, dist2: ArrayLike, *, normalize: bool = True) -> float | None

Compute Wasserstein distance (alias for EMD).

This is a convenience function that returns just the distance value.

PARAMETER	DESCRIPTION
`dist1`	First set of values. TYPE: `ArrayLike`
`dist2`	Second set of values. TYPE: `ArrayLike`
`normalize`	If True, normalize both distributions to [0, 1]. TYPE: `bool` DEFAULT: `True`

RETURNS	DESCRIPTION
`float \| None`	Wasserstein distance value, or `None` when either distribution is empty (the
`float \| None`	distance is genuinely undefined with no data to transport).

Source code in src/autorubric/metrics/distribution.py

def wasserstein_distance(
    dist1: ArrayLike,
    dist2: ArrayLike,
    *,
    normalize: bool = True,
) -> float | None:
    """Compute Wasserstein distance (alias for EMD).

    This is a convenience function that returns just the distance value.

    Args:
        dist1: First set of values.
        dist2: Second set of values.
        normalize: If True, normalize both distributions to [0, 1].

    Returns:
        Wasserstein distance value, or ``None`` when either distribution is empty (the
        distance is genuinely undefined with no data to transport).
    """
    result = earth_movers_distance(dist1, dist2, normalize=normalize)
    return result.emd

ks_test¶

Perform Kolmogorov-Smirnov test comparing two distributions.

ks_test ¶

ks_test(sample1: ArrayLike, sample2: ArrayLike | None = None) -> KSTestResult

Kolmogorov-Smirnov test comparing two samples (or one sample to normal).

The KS test measures the maximum difference between cumulative distribution functions. It tests whether two samples come from the same distribution.

PARAMETER	DESCRIPTION
`sample1`	First sample of values. TYPE: `ArrayLike`
`sample2`	Second sample. If None, tests against normal distribution. TYPE: `ArrayLike \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`KSTestResult`	KSTestResult with test statistic and p-value.

Example

result = ks_test([0.1, 0.2, 0.3], [0.4, 0.5, 0.6]) result.is_significant False

Source code in src/autorubric/metrics/distribution.py

def ks_test(
    sample1: ArrayLike,
    sample2: ArrayLike | None = None,
) -> KSTestResult:
    """Kolmogorov-Smirnov test comparing two samples (or one sample to normal).

    The KS test measures the maximum difference between cumulative distribution
    functions. It tests whether two samples come from the same distribution.

    Args:
        sample1: First sample of values.
        sample2: Second sample. If None, tests against normal distribution.

    Returns:
        KSTestResult with test statistic and p-value.

    Example:
        >>> result = ks_test([0.1, 0.2, 0.3], [0.4, 0.5, 0.6])
        >>> result.is_significant
        False
    """
    s1 = _to_array(sample1).astype(float)

    if sample2 is None:
        # One-sample test against normal distribution
        stat, p_value = stats.kstest(s1, "norm", args=(np.mean(s1), np.std(s1)))
    else:
        # Two-sample test
        s2 = _to_array(sample2).astype(float)
        stat, p_value = stats.ks_2samp(s1, s2)

    return KSTestResult(
        statistic=float(stat),
        p_value=float(p_value),
        is_significant=p_value < 0.05,
    )

score_distribution¶

Compute distribution statistics for a set of scores.

score_distribution ¶

score_distribution(scores: ArrayLike, *, bins: int | Sequence[float] = 10, include_histogram: bool = True) -> DistributionResult

Compute descriptive statistics for a score distribution.

PARAMETER	DESCRIPTION
`scores`	Sequence of scores to analyze. TYPE: `ArrayLike`
`bins`	Number of bins or explicit bin edges for histogram. TYPE: `int \| Sequence[float]` DEFAULT: `10`
`include_histogram`	If True, include histogram counts and edges. TYPE: `bool` DEFAULT: `True`

RETURNS	DESCRIPTION
`DistributionResult`	DistributionResult with summary statistics and optional histogram.

Example

result = score_distribution([0.1, 0.5, 0.8, 0.9]) 0.5 < result.mean < 0.6 True

Source code in src/autorubric/metrics/distribution.py

def score_distribution(
    scores: ArrayLike,
    *,
    bins: int | Sequence[float] = 10,
    include_histogram: bool = True,
) -> DistributionResult:
    """Compute descriptive statistics for a score distribution.

    Args:
        scores: Sequence of scores to analyze.
        bins: Number of bins or explicit bin edges for histogram.
        include_histogram: If True, include histogram counts and edges.

    Returns:
        DistributionResult with summary statistics and optional histogram.

    Example:
        >>> result = score_distribution([0.1, 0.5, 0.8, 0.9])
        >>> 0.5 < result.mean < 0.6
        True
    """
    scores = _to_array(scores).astype(float)

    n = len(scores)
    if n == 0:
        # No data ⇒ every descriptive statistic is genuinely undefined (not 0.0).
        return DistributionResult(
            n=0,
            mean=None,
            std=None,
            variance=None,
            min=None,
            max=None,
            median=None,
            q25=None,
            q75=None,
            iqr=None,
            skewness=None,
            kurtosis=None,
            histogram=None,
        )

    # Basic statistics. A single point still has a defined mean/min/max/median/quartiles
    # (and iqr = q75 - q25 = 0.0, the true IQR of one point). Dispersion- and
    # shape-dependent stats need more points and are None below their thresholds:
    #   std/variance need >=2, skewness needs >=3, kurtosis needs >=4.
    mean = float(np.mean(scores))
    std = float(np.std(scores, ddof=1)) if n > 1 else None
    variance = float(np.var(scores, ddof=1)) if n > 1 else None
    min_val = float(np.min(scores))
    max_val = float(np.max(scores))
    median = float(np.median(scores))
    q25 = float(np.percentile(scores, 25))
    q75 = float(np.percentile(scores, 75))
    iqr = q75 - q25

    # Skewness and kurtosis
    skewness = float(stats.skew(scores)) if n > 2 else None
    kurtosis = float(stats.kurtosis(scores)) if n > 3 else None

    # Histogram
    histogram = None
    if include_histogram:
        counts, edges = np.histogram(scores, bins=bins)
        histogram = (counts.tolist(), edges.tolist())

    return DistributionResult(
        n=n,
        mean=mean,
        std=std,
        variance=variance,
        min=min_val,
        max=max_val,
        median=median,
        q25=q25,
        q75=q75,
        iqr=iqr,
        skewness=skewness,
        kurtosis=kurtosis,
        histogram=histogram,
    )

systematic_bias¶

Analyze systematic bias between predicted and ground truth scores.

systematic_bias ¶

systematic_bias(y_pred: ArrayLike, y_true: ArrayLike, *, paired: bool = True, confidence: float = 0.95) -> BiasResult

Detect and quantify systematic bias between predictions and ground truth.

Systematic bias occurs when one set consistently scores higher or lower than another, independent of the item being rated.

PARAMETER	DESCRIPTION
`y_pred`	Predicted values (e.g., LLM scores). TYPE: `ArrayLike`
`y_true`	Ground truth values (e.g., human scores). TYPE: `ArrayLike`
`paired`	If True, assumes values are paired (same items). Uses paired t-test. If False, uses independent t-test. TYPE: `bool` DEFAULT: `True`
`confidence`	Confidence level for interval estimation. TYPE: `float` DEFAULT: `0.95`

RETURNS	DESCRIPTION
`BiasResult`	BiasResult with bias magnitude, direction, and statistical tests.

Example

result = systematic_bias([0.8, 0.7, 0.9], [0.7, 0.6, 0.8]) result.mean_bias 0.1 result.direction 'positive'

Source code in src/autorubric/metrics/distribution.py

def systematic_bias(
    y_pred: ArrayLike,
    y_true: ArrayLike,
    *,
    paired: bool = True,
    confidence: float = 0.95,
) -> BiasResult:
    """Detect and quantify systematic bias between predictions and ground truth.

    Systematic bias occurs when one set consistently scores higher or lower
    than another, independent of the item being rated.

    Args:
        y_pred: Predicted values (e.g., LLM scores).
        y_true: Ground truth values (e.g., human scores).
        paired: If True, assumes values are paired (same items).
            Uses paired t-test. If False, uses independent t-test.
        confidence: Confidence level for interval estimation.

    Returns:
        BiasResult with bias magnitude, direction, and statistical tests.

    Example:
        >>> result = systematic_bias([0.8, 0.7, 0.9], [0.7, 0.6, 0.8])
        >>> result.mean_bias
        0.1
        >>> result.direction
        'positive'
    """
    y_pred = _to_array(y_pred).astype(float)
    y_true = _to_array(y_true).astype(float)

    n = len(y_pred)
    if n == 0:
        # Nothing to compare ⇒ mean_bias is genuinely undefined (not 0.0).
        return BiasResult(
            mean_bias=None,
            std_bias=None,
            is_significant=False,
            p_value=None,
            direction="none",
            effect_size=None,
            ci=None,
            n_samples=0,
        )

    if n == 1:
        # mean_bias IS computable from a single pair (the one difference); the
        # dispersion-dependent statistics (std_bias, Cohen's d, CI, p-value) are not.
        if paired:
            _validate_same_length(y_pred, y_true)
            single = float(y_pred[0] - y_true[0])
        else:
            single = float(np.mean(y_pred) - np.mean(y_true))

        single_direction: Literal["positive", "negative", "none"]
        if single > 0.001:
            single_direction = "positive"
        elif single < -0.001:
            single_direction = "negative"
        else:
            single_direction = "none"

        return BiasResult(
            mean_bias=single,
            std_bias=None,
            is_significant=False,
            p_value=None,
            direction=single_direction,
            effect_size=None,
            ci=None,
            n_samples=1,
        )

    if paired:
        _validate_same_length(y_pred, y_true)
        differences = y_pred - y_true
        mean_bias = float(np.mean(differences))
        std_bias = float(np.std(differences, ddof=1))

        # Paired t-test
        t_stat, p_value = stats.ttest_rel(y_pred, y_true)

        # Cohen's d for paired samples (undefined when std_bias == 0).
        effect_size = mean_bias / std_bias if std_bias > 0 else None
    else:
        mean_bias = float(np.mean(y_pred) - np.mean(y_true))

        # Pooled standard deviation
        var_pred = np.var(y_pred, ddof=1)
        var_true = np.var(y_true, ddof=1)
        pooled_std = np.sqrt((var_pred + var_true) / 2)
        std_bias = float(pooled_std)

        # Independent t-test
        t_stat, p_value = stats.ttest_ind(y_pred, y_true)

        # Cohen's d (undefined when pooled std == 0).
        effect_size = mean_bias / std_bias if std_bias > 0 else None

    # Determine direction
    if mean_bias > 0.001:
        direction: Literal["positive", "negative", "none"] = "positive"
    elif mean_bias < -0.001:
        direction = "negative"
    else:
        direction = "none"

    # Confidence interval for mean bias. The t critical value is undefined when the
    # degrees of freedom drop below 1 (e.g. n=2 unpaired → df=0 → stats.t.ppf returns
    # NaN); the interval is then genuinely undefined → ci=None, never a NaN-valued CI.
    df = (n - 1) if paired else (n - 2)
    if df < 1:
        ci = None
    else:
        if paired:
            se = std_bias / np.sqrt(n)
        else:
            se = std_bias * np.sqrt(1 / len(y_pred) + 1 / len(y_true))

        t_crit = stats.t.ppf(1 - (1 - confidence) / 2, df)
        ci = ConfidenceInterval(
            lower=float(mean_bias - t_crit * se),
            upper=float(mean_bias + t_crit * se),
            confidence=confidence,
            method="t",
        )

    return BiasResult(
        mean_bias=mean_bias,
        std_bias=std_bias,
        is_significant=p_value < 0.05,
        p_value=float(p_value),
        direction=direction,
        effect_size=float(effect_size) if effect_size is not None else None,
        ci=ci,
        n_samples=n,
    )

Result Types¶

None means genuinely undefined, never a fabricated 0.0

The numeric fields on these result types are typed float | None. A field is None when the statistic is genuinely undefined for the data at hand (it is never silently reported as a fake 0.0). In particular, BiasResult.mean_bias is None at n=0 and BiasResult.std_bias is None for n < 2; the per-criterion CorrelationResult.coefficient (Pearson/Spearman/Kendall) is None for a constant array or fewer than 3 samples. Guard before formatting (e.g. f"{x:+.4f}" if x is not None else "n/a").

EMDResult¶

EMDResult ¶

Bases: BaseModel

Result of Earth Mover's Distance computation.

EMD measures the minimum "work" required to transform one distribution into another. Unlike correlation, it captures both shift (systematic bias) and shape differences (variance, skew).

A statistic is None when it is genuinely undefined, never a fake 0.0. With an empty distribution on either side (no data to transport) emd/mean_diff/ std_diff/bias_magnitude are all None; bias_direction stays "none" and interpretation "insufficient data".

ATTRIBUTE	DESCRIPTION
`emd`	Earth Mover's Distance (0 to ~1 if normalized). `None` for empty input. TYPE: `float \| None`
`mean_diff`	Difference in means (dist2 - dist1). `None` for empty input. TYPE: `float \| None`
`std_diff`	Difference in standard deviations. `None` for empty input. TYPE: `float \| None`
`bias_direction`	Whether dist1 tends higher, lower, or same. TYPE: `Literal['higher', 'lower', 'none']`
`bias_magnitude`	Absolute mean difference. `None` for empty input. TYPE: `float \| None`
`interpretation`	Human-readable interpretation. TYPE: `str`

interpret_emd `staticmethod` ¶

interpret_emd(emd: float) -> str

Human-readable interpretation of EMD value.

Source code in src/autorubric/metrics/_types.py

@staticmethod
def interpret_emd(emd: float) -> str:
    """Human-readable interpretation of EMD value."""
    if emd < 0.05:
        return "very similar"
    elif emd < 0.10:
        return "minor differences"
    elif emd < 0.20:
        return "moderate differences"
    else:
        return "substantial differences"

KSTestResult¶

KSTestResult ¶

Bases: BaseModel

Kolmogorov-Smirnov test result.

The KS test compares two distributions and tests whether they come from the same underlying distribution.

ATTRIBUTE	DESCRIPTION
`statistic`	KS test statistic. TYPE: `float`
`p_value`	P-value for the test. TYPE: `float`
`is_significant`	Whether the difference is significant (p < 0.05). TYPE: `bool`

DistributionResult¶

DistributionResult ¶

Bases: BaseModel

Score distribution statistics.

A statistic is None when it is genuinely undefined for the sample size, never a fake 0.0. At n=0 every stat is None. A single point still has a defined mean/min/max/median/q25/q75 (and iqr = q75 − q25 = 0.0, the true IQR of one point); std/variance need ≥2 points, skewness ≥3, kurtosis ≥4 — each None below its threshold. n is always the real count.

ATTRIBUTE	DESCRIPTION
`n`	Number of samples. TYPE: `int`
`mean`	Mean score. `None` when undefined (n = 0). TYPE: `float \| None`
`std`	Standard deviation. `None` when undefined (n < 2). TYPE: `float \| None`
`variance`	Variance. `None` when undefined (n < 2). TYPE: `float \| None`
`min`	Minimum score. `None` when undefined (n = 0). TYPE: `float \| None`
`max`	Maximum score. `None` when undefined (n = 0). TYPE: `float \| None`
`median`	Median score. `None` when undefined (n = 0). TYPE: `float \| None`
`q25`	25th percentile. `None` when undefined (n = 0). TYPE: `float \| None`
`q75`	75th percentile. `None` when undefined (n = 0). TYPE: `float \| None`
`iqr`	Interquartile range. `None` when undefined (n = 0); 0.0 for a single point. TYPE: `float \| None`
`skewness`	Skewness (measure of asymmetry). `None` when undefined (n < 3). TYPE: `float \| None`
`kurtosis`	Kurtosis (measure of tail heaviness). `None` when undefined (n < 4). TYPE: `float \| None`
`histogram`	Tuple of (counts, bin_edges). TYPE: `tuple[list[float], list[float]] \| None`

BiasResult¶

BiasResult ¶

Bases: BaseModel

Result from systematic bias analysis.

Systematic bias occurs when one rater consistently scores higher or lower than another, independent of the item being rated.

A statistic is None when it is genuinely undefined for the sample size, never a fake 0.0. mean_bias is the single pred−true difference at n=1 (computable) and is None only at n=0. std_bias is None when undefined (n<2). effect_size (Cohen's d) is None when std_bias is 0 or undefined.

ATTRIBUTE	DESCRIPTION
`mean_bias`	Mean difference (predictions - actuals). `None` only at n=0. TYPE: `float \| None`
`std_bias`	Standard deviation of differences. `None` when undefined (n < 2). TYPE: `float \| None`
`is_significant`	Whether the bias is statistically significant (p < 0.05). TYPE: `bool`
`p_value`	P-value from t-test. TYPE: `float \| None`
`direction`	Direction of bias ("positive" if predictions > actuals). TYPE: `Literal['positive', 'negative', 'none']`
`effect_size`	Cohen's d effect size. `None` when undefined (std_bias 0 or undefined). TYPE: `float \| None`
`ci`	Confidence interval for mean bias. TYPE: `ConfidenceInterval \| None`
`n_samples`	Number of samples. TYPE: `int`

interpret_effect_size `staticmethod` ¶

interpret_effect_size(d: float) -> str

Interpret effect size using Cohen's guidelines.

Source code in src/autorubric/metrics/_types.py

@staticmethod
def interpret_effect_size(d: float) -> str:
    """Interpret effect size using Cohen's guidelines."""
    abs_d = abs(d)
    if abs_d < 0.2:
        return "negligible"
    elif abs_d < 0.5:
        return "small"
    elif abs_d < 0.8:
        return "medium"
    else:
        return "large"

References¶

He, J., Shi, J., Zhuo, T. Y., Treude, C., Sun, J., Xing, Z., Du, X., and Lo, D. (2025). LLM-as-a-Judge for Software Engineering: Literature Review, Vision, and the Road Ahead. arXiv:2510.24367.

Distribution Metrics¶

Overview¶

Quick Example¶

earth_movers_distance¶

earth_movers_distance ¶

wasserstein_distance¶

wasserstein_distance ¶

ks_test¶

ks_test ¶

score_distribution¶

score_distribution ¶

systematic_bias¶

systematic_bias ¶

Result Types¶

EMDResult¶

EMDResult ¶

interpret_emd staticmethod ¶

KSTestResult¶

KSTestResult ¶

DistributionResult¶

DistributionResult ¶

BiasResult¶

BiasResult ¶

interpret_effect_size staticmethod ¶

References¶

interpret_emd `staticmethod` ¶

interpret_effect_size `staticmethod` ¶