Statistical Methodology Specification
- Status:
Stable
- Version:
1.0
- Module:
flameiq.engine.statistics- Spec file:
specs/statistical-methodology.rst
This document is the authoritative mathematical specification for all
statistical algorithms used in FlameIQ. Any change to an algorithm
described here requires a formal RFC. See RFC_PROCESS.md.
Note
All algorithms described here are deterministic: given identical inputs, they always produce identical outputs. No random seeds. No sampling. No time-dependent logic.
Overview
FlameIQ provides two regression detection modes:
Threshold-based (default) — direct percentage comparison against configured thresholds. See Threshold Algorithm Specification.
Statistical mode (optional) — adds the Mann-Whitney U test for significance testing. Use when benchmark noise is high enough that threshold crossings alone are unreliable.
Statistical mode is enabled via flameiq.yaml:
statistics:
enabled: true
confidence: 0.95
Mann-Whitney U Test
Background
The Mann-Whitney U test is a non-parametric significance test that makes no assumptions about the underlying distribution. It is particularly well-suited for latency data, which is typically:
Right-skewed (long tail of slow requests)
Non-normal (bimodal distributions are common)
Sensitive to outliers
- Reference:
Mann, H. B., & Whitney, D. R. (1947). On a test of whether one of two random variables is stochastically larger than the other. Annals of Mathematical Statistics, 18(1), 50–60.
Hypothesis
- Null hypothesis H₀:
The baseline and current distributions are identical.
- Alternative hypothesis H₁:
The current distribution tends to produce larger values than the baseline distribution (one-tailed).
Using a one-tailed test reflects the engineering question: “Is this metric getting worse?” A two-tailed test would also flag improvements as significant, which is not useful for regression detection.
Significance decision
A regression is declared statistically significant if:
where:
With the default confidence_level = 0.95, this gives
\(\alpha = 0.05\).
Implementation
FlameIQ uses scipy.stats.mannwhitneyu with alternative="greater":
_, p_value = scipy.stats.mannwhitneyu(
current_samples, # first argument = "greater" group
baseline_samples,
alternative="greater",
)
The p-value is cast to Python float (from numpy scalar) and rounded
to 6 decimal places for stable serialisation.
Minimum samples
The test requires at least 3 samples per group (MINIMUM_SAMPLES = 3).
If either group has fewer samples, InsufficientSamplesError
is raised.
Effect Size — Cohen’s d
Cohen’s d quantifies the magnitude of the difference, independent of sample size. It complements the p-value, which only measures significance.
Formula
where \(\bar{x}_1\) and \(\bar{x}_2\) are the sample means of the baseline and current groups respectively, and \(s_p\) is the pooled standard deviation:
Sign convention: Positive d means current > baseline (a potential regression for higher-is-worse metrics).
If \(s_p = 0\) (all measurements identical), Cohen’s d is defined
as 0.0.
Verbal labels (Cohen 1988 conventions)
\(|d|\) |
Label |
|---|---|
\(< 0.2\) |
Negligible — practically meaningless |
\(0.2 \leq d < 0.5\) |
Small — noticeable in controlled experiments |
\(0.5 \leq d < 0.8\) |
Medium — practically significant |
\(\geq 0.8\) |
Large — practically very significant |
- Reference:
Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Lawrence Erlbaum Associates.
Noise-resistant Median
Formula
Given n samples \(x_1, x_2, \ldots, x_n\):
Optionally discard the first k samples (warmup runs): \(x_{k+1}, x_{k+2}, \ldots, x_n\)
Sort the remaining samples: \(x_{(1)} \leq x_{(2)} \leq \ldots \leq x_{(n-k)}\)
Compute the median:
where \(m = \lfloor (n-k) / 2 \rfloor\).
Use cases
The median is used by the rolling_median baseline strategy:
baseline_p95 = noise_filter_median(
[snap.metrics.latency.p95 for snap in last_N_snapshots]
)
The median is preferred over the mean for benchmark data because it is:
Robust to outliers (a single spike does not distort it)
Stable under bimodal distributions
More representative of typical performance than the mean
Determinism guarantees
All algorithms in this specification satisfy:
Property |
Guarantee |
|---|---|
No random state |
No |
No time-dependent logic |
Timestamps are arguments, never |
Explicit floating-point policy |
|
scipy determinism |
|
Sorted median |
|
The FlameIQ test suite verifies determinism with 100-repetition runs:
results = [mann_whitney_compare(baseline, current) for _ in range(100)]
assert all(r.p_value == results[0].p_value for r in results)