flameiq.engine — Statistical Engine

Statistics

FlameIQ statistical comparison engine.

Provides optional statistical significance testing as a complement to the threshold-based comparator. Use when benchmark noise is high and you need confidence that a detected regression is real rather than noise.

Supported methods (v1.0)

Mann-Whitney U test — non-parametric, distribution-free. Preferred for latency distributions, which are typically right-skewed.
Median-based noise filter — warmup-aware, stable central tendency.

All methods are deterministic given fixed inputs. No random seeds.

Mathematical specification

See Statistical Methodology Specification for the full specification.

flameiq.engine.statistics.MINIMUM_SAMPLES: int = 3: Minimum samples required for any statistical test.

class flameiq.engine.statistics.StatisticalResult(is_significant, p_value, effect_size, test_name, confidence_level)[source]

Bases: object

Result of a statistical significance test.

Parameters:

is_significant (bool)
p_value (float)
effect_size (float)
test_name (str)
confidence_level (float)

is_significant: bool: True if the difference is statistically significant.

p_value: float: The test p-value. Lower = stronger evidence.

effect_size: float: Cohen’s d effect size. Positive = current > baseline.

test_name: str: The test used (e.g. "Mann-Whitney U").

confidence_level: float: The confidence level (default 0.95).

__init__(is_significant, p_value, effect_size, test_name, confidence_level)

Parameters:

is_significant (bool)
p_value (float)
effect_size (float)
test_name (str)
confidence_level (float)

Return type:

None

property alpha: float: Significance threshold α = 1 − confidence_level.

property effect_label: str: Cohen (1988) verbal label for the effect size magnitude.

flameiq.engine.statistics.mann_whitney_compare(baseline_samples, current_samples, confidence=0.95, minimum_samples=3)[source]

Compare two sample sets using the Mann-Whitney U test.

Tests the one-tailed hypothesis that the current distribution tends to produce larger values than the baseline distribution.

This is the preferred test for latency distributions, which are typically right-skewed and non-normal.

Parameters:

baseline_samples (list[float]) – Measurements from the baseline run.
current_samples (list[float]) – Measurements from the current run.
confidence (float) – Required confidence level. Default 0.95 (95%).
minimum_samples (int) – Minimum samples required in each group.

Returns:

A StatisticalResult with significance, p-value, and effect size.

Raises:

InsufficientSamplesError – If either sample set has fewer than minimum_samples entries.

Return type:

StatisticalResult

References

Mann, H. B., & Whitney, D. R. (1947). On a test of whether one of two random variables is stochastically larger than the other. Annals of Mathematical Statistics, 18(1), 50–60.

flameiq.engine.statistics.noise_filter_median(samples, warmup=0)[source]

Compute a noise-resistant median, optionally discarding warmup runs.

Parameters:

samples (list[float]) – Raw measurement samples (any order).
warmup (int) – Number of leading samples to discard as warmup. Default 0.

Returns:

Median of the remaining samples.

Raises:

ValueError – If no samples remain after the warmup discard.

Return type:

float

Notes

The median is more robust than the mean for noisy benchmark data with occasional outlier spikes.

Examples:

noise_filter_median([1.0, 3.0, 5.0])             # → 3.0
noise_filter_median([99.0, 1.0, 3.0], warmup=1)  # → 2.0

Baseline Strategies

FlameIQ baseline selection strategies.

A baseline strategy determines which historical snapshot is used as the reference point for a comparison run.

v1.0 supports three strategies:

last_successful: Use the most recently stored snapshot. Simple and predictable.
rolling_median: Compute median values over the last N snapshots. More resistant to noise from a single outlier run.
tagged: Use a snapshot explicitly tagged with a release label (e.g. "v1.0.0"). Useful for comparing against a known-good release.

Configuration in flameiq.yaml:

baseline:
  strategy: rolling_median
  rolling_window: 5

class flameiq.engine.baseline.BaselineStrategy(*values)[source]

Bases: str, Enum

Supported baseline selection strategies.

LAST_SUCCESSFUL = 'last_successful'

ROLLING_MEDIAN = 'rolling_median'

TAGGED = 'tagged'

flameiq.engine.baseline.select_baseline(history, strategy=BaselineStrategy.LAST_SUCCESSFUL, rolling_window=5, tag=None)[source]

Select a baseline from the history using the configured strategy.

Parameters:

history (list[PerformanceSnapshot]) – List of stored snapshots, oldest first.
strategy (BaselineStrategy) – Which selection strategy to apply.
rolling_window (int) – Window size for ROLLING_MEDIAN strategy.
tag (str | None) – Required when strategy is TAGGED.

Returns:

The selected (or synthesised) baseline snapshot.

Raises:

BaselineError – If history is empty or a tagged snapshot cannot be found.

Return type:

PerformanceSnapshot