flameiq.engine — Statistical Engine
Statistics
FlameIQ statistical comparison engine.
Provides optional statistical significance testing as a complement to the threshold-based comparator. Use when benchmark noise is high and you need confidence that a detected regression is real rather than noise.
Supported methods (v1.0)
Mann-Whitney U test — non-parametric, distribution-free. Preferred for latency distributions, which are typically right-skewed.
Median-based noise filter — warmup-aware, stable central tendency.
All methods are deterministic given fixed inputs. No random seeds.
Mathematical specification
See Statistical Methodology Specification for the full specification.
- flameiq.engine.statistics.MINIMUM_SAMPLES: int = 3
Minimum samples required for any statistical test.
- class flameiq.engine.statistics.StatisticalResult(is_significant, p_value, effect_size, test_name, confidence_level)[source]
Bases:
objectResult of a statistical significance test.
- Parameters:
- __init__(is_significant, p_value, effect_size, test_name, confidence_level)
- flameiq.engine.statistics.mann_whitney_compare(baseline_samples, current_samples, confidence=0.95, minimum_samples=3)[source]
Compare two sample sets using the Mann-Whitney U test.
Tests the one-tailed hypothesis that the current distribution tends to produce larger values than the baseline distribution.
This is the preferred test for latency distributions, which are typically right-skewed and non-normal.
- Parameters:
- Returns:
A
StatisticalResultwith significance, p-value, and effect size.- Raises:
InsufficientSamplesError – If either sample set has fewer than
minimum_samplesentries.- Return type:
References
Mann, H. B., & Whitney, D. R. (1947). On a test of whether one of two random variables is stochastically larger than the other. Annals of Mathematical Statistics, 18(1), 50–60.
- flameiq.engine.statistics.noise_filter_median(samples, warmup=0)[source]
Compute a noise-resistant median, optionally discarding warmup runs.
- Parameters:
- Returns:
Median of the remaining samples.
- Raises:
ValueError – If no samples remain after the warmup discard.
- Return type:
Notes
The median is more robust than the mean for noisy benchmark data with occasional outlier spikes.
Examples:
noise_filter_median([1.0, 3.0, 5.0]) # → 3.0 noise_filter_median([99.0, 1.0, 3.0], warmup=1) # → 2.0
Baseline Strategies
FlameIQ baseline selection strategies.
A baseline strategy determines which historical snapshot is used as the reference point for a comparison run.
v1.0 supports three strategies:
last_successfulUse the most recently stored snapshot. Simple and predictable.
rolling_medianCompute median values over the last N snapshots. More resistant to noise from a single outlier run.
taggedUse a snapshot explicitly tagged with a release label (e.g.
"v1.0.0"). Useful for comparing against a known-good release.
Configuration in flameiq.yaml:
baseline:
strategy: rolling_median
rolling_window: 5
- class flameiq.engine.baseline.BaselineStrategy(*values)[source]
-
Supported baseline selection strategies.
- LAST_SUCCESSFUL = 'last_successful'
- ROLLING_MEDIAN = 'rolling_median'
- TAGGED = 'tagged'
- flameiq.engine.baseline.select_baseline(history, strategy=BaselineStrategy.LAST_SUCCESSFUL, rolling_window=5, tag=None)[source]
Select a baseline from the history using the configured strategy.
- Parameters:
history (list[PerformanceSnapshot]) – List of stored snapshots, oldest first.
strategy (BaselineStrategy) – Which selection strategy to apply.
rolling_window (int) – Window size for
ROLLING_MEDIANstrategy.tag (str | None) – Required when strategy is
TAGGED.
- Returns:
The selected (or synthesised) baseline snapshot.
- Raises:
BaselineError – If history is empty or a tagged snapshot cannot be found.
- Return type: