Testing Standards

FlameIQ has high testing standards. The core comparison engine is mathematically critical, and bugs here cause missed regressions in production systems at scale.

Test organisation

tests/
├── unit/
│   ├── test_core/          # comparator, thresholds, models
│   ├── test_engine/        # statistics, baseline strategies
│   ├── test_schema/        # schema models, serialisation
│   ├── test_storage/       # baseline store
│   └── test_providers/     # json, pytest-benchmark providers
├── statistical/            # Algorithm correctness on known distributions
├── integration/            # Full end-to-end pipeline tests
├── e2e/                    # CLI invocation tests
└── fixtures/               # Shared test data and snapshot builders

Coverage requirements

Module	Minimum coverage	Notes
`core/`	95%	Critical path. All branches must be tested.
`schema/`	100%	Immutable contract. Every field validated.
`engine/`	90%	All statistical paths tested with known inputs.
`storage/`	85%	Including corruption recovery and empty history.
`providers/`	80%	Valid, invalid, missing file for each provider.
`cli/`	75%	Integration tests count toward CLI coverage.

Determinism tests

Any function in core/ or engine/ that is advertised as deterministic must have a determinism test:

@pytest.mark.determinism
def test_compare_snapshots_is_deterministic():
    results = [
        compare_snapshots(BASELINE, CURRENT)
        for _ in range(100)
    ]
    first = results[0]
    for r in results[1:]:
        assert r.status == first.status
        assert r.exit_code == first.exit_code
        for d1, d2 in zip(r.diffs, first.diffs):
            assert d1.change_percent == d2.change_percent

Statistical tests

Statistical tests use analytically known distributions — never random data:

@pytest.mark.statistical
@pytest.mark.parametrize("p95_current,threshold,expected", [
    (105.0, "10%", RegressionStatus.PASS),       # +5.0% < threshold
    (110.0, "10%", RegressionStatus.PASS),       # +10.0% = threshold (pass)
    (110.1, "10%", RegressionStatus.REGRESSION), # +10.1% > threshold
    (115.0, "10%", RegressionStatus.REGRESSION), # +15.0% > threshold
])
def test_latency_threshold(p95_current, threshold, expected):
    baseline = make_snapshot(p95=100.0)
    current  = make_snapshot(p95=p95_current)
    result   = compare_snapshots(
        baseline, current,
        threshold_config={"latency.p95": threshold},
    )
    assert result.status == expected

Fixture conventions

Use tmp_path (pytest built-in) for all filesystem tests — never write to the actual project directory
Use make_snapshot() from tests/fixtures/ for all snapshot construction — never construct snapshots inline in tests
Mark all tests with the appropriate marker: unit, integration, statistical, determinism