Skip to main content

Flaky Test Management

Flaky tests — tests that sometimes pass and sometimes fail without code changes — create noise in CI. DAGZ detects and manages them automatically.

Test Selection and Flakiness

Flaky test handling is a great "bonus feature" of test selection. If a code change didn't impact a flaky test, it just won't run and won't clutter the job's results.

If a code change did indeed cause a flaky test to run, the committer is more likely to have context to investigate and fix it — instead of just ignoring it as "flaky" noise.

DAGZ considers tests flaky when they fail on the first attempt but pass on a retry, on the same job.

Retry Mechanism

When a test fails, DAGZ retries it in two phases:

  1. Immediate retry — the test is re-run right away, without teardown. This catches transient failures (timing issues, network blips) with minimal overhead.
  2. Deferred retry — at the end of the session, failed tests are retried again with full setup/teardown. This catches failures caused by leaked state from other tests.

Safety limits

  • Max 10% of total tests — if more than 10% of tests are failing, retries stop. The failures are probably real, not flaky.
  • Skip long-running tests — tests that take a disproportionate amount of time are not retried. The cost isn't worth it.
  • Up to 2 retries per test — one immediate, one deferred.

Result Classification

ResultMeaningCounts as
OKPassed on first attemptPassed
OK_FLAKYFailed, then passed on retryPassed (flagged as flaky)
FAILEDFailed on all attemptsFailed

OK_FLAKY tests don't block the pipeline, but they're tracked separately.

Dashboard

The dashboard provides:

  • Top flaky tests — ranked by flaky rate, the tests that fail-then-pass most often
  • Historical execution view — full logs and failure trend for each test over time
  • Per-job flaky count — each job shows how many tests were flaky

pytest Integration

DAGZ respects pytest markers:

  • @pytest.mark.xfail — expected failures are tracked correctly (XFAIL when they fail as expected, XPASS when they unexpectedly pass)
  • XPASS results are usually groups with flaky tests - they indicate tests that are expected to fail but are currently passing, which can be a sign of flakiness or a fix in progress.
  • Retried tests are assigned to any available worker, so they get a clean process — which often resolves state-related flakiness