Flaky Tests Management

DAGZ and Flaky Tests

Flaky tests are tests that sometimes fail and sometimes pass, affected by factors like timing, external services or previous tests.

DAGZ affects flaky tests as follows:

Less noise. If a code change doesn't affect a flaky test, it won't run and won't clutter the results.
Automatic retries. When a test fails, DAGZ retries it automatically with some safety limits. See below.
Rich context for investigation. DAGZ provides rich context including previous runs and their logs, so humans and agents can quickly investigate and fix.
Discovery runs. DAGZ can run an exhaustive "discovery mode" which uncovers hidden dependencies between tests.

DAGZ also surfaces non-obvious issues:

Dynamic test ordering. DAGZ uses the default test order as a starting point. Tests may be dynamically scheduled on different workers.
Hidden dependencies. If a test is somehow affected by tests that run before it, DAGZ will surface this issue more often.

Smart retries, rich context and discovery runs all help with efficient investigation and fixing of these issues.

Formally, DAGZ flags tests as flaky when they fail on the first attempt but pass on a retry, on the same job:

Result	Meaning	Impact
`OK`	Passed on first attempt	-
`FAILED`	Failed on all attempts	Fails the job
`OK_FLAKY`	Failed, then passed on retry	-
`XFAIL`	`pytest.mark.xfail` test which failed	-
`XPASS`	`pytest.mark.xfail` which passed. Considered flaky in UI and reports.	-

When a test fails, DAGZ retries it in two phases:

Immediate retry. The test is re-run immediately, without teardown. This catches transient failures (timing issues, network jitter) with minimal overhead.
Tail run. At the end of the session, failed tests are retried again with full setup/teardown. This catches failures caused by leaked state from other tests.

Max 10% of total tests. If more than 10% of tests are failing, retries stop. The failures are probably real, not flaky.
Skip long-running tests. Tests that take a disproportionate amount of time are not retried. The cost isn't worth it.
Up to 2 retries per test: first immediate, then deferred (tail run).

You can change DAGZ' default retry behavior with these config options:

In .dagz/config.yaml:
- flaky_tail_run (default true): whether to retry failed tests at the end of the session.
pytest.dagz_flaky(tail_run=True/False) pytest marker: override the default tail-run for a specific test.
pytest.mark.flaky(reruns=N) pytest marker: DAGZ respects the flaky marker from pytest-rerunfailures, retrying a failed test up to N times.

DAGZ provides rich context for fixing a test. You can generate it with zb span-logs or download it from the dashboard.

zb span-logs <test_span_id> [--remote/-r]   # Use --remote/-r to use DAGZ' project server, instead of the local daemon.

In the dashboard, click on a test to see its historical execution view, which includes logs and failure trends over time.

The dashboard provides: