Flaky Tests Management
DAGZ and Flaky Tests
Flaky tests are tests that sometimes fail and sometimes pass, affected by factors like timing, external services or previous tests.
DAGZ affects flaky tests as follows:
- Less noise. If a code change doesn't affect a flaky test, it won't run and won't clutter the results.
- Automatic retries. When a test fails, DAGZ retries it automatically with some safety limits. See below.
- Rich context for investigation. DAGZ provides rich context including previous runs and their logs, so humans and agents can quickly investigate and fix.
- Discovery runs. DAGZ can run an exhaustive "discovery mode" which uncovers hidden dependencies between tests.
DAGZ also surfaces non-obvious issues:
- Dynamic test ordering. DAGZ uses the default test order as a starting point. Tests may be dynamically scheduled on different workers.
- Hidden dependencies. If a test is somehow affected by tests that run before it, DAGZ will surface this issue more often.
Smart retries, rich context and discovery runs all help with efficient investigation and fixing of these issues.
Result Types
Formally, DAGZ flags tests as flaky when they fail on the first attempt but pass on a retry, on the same job:
| Result | Meaning | Impact |
|---|---|---|
OK | Passed on first attempt | - |
FAILED | Failed on all attempts | Fails the job |
OK_FLAKY | Failed, then passed on retry | - |
XFAIL | pytest.mark.xfail test which failed | - |
XPASS | pytest.mark.xfail which passed. Considered flaky in UI and reports. | - |
Retry Mechanism
When a test fails, DAGZ retries it in two phases:
- Immediate retry. The test is re-run immediately, without teardown. This catches transient failures (timing issues, network jitter) with minimal overhead.
- Tail run. At the end of the session, failed tests are retried again with full setup/teardown. This catches failures caused by leaked state from other tests.
Safety Limits
- Max 10% of total tests. If more than 10% of tests are failing, retries stop. The failures are probably real, not flaky.
- Skip long-running tests. Tests that take a disproportionate amount of time are not retried. The cost isn't worth it.
- Up to 2 retries per test: first immediate, then deferred (tail run).
Configuration
You can change DAGZ' default retry behavior with these config options:
- In
.dagz/config.yaml:flaky_tail_run(default true): whether to retry failed tests at the end of the session.
pytest.dagz_flaky(tail_run=True/False)pytest marker: override the default tail-run for a specific test.pytest.mark.flaky(reruns=N)pytest marker: DAGZ respects the flaky marker from pytest-rerunfailures, retrying a failed test up to N times.
Fixing Flaky Tests
Generate Agent Context
DAGZ provides rich context for fixing a test. You can generate it with zb span-logs or download it from the dashboard.
zb span-logs <test_span_id> [--project/-P] # Use --project/-P to use DAGZ' project server, instead of the local daemon.
Download Agent Context
In the dashboard, click on a test to see its historical execution view, which includes logs and failure trends over time.
Dashboard views
The dashboard provides:
- Top flaky tests: ranked by flaky rate, the tests that fail-then-pass most often
- Historical execution view: full logs and failure trend for each test over time
- Per-job flaky count: each job shows how many tests were flaky