Skip to main content

Flaky Tests Management

DAGZ and Flaky Tests

Flaky tests are tests that sometimes fail and sometimes pass, affected by factors like timing, external services or previous tests.

DAGZ affects flaky tests as follows:

  • Less noise. If a code change doesn't affect a flaky test, it won't run and won't clutter the results.
  • Automatic retries. When a test fails, DAGZ retries it automatically with some safety limits. See below.
  • Rich context for investigation. DAGZ provides rich context including previous runs and their logs, so humans and agents can quickly investigate and fix.
  • Discovery runs. DAGZ can run an exhaustive "discovery mode" which uncovers hidden dependencies between tests.

DAGZ also surfaces non-obvious issues:

  • Dynamic test ordering. DAGZ uses the default test order as a starting point. Tests may be dynamically scheduled on different workers.
  • Hidden dependencies. If a test is somehow affected by tests that run before it, DAGZ will surface this issue more often.

Smart retries, rich context and discovery runs all help with efficient investigation and fixing of these issues.

Result Types

Formally, DAGZ flags tests as flaky when they fail on the first attempt but pass on a retry, on the same job:

ResultMeaningImpact
OKPassed on first attempt-
FAILEDFailed on all attemptsFails the job
OK_FLAKYFailed, then passed on retry-
XFAILpytest.mark.xfail test which failed-
XPASSpytest.mark.xfail which passed. Considered flaky in UI and reports.-

Retry Mechanism

When a test fails, DAGZ retries it in two phases:

  1. Immediate retry. The test is re-run immediately, without teardown. This catches transient failures (timing issues, network jitter) with minimal overhead.
  2. Tail run. At the end of the session, failed tests are retried again with full setup/teardown. This catches failures caused by leaked state from other tests.

Safety Limits

  • Max 10% of total tests. If more than 10% of tests are failing, retries stop. The failures are probably real, not flaky.
  • Skip long-running tests. Tests that take a disproportionate amount of time are not retried. The cost isn't worth it.
  • Up to 2 retries per test: first immediate, then deferred (tail run).

Configuration

You can change DAGZ' default retry behavior with these config options:

  • In .dagz/config.yaml:
    • flaky_tail_run (default true): whether to retry failed tests at the end of the session.
  • pytest.dagz_flaky(tail_run=True/False) pytest marker: override the default tail-run for a specific test.
  • pytest.mark.flaky(reruns=N) pytest marker: DAGZ respects the flaky marker from pytest-rerunfailures, retrying a failed test up to N times.

Fixing Flaky Tests

Generate Agent Context

DAGZ provides rich context for fixing a test. You can generate it with zb span-logs or download it from the dashboard.

zb span-logs <test_span_id> [--project/-P] # Use --project/-P to use DAGZ' project server, instead of the local daemon.

Download Agent Context

In the dashboard, click on a test to see its historical execution view, which includes logs and failure trends over time.

Dashboard views

The dashboard provides:

  • Top flaky tests: ranked by flaky rate, the tests that fail-then-pass most often
  • Historical execution view: full logs and failure trend for each test over time
  • Per-job flaky count: each job shows how many tests were flaky