Runtime Behavior

Things DAGZ does on its own at runtime, with no configuration involved. Knowing what to expect makes the output easier to read and the rare anomaly easier to spot.

Multi-node coordination

A "job" can be a single pytest --dagz process or many of them, spread across nodes. DAGZ stitches them into one job automatically.

GitLab CI parallel jobs are detected from CI_NODE_INDEX, CI_NODE_TOTAL, and CI_PIPELINE_ID. All parallel: slots in the same pipeline join the same DAGZ job. The first node to come up becomes the scheduler; the rest are nodes that run a vassal (which manages local workers and forwards results).

Single-node runs are the same model degenerate to one node: scheduler and vassal in the same process, with N workers underneath.

Connectivity. Vassals try a direct TCP connection to the scheduler first. When direct routing is not possible (NAT, isolated CI runners, no shared network), they fall back to a tunnel through the DAGZ service. The CI configuration is the same in either case: no ports to open, no transport to choose.

Network defaults

The local daemon binds only to loopback (127.0.0.1 and ::1) by default. It is never reachable from the network unless you opt in (see Running on Mac with Docker for the listen-address knobs). The web UI has no auth and no TLS, which is fine on loopback and deliberate elsewhere.

pytest --dagz finds the daemon via DAGZ_URL if set, otherwise the local env config under ~/.dagz/local.env/, otherwise localhost:29111.

Subprocess instrumentation

Tests that spawn child processes (subprocess.run, multiprocessing.Pool, CLI tools, embedded servers, ...) are instrumented automatically. The child gets the same DAGZ probes as the parent: coverage, dependency tracking, and DB rerouting all extend to it. Coverage from the child merges back into the parent test's span.

This is the same machinery that makes end-to-end tests usable for change-aware selection. There's nothing to enable.

Worker recycling

On Linux, workers are forked processes. If a worker exceeds its memory budget, DAGZ kills it between tests and restarts a fresh one. Already-finished test results are kept; the in-flight test (if any) is rescheduled, possibly onto another worker. A test that leaks memory does not crash the run.

If a worker exits unexpectedly (segfault, OOM kill, os._exit), DAGZ logs the cause and restarts. Tests that completed on that worker before the crash are recorded; the test that was running gets marked as a failure with the crash details and is not retried.

Retry behavior

Failures get up to two extra attempts before being recorded as final:

Immediate retry. Same worker, no teardown. Catches transient flakes (timing, network jitter) without paying for fixture setup.
Deferred retry. End of session, fresh worker, full setup. Catches state pollution from earlier tests.

Retries are skipped when:

The test takes longer than the retry budget (a fraction of the slowest planned task, with a minimum floor). Retrying a 20-minute test is rarely worth it.
More than ~10% of the suite has already failed. Past that threshold, the failures are probably real.
The test is part of a --dagz-fork-tests batch, where retries are handled per-test by the worker itself.

A test that passes on retry is marked OK_FLAKY and counts as passed but is tracked separately in the dashboard. See Flaky Test Management for the result classification.

Progress logs

While tests run, DAGZ prints one line per finished test, with throttling so a fast suite does not flood the terminal:

Failures, errors, unexpected results print immediately, in full.
Passes (OK) and skips are batched. At most one line per second, summarizing what completed since the last line (+47 OK, +3 skipped, ...).
Slow tests that have been running for a long time get an interim "still running" log so a hung test is visible.

If the scheduler notices that no tests have finished for a long stretch, it prints a stuck-detection report listing each worker's current task and queue depth. Useful for diagnosing a deadlock without attaching a debugger.

When the run ends, DAGZ writes a summary line: planned vs. actual time, CPU time, selected/skipped/redundant counts, flaky count, and a baseline ID you can pass back via --dagz-baseline for reproduction.

Job IDs

Every job gets a short, human-typable ID: j10 for the 10th job today, j1201.1 for December 1st's first job. These IDs are stable across the dashboard, zb logs, and zb spans, so you can copy-paste between them.

A job's full UUID is also assigned, but you rarely need it. Every CLI argument that takes a job will accept the short form.

Failure isolation

A failure in DAGZ's own machinery (instrumentation, coverage probes, scheduler) is reported as a DAGZ error, not a test failure. The test in question runs anyway, with selection effectively turned off for that path. The goal is that DAGZ never silently miscounts your test results: if something goes wrong, you'll see it labeled as such.

Multi-node coordination​

Network defaults​

Subprocess instrumentation​

Worker recycling​

Retry behavior​

Progress logs​

Job IDs​

Failure isolation​