CI Runs

Parallel execution

DAGZ executes test suites in parallel, spreading the work across multiple nodes (machines) and multiple workers on each node.

DAGZ's test scheduler optimizes test distribution using:

Previous run durations, taken from the baselines generated for DAGZ test selection.
Fixture sharing - tests that need the same setup tend to land on the same worker.
Work stealing - idle workers steal tests from busy ones when the plan drifts from reality.

For installing the plugin on CI workers and pointing it at your team deployment, see CI Integration.

For hardware-side reasons actual durations drift from the plan (CPU throttling, hybrid cores, power limits), see Scheduling.

Process roles

DAGZ manages 4 types of test processes:

Scheduler/Master: coordinates the entire run, plans work distribution, and collects results.
Vassal: manages the worker pool on a single machine.
Workers: run tests sequentially, collect code signals and reporting results back to the scheduler.
Sub-processes: spawned by tests, automatically associated with the calling test.

On Linux, workers and vassals are forked processes. They share memory and can be recycled if they exceed memory limits (see Worker recycling below).

CI Job Joining

When multiple pytest --dagz processes start in the same parallel CI step, they automatically join into a single job. The first node to come up becomes the scheduler; Other node become vassals and connect back to the scheduler.

Job Reruns

CI platforms expose a "rerun" button on jobs. DAGZ automatically handles retries intelligently:

Rerun of a failed job: only the tests that failed in the previous attempt are rerun. The rest are skipped as redundant. Useful for flake hunting and for re-verifying a fix without paying for the full suite.
Rerun of a passing job: the entire suite reruns without selection. The previous job already covered the affected tests; a deliberate rerun signals you want a fresh full pass.

Inspecting results

Progress logs

DAGZ prints a progress log line for every test as it finishes, with the test's result and duration:

Status logs

While the run is in progress, the scheduler prints the workers' status every 30 seconds.

01:30.668  master.t2275195 root  I dagz_scheduler::scheduler Worker #0 [Active] progress=7908+73846/81754 actual_delta=-0.5s last_finished=1.0s q=Regular batch #1 time_left=237.1s stolen=0,0finished reconnects=0  pandas/tests/arithmetic/test_datetime64.py::TestDatetime64DateOffsetArithmetic::test_dt64arr_add_sub_DateOffsets[Series-us-US/Central-5-True-CBMonthEnd]
01:30.668  master.t2275195 root  I dagz_scheduler::scheduler Worker #1 [Active] progress=15369+37374/52743 actual_delta=+2.4s last_finished=0.7s q=Regular batch #1 time_left=240.1s stolen=0,0finished reconnects=0  pandas/tests/extension/test_masked.py::TestMaskedArrays::test_loc_series[BooleanDtype]
01:30.668  master.t2275195 root  I dagz_scheduler::scheduler Worker #2 [Active] progress=7275+40253/47528 actual_delta=-0.1s last_finished=0.1s q=Regular batch #1 time_left=235.7s stolen=0,0finished reconnects=0  pandas/tests/groupby/test_raises.py::test_groupby_raises_category[by7-True-std-method]
01:30.668  master.t2275195 root  I dagz_scheduler::scheduler Worker #5 [Active] progress=52+2414/2466 actual_delta=-2.4s last_finished=6.9s q=Regular batch #1 time_left=245.1s stolen=0,0finished reconnects=0  pandas/tests/window/test_numba.py::TestTableMethod::test_table_method_rolling_methods[False-True-arithmetic_numba_supported_operators2-1]

One line per worker, in worker order. Fields after the worker number:

Field	Meaning
`[State]`	Worker state: `Active`, `Idle`, `WorkSent`, `Disconnected`, `Failed`, `NotConnected`.
`progress=A+B/C`	A tests finished, B remaining, C total assigned to this worker.
`actual_delta=±X.Xs`	Cumulative deviation from the plan. Positive = slower than planned.
`last_finished=Y.Ys`	Seconds since the last test on this worker finished. A growing number means the current test is taking long.
`q=NAME batch #N`	Active queue and batch number.
`time_left=Z.Zs`	Estimated time to drain all queues on this worker.
`stolen=L,Mfinished`	L tasks were stolen from other workers; M of those have finished.
`reconnects=K`	Vassal-to-scheduler reconnect count for this worker.

The trailing text is the test currently running on the worker.

Job report

At the end of the session, every process (scheduler and each worker) prints the same report. In a parallel CI step this means the report lands in every parallel job's logs. Any one of them gets you a link to the dashboard without having to find the scheduler's log.

When the run passes, the report prints in green:

*** DAGZ SESSION END: 2026-05-17 09:35:34 Asia/Jerusalem
***
*** See j0517.50 | https://dagz.example.com/jobs/43732fd6-a9d6-4a92-8960-ba30b8b3fe3f
***
*** Summary of all 6 workers on 1 nodes (master=1/1)
*** 239941/239941 PASSED, 1593 xfail
*** 7146 SKIPPED
***

When any test fails, the report prints in red and lists the failures above the summary:

***
  FAILED: pandas/tests/io/test_http_headers.py::test_request_headers[json] | AssertionError: expected 200, got 502
  FAILED: pandas/tests/groupby/test_raises.py::test_groupby_raises_category[by7-True-std-method] | TimeoutError: ...
  FAILED: pandas/tests/window/test_numba.py::TestTableMethod::test_table_method_rolling_methods | ValueError: ...

*** DAGZ SESSION END: 2026-05-17 09:35:34 Asia/Jerusalem
***
*** See j0517.50 | https://dagz.example.com/jobs/43732fd6-a9d6-4a92-8960-ba30b8b3fe3f?spanTypes=failures
***
*** Summary of all 6 workers on 1 nodes (master=1/1)
*** 3/239941 FAILED
*** 232792 PASSED, 1593 xfail
*** 7146 SKIPPED
***

On failure, the dashboard URL deep-links to the failures view (?spanTypes=failures); on success it links to the job overview.

j0517.50 is the short job ID: the 50th job on May 17.

Links between CI and DAGZ

DAGZ prints the dashboard URL inline in the job report. CI platforms render it as a clickable link, so any CI log that includes the report has a one-click path to the job in the dashboard.

The dashboard reverses the link: each job page shows the originating CI run, so navigating from a failure in DAGZ back to the CI logs is a single click.

Parallel execution​

Process roles​

CI Job Joining​

Job Reruns​

Inspecting results​

Progress logs​

Status logs​

Job report​

Links between CI and DAGZ​