Skip to main content

Concepts

Jobs

A job is one execution of your test suite.

Jobs can combine multiple pytest --dagz invocations in a parallel CI step (e.g., GitLab parallel jobs). In this case, the first process to come up becomes the scheduler for the entire parallel step.

Each job gets a unique ID — a short ID like j10 for the 10th job today, or j1201.1 for December 1st's first job.

Jobs are the primary unit in the dashboard. You can drill into any job to see its selected tests, skipped tests, failures, timing, and selection traces.

Jobs are the equivalent of OpenTelemetry's "traces", though they're typically much longer.

Spans

A span is a single execution flow. Spans are nested in a tree that also includes spans from sub-processes or external services. They are largely equivalent to OpenTelemetry's "spans". Unlike OpenTelemetry, DAGZ's spans have a free-text type field.

Span types include:

  • test: A test case execution — see below.
  • load: A load-time phase.
  • py.import: A module's module-level code execution.

Tests

A test is a span of type "test". It corresponds to a single test case in your codebase. Tests are the primary unit of selection and reporting.

Test spans can have nested:

  • fixture spans for any fixtures they use
  • cache spans for any cache lookups they trigger
  • app spans for any subprocesses they launch
  • test.call for the actual test body, excluding setup/teardown

Baselines

A baseline is a snapshot from a previous job: the dependency graph plus content fingerprints for every code element observed during that run. Baselines are compressed and stored efficiently.

Baselines can be built out of previous baselines plus dependencies collected during the current run. This means that even if you run a single test, you get a full baseline for the entire codebase.

Multiple baselines

DAGZ automatically selects both the branch baseline and main's latest, using both to minimize the selected set. Because the optimized coverage collector regenerates baselines on every job at negligible cost, you always have up-to-date baselines to compare against.

Process Roles

DAGZ runs tests across multiple processes for speed and isolation.

Scheduler

The scheduler is the main process — the one that pytest starts in. It:

  • Loads the baseline and computes test selection
  • Plans work distribution across workers based on historical test durations
  • Launches a local vassal to manage workers on the node — the vassal conducts most of the loading phase, including test collection
  • Assigns test batches to workers
  • Collects results and writes them to the service
  • Handles work stealing when workers finish early

Vassal

The vassal is a lightweight coordinator process that manages the worker pool on a single machine. In single-node setups there's one vassal. In multi-node CI (e.g., GitLab parallel jobs), each node runs its own vassal that connects back to the scheduler.

Vassals connect to the scheduler directly when possible. When nodes cannot communicate directly with each other, they can connect through the DAGZ service as a relay. This allows DAGZ to work in any CI environment, even those with strict network policies.

Workers

Workers are the processes that actually run tests. Each worker:

  • Receives a batch of tests from the scheduler
  • Executes them in order, collecting coverage for each
  • Reports results back
  • Restarts automatically if it exits unexpectedly (e.g., due to OOM) — intermediate results are never lost

On Linux, workers run in fork mode — each is an isolated process that can be recycled if it exceeds memory limits. This allows substantial RAM savings due to copy-on-write memory sharing.

Sub-processes (App spans)

In integration and end-to-end tests, your tests may spawn child processes (via subprocess, multiprocessing, etc.). DAGZ automatically instruments these child processes and merges their coverage back into the parent test's span.

This is critical for end-to-end tests that launch servers, CLI tools, or worker pools.