AcquireProvisionerJob can grab canceled pending jobs
Problem
The AcquireProvisionerJob query in coderd/database/queries/provisionerjobs.sql only checks started_at IS NULL to determine if a job is acquirable:
WHERE potential_job.started_at IS NULL AND potential_job.organization_id = @organization_id AND potential_job.provisioner = ANY(@types :: provisioner_type [ ]) AND provisioner_tagset_contains(@provisioner_tags :: jsonb, potential_job.tags :: jsonb)
This means a job that has been canceled while pending can still be acquired by a provisioner daemon.
How canceled pending jobs are created
In coderd/workspacebuilds.go (lines 736-747), when canceling a build:
err = db.UpdateProvisionerJobWithCancelByID(ctx, database.UpdateProvisionerJobWithCancelByIDParams{ ID: job.ID, CanceledAt: sql.NullTime{ Time: dbtime.Now(), Valid: true, }, CompletedAt: sql.NullTime{ Time: dbtime.Now(), // If the job is running, don't mark it completed! Valid: !job.WorkerID.Valid, }, })
For a pending job (WorkerID is NULL), this sets:
canceled_at= NOWcompleted_at= NOWstarted_at= NULL (unchanged)
Since started_at remains NULL, the job satisfies AcquireProvisionerJob's WHERE clause.
Impact
A provisioner daemon could theoretically acquire a canceled pending job. In practice this may be rare, but it creates an inconsistent state.
Proposed fixes
1. Fix AcquireProvisionerJob query
Add AND potential_job.completed_at IS NULL to the WHERE clause:
WHERE potential_job.started_at IS NULL AND potential_job.completed_at IS NULL -- Add this AND potential_job.organization_id = @organization_id ...
2. Fix dbfake test helper
In coderd/database/dbfake/dbfake.go, when completing jobs (both success and cancel paths), use UpdateProvisionerJobWithCompleteWithStartedAtByID instead of UpdateProvisionerJobWithCompleteByID to also set started_at. This prevents test provisioner daemons from acquiring fake-completed jobs.
This was the root cause of the flaky test Test_TaskLogs_Golden/SnapshotWithLogs_Table (see #1322).
Related
- Flaky test issue: flake: Test_TaskLogs_Golden/SnapshotWithLogs_Table #1322
- PR that disabled the test: test(cli): remove IncludeProvisionerDaemon from task snapshot tests coder#21850