AcquireProvisionerJob can grab canceled pending jobs

Problem

The AcquireProvisionerJob query in coderd/database/queries/provisionerjobs.sql only checks started_at IS NULL to determine if a job is acquirable:

WHERE
    potential_job.started_at IS NULL
    AND potential_job.organization_id = @organization_id
    AND potential_job.provisioner = ANY(@types :: provisioner_type [ ])
    AND provisioner_tagset_contains(@provisioner_tags :: jsonb, potential_job.tags :: jsonb)

This means a job that has been canceled while pending can still be acquired by a provisioner daemon.

How canceled pending jobs are created

In coderd/workspacebuilds.go (lines 736-747), when canceling a build:

err = db.UpdateProvisionerJobWithCancelByID(ctx, database.UpdateProvisionerJobWithCancelByIDParams{
    ID: job.ID,
    CanceledAt: sql.NullTime{
        Time:  dbtime.Now(),
        Valid: true,
    },
    CompletedAt: sql.NullTime{
        Time: dbtime.Now(),
        // If the job is running, don't mark it completed!
        Valid: !job.WorkerID.Valid,
    },
})

For a pending job (WorkerID is NULL), this sets:

  • canceled_at = NOW
  • completed_at = NOW
  • started_at = NULL (unchanged)

Since started_at remains NULL, the job satisfies AcquireProvisionerJob's WHERE clause.

Impact

A provisioner daemon could theoretically acquire a canceled pending job. In practice this may be rare, but it creates an inconsistent state.

Proposed fixes

1. Fix AcquireProvisionerJob query

Add AND potential_job.completed_at IS NULL to the WHERE clause:

WHERE
    potential_job.started_at IS NULL
    AND potential_job.completed_at IS NULL  -- Add this
    AND potential_job.organization_id = @organization_id
    ...

2. Fix dbfake test helper

In coderd/database/dbfake/dbfake.go, when completing jobs (both success and cancel paths), use UpdateProvisionerJobWithCompleteWithStartedAtByID instead of UpdateProvisionerJobWithCompleteByID to also set started_at. This prevents test provisioner daemons from acquiring fake-completed jobs.

This was the root cause of the flaky test Test_TaskLogs_Golden/SnapshotWithLogs_Table (see #1322).

Related