fix: make boundary usage telemetry collection atomic by zedkipp · Pull Request #21907 · coder/coder

@zedkipp

Replace separate GetBoundaryUsageSummary and ResetBoundaryUsageStats
calls with a single atomic GetAndResetBoundaryUsageSummary query. This
uses DELETE...RETURNING in a common table expression to ensure the rows
we sum are exactly the rows we delete, eliminating the race condition
where a replica could flush stats between the read and reset operations.

The race could cause one flush interval (1 minute) of usage data to be
lost when a replica flushed between GetBoundaryUsageSummary and
ResetBoundaryUsageStats, as the newly written data would be deleted before
the next telemetry collection.

@zedkipp zedkipp marked this pull request as ready for review

February 4, 2026 20:52

@zedkipp

UpsertBoundaryUsageStats (INSERT...ON CONFLICT DO UPDATE) and
GetAndResetBoundaryUsageSummary (DELETE...RETURNING) can race during
telemetry period cutover. Without serialization, an upsert concurrent
with the delete could lose data (deleted right after being written) or
commit after the delete (miscounted in the next period). Both operations
now acquire LockIDBoundaryUsageStats within a transaction to ensure a
clean cutover.

hugodutka

@zedkipp zedkipp deleted the zedkipp/snapshot-race branch

February 6, 2026 16:52