fix: make boundary usage telemetry collection atomic by zedkipp · Pull Request #21907 · coder/coder
Replace separate GetBoundaryUsageSummary and ResetBoundaryUsageStats calls with a single atomic GetAndResetBoundaryUsageSummary query. This uses DELETE...RETURNING in a common table expression to ensure the rows we sum are exactly the rows we delete, eliminating the race condition where a replica could flush stats between the read and reset operations. The race could cause one flush interval (1 minute) of usage data to be lost when a replica flushed between GetBoundaryUsageSummary and ResetBoundaryUsageStats, as the newly written data would be deleted before the next telemetry collection.
zedkipp
marked this pull request as ready for review
UpsertBoundaryUsageStats (INSERT...ON CONFLICT DO UPDATE) and GetAndResetBoundaryUsageSummary (DELETE...RETURNING) can race during telemetry period cutover. Without serialization, an upsert concurrent with the delete could lose data (deleted right after being written) or commit after the delete (miscounted in the next period). Both operations now acquire LockIDBoundaryUsageStats within a transaction to ensure a clean cutover.
zedkipp
deleted the
zedkipp/snapshot-race
branch
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters