Fix heap size double-subtraction causing server crash under low-memory object operations by badrishc · Pull Request #1663 · microsoft/garnet
AI review requested due to automatic review settings
April 2, 2026 01:02…ap objects Under lowMemory settings (LogMemorySize=2k, PageSize=512) with heavy object operations, the server would livelock in TryAllocateRetryNow because: 1. IsBeyondSizeLimitAndCanEvict(addingPage:true) returned true for soft memory limits (heap objects >> target), blocking page allocations via NeedToWaitForClose. Evicting pages cannot reduce heap memory from objects that are immediately cloned back via CopyUpdate, creating an infinite RETRY_NOW spin. 2. IssueShiftAddress skipped HeadAddress advancement when IsBeyondSizeLimit was true, even at the hard MaxAllocatedPageCount limit. This prevented the TryAllocateRetryNow spin from resolving because no thread advanced HeadAddress. 3. TryAllocateRetryNow had no backoff, consuming 70%+ CPU in a tight spin loop when allocation kept returning RETRY_NOW. Fixes: - IsBeyondSizeLimitAndCanEvict: when addingPage=true, only check the hard page limit (MaxAllocatedPageCount). Soft limits are handled asynchronously by the resizer task, signaled via NeedToWaitForClose. - NeedToWaitForClose: separate hard limit (block) from soft limit (signal only). - IssueShiftAddress: always advance HeadAddress at MaxAllocatedPageCount, regardless of IsBeyondSizeLimit state. - TryAllocateRetryNow: progressive backoff (Thread.Yield -> Thread.Sleep(1)) to reduce CPU usage during unavoidable spins. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
CircularDiskReadBuffer.OnBeginRecord compared recordFilePosition.word against bufferFilePosition.word using the full ulong word, but this word includes the piggybacked ObjectSizeHighByte field (bits 56-63) which differs between records and is unrelated to file position. This caused spurious assertion failures in Debug mode under heavy object eviction with storage tiering enabled. Additionally, use saturating subtraction via CurrentAddress (segment + offset only) to compute the increment, avoiding ulong underflow when positions have small misalignments from sector-alignment padding at partial flush boundaries. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…pdate Root cause: CacheSizeTracker used SetLogSizeTracker() which only sets the logSizeTracker field but does NOT subscribe the tracker as onEvictionObserver. Without the eviction observer, OnNext is never called when pages are evicted. Instead, ResizeIfNeeded pre-computes heapTrimmedSize by scanning records BEFORE eviction, then subtracts it AFTER ShiftAddresses completes. But between the scan and subtraction, concurrent CopyUpdate operations on other threads can clear the source record's value object AND decrement heapSize via AddHeapSize(sizeAdjustment) — causing the same heap memory to be subtracted twice, driving heapSize negative (e.g. -80) and triggering a DebugAssertException that crashes the server session. Fix: Use SubscribeEvictions() instead of SetLogSizeTracker(). This sets both onEvictionObserver AND logSizeTracker, enabling the OnNext callback to subtract each record's CURRENT HeapMemorySize at actual eviction time. Records already cleared by concurrent CopyUpdate will have HeapMemorySize=0, avoiding the double-subtraction. Remove the stale pre-computed heapSize.Increment(-heapTrimmedSize) from ResizeIfNeeded since OnNext now handles it correctly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
badrishc
changed the title
Fix ListPushPopStressTest
Fix heap size double-subtraction causing server crash under low-memory object operations
…heapTrimmedSize) Remove unnecessary changes that were workarounds for symptoms of the heap accounting bug, not separate issues: - AllocatorBase.cs: revert NeedToWaitForClose, IssueShiftAddress, and TryAllocateRetryNow changes (the livelock was caused by heapSize growing unbounded due to double-subtraction, not by the soft limit logic itself) - CircularDiskReadBuffer.cs: revert OnBeginRecord assertion change (the assertion only fired when heapSize was wrong, causing incorrect eviction decisions that led to corrupted object log positions) - LogSizeTracker.cs: revert IsBeyondSizeLimitAndCanEvict and IncrementSize clamping changes (no longer needed with correct accounting) The actual fix is two lines in two files: 1. CacheSizeTracker.cs: SubscribeEvictions() instead of SetLogSizeTracker() 2. LogSizeTracker.cs: remove heapSize.Increment(-heapTrimmedSize) from ResizeIfNeeded (OnNext callback now handles it at eviction time) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters