Fix heap size double-subtraction causing server crash under low-memory object operations by badrishc · Pull Request #1663 · microsoft/garnet

@badrishc

AI review requested due to automatic review settings

April 2, 2026 01:02
…ap objects

Under lowMemory settings (LogMemorySize=2k, PageSize=512) with heavy object
operations, the server would livelock in TryAllocateRetryNow because:

1. IsBeyondSizeLimitAndCanEvict(addingPage:true) returned true for soft memory
   limits (heap objects >> target), blocking page allocations via NeedToWaitForClose.
   Evicting pages cannot reduce heap memory from objects that are immediately
   cloned back via CopyUpdate, creating an infinite RETRY_NOW spin.

2. IssueShiftAddress skipped HeadAddress advancement when IsBeyondSizeLimit was
   true, even at the hard MaxAllocatedPageCount limit. This prevented the
   TryAllocateRetryNow spin from resolving because no thread advanced HeadAddress.

3. TryAllocateRetryNow had no backoff, consuming 70%+ CPU in a tight spin loop
   when allocation kept returning RETRY_NOW.

Fixes:
- IsBeyondSizeLimitAndCanEvict: when addingPage=true, only check the hard page
  limit (MaxAllocatedPageCount). Soft limits are handled asynchronously by the
  resizer task, signaled via NeedToWaitForClose.
- NeedToWaitForClose: separate hard limit (block) from soft limit (signal only).
- IssueShiftAddress: always advance HeadAddress at MaxAllocatedPageCount,
  regardless of IsBeyondSizeLimit state.
- TryAllocateRetryNow: progressive backoff (Thread.Yield -> Thread.Sleep(1))
  to reduce CPU usage during unavoidable spins.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
CircularDiskReadBuffer.OnBeginRecord compared recordFilePosition.word against
bufferFilePosition.word using the full ulong word, but this word includes the
piggybacked ObjectSizeHighByte field (bits 56-63) which differs between records
and is unrelated to file position. This caused spurious assertion failures in
Debug mode under heavy object eviction with storage tiering enabled.

Additionally, use saturating subtraction via CurrentAddress (segment + offset
only) to compute the increment, avoiding ulong underflow when positions have
small misalignments from sector-alignment padding at partial flush boundaries.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@badrishc

…pdate

Root cause: CacheSizeTracker used SetLogSizeTracker() which only sets the
logSizeTracker field but does NOT subscribe the tracker as onEvictionObserver.
Without the eviction observer, OnNext is never called when pages are evicted.
Instead, ResizeIfNeeded pre-computes heapTrimmedSize by scanning records BEFORE
eviction, then subtracts it AFTER ShiftAddresses completes. But between the
scan and subtraction, concurrent CopyUpdate operations on other threads can
clear the source record's value object AND decrement heapSize via
AddHeapSize(sizeAdjustment) — causing the same heap memory to be subtracted
twice, driving heapSize negative (e.g. -80) and triggering a DebugAssertException
that crashes the server session.

Fix: Use SubscribeEvictions() instead of SetLogSizeTracker(). This sets both
onEvictionObserver AND logSizeTracker, enabling the OnNext callback to subtract
each record's CURRENT HeapMemorySize at actual eviction time. Records already
cleared by concurrent CopyUpdate will have HeapMemorySize=0, avoiding the
double-subtraction. Remove the stale pre-computed heapSize.Increment(-heapTrimmedSize)
from ResizeIfNeeded since OnNext now handles it correctly.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@badrishc badrishc changed the title Fix ListPushPopStressTest Fix heap size double-subtraction causing server crash under low-memory object operations

Apr 3, 2026

@badrishc

…heapTrimmedSize)

Remove unnecessary changes that were workarounds for symptoms of the heap
accounting bug, not separate issues:
- AllocatorBase.cs: revert NeedToWaitForClose, IssueShiftAddress, and
  TryAllocateRetryNow changes (the livelock was caused by heapSize growing
  unbounded due to double-subtraction, not by the soft limit logic itself)
- CircularDiskReadBuffer.cs: revert OnBeginRecord assertion change (the
  assertion only fired when heapSize was wrong, causing incorrect eviction
  decisions that led to corrupted object log positions)
- LogSizeTracker.cs: revert IsBeyondSizeLimitAndCanEvict and IncrementSize
  clamping changes (no longer needed with correct accounting)

The actual fix is two lines in two files:
1. CacheSizeTracker.cs: SubscribeEvictions() instead of SetLogSizeTracker()
2. LogSizeTracker.cs: remove heapSize.Increment(-heapTrimmedSize) from
   ResizeIfNeeded (OnNext callback now handles it at eviction time)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>