[grid] Preventing potential deadlock in Distributor by VietND96 · Pull Request #17022 · SeleniumHQ/selenium

General
Improve test to correctly verify lock release

Improve the purgeDeadNodesShouldFireEventsOutsideLock test by attempting to
acquire a write lock instead of a read lock in the event listener. This provides
a more accurate verification that the lock is released before the event is
fired, as the current read lock check can pass even if the write lock is held.

java/test/org/openqa/selenium/grid/distributor/local/LocalGridModelDeadlockTest.java [183-196]

 // When event fires, try to call another model method that needs the lock
 events.addListener(
     NodeRemovedEvent.listener(
         status -> {
           try {
-            // This needs the read lock - if event is fired inside write lock,
-            // this would work due to lock downgrading, but we verify the pattern
-            model.getSnapshot();
+            // This needs the write lock. If the event is fired inside the
+            // purgeDeadNodes's write lock, this will block and time out.
+            // We use a dummy node ID since we just need to acquire the lock.
+            model.setAvailability(new NodeId(UUID.randomUUID()), UP);
             canAcquireLock.set(true);
           } catch (Exception e) {
             canAcquireLock.set(false);
           }
           eventFired.countDown();
         }));
  • Apply / Chat
Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies a flaw in the test logic where lock downgrading in ReentrantReadWriteLock could lead to a false positive, and proposes a more robust check by attempting to acquire a write lock, significantly improving the test's reliability.

Medium
Catch exceptions during removal events

Wrap the events.fire(new NodeRemovedEvent(node)) call within a try-catch block
inside the loop. This prevents a single failing event listener from stopping the
notification process for other removed nodes.

java/src/org/openqa/selenium/grid/distributor/local/LocalGridModel.java [290]

-removedNodes.forEach(node -> events.fire(new NodeRemovedEvent(node)));
+for (NodeStatus node : removedNodes) {
+  try {
+    events.fire(new NodeRemovedEvent(node));
+  } catch (RuntimeException e) {
+    LOG.warning(String.format(
+      "Error firing NodeRemovedEvent for node %s: %s",
+      node.getNodeId(), e.getMessage()));
+  }
+}
  • Apply / Chat
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly points out that an exception in an event listener could halt the processing of subsequent node removal events, and proposes a robust solution by wrapping each event fire in a try-catch block.

Medium
Preserve event firing order

Change removedNodes from a HashSet to an ArrayList to ensure that
NodeRemovedEvents are fired in a deterministic order.

java/src/org/openqa/selenium/grid/distributor/local/LocalGridModel.java [231]

-Set<NodeStatus> removedNodes = new HashSet<>();
+List<NodeStatus> removedNodes = new ArrayList<>();
  • Apply / Chat
Suggestion importance[1-10]: 2

__

Why: The suggestion to use a List for deterministic event order is valid, but the current implementation's use of a Set is also correct and has no functional downside, as the order of NodeRemovedEvents is not critical.

Low Learned
best practice
Don’t ignore thread failures

Capture any thrown exceptions and restore the interrupt flag on interruption,
then assert no exceptions occurred so the test cannot pass while threads
silently failed.

java/test/org/openqa/selenium/grid/distributor/local/LocalGridModelDeadlockTest.java [115-131]

+AtomicReference<Throwable> threadFailure = new AtomicReference<>();
+
 executor.submit(
     () -> {
       try {
         thread1Started.countDown();
         thread2Started.await(1, TimeUnit.SECONDS);
 
         synchronized (externalLock) {
-          // This simulates LocalNodeRegistry.updateNodeAvailability()
-          // which holds its lock and calls model.setAvailability()
           model.setAvailability(nodeId, UP);
         }
-      } catch (Exception e) {
-        // Ignore interruption
+      } catch (InterruptedException e) {
+        Thread.currentThread().interrupt();
+        threadFailure.compareAndSet(null, e);
+      } catch (Throwable t) {
+        threadFailure.compareAndSet(null, t);
       } finally {
         testComplete.countDown();
       }
     });
  • Apply / Chat
Suggestion importance[1-10]: 6

__

Why:
Relevant best practice - Do not swallow exceptions/interrupts in concurrency tests; propagate failures and restore interrupt status to avoid false positives.

Low
Clean up executors reliably

Move shutdown into a finally block and await termination (best-effort) so
threads are reliably cleaned up even when assertions fail or time out.

java/test/org/openqa/selenium/grid/distributor/local/LocalGridModelDeadlockTest.java [149-154]

-boolean completed = testComplete.await(5, TimeUnit.SECONDS);
-executor.shutdownNow();
+boolean completed;
+try {
+  completed = testComplete.await(5, TimeUnit.SECONDS);
+} finally {
+  executor.shutdownNow();
+  executor.awaitTermination(5, TimeUnit.SECONDS);
+}
 
 if (!completed) {
   deadlockDetected.set(true);
 }
  • Apply / Chat
Suggestion importance[1-10]: 5

__

Why:
Relevant best practice - Ensure executor lifecycle is safely managed in tests (shutdown in finally and await termination) to prevent thread leaks and flakiness.

Low
  • Update