Improve detection of connect/handshake failures and the retry-policy by mgravell · Pull Request #3038

Improve detection of connect/handshake failures and the retry-policy by mgravell · Pull Request #3038 · StackExchange/StackExchange.Redis

This PR investigates and resolves #3036, but also addresses some other related failure modes.

The fundamental issue in ^^^ was:

in PhysicalBridge, when invoking the retry-policy, we pass a count of how many retries we've already done, which is used by the policy, for example to implement exponential decay
this counter is reset when the shouldResetConnectionRetryCount flag is true
the code was incorrectly setting this flag to true even in "establishing" cases, due to an otherwise-intentional case fall-thru

So the simple part of this fix is simply: move the assignment of shouldResetConnectionRetryCount so that we only reset the retry count after a successful handshake (ConnectionEstablished).

This fix was tested using the in-process-server infrastructure, with failure modes:

actively (fault) reject connections
passively (no completion) reject connections
accept connections but never return responses
accept connections but return garbage responses

This testing highlit a second failure mode, whereby the "never return responses" scenario failed to reconnect. This turned out to be because our message timeout logic only considered "async" scenarios, since "sync" timeout is active from the perspective of the caller (Monitor.Wait). This means that if the handshake "critical tracer" (the echo or ping that we use with a payload that gets validated) never responds... well, nothing happens. The handshake is neither sync nor async, because it is F+F without a result-box.

We therefore extend the timeout logic to explicitly identify this handshake scenario (via a new pseudo-CommandFlags value), and burn the connection accordingly.