Improve detection of connect/handshake failures and the retry-policy by mgravell · Pull Request #3038 · StackExchange/StackExchange.Redis
This PR investigates and resolves #3036, but also addresses some other related failure modes.
The fundamental issue in ^^^ was:
- in
PhysicalBridge, when invoking the retry-policy, we pass a count of how many retries we've already done, which is used by the policy, for example to implement exponential decay - this counter is reset when the
shouldResetConnectionRetryCountflag istrue - the code was incorrectly setting this flag to
trueeven in "establishing" cases, due to an otherwise-intentionalcasefall-thru
So the simple part of this fix is simply: move the assignment of shouldResetConnectionRetryCount so that we only reset the retry count after a successful handshake (ConnectionEstablished).
This fix was tested using the in-process-server infrastructure, with failure modes:
- actively (fault) reject connections
- passively (no completion) reject connections
- accept connections but never return responses
- accept connections but return garbage responses
This testing highlit a second failure mode, whereby the "never return responses" scenario failed to reconnect. This turned out to be because our message timeout logic only considered "async" scenarios, since "sync" timeout is active from the perspective of the caller (Monitor.Wait). This means that if the handshake "critical tracer" (the echo or ping that we use with a payload that gets validated) never responds... well, nothing happens. The handshake is neither sync nor async, because it is F+F without a result-box.
We therefore extend the timeout logic to explicitly identify this handshake scenario (via a new pseudo-CommandFlags value), and burn the connection accordingly.