v2.5 work: BacklogPolicy by NickCraver · Pull Request #1912 · StackExchange/StackExchange.Redis

@NickCraver

We're working on pub/sub - breaking it out explicitly from #1912. This relates to several issues and in general handling resubscriptions on reconnect.

Issues: #1110, #1586, #1830 #1835

There are a few things in play we're investigating:
- [x] Subscription heartbeat not going over the subscription connection (due to `PING` and `GetBridge`)
- [x] Subscriptions not reconnecting at all (or potentially doing to and unsubscribing according to some issues)
- [x] Subscriptions always going to a single cluster node (due to `default(RedisKey)`)

Overall this set of changes:
- Completely restructures how RedisSubscriber works
  - No more `PendingSubscriptionState` (`Subscription` has the needed bits to reconnect)
  - Cleaner method topology (in `RedisSubscriber`, rather than `Subscriber`, `RedisSubscriber`, and `ConnectionMultiplexer`)
    - By placing these on `RedisSubscriber`, we can cleanly use `ExecuteSync/Async` bits, get proper profiling, etc.
  - Proper sync/async split (rather than `Wait()` in sync paths)
- Changes how subscriptions work
  - The `Subscription` object is added to the `ConnectionMultiplexer` tracking immediately, but the command itself actually goes to the server and back (unless FireAndForget) before returning for proper ordering like other commands.
  - No more `Task.Run()` loop - we now ensure reconnects as part of the handshake
  - Subscriptions are marked as not having a server the moment a disconnect is fired
    - Question: Should we have a throttle around this for massive numbers of connections, or async it?
- Changes how connecting works
  - The connection completion handler will now fire when the _second_ bridge/connection completes, this means we won't have `interactive` connected but `subscription` in an unknown state - both are connected before we fire the handler meaning the moment we come back from connect, subscriptions are in business.
- Moves to a `ConcurrentDictionary` since we only need limited locking around this and we only have it once per multiplexer.
  - TODO: This needs eyes, we could shift it - implementation changed along the way where this isn't a critical detail
- Fixes the `TrackSubscriptionsProcessor` - this was never setting the result but didn't notice in 8 years because downstream code never cared.
  - Note: each `Subscription` has a processor instance (with minimal state) because when the subscription command comes back _then_ we need to decide if it successfully registered (if it didn't, we need to maintain it has no successful server)
- `ConnectionMultiplexer` grew a `DefaultSubscriber` for running some commands without lots of method duplication, e.g. ensuring servers are connected.
- Overrides `GetHashSlot` on `CommandChannelBase` with the new `RedisChannel`-based methods so that operates correctly

Not directly related changes which helped here:
- Better profiler helpers for tests and profiler logging in them
- Re-enables a few `PubSub` tests that were unreliable before...but correctly so.

TODO: I'd like to add a few more test scenarios here:
- [x] Simple Subscribe/Publish/await Until/check pattern to ensure back-to-back subscribe/publish works well
- [x] Cluster connection failure and subscriptions moving to another node

To consider:
- [x] Subscription await loop from EnsureSubscriptionsAsync and connection impact on large reconnect situations
   - In a reconnect case, this is background and only the nodes affected have any latency...but still.
- [ ] TODOs in code around variadic commands, e.g. re-subscribing with far fewer commands by using `SUBSCRIBE <key1> <key2>...`
   - In cluster, we'd have to batch per slot...or just go for the first available node
   - ...but if we go for the first available node, the semantics of `IsConnected` are slightly off in the not connected (`CurrentServer is null`) case, because we'd say we're connected to where it _would_ go even though that'd be non-deterministic without hashslot batching. I think this is really minor and shouldn't affect our decision.
- [x] `ConcurrentDictionary` vs. returning to locks around a `Dictionary`
   - ...but if we have to lock on firing consumption of handlers anyway, concurrency overhead is probably a wash.