fix: handle http2 goaway race conditions by bayoumi17m · Pull Request #1054 · encode/httpcore
Fix HTTP/2 GOAWAY Race Conditions
⚠️GenAI was used in the development of this PR. It was all reviewed and tested manually by a human⚠️
Problem
When using HTTPX with HTTP/2 under high concurrency, users encounter errors where connections are closed for streams beyond the last_stream_id. This forces applications to implement retry logic for what should be transparent library behavior.
The root cause is a set of race conditions between h2's state machine and httpcore's event processing:
- State machine race: h2 transitions to CLOSED before httpcore processes the GOAWAY event
- Disconnect without context: Server disconnects immediately after GOAWAY, raising generic
RemoteProtocolError("Server disconnected")with no retry opportunity - Insufficient exception context:
ConnectionNotAvailabledidn't convey why the connection was unavailable, preventing informed retry decisions
A user proposed changing > to >= in the stream ID comparison (discussion #1041), but this would violate RFC 7540 Section 6.8:
"The last stream identifier in the GOAWAY frame contains the highest-numbered stream identifier for which the sender of the GOAWAY frame might have taken some action on..."
A stream matching last_stream_id may have already been processed, making automatic retry unsafe for non-idempotent operations.
Solution
This PR introduces proper GOAWAY handling that maintains RFC compliance while enabling safe automatic retries:
1. New ConnectionGoingAway Exception
A dedicated exception (subclass of ConnectionNotAvailable) that carries GOAWAY context:
last_stream_id/request_stream_idfor RFC 7540 stream comparisonerror_codeto distinguish graceful shutdown (NO_ERROR) from errorsheaders_sent/body_sentto track request phase at time of GOAWAY- Helper properties:
is_safe_to_retry,is_graceful_shutdown,may_have_side_effects
This exception is exported publicly, allowing applications to implement idempotency-aware retry logic when needed.
2. New DRAINING Connection State
Connections receiving a graceful GOAWAY (NO_ERROR) transition to DRAINING rather than CLOSED:
- Allows in-flight streams to complete normally
- Rejects new requests (connection reports
is_available() = False) - Connection expires once all streams complete
This fits naturally into the existing HTTPConnectionState enum pattern and makes state transitions easier to reason about.
3. Request Phase Tracking
Each stream tracks headers_sent and body_sent to determine potential side effects:
- If GOAWAY arrives before headers are sent → definitely no side effects
- If headers/body were sent → server may have processed the request
This information flows through ConnectionGoingAway.may_have_side_effects for retry decisions.
4. Connection Pool Retry Logic
The connection pool now handles ConnectionGoingAway with RFC-compliant retry semantics:
| Condition | Behavior |
|---|---|
stream_id > last_stream_id |
Auto-retry (guaranteed unprocessed per RFC 7540) |
| Graceful shutdown + no side effects | Auto-retry (likely safe) |
| Otherwise | Raise RemoteProtocolError with context for application decision |
Behavior Summary
| Scenario | Before | After |
|---|---|---|
| Stream ID > last_stream_id | ConnectionNotAvailable (retry worked but was implicit) |
Auto-retry with explicit RFC justification |
| Graceful shutdown, headers not sent | RemoteProtocolError or ConnectionNotAvailable |
Auto-retry |
| Potentially processed request | Generic error | RemoteProtocolError with cause chain to ConnectionGoingAway |
| h2 CLOSED race condition | Generic protocol error | Proper ConnectionGoingAway with conservative assumptions |
| Server disconnect after GOAWAY | RemoteProtocolError("Server disconnected") |
ConnectionGoingAway with retry context |
Design Decisions
Why a new connection state vs. a boolean flag?
The DRAINING state fits the existing HTTPConnectionState pattern and makes state transitions explicit. A boolean would work but adds cognitive overhead when reasoning about valid state combinations.
Why track headers and body separately?
Both are relevant for side-effect determination. Applications can use this granular information - for example, a read-only GET that sent headers but no body might be safer to retry than a POST with body sent.
Why raise RemoteProtocolError instead of ConnectionGoingAway for potentially-processed requests?
Balances API simplicity with power. Most users catch RemoteProtocolError; advanced users can inspect __cause__ to access the full ConnectionGoingAway context. Since ConnectionGoingAway extends ConnectionNotAvailable, it's also catchable directly for those who need it.
Why conservative assumptions when h2 is CLOSED without GOAWAY event?
When the connection is closed but we haven't processed a GOAWAY event, we can't know the true last_stream_id. Assuming the current stream may have been processed is the safe default - better to surface an error than silently retry a potentially-processed request.
Related Issues
- https://github.com/encode/httpx/issues/3324
- ConnectionTerminated error_code:ErrorCodes.NO_ERROR #45
- Graceful handling for HTTP/2 GoAway frames. #730
- HTTP2 GoAway frames don't seem to be handled gracefully in keepalive connections #1041
Test Plan
- Unit tests for
ConnectionGoingAwayexception properties - Unit tests for DRAINING state transitions
- Integration tests simulating GOAWAY during request
- Integration tests for race condition scenarios (h2 CLOSED before event processing)
- Verify existing tests still pass
Checklist
- I understand that this PR may be closed in case there was no previous discussion. (This doesn't apply to typos!)
- I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
- I've updated the documentation accordingly.