fix: handle http2 goaway race conditions by bayoumi17m · Pull Request #1054

fix: handle http2 goaway race conditions by bayoumi17m · Pull Request #1054 · encode/httpcore

Fix HTTP/2 GOAWAY Race Conditions

⚠️GenAI was used in the development of this PR. It was all reviewed and tested manually by a human⚠️

Problem

When using HTTPX with HTTP/2 under high concurrency, users encounter errors where connections are closed for streams beyond the last_stream_id. This forces applications to implement retry logic for what should be transparent library behavior.

The root cause is a set of race conditions between h2's state machine and httpcore's event processing:

State machine race: h2 transitions to CLOSED before httpcore processes the GOAWAY event
Disconnect without context: Server disconnects immediately after GOAWAY, raising generic RemoteProtocolError("Server disconnected") with no retry opportunity
Insufficient exception context: ConnectionNotAvailable didn't convey why the connection was unavailable, preventing informed retry decisions

A user proposed changing > to >= in the stream ID comparison (discussion #1041), but this would violate RFC 7540 Section 6.8:

"The last stream identifier in the GOAWAY frame contains the highest-numbered stream identifier for which the sender of the GOAWAY frame might have taken some action on..."

A stream matching last_stream_id may have already been processed, making automatic retry unsafe for non-idempotent operations.

Solution

This PR introduces proper GOAWAY handling that maintains RFC compliance while enabling safe automatic retries:

1. New `ConnectionGoingAway` Exception

A dedicated exception (subclass of ConnectionNotAvailable) that carries GOAWAY context:

last_stream_id / request_stream_id for RFC 7540 stream comparison
error_code to distinguish graceful shutdown (NO_ERROR) from errors
headers_sent / body_sent to track request phase at time of GOAWAY
Helper properties: is_safe_to_retry, is_graceful_shutdown, may_have_side_effects

This exception is exported publicly, allowing applications to implement idempotency-aware retry logic when needed.

2. New `DRAINING` Connection State

Connections receiving a graceful GOAWAY (NO_ERROR) transition to DRAINING rather than CLOSED:

Allows in-flight streams to complete normally
Rejects new requests (connection reports is_available() = False)
Connection expires once all streams complete

This fits naturally into the existing HTTPConnectionState enum pattern and makes state transitions easier to reason about.

3. Request Phase Tracking

Each stream tracks headers_sent and body_sent to determine potential side effects:

If GOAWAY arrives before headers are sent → definitely no side effects
If headers/body were sent → server may have processed the request

This information flows through ConnectionGoingAway.may_have_side_effects for retry decisions.

4. Connection Pool Retry Logic

The connection pool now handles ConnectionGoingAway with RFC-compliant retry semantics:

Condition	Behavior
`stream_id > last_stream_id`	Auto-retry (guaranteed unprocessed per RFC 7540)
Graceful shutdown + no side effects	Auto-retry (likely safe)
Otherwise	Raise `RemoteProtocolError` with context for application decision

Behavior Summary

Scenario	Before	After
Stream ID > last_stream_id	`ConnectionNotAvailable` (retry worked but was implicit)	Auto-retry with explicit RFC justification
Graceful shutdown, headers not sent	`RemoteProtocolError` or `ConnectionNotAvailable`	Auto-retry
Potentially processed request	Generic error	`RemoteProtocolError` with cause chain to `ConnectionGoingAway`
h2 CLOSED race condition	Generic protocol error	Proper `ConnectionGoingAway` with conservative assumptions
Server disconnect after GOAWAY	`RemoteProtocolError("Server disconnected")`	`ConnectionGoingAway` with retry context

Design Decisions

Why a new connection state vs. a boolean flag?
The DRAINING state fits the existing HTTPConnectionState pattern and makes state transitions explicit. A boolean would work but adds cognitive overhead when reasoning about valid state combinations.

Why track headers and body separately?
Both are relevant for side-effect determination. Applications can use this granular information - for example, a read-only GET that sent headers but no body might be safer to retry than a POST with body sent.

Why raise RemoteProtocolError instead of ConnectionGoingAway for potentially-processed requests?
Balances API simplicity with power. Most users catch RemoteProtocolError; advanced users can inspect __cause__ to access the full ConnectionGoingAway context. Since ConnectionGoingAway extends ConnectionNotAvailable, it's also catchable directly for those who need it.

Why conservative assumptions when h2 is CLOSED without GOAWAY event?
When the connection is closed but we haven't processed a GOAWAY event, we can't know the true last_stream_id. Assuming the current stream may have been processed is the safe default - better to surface an error than silently retry a potentially-processed request.

Related Issues

Test Plan

Unit tests for ConnectionGoingAway exception properties
Unit tests for DRAINING state transitions
Integration tests simulating GOAWAY during request
Integration tests for race condition scenarios (h2 CLOSED before event processing)
Verify existing tests still pass

Checklist

I understand that this PR may be closed in case there was no previous discussion. (This doesn't apply to typos!)
I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
I've updated the documentation accordingly.