perf: add multiplexing performance tests for AsyncMultiRangeDownloader by zhixiangli · Pull Request #16501 · googleapis/google-cloud-python
Overview
This PR introduces new microbenchmarks to measure and expose the performance bottleneck caused by lock contention in the AsyncMultiRangeDownloader. It provides a concrete way to compare the previous serialized implementation against the new multiplexed architecture.
Before vs. After: The Performance Gap
Before (Serialized via Lock)
In the previous implementation, download_ranges used a shared lock to prevent concurrent access to the bidi-gRPC stream. This meant that even with multiple coroutines, only one could "own" the stream at a time. The entire download cycle (Send -> Receive All) had to complete before another task could start.
Execution Flow:
sequenceDiagram
participant C1 as Coroutine 1
participant C2 as Coroutine 2
participant S as gRPC Stream
C1->>C1: Acquire Lock
C1->>S: Send Requests
S-->>C1: Receive Data (Streaming...)
S-->>C1: End of Range
C1->>C1: Release Lock
Note over C2: Waiting for Lock...
C2->>C2: Acquire Lock
C2->>S: Send Requests
S-->>C2: Receive Data (Streaming...)
S-->>C2: End of Range
C2->>C2: Release Lock
After (Multiplexed Concurrent)
With the introduction of the _StreamMultiplexer, multiple coroutines can now share the same stream concurrently. Requests are interleaved, and a background receiver loop routes incoming data to the correct task using read_id.
Execution Flow:
sequenceDiagram
participant C1 as Coroutine 1
participant C2 as Coroutine 2
participant M as Multiplexer
participant S as gRPC Stream
C1->>M: Send Requests
M->>S: Forward Req 1
C2->>M: Send Requests
M->>S: Forward Req 2
Note over C1,C2: Tasks wait on their own queues
S-->>M: Data for C1
M-->>C1: Route to Q1
S-->>M: Data for C2
M-->>C2: Route to Q2
S-->>M: Data for C1
M-->>C1: Route to Q1
How the Benchmark Works
This PR adds a read_rand_multi_coro workload that:
- Spawns multiple asynchronous tasks (coroutines).
- Shares a single
AsyncMultiRangeDownloaderinstance across all tasks. - Simulates the old serialized behavior by explicitly passing a
shared_locktodownload_ranges. - Measures total throughput (MiB/s) and resource utilization.
Key Changes
test_reads.py: Refactored to support launching concurrent coroutines within a single worker process.config.yaml: Addedread_rand_multi_corowith 1, 16 coroutines to stress the downloader.config.py: Updated naming convention to include coroutine count (e.g.,16c) in reports for easier differentiation.