perf: add multiplexing performance tests for AsyncMultiRangeDownloader by zhixiangli · Pull Request #16501

perf: add multiplexing performance tests for AsyncMultiRangeDownloader by zhixiangli · Pull Request #16501 · googleapis/google-cloud-python

Overview

This PR introduces new microbenchmarks to measure and expose the performance bottleneck caused by lock contention in the AsyncMultiRangeDownloader. It provides a concrete way to compare the previous serialized implementation against the new multiplexed architecture.

Before vs. After: The Performance Gap

Before (Serialized via Lock)

In the previous implementation, download_ranges used a shared lock to prevent concurrent access to the bidi-gRPC stream. This meant that even with multiple coroutines, only one could "own" the stream at a time. The entire download cycle (Send -> Receive All) had to complete before another task could start.

Execution Flow:

sequenceDiagram
    participant C1 as Coroutine 1
    participant C2 as Coroutine 2
    participant S as gRPC Stream

    C1->>C1: Acquire Lock
    C1->>S: Send Requests
    S-->>C1: Receive Data (Streaming...)
    S-->>C1: End of Range
    C1->>C1: Release Lock
    
    Note over C2: Waiting for Lock...
    
    C2->>C2: Acquire Lock
    C2->>S: Send Requests
    S-->>C2: Receive Data (Streaming...)
    S-->>C2: End of Range
    C2->>C2: Release Lock

After (Multiplexed Concurrent)

With the introduction of the _StreamMultiplexer, multiple coroutines can now share the same stream concurrently. Requests are interleaved, and a background receiver loop routes incoming data to the correct task using read_id.

Execution Flow:

sequenceDiagram
    participant C1 as Coroutine 1
    participant C2 as Coroutine 2
    participant M as Multiplexer
    participant S as gRPC Stream

    C1->>M: Send Requests
    M->>S: Forward Req 1
    C2->>M: Send Requests
    M->>S: Forward Req 2
    
    Note over C1,C2: Tasks wait on their own queues
    
    S-->>M: Data for C1
    M-->>C1: Route to Q1
    S-->>M: Data for C2
    M-->>C2: Route to Q2
    S-->>M: Data for C1
    M-->>C1: Route to Q1

How the Benchmark Works

This PR adds a read_rand_multi_coro workload that:

Spawns multiple asynchronous tasks (coroutines).
Shares a single AsyncMultiRangeDownloader instance across all tasks.
Simulates the old serialized behavior by explicitly passing a shared_lock to download_ranges.
Measures total throughput (MiB/s) and resource utilization.

Key Changes

test_reads.py: Refactored to support launching concurrent coroutines within a single worker process.
config.yaml: Added read_rand_multi_coro with 1, 16 coroutines to stress the downloader.
config.py: Updated naming convention to include coroutine count (e.g., 16c) in reports for easier differentiation.