AzureServiceBusAPMTests flakiness fix. by NachoEchevarria · Pull Request #8050 · DataDog/dd-trace-dotnet

Summary of changes

Fixes flaky TestReceiveMessagesAsyncIntegration test in Azure Service Bus APM integration tests by ensuring scheduled messages are delivered before queue cleanup.

Reason for change

The TestReceiveMessagesAsyncIntegration test was intermittently failing in CI with the error:

Failed Datadog.Trace.ClrProfiler.IntegrationTests.Azure.AzureServiceBusAPMTests.TestReceiveMessagesAsyncIntegration(packageVersion: "7.18.4", metadataSchemaVersion: "v1") [3 s]
 Error Message:
  Expected linkedSendSpan not to be <null> because Receive span 8970363549073655187 has link to span 14625431652207073697 in trace 7096167803062670649, but corresponding send span not found.
xpected linkedSendSpan not to be <null> because Receive span 3961647348784637861 has link to span 14625431652207073697 in trace 7096167803062670649, but corresponding send span not found.

=== Receive Messages Test ===
Sent test message for receive with ID: 8ddfb05e-9d2b-4c4d-ad7a-34e991e34bfd
Attempting to receive message...
Received message ID: 03e0bf63934346329a9c336cbdd5ebc2, Body: Scheduled Message 0 from ScheduleMessages test
Message completed successfully
Purging existing messages from queue...
Purged 2 existing messages from queue
Resources handled successfully
Azure Service Bus APM Test Sample completed successfully

This occurred because the test was receiving messages scheduled by previous TestScheduleMessagesAsyncIntegration test runs, but the corresponding send spans were not available (they were from a different test execution).

The race condition occurred due to the timing of scheduled message delivery:

  1. TestScheduleMessagesAsync schedules messages for 1 second in the future (DateTimeOffset.Now.AddSeconds(1))
  2. The test completes immediately and calls PurgeQueue
  3. PurgeQueue waits 2 seconds trying to receive messages
  4. Critical issue: Azure Service Bus emulator doesn't guarantee exact delivery timing
    - If delivery is delayed beyond the 2-second PurgeQueue window, messages escape cleanup
    - These orphaned scheduled messages get received by subsequent TestReceiveMessagesAsync test runs
    - The test fails because it can't find the corresponding send spans (they were in a different test run)

The test usually passed because:

  • Most of the time: Messages were delivered within the 2-second PurgeQueue window and got cleaned up
  • Test execution order: Random test shuffling meant ReceiveMessages didn't always run immediately after ScheduleMessages
  • Test gaps: Other tests running in between provided extra time for delivery and cleanup

It failed when:

  • Message delivery was delayed beyond 2 seconds (emulator timing variance due to CPU load, network, etc.)
  • Multiple ScheduleMessages tests ran before ReceiveMessages
  • Random test ordering placed ReceiveMessages right after ScheduleMessages tests

Implementation details

Modified TestScheduleMessagesAsync in Samples.AzureServiceBus.APM/Program.cs to wait for scheduled messages to be delivered before returning:

  // Calculate remaining time and ensure we wait at least 2 seconds total
  var waitTime = scheduleTime - DateTimeOffset.Now;
  var totalWaitSeconds = Math.Max(2.0, waitTime.TotalSeconds + 1.0);
  await Task.Delay(TimeSpan.FromSeconds(totalWaitSeconds));

This ensures:

  • Scheduled messages are actually delivered before PurgeQueue runs
  • The shared Azure Service Bus emulator queue is clean before the next test starts
  • No interference between test runs regardless of execution order

Test coverage

This change fixes the existing integration test rather than adding new tests. The fix eliminates the race condition entirely rather than masking it with retries.

Other details