Comparing v89...v90 · DataDog/datadog-lambda-extension
## TL;DR - Support deduping spans by span_id - which fixes trace stats overcount issue for Node.js ## Problem Node.js tracer sometimes sends duplicate traces to the extension, causing over-count of trace stats. In our tests, the over-count is usually 1–4 for 50, 500 or 5000 invocations. This only happens when the "default" flushing strategy (which uses the "continuous" strategy) is used, and doesn't happen when the "end" strategy is used. ## Cause of problem I think it's similar to a known problem of continuous flushing. - The known problem is, when Lambda runtime is frozen, the connection between **extension** and **DD endpoint** can time out, causing data flush failure. - In the trace stats case, the problem is, when Lambda runtime is frozen, the connection between **tracer** and **extension** can time out. Extension receives the request from tracer, then freezes before sending a response, causing the tracer's request to time out, which makes the tracer resend the trace. This doesn't happen with the END flush strategy because in that case, after the Lambda handler finishes, extension still needs to flush the data and doesn't freeze so fast, and it has enough time to respond to tracer. ## Testing ### Steps - Build a test extension layer - Run e2e tests, including: - Install it on various Lambda functions - Invoke these functions with various traffic pattern - Check the trace stats result ### Result **Before:** - Over-count happens for (1) Node.js runtime + (2) "Sampling" test, which uses the default flush strategy - (vs expected 50) <img width="1336" height="277" alt="Screenshot 2025-11-20 at 9 30 00 PM" src="https://github.com/user-attachments/assets/3bf33c26-8996-43e4-8df5-7313247926b3" /> **After:** - No over-count (vs expected 5000) <img width="1447" height="421" alt="Screenshot 2025-11-20 at 9 25 32 PM" src="https://github.com/user-attachments/assets/21576483-dbdc-499f-acc8-1801d3927b0b" /> ## Options considered 1. **At most once**: Disable retry in tracer, at least for Lambda. 2. **At least once** - 2.1 Do nothing. Call out this as a known limitation. - 2.2 Treat tracer as VIP. Before calling /next, make sure tracer's requests have been responded. This may cause regression on invocation duration, especially when volume is high. 3. **Exactly once**: Implement dedup in extension, by trace_id. This PR chooses 3 because it's the easiest. ## Notes Thanks @astuyve @rochdev @purple4reina for discussion. Thanks Cursor for writing most of the code. The under-count issue will be addressed separately. Related issue: #688