feat: continuous flushing strategy for high throughput functions by astuyve · Pull Request #684 · DataDog/datadog-lambda-extension

added 27 commits

April 29, 2025 15:34
…sions. Initial algorithm to determine if we should flush continuously. Refactor main loop to adhere to the FlushDecision, and redrive attempts at shutdown.

@astuyve

@astuyve

duncanista

duncanista

duncanista

duncanista

duncanista

duncanista

duncanista

@astuyve

@astuyve

@astuyve astuyve deleted the aj/ship-no-await branch

June 3, 2025 10:46

astuyve pushed a commit that referenced this pull request

Jul 16, 2025

duncanpharvey pushed a commit that referenced this pull request

Mar 10, 2026
This is a heavy refactor and new feature.
- Introduces FlushDecision and separates it from FlushStrategy
- Cleans up FlushControl logic and methods

It also adds the ability to flush telemetry across multiple serial
lambda invocations. This is done using the `continuous` strategy.

This is a huge win for busy functions as seen in our test fleet, where
the p99/max drops precipitously, which also causes the average to
plummet. This also helps reduce the number of cold starts encountered
during scaleup events, which further reduces latency along with costs:

![image](https://github.com/user-attachments/assets/14851e22-327d-43b0-8246-5780cfbf6ef7)

Technical implementation:
We spawn the task and collect the flush handles, then in the two
periodic strategies we check if there were any errors or unresolved
futures in the next flush cycle. If so, we switch to the `periodic`
strategy to ensure flushing completes successfully.

We don't adapt to the periodic strategy unless the last 20 invocations
occurred within the `config.flush_timeout` value, which has been
increased by default. This is a naive implementation. A better one would
be to calculate the first derivative of the invocation periodicity. If
the rate is increasing, we can adapt to the continuous strategy. If the
rate slows, we should fall back to the periodic strategy.
<img width="807" alt="image"
src="https://github.com/user-attachments/assets/d3c25419-f1da-4774-975f-0e254047b9b7"
/>

The existing implementation is cautious in that we could definitely
adapt sooner but don't.


Todo: add a feature flag for continuous flushing?