feat(apm): aggregate and flush data accordingly for DSM, Profiling, LLMObs, and Live Debugger by duncanista · Pull Request #737 · DataDog/datadog-lambda-extension

Conversation

@duncanista

What?

Creates an aggregator and flusher for APM proxy data.

Motivation

With the introduction of #684, the Extension is way faster to continue invocations, I think this could have affected how data is aggregated for this various APM products in which we were just sending the request to intake as soon as it arrives.

Now, we aggregate it and respect the flushing pattern and algorithm.

Testing

  1. Profiling:

Screenshot 2025-07-11 at 11 13 55 AM

  1. LLMObs:

Screenshot 2025-07-15 at 3 12 01 PM

  1. DSM:
    tbd

  2. Live Debugger:

image

duncanista

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't like how simple this aggregator is, its just a wrapper for a queue, but I think so far we should respect the way we are following this pattern, maybe we can create a trait and consolidate how we create aggregators

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a buffering layer for APM proxy requests in the Trace Agent, introducing an in-memory aggregator and a flusher that batches and retries HTTP calls for DSM, Profiling, LLMObs, and Live Debugger.

  • Introduce proxy_aggregator and proxy_flusher modules to buffer and periodically flush requests.
  • Update TraceAgent endpoints to enqueue proxy requests instead of forwarding immediately.
  • Adjust get_client signature to take a reference and propagate that change to related flushers.

Reviewed Changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
bottlecap/src/traces/trace_agent.rs Switched proxy endpoints to enqueue into proxy_aggregator
bottlecap/src/traces/proxy_aggregator.rs Added ProxyRequest struct and Aggregator for request buffering
bottlecap/src/traces/proxy_flusher.rs Added Flusher to batch, send, and retry buffered proxy requests
bottlecap/src/traces/mod.rs Registered new modules and added DD_ADDITIONAL_TAGS_HEADER const
bottlecap/src/logs/flusher.rs Updated get_client call to pass a reference
bottlecap/src/http.rs Changed get_client and build_client to accept &Arc<Config>
Comments suppressed due to low confidence (1)

bottlecap/src/traces/proxy_flusher.rs:26

  • [nitpick] There are no unit tests for the new Flusher behavior. Please add tests covering header merging, batch flushing, retry logic, and failure scenarios.

lym953

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Feel free to request additional review from others if needed.


pub async fn flush(
&self,
retry_requests: Option<Vec<reqwest::RequestBuilder>>,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my learning: is there a limit on the number of retries? I assumed the retry count in send() is a different things.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry count in the send method is mainly for the amount of times we try to retry a payload during a flush.

I think there's no limit on the number of failed payloads we try to retry – this is mainly bc we could have a network error which we need to handle.

For 400/500 errors we don't retry, maybe we should at some point, but we want to ensure we don't timeout the lambda or fill the buffer with data that will never arrive

astuyve

no need to pass a whoole clone here
only acknowledges data and lets the flusher take care of the rest

@astuyve astuyve deleted the jordan.gonzalez/trace-agent/aggregation-for-proxy branch

July 16, 2025 19:47