feat: For OOM metric, also use error type in PlatformRuntimeDone by lym953 · Pull Request #776 · DataDog/datadog-lambda-extension

Background

To compute the aws.lambda.enhanced.out_of_memory metric, right now we only rely on runtime-specific errors such as Runtime exited with error: signal: killed for Node.

Problem

A customer reported that their Lambda OOMed but the metric was not logged.
After testing, I found that the existing approach only works for:

  • .NET
  • Node
  • Java

but doesn't work for:

  • Go
  • Ruby
  • Python

(The result is only based on my experiments. There could likely be other cases I didn't cover.)

This PR

In addition to the current approach, also increment the metric if the PlatformRuntimeDone event contains {status: 'Status::Error', error_type: 'Runtime.OutOfMemory'

Test

Summary: Ruby 3.3 and Python 3.13 start working.

.NET 6: still works
image

Node 22: still works
image

Java 11: still mostly works
image

Ruby 3.3: works (new)
image

Python 3.13: mostly works (new)
image

Go 1: still doesn't work. Will address it in future PRs.
image

Notes

  1. More details:
    https://datadoghq.atlassian.net/wiki/spaces/SLS/pages/5371986568/out_of_memory+metric

  2. Like PlatformRuntimeDone event, PlatformReport event can also contain {status: 'Status::Error', error_type: 'Runtime.OutOfMemory'. Why not use that event?
    If the extension shuts down due to OOM, the PlatformReport event will be processed at next invocation when the extension restarts. However, if there is no next invocation, then PlatformReport event will never be processed, so we will miss this OOM. Therefore, PlatformRuntimeDone is better than PlatformReport. (Though it can sometimes also miss an OOM if the extension hard crashes.)

Jira: https://datadoghq.atlassian.net/browse/SVLS-7319