feat: For OOM metric, also use error type in PlatformRuntimeDone by lym953 · Pull Request #776 · DataDog/datadog-lambda-extension
Background
To compute the aws.lambda.enhanced.out_of_memory metric, right now we only rely on runtime-specific errors such as Runtime exited with error: signal: killed for Node.
Problem
A customer reported that their Lambda OOMed but the metric was not logged.
After testing, I found that the existing approach only works for:
- .NET
- Node
- Java
but doesn't work for:
- Go
- Ruby
- Python
(The result is only based on my experiments. There could likely be other cases I didn't cover.)
This PR
In addition to the current approach, also increment the metric if the PlatformRuntimeDone event contains {status: 'Status::Error', error_type: 'Runtime.OutOfMemory'
Test
Summary: Ruby 3.3 and Python 3.13 start working.
Python 3.13: mostly works (new)

Go 1: still doesn't work. Will address it in future PRs.

Notes
-
More details:
https://datadoghq.atlassian.net/wiki/spaces/SLS/pages/5371986568/out_of_memory+metric -
Like
PlatformRuntimeDoneevent,PlatformReportevent can also contain{status: 'Status::Error', error_type: 'Runtime.OutOfMemory'. Why not use that event?
If the extension shuts down due to OOM, thePlatformReportevent will be processed at next invocation when the extension restarts. However, if there is no next invocation, thenPlatformReportevent will never be processed, so we will miss this OOM. Therefore,PlatformRuntimeDoneis better thanPlatformReport. (Though it can sometimes also miss an OOM if the extension hard crashes.)



