feat: For OOM metric, also use error type in PlatformRuntimeDone by lym953 · Pull Request #776

feat: For OOM metric, also use error type in PlatformRuntimeDone by lym953 · Pull Request #776 · DataDog/datadog-lambda-extension

Background

To compute the aws.lambda.enhanced.out_of_memory metric, right now we only rely on runtime-specific errors such as Runtime exited with error: signal: killed for Node.

Problem

A customer reported that their Lambda OOMed but the metric was not logged.
After testing, I found that the existing approach only works for:

.NET
Node
Java

but doesn't work for:

Go
Ruby
Python

(The result is only based on my experiments. There could likely be other cases I didn't cover.)

This PR

In addition to the current approach, also increment the metric if the PlatformRuntimeDone event contains {status: 'Status::Error', error_type: 'Runtime.OutOfMemory'

Test

Summary: Ruby 3.3 and Python 3.13 start working.

.NET 6: still works

Node 22: still works

Java 11: still mostly works

Ruby 3.3: works (new)

Python 3.13: mostly works (new)

Go 1: still doesn't work. Will address it in future PRs.

Notes

More details:
https://datadoghq.atlassian.net/wiki/spaces/SLS/pages/5371986568/out_of_memory+metric
Like PlatformRuntimeDone event, PlatformReport event can also contain {status: 'Status::Error', error_type: 'Runtime.OutOfMemory'. Why not use that event?
If the extension shuts down due to OOM, the PlatformReport event will be processed at next invocation when the extension restarts. However, if there is no next invocation, then PlatformReport event will never be processed, so we will miss this OOM. Therefore, PlatformRuntimeDone is better than PlatformReport. (Though it can sometimes also miss an OOM if the extension hard crashes.)

Jira: https://datadoghq.atlassian.net/browse/SVLS-7319