Fix incorrect use of reentrant locks by suligap · Pull Request #1076

Fix incorrect use of reentrant locks by suligap · Pull Request #1076 · prometheus/client_python

This fixes a correctness bug introduced in
0014e97 resulting in lost updates
during some scenarios.

The code being locked is not reentrant safe. It's preferable to deadlock
in these situations instead of silently loosing updates for example.

Signed-off-by: Przemysław Suliga <mail@suligap.net>

csmarchbanks pushed a commit that referenced this pull request

Dec 3, 2024

This fixes a correctness bug introduced in
0014e97 resulting in lost updates
during some scenarios.

The code being locked is not reentrant safe. It's preferable to deadlock
in these situations instead of silently loosing updates for example.

Signed-off-by: Przemysław Suliga <mail@suligap.net>

suligap added a commit to suligap/client_python that referenced this pull request

Dec 3, 2024

Detect deadlocks during the library misuse by injecting code into the
critical sections that itself might want to obtain the lock.

A follow up to prometheus#1076.

Signed-off-by: Przemysław Suliga <mail@suligap.net>

suligap added a commit to suligap/client_python that referenced this pull request

Dec 3, 2024

Detect deadlocks during the library misuse, eg. by injecting code into
the critical sections that itself might want to obtain the relevant lock.

A follow up to prometheus#1076.

Signed-off-by: Przemysław Suliga <mail@suligap.net>

suligap added a commit to canonical-ols/talisker that referenced this pull request

Dec 4, 2024

We have been affected by deadlocks caused by talisker instrumenting the
gunicorn main process' logging with sentry and related
prometheus metrics. The deadlocks were on a lock in prometheus_client.

So this is a single threaded context in which the main process was
sometimes in the middle of incrementing a prometheus counter as a result
of sending an event to sentry as a result of gunicorn main process
logging an error log about eg. terminating an unresponsive worker. In
the middle of that when the global multiprocessing mode lock in
values.py was held, a signal handler for SIGCHLD was invoked in that
main process in response to some other worker terminating. During that
signal handling the main process also logs an error message which caused
the sentry event and a corresponding prometheus counter update ->
deadlock.

So in order to be more careful/conservative with what we do during
signal handling, or in this case what we instrument gunicorn to do, this
change sets up all of the requests related metrics to not be emitted
from the process they were created in (which for these metrics is the
gunicorn arbiter if we're running gunicorn).

More details on prometheus client behavior in
prometheus/client_python#1076

suligap added a commit to canonical-ols/talisker that referenced this pull request

Dec 4, 2024

We have been affected by deadlocks caused by talisker instrumenting the
gunicorn main process' logging with sentry and related
prometheus metrics. The deadlocks were on a lock in prometheus_client.

So this is a single threaded context in which the main process was
sometimes in the middle of incrementing a prometheus counter as a result
of sending an event to sentry as a result of gunicorn main process
logging an error log about eg. terminating an unresponsive worker. In
the middle of that when the global multiprocessing mode prometheus
client lock in values.py was held, a signal handler for SIGCHLD was
invoked in that main process in response to some other worker
terminating. During that signal handling the main process also logs an
error message which caused the sentry event and a corresponding
prometheus counter update -> deadlock.

So in order to be more careful/conservative with what we do during
signal handling, or in this case what we instrument gunicorn to do, this
change sets up all of the requests related metrics to not be emitted
from the process they were created in (which for these metrics is the
gunicorn arbiter if we're running gunicorn).

More details on prometheus client behavior in
prometheus/client_python#1076

suligap added a commit to canonical-ols/talisker that referenced this pull request

Dec 4, 2024

At least for now stop using the TaliskerRequestsTransport.

In practice, for flask apps it was not used in the apps/workers because
FlaskSentry was explicitly setting a None 'transport' in the client
kwargs.

This change makes sure that the same default raven (threaded) transport
is used by logging in the main process. This will mitigate the deadlocks
described in
prometheus/client_python#1076 (comment)
because there will be no prometheus instrumentation for sentry requests.

suligap added a commit to suligap/client_python that referenced this pull request

Dec 8, 2024

Detects and prevents deadlocks during the library misuse, eg. by
injecting code into the critical sections that itself might want to
obtain the relevant lock.

A follow up to prometheus#1076.

Signed-off-by: Przemysław Suliga <mail@suligap.net>