Fix incorrect use of reentrant locks by suligap · Pull Request #1076 · prometheus/client_python
This fixes a correctness bug introduced in 0014e97 resulting in lost updates during some scenarios. The code being locked is not reentrant safe. It's preferable to deadlock in these situations instead of silently loosing updates for example. Signed-off-by: Przemysław Suliga <mail@suligap.net>
csmarchbanks pushed a commit that referenced this pull request
Dec 3, 2024This fixes a correctness bug introduced in 0014e97 resulting in lost updates during some scenarios. The code being locked is not reentrant safe. It's preferable to deadlock in these situations instead of silently loosing updates for example. Signed-off-by: Przemysław Suliga <mail@suligap.net>
suligap added a commit to suligap/client_python that referenced this pull request
Dec 3, 2024Detect deadlocks during the library misuse by injecting code into the critical sections that itself might want to obtain the lock. A follow up to prometheus#1076. Signed-off-by: Przemysław Suliga <mail@suligap.net>
suligap added a commit to suligap/client_python that referenced this pull request
Dec 3, 2024Detect deadlocks during the library misuse, eg. by injecting code into the critical sections that itself might want to obtain the relevant lock. A follow up to prometheus#1076. Signed-off-by: Przemysław Suliga <mail@suligap.net>
suligap added a commit to canonical-ols/talisker that referenced this pull request
Dec 4, 2024We have been affected by deadlocks caused by talisker instrumenting the gunicorn main process' logging with sentry and related prometheus metrics. The deadlocks were on a lock in prometheus_client. So this is a single threaded context in which the main process was sometimes in the middle of incrementing a prometheus counter as a result of sending an event to sentry as a result of gunicorn main process logging an error log about eg. terminating an unresponsive worker. In the middle of that when the global multiprocessing mode lock in values.py was held, a signal handler for SIGCHLD was invoked in that main process in response to some other worker terminating. During that signal handling the main process also logs an error message which caused the sentry event and a corresponding prometheus counter update -> deadlock. So in order to be more careful/conservative with what we do during signal handling, or in this case what we instrument gunicorn to do, this change sets up all of the requests related metrics to not be emitted from the process they were created in (which for these metrics is the gunicorn arbiter if we're running gunicorn). More details on prometheus client behavior in prometheus/client_python#1076
suligap added a commit to canonical-ols/talisker that referenced this pull request
Dec 4, 2024We have been affected by deadlocks caused by talisker instrumenting the gunicorn main process' logging with sentry and related prometheus metrics. The deadlocks were on a lock in prometheus_client. So this is a single threaded context in which the main process was sometimes in the middle of incrementing a prometheus counter as a result of sending an event to sentry as a result of gunicorn main process logging an error log about eg. terminating an unresponsive worker. In the middle of that when the global multiprocessing mode prometheus client lock in values.py was held, a signal handler for SIGCHLD was invoked in that main process in response to some other worker terminating. During that signal handling the main process also logs an error message which caused the sentry event and a corresponding prometheus counter update -> deadlock. So in order to be more careful/conservative with what we do during signal handling, or in this case what we instrument gunicorn to do, this change sets up all of the requests related metrics to not be emitted from the process they were created in (which for these metrics is the gunicorn arbiter if we're running gunicorn). More details on prometheus client behavior in prometheus/client_python#1076
suligap added a commit to canonical-ols/talisker that referenced this pull request
Dec 4, 2024At least for now stop using the TaliskerRequestsTransport. In practice, for flask apps it was not used in the apps/workers because FlaskSentry was explicitly setting a None 'transport' in the client kwargs. This change makes sure that the same default raven (threaded) transport is used by logging in the main process. This will mitigate the deadlocks described in prometheus/client_python#1076 (comment) because there will be no prometheus instrumentation for sentry requests.
suligap added a commit to suligap/client_python that referenced this pull request
Dec 8, 2024Detects and prevents deadlocks during the library misuse, eg. by injecting code into the critical sections that itself might want to obtain the relevant lock. A follow up to prometheus#1076. Signed-off-by: Przemysław Suliga <mail@suligap.net>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters