importlib lock race issue in deadlock handling code · Issue #91351 · python/cpython
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
| BPO | 47195 |
|---|---|
| Nosy | @rpurdie |
| Files |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
assignee = None closed_at = None created_at = <Date 2022-04-01.20:23:24.167> labels = ['interpreter-core', '3.10'] title = 'importlib lock race issue in deadlock handling code' updated_at = <Date 2022-04-02.06:56:46.108> user = 'https://github.com/rpurdie'
bugs.python.org fields:
activity = <Date 2022-04-02.06:56:46.108> actor = 'rpurdie' assignee = 'none' closed = False closed_date = None closer = None components = ['Interpreter Core'] creation = <Date 2022-04-01.20:23:24.167> creator = 'rpurdie' dependencies = [] files = ['50714'] hgrepos = [] issue_num = 47195 keywords = [] message_count = 2.0 messages = ['416517', '416539'] nosy_count = 1.0 nosy_names = ['rpurdie'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = None url = 'https://bugs.python.org/issue47195' versions = ['Python 3.10']
Linked PRs
Copy link
Mannequin Author
We've seen tracebacks in production like:
File "<frozen importlib._bootstrap>", line 1004, in _find_and_load(name='oe.gpg_sign', import_=<built-in function __import__>)
File "<frozen importlib._bootstrap>", line 158, in _ModuleLockManager.__enter__()
File "<frozen importlib._bootstrap>", line 110, in _ModuleLock.acquire()
KeyError: 139622474778432
and
File "<frozen importlib._bootstrap>", line 1004, in _find_and_load(name='oe.path', import_=<built-in function __import__>)
File "<frozen importlib._bootstrap>", line 158, in _ModuleLockManager.__enter__()
File "<frozen importlib._bootstrap>", line 110, in _ModuleLock.acquire()
KeyError: 140438942700992
I've attached a reproduction script which shows that if an import XXX is in progress and waiting at the wrong point when an interrupt arrives (in this case a signal) and triggers it's own import YYY, _blocking_on[tid] in importlib/_bootstrap.py gets overwritten and lost, triggering the traceback we see above upon exit from the second import.
I'm using a signal handler here as the interrupt, I don't know what our production source is as yet but this reproducer proves it is possible.
Copy link
Mannequin Author
This is a production backtrace after I inserted code to traceback if tid was already in _blocking_on. It is being triggered by a warning about an unclosed asyncio event loop and confirms my theory about nested imports, in the production case I'd guess being triggered by gc given the __del__.
File "/home/pokybuild/yocto-worker/oe-selftest-fedora/build/meta/classes/base.bbclass", line 26, in oe_import
import oe.data
File "<frozen importlib._bootstrap>", line 1024, in _find_and_load
File "<frozen importlib._bootstrap>", line 171, in __enter__
File "/home/pokybuild/yocto-worker/oe-selftest-fedora/build/bitbake/lib/bb/cooker.py", line 168, in acquire
return orig_acquire(self)
File "<frozen importlib._bootstrap>", line 110, in acquire
File "/usr/lib64/python3.10/asyncio/base_events.py", line 685, in __del__
_warn(f"unclosed event loop {self!r}", ResourceWarning, source=self)
File "/usr/lib64/python3.10/warnings.py", line 112, in _showwarnmsg
_showwarnmsg_impl(msg)
File "/usr/lib64/python3.10/warnings.py", line 28, in _showwarnmsg_impl
text = _formatwarnmsg(msg)
File "/usr/lib64/python3.10/warnings.py", line 128, in _formatwarnmsg
return _formatwarnmsg_impl(msg)
File "/usr/lib64/python3.10/warnings.py", line 56, in _formatwarnmsg_impl
import tracemalloc
File "<frozen importlib._bootstrap>", line 1024, in _find_and_load
File "<frozen importlib._bootstrap>", line 171, in __enter__
File "/home/pokybuild/yocto-worker/oe-selftest-fedora/build/bitbake/lib/bb/cooker.py", line 167, in acquire
bb.warn("\n".join(traceback.format_stack()))
I also encountered this case, triggered by the resource warning emitted by an unclosed socket.socket. On Python 3.9.12 and Python 3.9.13 the following reproduces the problem on Windows and Linux (deterministically, if a call to gc.collect() is inserted into acquire immediately after _blocking_on[tid] = self in _bootstrap.py):
import socket, gc
class Cycle:
pass
def main():
c = Cycle()
c.cycle = c
c.s = socket.socket()
del c
# import any module that hasn't been imported yet to trigger the bug
import base64
main()
Here's the _bootstrap.py diff that makes it deterministic:
diff --git a/Lib/importlib/_bootstrap.py b/Lib/importlib/_bootstrap.py
index e00b27ece2..808d7bf501 100644
--- a/Lib/importlib/_bootstrap.py
+++ b/Lib/importlib/_bootstrap.py
@@ -36,6 +36,23 @@ def _new_module(name):
return type(sys)(name)
+def collect():
+ gc = sys.modules.get("gc")
+ collect = getattr(gc, "collect", None)
+ if collect is not None:
+ collect()
+
+_fd = None
+def debug(s):
+ global _fd
+ os = sys.modules.get("os")
+ if _fd is None:
+ if getattr(os, "open", None) is not None:
+ _fd = os.open("C:/bootstrap.log", os.O_CREAT|os.O_APPEND|os.O_WRONLY)
+ if _fd is not None:
+ if getattr(os, "write", None) is not None:
+ os.write(_fd, (s + "\n").encode("utf-8"))
+
# Module-level locking ########################################################
# A dict mapping module names to weakrefs of _ModuleLock instances
@@ -91,7 +108,12 @@ class _ModuleLock:
Otherwise, the lock is always acquired and True is returned.
"""
tid = _thread.get_ident()
+
+ # debug(f"Hello {self.name} {len(_blocking_on)} {tid}") # >>> {self.name} {tid} {_blocking_on} {self}")
+ debug(f"import {self.name}")
_blocking_on[tid] = self
+ if self.name == "base64":
+ collect()
try:
while True:
with self.lock:
@@ -107,6 +129,7 @@ class _ModuleLock:
self.wakeup.acquire()
self.wakeup.release()
finally:
+ # debug(f"Goodbye") # <<< {self.name} {tid} {_blocking_on} {self}")
del _blocking_on[tid]
def release(self):
This basically shows the same thing as the original reproducer attached above but with a patch instead of a monkey-patch and no sleep/signals.
Seeing how these reproducers work, the cause seems pretty straightforward to me.
_ModuleLock.acquireputs the main thread into_blocking_on.- The garbage collector runs and finds something like an asyncio or a socket
- The resource's
__del__callswarnings.warn warnings.warnleads to something being imported - linecache or tracemalloc_ModuleLock.acquireis re-entered and puts the main thread into_blocking_onagain_ModuleLock.acquireis left and removes the main thread from_blocking_on(which is now empty)warnings.warn,__del__and the garbage collector all unwinds_ModuleLock.acquireis left again but on its way out it tries to remove the main thread from_blocking_onagain. However, it has already been removed and we get the KeyError.
I haven't fully understood what this part of _bootstrap.py is trying to accomplish yet so I'm not going to try to suggest what the fix might be yet.
@exarkun opened a PR against 3.9 which is security only -- I've closed though a PR against HEAD would be welcome.
cc also @brettcannon as importlib expert.
A
If the failure is due to KeyError on _blocking_on when trying to delete a key from the dict, then we could catch the KeyError, send out an ImportWarning that stuff got weird, and then simply move on.
My guess is that wouldn't actually break anything as there isn't something to unlock, so it's really just stale caching at that point instead of a semantic failure.
@exarkun which PR? The only one I see referenced in this issue was closed for being targeted at the wrong branch. Do you have a PR for main to look at?
Hi @brettcannon. I meant the PR that was closed for being targeted at the wrong branch, yes. I am happy to fix that process mistake if the code in the branch is useful. If it is not going to be accepted then I'd prefer not to sink more effort into it.
I've looked at the recent history of importlib/_bootstrap.py and I don't think there are any changes that make a difference to this bug or to the changes in the PR so I think the current branch is quite representative of what the changes will be after they are retargeted at main.
Thanks again for your attention.
I took a very quick glance and this is not a small change, so getting it reviewed may take some time (and honestly @pitrou did a lot of the work on the locking mechanism).
Quite so. The cases it is trying to handle are not simple and the implementation strategy taken is not simple and the problems that arise as a result are not simple. :( One thing I'll point out is that the commits in the PR should be sequentially reviewable, with each commit focusing on a different and (I hope) internally consistent part of the change.
Does the simpler "ignore KeyError" solution potentially work?
Unfortunately I don't think so.
It doesn't solve the problem with deadlock detection during a re-entrant import. That problem goes like this:
- Outer import starts and populates _blocking_on
- Inner import starts and stomps on outer import's _blocking_on state
- Inner import completes and clears _blocking_on state for the running thread
- Another import begins and makes an incorrect
has_deadlockdecision based on the corrupted _blocking_on state (the original thread is blocking on something but it is no longer recorded as blocking on that thing) - The original outer import may now get around to clearing _blocking_on and adding the KeyError handling prevents it from raising that exception but a problem may already have been created by (4)
It also doesn't solve the single-threaded re-entrant deadlock. That problem goes like this:
- Outer import starts and populates _blocking_on
- Outer import takes
_ModuleLock.lockfor the module being imported - Inner import for the same module begins
- Inner import takes
_ModuleLock.lockfor that module and deadlocks
In this case the outer import never even gets to cleaning up _blocking_on so there is no KeyError to handle.
Thanks a lot @exarkun for tackling this. I would agree with using a RLock if it makes things simpler (RLock is written in C nowadays so should be usable from the import bootstrap).
Thanks @pitrou. I do think RLock will help.
If I adapt the implementation to that approach then should I also re-target it at main?
If I adapt the implementation to that approach then should I also re-target it at main?
Yes. Backports get cherry-picked from main afterwards.
I submitted a new PR against main and with the RLock simplification - #94504.
This was referenced
Jul 21, 2022brettcannon added a commit that referenced this issue
Jan 21, 2023…H-94504) Co-authored-by: Brett Cannon <brett@python.org>