importlib lock race issue in deadlock handling code · Issue #91351

importlib lock race issue in deadlock handling code · Issue #91351 · python/cpython

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

rpurdie mannequin opened this issue

Apr 1, 2022

· 17 comments · Fixed by #94504

Comments

BPO	47195
Nosy	@rpurdie
Files	testit2.py

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2022-04-01.20:23:24.167>
labels = ['interpreter-core', '3.10']
title = 'importlib lock race issue in deadlock handling code'
updated_at = <Date 2022-04-02.06:56:46.108>
user = 'https://github.com/rpurdie'

bugs.python.org fields:

activity = <Date 2022-04-02.06:56:46.108>
actor = 'rpurdie'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Interpreter Core']
creation = <Date 2022-04-01.20:23:24.167>
creator = 'rpurdie'
dependencies = []
files = ['50714']
hgrepos = []
issue_num = 47195
keywords = []
message_count = 2.0
messages = ['416517', '416539']
nosy_count = 1.0
nosy_names = ['rpurdie']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = None
url = 'https://bugs.python.org/issue47195'
versions = ['Python 3.10']

Linked PRs

gh-91351: Fix some bugs in importlib handling of re-entrant imports #94504

Copy link

Mannequin Author

We've seen tracebacks in production like:

File "<frozen importlib._bootstrap>", line 1004, in _find_and_load(name='oe.gpg_sign', import_=<built-in function __import__>)
File "<frozen importlib._bootstrap>", line 158, in _ModuleLockManager.__enter__()
File "<frozen importlib._bootstrap>", line 110, in _ModuleLock.acquire()
KeyError: 139622474778432

and

File "<frozen importlib._bootstrap>", line 1004, in _find_and_load(name='oe.path', import_=<built-in function __import__>)
File "<frozen importlib._bootstrap>", line 158, in _ModuleLockManager.__enter__()
File "<frozen importlib._bootstrap>", line 110, in _ModuleLock.acquire()
KeyError: 140438942700992

I've attached a reproduction script which shows that if an import XXX is in progress and waiting at the wrong point when an interrupt arrives (in this case a signal) and triggers it's own import YYY, _blocking_on[tid] in importlib/_bootstrap.py gets overwritten and lost, triggering the traceback we see above upon exit from the second import.

I'm using a signal handler here as the interrupt, I don't know what our production source is as yet but this reproducer proves it is possible.

Copy link

Mannequin Author

This is a production backtrace after I inserted code to traceback if tid was already in _blocking_on. It is being triggered by a warning about an unclosed asyncio event loop and confirms my theory about nested imports, in the production case I'd guess being triggered by gc given the __del__.

File "/home/pokybuild/yocto-worker/oe-selftest-fedora/build/meta/classes/base.bbclass", line 26, in oe_import
import oe.data
File "<frozen importlib._bootstrap>", line 1024, in _find_and_load
File "<frozen importlib._bootstrap>", line 171, in __enter__
File "/home/pokybuild/yocto-worker/oe-selftest-fedora/build/bitbake/lib/bb/cooker.py", line 168, in acquire
return orig_acquire(self)
File "<frozen importlib._bootstrap>", line 110, in acquire
File "/usr/lib64/python3.10/asyncio/base_events.py", line 685, in __del__
_warn(f"unclosed event loop {self!r}", ResourceWarning, source=self)
File "/usr/lib64/python3.10/warnings.py", line 112, in _showwarnmsg
_showwarnmsg_impl(msg)
File "/usr/lib64/python3.10/warnings.py", line 28, in _showwarnmsg_impl
text = _formatwarnmsg(msg)
File "/usr/lib64/python3.10/warnings.py", line 128, in _formatwarnmsg
return _formatwarnmsg_impl(msg)
File "/usr/lib64/python3.10/warnings.py", line 56, in _formatwarnmsg_impl
import tracemalloc
File "<frozen importlib._bootstrap>", line 1024, in _find_and_load
File "<frozen importlib._bootstrap>", line 171, in __enter__
File "/home/pokybuild/yocto-worker/oe-selftest-fedora/build/bitbake/lib/bb/cooker.py", line 167, in acquire
bb.warn("\n".join(traceback.format_stack()))

I also encountered this case, triggered by the resource warning emitted by an unclosed socket.socket. On Python 3.9.12 and Python 3.9.13 the following reproduces the problem on Windows and Linux (deterministically, if a call to gc.collect() is inserted into acquire immediately after _blocking_on[tid] = self in _bootstrap.py):

import socket, gc

class Cycle:
    pass

def main():
    c = Cycle()
    c.cycle = c
    c.s = socket.socket()

    del c
    # import any module that hasn't been imported yet to trigger the bug
    import base64

main()

Here's the _bootstrap.py diff that makes it deterministic:

diff --git a/Lib/importlib/_bootstrap.py b/Lib/importlib/_bootstrap.py
index e00b27ece2..808d7bf501 100644
--- a/Lib/importlib/_bootstrap.py
+++ b/Lib/importlib/_bootstrap.py
@@ -36,6 +36,23 @@ def _new_module(name):
     return type(sys)(name)


+def collect():
+    gc = sys.modules.get("gc")
+    collect = getattr(gc, "collect", None)
+    if collect is not None:
+        collect()
+
+_fd = None
+def debug(s):
+    global _fd
+    os = sys.modules.get("os")
+    if _fd is None:
+        if getattr(os, "open", None) is not None:
+            _fd = os.open("C:/bootstrap.log", os.O_CREAT|os.O_APPEND|os.O_WRONLY)
+    if _fd is not None:
+        if getattr(os, "write", None) is not None:
+            os.write(_fd, (s + "\n").encode("utf-8"))
+    
 # Module-level locking ########################################################

 # A dict mapping module names to weakrefs of _ModuleLock instances
@@ -91,7 +108,12 @@ class _ModuleLock:
         Otherwise, the lock is always acquired and True is returned.
         """
         tid = _thread.get_ident()
+        
+        # debug(f"Hello {self.name} {len(_blocking_on)} {tid}") # >>> {self.name} {tid} {_blocking_on} {self}")
+        debug(f"import {self.name}")
         _blocking_on[tid] = self
+        if self.name == "base64":
+            collect()
         try:
             while True:
                 with self.lock:
@@ -107,6 +129,7 @@ class _ModuleLock:
                 self.wakeup.acquire()
                 self.wakeup.release()
         finally:
+            # debug(f"Goodbye") # <<< {self.name} {tid} {_blocking_on} {self}")
             del _blocking_on[tid]

     def release(self):

This basically shows the same thing as the original reproducer attached above but with a patch instead of a monkey-patch and no sleep/signals.

Seeing how these reproducers work, the cause seems pretty straightforward to me.

_ModuleLock.acquire puts the main thread into _blocking_on.
The garbage collector runs and finds something like an asyncio or a socket
The resource's __del__ calls warnings.warn
warnings.warn leads to something being imported - linecache or tracemalloc
_ModuleLock.acquire is re-entered and puts the main thread into _blocking_on again
_ModuleLock.acquire is left and removes the main thread from _blocking_on (which is now empty)
warnings.warn, __del__ and the garbage collector all unwinds
_ModuleLock.acquire is left again but on its way out it tries to remove the main thread from _blocking_on again. However, it has already been removed and we get the KeyError.

I haven't fully understood what this part of _bootstrap.py is trying to accomplish yet so I'm not going to try to suggest what the fix might be yet.

@exarkun opened a PR against 3.9 which is security only -- I've closed though a PR against HEAD would be welcome.

cc also @brettcannon as importlib expert.

If the failure is due to KeyError on _blocking_on when trying to delete a key from the dict, then we could catch the KeyError, send out an ImportWarning that stuff got weird, and then simply move on.

My guess is that wouldn't actually break anything as there isn't something to unlock, so it's really just stale caching at that point instead of a semantic failure.

@exarkun which PR? The only one I see referenced in this issue was closed for being targeted at the wrong branch. Do you have a PR for main to look at?

Hi @brettcannon. I meant the PR that was closed for being targeted at the wrong branch, yes. I am happy to fix that process mistake if the code in the branch is useful. If it is not going to be accepted then I'd prefer not to sink more effort into it.

I've looked at the recent history of importlib/_bootstrap.py and I don't think there are any changes that make a difference to this bug or to the changes in the PR so I think the current branch is quite representative of what the changes will be after they are retargeted at main.

Thanks again for your attention.

(For other's reference, the PR under discussion is #94342 )

I took a very quick glance and this is not a small change, so getting it reviewed may take some time (and honestly @pitrou did a lot of the work on the locking mechanism). Does the simpler "ignore KeyError" solution potentially work?

I took a very quick glance and this is not a small change, so getting it reviewed may take some time (and honestly @pitrou did a lot of the work on the locking mechanism).

Quite so. The cases it is trying to handle are not simple and the implementation strategy taken is not simple and the problems that arise as a result are not simple. :( One thing I'll point out is that the commits in the PR should be sequentially reviewable, with each commit focusing on a different and (I hope) internally consistent part of the change.

Does the simpler "ignore KeyError" solution potentially work?

Unfortunately I don't think so.

It doesn't solve the problem with deadlock detection during a re-entrant import. That problem goes like this:

Outer import starts and populates _blocking_on
Inner import starts and stomps on outer import's _blocking_on state
Inner import completes and clears _blocking_on state for the running thread
Another import begins and makes an incorrect has_deadlock decision based on the corrupted _blocking_on state (the original thread is blocking on something but it is no longer recorded as blocking on that thing)
The original outer import may now get around to clearing _blocking_on and adding the KeyError handling prevents it from raising that exception but a problem may already have been created by (4)

It also doesn't solve the single-threaded re-entrant deadlock. That problem goes like this:

Outer import starts and populates _blocking_on
Outer import takes _ModuleLock.lock for the module being imported
Inner import for the same module begins
Inner import takes _ModuleLock.lock for that module and deadlocks

In this case the outer import never even gets to cleaning up _blocking_on so there is no KeyError to handle.

Thanks a lot @exarkun for tackling this. I would agree with using a RLock if it makes things simpler (RLock is written in C nowadays so should be usable from the import bootstrap).

Thanks @pitrou. I do think RLock will help.

If I adapt the implementation to that approach then should I also re-target it at main?

If I adapt the implementation to that approach then should I also re-target it at main?

Yes. Backports get cherry-picked from main afterwards.

I submitted a new PR against main and with the RLock simplification - #94504.

Yeah, I have it on my TODO list, but not high priority, sorry :-)

This was referenced

Jul 21, 2022

brettcannon added a commit that referenced this issue

Jan 21, 2023

…H-94504)

Co-authored-by: Brett Cannon <brett@python.org>