Issue45274
Created on 2021-09-23 20:13 by vstinner, last changed 2022-04-11 14:59 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| threading_bug.patch | vstinner, 2021-09-23 20:13 | |||
| threading_bug.py | vstinner, 2021-09-23 20:13 | |||
| Pull Requests | |||
|---|---|---|---|
| URL | Status | Linked | Edit |
| PR 28532 | merged | vstinner, 2021-09-23 20:22 | |
| PR 28579 | merged | miss-islington, 2021-09-27 12:21 | |
| PR 28580 | merged | miss-islington, 2021-09-27 12:21 | |
| PR 31290 | open | bjs, 2022-02-12 00:00 | |
| Messages (12) | |||
|---|---|---|---|
| msg402523 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2021-09-23 20:13 | |
Bernát Gábor found an interesting bug on Windows. Sometimes, when a process is interrupted on Windows with CTRL+C, its parent process hangs in thread.join(): https://twitter.com/gjbernat/status/1440949682759426050 Reproducer: * Install https://github.com/gaborbernat/tox/tree/2159 * Make an empty folder and put the above tox.ini in it * Invoke python -m tox and once that command is sleeping press CTRL+C (the app should lock there indefinitely). tox.ini: --- [testenv] skip_install = true commands = python -c 'import os; print(f"start {os.getpid()}"); import time; time.sleep(100); print("end");' --- Source: https://gist.github.com/gaborbernat/f1e1aff0f2ee514b50a92a4d019d4d13 I tried to attach the Python process in Python: there is a single thread, the main thread which is blocked in thread.join(). You can also see it in the faulthandler traceback. I did a long analysis of the _tstate_lock and I checked that thread really completed. Raw debug traces: == thread 6200 exit == thread_run[pid=3984, thread_id=6200]: clear PyThreadState_Clear[pid=3984, thread_id=6200]: on_delete() release_sentinel[pid=3984, thread_id=6200]: enter release_sentinel[pid=3984, thread_id=6200]: release(obj=000001C1122669C0, lock=000001C110BBDA00) release_sentinel[pid=3984, thread_id=6200]: exit PyThreadState_Clear[pid=3984, thread_id=6200]: on_delete()-- == main thread is calling join() but gets a KeyboardInterrupt == [pid=3984, thread_id=8000] Lock<obj=000001C1122669C0>.acquire() -> ACQUIRED Current thread 0x00001f40 (most recent call first): File "C:\vstinner\python\3.10\lib\threading.py", line 1118 in _wait_for_tstate_lock File "C:\vstinner\python\3.10\lib\threading.py", line 1092 in join File "C:\vstinner\env\lib\site-packages\tox\session\cmthread_run[pid=3984, thread_id=6200]: exit d\run\common.py", line 203 in execute File "C:\vstinner\env\lib\site-packages\tox\session\cmd\run\sequential.py", line 20 in run_sequential File "C:\vstinner\env\lib\site-packages\tox\session\cmd\legacy.py", line 104 in legacy File "C:\vstinner\env\lib\site-packages\tox\run.py", line 49 in main File "C:\vstinner\env\lib\site-packages\tox\run.py", line 23 in run File "C:\vstinner\env\lib\site-packages\tox\__main__.py", line 4 in <module> File "C:\vstinner\python\3.10\lib\runpy.py", line 86 in _run_code File "C:\vstinner\python\3.10\lib\runpy.py", line 196 in _run_module_as_main _wait_for_tstate_lock[pid=3984, current thread_id=8000, self thread_id=6200]: EXC: KeyboardInterrupt(); acquired? None == main thread calls repr(thread) == ROOT: [3984] KeyboardInterrupt - teardown started _wait_for_tstate_lock[pid=3984, current thread_id=8000, self thread_id=6200]: acquire(block=False, timeout=-1): lock obj= 0x1c1122669c0 File "C:\vstinner\python\3.10\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\vstinner\python\3.10\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "C:\vstinner\env\lib\site-packages\tox\__main__.py", line 4, in <module> run() File "C:\vstinner\env\lib\site-packages\tox\run.py", line 23, in run result = main(sys.argv[1:] if args is None else args) File "C:\vstinner\env\lib\site-packages\tox\run.py", line 49, in main result = handler(state) File "C:\vstinner\env\lib\site-packages\tox\session\cmd\legacy.py", line 104, in legacy return run_sequential(state) File "C:\vstinner\env\lib\site-packages\tox\session\cmd\run\sequential.py", line 20, in run_sequential return execute(state, max_workers=1, has_spinner=False, live=True) File "C:\vstinner\env\lib\site-packages\tox\session\cmd\run\common.py", line 217, in execute print(f'join {thread}') File "C:\vstinner\python\3.10\lib\threading.py", line 901, in __repr__ self.is_alive() # easy way to get ._is_stopped set when appropriate File "C:\vstinner\python\3.10\lib\threading.py", line 1181, in is_alive self._wait_for_tstate_lock(False) File "C:\vstinner\python\3.10\lib\threading.py", line 1113, in _wait_for_tstate_lock traceback.print_stack() _wait_for_tstate_lock[pid=3984, current thread_id=8000, self thread_id=6200]: failed to acquire 0x1c1122669c0 _wait_for_tstate_lock[pid=3984, current thread_id=8000, self thread_id=6200]: exit 0x1c1122669c0 join <Thread(tox-interrupt, started 6200)> == main thread calls thread.join()... which hangs == _wait_for_tstate_lock[pid=3984, current thread_id=8000, self thread_id=6200]: acquire(block=True, timeout=-1): lock obj= 0x1c1122669c0 File "C:\vstinner\python\3.10\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\vstinner\python\3.10\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "C:\vstinner\env\lib\site-packages\tox\__main__.py", line 4, in <module> run() File "C:\vstinner\env\lib\site-packages\tox\run.py", line 23, in run result = main(sys.argv[1:] if args is None else args) File "C:\vstinner\env\lib\site-packages\tox\run.py", line 49, in main result = handler(state) File "C:\vstinner\env\lib\site-packages\tox\session\cmd\legacy.py", line 104, in legacy return run_sequential(state) File "C:\vstinner\env\lib\site-packages\tox\session\cmd\run\sequential.py", line 20, in run_sequential return execute(state, max_workers=1, has_spinner=False, live=True) File "C:\vstinner\env\lib\site-packages\tox\session\cmd\run\common.py", line 218, in execute thread.join() File "C:\vstinner\python\3.10\lib\threading.py", line 1092, in join self._wait_for_tstate_lock() File "C:\vstinner\python\3.10\lib\threading.py", line 1113, in _wait_for_tstate_lock traceback.print_stack() Explanation: * Context: The main thread is waiting on thread.join() * The thread 6200 completes: release_sentinel() is called to to release _tstate_lock * The main thread succeed to acquire _tstate_lock (of the thread 6200) since it was just release * Ooops oops oops, the main thread gets KeyboardInterrupt in Thread._wait_for_tstate_lock() before being able to release the lock. As if the function if interrupted here: def _wait_for_tstate_lock(self, block=True, timeout=-1): lock = self._tstate_lock if lock is None: assert self._is_stopped elif lock.acquire(block, timeout): # -- got KeyboardInterrupt here --- lock.release() self._stop() * (tox does something in the main thread) * (there are only one remaining thread: the main thread) * tox catchs KeyboardInterrupt and calls thread.join() again * thread.join() hangs because the _tstate_lock was already acquire, so lock.acquire() hangs forever => NOT GOOD You can reproduce the issue on Linux with attached patch and script: * Apply threading_bug.patch on Python * Run threading_bug.py * See that the script hangs forever Example: --- $ git apply threading_bug.patch $ ./python threading_bug.py join... join failed with: KeyboardInterrupt() join again... --- I'm now working on a PR to fix the race condition. |
|||
| msg402524 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2021-09-23 20:21 | |
For curious people, see also bpo-44422 "Fix threading.enumerate() reentrant call": another example of race condition recently fixed in the threading module. But it's unrelated to this bug ;-) |
|||
| msg402532 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * ![]() |
Date: 2021-09-23 21:39 | |
I am not sure that it can be solved at Python level.
Possible solutions:
* Add a Lock method (or just a builtin function) which acquires and immediately releases the lock, without possibility to interrupt.
if lock._blink(block, timeout):
self._stop()
* Add a context manager which suppresses keyboard interruption.
with suppress_interrupt():
if not lock._blink(block, timeout):
return
self._stop()
|
|||
| msg402535 - (view) | Author: Pablo Galindo Salgado (pablogsal) * ![]() |
Date: 2021-09-23 21:59 | |
> Add a context manager which suppresses keyboard interruption. That's not enough technically. This can be any signal handler that raises an exception, no? |
|||
| msg402540 - (view) | Author: Eryk Sun (eryksun) * ![]() |
Date: 2021-09-23 23:03 | |
See bpo-21822 from 2014, which I think was the first report of this issue. |
|||
| msg402542 - (view) | Author: gaborjbernat (gaborjbernat) * | Date: 2021-09-24 06:56 | |
I just want to note that my maximal reproducer moved to https://github.com/tox-dev/tox/commit/e5d1a439be0790c8104d4caf943b3d82f23a0039 (that has been merged on the branch, so should be now stable). |
|||
| msg402597 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2021-09-24 23:42 | |
> I am not sure that it can be solved at Python level. Right. In pure Python, we cannot write code which works in all cases. My PR 28532 fix the most common case: application interrupted by a single CTRL+C. > with suppress_interrupt(): (...) It's important to be able to interrupt acquire() which can be called in blocking mode with no timeout: it's exactly what tox does, and users expect to be able to interrupt tox in this case. > * Add a Lock method (or just a builtin function) which acquires and immediately releases the lock, without possibility to interrupt. The acquire()+release() sequence can be made atomic in C, but it doesn't solve the problem of _stop() which can be interrupted by a second exception. This bug is likely as old as Python. I don't think that we should attempt to design a perfect solution. I only propose to make the race condition (way) less likely. |
|||
| msg402704 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2021-09-27 12:20 | |
New changeset a22be4943c119fecf5433d999227ff78fc2e5741 by Victor Stinner in branch 'main': bpo-45274: Fix Thread._wait_for_tstate_lock() race condition (GH-28532) https://github.com/python/cpython/commit/a22be4943c119fecf5433d999227ff78fc2e5741 |
|||
| msg402706 - (view) | Author: miss-islington (miss-islington) | Date: 2021-09-27 12:49 | |
New changeset 1ecb641b887af2feb026a266e2fceedee0815ca8 by Miss Islington (bot) in branch '3.9': bpo-45274: Fix Thread._wait_for_tstate_lock() race condition (GH-28532) https://github.com/python/cpython/commit/1ecb641b887af2feb026a266e2fceedee0815ca8 |
|||
| msg402707 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2021-09-27 12:53 | |
New changeset fae2694bea5e9e5a114af8cb40b60e7131a6340c by Miss Islington (bot) in branch '3.10': bpo-45274: Fix Thread._wait_for_tstate_lock() race condition (GH-28532) (GH-28580) https://github.com/python/cpython/commit/fae2694bea5e9e5a114af8cb40b60e7131a6340c |
|||
| msg402710 - (view) | Author: Eryk Sun (eryksun) * ![]() |
Date: 2021-09-27 13:34 | |
This is the same as bpo-21822, so I suppose it should be closed as well with a reference to this issue. |
|||
| msg402718 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2021-09-27 14:11 | |
> This is the same as bpo-21822, so I suppose it should be closed as well with a reference to this issue. You're right. I marked bpo-21822 as a duplicate of this issue. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022-04-11 14:59:50 | admin | set | github: 89437 |
| 2022-03-12 00:31:02 | vstinner | set | nosy:
- vstinner |
| 2022-03-04 18:16:41 | maggyero | set | nosy:
+ maggyero |
| 2022-02-12 00:00:05 | bjs | set | nosy:
+ bjs pull_requests: + pull_request29450 |
| 2021-09-27 14:11:56 | vstinner | set | messages: + msg402718 |
| 2021-09-27 14:11:30 | vstinner | link | issue21822 superseder |
| 2021-09-27 13:34:32 | eryksun | set | messages: + msg402710 |
| 2021-09-27 12:53:47 | vstinner | set | status: open -> closed stage: patch review -> resolved resolution: fixed versions: + Python 3.9, Python 3.10 |
| 2021-09-27 12:53:37 | vstinner | set | messages: + msg402707 |
| 2021-09-27 12:49:39 | miss-islington | set | messages: + msg402706 |
| 2021-09-27 12:21:25 | miss-islington | set | pull_requests: + pull_request26963 |
| 2021-09-27 12:21:21 | miss-islington | set | nosy:
+ miss-islington pull_requests:
+ pull_request26962 |
| 2021-09-27 12:20:35 | vstinner | set | messages: + msg402704 |
| 2021-09-24 23:42:10 | vstinner | set | messages: + msg402597 |
| 2021-09-24 08:16:32 | eryksun | set | nosy:
+ eryksun |
| 2021-09-24 06:56:23 | gaborjbernat | set | nosy:
- eryksun messages: + msg402542 |
| 2021-09-23 23:03:20 | eryksun | set | nosy:
+ eryksun messages: + msg402540 |
| 2021-09-23 22:40:44 | gaborjbernat | set | nosy:
+ gaborjbernat |
| 2021-09-23 21:59:53 | pablogsal | set | messages: + msg402535 |
| 2021-09-23 21:39:41 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka messages: + msg402532 |
| 2021-09-23 20:27:19 | vstinner | set | nosy:
+ pablogsal stage: patch review -> (no value) |
| 2021-09-23 20:22:50 | vstinner | set | stage: patch review pull_requests: + pull_request26920 |
| 2021-09-23 20:21:39 | vstinner | set | messages: + msg402524 |
| 2021-09-23 20:13:16 | vstinner | set | files: + threading_bug.py |
| 2021-09-23 20:13:09 | vstinner | create | |

