fix: prevent duplicate request processing in delayRequest by dearlordylord · Pull Request #3504 · apify/crawlee
Conversation
delayRequest removes request from inProgress immediately, then re-adds via timer. Between those two points another worker can fetch the same request, causing duplicate processing.
Fix: stop manipulating inProgress in delayRequest. The request stays guarded from fetchNextRequest through reclaimRequest's natural 3s cleanup timer.
The inProgress delete/add was introduced in #2045 to "clean up inProgress cache" during delays. But even at that commit, reclaimRequest had an inProgress.has() guard that passed fine without the delete - the manipulation was never needed. It only made inProgressCount appear lower during delays, while opening the race window. Keeping the request in inProgress during the delay is semantically correct: it is pending indeed.
Regression test included (red/green)
refs #3427
Hi @dearlordylord and thanks for your contribution. At a glance, the functionality that you're removing seems intentional, can you investigate why it was added in the first place?
Hi @dearlordylord and thanks for your contribution. At a glance, the functionality that you're removing seems intentional, can you investigate why it was added in the first place?
Hi @janbuchar thank you for checking. Could you please check the last paragraph of my pr description where I have the investigation of the removed functionality? Would that be sufficient?
Hi @dearlordylord and thanks for your contribution. At a glance, the functionality that you're removing seems intentional, can you investigate why it was added in the first place?
Hi @janbuchar thank you for checking. Could you please check the last paragraph of my pr description where I have the investigation of the removed functionality? Would that be sufficient?
Yeah, I missed that, I'm sorry. @B4nan do you remember why that change was needed, 2.5 years ago? It looks like the issue this PR is dealing with has actually appeared almost immediately after PR #2045.
Also, the code snippet in #3427 does not contain sameDomainDelaySecs, so I don't think that the delayed re-insertion of requests should even happen there.
Can't say I remember it exactly, but reading the PR title, I think it was solving some deadlocks caused by stale inProgress cache. Stale requests in there were causing the zero-concurrency bugs back in the day (but that was likely long before you joined, a bug that was so hard to reproduce that it often took many hours of running an actor).
This was surely not about "making things work", this was about "not leaking", and "not causing dead locks".
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters