Refine `wait_timeout` override to be at cut-over only by timvaillancourt ยท Pull Request #1406 ยท github/gh-ost
Related issue: #1407
Description
This PR refines #1401 by overriding the session wait_timeout only where it is needed - at the cut-over time where an idle connection could lead to potentially-long table lock if the gh-ost process (or host running it) "freezes"/"stalls" at the cut-over stage
The change (at cut-over only):
- Before cut-over:
- The server
wait_timeoutis fetched (via an existingselectthat fetchedtime_zone) - The applier session
wait_timeoutis set to be 3 x the lock-wait timeout- This is to ensure the lock is released if
gh-oststalls with a still-active connection here
- This is to ensure the lock is released if
- The server
- The cut-over proceeds as normal
- After cut-over, the original session
wait_timeoutis restored to what it was set to pre-cut-over
The --mysql-wait-timeout flag added in #1401 is removed because it is no longer needed. No release has been cut since #1401, so this isn't necessarily a breaking change
In case this PR introduced Go code changes:
- contributed code is using same conventions as original code
-
script/cibuildreturns with no formatting errors, build errors or unit test errors.
timvaillancourt
changed the title
Refine wait_timeout flag to be cut-over only
Refine wait_timeout override to be at cut-over only
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work and in particular great analysis of this stalemate situation! This looks generally correct, but see inline comments:
- It break
gh-ostfor existing users - It adds variables where I think we can do without
- It adds code complexity which we can delegate to MySQL.
@shlomi-noach I'm curious about your feedback on one consequence of this change
Following this PR, it is possible for the session holding the lock tables to timeout (and unlock tables) before the "magic table" is dropped here
If understand this right, the lock-session hitting wait_timeout will cause the rename tables to succeed. That all sounds better than before, but the "magic table" will be left behind. The impact of this I don't fully understand
-initially-drop-old-table
Drop a possibly existing OLD table (remains from a previous run?) before beginning operation. Default is to panic and abort if such table existsWould gh-ost just-fix this scenario for users with -initially-drop-old-table? Any other race-condition risks you can see the lock-release causing ๐ค? ๐
If understand this right, the lock-session hitting wait_timeout will cause the rename tables to succeed.
No, actually. The RENAME will not succeed, because the magic table is still in place. The RENAME statement attempts to rename original-table into magic-table. But since magic-table is there, the RENAME will fail.
The next cut-over attempt will first, before placing any locks, attempt to DropAtomicCutOverSentryTableIfExists() before re-creating it.
This should be safe.
If understand this right, the lock-session hitting wait_timeout will cause the rename tables to succeed.
No, actually. The
RENAMEwill not succeed, because the magic table is still in place. TheRENAMEstatement attempts to rename original-table into magic-table. But since magic-table is there, theRENAMEwill fail.The next cut-over attempt will first, before placing any locks, attempt to
DropAtomicCutOverSentryTableIfExists()before re-creating it.This should be safe.
@shlomi-noach that makes sense (eventually)! Thanks for the validations and explanations
@shlomi-noach / @meiji163: I believe I've addressed the PR suggestions and this is ready for another review ๐
Merging. @shlomi-noach let me know if I missed something and I'll make a follow-up PR ๐
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters