Resume from checkpoint by meiji163 · Pull Request #1595

Resume from checkpoint by meiji163 · Pull Request #1595 · github/gh-ost

Description

This PR introduces a checkpoint mechanism that can be used to resume a migration. In combination with --gtid, this would allow the user to resume the migration using a different replica. If using file-based coordinates, it requires to resume using the same replica. This is a continuation of @shlomi-noach's POC in #343. Closes #205

Usage: run gh-ost normally with --checkpoint flag. If the migration is interrupted/killed, restart gh-ost with the same arguments with the additional --resume flag. By default the checkpoint is every 300 seconds, but can be configured with --checkpoint-seconds. Also see doc/resume.md.

In case this PR introduced Go code changes:

contributed code is using same conventions as original code
script/cibuild returns with no formatting errors, build errors or unit test errors.

Details

The two main operations of gh-ost are applying DML events from the binlog and copying rows to the ghost table.
A checkpoint saves the state of both:

the binlog coordinates of the transaction last applied to the gh-ost table (LastTrxCoords)
the range last copied to the gh-ost table (IterationRangeMin and IterationRangeMax)

It is safe to resume the migration from this state because

DML event application is idempotent at the row level. If binlog streamer resumes at coordinates smaller than or equal to the coordinates last processed by the applier, the final values should be the same even if some DML events are applied twice.
Copying a row is also idempotent since the second INSERT will fail with duplicate key error. Then the DML applier will bring the row up to date like usual

To store the checkpoint we use a new _ghk table, which looks like

CREATE TABLE _${original_tablename}_ghk (
    `gh_ost_chk_id` bigint auto_increment primary key,
    `gh_ost_chk_timestamp` bigint,
    `gh_ost_chk_coords` varchar(4096),
    `gh_ost_chk_iteration` bigint,
    `gh_ost_rows_copied` bigint,
    `gh_ost_dml_applied` bigint,
    `c1_min`,`c2_min`, ...,`cn_min`,
    `c1_max`,`c2_max`, ...,`cn_max`
);

where (c1_min, c2_min..., cn_min) and (c1_max, ... cn_max) are the created with the same types as the shared unique key (c1, c2, ... cn) used by gh-ost.

Testing

Replica Tests

I tested resuming with --test-on-replica under synthetic sysbench OLTP write load of ~2k DML/sec. I created a sysbench table with 300M rows and ran a no-op migration with --gtid and --checkpoint set to timeout after 10min. 10 seconds after migration timed out, I started a new gh-ost process with --resume. When the migration finished the ghost and original tables were checksummed, revealing no data discrepancy. ✅

I repeated this test using an initial timeout of 20min and a waiting period of 1hr before resuming. The data integrity check also passed. In addition the test passed running on two testing replicas in production clusters.

Switching Replicas

I tested resuming gh-ost using a different replica than the original one it was attach to:

Using the same 300M test table and sysbench write load, I started the migration gh-ost --alter='add index k_2 (c)' --host='replica1' --gtid --checkpoint
After 10min, I killed the migration
After waiting 5min, I resumed the migration using a second replica: gh-ost --alter='add index k_2 (c)' --host='replica2' --gtid --checkpoint --resume
After a few minutes, I killed the sysbench write load (so no DML happens after cutover).
Once migration completed, I checksummed the original and ghost table to verify data integrity. ✅

Failover Test

Using the same setup, I tested resuming migration after a master failover triggered by orchestrator. The failover kills the migration, and I resumed the migration using the same replica. ✅