feat(#1235): add an optimizer switch flag to migration context by cyrinux · Pull Request #1236 · github/gh-ost

For people who stumble upon this later on and want to know the reason from a DBA point of view (I worked with Cyril to put this together)...

The backstory behind this is that we have a stable application that has been running for over 15 years on Mysql. When we upgraded from 5.7 to 8.0, we observed some severe performance impacts on a certain set of queries. It would have been particularly hard to rewrite them (since they are generated by a query builder that no-one is particularly keen to hack on in case the change introduces other side effects).

We worked out that we could recover the original performance (at the expense of marginally slowing a few other types of queries, which was deemed to be an acceptable tradeoff) by setting optimizer_switch='prefer_ordering_index=off'. And then we were able to completely forget about the issue.

Fast forward to a recent gh-ost session where an execute would repeatedly crash with

[mysql] 2022/12/21 16:32:48 packets.go:37: unexpected EOF

Why? Consider first, a source table:

CREATE TABLE thing (
    thing_id int not null auto_increment primary key
);

The table I was trying to ghost was, roughly:

CREATE TABLE thing_ref (
    thing_id int not null,
    ref_id int not null,
    primary key (thing_id, ref_id),
    key (ref_id)
);

In production, the thing table contains 165 million rows and the thing_ref table contains 1.43 billion rows (essentially, each thing can have a set of many refs, and we want to answer efficiently how many things have a particular ref).

When explaining the query that gh-ost issued to find the min/max migration values, it became apparent that it was using the index on ref_id instead of thing_id, ref_id which meant that, instead of being more or less instantaneous, mysql was doing a table scan over 1.43 billion rows to locate the min/max of thing_id, far too long to be acceptable. Reverting our local optimizer_switch hack (hence the original patch) restored the speed of the query back to something that was acceptable, and gh-ost was able to complete the migration of the table (taking 27 hours to do so!)

So yeah tl;dr issuing a force index should result in the desired behavior.