Support --checksum-data flag, on-the-fly checksum verification #867

shlomi-noach · 2020-07-28T09:00:35Z

resubmission of openark#4 from downstream

This PR introduces --checksum-data, an opt-in checksum verification that runs throughout the migration.

With --checksum-data enabled, each rowcopy (a range of rows copied from the original table to the ghost table) is followed by a checksum on the two tables for that range.

Checksums are executed concurrently to rowcopy and are the exception to the single thread model for gh-ost.

A checksum may well fail while the migration is running: since gh-ost works in async design, where binlog entries are applied at some point in time after they're generated, it's quite possible that ongoing traffic will make some checksums fail.

A failed range's checksum is retried and retried until successful.

When --checksum-data is enabled, cut-over does not complete if failed checksums are found. While tables are locked in preparation for cut-over, a grace period is given so that the checksum evaluation can run to completion.

This is experimental.

Risk assessment: risky!

With flag disabled (as is the default case), behavior does not change and risk is low. With flag enabled, the following happen (or can happen):

More reads directly on master server: these are the checksum tests; they take place on both original table and ghost table. It's worth noting that the row-copy operation runs a full scan on the original table anyhow, and so the extra reads do not (should not) bring into memory data pages not already brought into memory by row-copy.
Slower migration time due to extra reads
Risk at time of cut-over. At this time I have no access to a busy production server so I have not verified. The following scenario is possible:
- migration is ready for cut-over
- there's many checksums not fully verified yet (that's because production traffic was busy and changed data even while checksums were calculated)
- gh-ost begins cut-over, thus locks table for writes
- Table data is now static, so theoretically all checksums should be good.
- But there's so many checksums to evaluate that we get timeout, thus rolling back migration.
- repeat.
To clarify that I haven't seen this, but I predict this might show up in prod.

I'm presenting this PR upstream for visibility. It's an important change that further validates (or invalidates!) the correctness of migrated data so it may be of interest. I'd suggest massive experimentation.

Updates from upstream

Using golang 1.14

Actions/Workflow: upload artifact

Support a complete ALTER TABLE statement in --alter

shlomi-noach added 30 commits June 28, 2020 08:39

Using golang 1.14

1a8c372

checksums

fb4aca1

Actions/workflows: upload binary artifact

2b71b73

Support --checksum-data flag, on-the-fly checksum verification

b60b12d

extra table timeout when checksum-data is enabled

5c0d9ab

builder tests

4be4cb9

builder tests

ed7aa85

builder tests

edc1053

expect 1.14 and above in build scripts; update to readme.md

8eb300b

better iteration on checksum comparison

b774dc1

iteration on string representation

aa33f10

Visibility into pending/successful cehcksum comparisons

f430ba4

better synchronization logic

1f47f52

GhostUniqueKey

3907a13

Using GhostUniqueKey for building checksum query on ghost table

1182ad0

stricter checksum with IFNULL

8e43847

Support a complete ALTER TABLE statement in --alter

6c7b473

Merge branch 'master' into parse-alter-statement

f482356

Updating and using AlterTableOptions

c9249f2

Merge pull request #6 from github/master

87595b1

Updates from upstream

Merge branch 'master' into parse-alter-statement

88c73c0

comments

731df3c

Merge branch 'master' into golang1.14

d1fcef4

Merge branch 'master' into workflow-upload-artifact

b9d400a

Merge pull request #1 from openark/golang1.14

34d1624

Using golang 1.14

Merge branch 'master' into workflow-upload-artifact

b54d256

Merge branch 'master' into rowcopy-checksum

1083109

removed debug messages

317c807

fix log call

1de2b5d

Merge pull request #2 from openark/workflow-upload-artifact

9b2a04d

Actions/Workflow: upload artifact

shlomi-noach added 3 commits July 29, 2020 15:06

extra unit test checks

ae4dd18

Merge pull request #5 from openark/parse-alter-statement

9ccde4f

Support a complete ALTER TABLE statement in --alter

Merge branch 'master' into rowcopy-checksum

e9fbd4e

timvaillancourt added the needs-testing label Aug 19, 2020

timvaillancourt added this to the v1.2.0 milestone May 8, 2021

meiji163 added the enhancement label Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support --checksum-data flag, on-the-fly checksum verification #867

Support --checksum-data flag, on-the-fly checksum verification #867

Uh oh!

shlomi-noach commented Jul 28, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Support --checksum-data flag, on-the-fly checksum verification #867

Are you sure you want to change the base?

Support --checksum-data flag, on-the-fly checksum verification #867

Uh oh!

Conversation

shlomi-noach commented Jul 28, 2020

Risk assessment: risky!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants