-
Couldn't load subscription status.
- Fork 1.3k
Resume from checkpoint #1595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resume from checkpoint #1595
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a checkpoint mechanism to allow gh-ost migrations to be resumed from interruption. Using the --checkpoint flag, gh-ost periodically saves its state (binlog coordinates and chunk copying progress) to a checkpoint table, enabling recovery with the --resume flag.
- Adds checkpoint functionality with periodic state saving to
_<table>_ghktable - Implements resume capability to restart migrations from last checkpoint
- Modifies event streaming to support resuming from specific binlog coordinates
Reviewed Changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| localtests/test.sh | Updates test configuration with checkpoint flag and adjusted parameters |
| go/sql/types.go | Adds column metadata fields and Clone method for checkpoint support |
| go/sql/builder_test.go | Adds tests for checkpoint query builder functionality |
| go/sql/builder.go | Implements checkpoint table creation and insert query building |
| go/logic/streamer_test.go | Updates streamer tests to handle new BinlogEntry parameter |
| go/logic/streamer.go | Modifies event streaming to support initial coordinates and BinlogEntry |
| go/logic/migrator_test.go | Updates migrator tests for new BinlogEntry structure |
| go/logic/migrator.go | Implements core checkpoint and resume functionality |
| go/logic/inspect.go | Adds column nullability and type inspection for checkpoint table |
| go/logic/checkpoint.go | Defines Checkpoint struct for state storage |
| go/logic/applier_test.go | Adds comprehensive checkpoint read/write tests |
| go/logic/applier.go | Implements checkpoint table operations and state tracking |
| go/cmd/gh-ost/main.go | Adds command-line flags for checkpoint and resume features |
| go/base/context.go | Adds configuration fields for checkpoint functionality |
| doc/resume.md | Provides documentation for using the resume feature |
| doc/command-line-flags.md | Documents new checkpoint and resume command-line flags |
Comments suppressed due to low confidence (1)
localtests/test.sh:1
- The trap statement was removed but should be restored. Without this trap, cleanup won't occur on EXIT or TERM signals, potentially leaving test artifacts behind.
#!/bin/bash
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| len(this.applyEventsQueue), cap(this.applyEventsQueue), | ||
| base.PrettifyDurationOutput(elapsedTime), base.PrettifyDurationOutput(this.migrationContext.ElapsedRowCopyTime()), | ||
| currentBinlogCoordinates, | ||
| currentBinlogCoordinates.DisplayString(), |
Copilot
AI
Oct 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes currentBinlogCoordinates is not nil, but there's no nil check. If currentBinlogCoordinates is nil, this will cause a panic.
| currentBinlogCoordinates.DisplayString(), | |
| func() string { | |
| if currentBinlogCoordinates != nil { | |
| return currentBinlogCoordinates.DisplayString() | |
| } | |
| return "N/A" | |
| }(), |
| } | ||
| for _, col := range this.migrationContext.UniqueKey.Columns.Columns() { | ||
| if col.MySQLType == "" { | ||
| return fmt.Errorf("CreateCheckpoinTable: column %s has no type information. applyColumnTypes must be called", sql.EscapeName(col.Name)) |
Copilot
AI
Oct 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected spelling of 'CreateCheckpoinTable' to 'CreateCheckpointTable'.
| return fmt.Errorf("CreateCheckpoinTable: column %s has no type information. applyColumnTypes must be called", sql.EscapeName(col.Name)) | |
| return fmt.Errorf("CreateCheckpointTable: column %s has no type information. applyColumnTypes must be called", sql.EscapeName(col.Name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome
Description
This PR introduces a checkpoint mechanism that can be used to resume a migration. In combination with
--gtid, this would allow the user to resume the migration using a different replica. If using file-based coordinates, it requires to resume using the same replica. This is a continuation of @shlomi-noach's POC in #343. Closes #205Usage: run gh-ost normally with
--checkpointflag. If the migration is interrupted/killed, restart gh-ost with the same arguments with the additional--resumeflag. By default the checkpoint is every 300 seconds, but can be configured with--checkpoint-seconds. Also see doc/resume.md.script/cibuildreturns with no formatting errors, build errors or unit test errors.Details
The two main operations of gh-ost are applying DML events from the binlog and copying rows to the ghost table.
A checkpoint saves the state of both:
LastTrxCoords)IterationRangeMinandIterationRangeMax)It is safe to resume the migration from this state because
INSERTwill fail with duplicate key error. Then the DML applier will bring the row to date like usualTo store the checkpoint we use a new
_ghktable, which looks likewhere
(c1_min, c2_min..., cn_min)and(c1_max, ... cn_max)are the created with the same types as the shared unique key(c1, c2, ... cn)used by gh-ost.Testing
Replica Tests
I tested resuming with
--test-on-replicaunder synthetic sysbench OLTP write load of ~2k DML/sec. I created a sysbench table with 300M rows and ran a no-op migration with--gtidand--checkpointset to timeout after 10min. 10 seconds after migration timed out, I started a newgh-ostprocess with--resume. When the migration finished the ghost and original tables were checksummed, revealing no data discrepancy. ✅I repeated this test using an initial timeout of 20min and a waiting period of 1hr before resuming. The data integrity check also passed. In addition the test passed running on two testing replicas in production clusters.
Switching Replicas
I tested resuming
gh-ostusing a different replica than the original one it was attach to:gh-ost --alter='add index k_2 (c)' --host='replica1' --gtid --checkpointgh-ost --alter='add index k_2 (c)' --host='replica2' --gtid --checkpoint --resumeFailover Test
Using the same setup, I tested resuming migration after a master failover triggered by orchestrator. The failover kills the migration, and I resumed the migration using the same replica. ✅