Skip to content

Conversation

@meiji163
Copy link
Contributor

@meiji163 meiji163 commented Oct 13, 2025

Description

This PR introduces a checkpoint mechanism that can be used to resume a migration. In combination with --gtid, this would allow the user to resume the migration using a different replica. If using file-based coordinates, it requires to resume using the same replica. This is a continuation of @shlomi-noach's POC in #343. Closes #205

Usage: run gh-ost normally with --checkpoint flag. If the migration is interrupted/killed, restart gh-ost with the same arguments with the additional --resume flag. By default the checkpoint is every 300 seconds, but can be configured with --checkpoint-seconds. Also see doc/resume.md.

In case this PR introduced Go code changes:

  • contributed code is using same conventions as original code
  • script/cibuild returns with no formatting errors, build errors or unit test errors.

Details

The two main operations of gh-ost are applying DML events from the binlog and copying rows to the ghost table.
A checkpoint saves the state of both:

  1. the binlog coordinates of the transaction last applied to the gh-ost table (LastTrxCoords)
  2. the range last copied to the gh-ost table (IterationRangeMin and IterationRangeMax)

It is safe to resume the migration from this state because

  1. DML event application is idempotent at the row level. If binlog streamer resumes at coordinates smaller than or equal to the coordinates last processed by the applier, the final values should be the same even if some DML events are applied twice.
  2. Copying a row is also idempotent since the second INSERT will fail with duplicate key error. Then the DML applier will bring the row to date like usual

To store the checkpoint we use a new _ghk table, which looks like

CREATE TABLE _${original_tablename}_ghk (
    `gh_ost_chk_id` bigint auto_increment primary key,
    `gh_ost_chk_timestamp` bigint,
    `gh_ost_chk_coords` varchar(4096),
    `gh_ost_chk_iteration` bigint,
    `gh_ost_rows_copied` bigint,
    `gh_ost_dml_applied` bigint,
    `c1_min`,`c2_min`, ...,`cn_min`,
    `c1_max`,`c2_max`, ...,`cn_max`
);

where (c1_min, c2_min..., cn_min) and (c1_max, ... cn_max) are the created with the same types as the shared unique key (c1, c2, ... cn) used by gh-ost.

Testing

Replica Tests

I tested resuming with --test-on-replica under synthetic sysbench OLTP write load of ~2k DML/sec. I created a sysbench table with 300M rows and ran a no-op migration with --gtid and --checkpoint set to timeout after 10min. 10 seconds after migration timed out, I started a new gh-ost process with --resume. When the migration finished the ghost and original tables were checksummed, revealing no data discrepancy. ✅

I repeated this test using an initial timeout of 20min and a waiting period of 1hr before resuming. The data integrity check also passed. In addition the test passed running on two testing replicas in production clusters.

Switching Replicas

I tested resuming gh-ost using a different replica than the original one it was attach to:

  1. Using the same 300M test table and sysbench write load, I started the migration gh-ost --alter='add index k_2 (c)' --host='replica1' --gtid --checkpoint
  2. After 10min, I killed the migration
  3. After waiting 5min, I resumed the migration using a second replica: gh-ost --alter='add index k_2 (c)' --host='replica2' --gtid --checkpoint --resume
  4. After a few minutes, I killed the sysbench write load (so no DML happens after cutover).
  5. Once migration completed, I checksummed the original and ghost table to verify data integrity. ✅

Failover Test

Using the same setup, I tested resuming migration after a master failover triggered by orchestrator. The failover kills the migration, and I resumed the migration using the same replica. ✅

@meiji163 meiji163 changed the title [WIP] Resume from checkpoint Resume from checkpoint Oct 15, 2025
@meiji163 meiji163 marked this pull request as ready for review October 15, 2025 20:28
@meiji163 meiji163 requested review from Copilot and removed request for rashiq and timvaillancourt October 15, 2025 20:28
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a checkpoint mechanism to allow gh-ost migrations to be resumed from interruption. Using the --checkpoint flag, gh-ost periodically saves its state (binlog coordinates and chunk copying progress) to a checkpoint table, enabling recovery with the --resume flag.

  • Adds checkpoint functionality with periodic state saving to _<table>_ghk table
  • Implements resume capability to restart migrations from last checkpoint
  • Modifies event streaming to support resuming from specific binlog coordinates

Reviewed Changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
localtests/test.sh Updates test configuration with checkpoint flag and adjusted parameters
go/sql/types.go Adds column metadata fields and Clone method for checkpoint support
go/sql/builder_test.go Adds tests for checkpoint query builder functionality
go/sql/builder.go Implements checkpoint table creation and insert query building
go/logic/streamer_test.go Updates streamer tests to handle new BinlogEntry parameter
go/logic/streamer.go Modifies event streaming to support initial coordinates and BinlogEntry
go/logic/migrator_test.go Updates migrator tests for new BinlogEntry structure
go/logic/migrator.go Implements core checkpoint and resume functionality
go/logic/inspect.go Adds column nullability and type inspection for checkpoint table
go/logic/checkpoint.go Defines Checkpoint struct for state storage
go/logic/applier_test.go Adds comprehensive checkpoint read/write tests
go/logic/applier.go Implements checkpoint table operations and state tracking
go/cmd/gh-ost/main.go Adds command-line flags for checkpoint and resume features
go/base/context.go Adds configuration fields for checkpoint functionality
doc/resume.md Provides documentation for using the resume feature
doc/command-line-flags.md Documents new checkpoint and resume command-line flags
Comments suppressed due to low confidence (1)

localtests/test.sh:1

  • The trap statement was removed but should be restored. Without this trap, cleanup won't occur on EXIT or TERM signals, potentially leaving test artifacts behind.
#!/bin/bash

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

len(this.applyEventsQueue), cap(this.applyEventsQueue),
base.PrettifyDurationOutput(elapsedTime), base.PrettifyDurationOutput(this.migrationContext.ElapsedRowCopyTime()),
currentBinlogCoordinates,
currentBinlogCoordinates.DisplayString(),
Copy link

Copilot AI Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes currentBinlogCoordinates is not nil, but there's no nil check. If currentBinlogCoordinates is nil, this will cause a panic.

Suggested change
currentBinlogCoordinates.DisplayString(),
func() string {
if currentBinlogCoordinates != nil {
return currentBinlogCoordinates.DisplayString()
}
return "N/A"
}(),

Copilot uses AI. Check for mistakes.
}
for _, col := range this.migrationContext.UniqueKey.Columns.Columns() {
if col.MySQLType == "" {
return fmt.Errorf("CreateCheckpoinTable: column %s has no type information. applyColumnTypes must be called", sql.EscapeName(col.Name))
Copy link

Copilot AI Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'CreateCheckpoinTable' to 'CreateCheckpointTable'.

Suggested change
return fmt.Errorf("CreateCheckpoinTable: column %s has no type information. applyColumnTypes must be called", sql.EscapeName(col.Name))
return fmt.Errorf("CreateCheckpointTable: column %s has no type information. applyColumnTypes must be called", sql.EscapeName(col.Name))

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@danieljoos danieljoos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome

@meiji163 meiji163 merged commit 1557a95 into master Oct 16, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow a migration to resurrect under a new gh-ost process

3 participants