Resume from checkpoint #1595

meiji163 · 2025-10-13T05:05:47Z

Description

This PR introduces a checkpoint mechanism that can be used to resume a migration. In combination with --gtid, this would allow the user to resume the migration using a different replica. If using file-based coordinates, it requires to resume using the same replica. This is a continuation of @shlomi-noach's POC in #343. Closes #205

Usage: run gh-ost normally with --checkpoint flag. If the migration is interrupted/killed, restart gh-ost with the same arguments with the additional --resume flag. By default the checkpoint is every 300 seconds, but can be configured with --checkpoint-seconds. Also see doc/resume.md.

In case this PR introduced Go code changes:

contributed code is using same conventions as original code
script/cibuild returns with no formatting errors, build errors or unit test errors.

Details

The two main operations of gh-ost are applying DML events from the binlog and copying rows to the ghost table.
A checkpoint saves the state of both:

the binlog coordinates of the transaction last applied to the gh-ost table (LastTrxCoords)
the range last copied to the gh-ost table (IterationRangeMin and IterationRangeMax)

It is safe to resume the migration from this state because

DML event application is idempotent at the row level. If binlog streamer resumes at coordinates smaller than or equal to the coordinates last processed by the applier, the final values should be the same even if some DML events are applied twice.
Copying a row is also idempotent since the second INSERT will fail with duplicate key error. Then the DML applier will bring the row to date like usual

To store the checkpoint we use a new _ghk table, which looks like

CREATE TABLE _${original_tablename}_ghk (
    `gh_ost_chk_id` bigint auto_increment primary key,
    `gh_ost_chk_timestamp` bigint,
    `gh_ost_chk_coords` varchar(4096),
    `gh_ost_chk_iteration` bigint,
    `gh_ost_rows_copied` bigint,
    `gh_ost_dml_applied` bigint,
    `c1_min`,`c2_min`, ...,`cn_min`,
    `c1_max`,`c2_max`, ...,`cn_max`
);

where (c1_min, c2_min..., cn_min) and (c1_max, ... cn_max) are the created with the same types as the shared unique key (c1, c2, ... cn) used by gh-ost.

Testing

Replica Tests

I tested resuming with --test-on-replica under synthetic sysbench OLTP write load of ~2k DML/sec. I created a sysbench table with 300M rows and ran a no-op migration with --gtid and --checkpoint set to timeout after 10min. 10 seconds after migration timed out, I started a new gh-ost process with --resume. When the migration finished the ghost and original tables were checksummed, revealing no data discrepancy. ✅

I repeated this test using an initial timeout of 20min and a waiting period of 1hr before resuming. The data integrity check also passed. In addition the test passed running on two testing replicas in production clusters.

Switching Replicas

I tested resuming gh-ost using a different replica than the original one it was attach to:

Using the same 300M test table and sysbench write load, I started the migration gh-ost --alter='add index k_2 (c)' --host='replica1' --gtid --checkpoint
After 10min, I killed the migration
After waiting 5min, I resumed the migration using a second replica: gh-ost --alter='add index k_2 (c)' --host='replica2' --gtid --checkpoint --resume
After a few minutes, I killed the sysbench write load (so no DML happens after cutover).
Once migration completed, I checksummed the original and ghost table to verify data integrity. ✅

Failover Test

Using the same setup, I tested resuming migration after a master failover triggered by orchestrator. The failover kills the migration, and I resumed the migration using the same replica. ✅

Copilot

Pull Request Overview

This PR introduces a checkpoint mechanism to allow gh-ost migrations to be resumed from interruption. Using the --checkpoint flag, gh-ost periodically saves its state (binlog coordinates and chunk copying progress) to a checkpoint table, enabling recovery with the --resume flag.

Adds checkpoint functionality with periodic state saving to _<table>_ghk table
Implements resume capability to restart migrations from last checkpoint
Modifies event streaming to support resuming from specific binlog coordinates

Reviewed Changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
localtests/test.sh	Updates test configuration with checkpoint flag and adjusted parameters
go/sql/types.go	Adds column metadata fields and Clone method for checkpoint support
go/sql/builder_test.go	Adds tests for checkpoint query builder functionality
go/sql/builder.go	Implements checkpoint table creation and insert query building
go/logic/streamer_test.go	Updates streamer tests to handle new BinlogEntry parameter
go/logic/streamer.go	Modifies event streaming to support initial coordinates and BinlogEntry
go/logic/migrator_test.go	Updates migrator tests for new BinlogEntry structure
go/logic/migrator.go	Implements core checkpoint and resume functionality
go/logic/inspect.go	Adds column nullability and type inspection for checkpoint table
go/logic/checkpoint.go	Defines Checkpoint struct for state storage
go/logic/applier_test.go	Adds comprehensive checkpoint read/write tests
go/logic/applier.go	Implements checkpoint table operations and state tracking
go/cmd/gh-ost/main.go	Adds command-line flags for checkpoint and resume features
go/base/context.go	Adds configuration fields for checkpoint functionality
doc/resume.md	Provides documentation for using the resume feature
doc/command-line-flags.md	Documents new checkpoint and resume command-line flags

Comments suppressed due to low confidence (1)

localtests/test.sh:1

The trap statement was removed but should be restored. Without this trap, cleanup won't occur on EXIT or TERM signals, potentially leaving test artifacts behind.

#!/bin/bash

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-15T20:29:31Z

go/logic/migrator.go

 		len(this.applyEventsQueue), cap(this.applyEventsQueue),
 		base.PrettifyDurationOutput(elapsedTime), base.PrettifyDurationOutput(this.migrationContext.ElapsedRowCopyTime()),
-		currentBinlogCoordinates,
+		currentBinlogCoordinates.DisplayString(),


This assumes currentBinlogCoordinates is not nil, but there's no nil check. If currentBinlogCoordinates is nil, this will cause a panic.

Suggested change

currentBinlogCoordinates.DisplayString(),

func() string {

if currentBinlogCoordinates != nil {

return currentBinlogCoordinates.DisplayString()

}

return "N/A"

}(),

Copilot · 2025-10-15T20:29:32Z

go/logic/applier.go

+	}
+	for _, col := range this.migrationContext.UniqueKey.Columns.Columns() {
+		if col.MySQLType == "" {
+			return fmt.Errorf("CreateCheckpoinTable: column %s has no type information. applyColumnTypes must be called", sql.EscapeName(col.Name))


Corrected spelling of 'CreateCheckpoinTable' to 'CreateCheckpointTable'.

Suggested change

return fmt.Errorf("CreateCheckpoinTable: column %s has no type information. applyColumnTypes must be called", sql.EscapeName(col.Name))

return fmt.Errorf("CreateCheckpointTable: column %s has no type information. applyColumnTypes must be called", sql.EscapeName(col.Name))

danieljoos

awesome

meiji163 added 17 commits October 11, 2025 12:50

add Checkpoint table and read/write funcs

d6ee06d

handle no checkpoints returned

e2f0a20

store min and max range values in checkpoint

7c5fda8

resume from checkpoint

713ce97

add checkpoint file

a292d24

Merge branch 'master' into checkpoint

cec7f8d

fix unique key args

9012df0

update applier coordinates from _ghc heartbeat

c1fae0b

fix test

93953c9

fix linter

02f885f

make checkpoint interval configurable

8c2ad77

write checkpoint iteration number

d4ac082

store rows copied & dml applied

43e0d2c

truncate column name if necessary

4c7ed5f

drop checkpoint table for final cleanup

5284301

add docs

5662d9b

add resume doc

bcd19da

meiji163 changed the title ~~[WIP] Resume from checkpoint~~ Resume from checkpoint Oct 15, 2025

meiji163 marked this pull request as ready for review October 15, 2025 20:28

meiji163 requested review from rashiq and timvaillancourt as code owners October 15, 2025 20:28

meiji163 requested review from Copilot and removed request for rashiq and timvaillancourt October 15, 2025 20:28

Copilot AI reviewed Oct 15, 2025

View reviewed changes

danieljoos approved these changes Oct 16, 2025

View reviewed changes

meiji163 merged commit 1557a95 into master Oct 16, 2025
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Resume from checkpoint #1595

Resume from checkpoint #1595

Uh oh!

meiji163 commented Oct 13, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 15, 2025

Uh oh!

Copilot AI Oct 15, 2025

Uh oh!

danieljoos left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-		currentBinlogCoordinates.DisplayString(),
+		func() string {
+			if currentBinlogCoordinates != nil {
+				return currentBinlogCoordinates.DisplayString()
+			}
+			return "N/A"
+		}(),

	return fmt.Errorf("CreateCheckpoinTable: column %s has no type information. applyColumnTypes must be called", sql.EscapeName(col.Name))
	return fmt.Errorf("CreateCheckpointTable: column %s has no type information. applyColumnTypes must be called", sql.EscapeName(col.Name))

Uh oh!

Resume from checkpoint #1595

Resume from checkpoint #1595

Uh oh!

Conversation

meiji163 commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Details

Testing

Replica Tests

Switching Replicas

Failover Test

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

danieljoos left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

meiji163 commented Oct 13, 2025 •

edited

Loading