feat(failover): Implement Graceful Failover Feature #3295

yuzegao · 2025-12-14T04:20:34Z

Summary

This PR implements graceful failover for Kvrocks cluster, allowing a master node to safely transfer control to a slave node while ensuring data consistency and minimizing service disruption.

Background

Based on GitHub Discussion #3218, this feature enables controlled master-to-slave failover with:

Data consistency guarantee (waits for replication sync)
Write blocking during critical phases
Configurable timeout
State machine-based async execution

Implementation

Architecture

Independent module: ClusterFailover class, parallel to SlotMigrator
Async execution: Dedicated background thread for failover process
State machine: 8 states (none → started → check_slave → pause_write → wait_sync → switching → success/failed)

Key Features

Slave Validation: Checks connection status, replication speed, and lag before proceeding
Write Blocking: Blocks write requests during pause_write, wait_sync, and switching states (returns TRYAGAIN)
Replication Sync: Waits for slave to catch up to target sequence number
Takeover: Sends CLUSTERX TAKEOVER command to slave with authentication support
Slot Redirection: Marks all slots as migrated, redirects clients via MOVED errors

Commands

CLUSTERX FAILOVER <slave-node-id> [timeout] - Initiate failover (default timeout: 1000ms)
CLUSTER INFO - Now includes cluster_failover_state:<state>

Files Changed

New Files:

src/cluster/cluster_failover.h / cluster_failover.cc - Core implementation (325 lines)
tests/gocase/integration/failover/failover_test.go - Test suite (926 lines)
GRACEFUL_FAILOVER_DESIGN.md - Design document

Modified Files:

src/server/server.{h,cc} - Added ClusterFailover member and GetSlaveReplicationOffset()
src/cluster/cluster.{h,cc} - Write blocking check, SetMySlotsMigrated(), OnTakeOver(), state reset
src/commands/cmd_cluster.cc - FAILOVER and TAKEOVER command handlers

Testing

Comprehensive test suite with 20 sub-test cases (100% pass rate):

Normal flow (basic, custom timeout, authentication)
Failure scenarios (non-existent node, non-slave, invalid timeout, lag timeout, auth failure)
Concurrency (cannot start when in progress, restart after failure)
Write blocking (write blocked, read not blocked)
State query and transitions
Integration (data consistency, state reset after SETNODES)

Compatibility

✅ Backward compatible: New feature, no breaking changes. Only active when cluster-enabled=yes. Existing clusters unaffected.

tests/gocase/integration/failover/TEST_CASES.md

src/cluster/cluster_failover.h

src/cluster/cluster_failover.cc

Copilot

Pull request overview

This PR implements a graceful failover mechanism for Kvrocks cluster, enabling controlled master-to-slave role transitions while maintaining data consistency. The implementation follows a state machine pattern with dedicated background thread execution, similar to the existing SlotMigrator module architecture.

Key Changes:

Introduces a new ClusterFailover module with an 8-state state machine for controlled failover execution
Implements write blocking during critical failover phases (pause_write, wait_sync, switching) to ensure data consistency
Adds CLUSTERX FAILOVER and CLUSTERX TAKEOVER commands for failover initiation and slave promotion

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`tests/gocase/go.mod`	Updated Go version directive (contains critical issue - see comments)
`tests/gocase/integration/failover/failover_test.go`	Comprehensive test suite with 20 test cases covering normal flow, failure scenarios, concurrency, write blocking, authentication, and data consistency
`tests/gocase/integration/failover/TEST_CASES.md`	Chinese-language test documentation describing test coverage goals
`src/cluster/cluster_failover.h`	Header defining `ClusterFailover` class with state machine, failover control methods, and thread management
`src/cluster/cluster_failover.cc`	Implementation of failover logic including slave validation, lag checking, sync waiting, and takeover command sending
`src/cluster/cluster.h`	Added methods for failover support: `OnTakeOver()`, `GetNodeIPPort()`, `SetMySlotsMigrated()`, `IsSlotImported()`
`src/cluster/cluster.cc`	Implements cluster-level failover operations, write blocking checks, failover state reset, and slot redirection
`src/commands/cmd_cluster.cc`	Command handlers for `FAILOVER` and `TAKEOVER` subcommands with timeout parsing
`src/server/server.h`	Declares `ClusterFailover` member and `GetSlaveReplicationOffset()` method
`src/server/server.cc`	Initializes `ClusterFailover` and implements slave offset retrieval for sync verification
`src/server/redis_connection.cc`	Allows writes to imported slots during failover (before topology update)
`src/storage/scripting.cc`	Allows Lua script writes to imported slots during failover

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/gocase/integration/failover/failover_test.go

src/cluster/cluster_failover.cc

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 9 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/cluster/cluster_failover.cc

Copilot · 2025-12-16T12:19:56Z

src/cluster/cluster_failover.cc

+  if (s_resp.GetValue().substr(0, 3) != "+OK") {
+    return {Status::NotOK, "TAKEOVER failed: " + s_resp.GetValue()};
+  }


Potential resource leak: if SockReadLine fails or returns an unexpected response, the socket fd is closed. However, if GetValue().substr() throws an exception (e.g., if the response is less than 3 characters), the socket will not be closed. Consider using RAII or ensuring close(fd) is called in all paths, including exception paths.

auto s_resp = util::SockReadLine(fd);
close(fd);

if (!s_resp.IsOK()) {
return {Status::NotOK, "Failed to read TAKEOVER response: " + s_resp.Msg()};
}

if (s_resp.GetValue().substr(0, 3) != "+OK") {
return {Status::NotOK, "TAKEOVER failed: " + s_resp.GetValue()};
}

The fd allways closed before any if condition.

Copilot · 2025-12-16T12:19:57Z

src/commands/cmd_cluster.cc

+    }
+
+    return {Status::RedisParseErr,
+            "CLUSTERX command, CLUSTERX VERSION|MYID|SETNODEID|SETNODES|SETSLOT|MIGRATE|FAILOVER"};


The help message in the error does not include "TAKEOVER" which is a valid subcommand added in this PR. The message should be updated to include TAKEOVER in the list of valid subcommands.

Suggested change

"CLUSTERX command, CLUSTERX VERSION|MYID|SETNODEID|SETNODES|SETSLOT|MIGRATE|FAILOVER"};

"CLUSTERX command, CLUSTERX VERSION|MYID|SETNODEID|SETNODES|SETSLOT|MIGRATE|FAILOVER|TAKEOVER"};

The TAKEOVER is inner command, don't need to declare here.

src/server/redis_connection.cc

src/cluster/cluster_failover.cc

Copilot · 2025-12-16T12:19:58Z

src/cluster/cluster_failover.cc

+    auto s_line = util::SockReadLine(fd);
+    if (!s_line.IsOK() || s_line.GetValue().substr(0, 3) != "+OK") {
+      close(fd);
+      return {Status::NotOK, "AUTH failed"};
+    }


Potential resource leak: if SockReadLine fails or returns an unexpected response, the socket fd is closed. However, if GetValue().substr() throws an exception (e.g., if the response is less than 3 characters), the socket will not be closed. Consider using RAII or ensuring close(fd) is called in all paths, including exception paths.

Without resource leak risk, beacuse all path have close(fd).
if (!pass.empty()) {
std::string auth_cmd = redis::ArrayOfBulkStrings({"AUTH", pass});
auto s_auth = util::SockSend(fd, auth_cmd);
if (!s_auth.IsOK()) {
close(fd);
return {Status::NotOK, "Failed to send AUTH: " + s_auth.Msg()};
}
auto s_line = util::SockReadLine(fd);
if (!s_line.IsOK() || s_line.GetValue().substr(0, 3) != "+OK") {
close(fd);
return {Status::NotOK, "AUTH failed"};
}
}

std::string cmd = redis::ArrayOfBulkStrings({"CLUSTERX", "TAKEOVER"});
auto s_send = util::SockSend(fd, cmd);
if (!s_send.IsOK()) {
close(fd);
return {Status::NotOK, "Failed to send TAKEOVER: " + s_send.Msg()};
}

auto s_resp = util::SockReadLine(fd);
close(fd);

src/cluster/cluster_failover.cc

tests/gocase/integration/failover/failover_test.go

Copilot · 2025-12-16T12:19:58Z

src/cluster/cluster.cc

+Status Cluster::OnTakeOver() {
+  info("[Failover] OnTakeOver received myself_: {}", myself_ ? myself_->id : "null");
+  if (!myself_) {
+    return {Status::NotOK, "Cluster is not initialized"};
+  }
+  if (myself_->role == kClusterMaster) {
+    info("[Failover] OnTakeOver myself_ is master, return");
+    return Status::OK();
+  }
+
+  std::string old_master_id = myself_->master_id;
+  if (old_master_id.empty()) {
+    info("[Failover] OnTakeOver no master to takeover, return");
+    return {Status::NotOK, "No master to takeover"};
+  }
+
+  for (int i = 0; i < kClusterSlots; i++) {
+    if (slots_nodes_[i] && slots_nodes_[i]->id == old_master_id) {
+      imported_slots_.insert(i);
+    }
+  }
+  info("[Failover] OnTakeOver Success ");
+  return Status::OK();
+}


The OnTakeOver method modifies imported_slots_ without acquiring any lock or exclusivity guard. This data structure is accessed by IsSlotImported() from request processing threads without synchronization. This creates a potential race condition where imported_slots_ could be modified while being read by other threads, leading to undefined behavior.

git-hulk

@yuzegao, I'm not sure if you expected to use the migrated/imported slot to do the redirection while doing the failover. If yes, it's too tricky to do that. You can use the failover state to allow/disallow writing in the new/master node instead of mixing the migration behavior with the failover.

Another question is: What if the migration is ongoing?

git-hulk · 2025-12-16T12:11:02Z

src/cluster/cluster_failover.cc

+}
+
+Status ClusterFailover::checkSlaveLag() {
+  auto start_offset_status = srv_->GetSlaveReplicationOffset(node_ip_port_);


You have checked the replication offset at the check status stage?

git-hulk · 2025-12-16T12:15:37Z

src/cluster/cluster_failover.cc

+
+  if (lag == 0) return Status::OK();
+
+  if (speed <= 0.1) {  // Basically 0


Why do we need to test the replication speed? and why does it fail if the speed is <= 0.1?

yuzegao added 2 commits December 14, 2025 11:45

cluster: implement graceful failover feature

d16f56e

Merge branch 'unstable' into feature/gracefully_failover

38d3089

yuzegao changed the title ~~Implement Graceful Failover Feature~~ feat(failover): Implement Graceful Failover Feature Dec 14, 2025

PragmaTwice reviewed Dec 15, 2025

View reviewed changes

tests/gocase/integration/failover/TEST_CASES.md Outdated Show resolved Hide resolved

PragmaTwice requested review from caipengbo, Copilot and git-hulk December 15, 2025 02:49

Copilot started reviewing on behalf of PragmaTwice December 15, 2025 02:50 View session

PragmaTwice reviewed Dec 15, 2025

View reviewed changes

src/cluster/cluster_failover.h Show resolved Hide resolved

PragmaTwice reviewed Dec 15, 2025

View reviewed changes

src/cluster/cluster_failover.cc Show resolved Hide resolved

Copilot AI reviewed Dec 15, 2025

View reviewed changes

add License header and fix Copilot report issues.

0dbbc19

yuzegao requested review from PragmaTwice and Copilot December 16, 2025 11:50

Copilot started reviewing on behalf of yuzegao December 16, 2025 12:12 View session

Copilot AI reviewed Dec 16, 2025

View reviewed changes

git-hulk reviewed Dec 16, 2025

View reviewed changes

	"CLUSTERX command, CLUSTERX VERSION\|MYID\|SETNODEID\|SETNODES\|SETSLOT\|MIGRATE\|FAILOVER"};
	"CLUSTERX command, CLUSTERX VERSION\|MYID\|SETNODEID\|SETNODES\|SETSLOT\|MIGRATE\|FAILOVER\|TAKEOVER"};


		if (lag == 0) return Status::OK();

		if (speed <= 0.1) { // Basically 0

feat(failover): Implement Graceful Failover Feature #3295

Are you sure you want to change the base?

feat(failover): Implement Graceful Failover Feature #3295

Conversation

yuzegao commented Dec 14, 2025

Summary

Background

Implementation

Architecture

Key Features

Commands

Files Changed

Testing

Compatibility

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

yuzegao Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

yuzegao Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

yuzegao Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

git-hulk left a comment

Choose a reason for hiding this comment

Uh oh!

git-hulk Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

git-hulk Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants