Skip to content

Conversation

@yuzegao
Copy link

@yuzegao yuzegao commented Dec 14, 2025

Summary

This PR implements graceful failover for Kvrocks cluster, allowing a master node to safely transfer control to a slave node while ensuring data consistency and minimizing service disruption.

Background

Based on GitHub Discussion #3218, this feature enables controlled master-to-slave failover with:

  • Data consistency guarantee (waits for replication sync)
  • Write blocking during critical phases
  • Configurable timeout
  • State machine-based async execution

Implementation

Architecture

  • Independent module: ClusterFailover class, parallel to SlotMigrator
  • Async execution: Dedicated background thread for failover process
  • State machine: 8 states (nonestartedcheck_slavepause_writewait_syncswitchingsuccess/failed)

Key Features

  1. Slave Validation: Checks connection status, replication speed, and lag before proceeding
  2. Write Blocking: Blocks write requests during pause_write, wait_sync, and switching states (returns TRYAGAIN)
  3. Replication Sync: Waits for slave to catch up to target sequence number
  4. Takeover: Sends CLUSTERX TAKEOVER command to slave with authentication support
  5. Slot Redirection: Marks all slots as migrated, redirects clients via MOVED errors

Commands

  • CLUSTERX FAILOVER <slave-node-id> [timeout] - Initiate failover (default timeout: 1000ms)
  • CLUSTER INFO - Now includes cluster_failover_state:<state>

Files Changed

New Files:

  • src/cluster/cluster_failover.h / cluster_failover.cc - Core implementation (325 lines)
  • tests/gocase/integration/failover/failover_test.go - Test suite (926 lines)
  • GRACEFUL_FAILOVER_DESIGN.md - Design document

Modified Files:

  • src/server/server.{h,cc} - Added ClusterFailover member and GetSlaveReplicationOffset()
  • src/cluster/cluster.{h,cc} - Write blocking check, SetMySlotsMigrated(), OnTakeOver(), state reset
  • src/commands/cmd_cluster.cc - FAILOVER and TAKEOVER command handlers

Testing

Comprehensive test suite with 20 sub-test cases (100% pass rate):

  • Normal flow (basic, custom timeout, authentication)
  • Failure scenarios (non-existent node, non-slave, invalid timeout, lag timeout, auth failure)
  • Concurrency (cannot start when in progress, restart after failure)
  • Write blocking (write blocked, read not blocked)
  • State query and transitions
  • Integration (data consistency, state reset after SETNODES)

Compatibility

Backward compatible: New feature, no breaking changes. Only active when cluster-enabled=yes. Existing clusters unaffected.

@yuzegao yuzegao changed the title Implement Graceful Failover Feature feat(failover): Implement Graceful Failover Feature Dec 14, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a graceful failover mechanism for Kvrocks cluster, enabling controlled master-to-slave role transitions while maintaining data consistency. The implementation follows a state machine pattern with dedicated background thread execution, similar to the existing SlotMigrator module architecture.

Key Changes:

  • Introduces a new ClusterFailover module with an 8-state state machine for controlled failover execution
  • Implements write blocking during critical failover phases (pause_write, wait_sync, switching) to ensure data consistency
  • Adds CLUSTERX FAILOVER and CLUSTERX TAKEOVER commands for failover initiation and slave promotion

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/gocase/go.mod Updated Go version directive (contains critical issue - see comments)
tests/gocase/integration/failover/failover_test.go Comprehensive test suite with 20 test cases covering normal flow, failure scenarios, concurrency, write blocking, authentication, and data consistency
tests/gocase/integration/failover/TEST_CASES.md Chinese-language test documentation describing test coverage goals
src/cluster/cluster_failover.h Header defining ClusterFailover class with state machine, failover control methods, and thread management
src/cluster/cluster_failover.cc Implementation of failover logic including slave validation, lag checking, sync waiting, and takeover command sending
src/cluster/cluster.h Added methods for failover support: OnTakeOver(), GetNodeIPPort(), SetMySlotsMigrated(), IsSlotImported()
src/cluster/cluster.cc Implements cluster-level failover operations, write blocking checks, failover state reset, and slot redirection
src/commands/cmd_cluster.cc Command handlers for FAILOVER and TAKEOVER subcommands with timeout parsing
src/server/server.h Declares ClusterFailover member and GetSlaveReplicationOffset() method
src/server/server.cc Initializes ClusterFailover and implements slave offset retrieval for sync verification
src/server/redis_connection.cc Allows writes to imported slots during failover (before topology update)
src/storage/scripting.cc Allows Lua script writes to imported slots during failover

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 9 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +252 to +254
if (s_resp.GetValue().substr(0, 3) != "+OK") {
return {Status::NotOK, "TAKEOVER failed: " + s_resp.GetValue()};
}
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential resource leak: if SockReadLine fails or returns an unexpected response, the socket fd is closed. However, if GetValue().substr() throws an exception (e.g., if the response is less than 3 characters), the socket will not be closed. Consider using RAII or ensuring close(fd) is called in all paths, including exception paths.

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto s_resp = util::SockReadLine(fd);
close(fd);

if (!s_resp.IsOK()) {
return {Status::NotOK, "Failed to read TAKEOVER response: " + s_resp.Msg()};
}

if (s_resp.GetValue().substr(0, 3) != "+OK") {
return {Status::NotOK, "TAKEOVER failed: " + s_resp.GetValue()};
}

The fd allways closed before any if condition.

}

return {Status::RedisParseErr,
"CLUSTERX command, CLUSTERX VERSION|MYID|SETNODEID|SETNODES|SETSLOT|MIGRATE|FAILOVER"};
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The help message in the error does not include "TAKEOVER" which is a valid subcommand added in this PR. The message should be updated to include TAKEOVER in the list of valid subcommands.

Suggested change
"CLUSTERX command, CLUSTERX VERSION|MYID|SETNODEID|SETNODES|SETSLOT|MIGRATE|FAILOVER"};
"CLUSTERX command, CLUSTERX VERSION|MYID|SETNODEID|SETNODES|SETSLOT|MIGRATE|FAILOVER|TAKEOVER"};

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TAKEOVER is inner command, don't need to declare here.

Comment on lines +231 to +235
auto s_line = util::SockReadLine(fd);
if (!s_line.IsOK() || s_line.GetValue().substr(0, 3) != "+OK") {
close(fd);
return {Status::NotOK, "AUTH failed"};
}
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential resource leak: if SockReadLine fails or returns an unexpected response, the socket fd is closed. However, if GetValue().substr() throws an exception (e.g., if the response is less than 3 characters), the socket will not be closed. Consider using RAII or ensuring close(fd) is called in all paths, including exception paths.

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without resource leak risk, beacuse all path have close(fd).
if (!pass.empty()) {
std::string auth_cmd = redis::ArrayOfBulkStrings({"AUTH", pass});
auto s_auth = util::SockSend(fd, auth_cmd);
if (!s_auth.IsOK()) {
close(fd);
return {Status::NotOK, "Failed to send AUTH: " + s_auth.Msg()};
}
auto s_line = util::SockReadLine(fd);
if (!s_line.IsOK() || s_line.GetValue().substr(0, 3) != "+OK") {
close(fd);
return {Status::NotOK, "AUTH failed"};
}
}

std::string cmd = redis::ArrayOfBulkStrings({"CLUSTERX", "TAKEOVER"});
auto s_send = util::SockSend(fd, cmd);
if (!s_send.IsOK()) {
close(fd);
return {Status::NotOK, "Failed to send TAKEOVER: " + s_send.Msg()};
}

auto s_resp = util::SockReadLine(fd);
close(fd);

Comment on lines +1002 to +1025
Status Cluster::OnTakeOver() {
info("[Failover] OnTakeOver received myself_: {}", myself_ ? myself_->id : "null");
if (!myself_) {
return {Status::NotOK, "Cluster is not initialized"};
}
if (myself_->role == kClusterMaster) {
info("[Failover] OnTakeOver myself_ is master, return");
return Status::OK();
}

std::string old_master_id = myself_->master_id;
if (old_master_id.empty()) {
info("[Failover] OnTakeOver no master to takeover, return");
return {Status::NotOK, "No master to takeover"};
}

for (int i = 0; i < kClusterSlots; i++) {
if (slots_nodes_[i] && slots_nodes_[i]->id == old_master_id) {
imported_slots_.insert(i);
}
}
info("[Failover] OnTakeOver Success ");
return Status::OK();
}
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OnTakeOver method modifies imported_slots_ without acquiring any lock or exclusivity guard. This data structure is accessed by IsSlotImported() from request processing threads without synchronization. This creates a potential race condition where imported_slots_ could be modified while being read by other threads, leading to undefined behavior.

Copilot uses AI. Check for mistakes.
Copy link
Member

@git-hulk git-hulk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuzegao, I'm not sure if you expected to use the migrated/imported slot to do the redirection while doing the failover. If yes, it's too tricky to do that. You can use the failover state to allow/disallow writing in the new/master node instead of mixing the migration behavior with the failover.

Another question is: What if the migration is ongoing?

}

Status ClusterFailover::checkSlaveLag() {
auto start_offset_status = srv_->GetSlaveReplicationOffset(node_ip_port_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have checked the replication offset at the check status stage?


if (lag == 0) return Status::OK();

if (speed <= 0.1) { // Basically 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to test the replication speed? and why does it fail if the speed is <= 0.1?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants