-
Notifications
You must be signed in to change notification settings - Fork 590
feat(failover): Implement Graceful Failover Feature #3295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: unstable
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements a graceful failover mechanism for Kvrocks cluster, enabling controlled master-to-slave role transitions while maintaining data consistency. The implementation follows a state machine pattern with dedicated background thread execution, similar to the existing SlotMigrator module architecture.
Key Changes:
- Introduces a new
ClusterFailovermodule with an 8-state state machine for controlled failover execution - Implements write blocking during critical failover phases (
pause_write,wait_sync,switching) to ensure data consistency - Adds
CLUSTERX FAILOVERandCLUSTERX TAKEOVERcommands for failover initiation and slave promotion
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
tests/gocase/go.mod |
Updated Go version directive (contains critical issue - see comments) |
tests/gocase/integration/failover/failover_test.go |
Comprehensive test suite with 20 test cases covering normal flow, failure scenarios, concurrency, write blocking, authentication, and data consistency |
tests/gocase/integration/failover/TEST_CASES.md |
Chinese-language test documentation describing test coverage goals |
src/cluster/cluster_failover.h |
Header defining ClusterFailover class with state machine, failover control methods, and thread management |
src/cluster/cluster_failover.cc |
Implementation of failover logic including slave validation, lag checking, sync waiting, and takeover command sending |
src/cluster/cluster.h |
Added methods for failover support: OnTakeOver(), GetNodeIPPort(), SetMySlotsMigrated(), IsSlotImported() |
src/cluster/cluster.cc |
Implements cluster-level failover operations, write blocking checks, failover state reset, and slot redirection |
src/commands/cmd_cluster.cc |
Command handlers for FAILOVER and TAKEOVER subcommands with timeout parsing |
src/server/server.h |
Declares ClusterFailover member and GetSlaveReplicationOffset() method |
src/server/server.cc |
Initializes ClusterFailover and implements slave offset retrieval for sync verification |
src/server/redis_connection.cc |
Allows writes to imported slots during failover (before topology update) |
src/storage/scripting.cc |
Allows Lua script writes to imported slots during failover |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 9 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (s_resp.GetValue().substr(0, 3) != "+OK") { | ||
| return {Status::NotOK, "TAKEOVER failed: " + s_resp.GetValue()}; | ||
| } |
Copilot
AI
Dec 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential resource leak: if SockReadLine fails or returns an unexpected response, the socket fd is closed. However, if GetValue().substr() throws an exception (e.g., if the response is less than 3 characters), the socket will not be closed. Consider using RAII or ensuring close(fd) is called in all paths, including exception paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto s_resp = util::SockReadLine(fd);
close(fd);
if (!s_resp.IsOK()) {
return {Status::NotOK, "Failed to read TAKEOVER response: " + s_resp.Msg()};
}
if (s_resp.GetValue().substr(0, 3) != "+OK") {
return {Status::NotOK, "TAKEOVER failed: " + s_resp.GetValue()};
}
The fd allways closed before any if condition.
| } | ||
|
|
||
| return {Status::RedisParseErr, | ||
| "CLUSTERX command, CLUSTERX VERSION|MYID|SETNODEID|SETNODES|SETSLOT|MIGRATE|FAILOVER"}; |
Copilot
AI
Dec 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The help message in the error does not include "TAKEOVER" which is a valid subcommand added in this PR. The message should be updated to include TAKEOVER in the list of valid subcommands.
| "CLUSTERX command, CLUSTERX VERSION|MYID|SETNODEID|SETNODES|SETSLOT|MIGRATE|FAILOVER"}; | |
| "CLUSTERX command, CLUSTERX VERSION|MYID|SETNODEID|SETNODES|SETSLOT|MIGRATE|FAILOVER|TAKEOVER"}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TAKEOVER is inner command, don't need to declare here.
| auto s_line = util::SockReadLine(fd); | ||
| if (!s_line.IsOK() || s_line.GetValue().substr(0, 3) != "+OK") { | ||
| close(fd); | ||
| return {Status::NotOK, "AUTH failed"}; | ||
| } |
Copilot
AI
Dec 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential resource leak: if SockReadLine fails or returns an unexpected response, the socket fd is closed. However, if GetValue().substr() throws an exception (e.g., if the response is less than 3 characters), the socket will not be closed. Consider using RAII or ensuring close(fd) is called in all paths, including exception paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without resource leak risk, beacuse all path have close(fd).
if (!pass.empty()) {
std::string auth_cmd = redis::ArrayOfBulkStrings({"AUTH", pass});
auto s_auth = util::SockSend(fd, auth_cmd);
if (!s_auth.IsOK()) {
close(fd);
return {Status::NotOK, "Failed to send AUTH: " + s_auth.Msg()};
}
auto s_line = util::SockReadLine(fd);
if (!s_line.IsOK() || s_line.GetValue().substr(0, 3) != "+OK") {
close(fd);
return {Status::NotOK, "AUTH failed"};
}
}
std::string cmd = redis::ArrayOfBulkStrings({"CLUSTERX", "TAKEOVER"});
auto s_send = util::SockSend(fd, cmd);
if (!s_send.IsOK()) {
close(fd);
return {Status::NotOK, "Failed to send TAKEOVER: " + s_send.Msg()};
}
auto s_resp = util::SockReadLine(fd);
close(fd);
| Status Cluster::OnTakeOver() { | ||
| info("[Failover] OnTakeOver received myself_: {}", myself_ ? myself_->id : "null"); | ||
| if (!myself_) { | ||
| return {Status::NotOK, "Cluster is not initialized"}; | ||
| } | ||
| if (myself_->role == kClusterMaster) { | ||
| info("[Failover] OnTakeOver myself_ is master, return"); | ||
| return Status::OK(); | ||
| } | ||
|
|
||
| std::string old_master_id = myself_->master_id; | ||
| if (old_master_id.empty()) { | ||
| info("[Failover] OnTakeOver no master to takeover, return"); | ||
| return {Status::NotOK, "No master to takeover"}; | ||
| } | ||
|
|
||
| for (int i = 0; i < kClusterSlots; i++) { | ||
| if (slots_nodes_[i] && slots_nodes_[i]->id == old_master_id) { | ||
| imported_slots_.insert(i); | ||
| } | ||
| } | ||
| info("[Failover] OnTakeOver Success "); | ||
| return Status::OK(); | ||
| } |
Copilot
AI
Dec 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The OnTakeOver method modifies imported_slots_ without acquiring any lock or exclusivity guard. This data structure is accessed by IsSlotImported() from request processing threads without synchronization. This creates a potential race condition where imported_slots_ could be modified while being read by other threads, leading to undefined behavior.
git-hulk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yuzegao, I'm not sure if you expected to use the migrated/imported slot to do the redirection while doing the failover. If yes, it's too tricky to do that. You can use the failover state to allow/disallow writing in the new/master node instead of mixing the migration behavior with the failover.
Another question is: What if the migration is ongoing?
| } | ||
|
|
||
| Status ClusterFailover::checkSlaveLag() { | ||
| auto start_offset_status = srv_->GetSlaveReplicationOffset(node_ip_port_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have checked the replication offset at the check status stage?
|
|
||
| if (lag == 0) return Status::OK(); | ||
|
|
||
| if (speed <= 0.1) { // Basically 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to test the replication speed? and why does it fail if the speed is <= 0.1?
Summary
This PR implements graceful failover for Kvrocks cluster, allowing a master node to safely transfer control to a slave node while ensuring data consistency and minimizing service disruption.
Background
Based on GitHub Discussion #3218, this feature enables controlled master-to-slave failover with:
Implementation
Architecture
ClusterFailoverclass, parallel toSlotMigratornone→started→check_slave→pause_write→wait_sync→switching→success/failed)Key Features
pause_write,wait_sync, andswitchingstates (returnsTRYAGAIN)CLUSTERX TAKEOVERcommand to slave with authentication supportMOVEDerrorsCommands
CLUSTERX FAILOVER <slave-node-id> [timeout]- Initiate failover (default timeout: 1000ms)CLUSTER INFO- Now includescluster_failover_state:<state>Files Changed
New Files:
src/cluster/cluster_failover.h/cluster_failover.cc- Core implementation (325 lines)tests/gocase/integration/failover/failover_test.go- Test suite (926 lines)GRACEFUL_FAILOVER_DESIGN.md- Design documentModified Files:
src/server/server.{h,cc}- AddedClusterFailovermember andGetSlaveReplicationOffset()src/cluster/cluster.{h,cc}- Write blocking check,SetMySlotsMigrated(),OnTakeOver(), state resetsrc/commands/cmd_cluster.cc-FAILOVERandTAKEOVERcommand handlersTesting
Comprehensive test suite with 20 sub-test cases (100% pass rate):
Compatibility
✅ Backward compatible: New feature, no breaking changes. Only active when
cluster-enabled=yes. Existing clusters unaffected.