Skip to content

Conversation

@JacksonYao287
Copy link
Contributor

@JacksonYao287 JacksonYao287 commented Jan 7, 2026

expose nuraft_message to upper layer, so that they can register their own rpc call by calling bind_data_service_request

@JacksonYao287 JacksonYao287 force-pushed the add-interface-to-add-client-data-rpc branch 2 times, most recently from bf220c9 to b069b44 Compare January 8, 2026 00:54
// becoming a leader.

RD_LOGD(NO_TRACE_ID, "become_leader_cb: setting traffic_ready_lsn from {} to {}", current_gate, new_gate);
m_listener->on_become_leader(m_group_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious about the purpose of on_become_xxx. Should we move it to the beginning of become_xxx_cb to call the upper callback first, similar to how handle_commit does?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on_become_leader and on_become_follower will be used for homeobject scrubbing. when scrubbing is on-going, if leader switch happens, the old leader will become follower and thus on_become_follower will be called on this node, where we stop the scrubbing thread that will request scrub results from the other two members. one of the follower will become leader and thus on_become_leader will be called on this node, where we will read the scrub superblk and start the scrubbing thread to request scrub result from the other two memebers.

Should we move it to the beginning
no need to do this now, what we want is to make sure that we can be notified with the leader change event, it does not matter it is called in the beginning or end.

@xiaoxichen
Copy link
Collaborator

xiaoxichen commented Jan 8, 2026

Can we do a POC regarding the required communication primitives
I am not sure if it is right to break the HS/HO boundary here, or should we add a generic "USER" handler is good enough.

We can to the POC of HO in HS UT (which is easier to implement RepldevListener), that gives team better understanding the requirements

@JacksonYao287
Copy link
Contributor Author

I am now working on the POC code with this change to see if this is workable

@JacksonYao287 JacksonYao287 force-pushed the add-interface-to-add-client-data-rpc branch from b069b44 to 2ee228f Compare January 8, 2026 12:08
@codecov-commenter
Copy link

codecov-commenter commented Jan 8, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 7.54717% with 49 lines in your changes missing coverage. Please review.
✅ Project coverage is 48.30%. Comparing base (1a0cef8) to head (a9d376a).
⚠️ Report is 307 commits behind head on master.

Files with missing lines Patch % Lines
src/lib/replication/repl_dev/raft_repl_dev.cpp 0.00% 42 Missing and 1 partial ⚠️
src/lib/replication/repl_dev/solo_repl_dev.h 0.00% 6 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #846      +/-   ##
==========================================
- Coverage   56.51%   48.30%   -8.22%     
==========================================
  Files         108      110       +2     
  Lines       10300    12809    +2509     
  Branches     1402     6149    +4747     
==========================================
+ Hits         5821     6187     +366     
+ Misses       3894     2537    -1357     
- Partials      585     4085    +3500     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JacksonYao287 JacksonYao287 force-pushed the add-interface-to-add-client-data-rpc branch 2 times, most recently from d8fa192 to 0915f3e Compare January 15, 2026 07:53
@JacksonYao287 JacksonYao287 force-pushed the add-interface-to-add-client-data-rpc branch from 0915f3e to 5ec0d54 Compare January 15, 2026 08:03
@JacksonYao287
Copy link
Contributor Author

JacksonYao287 commented Jan 15, 2026

this PR add three apis into repl_dev(exposes nuraft_mesg data service to upper layer) :
1 add_data_rpc_service: register a rpc call with a name
2 data_request_unidirectional: send an unidirectional data with a request name through nuraft_mesg data service
3 data_request_bidirectional: send an bidirectional data with a request name through nuraft_mesg data service

I have written a POC to test this in homeobject layer, where the deserialized VSM are transfered through the data channel and then compared with leader . please see
https://github.com/JacksonYao287/HomeObject/blob/a9928f1dbe1dd4695f6154c964ee5052b2a62a72/src/lib/homestore_backend/tests/hs_scrubber_tests.cpp#L25

you can build homestore using this PR to create a local homestore conan repo and then use the following branch to build homeobject , and then run homestore_scrubber_test to see how it works.
https://github.com/JacksonYao287/HomeObject/tree/add-separate-data-rpc

@xiaoxichen @Besroy any suggestion for adding this three api?

data_service_request_handler_t const& request_handler) = 0;

// send a unidirectional data rpc to dest with request_name and cli_buf
virtual nuraft_mesg::NullAsyncResult data_request_unidirectional(nuraft_mesg::destination_t const& dest,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a generic interface we probably cannot use nuraft_mesg::XXXX,
we can use either

replica_member_info

or

replica_id_t

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using destination_t = std::variant< peer_id_t, role_regex, svr_id_t >

destination_t support several type of input params, this is more adaptive,

maybe other upper layer(not homeobject) wants to use svr_id_t or role_regex .

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can re-encapsulate or redefine a type that is fine... but not exposing a nuraft type. it breaks the boundary.
HO doesnt need to care if HS is using nuraft_mesg for data_request_unidirectional, or use its own grpc, or use http, tcp , ftp, whatever.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree and actually I have already tried to expose a clean api, but it is a little complicated.

coming to (http, tcp , ftp, whatever.), when failing to transfer data, we need to tell the upper layer the detailed failure info(why fails). so IMO, no need to do such a high level re-encapsulatation. since we actually use grpc to transfer data in the bottom layer, when grpc error happens, we should return this error to upper layer so that it can have a clear idea why grpc fails.

however, in nuraft_mesg, data_service_request_bidirectional never return a grpc error, and always change a grpc error to nuraft error with grpc_status_to_nuraft_code, and I am confused about the necessity here. As a result, at homestore level , when error happens, we can only get the error of a nuraft type for data_service_request_bidirectional . so if we do encapsulation here, we have three choices:

  1. do not re-encapsulate, return directly as it is.

  2. add a new function to convert nuraft error to grpc error, but this will miss somed detailed information. for example.

    case ::grpc::StatusCode::INVALID_ARGUMENT:
    case ::grpc::StatusCode::UNIMPLEMENTED:
    case ::grpc::StatusCode::UNAUTHENTICATED:
    case ::grpc::StatusCode::PERMISSION_DENIED:
    case ::grpc::StatusCode::RESOURCE_EXHAUSTED:
    case ::grpc::StatusCode::OUT_OF_RANGE:
        return nuraft::cmd_result_code::BAD_REQUEST;

if we want to convert nuraft::cmd_result_code::BAD_REQUEST to grpc error , which grpc should we chose.

3 add a rude convert, which will lump all the nuraft error types under a single umbrella error type with no detailed info. then the upper lay can only know this transfer failure, but have no idea why it fails.

for now, I chose 1, although it looks not that ideal.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well each layer is expected to lost some information , for example when you calling an HTTP service got a 500, how you expect to know the detail error inside the 500, is that GW cannot talk to Nudata? or GW cannot reach DM?

Another example you got EIO when your disk IO fail, the specific device error code is hidden. .

When an underlying error is important , then it is time to create a specific error mapping, for that error. We can add a log line in the error converter if needed.

It is even fine we use same type and directly return what upstream returns, but just , dont directly expose the upstream type .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let met try to re-encapsulate


bool RaftReplDev::add_data_rpc_service(std::string const& request_name,
data_service_request_handler_t const& request_handler) {
return m_msg_mgr.bind_data_service_request(request_name, m_group_id, request_handler);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure, kind of concerning it will be a problem during shutdown, the HO will be deconstructed first then the handler is invalid.

It is not major atm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good point. but actually , we call homestore::shutdown in HSHomeObject::shutdown(), this means Homeobject is not destructed until homestore shutdown return.

@JacksonYao287
Copy link
Contributor Author

@Besroy @xiaoxichen I add some encapsulation so that no nuraft_mesg type will be exposed to upper layer


/// @brief when restart, after all the logs are replayed and before joining raft group, notify the upper layer
virtual void on_log_replay_done(const group_id_t& group_id) {};
virtual void on_log_replay_done(const group_id_t& group_id) = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the reason for this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was designed to be pure virtual originally。when we added this interface, in order to not block the homeobject build, we make it not pure virtual and have a default implementation. now that we have already implement this virtual function in homeobject side, we can change it to pure virtual now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it will break homeblks..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, solo_repl_dev never use ReplDevListener, so any change to ReplDevListener will not break solo_repl_dev, no?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, at least in the SoloReplDevUT there is a ReplDevListener, so I am expecting same pattern, the ReplDevListener is not solely for RaftReplDev.

Personally I wont touch it , but you can ask in the channel

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, let me double confirm this in homestore channel

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remove the changes on this part, let`s keep it for now

using trace_id_t = u_int64_t;

using data_service_request_handler_t = std::function< void(boost::intrusive_ptr< sisl::GenericRpcData >& rpc_data) >;
ENUM(role_regex, uint8_t, LEADER, FOLLOWER, ALL, ANY);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repl_role_regex


using data_service_request_handler_t = std::function< void(boost::intrusive_ptr< sisl::GenericRpcData >& rpc_data) >;
ENUM(role_regex, uint8_t, LEADER, FOLLOWER, ALL, ANY);
using peer_id_t = boost::uuids::uuid;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use replica_id_t directly.

ENUM(role_regex, uint8_t, LEADER, FOLLOWER, ALL, ANY);
using peer_id_t = boost::uuids::uuid;
using svr_id_t = int32_t;
using destination_t = std::variant< peer_id_t, role_regex, svr_id_t >;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using repl_dest_t = std::variant< replica_id_t, repl_dest_role, svr_id_t>;

using svr_id_t = int32_t;
using destination_t = std::variant< peer_id_t, role_regex, svr_id_t >;

ENUM(data_rpc_error_code, uint8_t, SUCCESS, TIMEOUT, SERVER_NOT_FOUND, CANCELLED, SERVER_ALREADY_EXISTS, TERM_MISMATCH,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repl_deta_rpc_error_code

void handle_fetch_data_response(sisl::GenericClientResponse response, std::vector< repl_req_ptr_t > rreqs);
bool is_resync_mode();
data_rpc_error_code nuraft_to_data_rpc_error_code(nuraft::cmd_result_code const& nuraft_err);
nuraft_mesg::destination_t change_to_nuraft_mesg_destination(destination_t dest);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets have a consistent naming for the converter.
nuraft_to_hs_error
hs_to_nuraft_dest

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. I have changed the code according to all the above comments. PTAL

@JacksonYao287 JacksonYao287 force-pushed the add-interface-to-add-client-data-rpc branch from ad004bf to 9242419 Compare January 21, 2026 06:31
@JacksonYao287 JacksonYao287 merged commit fe98606 into eBay:master Jan 22, 2026
21 checks passed
@JacksonYao287 JacksonYao287 deleted the add-interface-to-add-client-data-rpc branch January 22, 2026 01:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants