Skip to content

Conversation

@chenBright
Copy link
Contributor

What problem does this PR solve?

Issue Number: resolve #3132

Problem Summary:

What is changed and the side effects?

Changed:

Side effects:

  • Performance effects:

  • Breaking backward compatibility:


Check List:

@chenBright
Copy link
Contributor Author

We encountered the same problem as #3132. This PR's solution is similar to #3132 (comment) and can solve this problem. PTAL @sunce4t @yanglimingcn

@chenBright chenBright requested a review from Copilot November 9, 2025 08:39
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the RDMA window management by splitting the single _window_size variable into two separate variables: _remote_rq_window_size (tracking remote receive queue capacity) and _sq_window_size (tracking local send queue capacity). This provides more granular control over flow control in RDMA operations.

  • Split window size tracking into separate remote RQ and local SQ windows
  • Added SendType enum to distinguish between different types of send operations
  • Enhanced logging to differentiate between client and server handshake completion

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
src/brpc/rdma/rdma_endpoint.h Declared two new atomic variables _remote_rq_window_size and _sq_window_size to replace the single _window_size variable
src/brpc/rdma/rdma_endpoint.cpp Refactored window management logic to independently track local SQ and remote RQ capacity; added SendType enum; improved handshake logging; optimized zero-copy receive path

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

uint16_t wnd_to_update = _local_window_capacity / 4;
_sq_window_size.fetch_add(wnd_to_update, butil::memory_order_relaxed);
// Wake up writing thread right after every signaled sending cqe.
_socket->WakeAsEpollOut();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_socket->WakeAsEpollOut is called in both IBV_WC_SEND and IBV_WC_RECV, but each branch only has one condition. Shouldn't it activate only when both conditions are met?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An additional condition can be added: _remote_rq_window_size > _local_window_capacity / 8.

@yanglimingcn
Copy link
Contributor

Why is CQ divided into send_cq and recv_cq?

@chenBright
Copy link
Contributor Author

We encountered an error: requests were not being sent, causing a large number of client timeouts.

 [E1008]Reached timeout=60000ms @Socket{id=13 fd=1160 addr=xxx:xx} (0x0x7f957c964ec0) rdma info={rdma_state=ON, handshake_state=ESTABLISHED, rdma_remote_rq_window_size=63, rdma_sq_window_size=0, rdma_local_window_capacity=125, rdma_remote_window_capacity=125, rdma_sbuf_head=57, rdma_sbuf_tail=120, rdma_rbuf_head=36, rdma_unacked_rq_wr=0, rdma_received_ack=0, rdma_unsolicited_sent=0, rdma_unsignaled_sq_wr=1, rdma_new_rq_wrs=0, }

From the RDMA connection information, we found that because ibv_req_notify_cq was only solicited, send WCs did not generate a CQEs. Without recv CQEs, send WCs could not be polled, so ·_sq_window_size` remained at 0. This is likely the reason why both the client and server are unable to send messages.

Using ibv_req_notify_cq with solicited_only=0 could solve this problem, but it would generate too many events. Therefore, we split the CQ into send_cq(solicited_only=0) and recv_cq(solicited_only=1).

@chenBright chenBright mentioned this pull request Nov 17, 2025
@chenBright chenBright requested a review from Copilot November 17, 2025 07:26
Copilot finished reviewing on behalf of chenBright November 17, 2025 07:29
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@yanglimingcn
Copy link
Contributor

We encountered an error: requests were not being sent, causing a large number of client timeouts.

 [E1008]Reached timeout=60000ms @Socket{id=13 fd=1160 addr=xxx:xx} (0x0x7f957c964ec0) rdma info={rdma_state=ON, handshake_state=ESTABLISHED, rdma_remote_rq_window_size=63, rdma_sq_window_size=0, rdma_local_window_capacity=125, rdma_remote_window_capacity=125, rdma_sbuf_head=57, rdma_sbuf_tail=120, rdma_rbuf_head=36, rdma_unacked_rq_wr=0, rdma_received_ack=0, rdma_unsolicited_sent=0, rdma_unsignaled_sq_wr=1, rdma_new_rq_wrs=0, }

From the RDMA connection information, we found that because ibv_req_notify_cq was only solicited, send WCs did not generate a CQEs. Without recv CQEs, send WCs could not be polled, so ·_sq_window_size` remained at 0. This is likely the reason why both the client and server are unable to send messages.

Using ibv_req_notify_cq with solicited_only=0 could solve this problem, but it would generate too many events. Therefore, we split the CQ into send_cq(solicited_only=0) and recv_cq(solicited_only=1).

With this modification, send_cq will generate one CQE for every 1/4 of the window?

@chenBright
Copy link
Contributor Author

With this modification, send_cq will generate one CQE for every 1/4 of the window?

Yes.

@legionxiong
Copy link
Contributor

legionxiong commented Nov 18, 2025

We encountered an error: requests were not being sent, causing a large number of client timeouts.

 [E1008]Reached timeout=60000ms @Socket{id=13 fd=1160 addr=xxx:xx} (0x0x7f957c964ec0) rdma info={rdma_state=ON, handshake_state=ESTABLISHED, rdma_remote_rq_window_size=63, rdma_sq_window_size=0, rdma_local_window_capacity=125, rdma_remote_window_capacity=125, rdma_sbuf_head=57, rdma_sbuf_tail=120, rdma_rbuf_head=36, rdma_unacked_rq_wr=0, rdma_received_ack=0, rdma_unsolicited_sent=0, rdma_unsignaled_sq_wr=1, rdma_new_rq_wrs=0, }

From the RDMA connection information, we found that because ibv_req_notify_cq was only solicited, send WCs did not generate a CQEs. Without recv CQEs, send WCs could not be polled, so ·_sq_window_size` remained at 0. This is likely the reason why both the client and server are unable to send messages.

Using ibv_req_notify_cq with solicited_only=0 could solve this problem, but it would generate too many events. Therefore, we split the CQ into send_cq(solicited_only=0) and recv_cq(solicited_only=1).

It is unnecessary to split CQ into send_cq and recv_cq, the reason that sliding window goes wrong is the precondition it relies on is not guaranteed in RoCE environment. The sliding window mechanism presumes that an local app receives the ack from remote app means that the underlying SQ has already been released, in a lossless IB environment it maybe true. But in RoCE environment, it is not guaranteed. Because the local device releases SQ only if it receives the ack to a message from remote device, but the ack might be lost and the remote side has processed the message, which means the remote app has received the message and sent an application layer ack to local app. Unfortunately, the local app layer received the ack to a message from remote app and then increased the window, but the device layer is still waiting for the ack from remote device to release the SQ. So the story is the window being increased wrongly for one or two times, till the window is absolutely wrong.

@chenBright
Copy link
Contributor Author

Unfortunately, the local app layer received the ack to a message from remote app and then increased the window, but the device layer is still waiting for the ack from remote device to release the SQ.

@legionxiong f the device-level ACK is lost, is it impossible to release the SQ?

It is unnecessary to split CQ into send_cq and recv_cq, the reason that sliding window goes wrong is the precondition it relies on is not guaranteed in RoCE environment.

Is there a better solution?

@legionxiong
Copy link
Contributor

legionxiong commented Nov 18, 2025

the device-level ACK is lost

Local side will resend the message whose ACK was not received within rdma_ack_timeout, and remote side will do nothing but resend the ACK to the message that has been received before. Maybe we need another variable which checks the event of IBV_WC_SEND, it is guaranteed that SQs are released if we polled the work completion of IBV_WC_SEND. And the number of unsignaled SQs could be carried with wr_id.

@chenBright
Copy link
Contributor Author

chenBright commented Nov 18, 2025

Maybe we need another variable which checks the event of IBV_WC_SEND, it is guaranteed that SQs are released if we polled the work completion of IBV_WC_SEND.

[E1008]Reached timeout=60000ms @socket{id=13 fd=1160 addr=xxx:xx} (0x0x7f957c964ec0) rdma info={rdma_state=ON, handshake_state=ESTABLISHED, rdma_remote_rq_window_size=63, rdma_sq_window_size=0, rdma_local_window_capacity=125, rdma_remote_window_capacity=125, rdma_sbuf_head=57, rdma_sbuf_tail=120, rdma_rbuf_head=36, rdma_unacked_rq_wr=0, rdma_received_ack=0, rdma_unsolicited_sent=0, rdma_unsignaled_sq_wr=1, rdma_new_rq_wrs=0, }

In this case, it seems we cannot poll any work completion of IBV_WC_SEND for a long time. With solicited_only=1, if device layer ACKs follows all application layer ACKs, we may be unable to poll IBV_WC_SEND after polling IBV_WC_RECV because there are no CQEs.

Maybe we need another variable which checks the event of IBV_WC_SEND

How can we check for the event of IBV_WC_SEND?

the number of unsignaled SQs could be carried with wr_id.

This solution for updating the SQ window is better.

@randomkang
Copy link

@chenBright Hi, i use the patch(commit:db3673acf276d9baec8aff95afd07bdbc5811437) in my task, but i still get the error:
"[wk-10] W1124 07:27:34.681490 82718 0 external/brpc/src/brpc/rdma/rdma_endpoint.cpp:952 CutFromIOBufList] Fail to ibv_post_send: Cannot allocate memory, remote_rq_window_size=57, sq_window_size=5, sq_current=5".

My task is model trainning and the brpc gdr is open.

@chenBright
Copy link
Contributor Author

chenBright commented Nov 25, 2025

@chenBright Hi, i use the patch(commit:db3673acf276d9baec8aff95afd07bdbc5811437) in my task, but i still get the error: "[wk-10] W1124 07:27:34.681490 82718 0 external/brpc/src/brpc/rdma/rdma_endpoint.cpp:952 CutFromIOBufList] Fail to ibv_post_send: Cannot allocate memory, remote_rq_window_size=57, sq_window_size=5, sq_current=5".

My task is model trainning and the brpc gdr is open.

Could it be that some unexpected imms are occupying SQ? Try to set sq_size back to its original value of sq_size * 5 / 4.

resource->qp = AllocateQp(resource->send_cq, resource->recv_cq, sq_size * 5 / 4, rq_size);

@yanglimingcn
Copy link
Contributor

Does this mean that the 5/4 window still can't be removed with the current modifications?

@randomkang
Copy link

randomkang commented Nov 27, 2025

@chenBright Hi, i use the patch(commit:db3673acf276d9baec8aff95afd07bdbc5811437) in my task, but i still get the error: "[wk-10] W1124 07:27:34.681490 82718 0 external/brpc/src/brpc/rdma/rdma_endpoint.cpp:952 CutFromIOBufList] Fail to ibv_post_send: Cannot allocate memory, remote_rq_window_size=57, sq_window_size=5, sq_current=5".
My task is model trainning and the brpc gdr is open.

Could it be that some unexpected imms are occupying SQ? Try to set sq_size back to its original value of sq_size * 5 / 4.

resource->qp = AllocateQp(resource->send_cq, resource->recv_cq, sq_size * 5 / 4, rq_size);

I set sq_size back to its original value of sq_size * 5 / 4 and run the same task two times. one last 1383 minutes and the other last 752 minutes. The error of "Fail to ibv_post_send: Cannot allocate memory" do not happen. @chenBright @yanglimingcn

@randomkang
Copy link

@chenBright Hi, i use the patch(commit:db3673acf276d9baec8aff95afd07bdbc5811437) in my task, but i still get the error: "[wk-10] W1124 07:27:34.681490 82718 0 external/brpc/src/brpc/rdma/rdma_endpoint.cpp:952 CutFromIOBufList] Fail to ibv_post_send: Cannot allocate memory, remote_rq_window_size=57, sq_window_size=5, sq_current=5".
My task is model trainning and the brpc gdr is open.

Could it be that some unexpected imms are occupying SQ? Try to set sq_size back to its original value of sq_size * 5 / 4.

resource->qp = AllocateQp(resource->send_cq, resource->recv_cq, sq_size * 5 / 4, rq_size);

I set sq_size back to its original value of sq_size * 5 / 4 and run the same task two times. one last 1383 minutes and the other last 752 minutes. The error of "Fail to ibv_post_send: Cannot allocate memory" do not happen. @chenBright @yanglimingcn

Update: one task run 1063 minutes, and report an error: "[wk-7] W1127 16:13:16.662405 67606 158119221015812 external/brpc/src/brpc/rdma/rdma_endpoint.cpp:997 SendImm] Fail to ibv_post_send: Cannot allocate memory".

@randomkang
Copy link

@chenBright Hi, i use the patch(commit:db3673acf276d9baec8aff95afd07bdbc5811437) in my task, but i still get the error: "[wk-10] W1124 07:27:34.681490 82718 0 external/brpc/src/brpc/rdma/rdma_endpoint.cpp:952 CutFromIOBufList] Fail to ibv_post_send: Cannot allocate memory, remote_rq_window_size=57, sq_window_size=5, sq_current=5".
My task is model trainning and the brpc gdr is open.

Could it be that some unexpected imms are occupying SQ? Try to set sq_size back to its original value of sq_size * 5 / 4.

resource->qp = AllocateQp(resource->send_cq, resource->recv_cq, sq_size * 5 / 4, rq_size);

I set sq_size back to its original value of sq_size * 5 / 4 and run the same task two times. one last 1383 minutes and the other last 752 minutes. The error of "Fail to ibv_post_send: Cannot allocate memory" do not happen. @chenBright @yanglimingcn

Update: one task run 1063 minutes, and report an error: "[wk-7] W1127 16:13:16.662405 67606 158119221015812 external/brpc/src/brpc/rdma/rdma_endpoint.cpp:997 SendImm] Fail to ibv_post_send: Cannot allocate memory".

Update2: The other task run 1902 minutes, and also report an error "[wk-13] W1127 19:41:55.088111 98359 102791452311833 external/brpc/src/brpc/rdma/rdma_endpoint.cpp:997 SendImm] Fail to ibv_post_send: Cannot allocate memory".

@chenBright
Copy link
Contributor Author

Latest changes:

  1. Send CQ and recv CQ share a comp channel. @yanglimingcn
  2. When ibv_post_send fails, it prints the number of inflight imm for debugging. @randomkang

@chenBright chenBright changed the title Bugfix: The failure of ibv_post_send is caused by polling send CQE before recv CQE Bugfix: SQ overflow Dec 3, 2025
@yanglimingcn
Copy link
Contributor

LGTM, @randomkang Please try the latest PR again; we haven't encountered the problem you mentioned.

@randomkang
Copy link

LGTM, @randomkang Please try the latest PR again; we haven't encountered the problem you mentioned.

The brpc version which i used is 1.14.1, does it matter?

@yanglimingcn
Copy link
Contributor

I don't think it matters; just focus on the modifications to the RDMA section.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

滑动窗口bug:Fail to ibv_post_send: Cannot allocate memory, window=3, sq_current=98

5 participants