Skip to content

Conversation

@mfahsold
Copy link

@mfahsold mfahsold commented Jan 19, 2026

Summary

The current RPC implementation crashes the server with GGML_ASSERT when ggml_backend_graph_compute returns a non-success status. This causes distributed inference setups to fail completely when a single worker encounters a temporary error (memory pressure, backend issues, etc.).

Changes

  1. Added response structs for RPC_CMD_GRAPH_COMPUTE and RPC_CMD_GRAPH_RECOMPUTE:

    • rpc_msg_graph_compute_rsp with int32_t status
    • rpc_msg_graph_recompute_rsp with int32_t status
  2. Server-side changes:

    • Replaced GGML_ASSERT(status == GGML_STATUS_SUCCESS) with graceful error logging
    • Server now sends the actual ggml_status back to the client via RPC response
    • Server continues operating instead of crashing on compute failures
  3. Client-side changes:

    • ggml_backend_rpc_graph_compute now receives and returns the actual status from the server
    • Clients can properly handle non-success status codes (retry, failover, etc.)

Before

[RPC] Worker disconnects or compute fails
-> GGML_ASSERT crashes the server
-> All connected clients lose their sessions
-> Manual restart required

After

[RPC] Worker disconnects or compute fails
-> Error is logged: "[rpc_server::graph_compute] graph compute failed with status X"
-> Status propagated to client
-> Client can handle error appropriately
-> Server continues operating

Related Issues

Testing

Tested in a Kubernetes distributed inference setup with 2 RPC workers:

  • Verified that worker disconnection no longer crashes the coordinator
  • Verified that compute errors are properly logged and propagated
  • Verified that inference completes successfully under normal operation

The current RPC implementation crashes the server with GGML_ASSERT when
ggml_backend_graph_compute returns a non-success status. This causes
distributed inference setups to fail completely when a single worker
encounters a temporary error (memory pressure, backend issues, etc.).

This patch:
1. Adds rpc_msg_graph_compute_rsp and rpc_msg_graph_recompute_rsp structs
2. Replaces GGML_ASSERT with graceful error logging on the server side
3. Propagates ggml_status back to the client via RPC response
4. Allows clients to handle errors appropriately (retry, failover, etc.)

Fixes: ggml-org#11929
Fixes: gpustack/gpustack#1178
@mfahsold mfahsold requested a review from rgerganov as a code owner January 19, 2026 16:12
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jan 19, 2026
Copy link
Collaborator

@rgerganov rgerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning a response from graph_compute requires a network round-trip which hurts performance. I don't think there is a legit use case where one graph_compute fails and then everything get backs to normal.

@mfahsold
Copy link
Author

mfahsold commented Jan 20, 2026 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

3 participants