Skip to content

Conversation

@afterincomparableyum
Copy link

This PR implements retry support for createReader failures in the C++ client, matching the behavior of the Java implementation. The implementation includes:

  • Added configuration properties:

    • clientFetchMaxRetriesForEachReplica (default: 3)
    • dataIoRetryWait (default: 5s)
    • clientPushReplicateEnabled (default: false)
  • Added peer location support methods to PartitionLocation:

    • hasPeer() - Check if location has a peer replica
    • getPeer() - Get the peer location
    • hostAndFetchPort() - Get host:port string for logging
  • Implemented retry logic in createReaderWithRetry():

    • Retries up to fetchChunkMaxRetry_ times (doubled if replication enabled)[which is why I added this parameter in this PR]
    • Switches to peer location on failure when available
    • Sleeps between retries when both replicas tried or no peer exists
    • Resets retry counter when moving to new location or on success
  • Added unit tests for new functionality

How was this patch tested?

Unit tests and compiling

@afterincomparableyum
Copy link
Author

@HolyLow @SteNicholas @FMX @RexXiong Could you please help review this PR? Appreciate your help in improving this as needed!


int clientFetchMaxRetriesForEachReplica() const;

Timeout dataIoRetryWait() const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this method name align with CelebornConf#networkIoRetryWaitMs for io wait conf of all modules?

try {
VLOG(1) << "Create reader for location " << currentLocation->host << ":"
<< currentLocation->fetchPort;
auto reader = createReader(*currentLocation);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should check whether the partition location is excluded, which aligns with the logic of CelebornInputStream#createReaderWithRetry.

return reader;
} catch (const std::exception& e) {
lastException = std::current_exception();
fetchChunkRetryCnt_++;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shuffle client should exclude failed fetch location.

std::this_thread::sleep_for(
std::chrono::milliseconds(retryWait_.count()));
}
LOG(WARNING) << "CreatePartitionReader failed " << fetchChunkRetryCnt_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this aligin with the failure handling of CelebornInputStream#createReaderWithRetry?

@codecov
Copy link

codecov bot commented Jan 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 67.04%. Comparing base (2dd1b7a) to head (5d32d94).
⚠️ Report is 6 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3583      +/-   ##
==========================================
- Coverage   67.13%   67.04%   -0.09%     
==========================================
  Files         357      357              
  Lines       21860    21924      +64     
  Branches     1943     1949       +6     
==========================================
+ Hits        14674    14696      +22     
- Misses       6166     6213      +47     
+ Partials     1020     1015       -5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@afterincomparableyum
Copy link
Author

Thank you for your comments @SteNicholas , I will take a look over the next couple of days. I suspect some refactoring may need to be done to this PR, I will notify you once done.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements retry support for createReader failures in the C++ client to match the Java implementation's behavior. It adds retry configuration, peer location helper methods, and implements the retry logic with peer failover.

Changes:

  • Added three configuration properties for retry behavior: clientFetchMaxRetriesForEachReplica, dataIoRetryWait, and clientPushReplicateEnabled
  • Added helper methods to PartitionLocation for peer access and formatting: hasPeer(), getPeer(), and hostAndFetchPort()
  • Implemented retry logic in createReaderWithRetry() that switches between primary and peer replicas on failure

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
cpp/celeborn/protocol/tests/PartitionLocationTest.cpp Added unit tests for new PartitionLocation helper methods
cpp/celeborn/protocol/PartitionLocation.h Declared three new helper methods for peer access and port formatting
cpp/celeborn/protocol/PartitionLocation.cpp Implemented the three new helper methods
cpp/celeborn/conf/tests/CelebornConfTest.cpp Added tests for new configuration properties and their default values
cpp/celeborn/conf/CelebornConf.h Declared three new configuration properties and their accessor methods
cpp/celeborn/conf/CelebornConf.cpp Implemented configuration property definitions and accessor methods
cpp/celeborn/client/reader/CelebornInputStream.h Added member variables for retry tracking and retry wait timeout
cpp/celeborn/client/reader/CelebornInputStream.cpp Implemented retry logic with peer failover and sleep between retries

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

VLOG(1) << "Create reader for location " << currentLocation->host << ":"
<< currentLocation->fetchPort;
auto reader = createReader(*currentLocation);
fetchChunkRetryCnt_ = 0;
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resetting fetchChunkRetryCnt_ to 0 on successful reader creation (line 207) is redundant since it's already reset at line 187 in moveToNextReader() before this function is called. While not harmful, removing this reset would make the code cleaner and align better with the Java implementation which doesn't reset the counter after successful reader creation.

Suggested change
fetchChunkRetryCnt_ = 0;

Copilot uses AI. Check for mistakes.
Comment on lines +234 to +237
CELEBORN_FAIL(
"createPartitionReader failed after " +
std::to_string(fetchChunkRetryCnt_) + " retries for location " +
location.hostAndFetchPort());
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The captured lastException should be rethrown instead of using CELEBORN_FAIL. The Java implementation throws CelebornIOException with the lastException as the cause. In C++, you should use std::rethrow_exception(lastException) to preserve the original exception information, which is critical for debugging. If you want to add context, you could wrap it in a CelebornRuntimeError similar to the pattern seen in CelebornException.cpp line 35-36.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants