Skip to content

Conversation

@erkanonl
Copy link

What does this change do?

This PR adds lightweight tracking for failed commit attempts in BaseCommitService.

Specifically, it:

  • Counts the number of failed commit attempts during rewrite actions
  • Captures the exceptions thrown by failed commits for diagnostic purposes
  • Exposes this information via accessor methods for use by callers and tests

Why is this change needed?

BaseCommitService explicitly supports partial progress, but today failed commits are only logged.
This makes it difficult to:

  • Programmatically detect partial failures
  • Surface useful diagnostics to callers
  • Write tests that assert failure behavior without relying on log inspection

This change improves observability without altering commit behavior or public APIs.

Scope and compatibility

The change is limited to BaseCommitService, which is package-private and internal
No behavioral changes to commit execution or error handling
No impact on existing public APIs

}

public List<Exception> exceptionsOfFailedCommits() {
return Lists.newArrayList(exceptionsOfFailedCommits);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just return exceptionsOfFailedCommits?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a defensive copy which prevents callers from mutating internal state.

private final ConcurrentLinkedQueue<T> completedRewrites;
private final ConcurrentLinkedQueue<String> inProgressCommits;
private final ConcurrentLinkedQueue<T> committedRewrites;
private final List<Exception> exceptionsOfFailedCommits;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might have to be a synchronizedList right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Committer service looks to be executed with a single thread through Executors.newSingleThreadExecutor

However; to be consistent and making it resistant to future changes, I'm making it Collections.synchronizedList(Lists.newArrayList());

public int failedCommits() {
return failedCommits;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think we should add some tests for this behaviour, but am happy to get buy-in first for this change before we do that.

Copy link
Author

@erkanonl erkanonl Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the tests of TestCommitService. We test the main behaviour but tests don't test minor/simple details like succeededCommits so I think it's fine to leave it as it is because what we do is just incrementing another variable and adding exceptions into a new list.

Copy link
Contributor

@geruh geruh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising this @erkanonl, before we proceed, I'd like to understand the use case better. For instance, what value does surfacing the exceptions to the user add.

On the exceptions list I'm trying to understand what a user would do with this.

  1. For partial progress scenarios where some commits succeed, some fail, the exceptions will almost always be CommitFailedException due to concurrent modifications when retries exhaust. What would a caller do differently based on having N copies of this exception vs just the count?

  2. For any other failures like permissions or auth, I'd expect all commits to fail. There's a rare case where permission failures could happen to a subset of my table, but that seems edge-case.

  3. Even with the exception objects, debugging still requires going to logs to understand which commits failed and why.

  4. if there are many failures, we could accumulate a lot of exception objects with full stack traces.

Ultimately, as a user, what I might mostly care about is, did my operation succeed and if not can I retry. If it's CommitFailedException after retries exhausted, I just re-run compaction later. But if it's any other exception like permission or service down, I wouldn't know from the information today. So what scenario are we trying to enable here?

}

public int failedCommits() {
return failedCommits;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this just be the length() of the failed commits list?

Copy link
Author

@erkanonl erkanonl Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it can also be used. No difference. I wanted to differentiate the variables from each other to keep things clear but maybe things are already clear so I'm return the size of the list now 👍

I'll get to answering questions above soon

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For instance, what value does surfacing the exceptions to the user add.
For partial progress scenarios where some commits succeed, some fail, the exceptions will almost always be CommitFailedException due to concurrent modifications when retries exhaust. What would a caller do differently based on having N copies of this exception vs just the count?

In our logs, we saw that there could be exceptions due to different reasons. One particular exception that we can see as logged by BaseCommitService is that X added_records > Y removed_records. This doesn't look like a standard commit conflict exception (We have a pending AI to investigate this error separately). We want to differentiate these kinds of exceptions from regular CommitFailedExceptions which happen due to concurrent modifications.

For any other failures like permissions or auth, I'd expect all commits to fail. There's a rare case where permission failures could happen to a subset of my table, but that seems edge-case.

Hmm, I'm not sure what would happen there but if we have the exception list, we can certainly tell it by analysing it, emitting metrics and logging.

Even with the exception objects, debugging still requires going to logs to understand which commits failed and why.

Yes, it requires going to logs to understand which commits failed but we want to classify errors at different categories to investigate further. We currently only have CommitFailedException category but we saw that CommitFailedException could happen due to other reasons like I mentioned above.

if there are many failures, we could accumulate a lot of exception objects with full stack traces.

Good point, but visibility is also required here and clients should not hundreds of thousands of commit failed exceptions in BaseCommitService. If they have, they should fix the underlying root cause.

@geruh
Copy link
Contributor

geruh commented Dec 19, 2025

Thanks for the context ! I agree with the goal, we should be able to be aware of any failures that happen in this service without searching through the logs. This would help distinguish commit conflicts from other actionable failures so clients can classify failures and investigate.

I don’t think returning a potentially unbounded List<Exception> is the right API for that. It creates memory/perf risk and still doesn’t eliminate the need to check logs for details.

Can we switch this to an optional failure summary instead? For instance, return a bounded list of actionable failure summaries omitting retried exceptions, and each summary would be minimal context about the failure.

I'll also let others chime in to hear their thoughts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants