Expose failed commit count and exceptions in BaseCommitService #14872

erkanonl · 2025-12-17T13:23:18Z

What does this change do?

This PR adds lightweight tracking for failed commit attempts in BaseCommitService.

Specifically, it:

Counts the number of failed commit attempts during rewrite actions
Captures the exceptions thrown by failed commits for diagnostic purposes
Exposes this information via accessor methods for use by callers and tests

Why is this change needed?

BaseCommitService explicitly supports partial progress, but today failed commits are only logged.
This makes it difficult to:

Programmatically detect partial failures
Surface useful diagnostics to callers
Write tests that assert failure behavior without relying on log inspection

This change improves observability without altering commit behavior or public APIs.

Scope and compatibility

The change is limited to BaseCommitService, which is package-private and internal
No behavioral changes to commit execution or error handling
No impact on existing public APIs

stubz151 · 2025-12-17T13:56:40Z

core/src/main/java/org/apache/iceberg/actions/BaseCommitService.java

+  }
+
+  public List<Exception> exceptionsOfFailedCommits() {
+    return Lists.newArrayList(exceptionsOfFailedCommits);


why not just return exceptionsOfFailedCommits?

This is a defensive copy which prevents callers from mutating internal state.

stubz151 · 2025-12-17T13:57:59Z

core/src/main/java/org/apache/iceberg/actions/BaseCommitService.java

  private final ConcurrentLinkedQueue<T> completedRewrites;
  private final ConcurrentLinkedQueue<String> inProgressCommits;
  private final ConcurrentLinkedQueue<T> committedRewrites;
+  private final List<Exception> exceptionsOfFailedCommits;


This might have to be a synchronizedList right?

Committer service looks to be executed with a single thread through Executors.newSingleThreadExecutor

However; to be consistent and making it resistant to future changes, I'm making it Collections.synchronizedList(Lists.newArrayList());

stubz151 · 2025-12-17T13:59:01Z

core/src/main/java/org/apache/iceberg/actions/BaseCommitService.java

+  public int failedCommits() {
+    return failedCommits;
+  }
+


think we should add some tests for this behaviour, but am happy to get buy-in first for this change before we do that.

I looked at the tests of TestCommitService. We test the main behaviour but tests don't test minor/simple details like succeededCommits so I think it's fine to leave it as it is because what we do is just incrementing another variable and adding exceptions into a new list.

geruh

Thanks for raising this @erkanonl, before we proceed, I'd like to understand the use case better. For instance, what value does surfacing the exceptions to the user add.

On the exceptions list I'm trying to understand what a user would do with this.

For partial progress scenarios where some commits succeed, some fail, the exceptions will almost always be CommitFailedException due to concurrent modifications when retries exhaust. What would a caller do differently based on having N copies of this exception vs just the count?
For any other failures like permissions or auth, I'd expect all commits to fail. There's a rare case where permission failures could happen to a subset of my table, but that seems edge-case.
Even with the exception objects, debugging still requires going to logs to understand which commits failed and why.
if there are many failures, we could accumulate a lot of exception objects with full stack traces.

Ultimately, as a user, what I might mostly care about is, did my operation succeed and if not can I retry. If it's CommitFailedException after retries exhausted, I just re-run compaction later. But if it's any other exception like permission or service down, I wouldn't know from the information today. So what scenario are we trying to enable here?

geruh · 2025-12-18T00:53:31Z

core/src/main/java/org/apache/iceberg/actions/BaseCommitService.java

  }

+  public int failedCommits() {
+    return failedCommits;


Can't this just be the length() of the failed commits list?

Yes, it can also be used. No difference. I wanted to differentiate the variables from each other to keep things clear but maybe things are already clear so I'm return the size of the list now 👍

I'll get to answering questions above soon

For instance, what value does surfacing the exceptions to the user add.
For partial progress scenarios where some commits succeed, some fail, the exceptions will almost always be CommitFailedException due to concurrent modifications when retries exhaust. What would a caller do differently based on having N copies of this exception vs just the count?

In our logs, we saw that there could be exceptions due to different reasons. One particular exception that we can see as logged by BaseCommitService is that X added_records > Y removed_records. This doesn't look like a standard commit conflict exception (We have a pending AI to investigate this error separately). We want to differentiate these kinds of exceptions from regular CommitFailedExceptions which happen due to concurrent modifications.

For any other failures like permissions or auth, I'd expect all commits to fail. There's a rare case where permission failures could happen to a subset of my table, but that seems edge-case.

Hmm, I'm not sure what would happen there but if we have the exception list, we can certainly tell it by analysing it, emitting metrics and logging.

Even with the exception objects, debugging still requires going to logs to understand which commits failed and why.

Yes, it requires going to logs to understand which commits failed but we want to classify errors at different categories to investigate further. We currently only have CommitFailedException category but we saw that CommitFailedException could happen due to other reasons like I mentioned above.

if there are many failures, we could accumulate a lot of exception objects with full stack traces.

Good point, but visibility is also required here and clients should not hundreds of thousands of commit failed exceptions in BaseCommitService. If they have, they should fix the underlying root cause.

geruh · 2025-12-19T18:19:57Z

Thanks for the context ! I agree with the goal, we should be able to be aware of any failures that happen in this service without searching through the logs. This would help distinguish commit conflicts from other actionable failures so clients can classify failures and investigate.

I don’t think returning a potentially unbounded List<Exception> is the right API for that. It creates memory/perf risk and still doesn’t eliminate the need to check logs for details.

Can we switch this to an optional failure summary instead? For instance, return a bounded list of actionable failure summaries omitting retried exceptions, and each summary would be minimal context about the failure.

I'll also let others chime in to hear their thoughts.

Expose failed commit count and exceptions in BaseCommitService

716d828

github-actions bot added the core label Dec 17, 2025

erkanonl mentioned this pull request Dec 17, 2025

Expose failed commit count and exceptions in BaseCommitService #14871

Closed

stubz151 reviewed Dec 17, 2025

View reviewed changes

Expose failed commit count and exceptions in BaseCommitService

6e79aac

geruh reviewed Dec 18, 2025

View reviewed changes

Expose failed commit count and exceptions in BaseCommitService

36c73ec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Expose failed commit count and exceptions in BaseCommitService #14872

Expose failed commit count and exceptions in BaseCommitService #14872

erkanonl commented Dec 17, 2025

Uh oh!

stubz151 Dec 17, 2025

Uh oh!

erkanonl Dec 17, 2025

Uh oh!

stubz151 Dec 17, 2025

Uh oh!

erkanonl Dec 17, 2025

Uh oh!

stubz151 Dec 17, 2025

Uh oh!

erkanonl Dec 17, 2025 •

edited

Loading

Uh oh!

geruh left a comment

Uh oh!

geruh Dec 18, 2025

Uh oh!

erkanonl Dec 18, 2025 •

edited

Loading

Uh oh!

erkanonl Dec 18, 2025

Uh oh!

geruh commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Expose failed commit count and exceptions in BaseCommitService #14872

Are you sure you want to change the base?

Expose failed commit count and exceptions in BaseCommitService #14872

Conversation

erkanonl commented Dec 17, 2025

What does this change do?

Why is this change needed?

Scope and compatibility

Uh oh!

stubz151 Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

erkanonl Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

stubz151 Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

erkanonl Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

stubz151 Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

erkanonl Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

geruh left a comment

Choose a reason for hiding this comment

Uh oh!

geruh Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

erkanonl Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

erkanonl Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

geruh commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

erkanonl Dec 17, 2025 •

edited

Loading

erkanonl Dec 18, 2025 •

edited

Loading