-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Expose failed commit count and exceptions in BaseCommitService #14872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| } | ||
|
|
||
| public List<Exception> exceptionsOfFailedCommits() { | ||
| return Lists.newArrayList(exceptionsOfFailedCommits); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just return exceptionsOfFailedCommits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a defensive copy which prevents callers from mutating internal state.
| private final ConcurrentLinkedQueue<T> completedRewrites; | ||
| private final ConcurrentLinkedQueue<String> inProgressCommits; | ||
| private final ConcurrentLinkedQueue<T> committedRewrites; | ||
| private final List<Exception> exceptionsOfFailedCommits; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might have to be a synchronizedList right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Committer service looks to be executed with a single thread through Executors.newSingleThreadExecutor
However; to be consistent and making it resistant to future changes, I'm making it Collections.synchronizedList(Lists.newArrayList());
| public int failedCommits() { | ||
| return failedCommits; | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
think we should add some tests for this behaviour, but am happy to get buy-in first for this change before we do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at the tests of TestCommitService. We test the main behaviour but tests don't test minor/simple details like succeededCommits so I think it's fine to leave it as it is because what we do is just incrementing another variable and adding exceptions into a new list.
geruh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for raising this @erkanonl, before we proceed, I'd like to understand the use case better. For instance, what value does surfacing the exceptions to the user add.
On the exceptions list I'm trying to understand what a user would do with this.
-
For partial progress scenarios where some commits succeed, some fail, the exceptions will almost always be
CommitFailedExceptiondue to concurrent modifications when retries exhaust. What would a caller do differently based on having N copies of this exception vs just the count? -
For any other failures like permissions or auth, I'd expect all commits to fail. There's a rare case where permission failures could happen to a subset of my table, but that seems edge-case.
-
Even with the exception objects, debugging still requires going to logs to understand which commits failed and why.
-
if there are many failures, we could accumulate a lot of exception objects with full stack traces.
Ultimately, as a user, what I might mostly care about is, did my operation succeed and if not can I retry. If it's CommitFailedException after retries exhausted, I just re-run compaction later. But if it's any other exception like permission or service down, I wouldn't know from the information today. So what scenario are we trying to enable here?
| } | ||
|
|
||
| public int failedCommits() { | ||
| return failedCommits; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't this just be the length() of the failed commits list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it can also be used. No difference. I wanted to differentiate the variables from each other to keep things clear but maybe things are already clear so I'm return the size of the list now 👍
I'll get to answering questions above soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For instance, what value does surfacing the exceptions to the user add.
For partial progress scenarios where some commits succeed, some fail, the exceptions will almost always be CommitFailedException due to concurrent modifications when retries exhaust. What would a caller do differently based on having N copies of this exception vs just the count?
In our logs, we saw that there could be exceptions due to different reasons. One particular exception that we can see as logged by BaseCommitService is that X added_records > Y removed_records. This doesn't look like a standard commit conflict exception (We have a pending AI to investigate this error separately). We want to differentiate these kinds of exceptions from regular CommitFailedExceptions which happen due to concurrent modifications.
For any other failures like permissions or auth, I'd expect all commits to fail. There's a rare case where permission failures could happen to a subset of my table, but that seems edge-case.
Hmm, I'm not sure what would happen there but if we have the exception list, we can certainly tell it by analysing it, emitting metrics and logging.
Even with the exception objects, debugging still requires going to logs to understand which commits failed and why.
Yes, it requires going to logs to understand which commits failed but we want to classify errors at different categories to investigate further. We currently only have CommitFailedException category but we saw that CommitFailedException could happen due to other reasons like I mentioned above.
if there are many failures, we could accumulate a lot of exception objects with full stack traces.
Good point, but visibility is also required here and clients should not hundreds of thousands of commit failed exceptions in BaseCommitService. If they have, they should fix the underlying root cause.
|
Thanks for the context ! I agree with the goal, we should be able to be aware of any failures that happen in this service without searching through the logs. This would help distinguish commit conflicts from other actionable failures so clients can classify failures and investigate. I don’t think returning a potentially unbounded Can we switch this to an optional failure summary instead? For instance, return a bounded list of actionable failure summaries omitting retried exceptions, and each summary would be minimal context about the failure. I'll also let others chime in to hear their thoughts. |
What does this change do?
This PR adds lightweight tracking for failed commit attempts in BaseCommitService.
Specifically, it:
Why is this change needed?
BaseCommitService explicitly supports partial progress, but today failed commits are only logged.
This makes it difficult to:
This change improves observability without altering commit behavior or public APIs.
Scope and compatibility
The change is limited to BaseCommitService, which is package-private and internal
No behavioral changes to commit execution or error handling
No impact on existing public APIs