-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Fix a rare data race in revision backend manager #16225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix a rare data race in revision backend manager #16225
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #16225 +/- ##
==========================================
- Coverage 80.05% 80.04% -0.01%
==========================================
Files 214 214
Lines 13313 13317 +4
==========================================
+ Hits 10658 10660 +2
- Misses 2295 2298 +3
+ Partials 360 359 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
/remove-approve which is there from the release lead role |
|
/cc @Fedosin |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dprotaso The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest The failure seems unrelated. I think MinScaleTransition needs a larger timeout - after 1 minute only one pod was created and Running. And there weren't any other pods created yet. |
|
/retest |
Proposed Changes
I couldn't reproduce the data race after almost 10k test runs but likely found the reason causing this issue, the explanation seems sound to me.
Explanation
The data race warning has the following output (copied from related issue):
We see that an informer calls
endpointsUpdatedwhich runs the revision watcher in a new goroutine (492) that calls the log statement causing the data race.There is a shutdown sequence that stops revision watchers:
newRevisionBackendsManagerWithProbeFrequencywaits for the context cancellation and then waits for all watchers to complete (see code below)waitForRevisionBackendManagerThis makes it very likely that we have TOCTOU issue where the goroutine run by
endpointsUpdatedcauses the problem:endpointsUpdatedchecks ctx.Done() and passesendpointsUpdatedcallsgetOrCreateRevisionWatcherrevisionWatchersMuxmutexgetOrCreateRevisionWatcherblocks waiting for the mutexgetOrCreateRevisionWatcheracquires the lock, can't find the revision watcher and creates a new oneRelevant code blocks:
serving/pkg/activator/net/revision_backends.go
Lines 567 to 577 in 5f7aa6e
serving/pkg/activator/net/revision_backends.go
Lines 501 to 513 in 5f7aa6e
Release Note
Fixes #16204