Skip to content

Conversation

@linkvt
Copy link
Contributor

@linkvt linkvt commented Nov 7, 2025

Proposed Changes

  • Fix a time-of-check-time-of-use race condition in the revision backend manager where revision watchers were created after its cleanup

I couldn't reproduce the data race after almost 10k test runs but likely found the reason causing this issue, the explanation seems sound to me.

Explanation

The data race warning has the following output (copied from related issue):

WARNING: DATA RACE
Read at 0x00c0006c458b by goroutine 492:
  testing.(*common).destination()
      /root/.gvm/gos/go1.25.3/src/testing/testing.go:1049 +0x96
  testing.(*common).log()
      /root/.gvm/gos/go1.25.3/src/testing/testing.go:1027 +0xbe
  testing.(*common).Logf()
      /root/.gvm/gos/go1.25.3/src/testing/testing.go:1191 +0x8e
  go.uber.org/zap/zaptest.TestingWriter.Write()
      /home/prow/go/src/knative.dev/serving/vendor/go.uber.org/zap/zaptest/logger.go:146 +0x119
  go.uber.org/zap/zaptest.(*TestingWriter).Write()
      <autogenerated>:1 +0x74
  go.uber.org/zap/zapcore.(*ioCore).Write()
      /home/prow/go/src/knative.dev/serving/vendor/go.uber.org/zap/zapcore/core.go:99 +0x18d
  go.uber.org/zap/zapcore.(*CheckedEntry).Write()
      /home/prow/go/src/knative.dev/serving/vendor/go.uber.org/zap/zapcore/entry.go:253 +0x1ec
  go.uber.org/zap.(*SugaredLogger).log()
      /home/prow/go/src/knative.dev/serving/vendor/go.uber.org/zap/sugar.go:355 +0x12c
  go.uber.org/zap.(*SugaredLogger).Debugw()
      /home/prow/go/src/knative.dev/serving/vendor/go.uber.org/zap/sugar.go:251 +0x55e
  knative.dev/serving/pkg/activator/net.(*revisionWatcher).run()
      /home/prow/go/src/knative.dev/serving/pkg/activator/net/revision_backends.go:422 +0x18c
  knative.dev/serving/pkg/activator/net.(*revisionBackendsManager).getOrCreateRevisionWatcher.gowrap2()
      /home/prow/go/src/knative.dev/serving/pkg/activator/net/revision_backends.go:558 +0x44

Previous write at 0x00c0006c458b by goroutine 480:
  testing.tRunner.func1()
      /root/.gvm/gos/go1.25.3/src/testing/testing.go:1921 +0x904
  runtime.deferreturn()
      /root/.gvm/gos/go1.25.3/src/runtime/panic.go:589 +0x5d
  testing.(*T).Run.gowrap1()
      /root/.gvm/gos/go1.25.3/src/testing/testing.go:1997 +0x44

Goroutine 492 (running) created at:
  knative.dev/serving/pkg/activator/net.(*revisionBackendsManager).getOrCreateRevisionWatcher()
      /home/prow/go/src/knative.dev/serving/pkg/activator/net/revision_backends.go:558 +0xf67
  knative.dev/serving/pkg/activator/net.(*revisionBackendsManager).endpointsUpdated()
      /home/prow/go/src/knative.dev/serving/pkg/activator/net/revision_backends.go:577 +0x147
  knative.dev/serving/pkg/activator/net.(*revisionBackendsManager).endpointsUpdated-fm()
      <autogenerated>:1 +0x47
  k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd()
      /home/prow/go/src/knative.dev/serving/vendor/k8s.io/client-go/tools/cache/controller.go:257 +0x63
  k8s.io/client-go/tools/cache.(*ResourceEventHandlerFuncs).OnAdd()
      <autogenerated>:1 +0x1b
  k8s.io/client-go/tools/cache.FilteringResourceEventHandler.OnAdd()
      /home/prow/go/src/knative.dev/serving/vendor/k8s.io/client-go/tools/cache/controller.go:320 +0x82
  k8s.io/client-go/tools/cache.(*FilteringResourceEventHandler).OnAdd()
      <autogenerated>:1 +0x75
  k8s.io/client-go/tools/cache.(*processorListener).run.func1()
      /home/prow/go/src/knative.dev/serving/vendor/k8s.io/client-go/tools/cache/shared_informer.go:1077 +0x1e4
  k8s.io/client-go/tools/cache.(*processorListener).run()
      /home/prow/go/src/knative.dev/serving/vendor/k8s.io/client-go/tools/cache/shared_informer.go:1087 +0x5e
  k8s.io/client-go/tools/cache.(*processorListener).run-fm()
      <autogenerated>:1 +0x33
  k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
      /home/prow/go/src/knative.dev/serving/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:72 +0x82

Goroutine 480 (running) created at:
  testing.(*T).Run()
      /root/.gvm/gos/go1.25.3/src/testing/testing.go:1997 +0x9d2
  knative.dev/serving/pkg/activator/net.TestRevisionBackendManagerAddEndpoint()
      /home/prow/go/src/knative.dev/serving/pkg/activator/net/revision_backends_test.go:967 +0x4df7
  testing.tRunner()
      /root/.gvm/gos/go1.25.3/src/testing/testing.go:1934 +0x21c
  testing.(*T).Run.gowrap1()
      /root/.gvm/gos/go1.25.3/src/testing/testing.go:1997 +0x44

We see that an informer calls endpointsUpdated which runs the revision watcher in a new goroutine (492) that calls the log statement causing the data race.

There is a shutdown sequence that stops revision watchers:

  1. cleanup in defer func of the test triggering context.Done()
  2. cleanup goroutine in newRevisionBackendsManagerWithProbeFrequency waits for the context cancellation and then waits for all watchers to complete (see code below)
  3. defer func of test ensures that everything is shutdown by calling waitForRevisionBackendManager
  4. in theory there is now no possibility for the logging code to run anymore

This makes it very likely that we have TOCTOU issue where the goroutine run by endpointsUpdated causes the problem:

  1. endpointsUpdated checks ctx.Done() and passes
  2. race window
    • thread a: endpointsUpdated calls getOrCreateRevisionWatcher
    • thread b: context is cancelled and revision backend manager acquires revisionWatchersMux mutex
    • thread a: getOrCreateRevisionWatcher blocks waiting for the mutex
    • thread b: revision backend manager shutdown all watchers and releases the mutex
  3. getOrCreateRevisionWatcher acquires the lock, can't find the revision watcher and creates a new one
  4. this revision watcher starts logging while the tests defer func already completed and while the test cleanup is WIP -> data race

Relevant code blocks:

func (rbm *revisionBackendsManager) endpointsUpdated(newObj interface{}) {
// Ignore the updates when we've terminated.
select {
case <-rbm.ctx.Done():
return
default:
}
endpoints := newObj.(*corev1.Endpoints)
revID := types.NamespacedName{Namespace: endpoints.Namespace, Name: endpoints.Labels[serving.RevisionLabelKey]}
rw, err := rbm.getOrCreateRevisionWatcher(revID)

go func() {
// updateCh can only be closed after revisionWatchers are done running
defer close(rbm.updateCh)
// Wait for cancellation
<-rbm.ctx.Done()
// Wait for all revisionWatchers to be done
rbm.revisionWatchersMux.Lock()
defer rbm.revisionWatchersMux.Unlock()
for _, rw := range rbm.revisionWatchers {
<-rw.done
}

Release Note

Fix a rare data race in revision backend manager creating revision watchers during shutdown

Fixes #16204

@knative-prow knative-prow bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Nov 7, 2025
@codecov
Copy link

codecov bot commented Nov 7, 2025

Codecov Report

❌ Patch coverage is 50.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.04%. Comparing base (5f7aa6e) to head (da8a05e).
⚠️ Report is 11 commits behind head on main.

Files with missing lines Patch % Lines
pkg/activator/net/revision_backends.go 50.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #16225      +/-   ##
==========================================
- Coverage   80.05%   80.04%   -0.01%     
==========================================
  Files         214      214              
  Lines       13313    13317       +4     
==========================================
+ Hits        10658    10660       +2     
- Misses       2295     2298       +3     
+ Partials      360      359       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@linkvt
Copy link
Contributor Author

linkvt commented Nov 7, 2025

/remove-approve which is there from the release lead role

@knative-prow knative-prow bot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 7, 2025
@linkvt
Copy link
Contributor Author

linkvt commented Nov 11, 2025

/cc @Fedosin
PTAL, thanks!

@linkvt linkvt requested a review from Fedosin November 11, 2025 21:35
@dprotaso
Copy link
Member

/lgtm
/approve

@knative-prow knative-prow bot added the lgtm Indicates that a PR is ready to be merged. label Nov 17, 2025
@knative-prow
Copy link

knative-prow bot commented Nov 17, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dprotaso

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow knative-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 17, 2025
@dprotaso
Copy link
Member

/retest

The failure seems unrelated.

I think MinScaleTransition needs a larger timeout - after 1 minute only one pod was created and Running. And there weren't any other pods created yet.

@linkvt
Copy link
Contributor Author

linkvt commented Nov 18, 2025

/retest

@knative-prow knative-prow bot merged commit c46db77 into knative:main Nov 18, 2025
92 checks passed
@linkvt linkvt deleted the fix-revision-manager-data-race branch November 18, 2025 10:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[flaky] pkg/activator/net: TestRevisionBackendManagerAddEndpoint

2 participants