Skip to content

Conversation

@toelke
Copy link
Collaborator

@toelke toelke commented Oct 29, 2025

Implements exponential backoff (1s to 64s max) to address issue #182 where rapid config changes could overwhelm the Kubernetes API server.

The backoff applies to ALL updates, even when config hashes change, preventing the scenario where a buggy controller rapidly updating secrets causes Wave to rapidly update deployments.

Key features:

  • Backoff progression: 1s → 2s → 4s → 8s → 16s → 32s → 64s (max)
  • State tracked in-memory within the operator
  • Resets after period of stability to avoid penalizing isolated changes
  • Thread-safe implementation with mutex protection

Fixes #182

🤖 Generated with Claude Code

Implements exponential backoff (1s to 64s max) to address issue #182
where rapid config changes could overwhelm the Kubernetes API server.

The backoff applies to ALL updates, even when config hashes change,
preventing the scenario where a buggy controller rapidly updating
secrets causes Wave to rapidly update deployments.

Key features:
- Backoff progression: 1s → 2s → 4s → 8s → 16s → 32s → 64s (max)
- State tracked in-memory within the operator
- Resets after period of stability to avoid penalizing isolated changes
- Thread-safe implementation with mutex protection

Fixes #182

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
state.backoffLevel++
// Cap at level 6 (64s)
if state.backoffLevel > 6 {
state.backoffLevel = 6
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is redundant to the MaxBackoff check above. Either way works but both together do not make it better.

level = 6
}
backoff := MinBackoff * (1 << level) // 2^level seconds
if backoff > MaxBackoff {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either this check or the level check. Claude tried to be safe and added 3 checks ;-)

// Record successful update
h.backoffTracker.RecordUpdate(instanceName)
} else {
// No changes detected - system is stable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this is sound. If wave is triggered without a hash change (i.e. due to an annotation on the Deployment) that would reset the backoff. I guess we would have to check the backoff here as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even after reading this code and asking the LLM to add a clarifying comment, this is the part where I wanted to do the most testing :-D

I will have time to test this probably only next week.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those models do not have a notion of concurrency so dont expect them to produce race free solutions. They rarely do. I have seen too many LLM implementations with race, wrong order or dead locks ;-). Most of the times those models are not even able to fix it if you explain the issue to them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw a Mutex somewhere, but that also requires thorough review.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be good. Its isolated to one method with defered release.

The race here is really that reconciles can (and will) happen for multiple unrelated reasons. Basically what needs to be changed is that the backoff check needs to happen for both if branches. I would move it up and simply bail out (maybe with two different log messages).

This definitely needs a test. The tests cursor generated look fine but arbirary to me (focus a bit unclear). However, the tests do not test the e2e behaviour at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Wave is very fast to update Deployments and can DDoS the kubernetes API with a lot of ReplicaSets

3 participants