-
Notifications
You must be signed in to change notification settings - Fork 82
Add exponential backoff to prevent rapid deployment updates #183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Implements exponential backoff (1s to 64s max) to address issue #182 where rapid config changes could overwhelm the Kubernetes API server. The backoff applies to ALL updates, even when config hashes change, preventing the scenario where a buggy controller rapidly updating secrets causes Wave to rapidly update deployments. Key features: - Backoff progression: 1s → 2s → 4s → 8s → 16s → 32s → 64s (max) - State tracked in-memory within the operator - Resets after period of stability to avoid penalizing isolated changes - Thread-safe implementation with mutex protection Fixes #182 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
| state.backoffLevel++ | ||
| // Cap at level 6 (64s) | ||
| if state.backoffLevel > 6 { | ||
| state.backoffLevel = 6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is redundant to the MaxBackoff check above. Either way works but both together do not make it better.
| level = 6 | ||
| } | ||
| backoff := MinBackoff * (1 << level) // 2^level seconds | ||
| if backoff > MaxBackoff { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either this check or the level check. Claude tried to be safe and added 3 checks ;-)
| // Record successful update | ||
| h.backoffTracker.RecordUpdate(instanceName) | ||
| } else { | ||
| // No changes detected - system is stable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure this is sound. If wave is triggered without a hash change (i.e. due to an annotation on the Deployment) that would reset the backoff. I guess we would have to check the backoff here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even after reading this code and asking the LLM to add a clarifying comment, this is the part where I wanted to do the most testing :-D
I will have time to test this probably only next week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those models do not have a notion of concurrency so dont expect them to produce race free solutions. They rarely do. I have seen too many LLM implementations with race, wrong order or dead locks ;-). Most of the times those models are not even able to fix it if you explain the issue to them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw a Mutex somewhere, but that also requires thorough review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should be good. Its isolated to one method with defered release.
The race here is really that reconciles can (and will) happen for multiple unrelated reasons. Basically what needs to be changed is that the backoff check needs to happen for both if branches. I would move it up and simply bail out (maybe with two different log messages).
This definitely needs a test. The tests cursor generated look fine but arbirary to me (focus a bit unclear). However, the tests do not test the e2e behaviour at all.
Implements exponential backoff (1s to 64s max) to address issue #182 where rapid config changes could overwhelm the Kubernetes API server.
The backoff applies to ALL updates, even when config hashes change, preventing the scenario where a buggy controller rapidly updating secrets causes Wave to rapidly update deployments.
Key features:
Fixes #182
🤖 Generated with Claude Code