-
Couldn't load subscription status.
- Fork 346
feat: add Memory Protection Middleware #2646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2646 +/- ##
==========================================
- Coverage 79.35% 77.24% -2.10%
==========================================
Files 453 457 +4
Lines 46993 48692 +1699
==========================================
+ Hits 37288 37609 +321
- Misses 6948 8323 +1375
- Partials 2757 2760 +3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
f68d3f0 to
0683e46
Compare
The commit introduces a new memory protection middleware to help prevent out-of-memory conditions in SpiceDB by implementing admission control based on current memory usage. This is not a perfect solution (doesn't prevent non-traffic-related sources of OOM) and is meant to support other future improvements to resource sharing in a single SpiceDB node. The middleware is installed both in the main api and in dispatch, but at different thresholds. Memory usage is polled in the background, and if in-flight memory rises above the threshold, backpressure is placed on incoming requests. The API threshold is higher than the dispatch threshold to preserve already admitted traffic as much as possible.
d65a7c0 to
7afdbc9
Compare
| - job_name: "spicedb" | ||
| static_configs: | ||
| - targets: ["spicedb:9090"] | ||
| - targets: ["spicedb-1:9090"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI so that we can verify the new metrics in Grafana
7d7fb92 to
dd062d7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI i can move the mock generation to a different PR
| @@ -0,0 +1,4 @@ | |||
| internal/mocks/*.go linguist-generated=true | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI so that when changing these, the "View PR" view on Github says that the files are automatically generated and people can just mark "viewed" on them
dd062d7 to
9ff3096
Compare
| // Start background sampling with context | ||
| am.startBackgroundSampling() | ||
|
|
||
| log.Info(). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI
{"level":"info","name":"dispatch-middleware","memory_limit_bytes":15762733056,"threshold_percent":95,"sample_interval_seconds":1,"time":"2025-10-28T18:27:10Z","message":"memory protection middleware initialized with background sampling"}
{"level":"info","name":"unary-middleware","memory_limit_bytes":15762733056,"threshold_percent":90,"sample_interval_seconds":1,"time":"2025-10-28T18:27:10Z","message":"memory protection middleware initialized with background sampling"}
{"level":"info","name":"stream-middleware","memory_limit_bytes":15762733056,"threshold_percent":90,"sample_interval_seconds":1,"time":"2025-10-28T18:27:10Z","message":"memory protection middleware initialized with background sampling"}
|
|
||
| // Memory Protection flags | ||
| apiFlags.BoolVar(&config.MemoryProtectionEnabled, "memory-protection-enabled", true, "enables a memory-based middleware that rejects requests when memory usage is too high") | ||
| apiFlags.IntVar(&config.MemoryProtectionAPIThresholdPercent, "memory-protection-api-threshold", 90, "memory usage threshold percentage for regular API requests (0-100)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't like that we have the default 90 here and also in the struct itself 😭 i'd love a unified approach in the future
| apiFlags.StringVar(&config.MismatchZedTokenBehavior, "mismatch-zed-token-behavior", "full-consistency", "behavior to enforce when an API call receives a zedtoken that was originally intended for a different kind of datastore. One of: full-consistency (treat as a full-consistency call, ignoring the zedtoken), min-latency (treat as a min-latency call, ignoring the zedtoken), error (return an error). defaults to full-consistency for safety.") | ||
|
|
||
| // Memory Protection flags | ||
| apiFlags.BoolVar(&config.MemoryProtectionEnabled, "memory-protection-enabled", true, "enables a memory-based middleware that rejects requests when memory usage is too high") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the appeal to have this enabled by default (it's for the better!), but playing devil's advocate, this behavior may be surprising for folks as they update to the next release.
| WithInterceptor(grpcMetricsUnaryInterceptor). | ||
| Done(), | ||
|
|
||
| NewUnaryMiddleware(). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand why we would want to add it here, and I think realistic and practical place for it, but I'd be remiss if I didn't mention that we miss protection as the early middleware layers are traversed.
But again, this is not meant to be perfect, but good enough ™️
| opt = opt.WithDatastore(nil) | ||
|
|
||
| defaultMw, err := DefaultUnaryMiddleware(opt) | ||
| defaultMw, err := DefaultUnaryMiddleware(context.Background(), opt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
| defaultMw, err := DefaultUnaryMiddleware(context.Background(), opt) | |
| defaultMw, err := DefaultUnaryMiddleware(t.Context(), opt) |
| opt = opt.WithDatastore(nil) | ||
|
|
||
| defaultMw, err := DefaultStreamingMiddleware(opt) | ||
| defaultMw, err := DefaultStreamingMiddleware(context.Background(), opt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| defaultMw, err := DefaultStreamingMiddleware(context.Background(), opt) | |
| defaultMw, err := DefaultStreamingMiddleware(t.Context(), opt) |
|
|
||
| var ( | ||
| // RejectedRequestsCounter tracks requests rejected due to memory pressure | ||
| RejectedRequestsCounter = promauto.NewCounterVec(prometheus.CounterOpts{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be easier to craft queries and visualizations if the counter was "self-contained", in that it was memory admission outcome rather than just "rejected". You could add a second label with the actual outcome, and that'd make it super easy to visualize ratios. Otherwise folks have to use another metric to craft those ratios.
|
|
||
| // Get the current GOMEMLIMIT | ||
| memoryLimit := limitProvider.Get() | ||
| if memoryLimit < 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why would we want to allow memory limit zero?
|
|
||
| // Memory Protection flags | ||
| apiFlags.BoolVar(&config.MemoryProtectionEnabled, "memory-protection-enabled", true, "enables a memory-based middleware that rejects requests when memory usage is too high") | ||
| apiFlags.IntVar(&config.MemoryProtectionAPIThresholdPercent, "memory-protection-api-threshold", 90, "memory usage threshold percentage for regular API requests (0-100)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have some percent-based flags that are defined as [0...1] floats. Worth having a look and deciding which approach to commit to.
|
|
||
| // sampleMemory samples the current memory usage and updates the cached value | ||
| func (am *MemoryAdmissionMiddleware) sampleMemory() { | ||
| defer func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In which scenarios may this happen? Can the process continue to operate reliably? I don't think we typically recover panics in SpiceDB. Does it, in theory, mean that at some point we could stop sampling and have SpiceDB operate on a stale snapshot of memory?
|
|
||
| now := time.Now() | ||
| metrics.Read(am.metricsSamples) | ||
| newUsage := am.metricsSamples[0].Value.Uint64() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indexed array access: Is this guaranteed? Should we add a length check?
| am.lastMemorySampleInBytes.Store(newUsage) | ||
| am.timestampLastMemorySample.Store(&now) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a concern that these two guys are not stored atomically as a whole? There is a point in execution where lastMemorySampleInBytes would have been updated, but timestampLastMemorySample is not yet updated
Description
This introduces a new memory protection middleware to help prevent out-of-memory conditions in SpiceDB by implementing admission control based on current memory usage.
This is not a perfect solution (doesn't prevent non-traffic-related sources of OOM) and is meant to support other future improvements to resource sharing in a single SpiceDB node.
The middleware is installed both in the main api and in dispatch, but at different thresholds. Memory usage is polled in the background, and if in-flight memory rises above the threshold, backpressure is placed on incoming requests.
The dispatch threshold is higher than the API threshold to preserve already admitted traffic as much as possible.
Testing
mem_limit: "200mb"docker-compose up --buildyou should see logs such as:
and this graph in Grafana: