feat: add Memory Protection Middleware #2646

ecordell · 2025-10-23T17:19:05Z

Description

This introduces a new memory protection middleware to help prevent out-of-memory conditions in SpiceDB by implementing admission control based on current memory usage.

This is not a perfect solution (doesn't prevent non-traffic-related sources of OOM) and is meant to support other future improvements to resource sharing in a single SpiceDB node.

The middleware is installed both in the main api and in dispatch, but at different thresholds. Memory usage is polled in the background, and if in-flight memory rises above the threshold, backpressure is placed on incoming requests.

The dispatch threshold is higher than the API threshold to preserve already admitted traffic as much as possible.

Testing

Unit tests included
Manual E2E test:
- Modify [docker-compose.yaml] to set mem_limit: "200mb"
- Run docker-compose up --build
- Run this

zed context set example localhost:50051 foobar --insecure
zed import development/schema.yaml
{
    echo '{"items":['
    for i in $(seq 1 200); do
      d=$(( (RANDOM % 9999) + 1 ))
      echo -n "{\"resource\":{\"objectTy  pe\":\"document\",\"objectId\": \"${d}\"}, \"permission\":\"view\",\"subject\":{ \"object\": {\"objectType\": \"user\", \"objectId\": \"1\"}}}"
      [ $i -lt 200 ] && echo -n ","
    done
    echo "], \"with_tracing\": true}"
} > payload.json
ab -n 100000 -c 200 -T 'application/json' -H 'Authorization: Bearer foobar' -p payload.json http://localhost:8443/v1/permissions/checkbulk

you should see logs such as:

{
  "level": "warn",
  "traceID": "125b7b37c5775af1f2d9ebf253dcf3d1",
  "protocol": "grpc",
  "grpc.component": "server",
  "grpc.service": "authzed.api.v1.PermissionsService",
  "grpc.method": "CheckBulkPermissions",
  "grpc.method_type": "unary",
  "requestID": "d3vurmuoqu8s73bsdom0",
  "peer.address": "127.0.0.1:60376",
  "grpc.start_time": "2025-10-27T22:10:35Z",
  "grpc.code": "ResourceExhausted",
  "grpc.error": "rpc error: code = ResourceExhausted desc = server is experiencing memory pressure (124.7% usage, threshold: 90%)",
  "grpc.time_ms": 0,
  "time": "2025-10-27T22:10:35Z",
  "message": "finished call"
}

and this graph in Grafana:

codecov · 2025-10-23T17:24:51Z

Codecov Report

❌ Patch coverage is 10.67660% with 1439 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.24%. Comparing base (cdef621) to head (9ff3096).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
internal/mocks/mock_datastore.go	0.32%	1277 Missing ⚠️
internal/mocks/mock_dispatcher.go	2.57%	152 Missing ⚠️
...al/middleware/memoryprotection/memoryprotection.go	96.04%	3 Missing and 1 partial ⚠️
pkg/cmd/server/server.go	90.48%	2 Missing and 2 partials ⚠️
pkg/cmd/testserver/testserver.go	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2646      +/-   ##
==========================================
- Coverage   79.35%   77.24%   -2.10%     
==========================================
  Files         453      457       +4     
  Lines       46993    48692    +1699     
==========================================
+ Hits        37288    37609     +321     
- Misses       6948     8323    +1375     
- Partials     2757     2760       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

The commit introduces a new memory protection middleware to help prevent out-of-memory conditions in SpiceDB by implementing admission control based on current memory usage. This is not a perfect solution (doesn't prevent non-traffic-related sources of OOM) and is meant to support other future improvements to resource sharing in a single SpiceDB node. The middleware is installed both in the main api and in dispatch, but at different thresholds. Memory usage is polled in the background, and if in-flight memory rises above the threshold, backpressure is placed on incoming requests. The API threshold is higher than the dispatch threshold to preserve already admitted traffic as much as possible.

miparnisari · 2025-10-27T23:51:57Z

development/prometheus.yaml

  - job_name: "spicedb"
    static_configs:
-      - targets: ["spicedb:9090"]
+      - targets: ["spicedb-1:9090"]


FYI so that we can verify the new metrics in Grafana

miparnisari · 2025-10-28T16:52:35Z

internal/mocks/mock_datastore.go

FYI i can move the mock generation to a different PR

miparnisari · 2025-10-28T16:53:45Z

.gitattributes

@@ -0,0 +1,4 @@
+internal/mocks/*.go linguist-generated=true


FYI so that when changing these, the "View PR" view on Github says that the files are automatically generated and people can just mark "viewed" on them

miparnisari · 2025-10-28T18:28:46Z

internal/middleware/memoryprotection/memoryprotection.go

+	// Start background sampling with context
+	am.startBackgroundSampling()
+
+	log.Info().


FYI

{"level":"info","name":"dispatch-middleware","memory_limit_bytes":15762733056,"threshold_percent":95,"sample_interval_seconds":1,"time":"2025-10-28T18:27:10Z","message":"memory protection middleware initialized with background sampling"} {"level":"info","name":"unary-middleware","memory_limit_bytes":15762733056,"threshold_percent":90,"sample_interval_seconds":1,"time":"2025-10-28T18:27:10Z","message":"memory protection middleware initialized with background sampling"} {"level":"info","name":"stream-middleware","memory_limit_bytes":15762733056,"threshold_percent":90,"sample_interval_seconds":1,"time":"2025-10-28T18:27:10Z","message":"memory protection middleware initialized with background sampling"}

miparnisari · 2025-10-28T18:30:18Z

pkg/cmd/serve.go


+	// Memory Protection flags
+	apiFlags.BoolVar(&config.MemoryProtectionEnabled, "memory-protection-enabled", true, "enables a memory-based middleware that rejects requests when memory usage is too high")
+	apiFlags.IntVar(&config.MemoryProtectionAPIThresholdPercent, "memory-protection-api-threshold", 90, "memory usage threshold percentage for regular API requests (0-100)")


i don't like that we have the default 90 here and also in the struct itself 😭 i'd love a unified approach in the future

vroldanbet · 2025-10-28T21:51:48Z

pkg/cmd/serve.go

 	apiFlags.StringVar(&config.MismatchZedTokenBehavior, "mismatch-zed-token-behavior", "full-consistency", "behavior to enforce when an API call receives a zedtoken that was originally intended for a different kind of datastore. One of: full-consistency (treat as a full-consistency call, ignoring the zedtoken), min-latency (treat as a min-latency call, ignoring the zedtoken), error (return an error). defaults to full-consistency for safety.")

+	// Memory Protection flags
+	apiFlags.BoolVar(&config.MemoryProtectionEnabled, "memory-protection-enabled", true, "enables a memory-based middleware that rejects requests when memory usage is too high")


I understand the appeal to have this enabled by default (it's for the better!), but playing devil's advocate, this behavior may be surprising for folks as they update to the next release.

vroldanbet · 2025-10-28T22:03:08Z

pkg/cmd/server/defaults.go

 			WithInterceptor(grpcMetricsUnaryInterceptor).
 			Done(),

+		NewUnaryMiddleware().


I understand why we would want to add it here, and I think realistic and practical place for it, but I'd be remiss if I didn't mention that we miss protection as the early middleware layers are traversed.

But again, this is not meant to be perfect, but good enough ™️

vroldanbet · 2025-10-28T22:06:43Z

pkg/cmd/server/server_test.go

 	opt = opt.WithDatastore(nil)

-	defaultMw, err := DefaultUnaryMiddleware(opt)
+	defaultMw, err := DefaultUnaryMiddleware(context.Background(), opt)


nit

Suggested change

defaultMw, err := DefaultUnaryMiddleware(context.Background(), opt)

defaultMw, err := DefaultUnaryMiddleware(t.Context(), opt)

vroldanbet · 2025-10-28T22:06:59Z

pkg/cmd/server/server_test.go

 	opt = opt.WithDatastore(nil)

-	defaultMw, err := DefaultStreamingMiddleware(opt)
+	defaultMw, err := DefaultStreamingMiddleware(context.Background(), opt)


Suggested change

defaultMw, err := DefaultStreamingMiddleware(context.Background(), opt)

defaultMw, err := DefaultStreamingMiddleware(t.Context(), opt)

vroldanbet · 2025-10-28T22:10:14Z

internal/middleware/memoryprotection/memoryprotection.go

+
+var (
+	// RejectedRequestsCounter tracks requests rejected due to memory pressure
+	RejectedRequestsCounter = promauto.NewCounterVec(prometheus.CounterOpts{


I think it would be easier to craft queries and visualizations if the counter was "self-contained", in that it was memory admission outcome rather than just "rejected". You could add a second label with the actual outcome, and that'd make it super easy to visualize ratios. Otherwise folks have to use another metric to craft those ratios.

vroldanbet · 2025-10-28T22:20:08Z

internal/middleware/memoryprotection/memoryprotection.go

+
+	// Get the current GOMEMLIMIT
+	memoryLimit := limitProvider.Get()
+	if memoryLimit < 0 {


why would we want to allow memory limit zero?

vroldanbet · 2025-10-28T22:22:43Z

pkg/cmd/serve.go


+	// Memory Protection flags
+	apiFlags.BoolVar(&config.MemoryProtectionEnabled, "memory-protection-enabled", true, "enables a memory-based middleware that rejects requests when memory usage is too high")
+	apiFlags.IntVar(&config.MemoryProtectionAPIThresholdPercent, "memory-protection-api-threshold", 90, "memory usage threshold percentage for regular API requests (0-100)")


We have some percent-based flags that are defined as [0...1] floats. Worth having a look and deciding which approach to commit to.

vroldanbet · 2025-10-28T22:27:59Z

internal/middleware/memoryprotection/memoryprotection.go

+
+// sampleMemory samples the current memory usage and updates the cached value
+func (am *MemoryAdmissionMiddleware) sampleMemory() {
+	defer func() {


In which scenarios may this happen? Can the process continue to operate reliably? I don't think we typically recover panics in SpiceDB. Does it, in theory, mean that at some point we could stop sampling and have SpiceDB operate on a stale snapshot of memory?

vroldanbet · 2025-10-28T22:29:01Z

internal/middleware/memoryprotection/memoryprotection.go

+
+	now := time.Now()
+	metrics.Read(am.metricsSamples)
+	newUsage := am.metricsSamples[0].Value.Uint64()


Indexed array access: Is this guaranteed? Should we add a length check?

vroldanbet · 2025-10-28T22:30:30Z

internal/middleware/memoryprotection/memoryprotection.go

+	am.lastMemorySampleInBytes.Store(newUsage)
+	am.timestampLastMemorySample.Store(&now)


Is it a concern that these two guys are not stored atomically as a whole? There is a point in execution where lastMemorySampleInBytes would have been updated, but timestampLastMemorySample is not yet updated

github-actions bot added area/cli Affects the command line area/dependencies Affects dependencies area/tooling Affects the dev or user toolchain (e.g. tests, ci, build tools) labels Oct 23, 2025

ecordell force-pushed the oomprotect branch from 59a8715 to 31d1947 Compare October 23, 2025 18:15

miparnisari force-pushed the oomprotect branch 4 times, most recently from f68d3f0 to 0683e46 Compare October 27, 2025 21:27

miparnisari force-pushed the oomprotect branch 2 times, most recently from d65a7c0 to 7afdbc9 Compare October 27, 2025 23:51

miparnisari reviewed Oct 27, 2025

View reviewed changes

github-actions bot added the area/dispatch Affects dispatching of requests label Oct 27, 2025

miparnisari force-pushed the oomprotect branch 5 times, most recently from 7d7fb92 to dd062d7 Compare October 28, 2025 01:06

miparnisari marked this pull request as ready for review October 28, 2025 05:01

miparnisari requested a review from a team as a code owner October 28, 2025 05:01

miparnisari reviewed Oct 28, 2025

View reviewed changes

internal/mocks/mock_datastore.go

Copy link

Contributor

miparnisari Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI i can move the mock generation to a different PR

miparnisari reviewed Oct 28, 2025

View reviewed changes

chore: improvements and fix building of dispatch middleware

9ff3096

miparnisari force-pushed the oomprotect branch from dd062d7 to 9ff3096 Compare October 28, 2025 18:26

miparnisari reviewed Oct 28, 2025

View reviewed changes

vroldanbet reviewed Oct 28, 2025

View reviewed changes

	defaultMw, err := DefaultUnaryMiddleware(context.Background(), opt)
	defaultMw, err := DefaultUnaryMiddleware(t.Context(), opt)

	defaultMw, err := DefaultStreamingMiddleware(context.Background(), opt)
	defaultMw, err := DefaultStreamingMiddleware(t.Context(), opt)

		am.lastMemorySampleInBytes.Store(newUsage)
		am.timestampLastMemorySample.Store(&now)

Uh oh!

feat: add Memory Protection Middleware #2646

Are you sure you want to change the base?

feat: add Memory Protection Middleware #2646

Uh oh!

Conversation

ecordell commented Oct 23, 2025 • edited by miparnisari Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Testing

Uh oh!

codecov bot commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

miparnisari Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ecordell commented Oct 23, 2025 •

edited by miparnisari

Loading

codecov bot commented Oct 23, 2025 •

edited

Loading

miparnisari Oct 28, 2025 •

edited

Loading