-
Notifications
You must be signed in to change notification settings - Fork 86
Fast eventcounter etags for geoextracts #1657
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Fast eventcounter etags for geoextracts #1657
Conversation
|
|
||
| const acteeVersion = await Actees.getEventCount(foundDataset.acteeId); | ||
| // Weak etag, as the order in the resultset is undefined. | ||
| return withEtag(acteeVersion, createResponse, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it expected that the ETag doesn't change when query parameters change? I think so, just wanted to check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 subscribing to this query
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very much expected, yes. In HTTP caching semantics, (with a few caveats, most prominently having to do with the Vary response header), the identity of a resource is the URL, including the query parameters.
If a browser sets an If-None-Match on a request for resource with identity A to the ETag received in an earlier response for a resource with identity B, then that'd be a serious bug (in the browser) ;-)
The fact that query parameters are part of the resource identity is actually even exploited for certain so called "cache busting" approaches.
The downside is that the order of parameters in the query matters. So two URLs that effectively deliver the same data, as they have the same meaning for the application (eg ?offset=10&limit=20 vs ?limit=20&offset=10) are distinct resources to HTTP caches, and they don't reuse the cache of the one for the other. They fortunately don't as they absolutely shouldn't, because they don't know the application semantics!
I would be well within my rights to write an application that does something completely different for ?a=1&b=2 vs ?b=2&a=1, and HTTP caching should still work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(quoting myself)
The downside is that the order of parameters in the query matters. So two URLs that effectively deliver the same data, as they have the same meaning for the application (eg
?offset=10&limit=20vs?limit=20&offset=10) are distinct resources to HTTP caches, and they don't reuse the cache of the one for the other. They fortunately don't as they absolutely shouldn't, because they don't know the application semantics!
So, to expand on that a bit, corollary:
If you want share a cache of computation results between requests to ?offset=10&limit=20 and ?limit=20&offset=10, you can't really do that with HTTP caching semantics. Intermediate caching proxies don't want to presume that these are effectively the same to your application, and there's no way to tell them (or indeed the browser cache) otherwise.†
The component best situated to understand the application semantics is... the application! surprise!
So, when there is a desire to share a cache between ?offset=10&limit=20 and ?limit=20&offset=10, one would do (potentially additional) caching at the application. For us that'd mean we would, all "from nodejs", compute the result, come up with a caching key (a component of which in this case would be a normalized form of the query parameters), and then store the result in something like redis or memcached (or even just plain postgresql, or files in the filesystem). That's quite a common setup!
†) Although, nginx accommodates "bring your own caching key" setups. But that's in a reverse proxy role where you have knowledge of the application semantics.
Forward caching HTTP proxies such as Squid (largely outmoded because everything has become E2E TLS in the last 10-15 years) can maybe not even be configured to bend the identity rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In HTTP caching semantics, (with a few caveats, most prominently having to do with the Vary response header), the identity of a resource is the URL, including the query parameters.
👍 makes sense to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, this makes sense to me.
I don't think we need to do anything special at this point related to the order of query parameters.
40eca0f to
42856a9
Compare
42856a9 to
b56d56c
Compare
|
Moving back to draft while considering #1654 (comment) |
Towards getodk/central#1439
Tests won't succeed, #1654 needs merging first.
Information leakage through ETags
Yes, we could hash (or deterministically obfuscate otherwise) the counter value instead of using it verbatim in the ETag.
I chose not to.
It'd be weird to be authorized to see a resource, yet not be authorized to know how many events have taken place affecting it since you last looked. I also don't think that those authorized users can do bad things when they induce or deduce that the etag is counter-derived, because the fact that it's the string representation of a counter is immaterial to how it's processed for revalidation (opaquely, not as a number, and certainly not as a number with counter-semantics).
Benchmarks:
hey, concurrency 1, 1000 requests, with the If-None-Match client header set to the etag of the collection. All responses are 302s.Before, old style: 17 revalidation requests/second
After this PR: 478 revalidation requests/second
So 30x 🥳
The difference will only grow with a slower DB and/or heavier contention.
Benchmark logs
hey, oldstyle
hey, newstyle
What has been done to verify that this works as intended?
Manual testing.
Why is this the best possible solution? Were any other approaches considered?
For the engineering background, see getodk/central#1439.
How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?
N/A
Does this change require updates to the API documentation? If so, please update docs/api.yaml as part of this PR.
N/A
Before submitting this PR, please make sure you have:
make testand confirmed all checks still pass OR confirm CircleCI build passes