Skip to content

Conversation

@brontolosone
Copy link
Contributor

@brontolosone brontolosone commented Oct 26, 2025

Towards getodk/central#1439

Tests won't succeed, #1654 needs merging first.

Information leakage through ETags

Yes, we could hash (or deterministically obfuscate otherwise) the counter value instead of using it verbatim in the ETag.

I chose not to.

It'd be weird to be authorized to see a resource, yet not be authorized to know how many events have taken place affecting it since you last looked. I also don't think that those authorized users can do bad things when they induce or deduce that the etag is counter-derived, because the fact that it's the string representation of a counter is immaterial to how it's processed for revalidation (opaquely, not as a number, and certainly not as a number with counter-semantics).

Benchmarks:

  • 10.000 feature geojson body
  • response is 64 KB compressed
  • response is 2223 KB uncompressed (this is what the DB needs to shovel out to Node, if response bodies need to be recomputed in order to do revalidation)
  • benched with hey, concurrency 1, 1000 requests, with the If-None-Match client header set to the etag of the collection. All responses are 302s.

Before, old style: 17 revalidation requests/second
After this PR: 478 revalidation requests/second

So 30x 🥳
The difference will only grow with a slower DB and/or heavier contention.

Benchmark logs

hey, oldstyle
hey -c 1 -n 1000 -m GET -t 0 -T 'application/json' -H 'Authorization: Bearer thetoken' -H 'If-None-Match: W/"22bda5-We1sERCWxUOQw5sNcOfxqZvkO0c"' http://127.0.0.1:8383/v1/projects/1/forms/geotest/submissions.geojson

Summary:
Total: 59.7844 secs
Slowest: 0.1203 secs
Fastest: 0.0560 secs
Average: 0.0598 secs
Requests/sec: 16.7268

Response time histogram:
0.056 [1] |
0.062 [959] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
0.069 [27] |■
0.075 [0] |
0.082 [2] |
0.088 [3] |
0.095 [6] |
0.101 [0] |
0.107 [0] |
0.114 [0] |
0.120 [2] |

Latency distribution:
10% in 0.0577 secs
25% in 0.0584 secs
50% in 0.0593 secs
75% in 0.0602 secs
90% in 0.0612 secs
95% in 0.0622 secs
99% in 0.0870 secs

Details (average, fastest, slowest):
DNS+dialup: 0.0000 secs, 0.0560 secs, 0.1203 secs
DNS-lookup: 0.0000 secs, 0.0000 secs, 0.0000 secs
req write: 0.0000 secs, 0.0000 secs, 0.0003 secs
resp wait: 0.0597 secs, 0.0559 secs, 0.1197 secs
resp read: 0.0001 secs, 0.0000 secs, 0.0002 secs

Status code distribution:
[304] 1000 responses

hey, newstyle
hey -c 1 -n 1000 -m GET -t 0 -T 'application/json' -H 'Authorization: Bearer thetoken' -H 'If-None-Match: W/"20220"' http://127.0.0.1:8383/v1/projects/1/forms/geotest/submissions.geojson

Summary:
Total: 2.0935 secs
Slowest: 0.0335 secs
Fastest: 0.0012 secs
Average: 0.0021 secs
Requests/sec: 477.6793

Response time histogram:
0.001 [1] |
0.004 [987] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
0.008 [11] |
0.011 [0] |
0.014 [0] |
0.017 [0] |
0.021 [0] |
0.024 [0] |
0.027 [0] |
0.030 [0] |
0.034 [1] |

Latency distribution:
10% in 0.0013 secs
25% in 0.0013 secs
50% in 0.0018 secs
75% in 0.0026 secs
90% in 0.0031 secs
95% in 0.0034 secs
99% in 0.0045 secs

Details (average, fastest, slowest):
DNS+dialup: 0.0000 secs, 0.0012 secs, 0.0335 secs
DNS-lookup: 0.0000 secs, 0.0000 secs, 0.0000 secs
req write: 0.0000 secs, 0.0000 secs, 0.0003 secs
resp wait: 0.0021 secs, 0.0012 secs, 0.0329 secs
resp read: 0.0000 secs, 0.0000 secs, 0.0001 secs

Status code distribution:
[304] 1000 responses

What has been done to verify that this works as intended?

Manual testing.

Why is this the best possible solution? Were any other approaches considered?

For the engineering background, see getodk/central#1439.

How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?

N/A

Does this change require updates to the API documentation? If so, please update docs/api.yaml as part of this PR.

N/A

Before submitting this PR, please make sure you have:

  • run make test and confirmed all checks still pass OR confirm CircleCI build passes
  • verified that any code from external sources are properly credited in comments or that everything is internally sourced

@brontolosone brontolosone requested a review from alxndrsn October 26, 2025 16:24
@brontolosone brontolosone marked this pull request as ready for review October 26, 2025 16:24

const acteeVersion = await Actees.getEventCount(foundDataset.acteeId);
// Weak etag, as the order in the resultset is undefined.
return withEtag(acteeVersion, createResponse, true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it expected that the ETag doesn't change when query parameters change? I think so, just wanted to check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 subscribing to this query

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very much expected, yes. In HTTP caching semantics, (with a few caveats, most prominently having to do with the Vary response header), the identity of a resource is the URL, including the query parameters.

If a browser sets an If-None-Match on a request for resource with identity A to the ETag received in an earlier response for a resource with identity B, then that'd be a serious bug (in the browser) ;-)

The fact that query parameters are part of the resource identity is actually even exploited for certain so called "cache busting" approaches.

The downside is that the order of parameters in the query matters. So two URLs that effectively deliver the same data, as they have the same meaning for the application (eg ?offset=10&limit=20 vs ?limit=20&offset=10) are distinct resources to HTTP caches, and they don't reuse the cache of the one for the other. They fortunately don't as they absolutely shouldn't, because they don't know the application semantics!
I would be well within my rights to write an application that does something completely different for ?a=1&b=2 vs ?b=2&a=1, and HTTP caching should still work.

Copy link
Contributor Author

@brontolosone brontolosone Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(quoting myself)

The downside is that the order of parameters in the query matters. So two URLs that effectively deliver the same data, as they have the same meaning for the application (eg ?offset=10&limit=20 vs ?limit=20&offset=10) are distinct resources to HTTP caches, and they don't reuse the cache of the one for the other. They fortunately don't as they absolutely shouldn't, because they don't know the application semantics!

So, to expand on that a bit, corollary:

If you want share a cache of computation results between requests to ?offset=10&limit=20 and ?limit=20&offset=10, you can't really do that with HTTP caching semantics. Intermediate caching proxies don't want to presume that these are effectively the same to your application, and there's no way to tell them (or indeed the browser cache) otherwise.†
The component best situated to understand the application semantics is... the application! surprise!
So, when there is a desire to share a cache between ?offset=10&limit=20 and ?limit=20&offset=10, one would do (potentially additional) caching at the application. For us that'd mean we would, all "from nodejs", compute the result, come up with a caching key (a component of which in this case would be a normalized form of the query parameters), and then store the result in something like redis or memcached (or even just plain postgresql, or files in the filesystem). That's quite a common setup!

†) Although, nginx accommodates "bring your own caching key" setups. But that's in a reverse proxy role where you have knowledge of the application semantics.
Forward caching HTTP proxies such as Squid (largely outmoded because everything has become E2E TLS in the last 10-15 years) can maybe not even be configured to bend the identity rules.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In HTTP caching semantics, (with a few caveats, most prominently having to do with the Vary response header), the identity of a resource is the URL, including the query parameters.

👍 makes sense to me

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, this makes sense to me.

I don't think we need to do anything special at this point related to the order of query parameters.

@brontolosone brontolosone force-pushed the 1439_fast-eventcounter-etags-for-geoextracts_pr branch from 40eca0f to 42856a9 Compare October 28, 2025 15:38
@brontolosone brontolosone force-pushed the 1439_fast-eventcounter-etags-for-geoextracts_pr branch from 42856a9 to b56d56c Compare November 4, 2025 15:27
@lognaturel lognaturel marked this pull request as draft November 24, 2025 19:46
@lognaturel
Copy link
Member

Moving back to draft while considering #1654 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🕒 backlog

Development

Successfully merging this pull request may close these issues.

4 participants