-
Notifications
You must be signed in to change notification settings - Fork 246
DRIVERS-3218 Avoid clearing the connection pool when the server connection rate limiter triggers #1852
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…tion rate limiter triggers
|
I'll get all of the tests passing in mongodb/mongo-python-driver#2598 and then include them in this PR. |
| - A successful heartbeat does NOT change the state of the pool. | ||
| - A failed heartbeat clears the pool. | ||
| - A subsequent failed connection will increase the backoff attempt. | ||
| - A successful connection will return the Pool to ready state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a description of the exponential backoff + jitter for the backoff duration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we define the backoff and jitter policy in one place and link to it? If so, should I add it in this PR and where?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's small and simple enough that it should be defined here alongside where it will be used.
| @@ -1,40 +1,38 @@ | |||
| version: 1 | |||
| style: integration | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't modify spec tests like this anymore, because that will break for drivers using submodules to track spec tests who haven't implemented backoff yet, and if those drivers then skip these tests they lose coverage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm this is a tricky one, because we're changing the behavior of the driver. So I guess we need a new runOnRequirement that is something like supportsPoolBackoff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, something like that would work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh. And we should also do the inverse for the tests we're leaving alone: runOnRequirement of !supportsPoolBackoff. (I just spent some time debugging a failing SDAM test on my branch only to realize it was supposed to fail with my changes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added poolBackoff but have not yet updated the existing tests that were changed in this PR.
| - poolBackoffEvent: {} | ||
| - poolBackoffEvent: {} | ||
| - poolBackoffEvent: {} | ||
| - poolBackoffEvent: {} | ||
| - poolBackoffEvent: {} | ||
| - poolBackoffEvent: {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this, and the tests below: where are all these pool backoff events coming from? I get zero / one (depending on my implementation, see below) of them in Node.
This is related to my comment in the design. Even if I align my implementation with the design, I only get one backoff event because there are no establishment attempts:
- there are no requests in the wait queue
- minPoolSize is 0, so no background thread population
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is these extra backoff attempts occur because of the new retry logic, which is already implemented in the branch Steve is using to test these changes in python. Is there a way we can test this without tying the two projects together? Otherwise drivers will need to implement the retry logic first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the retry logic come into play here? I thought we only retried commands with a SystemOverloadError error label. And the insert does have an expectError - so the insert fails on Steve's branch as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to how the driver labels connection errors with "RetryableWriteError" for retryable writes, we add the "RetryableError" and "SystemOverloadedError" labels to these errors so that the command can later be retried. We'll need to clarify this rule in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'll change it to only expect one backoff event, and remove the test for a retry that succeeds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two follow ups to the new tests:
- can we clarify in the CMAP spec that drivers are supposed to add the SystemOverloadError to network errors during connection establishment?
- can we use the
errorLabelsContainfield in the unified test format to assert on this as well, in the tests you've added? Something like:
- name: insertMany
object: collection
arguments:
documents:
- _id: 3
- _id: 4
expectError:
isError: true
errorLabelsContain:
- 'SystemOverloadError'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated: we also need SystemOverloadedError.
| connection checkout fails under conditions that indicate server overload. The rules for entering backoff mode are as | ||
| follows: - A network error or network timeout during the TCP handshake or the `hello` message for a new connection | ||
| MUST trigger the backoff state. - Other pending connections MUST not be canceled. - In the case of multiple pending | ||
| connections, the backoff attempt number MUST only be incremented once. This can be done by recording the state prior |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was talking with @ShaneHarvey about this (related comments on drivers ticket). Shane's understanding is that we decided to include all timeout errors, regardless of where it originated, during connection establishment. Does that match your understanding, Steve?
And related; the design says:
After a connection establishment failure the pool enters the PoolBackoff state.
We should update the design with whatever the outcome of this thread is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's a good callout, but it can't be from auth, since the auth spec explicitly calls out the timeout behavior. I'm assuming all drivers can distinguish between hello and auth since the are separate commands. I'll update to say if the driver can distinguish between TCP connect/DNS and the TLS handshake then it MUST do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which timeout behavior in the auth spec are you referring to? I searched for timeout and only saw stuff about what timeout values to use, but not how to handle network timeouts. Maybe I'm looking in the wrong place though.
I'm assuming all drivers can distinguish between hello and auth since the are separate commands.
If we decide to omit network errors during authentication, I think that's a fine assumption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, the specs must have gotten out of sync. I'm referring to:
"The Authentication spec requires that when authentication fails on a server, the driver MUST clear the server's connection pool"
Please complete the following before merging:
clusters).