-
Couldn't load subscription status.
- Fork 187
Add STH caching with configurable timeouts and intelligent retry #2583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Bob Callaway <[email protected]>
Signed-off-by: Bob Callaway <[email protected]>
Signed-off-by: Bob Callaway <[email protected]>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2583 +/- ##
===========================================
- Coverage 66.46% 25.40% -41.06%
===========================================
Files 92 190 +98
Lines 9258 24700 +15442
===========================================
+ Hits 6153 6275 +122
- Misses 2359 17634 +15275
- Partials 746 791 +45
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Bob Callaway <[email protected]>
| // Success - reset backoff for next potential failure | ||
| bo.Reset() | ||
|
|
||
| if nr == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can nr be nil here? There should only be one thread running updater(), correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
based on the current implementation of WaitForRootUpdate, this can never happen. but in case that were to change, if we don't have a new root then there's nothing to update.
| Factor: 2.0, // Double each time | ||
| Jitter: true, // Add randomization | ||
| } | ||
| for { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the period that this runs? Is this constant, or did you want a delay between each iteration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WaitForRootUpdate has it's own backoff mechanism, which exponentially backs of if it doesn't see a root update.
| slr := &trillian.SignedLogRoot{LogRoot: lrBytes} | ||
|
|
||
| // publish new snapshot and notify waiters | ||
| t.mu.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the lock here? The snapshot update is atomic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the snapshot update is atomic but the locking here forces a serialized execution between the check in waitForRootAtLeast and the update in updater().
if we didn't hold the lock here, it's possible that the goroutine scheduler could fire the t.cond.Broadcast() after the size comparison but strictly before waitForRootAtLeast had called t.cond.Wait() which would mean that the reader wouldn't see the broadcast notification that the root had changed.
| err := bo.Retry(t.bgCtx, func() error { | ||
| select { | ||
| case <-t.stopCh: | ||
| return fmt.Errorf("client stopped") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be returning nil rather than an error, since this is expected on server shutdown?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the return value here stops the Retry loop from continuing, which is what we'd want in a shutdown scenario.
| // calls do not hold the mutex. Only one goroutine performs the initial RPCs, | ||
| // others wait on the condition variable until initialization completes. This | ||
| // avoids head-of-line blocking while keeping state updates atomic. | ||
| func (t *TrillianClient) ensureStarted(ctx context.Context) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you have this called when initializing the client, rather than from each client function?
Other option for handling concurrency, have you looked into sync.OnceFunc to ensure this runs only once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could do it there, but if there is a transient outage putting it inside of a sync.OnceFunc means we should really panic the entire process since we could never re-initialize it.
| // waitForRootAtLeast blocks until t.lastRoot.TreeSize >= size, or context/client closes. | ||
| func (t *TrillianClient) waitForRootAtLeast(ctx context.Context, size uint64) error { | ||
| start := time.Now() | ||
| t.mu.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this read a reader lock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope, this requires a full lock given semantics of sync.Cond.
| metricWaitForRootAtLeast.WithLabelValues(fmt.Sprintf("%d", t.logID), "true").Observe(elapsed) | ||
| return nil | ||
| } | ||
| t.cond.Wait() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when an error occurs in updater()? Let's say Trillian is down. From what I can tell, a request blocks until Broadcast is called, but Broadcast is only called on successful update of the tree head.
Should an error in updater() trigger a Broadcast so waiters unblock, and then observe an error in the context and return? Or can we have a timeout on waiting?
This PR implements an active Signed Tree Head (STH) caching strategy for the Trillian client that eliminates polling overhead while providing real-time root updates to concurrent callers. The implementation uses a background goroutine to monitor tree changes via
WaitForRootUpdate, caches roots in atomic storage for lock-free reads, and notifies blocked operations when the tree advances. This reduces typicalGetLatestcalls from ~2ms to ~0.1ms and significantly improvesAddLeafperformance by avoiding redundant inclusion proof attempts.On my workstation, this improves the runtime of e2e tests from ~60 seconds down to ~10.