Skip to content

Conversation

@bobcallaway
Copy link
Member

This PR implements an active Signed Tree Head (STH) caching strategy for the Trillian client that eliminates polling overhead while providing real-time root updates to concurrent callers. The implementation uses a background goroutine to monitor tree changes via WaitForRootUpdate, caches roots in atomic storage for lock-free reads, and notifies blocked operations when the tree advances. This reduces typical GetLatest calls from ~2ms to ~0.1ms and significantly improves AddLeaf performance by avoiding redundant inclusion proof attempts.

On my workstation, this improves the runtime of e2e tests from ~60 seconds down to ~10.

@bobcallaway bobcallaway requested a review from a team as a code owner August 14, 2025 00:49
bobcallaway and others added 2 commits August 13, 2025 20:50
@codecov
Copy link

codecov bot commented Aug 14, 2025

Codecov Report

❌ Patch coverage is 76.71233% with 68 lines in your changes missing coverage. Please review.
✅ Project coverage is 25.40%. Comparing base (488eb97) to head (80c2f3b).
⚠️ Report is 490 commits behind head on main.

Files with missing lines Patch % Lines
pkg/trillianclient/trillian_client.go 75.80% 53 Missing and 15 partials ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #2583       +/-   ##
===========================================
- Coverage   66.46%   25.40%   -41.06%     
===========================================
  Files          92      190       +98     
  Lines        9258    24700    +15442     
===========================================
+ Hits         6153     6275      +122     
- Misses       2359    17634    +15275     
- Partials      746      791       +45     
Flag Coverage Δ
e2etests 46.93% <64.04%> (-0.63%) ⬇️
unittests 16.76% <59.58%> (-30.92%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Bob Callaway <[email protected]>
// Success - reset backoff for next potential failure
bo.Reset()

if nr == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can nr be nil here? There should only be one thread running updater(), correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on the current implementation of WaitForRootUpdate, this can never happen. but in case that were to change, if we don't have a new root then there's nothing to update.

Factor: 2.0, // Double each time
Jitter: true, // Add randomization
}
for {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the period that this runs? Is this constant, or did you want a delay between each iteration?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WaitForRootUpdate has it's own backoff mechanism, which exponentially backs of if it doesn't see a root update.

slr := &trillian.SignedLogRoot{LogRoot: lrBytes}

// publish new snapshot and notify waiters
t.mu.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the lock here? The snapshot update is atomic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the snapshot update is atomic but the locking here forces a serialized execution between the check in waitForRootAtLeast and the update in updater().

if we didn't hold the lock here, it's possible that the goroutine scheduler could fire the t.cond.Broadcast() after the size comparison but strictly before waitForRootAtLeast had called t.cond.Wait() which would mean that the reader wouldn't see the broadcast notification that the root had changed.

err := bo.Retry(t.bgCtx, func() error {
select {
case <-t.stopCh:
return fmt.Errorf("client stopped")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be returning nil rather than an error, since this is expected on server shutdown?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the return value here stops the Retry loop from continuing, which is what we'd want in a shutdown scenario.

// calls do not hold the mutex. Only one goroutine performs the initial RPCs,
// others wait on the condition variable until initialization completes. This
// avoids head-of-line blocking while keeping state updates atomic.
func (t *TrillianClient) ensureStarted(ctx context.Context) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you have this called when initializing the client, rather than from each client function?

Other option for handling concurrency, have you looked into sync.OnceFunc to ensure this runs only once?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could do it there, but if there is a transient outage putting it inside of a sync.OnceFunc means we should really panic the entire process since we could never re-initialize it.

// waitForRootAtLeast blocks until t.lastRoot.TreeSize >= size, or context/client closes.
func (t *TrillianClient) waitForRootAtLeast(ctx context.Context, size uint64) error {
start := time.Now()
t.mu.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this read a reader lock?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, this requires a full lock given semantics of sync.Cond.

metricWaitForRootAtLeast.WithLabelValues(fmt.Sprintf("%d", t.logID), "true").Observe(elapsed)
return nil
}
t.cond.Wait()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when an error occurs in updater()? Let's say Trillian is down. From what I can tell, a request blocks until Broadcast is called, but Broadcast is only called on successful update of the tree head.

Should an error in updater() trigger a Broadcast so waiters unblock, and then observe an error in the context and return? Or can we have a timeout on waiting?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants