Add STH caching with configurable timeouts and intelligent retry #2583

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

bobcallaway wants to merge 4 commits into sigstore:main from bobcallaway:cache_sth

Member

bobcallaway commented Aug 14, 2025

This PR implements an active Signed Tree Head (STH) caching strategy for the Trillian client that eliminates polling overhead while providing real-time root updates to concurrent callers. The implementation uses a background goroutine to monitor tree changes via WaitForRootUpdate, caches roots in atomic storage for lock-free reads, and notifies blocked operations when the tree advances. This reduces typical GetLatest calls from ~2ms to ~0.1ms and significantly improves AddLeaf performance by avoiding redundant inclusion proof attempts.

On my workstation, this improves the runtime of e2e tests from ~60 seconds down to ~10.


          initial implementation of STH cache, poller, etc

792fb13

Signed-off-by: Bob Callaway <[email protected]>

bobcallaway requested a review from a team as a code owner

August 14, 2025 00:49

bobcallaway and others added 2 commits

August 13, 2025 20:50


          Merge branch 'main' into cache_sth

8178ccc

Signed-off-by: Bob Callaway <[email protected]>


          fix lint and merge errors

cc0dd1e

Signed-off-by: Bob Callaway <[email protected]>

codecov bot commented Aug 14, 2025 •

edited

Loading

Codecov Report

❌ Patch coverage is 76.71233% with 68 lines in your changes missing coverage. Please review.
✅ Project coverage is 25.40%. Comparing base (488eb97) to head (80c2f3b).
⚠️ Report is 490 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/trillianclient/trillian_client.go	75.80%	53 Missing and 15 partials ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #2583       +/-   ##
===========================================
- Coverage   66.46%   25.40%   -41.06%     
===========================================
  Files          92      190       +98     
  Lines        9258    24700    +15442     
===========================================
+ Hits         6153     6275      +122     
- Misses       2359    17634    +15275     
- Partials      746      791       +45

Flag	Coverage Δ
e2etests	`46.93% <64.04%> (-0.63%)`	⬇️
unittests	`16.76% <59.58%> (-30.92%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.


          fix lint

80c2f3b

Signed-off-by: Bob Callaway <[email protected]>

haydentherapper reviewed

View reviewed changes

pkg/trillianclient/trillian_client.go

    
              		// Success - reset backoff for next potential failure

              		bo.Reset()

              		if nr == nil {

Contributor

haydentherapper Aug 22, 2025

How can nr be nil here? There should only be one thread running updater(), correct?

Member Author

bobcallaway Aug 25, 2025

based on the current implementation of WaitForRootUpdate, this can never happen. but in case that were to change, if we don't have a new root then there's nothing to update.

pkg/trillianclient/trillian_client.go

    
              		Factor: 2.0,                    // Double each time

              		Jitter: true,                   // Add randomization

              	}

              	for {

Contributor

haydentherapper Aug 22, 2025

What's the period that this runs? Is this constant, or did you want a delay between each iteration?

Member Author

bobcallaway Aug 25, 2025

WaitForRootUpdate has it's own backoff mechanism, which exponentially backs of if it doesn't see a root update.

pkg/trillianclient/trillian_client.go

    
              		slr := &trillian.SignedLogRoot{LogRoot: lrBytes}

              		// publish new snapshot and notify waiters

              		t.mu.Lock()

Contributor

haydentherapper Aug 22, 2025

Why the lock here? The snapshot update is atomic.

Member Author

bobcallaway Aug 25, 2025

yes, the snapshot update is atomic but the locking here forces a serialized execution between the check in waitForRootAtLeast and the update in updater().

if we didn't hold the lock here, it's possible that the goroutine scheduler could fire the t.cond.Broadcast() after the size comparison but strictly before waitForRootAtLeast had called t.cond.Wait() which would mean that the reader wouldn't see the broadcast notification that the root had changed.

pkg/trillianclient/trillian_client.go

    
              		err := bo.Retry(t.bgCtx, func() error {

              			select {

              			case <-t.stopCh:

              				return fmt.Errorf("client stopped")

Contributor

haydentherapper Aug 22, 2025

Should this be returning nil rather than an error, since this is expected on server shutdown?

Member Author

bobcallaway Aug 25, 2025

the return value here stops the Retry loop from continuing, which is what we'd want in a shutdown scenario.

pkg/trillianclient/trillian_client.go

    
              // calls do not hold the mutex. Only one goroutine performs the initial RPCs,

              // others wait on the condition variable until initialization completes. This

              // avoids head-of-line blocking while keeping state updates atomic.

              func (t *TrillianClient) ensureStarted(ctx context.Context) error {

Contributor

haydentherapper Aug 22, 2025

Could you have this called when initializing the client, rather than from each client function?

Other option for handling concurrency, have you looked into sync.OnceFunc to ensure this runs only once?

Member Author

bobcallaway Aug 30, 2025

I could do it there, but if there is a transient outage putting it inside of a sync.OnceFunc means we should really panic the entire process since we could never re-initialize it.

pkg/trillianclient/trillian_client.go

    
              // waitForRootAtLeast blocks until t.lastRoot.TreeSize >= size, or context/client closes.

              func (t *TrillianClient) waitForRootAtLeast(ctx context.Context, size uint64) error {

              	start := time.Now()

              	t.mu.Lock()

Contributor

haydentherapper Aug 22, 2025

Can this read a reader lock?

Member Author

bobcallaway Aug 30, 2025

nope, this requires a full lock given semantics of sync.Cond.

pkg/trillianclient/trillian_client.go

    
              			metricWaitForRootAtLeast.WithLabelValues(fmt.Sprintf("%d", t.logID), "true").Observe(elapsed)

              			return nil

              		}

              		t.cond.Wait()

Contributor

haydentherapper Aug 22, 2025

What happens when an error occurs in updater()? Let's say Trillian is down. From what I can tell, a request blocks until Broadcast is called, but Broadcast is only called on successful update of the tree head.

Should an error in updater() trigger a Broadcast so waiters unblock, and then observe an error in the context and return? Or can we have a timeout on waiting?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet