Skip to content

Conversation

@fridrik01
Copy link

@fridrik01 fridrik01 commented Oct 16, 2025

Description

The PrometheusSink had a race condition when multiple goroutines concurrently created the same metric for the first time. Each goroutine would create its own separate Prometheus counter/gauge/summary instance, increment it, and then store it using Store(). The last write would win, but all other metric instances with their accumulated values would be lost.

This fix uses LoadOrStore() instead of Store() to ensure only one metric instance exists per key, then updates that single instance regardless of which goroutine won the creation race.

Related Issue

N/A (discovered during testing/usage)

How Has This Been Tested?

  • Added TestPrometheusRaceCondition which spawns multiple goroutines that concurrently increment the same counter and verifies the final value matches the expected total
  • The test fails on the original code (demonstrating the race) and passes with the fix

Before the fix:

go test -v ./prometheus -run TestPrometheusRaceCondition$ -count=1
=== RUN   TestPrometheusRaceCondition
    prometheus_race_test.go:63: Race condition detected: got 152 want 200
--- FAIL: TestPrometheusRaceCondition (0.00s)
FAIL
FAIL    github.com/hashicorp/go-metrics/prometheus      0.253s
FAIL

With this fix applied the test always passes

This test demonstrates a race condition that occurs when PrometheusSink
is used from multiple goroutines concurrently, resulting in missed
counter updates.

The test spawns multiple goroutines that increment the same counter
simultaneously and verifies that the final value matches the expected
total.

Example:

go test -v ./prometheus -run TestPrometheusRaceCondition$ -count=1
=== RUN   TestPrometheusRaceCondition
    prometheus_race_test.go:63: Race condition detected: got 152 want 200
--- FAIL: TestPrometheusRaceCondition (0.00s)
FAIL
FAIL    github.com/hashicorp/go-metrics/prometheus      0.253s
FAIL
Fixes a race condition in PrometheusSink where concurrent
updates to the same metric from multiple goroutines could
result in missed updates or incorrect values.

The previous copy-on-write pattern was racy because multiple
goroutines could read the old value, create separate copies
with their updates, and overwrite each other when storing
back to the sync.Map.

This change adds a sync.RWMutex to each gauge, summary, and
counter struct to protect concurrent access to the underlying
Prometheus metric and the updatedAt timestamp.

The fix also uses LoadOrStore instead of Load+Store to
properly handle creation races when multiple goroutines
attempt to create the same metric simultaneously.
@fridrik01 fridrik01 requested a review from a team as a code owner October 16, 2025 14:46
@hashicorp-cla-app
Copy link

hashicorp-cla-app bot commented Oct 16, 2025

CLA assistant check
All committers have signed the CLA.

Also updated the prometheus_race_test.go to be in line with other
tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant