-
Notifications
You must be signed in to change notification settings - Fork 42
RingBuffer for stats #1353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
RingBuffer for stats #1353
Conversation
A ring buffer should be faster for our use case. While not the most common operation, for a broker with a lot of objects (queues, exchanges etc) we want to minimize any CPU time spent on just keeping track of the stats. Improvements to the `to_a` method that's used in the UI/API, i.e. any time we refresh the web page with new stats (proposed to be once per second in #1195) should help a bit, even though the full JSON serialization is probably much more expensive. Getting around 9% improvements in the simulated http_benchmark.cr See benchmark/ directory for all micro benchmarks (should be cleaned up before we merge this). Also kept the old Deque based code until we're ready to say this is good. ``` RingBuffer (steady-state) 355.76M ( 2.81ns) (± 5.70%) 0.0B/op fastest Deque (steady-state) 177.83M ( 5.62ns) (± 2.41%) 0.0B/op 2.00× slower RingBuffer (fill from empty) 285.64k ( 3.50µs) (± 0.76%) 12.1kB/op fastest Deque (fill from empty) 202.75k ( 4.93µs) (± 1.55%) 12.1kB/op 1.41× slower RingBuffer (to_a) 872.70k ( 1.15µs) (± 0.84%) 12.1kB/op fastest Deque (to_a) 146.77k ( 6.81µs) (± 0.94%) 12.1kB/op 5.95× slower ``` Why RingBuffer is faster: Bitwise Modulo Operations - RingBuffer uses & @Mask for wrap-around (single CPU instruction) - Deque uses conditional branches (if index >= @capacity) - Bitwise AND avoids branch misprediction penalties No Memory Shifting - RingBuffer: Simply overwrites oldest value when full - Deque: Calls shift to remove first element, causing memory movement Simpler Push Logic - RingBuffer: Direct write to calculated position - Deque: Check size → shift if needed → then push (3 operations vs 1) CPU Cache Optimization - RingBuffer: Predictable circular access pattern - Deque: Shift operations invalidate cache lines No Dynamic Memory Management - RingBuffer: Fixed allocation, never grows - Deque: Resize checks and potential reallocation overhead
|
Claude finished @jage's task —— View job Code Review CompleteReviewing PR #1353 for bugs, security vulnerabilities, performance issues, missing error handling, and Crystal anti-patterns. Todo List:
Issues Found1. Potential Integer Overflow ( @buffer = GC.malloc_atomic(@capacity * sizeof(T)).as(Pointer(T))No bounds checking on Recommendation: Add overflow check: byte_size = @capacity.to_u64 * sizeof(T)
raise ArgumentError.new("Buffer size too large") if byte_size > Int32::MAX
@buffer = GC.malloc_atomic(byte_size).as(Pointer(T))2. Missing Resource Cleanup ( Recommendation: Consider adding a finalizer: def finalize
GC.free(@buffer) if @buffer
endNote: The |
carlhoerberg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Any reason the old Stat is kept?
src/lavinmq/ring_buffer.cr
Outdated
| STDERR.puts "WARNING: RingBuffer capacity #{capacity} rounded up to #{@capacity} (must be power of 2)" | ||
| end | ||
| @mask = @capacity - 1 | ||
| @buffer = Pointer(T).malloc(@capacity) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could maybe use GC.malloc_atomic instead (if T is limited to Value), for better GC performance (the memory isn't cleared and doesn't have to be scanned for pointers), but i guess it's a theoretical improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yeah that sounds interesting here.
Initial micro benchmarks show it's ~20% faster when we allocate a lot of buffers, so should help in a high churn environment. For channel churn it might be a measurable improvements, but yeah for connection/exchanges/queues I guess it's more theoretical.
If there's a huge amount of objects I guess the GC scan speedup is nice as well (haven't benchmarked).
Not sure of there's any downsides? Memory won't be cleared but that's implicitly already assumed by the design, so should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nah just to nice to have the "original" left during development, but I'll add some cleanup commits and then this PR should be squash merged. |
Memory isn't cleared and doesn't have to be scanned for pointers. Micro benchmarks show this being ~20% faster in a high churn environment.
|
For version two of this we can re-architecture to use a "ring buffer group", e.g. we let a group of several ring buffers share the same This is relevant for "large" brokers, e.g. 100k exchanges would have 400k ring buffers ( |
| @{{name.id}}_log = RingBuffer(UInt32).new(Config.instance.stats_log_size) | ||
| def {{name.id}}_log | ||
| @{{name.id}}_log.to_a | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not 100% sure about this, if we really need to do this, should be handled at compile time if some code don't handle Ringbuffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, because of add_logs! - should rewrite that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You just need to add #next to RingBuffer to make it an Iterator(T) I think. Probably nice.
| {{name.id}}_details: { | ||
| rate: @{{name.id}}_rate, | ||
| log: @{{name.id}}_log | ||
| log: @{{name.id}}_log.to_a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
|
Need to take another round on this, missed how |
And maybe it's worth making it a lock-free MT safe structure. |
|
Or, if we really want to support massive amount of entities, i think we should do away with the stat logs completely, only have counters and let the clients poll counters and calulate their own rates, a la. prometheus. The UI could poll the counters in the background and store them in the IndexedDB. The difficulty is of course how do with history of one queue among 100k. But maybe we can start naively and collect all counters from all entities and see how it performs.. |
WHAT is this pull request doing?
Speed up internal stats collection, a ring buffer should be faster for our use case.
While not the most common operation, for a broker with a lot of objects (queues, exchanges etc) we want to minimize any CPU time spent on just keeping track of the stats.
Improvements to the
to_amethod that's used in the UI/API, i.e. any time we refresh the web page with new stats (proposed to be once per second in #1195) should help a bit, even though the full JSON serialization is probably much more expensive. Getting around 9% improvements in the simulated http_benchmark.crSee benchmark/ directory for all micro benchmarks (should be cleaned up before we merge this). Also kept the old Deque based code until we're ready to say this is good.
Why RingBuffer is faster:
Bitwise Modulo Operations
& @maskfor wrap-around (single CPU instruction)if index >= @capacity)ANDavoids branch misprediction penaltiesNo Memory Shifting
Simpler Push Logic
CPU Cache Optimization
No Dynamic Memory Management
HOW can this pull request be tested?
Specs and looking at the stats manually, but we might want to improve specs.