-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Following a closer inspection of the logs around times when relay has needed a restart twice on 19/9/25 when we're running with ~120+ experiments, @dpreid and I have figured out I made a potential mutex deadlock issue in the core crossbar code, which should be relatively simple fix
quick summary ....
a/ relay is still running fine so far as systemd is concerned, but the video and data sending is paused, yet no errors or panics are logged
b/ CPU drops, network traffic drops, yet memory usage increases over a period of about 4 min, in one case, by about 1.1 GB, equivalent to about half the data our fleet of experiments was sending in that time, then we typically manually reset (asap!)
c/ the memory increase is mostly in user space
d/ run time since last restart, max memory usage, cpu usage don't show any consistency when the issue happens (could be 3 months or 30 mins as it was in one rare case), and file handles appear within limit so far as we can tell
Taken together it suggests there might be an issue with message ingress, with the data backing up in memory.
Closer inspection of the core crossbar code highlights a possible edge case that can result in the distribution of messages hanging due to the broadcast, register and unregister channels being unbuffered, and handled within the same select scope:
select is uniform pseudo-random selection, so there is no guarantee which channel will be read from, therefore edge case is hypothesised to proceed as
a message has been written (or is about to be written) to the unregister channel, but not yet handled
a message has been written to the broadcast channel, which has a client that is unable to accept a message on the send channel
upon failing to send the message to the client, the broadcast handler tries to unregister the client, but hangs awaiting the unbuffered unregister channel to accept the message due to an earlier message already being present
It's not necessary to de-register a client that blocks an attempted send - there is a deferred func for deregistering in the readPump so we can just remove this deregister attempt and replace with a log file. We don't see a spate of reconnections (that we know of), so the case that a client is not sending but remains registered is likely rare
Proposed fix:
case message := <-h.broadcast:
h.mu.RLock()
topic := message.sender.topic
for client := range h.clients[topic] {
if client.name != message.sender.name {
select {
case client.send <- message:
default:
h.unregister <- client //delete this line
//log the failed send instead
}
}
}
h.mu.RUnlock()
We can log the failed send attempts and identify if a client is not deregistering promptly.
Further actions
We could also consider having the three channels handled in goroutines so they don't block each other, and let the mutex handle it, we should only have three goro accessing the mutex so it might even be faster than using channels in the select
We should also consider making broadcast, register and unregister buffered, because strict time coordination is not required, so long as send to an about-to-be unregistered client is handled gracefully. The buffer length should be configurable at startup to allow some tuning of memory usage and performance for different sizes of experimental fleet.
There's no need to support delayed message send, because an unavailable send channel implies the client is stopping (somehow) or that the send buffer size is insufficient it's configurable at runtime
It's on the basis that each client's send was buffered, that the original broadcast channel was not buffered, and the rate limiting step is the distribution of messages through the broadcast channel, so buffering it won't affect the throughput, just how long a client trying to ingress data has to wait.
Since we are calculating stats on each message, but not currently using the frame rate data, just the last sent, it might be worth considering pruning the stats down, and instead logging somethign that indicates the degree of pressure on the system, such as message throughput rate/messages waiting vs messages broadcast.
This situation also indicates we should consider
- monitoring message throughout using a separate canary messenger to facilitate alert/restart when messages are delayed or lost
- have a health endpoint that does not rely on the same core messaging system (stats messages are broadcast via the core broadcast mechanism so theoretically should stop when there is an issue with broadcast)