Skip to content

Conversation

@jeremyestein
Copy link
Collaborator

@jeremyestein jeremyestein commented Nov 18, 2025

Tweak container restart policies to increase their chances of coming back if the host restarts or the container crashes.

When our docker host last rebooted, only cassandra, core and glowroot came back for emap-dev. In particular, the waveform-reader didn't come back, and unknown to me there was some data directing to it that we wanted to keep.

So trying to prevent this happening again.

I can't explain why core came back but rabbitmq didn't, since they both had "on-failure" before this change. And more containers depend on rabbitmq than core, so I don't think that's the cause.

I'm aware that some containers need to be able to exit cleanly without being restarted (hl7-reader and hoover), so nothing more aggressive than "on-failure" can be used in that case. This could be a problem because the docs say:

The on-failure policy only prompts a restart if the container exits with a failure. It doesn't restart the container if the daemon restarts.

We certainly want hl7-reader to come back following a docker daemon (or host) restart, so this may require further work.

@jeremyestein jeremyestein changed the base branch from main to develop November 18, 2025 16:01
Copy link
Collaborator

@stefpiatek stefpiatek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah seems reasonable, though for the HL7-reader we had wanted it to stay dead rather than endlessly restarting if it encounters an unexpected error in processing an hl7 message (as it will keep on trying to process that same message again and again). That way informus dashboard will find that the hl7 reader is down and notify us that its not processing.

Not too strongly held and opinion as its very rare now, so happy to see what happens with this

@jeremyestein
Copy link
Collaborator Author

Yeah seems reasonable, though for the HL7-reader we had wanted it to stay dead rather than endlessly restarting if it encounters an unexpected error in processing an hl7 message (as it will keep on trying to process that same message again and again). That way informus dashboard will find that the hl7 reader is down and notify us that its not processing.

Not too strongly held and opinion as its very rare now, so happy to see what happens with this

Perhaps we should use on-failure:5 to avoid that particular problem, even if it doesn't solve the reboot problem either way.

@jeremyestein
Copy link
Collaborator Author

I think in an ideal world we'd use a different restart policy for hl7-reader/hoover depending on whether they're operating in indefinite mode or not. And for monitoring we'd use the lack of progress and/or the presence of an error in hl7-reader instead of it being down to signal an error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants