-
-
Couldn't load subscription status.
- Fork 1.8k
Enhance Dockerfile with health check #29128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Added curl installation and health check to Dockerfile.
|
From #28893 (comment)
|
|
Also, adding curl in the prod container just for this is probably not a good idea (increased size, attack surface...). Maybe a simple check that nodejs is responsive? HEALTHCHECK --interval=60s --timeout=3s --start-period=60s --retries=3 \
CMD node -e "process.exit(0)"Or a more comprehensive bash proc check approach (note: written by AI, have to double-check everything 😅): #!/bin/sh
set -e
# Get PID of the main node process (child of tini)
NODE_PID=$(pgrep -P 1 node | head -n1)
if [ -z "$NODE_PID" ]; then
echo "Node process not found"
exit 1
fi
# Check if process is in a bad state (Z=zombie, D=uninterruptible sleep for too long)
STATE=$(cat /proc/$NODE_PID/stat | awk '{print $3}')
case "$STATE" in
Z)
echo "Process is zombie"
exit 1
;;
D)
# Check how long it's been in D state by comparing boot time
# If it's been more than 30s in D state, it's likely stuck
BOOT_TIME=$(awk '{print int($1)}' /proc/uptime)
START_TIME=$(awk '{print int($22/100)}' /proc/$NODE_PID/stat)
RUNTIME=$((BOOT_TIME - START_TIME))
if [ "$RUNTIME" -gt 30 ]; then
echo "Process in uninterruptible sleep (likely stalled)"
exit 1
fi
;;
esac
# Check if process can respond to signals (not blocked)
# Use kill -0 which doesn't send a signal but checks if process is alive and accessible
if ! kill -0 $NODE_PID 2>/dev/null; then
echo "Process not responding to signals"
exit 1
fi
# Additional check: ensure MQTT connection timer is working by checking thread count hasn't dropped to 1
# (indicating event loop might be stalled - healthy Node.js should have multiple threads)
THREAD_COUNT=$(cat /proc/$NODE_PID/status | grep '^Threads:' | awk '{print $2}')
if [ "$THREAD_COUNT" -lt 2 ]; then
echo "Process has too few threads ($THREAD_COUNT) - may be stalled"
exit 1
fi
echo "Process healthy (PID: $NODE_PID, State: $STATE, Threads: $THREAD_COUNT)"
exit 0HEALTHCHECK --interval=60s --timeout=3s --start-period=60s --retries=3 \
CMD ["/usr/local/bin/healthcheck.sh"] |
Honestly, since this project doesn’t process any PII or run in a high-security environment, I don’t think we need to be overly strict about security here. What matters more is reliability and maintainability — and we can achieve that by relying on well-tested, trusted tools like
In this PR the interval is 60s, so the write is fairly infrequent. On contemporary hardware that’s unlikely to be a bottleneck, as it is primitive operation. If we do see pressure on lower-end devices, we can always tune the interval upward or mount file in memory volume.
That verifies the runtime can start and exit, not that our application is healthy. A health check should validate observable behavior of the service that users depend on (e.g., the HTTP layer). Otherwise we risk reporting “healthy” while the app is up but not doing useful work.
That’s more complex and heavier than a single read/write, and it’s brittle—keying off specific process states/threads can both miss real failures and create false positives across kernels/setups. It still doesn’t prove the app is accepting and serving requests.
Good point. We probably need to go a bit further to properly test the actual behavior of our application. Writing a file is an interesting idea, but it doesn’t really verify that the app can handle user requests. I’m wondering if it would make more sense to have a small Node.js script that sends an MQTT message and checks for a response — that way, we’d be testing it from the real user’s perspective. |
|
Sidenote: Also, a lot of more advanced checks would require loading settings (e.g. proper host, port...), so, beyond a "simple script". |
|
The goal of a health check is not necessarily to check whether the entire application works in every case, but whether it handles user requests, because this usually means that it will also handle administrative requests correctly. If the application is dead and does not respond to user queries, we have no choice but to kill it and start it again. For example, in the case of PostgresSQL, this might be
I think we can use most of the code that is already in Z2M and it shouldn't be too complicated. This is just the basic, simplest task that can be accomplished using MQTT. It does not require knowledge of Linux or Node implementation details, as in your earlier proposal. Other projects also implement additional commands to have a health check e.g. pg_isready, redis ping, airflow jobs check, airflow db check. Here is a collection of few examples on how to implement health check in docker: https://github.com/apache/airflow/tree/main/scripts/ci/docker-compose |
|
I think the file based approach is going to create more problems than it will solve:
I think the first thing would be to properly define what's the need here, because from the linked issue, there seems to be several points mentioned, that a simple curl, file, or node check would not cover. Just thinking out loud here, but could we somehow hook this with the Z2M Health extension? |
|
I think there are 2 decisions to be made here, first is how the healthcheck works, seconds is what the healthcheck checks: How the healthcheck works
This is inside the Docker container, so it has nothing to do with what the system allows. It's something we can control
I'm not worried about the (empty) file writes here. The MQTT based approach will cause logging on Z2M and the MQTT broker so it will also cause extra writes. I think the impact of an empty file write is negligible compared to What the healthcheck checks
|
Shouldn't. MQTT.JS already has a cmd line built-in that can be called with npx as mentioned above.
As long as we don't get the same issues we had when we initially migrated the external JS stuff.
Is that really necessary? Adapters should handle that already and trigger disconnected as needed. Some adapters even have watchdogs built-in (deconz, ezsp). We'd end up with code duplication, potentially more requests to coordinator...
I don't think that's necessary (see below). Z2M already has built-in recovery if MQTT or coordinator disconnect, we don't want to bypass that (and the built-in watchdog) and restart the whole container. Since this could happen during MQTT updates/maintenance, it could end up bootlooping the Z2M container for no good reason. Same could happen with coordinator, e.g. when doing router updates for networked coordinators. From all this, I think maybe just the file write (seems to be the consensus) is a good start. Keeps it simple, easy to debug. Rest should already be mostly handled by built-in logic, we can always add to it later on if needed. Also, more advanced cases can use automation in combination with Health extension data to trigger restarts as needed. Note: overall, I think we should expect this to cause confusion no matter what, because some users will expect the health check to mean "all devices are fine and responding" (as is already mentioned in the linked issue...). |
Agree! @mik-laj is this something you could implement in this PR?
I think typically users don't know even that this healthcheck is there. I think an initial nice benefit of the simple healthcheck is to indicate to Docker that the container started succesfully. |
Agreed, but probably many use some kind of instance manager that will automatically make use of it once it's there (with whatever "wrapper name" the manager uses). So, probably some kind of notification will pop up or something along that line.
About that, we should carefully configure it. Especially the start period & first write. Based on the docs, it will still execute the check while within start period but won't consider it a failure, and if success while within start period, it will be considered "started". So, need to be careful about the check, since file may or may not exist yet (in cases of slow starting Z2M instances, which I'm sure there are still some out there). Also have to consider onboarding in the logic. https://www.zigbee2mqtt.io/guide/getting-started/#onboarding |
My understanding is that this health check will only be available for Docker based installation.
That's a good point indeed (for zigbee2mqtt/hassio-zigbee2mqtt#812 it should be fine as |
Yes, I meant managers that wrap Docker containers into a more refined admin interface & the likes. Not sure if it's okay for HA add-on with curl during onboarding. Isn't there a chance that during the switch from onboarding to controller, it could fail and trigger an unnecessary unhealthy? |
I'll be honest, I currently don't have the ability to run Z2M locally to contribute anything to the core, so unfortunately I won't be able to work on it. I will order a Zigbee adapter in the future when I do some other shopping, but probably not within the next month. |

I haven't tested it because I don't have an environment that would allow me to test this image locally.
Part-of: #28893
Based on: zigbee2mqtt/hassio-zigbee2mqtt#812