Enhance Dockerfile with health check #29128

mik-laj · 2025-10-16T17:57:05Z

I haven't tested it because I don't have an environment that would allow me to test this image locally.

Part-of: #28893
Based on: zigbee2mqtt/hassio-zigbee2mqtt#812

Added curl installation and health check to Dockerfile.

Koenkk · 2025-10-16T18:53:52Z

For the HA-addon it's fine to use curl (as the frontend is always activated), but for the standard Docker image we cannot as the frontend is optional. I'm wondering if we can e.g. every 30 second write a file to the container here and then let the healthcheck check if it has been updated less than 1 minute ago?

Nerivec · 2025-10-17T13:30:33Z

Also, adding curl in the prod container just for this is probably not a good idea (increased size, attack surface...).
Writing a file with a short interval is going to impact perf, there are already some slow down issues with lower-end hardware, if the write happens to coincide with one of these (e.g. sudden flood of msg), it's likely to cause even more troubles and it is far more likely to occur than existing (required) "large" write intervals.

Maybe a simple check that nodejs is responsive?

HEALTHCHECK --interval=60s --timeout=3s --start-period=60s --retries=3 \
  CMD node -e "process.exit(0)"

Or a more comprehensive bash proc check approach (note: written by AI, have to double-check everything 😅):

#!/bin/sh
set -e

# Get PID of the main node process (child of tini)
NODE_PID=$(pgrep -P 1 node | head -n1)

if [ -z "$NODE_PID" ]; then
    echo "Node process not found"
    exit 1
fi

# Check if process is in a bad state (Z=zombie, D=uninterruptible sleep for too long)
STATE=$(cat /proc/$NODE_PID/stat | awk '{print $3}')

case "$STATE" in
    Z)
        echo "Process is zombie"
        exit 1
        ;;
    D)
        # Check how long it's been in D state by comparing boot time
        # If it's been more than 30s in D state, it's likely stuck
        BOOT_TIME=$(awk '{print int($1)}' /proc/uptime)
        START_TIME=$(awk '{print int($22/100)}' /proc/$NODE_PID/stat)
        RUNTIME=$((BOOT_TIME - START_TIME))
        
        if [ "$RUNTIME" -gt 30 ]; then
            echo "Process in uninterruptible sleep (likely stalled)"
            exit 1
        fi
        ;;
esac

# Check if process can respond to signals (not blocked)
# Use kill -0 which doesn't send a signal but checks if process is alive and accessible
if ! kill -0 $NODE_PID 2>/dev/null; then
    echo "Process not responding to signals"
    exit 1
fi

# Additional check: ensure MQTT connection timer is working by checking thread count hasn't dropped to 1
# (indicating event loop might be stalled - healthy Node.js should have multiple threads)
THREAD_COUNT=$(cat /proc/$NODE_PID/status | grep '^Threads:' | awk '{print $2}')

if [ "$THREAD_COUNT" -lt 2 ]; then
    echo "Process has too few threads ($THREAD_COUNT) - may be stalled"
    exit 1
fi

echo "Process healthy (PID: $NODE_PID, State: $STATE, Threads: $THREAD_COUNT)"
exit 0

HEALTHCHECK --interval=60s --timeout=3s --start-period=60s --retries=3 \
  CMD ["/usr/local/bin/healthcheck.sh"]

mik-laj · 2025-10-17T20:13:15Z

Also, adding curl in the prod container just for this is probably not a good idea (increased size, attack surface...).

Honestly, since this project doesn’t process any PII or run in a high-security environment, I don’t think we need to be overly strict about security here. What matters more is reliability and maintainability — and we can achieve that by relying on well-tested, trusted tools like curl.

Writing a file with a short interval is going to impact perf, there are already some slow down issues with lower-end hardware,

In this PR the interval is 60s, so the write is fairly infrequent. On contemporary hardware that’s unlikely to be a bottleneck, as it is primitive operation. If we do see pressure on lower-end devices, we can always tune the interval upward or mount file in memory volume.

Maybe a simple check that nodejs is responsive?

That verifies the runtime can start and exit, not that our application is healthy. A health check should validate observable behavior of the service that users depend on (e.g., the HTTP layer). Otherwise we risk reporting “healthy” while the app is up but not doing useful work.

Or a more comprehensive bash proc check approach (note: written by AI, have to double-check everything 😅):

That’s more complex and heavier than a single read/write, and it’s brittle—keying off specific process states/threads can both miss real failures and create false positives across kernels/setups. It still doesn’t prove the app is accepting and serving requests.

every 30 second write a file to the container here and then let the healthcheck check if it has been updated less than 1 minute ago

Good point. We probably need to go a bit further to properly test the actual behavior of our application. Writing a file is an interesting idea, but it doesn’t really verify that the app can handle user requests. I’m wondering if it would make more sense to have a small Node.js script that sends an MQTT message and checks for a response — that way, we’d be testing it from the real user’s perspective.
I’ve just noticed we already have some of this implemented — there’s a health check available through the bridge/{request|response}/health_check topic, so we already have half of the logic in place. Now we just need to write the code that will write to the topic and check if we get a response.

Nerivec · 2025-10-17T21:26:55Z

Sidenote:
There is no way to properly check Z2M is fully operational with a "simple" health check. Even curl on frontend (assuming enabled, and using right config) does not properly check that, since it could respond "fine" there but, for e.g., the driver could be deadlocked. We've already seen several cases like this in the past.
Whatever way we use will have downsides.

Also, a lot of more advanced checks would require loading settings (e.g. proper host, port...), so, beyond a "simple script".

mik-laj · 2025-10-17T22:13:47Z

The goal of a health check is not necessarily to check whether the entire application works in every case, but whether it handles user requests, because this usually means that it will also handle administrative requests correctly. If the application is dead and does not respond to user queries, we have no choice but to kill it and start it again.

For example, in the case of PostgresSQL, this might be SELECT 1, as it's the simplest operation. If the health check can execute SELECT 1, then the administrator can also execute other commands to check the instance's health and execute other commands to fix instance.

Also, a lot of more advanced checks would require loading settings (e.g. proper host, port...), so, beyond a "simple script".

I think we can use most of the code that is already in Z2M and it shouldn't be too complicated. This is just the basic, simplest task that can be accomplished using MQTT. It does not require knowledge of Linux or Node implementation details, as in your earlier proposal.

Other projects also implement additional commands to have a health check e.g. pg_isready, redis ping, airflow jobs check, airflow db check. Here is a collection of few examples on how to implement health check in docker: https://github.com/apache/airflow/tree/main/scripts/ci/docker-compose

Koenkk · 2025-10-18T12:20:50Z

The healthcheck also serves as a way to indicate the service (z2m in this case has been started). But given the frontend is optional, curl cannot be used here (for the HA addon Docker image we can as the frontend is always enabled there).

This is what copilot suggested me:

The other alternative is a socket, but I think this is more heavy.

So I would propose to go for the file approach and enable it in Z2M by setting a certain env var (e.g. Z2M_HEALTHCECK). Once implemented, the same can be used in the HA addon.

Nerivec · 2025-10-18T14:36:05Z

I think the file based approach is going to create more problems than it will solve:

all these systems that don't allow (or don't want) writes (or at least not outside the data path - which requires settings access)
PIs: sdcard, slow down issues, etc. - could break existing setups (granted a proper setup shouldn't nowadays, but we all know that's not reality)

I think the first thing would be to properly define what's the need here, because from the linked issue, there seems to be several points mentioned, that a simple curl, file, or node check would not cover.
I don't use healthcheck, so I'm not in the best position to answer this, but I can think of several scenario that would need different approaches, e.g. "can Z2M communicate with MQTT", "can Z2M communicate with the coordinator", "is the frontend running", etc..
With Z2M being a bridge, we don't have a scenario where we can just ask "can the service communicate".

Just thinking out loud here, but could we somehow hook this with the Z2M Health extension?
https://github.com/Koenkk/zigbee2mqtt/blob/master/lib/extension/health.ts#L19
I suppose the interval could be declared as an env var on startup from settings, which would then be used in the HEALTHCHECK directive.
https://github.com/Koenkk/zigbee2mqtt/blob/master/lib/eventBus.ts#L55
By checking the pub/rec stats from the EventBus for changes we could determine if comm is working. That should cover both MQTT comm, and coordinator comm, at least.
Several things would need figuring out, timings (current is 10m, haven't checked the impact of drastically lowering that), expectations (quiet periods could produce false positives), etc.
We could for e.g. pass that data (converted to healthy: true|false) to bridge/{request|response}/health_check and the HEALTHCHECK directive could use npx mqtt pub/npx mqtt sub maybe (mqttjs via cmd line)?

Koenkk · 2025-10-19T19:12:07Z

I think there are 2 decisions to be made here, first is how the healthcheck works, seconds is what the healthcheck checks:

How the healthcheck works
We have the following options:

HTTP (curl) based: not an option because frontend is optional
Socket based: possible, but requires running an additional socket server just for this check. I think the two options below are better.
MQTT based: an option, but might be quite heavy compared to the file one as first a message has to be published, received by z2m and then z2m has to reply, requires extra dependencies + connecting to MQTT on every check (so goes through multiple containers).
File based: most lightweight, no external deps required and should be super fast compared to the options above (e.g. no new sockets connections have to be created)

all these systems that don't allow (or don't want) writes (or at least not outside the data path - which requires settings access)

This is inside the Docker container, so it has nothing to do with what the system allows. It's something we can control

PIs: sdcard, slow down issues, etc. - could break existing setups (granted a proper setup shouldn't nowadays, but we all know that's not reality)

I'm not worried about the (empty) file writes here. The MQTT based approach will cause logging on Z2M and the MQTT broker so it will also cause extra writes. I think the impact of an empty file write is negligible compared to state.json and database.db writes.

What the healthcheck checks

Minimal: check if the z2m process is running correctly (e.g. if node didn't freeze).
MQTT (bonus): check if z2m is connected to the MQTT broker (currently already logged)
Adapter (bonus): check if z2m can communicate with the adapter. I'm thinking about adding an extra method to zigbee-herdsman: adapterIsHealthy. If zigbee-herdsman received a message from the adapter in the last 60 seconds it returns true, if not, it's going to ping the adapter.

Nerivec · 2025-10-20T13:01:48Z

requires extra dependencies

Shouldn't. MQTT.JS already has a cmd line built-in that can be called with npx as mentioned above.

This is inside the Docker container, so it has nothing to do with what the system allows. It's something we can control

As long as we don't get the same issues we had when we initially migrated the external JS stuff.

I'm thinking about adding an extra method to zigbee-herdsman

Is that really necessary? Adapters should handle that already and trigger disconnected as needed. Some adapters even have watchdogs built-in (deconz, ezsp). We'd end up with code duplication, potentially more requests to coordinator...

check if z2m is connected to the MQTT broker

I don't think that's necessary (see below).

Z2M already has built-in recovery if MQTT or coordinator disconnect, we don't want to bypass that (and the built-in watchdog) and restart the whole container. Since this could happen during MQTT updates/maintenance, it could end up bootlooping the Z2M container for no good reason. Same could happen with coordinator, e.g. when doing router updates for networked coordinators.
We want to avoid restarting the container (hence coordinator) unless absolutely necessary, as that is more likely to create troubles than solve (we could massively increase the number of issues "coordinator didn't restart properly, network is not working properly").
We also have to be careful about edge cases. For e.g., a brand new network with no device (yet) should not produce a negative healthcheck, same for tiny networks of 1-3 devices, mostly quiet, that could be sending data very far apart.

From all this, I think maybe just the file write (seems to be the consensus) is a good start. Keeps it simple, easy to debug. Rest should already be mostly handled by built-in logic, we can always add to it later on if needed. Also, more advanced cases can use automation in combination with Health extension data to trigger restarts as needed.

Note: overall, I think we should expect this to cause confusion no matter what, because some users will expect the health check to mean "all devices are fine and responding" (as is already mentioned in the linked issue...).

Koenkk · 2025-10-20T18:43:12Z

From all this, I think maybe just the file write (seems to be the consensus) is a good start. Keeps it simple, easy to debug.

Agree! @mik-laj is this something you could implement in this PR?

Note: overall, I think we should expect this to cause confusion no matter what, because some users will expect the health check to mean "all devices are fine and responding"

I think typically users don't know even that this healthcheck is there. I think an initial nice benefit of the simple healthcheck is to indicate to Docker that the container started succesfully.

Nerivec · 2025-10-20T19:30:17Z

I think typically users don't know even that this healthcheck is there.

Agreed, but probably many use some kind of instance manager that will automatically make use of it once it's there (with whatever "wrapper name" the manager uses). So, probably some kind of notification will pop up or something along that line.

I think an initial nice benefit of the simple healthcheck is to indicate to Docker that the container started succesfully.

About that, we should carefully configure it. Especially the start period & first write. Based on the docs, it will still execute the check while within start period but won't consider it a failure, and if success while within start period, it will be considered "started". So, need to be careful about the check, since file may or may not exist yet (in cases of slow starting Z2M instances, which I'm sure there are still some out there).
https://docs.docker.com/reference/dockerfile/#healthcheck

Also have to consider onboarding in the logic. https://www.zigbee2mqtt.io/guide/getting-started/#onboarding

Koenkk · 2025-10-20T19:38:10Z

Agreed, but probably many use some kind of instance manager that will automatically make use of it once it's there (with whatever "wrapper name" the manager uses). So, probably some kind of notification will pop up or something along that line.

My understanding is that this health check will only be available for Docker based installation.

Also have to consider onboarding in the logic. https://www.zigbee2mqtt.io/guide/getting-started/#onboarding

That's a good point indeed (for zigbee2mqtt/hassio-zigbee2mqtt#812 it should be fine as curl will succeed in that case)

Nerivec · 2025-10-20T19:48:59Z

My understanding is that this health check will only be available for Docker based installation.

Yes, I meant managers that wrap Docker containers into a more refined admin interface & the likes.

Not sure if it's okay for HA add-on with curl during onboarding. Isn't there a chance that during the switch from onboarding to controller, it could fail and trigger an unnecessary unhealthy?

mik-laj · 2025-10-21T11:56:40Z

Agree! @mik-laj is this something you could implement in this PR?

I'll be honest, I currently don't have the ability to run Z2M locally to contribute anything to the core, so unfortunately I won't be able to work on it. I will order a Zigbee adapter in the future when I do some other shopping, but probably not within the next month.

Enhance Dockerfile with health check

ff87072

Added curl installation and health check to Dockerfile.

Nerivec mentioned this pull request Oct 21, 2025

Add healthcheck in Dockerfile zigbee2mqtt/hassio-zigbee2mqtt#812

Merged

Uh oh!

Uh oh!

Enhance Dockerfile with health check #29128

Are you sure you want to change the base?

Enhance Dockerfile with health check #29128

Uh oh!

Conversation

mik-laj commented Oct 16, 2025

Uh oh!

Koenkk commented Oct 16, 2025

Uh oh!

Nerivec commented Oct 17, 2025

Uh oh!

mik-laj commented Oct 17, 2025

Uh oh!

Nerivec commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mik-laj commented Oct 17, 2025

Uh oh!

Koenkk commented Oct 18, 2025

Uh oh!

Nerivec commented Oct 18, 2025

Uh oh!

Koenkk commented Oct 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Nerivec commented Oct 20, 2025

Uh oh!

Koenkk commented Oct 20, 2025

Uh oh!

Nerivec commented Oct 20, 2025

Uh oh!

Koenkk commented Oct 20, 2025

Uh oh!

Nerivec commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mik-laj commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Nerivec commented Oct 17, 2025 •

edited

Loading

Koenkk commented Oct 19, 2025 •

edited

Loading

Nerivec commented Oct 20, 2025 •

edited

Loading