-
Couldn't load subscription status.
- Fork 5.3k
Description
Describe the bug a clear and concise description of what the bug is.
Description
The Helm chart mounts the host's /dev to /hostdev in the container. However, the container's internal /dev (not /hostdev) becomes stale when physical disks are hot-swapped.
The issue:
- /hostdev correctly reflects live changes from the host
- But smartctl --scan only scans /dev, not /hostdev (hardcoded)
- The container's internal /dev is a snapshot from pod start time and never updates
When a disk is physically removed from the host:
- Host's /dev/sdn → disappears
- Container's /hostdev/sdn → disappears (live mount)
- Container's /dev/sdn → still exists (stale)
- smartctl scans /dev and finds the stale device
This causes:
- The exporter to continue reporting metrics for non-existent devices
- Persistent Prometheus alerts that never resolve
- Errors like Smartctl open device: /dev/sdn failed: No such device or address
Steps to Reproduce
- Deploy the chart with default configuration
- Wait for the exporter to discover all disks
- Physically remove a disk from the host (e.g., /dev/sdn)
- Verify the mount states:
On host: ls/dev/sdn→ No such file or directory
In container /hostdev:ls /hostdev/sdn→ No such file or directory (live mount works!)
In container /dev:ls /dev/sdn→ Still exists ✗ (stale!)
The exporter continues to find the stale device in/dev:
time=2025-10-08T16:39:50.326Z level=ERROR source=readjson.go:178 msg="Smartctl open device: /dev/sdn failed: No such device or address"
time=2025-10-08T16:39:50.609Z level=WARN source=readjson.go:72 msg="S.M.A.R.T. output reading" err="exit status 64" device="/dev/sdj;auto (sdj)"
time=2025-10-08T16:39:50.609Z level=WARN source=readjson.go:162 msg="The device error log contains records of errors" device="/dev/sdj;auto (sdj)"
time=2025-10-08T16:40:08.928Z level=WARN source=readjson.go:72 msg="S.M.A.R.T. output reading" err="exit status 2" device="/dev/sdn;auto (sdn)"
time=2025-10-08T16:40:08.928Z level=ERROR source=readjson.go:146 msg="Device open failed, device did not return an IDENTIFY DEVICE structure, or device is in a low-power mode" device="/dev/sdn;auto (sdn)"
- Metrics for the removed device persist indefinitely in Prometheus
Current Behavior
- The container's internal /dev remains frozen/stale from pod creation time
- /hostdev is correctly live-mounted but smartctl doesn't use it
- The exporter finds and tries to scan stale devices every interval
- Requires manual pod restart to refresh the device list
Expected Behavior
- When a disk is removed from the host, smartctl should not find it
- The exporter should stop attempting to scan non-existent devices
- Metrics should naturally go stale in Prometheus
- No manual intervention required
Root Cause
The chart mounts host /dev to /hostdev, but smartctl only scans /dev:
Current daemonset.yaml:
volumeMounts:
- mountPath: /hostdev # Host /dev mounted here
name: dev
volumes:
- hostPath:
path: /dev
name: dev
The problem:
- smartctl exporter runs: smartctl --scan → scans /dev (not /hostdev)
- Container's /dev is a separate mount, not synchronized with host
- Device changes on host don't propagate to container's /dev
Proposed Solution
Mount host /dev directly to container /dev with mountPropagation: HostToContainer:
volumeMounts:
- mountPath: /dev # Mount to /dev, not /hostdev
name: dev
mountPropagation: HostToContainer # Enable live updates
This fixes the issue because:
- smartctl scans /dev ✓
- /dev is now the live host mount ✓
- Device changes propagate immediately ✓
- No stale devices ✓
Why /hostdev Doesn't Work
The device-include config cannot fix this because:
- device-include is a regex that filters device names (e.g., sda), not paths
- You cannot configure smartctl to scan /hostdev instead of /dev
- There's no way to tell the exporter to use /hostdev/sda instead of /dev/sda
What's your helm version?
Version:"v3.19.0"
What's your kubectl version?
Client Version: v1.34.1
Which chart?
prometheus-smartctl-exporter
What's the chart version?
v0.16.0