Skip to content

[prometheus-smartctl-exporter] Stale device mounts cause persistent alerts after disk replacement #6223

@Eliesmbr

Description

@Eliesmbr

Describe the bug a clear and concise description of what the bug is.

Description
The Helm chart mounts the host's /dev to /hostdev in the container. However, the container's internal /dev (not /hostdev) becomes stale when physical disks are hot-swapped.

The issue:

  • /hostdev correctly reflects live changes from the host
  • But smartctl --scan only scans /dev, not /hostdev (hardcoded)
  • The container's internal /dev is a snapshot from pod start time and never updates

When a disk is physically removed from the host:

  • Host's /dev/sdn → disappears
  • Container's /hostdev/sdn → disappears (live mount)
  • Container's /dev/sdn → still exists (stale)
  • smartctl scans /dev and finds the stale device

This causes:

  • The exporter to continue reporting metrics for non-existent devices
  • Persistent Prometheus alerts that never resolve
  • Errors like Smartctl open device: /dev/sdn failed: No such device or address

Steps to Reproduce

  1. Deploy the chart with default configuration
  2. Wait for the exporter to discover all disks
  3. Physically remove a disk from the host (e.g., /dev/sdn)
  4. Verify the mount states:
    On host: ls /dev/sdn → No such file or directory
    In container /hostdev: ls /hostdev/sdn → No such file or directory (live mount works!)
    In container /dev: ls /dev/sdn → Still exists ✗ (stale!)
    The exporter continues to find the stale device in /dev:
time=2025-10-08T16:39:50.326Z level=ERROR source=readjson.go:178 msg="Smartctl open device: /dev/sdn failed: No such device or address"
time=2025-10-08T16:39:50.609Z level=WARN source=readjson.go:72 msg="S.M.A.R.T. output reading" err="exit status 64" device="/dev/sdj;auto (sdj)"
time=2025-10-08T16:39:50.609Z level=WARN source=readjson.go:162 msg="The device error log contains records of errors" device="/dev/sdj;auto (sdj)"
time=2025-10-08T16:40:08.928Z level=WARN source=readjson.go:72 msg="S.M.A.R.T. output reading" err="exit status 2" device="/dev/sdn;auto (sdn)"
time=2025-10-08T16:40:08.928Z level=ERROR source=readjson.go:146 msg="Device open failed, device did not return an IDENTIFY DEVICE structure, or device is in a low-power mode" device="/dev/sdn;auto (sdn)"
  1. Metrics for the removed device persist indefinitely in Prometheus

Current Behavior

  • The container's internal /dev remains frozen/stale from pod creation time
  • /hostdev is correctly live-mounted but smartctl doesn't use it
  • The exporter finds and tries to scan stale devices every interval
  • Requires manual pod restart to refresh the device list

Expected Behavior

  • When a disk is removed from the host, smartctl should not find it
  • The exporter should stop attempting to scan non-existent devices
  • Metrics should naturally go stale in Prometheus
  • No manual intervention required

Root Cause
The chart mounts host /dev to /hostdev, but smartctl only scans /dev:
Current daemonset.yaml:

volumeMounts:
- mountPath: /hostdev  # Host /dev mounted here
  name: dev

volumes:
- hostPath:
    path: /dev
  name: dev

The problem:

  1. smartctl exporter runs: smartctl --scan → scans /dev (not /hostdev)
  2. Container's /dev is a separate mount, not synchronized with host
  3. Device changes on host don't propagate to container's /dev

Proposed Solution
Mount host /dev directly to container /dev with mountPropagation: HostToContainer:

volumeMounts:
- mountPath: /dev  # Mount to /dev, not /hostdev
  name: dev
  mountPropagation: HostToContainer  # Enable live updates

This fixes the issue because:

  • smartctl scans /dev ✓
  • /dev is now the live host mount ✓
  • Device changes propagate immediately ✓
  • No stale devices ✓

Why /hostdev Doesn't Work
The device-include config cannot fix this because:

  • device-include is a regex that filters device names (e.g., sda), not paths
  • You cannot configure smartctl to scan /hostdev instead of /dev
  • There's no way to tell the exporter to use /hostdev/sda instead of /dev/sda

What's your helm version?

Version:"v3.19.0"

What's your kubectl version?

Client Version: v1.34.1

Which chart?

prometheus-smartctl-exporter

What's the chart version?

v0.16.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions