GPU Operator v1.4.1 Release Notes
The AMD GPU Operator v1.4.1 release extends platform support to OpenShift v4.20 and Debian 12, and introduces the ability to build amdgpu kernel modules directly within air-gapped OpenShift clusters.
Important Notice
- New AMDGPU Driver Versioning Scheme
- Starting with ROCm 7.1, the AMD GPU driver version numbering has diverged from the ROCm release version. The amdgpu driver now uses an independent versioning scheme (e.g., version 30.20 corresponds to ROCm 7.1). When specifying driver versions in the DeviceConfig CR
spec.driver.version, users should reference the amdgpu driver version (e.g., "30.20") for ROCm 7.1 and later releases. For ROCm versions prior to 7.1, continue to use the ROCm version number (e.g., "6.4", "7.0"). Please refer to the AMD ROCm documentation for the driver version that corresponds to your desired ROCm release. All published amdgpu driver versions are available at Radeon Repository.
- Starting with ROCm 7.1, the AMD GPU driver version numbering has diverged from the ROCm release version. The amdgpu driver now uses an independent versioning scheme (e.g., version 30.20 corresponds to ROCm 7.1). When specifying driver versions in the DeviceConfig CR
Release Highlights
- OpenShift Platform Support Enhancements
- Build Driver Images Directly within Disconnected OpenShift Clusters
- Starting from v1.4.1, the AMD GPU Operator supports building driver kernel modules directly within disconnected OpenShift clusters.
- For Red Hat Enterprise Linux CoreOS (used by OpenShift), OpenShift will download source code and firmware from AMD provided amdgpu-driver images into their DriverToolKit and directly build the kernel modules from source code without dependency on lots of RPM packages.
- Cluster Monitoring Enablement
- The v1.4.1 AMD GPU Operator automatically creates the RBAC resources required by the OpenShift Cluster Monitoring stack. This reduces one manual configuration steps when setting up the OpenShift monitoring stack to scrape metrics from the device metrics exporter.
- Integration with OpenShift Cluster Observability Operator Accelerator Dashboard
- Starting with v1.4.1, the AMD GPU Operator automatically creates a
PrometheusRulethat translates key metrics into formats compatible with the OpenShift Cluster Observability Operator's accelerator dashboard, providing an improved out-of-the-box experience.
- Starting with v1.4.1, the AMD GPU Operator automatically creates a
- Build Driver Images Directly within Disconnected OpenShift Clusters
- Device-Metrics-Exporter Enhancements
- Enhanced Pod and Service Annotations
- Custom annotations can now be applied to exporter pods and services via the DeviceConfig CRD, providing greater flexibility in metadata management.
- Enhanced Pod and Service Annotations
- Test Runner Enhancements
- Level-Based and Partitioned GPU Test Recipe Support
- Test runner now supports level-based test recipes and partitioned GPU test recipes, enabling more granular and flexible GPU testing scenarios.
- Enhanced Test Result Events
- Test runner Kubernetes events now include additional information such as pod UID and test framework name (e.g., RVS, AGFHC) as event labels, providing more comprehensive test run information for improved tracking and diagnostics.
- Level-Based and Partitioned GPU Test Recipe Support
Fixes
- Node Feature Discovery Rule Fix
- Fixed the PCI device ID for the Virtual Function (VF) of MI308X and MI300X-HF GPUs
- Helm Chart default DeviceConfig Fix
- Fixed an issue where the Helm chart could not render the metrics exporter's pod resource API socket path in the default DeviceConfig when specified via
values.yamlor the--setoption.
- Fixed an issue where the Helm chart could not render the metrics exporter's pod resource API socket path in the default DeviceConfig when specified via
Known Limitations
- Test Runner
- RVS-generated
result.jsonfiles may contain redundant brackets at the end for newly introduced level-based recipes in v1.4.1, resulting in invalid JSON schema.
- RVS-generated
- Device Config Manager
- Memory partition operations may occasionally fail due to leaked device handlers that prevent the amdgpu driver from being unloaded when applying a new memory partition profile. This issue has been observed on Debian 12 with MI325X GPU when using the v1.4.1 Device Config Manager.
- Workaround: Reboot the affected worker nodes and retry the partitioning operation.