Release gpu-operator-charts-v1.4.1 · ROCm/gpu-operator

GPU Operator v1.4.1 Release Notes

The AMD GPU Operator v1.4.1 release extends platform support to OpenShift v4.20 and Debian 12, and introduces the ability to build amdgpu kernel modules directly within air-gapped OpenShift clusters.

Important Notice

New AMDGPU Driver Versioning Scheme
- Starting with ROCm 7.1, the AMD GPU driver version numbering has diverged from the ROCm release version. The amdgpu driver now uses an independent versioning scheme (e.g., version 30.20 corresponds to ROCm 7.1). When specifying driver versions in the DeviceConfig CR spec.driver.version, users should reference the amdgpu driver version (e.g., "30.20") for ROCm 7.1 and later releases. For ROCm versions prior to 7.1, continue to use the ROCm version number (e.g., "6.4", "7.0"). Please refer to the AMD ROCm documentation for the driver version that corresponds to your desired ROCm release. All published amdgpu driver versions are available at Radeon Repository.

Release Highlights

OpenShift Platform Support Enhancements
- Build Driver Images Directly within Disconnected OpenShift Clusters
  - Starting from v1.4.1, the AMD GPU Operator supports building driver kernel modules directly within disconnected OpenShift clusters.
  - For Red Hat Enterprise Linux CoreOS (used by OpenShift), OpenShift will download source code and firmware from AMD provided amdgpu-driver images into their DriverToolKit and directly build the kernel modules from source code without dependency on lots of RPM packages.
- Cluster Monitoring Enablement
  - The v1.4.1 AMD GPU Operator automatically creates the RBAC resources required by the OpenShift Cluster Monitoring stack. This reduces one manual configuration steps when setting up the OpenShift monitoring stack to scrape metrics from the device metrics exporter.
- Integration with OpenShift Cluster Observability Operator Accelerator Dashboard
  - Starting with v1.4.1, the AMD GPU Operator automatically creates a PrometheusRule that translates key metrics into formats compatible with the OpenShift Cluster Observability Operator's accelerator dashboard, providing an improved out-of-the-box experience.
Device-Metrics-Exporter Enhancements
- Enhanced Pod and Service Annotations
  - Custom annotations can now be applied to exporter pods and services via the DeviceConfig CRD, providing greater flexibility in metadata management.
Test Runner Enhancements
- Level-Based and Partitioned GPU Test Recipe Support
  - Test runner now supports level-based test recipes and partitioned GPU test recipes, enabling more granular and flexible GPU testing scenarios.
- Enhanced Test Result Events
  - Test runner Kubernetes events now include additional information such as pod UID and test framework name (e.g., RVS, AGFHC) as event labels, providing more comprehensive test run information for improved tracking and diagnostics.

Fixes

Node Feature Discovery Rule Fix
- Fixed the PCI device ID for the Virtual Function (VF) of MI308X and MI300X-HF GPUs
Helm Chart default DeviceConfig Fix
- Fixed an issue where the Helm chart could not render the metrics exporter's pod resource API socket path in the default DeviceConfig when specified via values.yaml or the --set option.

Known Limitations

Test Runner
- RVS-generated result.json files may contain redundant brackets at the end for newly introduced level-based recipes in v1.4.1, resulting in invalid JSON schema.
Device Config Manager
- Memory partition operations may occasionally fail due to leaked device handlers that prevent the amdgpu driver from being unloaded when applying a new memory partition profile. This issue has been observed on Debian 12 with MI325X GPU when using the v1.4.1 Device Config Manager.
- Workaround: Reboot the affected worker nodes and retry the partitioning operation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gpu-operator-charts-v1.4.1

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

GPU Operator v1.4.1 Release Notes

Important Notice

Release Highlights

Fixes

Known Limitations

Uh oh!