Skip to content

Conversation

@ArangoGutierrez
Copy link
Collaborator

Implement automated forward compatibility tests that validate GPU Operator
against the latest published images from NVIDIA component repositories.

Changes:

  • Add forward-compatibility.yaml workflow (weekly + manual trigger)
  • Create get-latest-images.sh to fetch latest commit-based images
  • Extend e2e-tests.yaml with optional component image inputs
  • Update install-operator.sh to support device-plugin and mig-manager overrides
  • Add Slack notifications for test failures
  • Create variables.yaml as reusable workflow for shared CI variables

Components tested:

  • ghcr.io/nvidia/container-toolkit
  • ghcr.io/nvidia/k8s-device-plugin
  • ghcr.io/nvidia/k8s-mig-manager

We could add other operands later, but I wanted to start with the core ones.

Schedule: Every Monday at 2 AM UTC

@github-advanced-security
Copy link

This pull request sets up GitHub code scanning for this repository. Once the scans have completed and the checks have passed, the analysis results for this pull request branch will appear on this overview. Once you merge this pull request, the 'Security' tab will show more code scanning analysis results (for example, for the default branch). Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results. For more information about GitHub code scanning, check out the documentation.

@ArangoGutierrez ArangoGutierrez force-pushed the integration-tests branch 2 times, most recently from d55dd82 to 9dab00e Compare November 10, 2025 17:54
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements automated weekly forward compatibility testing for the GPU Operator against the latest published component images from NVIDIA repositories. The changes refactor the existing monolithic CI workflow into reusable workflow modules and add forward compatibility testing infrastructure.

Key changes:

  • Refactored CI workflows into separate, reusable workflow files (variables, golang-checks, config-checks, image-builds, e2e-tests, release)
  • Added forward-compatibility.yaml workflow with scheduled weekly runs and manual trigger support
  • Created get-latest-images.sh script to fetch latest commit-based images from component repositories
  • Extended install-operator.sh to support device-plugin and mig-manager image overrides
  • Added Slack notifications for test failures

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
.github/workflows/forward-compatibility.yaml New workflow for weekly forward compatibility testing with Slack notifications
.github/workflows/ci.yaml Refactored to use reusable workflows instead of monolithic job definitions
.github/workflows/variables.yaml New reusable workflow for shared CI variables
.github/workflows/e2e-tests.yaml New reusable e2e test workflow with optional component image inputs
.github/workflows/image-builds.yaml New reusable workflow for GPU operator image builds
.github/workflows/release.yaml New reusable workflow for release processes
.github/workflows/golang-checks.yaml New reusable workflow for Go checks and tests
.github/workflows/config-checks.yaml New reusable workflow for configuration validation
.github/workflows/code-scanning.yaml New reusable workflow for CodeQL scanning
.github/scripts/get-latest-images.sh Script to fetch latest component images from NVIDIA repositories
tests/scripts/install-operator.sh Extended to support device-plugin and mig-manager image overrides

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Split monolithic ci.yaml (448 lines) into 6 specialized workflows.
Refactored ci.yaml as orchestrator (118 lines) with centralized variable
calculation. All workflows support workflow_call, workflow_dispatch, and
standalone execution. Eliminates duplicate variable calculations.

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Implement automated weekly tests validating GPU Operator against latest
container-toolkit, device-plugin, and mig-manager images from GHCR.

- Add forward-compatibility.yaml workflow with Slack alerts
- Create get-latest-images.sh for fetching latest commit-based tags
- Extend e2e-tests.yaml and install-operator.sh for component overrides
- Add variables.yaml reusable workflow for shared CI variables

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
@ArangoGutierrez
Copy link
Collaborator Author

PING @rahulait @rajathagasthya @tariq1890

env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
# Use workflow_dispatch inputs if provided, otherwise fetch latest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: when would someone set these variables as inputs? If the scope of this workflow is to always test the top-of-tree images for all operands, then maybe we can remove the input variables and move the entire run: block into a shell script? Thoughts?

Comment on lines +101 to +103
toolkit_image: ${{ needs.fetch-latest-images.outputs.toolkit_image }}
device_plugin_image: ${{ needs.fetch-latest-images.outputs.device_plugin_image }}
mig_manager_image: ${{ needs.fetch-latest-images.outputs.mig_manager_image }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the current tests/scripts/install-operator.sh has separate input variables for the various operand images, but would it simplify our CI definition if we passed a values override file instead? Just an idea -- the previous step fetch-latest-images can craft the values override file that points to all the latest operand images. This file is passed as input to the run-e2e-tests step which launches our e2e test script with this overrides file. Thoughts? The reason I am suggesting this is because I see a lot of boilerplate in forward-compatibility.yaml and e2e-tests.yaml for declaring all of these input variables.

Comment on lines +22 to +33
toolkit_image:
description: 'Override container-toolkit image'
required: false
type: string
device_plugin_image:
description: 'Override device-plugin image'
required: false
type: string
mig_manager_image:
description: 'Override mig-manager image'
required: false
type: string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do these variables need to be inputs? Aren't we always resolving the images in this workflow itself?

Comment on lines +54 to +56
with:
operator_image: ${{ needs.variables.outputs.operator_image_base }}
operator_version: ${{ needs.variables.outputs.operator_version }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question -- could we potentially move this to the variables workflow itself? As in, make operator_image and operator_version as outputs of the variables worflow?

Comment on lines +63 to +66
with:
commit_short_sha: ${{ needs.variables.outputs.commit_short_sha }}
operator_version: ${{ needs.variables.outputs.operator_version }}
operator_image_base: ${{ needs.variables.outputs.operator_image_base }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above -- is this really needed? Can we move these variables as outputs of the variables workflow itself?

Comment on lines +80 to +85
with:
operator_image: ${{ inputs.operator_image }}
operator_version: ${{ inputs.operator_version }}
toolkit_image: ${{ inputs.toolkit_image }}
device_plugin_image: ${{ inputs.device_plugin_image }}
mig_manager_image: ${{ inputs.mig_manager_image }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rationale for passing these variables as input to the variables workflow?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants