-
Notifications
You must be signed in to change notification settings - Fork 418
Add weekly forward compatibility testing #1884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
This pull request sets up GitHub code scanning for this repository. Once the scans have completed and the checks have passed, the analysis results for this pull request branch will appear on this overview. Once you merge this pull request, the 'Security' tab will show more code scanning analysis results (for example, for the default branch). Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results. For more information about GitHub code scanning, check out the documentation. |
d55dd82 to
9dab00e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements automated weekly forward compatibility testing for the GPU Operator against the latest published component images from NVIDIA repositories. The changes refactor the existing monolithic CI workflow into reusable workflow modules and add forward compatibility testing infrastructure.
Key changes:
- Refactored CI workflows into separate, reusable workflow files (variables, golang-checks, config-checks, image-builds, e2e-tests, release)
- Added forward-compatibility.yaml workflow with scheduled weekly runs and manual trigger support
- Created get-latest-images.sh script to fetch latest commit-based images from component repositories
- Extended install-operator.sh to support device-plugin and mig-manager image overrides
- Added Slack notifications for test failures
Reviewed Changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| .github/workflows/forward-compatibility.yaml | New workflow for weekly forward compatibility testing with Slack notifications |
| .github/workflows/ci.yaml | Refactored to use reusable workflows instead of monolithic job definitions |
| .github/workflows/variables.yaml | New reusable workflow for shared CI variables |
| .github/workflows/e2e-tests.yaml | New reusable e2e test workflow with optional component image inputs |
| .github/workflows/image-builds.yaml | New reusable workflow for GPU operator image builds |
| .github/workflows/release.yaml | New reusable workflow for release processes |
| .github/workflows/golang-checks.yaml | New reusable workflow for Go checks and tests |
| .github/workflows/config-checks.yaml | New reusable workflow for configuration validation |
| .github/workflows/code-scanning.yaml | New reusable workflow for CodeQL scanning |
| .github/scripts/get-latest-images.sh | Script to fetch latest component images from NVIDIA repositories |
| tests/scripts/install-operator.sh | Extended to support device-plugin and mig-manager image overrides |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
9dab00e to
1cad374
Compare
Split monolithic ci.yaml (448 lines) into 6 specialized workflows. Refactored ci.yaml as orchestrator (118 lines) with centralized variable calculation. All workflows support workflow_call, workflow_dispatch, and standalone execution. Eliminates duplicate variable calculations. Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Implement automated weekly tests validating GPU Operator against latest container-toolkit, device-plugin, and mig-manager images from GHCR. - Add forward-compatibility.yaml workflow with Slack alerts - Create get-latest-images.sh for fetching latest commit-based tags - Extend e2e-tests.yaml and install-operator.sh for component overrides - Add variables.yaml reusable workflow for shared CI variables Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
1cad374 to
2c49553
Compare
Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
84a0c31 to
54d6fc9
Compare
| env: | ||
| GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} | ||
| run: | | ||
| # Use workflow_dispatch inputs if provided, otherwise fetch latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: when would someone set these variables as inputs? If the scope of this workflow is to always test the top-of-tree images for all operands, then maybe we can remove the input variables and move the entire run: block into a shell script? Thoughts?
| toolkit_image: ${{ needs.fetch-latest-images.outputs.toolkit_image }} | ||
| device_plugin_image: ${{ needs.fetch-latest-images.outputs.device_plugin_image }} | ||
| mig_manager_image: ${{ needs.fetch-latest-images.outputs.mig_manager_image }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the current tests/scripts/install-operator.sh has separate input variables for the various operand images, but would it simplify our CI definition if we passed a values override file instead? Just an idea -- the previous step fetch-latest-images can craft the values override file that points to all the latest operand images. This file is passed as input to the run-e2e-tests step which launches our e2e test script with this overrides file. Thoughts? The reason I am suggesting this is because I see a lot of boilerplate in forward-compatibility.yaml and e2e-tests.yaml for declaring all of these input variables.
| toolkit_image: | ||
| description: 'Override container-toolkit image' | ||
| required: false | ||
| type: string | ||
| device_plugin_image: | ||
| description: 'Override device-plugin image' | ||
| required: false | ||
| type: string | ||
| mig_manager_image: | ||
| description: 'Override mig-manager image' | ||
| required: false | ||
| type: string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do these variables need to be inputs? Aren't we always resolving the images in this workflow itself?
| with: | ||
| operator_image: ${{ needs.variables.outputs.operator_image_base }} | ||
| operator_version: ${{ needs.variables.outputs.operator_version }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question -- could we potentially move this to the variables workflow itself? As in, make operator_image and operator_version as outputs of the variables worflow?
| with: | ||
| commit_short_sha: ${{ needs.variables.outputs.commit_short_sha }} | ||
| operator_version: ${{ needs.variables.outputs.operator_version }} | ||
| operator_image_base: ${{ needs.variables.outputs.operator_image_base }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above -- is this really needed? Can we move these variables as outputs of the variables workflow itself?
| with: | ||
| operator_image: ${{ inputs.operator_image }} | ||
| operator_version: ${{ inputs.operator_version }} | ||
| toolkit_image: ${{ inputs.toolkit_image }} | ||
| device_plugin_image: ${{ inputs.device_plugin_image }} | ||
| mig_manager_image: ${{ inputs.mig_manager_image }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the rationale for passing these variables as input to the variables workflow?
Implement automated forward compatibility tests that validate GPU Operator
against the latest published images from NVIDIA component repositories.
Changes:
Components tested:
We could add other operands later, but I wanted to start with the core ones.
Schedule: Every Monday at 2 AM UTC