-
Notifications
You must be signed in to change notification settings - Fork 228
fix: upgrade Intel MPI to 2021.14 and fix CI race condition #757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
fix: upgrade Intel MPI to 2021.14 and fix CI race condition #757
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
9bc2924 to
c50b4fa
Compare
tenzen-y
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this issue 👍
build/base/entrypoint.sh
Outdated
| # Add a delay to allow worker's sshd to be fully ready in slow CI environments | ||
| sleep 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of this sleep, what about increasing the backoff to 1 or more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced sleep with backoff=1 as suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use the previous indent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the indentation to match the previous style.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the indentation to match the previous style.
c50b4fa to
c62e651
Compare
|
I've applied the feedback: replaced sleep with backoff logic and fixed Dockerfile indentation. Verified locally |
tenzen-y
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's see if the errors have gone away in CI.
build/base/intel-builder.Dockerfile
Outdated
| libstdc++-12-dev binutils procps clang \ | ||
| intel-oneapi-compiler-dpcpp-cpp \ | ||
| intel-oneapi-mpi-devel-2021.14 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that you forgot to push the new commit to this PR?
| libstdc++-12-dev binutils procps clang \ | |
| intel-oneapi-compiler-dpcpp-cpp \ | |
| intel-oneapi-mpi-devel-2021.14 \ | |
| libstdc++-12-dev binutils procps clang \ | |
| intel-oneapi-compiler-dpcpp-cpp \ | |
| intel-oneapi-mpi-devel-2021.14 \ |
build/base/intel.Dockerfile
Outdated
| dnsutils \ | ||
| intel-oneapi-mpi-2021.14 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| dnsutils \ | |
| intel-oneapi-mpi-2021.14 \ | |
| dnsutils \ | |
| intel-oneapi-mpi-2021.14 \ |
3cdc928 to
f4b36f1
Compare
Upgrades Intel MPI version from 2021.13 to 2021.14. Adds a 5-second initialization delay in entrypoint.sh to prevent SSH handshake failures (kex_exchange_identification) observed in CI environments. Fixes Issue kubeflow#678. Signed-off-by: Syed-Suhaan <[email protected]>
f4b36f1 to
fbe4d12
Compare
…etic, and fix Dockerfile indentation Signed-off-by: Syed-Suhaan <[email protected]>
b67fb9b to
c05d2af
Compare
|
Fixed entrypoint.sh race condition by switching to native bash backoff (removing awk dependency), and corrected Dockerfile indentation. |
| max_retry=10 | ||
| counter=0 | ||
| backoff=0.1 | ||
| backoff=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| backoff=1 | |
| backoff=3 |
@Syed-Suhaan It seems 1 sec is not enough: https://github.com/kubeflow/mpi-operator/actions/runs/20128945073/job/57765870629?pr=757
Let's increase it to 3 sec.
Upgrades Intel MPI version from 2021.13 to 2021.14.
Adds a 5-second initialization delay in entrypoint.sh to prevent SSH handshake failures (kex_exchange_identification) observed in CI environments.
Fixes Issue #678.