Skip to content

Conversation

@Syed-Suhaan
Copy link

Upgrades Intel MPI version from 2021.13 to 2021.14.

Adds a 5-second initialization delay in entrypoint.sh to prevent SSH handshake failures (kex_exchange_identification) observed in CI environments.

Fixes Issue #678.

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign terrytangyuan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Syed-Suhaan Syed-Suhaan force-pushed the upgrade-intel-mpi-2021-14 branch from 9bc2924 to c50b4fa Compare December 8, 2025 14:59
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this issue 👍

Comment on lines 34 to 35
# Add a delay to allow worker's sshd to be fully ready in slow CI environments
sleep 5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this sleep, what about increasing the backoff to 1 or more?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced sleep with backoff=1 as suggested.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use the previous indent?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the indentation to match the previous style.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the indentation to match the previous style.

@Syed-Suhaan Syed-Suhaan force-pushed the upgrade-intel-mpi-2021-14 branch from c50b4fa to c62e651 Compare December 11, 2025 07:32
@Syed-Suhaan
Copy link
Author

I've applied the feedback: replaced sleep with backoff logic and fixed Dockerfile indentation. Verified locally

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see if the errors have gone away in CI.

Comment on lines 19 to 21
libstdc++-12-dev binutils procps clang \
intel-oneapi-compiler-dpcpp-cpp \
intel-oneapi-mpi-devel-2021.14 \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that you forgot to push the new commit to this PR?

Suggested change
libstdc++-12-dev binutils procps clang \
intel-oneapi-compiler-dpcpp-cpp \
intel-oneapi-mpi-devel-2021.14 \
libstdc++-12-dev binutils procps clang \
intel-oneapi-compiler-dpcpp-cpp \
intel-oneapi-mpi-devel-2021.14 \

Comment on lines 21 to 22
dnsutils \
intel-oneapi-mpi-2021.14 \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dnsutils \
intel-oneapi-mpi-2021.14 \
dnsutils \
intel-oneapi-mpi-2021.14 \

@Syed-Suhaan Syed-Suhaan force-pushed the upgrade-intel-mpi-2021-14 branch 3 times, most recently from 3cdc928 to f4b36f1 Compare December 11, 2025 08:32
Upgrades Intel MPI version from 2021.13 to 2021.14.

Adds a 5-second initialization delay in entrypoint.sh to prevent SSH handshake failures (kex_exchange_identification) observed in CI environments.

Fixes Issue kubeflow#678.

Signed-off-by: Syed-Suhaan <[email protected]>
@Syed-Suhaan Syed-Suhaan force-pushed the upgrade-intel-mpi-2021-14 branch from f4b36f1 to fbe4d12 Compare December 11, 2025 08:34
…etic, and fix Dockerfile indentation

Signed-off-by: Syed-Suhaan <[email protected]>
@Syed-Suhaan Syed-Suhaan force-pushed the upgrade-intel-mpi-2021-14 branch from b67fb9b to c05d2af Compare December 11, 2025 09:51
@Syed-Suhaan
Copy link
Author

Fixed entrypoint.sh race condition by switching to native bash backoff (removing awk dependency), and corrected Dockerfile indentation.

max_retry=10
counter=0
backoff=0.1
backoff=1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
backoff=1
backoff=3

@Syed-Suhaan It seems 1 sec is not enough: https://github.com/kubeflow/mpi-operator/actions/runs/20128945073/job/57765870629?pr=757

Let's increase it to 3 sec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants