Skip to content

Conversation

@badstreff
Copy link

@badstreff badstreff commented Oct 11, 2025

Currently if the pod fails to create, for example if a resource quote is blocking it or some other admission controller then the controller will not retry.

There is code to handle event when the pod fails here PR #4059 but this doesn't work when the pod fails to actually create

This PR adds a failure count to the EphemeralRunner resource even if the pod fails to create so we back off and try again during reconciliation

I'm also not sure if it would be better to just emit an event that the pod creation failed, keep the ephemeralrunner in pending, and let the reconciler try again - this seems like a better pattern to me but the existing code is here to implement this retry logic so I think keeping it consistent is best

@badstreff badstreff changed the title retry ephemeral runner eveen if pod creation fails retry ephemeral runner even if pod creation fails Oct 11, 2025
@badstreff badstreff changed the title retry ephemeral runner even if pod creation fails draft: retry ephemeral runner even if pod creation fails Oct 11, 2025
@badstreff badstreff changed the title draft: retry ephemeral runner even if pod creation fails retry ephemeral runner even if pod creation fails Oct 12, 2025
@badstreff
Copy link
Author

@nikola-jokic I see there is a 0.13.0 PR - are these changes still relevant? It looks like some of the changes in that PR may fix this (but not 100% sure)

@nikola-jokic
Copy link
Collaborator

Hey @badstreff, I think so too, at least it would fix some of the issues I was able to reproduce. I'm pretty sure that this case is already covered.

@badstreff
Copy link
Author

So I just tested because I wasn't sure and it looks like the issue is still present, I'll try to get a test written today that covers the behavior.

So far I have been testing manually by deploying arc and creating a namespace with a highly restrictive resource quota and trying to run jobs. You will see the ephemeral pod enters a failed state and gets stuck there even after the resource quota is removed, you need to manually delete the failed ephemeral runner to get the controller to make new pods

Attached is a SS of the error I see -

image

@badstreff
Copy link
Author

@nikola-jokic Added a test case that covers the scenario I'm attempting to resolve, this test fails on the master branch without these changes

@badstreff
Copy link
Author

@nikola-jokic If you have some time can you review? We ran into this again today

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants