retry ephemeral runner even if pod creation fails #4272

badstreff · 2025-10-11T05:07:12Z

Currently if the pod fails to create, for example if a resource quote is blocking it or some other admission controller then the controller will not retry.

There is code to handle event when the pod fails here PR #4059 but this doesn't work when the pod fails to actually create

This PR adds a failure count to the EphemeralRunner resource even if the pod fails to create so we back off and try again during reconciliation

I'm also not sure if it would be better to just emit an event that the pod creation failed, keep the ephemeralrunner in pending, and let the reconciler try again - this seems like a better pattern to me but the existing code is here to implement this retry logic so I think keeping it consistent is best

badstreff · 2025-10-15T13:23:15Z

@nikola-jokic I see there is a 0.13.0 PR - are these changes still relevant? It looks like some of the changes in that PR may fix this (but not 100% sure)

nikola-jokic · 2025-10-16T10:25:41Z

Hey @badstreff, I think so too, at least it would fix some of the issues I was able to reproduce. I'm pretty sure that this case is already covered.

badstreff · 2025-10-16T14:36:21Z

So I just tested because I wasn't sure and it looks like the issue is still present, I'll try to get a test written today that covers the behavior.

So far I have been testing manually by deploying arc and creating a namespace with a highly restrictive resource quota and trying to run jobs. You will see the ephemeral pod enters a failed state and gets stuck there even after the resource quota is removed, you need to manually delete the failed ephemeral runner to get the controller to make new pods

Attached is a SS of the error I see -

badstreff · 2025-10-17T17:10:44Z

@nikola-jokic Added a test case that covers the scenario I'm attempting to resolve, this test fails on the master branch without these changes

…et created

badstreff · 2025-10-23T18:37:10Z

@nikola-jokic If you have some time can you review? We ran into this again today

badstreff requested review from a team, mumoshu, nikola-jokic, rentziass and toast-gear as code owners October 11, 2025 05:07

badstreff changed the title ~~retry ephemeral runner eveen if pod creation fails~~ retry ephemeral runner even if pod creation fails Oct 11, 2025

badstreff changed the title ~~retry ephemeral runner even if pod creation fails~~ draft: retry ephemeral runner even if pod creation fails Oct 11, 2025

badstreff changed the title ~~draft: retry ephemeral runner even if pod creation fails~~ retry ephemeral runner even if pod creation fails Oct 12, 2025

badstreff mentioned this pull request Oct 12, 2025

Runner get stuck in "Failed" state for indefinite time ( 0.12.1 ) #4168

Open

4 tasks

badstreff force-pushed the master branch from b75f7ca to 55b8d64 Compare October 17, 2025 17:09

update ephemeral runner controller to retry even if the pod doesn't g…

8428fcd

…et created

badstreff force-pushed the master branch from 55b8d64 to 8428fcd Compare October 17, 2025 17:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

retry ephemeral runner even if pod creation fails #4272

retry ephemeral runner even if pod creation fails #4272

badstreff commented Oct 11, 2025 •

edited

Loading

Uh oh!

badstreff commented Oct 15, 2025

Uh oh!

nikola-jokic commented Oct 16, 2025

Uh oh!

badstreff commented Oct 16, 2025

Uh oh!

badstreff commented Oct 17, 2025

Uh oh!

badstreff commented Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

retry ephemeral runner even if pod creation fails #4272

Are you sure you want to change the base?

retry ephemeral runner even if pod creation fails #4272

Conversation

badstreff commented Oct 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

badstreff commented Oct 15, 2025

Uh oh!

nikola-jokic commented Oct 16, 2025

Uh oh!

badstreff commented Oct 16, 2025

Uh oh!

badstreff commented Oct 17, 2025

Uh oh!

badstreff commented Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

badstreff commented Oct 11, 2025 •

edited

Loading