Skip to content

Conversation

@lloyd-brown
Copy link
Collaborator

This PR adds support for launching jobs in parallel using --num-jobs. Before one command would launch multiple jobs but the actual launching would happen serially, now we use multiple threads on the jobs controller to handle launching.

This may not be the right design. A single job launch is handled by scheduling a request and using a long executor normally and a short executor if we're in consolidation mode. What this means is that executor will launch multiple threads to handle this single request which seems to go against our idea of using executors. Another idea would be to spawn multiple requests when we specify --num-jobs but we reuse a lot (like the DAG) by launching multiple jobs in the launch function so it would likely take more resources and block other launch requests.

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

@lloyd-brown lloyd-brown requested a review from cg505 October 25, 2025 00:11
Copy link
Collaborator

@cg505 cg505 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is probably okay, but we should move towards a world where this is a single request to the API server and a single submission GRPC call from the API server -> job controller, so it's truly a constant time up to the point of submission to the jobs controller.
Feel free to undraft and re-request review if you want this to go in.

@lloyd-brown lloyd-brown marked this pull request as ready for review October 27, 2025 17:00
@lloyd-brown
Copy link
Collaborator Author

I think this is probably okay, but we should move towards a world where this is a single request to the API server and a single submission GRPC call from the API server -> job controller, so it's truly a constant time up to the point of submission to the jobs controller. Feel free to undraft and re-request review if you want this to go in.

Sounds good I agree that makes sense long term. Undrafting!

@lloyd-brown lloyd-brown requested a review from cg505 October 27, 2025 17:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants