Skip to content

Conversation

@SeungjinYang
Copy link
Collaborator

@SeungjinYang SeungjinYang commented Nov 3, 2025

Add request_name to Admin Policy's input, allowing Admin Policy to selectively act on certain requests or only act on a subset of requests.

There are some quick benefits that can be realized from this PR. The most obvious is having a certain admin policy not act on certain requests, such as Optimize or Validate.

As an example, when a launch request is submitted through either CLI or SDK, the codepath command actually first calls "Validate" on the DAG and then calls "Optimize" on the DAG before calling "Launch". Now imagine the skypilot admin writes an admin policy like below (taken from one of skypilot example admin policies):

class AddLabelsPolicy(sky.AdminPolicy):
    """Example policy: adds a kubernetes label for skypilot_config."""

    @classmethod
    def validate_and_mutate(
            cls, user_request: sky.UserRequest) -> sky.MutatedUserRequest:
        config = user_request.skypilot_config
        labels = config.get_nested(('kubernetes', 'custom_metadata', 'labels'),
                                   {})
        labels['app'] = 'skypilot'
        config.set_nested(('kubernetes', 'custom_metadata', 'labels'), labels)
        return sky.MutatedUserRequest(user_request.task, config)

This policy doesn't actually need to be applied in either "Validate" or "Optimize" call, just the "Launch" call. With this PR, the admin can write this policy instead:

class AddLabelsConditionalPolicy(sky.AdminPolicy):
    """Example policy: adds a kubernetes label for skypilot_config 
    if the request is a cluster launch request."""

    @classmethod
    def validate_and_mutate(
            cls, user_request: sky.UserRequest) -> sky.MutatedUserRequest:
        if user_request.request_name in [
                sky.AdminPolicyRequestName.VALIDATE,
                sky.AdminPolicyRequestName.OPTIMIZE
        ]:
            return sky.MutatedUserRequest(user_request.task,
                                          user_request.skypilot_config)
        config = user_request.skypilot_config
        labels = config.get_nested(('kubernetes', 'custom_metadata', 'labels'),
                                   {})
        labels['app'] = 'skypilot'
        config.set_nested(('kubernetes', 'custom_metadata', 'labels'), labels)
        return sky.MutatedUserRequest(user_request.task, config)

avoiding unnecessary work.

A better example where having request_name on hand is EnforceAutostopPolicy (also taken from publicly available examples):

class EnforceAutostopPolicy(sky.AdminPolicy):
    """Example policy: enforce autostop for all tasks."""

    @classmethod
    def validate_and_mutate(
            cls, user_request: sky.UserRequest) -> sky.MutatedUserRequest:
        """Enforces autostop for all tasks.

        Note that with this policy enforced, users can still change the autostop
        setting for an existing cluster by using `sky autostop`.

        Since we refresh the cluster status with `sky.status` whenever this
        policy is applied, we should expect a few seconds latency when a user
        run a request.
        """
        request_options = user_request.request_options

        # Request options is None when a task is executed with `jobs launch` or
        # `sky serve up`.
        if request_options is None:
            return sky.MutatedUserRequest(
                task=user_request.task,
                skypilot_config=user_request.skypilot_config)

        # Get the cluster record to operate on.
        cluster_name = request_options.cluster_name
        cluster_records: List[responses.StatusResponse] = []
        if cluster_name is not None:
            try:
                cluster_records = sky.get(
                    sky.status([cluster_name],
                               refresh=common.StatusRefreshMode.AUTO,
                               all_users=True))
            except Exception as e:
                ...

The problem with this admin policy as it exists now is twofold:

  1. We're using some bespoke internal behavior to determine when to run the logic, which can work for skypilot developers but hard for inexperienced users to figure out.
  2. This admin policy still runs in all three of "Validate", "Optimize" and "Launch" requests. Given this admin policy is expensive due to the status call, this policy can add significant unnecessary latency to a launch request.

By adding request name, we solve both of these problems:

class EnforceAutostopPolicy(sky.AdminPolicy):
    """Example policy: enforce autostop for all tasks."""

    @classmethod
    def validate_and_mutate(
            cls, user_request: sky.UserRequest) -> sky.MutatedUserRequest:
        """Enforces autostop for all tasks.

        Note that with this policy enforced, users can still change the autostop
        setting for an existing cluster by using `sky autostop`.

        Since we refresh the cluster status with `sky.status` whenever this
        policy is applied, we should expect a few seconds latency when a user
        run a request.
        """
        if user_request.request_name not in [
            sky.AdminPolicyRequestName.CLUSTER_LAUNCH,
            sky.AdminPolicyRequestName.CLUSTER_EXEC,
        ]:
            return sky.MutatedUserRequest(
                task=user_request.task,
                skypilot_config=user_request.skypilot_config)

        request_options = user_request.request_options
        # Request options is not None when a task is executed with `sky launch`.
        assert request_options is not None
        # Get the cluster record to operate on.
        cluster_name = request_options.cluster_name
        cluster_records: List[responses.StatusResponse] = []
        if cluster_name is not None:
            try:
                cluster_records = sky.get(
                    sky.status([cluster_name],
                               refresh=common.StatusRefreshMode.AUTO,
                               all_users=True))
            except Exception as e:
                ...

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

@SeungjinYang
Copy link
Collaborator Author

SeungjinYang commented Nov 4, 2025

/smoke-test --kubernetes https://buildkite.com/skypilot-1/smoke-tests/builds/5185 (passed)

@SeungjinYang SeungjinYang marked this pull request as ready for review November 4, 2025 01:15
@SeungjinYang SeungjinYang requested a review from aylei November 4, 2025 01:15
@SeungjinYang
Copy link
Collaborator Author

SeungjinYang commented Nov 4, 2025

/smoke-test --aws https://buildkite.com/skypilot-1/smoke-tests/builds/5190 (passed)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants