Skip to content

Conversation

@andreyvelich
Copy link
Member

@andreyvelich andreyvelich commented Oct 13, 2025

Part of: #46
Depends on: kubeflow/katib#2579

I've added initial support for hyperparameter optimization with OptimizerClient() into Kubeflow SDK.
This PR also introduced some refactoring to re-use code across TrainerClient() and OptimizerClient().

Working example:

def train_func(lr: str, num_epochs: str):
    import time
    import random

    for i in range(10):
        time.sleep(1)
        print(f"Training {i}, lr: {lr}, num_epochs: {num_epochs}")

    print(f"loss={round(random.uniform(0.77, 0.99), 2)}")

OptimizerClient().optimize(
    TrainJobTemplate(
        trainer=CustomTrainer(train_func, num_nodes=2),
    ),
    search_space={
        "lr": Search.loguniform(0.01, 0.05),
        "num_epochs": Search.choice([2, 4, 5]),
    },
)
Katib Experiment
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: o759d77408d2
  namespace: default
spec:
  algorithm:
    algorithmName: random
  maxTrialCount: 10
  metricsCollectorSpec:
    collector:
      kind: StdOut
  objective:
    metricStrategies:
      - name: loss
        value: min
    objectiveMetricName: loss
    type: minimize
  parallelTrialCount: 1
  parameters:
    - feasibleSpace:
        distribution: logUniform
        max: "0.05"
        min: "0.01"
      name: lr
      parameterType: double
    - feasibleSpace:
        distribution: uniform
        list:
          - "2"
          - "4"
          - "5"
      name: num_epochs
      parameterType: categorical
  resumePolicy: Never
  trialTemplate:
    failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
    primaryContainerName: node
    primaryPodLabels:
      batch.kubernetes.io/job-completion-index: "0"
      jobset.sigs.k8s.io/replicatedjob-name: node
    retain: true
    successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
    trialParameters:
      - name: lr
        reference: lr
      - name: num_epochs
        reference: num_epochs
    trialSpec:
      apiVersion: trainer.kubeflow.org/v1alpha1
      kind: TrainJob
      spec:
        runtimeRef:
          name: torch-distributed
        trainer:
          command:
            - bash
            - -c
            - |2-

              read -r -d '' SCRIPT << EOM

              def train_func(lr: str, num_epochs: str):
                  import time
                  import random

                  for i in range(10):
                      time.sleep(1)
                      print(f"Training {i}, lr: {lr}, num_epochs: {num_epochs}")

                  print(f"loss={round(random.uniform(0.77, 0.99), 2)}")

              train_func(**{'lr': '${trialParameters.lr}', 'num_epochs': '${trialParameters.num_epochs}'})

              EOM
              printf "%s" "$SCRIPT" > "test-iceberg.py"
              torchrun "test-iceberg.py"
          numNodes: 2
status:
  completionTime: "2025-10-13T21:48:41Z"
  conditions:
    - lastTransitionTime: "2025-10-13T21:45:37Z"
      lastUpdateTime: "2025-10-13T21:45:37Z"
      message: Experiment is created
      reason: ExperimentCreated
      status: "True"
      type: Created
    - lastTransitionTime: "2025-10-13T21:48:41Z"
      lastUpdateTime: "2025-10-13T21:48:41Z"
      message: Experiment is running
      reason: ExperimentRunning
      status: "False"
      type: Running
    - lastTransitionTime: "2025-10-13T21:48:41Z"
      lastUpdateTime: "2025-10-13T21:48:41Z"
      message: Experiment has succeeded because max trial count has reached
      reason: ExperimentMaxTrialsReached
      status: "True"
      type: Succeeded
  currentOptimalTrial:
    bestTrialName: o759d77408d2-lfcqff79
    observation:
      metrics:
        - latest: "0.85"
          max: "0.98"
          min: "0.77"
          name: loss
    parameterAssignments:
      - name: lr
        value: "0.018571949792818013"
      - name: num_epochs
        value: "5"
  startTime: "2025-10-13T21:45:37Z"
  succeededTrialList:
    - o759d77408d2-lfcqff79
    - o759d77408d2-qwbkwc9n
    - o759d77408d2-jhqgmnm6
    - o759d77408d2-xjk86z66
    - o759d77408d2-g8mr72v7
    - o759d77408d2-5s2mqftm
    - o759d77408d2-86p9bw4r
    - o759d77408d2-28d5gd8f
    - o759d77408d2-m8gq4pcn
    - o759d77408d2-kxg6f45v
  trials: 10
  trialsSucceeded: 10

/assign @kubeflow/kubeflow-sdk-team @akshaychitneni

TODO items:

  • Add get_job() API
  • Add list_jobs() API
  • Add delete_job() API

@coveralls
Copy link

coveralls commented Oct 13, 2025

Pull Request Test Coverage Report for Build 18959385335

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 32 of 38 (84.21%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+6.2%) to 79.621%

Changes Missing Coverage Covered Lines Changed/Added Lines %
kubeflow/trainer/backends/kubernetes/backend.py 32 38 84.21%
Totals Coverage Status
Change from base Build 18655221655: 6.2%
Covered Lines: 168
Relevant Lines: 211

💛 - Coveralls

@andreyvelich andreyvelich marked this pull request as draft October 14, 2025 02:14
@andreyvelich andreyvelich marked this pull request as ready for review October 14, 2025 14:54
@andreyvelich
Copy link
Member Author

andreyvelich commented Oct 14, 2025

I have implemented create_job(), get_job(), list_jobs(), and delete_job() APIs for OptimizerClient().
Please take a look at this PR.
/cc @kubeflow/kubeflow-sdk-team @briangallagher @Fiona-Waters @abhijeet-dhumal @anencore94 @jskswamy @franciscojavierarceo

@google-oss-prow
Copy link

@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: kubeflow/kubeflow-sdk-team, Fiona-Waters, abhijeet-dhumal, jskswamy.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

I have implemented create_job(), get_job(), list_jobs(), and delete_job() APIs for OptimizerClient().
Please take a look at this PR.
/cc @kubeflow/kubeflow-sdk-team @briangallagher @Fiona-Waters @abhijeet-dhumal @anencore94 @jskswamy

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@andreyvelich andreyvelich changed the title feat(optimizer): Hyperparameter Optimization APIs in Kubeflow SDK feat(api): Hyperparameter Optimization APIs in Kubeflow SDK Oct 14, 2025
@andreyvelich andreyvelich changed the title feat(api): Hyperparameter Optimization APIs in Kubeflow SDK feat: Hyperparameter Optimization APIs in Kubeflow SDK Oct 14, 2025
@andreyvelich
Copy link
Member Author

cc @helenxie-bit @mahdikhashan

trial_config: Optional[TrialConfig] = None,
search_space: dict[str, Any],
objectives: Optional[list[Objective]] = None,
algorithm: Optional[RandomSearch] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider adding options already?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add it in the followup PR, since we want to limit number of APIs user can configure initially for Experiment CR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!


logger.debug(f"OptimizationJob {self.namespace}/{name} has been deleted")

def __get_optimization_job_from_crd(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def __get_optimization_job_from_crd(
def __get_optimization_job_from_custom_resource(


def __get_optimization_job_from_crd(
self,
optimization_job_crd: models.V1beta1Experiment,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
optimization_job_crd: models.V1beta1Experiment,
optimization_job_cr: models.V1beta1Experiment,

from kubeflow.optimizer.types.optimization_types import Objective, OptimizationJob, TrialConfig


class ExecutionBackend(abc.ABC):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class ExecutionBackend(abc.ABC):
class RuntimeBackend(abc.ABC):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or:

Suggested change
class ExecutionBackend(abc.ABC):
class OptimizerBackend(abc.ABC):

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We previously agreed on the ExecutionBackend here: #34 (comment) with @kramaranya and @szaher.
Do you prefer to find better name for it @astefanutti ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich that's not a big deal, ExecutionBackend is fine. RuntimeBackend seems more general as it also covers resources and not only the "execution", like the job "registry" (ETCD for Kubernetes).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good!

EXPERIMENT_SUCCEEDED = "Succeeded"

# Label to identify Experiment's resources.
EXPERIMENT_LABEL = "katib.kubeflow.org/experiment"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we start using optimizer.kubeflow.org?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we rely on Katib Experiment CRD for now, we can't use the new labels yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, do we have plans to implement OptimizerRuntime and let OptimizerJob override it in the future?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need OptimizerRuntime, since OptimizerJob should natively integrate with TrainingRuntime

Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich Thanks for this. I left my initial question:)

steps: list[Step]
num_nodes: int
status: str = constants.UNKNOWN
creation_timestamp: datetime
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need creation_timestamp? Shouldn't it be added automatically in the creation phase?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does. We set this property from the Experiment.metadata.creation_timestamp:

creation_timestamp=optimization_job_cr.metadata.creation_timestamp,

@andreyvelich
Copy link
Member Author

@kramaranya @Electronic-Waste @astefanutti Any additional comments before we move forward with the initial support of HPO in Kubeflow SDK ?

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Copy link
Contributor

@kramaranya kramaranya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @andreyvelich for this great work!!
I left a few comments

pyproject.toml Outdated
"pydantic>=2.10.0",
"kubeflow-trainer-api>=2.0.0",
# TODO (andreyvelich): Switch to kubeflow-katib-api once it is published.
"kubeflow_katib_api@git+https://github.com/kramaranya/katib.git@separate-models-from-sdk#subdirectory=api/python_api",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it has been merged, we can update that with katib ref instead of the fork. Or shall we cut a new Katib release and publish those models to PyPI?

Copy link
Member Author

@andreyvelich andreyvelich Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will replace it once we release 0.19

initializers.
"""

trainer: CustomTrainer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we support BuiltinTrainer initially? Is it due to metrics collection?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, since we don't have access to the BuiltinTrainer scripts, we can't guarantee in which format metrics will be printed. In the future integrations, let's talk how we can integrate BuiltinTrainers as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for explaining

# Import the Kubeflow Trainer types.
from kubeflow.trainer.types.types import TrainJobTemplate

__all__ = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add GridSearch here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

"""
# Set the default backend config.
if not backend_config:
backend_config = KubernetesBackendConfig()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, just for consistency shall we match trainer and use the same import style:

if not backend_config:
    backend_config = common_types.KubernetesBackendConfig()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me go other way around, tho.


logger.debug(f"OptimizationJob {self.namespace}/{name} has been deleted")

def __get_optimization_job_from_custom_resource(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To align with trainer, should we update this?

Suggested change
def __get_optimization_job_from_custom_resource(
def __get_optimization_job_from_cr(


except multiprocessing.TimeoutError as e:
raise TimeoutError(
f"Timeout to list OptimizationJobs in namespace: {self.namespace}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add OptimizationJob to constants instead?

# Trainer function arguments for the appropriate substitution.
parameters_spec = []
trial_parameters = []
trial_template.trainer.func_args = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this not overwrite existing func_args?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will, it is a good point.
I am curious whether we want to merge them? E.g. we should allow users to do something like this:

func_args={
  "lr": 0.1
  "num_epochs": 5
},
search_space={
  "num_epochs": Serach.choice(1,2)
}

WDYT @kramaranya @astefanutti @Electronic-Waste ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good to me

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change it to:

  if trial_template.trainer.func_args is None:
    trial_template.trainer.func_args = {}

trial_config: Optional[TrialConfig] = None,
search_space: dict[str, Any],
objectives: Optional[list[Objective]] = None,
algorithm: Optional[RandomSearch] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether we should accept a base type instead so any algorithm works without changing api in the future?

Suggested change
algorithm: Optional[RandomSearch] = None,
algorithm: Optional[BaseAlgorithm] = None,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!

Comment on lines +30 to +31
MAXIMIZE = "maximize"
MINIMIZE = "minimize"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about adding "max" and "min" aliases?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, sounds good to me. I was just wondering if it makes sense to support both "max" and "maximize" as options, since Ray Tune uses "max" for example https://docs.ray.io/en/latest/tune/api/doc/ray.tune.run.html

Copy link
Member Author

@andreyvelich andreyvelich Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, maybe we should do something like this for this enum ?

    @classmethod
    def from_str(cls, value: str):
        value = value.lower()
        aliases = {
            "max": "maximize",
            "maximize": "maximize",
            "min": "minimize",
            "minimize": "minimize",
        }
        return cls(aliases[value])

@kramaranya Can you create an issue to track this improvement ?

vrd-cse added a commit to vrd-cse/kube_sdk_vrd that referenced this pull request Oct 29, 2025
This adds comprehensive unit tests for the OptimizerClient methods in
KubernetesBackend, mirroring the existing test structure used for
TrainerClient in backend_test.py.

Tests cover:
- get_runtime (success, timeout, runtime error)
- list_runtimes
- optimize (study creation)
- get_study
- list_studies
- get_study_logs
- delete_study

All tests use pytest fixtures, mock Kubernetes API calls
closes kubeflow#124
Rename common types

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
@kramaranya
Copy link
Contributor

I'm thinking whether we should introduce OWNERS files for sub-packages like optimizer and trainer? I think this was mentioned in the KEP. Doesn't need to be in this PR, but wanted to raise it since we now have two subpackages

@andreyvelich
Copy link
Member Author

I'm thinking whether we should introduce OWNERS files for sub-packages like optimizer and trainer? I think this was mentioned in the KEP. Doesn't need to be in this PR, but wanted to raise it since we now have two subpackages

Yes, I think eventually we should do that.
I hope we can find more maintainers who can be responsible for the specific sub-projects.

Signed-off-by: Andrey Velichkevich <[email protected]>
@andreyvelich
Copy link
Member Author

@kramaranya CI should be working now.

@kramaranya
Copy link
Contributor

Thank you for this great work, @andreyvelich! 🎉
/lgtm

Signed-off-by: Andrey Velichkevich <[email protected]>
@google-oss-prow google-oss-prow bot removed the lgtm label Oct 31, 2025
@google-oss-prow
Copy link

New changes are detected. LGTM label has been removed.

@andreyvelich
Copy link
Member Author

Thanks everyone!
/approve

@andreyvelich
Copy link
Member Author

/approve

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 7d682fd into kubeflow:main Oct 31, 2025
10 checks passed
@google-oss-prow google-oss-prow bot added this to the v0.2 milestone Oct 31, 2025
@andreyvelich andreyvelich deleted the hpo-support branch October 31, 2025 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants