feat: Hyperparameter Optimization APIs in Kubeflow SDK #124

andreyvelich · 2025-10-13T21:52:56Z

Part of: #46
Depends on: kubeflow/katib#2579

I've added initial support for hyperparameter optimization with OptimizerClient() into Kubeflow SDK.
This PR also introduced some refactoring to re-use code across TrainerClient() and OptimizerClient().

Working example:

def train_func(lr: str, num_epochs: str):
    import time
    import random

    for i in range(10):
        time.sleep(1)
        print(f"Training {i}, lr: {lr}, num_epochs: {num_epochs}")

    print(f"loss={round(random.uniform(0.77, 0.99), 2)}")

OptimizerClient().optimize(
    TrainJobTemplate(
        trainer=CustomTrainer(train_func, num_nodes=2),
    ),
    search_space={
        "lr": Search.loguniform(0.01, 0.05),
        "num_epochs": Search.choice([2, 4, 5]),
    },
)

Katib Experiment

apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: o759d77408d2
  namespace: default
spec:
  algorithm:
    algorithmName: random
  maxTrialCount: 10
  metricsCollectorSpec:
    collector:
      kind: StdOut
  objective:
    metricStrategies:
      - name: loss
        value: min
    objectiveMetricName: loss
    type: minimize
  parallelTrialCount: 1
  parameters:
    - feasibleSpace:
        distribution: logUniform
        max: "0.05"
        min: "0.01"
      name: lr
      parameterType: double
    - feasibleSpace:
        distribution: uniform
        list:
          - "2"
          - "4"
          - "5"
      name: num_epochs
      parameterType: categorical
  resumePolicy: Never
  trialTemplate:
    failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
    primaryContainerName: node
    primaryPodLabels:
      batch.kubernetes.io/job-completion-index: "0"
      jobset.sigs.k8s.io/replicatedjob-name: node
    retain: true
    successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
    trialParameters:
      - name: lr
        reference: lr
      - name: num_epochs
        reference: num_epochs
    trialSpec:
      apiVersion: trainer.kubeflow.org/v1alpha1
      kind: TrainJob
      spec:
        runtimeRef:
          name: torch-distributed
        trainer:
          command:
            - bash
            - -c
            - |2-

              read -r -d '' SCRIPT << EOM

              def train_func(lr: str, num_epochs: str):
                  import time
                  import random

                  for i in range(10):
                      time.sleep(1)
                      print(f"Training {i}, lr: {lr}, num_epochs: {num_epochs}")

                  print(f"loss={round(random.uniform(0.77, 0.99), 2)}")

              train_func(**{'lr': '${trialParameters.lr}', 'num_epochs': '${trialParameters.num_epochs}'})

              EOM
              printf "%s" "$SCRIPT" > "test-iceberg.py"
              torchrun "test-iceberg.py"
          numNodes: 2
status:
  completionTime: "2025-10-13T21:48:41Z"
  conditions:
    - lastTransitionTime: "2025-10-13T21:45:37Z"
      lastUpdateTime: "2025-10-13T21:45:37Z"
      message: Experiment is created
      reason: ExperimentCreated
      status: "True"
      type: Created
    - lastTransitionTime: "2025-10-13T21:48:41Z"
      lastUpdateTime: "2025-10-13T21:48:41Z"
      message: Experiment is running
      reason: ExperimentRunning
      status: "False"
      type: Running
    - lastTransitionTime: "2025-10-13T21:48:41Z"
      lastUpdateTime: "2025-10-13T21:48:41Z"
      message: Experiment has succeeded because max trial count has reached
      reason: ExperimentMaxTrialsReached
      status: "True"
      type: Succeeded
  currentOptimalTrial:
    bestTrialName: o759d77408d2-lfcqff79
    observation:
      metrics:
        - latest: "0.85"
          max: "0.98"
          min: "0.77"
          name: loss
    parameterAssignments:
      - name: lr
        value: "0.018571949792818013"
      - name: num_epochs
        value: "5"
  startTime: "2025-10-13T21:45:37Z"
  succeededTrialList:
    - o759d77408d2-lfcqff79
    - o759d77408d2-qwbkwc9n
    - o759d77408d2-jhqgmnm6
    - o759d77408d2-xjk86z66
    - o759d77408d2-g8mr72v7
    - o759d77408d2-5s2mqftm
    - o759d77408d2-86p9bw4r
    - o759d77408d2-28d5gd8f
    - o759d77408d2-m8gq4pcn
    - o759d77408d2-kxg6f45v
  trials: 10
  trialsSucceeded: 10

/assign @kubeflow/kubeflow-sdk-team @akshaychitneni

TODO items:

Add get_job() API
Add list_jobs() API
Add delete_job() API

coveralls · 2025-10-13T22:42:16Z

Pull Request Test Coverage Report for Build 18959385335

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

32 of 38 (84.21%) changed or added relevant lines in 1 file are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+6.2%) to 79.621%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
kubeflow/trainer/backends/kubernetes/backend.py	32	38	84.21%

Totals
Change from base Build 18655221655:	6.2%
Covered Lines:	168
Relevant Lines:	211

💛 - Coveralls

andreyvelich · 2025-10-14T14:56:32Z

I have implemented create_job(), get_job(), list_jobs(), and delete_job() APIs for OptimizerClient().
Please take a look at this PR.
/cc @kubeflow/kubeflow-sdk-team @briangallagher @Fiona-Waters @abhijeet-dhumal @anencore94 @jskswamy @franciscojavierarceo

google-oss-prow · 2025-10-14T14:56:37Z

@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: kubeflow/kubeflow-sdk-team, Fiona-Waters, abhijeet-dhumal, jskswamy.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

I have implemented create_job(), get_job(), list_jobs(), and delete_job() APIs for OptimizerClient().
Please take a look at this PR.
/cc @kubeflow/kubeflow-sdk-team @briangallagher @Fiona-Waters @abhijeet-dhumal @anencore94 @jskswamy

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

andreyvelich · 2025-10-15T01:01:22Z

cc @helenxie-bit @mahdikhashan

astefanutti · 2025-10-15T11:49:10Z

kubeflow/optimizer/api/optimizer_client.py

+        trial_config: Optional[TrialConfig] = None,
+        search_space: dict[str, Any],
+        objectives: Optional[list[Objective]] = None,
+        algorithm: Optional[RandomSearch] = None,


Should we consider adding options already?

Let's add it in the followup PR, since we want to limit number of APIs user can configure initially for Experiment CR.

Sounds good!

astefanutti · 2025-10-15T11:53:08Z

kubeflow/optimizer/backends/kubernetes/backend.py

+
+        logger.debug(f"OptimizationJob {self.namespace}/{name} has been deleted")
+
+    def __get_optimization_job_from_crd(


Suggested change

def __get_optimization_job_from_crd(

def __get_optimization_job_from_custom_resource(

astefanutti · 2025-10-15T11:53:18Z

kubeflow/optimizer/backends/kubernetes/backend.py

+
+    def __get_optimization_job_from_crd(
+        self,
+        optimization_job_crd: models.V1beta1Experiment,


Suggested change

optimization_job_crd: models.V1beta1Experiment,

optimization_job_cr: models.V1beta1Experiment,

astefanutti · 2025-10-15T11:54:34Z

kubeflow/optimizer/backends/base.py

+from kubeflow.optimizer.types.optimization_types import Objective, OptimizationJob, TrialConfig
+
+
+class ExecutionBackend(abc.ABC):


Suggested change

class ExecutionBackend(abc.ABC):

class RuntimeBackend(abc.ABC):

Or:

Suggested change

class ExecutionBackend(abc.ABC):

class OptimizerBackend(abc.ABC):

We previously agreed on the ExecutionBackend here: #34 (comment) with @kramaranya and @szaher.
Do you prefer to find better name for it @astefanutti ?

@andreyvelich that's not a big deal, ExecutionBackend is fine. RuntimeBackend seems more general as it also covers resources and not only the "execution", like the job "registry" (ETCD for Kubernetes).

That sounds good!

astefanutti · 2025-10-16T06:31:32Z

kubeflow/optimizer/constants/constants.py

+EXPERIMENT_SUCCEEDED = "Succeeded"
+
+# Label to identify Experiment's resources.
+EXPERIMENT_LABEL = "katib.kubeflow.org/experiment"


Should we start using optimizer.kubeflow.org?

Since we rely on Katib Experiment CRD for now, we can't use the new labels yet.

So, do we have plans to implement OptimizerRuntime and let OptimizerJob override it in the future?

I don't think we need OptimizerRuntime, since OptimizerJob should natively integrate with TrainingRuntime

Electronic-Waste

@andreyvelich Thanks for this. I left my initial question:)

Electronic-Waste · 2025-10-17T15:06:21Z

kubeflow/trainer/types/types.py

    steps: list[Step]
    num_nodes: int
-    status: str = constants.UNKNOWN
+    creation_timestamp: datetime


Why do we need creation_timestamp? Shouldn't it be added automatically in the creation phase?

It does. We set this property from the Experiment.metadata.creation_timestamp:

sdk/kubeflow/optimizer/backends/kubernetes/backend.py

Line 278 in 9b3700a

creation_timestamp=optimization_job_cr.metadata.creation_timestamp,

andreyvelich · 2025-10-27T00:08:55Z

@kramaranya @Electronic-Waste @astefanutti Any additional comments before we move forward with the initial support of HPO in Kubeflow SDK ?

Signed-off-by: Andrey Velichkevich <[email protected]>

kramaranya

Thank you so much @andreyvelich for this great work!!
I left a few comments

kramaranya · 2025-10-27T11:40:10Z

pyproject.toml

  "pydantic>=2.10.0",
  "kubeflow-trainer-api>=2.0.0",
+  # TODO (andreyvelich): Switch to kubeflow-katib-api once it is published.
+  "kubeflow_katib_api@git+https://github.com/kramaranya/katib.git@separate-models-from-sdk#subdirectory=api/python_api",


Since it has been merged, we can update that with katib ref instead of the fork. Or shall we cut a new Katib release and publish those models to PyPI?

Yes, I will replace it once we release 0.19

kramaranya · 2025-10-27T12:43:59Z

kubeflow/trainer/types/types.py

+            initializers.
+    """
+
+    trainer: CustomTrainer


Why don't we support BuiltinTrainer initially? Is it due to metrics collection?

Yes, since we don't have access to the BuiltinTrainer scripts, we can't guarantee in which format metrics will be printed. In the future integrations, let's talk how we can integrate BuiltinTrainers as well.

I see, thanks for explaining

kramaranya · 2025-10-27T12:52:11Z

kubeflow/optimizer/__init__.py

+# Import the Kubeflow Trainer types.
+from kubeflow.trainer.types.types import TrainJobTemplate
+
+__all__ = [


Shall we add GridSearch here?

Good catch!

kramaranya · 2025-10-27T13:03:59Z

kubeflow/optimizer/api/optimizer_client.py

+        """
+        # Set the default backend config.
+        if not backend_config:
+            backend_config = KubernetesBackendConfig()


nit, just for consistency shall we match trainer and use the same import style:

if not backend_config: backend_config = common_types.KubernetesBackendConfig()

Let me go other way around, tho.

kramaranya · 2025-10-27T13:32:47Z

kubeflow/optimizer/backends/kubernetes/backend.py

+
+        logger.debug(f"OptimizationJob {self.namespace}/{name} has been deleted")
+
+    def __get_optimization_job_from_custom_resource(


To align with trainer, should we update this?

Suggested change

def __get_optimization_job_from_custom_resource(

def __get_optimization_job_from_cr(

kramaranya · 2025-10-27T13:35:57Z

kubeflow/optimizer/backends/kubernetes/backend.py

+
+        except multiprocessing.TimeoutError as e:
+            raise TimeoutError(
+                f"Timeout to list OptimizationJobs in namespace: {self.namespace}"


Can we add OptimizationJob to constants instead?

kramaranya · 2025-10-27T13:54:34Z

kubeflow/optimizer/backends/kubernetes/backend.py

+        # Trainer function arguments for the appropriate substitution.
+        parameters_spec = []
+        trial_parameters = []
+        trial_template.trainer.func_args = {}


Would this not overwrite existing func_args?

It will, it is a good point.
I am curious whether we want to merge them? E.g. we should allow users to do something like this:

func_args={ "lr": 0.1 "num_epochs": 5 }, search_space={ "num_epochs": Serach.choice(1,2) }

WDYT @kramaranya @astefanutti @Electronic-Waste ?

That sounds good to me

Change it to:

if trial_template.trainer.func_args is None: trial_template.trainer.func_args = {}

kramaranya · 2025-10-27T14:13:42Z

kubeflow/optimizer/api/optimizer_client.py

+        trial_config: Optional[TrialConfig] = None,
+        search_space: dict[str, Any],
+        objectives: Optional[list[Objective]] = None,
+        algorithm: Optional[RandomSearch] = None,


I wonder whether we should accept a base type instead so any algorithm works without changing api in the future?

Suggested change

algorithm: Optional[RandomSearch] = None,

algorithm: Optional[BaseAlgorithm] = None,

Good point!

kramaranya · 2025-10-27T14:21:23Z

kubeflow/optimizer/types/optimization_types.py

+    MAXIMIZE = "maximize"
+    MINIMIZE = "minimize"


What do you think about adding "max" and "min" aliases?

I think, this makes us consistent with the Optuna naming: https://optuna.readthedocs.io/en/stable/tutorial/20_recipes/002_multi_objective.html#run-multi-objective-optimization

Sure, sounds good to me. I was just wondering if it makes sense to support both "max" and "maximize" as options, since Ray Tune uses "max" for example https://docs.ray.io/en/latest/tune/api/doc/ray.tune.run.html

Sounds good, maybe we should do something like this for this enum ?

@classmethod def from_str(cls, value: str): value = value.lower() aliases = { "max": "maximize", "maximize": "maximize", "min": "minimize", "minimize": "minimize", } return cls(aliases[value])

@kramaranya Can you create an issue to track this improvement ?

This adds comprehensive unit tests for the OptimizerClient methods in KubernetesBackend, mirroring the existing test structure used for TrainerClient in backend_test.py. Tests cover: - get_runtime (success, timeout, runtime error) - list_runtimes - optimize (study creation) - get_study - list_studies - get_study_logs - delete_study All tests use pytest fixtures, mock Kubernetes API calls closes kubeflow#124

Rename common types Signed-off-by: Andrey Velichkevich <[email protected]>

Signed-off-by: Andrey Velichkevich <[email protected]>

kramaranya · 2025-10-30T21:48:17Z

I'm thinking whether we should introduce OWNERS files for sub-packages like optimizer and trainer? I think this was mentioned in the KEP. Doesn't need to be in this PR, but wanted to raise it since we now have two subpackages

andreyvelich · 2025-10-30T22:05:28Z

I'm thinking whether we should introduce OWNERS files for sub-packages like optimizer and trainer? I think this was mentioned in the KEP. Doesn't need to be in this PR, but wanted to raise it since we now have two subpackages

Yes, I think eventually we should do that.
I hope we can find more maintainers who can be responsible for the specific sub-projects.

Signed-off-by: Andrey Velichkevich <[email protected]>

andreyvelich · 2025-10-31T00:02:52Z

@kramaranya CI should be working now.

kramaranya · 2025-10-31T00:14:53Z

Thank you for this great work, @andreyvelich! 🎉
/lgtm

Signed-off-by: Andrey Velichkevich <[email protected]>

google-oss-prow · 2025-10-31T00:57:05Z

New changes are detected. LGTM label has been removed.

andreyvelich · 2025-10-31T00:58:25Z

Thanks everyone!
/approve

andreyvelich · 2025-10-31T15:31:54Z

/approve

google-oss-prow · 2025-10-31T15:32:04Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow bot requested review from kramaranya and szaher October 13, 2025 21:53

google-oss-prow bot added the size/XL label Oct 13, 2025

andreyvelich mentioned this pull request Oct 13, 2025

chore(models): Move models into kubeflow_katib_api package kubeflow/katib#2579

Merged

kramaranya mentioned this pull request Oct 13, 2025

Support Hyperparameter Optimization in Kubeflow SDK #46

Open

8 tasks

google-oss-prow bot added size/XXL and removed size/XL labels Oct 13, 2025

andreyvelich marked this pull request as draft October 14, 2025 02:14

google-oss-prow bot added the do-not-merge/work-in-progress label Oct 14, 2025

andreyvelich marked this pull request as ready for review October 14, 2025 14:54

google-oss-prow bot removed the do-not-merge/work-in-progress label Oct 14, 2025

google-oss-prow bot requested review from anencore94 and briangallagher October 14, 2025 14:56

andreyvelich changed the title ~~feat(optimizer): Hyperparameter Optimization APIs in Kubeflow SDK~~ feat(api): Hyperparameter Optimization APIs in Kubeflow SDK Oct 14, 2025

andreyvelich changed the title ~~feat(api): Hyperparameter Optimization APIs in Kubeflow SDK~~ feat: Hyperparameter Optimization APIs in Kubeflow SDK Oct 14, 2025

astefanutti reviewed Oct 15, 2025

View reviewed changes

andreyvelich force-pushed the hpo-support branch from 28d2b5e to 92f34a5 Compare October 15, 2025 16:57

astefanutti reviewed Oct 16, 2025

View reviewed changes

Electronic-Waste reviewed Oct 17, 2025

View reviewed changes

andreyvelich mentioned this pull request Oct 27, 2025

Create unit tests for the OptimizerClient() #126

Open

andreyvelich added 5 commits October 27, 2025 00:34

Init commit

65e4dea

Signed-off-by: Andrey Velichkevich <[email protected]>

Create optimize() API

2bd1540

Signed-off-by: Andrey Velichkevich <[email protected]>

Set retain=True for Experiment

778c6a9

Signed-off-by: Andrey Velichkevich <[email protected]>

Fix location to Trainer utils

f7f7aba

Signed-off-by: Andrey Velichkevich <[email protected]>

Implement get_job, list_jobs, and delete_job APIs

192299c

Signed-off-by: Andrey Velichkevich <[email protected]>

andreyvelich added 3 commits October 27, 2025 00:34

Rename CRD to CR

55a89fa

Signed-off-by: Andrey Velichkevich <[email protected]>

Fix serialization of TrainJob

14c1497

Signed-off-by: Andrey Velichkevich <[email protected]>

Rename ExecutionBackend to RuntimeBackend

1353fc9

Signed-off-by: Andrey Velichkevich <[email protected]>

andreyvelich force-pushed the hpo-support branch from 9b3700a to 1353fc9 Compare October 27, 2025 00:34

kramaranya reviewed Oct 27, 2025

View reviewed changes

andreyvelich added 4 commits October 29, 2025 22:33

Export GridSearch

a1bcab9

Rename common types Signed-off-by: Andrey Velichkevich <[email protected]>

Add OptimizationJob constant

50d743f

Signed-off-by: Andrey Velichkevich <[email protected]>

Change to BaseAlgorithm

57c0a40

Signed-off-by: Andrey Velichkevich <[email protected]>

Keep func_args for Trainer

6ca385e

Signed-off-by: Andrey Velichkevich <[email protected]>

Use PyPI package for Katib models

85c63f4

Signed-off-by: Andrey Velichkevich <[email protected]>

google-oss-prow bot assigned kramaranya Oct 31, 2025

google-oss-prow bot added the lgtm label Oct 31, 2025

Update lock file

a044087

Signed-off-by: Andrey Velichkevich <[email protected]>

google-oss-prow bot removed the lgtm label Oct 31, 2025

google-oss-prow bot added the approved label Oct 31, 2025

andreyvelich added lgtm and removed approved labels Oct 31, 2025

google-oss-prow bot added the approved label Oct 31, 2025

google-oss-prow bot merged commit 7d682fd into kubeflow:main Oct 31, 2025
10 checks passed

google-oss-prow bot added this to the v0.2 milestone Oct 31, 2025

andreyvelich deleted the hpo-support branch October 31, 2025 15:32


		logger.debug(f"OptimizationJob {self.namespace}/{name} has been deleted")

		def __get_optimization_job_from_crd(

	def __get_optimization_job_from_crd(
	def __get_optimization_job_from_custom_resource(

	optimization_job_crd: models.V1beta1Experiment,
	optimization_job_cr: models.V1beta1Experiment,

		from kubeflow.optimizer.types.optimization_types import Objective, OptimizationJob, TrialConfig


		class ExecutionBackend(abc.ABC):

	class ExecutionBackend(abc.ABC):
	class RuntimeBackend(abc.ABC):

	class ExecutionBackend(abc.ABC):
	class OptimizerBackend(abc.ABC):

	def __get_optimization_job_from_custom_resource(
	def __get_optimization_job_from_cr(

	algorithm: Optional[RandomSearch] = None,
	algorithm: Optional[BaseAlgorithm] = None,

feat: Hyperparameter Optimization APIs in Kubeflow SDK #124

feat: Hyperparameter Optimization APIs in Kubeflow SDK #124

Conversation

andreyvelich commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coveralls commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 18959385335

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

Uh oh!

andreyvelich commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

google-oss-prow bot commented Oct 14, 2025

Uh oh!

andreyvelich commented Oct 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Electronic-Waste left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreyvelich commented Oct 27, 2025

Uh oh!

kramaranya left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreyvelich Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreyvelich commented Oct 13, 2025 •

edited

Loading

coveralls commented Oct 13, 2025 •

edited

Loading

andreyvelich commented Oct 14, 2025 •

edited

Loading

andreyvelich Oct 29, 2025 •

edited

Loading

andreyvelich Oct 31, 2025 •

edited

Loading