Skip to content

Commit 7d682fd

Browse files
authored
feat: Hyperparameter Optimization APIs in Kubeflow SDK (#124)
* Init commit Signed-off-by: Andrey Velichkevich <[email protected]> * Create optimize() API Signed-off-by: Andrey Velichkevich <[email protected]> * Set retain=True for Experiment Signed-off-by: Andrey Velichkevich <[email protected]> * Fix location to Trainer utils Signed-off-by: Andrey Velichkevich <[email protected]> * Implement get_job, list_jobs, and delete_job APIs Signed-off-by: Andrey Velichkevich <[email protected]> * Add metrics and parameters to Trial object Signed-off-by: Andrey Velichkevich <[email protected]> * Clarify message for objective Signed-off-by: Andrey Velichkevich <[email protected]> * Move TrainJobTemplate to the Trainer types Signed-off-by: Andrey Velichkevich <[email protected]> * Rename CRD to CR Signed-off-by: Andrey Velichkevich <[email protected]> * Fix serialization of TrainJob Signed-off-by: Andrey Velichkevich <[email protected]> * Rename ExecutionBackend to RuntimeBackend Signed-off-by: Andrey Velichkevich <[email protected]> * Export GridSearch Rename common types Signed-off-by: Andrey Velichkevich <[email protected]> * Add OptimizationJob constant Signed-off-by: Andrey Velichkevich <[email protected]> * Change to BaseAlgorithm Signed-off-by: Andrey Velichkevich <[email protected]> * Keep func_args for Trainer Signed-off-by: Andrey Velichkevich <[email protected]> * Use PyPI package for Katib models Signed-off-by: Andrey Velichkevich <[email protected]> * Update lock file Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]>
1 parent e878505 commit 7d682fd

File tree

34 files changed

+1474
-294
lines changed

34 files changed

+1474
-294
lines changed

.github/workflows/check-pr-title.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@ jobs:
3333
ci
3434
docs
3535
examples
36+
optimizer
3637
scripts
3738
test
3839
trainer

Makefile

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -37,8 +37,6 @@ VENV_DIR := $(PROJECT_DIR)/.venv
3737
help: ## Display this help.
3838
@awk 'BEGIN {FS = ":.*##"; printf "\nUsage:\n make \033[36m<target>\033[0m\n"} /^[a-zA-Z_0-9-]+:.*?##/ { printf " \033[36m%-15s\033[0m %s\n", $$1, $$2 } /^##@/ { printf "\n\033[1m%s\033[0m\n", substr($$0, 5) } ' $(MAKEFILE_LIST)
3939

40-
#UV := $(shell which uv)
41-
4240
.PHONY: uv
4341
uv: ## Install UV
4442
@command -v uv &> /dev/null || { \
@@ -57,7 +55,7 @@ verify: install-dev ## install all required tools
5755
@uv run ruff format --check kubeflow
5856

5957
.PHONY: uv-venv
60-
uv-venv:
58+
uv-venv: ## Create uv virtual environment
6159
@if [ ! -d "$(VENV_DIR)" ]; then \
6260
echo "Creating uv virtual environment in $(VENV_DIR)..."; \
6361
uv venv; \
@@ -75,10 +73,14 @@ release: install-dev
7573

7674
# make test-python will produce html coverage by default. Run with `make test-python report=xml` to produce xml report.
7775
.PHONY: test-python
78-
test-python: uv-venv
76+
test-python: uv-venv ## Run Python unit tests
7977
@uv sync
80-
@uv run coverage run --source=kubeflow.trainer.backends.kubernetes.backend,kubeflow.trainer.utils.utils -m pytest ./kubeflow/trainer/backends/kubernetes/backend_test.py ./kubeflow/trainer/utils/utils_test.py
81-
@uv run coverage report -m kubeflow/trainer/backends/kubernetes/backend.py kubeflow/trainer/utils/utils.py
78+
@uv run coverage run --source=kubeflow.trainer.backends.kubernetes.backend,kubeflow.trainer.utils.utils -m pytest \
79+
./kubeflow/trainer/backends/kubernetes/backend_test.py \
80+
./kubeflow/trainer/backends/kubernetes/utils_test.py
81+
@uv run coverage report -m \
82+
kubeflow/trainer/backends/kubernetes/backend.py \
83+
kubeflow/trainer/backends/kubernetes/utils.py
8284
ifeq ($(report),xml)
8385
@uv run coverage xml
8486
else
@@ -87,7 +89,7 @@ endif
8789

8890

8991
.PHONY: install-dev
90-
install-dev: uv uv-venv ruff ## Install uv, create .venv, sync deps; DEV=1 to include dev group; EXTRAS=comma,list for extras
92+
install-dev: uv uv-venv ruff ## Install uv, create .venv, sync deps.
9193
@echo "Using virtual environment at: $(VENV_DIR)"
9294
@echo "Syncing dependencies with uv..."
9395
@uv sync

docs/proposals/2-trainer-local-execution/README.md

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,13 +14,15 @@ AI Practitioners often want to experiment locally before scaling their models to
1414
The proposed local execution mode will allow engineers to quickly test their models in isolated containers or virtualenvs via subprocess, facilitating a faster and more efficient workflow.
1515

1616
### Goals
17+
1718
- Allow users to run training jobs on their local machines using container runtimes or subprocess.
1819
- Rework current Kubeflow Trainer SDK to implement Execution Backends with Kubernetes Backend as default.
1920
- Implement Local Execution Backends that integrates seamlessly with the Kubeflow SDK, supporting both single-node and multi-node training processes.
2021
- Provide an implementation that supports PyTorch, with the potential to extend to other ML frameworks or runtimes.
2122
- Ensure compatibility with existing Kubeflow Trainer SDK features and user interfaces.
2223

2324
### Non-Goals
25+
2426
- Full support for distributed training in the first phase of implementation.
2527
- Support for all ML frameworks or runtime environments in the initial proof-of-concept.
2628
- Major changes to the Kubeflow Trainer SDK architecture.
@@ -34,18 +36,22 @@ The local execution mode will allow users to run training jobs in container runt
3436
### User Stories (Optional)
3537

3638
#### Story 1
39+
3740
As an AI Practitioner, I want to run my model locally using Podman/Docker containers so that I can test my training job without incurring the costs of running a Kubernetes cluster.
3841

3942
#### Story 2
43+
4044
As an AI Practitioner, I want to initialize datasets and models within Podman/Docker containers, so that I can streamline my local training environment.
4145

4246
### Notes/Constraints/Caveats
47+
4348
- Local execution mode will first support Subprocess, with future plans to explore Podman, Docker, and Apple Container.
4449
- The subprocess implementation will be restricted to single node.
4550
- The local execution mode will support only pytorch runtime initially.
4651
- Resource limitations on memory, cpu and gpu is not fully supported locally and might not be supported if the execution backend doesn't expose apis to support it.
4752

4853
### Risks and Mitigations
54+
4955
- **Risk**: Compatibility issues with non-Docker container runtimes.
5056
- **Mitigation**: Initially restrict support to Podman/Docker and evaluate alternatives for future phases.
5157
- **Risk**: Potential conflicts between local and Kubernetes execution modes.
@@ -55,7 +61,7 @@ As an AI Practitioner, I want to initialize datasets and models within Podman/Do
5561

5662
The local execution mode will be implemented using a new `LocalProcessBackend`, `PodmanBackend`, `DockerBackend` which will allow users to execute training jobs using containers and virtual environment isolation. The client will utilize container runtime capabilities to create isolated environments, including volumes and networks, to manage the training lifecycle. It will also allow for easy dataset and model initialization.
5763

58-
- Different execution backends will need to implement the same interface from the `ExecutionBackend` abstract class so `TrainerClient` can initialize and load the backend.
64+
- Different execution backends will need to implement the same interface from the `RuntimeBackend` abstract class so `TrainerClient` can initialize and load the backend.
5965
- The Podman/Docker client will connect to a local container environment, create shared volumes, and initialize datasets and models as needed.
6066
- The **DockerBackend** will manage Docker containers, networks, and volumes using runtime definitions specified by the user.
6167
- The **PodmanBackend** will manage Podman containers, networks, and volumes using runtime definitions specified by the user.
@@ -70,16 +76,20 @@ The local execution mode will be implemented using a new `LocalProcessBackend`,
7076
- **E2E Tests**: Conduct end-to-end tests to validate the local execution mode, ensuring that jobs can be initialized, executed, and tracked correctly within Podman/Docker containers.
7177

7278
### Graduation Criteria
79+
7380
- The feature will move to the `beta` stage once it supports multi-node training with pytorch framework as default runtime and works seamlessly with local environments.
7481
- Full support for multi-worker configurations and additional ML frameworks will be considered for the `stable` release.
7582

7683
## Implementation History
84+
7785
- **KEP Creation**: April 2025
7886
- **Implementation Start**: April 2025
87+
7988
## Drawbacks
8089

8190
- The initial implementation will be limited to single-worker training jobs, which may restrict users who need multi-node support.
8291
- The local execution mode will initially only support Subprocess and may require additional configurations for Podman/Docker container runtimes in the future.
8392

8493
## Alternatives
94+
8595
- **Full Kubernetes Execution**: Enable users to always run jobs on Kubernetes clusters, though this comes with higher costs and longer development cycles for ML engineers.

kubeflow/common/__init__.py

Whitespace-only changes.

kubeflow/common/constants.py

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
# Copyright 2025 The Kubeflow Authors.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
# The default Kubernetes namespace.
16+
DEFAULT_NAMESPACE = "default"
17+
18+
# How long to wait in seconds for requests to the Kubernetes API Server.
19+
DEFAULT_TIMEOUT = 120
20+
21+
# Unknown indicates that the value can't be identified.
22+
UNKNOWN = "Unknown"
File renamed without changes.

kubeflow/common/utils.py

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
# Copyright 2025 The Kubeflow Authors.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
import os
15+
from typing import Optional
16+
17+
from kubernetes import config
18+
19+
from kubeflow.common import constants
20+
21+
22+
def is_running_in_k8s() -> bool:
23+
return os.path.isdir("/var/run/secrets/kubernetes.io/")
24+
25+
26+
def get_default_target_namespace(context: Optional[str] = None) -> str:
27+
if not is_running_in_k8s():
28+
try:
29+
all_contexts, current_context = config.list_kube_config_contexts()
30+
# If context is set, we should get namespace from it.
31+
if context:
32+
for c in all_contexts:
33+
if isinstance(c, dict) and c.get("name") == context:
34+
return c["context"]["namespace"]
35+
# Otherwise, try to get namespace from the current context.
36+
return current_context["context"]["namespace"]
37+
except Exception:
38+
return constants.DEFAULT_NAMESPACE
39+
with open("/var/run/secrets/kubernetes.io/serviceaccount/namespace") as f:
40+
return f.readline()

kubeflow/optimizer/__init__.py

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
# Copyright 2025 The Kubeflow Authors.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
# Import common types.
16+
from kubeflow.common.types import KubernetesBackendConfig
17+
18+
# Import the Kubeflow Optimizer client.
19+
from kubeflow.optimizer.api.optimizer_client import OptimizerClient
20+
21+
# Import the Kubeflow Optimizer types.
22+
from kubeflow.optimizer.types.algorithm_types import GridSearch, RandomSearch
23+
from kubeflow.optimizer.types.optimization_types import Objective, OptimizationJob, TrialConfig
24+
from kubeflow.optimizer.types.search_types import Search
25+
26+
# Import the Kubeflow Trainer types.
27+
from kubeflow.trainer.types.types import TrainJobTemplate
28+
29+
__all__ = [
30+
"GridSearch",
31+
"KubernetesBackendConfig",
32+
"Objective",
33+
"OptimizationJob",
34+
"OptimizerClient",
35+
"RandomSearch",
36+
"Search",
37+
"TrainJobTemplate",
38+
"TrialConfig",
39+
]

kubeflow/optimizer/api/__init__.py

Whitespace-only changes.
Lines changed: 126 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
# Copyright 2025 The Kubeflow Authors.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
import logging
16+
from typing import Any, Optional
17+
18+
from kubeflow.common.types import KubernetesBackendConfig
19+
from kubeflow.optimizer.backends.kubernetes.backend import KubernetesBackend
20+
from kubeflow.optimizer.types.algorithm_types import BaseAlgorithm
21+
from kubeflow.optimizer.types.optimization_types import Objective, OptimizationJob, TrialConfig
22+
from kubeflow.trainer.types.types import TrainJobTemplate
23+
24+
logger = logging.getLogger(__name__)
25+
26+
27+
class OptimizerClient:
28+
def __init__(
29+
self,
30+
backend_config: Optional[KubernetesBackendConfig] = None,
31+
):
32+
"""Initialize a Kubeflow Optimizer client.
33+
34+
Args:
35+
backend_config: Backend configuration. Either KubernetesBackendConfig or None to use
36+
default config class. Defaults to KubernetesBackendConfig.
37+
38+
Raises:
39+
ValueError: Invalid backend configuration.
40+
41+
"""
42+
# Set the default backend config.
43+
if not backend_config:
44+
backend_config = KubernetesBackendConfig()
45+
46+
if isinstance(backend_config, KubernetesBackendConfig):
47+
self.backend = KubernetesBackend(backend_config)
48+
else:
49+
raise ValueError(f"Invalid backend config '{backend_config}'")
50+
51+
def optimize(
52+
self,
53+
trial_template: TrainJobTemplate,
54+
*,
55+
trial_config: Optional[TrialConfig] = None,
56+
search_space: dict[str, Any],
57+
objectives: Optional[list[Objective]] = None,
58+
algorithm: Optional[BaseAlgorithm] = None,
59+
) -> str:
60+
"""Create an OptimizationJob for hyperparameter tuning.
61+
62+
Args:
63+
trial_template: The TrainJob template defining the training script.
64+
trial_config: Optional configuration to run Trials.
65+
objectives: List of objectives to optimize.
66+
search_space: Dictionary mapping parameter names to Search specifications using
67+
Search.uniform(), Search.loguniform(), Search.choice(), etc.
68+
algorithm: The optimization algorithm to use. Defaults to RandomSearch.
69+
70+
Returns:
71+
The unique name of the Experiment that has been generated.
72+
73+
Raises:
74+
ValueError: Input arguments are invalid.
75+
TimeoutError: Timeout to create Experiment.
76+
RuntimeError: Failed to create Experiment.
77+
"""
78+
return self.backend.optimize(
79+
trial_template=trial_template,
80+
trial_config=trial_config,
81+
objectives=objectives,
82+
search_space=search_space,
83+
algorithm=algorithm,
84+
)
85+
86+
def list_jobs(self) -> list[OptimizationJob]:
87+
"""List of the created OptimizationJobs
88+
89+
Returns:
90+
List of created OptimizationJobs. If no OptimizationJob exist,
91+
an empty list is returned.
92+
93+
Raises:
94+
TimeoutError: Timeout to list OptimizationJobs.
95+
RuntimeError: Failed to list OptimizationJobs.
96+
"""
97+
98+
return self.backend.list_jobs()
99+
100+
def get_job(self, name: str) -> OptimizationJob:
101+
"""Get the OptimizationJob object
102+
103+
Args:
104+
name: Name of the OptimizationJob.
105+
106+
Returns:
107+
A OptimizationJob object.
108+
109+
Raises:
110+
TimeoutError: Timeout to get a OptimizationJob.
111+
RuntimeError: Failed to get a OptimizationJob.
112+
"""
113+
114+
return self.backend.get_job(name=name)
115+
116+
def delete_job(self, name: str):
117+
"""Delete the OptimizationJob.
118+
119+
Args:
120+
name: Name of the OptimizationJob.
121+
122+
Raises:
123+
TimeoutError: Timeout to delete OptimizationJob.
124+
RuntimeError: Failed to delete OptimizationJob.
125+
"""
126+
return self.backend.delete_job(name=name)

0 commit comments

Comments
 (0)