-
Notifications
You must be signed in to change notification settings - Fork 45
feat: Hyperparameter Optimization APIs in Kubeflow SDK #124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Pull Request Test Coverage Report for Build 18959385335Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
|
I have implemented |
|
@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: kubeflow/kubeflow-sdk-team, Fiona-Waters, abhijeet-dhumal, jskswamy. Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
| trial_config: Optional[TrialConfig] = None, | ||
| search_space: dict[str, Any], | ||
| objectives: Optional[list[Objective]] = None, | ||
| algorithm: Optional[RandomSearch] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider adding options already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add it in the followup PR, since we want to limit number of APIs user can configure initially for Experiment CR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
|
|
||
| logger.debug(f"OptimizationJob {self.namespace}/{name} has been deleted") | ||
|
|
||
| def __get_optimization_job_from_crd( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def __get_optimization_job_from_crd( | |
| def __get_optimization_job_from_custom_resource( |
|
|
||
| def __get_optimization_job_from_crd( | ||
| self, | ||
| optimization_job_crd: models.V1beta1Experiment, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| optimization_job_crd: models.V1beta1Experiment, | |
| optimization_job_cr: models.V1beta1Experiment, |
kubeflow/optimizer/backends/base.py
Outdated
| from kubeflow.optimizer.types.optimization_types import Objective, OptimizationJob, TrialConfig | ||
|
|
||
|
|
||
| class ExecutionBackend(abc.ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| class ExecutionBackend(abc.ABC): | |
| class RuntimeBackend(abc.ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or:
| class ExecutionBackend(abc.ABC): | |
| class OptimizerBackend(abc.ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We previously agreed on the ExecutionBackend here: #34 (comment) with @kramaranya and @szaher.
Do you prefer to find better name for it @astefanutti ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich that's not a big deal, ExecutionBackend is fine. RuntimeBackend seems more general as it also covers resources and not only the "execution", like the job "registry" (ETCD for Kubernetes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good!
28d2b5e to
92f34a5
Compare
| EXPERIMENT_SUCCEEDED = "Succeeded" | ||
|
|
||
| # Label to identify Experiment's resources. | ||
| EXPERIMENT_LABEL = "katib.kubeflow.org/experiment" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we start using optimizer.kubeflow.org?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we rely on Katib Experiment CRD for now, we can't use the new labels yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, do we have plans to implement OptimizerRuntime and let OptimizerJob override it in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need OptimizerRuntime, since OptimizerJob should natively integrate with TrainingRuntime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich Thanks for this. I left my initial question:)
| steps: list[Step] | ||
| num_nodes: int | ||
| status: str = constants.UNKNOWN | ||
| creation_timestamp: datetime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need creation_timestamp? Shouldn't it be added automatically in the creation phase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does. We set this property from the Experiment.metadata.creation_timestamp:
| creation_timestamp=optimization_job_cr.metadata.creation_timestamp, |
|
@kramaranya @Electronic-Waste @astefanutti Any additional comments before we move forward with the initial support of HPO in Kubeflow SDK ? |
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
9b3700a to
1353fc9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much @andreyvelich for this great work!!
I left a few comments
pyproject.toml
Outdated
| "pydantic>=2.10.0", | ||
| "kubeflow-trainer-api>=2.0.0", | ||
| # TODO (andreyvelich): Switch to kubeflow-katib-api once it is published. | ||
| "kubeflow_katib_api@git+https://github.com/kramaranya/katib.git@separate-models-from-sdk#subdirectory=api/python_api", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it has been merged, we can update that with katib ref instead of the fork. Or shall we cut a new Katib release and publish those models to PyPI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will replace it once we release 0.19
| initializers. | ||
| """ | ||
|
|
||
| trainer: CustomTrainer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we support BuiltinTrainer initially? Is it due to metrics collection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, since we don't have access to the BuiltinTrainer scripts, we can't guarantee in which format metrics will be printed. In the future integrations, let's talk how we can integrate BuiltinTrainers as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks for explaining
| # Import the Kubeflow Trainer types. | ||
| from kubeflow.trainer.types.types import TrainJobTemplate | ||
|
|
||
| __all__ = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we add GridSearch here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
| """ | ||
| # Set the default backend config. | ||
| if not backend_config: | ||
| backend_config = KubernetesBackendConfig() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, just for consistency shall we match trainer and use the same import style:
if not backend_config:
backend_config = common_types.KubernetesBackendConfig()There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me go other way around, tho.
|
|
||
| logger.debug(f"OptimizationJob {self.namespace}/{name} has been deleted") | ||
|
|
||
| def __get_optimization_job_from_custom_resource( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To align with trainer, should we update this?
| def __get_optimization_job_from_custom_resource( | |
| def __get_optimization_job_from_cr( |
|
|
||
| except multiprocessing.TimeoutError as e: | ||
| raise TimeoutError( | ||
| f"Timeout to list OptimizationJobs in namespace: {self.namespace}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add OptimizationJob to constants instead?
| # Trainer function arguments for the appropriate substitution. | ||
| parameters_spec = [] | ||
| trial_parameters = [] | ||
| trial_template.trainer.func_args = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this not overwrite existing func_args?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will, it is a good point.
I am curious whether we want to merge them? E.g. we should allow users to do something like this:
func_args={
"lr": 0.1
"num_epochs": 5
},
search_space={
"num_epochs": Serach.choice(1,2)
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change it to:
if trial_template.trainer.func_args is None:
trial_template.trainer.func_args = {}| trial_config: Optional[TrialConfig] = None, | ||
| search_space: dict[str, Any], | ||
| objectives: Optional[list[Objective]] = None, | ||
| algorithm: Optional[RandomSearch] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether we should accept a base type instead so any algorithm works without changing api in the future?
| algorithm: Optional[RandomSearch] = None, | |
| algorithm: Optional[BaseAlgorithm] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point!
| MAXIMIZE = "maximize" | ||
| MINIMIZE = "minimize" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about adding "max" and "min" aliases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, this makes us consistent with the Optuna naming: https://optuna.readthedocs.io/en/stable/tutorial/20_recipes/002_multi_objective.html#run-multi-objective-optimization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, sounds good to me. I was just wondering if it makes sense to support both "max" and "maximize" as options, since Ray Tune uses "max" for example https://docs.ray.io/en/latest/tune/api/doc/ray.tune.run.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, maybe we should do something like this for this enum ?
@classmethod
def from_str(cls, value: str):
value = value.lower()
aliases = {
"max": "maximize",
"maximize": "maximize",
"min": "minimize",
"minimize": "minimize",
}
return cls(aliases[value])@kramaranya Can you create an issue to track this improvement ?
This adds comprehensive unit tests for the OptimizerClient methods in KubernetesBackend, mirroring the existing test structure used for TrainerClient in backend_test.py. Tests cover: - get_runtime (success, timeout, runtime error) - list_runtimes - optimize (study creation) - get_study - list_studies - get_study_logs - delete_study All tests use pytest fixtures, mock Kubernetes API calls closes kubeflow#124
Rename common types Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
|
I'm thinking whether we should introduce OWNERS files for sub-packages like optimizer and trainer? I think this was mentioned in the KEP. Doesn't need to be in this PR, but wanted to raise it since we now have two subpackages |
Yes, I think eventually we should do that. |
Signed-off-by: Andrey Velichkevich <[email protected]>
|
@kramaranya CI should be working now. |
|
Thank you for this great work, @andreyvelich! 🎉 |
Signed-off-by: Andrey Velichkevich <[email protected]>
|
New changes are detected. LGTM label has been removed. |
|
Thanks everyone! |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Part of: #46
Depends on: kubeflow/katib#2579
I've added initial support for hyperparameter optimization with
OptimizerClient()into Kubeflow SDK.This PR also introduced some refactoring to re-use code across
TrainerClient()andOptimizerClient().Working example:
Katib Experiment
/assign @kubeflow/kubeflow-sdk-team @akshaychitneni
TODO items:
get_job()APIlist_jobs()APIdelete_job()API