- 
                Notifications
    You must be signed in to change notification settings 
- Fork 43
feat: Implement Training Options pattern for flexible TrainJob customization #91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Implement Training Options pattern for flexible TrainJob customization #91
Conversation
67d12d8    to
    b39b364      
    Compare
  
    2d7a8e6    to
    dbba135      
    Compare
  
    36c0160    to
    95155f6      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @abhijeet-dhumal
I believe we need to handle how options will be applied for while backends. either we ignore options for localprocess or make options targeted towards specific backend.
f092845    to
    03994e7      
    Compare
  
    | Pull Request Test Coverage Report for Build 18870799839Details
 
 
 💛 - Coveralls | 
…bSet/Jobs labels Signed-off-by: Abhijeet Dhumal <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments, otherwise looks good to me.
        
          
                kubeflow/trainer/options/__init__.py
              
                Outdated
          
        
      | from kubeflow.trainer.options.common import ( | ||
| ContainerOverride, | ||
| PodTemplateOverride, | ||
| PodTemplateSpecOverride, | ||
| ) | ||
| from kubeflow.trainer.options.kubernetes import ( | ||
| Annotations, | ||
| Labels, | ||
| Name, | ||
| PodTemplateOverrides, | ||
| SpecAnnotations, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we only start with PodTemplateOverrides options and remove other options for now?
We can introduce them later, if users require them.
| from kubeflow.trainer.options.common import ( | |
| ContainerOverride, | |
| PodTemplateOverride, | |
| PodTemplateSpecOverride, | |
| ) | |
| from kubeflow.trainer.options.kubernetes import ( | |
| Annotations, | |
| Labels, | |
| Name, | |
| PodTemplateOverrides, | |
| SpecAnnotations, | |
| from kubeflow.trainer.options.kubernetes import ( | |
| Annotations, | |
| Labels, | |
| Name, | |
| PodTemplateOverrides, | |
| SpecAnnotations, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually If we planning to implement podtemplateOverrides
then we need all three because they're direct mappings to the TrainJob CRD schema (verified against the latest trainer.kubeflow.org_trainjobs.yaml).
IMHO We can't flatten this - the nested structure is required by the API, and each class serves a distinct purpose: targeting (PodTemplateOverride), pod-level config (PodTemplateSpecOverride), and container-specific overrides (ContainerOverride).
The hierarchy mirrors TrainJob pod structure:
- PodTemplateOverride targets specific job types (e.g., "node" vs "launcher"),
- PodTemplateSpecOverride handles pod-level configs (serviceAccountName, nodeSelector, affinity, tolerations, volumes, imagePullSecrets, schedulingGates),
- ContainerOverride lets users modify specific containers by name (env vars, volume mounts).
Without these, users can't do basic production scenarios like mounting datasets from PVCs, scheduling on GPU nodes, pulling from private registries, or injecting secrets as env vars. The CRD explicitly defines ContainerOverride with name/env/volumeMounts fields, and spec.containers and spec.initContainers are lists of ContainerOverride objects.
These are essential building blocks, not optional nice-to-haves.
# Mount dataset from PVC, add GPU affinity
PodTemplateOverride(
    target_jobs=["node"],
    spec=PodTemplateSpecOverride(
        node_selector={"gpu": "a100"},
        volumes=[{"name": "data", "persistentVolumeClaim": {"claimName": "dataset-pvc"}}],
        containers=[
            ContainerOverride(
                name="trainer",
                env=[{"name": "HF_TOKEN", "valueFrom": {"secretKeyRef": ...}}],
                volume_mounts=[{"name": "data", "mountPath": "/data"}]
            )
        ]
    )
)
WDYT? @andreyvelich @astefanutti
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be happy to discuss if I've misunderstood you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that makes sense, I think that's acceptable. The other alternative would be to use dictionaries, but that looks less elegant.
| from kubeflow.trainer.types import types | ||
|  | ||
|  | ||
| class TestTrainerClientOptionValidation: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still open @abhijeet-dhumal.
|  | ||
| def __call__( | ||
| self, | ||
| job_spec: dict, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be type it to TrainerV1alpha1TrainJob or we keep it flexible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can keep it flexible here ?
job_spec = {}
if options:
    for option in options:
        option(job_spec, trainer)
The options creates a plain dict that represents the TrainJob structure. The backend then extracts values from this dict and constructs the actual models.TrainerV1alpha1TrainJob object later. So that way backends can handle conversion to their specific types
I'd be happy to discuss if I've misunderstood you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good, let's start by being flexible and iterate if needed.
338ed1c    to
    938b322      
    Compare
  
    Signed-off-by: Abhijeet Dhumal <[email protected]>
938b322    to
    b1ce0f5      
    Compare
  
    | 
 @andreyvelich I don't understand this point I'm afraid 😅 | 
Signed-off-by: Abhijeet Dhumal <[email protected]>
Signed-off-by: Abhijeet Dhumal <[email protected]>
| /lgtm Thanks @abhijeet-dhumal for this awesome contribution! | 
Signed-off-by: Abhijeet Dhumal <[email protected]>
| @abhijeet-dhumal Thanks! /lgtm /hold for @andreyvelich approval | 
| [APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: astefanutti The full list of commands accepted by this bot can be found here. The pull request process is described here 
Needs approval from an approver in each of these files:
 
 Approvers can indicate their approval by writing  | 
| metadata=models.IoK8sApimachineryPkgApisMetaV1ObjectMeta( | ||
| name=train_job_name, labels=labels, annotations=annotations | ||
| ), | ||
| spec=models.TrainerV1alpha1TrainJobSpec( | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abhijeet-dhumal, awesome, looks good to me now!
| @bohdandenysenko: changing LGTM is restricted to collaborators In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. | 
| /hold | 
| @abhijeet-dhumal As I suggested here: https://github.com/kubeflow/sdk/pull/91/files#r2412261172, can you refactor unit tests: https://github.com/kubeflow/sdk/blob/d32626e4390cd2f9cc25407b88a82bd1cf09973a/kubeflow/trainer/api/trainer_client_test.py ? The test cases  for  
 Similar to Kubernetes, we can create unit tests for  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @abhijeet-dhumal for this!
I left a few comments
| "Name", | ||
| "PodTemplateOverride", | ||
| "PodTemplateOverrides", | ||
| "PodTemplateSpecOverride", | ||
| "Runtime", | ||
| "RuntimeTrainer", | ||
| "TorchTuneConfig", | ||
| "TorchTuneInstructDataset", | ||
| "RuntimeTrainer", | ||
| "TrainerArgs", | ||
| "TrainerClient", | ||
| "TrainerCommand", | ||
| "TrainerImage", | ||
| "TrainerType", | ||
| "LocalProcessBackendConfig", | ||
| "KubernetesBackendConfig", | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SpecLabels and SpecAnnotations are missing
| options: Optional list of configuration options to apply to the TrainJob. Use | ||
| Labels and Annotations for basic metadata configuration. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest also mentioning container and pod options
| # Generate unique name for the TrainJob if not provided | ||
| train_job_name = name or ( | ||
| random.choice(string.ascii_lowercase) | ||
| + uuid.uuid4().hex[: constants.TRAINJOB_NAME_UUID_LENGTH] | ||
| ) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the name user provides is not unique?
| f"Timeout to get {constants.CLUSTER_TRAINING_RUNTIME_PLURAL}: " | ||
| f"{self.namespace}/{name}" | ||
| f"Timeout to get {constants.CLUSTER_TRAINING_RUNTIME_PLURAL}: {name}" | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you dropped namespace here, could you also update other instances? For example in list_runtimes https://github.com/kubeflow/sdk/blob/main/kubeflow/trainer/backends/kubernetes/backend.py#L90-L98
| def __call__( | ||
| self, | ||
| job_spec: dict[str, Any], | ||
| trainer: BuiltinTrainer | CustomTrainer | None = None, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This requires python 3.10, since we still support 3.9, could you use optional/union instead?
https://peps.python.org/pep-0604/
| env: list[dict] | None = None | ||
| volume_mounts: list[dict] | None = None | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, please use optional/union
| func: Callable | ||
| func_args: Optional[dict] = None | ||
| packages_to_install: Optional[list[str]] = None | ||
| func: Callable | None = None | ||
| func_args: dict | None = None | ||
| packages_to_install: list[str] | None = None | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, please revert these changes where needed
| TORCH_TUNE = BuiltinTrainer.__annotations__["config"].__name__.lower().replace("config", "") | ||
| TORCH_TUNE = BuiltinTrainer.__annotations__["config"].lower().replace("config", "") | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason behind this change?
| "kubernetes>=27.2.0", | ||
| "pydantic>=2.10.0", | ||
| "kubeflow-trainer-api>=2.0.0", | ||
| "eval-type-backport>=0.2.0; python_version < '3.10'", | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that provides retro-compatibility for union types.
| @abhijeet-dhumal Did you get a chance to address this comment: #91 (comment), and @kramaranya suggestions ? | 
What this PR does / why we need it:
Fixes #87, #92, #116
Sample import styles for Options :
Job Metadata
Container runtime customisation :
Pod level overrides:
Local process backend options:
Complete example :
Checklist: