Skip to content

Conversation

@abhijeet-dhumal
Copy link
Contributor

@abhijeet-dhumal abhijeet-dhumal commented Sep 12, 2025

What this PR does / why we need it:

Fixes #87, #92, #116

Sample import styles for Options :

# Style 1: Only works with Kubernetes + common import options
from kubeflow.trainer import (
    Labels, Annotations, SpecLabels, SpecAnnotations,
    Name, PodTemplateOverrides, TrainerImage, TrainerCommand, TrainerArgs,
    PodTemplateOverride, PodTemplateSpecOverride, ContainerOverride
)

# Style 2: Import all options - SDK validates at runtime and rejects incompatible ones
from kubeflow.trainer.options import (
    # Kubernetes options
    Labels, Annotations, SpecLabels, SpecAnnotations, Name,
    PodTemplateOverrides, TrainerImage, TrainerCommand, TrainerArgs,
    # Common options (used in PodTemplateOverrides)
    PodTemplateOverride, PodTemplateSpecOverride, ContainerOverride,
    # LocalProcess options
    ProcessTimeout, WorkingDirectory
)

# Style 3: Backend-specific imports
from kubeflow.trainer.options.kubernetes import (
    Labels, Annotations, SpecLabels, SpecAnnotations, Name,
    PodTemplateOverrides, TrainerImage, TrainerCommand, TrainerArgs
)
from kubeflow.trainer.options.common import (
    PodTemplateOverride, PodTemplateSpecOverride, ContainerOverride
)
from kubeflow.trainer.options.localprocess import (
    ProcessTimeout, WorkingDirectory
)

Job Metadata

# TrainJob resource metadata (for TrainJob level Labels and annotations)
Labels(labels={"team": "ml-platform", "project": "llm"})
Annotations(annotations={"description": "Fine-tuning job", "owner": "user"})
Name(name="my-custom-job-name")

# Derivative JobSet and Jobs labels (for Pod level Labels and annotations)
SpecLabels(labels={"app": "training", "version": "v1.0", "env": "prod"})
SpecAnnotations(annotations={"prometheus.io/scrape": "true", "cost-center": "ai"})

Container runtime customisation :

TrainerImage(image="custom/pytorch:2.0-cuda11.8")
TrainerCommand(command=["python", "train.py"])
TrainerArgs(args=["--epochs", "100", "--lr", "0.001", "--batch-size", "32"])

Pod level overrides:

PodTemplateOverrides(pod_template_overrides=[
    PodTemplateOverride(
        target_jobs=["node", "launcher"],
        metadata={  # Optional: pod-level labels/annotations
            "labels": {"tier": "training"},
            "annotations": {"sidecar.istio.io/inject": "false"}
        },
        spec=PodTemplateSpecOverride(
            service_account_name="training-sa",
            node_selector={"gpu": "true", "zone": "us-west1"},
            affinity={
                "nodeAffinity": {
                    "requiredDuringSchedulingIgnoredDuringExecution": {
                        "nodeSelectorTerms": [{
                            "matchExpressions": [{
                                "key": "gpu-type",
                                "operator": "In",
                                "values": ["nvidia-a100"]
                            }]
                        }]
                    }
                }
            },
            tolerations=[
                {"key": "gpu", "operator": "Exists", "effect": "NoSchedule"},
                {"key": "high-priority", "operator": "Equal", "value": "true", "effect": "NoSchedule"}
            ],
            volumes=[
                {"name": "data", "emptyDir": {}},
                {"name": "model", "persistentVolumeClaim": {"claimName": "model-pvc"}}
            ],
            containers=[
                ContainerOverride(
                    name="node",
                    env=[
                        {"name": "NCCL_DEBUG", "value": "INFO"},
                        {"name": "WANDB_API_KEY", "valueFrom": {"secretKeyRef": {
                            "name": "wandb-secret", "key": "api-key"
                        }}}
                    ],
                    volume_mounts=[
                        {"name": "data", "mountPath": "/data"},
                        {"name": "model", "mountPath": "/model"}
                    ]
                )
            ],
            init_containers=[
                ContainerOverride(
                    name="init-setup",
                    env=[{"name": "SETUP_MODE", "value": "fast"}],
                    volume_mounts=[{"name": "data", "mountPath": "/setup"}]
                )
            ],
            scheduling_gates=[{"name": "wait-for-quota"}],
            image_pull_secrets=[{"name": "registry-credentials"}]
        )
    )
])

Local process backend options:

from kubeflow.trainer.options import ProcessTimeout, WorkingDirectory
ProcessTimeout(timeout_seconds=3600)
WorkingDirectory(working_dir="/custom/training/path")

Complete example :

job_name = client.train(
    runtime=client.get_runtime("torch-distributed"),
    trainer=CustomTrainer(
        func=lambda: print("Training LLM!"),
        num_nodes=2,
        resources_per_node={
            "cpu": "8",
            "memory": "32Gi",
            "nvidia.com/gpu": "2"
        },
        packages_to_install=["torch==2.0.0", "transformers", "datasets"],
        env={"WANDB_PROJECT": "llm-training"}
    ),
    options=[
        # TrainJob resource metadata
        Name(name="llm-finetune-001"),
        Labels(labels={"owner": "ml-team", "project": "llm"}),
        Annotations(annotations={"description": "GPT fine-tuning experiment"}),
        
        # Derivative JobSet/Jobs labels (for training pods)
        SpecLabels(labels={"app": "training", "version": "v1.0", "env": "prod"}),
        SpecAnnotations(annotations={"prometheus.io/scrape": "true"}),
        
        # Trainer container customization
        TrainerImage(image="my-registry/pytorch:2.0-cuda11.8"),
        
        # Pod-level customizations
        PodTemplateOverrides(pod_template_overrides=[
            PodTemplateOverride(
                target_jobs=["node"],
                spec=PodTemplateSpecOverride(
                    service_account_name="gpu-training-sa",
                    node_selector={"accelerator": "nvidia-a100", "zone": "us-west1"},
                    tolerations=[
                        {"key": "nvidia.com/gpu", "operator": "Exists", "effect": "NoSchedule"}
                    ],
                    volumes=[
                        {"name": "data", "persistentVolumeClaim": {"claimName": "training-data-pvc"}},
                        {"name": "cache", "emptyDir": {"sizeLimit": "50Gi"}}
                    ],
                    containers=[
                        ContainerOverride(
                            name="node",
                            env=[
                                {"name": "NCCL_DEBUG", "value": "INFO"},
                                {"name": "NCCL_IB_DISABLE", "value": "1"},
                                {"name": "WANDB_API_KEY", "valueFrom": {
                                    "secretKeyRef": {"name": "wandb-creds", "key": "api-key"}
                                }}
                            ],
                            volume_mounts=[
                                {"name": "data", "mountPath": "/data"},
                                {"name": "cache", "mountPath": "/cache"}
                            ]
                        )
                    ]
                )
            )
        ])
    ]
)

Checklist:

  • Docs included if any changes are user facing

@abhijeet-dhumal abhijeet-dhumal changed the title Add labels and annotations support for train client feast: Add labels and annotations support for train client Sep 12, 2025
@abhijeet-dhumal abhijeet-dhumal changed the title feast: Add labels and annotations support for train client feat: Add labels and annotations support for train client Sep 15, 2025
@abhijeet-dhumal abhijeet-dhumal force-pushed the add-labels-annotations-support#87 branch from 67d12d8 to b39b364 Compare September 15, 2025 05:22
@abhijeet-dhumal abhijeet-dhumal force-pushed the add-labels-annotations-support#87 branch 2 times, most recently from 2d7a8e6 to dbba135 Compare September 15, 2025 05:44
@abhijeet-dhumal abhijeet-dhumal changed the title feat: Add labels and annotations support for train client feat: Implement Training Options pattern with WithLabels, WithAnnotations, and WithPodSpecOverrides for flexible TrainJob customization Sep 15, 2025
@abhijeet-dhumal abhijeet-dhumal changed the title feat: Implement Training Options pattern with WithLabels, WithAnnotations, and WithPodSpecOverrides for flexible TrainJob customization feat: Implement Training Options pattern for flexible TrainJob customization Sep 15, 2025
@abhijeet-dhumal abhijeet-dhumal force-pushed the add-labels-annotations-support#87 branch from 36c0160 to 95155f6 Compare September 15, 2025 10:41
@abhijeet-dhumal abhijeet-dhumal marked this pull request as ready for review September 15, 2025 11:23
Copy link
Member

@szaher szaher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @abhijeet-dhumal

I believe we need to handle how options will be applied for while backends. either we ignore options for localprocess or make options targeted towards specific backend.

@abhijeet-dhumal abhijeet-dhumal marked this pull request as draft September 17, 2025 13:17
@abhijeet-dhumal abhijeet-dhumal force-pushed the add-labels-annotations-support#87 branch 3 times, most recently from f092845 to 03994e7 Compare September 24, 2025 11:40
@coveralls
Copy link

coveralls commented Sep 24, 2025

Pull Request Test Coverage Report for Build 18870799839

Details

  • 37 of 37 (100.0%) changed or added relevant lines in 2 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+2.7%) to 76.11%

Files with Coverage Reduction New Missed Lines %
kubeflow/trainer/utils/utils.py 1 70.09%
Totals Coverage Status
Change from base Build 18826380320: 2.7%
Covered Lines: 360
Relevant Lines: 473

💛 - Coveralls

Copy link
Contributor

@astefanutti astefanutti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments, otherwise looks good to me.

Comment on lines 21 to 31
from kubeflow.trainer.options.common import (
ContainerOverride,
PodTemplateOverride,
PodTemplateSpecOverride,
)
from kubeflow.trainer.options.kubernetes import (
Annotations,
Labels,
Name,
PodTemplateOverrides,
SpecAnnotations,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we only start with PodTemplateOverrides options and remove other options for now?
We can introduce them later, if users require them.

Suggested change
from kubeflow.trainer.options.common import (
ContainerOverride,
PodTemplateOverride,
PodTemplateSpecOverride,
)
from kubeflow.trainer.options.kubernetes import (
Annotations,
Labels,
Name,
PodTemplateOverrides,
SpecAnnotations,
from kubeflow.trainer.options.kubernetes import (
Annotations,
Labels,
Name,
PodTemplateOverrides,
SpecAnnotations,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually If we planning to implement podtemplateOverrides
then we need all three because they're direct mappings to the TrainJob CRD schema (verified against the latest trainer.kubeflow.org_trainjobs.yaml).
IMHO We can't flatten this - the nested structure is required by the API, and each class serves a distinct purpose: targeting (PodTemplateOverride), pod-level config (PodTemplateSpecOverride), and container-specific overrides (ContainerOverride).

The hierarchy mirrors TrainJob pod structure:

  • PodTemplateOverride targets specific job types (e.g., "node" vs "launcher"),
  • PodTemplateSpecOverride handles pod-level configs (serviceAccountName, nodeSelector, affinity, tolerations, volumes, imagePullSecrets, schedulingGates),
  • ContainerOverride lets users modify specific containers by name (env vars, volume mounts).

Without these, users can't do basic production scenarios like mounting datasets from PVCs, scheduling on GPU nodes, pulling from private registries, or injecting secrets as env vars. The CRD explicitly defines ContainerOverride with name/env/volumeMounts fields, and spec.containers and spec.initContainers are lists of ContainerOverride objects.

These are essential building blocks, not optional nice-to-haves.

# Mount dataset from PVC, add GPU affinity
PodTemplateOverride(
    target_jobs=["node"],
    spec=PodTemplateSpecOverride(
        node_selector={"gpu": "a100"},
        volumes=[{"name": "data", "persistentVolumeClaim": {"claimName": "dataset-pvc"}}],
        containers=[
            ContainerOverride(
                name="trainer",
                env=[{"name": "HF_TOKEN", "valueFrom": {"secretKeyRef": ...}}],
                volume_mounts=[{"name": "data", "mountPath": "/data"}]
            )
        ]
    )
)

WDYT? @andreyvelich @astefanutti

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be happy to discuss if I've misunderstood you!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that makes sense, I think that's acceptable. The other alternative would be to use dictionaries, but that looks less elegant.

from kubeflow.trainer.types import types


class TestTrainerClientOptionValidation:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still open @abhijeet-dhumal.


def __call__(
self,
job_spec: dict,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be type it to TrainerV1alpha1TrainJob or we keep it flexible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can keep it flexible here ?

job_spec = {}
if options:
    for option in options:
        option(job_spec, trainer)

The options creates a plain dict that represents the TrainJob structure. The backend then extracts values from this dict and constructs the actual models.TrainerV1alpha1TrainJob object later. So that way backends can handle conversion to their specific types

I'd be happy to discuss if I've misunderstood you!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good, let's start by being flexible and iterate if needed.

@abhijeet-dhumal abhijeet-dhumal force-pushed the add-labels-annotations-support#87 branch from 338ed1c to 938b322 Compare October 27, 2025 15:16
@abhijeet-dhumal abhijeet-dhumal force-pushed the add-labels-annotations-support#87 branch from 938b322 to b1ce0f5 Compare October 27, 2025 15:27
@abhijeet-dhumal
Copy link
Contributor Author

#91 (comment)

This is still open @abhijeet-dhumal.

@andreyvelich I don't understand this point I'm afraid 😅
Could you please elaborate it a bit more

@astefanutti
Copy link
Contributor

/lgtm

Thanks @abhijeet-dhumal for this awesome contribution!

@google-oss-prow google-oss-prow bot removed the lgtm label Oct 28, 2025
@astefanutti
Copy link
Contributor

@abhijeet-dhumal Thanks!

/lgtm
/approve

/hold for @andreyvelich approval

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: astefanutti

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

metadata=models.IoK8sApimachineryPkgApisMetaV1ObjectMeta(
name=train_job_name, labels=labels, annotations=annotations
),
spec=models.TrainerV1alpha1TrainJobSpec(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abhijeet-dhumal, awesome, looks good to me now!

@google-oss-prow
Copy link

@bohdandenysenko: changing LGTM is restricted to collaborators

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kramaranya
Copy link
Contributor

/hold
Sorry, will review it today!

@andreyvelich
Copy link
Member

@abhijeet-dhumal As I suggested here: https://github.com/kubeflow/sdk/pull/91/files#r2412261172, can you refactor unit tests: https://github.com/kubeflow/sdk/blob/d32626e4390cd2f9cc25407b88a82bd1cf09973a/kubeflow/trainer/api/trainer_client_test.py ?

The test cases for train API in Kubernetes backend can go to:

def test_train(kubernetes_backend, test_case):

Similar to Kubernetes, we can create unit tests for train() API in local subprocess: https://github.com/kubeflow/sdk/blob/d32626e4390cd2f9cc25407b88a82bd1cf09973a/kubeflow/trainer/backends/localprocess/backend_test.py

Copy link
Contributor

@kramaranya kramaranya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @abhijeet-dhumal for this!
I left a few comments

Comment on lines +77 to -65
"Name",
"PodTemplateOverride",
"PodTemplateOverrides",
"PodTemplateSpecOverride",
"Runtime",
"RuntimeTrainer",
"TorchTuneConfig",
"TorchTuneInstructDataset",
"RuntimeTrainer",
"TrainerArgs",
"TrainerClient",
"TrainerCommand",
"TrainerImage",
"TrainerType",
"LocalProcessBackendConfig",
"KubernetesBackendConfig",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SpecLabels and SpecAnnotations are missing

Comment on lines +114 to +115
options: Optional list of configuration options to apply to the TrainJob. Use
Labels and Annotations for basic metadata configuration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest also mentioning container and pod options

Comment on lines +222 to +226
# Generate unique name for the TrainJob if not provided
train_job_name = name or (
random.choice(string.ascii_lowercase)
+ uuid.uuid4().hex[: constants.TRAINJOB_NAME_UUID_LENGTH]
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the name user provides is not unique?

Comment on lines -121 to +125
f"Timeout to get {constants.CLUSTER_TRAINING_RUNTIME_PLURAL}: "
f"{self.namespace}/{name}"
f"Timeout to get {constants.CLUSTER_TRAINING_RUNTIME_PLURAL}: {name}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you dropped namespace here, could you also update other instances? For example in list_runtimes https://github.com/kubeflow/sdk/blob/main/kubeflow/trainer/backends/kubernetes/backend.py#L90-L98

def __call__(
self,
job_spec: dict[str, Any],
trainer: BuiltinTrainer | CustomTrainer | None = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires python 3.10, since we still support 3.9, could you use optional/union instead?
https://peps.python.org/pep-0604/

Comment on lines +41 to +42
env: list[dict] | None = None
volume_mounts: list[dict] | None = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, please use optional/union

Comment on lines -43 to +63
func: Callable
func_args: Optional[dict] = None
packages_to_install: Optional[list[str]] = None
func: Callable | None = None
func_args: dict | None = None
packages_to_install: list[str] | None = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, please revert these changes where needed

Comment on lines -201 to +219
TORCH_TUNE = BuiltinTrainer.__annotations__["config"].__name__.lower().replace("config", "")
TORCH_TUNE = BuiltinTrainer.__annotations__["config"].lower().replace("config", "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason behind this change?

"kubernetes>=27.2.0",
"pydantic>=2.10.0",
"kubeflow-trainer-api>=2.0.0",
"eval-type-backport>=0.2.0; python_version < '3.10'",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that provides retro-compatibility for union types.

@andreyvelich
Copy link
Member

@abhijeet-dhumal Did you get a chance to address this comment: #91 (comment), and @kramaranya suggestions ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expose TrainJob labels and annotations in the SDK

7 participants