kubeflow
diff --git a/‎AGENTS.md‎
Lines changed: 57 additions & 12 deletions b/‎AGENTS.md‎
Lines changed: 57 additions & 12 deletions
diff --git a/‎kubeflow/trainer/api/trainer_client.py‎
Lines changed: 8 additions & 3 deletions b/‎kubeflow/trainer/api/trainer_client.py‎
Lines changed: 8 additions & 3 deletions
diff --git a/‎kubeflow/trainer/backends/base.py‎
Lines changed: 3 additions & 1 deletion b/‎kubeflow/trainer/backends/base.py‎
Lines changed: 3 additions & 1 deletion
@@ -7,18 +7,19 @@
 ## Agent Behavior Policy
 
 AI agents should:
+
 - Make atomic, minimal, and reversible changes.
 - Prefer local analysis (`uv run`, `make verify`, `pytest`) before proposing commits.
 - NEVER modify configuration, CI/CD, or release automation unless explicitly requested.
 - Avoid non-deterministic code or random seeds without fixtures.
 - Use `AGENTS.md` and `Makefile` as the source of truth for development commands.
 
 Agents must NOT:
+
 - Bypass tests or linters
 - Introduce dependencies without updating `pyproject.toml`
 - Generate or commit large autogenerated files
 
-
 ### Context Awareness
 
 Before writing code, agents should:
@@ -27,7 +28,6 @@ Before writing code, agents should:
 - Match import patterns from neighboring files
 - Preserve existing logging and error-handling conventionso
 
-
 ## Repository Map
 
 ```
@@ -53,17 +53,21 @@ Root files: AGENTS.md, README.md, pyproject.toml, Makefile, CI workflows
 ## Quick Start
 
 <!-- BEGIN: AGENT_COMMANDS -->
+
 **Setup**:
+
 ```bash
 make install-dev              # Install uv, create .venv, sync deps
 ```
 
 **Verify (CI parity)**:
+
 ```bash
 make verify                   # Runs ruff check --show-fixes and ruff format --check
 ```
 
 **Testing**:
+
 ```bash
 make test-python              # All unit tests + coverage (HTML by default)
 make test-python report=xml   # XML coverage report
@@ -73,38 +77,45 @@ uv run coverage run -m pytest <path> && uv run coverage report          # Ad-hoc
 ```
 
 **Local lint/format**:
+
 ```bash
 uv run ruff check --fix .     # Fix lint issues
 uv run ruff format kubeflow   # Format code
 ```
 
 **Type checking**:
+
 ```bash
 uv run mypy kubeflow          # Run type checker
 ```
 
 **Pre-commit**:
+
 ```bash
 uv run pre-commit install                    # Install hooks
 uv run pre-commit run --all-files           # Run all hooks
 ```
+
 <!-- END: AGENT_COMMANDS -->
 
 ## Development Workflow for AI Agents
 
 **Preferred commands**: use `uv run ...` to ensure tool consistency and `.venv` usage
 
 **Before making changes**:
+
 1. Read existing code patterns and docstrings for alignment
 2. Follow the Core Development Principles below
 3. Run validation commands before proposing changes
 
 **Validation before proposing changes**:
+
 - Lint/format: `make verify`
 - Tests: `make test-python` or targeted `pytest` invocations
 - Type checking: `uv run mypy kubeflow` (if available)
 
 **Commit/PR hygiene**:
+
 - Follow Conventional Commits in titles and messages
 - Include rationale ("why") in commit messages/PR descriptions
 - Do not push secrets or change git config
@@ -117,19 +128,22 @@ uv run pre-commit run --all-files           # Run all hooks
 **Always attempt to preserve function signatures, argument positions, and names for exported/public methods.**
 
 ❌ **Bad - Breaking Change:**
+
 ```python
 def train_model(id, verbose=False):  # Changed from `model_id`
     pass
 ```
 
 ✅ **Good - Stable Interface:**
+
 ```python
 def train_model(model_id: str, verbose: bool = False) -> TrainingResult:
     """Train model with optional verbose output."""
     pass
 ```
 
 **Before making ANY changes to public APIs:**
+
 - Check if the function/class is exported in `__init__.py`
 - Look for existing usage patterns in tests and examples
 - Use keyword-only arguments for new parameters: `*, new_param: str = "default"`
@@ -140,27 +154,30 @@ def train_model(model_id: str, verbose: bool = False) -> TrainingResult:
 **All Python code MUST include type hints and return types.**
 
 ❌ **Bad:**
+
 ```python
 def p(u, d):
     return [x for x in u if x not in d]
 ```
 
 ✅ **Good:**
+
 ```python
 def filter_completed_jobs(jobs: list[str], completed: set[str]) -> list[str]:
     """Filter out jobs that are already completed.
-    
+
     Args:
         jobs: List of job identifiers to filter.
         completed: Set of completed job identifiers.
-        
+
     Returns:
         List of jobs that are not yet completed.
     """
     return [job for job in jobs if job not in completed]
 ```
 
 **Style Requirements:**
+
 - Line length 100, Python 3.9 target, double quotes, spaces indent
 - Imports: isort via ruff; first-party is `kubeflow`; prefer absolute imports
 - Naming: pep8-naming; functions/vars `snake_case`, classes `PascalCase`, constants `UPPER_SNAKE_CASE`; prefix private with `_`
@@ -173,18 +190,21 @@ def filter_completed_jobs(jobs: list[str], completed: set[str]) -> list[str]:
 **Every new feature or bugfix MUST be covered by unit tests.**
 
 **Test Organization:**
+
 - Unit tests: `kubeflow/trainer/**/*_test.py` (no network calls allowed)
 - Use `pytest` as the testing framework
 - See `kubeflow/trainer/test/common.py` for fixtures and patterns
 - Unit test structure must be consistent between each other (see `kubeflow/trainer/backends/kubernetes/backend_test.py` for reference)
 
 **Test Structure Pattern** (following `backend_test.py`):
+
 - Use `TestCase` dataclass for parametrized tests
 - Include `name`, `expected_status`, `config`, `expected_output/error` fields
 - Print test execution status for debugging
 - Handle both success and exception cases in the same test function
 
 **Test Quality Checklist:**
+
 - [ ] Tests fail when your new logic is broken
 - [ ] Happy path is covered
 - [ ] Edge cases and error conditions are tested
@@ -194,19 +214,21 @@ def filter_completed_jobs(jobs: list[str], completed: set[str]) -> list[str]:
 **Test Examples:**
 
 Simple test:
+
 ```python
 def test_filter_completed_jobs():
     """Test filtering completed jobs from a list."""
     jobs = ["job-1", "job-2", "job-3"]
     completed = {"job-1", "job-2"}
-    
+
     result = filter_completed_jobs(jobs, completed)
-    
+
     assert result == ["job-3"]
     assert len(result) == 1
 ```
 
 Parametrized test cases (preferred for multiple scenarios):
+
 ```python
 @pytest.mark.parametrize(
     "test_case",
@@ -234,20 +256,23 @@ def test_filter_jobs_parametrized(test_case):
 ### 4. Security and Risk Assessment
 
 **Security Checklist:**
+
 - [ ] No `eval()`, `exec()`, or `pickle` on user-controlled input
 - [ ] Proper exception handling (no bare `except:`) and use descriptive error messages
 - [ ] Remove unreachable/commented code before committing
 - [ ] Ensure proper resource cleanup (file handles, connections)
 - [ ] No secrets in code, logs, or examples
 
 ❌ **Bad:**
+
 ```python
 def load_config(path):
     with open(path) as f:
         return eval(f.read())  # ⚠️ Never eval user input
 ```
 
 ✅ **Good:**
+
 ```python
 import yaml
 
@@ -262,31 +287,34 @@ def load_config(path: str) -> dict:
 **Use Google-style docstrings with Args section for all public functions.**
 
 ❌ **Insufficient Documentation:**
+
 ```python
 def submit_job(name, config):
     """Submit a job."""
 ```
 
 ✅ **Complete Documentation:**
+
 ```python
 def submit_job(name: str, config: dict, *, priority: str = "normal") -> str:
     """Submit a training job with specified configuration.
-    
+
     Args:
         name: The job name identifier.
         config: Job configuration dictionary.
         priority: Job priority level ('low', 'normal', 'high').
-        
+
     Returns:
         Job ID string for tracking the submitted job.
-        
+
     Raises:
         InvalidConfigError: If the configuration is invalid.
         ResourceUnavailableError: If required resources are not available.
     """
 ```
 
 **Documentation Guidelines:**
+
 - Types go in function signatures, NOT in docstrings
 - Focus on "why" rather than "what" in descriptions
 - Document all parameters, return values, and exceptions
@@ -298,6 +326,7 @@ def submit_job(name: str, config: dict, *, priority: str = "normal") -> str:
 **When you encounter code that could be improved, suggest better designs:**
 
 ❌ **Poor Design:**
+
 ```python
 def process_training(data, k8s_client, storage, logger):
     # Function doing too many things
@@ -309,21 +338,22 @@ def process_training(data, k8s_client, storage, logger):
 ```
 
 ✅ **Better Design:**
+
 ```python
 @dataclass
 class TrainingJobResult:
     """Result of training job submission."""
     job_id: str
     status: str
     created_at: datetime
-    
+
 class TrainingJobManager:
     """Handles training job lifecycle operations."""
-    
+
     def __init__(self, k8s_client: KubernetesClient, storage: Storage):
         self.k8s = k8s_client
         self.storage = storage
-        
+
     def submit_job(self, config: TrainingConfig) -> TrainingJobResult:
         """Submit and track a new training job."""
         validated_config = self._validate_config(config)
@@ -343,27 +373,39 @@ class TrainingJobManager:
 **Trainer Types**:
 
 **CustomTrainer** (`kubeflow.trainer.types.CustomTrainer`):
+
 - **Purpose**: For custom, self-contained training functions that you write yourself
 - **Flexibility**: Complete control over the training process
 - **Use case**: "Bring your own training code" - maximum flexibility
 - **Key attributes**: `func` (your training function), `func_args`, `packages_to_install`, `pip_index_urls`, `num_nodes`, `resources_per_node`, `env`
 
+**CustomTrainerContainer** (`kubeflow.trainer.types.CustomTrainerContainer`):
+
+- **Purpose**: For custom, self-contained container image that you create yourself
+- **Flexibility**: Complete control over the training process
+- **Use case**: "Bring your own training image" - maximum flexibility
+- **Key attributes**: `num_nodes`, `resources_per_node`, `env`
+
 **BuiltinTrainer** (`kubeflow.trainer.types.BuiltinTrainer`):
+
 - **Purpose**: For pre-built training frameworks with existing fine-tuning logic
 - **Convenience**: Just configure parameters, training logic is already implemented
 - **Use case**: "Use our pre-built trainers" - convenience for common scenarios
 - **Key attributes**: `config` (currently only supports `TorchTuneConfig` for LLM fine-tuning with TorchTune)
 
 **Backends**:
+
 - `localprocess`: local execution for fast iteration
 - `kubernetes`: K8s-backed jobs, see `backends/kubernetes`
 
 **Typical flow**:
+
 1. Get runtime, define trainer, submit with `TrainerClient().train(...)`
 2. `wait_for_job_status(...)` then fetch logs with `get_job_logs(...)`
 3. For full example, see README "Run your first PyTorch distributed job"
 
 **Integration patterns**:
+
 - Follow existing patterns in `kubeflow.trainer.backends` for new backends
 - Use `kubeflow.trainer.types` for data models and type definitions
 - Implement proper error handling and resource cleanup
@@ -372,6 +414,7 @@ class TrainingJobManager:
 ## CI & PRs
 
 **PR Requirements**:
+
 - Title must follow Conventional Commits:
   - Types: `chore`, `fix`, `feat`, `revert`
   - Scopes: `ci`, `docs`, `examples`, `scripts`, `test`, `trainer`
@@ -381,9 +424,11 @@ class TrainingJobManager:
 ## Releasing
 
 **Version management**:
+
 ```bash
 make release VERSION=X.Y.Z   # Updates kubeflow/__init__.py and generates changelog
 ```
+
 - Do not commit secrets; verify coverage and lint pass before tagging
 
 ## Troubleshooting
 
@@ -95,21 +95,26 @@ def train(
         self,
         runtime: Optional[types.Runtime] = None,
         initializer: Optional[types.Initializer] = None,
-        trainer: Optional[Union[types.CustomTrainer, types.BuiltinTrainer]] = None,
+        trainer: Optional[
+            Union[types.CustomTrainer, types.CustomTrainerContainer, types.BuiltinTrainer]
+        ] = None,
     ) -> str:
         """Create a TrainJob. You can configure the TrainJob using one of these trainers:
 
         - CustomTrainer: Runs training with a user-defined function that fully encapsulates the
             training process.
+        - CustomTrainerContainer: Runs training with a user-defined image that fully encapsulates
+            the training process.
         - BuiltinTrainer: Uses a predefined trainer with built-in post-training logic, requiring
             only parameter configuration.
 
         Args:
             runtime: Optional reference to one of the existing runtimes. Defaults to the
                 torch-distributed runtime if not provided.
             initializer: Optional configuration for the dataset and model initializers.
-            trainer: Optional configuration for a CustomTrainer or BuiltinTrainer. If not specified,
-                the TrainJob will use the runtime's default values.
+            trainer: Optional configuration for a CustomTrainer, CustomTrainerContainer, or
+                BuiltinTrainer. If not specified, the TrainJob will use the
+                runtime's default values.
 
         Returns:
             The unique name of the TrainJob that has been generated.
 
@@ -38,7 +38,9 @@ def train(
         self,
         runtime: Optional[types.Runtime] = None,
         initializer: Optional[types.Initializer] = None,
-        trainer: Optional[Union[types.CustomTrainer, types.BuiltinTrainer]] = None,
+        trainer: Optional[
+            Union[types.CustomTrainer, types.CustomTrainerContainer, types.BuiltinTrainer]
+        ] = None,
     ) -> str:
         raise NotImplementedError()