From 3af0aed0c75717fa79e5622e241f7e6ed8014742 Mon Sep 17 00:00:00 2001 From: Peter Pan Date: Tue, 7 Apr 2026 10:11:04 +0800 Subject: [PATCH 1/9] docs(proposals): add draft KEP for inject PET envs into trainer init containers Signed-off-by: Peter Pan --- .../3416-pet-env-init-containers/README.md | 87 +++++++++++++++++++ 1 file changed, 87 insertions(+) create mode 100644 docs/proposals/3416-pet-env-init-containers/README.md diff --git a/docs/proposals/3416-pet-env-init-containers/README.md b/docs/proposals/3416-pet-env-init-containers/README.md new file mode 100644 index 0000000000..10b40f6cee --- /dev/null +++ b/docs/proposals/3416-pet-env-init-containers/README.md @@ -0,0 +1,87 @@ +# KEP-3416: Inject Torch Distributed PET_* Envs into Trainer Init Containers + +## Summary + +Torch `EnforceMLPolicy` injects PET_* envs only into trainer main container today. This KEP proposes to inject same PET_* envs into trainer init containers as well. + +## Motivation + +Init containers cannot read distributed topology envs (`PET_NNODES`, `PET_NPROC_PER_NODE`, `PET_NODE_RANK`, `PET_MASTER_ADDR`, `PET_MASTER_PORT`). This blocks preflight distributed checks before expensive training startup. + +Current code facts: + +- `pkg/runtime/framework/plugins/torch/torch.go`: `EnforceMLPolicy` updates only main trainer container path. +- `pkg/runtime/runtime.go`: `PodSet` already stores both `InitContainers` and `Containers`. +- `pkg/runtime/framework/plugins/jobset/jobset.go`: Build sync writes `ps.Containers` back, but not `ps.InitContainers`. + +### Goals + +- Inject PET_* envs to trainer main container and all trainer init containers. +- Keep one deterministic env source for both container types. +- Keep scheduler behavior unchanged. + +### Non-Goals + +- Change CRD or API schema. +- Add new user-facing field. +- Change scheduling semantics. + +## Proposal + +Apply PET_* env set to all containers in trainer `PodSet` (`AncestorTrainer`) with same values. + +## Design Details + +### Runtime helper + +Add helper for init-container lookup by podset ancestor and container name, or generalize existing lookup helper to support both main and init containers. + +### Torch plugin changes + +In `EnforceMLPolicy`, after PET_* values are computed: + +- Keep existing injection to trainer main container. +- Add injection to every trainer init container. +- Keep torchtune command mutation scoped to trainer main container only. + +### JobSet plugin changes + +In `Build`, mirror existing sync logic for `ps.Containers` to `ps.InitContainers`: + +- Sync command, image, env, ports, and volumeMounts where applicable. +- Write updates to `ReplicatedJobs[*].Template.Spec.Template.Spec.InitContainers`. + +### Validation and Safety + +Reserved-env validation currently checks only `spec.trainer.env`. This KEP does not expand API validation scope. + +### Compatibility + +Backward compatible. Jobs without init containers are unchanged. + +## Test Plan + +- [x] I/we understand the owners of involved components may require updates to existing tests before implementation is merged. + +### Unit Tests + +- Add torch plugin unit test with trainer `PodSet` containing init containers. +- Verify PET_* env injection for main and init containers. +- Add or extend JobSet Build test to verify init container sync in final JobSet spec. + +## Implementation History + +- **TBD**: Issue opened [#3416](https://github.com/kubeflow/trainer/issues/3416) +- **2026-04-07**: Initial KEP drafted. + +## Alternatives + +### Alternative 1: Inject only into selected init containers + +- **Pros:** Smaller runtime mutation scope. +- **Cons:** Need selection API/annotation and user education; less predictable behavior. + + +## Open Questions + +Default behavior injects into all trainer init containers. If opt-out is needed later, it can be added in a follow-up KEP (for example, annotation-based control). From d7f04bed0bbcbaca748826e34106103d39dade53 Mon Sep 17 00:00:00 2001 From: Peter Pan Date: Tue, 7 Apr 2026 13:19:37 +0800 Subject: [PATCH 2/9] fix comments Signed-off-by: Peter Pan --- .../proposals/3416-pet-env-init-containers/README.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/proposals/3416-pet-env-init-containers/README.md b/docs/proposals/3416-pet-env-init-containers/README.md index 10b40f6cee..45c47aa529 100644 --- a/docs/proposals/3416-pet-env-init-containers/README.md +++ b/docs/proposals/3416-pet-env-init-containers/README.md @@ -1,8 +1,8 @@ -# KEP-3416: Inject Torch Distributed PET_* Envs into Trainer Init Containers +# KEP-3416: Inject Torch Distributed `PET_*` Envs into Trainer Init Containers ## Summary -Torch `EnforceMLPolicy` injects PET_* envs only into trainer main container today. This KEP proposes to inject same PET_* envs into trainer init containers as well. +Torch `EnforceMLPolicy` injects `PET_*` envs only into trainer main container today. This KEP proposes to inject same `PET_*` envs into trainer init containers as well. ## Motivation @@ -16,7 +16,7 @@ Current code facts: ### Goals -- Inject PET_* envs to trainer main container and all trainer init containers. +- Inject `PET_*` envs to trainer main container and all trainer init containers. - Keep one deterministic env source for both container types. - Keep scheduler behavior unchanged. @@ -28,7 +28,7 @@ Current code facts: ## Proposal -Apply PET_* env set to all containers in trainer `PodSet` (`AncestorTrainer`) with same values. +Apply `PET_*` env set to all containers in trainer `PodSet` (`AncestorTrainer`) with same values. ## Design Details @@ -38,7 +38,7 @@ Add helper for init-container lookup by podset ancestor and container name, or g ### Torch plugin changes -In `EnforceMLPolicy`, after PET_* values are computed: +In `EnforceMLPolicy`, after `PET_*` values are computed: - Keep existing injection to trainer main container. - Add injection to every trainer init container. @@ -66,7 +66,7 @@ Backward compatible. Jobs without init containers are unchanged. ### Unit Tests - Add torch plugin unit test with trainer `PodSet` containing init containers. -- Verify PET_* env injection for main and init containers. +- Verify `PET_*` env injection for main and init containers. - Add or extend JobSet Build test to verify init container sync in final JobSet spec. ## Implementation History From 1362658751069aa4ebb97e379fef9c2c16ed9d1e Mon Sep 17 00:00:00 2001 From: Peter Pan Date: Thu, 16 Apr 2026 14:01:59 +0800 Subject: [PATCH 3/9] address comments Signed-off-by: Peter Pan --- .../3416-pet-env-init-containers/README.md | 38 ++++++++++++++++++- 1 file changed, 37 insertions(+), 1 deletion(-) diff --git a/docs/proposals/3416-pet-env-init-containers/README.md b/docs/proposals/3416-pet-env-init-containers/README.md index 45c47aa529..66be5d4123 100644 --- a/docs/proposals/3416-pet-env-init-containers/README.md +++ b/docs/proposals/3416-pet-env-init-containers/README.md @@ -8,12 +8,20 @@ Torch `EnforceMLPolicy` injects `PET_*` envs only into trainer main container to Init containers cannot read distributed topology envs (`PET_NNODES`, `PET_NPROC_PER_NODE`, `PET_NODE_RANK`, `PET_MASTER_ADDR`, `PET_MASTER_PORT`). This blocks preflight distributed checks before expensive training startup. +Also, this proposal only solves env visibility for init containers. It does not change DNS publishing behavior. If preflight scripts need to resolve `PET_MASTER_ADDR` before Pods become Ready, the runtime network settings must allow publishing pod DNS records for not-ready Pods (for example, `publishNotReadyAddresses: true` in JobSet network config). + Current code facts: - `pkg/runtime/framework/plugins/torch/torch.go`: `EnforceMLPolicy` updates only main trainer container path. - `pkg/runtime/runtime.go`: `PodSet` already stores both `InitContainers` and `Containers`. - `pkg/runtime/framework/plugins/jobset/jobset.go`: Build sync writes `ps.Containers` back, but not `ps.InitContainers`. +Which envs are needed by preflight: + +- Always needed for distributed connection checks: `PET_MASTER_ADDR`, `PET_MASTER_PORT`, `PET_NODE_RANK`. +- Commonly needed by launch logic: `PET_NNODES`, `PET_NPROC_PER_NODE`. +- `PET_NODE_RANK` can be read from Pod metadata (`batch.kubernetes.io/job-completion-index`), but the other values are runtime-derived and must still be injected by the plugin. + ### Goals - Inject `PET_*` envs to trainer main container and all trainer init containers. @@ -25,6 +33,7 @@ Current code facts: - Change CRD or API schema. - Add new user-facing field. - Change scheduling semantics. +- Change JobSet network defaults or DNS behavior. ## Proposal @@ -51,10 +60,23 @@ In `Build`, mirror existing sync logic for `ps.Containers` to `ps.InitContainers - Sync command, image, env, ports, and volumeMounts where applicable. - Write updates to `ReplicatedJobs[*].Template.Spec.Template.Spec.InitContainers`. +### Which `PET_*` values come from where + +- `PET_NODE_RANK`: comes from Pod metadata field `batch.kubernetes.io/job-completion-index`. +- `PET_MASTER_ADDR`: computed by runtime naming convention for the master Pod DNS name. +- `PET_MASTER_PORT`: set from trainer runtime port config. +- `PET_NNODES`, `PET_NPROC_PER_NODE`: derived from TrainJob/runtime policy values. + ### Validation and Safety Reserved-env validation currently checks only `spec.trainer.env`. This KEP does not expand API validation scope. +### Networking prerequisite for preflight + +`PET_MASTER_ADDR` is injected as a DNS name (not a direct Pod IP). Because of that, preflight checks that run before readiness may fail to resolve the address when pod DNS records are not published for not-ready Pods. + +This KEP does not enforce any network setting. Runtime authors and users should ensure the selected runtime template has suitable JobSet network configuration when they depend on early DNS resolution (for example, `publishNotReadyAddresses: true`). + ### Compatibility Backward compatible. Jobs without init containers are unchanged. @@ -81,7 +103,21 @@ Backward compatible. Jobs without init containers are unchanged. - **Pros:** Smaller runtime mutation scope. - **Cons:** Need selection API/annotation and user education; less predictable behavior. +### Alternative 2: Run preflight in main container startup path + +- **Pros:** Works without injecting `PET_*` into init containers. +- **Cons:** Startup probes and entrypoint checks have different failure behavior from init-container gating, and they still depend on DNS/network settings for `PET_MASTER_ADDR` resolution. + ## Open Questions -Default behavior injects into all trainer init containers. If opt-out is needed later, it can be added in a follow-up KEP (for example, annotation-based control). +Should this KEP include an opt-out switch to inject `PET_*` only into the main container? + +- Option A: keep this KEP simple and inject into all init containers by default. +- Option B: add annotation-based control (for example, `trainer.kubeflow.org/pet-init-env-injection: "false"`). + +If annotation-based control is added, rollout and compatibility policy must be explicit: + +- Keep current behavior as default first to avoid surprise changes for existing runtimes. +- Define conflict handling when users already set envs with the same names. +- Document migration steps if default behavior changes in a future release. From 342161ba6a49381a87174b11aba7b3bacf8f90c9 Mon Sep 17 00:00:00 2001 From: Peter Pan Date: Wed, 22 Apr 2026 16:35:00 +0800 Subject: [PATCH 4/9] Add OPT-In annotation Polish more Signed-off-by: Peter Pan --- .../3416-pet-env-init-containers/README.md | 53 +++++++++++-------- 1 file changed, 32 insertions(+), 21 deletions(-) diff --git a/docs/proposals/3416-pet-env-init-containers/README.md b/docs/proposals/3416-pet-env-init-containers/README.md index 66be5d4123..4b566c6633 100644 --- a/docs/proposals/3416-pet-env-init-containers/README.md +++ b/docs/proposals/3416-pet-env-init-containers/README.md @@ -2,7 +2,7 @@ ## Summary -Torch `EnforceMLPolicy` injects `PET_*` envs only into trainer main container today. This KEP proposes to inject same `PET_*` envs into trainer init containers as well. +Torch `EnforceMLPolicy` injects `PET_*` envs only into trainer main container today. This KEP proposes an opt-in way to inject the same `PET_*` envs into trainer init containers. ## Motivation @@ -22,22 +22,32 @@ Which envs are needed by preflight: - Commonly needed by launch logic: `PET_NNODES`, `PET_NPROC_PER_NODE`. - `PET_NODE_RANK` can be read from Pod metadata (`batch.kubernetes.io/job-completion-index`), but the other values are runtime-derived and must still be injected by the plugin. +This KEP does not claim that every preflight check needs all `PET_*` envs. +The goal is to make the same runtime-computed `PET_*` values available to init containers when users choose to use them. + ### Goals -- Inject `PET_*` envs to trainer main container and all trainer init containers. +- Keep `PET_*` env injection to trainer main container unchanged. +- Add opt-in `PET_*` env injection for trainer init containers. - Keep one deterministic env source for both container types. - Keep scheduler behavior unchanged. ### Non-Goals - Change CRD or API schema. -- Add new user-facing field. +- Add new CRD or API fields. - Change scheduling semantics. - Change JobSet network defaults or DNS behavior. ## Proposal -Apply `PET_*` env set to all containers in trainer `PodSet` (`AncestorTrainer`) with same values. +Keep existing behavior by default: inject `PET_*` only into trainer main container. + +Add annotation-based opt-in for init containers. When enabled, apply the same `PET_*` env set to trainer init containers in `PodSet` (`AncestorTrainer`). + +Proposed annotation: + +- `trainer.kubeflow.org/pet-init-env-injection: "enabled"` ## Design Details @@ -50,7 +60,7 @@ Add helper for init-container lookup by podset ancestor and container name, or g In `EnforceMLPolicy`, after `PET_*` values are computed: - Keep existing injection to trainer main container. -- Add injection to every trainer init container. +- Add injection to trainer init containers only when the opt-in annotation is enabled. - Keep torchtune command mutation scoped to trainer main container only. ### JobSet plugin changes @@ -79,7 +89,11 @@ This KEP does not enforce any network setting. Runtime authors and users should ### Compatibility -Backward compatible. Jobs without init containers are unchanged. +Backward compatible by default. + +- Existing jobs keep current behavior (main container injection only). +- Jobs without init containers are unchanged. +- Init-container injection is enabled only for users who opt in. ## Test Plan @@ -88,7 +102,8 @@ Backward compatible. Jobs without init containers are unchanged. ### Unit Tests - Add torch plugin unit test with trainer `PodSet` containing init containers. -- Verify `PET_*` env injection for main and init containers. +- Verify default behavior: `PET_*` env injection only for main container. +- Verify opt-in behavior: `PET_*` env injection for main and init containers. - Add or extend JobSet Build test to verify init container sync in final JobSet spec. ## Implementation History @@ -100,24 +115,20 @@ Backward compatible. Jobs without init containers are unchanged. ### Alternative 1: Inject only into selected init containers -- **Pros:** Smaller runtime mutation scope. -- **Cons:** Need selection API/annotation and user education; less predictable behavior. +- **Pros:** Smaller runtime mutation scope. Also allows mixed sourcing of env values: + - `PET_NODE_RANK` from Pod `fieldRef` (`batch.kubernetes.io/job-completion-index`). + - `PET_NNODES` or `PET_NPROC_PER_NODE` from user-provided overrides when values are fixed. + - `PET_MASTER_ADDR` can be derived from `metadata.annotations['jobset.sigs.k8s.io/jobset-name']` (`$(JOBSET_NAME)-node-0-0.$(JOBSET_NAME)`) + - `PET_MASTER_PORT` can be fixed to `29500`, so it may not need explicit injection in some setups. +- **Cons:** Need clear precedence rules when values are available from multiple sources (plugin injection, metadata-derived values, and user-provided envs). For example, `PET_NNODES` may duplicate `TrainJob.spec.trainer.numNodes`, which can cause configuration drift. -### Alternative 2: Run preflight in main container startup path +### Alternative 2: Run preflight in main container startup path(entrypoint) - **Pros:** Works without injecting `PET_*` into init containers. -- **Cons:** Startup probes and entrypoint checks have different failure behavior from init-container gating, and they still depend on DNS/network settings for `PET_MASTER_ADDR` resolution. +- **Cons:** Startup probes and entrypoint checks have different failure behavior from init-container gating, and they still depend on DNS/network settings for `PET_MASTER_ADDR` resolution after Pod's ready (without explicit `publishNotReadyAddresses: true` ) -## Open Questions - -Should this KEP include an opt-out switch to inject `PET_*` only into the main container? -- Option A: keep this KEP simple and inject into all init containers by default. -- Option B: add annotation-based control (for example, `trainer.kubeflow.org/pet-init-env-injection: "false"`). - -If annotation-based control is added, rollout and compatibility policy must be explicit: +## Open Questions -- Keep current behavior as default first to avoid surprise changes for existing runtimes. -- Define conflict handling when users already set envs with the same names. -- Document migration steps if default behavior changes in a future release. +Should a future KEP change the default from opt-in to opt-out after enough adoption data? From 1345659377b759ae6457d05c7d2701b21801ae47 Mon Sep 17 00:00:00 2001 From: Peter Pan Date: Mon, 27 Apr 2026 20:21:07 +0800 Subject: [PATCH 5/9] add more detailed user story Signed-off-by: Peter Pan --- .../3416-pet-env-init-containers/README.md | 26 +++++++++++++++++++ 1 file changed, 26 insertions(+) diff --git a/docs/proposals/3416-pet-env-init-containers/README.md b/docs/proposals/3416-pet-env-init-containers/README.md index 4b566c6633..c4b0deaac5 100644 --- a/docs/proposals/3416-pet-env-init-containers/README.md +++ b/docs/proposals/3416-pet-env-init-containers/README.md @@ -25,6 +25,32 @@ Which envs are needed by preflight: This KEP does not claim that every preflight check needs all `PET_*` envs. The goal is to make the same runtime-computed `PET_*` values available to init containers when users choose to use them. +### User Stories + +- As a platform administrator, I want `PET_*` topology environment variables available in distributed trainer init containers so preflight checks can validate distributed readiness before expensive training starts. +- As a job submitter, I want preflight to fail fast with clear, machine-readable reasons (GPU, network, DNS, storage, runtime smoke test) so I can fix issues quickly instead of debugging mid-run failures. +- As an operator, I want preflight outcomes to map to deterministic actions ( Warning , Retry , Reschedule , Stop ) to avoid inconsistent behavior across clusters. +- As a runtime engineer, I want early clarify and detect any of unstable problems like ensure cross-pod DNS resolution for MASTER_ADDR to prevent out-of-band communication failures during training. + + +Moreover: +- Init-container preflight emits structured results ( json ) and normalized exit codes (example: 0=pass , 10=warning , 20=retryable , 30=fatal ). +- Preflight covers at least: GPU health, driver/CUDA compatibility, NCCL connectivity, Kubernetes API reachability, storage accessibility, minimal torchrun smoke test, and repeated DNS resolution for MASTER_ADDR . + +| ID | Real-World Story (Init-Container Context) | Typical Check in Init Container | Recommended Action | Rationale | +|----|------------------------------------------|--------------------------------|--------------------|-----------| +| 1 | GPU missing/unhealthy on a node causes immediate CUDA failures after launch. | `nvidia-smi -L`, DCGM health/diag (or vendor equivalent) | Stop + Reschedule | Node-local hardware issue is unlikely to self-heal in-place. | +| 2 | Driver/CUDA incompatibility causes runtime crashes despite successful pod startup. | Compare host driver (`nvidia-smi`) vs image CUDA compatibility | Stop | Configuration mismatch; rescheduling usually reproduces same failure class. | +| 3 | NCCL path is broken across nodes, leading to all-reduce hang/timeout. | `nccl-tests` (small all-reduce) using PET_* topology | Retry once → Reschedule once → Stop | Transient network issues may recover; persistent failures should fail fast. | +| 4 | API server reachability is intermittent, causing control-plane communication issues. | `curl https://$KUBERNETES_SERVICE_HOST:$PORT/version` | Warning + Retry, then Stop if persistent | Short blips are common; sustained failure is fatal for orchestration. | +| 5 | Required storage path is not writable/readable, causing checkpoint/data IO failures. | Read/write/delete probe on mounted volumes | Stop for required path; Warning + Degrade for optional cache path | Required IO must be hard-gated; optional paths can fall back. | +| 6 | Minimal distributed launch fails although single checks pass. | Tiny `torchrun` smoke test (`--nnodes`, `--nproc_per_node`) | Stop + Reschedule once | End-to-end distributed readiness is the final gate before expensive training. | +| 7 | Cross-pod DNS for `MASTER_ADDR` is unstable; out-of-band runtime communication fails mid-run. | Resolve `MASTER_ADDR` repeatedly (`nslookup` / `getent hosts`) and optional TCP probe | Warning + Degrade if fallback endpoint exists; otherwise Stop | Name resolution instability can silently break runtime coordination later. Ensure `publishNotReadyAddresses=true` when early resolution is required. | + + + + + ### Goals - Keep `PET_*` env injection to trainer main container unchanged. From bbc798330685d0d89dc8082aa33c3ef38dbd8f77 Mon Sep 17 00:00:00 2001 From: Peter Pan Date: Sat, 2 May 2026 22:43:31 +0800 Subject: [PATCH 6/9] change annotation name to more generic Signed-off-by: Peter Pan --- docs/proposals/3416-pet-env-init-containers/README.md | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/docs/proposals/3416-pet-env-init-containers/README.md b/docs/proposals/3416-pet-env-init-containers/README.md index c4b0deaac5..3340a4897f 100644 --- a/docs/proposals/3416-pet-env-init-containers/README.md +++ b/docs/proposals/3416-pet-env-init-containers/README.md @@ -70,10 +70,11 @@ Moreover: Keep existing behavior by default: inject `PET_*` only into trainer main container. Add annotation-based opt-in for init containers. When enabled, apply the same `PET_*` env set to trainer init containers in `PodSet` (`AncestorTrainer`). +In this KEP, the annotation controls only Torch plugin env injection behavior. Proposed annotation: -- `trainer.kubeflow.org/pet-init-env-injection: "enabled"` +- `trainer.kubeflow.org/plugin-env-injection-mode: "init-containers"` ## Design Details @@ -86,7 +87,7 @@ Add helper for init-container lookup by podset ancestor and container name, or g In `EnforceMLPolicy`, after `PET_*` values are computed: - Keep existing injection to trainer main container. -- Add injection to trainer init containers only when the opt-in annotation is enabled. +- Add injection to trainer init containers only when `trainer.kubeflow.org/plugin-env-injection-mode` is set to `init-containers`. - Keep torchtune command mutation scoped to trainer main container only. ### JobSet plugin changes @@ -158,3 +159,9 @@ Backward compatible by default. ## Open Questions Should a future KEP change the default from opt-in to opt-out after enough adoption data? + +Should this annotation-based control move to a dedicated API field in Torch MLPolicy for better type safety and discoverability? + +If a dedicated API field is introduced, should TrainJob override be supported via RuntimePatches (for example by extending `TrainingRuntimeSpecPatch`)? + +Should future scope include finer-grained targets such as specific replicated jobs and container names? From 39a170d0d26990a8263e71dd21a07ce7c5ca197c Mon Sep 17 00:00:00 2001 From: Peter Pan Date: Sat, 2 May 2026 22:49:10 +0800 Subject: [PATCH 7/9] docs(proposals): clarify plugin env injection mode in KEP-3416 Signed-off-by: Peter Pan --- docs/proposals/3416-pet-env-init-containers/README.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/docs/proposals/3416-pet-env-init-containers/README.md b/docs/proposals/3416-pet-env-init-containers/README.md index 3340a4897f..9bd1bfdb65 100644 --- a/docs/proposals/3416-pet-env-init-containers/README.md +++ b/docs/proposals/3416-pet-env-init-containers/README.md @@ -76,6 +76,9 @@ Proposed annotation: - `trainer.kubeflow.org/plugin-env-injection-mode: "init-containers"` +In this KEP, only one annotation value is supported: `init-containers`. +Other values are reserved for future extension and are out of scope for this KEP. + ## Design Details ### Runtime helper From d83914b3e6eb3d1855c181d21d80123793576133 Mon Sep 17 00:00:00 2001 From: Peter Pan Date: Thu, 7 May 2026 14:08:34 +0800 Subject: [PATCH 8/9] docs(proposals): switch from annotation to TorchMLPolicy API field for env injection --- .../3416-pet-env-init-containers/README.md | 66 +++++++++++++------ 1 file changed, 47 insertions(+), 19 deletions(-) diff --git a/docs/proposals/3416-pet-env-init-containers/README.md b/docs/proposals/3416-pet-env-init-containers/README.md index 9bd1bfdb65..9700142d16 100644 --- a/docs/proposals/3416-pet-env-init-containers/README.md +++ b/docs/proposals/3416-pet-env-init-containers/README.md @@ -2,7 +2,7 @@ ## Summary -Torch `EnforceMLPolicy` injects `PET_*` envs only into trainer main container today. This KEP proposes an opt-in way to inject the same `PET_*` envs into trainer init containers. +Torch `EnforceMLPolicy` injects `PET_*` envs only into trainer main container today. This KEP proposes adding an API field in `TorchMLPolicy` to opt-in `PET_*` env injection into trainer init containers. ## Motivation @@ -54,14 +54,14 @@ Moreover: ### Goals - Keep `PET_*` env injection to trainer main container unchanged. -- Add opt-in `PET_*` env injection for trainer init containers. +- Add a type-safe API field in `TorchMLPolicy` to opt-in `PET_*` env injection for trainer init containers. - Keep one deterministic env source for both container types. - Keep scheduler behavior unchanged. ### Non-Goals -- Change CRD or API schema. -- Add new CRD or API fields. +- Change `PET_*` env injection scope beyond init containers in this KEP. +- Add user-facing fields to `TrainJob` CRD (deferred to future RuntimePatches discussion). - Change scheduling semantics. - Change JobSet network defaults or DNS behavior. @@ -69,15 +69,45 @@ Moreover: Keep existing behavior by default: inject `PET_*` only into trainer main container. -Add annotation-based opt-in for init containers. When enabled, apply the same `PET_*` env set to trainer init containers in `PodSet` (`AncestorTrainer`). -In this KEP, the annotation controls only Torch plugin env injection behavior. - -Proposed annotation: - -- `trainer.kubeflow.org/plugin-env-injection-mode: "init-containers"` - -In this KEP, only one annotation value is supported: `init-containers`. -Other values are reserved for future extension and are out of scope for this KEP. +Add a new `envInjection` field to `TorchMLPolicySource` in the TrainingRuntime CRD. When configured, apply the same `PET_*` env set to trainer init containers in `PodSet` (`AncestorTrainer`). + +Proposed API: + +```yaml +# TrainingRuntime.spec.mlPolicy.torch +torch: + numNodes: 4 + envInjection: + containerTypes: + - InitContainers +``` + +```go +type TorchMLPolicySource struct { + // envInjection specifies which container types receive PET_* env injection. + // Defaults to empty (main container only). + // +optional + EnvInjection *TorchEnvInjection `json:"envInjection,omitempty"` +} + +type TorchEnvInjection struct { + // containerTypes lists the container types to inject PET_* envs into. + // Supported values: "Containers", "InitContainers". + // Defaults to ["Containers"] (backward compatible). + // +optional + ContainerTypes []ContainerType `json:"containerTypes,omitempty"` +} + +type ContainerType string + +const ( + ContainersContainerType ContainerType = "Containers" + InitContainersContainerType ContainerType = "InitContainers" +) +``` + +In this KEP, only `InitContainers` is implemented for the opt-in path; `Containers` is the +implicit default (unchanged behavior). Future KEPs can extend the list. ## Design Details @@ -90,7 +120,7 @@ Add helper for init-container lookup by podset ancestor and container name, or g In `EnforceMLPolicy`, after `PET_*` values are computed: - Keep existing injection to trainer main container. -- Add injection to trainer init containers only when `trainer.kubeflow.org/plugin-env-injection-mode` is set to `init-containers`. +- Add injection to trainer init containers only when `TorchMLPolicy.EnvInjection` includes `InitContainers`. - Keep torchtune command mutation scoped to trainer main container only. ### JobSet plugin changes @@ -133,7 +163,7 @@ Backward compatible by default. - Add torch plugin unit test with trainer `PodSet` containing init containers. - Verify default behavior: `PET_*` env injection only for main container. -- Verify opt-in behavior: `PET_*` env injection for main and init containers. +- Verify opt-in behavior: `PET_*` env injection for main and init containers when `EnvInjection` includes `InitContainers`. - Add or extend JobSet Build test to verify init container sync in final JobSet spec. ## Implementation History @@ -163,8 +193,6 @@ Backward compatible by default. Should a future KEP change the default from opt-in to opt-out after enough adoption data? -Should this annotation-based control move to a dedicated API field in Torch MLPolicy for better type safety and discoverability? - -If a dedicated API field is introduced, should TrainJob override be supported via RuntimePatches (for example by extending `TrainingRuntimeSpecPatch`)? +Should TrainJob allow overriding `envInjection` via RuntimePatches (for example by extending `TrainingRuntimeSpecPatch`)? -Should future scope include finer-grained targets such as specific replicated jobs and container names? +Should a future KEP extend `TorchEnvInjection` with finer-grained targets such as specific replicated jobs and container names? From 5beb3b36344f7c8bd4de97e119903600f4e753cc Mon Sep 17 00:00:00 2001 From: Peter Pan Date: Thu, 7 May 2026 15:08:55 +0800 Subject: [PATCH 9/9] docs(proposals): clarify envInjection default state semantics --- .../3416-pet-env-init-containers/README.md | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/docs/proposals/3416-pet-env-init-containers/README.md b/docs/proposals/3416-pet-env-init-containers/README.md index 9700142d16..dc77070f03 100644 --- a/docs/proposals/3416-pet-env-init-containers/README.md +++ b/docs/proposals/3416-pet-env-init-containers/README.md @@ -93,7 +93,8 @@ type TorchMLPolicySource struct { type TorchEnvInjection struct { // containerTypes lists the container types to inject PET_* envs into. // Supported values: "Containers", "InitContainers". - // Defaults to ["Containers"] (backward compatible). + // When empty or omitted, the behavior is backward compatible: + // PET_* envs are injected into main containers only. // +optional ContainerTypes []ContainerType `json:"containerTypes,omitempty"` } @@ -106,8 +107,15 @@ const ( ) ``` -In this KEP, only `InitContainers` is implemented for the opt-in path; `Containers` is the -implicit default (unchanged behavior). Future KEPs can extend the list. +The following states are equivalent and all result in the backward-compatible default +(main container injection only): + +- `envInjection` is omitted entirely (`EnvInjection == nil`). +- `envInjection` is present but `containerTypes` is empty or omitted. +- `containerTypes` explicitly lists only `"Containers"`. + +In this KEP, only `"InitContainers"` is the opt-in value that changes behavior. +Future KEPs can extend the list. ## Design Details