feat: KEP for inject PET envs into init-container by panpan0000 · Pull Request #3417 · kubeflow/trainer

panpan0000 · 2026-04-07T02:23:39Z

What this PR does / why we need it:

Add KEP to address feature request in #3416

Co-Author by AI

Fixes #3416

Checklist:

Docs included if any changes are user facing

Copilot

Pull request overview

Adds a draft KEP documenting the proposed change to inject PyTorch distributed PET_* environment variables into Trainer init containers (in addition to the main trainer container), addressing the feature request in #3416.

Changes:

Add a new proposal document describing motivation, goals/non-goals, and a high-level implementation plan for PET env injection into init containers.
Outline required runtime helper, Torch plugin, and JobSet plugin updates plus a basic test plan.

panpan0000 · 2026-04-14T01:45:03Z

Hi, Kubeflow users and committee , what do you think ?

andreyvelich · 2026-04-14T19:11:36Z

Sorry for the delay @panpan0000, let me take a look at the KEP!

andreyvelich

Thanks for this @panpan0000!
I left a few thoughts.
/assign @astefanutti @akshaychitneni @tenzen-y

google-oss-prow · 2026-04-16T06:02:15Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from astefanutti. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

andreyvelich · 2026-04-21T13:29:32Z

/ok-to-test

panpan0000 · 2026-04-22T09:03:24Z

about comparison of the entrypoint method:

I was mixing up trainerJob and LWS...sorry . same topic for two repo... my bad.

But

Entrypoint checks are workload-level and couple preflight with user command wiring.
Init-container checks are runtime-level and keep user training entrypoints clean.
Init containers give a cleaner gate: preflight must pass before training starts.
Injecting PET_* into init containers makes this reusable across runtimes without wrapping every command.
So entrypoint and init-container approaches are complementary.

This KEP focuses on runtime-level preflight before main training starts.

andreyvelich · 2026-04-22T10:41:09Z

+
+Add annotation-based opt-in for init containers. When enabled, apply the same `PET_*` env set to trainer init containers in `PodSet` (`AncestorTrainer`).
+
+Proposed annotation:
+
+- `trainer.kubeflow.org/pet-init-env-injection: "enabled"`


Are we going to enforce this annotation to the Runtime and TrainJob? I would see a use-case when users want all TrainJobs that reference appropriate Runtime get env injected into initContainer.

+1 on @andreyvelich

Based on user stories, we might want to consider a dedicated field in either TrainingRuntime or TrainJob.
If this should be configured by MLOps or Cluster Admins, we might want to have `.spec.mlPolicy.torch.petEnvInjectionContainerTypes: ["Containers", "InitContainers"], which could also allow users to manage PET environment variables even in Containers.

Or, we might be able to add .spec.trainer.frameworkEnvInjectionContainerType: ["Containers", "InitContainers"] when we want to give this capabilities to the researchers.

Good, point, maybe we should start with dedicated API field in the torch ML Policy?
In TrainJob, we can also override it via RuntimePatches API, since we can extend the TrainingRuntimeSpecPatch API if needed: https://github.com/kubeflow/trainer/blob/master/pkg/apis/trainer/v1alpha1/trainjob_types.go#L310

WDYT @panpan0000 @robert-bell @astefanutti ?

Also, we might want to consider use-case when users want to inject PET_ envs to other ReplicatedJobs or Container names. We can consider more extensible approach for future:

torch: envInjection: containerName: nccl-check targetJob: node

@panpan0000 Did you get a chance to update your proposal, so we can move it forward?

Thanks @andreyvelich, updated.sorry for late.

Added user stories per @tenzen-y

Renamed the annotation to a more generic key as people suggested: trainer.kubeflow.org/plugin-env-injection-mode: "init-containers". ( In this KEP we only support init-containers for now; other values are reserved for follow-ups. )

Moved broader items people discussed (dedicated MLPolicy field, RuntimePatches override, finer-grained targets) to Open Questions portion.

Please take another look and thank you

Thanks for the update @panpan0000! Shall we go ahead with API changes instead of annotation?
That will help us to avoid breaking it in the future versions.

tenzen-y · 2026-04-22T17:32:18Z

ACK
InQueue

drivebyer · 2026-04-23T02:35:42Z

+1, this is exactly what we need.

Right now we hack it with TrainJob.spec.podTemplateOverrides to push envs into the preflight init container, but values like PET_NNODES are runtime-derived.

What we really want: user drops a preflight init container into the TrainingRuntime, and the operator injects PET_* into it the same way it does for the main container. No overrides, no duplication.

panpan0000 · 2026-04-23T09:07:42Z

@andreyvelich

trainer.kubeflow.org/runtime-envs: init-containers
or
trainer.kubeflow.org/plugin-env-injection: init-containers

Do we have any other possible value for this annotation (besides init-containers)?

…containers Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>

Polish more Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>

andreyvelich · 2026-04-23T22:00:36Z

@andreyvelich
trainer.kubeflow.org/runtime-envs: init-containers
or
trainer.kubeflow.org/plugin-env-injection: init-containers
Do we have any other possible value for this annotation (besides init-containers)?

I like init-containers, @robert-bell proposed the same here, but with -mode suffix: #3417 (comment)

@tenzen-y @astefanutti How do you like this annotation name?

trainer.kubeflow.org/plugin-env-injection: init-containers

tenzen-y · 2026-04-26T06:06:43Z

@andreyvelich
trainer.kubeflow.org/runtime-envs: init-containers
or
trainer.kubeflow.org/plugin-env-injection: init-containers
Do we have any other possible value for this annotation (besides init-containers)?
I like init-containers, @robert-bell proposed the same here, but with -mode suffix: #3417 (comment)

@tenzen-y @astefanutti How do you like this annotation name?
trainer.kubeflow.org/plugin-env-injection: init-containers

I would consider those based on user stories, as I mentioned in #3417 (comment), because the current KEP lacks real-world stories.

tenzen-y · 2026-04-26T06:10:11Z

I would highly recommend to add "### User Stories" section under the ## Proposal section to evaluate proposal.

panpan0000 · 2026-04-27T12:21:40Z

I would consider those based on user stories, as I mentioned in #3417 (comment), because the current KEP lacks real-world stories.

added user stories @tenzen-y , thank you for your suggestion

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>

Copilot AI review requested due to automatic review settings April 7, 2026 02:23

google-oss-prow Bot requested review from jinchihe and kuizhiqing April 7, 2026 02:23

google-oss-prow Bot added the size/M label Apr 7, 2026

panpan0000 changed the title ~~docs(proposals): draft KEP for inject PET envs into init-container~~ feat: KEP for inject PET envs into init-container Apr 7, 2026

Copilot AI reviewed Apr 7, 2026

View reviewed changes

Comment thread docs/proposals/3416-pet-env-init-containers/README.md Outdated

andreyvelich reviewed Apr 14, 2026

View reviewed changes

Comment thread docs/proposals/3416-pet-env-init-containers/README.md

Comment thread docs/proposals/3416-pet-env-init-containers/README.md Outdated

Comment thread docs/proposals/3416-pet-env-init-containers/README.md Outdated

Comment thread docs/proposals/3416-pet-env-init-containers/README.md

google-oss-prow Bot assigned akshaychitneni, astefanutti and tenzen-y Apr 14, 2026

google-oss-prow Bot added size/L and removed size/M labels Apr 16, 2026

panpan0000 commented Apr 16, 2026

View reviewed changes

Comment thread docs/proposals/3416-pet-env-init-containers/README.md

panpan0000 force-pushed the kep3416 branch from 96ca02b to 3252830 Compare April 16, 2026 06:12

panpan0000 mentioned this pull request Apr 16, 2026

docs(kep): draft fail-fast restart budget and init-phase DNS for LWS kubernetes-sigs/lws#813

Open

google-oss-prow Bot added the ok-to-test label Apr 21, 2026

andreyvelich reviewed Apr 22, 2026

View reviewed changes

panpan0000 added 4 commits April 23, 2026 17:25

docs(proposals): add draft KEP for inject PET envs into trainer init …

3af0aed

…containers Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>

fix comments

d7f04be

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>

address comments

1362658

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>

Add OPT-In annotation

342161b

Polish more Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>

panpan0000 force-pushed the kep3416 branch from af3d9a8 to 342161b Compare April 23, 2026 09:26

add more detailed user story

1345659

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>

panpan0000 force-pushed the kep3416 branch from 29bfb68 to 1345659 Compare April 28, 2026 05:23

panpan0000 added 2 commits May 2, 2026 22:43

change annotation name to more generic

bbc7983

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>

docs(proposals): clarify plugin env injection mode in KEP-3416

39a170d

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>

panpan0000 force-pushed the kep3416 branch from 8c50c1a to 39a170d Compare May 2, 2026 14:54

Conversation

panpan0000 commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

panpan0000 commented Apr 14, 2026

Uh oh!

andreyvelich commented Apr 14, 2026

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

google-oss-prow Bot commented Apr 16, 2026

Uh oh!

Uh oh!

andreyvelich commented Apr 21, 2026

Uh oh!

panpan0000 commented Apr 22, 2026

Uh oh!

andreyvelich Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

tenzen-y Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

andreyvelich Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

andreyvelich May 1, 2026

Choose a reason for hiding this comment

Uh oh!

panpan0000 May 2, 2026

Choose a reason for hiding this comment

Uh oh!

andreyvelich May 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tenzen-y commented Apr 22, 2026

Uh oh!

drivebyer commented Apr 23, 2026

Uh oh!

panpan0000 commented Apr 23, 2026

Uh oh!

andreyvelich commented Apr 23, 2026

Uh oh!

tenzen-y commented Apr 26, 2026

Uh oh!

tenzen-y commented Apr 26, 2026

Uh oh!

panpan0000 commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

panpan0000 commented Apr 7, 2026 •

edited

Loading