Skip to content

feat: KEP for inject PET envs into init-container#3417

Open
panpan0000 wants to merge 7 commits intokubeflow:masterfrom
panpan0000:kep3416
Open

feat: KEP for inject PET envs into init-container#3417
panpan0000 wants to merge 7 commits intokubeflow:masterfrom
panpan0000:kep3416

Conversation

@panpan0000
Copy link
Copy Markdown

@panpan0000 panpan0000 commented Apr 7, 2026

What this PR does / why we need it:

Add KEP to address feature request in #3416

Co-Author by AI

Fixes #3416

Checklist:

  • Docs included if any changes are user facing

Copilot AI review requested due to automatic review settings April 7, 2026 02:23
@google-oss-prow google-oss-prow Bot requested review from jinchihe and kuizhiqing April 7, 2026 02:23
@panpan0000 panpan0000 changed the title docs(proposals): draft KEP for inject PET envs into init-container feat: KEP for inject PET envs into init-container Apr 7, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a draft KEP documenting the proposed change to inject PyTorch distributed PET_* environment variables into Trainer init containers (in addition to the main trainer container), addressing the feature request in #3416.

Changes:

  • Add a new proposal document describing motivation, goals/non-goals, and a high-level implementation plan for PET env injection into init containers.
  • Outline required runtime helper, Torch plugin, and JobSet plugin updates plus a basic test plan.

Comment thread docs/proposals/3416-pet-env-init-containers/README.md Outdated
@panpan0000
Copy link
Copy Markdown
Author

Hi, Kubeflow users and committee , what do you think ?

@andreyvelich
Copy link
Copy Markdown
Member

Sorry for the delay @panpan0000, let me take a look at the KEP!

Copy link
Copy Markdown
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @panpan0000!
I left a few thoughts.
/assign @astefanutti @akshaychitneni @tenzen-y

Comment thread docs/proposals/3416-pet-env-init-containers/README.md
Comment thread docs/proposals/3416-pet-env-init-containers/README.md Outdated
Comment thread docs/proposals/3416-pet-env-init-containers/README.md Outdated
Comment thread docs/proposals/3416-pet-env-init-containers/README.md
@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from astefanutti. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow Bot added size/L and removed size/M labels Apr 16, 2026
Comment thread docs/proposals/3416-pet-env-init-containers/README.md
@andreyvelich
Copy link
Copy Markdown
Member

/ok-to-test

@panpan0000
Copy link
Copy Markdown
Author

about comparison of the entrypoint method:

I was mixing up trainerJob and LWS...sorry . same topic for two repo... my bad.

But

  • Entrypoint checks are workload-level and couple preflight with user command wiring.
  • Init-container checks are runtime-level and keep user training entrypoints clean.
  • Init containers give a cleaner gate: preflight must pass before training starts.
  • Injecting PET_* into init containers makes this reusable across runtimes without wrapping every command.
    So entrypoint and init-container approaches are complementary.

This KEP focuses on runtime-level preflight before main training starts.

Comment on lines +45 to +50

Add annotation-based opt-in for init containers. When enabled, apply the same `PET_*` env set to trainer init containers in `PodSet` (`AncestorTrainer`).

Proposed annotation:

- `trainer.kubeflow.org/pet-init-env-injection: "enabled"`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going to enforce this annotation to the Runtime and TrainJob? I would see a use-case when users want all TrainJobs that reference appropriate Runtime get env injected into initContainer.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on @andreyvelich

Based on user stories, we might want to consider a dedicated field in either TrainingRuntime or TrainJob.
If this should be configured by MLOps or Cluster Admins, we might want to have `.spec.mlPolicy.torch.petEnvInjectionContainerTypes: ["Containers", "InitContainers"], which could also allow users to manage PET environment variables even in Containers.

Or, we might be able to add .spec.trainer.frameworkEnvInjectionContainerType: ["Containers", "InitContainers"] when we want to give this capabilities to the researchers.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good, point, maybe we should start with dedicated API field in the torch ML Policy?
In TrainJob, we can also override it via RuntimePatches API, since we can extend the TrainingRuntimeSpecPatch API if needed: https://github.com/kubeflow/trainer/blob/master/pkg/apis/trainer/v1alpha1/trainjob_types.go#L310

WDYT @panpan0000 @robert-bell @astefanutti ?

Also, we might want to consider use-case when users want to inject PET_ envs to other ReplicatedJobs or Container names. We can consider more extensible approach for future:

torch:
  envInjection:
    containerName: nccl-check
    targetJob: node

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@panpan0000 Did you get a chance to update your proposal, so we can move it forward?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andreyvelich, updated.sorry for late.

  • Added user stories per @tenzen-y
  • Renamed the annotation to a more generic key as people suggested: trainer.kubeflow.org/plugin-env-injection-mode: "init-containers". ( In this KEP we only support init-containers for now; other values are reserved for follow-ups. )
  • Moved broader items people discussed (dedicated MLPolicy field, RuntimePatches override, finer-grained targets) to Open Questions portion.

Please take another look and thank you

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @panpan0000! Shall we go ahead with API changes instead of annotation?
That will help us to avoid breaking it in the future versions.

Comment thread docs/proposals/3416-pet-env-init-containers/README.md
@tenzen-y
Copy link
Copy Markdown
Member

ACK
InQueue

@drivebyer
Copy link
Copy Markdown

+1, this is exactly what we need.

Right now we hack it with TrainJob.spec.podTemplateOverrides to push envs into the preflight init container, but values like PET_NNODES are runtime-derived.

What we really want: user drops a preflight init container into the TrainingRuntime, and the operator injects PET_* into it the same way it does for the main container. No overrides, no duplication.

@panpan0000
Copy link
Copy Markdown
Author

@andreyvelich

trainer.kubeflow.org/runtime-envs: init-containers
or
trainer.kubeflow.org/plugin-env-injection: init-containers

Do we have any other possible value for this annotation (besides init-containers)?

…containers

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
Polish more

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
@andreyvelich
Copy link
Copy Markdown
Member

@andreyvelich

trainer.kubeflow.org/runtime-envs: init-containers
or
trainer.kubeflow.org/plugin-env-injection: init-containers

Do we have any other possible value for this annotation (besides init-containers)?

I like init-containers, @robert-bell proposed the same here, but with -mode suffix: #3417 (comment)

@tenzen-y @astefanutti How do you like this annotation name?

trainer.kubeflow.org/plugin-env-injection: init-containers

@tenzen-y
Copy link
Copy Markdown
Member

@andreyvelich

trainer.kubeflow.org/runtime-envs: init-containers
or
trainer.kubeflow.org/plugin-env-injection: init-containers

Do we have any other possible value for this annotation (besides init-containers)?

I like init-containers, @robert-bell proposed the same here, but with -mode suffix: #3417 (comment)

@tenzen-y @astefanutti How do you like this annotation name?

trainer.kubeflow.org/plugin-env-injection: init-containers

I would consider those based on user stories, as I mentioned in #3417 (comment), because the current KEP lacks real-world stories.

@tenzen-y
Copy link
Copy Markdown
Member

I would highly recommend to add "### User Stories" section under the ## Proposal section to evaluate proposal.

@panpan0000
Copy link
Copy Markdown
Author

I would consider those based on user stories, as I mentioned in #3417 (comment), because the current KEP lacks real-world stories.

added user stories @tenzen-y , thank you for your suggestion

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
panpan0000 added 2 commits May 2, 2026 22:43
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support injecting Torch PET_* envs into trainer init containers

9 participants