feat: KEP for inject PET envs into init-container#3417
feat: KEP for inject PET envs into init-container#3417panpan0000 wants to merge 7 commits intokubeflow:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a draft KEP documenting the proposed change to inject PyTorch distributed PET_* environment variables into Trainer init containers (in addition to the main trainer container), addressing the feature request in #3416.
Changes:
- Add a new proposal document describing motivation, goals/non-goals, and a high-level implementation plan for PET env injection into init containers.
- Outline required runtime helper, Torch plugin, and JobSet plugin updates plus a basic test plan.
|
Hi, Kubeflow users and committee , what do you think ? |
|
Sorry for the delay @panpan0000, let me take a look at the KEP! |
andreyvelich
left a comment
There was a problem hiding this comment.
Thanks for this @panpan0000!
I left a few thoughts.
/assign @astefanutti @akshaychitneni @tenzen-y
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/ok-to-test |
|
about comparison of the I was mixing up trainerJob and LWS...sorry . same topic for two repo... my bad. But
This KEP focuses on runtime-level preflight before main training starts. |
|
|
||
| Add annotation-based opt-in for init containers. When enabled, apply the same `PET_*` env set to trainer init containers in `PodSet` (`AncestorTrainer`). | ||
|
|
||
| Proposed annotation: | ||
|
|
||
| - `trainer.kubeflow.org/pet-init-env-injection: "enabled"` |
There was a problem hiding this comment.
Are we going to enforce this annotation to the Runtime and TrainJob? I would see a use-case when users want all TrainJobs that reference appropriate Runtime get env injected into initContainer.
There was a problem hiding this comment.
+1 on @andreyvelich
Based on user stories, we might want to consider a dedicated field in either TrainingRuntime or TrainJob.
If this should be configured by MLOps or Cluster Admins, we might want to have `.spec.mlPolicy.torch.petEnvInjectionContainerTypes: ["Containers", "InitContainers"], which could also allow users to manage PET environment variables even in Containers.
Or, we might be able to add .spec.trainer.frameworkEnvInjectionContainerType: ["Containers", "InitContainers"] when we want to give this capabilities to the researchers.
There was a problem hiding this comment.
Good, point, maybe we should start with dedicated API field in the torch ML Policy?
In TrainJob, we can also override it via RuntimePatches API, since we can extend the TrainingRuntimeSpecPatch API if needed: https://github.com/kubeflow/trainer/blob/master/pkg/apis/trainer/v1alpha1/trainjob_types.go#L310
WDYT @panpan0000 @robert-bell @astefanutti ?
Also, we might want to consider use-case when users want to inject PET_ envs to other ReplicatedJobs or Container names. We can consider more extensible approach for future:
torch:
envInjection:
containerName: nccl-check
targetJob: nodeThere was a problem hiding this comment.
@panpan0000 Did you get a chance to update your proposal, so we can move it forward?
There was a problem hiding this comment.
Thanks @andreyvelich, updated.sorry for late.
- Added user stories per @tenzen-y
- Renamed the annotation to a more generic key as people suggested:
trainer.kubeflow.org/plugin-env-injection-mode: "init-containers". ( In this KEP we only supportinit-containersfor now; other values are reserved for follow-ups. ) - Moved broader items people discussed (dedicated MLPolicy field, RuntimePatches override, finer-grained targets) to
Open Questionsportion.
Please take another look and thank you
There was a problem hiding this comment.
Thanks for the update @panpan0000! Shall we go ahead with API changes instead of annotation?
That will help us to avoid breaking it in the future versions.
|
ACK |
|
+1, this is exactly what we need. Right now we hack it with TrainJob.spec.podTemplateOverrides to push envs into the preflight init container, but values like What we really want: user drops a preflight init container into the TrainingRuntime, and the operator injects PET_* into it the same way it does for the main container. No overrides, no duplication. |
Do we have any other possible value for this annotation (besides |
…containers Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
Polish more Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
I like @tenzen-y @astefanutti How do you like this annotation name? |
I would consider those based on user stories, as I mentioned in #3417 (comment), because the current KEP lacks real-world stories. |
|
I would highly recommend to add "### User Stories" section under the |
added user stories @tenzen-y , thank you for your suggestion |
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
What this PR does / why we need it:
Add KEP to address feature request in #3416
Co-Author by AI
Fixes #3416
Checklist: