Skip to content

feat: add e2e tests to Helm CI workflow#3253

Merged
google-oss-prow[bot] merged 3 commits intokubeflow:masterfrom
Goku2099:e2e-test-through-helm
Apr 17, 2026
Merged

feat: add e2e tests to Helm CI workflow#3253
google-oss-prow[bot] merged 3 commits intokubeflow:masterfrom
Goku2099:e2e-test-through-helm

Conversation

@Goku2099
Copy link
Copy Markdown
Contributor

This PR extends the existing Helm CI workflow by adding an E2E validation step.

#3230

Copilot AI review requested due to automatic review settings February 24, 2026 04:25
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the Helm CI workflow to run end-to-end validation by provisioning a Kind cluster and executing the Go E2E test suite, addressing Issue #3230.

Changes:

  • Add a Kind cluster setup step to the Helm CI workflow.
  • Run Go E2E tests (make test-e2e) as part of the Helm workflow.

Comment thread .github/workflows/test-helm.yaml Outdated
Comment thread .github/workflows/test-helm.yaml Outdated
Comment thread .github/workflows/test-helm.yaml Outdated
@google-oss-prow google-oss-prow Bot added size/S and removed size/XS labels Feb 24, 2026
Comment thread .github/workflows/test-helm.yaml Outdated
@Goku2099
Copy link
Copy Markdown
Contributor Author

@andreyvelich The Coveralls step is failing with HTTP 530 (error code 1016).

This appears to be an external Coveralls issue rather than a CI or coverage configuration problem. The coverage file is generated successfully, but the upload fails.

Could you please confirm if we should re-run the job or temporarily ignore this failure?

@Goku2099 Goku2099 force-pushed the e2e-test-through-helm branch from 4982e6f to d2c9aa2 Compare February 27, 2026 15:14
@Goku2099
Copy link
Copy Markdown
Contributor Author

Goku2099 commented Mar 4, 2026

@andreyvelich can you review this one too.

Comment thread .github/workflows/test-helm.yaml Outdated
Comment thread .github/workflows/test-helm.yaml Outdated
Comment thread hack/e2e-setup-cluster.sh Outdated
@Goku2099 Goku2099 force-pushed the e2e-test-through-helm branch 2 times, most recently from 0af8e47 to 337e859 Compare April 15, 2026 07:58
Comment thread .github/workflows/test-helm.yaml Outdated
@Goku2099 Goku2099 force-pushed the e2e-test-through-helm branch from 3bfea9e to dbbc804 Compare April 15, 2026 12:04
Comment thread .github/workflows/test-helm.yaml Outdated
Comment thread Makefile Outdated
@Goku2099 Goku2099 force-pushed the e2e-test-through-helm branch 3 times, most recently from 5d748ed to 148cc30 Compare April 15, 2026 14:42
Comment thread hack/e2e-setup-cluster.sh
Comment thread hack/e2e-setup-cluster.sh Outdated
Comment thread .github/workflows/test-helm.yaml Outdated
@Goku2099 Goku2099 force-pushed the e2e-test-through-helm branch 3 times, most recently from 6a7dd04 to 859c5a4 Compare April 15, 2026 15:46
Comment thread hack/e2e-setup-cluster.sh
@Goku2099 Goku2099 force-pushed the e2e-test-through-helm branch from 859c5a4 to e7ba4b2 Compare April 15, 2026 15:59
@Goku2099 Goku2099 force-pushed the e2e-test-through-helm branch from ba84ecd to bca47c8 Compare April 16, 2026 12:53
@Goku2099
Copy link
Copy Markdown
Contributor Author

Earlier it looked like a config difference because Helm wasn’t enabling the feature gate, but after updating Helm to use manager.config.featureGates, both setups should be aligned now.

With that change, most tests pass, but the DeadlineExceeded case still fails in Helm while kustomize passes. The TrainJob fails as expected, but the underlying JobSet isn’t getting deleted and the test times out waiting for it, which looks similar to #3358.

So it doesn’t seem to be just a config difference anymore there might be some difference in how resources are applied or reconciled between Helm and kustomize. I’ll investigate further.

@Goku2099
Copy link
Copy Markdown
Contributor Author

hey @andreyvelich
I investigated the Helm e2e failure for the DeadlineExceeded case.

The TrainJob is correctly marked as failed with DeadlineExceeded, but the test fails because the underlying JobSet is not getting deleted.

In the controller, the deletion happens here:

if now.After(deadline) {
ctrl.LoggerFrom(ctx).V(2).Info("TrainJob deadline exceeded, marking as failed",
"activeDeadlineSeconds", trainJob.Spec.ActiveDeadlineSeconds,
"startTime", startTime,
"deadline", deadline)
setFailedCondition(trainJob, constants.TrainJobDeadlineExceededMessage, trainer.TrainJobDeadlineExceededReason)
jobSet := &jobsetv1alpha2.JobSet{
ObjectMeta: metav1.ObjectMeta{Name: trainJob.Name, Namespace: trainJob.Namespace},
}
if err := client.IgnoreNotFound(r.client.Delete(ctx, jobSet)); err != nil {

We are deleting the JobSet using a constructed object (only name/namespace), without fetching the existing resource first.

From what I observed:

  • Kustomize e2e passes → JobSet gets deleted
  • Helm e2e fails → JobSet remains and the test times out

So the behavior is inconsistent even though the same test is used.

I’m trying to understand:

  1. What guarantees JobSet deletion in this flow?
  2. Is deleting via a constructed object intentional, or should we fetch the JobSet before deleting it?
  3. Could differences between Helm and kustomize (e.g., reconciliation timing or resource state) affect deletion behavior here?

@andreyvelich
Copy link
Copy Markdown
Member

I am not sure why we have different logic for Helm vs Kustomize.
Can you try to deploy it locally on Kind cluster and verify?

@aniket2405 @jaiakash @astefanutti @abhijeet-dhumal @robert-bell @XploY04 @Krishna-kg732 Any thoughts?

@robert-bell
Copy link
Copy Markdown
Contributor

My guess is there's a difference between the helm and kustomize manifests. @Goku2099 you could try rendering both of them locally and comparing.

@XploY04
Copy link
Copy Markdown
Contributor

XploY04 commented Apr 16, 2026

Kustomize grants create, delete, get, list, patch, update, watch on jobsets.jobset.x-k8s.io. The Helm ClusterRole is missing delete.

At pkg/controller/trainjob_controller.go:194 the controller calls Delete on the JobSet for DeadlineExceeded and swallows the error. Under Helm the API denies with forbidden, so the TrainJob still gets the failed condition but the JobSet lingers and the e2e times out. Under Kustomize the verb is granted, delete succeeds.

Fix is one line: add - delete to the jobsets verbs in the Helm ClusterRole. @Goku2099 please add this and try.

@andreyvelich
Copy link
Copy Markdown
Member

Great catch @XploY04!
@Goku2099 please can we ensure that ClusterRole in Helm charts is aligned with this: https://github.com/kubeflow/trainer/blob/master/manifests/base/rbac/role.yaml#L40

@Goku2099 Goku2099 force-pushed the e2e-test-through-helm branch from 9946299 to bca47c8 Compare April 17, 2026 06:23
@Goku2099
Copy link
Copy Markdown
Contributor Author

@andreyvelich
I updated the RBAC by adding the missing delete verb.
However, now the CI is failing during the image build step with package installation errors. This seems unrelated to the RBAC change, so I’m investigating and fixing the Docker build issues to get CI green.

Comment thread cmd/runtimes/deepspeed/Dockerfile Outdated
@review-notebook-app
Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@XploY04
Copy link
Copy Markdown
Contributor

XploY04 commented Apr 17, 2026

@Goku2099 I have made the fix in a new PR for the failing GPU test, please wait.

@Goku2099
Copy link
Copy Markdown
Contributor Author

@Goku2099 I have made the fix in a new PR for the failing GPU test, please wait.

Thanks for the update! I’ll wait for your PR and rebase/adjust mine accordingly

@andreyvelich
Copy link
Copy Markdown
Member

@Goku2099 Can you rebase your PR please?

@Goku2099 Goku2099 force-pushed the e2e-test-through-helm branch 2 times, most recently from 75ffb8c to e8366a6 Compare April 17, 2026 14:15
Signed-off-by: Sameer_yadav <159073326+Goku2099@users.noreply.github.com>
Signed-off-by: Sameer_yadav <159073326+Goku2099@users.noreply.github.com>
Signed-off-by: Sameer_yadav <159073326+Goku2099@users.noreply.github.com>
@Goku2099 Goku2099 force-pushed the e2e-test-through-helm branch from e8366a6 to b3106ee Compare April 17, 2026 14:26
@Goku2099
Copy link
Copy Markdown
Contributor Author

@XploY04 @robert-bell Thanks for the help and guidance on this
@andreyvelich CI is green now, could you please review

Copy link
Copy Markdown
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this work @Goku2099!
/lgtm
/approve

@google-oss-prow google-oss-prow Bot added the lgtm label Apr 17, 2026
@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@andreyvelich
Copy link
Copy Markdown
Member

/hold cancel

@google-oss-prow google-oss-prow Bot merged commit d4546f4 into kubeflow:master Apr 17, 2026
28 of 29 checks passed
@google-oss-prow google-oss-prow Bot added this to the v2.3 milestone Apr 17, 2026
Goku2099 added a commit to Goku2099/trainer that referenced this pull request Apr 25, 2026
* ci(helm): add e2e tests to Helm CI workflow

Signed-off-by: Sameer_yadav <159073326+Goku2099@users.noreply.github.com>

* added delete in clusterrole

Signed-off-by: Sameer_yadav <159073326+Goku2099@users.noreply.github.com>

* Revert Dockerfile base image changes

Signed-off-by: Sameer_yadav <159073326+Goku2099@users.noreply.github.com>

---------

Signed-off-by: Sameer_yadav <159073326+Goku2099@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants