feat: add e2e tests to Helm CI workflow#3253
feat: add e2e tests to Helm CI workflow#3253google-oss-prow[bot] merged 3 commits intokubeflow:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR extends the Helm CI workflow to run end-to-end validation by provisioning a Kind cluster and executing the Go E2E test suite, addressing Issue #3230.
Changes:
- Add a Kind cluster setup step to the Helm CI workflow.
- Run Go E2E tests (
make test-e2e) as part of the Helm workflow.
|
@andreyvelich The Coveralls step is failing with HTTP 530 (error code 1016). This appears to be an external Coveralls issue rather than a CI or coverage configuration problem. The coverage file is generated successfully, but the upload fails. Could you please confirm if we should re-run the job or temporarily ignore this failure? |
4982e6f to
d2c9aa2
Compare
|
@andreyvelich can you review this one too. |
0af8e47 to
337e859
Compare
3bfea9e to
dbbc804
Compare
5d748ed to
148cc30
Compare
6a7dd04 to
859c5a4
Compare
859c5a4 to
e7ba4b2
Compare
ba84ecd to
bca47c8
Compare
|
Earlier it looked like a config difference because Helm wasn’t enabling the feature gate, but after updating Helm to use With that change, most tests pass, but the DeadlineExceeded case still fails in Helm while kustomize passes. The TrainJob fails as expected, but the underlying JobSet isn’t getting deleted and the test times out waiting for it, which looks similar to #3358. So it doesn’t seem to be just a config difference anymore there might be some difference in how resources are applied or reconciled between Helm and kustomize. I’ll investigate further. |
|
hey @andreyvelich The TrainJob is correctly marked as failed with DeadlineExceeded, but the test fails because the underlying JobSet is not getting deleted. In the controller, the deletion happens here: trainer/pkg/controller/trainjob_controller.go Lines 185 to 194 in 9c598fb We are deleting the JobSet using a constructed object (only name/namespace), without fetching the existing resource first. From what I observed:
So the behavior is inconsistent even though the same test is used. I’m trying to understand:
|
|
I am not sure why we have different logic for Helm vs Kustomize. @aniket2405 @jaiakash @astefanutti @abhijeet-dhumal @robert-bell @XploY04 @Krishna-kg732 Any thoughts? |
|
My guess is there's a difference between the helm and kustomize manifests. @Goku2099 you could try rendering both of them locally and comparing. |
|
Kustomize grants create, delete, get, list, patch, update, watch on jobsets.jobset.x-k8s.io. The Helm ClusterRole is missing delete. At pkg/controller/trainjob_controller.go:194 the controller calls Delete on the JobSet for DeadlineExceeded and swallows the error. Under Helm the API denies with forbidden, so the TrainJob still gets the failed condition but the JobSet lingers and the e2e times out. Under Kustomize the verb is granted, delete succeeds. Fix is one line: add - delete to the jobsets verbs in the Helm ClusterRole. @Goku2099 please add this and try. |
|
Great catch @XploY04! |
9946299 to
bca47c8
Compare
|
@andreyvelich |
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
|
@Goku2099 I have made the fix in a new PR for the failing GPU test, please wait. |
Thanks for the update! I’ll wait for your PR and rebase/adjust mine accordingly |
|
@Goku2099 Can you rebase your PR please? |
75ffb8c to
e8366a6
Compare
Signed-off-by: Sameer_yadav <159073326+Goku2099@users.noreply.github.com>
Signed-off-by: Sameer_yadav <159073326+Goku2099@users.noreply.github.com>
Signed-off-by: Sameer_yadav <159073326+Goku2099@users.noreply.github.com>
e8366a6 to
b3106ee
Compare
|
@XploY04 @robert-bell Thanks for the help and guidance on this |
andreyvelich
left a comment
There was a problem hiding this comment.
Thanks for this work @Goku2099!
/lgtm
/approve
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/hold cancel |
* ci(helm): add e2e tests to Helm CI workflow Signed-off-by: Sameer_yadav <159073326+Goku2099@users.noreply.github.com> * added delete in clusterrole Signed-off-by: Sameer_yadav <159073326+Goku2099@users.noreply.github.com> * Revert Dockerfile base image changes Signed-off-by: Sameer_yadav <159073326+Goku2099@users.noreply.github.com> --------- Signed-off-by: Sameer_yadav <159073326+Goku2099@users.noreply.github.com>
This PR extends the existing Helm CI workflow by adding an E2E validation step.
#3230