Skip to content

Fix pre-existing e2e test failures#549

Open
scotwells wants to merge 2 commits intomainfrom
fix/pre-existing-e2e-test-failures
Open

Fix pre-existing e2e test failures#549
scotwells wants to merge 2 commits intomainfrom
fix/pre-existing-e2e-test-failures

Conversation

@scotwells
Copy link
Copy Markdown
Contributor

Summary

  • Adds control plane reachability wait to notes multicluster tests
  • Adds quota grant for CRM note lifecycle test

Context

Three e2e tests have been failing on main for some time:

  1. note-multicluster-subject and clusternote-multicluster-subject — wrote to the project control plane immediately after Project Ready=true, but the proxy wasn't reachable yet. Fixed by adding a GrantCreationPolicy + wait for ResourceGrant in the project control plane (same pattern as multi-cluster-enforcement).

  2. crm-note-contact-lifecycle — Note creation was rejected with "Insufficient quota resources available" because no ResourceGrant existed. Fixed by adding a grant in the test setup.

Test plan

  • All 3 tests verified against the patterns used by passing tests
  • CI e2e validation

🤖 Generated with Claude Code

@joggrbot
Copy link
Copy Markdown
Contributor

joggrbot bot commented Mar 31, 2026

📝 Documentation Analysis

All docs are up to date! 🎉


✅ Latest commit analyzed: 408204c | Powered by Joggr

scotwells added a commit that referenced this pull request Apr 1, 2026
> [!NOTE]
> CI shows 6 failing e2e tests — these are pre-existing failures
unrelated to this PR. See #549 for the fix.

## Summary

- Project creation could stall for up to 10 minutes while another
project was being deleted. The controller now handles deletions without
blocking other work.
- Deletion progress is visible via a new `ResourceCleanup` status
condition on the project, so operators can see exactly what phase the
cleanup is in.
- The controller now processes up to 4 projects concurrently instead of
1.

## What changed

The project controller previously ran a synchronous cleanup operation
during deletion that could block for up to 10 minutes. During that time,
no other project — including newly created ones — could be reconciled.

The cleanup is now split into two non-blocking steps: issuing delete
commands (fast) and checking whether resources have drained (cheap
poll). The controller tracks progress via a `ResourceCleanup` condition
with three reasons:
- `CleanupStarted` → delete commands are being issued
- `CleanupAwaitingCompletion` → waiting for resources to be removed
- `CleanupComplete` → all resources removed, finalizer can be released

If resources haven't drained after a check, the controller transitions
back to `CleanupStarted` to automatically re-issue delete commands on
the next reconcile.

## Test plan

- [x] Existing unit tests pass (`task test:unit`)
- [ ] Delete a project and verify the `ResourceCleanup` condition
progresses through `CleanupStarted` → `CleanupAwaitingCompletion` →
`CleanupComplete`
- [ ] Create a new project while another project is being deleted —
creation should not be delayed
- [ ] Kill the controller pod mid-deletion and verify cleanup resumes
after restart
- [x] End-to-end tests pass (`task test:end-to-end`) — new
`project-deletion` test passes in 20.7s

🤖 Generated with [Claude Code](https://claude.com/claude-code)
scotwells added a commit that referenced this pull request Apr 2, 2026
## Summary

- Wires the existing finalizer into the UserInvitation controller so
invitation-related PolicyBindings (`getinvitation` and
`acceptinvitation`) are deleted before an invitation is removed
- Removes a redundant manual PolicyBinding deletion in the acceptance
path since the finalizer now handles cleanup of both bindings
- Adds unit tests verifying finalizer registration and PolicyBinding
garbage collection
- Adds an end-to-end Chainsaw test for the full invitation lifecycle

## Context

When a UserInvitation was deleted (accepted, expired, or revoked), the
associated PolicyBindings in `milo-system` were left behind because the
controller's finalizer was registered but never invoked during
reconciliation. This produced 49 orphaned PolicyBindings in production
stuck in `ResourceSelectorValidationFailed` state.

Closes #535

## Test plan

- [x] Unit tests verify finalizer is added on first reconcile
- [x] Unit tests verify both PolicyBindings are deleted on
UserInvitation deletion
- [x] Existing unit tests updated to account for finalizer registration
cycle
- [x] Chainsaw e2e test validates full lifecycle: create invitation →
verify PolicyBindings → delete invitation → verify cleanup
- [x] e2e test passes in CI

> [!NOTE]
> The CI workflow reports a failure due to pre-existing e2e test issues
on `main` (`note-multicluster-subject`,
`clusternote-multicluster-subject`, `crm-note-contact-lifecycle`). These
are unrelated to this PR and are being resolved in #549.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
scotwells added a commit that referenced this pull request Apr 2, 2026
## Summary

- Adds a finalizer to the UserController that deletes all
OrganizationMemberships referencing a user before allowing the User
object to be removed
- Extends the OrganizationMembership validation webhook to allow
last-owner membership deletion when the referenced User is being deleted
(verified via direct API server read, not cache)
- Fixes the Organization webhook and UserInvitation controller to set
correct User ownerReferences on OrganizationMemberships
- Adds a two-pass self-delete in the OrganizationMembership controller
for existing orphaned memberships where the User no longer exists

## Context

When a User was deleted, OrganizationMembership resources referencing
that user were not cleaned up because they had no User ownerReference
(webhook creation path) or had a malformed one (invitation path used
`.Group` instead of `.String()` for APIVersion). The orphaned
memberships left their owned PolicyBindings stuck in
`SubjectValidationFailed` state.

Closes #536

## Test plan

- [x] Webhook tests verify deletion is allowed when User has
DeletionTimestamp
- [x] Webhook tests verify deletion is allowed when User is already gone
- [x] Webhook tests verify last-owner guard still blocks when User is
active
- [x] All existing unit tests pass
- [x] Chainsaw e2e test validates full lifecycle: create user + org →
verify membership → delete user → verify cleanup
- [x] e2e test passes in CI

> [!NOTE]
> The CI workflow reports a failure due to pre-existing e2e test issues
on `main` (`note-multicluster-subject`,
`clusternote-multicluster-subject`, `crm-note-contact-lifecycle`). These
are unrelated to this PR and are being resolved in #549.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
- Add control plane reachability wait to notes multicluster tests
- Add quota grant for CRM note lifecycle test

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@scotwells scotwells force-pushed the fix/pre-existing-e2e-test-failures branch from 05f42e6 to c9f2743 Compare April 10, 2026 20:42
The apiserver was being OOMKilled at 512Mi during e2e tests when
multiple chainsaw tests run concurrently, causing cascading failures.

Bump memory limit to 1Gi and CPU limit to 1 core, with requests
raised proportionally to 512Mi memory and 200m CPU.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant