Skip to content

feat: deploy Cerberus in management account, drop delegated-admin model#13

Merged
gregsienkiewicz merged 12 commits into
mainfrom
feat/gregsienkiewicz/major-refactor
May 28, 2026
Merged

feat: deploy Cerberus in management account, drop delegated-admin model#13
gregsienkiewicz merged 12 commits into
mainfrom
feat/gregsienkiewicz/major-refactor

Conversation

@gregsienkiewicz

@gregsienkiewicz gregsienkiewicz commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

Update — additional work since initial review

Since the original architectural pivot landed, the branch has received five follow-on commits addressing reviewer feedback and operational quality. Specifics in the original description below have changed where noted.

  • /code-review findings (e9e3e94) — CLAUDE.md staleness, README cleanup, DISABLED-mode defense-in-depth guard at the Lambda.
  • KIRO peer review (687be33) — State machine hardening:
    • Explicit Retry blocks on all four AWS SDK integration Tasks (DescribePermissionSet, DescribeInstance, DescribeUser, DescribeGroup). One throttle without retry previously failed the whole execution.
    • Lambda invoke retry narrowed from blanket States.TaskFailed (5 attempts × 15s, worst case ~25 min on Lambda crashes) to specific transient invocation errors (Lambda.ServiceException, Lambda.AWSLambdaException, Lambda.SdkClientException, Lambda.TooManyRequestsException). Crashes now fail fast.
    • Is User or Group Name Returned? Choice now validates $.DescribeUser.UserName — the field the Lambda actually reads, and the field that is required on Identity Store users (DisplayName is optional and could have silently rejected valid users).
  • CLAUDE.md authoring conventions (fc65b13) — Project memory now codifies the patterns surfaced by the KIRO review: uniform Task retries, Choice/Lambda field alignment, StringEquals over StringMatches.
  • Make-driven workflow (3bb6329) — Root Makefile is the single CI/CD entry point. make check (validate + test) gates PRs; make deploy NOTIFICATION_EMAIL=... MODE=DRY_RUN replaces remembered SAM CLI command sequences. CI=true skips the interactive changeset confirmation that samconfig.toml enables by default. Root README rewritten to point at this workflow.
  • Codex GPT review findings (48b9c04) — Three substantive hardening items:
    • Dropped the operator-supplied ManagementAccountId parameter. The protected-target account is now derived from !Ref AWS::AccountId in the state machine substitution. Eliminates the misconfiguration class where a wrong account ID leaves the real management account vulnerable to self-lockout. The original PR description below still lists ManagementAccountId as a retained parameter — that's no longer accurate.
    • Custom CloudWatch metrics. Lambda emits Cerberus / {Deleted, Skipped, Failed} at each outcome path. CerberusHighDeletionRateAlarm retargeted from AWS/States ExecutionsSucceeded (which counted no-ops, dry-runs, and skipped paths as success) to Cerberus / Deleted — fires on actual deletion volume, not on execution count.
    • Null-safe env reads. All three pattern env vars (PermissionSetNamePattern, PrincipalGroupNamePattern, PrincipalUserNameEmail) now use os.environ.get(name, "") defaults. Fail-closed early return when either regex pattern is empty, since re.compile("") matches every string and an empty pattern under ENFORCE would delete every matching principal.

Note

Deferred to a follow-up PR (Codex #1): Deletion-confirmation polling. The Lambda currently treats DeleteAccountAssignment Status=IN_PROGRESS as "deletion accepted." Confirming actual completion via a Step Functions polling loop on DescribeAccountAssignmentDeletionStatus is an architectural change to the state machine and warrants its own review cycle. The custom-metric work here is forward-compatible: when polling lands, Deleted semantics shift from "API accepted" to "polling confirmed" with no metric-name change.

Test plan refresh

  • make check — SAM template valid + 9/9 unit tests pass (was 8/8 at original open; one new metric-emission assertion per outcome type — Deleted, Skipped, Failed).
  • make deploy (no params) — fails closed with helpful usage hint (NOTIFICATION_EMAIL required; MANAGEMENT_ACCOUNT_ID no longer needed).
  • Post-deploy verification in a sandbox management account remains as outlined in the original test plan below, with one addition: confirm the Cerberus namespace appears in CloudWatch Metrics with Deleted, Skipped, Failed counters incrementing on the relevant paths.

Summary

IAM Identity Center enforces a service-level restriction that is invisible to IAM, SCPs, and the delegation configuration: permission sets whose lifecycle is owned by the management account — every Control Tower default — can only have their assignments removed by a principal in the management account itself. The delegated admin returns AccessDeniedException regardless of IAM permissions. This makes the previous delegated-admin topology unworkable.

This PR collapses Cerberus into a single-account deployment in the management account, with mitigations for the security implications of running application code there.

Architectural change

Before After
Deployment account Delegated admin (member account) Management account
Cross-account event bus Custom cerberus-event-bus + CerberusEventBusPolicy None (removed)
Forwarder template cft-eventbridge-rule.yaml (deployed in mgmt) Deleted
Lambda + state machine Delegated admin Management account
Event source Cross-account events:PutEvents Default event bus (CloudTrail-native)

Security mitigations added

  • Inline permission boundary (AWS::IAM::ManagedPolicy) attached to both the Lambda role and the state machine role. SCPs do not apply to management-account principals; this is the only IAM-layer guardrail. Whitelist: sso:DeleteAccountAssignment, sso:Describe*/Get*/List*, identitystore:Describe*, lambda:InvokeFunction, logs:*, cloudwatch:PutMetricData. Anything else (iam:*, organizations:*, sts:AssumeRole, sso:CreateAccountAssignment, kms:*, etc.) is implicitly denied.
  • Mode parameter (ENFORCE | DRY_RUN | DISABLED). DRY_RUN logs would-delete decisions without calling the SSO API. DISABLED sets the EventBridge rule's State to DISABLED — operational kill switch flippable via sam deploy --parameter-overrides Mode=....
  • Reduced ReservedConcurrentExecutions from 5 → 2.
  • Three new alarms alongside the existing ExecutionsFailed:
    • CerberusFunctionErrorsAlarm — Lambda errors > 0
    • CerberusFunctionThrottlesAlarm — concurrency cap hit
    • CerberusHighDeletionRateAlarm — > 10 successful executions in 5 min (runaway / compromise detector)
  • Self-lockout protection (kept from fix: surface real deletion failures and skip mgmt-account events #12): events targeting the configured ManagementAccountId are skipped at the state machine, with rationale text updated to reflect the new context.

Removed

  • cft-eventbridge-rule.yaml
  • EventBusName parameter
  • CerberusEventBus resource and CerberusEventBusPolicy
  • CerberusEventManagementAccount event rule (cross-account-bus version)
  • EventBusArn output

Added

  • Mode parameter + DRY_RUN early-return in app.py
  • CerberusPermissionsBoundary (AWS::IAM::ManagedPolicy)
  • New unit tests: test_lambda_handler_dry_run_mode, test_lambda_handler_group_principal_match
  • tearDown to clean Mode env across tests
  • README rewrite covering: rationale for management-account deployment, pre-deploy security checklist, all four alarms, the Mode parameter, and a migration runbook

Test plan

  • python -m unittest discover -s cerberus/tests/unit -v — 8/8 pass (was 6)
  • sam validate --lint — valid
  • sam build — substitutions resolve, build succeeds
  • Migration runbook validation (post-deploy):
    • Deploy to management account with Mode=DRY_RUN. Trigger a known CreateAccountAssignment against an account-factory test account. Verify Lambda logs DRY_RUN: would remove ... and delete_account_assignment is not called.
    • Tear down old delegated-admin stack and the old cft-eventbridge-rule.yaml stack.
    • Flip new stack to Mode=ENFORCE. Trigger another known assignment. Verify the assignment is actually removed in the IAM Identity Center console (not only the Lambda's reported result).
    • Force a fault (e.g., temporarily remove sso:DeleteAccountAssignment from the boundary) and confirm CerberusFunctionErrorsAlarm or ExecutionsFailed fires.

Operator-facing breaking changes

  • The delegated-admin deployment is no longer supported. Existing deployments must follow the migration runbook in cerberus/README.md.
  • EventBusName parameter is removed. Operators referencing it in deploy scripts will need to drop it.
  • EventBusArn output is removed. Anything consuming that output must be updated.

Deferred to follow-up PRs

  • Lambda code signing (AWS::Signer::SigningProfile + CodeSigningConfig) — deferred per discussion. Add when deploys move to CI/CD.
  • SsoInstanceId parameter for tighter resource scoping on sso:* actions. Keeps Resource: "*" for OSS friendliness in this PR.
  • Custom metric for DRY_RUN-stuck-mode detection.
  • Regex hardening (re.fullmatch, AllowedPattern/MinLength on regex parameters).
  • State-machine integration test for the management-account skip path (deferred from fix: surface real deletion failures and skip mgmt-account events #12).

Generated with Claude Code

IAM Identity Center enforces a service-level restriction (invisible to
IAM, SCPs, and delegation config): permission sets whose lifecycle is
owned by the management account — every Control Tower default — can
only have their assignments removed by a principal in the management
account itself. The delegated admin returns AccessDeniedException
regardless of IAM permissions.

Confirmed live on 2026-04-30 by attempting deletion of an
AWSAdministratorAccess assignment from both the Lambda role and the
console, while authenticated to the delegated admin account. Console
error: "You can't remove access to permission sets provisioned in the
management account... Contact the owner of the management account
if you need help."

The delegated-admin topology is therefore unworkable. This refactor
collapses Cerberus into a single-account deployment in the management
account, with mitigations for the security implications.

Mitigations added:
- Inline AWS::IAM::ManagedPolicy permission boundary attached to both
  the Lambda role and the state machine role. SCPs do not apply to
  management-account principals; this is the only IAM-layer guardrail.
- Mode parameter (ENFORCE | DRY_RUN | DISABLED). DISABLED toggles the
  EventBridge rule State; DRY_RUN logs what would be deleted without
  calling the SSO API.
- Reduced ReservedConcurrentExecutions from 5 to 2.
- Three new alarms: Lambda Errors, Lambda Throttles, and unusual
  deletion volume (>10 successful executions in 5 min) — runaway
  detector for regex misfire or compromise.
- Self-lockout protection (kept from PR #12): events targeting the
  configured ManagementAccountId are skipped at the state machine.

Removed:
- cft-eventbridge-rule.yaml (cross-account forwarder no longer needed)
- Custom event bus (cerberus-event-bus) and its policy
- Cross-account event rule (CerberusEventManagementAccount)
- EventBusName parameter and EventBusArn output

Added:
- Mode parameter + DRY_RUN early-return in app.py
- CerberusPermissionsBoundary (AWS::IAM::ManagedPolicy)
- DRY_RUN unit test + GROUP-principal unit test
- tearDown to clean Mode env across tests
- README rewrite: rationale, pre-deploy security checklist, Mode
  parameter, all four alarms, migration runbook

Tests: 8 passing (was 6). sam validate --lint clean. sam build green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…BLED guard)

- CLAUDE.md: rewrite Two-Account Deployment section as single-account; remove
  obsolete Critical Code Quirk (line 120 was removed in PR #12); add Mode to
  Primary Tuning Surface; remove cft-eventbridge-rule.yaml references in MCP
  and Plugins sections.
- cerberus/README.md: replace inaccurate "sso:* (read + DeleteAccountAssignment
  only)" boundary description with the explicit list of six allowed sso
  actions, matching the actual policy in template.yaml.
- cerberus/src/cerberus/app.py: defense-in-depth guard for Mode=DISABLED. The
  EventBridge rule is the primary kill switch, but if the Lambda is invoked
  directly (test, manual), the handler now short-circuits before any deletion.
- cerberus/tests/unit/test_cerberus.py: new test_lambda_handler_disabled_mode.

Tests: 9 passing (was 8). sam validate --lint clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gregsienkiewicz

Copy link
Copy Markdown
Contributor Author

Code review

/code-review ran on this PR. Five review agents surfaced candidates; six findings met the ≥80 confidence threshold after Haiku scoring. Four addressed inline in e9e3e94 before posting this comment.

Addressed:

  1. README permission boundary description was inaccurate. Said "sso:* (read + DeleteAccountAssignment only)" but the policy lists six specific sso actions (no wildcard). Replaced with the explicit list to match template.yaml CerberusPermissionsBoundary. Fixed in cerberus/README.md.

  2. CLAUDE.md "Two-Account Deployment" section described the deleted topology. PR removes cft-eventbridge-rule.yaml and collapses to single-account, but the project CLAUDE.md at the time of opening still pointed at the old topology. Rewritten to reflect the current architecture in CLAUDE.md.

  3. CLAUDE.md "Primary Tuning Surface" listed only three regex env vars, missing Mode. Mode is the kill switch / dry-run toggle introduced in this PR. Added to the list in CLAUDE.md.

  4. CLAUDE.md "Critical Code Quirk" warned about the line-120 hardcoded SUCCEEDED override, which was removed in PR fix: surface real deletion failures and skip mgmt-account events #12 (e9c7ce1). The warning was stale and would mislead future readers. Section removed in CLAUDE.md.

Defense-in-depth fix bundled in the same commit (scored 70 — below threshold but worth the 5 lines):

  1. Mode=DISABLED was honoured at the EventBridge rule layer only. If the Lambda were invoked directly (console test, Step Functions test, misrouted event), it would still call sso:DeleteAccountAssignment. Added a defense-in-depth guard in app.py plus a unit test.

Considered but deferred:

  • CerberusHighDeletionRateAlarm counts skipped and DRY_RUN executions as ExecutionsSucceeded (score 75). The skip Pass state and the DRY_RUN early-return both terminate as successful executions, so the alarm could fire on legitimate management-account-skip bursts or initial DRY_RUN validation runs. Real concern; the right fix is a custom metric emitted only on actual deletions, which is non-trivial and would expand scope. Tracking as a follow-up.
  • UserName vs DisplayName mismatch (Lambda reads UserName, state machine validates DisplayName presence). Pre-existing from PR Initial Release of Cerberus #2 and confirmed working in production (real DescribeUser responses contain both fields). Out of scope per the review skill rubric.

Generated with Claude Code

Copilot AI review requested due to automatic review settings May 19, 2026 18:32

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Collapses Cerberus from a delegated-admin + management-account topology into a single management-account deployment. Adds an inline permission boundary, an operational Mode parameter (ENFORCE/DRY_RUN/DISABLED), additional CloudWatch alarms, and reduces reserved concurrency from 5 to 2. Documentation and tests updated; cross-account event bus and forwarder template removed.

Changes:

  • Removed cft-eventbridge-rule.yaml, EventBusName parameter, CerberusEventBus/CerberusEventBusPolicy, and EventBusArn output; the EventBridge rule now binds to the default bus.
  • Added Mode parameter (wired to env var + EventBridge rule State), CerberusPermissionsBoundary managed policy applied to both roles, and three new alarms (CerberusFunctionErrorsAlarm, CerberusFunctionThrottlesAlarm, CerberusHighDeletionRateAlarm).
  • Expanded unit tests for DRY_RUN, DISABLED, and group-principal match; rewrote README.md and CLAUDE.md for the new topology.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
cft-eventbridge-rule.yaml Deleted — cross-account forwarder no longer needed.
cerberus/template.yaml Adds Mode param, permissions boundary, new alarms; removes event bus + policy; lowers concurrency.
cerberus/src/cerberus/app.py Reads Mode env var; early-returns for DISABLED and DRY_RUN.
cerberus/statemachine/cerberus.asl.json Updates skip-reason text for management-account targets.
cerberus/tests/unit/test_cerberus.py Adds tests for DISABLED/DRY_RUN/group match; adds tearDown for Mode.
cerberus/README.md Rewrites rationale, security checklist, alarms, Mode, migration runbook.
CLAUDE.md Updates guidance for single-account deployment; removes references to deleted template.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cerberus/template.yaml
Comment thread cerberus/src/cerberus/app.py Outdated
Comment thread cerberus/tests/unit/test_cerberus.py
Comment thread cerberus/src/cerberus/app.py Outdated
Comment thread cerberus/README.md Outdated
Comment thread cerberus/template.yaml Outdated
gregsienkiewicz and others added 9 commits May 19, 2026 14:56
Addresses substantive findings from peer review of cerberus.asl.json:

- Add Retry (3 attempts, 3s/2.0x backoff) to DescribePermissionSet,
  DescribeInstance, DescribeUser, DescribeGroup so transient AWS SDK
  throttling/network errors don't fail the whole execution.
- Narrow the Cerberus Lambda invoke retry from a blanket
  States.TaskFailed (5 attempts × 15s) to Lambda.{Service,AWSLambda,
  SdkClient,TooManyRequests}Exception only. The Lambda already returns
  structured FAILED for every business-logic exception, so blanket
  retries only catch Lambda crashes — which retrying with identical
  input cannot fix, and worst-case wastes ~25 min on repeated 300s
  timeouts. Crashes now fail fast onto CerberusExecutionFailureAlarm.
- Change "Is User or Group Name Returned?" choice from
  $.DescribeUser.DisplayName to $.DescribeUser.UserName so the
  validation field matches what the Lambda actually reads. DisplayName
  is optional on Identity Store DescribeUser; UserName is required.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s, null-safe env reads)

Codex GPT review flagged four pre-rollout hardening items; this commit
addresses three (#1 deletion-status polling deferred to a separate PR).

- Drop ManagementAccountId parameter; derive the protected-target
  account from AWS::AccountId in the state machine substitution.
  Eliminates the misconfiguration vector where an operator types the
  wrong account ID and the real management account becomes vulnerable
  to self-lockout. Removes the parameter from template, Makefile, and
  both READMEs.

- Replace AWS/States ExecutionsSucceeded alarm with a Cerberus/Deleted
  custom-metric alarm. The Lambda now emits Deleted on accepted delete,
  Skipped on DISABLED/DRY_RUN/no-action, and Failed on every FAILED
  path. The old alarm counted no-op and skipped executions as success
  and could fire on a flood of non-deletion events; the new one fires
  only on real deletion volume.

- Null-safe os.environ.get for PermissionSetNamePattern,
  PrincipalGroupNamePattern, PrincipalUserNameEmail. Fail-closed early
  return when either pattern is empty, since re.compile("") matches
  everything and an empty pattern under ENFORCE would delete every
  matching principal. Hardens Lambda against non-template invocation
  contexts.

Tests: cloudwatch mocked across all 9 existing tests; metric-emission
assertions added covering Deleted/Skipped/Failed paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…, drop StringMatches)

Local Copilot pre-PR review surfaced four items; all addressed in one
pass before requesting CODEOWNERS review.

- Fail closed on unrecognized DeleteAccountAssignment status. AWS
  documents three states (IN_PROGRESS, SUCCEEDED, FAILED) but the
  handler previously emitted Deleted on anything that wasn't literally
  FAILED, including None and unknown strings. Treating an unparseable
  API response as success masks contract drift; the handler now returns
  FAILED and emits the Failed metric for any status outside the
  documented set. Two new tests cover the unknown-string and
  missing-AccountAssignmentDeletionStatus paths.

- Scope lambda:InvokeFunction in the permissions boundary to lambdas in
  this stack (arn:...:function:${AWS::StackName}-*) rather than *. The
  inline grant from the SAM Serverless::Connector is already scoped to
  the CerberusFunction ARN; this tightens the boundary ceiling to match
  the spirit of that grant. !GetAtt CerberusFunction.Arn isn't usable
  here (circular: the function itself has PermissionsBoundary set to
  this managed policy), so the ARN is constructed from the stack name.

- Fix README alarm table — CerberusHighDeletionRateAlarm row still
  said AWS/States ExecutionsSucceeded, but the previous commit had
  already retargeted it to the Cerberus/Deleted custom metric.

- Replace remaining StringMatches with StringEquals for exact-string
  Choice predicates ("CreateAccountAssignment", "USER", "GROUP"). The
  state machine authoring conventions added to CLAUDE.md in this same
  PR call this out specifically; shipping the convention while
  violating it three lines away was inconsistent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… before event parsing, CloudTrail docs)

- Move Mode parse + DISABLED check above all event destructuring so the
  kill switch survives stripped-down direct invocations (the previous
  position let an AttributeError fire before DISABLED was honoured).
- Treat any Mode value other than ENFORCE / DRY_RUN / DISABLED as
  DISABLED. CFN AllowedValues blocks deploy-time typos, but a direct env
  var override or tooling bug could land an unknown value — fail closed
  rather than silently fall through to ENFORCE.
- Reword README CloudTrail bullet: data events capture Invoke only, not
  code-change activity (that's a default management event).
- Add regression tests for both: DISABLED on a stripped {} event, and
  unknown Mode treated as DISABLED.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gregsienkiewicz gregsienkiewicz requested review from a team as code owners May 28, 2026 14:17
@gregsienkiewicz gregsienkiewicz merged commit b1046e1 into main May 28, 2026
4 checks passed
@gregsienkiewicz gregsienkiewicz deleted the feat/gregsienkiewicz/major-refactor branch May 28, 2026 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants