feat: deploy Cerberus in management account, drop delegated-admin model#13
Conversation
IAM Identity Center enforces a service-level restriction (invisible to IAM, SCPs, and delegation config): permission sets whose lifecycle is owned by the management account — every Control Tower default — can only have their assignments removed by a principal in the management account itself. The delegated admin returns AccessDeniedException regardless of IAM permissions. Confirmed live on 2026-04-30 by attempting deletion of an AWSAdministratorAccess assignment from both the Lambda role and the console, while authenticated to the delegated admin account. Console error: "You can't remove access to permission sets provisioned in the management account... Contact the owner of the management account if you need help." The delegated-admin topology is therefore unworkable. This refactor collapses Cerberus into a single-account deployment in the management account, with mitigations for the security implications. Mitigations added: - Inline AWS::IAM::ManagedPolicy permission boundary attached to both the Lambda role and the state machine role. SCPs do not apply to management-account principals; this is the only IAM-layer guardrail. - Mode parameter (ENFORCE | DRY_RUN | DISABLED). DISABLED toggles the EventBridge rule State; DRY_RUN logs what would be deleted without calling the SSO API. - Reduced ReservedConcurrentExecutions from 5 to 2. - Three new alarms: Lambda Errors, Lambda Throttles, and unusual deletion volume (>10 successful executions in 5 min) — runaway detector for regex misfire or compromise. - Self-lockout protection (kept from PR #12): events targeting the configured ManagementAccountId are skipped at the state machine. Removed: - cft-eventbridge-rule.yaml (cross-account forwarder no longer needed) - Custom event bus (cerberus-event-bus) and its policy - Cross-account event rule (CerberusEventManagementAccount) - EventBusName parameter and EventBusArn output Added: - Mode parameter + DRY_RUN early-return in app.py - CerberusPermissionsBoundary (AWS::IAM::ManagedPolicy) - DRY_RUN unit test + GROUP-principal unit test - tearDown to clean Mode env across tests - README rewrite: rationale, pre-deploy security checklist, Mode parameter, all four alarms, migration runbook Tests: 8 passing (was 6). sam validate --lint clean. sam build green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…BLED guard) - CLAUDE.md: rewrite Two-Account Deployment section as single-account; remove obsolete Critical Code Quirk (line 120 was removed in PR #12); add Mode to Primary Tuning Surface; remove cft-eventbridge-rule.yaml references in MCP and Plugins sections. - cerberus/README.md: replace inaccurate "sso:* (read + DeleteAccountAssignment only)" boundary description with the explicit list of six allowed sso actions, matching the actual policy in template.yaml. - cerberus/src/cerberus/app.py: defense-in-depth guard for Mode=DISABLED. The EventBridge rule is the primary kill switch, but if the Lambda is invoked directly (test, manual), the handler now short-circuits before any deletion. - cerberus/tests/unit/test_cerberus.py: new test_lambda_handler_disabled_mode. Tests: 9 passing (was 8). sam validate --lint clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Code review
Addressed:
Defense-in-depth fix bundled in the same commit (scored 70 — below threshold but worth the 5 lines):
Considered but deferred:
Generated with Claude Code |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Collapses Cerberus from a delegated-admin + management-account topology into a single management-account deployment. Adds an inline permission boundary, an operational Mode parameter (ENFORCE/DRY_RUN/DISABLED), additional CloudWatch alarms, and reduces reserved concurrency from 5 to 2. Documentation and tests updated; cross-account event bus and forwarder template removed.
Changes:
- Removed
cft-eventbridge-rule.yaml,EventBusNameparameter,CerberusEventBus/CerberusEventBusPolicy, andEventBusArnoutput; the EventBridge rule now binds to the default bus. - Added
Modeparameter (wired to env var + EventBridge ruleState),CerberusPermissionsBoundarymanaged policy applied to both roles, and three new alarms (CerberusFunctionErrorsAlarm,CerberusFunctionThrottlesAlarm,CerberusHighDeletionRateAlarm). - Expanded unit tests for
DRY_RUN,DISABLED, and group-principal match; rewroteREADME.mdandCLAUDE.mdfor the new topology.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
cft-eventbridge-rule.yaml |
Deleted — cross-account forwarder no longer needed. |
cerberus/template.yaml |
Adds Mode param, permissions boundary, new alarms; removes event bus + policy; lowers concurrency. |
cerberus/src/cerberus/app.py |
Reads Mode env var; early-returns for DISABLED and DRY_RUN. |
cerberus/statemachine/cerberus.asl.json |
Updates skip-reason text for management-account targets. |
cerberus/tests/unit/test_cerberus.py |
Adds tests for DISABLED/DRY_RUN/group match; adds tearDown for Mode. |
cerberus/README.md |
Rewrites rationale, security checklist, alarms, Mode, migration runbook. |
CLAUDE.md |
Updates guidance for single-account deployment; removes references to deleted template. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Addresses substantive findings from peer review of cerberus.asl.json:
- Add Retry (3 attempts, 3s/2.0x backoff) to DescribePermissionSet,
DescribeInstance, DescribeUser, DescribeGroup so transient AWS SDK
throttling/network errors don't fail the whole execution.
- Narrow the Cerberus Lambda invoke retry from a blanket
States.TaskFailed (5 attempts × 15s) to Lambda.{Service,AWSLambda,
SdkClient,TooManyRequests}Exception only. The Lambda already returns
structured FAILED for every business-logic exception, so blanket
retries only catch Lambda crashes — which retrying with identical
input cannot fix, and worst-case wastes ~25 min on repeated 300s
timeouts. Crashes now fail fast onto CerberusExecutionFailureAlarm.
- Change "Is User or Group Name Returned?" choice from
$.DescribeUser.DisplayName to $.DescribeUser.UserName so the
validation field matches what the Lambda actually reads. DisplayName
is optional on Identity Store DescribeUser; UserName is required.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s, null-safe env reads) Codex GPT review flagged four pre-rollout hardening items; this commit addresses three (#1 deletion-status polling deferred to a separate PR). - Drop ManagementAccountId parameter; derive the protected-target account from AWS::AccountId in the state machine substitution. Eliminates the misconfiguration vector where an operator types the wrong account ID and the real management account becomes vulnerable to self-lockout. Removes the parameter from template, Makefile, and both READMEs. - Replace AWS/States ExecutionsSucceeded alarm with a Cerberus/Deleted custom-metric alarm. The Lambda now emits Deleted on accepted delete, Skipped on DISABLED/DRY_RUN/no-action, and Failed on every FAILED path. The old alarm counted no-op and skipped executions as success and could fire on a flood of non-deletion events; the new one fires only on real deletion volume. - Null-safe os.environ.get for PermissionSetNamePattern, PrincipalGroupNamePattern, PrincipalUserNameEmail. Fail-closed early return when either pattern is empty, since re.compile("") matches everything and an empty pattern under ENFORCE would delete every matching principal. Hardens Lambda against non-template invocation contexts. Tests: cloudwatch mocked across all 9 existing tests; metric-emission assertions added covering Deleted/Skipped/Failed paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…, drop StringMatches)
Local Copilot pre-PR review surfaced four items; all addressed in one
pass before requesting CODEOWNERS review.
- Fail closed on unrecognized DeleteAccountAssignment status. AWS
documents three states (IN_PROGRESS, SUCCEEDED, FAILED) but the
handler previously emitted Deleted on anything that wasn't literally
FAILED, including None and unknown strings. Treating an unparseable
API response as success masks contract drift; the handler now returns
FAILED and emits the Failed metric for any status outside the
documented set. Two new tests cover the unknown-string and
missing-AccountAssignmentDeletionStatus paths.
- Scope lambda:InvokeFunction in the permissions boundary to lambdas in
this stack (arn:...:function:${AWS::StackName}-*) rather than *. The
inline grant from the SAM Serverless::Connector is already scoped to
the CerberusFunction ARN; this tightens the boundary ceiling to match
the spirit of that grant. !GetAtt CerberusFunction.Arn isn't usable
here (circular: the function itself has PermissionsBoundary set to
this managed policy), so the ARN is constructed from the stack name.
- Fix README alarm table — CerberusHighDeletionRateAlarm row still
said AWS/States ExecutionsSucceeded, but the previous commit had
already retargeted it to the Cerberus/Deleted custom metric.
- Replace remaining StringMatches with StringEquals for exact-string
Choice predicates ("CreateAccountAssignment", "USER", "GROUP"). The
state machine authoring conventions added to CLAUDE.md in this same
PR call this out specifically; shipping the convention while
violating it three lines away was inconsistent.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… before event parsing, CloudTrail docs)
- Move Mode parse + DISABLED check above all event destructuring so the
kill switch survives stripped-down direct invocations (the previous
position let an AttributeError fire before DISABLED was honoured).
- Treat any Mode value other than ENFORCE / DRY_RUN / DISABLED as
DISABLED. CFN AllowedValues blocks deploy-time typos, but a direct env
var override or tooling bug could land an unknown value — fail closed
rather than silently fall through to ENFORCE.
- Reword README CloudTrail bullet: data events capture Invoke only, not
code-change activity (that's a default management event).
- Add regression tests for both: DISABLED on a stripped {} event, and
unknown Mode treated as DISABLED.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Update — additional work since initial review
Since the original architectural pivot landed, the branch has received five follow-on commits addressing reviewer feedback and operational quality. Specifics in the original description below have changed where noted.
/code-reviewfindings (e9e3e94) — CLAUDE.md staleness, README cleanup, DISABLED-mode defense-in-depth guard at the Lambda.687be33) — State machine hardening:Retryblocks on all four AWS SDK integration Tasks (DescribePermissionSet,DescribeInstance,DescribeUser,DescribeGroup). One throttle without retry previously failed the whole execution.States.TaskFailed(5 attempts × 15s, worst case ~25 min on Lambda crashes) to specific transient invocation errors (Lambda.ServiceException,Lambda.AWSLambdaException,Lambda.SdkClientException,Lambda.TooManyRequestsException). Crashes now fail fast.Is User or Group Name Returned?Choice now validates$.DescribeUser.UserName— the field the Lambda actually reads, and the field that is required on Identity Store users (DisplayNameis optional and could have silently rejected valid users).fc65b13) — Project memory now codifies the patterns surfaced by the KIRO review: uniform Task retries, Choice/Lambda field alignment,StringEqualsoverStringMatches.3bb6329) — RootMakefileis the single CI/CD entry point.make check(validate + test) gates PRs;make deploy NOTIFICATION_EMAIL=... MODE=DRY_RUNreplaces remembered SAM CLI command sequences.CI=trueskips the interactive changeset confirmation thatsamconfig.tomlenables by default. Root README rewritten to point at this workflow.48b9c04) — Three substantive hardening items:ManagementAccountIdparameter. The protected-target account is now derived from!Ref AWS::AccountIdin the state machine substitution. Eliminates the misconfiguration class where a wrong account ID leaves the real management account vulnerable to self-lockout. The original PR description below still listsManagementAccountIdas a retained parameter — that's no longer accurate.Cerberus / {Deleted, Skipped, Failed}at each outcome path.CerberusHighDeletionRateAlarmretargeted fromAWS/States ExecutionsSucceeded(which counted no-ops, dry-runs, and skipped paths as success) toCerberus / Deleted— fires on actual deletion volume, not on execution count.PermissionSetNamePattern,PrincipalGroupNamePattern,PrincipalUserNameEmail) now useos.environ.get(name, "")defaults. Fail-closed early return when either regex pattern is empty, sincere.compile("")matches every string and an empty pattern underENFORCEwould delete every matching principal.Note
Deferred to a follow-up PR (Codex #1): Deletion-confirmation polling. The Lambda currently treats
DeleteAccountAssignment Status=IN_PROGRESSas "deletion accepted." Confirming actual completion via a Step Functions polling loop onDescribeAccountAssignmentDeletionStatusis an architectural change to the state machine and warrants its own review cycle. The custom-metric work here is forward-compatible: when polling lands,Deletedsemantics shift from "API accepted" to "polling confirmed" with no metric-name change.Test plan refresh
make check— SAM template valid + 9/9 unit tests pass (was 8/8 at original open; one new metric-emission assertion per outcome type —Deleted,Skipped,Failed).make deploy(no params) — fails closed with helpful usage hint (NOTIFICATION_EMAILrequired;MANAGEMENT_ACCOUNT_IDno longer needed).Cerberusnamespace appears in CloudWatch Metrics withDeleted,Skipped,Failedcounters incrementing on the relevant paths.Summary
IAM Identity Center enforces a service-level restriction that is invisible to IAM, SCPs, and the delegation configuration: permission sets whose lifecycle is owned by the management account — every Control Tower default — can only have their assignments removed by a principal in the management account itself. The delegated admin returns
AccessDeniedExceptionregardless of IAM permissions. This makes the previous delegated-admin topology unworkable.This PR collapses Cerberus into a single-account deployment in the management account, with mitigations for the security implications of running application code there.
Architectural change
cerberus-event-bus+CerberusEventBusPolicycft-eventbridge-rule.yaml(deployed in mgmt)events:PutEventsSecurity mitigations added
AWS::IAM::ManagedPolicy) attached to both the Lambda role and the state machine role. SCPs do not apply to management-account principals; this is the only IAM-layer guardrail. Whitelist:sso:DeleteAccountAssignment,sso:Describe*/Get*/List*,identitystore:Describe*,lambda:InvokeFunction,logs:*,cloudwatch:PutMetricData. Anything else (iam:*,organizations:*,sts:AssumeRole,sso:CreateAccountAssignment,kms:*, etc.) is implicitly denied.Modeparameter (ENFORCE|DRY_RUN|DISABLED).DRY_RUNlogs would-delete decisions without calling the SSO API.DISABLEDsets the EventBridge rule'sStatetoDISABLED— operational kill switch flippable viasam deploy --parameter-overrides Mode=....ReservedConcurrentExecutionsfrom 5 → 2.ExecutionsFailed:CerberusFunctionErrorsAlarm— Lambda errors > 0CerberusFunctionThrottlesAlarm— concurrency cap hitCerberusHighDeletionRateAlarm— > 10 successful executions in 5 min (runaway / compromise detector)ManagementAccountIdare skipped at the state machine, with rationale text updated to reflect the new context.Removed
cft-eventbridge-rule.yamlEventBusNameparameterCerberusEventBusresource andCerberusEventBusPolicyCerberusEventManagementAccountevent rule (cross-account-bus version)EventBusArnoutputAdded
Modeparameter + DRY_RUN early-return inapp.pyCerberusPermissionsBoundary(AWS::IAM::ManagedPolicy)test_lambda_handler_dry_run_mode,test_lambda_handler_group_principal_matchtearDownto cleanModeenv across testsModeparameter, and a migration runbookTest plan
python -m unittest discover -s cerberus/tests/unit -v— 8/8 pass (was 6)sam validate --lint— validsam build— substitutions resolve, build succeedsMode=DRY_RUN. Trigger a knownCreateAccountAssignmentagainst an account-factory test account. Verify Lambda logsDRY_RUN: would remove ...anddelete_account_assignmentis not called.cft-eventbridge-rule.yamlstack.Mode=ENFORCE. Trigger another known assignment. Verify the assignment is actually removed in the IAM Identity Center console (not only the Lambda's reported result).sso:DeleteAccountAssignmentfrom the boundary) and confirmCerberusFunctionErrorsAlarmorExecutionsFailedfires.Operator-facing breaking changes
cerberus/README.md.EventBusNameparameter is removed. Operators referencing it in deploy scripts will need to drop it.EventBusArnoutput is removed. Anything consuming that output must be updated.Deferred to follow-up PRs
AWS::Signer::SigningProfile+CodeSigningConfig) — deferred per discussion. Add when deploys move to CI/CD.SsoInstanceIdparameter for tighter resource scoping onsso:*actions. KeepsResource: "*"for OSS friendliness in this PR.DRY_RUN-stuck-modedetection.re.fullmatch,AllowedPattern/MinLengthon regex parameters).Generated with Claude Code