diff --git a/CLAUDE.md b/CLAUDE.md index e1bfd61..01c0859 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -6,42 +6,48 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co Cerberus is an AWS SAM application that automatically removes unwanted default AWS Control Tower IAM Identity Center permission set assignments. It intercepts `CreateAccountAssignment` CloudTrail events and deletes the assignment when it matches configured regex patterns. -## Two-Account Deployment +## Single-Account Deployment in the Management Account -This app spans two AWS accounts: +Cerberus must be deployed in the AWS Organization management account. IAM Identity Center enforces a service-level restriction (invisible to IAM, SCPs, and the delegated-admin configuration): permission sets whose lifecycle is owned by the management account — every Control Tower default — can only have their assignments removed by a principal in the management account itself. A delegated admin returns `AccessDeniedException` regardless of IAM permissions, which is why Cerberus does not run in a delegated-admin account. -- **Management account**: `cft-eventbridge-rule.yaml` is a standalone CloudFormation template (not SAM) that forwards `sso:CreateAccountAssignment` events cross-account to the custom event bus in the delegated admin account. -- **Delegated admin account**: `cerberus/template.yaml` is the SAM app. Deploy here. - -Never conflate these two templates. `sam build` / `sam deploy` only touch `cerberus/`. - -## Critical Code Quirk - -`cerberus/src/cerberus/app.py` around line 120 unconditionally overwrites the real `sso:DeleteAccountAssignment` API response with a hardcoded `{"AccountAssignmentDeletionStatus": {"Status": "SUCCEEDED"}}`. This means the function always reports success regardless of what the API actually returned. Verify intent with the team before modifying this function or adding response-based branching logic. +`cerberus/template.yaml` is the SAM app. Deploy it in the management account. There are no other CloudFormation templates in the repository. ## Primary Tuning Surface -The three Lambda environment variables below are the main way to control what gets deleted. They are regex patterns set in `cerberus/template.yaml`: +Lambda environment variables (set in `cerberus/template.yaml`): -- `PermissionSetNamePattern` -- `PrincipalGroupNamePattern` -- `PrincipalUserNameEmail` +- `PermissionSetNamePattern` — regex matched against the permission set name (case-insensitive). +- `PrincipalGroupNamePattern` — regex matched against the principal name when `principalType=GROUP`. +- `PrincipalUserNameEmail` — exact email match against the principal name when `principalType=USER`. +- `Mode` — `ENFORCE` | `DRY_RUN` | `DISABLED`. `DRY_RUN` logs would-delete decisions without calling the SSO API; `DISABLED` turns off the EventBridge rule and short-circuits the Lambda. Operational kill switch + dry-run capability. ## Testing Tests use stdlib `unittest`, not pytest. Do not add pytest dependencies or use pytest-style fixtures. Test file: `cerberus/tests/unit/test_cerberus.py`. +## State Machine & Lambda Authoring Conventions + +When editing `cerberus/statemachine/cerberus.asl.json` or `cerberus/src/cerberus/app.py`, apply these defaults at write time — don't wait for a reviewer to add them. + +- **Every Task state needs an explicit `Retry`, applied uniformly.** A defensive pattern on one Task and not its siblings is worse than no pattern — it signals "we thought about this" while leaving peers exposed. AWS SDK integration tasks (`arn:aws:states:::aws-sdk:*`) are subject to throttling and transient network errors; one throttle without retry fails the whole execution. + - SDK integration default: 3 attempts on `States.TaskFailed`, 3s interval, 2.0× backoff. + - Lambda invoke default: scope `ErrorEquals` to `Lambda.ServiceException`, `Lambda.AWSLambdaException`, `Lambda.SdkClientException`, `Lambda.TooManyRequestsException`. Do **not** retry blanket `States.TaskFailed` for Lambda — the Cerberus Lambda wraps every business-logic exception into a structured `{"result": "FAILED"}` return, so `States.TaskFailed` only fires on crashes (timeout/OOM/runtime), which retrying with identical input cannot fix and will burn ~5×timeout-seconds of wall time before surfacing. + +- **Choice-state validations must match what the Lambda actually reads.** When the ASL feeds data to the Lambda, every `Is X Returned?` Choice must check the exact JSONPath the Lambda consumes — not a sibling field. Trace `event.get(...)` calls in `cerberus/src/cerberus/app.py` and align `Variable` paths in the ASL to match. Sibling-field validation can fail-closed on valid input when the sibling is optional (`DisplayName` on Identity Store users is the canonical case: `UserName` is required, `DisplayName` is not). + +- **Prefer `StringEquals` over `StringMatches` for exact strings.** `StringMatches` allows wildcards — use it only when you mean it. + ## MCP Servers Two MCP servers are configured in `.mcp.json` at the repo root. Use them proactively — don't guess at AWS API shapes or dig through logs manually. -**`awslabs.aws-documentation-mcp-server`** — AWS official docs, resource schemas, IAM policy references, API signatures. Reach for this whenever you're working on `cerberus/template.yaml`, `cerberus/statemachine/cerberus.asl.json`, or `cft-eventbridge-rule.yaml`, or any time you need to verify an AWS API call, IAM action name, or resource attribute. +**`awslabs.aws-documentation-mcp-server`** — AWS official docs, resource schemas, IAM policy references, API signatures. Reach for this whenever you're working on `cerberus/template.yaml` or `cerberus/statemachine/cerberus.asl.json`, or any time you need to verify an AWS API call, IAM action name, or resource attribute. **`awslabs-cloudwatch-mcp-server`** — Live CloudWatch access to the deployed Cerberus stack. Use this to debug Step Functions execution failures, inspect Lambda errors, or trace an event end-to-end. The default log group is `/cerberus` (parameterized at deploy time). The server is pre-configured with `AWS_PROFILE=cerberus` and `AWS_REGION=ca-central-1`; the `cerberus` profile must exist locally with CloudWatch read-only access (see `cerberus/README.md` for profile setup). ## Plugins -The **`aws-serverless` plugin** (`aws-serverless@claude-plugins-official`) is enabled at project scope in `.claude/settings.json`. It provides SAM-aware skills and serverless-specific context for working with `cerberus/template.yaml`, `cerberus/statemachine/cerberus.asl.json`, and `cft-eventbridge-rule.yaml`. +The **`aws-serverless` plugin** (`aws-serverless@claude-plugins-official`) is enabled at project scope in `.claude/settings.json`. It provides SAM-aware skills and serverless-specific context for working with `cerberus/template.yaml` and `cerberus/statemachine/cerberus.asl.json`. The **`code-review` plugin** (`code-review@claude-plugins-official`) is also enabled. See PR Requirements below. diff --git a/Makefile b/Makefile new file mode 100644 index 0000000..bf2db5b --- /dev/null +++ b/Makefile @@ -0,0 +1,72 @@ +# Cerberus — wrapper around SAM CLI + unit tests. +# Single entry point for CI/CD so deploys and tests don't depend on remembered +# command sequences. Targets are self-documenting; run `make help`. + +SAM_DIR := cerberus +PYTHON ?= python3 +VENV := $(SAM_DIR)/.venv +VENV_PY := $(VENV)/bin/python +VENV_MARKER := $(VENV)/.installed +REQUIREMENTS := $(SAM_DIR)/tests/requirements.txt +AWS_REGION ?= ca-central-1 + +# Required for `make deploy`. No defaults — failing closed is intentional. +NOTIFICATION_EMAIL ?= + +# Optional parameter overrides for `make deploy`. Unset => template defaults apply. +MODE ?= +PERMISSION_SET_PATTERN ?= +PRINCIPAL_GROUP_PATTERN ?= +PRINCIPAL_USER_EMAIL ?= +LOG_GROUP_NAME ?= +LOG_GROUP_RETENTION ?= + +.DEFAULT_GOAL := help +.PHONY: help install validate test check build deploy clean _check-deploy-params + +help: ## Show available targets + @awk 'BEGIN {FS = ":.*?## "} /^[a-zA-Z_-]+:.*?## / {printf " \033[36m%-16s\033[0m %s\n", $$1, $$2}' $(MAKEFILE_LIST) + +$(VENV_MARKER): $(REQUIREMENTS) + $(PYTHON) -m venv $(VENV) + $(VENV)/bin/pip install --quiet --upgrade pip + $(VENV)/bin/pip install --quiet -r $(REQUIREMENTS) + @touch $(VENV_MARKER) + +install: $(VENV_MARKER) ## Set up local venv with test dependencies + +validate: ## Lint and validate the SAM template + cd $(SAM_DIR) && sam validate --lint + +test: $(VENV_MARKER) ## Run unit tests + AWS_DEFAULT_REGION=$(AWS_REGION) $(VENV_PY) -m unittest discover -s $(SAM_DIR)/tests/unit -t . -v + +check: validate test ## CI gate — validate + test + +build: ## Build deployment artifacts (sam build) + cd $(SAM_DIR) && sam build + +deploy: _check-deploy-params build ## Deploy stack (requires NOTIFICATION_EMAIL) + cd $(SAM_DIR) && sam deploy \ + $(if $(CI),--no-confirm-changeset) \ + --parameter-overrides \ + NotificationEmail=$(NOTIFICATION_EMAIL) \ + $(if $(MODE),Mode=$(MODE)) \ + $(if $(PERMISSION_SET_PATTERN),"PermissionSetNamePattern=$(PERMISSION_SET_PATTERN)") \ + $(if $(PRINCIPAL_GROUP_PATTERN),"PrincipalGroupNamePattern=$(PRINCIPAL_GROUP_PATTERN)") \ + $(if $(PRINCIPAL_USER_EMAIL),PrincipalUserNameEmail=$(PRINCIPAL_USER_EMAIL)) \ + $(if $(LOG_GROUP_NAME),LogGroupName=$(LOG_GROUP_NAME)) \ + $(if $(LOG_GROUP_RETENTION),LogGroupRetentionDays=$(LOG_GROUP_RETENTION)) + +clean: ## Remove build artifacts and venv + rm -rf $(SAM_DIR)/.aws-sam $(VENV) + +_check-deploy-params: + @if [ -z "$(NOTIFICATION_EMAIL)" ]; then \ + echo "ERROR: NOTIFICATION_EMAIL is required"; \ + echo ""; \ + echo "Usage: make deploy NOTIFICATION_EMAIL=ops@example.com"; \ + echo "Optional: MODE={ENFORCE|DRY_RUN|DISABLED} PERMISSION_SET_PATTERN='...' PRINCIPAL_GROUP_PATTERN='...' PRINCIPAL_USER_EMAIL='...' LOG_GROUP_NAME='/cerberus' LOG_GROUP_RETENTION=14"; \ + echo "In CI: set CI=true to skip the interactive changeset confirmation."; \ + exit 1; \ + fi diff --git a/README.md b/README.md index f3b0996..63cfb69 100644 --- a/README.md +++ b/README.md @@ -8,15 +8,23 @@ The default **IAM Identity Center Groups for AWS Control Tower** are rather perm We have created [Cerberus](https://www.britannica.com/topic/Cerberus) to monitor events from the `sso.amazonaws.com` service. Cerberus, often referred to as the hound of Hades, is a multi-headed dog that guards the gates of the underworld to prevent the dead from leaving, or in this case, prevent `CreateAccountAssignment` of unauthorized (unwanted) default permission sets to AWS Control Tower managed accounts. -# AWS Serverless Application Model (SAM) +## Deployment -Instruction on how to deploy the application, [Cerberus AWS SAM App](cerberus/README.md). +Cerberus is a single [AWS SAM](https://docs.aws.amazon.com/serverless-application-model/) stack that must be deployed in the AWS Organization **management account**. IAM Identity Center enforces a service-level restriction that prevents a delegated administrator from removing assignments owned by the management account — see [cerberus/README.md](cerberus/README.md#why-this-runs-in-the-aws-organization-management-account) for the full explanation, pre-deploy security checklist, parameter reference, and migration path from the older delegated-admin topology. -Deployment steps: +The repository ships a top-level `Makefile` as the single entry point for build, test, and deploy — no remembered SAM CLI command sequences required. -1. Deploy the [Cerberus AWS SAM App](cerberus/template.yaml) in the Management or delegated administrator IAM Identity Center account -2. Deploy the [EventBrdige Rule](cft-eventbridge-rule.yaml) in the Management account - - Reference the Output `EventBusArn` from the **Cerberus AWS SAM App** deployed stack for `CerberusEventBusArn` parameter +```bash +make help # List all available targets +make check # Validate template + run unit tests +make deploy \ + NOTIFICATION_EMAIL=oncall@example.com \ + MODE=DRY_RUN # First-time deploy in DRY_RUN +``` + +After observing `DRY_RUN: would remove ...` lines in the `/cerberus` log group for real `CreateAccountAssignment` events, re-run with `MODE=ENFORCE` (or omit — `ENFORCE` is the template default). + +In CI, set `CI=true` to skip the interactive changeset confirmation that `cerberus/samconfig.toml` enables by default. ## Contributing @@ -24,8 +32,9 @@ Contributions are welcome! Please follow these steps: 1. Fork the repository. 2. Create a feature branch. -3. Commit your changes. -4. Submit a pull request. +3. Run `make check` locally — must pass before opening a PR. +4. Commit your changes. +5. Submit a pull request. ## Code Formatting diff --git a/cerberus/README.md b/cerberus/README.md index 5bcb4d0..c06b1d3 100644 --- a/cerberus/README.md +++ b/cerberus/README.md @@ -1,148 +1,197 @@ # Cerberus -Cerberus is a [AWS Serverless Application Model](https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/serverless-getting-started.html) serverless application for managing AWS resources with the SAM CLI. - -AWS SAM Amazon States Language (ASL) diagram of the Cerberus state machine. +Cerberus is an [AWS Serverless Application Model](https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/serverless-getting-started.html) (SAM) application that automatically removes unwanted AWS Control Tower default IAM Identity Center permission set assignments. It listens for `sso:CreateAccountAssignment` events on CloudTrail and, when an assignment matches the configured regex patterns, calls `sso-admin:DeleteAccountAssignment` to remove it. ![Cerberus SAM ASL](../static/stepfunctions_graph.png) +## Why this runs in the AWS Organization management account + +Cerberus must be deployed in the AWS Organization **management account**, not in a delegated administrator account. + +[AWS IAM Identity Center](https://docs.aws.amazon.com/singlesignon/latest/userguide/delegated-admin.html) supports delegating administration to a member account, and that's the recommended pattern for most identity-management workloads. However, IAM Identity Center enforces a service-level restriction that is invisible to IAM policies, SCPs, and the delegation configuration: + +> Permission sets whose lifecycle is owned by the management account — including all the AWS-managed permission sets that AWS Control Tower provisions (`AWSAdministratorAccess`, `AWSReadOnlyAccess`, `AWSOrganizationsFullAccess`, etc.) — can only have their assignments removed by a principal in the management account itself. + +Attempting to delete these assignments from a delegated administrator account returns `AccessDeniedException` regardless of IAM permissions. Since Cerberus's purpose is precisely to clean up these AWS-Control-Tower-managed default assignments, it must run in the management account. + +This is a deliberate architectural compromise. The mitigations below (permission boundary, kill switch, expanded alarms, reduced concurrency) are designed to limit the blast radius of a compromised Cerberus deployment. + +The "protected target" account (the one whose assignments the state machine always skips before invoking the Lambda) is derived automatically from `AWS::AccountId` at deploy time — there is no operator-supplied account ID parameter, which eliminates the misconfiguration vector where the wrong account is marked protected and the real management account becomes vulnerable to self-lockout. + +## Pre-deploy security checklist + +Before deploying Cerberus to the management account, confirm: + +- [ ] **CloudTrail** is enabled at the organization level and ingests `sso.amazonaws.com` events. +- [ ] **CloudTrail data events** are enabled on the Cerberus Lambda function ARN — this captures every `Invoke` call against the function, giving an audit trail of who/what triggered each deletion attempt. Configure post-deploy. (Code-change activity — `UpdateFunctionCode`, `UpdateFunctionConfiguration`, etc. — is captured by default CloudTrail management events on `lambda.amazonaws.com`; no extra config needed.) +- [ ] **GuardDuty Lambda Protection** is enabled (verify org-level coverage). +- [ ] **Branch protection on `main`** in the source repository: required reviewers, required status checks, signed commits required, no force-push, no admin bypass. +- [ ] **CODEOWNERS** routes Cerberus changes through your security and platform teams. +- [ ] **Deploying principal** is a dedicated `CerberusDeployer` role in the management account, not `AdministratorAccess`. The role itself should have a permission boundary. +- [ ] **NotificationEmail** is wired to a real on-call destination (PagerDuty, monitored shared inbox), not a personal email. + ## CloudFormation Template Parameters -The following parameters are defined in the `template.yaml` file and can be customized during deployment: +Defined in `template.yaml`. Parameters without a `Default` are required at deploy time. + +### `Mode` (optional, default `ENFORCE`) + +Operational mode of the deletion pipeline: + +| Value | Behavior | +|---|---| +| `ENFORCE` | Default. Cerberus deletes matching assignments. | +| `DRY_RUN` | Lambda evaluates the regex and logs what *would* be deleted, but does not call the SSO API. Use for the first 24 hours after deploying or after changing a regex pattern. | +| `DISABLED` | EventBridge rule's `State` is set to `DISABLED`. No events reach the state machine. Operational kill switch. | + +Flip via `sam deploy --parameter-overrides Mode=DRY_RUN` (or `=DISABLED`, `=ENFORCE`) without changing code. + +### `PermissionSetNamePattern` (optional) -### EventBusName +Regex matched (case-insensitive) against the permission set name. Default matches the AWS Control Tower default permission sets (`AWSOrganizationsFullAccess`, `AWSReadOnlyAccess`, `AWSServiceCatalogEndUserAccess`, `AWSServiceCatalogAdminFullAccess`, `AWSPowerUserAccess`, `AWSAdministratorAccess`). -- **Type**: String -- **Default**: `cerberus-event-bus` -- **Description**: The name of the custom EventBridge event bus for Cerberus. +### `PrincipalGroupNamePattern` (optional) -### ManagementAccountId +Regex matched (case-insensitive) against the principal name when `principalType=GROUP`. Default matches the AWS Control Tower default groups. -- **Type**: String -- **Description**: The Management AWS account ID that will send events to the Cerberus event bus. -- **Allowed Pattern**: `^[0-9]{12}$` -- **Constraint Description**: Must be a valid 12-digit AWS account ID. +### `PrincipalUserNameEmail` (optional, default empty) -### LogGroupName +Exact email address (lowercase) matched against the principal name when `principalType=USER`. Used to clean up the default Account Factory admin user assignment created during account provisioning. Leave empty to disable user-email matching. -- **Type**: String -- **Default**: `/cerberus` -- **Description**: The name of the CloudWatch Log Group for the Cerberus State Machine. -- **Allowed Pattern**: `^[.\\-_/#A-Za-z0-9]{1,512}\\Z` -- **Min Length**: 1 -- **Max Length**: 512 -- **Constraint Description**: Log group name must be 1-512 characters long and can include letters, numbers, and the following characters: `.-_/#`. +### `NotificationEmail` (required) -### LogGroupRetentionDays +Email address subscribed to the SNS topic that receives all alarm notifications. -- **Type**: Number -- **Default**: 14 -- **Description**: The retention period in days for the CloudWatch Log Group. -- **Allowed Values**: 1, 3, 5, 7, 14, 30, 60, 90, 120, 150, 180, 365, 400, 545, 731, 1096, 1827, 2192, 2557, 2922, 3288, 3653. +### `LogGroupName` (optional, default `/cerberus`) -### PermissionSetNamePattern +Name of the CloudWatch Log Group for the Cerberus state machine and Lambda. -- **Type**: String -- **Default**: `^AWS(?:OrganizationsFullAccess|ReadOnlyAccess|ServiceCatalogEndUserAccess|ServiceCatalogAdminFullAccess|PowerUserAccess|AdministratorAccess)$` -- **Description**: Regex pattern for matching AWS Control Tower default permission set names, such as `OrganizationsFullAccess` and `AdministratorAccess`. +### `LogGroupRetentionDays` (optional, default 14) -### PrincipalGroupNamePattern +CloudWatch Log retention period. -- **Type**: String -- **Default**: `^AWS(?:LogArchiveViewers|LogArchiveAdmins|ControlTowerAdmins|AccountFactory|AuditAccountAdmins|SecurityAuditors|ServiceCatalogAdmins|SecurityAuditPowerUsers)$` -- **Description**: Regex pattern for matching AWS Control Tower default group principal names, such as `LogArchiveAdmins` and `ControlTowerAdmins`. +## Monitoring and alerts -### PrincipalUserNameEmail +Cerberus publishes four CloudWatch Alarms, all wired to the same SNS topic: -- **Type**: String -- **Default**: (empty) -- **Description**: Valid email addresses used by AWS Control Tower account factory enrollment. Leave empty to disable validation. -- **Allowed Pattern**: `^$|^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$` -- **Constraint Description**: Must be a valid email address or left empty. +| Alarm | Source metric | Threshold | What it means | +|---|---|---|---| +| `CerberusExecutionFailureAlarm` | `AWS/States ExecutionsFailed` | > 0 in 1 min | A state machine execution failed. Real deletion failure or upstream issue. | +| `CerberusFunctionErrorsAlarm` | `AWS/Lambda Errors` | > 0 in 1 min | Lambda raised an unhandled error. Code-level issue. | +| `CerberusFunctionThrottlesAlarm` | `AWS/Lambda Throttles` | > 0 in 1 min | Reserved-concurrency cap hit. Investigate event burst. | +| `CerberusHighDeletionRateAlarm` | `Cerberus Deleted` (custom) | > 10 in 5 min | Cerberus performed an unusual volume of actual deletions. Possible regex misfire or compromise. The metric counts only real `DeleteAccountAssignment` calls — no-action, dry-run, and skipped-mgmt-account paths emit `Skipped` instead, so this alarm has no false positives from non-deletion executions. | -### NotificationEmail +Subscribe `NotificationEmail` to a real on-call destination — a noisy alarm to a personal inbox is worse than no alarm. -- **Type**: String -- **Description**: Email address to receive notifications when the Cerberus state machine execution fails. -- **Allowed Pattern**: `^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$` -- **Constraint Description**: Must be a valid email address. +## Permission boundary -## Monitoring and Alerts +The template ships an inline `AWS::IAM::ManagedPolicy` (`CerberusPermissionsBoundary`) attached to both the Lambda execution role and the state machine role. It allows only the specific actions required for IAM Identity Center cleanup: -Cerberus includes built-in monitoring capabilities: +- `sso:DeleteAccountAssignment` +- `sso:DescribePermissionSet`, `sso:DescribeInstance`, `sso:ListPermissionSets`, `sso:GetPermissionSet`, `sso:DescribePermissionSetProvisioningStatus` +- `identitystore:DescribeUser`, `identitystore:DescribeGroup` +- `lambda:InvokeFunction` +- `logs:CreateLogStream`, `logs:PutLogEvents`, `logs:DescribeLogGroups`, `logs:DescribeLogStreams` +- `cloudwatch:PutMetricData` -- **CloudWatch Alarm**: Automatically monitors the state machine for execution failures -- **SNS Notifications**: Sends email notifications to the specified address when failures occur -- **Failure Detection**: Triggers alerts when any state machine execution fails +Anything else (`iam:*`, `organizations:*`, `sts:AssumeRole`, `sso:CreateAccountAssignment`, `kms:*`, etc.) is implicitly denied at the boundary regardless of inline policy grants. -The monitoring system helps ensure quick response to any issues with the Cerberus state machine execution. +Service Control Policies do **not** apply to management-account principals. The permission boundary is the only IAM-layer guardrail and is intentionally tight. ## Build and Deploy ### Build -Use the following command to build the application: - ```bash -sam build --use-container +sam build ``` -### Deploy - -⚠️ IMPORTANT PARAMETERS ⚠️ +Container builds are configured by default in `samconfig.toml`. The build runs inside `public.ecr.aws/sam/build-python3.13` so the artifact matches the Lambda runtime exactly. -#### `ManagementAccountId` and `EventBusName` +### Deploy (first time) -[AWS IAM Identity Center Documentation, Delegated administration](https://docs.aws.amazon.com/singlesignon/latest/userguide/delegated-admin.html). IAM Identity Center instance must always reside in the management account, they can be configured to delegate administration of IAM Identity Center to a member account in AWS Organizations, thereby extending the ability to manage IAM Identity Center from outside the management account. +The recommended pattern is to deploy in `DRY_RUN` mode first, observe, then flip to `ENFORCE`. -This parmeter enables support for environments following the best-practice of delegating the access via another AWS account. These parameters enable the integration with the [cft-eventbridge-rule.yaml](../cft-eventbridge-rule.yaml) template to deploy the Event Bridge Rule in the Management account. +```bash +sam deploy --region \ + --parameter-overrides \ + NotificationEmail= \ + Mode=DRY_RUN \ + PrincipalUserNameEmail= +``` -#### `PrincipalUserNameEmail` +Watch the `/cerberus` log group for `DRY_RUN: would remove ...` lines on the next `CreateAccountAssignment` event. Confirm the matches are correct. -[AWS Control Tower Documentation, Provision accounts with AWS Service Catalog Account Factory](https://docs.aws.amazon.com/controltower/latest/userguide/provision-as-end-user.html). When creating or updating an Account Factory enrolled account, the `SSOUserEmail` prompt can be a new email address, or the email address associated with an existing IAM Identity Center user. Whichever choen, this user will have administrative access to the account you're provisioning. +Then flip to `ENFORCE`: -This parameter enables removal of the default User assignment that will have administrative access. The pattern requires that a common email address be used when performing changes to accounts via AWS Control Tower Account Factory. Example `aws-control-tower@company.xyz`. +```bash +sam deploy --parameter-overrides Mode=ENFORCE [...other params...] +``` -Deploy the application with the following command: +To override the regex defaults: ```bash -sam deploy --region us-east-1 --parameter-overrides ManagementAccountId=012345678901 LogGroupName=/cerberus NotificationEmail=your-email@company.com +sam deploy --region \ + --parameter-overrides \ + NotificationEmail= \ + PermissionSetNamePattern='^AWS(?:OrganizationsFullAccess|ReadOnlyAccess|...)$' \ + PrincipalGroupNamePattern='^AWS(?:LogArchiveAdmins|ControlTowerAdmins|...)$' \ + PrincipalUserNameEmail='' ``` -To include RegEx patterns for permissions and principals, use: +### Operational kill switch ```bash -sam deploy --region us-east-1 --parameter-overrides ManagementAccountId=012345678901 LogGroupName=/cerberus PermissionSetNamePattern='^AWS(?:OrganizationsFullAccess|ReadOnlyAccess|ServiceCatalogEndUserAccess|ServiceCatalogAdminFullAccess|PowerUserAccess|AdministratorAccess)$' PrincipalNamePattern='^AWS(?:LogArchiveViewers|LogArchiveAdmins|ControlTowerAdmins|AccountFactory|AuditAccountAdmins|SecurityAuditors|ServiceCatalogAdmins|SecurityAuditPowerUsers)$' PrincipalUserNameEmail='devops+control-tower-account-factory@company.xyz' NotificationEmail=your-email@company.com +# Stop processing events without deleting the stack +sam deploy --parameter-overrides Mode=DISABLED [...other params...] + +# Resume +sam deploy --parameter-overrides Mode=ENFORCE [...other params...] ``` -## Testing +`DISABLED` sets the EventBridge rule's `State` to `DISABLED`. No events reach the state machine. -### Unit Tests +## Migration from the delegated-admin model -Run unit tests using the following commands: +Earlier versions of Cerberus deployed the state machine and Lambda in a delegated administrator account, with a separate `cft-eventbridge-rule.yaml` template forwarding events from the management account. That topology is no longer supported — see [Why this runs in the management account](#why-this-runs-in-the-aws-organization-management-account). + +Migration sequence: + +1. Deploy this stack to the management account in `Mode=DRY_RUN`. +2. Validate the new stack against a test `CreateAccountAssignment`. Confirm the Lambda logs `DRY_RUN: would remove ...` and no actual deletion occurs. +3. Delete the old delegated-admin stack (`sam delete --stack-name cerberus` from the delegated admin account profile). +4. Delete the old management-account forwarder stack (the one created from `cft-eventbridge-rule.yaml`). +5. Flip the new stack to `Mode=ENFORCE`. +6. Validate the end-to-end path against another test `CreateAccountAssignment`. Confirm the assignment is actually removed in the IAM Identity Center console — do not rely on the Lambda's reported result alone. + +The `DRY_RUN` overlap (steps 1–3) is what makes the cutover safe: both stacks are subscribed to the same event source, but neither modifies state during the overlap window. + +## Testing ```bash python3 -m venv venv source venv/bin/activate pip install -r tests/requirements.txt -python3 -m unittest discover -v +AWS_DEFAULT_REGION= python3 -m unittest discover -v ``` -## Cleanup +`AWS_DEFAULT_REGION` is required because `app.py` initialises a `boto3.client("sso-admin")` at import time. -To delete the deployed stack, use: +## Cleanup ```bash sam delete --stack-name "cerberus" ``` +This removes the Lambda, state machine, EventBridge rule, log group, alarms, SNS topic, and permission boundary policy. CloudTrail data event configuration (if any) is org-level and must be removed separately. + ## CloudWatch MCP Server The repo includes a pre-configured CloudWatch MCP server (`.mcp.json`) that gives Claude Code live read access to the deployed Cerberus stack — Step Functions execution history, Lambda logs, and CloudWatch metrics — without leaving the editor. ### AWS Profile Setup -The server expects a local AWS CLI profile named **`cerberus`** with read-only CloudWatch access. Create it using IAM Identity Center or a dedicated IAM user/role: +The server expects a local AWS CLI profile named **`cerberus`** with read-only CloudWatch access **in the management account** (where this stack is now deployed). Create it using IAM Identity Center or a dedicated IAM user/role: ```bash # Option A — SSO profile (recommended) @@ -152,11 +201,11 @@ aws configure sso --profile cerberus aws configure --profile cerberus ``` -Attach the AWS managed policy **`arn:aws:iam::aws:policy/CloudWatchLogsReadOnlyAccess`** to whichever principal the profile authenticates as. The MCP server only needs read access; do not grant write or broader permissions. +Attach the AWS managed policy `arn:aws:iam::aws:policy/CloudWatchLogsReadOnlyAccess` to whichever principal the profile authenticates as. The MCP server only needs read access; do not grant write or broader permissions. The server targets `ca-central-1` by default (set in `.mcp.json`). If your stack is in a different region, update `AWS_REGION` in `.mcp.json` accordingly. -### What It Gives You +### What it gives you Once the profile is configured, Claude Code can query the `/cerberus` log group directly to: diff --git a/cerberus/src/cerberus/app.py b/cerberus/src/cerberus/app.py index c6fb580..1d94fd3 100644 --- a/cerberus/src/cerberus/app.py +++ b/cerberus/src/cerberus/app.py @@ -5,6 +5,22 @@ logger = logging.getLogger() client = boto3.client("sso-admin") +cloudwatch = boto3.client("cloudwatch") + + +def _emit_metric(name: str) -> None: + """Emit a Cerberus operational metric (Deleted | Skipped | Failed). + + Observability must never block the deletion pipeline, so any failure here + is logged and swallowed. + """ + try: + cloudwatch.put_metric_data( + Namespace="Cerberus", + MetricData=[{"MetricName": name, "Value": 1, "Unit": "Count"}], + ) + except Exception as e: + logger.warning("Failed to emit Cerberus.%s metric: %s", name, e) def lambda_handler(event, context): @@ -25,6 +41,26 @@ def lambda_handler(event, context): logger.debug("Lambda function invoked with event: %s", event) logger.debug("Lambda function context: %s", context) + # Mode is parsed and DISABLED is enforced before any event field access, so the + # kill switch works even on stripped-down or malformed payloads (defense-in-depth + # against direct invocation; EventBridge is the primary gate). Unknown values + # fail closed — anything other than ENFORCE or DRY_RUN is treated as DISABLED. + mode = os.environ.get("Mode", "ENFORCE").strip().upper() + if mode not in {"ENFORCE", "DRY_RUN", "DISABLED"}: + logger.warning( + "Unknown Mode value %r — failing closed (treating as DISABLED).", mode + ) + mode = "DISABLED" + + if mode == "DISABLED": + logger.info("Cerberus is in DISABLED mode; ignoring invocation.") + _emit_metric("Skipped") + return { + "result": "SUCCESS", + "message": "DISABLED: invocation ignored.", + "details": {"mode": "DISABLED"}, + } + instanceArn = event.get("DescribeInstance").get("InstanceArn") targetId = event.get("RequestParameters").get("targetId") targetType = event.get("RequestParameters").get("targetType", "AWS_ACCOUNT") @@ -48,6 +84,7 @@ def lambda_handler(event, context): ] ): logger.error("Missing required parameters in the event: {}".format(event)) + _emit_metric("Failed") return { "result": "FAILED", "message": "Missing required parameters in the event.", @@ -56,6 +93,7 @@ def lambda_handler(event, context): if principalType not in ["USER", "GROUP"]: logger.error("Invalid principal type: {}".format(principalType)) + _emit_metric("Failed") return { "result": "FAILED", "message": f"Invalid principal type: {principalType}. Expected 'USER' or 'GROUP'.", @@ -71,16 +109,27 @@ def lambda_handler(event, context): ) try: + permissionSetNamePattern = os.environ.get("PermissionSetNamePattern", "") + principalGroupNamePattern = os.environ.get("PrincipalGroupNamePattern", "") + principalUserNameEmail = ( + os.environ.get("PrincipalUserNameEmail", "").strip().lower() + ) + + if not permissionSetNamePattern or not principalGroupNamePattern: + logger.error( + "Required regex patterns missing from environment " + "(PermissionSetNamePattern and PrincipalGroupNamePattern must be set)." + ) + _emit_metric("Failed") + return { + "result": "FAILED", + "message": "Required regex patterns missing from environment.", + } - permissionSetNamePattern = os.environ.get("PermissionSetNamePattern") permissionSetNamePatternRegex = re.compile( permissionSetNamePattern, re.IGNORECASE ) - principalGroupNamePattern = os.environ.get("PrincipalGroupNamePattern") principalGroupNameRegex = re.compile(principalGroupNamePattern, re.IGNORECASE) - principalUserNameEmail = ( - os.environ.get("PrincipalUserNameEmail").strip().lower() - ) logger.info( "Using regex for principal name: {}".format(principalGroupNameRegex.pattern) @@ -102,6 +151,21 @@ def lambda_handler(event, context): re.match(principalGroupNameRegex, principalName) or principalUserNameEmail == principalName ): + if mode == "DRY_RUN": + logger.info( + "DRY_RUN: would remove Control Tower provisioned '%s' access for principal '%s' on permission set '%s' targeting account '%s'.", + principalType, + principalName, + permissionSetName, + targetId, + ) + _emit_metric("Skipped") + return { + "result": "SUCCESS", + "message": "DRY_RUN: deletion skipped.", + "details": {"mode": "DRY_RUN"}, + } + logger.info( "Removing Control Tower provisioned '{}' access for principal '{}'.".format( principalType, principalName @@ -135,12 +199,34 @@ def lambda_handler(event, context): logger.error( "Account assignment deletion failed at API: %s", failure_reason ) + _emit_metric("Failed") return { "result": "FAILED", "message": "Account assignment deletion failed.", "details": response, } + # AWS documents three valid statuses for DeleteAccountAssignment: + # IN_PROGRESS, SUCCEEDED, FAILED. Anything else (None, an unrecognized + # string, a missing AccountAssignmentDeletionStatus) means the API + # contract changed underneath us or the response is malformed — fail + # closed rather than emit Deleted on a state we can't reason about. + if status not in {"IN_PROGRESS", "SUCCEEDED"}: + logger.error( + "Unexpected deletion status from API: %r (full response: %s)", + status, + response, + ) + _emit_metric("Failed") + return { + "result": "FAILED", + "message": "Unexpected deletion status from API: {!r}.".format( + status + ), + "details": response, + } + + _emit_metric("Deleted") return { "result": "SUCCESS", "message": "Account assignment deletion request accepted (status={}).".format( @@ -154,6 +240,7 @@ def lambda_handler(event, context): principalName, principalType ) ) + _emit_metric("Skipped") return { "result": "SUCCESS", "message": "No action taken for principal: {}".format(principalName), @@ -161,6 +248,7 @@ def lambda_handler(event, context): except re.PatternError as e: logger.error("Invalid regex pattern: %s", e) + _emit_metric("Failed") return { "result": "FAILED", "message": "Invalid regex pattern.", @@ -170,6 +258,7 @@ def lambda_handler(event, context): except client.exceptions.ConflictException as e: logger.error("ConflictException occurred: %s", e) + _emit_metric("Failed") return { "result": "FAILED", "message": "Conflict occurred while deleting account assignment.", @@ -179,6 +268,7 @@ def lambda_handler(event, context): except client.exceptions.ResourceNotFoundException as e: logger.error("ResourceNotFoundException occurred: %s", e) + _emit_metric("Failed") return { "result": "FAILED", "message": "Resource not found while deleting account assignment.", @@ -188,6 +278,7 @@ def lambda_handler(event, context): except client.exceptions.AccessDeniedException as e: logger.error("AccessDeniedException occurred: %s", e) + _emit_metric("Failed") return { "result": "FAILED", "message": "Access denied while deleting account assignment.", @@ -197,6 +288,7 @@ def lambda_handler(event, context): except client.exceptions.ValidationException as e: logger.error("ValidationException occurred: %s", e) + _emit_metric("Failed") return { "result": "FAILED", "message": "Validation error occurred while deleting account assignment.", @@ -206,6 +298,7 @@ def lambda_handler(event, context): except Exception as e: logger.error("An error occurred: %s", e) + _emit_metric("Failed") return { "result": "FAILED", "message": "An error occurred while processing the request.", diff --git a/cerberus/statemachine/cerberus.asl.json b/cerberus/statemachine/cerberus.asl.json index 8c58425..5069c4b 100644 --- a/cerberus/statemachine/cerberus.asl.json +++ b/cerberus/statemachine/cerberus.asl.json @@ -38,7 +38,7 @@ "Type": "Pass", "Result": { "result": "SKIPPED", - "reason": "Target is the Management account; delegated admin cannot modify Management-account assignments." + "reason": "Target is the Management account; Cerberus skips it to prevent accidental self-lockout if a regex misfires against the management account's own admin assignments." }, "ResultPath": "$.SkipReason", "End": true @@ -51,7 +51,7 @@ "And": [ { "Variable": "$.EventName", - "StringMatches": "CreateAccountAssignment" + "StringEquals": "CreateAccountAssignment" }, { "Variable": "$.RequestParameters.permissionSetArn", @@ -73,6 +73,14 @@ "PermissionSetArn.$": "$.RequestParameters.permissionSetArn" }, "Resource": "arn:aws:states:::aws-sdk:ssoadmin:describePermissionSet", + "Retry": [ + { + "ErrorEquals": ["States.TaskFailed"], + "IntervalSeconds": 3, + "MaxAttempts": 3, + "BackoffRate": 2.0 + } + ], "ResultPath": "$.DescribePermissionSet", "Next": "Is Permission Set Name Returned?" }, @@ -96,6 +104,14 @@ "InstanceArn.$": "$.RequestParameters.instanceArn" }, "Resource": "arn:aws:states:::aws-sdk:ssoadmin:describeInstance", + "Retry": [ + { + "ErrorEquals": ["States.TaskFailed"], + "IntervalSeconds": 3, + "MaxAttempts": 3, + "BackoffRate": 2.0 + } + ], "Next": "Is Identity Store Id Returned?", "ResultPath": "$.DescribeInstance" }, @@ -118,13 +134,13 @@ "Choices": [ { "Variable": "$.RequestParameters.principalType", - "StringMatches": "USER", + "StringEquals": "USER", "Next": "DescribeUser" }, { "Next": "DescribeGroup", "Variable": "$.RequestParameters.principalType", - "StringMatches": "GROUP" + "StringEquals": "GROUP" } ], "Default": "Fail Unsupported Principal Type" @@ -136,6 +152,14 @@ "UserId.$": "$.RequestParameters.principalId" }, "Resource": "arn:aws:states:::aws-sdk:identitystore:describeUser", + "Retry": [ + { + "ErrorEquals": ["States.TaskFailed"], + "IntervalSeconds": 3, + "MaxAttempts": 3, + "BackoffRate": 2.0 + } + ], "Next": "Is User or Group Name Returned?", "ResultPath": "$.DescribeUser" }, @@ -146,7 +170,7 @@ "Next": "Cerberus Lambda Invoke", "Or": [ { - "Variable": "$.DescribeUser.DisplayName", + "Variable": "$.DescribeUser.UserName", "IsPresent": true }, { @@ -168,6 +192,14 @@ "GroupId.$": "$.RequestParameters.principalId" }, "Resource": "arn:aws:states:::aws-sdk:identitystore:describeGroup", + "Retry": [ + { + "ErrorEquals": ["States.TaskFailed"], + "IntervalSeconds": 3, + "MaxAttempts": 3, + "BackoffRate": 2.0 + } + ], "Next": "Is User or Group Name Returned?", "ResultPath": "$.DescribeGroup" }, @@ -176,10 +208,15 @@ "Resource": "${CerberusFunctionArn}", "Retry": [ { - "ErrorEquals": ["States.TaskFailed"], - "IntervalSeconds": 15, - "MaxAttempts": 5, - "BackoffRate": 1.5 + "ErrorEquals": [ + "Lambda.ServiceException", + "Lambda.AWSLambdaException", + "Lambda.SdkClientException", + "Lambda.TooManyRequestsException" + ], + "IntervalSeconds": 2, + "MaxAttempts": 3, + "BackoffRate": 2.0 } ], "InputPath": "$", diff --git a/cerberus/template.yaml b/cerberus/template.yaml index 74f691a..3c1d41c 100644 --- a/cerberus/template.yaml +++ b/cerberus/template.yaml @@ -4,17 +4,6 @@ Description: > Cerberus, a state machine that removes AWS Control Tower default permission set associations. Parameters: - EventBusName: - Type: String - Default: "cerberus-event-bus" - Description: "The name of the custom EventBridge event bus for Cerberus" - - ManagementAccountId: - Type: String - Description: "The Management AWS account ID that will send events to the Cerberus event bus" - AllowedPattern: "^[0-9]{12}$" - ConstraintDescription: "Must be a valid 12-digit AWS account ID" - LogGroupName: Type: String Default: "/cerberus" @@ -71,10 +60,22 @@ Parameters: NotificationEmail: Type: String - Description: "Email address to receive notifications when the Cerberus state machine execution fails" + Description: "Email address to receive notifications when the Cerberus state machine execution fails or another alarm fires" AllowedPattern: "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$" ConstraintDescription: "Must be a valid email address" + Mode: + Type: String + Default: "ENFORCE" + AllowedValues: + - "ENFORCE" + - "DRY_RUN" + - "DISABLED" + Description: "ENFORCE: actually delete matching assignments. DRY_RUN: log what would be deleted, take no action. DISABLED: turn off the EventBridge rule entirely." + +Conditions: + IsDisabled: !Equals [!Ref Mode, "DISABLED"] + Resources: SfnToCerberusFunctionConnector: Type: AWS::Serverless::Connector @@ -86,16 +87,52 @@ Resources: Permissions: - Write + CerberusPermissionsBoundary: + Type: AWS::IAM::ManagedPolicy + Properties: + Description: "Permission boundary limiting Cerberus roles to the namespaces required for IAM Identity Center cleanup. Mgmt-account principals are exempt from SCPs; this is the IAM-layer guardrail." + PolicyDocument: + Version: "2012-10-17" + Statement: + - Sid: AllowedNamespaces + Effect: Allow + Action: + - sso:DeleteAccountAssignment + - sso:DescribePermissionSet + - sso:DescribeInstance + - sso:ListPermissionSets + - sso:GetPermissionSet + - sso:DescribePermissionSetProvisioningStatus + - identitystore:DescribeUser + - identitystore:DescribeGroup + - logs:CreateLogStream + - logs:PutLogEvents + - logs:DescribeLogGroups + - logs:DescribeLogStreams + - cloudwatch:PutMetricData + Resource: "*" + # Boundary ceiling on lambda:InvokeFunction is scoped to Lambda functions + # in this stack. The actual grant comes from the SAM Serverless::Connector + # above, which produces a tightly-scoped policy on CerberusFunction.Arn. + # We can't !GetAtt the function ARN here (circular: function uses this boundary), + # so we constrain via the StackName-prefixed naming pattern that SAM generates + # by default: --. + - Sid: AllowInvokeStackLambdas + Effect: Allow + Action: lambda:InvokeFunction + Resource: + - !Sub "arn:aws:lambda:${AWS::Region}:${AWS::AccountId}:function:${AWS::StackName}-*" + CerberusStateMachine: Type: AWS::Serverless::StateMachine DependsOn: - - CerberusEventBus - CerberusLogGroup Properties: DefinitionUri: statemachine/cerberus.asl.json DefinitionSubstitutions: CerberusFunctionArn: !GetAtt CerberusFunction.Arn - ManagementAccountId: !Ref ManagementAccountId + ManagementAccountId: !Ref AWS::AccountId + PermissionsBoundary: !Ref CerberusPermissionsBoundary Policies: - CloudWatchPutMetricPolicy: {} - Statement: @@ -114,20 +151,7 @@ Resources: CerberusEvent: Type: EventBridgeRule Properties: - Pattern: - source: - - "aws.sso" - detail-type: - - "AWS API Call via CloudTrail" - detail: - eventSource: - - "sso.amazonaws.com" - eventName: - - "CreateAccountAssignment" - CerberusEventManagementAccount: - Type: EventBridgeRule - Properties: - EventBusName: !Ref CerberusEventBus + State: !If [IsDisabled, "DISABLED", "ENABLED"] Pattern: source: - "aws.sso" @@ -149,11 +173,13 @@ Resources: LogGroup: !Ref CerberusLogGroup ApplicationLogLevel: INFO LogFormat: JSON + PermissionsBoundary: !Ref CerberusPermissionsBoundary Environment: Variables: PermissionSetNamePattern: !Ref PermissionSetNamePattern PrincipalGroupNamePattern: !Ref PrincipalGroupNamePattern PrincipalUserNameEmail: !Ref PrincipalUserNameEmail + Mode: !Ref Mode Policies: - CloudWatchPutMetricPolicy: {} - Statement: @@ -171,24 +197,10 @@ Resources: Resource: - !Sub "arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:${LogGroupName}:*" - !GetAtt CerberusLogGroup.Arn - ReservedConcurrentExecutions: 5 + ReservedConcurrentExecutions: 2 MemorySize: 128 Timeout: 300 - CerberusEventBus: - Type: AWS::Events::EventBus - Properties: - Name: !Ref EventBusName - Description: "Event bus for IAM Identity Center events from Management Account (delegated admin)" - - CerberusEventBusPolicy: - Type: AWS::Events::EventBusPolicy - Properties: - StatementId: "CerberusEventBusPolicy" - Principal: !Ref ManagementAccountId - Action: "events:PutEvents" - EventBusName: !Ref CerberusEventBus - CerberusLogGroup: Type: AWS::Logs::LogGroup Properties: @@ -206,12 +218,64 @@ Resources: Period: 60 Statistic: Sum Threshold: 0 + TreatMissingData: notBreaching Dimensions: - Name: StateMachineArn Value: !Ref CerberusStateMachine AlarmActions: - !Ref CerberusFailureNotificationTopic + CerberusFunctionErrorsAlarm: + Type: AWS::CloudWatch::Alarm + Properties: + AlarmDescription: "Cerberus Lambda raised an unhandled error" + ComparisonOperator: GreaterThanThreshold + EvaluationPeriods: 1 + MetricName: Errors + Namespace: AWS/Lambda + Period: 60 + Statistic: Sum + Threshold: 0 + TreatMissingData: notBreaching + Dimensions: + - Name: FunctionName + Value: !Ref CerberusFunction + AlarmActions: + - !Ref CerberusFailureNotificationTopic + + CerberusFunctionThrottlesAlarm: + Type: AWS::CloudWatch::Alarm + Properties: + AlarmDescription: "Cerberus Lambda throttled (concurrency cap hit) — investigate event burst or compromise" + ComparisonOperator: GreaterThanThreshold + EvaluationPeriods: 1 + MetricName: Throttles + Namespace: AWS/Lambda + Period: 60 + Statistic: Sum + Threshold: 0 + TreatMissingData: notBreaching + Dimensions: + - Name: FunctionName + Value: !Ref CerberusFunction + AlarmActions: + - !Ref CerberusFailureNotificationTopic + + CerberusHighDeletionRateAlarm: + Type: AWS::CloudWatch::Alarm + Properties: + AlarmDescription: "Cerberus performed >10 actual deletions in 5 min — possible regex misfire or compromise" + ComparisonOperator: GreaterThanThreshold + EvaluationPeriods: 1 + MetricName: Deleted + Namespace: Cerberus + Period: 300 + Statistic: Sum + Threshold: 10 + TreatMissingData: notBreaching + AlarmActions: + - !Ref CerberusFailureNotificationTopic + CerberusFailureNotificationTopic: Type: AWS::SNS::Topic Properties: @@ -220,8 +284,3 @@ Resources: Subscription: - Endpoint: !Ref NotificationEmail Protocol: email - -Outputs: - EventBusArn: - Description: "The ARN of the custom EventBridge event bus for Cerberus" - Value: !GetAtt CerberusEventBus.Arn diff --git a/cerberus/tests/unit/test_cerberus.py b/cerberus/tests/unit/test_cerberus.py index b603a2d..a8b0936 100644 --- a/cerberus/tests/unit/test_cerberus.py +++ b/cerberus/tests/unit/test_cerberus.py @@ -6,9 +6,18 @@ class TestLambdaHandler(unittest.TestCase): + def tearDown(self): + # Mode is set by individual tests when they need DRY_RUN; remove it + # afterwards so it doesn't leak into the next test (which expects ENFORCE + # behavior, the os.environ.get default). + os.environ.pop("Mode", None) + + @patch("cerberus.src.cerberus.app.cloudwatch", new_callable=MagicMock) @patch("cerberus.src.cerberus.app.logger") @patch("cerberus.src.cerberus.app.client", new_callable=MagicMock) - def test_lambda_handler_successful_deletion(self, mock_client, mock_logger): + def test_lambda_handler_successful_deletion( + self, mock_client, mock_logger, mock_cloudwatch + ): event = { "DescribeInstance": { "InstanceArn": "arn:aws:sso:::instance/sso-instance-id" @@ -37,10 +46,17 @@ def test_lambda_handler_successful_deletion(self, mock_client, mock_logger): self.assertEqual(result["result"], "SUCCESS") self.assertIn("SUCCEEDED", result["message"]) self.assertIn("AccountAssignmentDeletionStatus", result["details"]) + mock_cloudwatch.put_metric_data.assert_called_once_with( + Namespace="Cerberus", + MetricData=[{"MetricName": "Deleted", "Value": 1, "Unit": "Count"}], + ) + @patch("cerberus.src.cerberus.app.cloudwatch", new_callable=MagicMock) @patch("cerberus.src.cerberus.app.logger") @patch("cerberus.src.cerberus.app.client", new_callable=MagicMock) - def test_lambda_handler_no_action_taken(self, mock_client, mock_logger): + def test_lambda_handler_no_action_taken( + self, mock_client, mock_logger, mock_cloudwatch + ): event = { "DescribeInstance": { "InstanceArn": "arn:aws:sso:::instance/sso-instance-id" @@ -68,10 +84,17 @@ def test_lambda_handler_no_action_taken(self, mock_client, mock_logger): result = lambda_handler(event, None) self.assertEqual(result["result"], "SUCCESS") self.assertIn("No action taken for principal", result["message"]) + mock_cloudwatch.put_metric_data.assert_called_once_with( + Namespace="Cerberus", + MetricData=[{"MetricName": "Skipped", "Value": 1, "Unit": "Count"}], + ) + @patch("cerberus.src.cerberus.app.cloudwatch", new_callable=MagicMock) @patch("cerberus.src.cerberus.app.logger") @patch("cerberus.src.cerberus.app.client", new_callable=MagicMock) - def test_lambda_handler_regex_pattern_error(self, mock_client, mock_logger): + def test_lambda_handler_regex_pattern_error( + self, mock_client, mock_logger, mock_cloudwatch + ): event = { "DescribeInstance": { "InstanceArn": "arn:aws:sso:::instance/sso-instance-id" @@ -100,9 +123,12 @@ def test_lambda_handler_regex_pattern_error(self, mock_client, mock_logger): self.assertEqual(result["result"], "FAILED") self.assertIn("Invalid regex pattern", result["message"]) + @patch("cerberus.src.cerberus.app.cloudwatch", new_callable=MagicMock) @patch("cerberus.src.cerberus.app.logger") @patch("cerberus.src.cerberus.app.client", new_callable=MagicMock) - def test_lambda_handler_in_progress_status(self, mock_client, mock_logger): + def test_lambda_handler_in_progress_status( + self, mock_client, mock_logger, mock_cloudwatch + ): event = { "DescribeInstance": { "InstanceArn": "arn:aws:sso:::instance/sso-instance-id" @@ -136,9 +162,12 @@ def test_lambda_handler_in_progress_status(self, mock_client, mock_logger): self.assertIn("AccountAssignmentDeletionStatus", result["details"]) mock_client.delete_account_assignment.assert_called_once() + @patch("cerberus.src.cerberus.app.cloudwatch", new_callable=MagicMock) @patch("cerberus.src.cerberus.app.logger") @patch("cerberus.src.cerberus.app.client", new_callable=MagicMock) - def test_lambda_handler_status_failed(self, mock_client, mock_logger): + def test_lambda_handler_status_failed( + self, mock_client, mock_logger, mock_cloudwatch + ): event = { "DescribeInstance": { "InstanceArn": "arn:aws:sso:::instance/sso-instance-id" @@ -172,9 +201,89 @@ def test_lambda_handler_status_failed(self, mock_client, mock_logger): self.assertIn("Account assignment deletion failed", result["message"]) self.assertIn("AccountAssignmentDeletionStatus", result["details"]) + @patch("cerberus.src.cerberus.app.cloudwatch", new_callable=MagicMock) @patch("cerberus.src.cerberus.app.logger") @patch("cerberus.src.cerberus.app.client", new_callable=MagicMock) - def test_lambda_handler_access_denied(self, mock_client, mock_logger): + def test_lambda_handler_unknown_status_fails_closed( + self, mock_client, mock_logger, mock_cloudwatch + ): + event = { + "DescribeInstance": { + "InstanceArn": "arn:aws:sso:::instance/sso-instance-id" + }, + "RequestParameters": { + "targetId": "target-id", + "targetType": "AWS_ACCOUNT", + "principalType": "USER", + "principalId": "user-id", + }, + "DescribePermissionSet": { + "PermissionSet": { + "PermissionSetArn": "arn:aws:sso:::permissionSet/sso-instance-id/permission-set-id", + "Name": "MatchingPermissionSetName", + } + }, + "DescribeUser": {"UserName": "matchinguser@example.com"}, + } + os.environ["PermissionSetNamePattern"] = "^MatchingPermissionSetName$" + os.environ["PrincipalGroupNamePattern"] = "^MatchingGroupName$" + os.environ["PrincipalUserNameEmail"] = "matchinguser@example.com" + # AWS returns a status outside the documented {IN_PROGRESS, SUCCEEDED, FAILED} set — + # could mean API contract drift or a malformed response. Must fail closed, not emit Deleted. + mock_client.delete_account_assignment.return_value = { + "AccountAssignmentDeletionStatus": {"Status": "UNKNOWN_NEW_STATUS"} + } + result = lambda_handler(event, None) + self.assertEqual(result["result"], "FAILED") + self.assertIn("Unexpected deletion status", result["message"]) + mock_cloudwatch.put_metric_data.assert_called_once_with( + Namespace="Cerberus", + MetricData=[{"MetricName": "Failed", "Value": 1, "Unit": "Count"}], + ) + + @patch("cerberus.src.cerberus.app.cloudwatch", new_callable=MagicMock) + @patch("cerberus.src.cerberus.app.logger") + @patch("cerberus.src.cerberus.app.client", new_callable=MagicMock) + def test_lambda_handler_missing_status_fails_closed( + self, mock_client, mock_logger, mock_cloudwatch + ): + event = { + "DescribeInstance": { + "InstanceArn": "arn:aws:sso:::instance/sso-instance-id" + }, + "RequestParameters": { + "targetId": "target-id", + "targetType": "AWS_ACCOUNT", + "principalType": "USER", + "principalId": "user-id", + }, + "DescribePermissionSet": { + "PermissionSet": { + "PermissionSetArn": "arn:aws:sso:::permissionSet/sso-instance-id/permission-set-id", + "Name": "MatchingPermissionSetName", + } + }, + "DescribeUser": {"UserName": "matchinguser@example.com"}, + } + os.environ["PermissionSetNamePattern"] = "^MatchingPermissionSetName$" + os.environ["PrincipalGroupNamePattern"] = "^MatchingGroupName$" + os.environ["PrincipalUserNameEmail"] = "matchinguser@example.com" + # No AccountAssignmentDeletionStatus at all — Status resolves to None. + mock_client.delete_account_assignment.return_value = {} + result = lambda_handler(event, None) + self.assertEqual(result["result"], "FAILED") + self.assertIn("Unexpected deletion status", result["message"]) + mock_cloudwatch.put_metric_data.assert_called_once_with( + Namespace="Cerberus", + MetricData=[{"MetricName": "Failed", "Value": 1, "Unit": "Count"}], + ) + + @patch("cerberus.src.cerberus.app.cloudwatch", new_callable=MagicMock) + @patch("cerberus.src.cerberus.app.logger") + @patch("cerberus.src.cerberus.app.client", new_callable=MagicMock) + def test_lambda_handler_access_denied( + self, mock_client, mock_logger, mock_cloudwatch + ): event = { "DescribeInstance": { "InstanceArn": "arn:aws:sso:::instance/sso-instance-id" @@ -216,3 +325,150 @@ def test_lambda_handler_access_denied(self, mock_client, mock_logger): self.assertEqual(result["result"], "FAILED") self.assertEqual(result["errorName"], "AccessDeniedException") self.assertIn("Access denied", result["message"]) + mock_cloudwatch.put_metric_data.assert_called_once_with( + Namespace="Cerberus", + MetricData=[{"MetricName": "Failed", "Value": 1, "Unit": "Count"}], + ) + + @patch("cerberus.src.cerberus.app.cloudwatch", new_callable=MagicMock) + @patch("cerberus.src.cerberus.app.logger") + @patch("cerberus.src.cerberus.app.client", new_callable=MagicMock) + def test_lambda_handler_disabled_mode( + self, mock_client, mock_logger, mock_cloudwatch + ): + event = { + "DescribeInstance": { + "InstanceArn": "arn:aws:sso:::instance/sso-instance-id" + }, + "RequestParameters": { + "targetId": "target-id", + "targetType": "AWS_ACCOUNT", + "principalType": "USER", + "principalId": "user-id", + }, + "DescribePermissionSet": { + "PermissionSet": { + "PermissionSetArn": "arn:aws:sso:::permissionSet/sso-instance-id/permission-set-id", + "Name": "MatchingPermissionSetName", + } + }, + "DescribeUser": {"UserName": "matchinguser@example.com"}, + } + os.environ["PermissionSetNamePattern"] = "^MatchingPermissionSetName$" + os.environ["PrincipalGroupNamePattern"] = "^MatchingGroupName$" + os.environ["PrincipalUserNameEmail"] = "matchinguser@example.com" + os.environ["Mode"] = "DISABLED" + + result = lambda_handler(event, None) + self.assertEqual(result["result"], "SUCCESS") + self.assertIn("DISABLED", result["message"]) + self.assertEqual(result["details"], {"mode": "DISABLED"}) + mock_client.delete_account_assignment.assert_not_called() + + @patch("cerberus.src.cerberus.app.cloudwatch", new_callable=MagicMock) + @patch("cerberus.src.cerberus.app.logger") + @patch("cerberus.src.cerberus.app.client", new_callable=MagicMock) + def test_lambda_handler_disabled_mode_short_circuits_before_event_parsing( + self, mock_client, mock_logger, mock_cloudwatch + ): + # Defense-in-depth: DISABLED must be honoured on any payload, including + # stripped-down direct invocations used to test the kill switch. The check + # runs before any event field destructuring, so an empty event is fine. + os.environ["Mode"] = "DISABLED" + + result = lambda_handler({}, None) + self.assertEqual(result["result"], "SUCCESS") + self.assertIn("DISABLED", result["message"]) + self.assertEqual(result["details"], {"mode": "DISABLED"}) + mock_client.delete_account_assignment.assert_not_called() + + @patch("cerberus.src.cerberus.app.cloudwatch", new_callable=MagicMock) + @patch("cerberus.src.cerberus.app.logger") + @patch("cerberus.src.cerberus.app.client", new_callable=MagicMock) + def test_lambda_handler_unknown_mode_fails_closed( + self, mock_client, mock_logger, mock_cloudwatch + ): + # CloudFormation AllowedValues blocks typos at deploy time, but a direct + # env-var override (or tooling bug) could land an unknown value. Must be + # treated as DISABLED, not silently fall through to ENFORCE. + os.environ["Mode"] = "ENORCE" + + result = lambda_handler({}, None) + self.assertEqual(result["result"], "SUCCESS") + self.assertIn("DISABLED", result["message"]) + self.assertEqual(result["details"], {"mode": "DISABLED"}) + mock_client.delete_account_assignment.assert_not_called() + + @patch("cerberus.src.cerberus.app.cloudwatch", new_callable=MagicMock) + @patch("cerberus.src.cerberus.app.logger") + @patch("cerberus.src.cerberus.app.client", new_callable=MagicMock) + def test_lambda_handler_dry_run_mode( + self, mock_client, mock_logger, mock_cloudwatch + ): + event = { + "DescribeInstance": { + "InstanceArn": "arn:aws:sso:::instance/sso-instance-id" + }, + "RequestParameters": { + "targetId": "target-id", + "targetType": "AWS_ACCOUNT", + "principalType": "USER", + "principalId": "user-id", + }, + "DescribePermissionSet": { + "PermissionSet": { + "PermissionSetArn": "arn:aws:sso:::permissionSet/sso-instance-id/permission-set-id", + "Name": "MatchingPermissionSetName", + } + }, + "DescribeUser": {"UserName": "matchinguser@example.com"}, + } + os.environ["PermissionSetNamePattern"] = "^MatchingPermissionSetName$" + os.environ["PrincipalGroupNamePattern"] = "^MatchingGroupName$" + os.environ["PrincipalUserNameEmail"] = "matchinguser@example.com" + os.environ["Mode"] = "DRY_RUN" + + result = lambda_handler(event, None) + self.assertEqual(result["result"], "SUCCESS") + self.assertIn("DRY_RUN", result["message"]) + self.assertEqual(result["details"], {"mode": "DRY_RUN"}) + mock_client.delete_account_assignment.assert_not_called() + + @patch("cerberus.src.cerberus.app.cloudwatch", new_callable=MagicMock) + @patch("cerberus.src.cerberus.app.logger") + @patch("cerberus.src.cerberus.app.client", new_callable=MagicMock) + def test_lambda_handler_group_principal_match( + self, mock_client, mock_logger, mock_cloudwatch + ): + event = { + "DescribeInstance": { + "InstanceArn": "arn:aws:sso:::instance/sso-instance-id" + }, + "RequestParameters": { + "targetId": "target-id", + "targetType": "AWS_ACCOUNT", + "principalType": "GROUP", + "principalId": "group-id", + }, + "DescribePermissionSet": { + "PermissionSet": { + "PermissionSetArn": "arn:aws:sso:::permissionSet/sso-instance-id/permission-set-id", + "Name": "MatchingPermissionSetName", + } + }, + "DescribeGroup": {"DisplayName": "MatchingGroupName"}, + } + os.environ["PermissionSetNamePattern"] = "^MatchingPermissionSetName$" + os.environ["PrincipalGroupNamePattern"] = "^MatchingGroupName$" + os.environ["PrincipalUserNameEmail"] = "" + mock_client.delete_account_assignment.return_value = { + "AccountAssignmentDeletionStatus": { + "Status": "SUCCEEDED", + "RequestId": "22222222-3333-4444-5555-666666666666", + } + } + result = lambda_handler(event, None) + self.assertEqual(result["result"], "SUCCESS") + self.assertIn("SUCCEEDED", result["message"]) + self.assertIn("AccountAssignmentDeletionStatus", result["details"]) + mock_client.delete_account_assignment.assert_called_once() diff --git a/cft-eventbridge-rule.yaml b/cft-eventbridge-rule.yaml deleted file mode 100644 index 1d99683..0000000 --- a/cft-eventbridge-rule.yaml +++ /dev/null @@ -1,62 +0,0 @@ -AWSTemplateFormatVersion: "2010-09-09" -Description: CloudFormation template for an EventBridge rule triggered by AWS SSO CreateAccountAssignment events. - -Parameters: - EventBridgeRuleName: - Type: String - Description: The name of the EventBridge rule. - Default: Cerberus - - CerberusEventBusArn: - Type: String - Description: The ARN of the Cerberus EventBridge event bus in the target account. - AllowedPattern: "^arn:aws:events:[a-z0-9-]+:\\d{12}:event-bus/[a-zA-Z0-9-_]+$" - ConstraintDescription: Must be a valid EventBridge event bus ARN. - -Resources: - EventBridgeRule: - Type: AWS::Events::Rule - Properties: - Name: !Ref EventBridgeRuleName - EventPattern: - source: - - "aws.sso" - detail-type: - - "AWS API Call via CloudTrail" - detail: - eventSource: - - "sso.amazonaws.com" - eventName: - - "CreateAccountAssignment" - State: ENABLED - Targets: - - Arn: !Ref CerberusEventBusArn - Id: "EventBusTarget" - RoleArn: !GetAtt EventBridgeTargetRole.Arn - - EventBridgeTargetRole: - Type: AWS::IAM::Role - Properties: - AssumeRolePolicyDocument: - Version: "2012-10-17" - Statement: - - Effect: Allow - Principal: - Service: "events.amazonaws.com" - Action: "sts:AssumeRole" - Policies: - - PolicyName: "EventBridgeTargetPolicy" - PolicyDocument: - Version: "2012-10-17" - Statement: - - Effect: Allow - Action: "events:PutEvents" - Resource: !Ref CerberusEventBusArn - -Outputs: - EventBridgeRuleArn: - Description: The ARN of the EventBridge rule. - Value: !GetAtt EventBridgeRule.Arn - EventBridgeTargetRoleArn: - Description: The ARN of the IAM role for the EventBridge rule. - Value: !GetAtt EventBridgeTargetRole.Arn