feat: containerization, Terraform (AWS), and CD pipeline by div0rce · Pull Request #14 · div0rce/sentinel

div0rce · 2026-05-29T06:29:07Z

Milestone

M10 — Containerization, Terraform (AWS), and CD pipeline

Summary

Production Dockerfiles for backend (uvicorn + structlog + request-id middleware,
non-root, multi-stage) and frontend (nginx serving the Vite SPA, envsubst-driven
backend URL); Terraform under infra/ provisioning a cost-minimal us-east-1
demo stack (VPC, public subnets, ECR, RDS Postgres 16 with pgvector, ECS Fargate
behind an ALB, SSM Parameter Store secrets, GitHub Actions OIDC role); a
manual-dispatch CD workflow that builds + pushes images and force-redeploys ECS;
and infra/README.md documenting the cost posture, security invariants, and
the apply/destroy recipe.

Hard constraint honoured: no terraform apply was run, no AWS resources
were created, no costs incurred. The PR ships infra-as-code only. The user (the
operator) runs terraform plan and apply against their own AWS account when
ready, captures demo screenshots, and terraform destroy immediately after.

Definition of Done

All DoD items from MILESTONES.md addressed (one is operator-action, see below)
make check passes (ruff + ruff-format + mypy strict + 195 backend pytest + 7 frontend Vitest)
Tests added/updated for new logic — 8 new request-id-middleware tests
PROGRESS.md updated — M10 marked complete on branch; M9 row updated to ☑ merged with PR feat: evaluation harness and benchmark results #12 link; backlog issue eval: record real-provider benchmark numbers (M9 follow-up) #13 noted
No secrets committed; sample data is synthetic; SSM SecureString placeholders use lifecycle.ignore_changes = [value]
Guardrails intact (citation-or-refuse, PII redaction, confidence gating, audit logging) — unchanged

M10 DoD verification (from MILESTONES.md)

terraform plan is clean; apply provisions the stack. Pending operator action. The user explicitly forbade running terraform plan or apply in this session. Local environment also has no terraform binary, so even fmt/validate ran zero times locally — those checks are wired into CI (a new no-AWS-creds terraform job that runs fmt -check, init -backend=false, and validate) so the regression surface is covered without any AWS calls.
CD workflow builds and deploys on manual dispatch. .github/workflows/cd.yml is workflow_dispatch-only — no push: or pull_request: triggers, by design. Steps: assume the OIDC role (AWS_ROLE_ARN secret), ECR login, build + push backend (context = repo root, -f backend/Dockerfile) and/or frontend (context = ./frontend) tagged with the git SHA + latest, force ECS service redeploy. Choice input lets the operator deploy backend / frontend / both per dispatch.
App is reachable at a URL — infra-as-code complete. The Terraform output alb_dns_name is the URL once terraform apply succeeds. Capturing screenshots and the demo flow are M11 deliverables; teardown via terraform destroy is the operator's immediate next step.

Locked design (per user constraints)

Code only this session. No terraform apply. No AWS API calls. No costs. No terraform plan unless AWS credentials are configured and the user explicitly approves (the user did not — so plan didn't run).
Cost posture: public-subnet / no-NAT. Avoids the ~$32/month idle NAT Gateway. ECS tasks live in public subnets with assign_public_ip = true so they can reach ECR / Anthropic / OpenAI / CloudWatch. Security groups are what enforce "internal-only" for RDS (next bullet).
Hard invariant: RDS not publicly accessible. aws_db_instance.publicly_accessible = false and the rds security group ingress is keyed only to the backend task SG. Even though RDS lives in the same public subnets as the tasks (no private subnets in the no-NAT design), the SG prevents internet reach.
Trigger gate: workflow_dispatch is the cost-control gate for the CD workflow. Nothing else (no push:, no pull_request:).
Region: us-east-1. Pinned via var.region default.
Secrets: SSM Parameter Store SecureString. API keys are placeholders with lifecycle.ignore_changes = [value] so an out-of-band aws ssm put-parameter --overwrite is not clobbered on re-apply. CI identity uses GitHub OIDC, not long-lived access keys.
Demo-only. infra/README.md documents cost (~$45/month idle floor), every tradeoff (single-AZ, no Multi-AZ, no auto-scaling, no remote state, plain HTTP on the ALB), and an unambiguous "terraform destroy immediately after screenshots" instruction.

What ships

Backend

backend/app/observability.py — configure_logging() wires structlog for
JSON output (SENTINEL_LOG_FORMAT=console for local dev). RequestIdMiddleware
assigns a stable id per request (sanitised inbound X-Request-Id allowlist,
generated uuid4().hex otherwise), binds it to the structlog contextvars
for the request scope, surfaces it on the response.
backend/app/main.py — calls configure_logging() at import, adds the
middleware before the routers.
backend/Dockerfile — multi-stage uv build → slim runtime. Non-root
sentinel user (uid 1000), HEALTHCHECK on /health, honours $PORT,
SENTINEL_LOG_FORMAT=json default.
.dockerignore (repo root) — keeps the backend build context lean and
free of secrets / tests / frontend / eval / scripts / IDE state.
structlog>=24.4 added to runtime deps (resolved 25.5.0).

Frontend

frontend/Dockerfile — node:20-alpine builder runs npm ci && npm run build (which runs tsc -b first, so any type error fails the image build).
nginx:1.27-alpine runtime serves /usr/share/nginx/html and reverse-proxies
same-origin API paths to ${BACKEND_URL} (envsubst-substituted on container
start by the official entrypoint).
frontend/nginx.conf.template — SPA try_files fallback for React Router
routes; reverse-proxy for /query, /extract, /review, /dashboard,
/health with X-Forwarded-* headers and request-id pass-through; 1y cache
on hashed Vite assets.
frontend/.dockerignore — keeps node_modules, dist, tests, .git, IDE
state, and *.tsbuildinfo out of the build context.

Terraform (`infra/`)

infra/
├── versions.tf       terraform >= 1.6, aws ~> 5.70
├── variables.tf      project_name, region, db creds (sensitive, >=16 char), image tags, github_repository
├── main.tf           wires the modules
├── outputs.tf        alb_dns_name, ecr URLs, ecs names, rds_endpoint, ci_role_arn
├── README.md         cost & security posture, apply/destroy recipe, validation steps
└── modules/
    ├── network/      VPC + 2 public subnets + IGW + public RT + 4 SGs
    ├── ecr/          backend + frontend repos with image-scan + lifecycle
    ├── secrets/      SSM SecureString for API keys + composed DATABASE_URL
    ├── rds/          Postgres 16.4 db.t4g.micro single-AZ, publicly_accessible=false
    ├── ecs/          cluster + ALB + listener + target groups + task defs + services + IAM + log groups
    └── ci_oidc/      GitHub Actions OIDC provider + role scoped to ECR push + ECS update

Reachability graph encoded in the four security groups owned by
modules/network/:

internet ──→ alb_sg          (80, 443)
alb_sg   ──→ frontend_sg     (80)         ALB → nginx
alb_sg   ──→ backend_sg      (8000)       ALB → FastAPI (path-prefix rule)
backend_sg ──→ rds_sg        (5432)       FastAPI → Postgres

OIDC trust policy is scoped to one repo via the repo:OWNER/NAME:* subject
claim. CI permissions: ecr:GetAuthorizationToken account-wide, push to the
two project ECR repos, ecs:UpdateService on the two project services.
iam:PassRole is restricted to the project task roles, only to
ecs-tasks.amazonaws.com.

CD (`.github/workflows/cd.yml`)

workflow_dispatch only. Choice input: backend / frontend / both.
OIDC via aws-actions/configure-aws-credentials@v4 (role ARN from
secrets.AWS_ROLE_ARN, written from terraform output ci_role_arn).
Builds with --platform linux/amd64, tags with the git SHA + latest,
pushes to ECR.
aws ecs update-service --force-new-deployment for each requested service.

CI (`.github/workflows/ci.yml`)

New terraform job (no AWS credentials needed) running terraform fmt -recursive -check, terraform init -backend=false, and terraform validate
on every PR — so a syntax/wiring regression is caught without any AWS calls.
Backend and frontend jobs unchanged.

Verification

Local (this session):

ruff/format/mypy: clean (eval/ + new observability.py + main.py)
backend pytest: 195 passed (was 187 in M9 + 8 new request-id middleware tests)
frontend tests: 7 passed (unchanged)
terraform fmt/validate: NOT run locally — no terraform binary. CI job covers it.
docker build: NOT run locally — no Docker daemon. CI/operator covers it.

The infra is wired so the only step that costs money is an explicit
terraform apply run by the operator after reviewing terraform plan.

Schema/migration concerns

None for the application schema. Infra-as-code only.

Reminder

Please squash-merge this PR. Then follow infra/README.md for the
operator workflow: terraform plan → terraform apply → aws ssm put-parameter for the API keys → Run workflow on the CD action → demo
screenshots → terraform destroy immediately after. M11 (docs/demo.md,
architecture diagram, README polish) is the natural next milestone for
turning a working stack into a portfolio artefact.

…n Dockerfile backend/app/observability.py: - configure_logging() wires structlog for JSON output (CloudWatch-friendly) with a SENTINEL_LOG_FORMAT=console escape hatch for local dev. Idempotent so CLIs (make seed, make eval) produce the same shape of log as the API. - RequestIdMiddleware assigns a stable id per request, binds it to the structlog contextvars (so any structlog call inside a handler picks it up), exposes it on request.state.request_id, and surfaces it on the response as X-Request-Id. Caller-supplied X-Request-Id headers are accepted only when short and printable ([alnum]+[-_], <= 64 chars); anything else is replaced with a fresh uuid4 hex to keep attacker-controlled bytes out of the log pipeline. backend/app/main.py: configure_logging() at import time; middleware added before routers. backend/tests/test_request_id.py (8 tests): generated id is uuid4 hex; safe inbound id is echoed; rogue inbound ids (too long, whitespace, control chars, punctuation, empty) are replaced; consecutive requests get distinct ids. backend/Dockerfile: multi-stage (uv-based dependency resolution, slim runtime), non-root sentinel user (uid 1000), HEALTHCHECK against /health, PORT=8000 default but honours $PORT for ECS service-port flexibility, SENTINEL_LOG_FORMAT defaults to 'json' in the image. Source layer copied last so code-only changes don't invalidate the deps layer. backend/.dockerignore prunes tests, frontend, eval, scripts, .git, IDE state, and local Postgres data so the image stays small and free of secrets. structlog>=24.4 added as a runtime dep (resolved 25.5.0).

…x serve) frontend/Dockerfile is a two-stage image: 1. node:20-alpine builder runs 'npm ci && npm run build' (which transitively runs 'tsc -b' so any type error fails the build, matching the CI lint step). 2. nginx:1.27-alpine runtime serves /usr/share/nginx/html (the Vite dist) and reverse-proxies same-origin paths to the backend. The nginx config template substitutes ${BACKEND_URL} via the official image's envsubst entrypoint on container start, so the same image is portable across environments. ECS task def sets BACKEND_URL to the backend service-discovery DNS name (default in-image: http://backend:8000 for local docker compose). The proxy passes /query, /extract, /review, /dashboard, /health straight through with X-Forwarded-* headers and forwards request headers (so the M10 X-Request-Id stays correlated end to end). Hashed Vite assets get a 1-year cache; everything else is uncached. SPA fallback ('try_files $uri $uri/ /index.html') keeps React Router routes working on hard reload. frontend/.dockerignore prunes node_modules/dist/test trees, IDE state, and *.tsbuildinfo so the build context stays small.

…st-1, demo) infra/ provisions the M10 demo stack on AWS: - modules/network: VPC (10.0.0.0/16), two public /24 subnets in two AZs, IGW, public route table. Owns the four security groups (alb, frontend, backend, rds) so the rds ingress rule can reference the backend SG without creating an ecs <-> rds module-level dependency cycle. Reachability graph encoded in the SGs: internet -> alb -> {frontend on 80, backend on 8000} -> rds on 5432. Egress open on tasks (ECR/Anthropic/OpenAI/CloudWatch); RDS has none. - modules/ecr: two repos (backend, frontend) with image-scan-on-push, a 7-day untagged-image expiry, and a 20-image cap. force_delete=true so terraform destroy doesn't hang on lingering tags. - modules/secrets: SSM SecureString parameters for ANTHROPIC_API_KEY, OPENAI_API_KEY (placeholders, lifecycle.ignore_changes=[value] so the real out-of-band 'aws ssm put-parameter' values aren't clobbered on re-apply), and DATABASE_URL composed from rds outputs. - modules/rds: Postgres 16.4 db.t4g.micro single-AZ, gp3 storage, encrypted at rest, publicly_accessible=false invariant, parameter group (log_statement=ddl). pgvector loads via the application's CREATE EXTENSION migration; no shared_preload_libraries needed. - modules/ecs: cluster, ALB with HTTP listener (frontend default; path-prefix rule routes /query|/extract|/review|/dashboard|/health to the backend target group), service discovery in <project>.local for nginx -> backend, two task defs (256 cpu / 512 mem), two services with assign_public_ip=true (no-NAT topology). Task execution role has scoped ssm:GetParameter on the three secret ARNs. CloudWatch log groups with 7-day retention. - modules/ci_oidc: GitHub Actions OIDC provider + role scoped to the configured repo via 'repo:OWNER/NAME:*' subject claim. Permissions: ecr push to the two project repos, ecr:GetAuthorizationToken account-wide, ecs:UpdateService on the two project services. PassRole limited to the project task roles, only to ecs-tasks.amazonaws.com. count=0 when var.github_repository is empty. Root: versions.tf (terraform >=1.6, aws ~>5.70), variables.tf (project_name, region us-east-1 default, db creds with sensitive=true and >=16 char password validation, image tags, github_repository), main.tf wires everything, outputs expose ALB DNS, ECR URLs, ECS names, RDS endpoint, CI role ARN. No remote state. Local-only is fine for a single-operator demo; convert to S3 + DynamoDB before any second user.

…e .dockerignore .github/workflows/cd.yml: workflow_dispatch only (no push:, no pull_request:). The trigger gate is the cost-control mechanism for M10 — additional triggers must not be added. Steps: assume the OIDC role (AWS_ROLE_ARN secret), ECR login, build+push backend (context = repo root, -f backend/Dockerfile) and/or frontend (context = ./frontend) tagged with the git SHA + 'latest', force ECS service redeploy. Choice input lets the operator deploy backend / frontend / both per dispatch. ci.yml: new 'terraform' job (no AWS creds) running terraform fmt -check, terraform init -backend=false, terraform validate. Catches a Terraform syntax/wiring regression on every PR without touching AWS. .dockerignore: moved from backend/ to repo root so Docker picks it up — the backend build context is the repo root (Dockerfile copies pyproject.toml, uv.lock, alembic.ini from there). frontend/.dockerignore stays where it is because the frontend build context is ./frontend.

div0rce · 2026-05-29T13:39:10Z

@codex review

Review PR #13 strictly as M10 — Containerization, Terraform (AWS), and CD pipeline.

Focus on:

production Dockerfiles
frontend nginx config
backend runtime config
structured logging
request-id middleware
Terraform module correctness
AWS ECS/Fargate design
ECR setup
RDS/Postgres security
security group boundaries
public-subnet/no-NAT cost posture
SSM/GitHub secrets handling
manual workflow_dispatch CD only
no accidental AWS apply/deploy triggers
no committed secrets/state
least-privilege IAM/OIDC assumptions
health checks
ALB routing
environment variable wiring
cost-control documentation
teardown instructions
whether the PR satisfies M10 without actually provisioning resources

Known constraint:

Terraform and Docker may not have been validated locally because Terraform was unavailable and Docker daemon was not running.
Do not treat lack of apply/deploy as a blocker unless the code itself is unsafe.
Do flag anything that would create cost unintentionally or expose RDS/public services incorrectly.

Check especially:

No terraform apply or deployment runs automatically on push/PR.
CD workflow is manual-only via workflow_dispatch.
RDS is not publicly accessible.
Backend is only reachable through intended ALB/security-group paths.
Secrets are never committed and are sourced from SSM/GitHub secrets.
Terraform state, tfvars, credentials, and generated artifacts are gitignored.
Docker images run as non-root where practical.
Health checks match actual app endpoints.
Public-subnet/no-NAT tradeoff is documented honestly.
terraform destroy / teardown instructions are clear.

Output only:

BLOCKING
NON-BLOCKING
CLEANUP
FINAL VERDICT

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7f5f35cbf4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

div0rce · 2026-05-29T14:06:09Z

@codex review

Re-review PR #13 after commit f1e9789 fix: separate deployed API routes and harden frontend container.

Focus only on the two previously reported P2 findings and the M10 deployment safety surface.

Previously reported findings:

SPA routes /review and /dashboard conflicted with deployed backend API routing.
Frontend nginx runtime container ran as root.

Verify:

SPA routes /, /review, and /dashboard are served by the frontend target group / SPA fallback.
API traffic is separated under /api/* in deployed frontend behavior.
nginx proxies /api/* to FastAPI and strips /api correctly.
ALB backend direct routing is limited to backend health checks where appropriate.
/health health check behavior still works for backend/ALB.
Frontend API client uses the deployed VITE_API_BASE=/api path correctly.
Local/dev API override behavior is not broken.
nginx runtime runs as non-root.
nginx listens on unprivileged port 8080.
ECS frontend container port, target group port, service wiring, and security-group rules all align with port 8080.
writable nginx pid/cache/log paths are correctly handled for USER nginx.
No schema/migration changes were introduced.
No M11+ scope was introduced.
No terraform apply, deployment trigger, AWS resource creation, secrets, .tfstate, or credentials were added.
CD workflow remains manual workflow_dispatch only.
Terraform remains cost-controlled and consistent with the documented public-subnet/no-NAT posture.

Also verify the reported local checks are sufficient:

make check passed
frontend lint/test/build passed
frontend Docker build passed
Docker runtime smoke ran as uid=101 user=nginx
terraform -chdir=infra fmt -recursive -check passed
terraform -chdir=infra init -backend=false passed
terraform -chdir=infra validate passed
no terraform plan or terraform apply was run

Output only:

BLOCKING
NON-BLOCKING
CLEANUP
FINAL VERDICT

chatgpt-codex-connector · 2026-05-29T14:08:50Z

Codex Review: Didn't find any major issues. Hooray!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

* docs(progress): M10 merged (PR #14, b18112d); M11 in progress * docs(architecture): full write-up + Mermaid source + rendered PNG docs/architecture.md replaces the M0 placeholder. Covers: - High-level component diagram (frontend, backend, governance, data, providers). - Per-component source-of-truth file list, including invariant cross-references (citation-or-refuse, append-only audit, redaction, FSM idempotency). - Sequence diagrams for /query, /extract, and human review. - ER diagram for documents → chunks → extractions → workflow_items → audit_events. - M10 deployment shape: VPC/SG reachability graph, ECS+ALB+RDS+SSM topology, cost posture, CD posture (workflow_dispatch only). docs/architecture.mmd is the standalone source for the headline component diagram. Render with mmdc: npx -y --package=@mermaid-js/mermaid-cli mmdc \ -i docs/architecture.mmd -o docs/architecture.png \ --backgroundColor white --width 1600 --scale 2 docs/architecture.png is the committed rendering (3168x2234) so a reviewer landing on the README sees the picture without needing to run mmdc. * docs(demo): 7-step demo script (clone -> compose -> seed -> query/refusal -> extract -> review -> dashboard) Replaces the M0 placeholder. Each step has a copy-pasteable command, an expected response shape (no fabricated metric values; real LLM output is documented as 'phrase may differ but the citation invariants are deterministic'), screenshot placeholders rooted at docs/screenshots/, and explicit invariant call-outs (citation-or-refuse, append-only audit verifiable in psql, requires_review routing). Final 'Optional - AWS' section repeats the demo against the M10 Terraform stack with a teardown reminder, but never runs apply for the reader; that remains a manual operator action documented in infra/README.md. * docs(readme): top-level portfolio README — problem, architecture, features, quickstart, eval, governance, deployment, limitations, roadmap Single-page entry point. Embeds the architecture PNG, links every sub-doc (architecture.md, demo.md, evaluation.md, guardrails.md, workflow.md, audit-and-review.md, infra/README.md), and lists every limitation honestly: synthetic data only, small eval set, eval/RESULTS.md still pending real numbers (issue #13), demo-only deployment posture, self-reported confidence is a routing signal not a calibrated probability, citation-validity is an in-context check. CI badge points at .github/workflows/ci.yml on main. License badge points at the new LICENSE file. * docs: add MIT LICENSE * docs(progress): mark M11 complete on branch with DoD verification * docs: update architecture diagram * docs: align demo examples with current schema * docs: avoid hardcoded workflow item id in demo * fix: route successful extractions into workflow

div0rce added 7 commits May 29, 2026 02:01

docs(progress): record M9 merged (PR #12) and mark M10 in progress

02995b6

docs(progress): mark M10 complete on branch with DoD verification

97b0263

style(infra): terraform fmt — fix attribute alignment in rds module

7f5f35c

chatgpt-codex-connector Bot reviewed May 29, 2026

View reviewed changes

Comment thread infra/modules/ecs/main.tf Outdated

Comment thread frontend/Dockerfile

fix: separate deployed API routes and harden frontend container

f1e9789

div0rce merged commit b18112d into main May 29, 2026
3 checks passed

div0rce deleted the feat/m10-deploy branch May 29, 2026 14:09

div0rce mentioned this pull request May 29, 2026

docs: README, architecture diagram, and demo #15

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: containerization, Terraform (AWS), and CD pipeline#14

feat: containerization, Terraform (AWS), and CD pipeline#14
div0rce merged 8 commits into
mainfrom
feat/m10-deploy

div0rce commented May 29, 2026 •

edited

Loading

Uh oh!

div0rce commented May 29, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

div0rce commented May 29, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

div0rce commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Milestone

Summary

Definition of Done

M10 DoD verification (from MILESTONES.md)

Locked design (per user constraints)

What ships

Backend

Frontend

Terraform (infra/)

CD (.github/workflows/cd.yml)

CI (.github/workflows/ci.yml)

Verification

Schema/migration concerns

Reminder

Uh oh!

div0rce commented May 29, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

div0rce commented May 29, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

div0rce commented May 29, 2026 •

edited

Loading

Terraform (`infra/`)

CD (`.github/workflows/cd.yml`)

CI (`.github/workflows/ci.yml`)