Skip to content

feat: containerization, Terraform (AWS), and CD pipeline#14

Merged
div0rce merged 8 commits into
mainfrom
feat/m10-deploy
May 29, 2026
Merged

feat: containerization, Terraform (AWS), and CD pipeline#14
div0rce merged 8 commits into
mainfrom
feat/m10-deploy

Conversation

@div0rce

@div0rce div0rce commented May 29, 2026

Copy link
Copy Markdown
Owner

Milestone

M10 — Containerization, Terraform (AWS), and CD pipeline

Summary

Production Dockerfiles for backend (uvicorn + structlog + request-id middleware,
non-root, multi-stage) and frontend (nginx serving the Vite SPA, envsubst-driven
backend URL); Terraform under infra/ provisioning a cost-minimal us-east-1
demo stack (VPC, public subnets, ECR, RDS Postgres 16 with pgvector, ECS Fargate
behind an ALB, SSM Parameter Store secrets, GitHub Actions OIDC role); a
manual-dispatch CD workflow that builds + pushes images and force-redeploys ECS;
and infra/README.md documenting the cost posture, security invariants, and
the apply/destroy recipe.

Hard constraint honoured: no terraform apply was run, no AWS resources
were created, no costs incurred. The PR ships infra-as-code only. The user (the
operator) runs terraform plan and apply against their own AWS account when
ready, captures demo screenshots, and terraform destroy immediately after.

Definition of Done

  • All DoD items from MILESTONES.md addressed (one is operator-action, see below)
  • make check passes (ruff + ruff-format + mypy strict + 195 backend pytest + 7 frontend Vitest)
  • Tests added/updated for new logic — 8 new request-id-middleware tests
  • PROGRESS.md updated — M10 marked complete on branch; M9 row updated to ☑ merged with PR feat: evaluation harness and benchmark results #12 link; backlog issue eval: record real-provider benchmark numbers (M9 follow-up) #13 noted
  • No secrets committed; sample data is synthetic; SSM SecureString placeholders use lifecycle.ignore_changes = [value]
  • Guardrails intact (citation-or-refuse, PII redaction, confidence gating, audit logging) — unchanged

M10 DoD verification (from MILESTONES.md)

  • terraform plan is clean; apply provisions the stack. Pending operator action. The user explicitly forbade running terraform plan or apply in this session. Local environment also has no terraform binary, so even fmt/validate ran zero times locally — those checks are wired into CI (a new no-AWS-creds terraform job that runs fmt -check, init -backend=false, and validate) so the regression surface is covered without any AWS calls.
  • CD workflow builds and deploys on manual dispatch. .github/workflows/cd.yml is workflow_dispatch-only — no push: or pull_request: triggers, by design. Steps: assume the OIDC role (AWS_ROLE_ARN secret), ECR login, build + push backend (context = repo root, -f backend/Dockerfile) and/or frontend (context = ./frontend) tagged with the git SHA + latest, force ECS service redeploy. Choice input lets the operator deploy backend / frontend / both per dispatch.
  • App is reachable at a URLinfra-as-code complete. The Terraform output alb_dns_name is the URL once terraform apply succeeds. Capturing screenshots and the demo flow are M11 deliverables; teardown via terraform destroy is the operator's immediate next step.

Locked design (per user constraints)

  • Code only this session. No terraform apply. No AWS API calls. No costs. No terraform plan unless AWS credentials are configured and the user explicitly approves (the user did not — so plan didn't run).
  • Cost posture: public-subnet / no-NAT. Avoids the ~$32/month idle NAT Gateway. ECS tasks live in public subnets with assign_public_ip = true so they can reach ECR / Anthropic / OpenAI / CloudWatch. Security groups are what enforce "internal-only" for RDS (next bullet).
  • Hard invariant: RDS not publicly accessible. aws_db_instance.publicly_accessible = false and the rds security group ingress is keyed only to the backend task SG. Even though RDS lives in the same public subnets as the tasks (no private subnets in the no-NAT design), the SG prevents internet reach.
  • Trigger gate: workflow_dispatch is the cost-control gate for the CD workflow. Nothing else (no push:, no pull_request:).
  • Region: us-east-1. Pinned via var.region default.
  • Secrets: SSM Parameter Store SecureString. API keys are placeholders with lifecycle.ignore_changes = [value] so an out-of-band aws ssm put-parameter --overwrite is not clobbered on re-apply. CI identity uses GitHub OIDC, not long-lived access keys.
  • Demo-only. infra/README.md documents cost (~$45/month idle floor), every tradeoff (single-AZ, no Multi-AZ, no auto-scaling, no remote state, plain HTTP on the ALB), and an unambiguous "terraform destroy immediately after screenshots" instruction.

What ships

Backend

  • backend/app/observability.pyconfigure_logging() wires structlog for
    JSON output (SENTINEL_LOG_FORMAT=console for local dev). RequestIdMiddleware
    assigns a stable id per request (sanitised inbound X-Request-Id allowlist,
    generated uuid4().hex otherwise), binds it to the structlog contextvars
    for the request scope, surfaces it on the response.
  • backend/app/main.py — calls configure_logging() at import, adds the
    middleware before the routers.
  • backend/Dockerfile — multi-stage uv build → slim runtime. Non-root
    sentinel user (uid 1000), HEALTHCHECK on /health, honours $PORT,
    SENTINEL_LOG_FORMAT=json default.
  • .dockerignore (repo root) — keeps the backend build context lean and
    free of secrets / tests / frontend / eval / scripts / IDE state.
  • structlog>=24.4 added to runtime deps (resolved 25.5.0).

Frontend

  • frontend/Dockerfilenode:20-alpine builder runs npm ci && npm run build (which runs tsc -b first, so any type error fails the image build).
    nginx:1.27-alpine runtime serves /usr/share/nginx/html and reverse-proxies
    same-origin API paths to ${BACKEND_URL} (envsubst-substituted on container
    start by the official entrypoint).
  • frontend/nginx.conf.template — SPA try_files fallback for React Router
    routes; reverse-proxy for /query, /extract, /review, /dashboard,
    /health with X-Forwarded-* headers and request-id pass-through; 1y cache
    on hashed Vite assets.
  • frontend/.dockerignore — keeps node_modules, dist, tests, .git, IDE
    state, and *.tsbuildinfo out of the build context.

Terraform (infra/)

infra/
├── versions.tf       terraform >= 1.6, aws ~> 5.70
├── variables.tf      project_name, region, db creds (sensitive, >=16 char), image tags, github_repository
├── main.tf           wires the modules
├── outputs.tf        alb_dns_name, ecr URLs, ecs names, rds_endpoint, ci_role_arn
├── README.md         cost & security posture, apply/destroy recipe, validation steps
└── modules/
    ├── network/      VPC + 2 public subnets + IGW + public RT + 4 SGs
    ├── ecr/          backend + frontend repos with image-scan + lifecycle
    ├── secrets/      SSM SecureString for API keys + composed DATABASE_URL
    ├── rds/          Postgres 16.4 db.t4g.micro single-AZ, publicly_accessible=false
    ├── ecs/          cluster + ALB + listener + target groups + task defs + services + IAM + log groups
    └── ci_oidc/      GitHub Actions OIDC provider + role scoped to ECR push + ECS update

Reachability graph encoded in the four security groups owned by
modules/network/:

internet ──→ alb_sg          (80, 443)
alb_sg   ──→ frontend_sg     (80)         ALB → nginx
alb_sg   ──→ backend_sg      (8000)       ALB → FastAPI (path-prefix rule)
backend_sg ──→ rds_sg        (5432)       FastAPI → Postgres

OIDC trust policy is scoped to one repo via the repo:OWNER/NAME:* subject
claim. CI permissions: ecr:GetAuthorizationToken account-wide, push to the
two project ECR repos, ecs:UpdateService on the two project services.
iam:PassRole is restricted to the project task roles, only to
ecs-tasks.amazonaws.com.

CD (.github/workflows/cd.yml)

  • workflow_dispatch only. Choice input: backend / frontend / both.
  • OIDC via aws-actions/configure-aws-credentials@v4 (role ARN from
    secrets.AWS_ROLE_ARN, written from terraform output ci_role_arn).
  • Builds with --platform linux/amd64, tags with the git SHA + latest,
    pushes to ECR.
  • aws ecs update-service --force-new-deployment for each requested service.

CI (.github/workflows/ci.yml)

New terraform job (no AWS credentials needed) running terraform fmt -recursive -check, terraform init -backend=false, and terraform validate
on every PR — so a syntax/wiring regression is caught without any AWS calls.
Backend and frontend jobs unchanged.

Verification

Local (this session):

ruff/format/mypy: clean (eval/ + new observability.py + main.py)
backend pytest: 195 passed (was 187 in M9 + 8 new request-id middleware tests)
frontend tests: 7 passed (unchanged)
terraform fmt/validate: NOT run locally — no terraform binary. CI job covers it.
docker build: NOT run locally — no Docker daemon. CI/operator covers it.

The infra is wired so the only step that costs money is an explicit
terraform apply run by the operator after reviewing terraform plan.

Schema/migration concerns

None for the application schema. Infra-as-code only.

Reminder

Please squash-merge this PR. Then follow infra/README.md for the
operator workflow: terraform planterraform applyaws ssm put-parameter for the API keys → Run workflow on the CD action → demo
screenshots → terraform destroy immediately after. M11 (docs/demo.md,
architecture diagram, README polish) is the natural next milestone for
turning a working stack into a portfolio artefact.

div0rce added 7 commits May 29, 2026 02:01
…n Dockerfile

backend/app/observability.py:
- configure_logging() wires structlog for JSON output (CloudWatch-friendly) with
  a SENTINEL_LOG_FORMAT=console escape hatch for local dev. Idempotent so CLIs
  (make seed, make eval) produce the same shape of log as the API.
- RequestIdMiddleware assigns a stable id per request, binds it to the structlog
  contextvars (so any structlog call inside a handler picks it up), exposes it on
  request.state.request_id, and surfaces it on the response as X-Request-Id.
  Caller-supplied X-Request-Id headers are accepted only when short and printable
  ([alnum]+[-_], <= 64 chars); anything else is replaced with a fresh uuid4 hex
  to keep attacker-controlled bytes out of the log pipeline.

backend/app/main.py: configure_logging() at import time; middleware added before
routers.

backend/tests/test_request_id.py (8 tests): generated id is uuid4 hex; safe
inbound id is echoed; rogue inbound ids (too long, whitespace, control chars,
punctuation, empty) are replaced; consecutive requests get distinct ids.

backend/Dockerfile: multi-stage (uv-based dependency resolution, slim runtime),
non-root sentinel user (uid 1000), HEALTHCHECK against /health, PORT=8000
default but honours $PORT for ECS service-port flexibility, SENTINEL_LOG_FORMAT
defaults to 'json' in the image. Source layer copied last so code-only changes
don't invalidate the deps layer.

backend/.dockerignore prunes tests, frontend, eval, scripts, .git, IDE state,
and local Postgres data so the image stays small and free of secrets.

structlog>=24.4 added as a runtime dep (resolved 25.5.0).
…x serve)

frontend/Dockerfile is a two-stage image:

1. node:20-alpine builder runs 'npm ci && npm run build' (which transitively
   runs 'tsc -b' so any type error fails the build, matching the CI lint step).
2. nginx:1.27-alpine runtime serves /usr/share/nginx/html (the Vite dist) and
   reverse-proxies same-origin paths to the backend.

The nginx config template substitutes ${BACKEND_URL} via the official image's
envsubst entrypoint on container start, so the same image is portable across
environments. ECS task def sets BACKEND_URL to the backend service-discovery
DNS name (default in-image: http://backend:8000 for local docker compose).

The proxy passes /query, /extract, /review, /dashboard, /health straight
through with X-Forwarded-* headers and forwards request headers (so the M10
X-Request-Id stays correlated end to end). Hashed Vite assets get a 1-year
cache; everything else is uncached. SPA fallback ('try_files $uri $uri/
/index.html') keeps React Router routes working on hard reload.

frontend/.dockerignore prunes node_modules/dist/test trees, IDE state, and
*.tsbuildinfo so the build context stays small.
…st-1, demo)

infra/ provisions the M10 demo stack on AWS:

- modules/network: VPC (10.0.0.0/16), two public /24 subnets in two AZs, IGW,
  public route table. Owns the four security groups (alb, frontend, backend,
  rds) so the rds ingress rule can reference the backend SG without creating
  an ecs <-> rds module-level dependency cycle. Reachability graph encoded in
  the SGs: internet -> alb -> {frontend on 80, backend on 8000} -> rds on 5432.
  Egress open on tasks (ECR/Anthropic/OpenAI/CloudWatch); RDS has none.
- modules/ecr: two repos (backend, frontend) with image-scan-on-push, a 7-day
  untagged-image expiry, and a 20-image cap. force_delete=true so terraform
  destroy doesn't hang on lingering tags.
- modules/secrets: SSM SecureString parameters for ANTHROPIC_API_KEY,
  OPENAI_API_KEY (placeholders, lifecycle.ignore_changes=[value] so the real
  out-of-band 'aws ssm put-parameter' values aren't clobbered on re-apply),
  and DATABASE_URL composed from rds outputs.
- modules/rds: Postgres 16.4 db.t4g.micro single-AZ, gp3 storage, encrypted at
  rest, publicly_accessible=false invariant, parameter group (log_statement=ddl).
  pgvector loads via the application's CREATE EXTENSION migration; no
  shared_preload_libraries needed.
- modules/ecs: cluster, ALB with HTTP listener (frontend default; path-prefix
  rule routes /query|/extract|/review|/dashboard|/health to the backend target
  group), service discovery in <project>.local for nginx -> backend, two task
  defs (256 cpu / 512 mem), two services with assign_public_ip=true (no-NAT
  topology). Task execution role has scoped ssm:GetParameter on the three
  secret ARNs. CloudWatch log groups with 7-day retention.
- modules/ci_oidc: GitHub Actions OIDC provider + role scoped to the configured
  repo via 'repo:OWNER/NAME:*' subject claim. Permissions: ecr push to the two
  project repos, ecr:GetAuthorizationToken account-wide, ecs:UpdateService on
  the two project services. PassRole limited to the project task roles, only
  to ecs-tasks.amazonaws.com. count=0 when var.github_repository is empty.

Root: versions.tf (terraform >=1.6, aws ~>5.70), variables.tf (project_name,
region us-east-1 default, db creds with sensitive=true and >=16 char password
validation, image tags, github_repository), main.tf wires everything, outputs
expose ALB DNS, ECR URLs, ECS names, RDS endpoint, CI role ARN.

No remote state. Local-only is fine for a single-operator demo; convert to
S3 + DynamoDB before any second user.
…e .dockerignore

.github/workflows/cd.yml: workflow_dispatch only (no push:, no pull_request:).
The trigger gate is the cost-control mechanism for M10 — additional triggers
must not be added.

Steps: assume the OIDC role (AWS_ROLE_ARN secret), ECR login, build+push
backend (context = repo root, -f backend/Dockerfile) and/or frontend (context =
./frontend) tagged with the git SHA + 'latest', force ECS service redeploy.
Choice input lets the operator deploy backend / frontend / both per dispatch.

ci.yml: new 'terraform' job (no AWS creds) running terraform fmt -check,
terraform init -backend=false, terraform validate. Catches a Terraform
syntax/wiring regression on every PR without touching AWS.

.dockerignore: moved from backend/ to repo root so Docker picks it up — the
backend build context is the repo root (Dockerfile copies pyproject.toml,
uv.lock, alembic.ini from there). frontend/.dockerignore stays where it is
because the frontend build context is ./frontend.
@div0rce

div0rce commented May 29, 2026

Copy link
Copy Markdown
Owner Author

@codex review

Review PR #13 strictly as M10 — Containerization, Terraform (AWS), and CD pipeline.

Focus on:

  • production Dockerfiles
  • frontend nginx config
  • backend runtime config
  • structured logging
  • request-id middleware
  • Terraform module correctness
  • AWS ECS/Fargate design
  • ECR setup
  • RDS/Postgres security
  • security group boundaries
  • public-subnet/no-NAT cost posture
  • SSM/GitHub secrets handling
  • manual workflow_dispatch CD only
  • no accidental AWS apply/deploy triggers
  • no committed secrets/state
  • least-privilege IAM/OIDC assumptions
  • health checks
  • ALB routing
  • environment variable wiring
  • cost-control documentation
  • teardown instructions
  • whether the PR satisfies M10 without actually provisioning resources

Known constraint:

  • Terraform and Docker may not have been validated locally because Terraform was unavailable and Docker daemon was not running.
  • Do not treat lack of apply/deploy as a blocker unless the code itself is unsafe.
  • Do flag anything that would create cost unintentionally or expose RDS/public services incorrectly.

Check especially:

  1. No terraform apply or deployment runs automatically on push/PR.
  2. CD workflow is manual-only via workflow_dispatch.
  3. RDS is not publicly accessible.
  4. Backend is only reachable through intended ALB/security-group paths.
  5. Secrets are never committed and are sourced from SSM/GitHub secrets.
  6. Terraform state, tfvars, credentials, and generated artifacts are gitignored.
  7. Docker images run as non-root where practical.
  8. Health checks match actual app endpoints.
  9. Public-subnet/no-NAT tradeoff is documented honestly.
  10. terraform destroy / teardown instructions are clear.

Output only:

  • BLOCKING
  • NON-BLOCKING
  • CLEANUP
  • FINAL VERDICT

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7f5f35cbf4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread infra/modules/ecs/main.tf Outdated
Comment thread frontend/Dockerfile
@div0rce

div0rce commented May 29, 2026

Copy link
Copy Markdown
Owner Author

@codex review

Re-review PR #13 after commit f1e9789 fix: separate deployed API routes and harden frontend container.

Focus only on the two previously reported P2 findings and the M10 deployment safety surface.

Previously reported findings:

  1. SPA routes /review and /dashboard conflicted with deployed backend API routing.
  2. Frontend nginx runtime container ran as root.

Verify:

  • SPA routes /, /review, and /dashboard are served by the frontend target group / SPA fallback.
  • API traffic is separated under /api/* in deployed frontend behavior.
  • nginx proxies /api/* to FastAPI and strips /api correctly.
  • ALB backend direct routing is limited to backend health checks where appropriate.
  • /health health check behavior still works for backend/ALB.
  • Frontend API client uses the deployed VITE_API_BASE=/api path correctly.
  • Local/dev API override behavior is not broken.
  • nginx runtime runs as non-root.
  • nginx listens on unprivileged port 8080.
  • ECS frontend container port, target group port, service wiring, and security-group rules all align with port 8080.
  • writable nginx pid/cache/log paths are correctly handled for USER nginx.
  • No schema/migration changes were introduced.
  • No M11+ scope was introduced.
  • No terraform apply, deployment trigger, AWS resource creation, secrets, .tfstate, or credentials were added.
  • CD workflow remains manual workflow_dispatch only.
  • Terraform remains cost-controlled and consistent with the documented public-subnet/no-NAT posture.

Also verify the reported local checks are sufficient:

  • make check passed
  • frontend lint/test/build passed
  • frontend Docker build passed
  • Docker runtime smoke ran as uid=101 user=nginx
  • terraform -chdir=infra fmt -recursive -check passed
  • terraform -chdir=infra init -backend=false passed
  • terraform -chdir=infra validate passed
  • no terraform plan or terraform apply was run

Output only:

  • BLOCKING
  • NON-BLOCKING
  • CLEANUP
  • FINAL VERDICT

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Hooray!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@div0rce div0rce merged commit b18112d into main May 29, 2026
3 checks passed
@div0rce div0rce deleted the feat/m10-deploy branch May 29, 2026 14:09
div0rce added a commit that referenced this pull request May 29, 2026
* docs(progress): M10 merged (PR #14, b18112d); M11 in progress

* docs(architecture): full write-up + Mermaid source + rendered PNG

docs/architecture.md replaces the M0 placeholder. Covers:

- High-level component diagram (frontend, backend, governance, data, providers).
- Per-component source-of-truth file list, including invariant cross-references
  (citation-or-refuse, append-only audit, redaction, FSM idempotency).
- Sequence diagrams for /query, /extract, and human review.
- ER diagram for documents → chunks → extractions → workflow_items → audit_events.
- M10 deployment shape: VPC/SG reachability graph, ECS+ALB+RDS+SSM topology,
  cost posture, CD posture (workflow_dispatch only).

docs/architecture.mmd is the standalone source for the headline component
diagram. Render with mmdc:

    npx -y --package=@mermaid-js/mermaid-cli mmdc \
        -i docs/architecture.mmd -o docs/architecture.png \
        --backgroundColor white --width 1600 --scale 2

docs/architecture.png is the committed rendering (3168x2234) so a reviewer
landing on the README sees the picture without needing to run mmdc.

* docs(demo): 7-step demo script (clone -> compose -> seed -> query/refusal -> extract -> review -> dashboard)

Replaces the M0 placeholder. Each step has a copy-pasteable command, an
expected response shape (no fabricated metric values; real LLM output is
documented as 'phrase may differ but the citation invariants are
deterministic'), screenshot placeholders rooted at docs/screenshots/, and
explicit invariant call-outs (citation-or-refuse, append-only audit
verifiable in psql, requires_review routing).

Final 'Optional - AWS' section repeats the demo against the M10 Terraform
stack with a teardown reminder, but never runs apply for the reader; that
remains a manual operator action documented in infra/README.md.

* docs(readme): top-level portfolio README — problem, architecture, features, quickstart, eval, governance, deployment, limitations, roadmap

Single-page entry point. Embeds the architecture PNG, links every sub-doc
(architecture.md, demo.md, evaluation.md, guardrails.md, workflow.md,
audit-and-review.md, infra/README.md), and lists every limitation honestly:
synthetic data only, small eval set, eval/RESULTS.md still pending real
numbers (issue #13), demo-only deployment posture, self-reported confidence
is a routing signal not a calibrated probability, citation-validity is an
in-context check.

CI badge points at .github/workflows/ci.yml on main. License badge points at
the new LICENSE file.

* docs: add MIT LICENSE

* docs(progress): mark M11 complete on branch with DoD verification

* docs: update architecture diagram

* docs: align demo examples with current schema

* docs: avoid hardcoded workflow item id in demo

* fix: route successful extractions into workflow
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant