feat: containerization, Terraform (AWS), and CD pipeline#14
Conversation
…n Dockerfile backend/app/observability.py: - configure_logging() wires structlog for JSON output (CloudWatch-friendly) with a SENTINEL_LOG_FORMAT=console escape hatch for local dev. Idempotent so CLIs (make seed, make eval) produce the same shape of log as the API. - RequestIdMiddleware assigns a stable id per request, binds it to the structlog contextvars (so any structlog call inside a handler picks it up), exposes it on request.state.request_id, and surfaces it on the response as X-Request-Id. Caller-supplied X-Request-Id headers are accepted only when short and printable ([alnum]+[-_], <= 64 chars); anything else is replaced with a fresh uuid4 hex to keep attacker-controlled bytes out of the log pipeline. backend/app/main.py: configure_logging() at import time; middleware added before routers. backend/tests/test_request_id.py (8 tests): generated id is uuid4 hex; safe inbound id is echoed; rogue inbound ids (too long, whitespace, control chars, punctuation, empty) are replaced; consecutive requests get distinct ids. backend/Dockerfile: multi-stage (uv-based dependency resolution, slim runtime), non-root sentinel user (uid 1000), HEALTHCHECK against /health, PORT=8000 default but honours $PORT for ECS service-port flexibility, SENTINEL_LOG_FORMAT defaults to 'json' in the image. Source layer copied last so code-only changes don't invalidate the deps layer. backend/.dockerignore prunes tests, frontend, eval, scripts, .git, IDE state, and local Postgres data so the image stays small and free of secrets. structlog>=24.4 added as a runtime dep (resolved 25.5.0).
…x serve)
frontend/Dockerfile is a two-stage image:
1. node:20-alpine builder runs 'npm ci && npm run build' (which transitively
runs 'tsc -b' so any type error fails the build, matching the CI lint step).
2. nginx:1.27-alpine runtime serves /usr/share/nginx/html (the Vite dist) and
reverse-proxies same-origin paths to the backend.
The nginx config template substitutes ${BACKEND_URL} via the official image's
envsubst entrypoint on container start, so the same image is portable across
environments. ECS task def sets BACKEND_URL to the backend service-discovery
DNS name (default in-image: http://backend:8000 for local docker compose).
The proxy passes /query, /extract, /review, /dashboard, /health straight
through with X-Forwarded-* headers and forwards request headers (so the M10
X-Request-Id stays correlated end to end). Hashed Vite assets get a 1-year
cache; everything else is uncached. SPA fallback ('try_files $uri $uri/
/index.html') keeps React Router routes working on hard reload.
frontend/.dockerignore prunes node_modules/dist/test trees, IDE state, and
*.tsbuildinfo so the build context stays small.
…st-1, demo)
infra/ provisions the M10 demo stack on AWS:
- modules/network: VPC (10.0.0.0/16), two public /24 subnets in two AZs, IGW,
public route table. Owns the four security groups (alb, frontend, backend,
rds) so the rds ingress rule can reference the backend SG without creating
an ecs <-> rds module-level dependency cycle. Reachability graph encoded in
the SGs: internet -> alb -> {frontend on 80, backend on 8000} -> rds on 5432.
Egress open on tasks (ECR/Anthropic/OpenAI/CloudWatch); RDS has none.
- modules/ecr: two repos (backend, frontend) with image-scan-on-push, a 7-day
untagged-image expiry, and a 20-image cap. force_delete=true so terraform
destroy doesn't hang on lingering tags.
- modules/secrets: SSM SecureString parameters for ANTHROPIC_API_KEY,
OPENAI_API_KEY (placeholders, lifecycle.ignore_changes=[value] so the real
out-of-band 'aws ssm put-parameter' values aren't clobbered on re-apply),
and DATABASE_URL composed from rds outputs.
- modules/rds: Postgres 16.4 db.t4g.micro single-AZ, gp3 storage, encrypted at
rest, publicly_accessible=false invariant, parameter group (log_statement=ddl).
pgvector loads via the application's CREATE EXTENSION migration; no
shared_preload_libraries needed.
- modules/ecs: cluster, ALB with HTTP listener (frontend default; path-prefix
rule routes /query|/extract|/review|/dashboard|/health to the backend target
group), service discovery in <project>.local for nginx -> backend, two task
defs (256 cpu / 512 mem), two services with assign_public_ip=true (no-NAT
topology). Task execution role has scoped ssm:GetParameter on the three
secret ARNs. CloudWatch log groups with 7-day retention.
- modules/ci_oidc: GitHub Actions OIDC provider + role scoped to the configured
repo via 'repo:OWNER/NAME:*' subject claim. Permissions: ecr push to the two
project repos, ecr:GetAuthorizationToken account-wide, ecs:UpdateService on
the two project services. PassRole limited to the project task roles, only
to ecs-tasks.amazonaws.com. count=0 when var.github_repository is empty.
Root: versions.tf (terraform >=1.6, aws ~>5.70), variables.tf (project_name,
region us-east-1 default, db creds with sensitive=true and >=16 char password
validation, image tags, github_repository), main.tf wires everything, outputs
expose ALB DNS, ECR URLs, ECS names, RDS endpoint, CI role ARN.
No remote state. Local-only is fine for a single-operator demo; convert to
S3 + DynamoDB before any second user.
…e .dockerignore .github/workflows/cd.yml: workflow_dispatch only (no push:, no pull_request:). The trigger gate is the cost-control mechanism for M10 — additional triggers must not be added. Steps: assume the OIDC role (AWS_ROLE_ARN secret), ECR login, build+push backend (context = repo root, -f backend/Dockerfile) and/or frontend (context = ./frontend) tagged with the git SHA + 'latest', force ECS service redeploy. Choice input lets the operator deploy backend / frontend / both per dispatch. ci.yml: new 'terraform' job (no AWS creds) running terraform fmt -check, terraform init -backend=false, terraform validate. Catches a Terraform syntax/wiring regression on every PR without touching AWS. .dockerignore: moved from backend/ to repo root so Docker picks it up — the backend build context is the repo root (Dockerfile copies pyproject.toml, uv.lock, alembic.ini from there). frontend/.dockerignore stays where it is because the frontend build context is ./frontend.
|
@codex review Review PR #13 strictly as M10 — Containerization, Terraform (AWS), and CD pipeline. Focus on:
Known constraint:
Check especially:
Output only:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7f5f35cbf4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review Re-review PR #13 after commit Focus only on the two previously reported P2 findings and the M10 deployment safety surface. Previously reported findings:
Verify:
Also verify the reported local checks are sufficient:
Output only:
|
|
Codex Review: Didn't find any major issues. Hooray! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
* docs(progress): M10 merged (PR #14, b18112d); M11 in progress * docs(architecture): full write-up + Mermaid source + rendered PNG docs/architecture.md replaces the M0 placeholder. Covers: - High-level component diagram (frontend, backend, governance, data, providers). - Per-component source-of-truth file list, including invariant cross-references (citation-or-refuse, append-only audit, redaction, FSM idempotency). - Sequence diagrams for /query, /extract, and human review. - ER diagram for documents → chunks → extractions → workflow_items → audit_events. - M10 deployment shape: VPC/SG reachability graph, ECS+ALB+RDS+SSM topology, cost posture, CD posture (workflow_dispatch only). docs/architecture.mmd is the standalone source for the headline component diagram. Render with mmdc: npx -y --package=@mermaid-js/mermaid-cli mmdc \ -i docs/architecture.mmd -o docs/architecture.png \ --backgroundColor white --width 1600 --scale 2 docs/architecture.png is the committed rendering (3168x2234) so a reviewer landing on the README sees the picture without needing to run mmdc. * docs(demo): 7-step demo script (clone -> compose -> seed -> query/refusal -> extract -> review -> dashboard) Replaces the M0 placeholder. Each step has a copy-pasteable command, an expected response shape (no fabricated metric values; real LLM output is documented as 'phrase may differ but the citation invariants are deterministic'), screenshot placeholders rooted at docs/screenshots/, and explicit invariant call-outs (citation-or-refuse, append-only audit verifiable in psql, requires_review routing). Final 'Optional - AWS' section repeats the demo against the M10 Terraform stack with a teardown reminder, but never runs apply for the reader; that remains a manual operator action documented in infra/README.md. * docs(readme): top-level portfolio README — problem, architecture, features, quickstart, eval, governance, deployment, limitations, roadmap Single-page entry point. Embeds the architecture PNG, links every sub-doc (architecture.md, demo.md, evaluation.md, guardrails.md, workflow.md, audit-and-review.md, infra/README.md), and lists every limitation honestly: synthetic data only, small eval set, eval/RESULTS.md still pending real numbers (issue #13), demo-only deployment posture, self-reported confidence is a routing signal not a calibrated probability, citation-validity is an in-context check. CI badge points at .github/workflows/ci.yml on main. License badge points at the new LICENSE file. * docs: add MIT LICENSE * docs(progress): mark M11 complete on branch with DoD verification * docs: update architecture diagram * docs: align demo examples with current schema * docs: avoid hardcoded workflow item id in demo * fix: route successful extractions into workflow
Milestone
M10 — Containerization, Terraform (AWS), and CD pipeline
Summary
Production Dockerfiles for backend (uvicorn + structlog + request-id middleware,
non-root, multi-stage) and frontend (nginx serving the Vite SPA, envsubst-driven
backend URL); Terraform under
infra/provisioning a cost-minimalus-east-1demo stack (VPC, public subnets, ECR, RDS Postgres 16 with pgvector, ECS Fargate
behind an ALB, SSM Parameter Store secrets, GitHub Actions OIDC role); a
manual-dispatch CD workflow that builds + pushes images and force-redeploys ECS;
and
infra/README.mddocumenting the cost posture, security invariants, andthe apply/destroy recipe.
Hard constraint honoured: no
terraform applywas run, no AWS resourceswere created, no costs incurred. The PR ships infra-as-code only. The user (the
operator) runs
terraform planandapplyagainst their own AWS account whenready, captures demo screenshots, and
terraform destroyimmediately after.Definition of Done
make checkpasses (ruff + ruff-format + mypy strict + 195 backend pytest + 7 frontend Vitest)lifecycle.ignore_changes = [value]M10 DoD verification (from MILESTONES.md)
terraform planis clean;applyprovisions the stack. Pending operator action. The user explicitly forbade runningterraform planorapplyin this session. Local environment also has noterraformbinary, so evenfmt/validateran zero times locally — those checks are wired into CI (a new no-AWS-credsterraformjob that runsfmt -check,init -backend=false, andvalidate) so the regression surface is covered without any AWS calls..github/workflows/cd.ymlisworkflow_dispatch-only — nopush:orpull_request:triggers, by design. Steps: assume the OIDC role (AWS_ROLE_ARNsecret), ECR login, build + push backend (context = repo root,-f backend/Dockerfile) and/or frontend (context =./frontend) tagged with the git SHA +latest, force ECS service redeploy. Choice input lets the operator deploybackend/frontend/bothper dispatch.alb_dns_nameis the URL onceterraform applysucceeds. Capturing screenshots and the demo flow are M11 deliverables; teardown viaterraform destroyis the operator's immediate next step.Locked design (per user constraints)
terraform apply. No AWS API calls. No costs. Noterraform planunless AWS credentials are configured and the user explicitly approves (the user did not — so plan didn't run).assign_public_ip = trueso they can reach ECR / Anthropic / OpenAI / CloudWatch. Security groups are what enforce "internal-only" for RDS (next bullet).aws_db_instance.publicly_accessible = falseand therdssecurity group ingress is keyed only to thebackendtask SG. Even though RDS lives in the same public subnets as the tasks (no private subnets in the no-NAT design), the SG prevents internet reach.workflow_dispatchis the cost-control gate for the CD workflow. Nothing else (nopush:, nopull_request:).us-east-1. Pinned viavar.regiondefault.lifecycle.ignore_changes = [value]so an out-of-bandaws ssm put-parameter --overwriteis not clobbered on re-apply. CI identity uses GitHub OIDC, not long-lived access keys.infra/README.mddocuments cost (~$45/month idle floor), every tradeoff (single-AZ, no Multi-AZ, no auto-scaling, no remote state, plain HTTP on the ALB), and an unambiguous "terraform destroyimmediately after screenshots" instruction.What ships
Backend
backend/app/observability.py—configure_logging()wires structlog forJSON output (
SENTINEL_LOG_FORMAT=consolefor local dev).RequestIdMiddlewareassigns a stable id per request (sanitised inbound
X-Request-Idallowlist,generated
uuid4().hexotherwise), binds it to the structlog contextvarsfor the request scope, surfaces it on the response.
backend/app/main.py— callsconfigure_logging()at import, adds themiddleware before the routers.
backend/Dockerfile— multi-stage uv build → slim runtime. Non-rootsentineluser (uid 1000), HEALTHCHECK on/health, honours$PORT,SENTINEL_LOG_FORMAT=jsondefault..dockerignore(repo root) — keeps the backend build context lean andfree of secrets / tests / frontend / eval / scripts / IDE state.
structlog>=24.4added to runtime deps (resolved 25.5.0).Frontend
frontend/Dockerfile—node:20-alpinebuilder runsnpm ci && npm run build(which runstsc -bfirst, so any type error fails the image build).nginx:1.27-alpineruntime serves/usr/share/nginx/htmland reverse-proxiessame-origin API paths to
${BACKEND_URL}(envsubst-substituted on containerstart by the official entrypoint).
frontend/nginx.conf.template— SPAtry_filesfallback for React Routerroutes; reverse-proxy for
/query,/extract,/review,/dashboard,/healthwithX-Forwarded-*headers and request-id pass-through; 1y cacheon hashed Vite assets.
frontend/.dockerignore— keepsnode_modules,dist, tests,.git, IDEstate, and
*.tsbuildinfoout of the build context.Terraform (
infra/)Reachability graph encoded in the four security groups owned by
modules/network/:OIDC trust policy is scoped to one repo via the
repo:OWNER/NAME:*subjectclaim. CI permissions:
ecr:GetAuthorizationTokenaccount-wide, push to thetwo project ECR repos,
ecs:UpdateServiceon the two project services.iam:PassRoleis restricted to the project task roles, only toecs-tasks.amazonaws.com.CD (
.github/workflows/cd.yml)workflow_dispatchonly. Choice input:backend/frontend/both.aws-actions/configure-aws-credentials@v4(role ARN fromsecrets.AWS_ROLE_ARN, written fromterraform output ci_role_arn).--platform linux/amd64, tags with the git SHA +latest,pushes to ECR.
aws ecs update-service --force-new-deploymentfor each requested service.CI (
.github/workflows/ci.yml)New
terraformjob (no AWS credentials needed) runningterraform fmt -recursive -check,terraform init -backend=false, andterraform validateon every PR — so a syntax/wiring regression is caught without any AWS calls.
Backend and frontend jobs unchanged.
Verification
Local (this session):
The infra is wired so the only step that costs money is an explicit
terraform applyrun by the operator after reviewingterraform plan.Schema/migration concerns
None for the application schema. Infra-as-code only.
Reminder
Please squash-merge this PR. Then follow
infra/README.mdfor theoperator workflow:
terraform plan→terraform apply→aws ssm put-parameterfor the API keys →Run workflowon the CD action → demoscreenshots →
terraform destroyimmediately after. M11 (docs/demo.md,architecture diagram, README polish) is the natural next milestone for
turning a working stack into a portfolio artefact.