Skip to content

DB Foundation: greenfield two-database rebuild (F1/F2 done, F3/F4 in progress)#36

Draft
whereisanzi wants to merge 39 commits into
mainfrom
db-foundation
Draft

DB Foundation: greenfield two-database rebuild (F1/F2 done, F3/F4 in progress)#36
whereisanzi wants to merge 39 commits into
mainfrom
db-foundation

Conversation

@whereisanzi

Copy link
Copy Markdown
Contributor

DB Foundation — greenfield two-database rebuild

Supersedes the in-place UUID retrofit (PR #35, closed). Rebuilds the database as two UUID-keyed,
English-named databases fronted by pgbouncer, salvaging the int→uuid code typing from the retrofit.

Draft: not mergeable yet. The working tree is intentionally mid-cutover (some query modules
already target the new civic schema, so legacy code paths break against the old DB). The full API +
frontend rewrite and the cutover come in later commits. Detailed plans live in local *.local.md
(gitignored).

Done

  • F1 — schema. terreiro/schema/{app,civic}/ — 2 databases, 11 schemas, 29 tables, all UUID PKs,
    explicit constraint/index names, HNSW + FTS, static seeds. pt→en column map in schema/MAPPING.md.
    Validated against pgvector pg16.
  • F2 — local stack + plumbing. docker-compose.foundation.yml (db-app + db-civic + one pgbouncer
    per DB, blast-radius isolation, session pooling). Replica-ready two-pool plumbing in config.py /
    database.py (<DOMAIN>_WRITE/READ_DB_URL, read defaults to write). make foundation-* targets.
  • F0.5 — embedding eval harness (kept BGE-M3).

In progress (F3/F4)

  • Query modules migrated so far: raw_ingestao, entes, parlamentares; Baque pool → civic.
    Proven end-to-end with real Câmara data. Remaining: the other query modules + routers + Pydantic
    schemas + agent tools + prompts + the Cortejo frontend, all to English.

Conventions adopted

  • en-US for all identifiers and the technical contract (db, backend, API fields, frontend);
    user-facing copy stays pt-BR.
  • One pgbouncer per database; replica-ready but no replica runs on the single host.

🤖 Generated with Claude Code

Retrofit every integer-PK table to `id UUID PRIMARY KEY DEFAULT gen_random_uuid()`
with no data loss, and update all dependent code to match.

Foundation (P0): adopt the real yoyo-migrations runner. scripts/migrate.py is now
a thin wrapper over `yoyo apply --batch` (state in _yoyo_migration, applies only
unapplied migrations, each transactional with a .rollback.sql companion). A
`--mark-only` mode (make db-migrate-mark) records the legacy migrations as applied
without running them, guarded to refuse if 0018+ retrofit files are present so the
conversion can never be silently skipped (override: --force-mark-all).

Schema (P1-P8): 13 migrations (0018-0030) convert 25 integer-PK tables plus
orgaos_federais. Mechanics keep the old int id and a new id_uuid side by side,
backfill children (and the polymorphic embeddings/feed_eventos refs) by joining on
the old int id, then cut over: drop int PK/FK/UNIQUE, swap to UUID, re-point FKs,
rebuild the votos/orientacoes composite UNIQUEs. Pre-existing constraint drops
resolve the real name from the catalog (DO blocks) instead of trusting Postgres
default names, so a mid-window DROP cannot abort on a name mismatch.

Code cutover (P9): queries, routers, Pydantic schemas, agent tools, classifiers,
pipeline FK-value resolution, and the frontend all move from int ids to UUID. The
Calunga tool JSON contract is preserved. Tests are UUID-aware; added
test_migrate_runner.py for the yoyo wrapper and its mark guard.

Destructive migrations remain manual, applied in a maintenance window against a
verified backup (not run by deploy.yml). Gates: ruff clean, 121 pytest pass, tsc clean.
The in-place int->uuid retrofit (0018-0030) is superseded by the greenfield
two-database rebuild. The reusable int->uuid code typing is kept; only the
in-place ALTER migrations are dropped.
UUID PKs everywhere, English snake_case identifiers, schema-qualified, explicit
constraint/index names. App DB (auth, chat) and civic DB (9 schemas, 24 tables)
with HNSW + FTS. Static reference seeds and the pt->en column mapping included.
Additive compose with db-app/db-civic and one pgbouncer per database for
blast-radius isolation (session pooling). make foundation-{up,schema,down,reset}
targets and the DB env vars documented in .env.example.
config.db_dsns() resolves app/civic write/read DSNs (<DOMAIN>_WRITE/READ_DB_URL,
read defaults to write, write to legacy DATABASE_URL). database.py builds the
roles deduped to one pool per unique DSN; get_app_pool/get_civic_pool(readonly=).
…vic schema

First query slices rewritten to the English civic schema (ingestion.raw_ingestion,
legislative.government_entities, legislative.legislators); Baque _create_pool now
targets CIVIC_WRITE_DB_URL.
Relative known-item retrieval benchmark on the real corpus to pick the embedding
model for the DB foundation; result kept BGE-M3.
ParlamentarResponse fields and the deputados/senadores list+get endpoints now
speak the English civic schema and read via get_civic_pool. The /despesas
endpoints stay on the legacy path until the despesas slice. Fix the migrate-runner
test to assert has_post_legacy_migrations against a synthetic dir (the retrofit
migrations are gone).
despesas.py queries -> spending.ceap_expenses (join legislative.legislators),
English output aliases/keys; DespesaResponse, the deputados/senadores /despesas
endpoints, and the exportacao despesas/parlamentares CSV all move to English +
civic pool. tools.py despesa functions are left for the dedicated tools+prompts
sweep (still read the old keys; covered only by mocked tests).
upsert_* for legislative (bills, voting_sessions, legislator_votes, party_guidances)
and spending (corporate_card, budget, contracts, procurements, trips, amendments,
fiscal_data), plus companies reads, now target the English civic schema. Input data
dict keys stay pt (pipeline-supplied); only SQL identifiers changed.
suspeitas query module and router target analysis.anomalies joined to
spending.ceap_expenses / legislative.legislators; request/response fields and
output keys English; civic pool. The Gonguê classifier writers + the suspeitas
agent tool still read legacy keys (next slices).
The 6 classifiers read spending.ceap_expenses joined to legislative.legislators /
companies and check analysis.anomalies; tasks/ingestao.py persists to
analysis.anomalies and its pool targets CIVIC_WRITE_DB_URL. EN columns aliased back
to the names the classifier code / detalhes JSONB still use (alias cleanup deferred).
auth/conversas/feedback queries target auth.* / chat.* schemas; routers use
get_app_pool (read/write split); API contract English (conversation_id, title,
messages, last_message, feedback_type, category, comment, retry_after_seconds,
X-Conversation-Id). Tests updated to the English contract. End-to-end verified
against the foundation.
feed queries (publicar_evento, listar_feed, contar_feed, get_evento_por_id) target
feed.feed_events with English columns; router output keys and base event fields
English; civic pool. The rich 'data' enrichment payload keeps its pt contract via
SQL aliases (frontend + feed_enrichment slice deferred).
busca_hibrida and busca_universal target search.embeddings + spending/legislative/
companies/feed civic tables with English columns; embedding row aliased to the keys
the code reads; detail reads use English columns. The item output keys stay pt
(consumed by the agent tools, migrated in the tools sweep).
empresas/sancoes/candidatos_tse ingestion, the analysis + OCR persists, embeddings
generation (9 types), glossary embeddings, and the dagster feed publishers now target
the English civic schema. sanctions get a computed natural_key (R1). Enrichment SQL
aliases columns back to the keys the asset code reads.
All 16 tools: inline SQL migrated to the civic schema (federal_agencies, ceap_expenses,
voting_sessions, contracts, etc.), data reads aligned to the English query modules, and
the tool contract translated to English (mode/source/filters + per-tool output keys;
mode values list/ranking/item/summary/empty/error/legislators_by_vote/legislator_votes).
prompts.py field-name examples updated; test_tools.py asserts the English contract.
…migration

All agent tools read via get_civic_pool (buscar_empresa writes via the civic write pool
for the BrasilAPI cache); exportacao suspeitas CSV and the BrasilAPI company upsert target
the civic schema. conftest patches get_civic_pool/get_app_pool for the mocked tests.
Smoke-tested the tools end-to-end against the foundation.
Feed (events, event_type, title, description, source, data, reference_*), chat
(conversation_id, X-Conversation-Id, title, messages), auth error detail, and the
feedback body (feedback_type/category/comment) all use the English contract. Rich
feed 'data' payload (ator/acao/objeto) stays pt to match the deferred backend
enrichment slice. tsc clean.
BrasilAPI company dict keys match the DB path (legal_name, trade_name, ...);
busca_universal result items use English keys (type/summary/relevance + per-type).
All 16 tools' Pydantic Input schemas + function signatures + bodies + filter echoes use
English arg names (name/year/month/category/agency/group_by/bill_type/vote/...). Enum-like
arg VALUES stay pt (camara, Sim, pix, orgao). prompts.py arg references and test_tools.py
call kwargs updated. Smoke-tested arg binding via LangChain.
Pydantic models (Ator/Acao/Objeto/Evidencia/Contexto/DadosFeedRico) + feed_enrichment
+ all 3 builders (suspeitas asset, dagster feed, buscar_empresa, get_evento_por_id) +
frontend types and components use English field names (actor/action/object/evidence/
context/severity/contract_version; name/role/verb/amount/...). Free-form details bag and
enum string values stay pt. Fixed a stale row['relevancia'] read. tsc + 121 tests green.
Move embedding generation out of the API into a dedicated service: app.embeddings_server
exposes a TEI-compatible POST /embed (BGE-M3, 1024-dim). app.services.embeddings is now a
thin httpx client (same async API, graceful None on failure). Added to the foundation
compose (reuses the terreiro image; CPU local/arm64, EMBEDDINGS_DEVICE=cuda on a GPU host;
hf_cache volume). EMBEDDINGS_URL setting. Prod can swap to a TEI GPU image (same contract).
Embeddings now live in their own top-level component (mineiro/: own pyproject, uv.lock,
Dockerfile with the BGE-M3 weights baked in). Terreiro drops sentence-transformers/torch
(~4GB lighter image) and is now purely an HTTP client. Mineiro resolves device auto/cuda/
mps/cpu: RTX 3060 via Docker NVIDIA runtime (EMBEDDINGS_DEVICE=cuda + GPU reservation),
Apple GPU via 'make mineiro-dev' (native MPS, since Docker-on-Mac has no Metal passthrough),
cpu otherwise. Foundation compose builds ./mineiro; TEI-compatible /embed unchanged.
- auth/verify: move verification + navigation into useEffect (was calling setState/
  router.push during render, triggering the React 'setState in render' warning).
- agent stream: map provider exceptions to a pt-BR user message instead of leaking the
  raw error (e.g. Gemini 429/billing text); the real error is still logged server-side.
- chat error bubble: render the (now friendly) message without the technical prefix.
Pin the default model to gemini-2.5-flash-lite and disable the auto-promotion to Pro
(route_model now returns the configured default). OCR follows settings.default_model
instead of a hardcoded flash. fallback_model still degrades an explicit Pro to the cheap
tier. config default + .env.example updated; router tests adjusted.
ChatShell called refreshChats() (which calls startTransition) directly in the render body
on pathname change, triggering 'Cannot call startTransition while rendering'. Moved the
pathname-change detection into useEffect.
Routes standardized to en-US: deputados->deputies, senadores->senators, suspeitas->
anomalies (+/statistics), conversas/chats->conversations, mensagens->messages, exportar->
exports (expenses/anomalies/legislators.csv), despesas->expenses; path params and the cache
service's conversation_id updated to match. Frontend (actions, chat-api, chat proxy route)
points at the new paths. feed/auth/share/metrics were already English. Validated end to end
against the local foundation; 111 tests + tsc green.
ChatIdPageInner called fetchChat (which setState/navigates) directly in the render body,
triggering 'Cannot update a component while rendering'. Moved the once-only load into
useEffect. This was the last render-time side-effect (verify + chat-shell already fixed).
Charts replayed their enter animation whenever the assistant streamed more text, because
chat-message re-parses the chart config (new object) each render and Recharts animates by
default. Memoize ChartRenderer by config value and disable Recharts animation, so a chart
stays stable while the rest of the message streams. Theme changes still update it (context
re-renders bypass memo).
The native API needs Redis (cache, rate limit, token quota, Celery broker); the foundation
compose only had the two databases + pgbouncers + Mineiro. Adds redis on 127.0.0.1:6380 so
'make foundation-up' brings up a complete local stack for end-to-end testing.
Adds db-app, db-civic, pgbouncer-app, pgbouncer-civic (env-driven scram auth, no committed
userlist) and the Mineiro embeddings service (GPU reservation, EMBEDDINGS_DEVICE=cuda) to
docker-compose.yml. api + dagster get APP_*/CIVIC_* DSNs + EMBEDDINGS_URL from .env and
depend on the new services; DEFAULT_MODEL default is now flash-lite. The legacy single db
stays as the rollback target (fall back via DATABASE_URL by unsetting the new DSNs).
Validated with 'docker compose config'.
…nd en-US REST routes

- Add Mineiro to the component glossary (BGE-M3 embeddings service with a
  TEI-compatible /embed)
- Describe the split into maracatu_app and maracatu_civic, each behind its
  own PgBouncer, with replica-ready DB URLs
- Update the AI agent row to Gemini 2.5 Flash-Lite (cheapest tier, no auto
  promotion to Pro)
- Redraw the production topology to show db-app/db-civic, both PgBouncers,
  embeddings and dagster
- Add schema/, mineiro/ and docker-compose.foundation.yml to the repo tree
- Add foundation-up, foundation-schema and mineiro-dev make targets
- Note that the BGE-M3 weights now live in the Mineiro image
Drops the alerts/suspeitas feed and the Gongue classifiers to focus the product on chat +
LLM + RAG; the feed is being rethought. Removed: app/classifiers, feed_enrichment, ocr,
services/feed, queries/feed+suspeitas, routers/feed+suspeitas, schemas/feed, civic schema
009_feed + 010_analysis, the anomalies.csv export, the analise_suspeitas celery task, the
Dagster assets suspeitas/analise_recibos/feed_eventos_dagster (+ their job/beat entries),
and the agent tools buscar_suspeitas + consultar_recibo (16 -> 14 tools). Frontend: feed
pages/components, feed-api, FeedEvento types, landing feed + classifier sections. Chat
message feedback (like/dislike) stays. Full design preserved in docs/feed-and-anomalies.md
for a future rebuild. 98 backend tests + tsc green.
Service independence: each service gets its own .env.example (terreiro/cortejo/mineiro)
documenting only its vars, so each is extractable to its own repo. Terreiro Dockerfile drops
dev extras (pytest/ruff) and the tests/ copy from the runtime image (CI runs them separately)
for a slimmer production image. Cortejo (node-alpine + standalone) and Mineiro (uv multi-stage,
model baked) were already slim.
Service independence + cleanup: deleted feed/stale scripts (analisar_*, backfill_feed_ricos,
enviar_alerta_semanal, gerar_embeddings, ingestao_cnpj/sancoes/tse). Moved the live ones
(seed, seed_senado, migrate, apply_foundation_schema, embeddings_eval) into terreiro/scripts/;
they're baked into the terreiro image so the ./scripts volume mount is gone. Makefile +
migrate.py + apply_foundation_schema.sh + test_migrate_runner paths updated. 98 tests green.
Add versioned docs/architecture.md (overview, active components, two-database
data plane, RAG flow, stack, repo layout, why two compose files, en-US
conventions) and docs/cutover.md (DB-foundation cutover runbook).

Realign README to the chat+RAG product scope: drop Gonguê/classifiers,
anomaly feed, scikit-learn and the 16-tools claim; reflect 14 tools,
flash-lite, Mineiro, two databases, and terreiro/scripts.
The feed/analysis tables were already removed; this drops the now-empty CREATE SCHEMA for
them from the civic bootstrap, and deletes the migration-scaffolding MAPPING.md (the en-US
schema files are the source of truth now).
…ompose)

Consolidate to one docker-compose.yml that runs everywhere (embeddings default to CPU,
PgBouncers/embeddings expose loopback host ports for local native dev). A small
docker-compose.gpu.yml overlay adds the NVIDIA reservation + EMBEDDINGS_DEVICE=cuda on the
host; deploy layers it via COMPOSE_FILE. Removes docker-compose.foundation.yml and
infra/pgbouncer/ (env-driven edoburu pgbouncer needs no mounted userlist). Makefile
foundation-* targets + apply_foundation_schema.sh point at the base compose; docs updated.
Validated locally: base infra up on CPU, pgbouncer scram OK, Mineiro CPU healthy.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant