DB Foundation: greenfield two-database rebuild (F1/F2 done, F3/F4 in progress)#36
Draft
whereisanzi wants to merge 39 commits into
Draft
DB Foundation: greenfield two-database rebuild (F1/F2 done, F3/F4 in progress)#36whereisanzi wants to merge 39 commits into
whereisanzi wants to merge 39 commits into
Conversation
Retrofit every integer-PK table to `id UUID PRIMARY KEY DEFAULT gen_random_uuid()` with no data loss, and update all dependent code to match. Foundation (P0): adopt the real yoyo-migrations runner. scripts/migrate.py is now a thin wrapper over `yoyo apply --batch` (state in _yoyo_migration, applies only unapplied migrations, each transactional with a .rollback.sql companion). A `--mark-only` mode (make db-migrate-mark) records the legacy migrations as applied without running them, guarded to refuse if 0018+ retrofit files are present so the conversion can never be silently skipped (override: --force-mark-all). Schema (P1-P8): 13 migrations (0018-0030) convert 25 integer-PK tables plus orgaos_federais. Mechanics keep the old int id and a new id_uuid side by side, backfill children (and the polymorphic embeddings/feed_eventos refs) by joining on the old int id, then cut over: drop int PK/FK/UNIQUE, swap to UUID, re-point FKs, rebuild the votos/orientacoes composite UNIQUEs. Pre-existing constraint drops resolve the real name from the catalog (DO blocks) instead of trusting Postgres default names, so a mid-window DROP cannot abort on a name mismatch. Code cutover (P9): queries, routers, Pydantic schemas, agent tools, classifiers, pipeline FK-value resolution, and the frontend all move from int ids to UUID. The Calunga tool JSON contract is preserved. Tests are UUID-aware; added test_migrate_runner.py for the yoyo wrapper and its mark guard. Destructive migrations remain manual, applied in a maintenance window against a verified backup (not run by deploy.yml). Gates: ruff clean, 121 pytest pass, tsc clean.
The in-place int->uuid retrofit (0018-0030) is superseded by the greenfield two-database rebuild. The reusable int->uuid code typing is kept; only the in-place ALTER migrations are dropped.
UUID PKs everywhere, English snake_case identifiers, schema-qualified, explicit constraint/index names. App DB (auth, chat) and civic DB (9 schemas, 24 tables) with HNSW + FTS. Static reference seeds and the pt->en column mapping included.
Additive compose with db-app/db-civic and one pgbouncer per database for
blast-radius isolation (session pooling). make foundation-{up,schema,down,reset}
targets and the DB env vars documented in .env.example.
config.db_dsns() resolves app/civic write/read DSNs (<DOMAIN>_WRITE/READ_DB_URL, read defaults to write, write to legacy DATABASE_URL). database.py builds the roles deduped to one pool per unique DSN; get_app_pool/get_civic_pool(readonly=).
…vic schema First query slices rewritten to the English civic schema (ingestion.raw_ingestion, legislative.government_entities, legislative.legislators); Baque _create_pool now targets CIVIC_WRITE_DB_URL.
Relative known-item retrieval benchmark on the real corpus to pick the embedding model for the DB foundation; result kept BGE-M3.
ParlamentarResponse fields and the deputados/senadores list+get endpoints now speak the English civic schema and read via get_civic_pool. The /despesas endpoints stay on the legacy path until the despesas slice. Fix the migrate-runner test to assert has_post_legacy_migrations against a synthetic dir (the retrofit migrations are gone).
despesas.py queries -> spending.ceap_expenses (join legislative.legislators), English output aliases/keys; DespesaResponse, the deputados/senadores /despesas endpoints, and the exportacao despesas/parlamentares CSV all move to English + civic pool. tools.py despesa functions are left for the dedicated tools+prompts sweep (still read the old keys; covered only by mocked tests).
upsert_* for legislative (bills, voting_sessions, legislator_votes, party_guidances) and spending (corporate_card, budget, contracts, procurements, trips, amendments, fiscal_data), plus companies reads, now target the English civic schema. Input data dict keys stay pt (pipeline-supplied); only SQL identifiers changed.
suspeitas query module and router target analysis.anomalies joined to spending.ceap_expenses / legislative.legislators; request/response fields and output keys English; civic pool. The Gonguê classifier writers + the suspeitas agent tool still read legacy keys (next slices).
The 6 classifiers read spending.ceap_expenses joined to legislative.legislators / companies and check analysis.anomalies; tasks/ingestao.py persists to analysis.anomalies and its pool targets CIVIC_WRITE_DB_URL. EN columns aliased back to the names the classifier code / detalhes JSONB still use (alias cleanup deferred).
auth/conversas/feedback queries target auth.* / chat.* schemas; routers use get_app_pool (read/write split); API contract English (conversation_id, title, messages, last_message, feedback_type, category, comment, retry_after_seconds, X-Conversation-Id). Tests updated to the English contract. End-to-end verified against the foundation.
feed queries (publicar_evento, listar_feed, contar_feed, get_evento_por_id) target feed.feed_events with English columns; router output keys and base event fields English; civic pool. The rich 'data' enrichment payload keeps its pt contract via SQL aliases (frontend + feed_enrichment slice deferred).
busca_hibrida and busca_universal target search.embeddings + spending/legislative/ companies/feed civic tables with English columns; embedding row aliased to the keys the code reads; detail reads use English columns. The item output keys stay pt (consumed by the agent tools, migrated in the tools sweep).
empresas/sancoes/candidatos_tse ingestion, the analysis + OCR persists, embeddings generation (9 types), glossary embeddings, and the dagster feed publishers now target the English civic schema. sanctions get a computed natural_key (R1). Enrichment SQL aliases columns back to the keys the asset code reads.
All 16 tools: inline SQL migrated to the civic schema (federal_agencies, ceap_expenses, voting_sessions, contracts, etc.), data reads aligned to the English query modules, and the tool contract translated to English (mode/source/filters + per-tool output keys; mode values list/ranking/item/summary/empty/error/legislators_by_vote/legislator_votes). prompts.py field-name examples updated; test_tools.py asserts the English contract.
…migration All agent tools read via get_civic_pool (buscar_empresa writes via the civic write pool for the BrasilAPI cache); exportacao suspeitas CSV and the BrasilAPI company upsert target the civic schema. conftest patches get_civic_pool/get_app_pool for the mocked tests. Smoke-tested the tools end-to-end against the foundation.
Feed (events, event_type, title, description, source, data, reference_*), chat (conversation_id, X-Conversation-Id, title, messages), auth error detail, and the feedback body (feedback_type/category/comment) all use the English contract. Rich feed 'data' payload (ator/acao/objeto) stays pt to match the deferred backend enrichment slice. tsc clean.
BrasilAPI company dict keys match the DB path (legal_name, trade_name, ...); busca_universal result items use English keys (type/summary/relevance + per-type).
All 16 tools' Pydantic Input schemas + function signatures + bodies + filter echoes use English arg names (name/year/month/category/agency/group_by/bill_type/vote/...). Enum-like arg VALUES stay pt (camara, Sim, pix, orgao). prompts.py arg references and test_tools.py call kwargs updated. Smoke-tested arg binding via LangChain.
Pydantic models (Ator/Acao/Objeto/Evidencia/Contexto/DadosFeedRico) + feed_enrichment + all 3 builders (suspeitas asset, dagster feed, buscar_empresa, get_evento_por_id) + frontend types and components use English field names (actor/action/object/evidence/ context/severity/contract_version; name/role/verb/amount/...). Free-form details bag and enum string values stay pt. Fixed a stale row['relevancia'] read. tsc + 121 tests green.
Move embedding generation out of the API into a dedicated service: app.embeddings_server exposes a TEI-compatible POST /embed (BGE-M3, 1024-dim). app.services.embeddings is now a thin httpx client (same async API, graceful None on failure). Added to the foundation compose (reuses the terreiro image; CPU local/arm64, EMBEDDINGS_DEVICE=cuda on a GPU host; hf_cache volume). EMBEDDINGS_URL setting. Prod can swap to a TEI GPU image (same contract).
Embeddings now live in their own top-level component (mineiro/: own pyproject, uv.lock, Dockerfile with the BGE-M3 weights baked in). Terreiro drops sentence-transformers/torch (~4GB lighter image) and is now purely an HTTP client. Mineiro resolves device auto/cuda/ mps/cpu: RTX 3060 via Docker NVIDIA runtime (EMBEDDINGS_DEVICE=cuda + GPU reservation), Apple GPU via 'make mineiro-dev' (native MPS, since Docker-on-Mac has no Metal passthrough), cpu otherwise. Foundation compose builds ./mineiro; TEI-compatible /embed unchanged.
- auth/verify: move verification + navigation into useEffect (was calling setState/ router.push during render, triggering the React 'setState in render' warning). - agent stream: map provider exceptions to a pt-BR user message instead of leaking the raw error (e.g. Gemini 429/billing text); the real error is still logged server-side. - chat error bubble: render the (now friendly) message without the technical prefix.
Pin the default model to gemini-2.5-flash-lite and disable the auto-promotion to Pro (route_model now returns the configured default). OCR follows settings.default_model instead of a hardcoded flash. fallback_model still degrades an explicit Pro to the cheap tier. config default + .env.example updated; router tests adjusted.
ChatShell called refreshChats() (which calls startTransition) directly in the render body on pathname change, triggering 'Cannot call startTransition while rendering'. Moved the pathname-change detection into useEffect.
Routes standardized to en-US: deputados->deputies, senadores->senators, suspeitas-> anomalies (+/statistics), conversas/chats->conversations, mensagens->messages, exportar-> exports (expenses/anomalies/legislators.csv), despesas->expenses; path params and the cache service's conversation_id updated to match. Frontend (actions, chat-api, chat proxy route) points at the new paths. feed/auth/share/metrics were already English. Validated end to end against the local foundation; 111 tests + tsc green.
ChatIdPageInner called fetchChat (which setState/navigates) directly in the render body, triggering 'Cannot update a component while rendering'. Moved the once-only load into useEffect. This was the last render-time side-effect (verify + chat-shell already fixed).
Charts replayed their enter animation whenever the assistant streamed more text, because chat-message re-parses the chart config (new object) each render and Recharts animates by default. Memoize ChartRenderer by config value and disable Recharts animation, so a chart stays stable while the rest of the message streams. Theme changes still update it (context re-renders bypass memo).
The native API needs Redis (cache, rate limit, token quota, Celery broker); the foundation compose only had the two databases + pgbouncers + Mineiro. Adds redis on 127.0.0.1:6380 so 'make foundation-up' brings up a complete local stack for end-to-end testing.
Adds db-app, db-civic, pgbouncer-app, pgbouncer-civic (env-driven scram auth, no committed userlist) and the Mineiro embeddings service (GPU reservation, EMBEDDINGS_DEVICE=cuda) to docker-compose.yml. api + dagster get APP_*/CIVIC_* DSNs + EMBEDDINGS_URL from .env and depend on the new services; DEFAULT_MODEL default is now flash-lite. The legacy single db stays as the rollback target (fall back via DATABASE_URL by unsetting the new DSNs). Validated with 'docker compose config'.
…nd en-US REST routes - Add Mineiro to the component glossary (BGE-M3 embeddings service with a TEI-compatible /embed) - Describe the split into maracatu_app and maracatu_civic, each behind its own PgBouncer, with replica-ready DB URLs - Update the AI agent row to Gemini 2.5 Flash-Lite (cheapest tier, no auto promotion to Pro) - Redraw the production topology to show db-app/db-civic, both PgBouncers, embeddings and dagster - Add schema/, mineiro/ and docker-compose.foundation.yml to the repo tree - Add foundation-up, foundation-schema and mineiro-dev make targets - Note that the BGE-M3 weights now live in the Mineiro image
Drops the alerts/suspeitas feed and the Gongue classifiers to focus the product on chat + LLM + RAG; the feed is being rethought. Removed: app/classifiers, feed_enrichment, ocr, services/feed, queries/feed+suspeitas, routers/feed+suspeitas, schemas/feed, civic schema 009_feed + 010_analysis, the anomalies.csv export, the analise_suspeitas celery task, the Dagster assets suspeitas/analise_recibos/feed_eventos_dagster (+ their job/beat entries), and the agent tools buscar_suspeitas + consultar_recibo (16 -> 14 tools). Frontend: feed pages/components, feed-api, FeedEvento types, landing feed + classifier sections. Chat message feedback (like/dislike) stays. Full design preserved in docs/feed-and-anomalies.md for a future rebuild. 98 backend tests + tsc green.
Service independence: each service gets its own .env.example (terreiro/cortejo/mineiro) documenting only its vars, so each is extractable to its own repo. Terreiro Dockerfile drops dev extras (pytest/ruff) and the tests/ copy from the runtime image (CI runs them separately) for a slimmer production image. Cortejo (node-alpine + standalone) and Mineiro (uv multi-stage, model baked) were already slim.
Service independence + cleanup: deleted feed/stale scripts (analisar_*, backfill_feed_ricos, enviar_alerta_semanal, gerar_embeddings, ingestao_cnpj/sancoes/tse). Moved the live ones (seed, seed_senado, migrate, apply_foundation_schema, embeddings_eval) into terreiro/scripts/; they're baked into the terreiro image so the ./scripts volume mount is gone. Makefile + migrate.py + apply_foundation_schema.sh + test_migrate_runner paths updated. 98 tests green.
Add versioned docs/architecture.md (overview, active components, two-database data plane, RAG flow, stack, repo layout, why two compose files, en-US conventions) and docs/cutover.md (DB-foundation cutover runbook). Realign README to the chat+RAG product scope: drop Gonguê/classifiers, anomaly feed, scikit-learn and the 16-tools claim; reflect 14 tools, flash-lite, Mineiro, two databases, and terreiro/scripts.
The feed/analysis tables were already removed; this drops the now-empty CREATE SCHEMA for them from the civic bootstrap, and deletes the migration-scaffolding MAPPING.md (the en-US schema files are the source of truth now).
…ompose) Consolidate to one docker-compose.yml that runs everywhere (embeddings default to CPU, PgBouncers/embeddings expose loopback host ports for local native dev). A small docker-compose.gpu.yml overlay adds the NVIDIA reservation + EMBEDDINGS_DEVICE=cuda on the host; deploy layers it via COMPOSE_FILE. Removes docker-compose.foundation.yml and infra/pgbouncer/ (env-driven edoburu pgbouncer needs no mounted userlist). Makefile foundation-* targets + apply_foundation_schema.sh point at the base compose; docs updated. Validated locally: base infra up on CPU, pgbouncer scram OK, Mineiro CPU healthy.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
DB Foundation — greenfield two-database rebuild
Supersedes the in-place UUID retrofit (PR #35, closed). Rebuilds the database as two UUID-keyed,
English-named databases fronted by pgbouncer, salvaging the int→uuid code typing from the retrofit.
Draft: not mergeable yet. The working tree is intentionally mid-cutover (some query modules
already target the new civic schema, so legacy code paths break against the old DB). The full API +
frontend rewrite and the cutover come in later commits. Detailed plans live in local
*.local.md(gitignored).
Done
terreiro/schema/{app,civic}/— 2 databases, 11 schemas, 29 tables, all UUID PKs,explicit constraint/index names, HNSW + FTS, static seeds. pt→en column map in
schema/MAPPING.md.Validated against pgvector pg16.
docker-compose.foundation.yml(db-app + db-civic + one pgbouncerper DB, blast-radius isolation, session pooling). Replica-ready two-pool plumbing in
config.py/database.py(<DOMAIN>_WRITE/READ_DB_URL, read defaults to write).make foundation-*targets.In progress (F3/F4)
raw_ingestao,entes,parlamentares; Baque pool → civic.Proven end-to-end with real Câmara data. Remaining: the other query modules + routers + Pydantic
schemas + agent tools + prompts + the Cortejo frontend, all to English.
Conventions adopted
user-facing copy stays pt-BR.
🤖 Generated with Claude Code