Summary
Add first-class bill text excerpts as retrievable and citable evidence in chat answers, while keeping existing transcript utterance citations working unchanged. Use precomputed embeddings for bill excerpts for speed and stable citations.
Problem
Currently:
- Bills are scraped and stored in DB with
source_text available (schema/init.sql:129)
kg_hybrid_graph_rag only returns transcript utterance citations from sentences table - no bill-document citations
- Chat sources are utterance-centric (
utterance_id, youtube timestamp) - no bill evidence
TranscriptIngestor creates bill rows from transcript legislation but source_text is polluted with "audio"/"visual" modality strings instead of actual bill text
Goal
- Make bill text excerpts retrievable and citable as first-class evidence
- Keep existing transcript
utterance citations working unchanged
- Use precomputed embeddings for bill excerpts (no on-demand embedding at query time)
Success Criteria
- Query about a bill returns at least one bill excerpt source when available
- Chat can cite both transcript utterances and bill excerpts in one answer
- Existing chat clients do not break if they only understand utterance sources
- End-to-end latency remains acceptable
Implementation Plan
Phase 1: Data Model + Migration
-
Add new table bill_excerpts:
id TEXT PRIMARY KEY (stable ID: bex_<bill_id>_<chunk_index>)
bill_id TEXT NOT NULL FK -> bills(id)
chunk_index INTEGER NOT NULL
text TEXT NOT NULL
char_start INTEGER, char_end INTEGER
embedding vector(768) (precomputed)
tsv tsvector
source_url TEXT
created_at, updated_at
- unique
(bill_id, chunk_index)
-
Indexes: ivfflat on embedding, GIN on tsv, btree on bill_id
-
Trigger: bill_excerpts_tsv_trigger() to auto-populate tsv from text
Phase 2: Chunking + Embedding Pipeline
-
Create chunker module: lib/bills/excerpt_chunker.py
- Deterministic chunking (for stable IDs)
- Default: split by paragraph, merge/split to ~900 chars, 150 char overlap
- Skip tiny/noisy chunks, preserve offsets
-
Extend BillIngestor in lib/processors/bill_ingestor.py:
- After bill upsert, build chunks from
source_text (fallback to description)
- Batch-generate embeddings, upsert
bill_excerpts
- Safe re-run: upsert by
(bill_id, chunk_index)
-
Fix transcript-derived bill writes in lib/transcripts/ingestor.py:
- Stop setting
source_text to "audio"/"visual" modality strings
- Set
source_text only when real textual content exists
Phase 3: Backfill Existing Bills
Add script: scripts/backfill_bill_excerpts.py
- Scan
bills where source_text or description has usable content
- Chunk, embed, upsert
- Flags:
--max-bills, --rebuild, --skip-embeddings, --only-missing
Phase 4: Retrieval Integration (Hybrid Graph-RAG)
-
Extend lib/kg_hybrid_graph_rag.py:
- Add
_retrieve_bill_excerpts(...): vector similarity + BM25/FTS
- Optional boost for seed legislation nodes
- Add
bill_citations to tool output with: citation_id, bill_id, bill_number, bill_title, excerpt, source_url, score
-
Add knobs: max_bill_citations (default 8)
Phase 5: Chat Source/Citation Model Upgrade
- Update
lib/chat_agent_v2.py:
- Add
source_kind enum: utterance | bill_excerpt
- Add bill fields to source model
- Support
#src:bill:<bill_id>:<chunk_index> citation IDs
- Merge transcript + bill citations in
_sources_from_retrieval
Phase 6: Agent Prompt + Tool Contract
-
Update lib/kg_agent_loop.py tool schema with max_bill_citations
-
Update system instructions to encourage bill-excerpt citations for bill-content questions
Phase 7: API + Frontend Compatibility
-
Update api/search_api.py ChatSource model with optional bill fields + source_kind
-
Frontend: show source badge, bill card with title + excerpt + link
Phase 8: Tests
- Unit: chunker, upsert idempotency, retrieval ranking, citation parsing, mixed source serialization
- Integration: seed bill, query, verify bill_citations returned
- Regression: utterance-only flows unchanged
Phase 9: Rollout
-
Feature flag: ENABLE_BILL_EVIDENCE (default off)
-
Deploy sequence:
- schema migration
- ingestion + retrieval code
- backfill excerpts
- enable flag in staging, validate
- enable in prod
Files Likely Touched
schema/init.sql
schema/migrations/<new>_bill_excerpts.sql
lib/processors/bill_ingestor.py
lib/transcripts/ingestor.py
lib/kg_hybrid_graph_rag.py
lib/kg_agent_loop.py
lib/chat_agent_v2.py
api/search_api.py
frontend/src/App.tsx
- new:
lib/bills/excerpt_chunker.py
- new:
scripts/backfill_bill_excerpts.py
- tests under
tests/
Summary
Add first-class bill text excerpts as retrievable and citable evidence in chat answers, while keeping existing transcript utterance citations working unchanged. Use precomputed embeddings for bill excerpts for speed and stable citations.
Problem
Currently:
source_textavailable (schema/init.sql:129)kg_hybrid_graph_ragonly returns transcript utterance citations fromsentencestable - no bill-document citationsutterance_id, youtube timestamp) - no bill evidenceTranscriptIngestorcreates bill rows from transcript legislation butsource_textis polluted with"audio"/"visual"modality strings instead of actual bill textGoal
utterancecitations working unchangedSuccess Criteria
Implementation Plan
Phase 1: Data Model + Migration
Add new table
bill_excerpts:id TEXT PRIMARY KEY(stable ID:bex_<bill_id>_<chunk_index>)bill_id TEXT NOT NULLFK ->bills(id)chunk_index INTEGER NOT NULLtext TEXT NOT NULLchar_start INTEGER,char_end INTEGERembedding vector(768)(precomputed)tsv tsvectorsource_url TEXTcreated_at,updated_at(bill_id, chunk_index)Indexes: ivfflat on
embedding, GIN ontsv, btree onbill_idTrigger:
bill_excerpts_tsv_trigger()to auto-populatetsvfromtextPhase 2: Chunking + Embedding Pipeline
Create chunker module:
lib/bills/excerpt_chunker.pyExtend
BillIngestorinlib/processors/bill_ingestor.py:source_text(fallback todescription)bill_excerpts(bill_id, chunk_index)Fix transcript-derived bill writes in
lib/transcripts/ingestor.py:source_textto"audio"/"visual"modality stringssource_textonly when real textual content existsPhase 3: Backfill Existing Bills
Add script:
scripts/backfill_bill_excerpts.pybillswheresource_textordescriptionhas usable content--max-bills,--rebuild,--skip-embeddings,--only-missingPhase 4: Retrieval Integration (Hybrid Graph-RAG)
Extend
lib/kg_hybrid_graph_rag.py:_retrieve_bill_excerpts(...): vector similarity + BM25/FTSbill_citationsto tool output with:citation_id,bill_id,bill_number,bill_title,excerpt,source_url,scoreAdd knobs:
max_bill_citations(default 8)Phase 5: Chat Source/Citation Model Upgrade
lib/chat_agent_v2.py:source_kindenum:utterance|bill_excerpt#src:bill:<bill_id>:<chunk_index>citation IDs_sources_from_retrievalPhase 6: Agent Prompt + Tool Contract
Update
lib/kg_agent_loop.pytool schema withmax_bill_citationsUpdate system instructions to encourage bill-excerpt citations for bill-content questions
Phase 7: API + Frontend Compatibility
Update
api/search_api.pyChatSourcemodel with optional bill fields +source_kindFrontend: show source badge, bill card with title + excerpt + link
Phase 8: Tests
Phase 9: Rollout
Feature flag:
ENABLE_BILL_EVIDENCE(default off)Deploy sequence:
Files Likely Touched
schema/init.sqlschema/migrations/<new>_bill_excerpts.sqllib/processors/bill_ingestor.pylib/transcripts/ingestor.pylib/kg_hybrid_graph_rag.pylib/kg_agent_loop.pylib/chat_agent_v2.pyapi/search_api.pyfrontend/src/App.tsxlib/bills/excerpt_chunker.pyscripts/backfill_bill_excerpts.pytests/