Implement bill excerpts as citable evidence in chat

## Summary

Add first-class bill text excerpts as retrievable and citable evidence in chat answers, while keeping existing transcript utterance citations working unchanged. Use precomputed embeddings for bill excerpts for speed and stable citations.

## Problem

Currently:
- Bills are scraped and stored in DB with `source_text` available (`schema/init.sql:129`)
- `kg_hybrid_graph_rag` only returns transcript utterance citations from `sentences` table - no bill-document citations
- Chat sources are utterance-centric (`utterance_id`, youtube timestamp) - no bill evidence
- `TranscriptIngestor` creates bill rows from transcript legislation but `source_text` is polluted with `"audio"/"visual"` modality strings instead of actual bill text

## Goal

- Make bill text excerpts retrievable and citable as first-class evidence
- Keep existing transcript `utterance` citations working unchanged  
- Use precomputed embeddings for bill excerpts (no on-demand embedding at query time)

## Success Criteria

- Query about a bill returns at least one bill excerpt source when available
- Chat can cite both transcript utterances and bill excerpts in one answer
- Existing chat clients do not break if they only understand utterance sources
- End-to-end latency remains acceptable

---

## Implementation Plan

### Phase 1: Data Model + Migration

1. Add new table `bill_excerpts`:
   - `id TEXT PRIMARY KEY` (stable ID: `bex_<bill_id>_<chunk_index>`)
   - `bill_id TEXT NOT NULL` FK -> `bills(id)`
   - `chunk_index INTEGER NOT NULL`
   - `text TEXT NOT NULL`
   - `char_start INTEGER`, `char_end INTEGER`
   - `embedding vector(768)` (precomputed)
   - `tsv tsvector`
   - `source_url TEXT`
   - `created_at`, `updated_at`
   - unique `(bill_id, chunk_index)`

2. Indexes: ivfflat on `embedding`, GIN on `tsv`, btree on `bill_id`

3. Trigger: `bill_excerpts_tsv_trigger()` to auto-populate `tsv` from `text`

### Phase 2: Chunking + Embedding Pipeline

1. Create chunker module: `lib/bills/excerpt_chunker.py`
   - Deterministic chunking (for stable IDs)
   - Default: split by paragraph, merge/split to ~900 chars, 150 char overlap
   - Skip tiny/noisy chunks, preserve offsets

2. Extend `BillIngestor` in `lib/processors/bill_ingestor.py`:
   - After bill upsert, build chunks from `source_text` (fallback to `description`)
   - Batch-generate embeddings, upsert `bill_excerpts`
   - Safe re-run: upsert by `(bill_id, chunk_index)`

3. Fix transcript-derived bill writes in `lib/transcripts/ingestor.py`:
   - Stop setting `source_text` to `"audio"/"visual"` modality strings
   - Set `source_text` only when real textual content exists

### Phase 3: Backfill Existing Bills

Add script: `scripts/backfill_bill_excerpts.py`
- Scan `bills` where `source_text` or `description` has usable content
- Chunk, embed, upsert
- Flags: `--max-bills`, `--rebuild`, `--skip-embeddings`, `--only-missing`

### Phase 4: Retrieval Integration (Hybrid Graph-RAG)

1. Extend `lib/kg_hybrid_graph_rag.py`:
   - Add `_retrieve_bill_excerpts(...)`: vector similarity + BM25/FTS
   - Optional boost for seed legislation nodes
   - Add `bill_citations` to tool output with: `citation_id`, `bill_id`, `bill_number`, `bill_title`, `excerpt`, `source_url`, `score`

2. Add knobs: `max_bill_citations` (default 8)

### Phase 5: Chat Source/Citation Model Upgrade

1. Update `lib/chat_agent_v2.py`:
   - Add `source_kind` enum: `utterance` | `bill_excerpt`
   - Add bill fields to source model
   - Support `#src:bill:<bill_id>:<chunk_index>` citation IDs
   - Merge transcript + bill citations in `_sources_from_retrieval`

### Phase 6: Agent Prompt + Tool Contract

1. Update `lib/kg_agent_loop.py` tool schema with `max_bill_citations`

2. Update system instructions to encourage bill-excerpt citations for bill-content questions

### Phase 7: API + Frontend Compatibility

1. Update `api/search_api.py` `ChatSource` model with optional bill fields + `source_kind`

2. Frontend: show source badge, bill card with title + excerpt + link

### Phase 8: Tests

- Unit: chunker, upsert idempotency, retrieval ranking, citation parsing, mixed source serialization
- Integration: seed bill, query, verify bill_citations returned
- Regression: utterance-only flows unchanged

### Phase 9: Rollout

1. Feature flag: `ENABLE_BILL_EVIDENCE` (default off)

2. Deploy sequence:
   - schema migration
   - ingestion + retrieval code
   - backfill excerpts
   - enable flag in staging, validate
   - enable in prod

---

## Files Likely Touched

- `schema/init.sql`
- `schema/migrations/<new>_bill_excerpts.sql`
- `lib/processors/bill_ingestor.py`
- `lib/transcripts/ingestor.py`
- `lib/kg_hybrid_graph_rag.py`
- `lib/kg_agent_loop.py`
- `lib/chat_agent_v2.py`
- `api/search_api.py`
- `frontend/src/App.tsx`
- new: `lib/bills/excerpt_chunker.py`
- new: `scripts/backfill_bill_excerpts.py`
- tests under `tests/`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement bill excerpts as citable evidence in chat #3

Summary

Problem

Goal

Success Criteria

Implementation Plan

Phase 1: Data Model + Migration

Phase 2: Chunking + Embedding Pipeline

Phase 3: Backfill Existing Bills

Phase 4: Retrieval Integration (Hybrid Graph-RAG)

Phase 5: Chat Source/Citation Model Upgrade

Phase 6: Agent Prompt + Tool Contract

Phase 7: API + Frontend Compatibility

Phase 8: Tests

Phase 9: Rollout

Files Likely Touched

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Implement bill excerpts as citable evidence in chat #3

Description

Summary

Problem

Goal

Success Criteria

Implementation Plan

Phase 1: Data Model + Migration

Phase 2: Chunking + Embedding Pipeline

Phase 3: Backfill Existing Bills

Phase 4: Retrieval Integration (Hybrid Graph-RAG)

Phase 5: Chat Source/Citation Model Upgrade

Phase 6: Agent Prompt + Tool Contract

Phase 7: API + Frontend Compatibility

Phase 8: Tests

Phase 9: Rollout

Files Likely Touched

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions