Notes / reference-text store: deterministic filing cabinet on the document/RAG layer#19
Conversation
… (v61) Fixes conv-809: exact reference text (bios, pitches) filed under a label now retrieves rank-1 instead of being buried by semantic recall. Adds a column- weighted FTS5/BM25 lexical channel (its own candidate set, fused with semantic + ordered/contiguous phrase bonus), single-chunk notes with stable-id edit, orphan-free indexed delete, a document_manage LLM tool (save/list/delete with a two-step user-approved deletion), and a WebUI Notes tab. Splits the auth_db migration ladder into auth_db_migrations.c first. Schema v61 (document_chunks_fts + note_doc_id). Bundles a memory get-by-ID action. Test: tests/test_document_search_bm25.c (the 809 repro) + 74/74 CI green; full 5-agent review applied; live-verified end-to-end (save/overwrite/list/two-step delete/save_text + WebUI).
…tool hardening Phase 9 of the notes/reference-text store, plus pre-existing doc/tool-loop bugs the live test surfaced. No schema change (note_doc_id + FTS shipped in v61). Extraction guard (memory_note_guard): keeps note-filed reference text — and its document_read/document_search tool-result echoes — out of session-end fact extraction, closing a live-proven leak where filed bios were re-mined into facts and drifted stale. Both provider tool-call shapes; copy-on-redact; config [memory] note_extraction_guard (default on). Bridge (memory_note_bridge): one self-directing gloss fact per note, note_doc_id- linked, so a fuzzy "what's my bio" routes to the verbatim note. Glosses stay in the embedding cache (retrievable) but are skipped by paraphrase-dedup/find- duplicates via an s_cache flag, and exempt from prune/decay/find_similar/pattern- delete. set_note_doc_id ownership-validated; gloss label injection-filtered. Admin: dawn-admin memory backfill-note-glosses. Doc/tool-loop fixes: save buffer 4 KB→16 KB (DOCMGMT_SAVE_TEXT_MAX), save_text overwrite-by-label, dup-tool-call guard scoped to the current turn, tool_call_t.args_truncated refuses partial-arg execution, document_read/delete by id. Tests: test_memory_note_guard (8), test_memory_note_bridge (6), test_llm_dup_check (4). 77/77 CI, format clean. Big-three reviewed (1 critical cache-flag init + 2 security-medium fixed). Live-verified end-to-end.
…mory, search-weight settings Document Library follow-ups (notes/reference store backlog A1–A3 + A5): - Remember active tab across opens (persist state.scope to localStorage). - Click a note name to open an inline read-only viewer (Close/Edit/Delete), keyboard-accessible, nested-Esc aware. - Sticky panel: drop outside-click auto-close (mirrors the music panel) so an in-panel action spawning the confirm modal no longer closes the panel. The top-right trio stays mutually exclusive — scheduler now closes doc-library on open, matching memory. - Surface the 6 [documents] hybrid-search weights in WebUI settings (advanced). Wired the missing backend round-trip: webui_config.c set path, config_env.c JSON getter + TOML writer (parser + defaults already existed). - Sticky panel: drop outside-click auto-close (mirrors the music panel) so an in-panel action spawning the confirm modal no longer closes the panel. The top-right trio stays mutually exclusive — scheduler now closes doc-library on open, matching memory. Drops to z-index 997 (like music) so the Settings panel slides out over it. Test: debug build clean (no warnings); format --check --changed clean; the 3 JS files pass node --check; test_config_validate 18/18.
Add `dawn-admin memory rebuild-document-fts` — rebuilds document_chunks_fts
from scratch: recovery path for a partial v61 migration backfill or for FTS
orphans left by delete_indexed's OOM plain-delete fallback.
document_db_rebuild_fts() clears the contentless index ('delete-all', so no
original stems needed) then keyset-re-indexes every live chunk one per
lock-cycle (stem outside the leaf lock, reuse document_db_chunk_index_fts).
Global op (the FTS index spans users); admin opcode 0x92, no payload.
Test: 2 new cases in test_document_search_bm25 (from-scratch reindex +
orphan-scoping); 77/77 CI; build + format clean. Live-verified end-to-end:
wiped the FTS index to 0, rebuild restored all 285 chunks (~0.55s).
…uidance (B2)
document_manage gains 'edit' (find/replace) and 'append' so the LLM updates a
note by sending only a diff instead of resaving the whole text (the cause of the
earlier truncation incident). edit carries {find,replace} as a JSON-object tail
param (closes the ::-in-find silent-corruption vector); unique-match contract
(0/>1 occurrences refuse, never guess); in-place via document_note_update
(stable doc_id/FTS/gloss). Pure splice logic split into document_edit_ops.c +
unit-tested (13 cases).
Extraction guard extended to redact edit/append content from history (collector
+ structural redactor) so dictated note text isn't re-mined into memory — live-
verified: 4 bodies redacted, zero content leaked to memory_facts. B2: one-home-
per-datum guidance in document_manage + memory remember descriptions. Also fixes
a pre-existing param_count off-by-one in the tool metadata.
Test: 78/78 CI; live-verified edit/append/not-found/B2 end-to-end on Sonnet 4.6.
…diting (B1b, v63) B3 versioning: every destructive change (overwrite/edit/append/delete) snapshots the prior content to document_versions first, atomic with the mutation. Bounded by age (version_retention_days, default 14, swept in auth_maintenance) and a per-doc cap (version_keep_per_doc, default 10) — both config. Surfaces: WebUI note/doc viewer "History" + Restore, a "Recently deleted" list, and the LLM document_manage actions list_deleted + recover. `recover <label>` is a full UNDO — restores an existing item's previous version in place (toggles undo/redo), or re-creates a deleted one from its snapshot. B1b multi-chunk editing: document_full_text stores canonical text on ingest (not notes, not globals); edit/append now accept multi-chunk docs and document_doc_- update re-chunks + re-embeds outside the auth_db leaf lock, then swaps all chunks + FTS + full_text in one transaction (doc_id stable). Pre-v63 uploads prompt a re-save; no lossy backfill. Schema v61→v63 (two idempotent gate-flagged migrations). Big-three reviewed (0 critical; per-doc cap→config, full_text skipped for globals, id guards, owner scoping). Versioning surfaced in tool descriptions; save_text steering sharpened. Test: 78/78 CI (10 new doc/version cases incl. the destructive swap + undo round-trip). Live-verified end-to-end on real data incl. multi-chunk doc undo.
Holistic master-code-review of the notes_reference_store branch: - XSS: escape quotes in title/aria-label attrs (LLM/cross-user labels) - delete_indexed: ROLLBACK on failed delete (no orphan FTS/version rows) - save_text overwrite goes in place (stable doc_id → undo/version chain) - WebUI deleted-restore: document_index_text fallback for multi-chunk - recover re-points version rows onto the new doc (no deleted-ghost/dup) - guard: NULL-check JSON-null "type" (extraction-thread crash) - skip empty/junk version snapshot; correct num_chunks on partial embed - note_extraction_guard: wire missing config round-trip (WebUI save dropped it) - config parity (version_* example, guard in settings schema) Tests: +2 cases (version_reattach, set_num_chunks); 20/20 + 78/78 CI green.
Code Review by Qodo
Context used✅ Compliance rules (platform):
27 rules 1. docmgmt_find_replace_once() returns 1 on OOM
|
There was a problem hiding this comment.
Pull request overview
Adds a deterministic notes / reference-text store on top of the existing document/RAG system so users can retrieve exactly what they saved under a label, backed by hybrid BM25 lexical + embedding search, in-place editing, and version history/restore. This also introduces a memory↔note bridge (gloss facts pointing to notes) and a note-extraction guard to prevent verbatim filed reference text from being re-mined into semantic memory.
Changes:
- Implement hybrid document search (BM25 candidate set + semantic fusion + phrase bonus) and expose tuning controls in config + WebUI.
- Add note save/update + document/note version history (undo/recover) with admin recovery commands (FTS rebuild, gloss backfill).
- Add Phase 9 memory integration: gloss facts linked by
note_doc_id, extraction-time redaction guard, and improved duplicate-tool-call scoping + tool-arg truncation safety.
Reviewed changes
Copilot reviewed 66 out of 68 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| www/js/ui/settings/schema.js | Adds UI schema fields for note-extraction guard and hybrid-search/versioning tuning. |
| www/js/ui/scheduler-queue.js | Closes doc-library panel when scheduler popover opens to avoid slot conflicts. |
| www/js/dawn.js | Adds WebSocket message handlers for note save/update, version list/restore, deleted list. |
| www/index.html | Extends Document Library UI with search, tabs (Documents/Notes), note viewer/editor, deleted list toggle. |
| www/css/components/doc-library.css | Styles notes UI, adds scoped .hidden toggles, adjusts z-index layering. |
| www/css/base/components.css | Introduces shared .dawn-input styling and warning hint color. |
| tests/test_memory_provenance.c | Updates test DDL for note_doc_id bridge column. |
| tests/test_memory_note_guard.c | New tests for extraction redaction of filed note bodies (both provider shapes). |
| tests/test_memory_note_bridge.c | New lifecycle tests for memory→note gloss creation/rename/delete and owner scoping. |
| tests/test_llm_dup_check.c | New tests for turn-scoped duplicate-tool-call detection (OpenAI + Claude). |
| tests/test_llm_dup_check_stub.c | Link-time stubs to isolate llm_tools duplicate-check logic for testing. |
| tests/test_document_search_bm25.c | New tests asserting exact-label BM25 rank-1 behavior + versioning/full-text edit flows. |
| tests/test_document_manage.c | New tests for document_manage edit/append pure string ops contract. |
| tests/test_document_db_stub.c | Adds g_config stub for document_db versioning-dependent paths in tests. |
| tests/CMakeLists.txt | Wires new unit tests and adds required sources for BM25/stemming paths. |
| src/webui/webui_message_dispatch.c | Dispatches new doc-library note/version/deleted message types. |
| src/webui/webui_doc_library.c | Adds scoped list/search, note save/update endpoints, version list/restore, deleted list; uses delete_indexed and gloss deletion. |
| src/webui/webui_config.c | Parses/round-trips new memory/documents config keys from WebUI config JSON. |
| src/tools/tools_init.c | Registers the new document_manage tool. |
| src/tools/memory_tool.c | Adds memory tool action get and guidance to use notes for verbatim reference text. |
| src/tools/document_search.c | Reworks document_search into hybrid lexical+semantic with phrase bonus and configurable thresholds. |
| src/tools/document_read.c | Adds optional id parameter for deterministic read targeting; makes document optional. |
| src/tools/document_index_pipeline.c | Adds FTS indexing on ingest, note save/update, partial-embed num_chunks correction, full-text storage, and multi-chunk doc replace pipeline. |
| src/tools/document_edit_ops.c | New isolated translation unit for edit/append pure string operations. |
| src/memory/memory_note_bridge.c | New bridge to create/delete “gloss” facts linked to notes via note_doc_id. |
| src/memory/memory_extraction.c | Integrates note-extraction guard redaction into extraction input assembly. |
| src/memory/memory_embeddings.c | Extends embedding cache with note_doc_id and excludes glosses from dedup clustering/nearest-fact merge targets. |
| src/memory/memory_db.c | Exempts gloss facts from decay and low-confidence pruning. |
| src/memory/memory_db_facts.c | Factors BM25 MATCH expr builder out, adds note_doc_id setters/finders, excludes glosses from pattern bulk delete, returns note_doc_id in embeddings query. |
| src/memory/memory_callback.c | Adds memory.get action and shared numeric-ID list parser. |
| src/memory/memory_bm25.c | Adds shared memory_bm25_build_match_expr helper for FTS5 consumers. |
| src/llm/llm_tools.c | Blocks execution on truncated tool args; makes duplicate-tool-call detection turn-scoped. |
| src/llm/llm_streaming.c | Tracks tool-argument overflow during streaming and marks calls as truncated. |
| src/config/config_parser.c | Adds TOML parsing + key validation for new documents/memory settings. |
| src/config/config_env.c | Adds JSON/TOML round-trip support for new documents/memory settings. |
| src/config/config_defaults.c | Sets defaults for hybrid search, versioning, and note-extraction guard. |
| src/auth/auth_maintenance.c | Adds retention pruning sweep for document/note version history. |
| src/auth/auth_db_statements.c | Prepares new FTS/index/search statements and extends embeddings select with note_doc_id. |
| src/auth/admin_socket.c | Adds admin opcodes for gloss backfill and document FTS rebuild. |
| src/auth/admin_socket_memory.c | Implements gloss backfill and document FTS rebuild admin handlers. |
| include/webui/webui_doc_library.h | Declares new doc-library WebSocket handlers. |
| include/tools/document_manage.h | Declares document_manage tool register and edit/append helpers. |
| include/tools/document_index_pipeline.h | Declares note index/update and multi-chunk doc update APIs. |
| include/tools/document_db.h | Adds BM25 hit structs and versioning/full-text/edit APIs and constants. |
| include/memory/memory_note_guard.h | Declares extraction redaction guard API. |
| include/memory/memory_note_bridge.h | Declares memory→note bridge API. |
| include/memory/memory_db.h | Declares note_doc_id setter/finder for bridge. |
| include/memory/memory_db_embeddings.h | Extends embeddings retrieval API to include note_doc_id. |
| include/memory/memory_bm25.h | Declares memory_bm25_build_match_expr. |
| include/llm/llm_tools.h | Adds args_truncated marker to tool_call_t. |
| include/llm/llm_streaming.h | Adds streaming overflow tracking for tool args. |
| include/config/dawn_config.h | Adds documents hybrid-search/versioning config and memory note-extraction guard. |
| include/auth/auth_db_migrations.h | Declares migration ladder entry point (schema split). |
| include/auth/auth_db_internal.h | Bumps schema version and adds prepared statement slots for doc FTS + note edit. |
| include/auth/admin_socket.h | Adds admin opcodes for gloss backfill and FTS rebuild. |
| include/auth/admin_socket_internal.h | Declares new admin handler prototypes. |
| dawn.toml.example | Documents new config options for hybrid search, versioning, and extraction guard. |
| dawn-admin/socket_client.h | Adds client APIs for gloss backfill and FTS rebuild. |
| dawn-admin/socket_client.c | Implements new admin socket client calls. |
| dawn-admin/main.c | Adds CLI commands for gloss backfill and document FTS rebuild. |
| CMakeLists.txt | Adds new sources (migrations, guard, bridge) to the build. |
| cmake/DawnTools.cmake | Includes document_manage tool sources when document search tool is enabled. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
PR Summary by QodoNotes / reference-text store: deterministic filing cabinet on the document/RAG layer WalkthroughsDescription• Adds label-keyed **Notes** as single-chunk documents, retrievable deterministically; fixes conv-809. • Implements **hybrid BM25 + semantic document search** with phrase bonus via new FTS5 index. • Adds **document_manage** tool + WebUI Notes tab with editing, version history, and recovery. Diagramgraph TD
User(["User / LLM"])
subgraph Tools["LLM Tools"]
DM["document_manage"]
DS["document_search"]
DR["document_read"]
MT["memory (+get)"]
end
subgraph Doc["Document / RAG"]
DIP["index pipeline"] --> DDB[("document_db")]
DDB --> FTS[("document_chunks_fts")]
DDB --> VER[("document_versions")]
DDB --> FTX[("document_full_text")]
end
subgraph Mem["Memory"]
BR["note bridge"] --> MF[("memory_facts\n(note_doc_id)")]
EXT["extraction"] --> NG["note guard"]
end
WebUI["WebUI Doc Library"] --> DDB
Admin["dawn-admin"] --> DDB
User --> DM --> DIP
User --> DS --> DDB
User --> DR --> DDB
DM --> BR
EXT --> NG
subgraph Legend
direction LR
_api(["API/Tool"]) ~~~ _db[("DB/Index")] ~~~ _mod["Module"]
end
High-Level AssessmentThe following are alternative approaches to this PR: 1. Dedicated notes table separate from documents
2. Pure semantic retrieval with a label boost heuristic
Recommendation: The PR’s unified approach (notes as documents + an independent BM25 candidate set fused with semantic + phrase bonus) is the right strategy for deterministic “exact label” retrieval while reusing the existing document infrastructure. A separate notes table is the only substantial alternative, but the added subsystem duplication likely outweighs the modest complexity of the current filetype/invariant checks. File ChangesEnhancement (43)
Bug fix (2)
Refactor (4)
Documentation (1)
Other (18)
|
- webui doc search: pull authoritative metadata (is_global, created_at) from the document row instead of the BM25 hit; force show_all off on a query so the user-scoped search can't emit bogus owner fields - migrations: ROLLBACK on COMMIT failure in the v61 FTS backfill (+ the identical pre-existing v48 site) so a failed commit can't leave an open transaction during startup - admin gloss backfill: NULL the freed notes buffer - doxygen: complete @param/@return on document_manage.h + document_db.h Skipped (documented intentional): find_replace_once {0,1,2} count contract (OOM via *out==NULL); auth_db_migrations.c append-only ladder size exception. 78/78 CI green, format clean.
|
Review dispositions (Copilot + Qodo) — applied in Most findings had inline threads (replied there). One Qodo summary-only item had no inline anchor:
Summary: 6 fixed, 3 skipped with reason — |
Summary
Adds a notes / reference-text store — a deterministic "filing cabinet" on top
of the existing document/RAG store — so DAWN can give back exactly what the user
filed under a label, instead of a fuzzy top-K neighbor. Closes the conversation-809
gap where three canonical bios filed under distinct labels were buried under their
own semantic near-twins and couldn't be retrieved cleanly.
A note is just a single-chunk document whose filename is the user's label. The
capability is delivered by a column-weighted BM25 lexical channel (its own
candidate set, fused with semantic) plus surgical in-place editing and full version
history with undo/recover — all folded into the existing document tools, library
UI, and config rather than a new subsystem.
What's included
(
document_chunks_fts) with separately-weighted label/body columns; BM25 runs asan independent candidate set fused with cosine + an ordered/contiguous phrase
bonus, so an exact-label note ranks first even with weak embedding similarity.
tokens ≤ max),num_chunks == 1invariant by construction on both the WebUIand tool paths.
editandappendvia a JSON-objectchangeparam; multi-chunk documents storecanonical full text (
document_full_text) and edit in place (re-chunk/re-embed →atomic swap, stable doc_id).
archives the prior content (
document_versions);recoverundoes the last changeor brings a deleted item back. Retention by age + per-doc cap (both config).
redirects fuzzy "what's my bio?" to the exact note; the canonical body is kept out
of
memory_facts, and the extraction guard redacts filed bodies from session-endextraction so reference text isn't duplicated and re-mined.
read-only viewer, inline note editor, version History/Restore, "Recently deleted"
recovery, and search-weight settings.
dawn-adminFTS rebuild command (v61 recovery path).Schema migrations
v61(hybrid FTS index +note_doc_idbridge column) →v62(document_versions)→
v63(document_full_text). All idempotent, gate-flagged, with fresh-installSCHEMA_SQL parity.
auth_db_schema.cwas split first (per the standing TODO):per-version migration blocks extracted into
auth_db_migrations.c.Review & hardening
The full branch went through per-commit big-three reviews plus a holistic
master-code-review before merge. The final commit (
ef9ee51) applies that pass:save_textoverwrites in place so the version chain + undo survive; recoverre-points version rows onto the re-created doc (no "Recently deleted" ghost or
duplicate); skip empty/junk snapshots; correct
num_chunkson partial embeds.typein the extraction guard.note_extraction_guardround-trip that a WebUI save wassilently dropping; config-example + settings-schema parity.
Owner-scoping verified on every new SQL surface (no IDOR); auth_db leaf-lock
discipline held across all stem-outside-lock sites.
Testing
test_document_search_bm25,test_document_manage,test_memory_note_guard,test_memory_note_bridge(exact-label rank-1, stable-id edit, version archive/survive-delete/owner-scope/
cap, full-text round-trip, atomic doc-replace, recover/undo, version reattach,
num_chunks correction, guard coverage across both provider shapes + all actions).
doc edit/restore, undo-a-delete, and the memory→note redirect.
Notes
68 files, +11,226 / −2,724. Design doc archived atatlas/dawn/archive/NOTES_REFERENCE_STORE_DESIGN.md. Only the B4 structured-recordsfollow-up is shelved (B1 surgical edit absorbed most of its motivation).