knowledge: perf + OCR + session isolation + live progress UX by dolev31 · Pull Request #297 · cuga-project/cuga-agent

dolev31 · 2026-06-03T07:54:41Z

Summary

End-to-end overhaul of the knowledge subsystem on a single branch (20 commits, 75+ files, ~+5500/-200):

Perf — bulk add_many on store backends, sub-batched + concurrent embed, race-free progress callbacks, Docling pdf_mode = accurate | balanced | fast. Large-PDF ingest now bounded by embedding-provider throughput, not parse.
OCR — zero-config auto-detection (filename + pypdfium2 text-layer scan, run-length ≥3 false-positive guard), per-platform engine selection (OcrMac on macOS / Tesseract on Linux / EasyOCR fallback), supported-language gating against Apple Vision's actual list, code + formula enrichments enabled in accurate mode.
Session isolation — whitespace-safe resolve_collection, MCP wrapper static whitelist + _is_empty() injection, E2B stub serializer parity, observability logs on every session op.
Live UX — task_id + weighted_pct + ui_phase via POST /api/knowledge/documents?wait=false (202), per-upload AbortController + resume-across-reload, app-level useAgentKnowledgeDocs hook with sticky non-zero + localStorage cache.
Providers — OpenRouter embedding provider, PATCH /api/knowledge/config now reconfigures the live engine, README populated with measured 3×3 provider × pdf_mode matrix.
RAG profiles + GPU — Pareto-locked profile TOMLs (each pins embedding/chunking/docling/rerank/search/engine in one place), Dockerfile.gpu on nvidia/cuda:12.4.1-cudnn, [gpu] extra with onnxruntime-gpu + fastembed-gpu, _FastEmbedEmbeddings CUDA detection with batch_size=256/parallel=1, three-mode startup warning + gpu_required + cuga doctor CLI, Docling parse polish (generate_page_images=False in fast mode).
Knowledge harness UI — operator-tunable <client_adaptation> panel with templates / glossary / reset; ingested into prompt via _render_client_adaptation_block; round-trippable through publish/import; secret-stripped on export.
Knowledge Tool Contract — three-round 3-expert audit of knowledge_instructions.md; 1,947 → 1,357 tokens (-30%); generalized cross-scope rule (predicate + forced verbalization, no enumeration overfitting); refine-on-miss playbook with explicit miss-criteria; harness obedience strengthened (MANDATORY framing + BEFORE-you-respond pre-response check); dynamic anchors (session-LIVE + cross-scope commit) emitted just-in-time only when their trigger fires.
Tests — 700+ passing across 30+ knowledge-specific suites (perf, OCR, session isolation, MCP stub, pgvector metadata, publish/import, progress, live PATCH, contract guards, harness obedience).

Closes #183, #247, #248, #289, #290, #291, #292, #293, #294, #295, #296

Where the prompt-engineering work lives in the diff

The Knowledge Tool Contract + harness work is bundled inside commit 7c2c642 (alongside the RAG profiles + GPU image) since the audits happened in the same session. Files to scan for review:

src/cuga/configurations/knowledge/knowledge_instructions.md — the contract itself (the −60% / now −30% file; iterations 1-3b documented inline)
src/cuga/backend/knowledge/awareness.py — _render_client_adaptation_block (MANDATORY framing) + get_knowledge_summary (session-LIVE anchor + cross-scope commit anchor + BEFORE-you-respond tail)
tests/unit/test_knowledge_rag_scope_failure_fix.py — contract structural guards (NARROWEST scope, Session+All+Anti-example examples, fallback_from, do NOT re-issue)
tests/unit/test_knowledge_multi_scope_search.py — test_canonical_contract_* guards (by_source, recommendation, query rewriting on miss, no stale agent first guidance)
tests/unit/test_knowledge_client_adaptation.py — test_precedence_framing_present (MANDATORY / EVERY response / CRITICAL) + test_recency_tail_pointer_present_when_set (BEFORE you respond)
tests/unit/test_chat_agent_knowledge_toggle.py — engine-off ⇒ no knowledge block injected

Test plan

Run pytest tests/unit/test_knowledge_* — expect 700+ green (160 contract+adaptation+prompt-assembly tests pass on this branch)
Ingest a 500-page text-native PDF with pdf_mode=balanced on a network embedding provider — confirm completion under 1 minute
Ingest a non-English PDF on macOS — confirm OcrMac is selected and works; ingest same doc on Linux — confirm Tesseract path
Open two sessions in the UI, upload doc A in session 1, doc B in session 2 — confirm search in each session returns only that session's doc
Mid-upload page reload — confirm progress rows rehydrate and continue polling
Edit embedding provider in Knowledge UI → PATCH → re-ingest a new doc immediately — confirm new provider is used without restart
Verify pgvector parity: same matrix of ingest + search smoke tests against Postgres + pgvector backend
Knowledge harness UI — save a rule + glossary entry, reload page, confirm both persist and survive cuga knowledge snapshot-export/-import round-trip
Cross-scope decision — agent KB has a paper, session has a paper, ask "what does the paper say about X" — confirm LLM picks scope="all" (the predicate fires) and verbalizes the candidate doc on each side

Follow-up: Pareto-locked profiles + GPU image (added 2026-06)

3 commits pushed on top of the original scope (c476a5e, 7c2c642, ca153f3 after revert + merge with origin/main):

Pareto-locked RAG profiles (Closes knowledge: Pareto-locked RAG profiles #316) — 4 profile TOMLs now own their full effective config (embedding_model, chunking, docling.pdf_mode, rerank, hybrid_mode, junk_filter, default_limit, max_ingest_workers) instead of just chunk_size + prompt addendum. Code-enforced contract: rerank_top_k_in >= 3 * default_limit. Designed via a 3-expert RAG debate.
GPU image + transparent CUDA runtime (Closes knowledge: GPU image (cuga:gpu) + transparent CUDA runtime #317) — new Dockerfile.gpu on nvidia/cuda:12.4.1-cudnn-runtime base, [project.optional-dependencies] gpu extra with onnxruntime-gpu + fastembed-gpu, 3-mode startup warning, gpu_required hard-fail mode, cuga doctor health-check CLI, full README deployment section.
Docling parse polish — generate_page_images=False in pdf_mode=fast only; warning when CUDA engaged but max_ingest_workers <= 2.

Merge with origin/main (commit 1def217) handled 4 conflicts:

cuga_lite_graph.py — took main wholesale (the cuga_agent_core refactor at 27bd35e extracted ~2300 lines, including knowledge helpers into helpers/knowledge.py).
client.py — combined HEAD's search_default_scope / single_default_scope with main's **_: Any kwargs absorption.
App.tsx — kept HEAD's agentKnowledge hook source, dropped 3 props the refactored ConfigHeader no longer accepts.
frontend/dist/index.html — took HEAD (rebuild artifact).

Post-merge fixes (ca153f3) restored 3 branch semantics that main's refactor regressed:

helpers/knowledge.py::_get_knowledge_tool_scope_context returns ("all", "agent", "session") with default "session" (main's version was ("agent", "session") with default "agent").
_knowledge_scope_instruction describes the "all" scope in the both-scopes-wired branch.
adapter/prepare_node.py prefers the per-agent draft_knowledge_configs dict before the legacy singular attribute (multi-tenant safety for shared deployments).

Test sweep: 709 knowledge unit tests pass.

Summary by CodeRabbit

New Features

Added GPU deployment support via Dockerfile.gpu with NVIDIA CUDA runtime and optimized knowledge pipeline
Introduced multiple embedding providers: OpenRouter, OpenAI-compatible, LiteLLM, and local fastembed
Added knowledge scoping: agent-level and session-level document management with fallback search behavior
Implemented client adaptation (knowledge harness) and glossary-driven query expansion
Added optional cross-encoder reranking for search results
New knowledge CLI subcommand group for config/snapshot management

Bug Fixes

Fixed knowledge session isolation by thread ID validation
Improved progress tracking and weighted percentages for ingestion tasks
Enhanced error handling for embedding connectivity and configuration

Documentation

GPU deployment guide with Kubernetes examples and performance tuning
Embedding providers configuration with multiple provider examples
Ingest tuning parameters and per-stage progress documentation

Updated 2026-06: response to Sami's review

Pushed commit 734c56c addressing every blocker and most also-noted items.
Full per-item explanation is in that commit's message body. Headline:

Blockers fixed:

Event-loop block in async ingest (engine.py fut.result → await asyncio.wait_for(asyncio.wrap_future(...)))
Event-loop block in PATCH /knowledge/config (apply_knowledge_config now await asyncio.to_thread(...))
Orphan vectors on pgvector partial-insert (compensating rollback loop in _add_all)
30+ knowledge tests now in CI (run_tests.sh glob)

Also-noted bugs fixed:

commit_knowledge_update now returns metric_changed
get_settings() now exposes docling_pdf_mode, embedding_batch_size/concurrency, vector_insert_batch_size, rerank_*, embedding_api_key_set (presence-only), default_limit, max_search_attempts, gpu_required, etc.

CodeQL hardening: manage_routes.py:978-1009 — logger.exception server-side + validator-message whitelist client-side (generic fallback for any unrecognized shape).

Scope cleanup: removed scripts/KNOWLEDGE_EVAL_METHODOLOGY.md, scripts/bench_knowledge_eval{,_report}.py, scripts/eval_queries.json, scripts/test_{groq,openrouter}_connectivity.py. Personal knowledge_ingest_perf.pptx + KNOWLEDGE_INGEST_PERF_SLIDES.md gitignore entries dropped. README 3×3 latency matrix + Jaccard table trimmed (~80 lines).

Note on the "routes.py zero-diff" comment — routes.py actually has ~588 lines of diff vs main; the wait=false / 202 / live-progress path is implemented at lines 534–696 (upload_documents handler). I suspect Sami meant the DEFAULT wait=true synchronous path which is preserved for MCP/ingest_knowledge + session-attachment back-compat. The async path is opt-in by the UI.

Test count clarification: the original summary said "387 tests / 27 suites" — accurate count on this branch right now is 709 knowledge unit tests passing across 30+ test files. CI tests.yml runs them all via tests/unit/test_knowledge_*.py glob (was missing before — see blocker 4).

…step 0) This is the no-op baseline for the issue #183 perf work. It does not change any ingest behavior; it only instruments the pipeline so the later commits (C1-C6) can be measured against the same ruler and the PR can carry a real attribution table. What this lands: - src/cuga/backend/knowledge/bench.py: tiny record_ingest_run(log_path, payload) helper that appends one JSON line per ingest, with auto-fields ts and commit_sha. Best-effort: any failure (unwritable path, non-serialisable payload) is swallowed so instrumentation never fails an ingest. - KnowledgeConfig.bench_log_path (default ""): when set, the engine appends per-ingest records to that file. Loaded from settings.toml as [knowledge.bench] log_path. - engine._ingest_inner now collects a stage_timings dict (parse_s, insert_total_s, finalize_s, total_s, chunk_count), includes it in the completion update_task call, and forwards it to record_ingest_run. - adapter.add_documents accepts an optional stage_timings dict and populates embed_s + insert_s. Signature added to VectorStoreAdapter base class with a backwards-compatible None default. - scripts/bench_knowledge_ingest.py: harness that runs N ingests with the bench log on, prints a per-run markdown row + a median-aggregated row, and appends the summary to ~/.cuga/bench/runs.json. - scripts/bench_knowledge_report.py: joins the runs.json entries into the attribution table the PR description will paste, with a delta-vs-baseline column. - tests/unit/test_knowledge_bench.py (6 cases) covering append, parent-dir creation, no-op on empty path, default=str fallback, non-mapping payload, unwritable path. - tests/unit/test_knowledge_timings.py (3 cases) verifying adapter populates embed_s + insert_s when stage_timings is passed, backwards-compat when it is not, and that empty doc list does not touch the dict. All 73 existing + new tests pass. ruff check / format clean.

… steps 1-3) Adds add_many to EmbeddingStoreBackend Protocol plus implementations on LocalEmbeddingStore (sqlite-vec) and ProdEmbeddingStore (pgvector). The adapter still uses per-row add() in this commit; the wiring lands in C2. This is intentionally a no-perf-change commit. The point is to land the two backend implementations and their tests on their own, so that any perf delta measured at C2 can be cleanly attributed to the adapter refactor, not to a backend-method change. C1 sanity bench (3 runs, 511-chunk PDF, local stack): median: parse=57.1s embed=37.7s insert=0.39s total=94.9s vs C0: parse=56.8s embed=37.3s insert=0.37s total=94.4s delta: +0.5s (within noise) What this lands: - EmbeddingStoreBackend.add_many: new Protocol method with the same (id, embedding, metadata) tuple shape as add(). - LocalEmbeddingStore.add_many: single transaction containing one executemany. Inner try/except sqlite3.OperationalError falls back to a per-row loop *inside the same transaction*, so even on sqlite-vec builds that reject executemany against vec0 we still drop the per-row commit fsync cost from N to 1. - ProdEmbeddingStore.add_many: one pool.acquire and one conn.transaction per batch with conn.executemany. Preserves the existing ON CONFLICT DO UPDATE upsert behaviour so replace_duplicates callers remain correct. - tests/unit/test_knowledge_local_add_many.py (4 cases): happy path, metadata round-trip via get(), empty no-op, and the executemany -> per-row fallback path. The fallback test wraps the live connection in a small proxy class because sqlite3.Connection is a C type whose methods cannot be monkey-patched directly. - scripts/bench_knowledge_ingest.py: each bench run now copies the source file under a uuid-suffixed name. Without this every run after the first hit the replace_duplicates branch in the engine, where the per-batch loop intentionally skips stage_timings, and we lost the embed_s / insert_s split in the bench output. 77 unit tests pass. ruff check / format clean.

…50 (#183 steps 4-5) First commit with a measurable perf delta. The adapter now builds the (id, embedding, metadata) tuple list once and calls store.add_many in a single round-trip; the engine no longer wraps that call in a 50-doc outer batch loop, because the adapter's bulk path is now correct and cheaper than the previous per-row loop ever was. C2 bench delta on local (3 runs, 511-chunk PDF, fastembed + sqlite-vec): insert_s: 0.374s -> 0.040s -89% (9x faster, the stage we touched) insert_total_s: 37.63s -> 37.27s -1% (embed dominates the wrapper) total_s: 94.44s -> 93.63s -1% (insert is a small share on local) The headline insert_s win is small in absolute terms on sqlite-vec because per-row commit was already on a local fsync. On prod (pgvector) each row in the old loop was a separate pool.acquire + INSERT round-trip; the same architectural change should shave a much larger chunk off insert_s there. We cannot measure that locally without a pgvector test env, so the prod-side number lands when a reviewer with that environment runs the bench script against this commit. What this lands: - adapter.add_documents: replaces the ``async def _add_all: for ... await self._store.add(...)`` loop with a list-comprehension over the documents and a single ``await self._store.add_many(items)`` call. Same metadata shape, same uuid id strategy, same return contract. stage_timings continues to capture embed_s and insert_s separately. - engine._insert_documents_async: drops the outer ``_BATCH = 50`` loop on the replace_duplicates branch; both branches now make a single adapter.add_documents call. The original loop existed because per-row insert times scaled linearly with batch size on the old path; with bulk insert it is strictly slower (one extra pool create/teardown per outer batch on prod, and zero per-row commit fsyncs to amortise on local). The branch lives on only because we still need delete_by_source for replace semantics. 77 unit tests pass. ruff check / format clean.

… step 4) Adapter now splits the chunk list into ``embedding_batch_size``-sized sub-batches and embeds each one sequentially (concurrency=1 for now; C4 turns concurrency on for network providers). New config field plus its plumbing through the vector-store factories so users can tune it without touching adapter internals. C3 bench delta on local (3 runs, 511-chunk PDF, fastembed + sqlite-vec): embed_s: 37.20s -> 29.17s -22% (sub-batching helps even local) total_s: 93.63s -> 85.78s -8% The win on local is bigger than the plan predicted. Likely cause: fastembed's internal embed batching prefers inputs smaller than the full 511-chunk single call we used to make. Cache locality and less peak-memory pressure pay off even before any concurrency lands. What this lands: - KnowledgeConfig: new ``embedding_batch_size`` (default 64) and a forward-declared ``vector_insert_batch_size`` (default 200) that will drive future bulk insert sub-batching but is not used yet. - StorageBackedKnowledgeVectorStore: takes ``embedding_batch_size`` via the constructor; ``add_documents`` walks the chunk list in steps of that size, asserts the embedder returned the expected count per call (RuntimeError otherwise so the insert never touches partial state), and concatenates the resulting vectors in order. Still one ``add_many`` bulk insert from C2. - create_storage_local_knowledge_store / create_storage_prod_knowledge_store / create_vector_store: all forward the new ``embedding_batch_size`` kwarg. - KnowledgeEngine._create_vector_adapter: passes ``self._config.embedding_batch_size`` through. - tests/unit/test_knowledge_timings.py: three new cases — embed sub-batch sizes match config; default of 64 is honoured; embedder returning a short list raises ``RuntimeError`` before any insert. 80 unit tests pass. ruff check / format clean.

Adapter now dispatches sub-batched embed calls through asyncio.gather when the provider is network-bound (openai / ollama), bounded by a semaphore of ``embedding_concurrency``. Local providers (fastembed / huggingface) stay strictly sequential because their ONNX/torch runtimes already exploit internal threads and adding gather on top just adds asyncio overhead. C4 bench on local (3 runs, 511-chunk PDF, fastembed + sqlite-vec): embed_s: 29.17s -> 29.23s no-op on local (as designed) total_s: 85.78s -> 85.86s within noise This is the expected sign for local. The concurrency win materialises on prod with OpenAI / Ollama where each embed call is one HTTP round trip; we cannot reproduce that here. A reviewer with that stack should re-run scripts/bench_knowledge_ingest.py --label C4-... against this commit to fill in the prod row of the attribution table. What this lands: - KnowledgeConfig.embedding_concurrency (default 4). - Adapter constructor gains ``embedding_concurrency`` and ``embedder_is_network``. add_documents now builds a ``starts`` list of sub-batch offsets, populates a fixed-size ``vectors`` array at absolute indices, and runs sub-batches concurrently via asyncio.gather + Semaphore when the embedder is network-bound. Order is preserved because indices are absolute, never appended. - Defensive RuntimeError if any chunk has no embedding after the loop (catches a logic bug in the dispatch before any insert touches the store). - create_storage_(local|prod)_knowledge_store and create_vector_store: forward the two new kwargs. - KnowledgeEngine._create_vector_adapter: sets ``embedder_is_network`` from ``self._config.embedding_provider in ("openai", "ollama")`` and passes ``self._config.embedding_concurrency`` through. - tests/unit/test_knowledge_timings.py: two new cases — ``test_adapter_concurrent_embed_preserves_order`` runs a sleepy randomised embedder under network-mode and asserts vectors come back in input order via retrieval; ``test_adapter_sequential_when_local_provider`` uses a thread-safe counter to assert max-concurrent embed calls is 1 when embedder_is_network=False. 82 unit tests pass. ruff check / format clean.

…step 6) The half of issue #183 not about throughput: ingest now emits per-stage progress events that callers polling get_ingestion_status see in near-real-time, instead of staring at a single processing -> completed transition that takes minutes. The race trap is real: stage_timings emits run on a worker thread (via asyncio.to_thread for the vector mutation) while the completion update runs on the outer event loop. A naive fire-and-forget would let a late progress write overwrite status="completed" back to "processing". The fix is two-part: (a) progress payloads never include the status field, so even a late landing cannot un-flip it; (b) the engine tracks every run_coroutine_threadsafe future and drains them with a bounded timeout before writing the terminal status. C5 bench on local (3 runs, 511-chunk PDF, fastembed + sqlite-vec): total_s: 85.86s -> 87.25s +1.4s (+1.6%) for ~10 progress emits The overhead is the cost of routing ~10 metadata update_task calls through the worker -> outer loop bridge. Acceptable for the UX win; the user no longer stares at a frozen UI for two minutes. What this lands: - adapter.add_documents: new ``progress_cb`` kwarg, default None. Adapter calls _safe_progress("embed", done, total) after each embed sub-batch, _safe_progress("insert_start", 0, total) before the bulk add_many, and _safe_progress("insert", count, total) after. _safe_progress swallows callback exceptions because progress is best-effort and must never fail an ingest. - engine._insert_documents_async: accepts progress_cb, forwards to adapter on both the replace_duplicates and not-exists branches as well as the schema-mismatch retry path. - engine._ingest_inner: emits a "parsed" stage event between Docling finishing and the insert starting; builds progress_cb that submits _emit_progress via asyncio.run_coroutine_threadsafe(outer_loop); tracks each future in progress_futures; drains them with per-future timeout=2.0s before the final status="completed" update_task call. - Progress emits deliberately omit the ``status`` field so a stray late write cannot overwrite the terminal "completed" / "failed" status (defence in depth on top of the drain). - tests/unit/test_knowledge_progress.py: four adapter-level cases — one emit per embed sub-batch with monotonic done count, both insert_start and final insert events fire, a throwing callback does not break ingest, and progress_cb=None is a no-op. 86 unit tests pass. ruff check / format clean.

step 7) Surfaces the three KnowledgeConfig fields added by C3-C5 in knowledge_settings.toml so users can tune them without touching code, and loads them via KnowledgeConfig.from_settings. C6 sanity bench on local (3 runs, 511-chunk PDF): total_s: 87.25s -> 89.04s within noise; no code path changed. What this lands: - src/cuga/configurations/knowledge/knowledge_settings.toml: [knowledge.embeddings] batch_size = 64 # embedding_batch_size concurrency = 4 # embedding_concurrency (network only) [knowledge.engine] vector_insert_batch_size = 200 # reserved; not yet used - KnowledgeConfig.from_settings: reads embeddings.batch_size, embeddings.concurrency, engine.vector_insert_batch_size with the same defaults as the dataclass. - tests/unit/test_knowledge_config_perf_keys.py: confirms TOML loads override defaults; missing keys fall back to defaults; dataclass defaults match settings.toml defaults so the docs and the code agree. 89 unit tests pass.

After landing C0-C6 I went through the new code with the express goal of finding things I had reimplemented when an existing package or a shared helper would do. Two real duplications, both fixed here without changing behaviour or perf (3-run bench: 87.6s vs C5 87.25s, within variance). 1. ``LocalEmbeddingStore.add`` and ``ProdEmbeddingStore.add`` each had their own copy of the schema/SQL/row-values construction next to the matching code in ``add_many``. Now: - local: ``add`` calls ``await self.add_many([(id, embedding, metadata)])``. The schema-aware INSERT building moves into a single helper ``_insert_sql_and_row`` used by both code paths. Semantics unchanged: one row in one transaction = one commit, same as before. - prod: ``add`` delegates to ``add_many`` for the same reason. Two small helpers (``_upsert_sql``, ``_row_values``) cover the schema + UPSERT layout once instead of twice. Pool acquire / transaction semantics for the single-row case are identical because executemany with a length-1 row list is equivalent to a single execute under ``conn.transaction``. 2. Adapter ``add_documents`` gather path used ``asyncio.to_thread(self._embeddings.embed_documents, ...)`` to make the sync embedder concurrent under ``asyncio.gather``. LangChain's ``Embeddings.aembed_documents`` already does exactly this in its default impl (``run_in_executor(None, embed_documents, ...)``) AND providers that have a real async implementation override it (``langchain-openai`` uses httpx natively). Switching to ``await self._embeddings.aembed_documents(...)`` gets us the same behaviour for sync-only providers and a strictly better path for ones with native async, with less code. The sequential (local-provider) path keeps the direct ``embed_documents`` call — there is no benefit to going through asyncio there and the runtime already exploits threads internally. 89 unit tests pass. ruff check / format clean. What I audited and decided NOT to change (per the user's redundancy ask): - bench.py JSONL writer: a loguru sink would couple it to the logging system whose failure modes should not affect benchmarking. The current standalone writer is the right isolation. - stage_timings dict: OpenLit (already in the repo) instruments high-level LLM/LangGraph calls, not sub-second embed/insert splits. The two are complementary. - progress_cb (sync callback through the stack): LangChain's AsyncCallbackHandler is chain-level; sub-batch progress is not a natural fit. Adding a handler class would be more boilerplate than the current pattern. - asyncpg copy_records_to_table: faster than executemany at >1000 rows BUT does not support ON CONFLICT, which the replace_duplicates path needs. Keep executemany. - pytest-benchmark: designed for microbenchmarks in pytest; our harness spins up CugaAgent end-to-end and parses JSONL across runs. Different tool for a different problem.

…dering Wires the issue #183 perf controls and per-stage ingest progress through every CUGA surface so the new tunables can be set without editing files and so the UI shows what the engine is actually doing. Surfaces touched: CLI (src/cuga/cli/main.py) - `cuga start ... --embeddings-provider` (fastembed | huggingface | openai | ollama; OpenAI-compatible endpoints — Groq / Together / Fireworks / OpenRouter — use `openai` + `--embeddings-base-url`). - `--embeddings-model`, `--embeddings-base-url`, `--embeddings-api-key`. - `--embeddings-batch-size`, `--embeddings-concurrency` (issue #183 tunables). - All flags map to DYNACONF_KNOWLEDGE__EMBEDDINGS__* env vars set before the cuga modules load settings, so any service (demo / demo_knowledge / manager / etc.) picks them up. UI (src/frontend_workspaces/agentic_chat/src/KnowledgeConfig.tsx) - KnowledgeConfigValues extended with embedding_api_key, embedding_base_url, embedding_batch_size, embedding_concurrency, vector_insert_batch_size. - ReindexTask.file_tasks gains optional stage + progress fields. `getReindexTaskProgressLabel` renders "<friendly stage> (done/total)" — e.g. "Embedding (192/511)" — under each running file in the reindex progress tile. - Embeddings accordion: Base URL + API Key inputs appear when the provider is openai or ollama, with a hint pointing OpenAI-compatible users (Groq / Together / Fireworks / OpenRouter) at the same path. Batch size and concurrency surfaced as NumberInputs with helper text explaining the local-vs-network semantics. Snapshot/publish (already free, verified) - The three new fields are dataclass attributes, so KnowledgeConfig.to_dict already serializes them and coerce_and_validate restores them. Two new unit tests (test_perf_keys_round_trip_via_to_dict_and_coerce and test_perf_keys_coerce_strings_to_int) lock this contract so a future field that bypasses the dataclass round-trip would fail loudly. Docs (README.md "Knowledge Base" section) - New "Embedding Providers" subsection: table of built-in providers + concrete settings.toml snippets for Groq and Together AI via the OpenAI-compatible path; CLI override example. - "Ingestion Performance Tuning" subsection: explains batch_size, concurrency, vector_insert_batch_size and that they snapshot with published configs. - "Per-stage Ingest Progress" subsection points users at the JSONL bench log for performance attribution. Frontend rebuilt; new bundle hashes: main.813ae3a97097fb29e526.js vendors.88ac6d4d0b8ba3e82779.js (Bundles shipped under src/cuga/frontend/dist/ via build.sh which copies the workspaces dist into the cuga Python package for the wheel.) 91 unit tests pass (+2 new snapshot round-trip tests). ruff clean.

… smoke Closes the gaps surfaced by the prod-side rigorous review and ships P1 (Docling level config) from the original plan. Two end-to-end tests land alongside: pgvector + fastembed (Docker) and Groq connectivity. 1. vector_insert_batch_size now actually enforced The plan declared this knob in C6 but the adapter never read it. With the full chunk list going to a single add_many on prod (pgvector), a large document risked: command_timeout=60 firing, an HNSW write lock held for many seconds, and asyncpg's binary-protocol buffer ballooning. Adapter now splits items by vector_insert_batch_size and calls add_many per sub-batch. Plumbed through create_storage_*_knowledge_store and create_vector_store. Stress-test bench with batch=10 (41 sub-batches for 410 chunks) added only +0.12s on the prod stack — knob works, cost is tiny. 2. Docling pdf_mode: fast | balanced | accurate The dominant cost on the 7-8 min ingest in #183 is Docling parse. New KnowledgeConfig.docling_pdf_mode (default "accurate" so nothing regresses) maps to PdfPipelineOptions: fast turns off OCR + table structure; balanced turns off OCR but keeps tables; accurate stays on Docling defaults. Converters are cached per-mode so toggling between ingests doesn't reload weights. Wired through: - settings.toml: [knowledge.docling] pdf_mode - CLI: cuga start --docling-pdf-mode - UI: new "PDF Parsing (Docling)" accordion with a select + warning about fast mode + scanned PDFs - Snapshot: dataclass field, so to_dict / coerce_and_validate round-trip preserves it. Test added. - Validation: rejects unknown modes, integer batch knobs must be > 0. 3. End-to-end pgvector + fastembed bench (Docker) Spun up pgvector/pgvector:pg16 locally and ran the bench against the 1.5MB RAG PDF, 511 chunks. Measured: accurate: 91.8s total (parse 58.4 / embed 33.05 / insert 0.24) fast: 39.2s total (parse 20.97 / embed 17.93 / insert 0.21) -57% wall time, -64% parse, -46% embed (410 chunks vs 511) Search round-trip verified end-to-end: top hit score 0.881 on 'retrieval augmented generation' against the just-ingested doc. 4. Groq connectivity smoke test scripts/test_groq_connectivity.py builds a KnowledgeConfig with provider=openai + base_url=https://api.groq.com/openai/v1 + api_key=$GROQ_API_KEY, builds the embedder via create_embeddings, and calls embed_query. Confirms the OpenAI-compatible config path plumbs end-to-end. Groq's catalog has zero embeddings models so the call returns a structured 4xx — the test accepts any Groq 4xx as "config path verified". The honest finding (Groq != embeddings) stays in the README. 98 unit tests pass. ruff clean. New frontend bundle: main.2bebaba0f19f267fbd65.js vendors.88ac6d4d0b8ba3e82779.js

…dd knowledge_config kwarg Two real gaps surfaced when the user asked "is everything verifiably configurable through UI and SDK". Both fixed. UI gap: vector_insert_batch_size The type interface declared it but no NumberInput was rendered. Added one under the Embeddings accordion with min=1 max=5000 step=50, helper text explaining the pgvector command_timeout / HNSW lock motivation, and a default of 200 matching KnowledgeConfig + settings.toml. SDK gap: CugaAgent only had enable_knowledge: bool SDK users had no way to set docling_pdf_mode, embedding batch knobs, provider, base_url, or api_key from Python — they had to set env vars before importing cuga or edit settings.toml. New ``knowledge_config`` kwarg accepts: - None (default): falls back to KnowledgeConfig.from_settings(settings) - KnowledgeConfig: validated then used directly - dict: passed through coerce_and_validate (with type coercion, so e.g. embedding_batch_size="32" becomes 32) Validation happens when ``agent.knowledge`` is first accessed (lazy, matching the rest of the engine init pattern). Bad config raises a clear ValueError before any ingest can land. The knowledge property: @Property def knowledge(self): ... override = self._knowledge_config_override if override is None: config = KnowledgeConfig.from_settings(settings) elif isinstance(override, KnowledgeConfig): override.validate() config = override else: config = KnowledgeConfig.coerce_and_validate(dict(override)) engine = KnowledgeEngine(config) ... Tests (tests/unit/test_sdk_knowledge_config.py, 4 cases): - kwarg with KnowledgeConfig lands on engine._config for all 7 fields - kwarg with dict goes through coerce_and_validate (str->int coercion) - bad config raises ValueError lazily at first ``.knowledge`` access - no kwarg falls back to settings.toml (env var still works) 102 unit tests pass (4 new). ruff clean. Frontend rebuilt.

knowledge_ingest_perf.pptx and KNOWLEDGE_INGEST_PERF_SLIDES.md are personal review materials, not codebase content — kept local only. Also adds Office lock file patterns (~$*.pptx etc.).

… engine Two related changes shipped together. The bug fix gates the OpenRouter work: without it, a user who picks OpenRouter in the UI (or changes any other knowledge setting) sees an HTTP 200 PATCH but no behavior change until they explicitly publish, which is hostile UX in demo / single-agent mode. 1. BUG FIX — PATCH /api/manage/config/draft/knowledge now applies to the live engine, not only the draft store Symptom (user-reported): selected Fast docling_pdf_mode in the UI, the request returned 200, but the next upload's log line still showed "mode='accurate', do_ocr=True, do_table_structure=True". Root cause: the handler at manage_routes.py:931 saved the merged config to draft + rebuilt the draft agent, but never called engine.apply_knowledge_config on app.state.app_state.knowledge_engine (the live engine that uploads route through). Comment line at the top of the function literally said "No engine mutation (draft isolation)". Fix: after validation, call live_engine.apply_knowledge_config(filtered). prepare_knowledge_update skips the expensive embed preflight unless provider/model actually changed, so docling_pdf_mode / batch sizes stay cheap. On preflight failure (bad key, unknown model) we return 400 with the engine error message and don't save the draft — no more "saved but not applied" zombie state. Response now includes live_applied + live_changes so the UI can show "reindex recommended" without an extra round-trip. 2. OpenRouter as a first-class embeddings provider User asks "what embeddings provider should I use" → today the answer is "provider=openai + base_url=https://openrouter.ai/api/v1 + the right key", a 3-field setup with a magic URL. New flow: provider="openrouter" model="openai/text-embedding-3-small" # REQUIRED, no default api_key=$OPENROUTER_API_KEY # REQUIRED Base URL auto-set to https://openrouter.ai/api/v1; embedding_base_url override still honored as an escape hatch. Both model and api_key are required by design — OpenRouter has no single "default" embeddings model and we'd rather fail loudly than pick one for the user. Surfaces wired: • engine.create_embeddings: new branch + new OPENROUTER_BASE_URL constant • KnowledgeConfig.validate: "openrouter" accepted • _create_vector_adapter: embedder_is_network includes openrouter (concurrency win applies) • settings.toml comments updated to call out openrouter rules • CLI: --embeddings-provider help text mentions openrouter explicitly • UI: new SelectItem + conditional block showing only Model + API Key fields, plus an inline link to openrouter.ai/models?output_modalities=embeddings • SDK: CugaAgent(knowledge_config=KnowledgeConfig(provider="openrouter",...)) • Publish/Import: dataclass field, free round-trip via to_dict / coerce_and_validate (test-locked) • Docs: README "Knowledge Base" → "Embedding Providers" gains an OpenRouter quick-start with TOML, CLI, and SDK snippets • Smoke script: scripts/test_openrouter_connectivity.py exercises the same code path the engine uses; run when OPENROUTER_API_KEY is set and you want to verify auth/URL/body shape 3. UI touch-up: renamed "PDF Parsing (Docling)" accordion to "Document Parsing (Docling)" — Docling handles DOCX/PPTX/HTML/images/etc. too, and the user pointed out the PDF-only naming was misleading. Tests (115 unit tests pass, up from 102): • tests/unit/test_knowledge_openrouter.py (10 cases): factory auto-URL, explicit override, missing key/model errors, env var fallback, validation accept/reject, snapshot round-trip, network-flag, SDK kwarg, TOML load. • tests/unit/test_knowledge_patch_live_apply.py (2 cases): PATCH applies to engine; invalid config returns 400 without partial apply. Frontend rebuilt: main.977339c561b271967583.js

After the OpenRouter + live-apply commit, walked the whole UI→PATCH→engine→ ingest pipeline once more with adversarial inputs. Found four real gaps; all fixed and locked with tests. 1. validate() didn't enforce OpenRouter has a model Before: provider=openrouter + empty model passed config validation but blew up later in create_embeddings, surfacing a generic engine error that didn't name the field. Now KnowledgeConfig.validate() rejects it at coerce_and_validate time with a message pointing at the right knowledge.embeddings.model field + a link to OpenRouter's catalog. Test: test_validate_rejects_openrouter_without_model (+ whitespace variant + the env-var-fallback case for api_key, which must NOT be blocked at validation time). 2. No whitespace stripping on api_key / model / base_url Classic copy-paste footgun: a key pasted from a docs page or terminal often carries a trailing newline / space. langchain-openai's auth then silently fails with a confusing 401. Now stripped at the ``create_embeddings`` boundary for both openai AND openrouter (and base_url too). Whitespace-only values are rejected as if empty. Tests: strips_whitespace, whitespace_only_model_raises, whitespace_only_key_raises. 3. UI silently swallowed PATCH errors ManagePage.tsx had a debounced PATCH that did `if (res.ok)` and a bare `catch {}` with a comment "// silent". With the new live-apply behavior, a 400 from the engine (bad key, unknown model) was completely invisible to the user — they saw their UI value but the engine still ran the old config. That was exactly the confusion the bug fix was supposed to eliminate. Now: on non-ok response we parse the JSON body's `detail` field and show an error toast. On network error (the `catch`), we show a second-tier toast. The user always sees something. 4. .env / .env.example docs for OpenRouter Added a commented-out OPENROUTER_API_KEY block to both files explaining how to opt-in. .env.example uses a placeholder (sk-or-...); .env carries the existing dev key behind a comment so re-enabling for testing is one line. Tests (121 pass, +6 new): all four fixes pinned individually. Frontend rebuilt: main.f70669d8954206299214.js.

…er matrix in README Two changes that surfaced when actually exercising OpenRouter with the user's key: 1. tiktoken-off when targeting non-OpenAI routes The first bench attempt against ``baai/bge-base-en-v1.5`` via OpenRouter failed with "No embedding data received". Root cause: langchain-openai's ``OpenAIEmbeddings`` defaults to ``tiktoken_enabled =True``, which sends integer arrays for OpenAI-named models. Non-OpenAI models on OpenRouter (BAAI, Mistral, Qwen, Nvidia, etc.) reject those with HTTP 422 "Input should be a valid string". Curl with raw string input proved the underlying API works. Fix: in ``create_embeddings``, when ``provider == "openrouter"`` OR when ``provider == "openai"`` AND ``embedding_base_url`` is set, construct the client with ``tiktoken_enabled=False`` + ``check_embedding_ctx_length=False``. Forces raw-text mode for the wire request. Verified end-to-end: openai/text-embedding-3-small → 1536-dim ✓ baai/bge-base-en-v1.5 → 768-dim ✓ sentence-transformers/all-mpnet-base-v2 → 768-dim ✓ (nvidia/llama-nemotron-embed-vl-1b-v2:free still trips the OpenAI SDK's "No embedding data received" — likely a response-shape incompatibility we can't fix without forking. Documented as a provider-side quirk.) 2. Measured embedding-provider performance matrix in README Ran the bench harness with 3-run median per cell over the same 1.5 MB PDF, local sqlite-vec, fastembed-vs-OpenRouter × {balanced, fast}. Headline numbers added to README's Knowledge Base section: fastembed × accurate (post-P0): 85.8 s fastembed × balanced: 89.9 s fastembed × fast: 39.9 s OpenRouter openai/text-embedding-3-small × balanced: 62.8 s OpenRouter baai/bge-base-en-v1.5 × balanced: 64.5 s OpenRouter openai/text-embedding-3-small × fast: 23.8 s ← best embed_s is the discriminator: fastembed is CPU-bound at ~30 s for ~500 chunks; cloud routes drop it to 2-4 s. Combined with Docling fast mode (parse -60%), best stack runs 3.96× faster than the pre-P0 baseline. README now also includes a "Choosing a provider" decision table. KNOWLEDGE_PIPELINE.md (gitignored internal reference) was updated locally with the same matrix + the validation/auto-flags story for future maintainers.

Previous commit only had 6 of the 9 cells — missing OpenRouter × accurate for both models and OpenRouter baai × fast. User correctly called out the gap; ran the missing benches and updated the README. Full matrix (1.5 MB PDF, 3-run median, local sqlite-vec, total_s): Provider \ Docling fast balanced accurate fastembed 39.9 89.9 85.8 openrouter openai-3s 23.8 62.8 63.2 openrouter baai/bge 26.4 64.5 63.4 Per-stage data also expanded in the README so reviewers can see exactly where each saving comes from (parse_s vs embed_s vs insert_s). Three observations the matrix made explicit: 1. OpenRouter shaves embed_s by 7-14× independent of Docling mode. 2. balanced ≈ accurate on digital PDFs (OCR finds nothing useful). 3. Cloud has variance — one openrouter × accurate run spiked to ~960s embed_s on a rate-limit / network blip. The 3-run median is the honest comparison; SLA-sensitive ingest should factor retries. Local artifacts also updated (gitignored): KNOWLEDGE_PIPELINE.md with the full table + caveat, knowledge_ingest_perf.pptx slide 6 now shows the 3×3 plus the per-stage breakdown for the headline rows.

…ty / robustness Adds a client-shareable benchmarking pipeline that measures the four dimensions a client weighs when picking a CUGA knowledge configuration for their workload. Methodology, harness, queries, and renderer all ship together so a client can run it on their corpus and reproduce the numbers. What ships scripts/bench_knowledge_eval.py Run a (provider, model, docling_mode) matrix on a single PDF + labeled query set. Outputs JSON. scripts/bench_knowledge_eval_report.py Render the JSON as 3 markdown tables (Latency × Cost, Quality, Decision matrix). scripts/eval_queries.json 10 labeled queries — 3 title/ author + 4 concept + 3 detail — tagged by ``kind`` so a client can spot configurations that ace one kind but bomb another. scripts/KNOWLEDGE_EVAL_METHODOLOGY.md The reasoning behind every metric, test-design tradeoffs, how to adapt for a client engagement, and the gaps v1 does not yet cover (no LLM grader, no multi-corpus rollup, no prod-pgvector flag). Four metrics per cell, by design Latency : parse_s / embed_s / insert_s / total_s + min/max spread (3-run median by default; cloud variance is real — we saw a 960 s outlier on openrouter × accurate). Cost : tokens estimated from chars × per-model OpenRouter pricing (live lookup, with hard-coded fallback). $0 for local providers. Quality : avg top-1 score (informational; provider-specific scale) + top-5 Jaccard vs a reference cell. Jaccard answers the client's real question — "do I get the same chunks if I switch from the trusted/expensive config to this cheaper one?" — without needing hand-labeled ground truth. Robustness : ingest_status + search_status + first 3 error strings. A cell that times out or errors mid-run is still reported as failed, with diagnostic text. Findings from the default 9-cell matrix on the RAG paper (2-run median, 10 queries, reference = openrouter/openai-3-small/balanced): Fastest openrouter/openai-3-small/fast 26.8 s $0.0083 Cheapest cloud openrouter/baai-bge-base/fast 29.4 s $0.0021 Free / air-gapped fastembed/fast 36.6 s $0 Quality-preserving AND cheap openrouter/openai-3-small/fast 26.8 s $0.0083 Jaccard 0.82 Top-5 Jaccard vs reference, by cell: openrouter / openai-3-small / balanced 1.00 (reference) openrouter / openai-3-small / accurate 1.00 (same model, different parse) openrouter / openai-3-small / fast 0.82 (fast skips OCR/tables; minor chunk diff) openrouter / baai-bge-base / * 0.20-0.30 fastembed / * 0.13 This is the kind of result a client cannot get from the latency benchmarks alone: switching providers materially changes which chunks come back for a query. Whether that change is "worse" or "different but equivalent" is a corpus-specific judgment the client must make. The harness gives them the queries to spot-check. Two bugs surfaced while building the harness 1. Dynaconf env-var caching: the first attempt set DYNACONF_KNOWLEDGE__... env vars per cell but cuga.config was module-cached after cell 1. Refactored to use the SDK's ``knowledge_config=`` kwarg (built for exactly this case in commit 6e106f0). The harness is now the canonical user of that kwarg. 2. Filename-based chunk identity: first eval gave Jaccard = 0 for every non-reference cell. Root cause was the per-run uuid-tagged copy of the source PDF (added earlier to keep each ingest on the new-doc branch); every cell saw a different filename, so the chunk id was unique per cell. Fixed by using content-based identity: ``p{page}:{first 120 chars of chunk text}``. Robustness note for clients One of the OpenRouter accurate-mode cells gave a transient "No embedding data received" error on the first bench attempt; the rerun succeeded. Cloud providers have variance, including occasional outlier runs taking 10x+ the median. The harness reports min/max alongside median so a client can see the spread and budget retries. README now points at the harness from the Knowledge Base section with a quick-start + summary of the default-matrix findings.

coderabbitai · 2026-06-03T07:54:48Z

📝 Walkthrough

Walkthrough

This PR adds knowledge configuration, retrieval, storage, routing, and UI changes for embedding providers, GPU support, progress reporting, client adaptation, and scoped session behavior.

Changes

Knowledge platform rollout

Layer / File(s)	Summary
Deployment docs and benchmark tooling `README.md`, `Dockerfile`, `.env.example`, `.gitignore`, `pyproject.toml`, `docs/examples/`, `scripts/bench_*.py`, `src/scripts/run_tests.sh`	Deployment notes, container images, dependency metadata, ignore rules, docs examples, benchmark CLIs, and test runner coverage were updated for OCR, GPU, and knowledge benchmarking.
Knowledge contracts and runtime backend `src/cuga/backend/knowledge/`, `src/cuga/backend/cuga_graph/...`, `src/cuga/configurations/knowledge/`, `src/cuga/__init__.py`	Knowledge config/schema, prompt assembly, search envelopes, reranking, query expansion, scope handling, storage/retrieval plumbing, and knowledge contract/profile settings were expanded.
SDK and frontend integration flow `src/cuga/sdk.py`, `src/frontend_workspaces/*`, `src/cuga/cli/main.py`, `src/cuga/backend/server/`, `src/cuga/backend/knowledge/routes.py`, `src/cuga/backend/knowledge/mcp_server.py`	SDK overrides, manage routes, startup apply behavior, CLI commands, frontend knowledge docs/badges, and upload/search/task APIs were wired through the new knowledge surfaces.
Validation and regression coverage `tests/**`	New and updated tests cover adaptation, GPU/use_gpu, provider selection, layout/OCR, reranking, session isolation, upload progress, publish/import, limits, and route/UI parity.

Sequence Diagram(s)

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf/knowledge-ingest-bottlenecks

sami-marreed

Review summary

Solid perf work (bulk insert, concurrent embed, docling modes, OpenRouter, live PATCH) with good unit test coverage when run manually — but several blockers before merge.

Blockers

Merge conflicts with main — GitHub shows mergeable: CONFLICTING (see inline on index.html)
PR description overclaims — OCR, session isolation, live upload UX, and "387 tests" are not in the 41-file diff
New tests not in CI — 8 new test files are not wired into src/scripts/run_tests.sh (lines 82–89), which is what .github/workflows/tests.yml runs
Event-loop blocking — fut.result() in async ingest; sync embed preflight on live PATCH
Data integrity — partial insert failure can leave orphan vectors on pgvector
Out-of-scope files — eval harness, methodology doc, personal .gitignore entries

Also noted (file unchanged in PR — no inline possible)

src/cuga/backend/knowledge/routes.py — upload still await engine._run_ingest(...) synchronously; PR body claims wait=false / 202 / reload resume but routes.py has zero diff
src/cuga/backend/knowledge/engine.py commit_knowledge_update — metric_changed is computed in prepare_knowledge_update but omitted from the return dict; manage_routes.py exposes it in live_changes anyway → always null. Add "metric_changed": prepared.metric_changed to the return.
get_settings() — omits new fields (docling_pdf_mode, batch knobs, embedding_api_key) after live PATCH
PR still Draft — please mark ready after fixes

Happy to follow up with more comments as review continues.

Snapshot of in-progress work being merged with feat/knowledge-client-adaptation: - search_multi Option B per-source cap (limit × n_scopes, capped 100); fallback keeps Option F - BM25/SQLite FTS5 hybrid retrieval with cross-leg RRF - CID-glyph detector for unOCR'd PDF text layers - Knowledge snapshot import/export via CLI + manage routes - Knowledge UI: in-flight upload persistence, sticky non-zero badge hook - Per-stage progress timers + JSONL bench harness - Many test additions across the knowledge surface

Cherry-picked c9459a8 ("WIP: client-adaptation panel + knowledge harness") onto a5a7d5a. Conflicts resolved with default-keep-perf, manually layered harness features on top: Frontend: - New ClientAdaptationPanel.tsx (Behavior tab — rules editor + glossary) - KnowledgeConfig.tsx: glossary entry type + adaptation fields on the interface; new client_adaptation/glossary surfaces in the modal - ManagePage.tsx: harness invocation surfaces Backend: - engine.py: layered query expansion (glossary aliases) ahead of dense + lexical fan-out; empty-guard retry with unexpanded query when expansion drives dense embedding away from useful regions; cross- encoder reranker post-fusion; preserved perf branch BM25 + drain loop + Option B per-source cap - awareness.py: client_adaptation block rendered top-of-summary; hash log for observability; new fields plumbed via assemble_system_prompt_section - config.py: client_adaptation_text + glossary + rerank_* fields; validation (denylist, bidi-override codepoints, glossary limits); TOML round-trip via [knowledge.client_adaptation]; secret-redaction preserved - cuga_lite_graph.py: backported the multi-tenant draft_knowledge_configs dict lookup so the draft-overlay path is shared-deployment safe - cli/main.py: kept perf's config-*/snapshot-* commands AND added adapt's adaptation-*/glossary-*/doctor (disjoint command names — both helper functions retained for env-var back-compat) - knowledge_instructions.md: kept perf's narrow Scope rules; layered "Citing Sources" section from adapt above Anti-patterns New files brought over: - query_expansion.py, reranker.py - tests/integration/test_client_adaptation_e2e.py - tests/unit/test_knowledge_client_adaptation.py - tests/unit/test_knowledge_lever_stack.py - docs/examples/knowledge_demo/client_adaptation_example.md - knowledge_testing/.gitignore Files NOT brought over: - prepare_node.py — perf branch is pre-agent-core-refactor and never had this layout; equivalent logic backported into cuga_lite_graph.py (multi-tenant dict lookup, with client_adaptation_text/glossary picked up automatically inside assemble_system_prompt_section) Tests: 667/667 knowledge unit tests pass. Dist bundles will be rebuilt fresh in a follow-up step.

…ld artefacts Two paths that were appearing in ``git status`` and don't belong in the tracked tree: - ``.claude/scheduled_tasks.lock`` + ``.claude/projects/`` — Claude Code's internal session state. Different per developer; never useful upstream. - ``docs/examples/**/build/`` — Python build artefacts created when running examples in isolation (``uv build`` from a sub-example dir). Pure housekeeping; no behavioural change.

…nt CUDA runtime Closes the knowledge-engine production-readiness gap with three closely-coupled changes audited by a 3-expert review (image / per-stage perf / deployment UX) plus a Docling-specific deep dive. ## 1. Pareto-locked RAG profiles Profile TOMLs now own their full effective config rather than just chunking + prompt addendum. Each profile pins: - embedding_model + batch_size + concurrency - chunk_size + overlap - docling.pdf_mode + layout_engine + drop_page_chrome - rerank.enabled + top_k_in + model - search.default_limit + score_threshold + max_search_attempts + hybrid_mode + junk_filter - engine.max_ingest_workers + vector_insert_batch_size The matrix (speed / standard / balanced / max_quality) is locked from a 3-expert debate on a 556-page PDF corpus; values defended in each TOML's header comment. Wiring: - ``from_settings`` reads the named profile's full surface; settings.toml keys act as fallback only. - ``coerce_and_validate`` applies the profile when ``rag_profile`` is named in the incoming PATCH (a bare ``{"rag_profile": "X"}`` produces a fully X-shaped config, not just a relabel). - Code-enforced contract: ``rerank_top_k_in >= 3 * default_limit`` — Expert B's recall-ceiling finding. - New ``would_invalidate_vectors(other)`` helper that the PATCH-config UX layer can use to gate a confirmation dialog. Default rag_profile remains ``standard`` (rerank off; bge-small-en; balanced parse) so the shipped default fits ~4GB containers. ## 2. GPU image variant (Dockerfile.gpu) + transparent CUDA New ``Dockerfile.gpu`` on ``nvidia/cuda:12.4.1-cudnn-runtime-ubuntu24.04`` base — slim Debian lacks ``libcudnn.so.9`` / ``libgomp.so.1`` / ``libcudart.so.12`` and causes silent ORT fallback to CPU. New ``[project.optional-dependencies] gpu = ["onnxruntime-gpu>=1.20", "fastembed-gpu>=0.8.0"]`` (lockfile-managed). ``fastembed-gpu`` avoids the qdrant/fastembed#608 dual-install collision where a transitive ``onnxruntime`` silently strips CUDAExecutionProvider. Main ``Dockerfile`` reverts a broken ``--build-arg GPU=1`` block: it was lockfile-unaware (``uv pip install --reinstall``) and any later ``uv sync`` would silently revert to CPU. Runtime: - ``_FastEmbedEmbeddings`` detects ``CUDAExecutionProvider`` in active providers and forces ``batch_size=256, parallel=1`` for the embed call. Without this, bge-small at fastembed's default ``parallel=None`` spawns one worker per CPU core, all sharing one GPU, and CUDA-context thrash caps the speedup at 2–3× CPU. - 3-mode startup warning distinguishes (a) GPU image with no device visible (forgot ``--gpus all``), (b) CPU image with GPU requested (wrong tag), (c) partial GPU (torch sees CUDA but ORT doesn't). - ``gpu_required: bool = False`` config field + ``CUGA_GPU_REQUIRED=1`` env var: when set, the engine raises at startup instead of warning, matching vLLM / TGI fail-fast posture. - New ``cuga doctor`` CLI: prints nvidia-smi, ORT providers, torch CUDA, fastembed live session providers, /dev/shm size. Copy-pasteable for support tickets. README: full GPU deployment section with k8s manifest example (``nvidia.com/gpu: 1`` + tmpfs /dev/shm + ``CUGA_GPU_REQUIRED=1``), per-stage speedup table from the audit, HF Spaces + AWS ECS notes. ## 3. Docling parse polish (per Expert D's deep dive) - ``generate_page_images=False`` only when ``pdf_mode=fast``. Saves ~30 ms/page in PNG encoding when the user has explicitly chosen the fast profile. Balanced + accurate keep Docling's default so downstream UI/extension consumers that may want the rendered PNGs aren't broken. - Warning when CUDA is engaged but ``max_ingest_workers <= 2``: per Expert D, a single H100 fits ~10–14 Docling workers (4–6 GB VRAM each); cuga's laptop-tuned default leaves the GPU ~80% idle. ## Tests - 709 knowledge unit tests pass. - 8 existing tests updated for the new profile-wins semantics: they now point ``rag_profile`` at ``__no_such_profile__`` to verify the settings.toml fallback path independently of profile expansion. - ``test_publish_syncs_draft_*`` + ``test_cuga_lite_knowledge_scopes`` pre-existing fragilities resolved (apply_knowledge_config stub on the fake engine; assertion shape updated to match the live setdefault → get/assign refactor on thread_id injection). ## Defended in 3 expert reports - Expert A (image / packaging): nvidia/cuda base + ``[gpu]`` extra + fastembed-gpu eliminate 3 silent-CPU-fallback footguns found by audit. - Expert B (per-stage GPU perf): batch_size=256 / parallel=1 on CUDA; code-enforced ``rerank_top_k_in >= 3 * default_limit``. - Expert C (deployment UX): refined warning + ``gpu_required`` + ``cuga doctor`` + README with platform matrix. - Expert D (Docling parse): ``generate_page_images=False`` in fast mode; ``max_ingest_workers`` warning. GPU image size: ~4.8 GB (CPU image stays ~1.5 GB). End-to-end speedup on H100 vs CPU (measured baseline 6.1 min, 556-page PDF): ~10× with multilingual-e5-large; ~3–6× with bge-small-en (launch-overhead bound at small models). Tesseract OCR + sqlite-vec/pgvector + BM25 remain CPU floors regardless.

Ships a small, opinionated corpus that operators can ingest immediately to evaluate the knowledge engine end-to-end without sourcing their own documents: - ``agent_level/`` — 4 evergreen HR-style policies (employee handbook, health insurance plan, PTO policy, 401k retirement plan). These map to cuga's ``agent_level`` scope (permanent, shared across conversations). - ``session_level/`` — 3 user-scoped documents for a fictional persona (Sarah Chen): benefits enrollment, paystub, PTO balance. These map to cuga's ``session_level`` scope (transient, per-conversation) and demonstrate the per-thread scope-isolation contract. - ``build_deck.py`` + ``cuga_knowledge_demo.pptx`` — small Python pptx-build script and the rendered demo deck (~80KB) for screenshare walkthroughs. Total footprint ~100KB. All content is fictitious; useful for evaluation, QA, and onboarding new cuga users to the knowledge feature.

…edge engine" This reverts commit e1438f2.

…-bottlenecks # Conflicts: # src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py # src/cuga/backend/knowledge/client.py # src/cuga/frontend/dist/index.html # src/frontend_workspaces/agentic_chat/src/App.tsx

…merge The merge with origin/main pulled in 27bd35e (extract shared agent logic into cuga_agent_core) which moved ~2300 lines out of cuga_lite_graph.py. Three of our branch's semantic improvements would have been lost without this fix: 1. ``helpers/knowledge.py::_get_knowledge_tool_scope_context`` — restore the synthetic ``"all"`` scope and ``"session"`` default. Main's refactored helper returned only ``("agent", "session")`` with default ``"agent"``; our branch had evolved to ``("all", "agent", "session")`` with default ``"session"`` so the LLM defaults to the narrowest plausible query and the engine auto-falls back to ``"all"`` if session is empty (the wiring lives in ``KnowledgeClient.search`` / ``_run_multi``). Without this restore, multi-scope deployments lose the precision gain that the perf branch shipped. 2. ``helpers/knowledge.py::_knowledge_scope_instruction`` — update the both-scopes-wired branch to describe the ``"all"`` scope explicitly so the LLM tool prompt matches the actual scope set returned in (1). 3. ``adapter/prepare_node.py`` — when assembling the draft-overlay knowledge config for ``--draft`` agents, prefer the per-agent ``draft_knowledge_configs`` dict before falling back to the legacy singular ``draft_knowledge_config`` attribute. Main's refactored prepare_node read only the singular attribute, which is multi-tenant unsafe (two draft agents on the same host would clobber each other). The dict lookup mirrors what cuga_lite_graph.py had pre-refactor. 4. ``test_knowledge_multi_scope_search.py`` — update the import path for ``_get_knowledge_tool_scope_context`` from ``cuga_lite_graph`` to ``cuga_lite.helpers.knowledge`` (the new home post-refactor). All 709 knowledge unit tests pass.

After the merge with origin/main pulled in the cuga_agent_core refactor, CI's ``ruff check`` flagged 40 lint errors across pre-existing files (31 auto-fixable, 9 manual). All are unused-imports / unused-locals / semicolon-separated statements — no behavioural change. - tests/unit/test_knowledge_use_gpu_surfaces.py + a dozen other test files: ``os``, ``patch``, etc. imported but unused. - tests/unit/test_knowledge_layout_engine.py:429-430: split two semicolon-joined statements onto their own lines. - src/cuga/backend/knowledge/engine.py:2473: drop unused ``rerank_in`` assignment. - scripts/bench_knowledge_eval.py:242,248: drop unused ``last_ingest_filename`` + ``docs``. ruff check now passes cleanly. 709 knowledge unit tests still pass.

… failure CI's ``ruff format --check`` flagged 38 files needing format normalization (extra blank lines, trailing-comma fixups, multi-line call formatting). Applied ``uv run ruff format .`` to bring the tree in line with the upstream code style. No behavioural change; 709 knowledge unit tests still pass.

…sion Background: CI's ``uv sync --all-extras --dev`` installs every extra side-by-side. With ``fastembed`` (base) AND ``fastembed-gpu`` (in the previous ``[gpu]`` extra) both installed, they clobber each other on disk — the same qdrant/fastembed#608 issue we were trying to avoid. Net effect under CI: ``ModuleNotFoundError: No module named 'fastembed'`` on every integration test that imports the engine. Fix: - Remove ``fastembed-gpu`` from ``[project.optional-dependencies] gpu``. Only ``onnxruntime-gpu`` stays. ``fastembed-gpu`` was a nicety; the Dockerfile.gpu can swap the underlying ORT directly. - ``Dockerfile.gpu`` now does ``uv pip uninstall onnxruntime`` between ``uv sync --extra gpu`` (which adds ``onnxruntime-gpu``) and the cu121 torch reinstall. Two ORT packages on disk silently strips CUDAExecutionProvider — the uninstall keeps it clean. - ``uv.lock`` re-locked to remove ``fastembed-gpu``. CI's ``uv sync --all-extras`` now resolves cleanly. 709 knowledge unit tests still pass.

… collision ``uv sync --all-extras --dev`` installs every optional dep, including the new ``[gpu]`` extra. ``onnxruntime-gpu`` (in [gpu]) and ``onnxruntime`` (transitive of base fastembed) then coexist on disk and clobber each other — ``import onnxruntime`` raises ``AttributeError: module 'onnxruntime' has no attribute 'SessionOptions'``. The qdrant/fastembed #608 and microsoft/onnxruntime #21825 issues both document the collision. Fix: ``--no-extra gpu`` in the workflow. CI never needs the GPU runtime — the GPU path is exercised in Dockerfile.gpu only.

CodeQL flagged two new alerts after the merge with origin/main: 1. HIGH ``js/polynomial-redos`` — ``src/frontend_workspaces/agentic_chat/src/ClientAdaptationPanel.tsx:309`` ``value.replace(/\s+$/, "")`` is the classic polynomial-backtracking pattern; on adversarial inputs the regex engine spends quadratic time trying to match. Replaced with ``value.trimEnd()`` — O(n), same semantics, intent-revealing. 2. MEDIUM ``py/stack-trace-exposure`` — ``src/cuga/backend/server/manage_routes.py:979`` The embedding-test endpoint's validation error path returned ``str(e)`` unscrubbed. ``KnowledgeConfig.validate()`` messages are operator-authored strings, but if a user accidentally pastes their API key into the model field the error could echo it back. Now uses the same ``_scrub_secret(str(e), api_key)`` the embed-test path two blocks below already uses — consistent secret-scrub posture across the endpoint. ruff clean, 709 knowledge unit tests pass.

…ors, CI tests, scope cleanup) Aggregate response to PR #297 review comments (Sami, Dec 2026). ## Blockers fixed 1. **Event-loop blocking in async ingest** — engine.py:2082 ``fut.result(timeout=2.0)`` was a blocking call inside the asyncio thread; with ~10 progress emits the drain stalled other requests for up to ~20s. Replaced with ``await asyncio.wait_for(asyncio.wrap_future(fut), timeout=2.0)``. 2. **Event-loop blocking in PATCH /knowledge/config** — manage_routes.py ``live_engine.apply_knowledge_config(filtered)`` was called sync in an async route. On embedding-provider/model change it runs ``embed_query("test")`` (network round-trip) on the event loop. Wrapped in ``await asyncio.to_thread(...)``. 3. **Partial insert can orphan vectors on pgvector** — storage/adapter.py Each ``add_many`` sub-batch is its own pgvector transaction; if batch N+1 fails, batches 1..N stay committed and the next retry appends duplicates (because ``delete_by_source`` only runs when metadata exists, and first-time failures never write metadata). Added compensating delete loop inside ``_add_all``: tracks ``source_for_rollback`` on first chunk and, on any subsequent batch failure, runs the same ``list`` / ``delete`` / ``_fts_delete_by_source`` loop as the adapter's public ``delete_by_source`` (inlined to stay on the current event loop — calling the public helper would ``_run_embedding_coro`` a nested loop and deadlock). 4. **New tests not in CI** — src/scripts/run_tests.sh The explicit allowlist of ~8 knowledge tests at lines 82-89 silently excluded ~30+ test modules added since. Switched to ``tests/unit/test_knowledge_*.py`` glob + named the few non-glob companions (session/chat/manage_publish/sdk_knowledge_config). All knowledge tests now run under ``./src/scripts/run_tests.sh unit_tests``. ## Also-noted bugs 5. **commit_knowledge_update omitted metric_changed from return** — engine.py:3197. ``manage_routes.py`` already exposes it in ``live_changes`` so the field was always null. Added ``"metric_changed": prepared.metric_changed`` to the return. 6. **get_settings() omitted new fields** — engine.py:3224. After live PATCH the SDK / UI / CLI couldn't read back ``docling_pdf_mode``, ``embedding_batch_size``, ``embedding_concurrency``, ``vector_insert_batch_size``, ``embedding_api_key`` (presence-only), ``rerank_*``, ``default_limit``, ``max_search_attempts``, ``max_ingest_workers``, ``docling_layout_engine``, ``gpu_required``. All now surfaced. Secret field exposed as ``embedding_api_key_set: bool`` only — the actual key never leaves the engine. All new fields read via ``getattr(self._config, name, default)`` so existing SimpleNamespace mocks in tests don't break. ## CodeQL hardening 7. **manage_routes.py:978-1009** — strengthened the py/stack-trace-exposure response: ``logger.exception`` logs the full server-side traceback; client receives only validator-authored messages on a known prefix whitelist (operators NEED these to fix their config), and falls back to a generic "Invalid knowledge embedding configuration." for any other exception shape so internals never leak. ## Scope cleanup (Sami: out-of-scope for this PR) 8. Removed eval harness scripts: - scripts/KNOWLEDGE_EVAL_METHODOLOGY.md - scripts/bench_knowledge_eval.py - scripts/bench_knowledge_eval_report.py - scripts/eval_queries.json - scripts/test_groq_connectivity.py - scripts/test_openrouter_connectivity.py 9. Removed personal gitignore entries for ``knowledge_ingest_perf.pptx`` + ``KNOWLEDGE_INGEST_PERF_SLIDES.md``. Generic ``~$*.pptx`` / ``~$*.docx`` / ``~$*.xlsx`` lock-file patterns kept since those are repo-wide useful. 10. README — trimmed the 3×3 provider×Docling latency matrix + Top-5 Jaccard quality table + eval-harness section (~80 lines). Provider-choice table + reindex-on-dim-change warning kept (real user docs). Tests: 709 knowledge unit tests pass. ruff check + ruff format clean.

coderabbitai

Actionable comments posted: 1

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (8)

Dockerfile.gpu (1)

27-85: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Run the container as a non-root user.

The image never drops privileges (USER is missing), so runtime executes as root. Please add a dedicated unprivileged user and switch before CMD.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Dockerfile.gpu` around lines 27 - 85, Create and switch to a non-root user in
the Dockerfile: add a dedicated unprivileged user (e.g., "cuga") and group, set
a HOME, chown the app workspace and any files copied into /app/cuga_workspace
(references: the RUN mkdir -p /app/cuga_workspace and multiple COPY lines), and
add a USER cuga (and optionally WORKDIR/HOME) before the final CMD so the
container runs unprivileged rather than as root. Ensure the new user has
permission to read/write needed files and the PATH (ENV PATH) remains usable for
that user.

Source: Linters/SAST tools

src/cuga/backend/knowledge/auth.py (1)

164-191: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Normalize thread_id once and reuse it for ownership + collection resolution.

Line 170 validates stripped content, but Lines 182, 187, and 191 still use the raw value. "abc" and " abc " currently map to different session/collection identities.

Suggested fix

     if scope == "session":
+        normalized_thread_id = (identity.thread_id or "").strip()
         # Whitespace-safe check. ``"   "`` is truthy in Python but it
         # sanitizes to ``"___"`` and would create a phantom collection
         # ``kb_sess____`` shared across every misconfigured caller.
         # Strip before the truthiness test so any "no real thread id"
         # input is rejected uniformly.
-        if not (identity.thread_id or "").strip():
+        if not normalized_thread_id:
             raise HTTPException(status_code=400, detail="X-Thread-ID required")
@@
                 provider.get_or_create_session(
-                    identity.thread_id,
+                    normalized_thread_id,
                     user_id=identity.user_id,
                     tenant_id=identity.tenant_id,
                 )
                 if not provider.check_session_access(
-                    identity.thread_id, identity.user_id, identity.tenant_id
+                    normalized_thread_id, identity.user_id, identity.tenant_id
                 ):
                     raise HTTPException(status_code=403, detail="access denied to session")
 
-        return f"kb_sess_{_sanitize(identity.thread_id)}"
+        return f"kb_sess_{_sanitize(normalized_thread_id)}"

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cuga/backend/knowledge/auth.py` around lines 164 - 191, Normalize and
reuse a stripped thread_id instead of calling identity.thread_id repeatedly:
create a local variable like stripped_thread_id = (identity.thread_id or
"").strip(), validate it and raise HTTPException if empty, then pass
stripped_thread_id to provider.get_or_create_session and
provider.check_session_access and call _sanitize(stripped_thread_id) for the
returned collection name; ensure all uses in this scope (the block handling
scope == "session" in auth.py) reference that single normalized variable.

src/cuga/backend/knowledge/config.py (1)

633-649: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Include Docling ingest knobs in vector_config_hash().

would_invalidate_vectors() currently misses docling_pdf_mode, docling_layout_engine, and docling_drop_page_chrome, even though those settings change the extracted text/chunks that get embedded and stored. A profile switch like standard (balanced + dry_run) to a stricter parsing profile can therefore keep the same hash and skip the re-ingest prompt while serving stale vectors.

Suggested fix

         key = (
             f"{sm}|{self.embedding_provider}|{self.embedding_model}|"
-            f"{self.chunk_size}|{self.chunk_overlap}|{self.metric_type}"
+            f"{self.chunk_size}|{self.chunk_overlap}|{self.metric_type}|"
+            f"{self.docling_pdf_mode}|{self.docling_layout_engine}|"
+            f"{self.docling_drop_page_chrome}"
         )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cuga/backend/knowledge/config.py` around lines 633 - 649,
vector_config_hash() currently omits docling ingest knobs so profile changes
(e.g., docling_pdf_mode, docling_layout_engine, docling_drop_page_chrome) won't
flip the vector hash; update the function (vector_config_hash) to include these
three settings from cuga.config.settings (accessed the same way as storage.mode)
in the key composition so the hash reflects any change to docling_pdf_mode,
docling_layout_engine, and docling_drop_page_chrome and triggers re-ingest
detection consistent with would_invalidate_vectors().

src/cuga/backend/knowledge/routes.py (1)

534-695: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

wait=false path does not actually return 202.

The handler documents async semantics for wait=false, but it never sets the HTTP status to 202, so clients still receive 200 and can mis-handle polling/state transitions.

Suggested fix

-from fastapi import APIRouter, Depends, File, Form, HTTPException, Query, UploadFile, Request
+from fastapi import APIRouter, Depends, File, Form, HTTPException, Query, UploadFile, Request, Response
@@
 async def upload_documents(
     request: Request,
+    response: Response,
     files: list[UploadFile] = File(...),
@@
-    if len(results) == 1:
+    if not wait:
+        response.status_code = 202
+    if len(results) == 1:
         return results[0]
     return {"tasks": results}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cuga/backend/knowledge/routes.py` around lines 534 - 695, The async
branch for wait=false currently builds result payloads but still returns normal
200; update upload_documents to return a 202 status for the fire-and-forget path
by returning a Response/JSONResponse with status_code=202 (preserve the same
body built from results/_enrich_task). Specifically, after assembling results in
the else branch (symbols: wait, results, _enrich_task) and at the final return
site, return JSONResponse(status_code=202, content=results[0] if len(results)==1
else {"tasks": results}) or equivalent so single-file and multi-file flows both
send HTTP 202 for wait=false. Ensure you import JSONResponse/Response from
fastapi.

src/frontend_workspaces/agentic_chat/src/KnowledgeSidePanel.tsx (1)

175-180: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Don’t render the empty state until agent docs are authoritative.

agentDocs.length === 0 now means both “no docs” and “still resolving”. Right after first load or publish, this will briefly show “No agent documents” even though the badge is still loading. Pass through loadedOnce/isResolving from useAgentKnowledgeDocs and render a loading state until the hook settles.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/frontend_workspaces/agentic_chat/src/KnowledgeSidePanel.tsx` around lines
175 - 180, The component KnowledgeSidePanel currently shows "No agent documents"
whenever agentDocs.length === 0, which conflates "no docs" with "still
resolving"; update the rendering logic to use the hook's settled state (pass
through and read loadedOnce and/or isResolving from useAgentKnowledgeDocs) and
show a loading placeholder while isResolving or loadedOnce is false, only
rendering the empty state when loadedOnce is true and agentDocs.length === 0;
adjust the conditional around agentDocs/agentScopeEnabled accordingly so the UI
waits for the hook to settle before showing "No agent documents".

src/frontend_workspaces/frontend/src/ManagePage.tsx (1)

1028-1105: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Abort or sequence knowledge autosave requests.

This debounce only cancels the timer, not the previous PATCH once it has started. Since this endpoint updates the draft and the live engine, an older request can finish after a newer one and roll the server back to stale knowledgeConfig while the form still shows the latest edits. Guard this with an AbortController or a monotonic request id, and ignore/toast only the latest response.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/frontend_workspaces/frontend/src/ManagePage.tsx` around lines 1028 -
1105, The autosave useEffect (watching knowledgeConfig and effectiveAgentId) can
let overlapping PATCH requests finish out-of-order; modify it so each scheduled
save uses an AbortController or an incrementing sequence id tied to the effect:
create a controller/seq stored on skipDraftSaveRef (or a new ref) and pass
controller.signal to api.patchManageConfigDraftKnowledge (or include the seq in
the request/context), aborting the previous controller before starting a new
timeout and ignoring responses whose seq doesn't match the latest ref (or
treating aborted requests as no-ops); update error/toast handling to skip toasts
for aborted/stale responses while preserving current handling for real
network/errors and server 4xx responses.

tests/integration/test_client_adaptation_e2e.py (1)

255-283: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Tool-path test does not execute any tool call.

The test description says it validates KnowledgeClient.get_langchain_tools() end-to-end, but it never invokes a returned tool—only adapted_engine.search() and len(tools). This leaves wrapper-call regressions undetected.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/test_client_adaptation_e2e.py` around lines 255 - 283, The
test never invokes the langchain tool produced by
KnowledgeClient.get_langchain_tools(), so add an execution step in
test_client_consumes_search_via_langchain_tool that calls one of the returned
tools (from tools, e.g., tools[0]) with the same or similar query used earlier
("severance code 3 process") and awaits its result (use the tool's async
run/call API as appropriate), then assert the tool returned non-empty search
results; reference the test function name
test_client_consumes_search_via_langchain_tool and the method
KnowledgeClient.get_langchain_tools so you update the existing tools variable
rather than creating a new one.

tests/unit/test_knowledge_rag_scope_failure_fix.py (1)

784-824: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Close KnowledgeEngine in test_search_multi_failing_case_session_not_drowned.

This test creates a real engine and exits without shutdown, unlike nearby tests. Add try/finally with eng.shutdown() to avoid leaking resources into subsequent tests.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/test_knowledge_rag_scope_failure_fix.py` around lines 784 - 824,
The test test_search_multi_failing_case_session_not_drowned creates a real
KnowledgeEngine instance (eng = KnowledgeEngine(cfg)) but never shuts it down;
wrap the test body that uses eng (including the monkeypatch and asyncio.run
call) in a try/finally and call await eng.shutdown() or eng.shutdown() in the
finally block (matching other tests' shutdown pattern) so the KnowledgeEngine is
closed; reference the KnowledgeEngine constructor (KnowledgeEngine(cfg)), the
eng variable, and the shutdown method (eng.shutdown()) when adding the
try/finally.

🟠 Major comments (23)

scripts/bench_knowledge_eval.py-300-325 (1)

300-325: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve query alignment before Jaccard overlap computation.

No-hit and exception paths skip appending to top_chunk_ids_per_query, so zip(ref_chunks, cell_chunks) can compare different queries and produce incorrect overlap scores.

Proposed fix

         for q in queries:
             try:
                 hits = await agent.knowledge.search(q["query"], scope="agent", limit=5)
                 if not hits:
                     out.queries_failed += 1
+                    top_chunk_ids_per_query.append([])
                     continue
                 out.queries_answered += 1
@@
                 top_chunk_ids_per_query.append(ids)
             except Exception as q_err:
                 out.errors.append(f"search '{q['query'][:40]}…' failed: {type(q_err).__name__}: {q_err}")
                 out.queries_failed += 1
+                top_chunk_ids_per_query.append([])

Also applies to: 369-375

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/bench_knowledge_eval.py` around lines 300 - 325, The loop building
top_chunk_ids_per_query sometimes skips appending when there are no hits or an
exception, breaking alignment with queries and causing incorrect Jaccard
comparisons; ensure you always append a placeholder (e.g., empty list or
sentinel) to top_chunk_ids_per_query for every query even when hits is empty or
in the except block, while still incrementing out.queries_failed and recording
out.errors as before; make the same change in the similar block around the other
occurrence (the block referenced at lines 369-375) so ref_chunks and cell_chunks
remain index-aligned for zip-based overlap computation.

scripts/bench_knowledge_query_compare.py-61-77 (1)

61-77: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Per-cell temp directories are never cleaned up.

Both ingest input temp dirs and per-cell persist dirs remain on disk after completion.

Proposed fix

-async def _ingest_once(cell, pdf):
+async def _ingest_once(cell, pdf):
@@
-    return agent, cfg
+    return agent, cfg, persist, tmpdir
@@
-        agent, _ = await _ingest_once(cell, pdf)
+        agent, _, persist_dir, input_tmpdir = await _ingest_once(cell, pdf)
         try:
             results = await _run_queries(agent, queries)
             results_per_cell.append(results)
         finally:
             try:
                 await agent.aclose()
             except Exception:
                 pass
+            shutil.rmtree(persist_dir, ignore_errors=True)
+            shutil.rmtree(input_tmpdir, ignore_errors=True)

Also applies to: 153-161

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/bench_knowledge_query_compare.py` around lines 61 - 77, The per-cell
temporary directories created as persist = Path(tempfile.mkdtemp(...)) and
tmpdir = Path(tempfile.mkdtemp(...)) are never removed; wrap the per-cell
processing that uses KnowledgeConfig, dest, and the copied PDF in a try/finally
(or context manager) and call shutil.rmtree(persist, ignore_errors=True) and
shutil.rmtree(tmpdir, ignore_errors=True) in the finally block to ensure both
the persist_dir and input temp directory are deleted after each cell's work
completes (also apply same cleanup for the other similar block around the
153-161 code using the same symbols).

scripts/bench_knowledge_eval.py-185-194 (1)

185-194: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Temporary directories are leaked per run/cell.

_make_unique_pdf() and per-cell persist_dir allocate mkdtemp() directories but never remove them. Repeated benchmarks will steadily consume disk in /tmp.

Proposed fix

 async def _run_cell(...):
@@
-    persist_dir = Path(tempfile.mkdtemp(prefix=f"cuga-eval-{provider.replace('/', '-')}-"))
+    persist_dir = Path(tempfile.mkdtemp(prefix=f"cuga-eval-{provider.replace('/', '-')}-"))
@@
-    for i in range(runs):
-        pdf_unique = _make_unique_pdf(pdf)
+    for i in range(runs):
+        pdf_unique = _make_unique_pdf(pdf)
         cell_cfg = _build_cell_config(provider, model, docling_mode, persist_dir)
         agent = CugaAgent(enable_knowledge=True, knowledge_config=cell_cfg)
         try:
@@
         finally:
             try:
                 await agent.aclose()
             except Exception:
                 pass
+            import shutil
+            shutil.rmtree(pdf_unique.parent, ignore_errors=True)
@@
-    return out
+    import shutil
+    shutil.rmtree(persist_dir, ignore_errors=True)
+    return out

Also applies to: 209-210

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/bench_knowledge_eval.py` around lines 185 - 194, _temp directories
created by mkdtemp in _make_unique_pdf and where persist_dir is set are never
cleaned up, causing leaked /tmp files across runs; modify _make_unique_pdf and
the code that creates per-cell persist_dir to either use
tempfile.TemporaryDirectory() (or register cleanup via context managers) or
explicitly remove the created directories when they are no longer needed (e.g.,
call shutil.rmtree on the tmp path after ingestion/processing completes or on
process exit). Locate the _make_unique_pdf function and the places assigning
persist_dir, replace mkdtemp usage with TemporaryDirectory or add deterministic
cleanup logic (ensure you reference _make_unique_pdf, persist_dir,
tempfile.mkdtemp, and shutil.rmtree in your changes).

scripts/bench_knowledge_ingest.py-123-137 (1)

123-137: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

chunk_count_median is read from the wrong location.

chunk_count is consumed from the record root elsewhere, but summary reads it from timings, so the median can be wrong/empty.

Proposed fix

     summary = {
         "label": label,
         "runs": len(records),
         "stack": records[-1].get("stack"),
         "embedding_provider": records[-1].get("embedding_provider"),
-        "chunk_count_median": _med("chunk_count"),
+        "chunk_count_median": (
+            round(
+                statistics.median(
+                    float(r["chunk_count"])
+                    for r in records
+                    if isinstance(r.get("chunk_count"), (int, float))
+                ),
+                3,
+            )
+            if any(isinstance(r.get("chunk_count"), (int, float)) for r in records)
+            else None
+        ),
         "parse_s_median": _med("parse_s"),

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/bench_knowledge_ingest.py` around lines 123 - 137, summary currently
computes "chunk_count_median" by calling _med("chunk_count"), which looks up
chunk_count inside each record's "timings" dict, but chunk_count is stored at
the record root elsewhere; update the summary creation to compute
chunk_count_median from the root values instead of _med("chunk_count"). Iterate
records, extract each record.get("chunk_count") (or skip non-numeric), collect
floats, and compute median (rounded to 3 decimals) similar to _med, or add a
small helper _med_root(name) that reads record.get(name) and use that for
"chunk_count_median" while keeping parse_s_median and others that are actually
in timings unchanged.

scripts/bench_knowledge_eval_report.py-102-111 (1)

102-111: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Decision matrix can recommend failed cells.

Selection currently runs across all cells, including failed ingest/search results. This can output unusable recommendations.

Proposed fix

-    fastest = min(cells, key=lambda c: c.get("median_total_s") or 1e9, default=None)
+    successful = [
+        c for c in cells
+        if c.get("ingest_status") == "success" and c.get("search_status") != "failed"
+    ]
+    fastest = min(successful, key=lambda c: c.get("median_total_s") or 1e9, default=None)
@@
-        (c for c in cells if (c.get("cost_usd_estimated") or 0) > 0),
+        (c for c in successful if (c.get("cost_usd_estimated") or 0) > 0),
@@
-    free_options = [c for c in cells if (c.get("cost_usd_estimated") or 0) == 0]
+    free_options = [c for c in successful if (c.get("cost_usd_estimated") or 0) == 0]
@@
-    high_quality = [c for c in cells if (c.get("topk_overlap_vs_reference") or 0) >= 0.8]
+    high_quality = [c for c in successful if (c.get("topk_overlap_vs_reference") or 0) >= 0.8]

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/bench_knowledge_eval_report.py` around lines 102 - 111, The selection
logic currently computes fastest/cheapest/free/high_quality across all entries
in cells, which can include failed ingest/search results; before the existing
calculations (variables fastest, cheapest_nonzero, free_options, free_fastest,
high_quality, high_quality_fastest) filter out failed rows by creating a
sanitized list, e.g. successful_cells = [c for c in cells if not
(c.get("failed") or c.get("error")) and (c.get("median_total_s") is not None or
c.get("cost_usd_estimated") is not None or c.get("topk_overlap_vs_reference") is
not None)]; then replace uses of cells with successful_cells so the min() and
list comprehensions only consider valid results (ensure all references to cells
in that block are updated to successful_cells).

src/cuga/backend/cuga_graph/nodes/cuga_lite/adapter/prepare_node.py-538-543 (1)

538-543: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid cross-agent fallback when per-agent draft config map exists.

On Line 541, the fallback to draft_knowledge_config still runs even when draft_knowledge_configs exists but has no entry for this agent. That can apply another agent’s draft search config in multi-tenant draft runs.

Suggested fix

-                                _cfgs = getattr(_das, "draft_knowledge_configs", None)
-                                if isinstance(_cfgs, dict):
-                                    _draft_kc = _cfgs.get(_base_aid)
-                                if _draft_kc is None:
-                                    # Legacy singular attribute fallback.
-                                    _draft_kc = getattr(_das, "draft_knowledge_config", None)
+                                _cfgs = getattr(_das, "draft_knowledge_configs", None)
+                                if isinstance(_cfgs, dict):
+                                    _draft_kc = _cfgs.get(_base_aid)
+                                else:
+                                    # Legacy singular attribute fallback only when per-agent map is unavailable.
+                                    _draft_kc = getattr(_das, "draft_knowledge_config", None)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cuga/backend/cuga_graph/nodes/cuga_lite/adapter/prepare_node.py` around
lines 538 - 543, Current code falls back to draft_knowledge_config even when
draft_knowledge_configs exists but lacks an entry for this agent, risking
cross-agent config leakage; change the logic around
draft_knowledge_configs/draft_knowledge_config so the singular fallback runs
only when draft_knowledge_configs is absent or not a dict. Specifically, in the
block referencing _das, draft_knowledge_configs, _draft_kc and _base_aid, first
check if hasattr(_das, "draft_knowledge_configs") and isinstance(_cfgs, dict)
and use only _cfgs.get(_base_aid); do NOT call getattr(_das,
"draft_knowledge_config", None) when draft_knowledge_configs is a dict (even if
.get returns None); only use the legacy singular draft_knowledge_config when
draft_knowledge_configs is missing or not a dict.

src/cuga/backend/knowledge/client.py-272-278 (1)

272-278: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't log raw user queries at INFO.

This fallback path logs the first 60 characters of every search query. Knowledge searches are likely to contain user data, document excerpts, or secrets, and INFO logs are usually retained centrally. Prefer scope/thread metadata only, or log a short hash if you need correlation.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cuga/backend/knowledge/client.py` around lines 272 - 278, The current
info log call using _stdlib_logging.getLogger("cuga.knowledge").info includes
the raw query (query[:60]); remove raw user content and instead log only
non-sensitive metadata and a short, stable correlation token: compute a short
hash (e.g., SHA256 or SHA1 and take first 8-12 hex chars) of the full query and
log that token along with thread_id and other metadata, or omit the query
entirely; update the call that references (thread_id or "") and (query or "") to
use the hash variable (or drop the query field) so no plaintext query data is
written to INFO logs. Ensure hashing is performed deterministically before the
log statement and reference the existing logger call
_stdlib_logging.getLogger("cuga.knowledge").info to locate and replace the code.

src/cuga/backend/knowledge/client.py-133-152 (1)

133-152: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Use a collision-free collection key, not lossy _sanitize().

_sanitize() maps every non-[A-Za-z0-9_] character to _, so distinct ids like thread-1, thread 1, and thread_1 all resolve to the same collection. That can merge documents across sessions or agents and defeats the isolation guarantees this PR is adding.

A reversible encoding or a stable hash-derived key is safer here than character replacement.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cuga/backend/knowledge/client.py` around lines 133 - 152, The current
_sanitize(...) is lossy and causes collisions (e.g., "thread-1" == "thread_1"),
so replace the lossy behavior with a stable, collision-resistant encoding when
building collection names: implement a new helper (or change _sanitize) to
return a short stable hash (e.g., sha256 hex or urlsafe base64 of the original
id) and use that in _agent_collection() (for self._default_agent_id) and in
_resolve_collection() (for session thread_id) instead of character-replacement;
ensure the returned token contains only safe filename/collection characters and
preserve the existing _agent_collection_hash behavior so final names remain
deterministic and unique (reference: _sanitize, _agent_collection,
_resolve_collection, _agent_collection_hash, _default_agent_id).

src/cuga/sdk.py-1670-1684 (1)

1670-1684: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

knowledge_config override is only partially applied (injection gate still ignores it).

The new override is resolved for engine creation, but auto-injection enablement is still driven by settings (when enable_knowledge is None). This can inject tools when override says disabled—or skip them when override says enabled.

Suggested fix

+    def _resolve_knowledge_config(self):
+        from cuga.backend.knowledge.config import KnowledgeConfig
+        from cuga.config import settings
+
+        override = self._knowledge_config_override
+        if override is None:
+            return KnowledgeConfig.from_settings(settings)
+        if isinstance(override, KnowledgeConfig):
+            override.validate()
+            return override
+        return KnowledgeConfig.coerce_and_validate(dict(override))
@@
-                else:
-                    kb_config = KnowledgeConfig.from_settings(settings)
-                    kb_enabled = kb_config.enabled
+                else:
+                    kb_enabled = self._resolve_knowledge_config().enabled
@@
-            # Resolve config: explicit override > settings.toml fallback.
-            override = self._knowledge_config_override
-            if override is None:
-                config = KnowledgeConfig.from_settings(settings)
-            elif isinstance(override, KnowledgeConfig):
-                ...
-            else:
-                config = KnowledgeConfig.coerce_and_validate(dict(override))
+            config = self._resolve_knowledge_config()

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cuga/sdk.py` around lines 1670 - 1684, The override in
_knowledge_config_override is used to build the KnowledgeEngine but the
auto-injection decision still reads from settings, causing mismatch; after you
resolve "config" (via KnowledgeConfig.from_settings, validate, or
coerce_and_validate) use config.enable_knowledge (or the equivalent property on
KnowledgeConfig) as the source for the injection gate instead of reading
enable_knowledge from settings when an override was provided, and update the
injection-enablement check (where the injection gate is evaluated) to prefer
config.enable_knowledge so the override consistently controls whether
auto-injection runs.

src/cuga/backend/server/manage_routes.py-1313-1335 (1)

1313-1335: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Don't block the PATCH request on full auto-reindex work.

When dim_changed is true, this route now awaits live_engine.reindex() for every collection before returning. On the large corpora this PR targets, a settings save can now scale with total corpus size and easily hit proxy/request timeouts after the draft has already been persisted.

Queue the reindex as background work and return a pending status/job handle instead of doing corpus-wide work inline.

src/cuga/backend/server/manage_routes.py-1167-1175 (1)

1167-1175: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Classify auth failures from the merged config, not the PATCH delta.

This branch inspects knowledge.get(...), which only contains fields sent in the current PATCH. If the client changes just embedding_model, filtered can still contain an existing embedding_provider/embedding_api_key, but _provider and _user_supplied_key are computed as empty here. That makes the soft-fail vs hard-fail path take the wrong branch for partial updates.

Suggested fix

-                _user_supplied_key = bool((knowledge.get("embedding_api_key") or "").strip())
-                _provider = (knowledge.get("embedding_provider") or "").lower()
+                _user_supplied_key = bool((filtered.get("embedding_api_key") or "").strip())
+                _provider = str(filtered.get("embedding_provider") or "").lower()

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cuga/backend/server/manage_routes.py` around lines 1167 - 1175, The
auth-error classification is reading provider/key from the PATCH delta
`knowledge` (which may be partial) instead of the merged configuration; update
the logic so `_provider`, `_user_supplied_key`, and `_is_credentialed` are
computed from the merged config (e.g., use the existing merged/filtered dict
already used elsewhere rather than `knowledge`), keeping the same
stripping/lowering and `_looks_like_auth` checks, so the soft-fail vs hard-fail
branch is based on the full current settings.

src/cuga/backend/server/main.py-718-724 (1)

718-724: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Published knowledge config is still skipped after restart when settings boot with knowledge disabled.

This block only restores _saved_knowledge when app_state.knowledge_engine already exists. If settings.toml starts with knowledge.enabled=false but the published config had knowledge enabled, startup leaves _live_engine as None and silently skips the restore path, so the published provider/model/layout never comes back after a restart.

Suggested fix

-            _saved_knowledge = (_startup_config or {}).get("knowledge") or {}
-            _live_engine = getattr(app_state, "knowledge_engine", None)
-            if _saved_knowledge and _live_engine is not None:
+            _saved_knowledge = (_startup_config or {}).get("knowledge") or {}
+            _live_engine = getattr(app_state, "knowledge_engine", None)
+            if _saved_knowledge and _live_engine is None:
+                from cuga.backend.knowledge.config import KnowledgeConfig as _KC
+
+                _saved_cfg = _KC.coerce_and_validate(
+                    {k: v for k, v in _saved_knowledge.items() if not k.startswith("_")}
+                )
+                if _saved_cfg.enabled:
+                    await initialize_knowledge_engine(app_state, _saved_cfg)
+                    _live_engine = getattr(app_state, "knowledge_engine", None)
+            if _saved_knowledge and _live_engine is not None:
                 # Strip noise that's not part of the public config shape.
                 _saved_kb = {k: v for k, v in _saved_knowledge.items() if not k.startswith("_")}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cuga/backend/server/main.py` around lines 718 - 724, The restore logic
only applies saved knowledge when app_state.knowledge_engine already exists, so
if settings boot with knowledge disabled the published config is never restored;
change the block handling _saved_knowledge so that if _saved_kb exists and
_live_engine is None you initialize or enable the knowledge engine using the
same startup/factory path used elsewhere (so app_state.knowledge_engine becomes
a real engine) and then call _live_engine.apply_knowledge_config(_saved_kb);
keep the existing stripping of keys (the _saved_kb creation) and reuse
apply_knowledge_config to apply the restored config.

src/cuga/cli/main.py-201-203 (1)

201-203: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Unify knowledge CLI base URL resolution

config-*/snapshot-* use CUGA_API_BASE (default :8005), but adaptation-*/glossary-*/doctor use CUGA_HOST+CUGA_PORT (default :8000) and ignore CUGA_API_BASE. This split makes commands in the same cuga knowledge group target different backends by default.

Suggested fix

 def _cuga_server_base_url() -> str:
-    port = os.environ.get("CUGA_PORT") or "8000"
-    host = os.environ.get("CUGA_HOST") or "127.0.0.1"
-    return f"http://{host}:{port}"
+    api_base = os.environ.get("CUGA_API_BASE")
+    if api_base:
+        return api_base.rstrip("/")
+    port = os.environ.get("CUGA_PORT") or "8000"
+    host = os.environ.get("CUGA_HOST") or "127.0.0.1"
+    return f"http://{host}:{port}"

Also applies to: 441-455

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cuga/cli/main.py` around lines 201 - 203, Unify CLI backend resolution by
making the existing helper _knowledge_api_base() the single source of truth:
change it to first read CUGA_API_BASE (default "http://127.0.0.1:8005"), and if
unset fall back to composing from CUGA_HOST and CUGA_PORT (default
"http://127.0.0.1" and "8000") and always rstrip("/"); then update all places
that currently build the host/port separately (the adaptation-*, glossary-*,
doctor command implementations) to call _knowledge_api_base() instead of using
CUGA_HOST/CUGA_PORT so every cuga knowledge command targets the same backend by
default. Ensure callers use the returned URL directly (no further port/host
assembly).

src/cuga/configurations/knowledge/knowledge_profiles/speed.toml-63-66 (1)

63-66: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Speed addendum conflicts with the mandatory scope-safety contract

Line 65 directs “most-likely scope first; fallback if empty,” which can miss relevant docs when both scope candidates exist. That conflicts with the contract’s anti-silent-miss rule in this PR.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cuga/configurations/knowledge/knowledge_profiles/speed.toml` around lines
63 - 66, The "Speed addendum" text in speed.toml currently instructs
"most-likely scope first; only fall back to the other if the first comes up
empty", which violates the scope-safety contract by allowing silent misses when
both scopes contain relevant docs; update the guidance so the system either (a)
always query both scopes and merge/compare results, or (b) query the most-likely
scope first but still query the other scope in parallel or as a verification
step and include any higher-relevance hits, and explicitly state that fallback
must not be the only source when both scopes exist; change the phrasing in
speed.toml accordingly to mandate non-silent-miss behavior and reference the
scope-safety contract.

src/cuga/configurations/knowledge/knowledge_instructions.md-26-26 (1)

26-26: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Resolve contradictory scope-miss rules

Line 26 says session misses auto-fallback to scope="all" and explicitly says not to re-issue with a different scope, but Line 41 says session miss → scope="agent". These rules conflict and will produce inconsistent tool behavior.

Also applies to: 41-41

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cuga/configurations/knowledge/knowledge_instructions.md` at line 26, The
document has contradictory fallback rules for session misses: one line says
session auto-falls back to scope="all" and forbids re-issuing, while another
says session miss → scope="agent"; make them consistent by choosing the intended
behavior (prefer keeping scope="all" as the canonical automatic fallback) and
update the rule that mentions scope="agent" to instead state the same behavior
as the scope="session" rule (i.e., engine auto-retries as scope="all", sets
retrieval.fallback_from="session", and callers must treat fallback results as
authoritative — do NOT re-issue with a different scope); ensure both occurrences
reference scope="session" → fallback to scope="all" and remove or reword any
conflicting wording about scope="agent" so the file no longer contains
contradictory rules.

src/cuga/cli/main.py-1660-1740 (1)

1660-1740: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not log raw embedding API keys

Line 1739 logs _value for every override, including DYNACONF_KNOWLEDGE__EMBEDDINGS__API_KEY (Line 1664). That leaks credentials to logs/console history.

Suggested fix

     for _env_name, _value in _embedding_flag_env:
         if _value is not None:
             os.environ[_env_name] = _value
-            logger.info("Embedding override: %s=%s", _env_name.split("__", 1)[-1], _value)
+            _masked = _value
+            if _env_name.endswith("__API_KEY") or "SECRET" in _env_name:
+                _masked = "***REDACTED***"
+            logger.info("Embedding override: %s=%s", _env_name.split("__", 1)[-1], _masked)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cuga/cli/main.py` around lines 1660 - 1740, The loop that sets os.environ
for entries in _embedding_flag_env currently logs the raw _value via
logger.info, which can leak secrets (notably
DYNACONF_KNOWLEDGE__EMBEDDINGS__API_KEY). Update the loop that iterates over
(_env_name, _value) (and still sets os.environ[_env_name] = _value) to redact
sensitive values before logging: detect API keys by matching substrings like
"API_KEY" or the exact "DYNACONF_KNOWLEDGE__EMBEDDINGS__API_KEY" and replace the
value with a redacted placeholder (e.g., "<redacted>" or a masked version
showing only last 4 chars); continue to log non-sensitive entries unchanged via
logger.info("Embedding override: %s=%s", _env_name.split("__", 1)[-1],
logged_value).

src/cuga/backend/storage/embedding/local.py-108-117 (1)

108-117: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Rollback before fallback replay to avoid duplicate writes

Line 111 catches OperationalError and immediately replays all rows in Line 115-116. If executemany() applied a prefix before failing, replaying without resetting transaction state can duplicate already-written rows.

Suggested fix

         try:
             try:
                 # Fast path: single C-loop multi-row insert.
                 conn.executemany(sql, rows)
             except sqlite3.OperationalError:
+                # Reset any partial work from executemany before replaying.
+                try:
+                    conn.rollback()
+                except sqlite3.Error:
+                    pass
                 # sqlite-vec's vec0 virtual table is suspect for executemany on
                 # some builds; fall back to a loop inside the same transaction
                 # so we still drop the per-row commit fsync cost from N to 1.
                 for row in rows:
                     conn.execute(sql, row)
             conn.commit()

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cuga/backend/storage/embedding/local.py` around lines 108 - 117, The
except block that catches sqlite3.OperationalError should rollback the failed
transaction before replaying the individual inserts to avoid duplicating any
prefix that may have been written by conn.executemany; update the handler around
conn.executemany to call conn.rollback() first, then replay rows with
conn.execute in the loop, and finally call conn.commit() (keeping the existing
conn.commit() after the try/except). Refer to conn.executemany, conn.rollback,
conn.execute, and conn.commit to find and modify the logic.

src/frontend_workspaces/frontend/src/knowledge/useAgentKnowledgeDocs.ts-209-228 (1)

209-228: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Make the sticky retry cancellable/stale-aware.

After the timer fires, the retry fetch is no longer tied to the current refresh cycle. If a newer refresh() completes first, this older retry can still hit acceptResult() and overwrite the newer documents/cache with stale data. Reuse an AbortController or attach a request generation check before applying retry results.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/frontend_workspaces/frontend/src/knowledge/useAgentKnowledgeDocs.ts`
around lines 209 - 228, The sticky retry currently schedules a fetch that can
apply stale results via acceptResult; make it cancellable/stale-aware by tying
the retry to the same cancellation/generation mechanism used by refresh(): when
scheduling the retry (where retryAttemptedRef and stickyRetryTimerRef are used)
create or reuse an AbortController (or capture the current refresh generation
id) and pass it into listKnowledgeDocuments, then before calling
acceptResult(retryData.documents ?? []) verify the controller is not aborted (or
the generation id still matches the latest) and only then apply results; also
ensure clearing/aborting the pending controller/timer when a new refresh()
starts so the old retry cannot overwrite newer data and setIsResolving is
handled consistently.

tests/unit/test_knowledge_cli_sdk_parity.py-256-271 (1)

256-271: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

“CLI flag → env var mapping” test does not validate CLI mapping.

This test never exercises CLI/env wiring; it only rebuilds a settings dict and calls KnowledgeConfig.from_settings, which is already covered above. A CLI mapping typo could still pass here.

tests/unit/test_knowledge_routes.py-72-79 (1)

72-79: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Queued upload contract should assert 202, not 200.

For wait=false, this path is documented as an accepted/queued response. Locking 200 here risks preserving the wrong API contract and diverging from the intended async semantics.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/test_knowledge_routes.py` around lines 72 - 79, The test is
asserting the wrong HTTP status for a queued upload: update the assertion in
tests/unit/test_knowledge_routes.py (the POST to "/api/knowledge/documents" with
data {"scope":"agent","replace_duplicates":"true","wait":"false"}) to expect 202
instead of 200; locate the assertion on response.status_code in this test and
change it to assert response.status_code == 202 so the test matches the API's
async/queued contract.

tests/unit/test_knowledge_publish_import_pipeline.py-131-144 (1)

131-144: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

This secret-leak regression test can pass without exercising save_config.

Because assertions are conditional on if "saved" in captured, the test succeeds even when the route never reaches disk-save logic. That creates a false green on a security-critical contract.

Suggested fix

     client.post("/api/manage/config", params={"agent_id": "pub-strip-test"}, json=payload)
-    # Don't assert on status; the publish flow has many side-effects we mock
-    # out. Just verify what hit the disk save call.
-    if "saved" in captured:
-        saved_kb = captured["saved"].get("knowledge", {})
-        # KEY must be stripped on disk
-        assert saved_kb.get("embedding_api_key", "") == "", (
-            f"API key leaked to published config-store: {saved_kb.get('embedding_api_key')!r}"
-        )
-        # Non-secret fields must survive
-        assert saved_kb.get("embedding_provider") == "openai"
-        assert saved_kb.get("embedding_model") == "Azure/text-embedding-3-small-1"
-        assert saved_kb.get("embedding_base_url") == "https://ete-litellm.bx.cloud9.ibm.com/v1"
+    assert "saved" in captured, "save_config was not invoked"
+    saved_kb = captured["saved"].get("knowledge", {})
+    assert saved_kb.get("embedding_api_key", "") == "", (
+        f"API key leaked to published config-store: {saved_kb.get('embedding_api_key')!r}"
+    )
+    assert saved_kb.get("embedding_provider") == "openai"
+    assert saved_kb.get("embedding_model") == "Azure/text-embedding-3-small-1"
+    assert saved_kb.get("embedding_base_url") == "https://ete-litellm.bx.cloud9.ibm.com/v1"

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/test_knowledge_publish_import_pipeline.py` around lines 131 - 144,
The test currently guards assertions with a conditional "if 'saved' in captured"
which lets the test pass when save_config was never invoked; change this to an
explicit assertion that ensures the save happened before checking contents—for
example assert "saved" in captured (or assert "saved" in captured, f"expected
save_config to be called after client.post('/api/manage/config')") and then
proceed to set saved_kb = captured["saved"].get("knowledge", {}) and assert on
embedding_api_key, embedding_provider, embedding_model, and embedding_base_url
to enforce the secret-strip contract; update any variable references (captured,
saved_kb) accordingly so the test fails when disk save isn't exercised.

tests/unit/test_knowledge_use_gpu_surfaces.py-194-197 (1)

194-197: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid eager embedder initialization in this unit test.

eng._ensure_embeddings() can pull in provider/model initialization and make this test flaky/non-hermetic. This test only needs to validate use_gpu config propagation before and after PATCH.

Suggested change

 def test_patch_draft_knowledge_use_gpu_false_applies_to_live_engine():
     """User toggles use_gpu off in the UI → live engine rebinds inference backend."""
     eng = KnowledgeEngine(_cfg(use_gpu=True))
-    eng._ensure_embeddings()
     assert eng._config.use_gpu is True

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/test_knowledge_use_gpu_surfaces.py` around lines 194 - 197, The
test is eagerly initializing embedders by calling eng._ensure_embeddings(),
making it flaky; remove that call and only assert config propagation on the
KnowledgeEngine instance: create eng = KnowledgeEngine(_cfg(use_gpu=True)),
assert eng._config.use_gpu is True, then apply the PATCH operation the test
intends and assert the config value after PATCH. Update references to
_ensure_embeddings() to avoid triggering provider/model initialization and keep
assertions limited to eng._config.use_gpu before and after the PATCH.

tests/unit/test_knowledge_timings.py-267-273 (1)

267-273: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Order-preservation test does not assert the document identity strongly enough.

This loop claims to verify order, but it only checks filename == "x.pdf". If embeddings are reordered, the test can still pass because all docs share that filename. Assert page (and/or exact text) for the top hit.

Suggested change

     for i in range(0, 50, 7):
         hits = store.search(f"text {i}", k=3)
         assert hits, f"no hits for 'text {i}'"
         # Top hit's filename + page must be the original doc i+1.
         top = hits[0][0]
         assert top.metadata.get("filename") == "x.pdf"
+        assert top.metadata.get("page") == i + 1, (
+            f"expected top hit page {i + 1}, got {top.metadata.get('page')!r}"
+        )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/test_knowledge_timings.py` around lines 267 - 273, The test's loop
only asserts filename == "x.pdf" which is insufficient; update the assertion to
also check the document identity by asserting the page (and/or exact text) of
the top hit matches the expected document for each i. Concretely, after
obtaining top = hits[0][0], add an assertion like top.metadata.get("page") ==
i+1 (or assert the hit text equals the expected "text {i}") so the test verifies
order-preservation, using the existing variables hits and top.

🧹 Nitpick comments (3)

tests/unit/test_knowledge_config_ui_design.py (1)

458-462: ⚡ Quick win

Replace the tautological warmup assertion with a real contract check.

Line 461 always passes and won’t catch regressions. Assert a concrete condition (e.g., type/expected value) instead.

Suggested change

-    assert result["embeddings_initialized"] is False or result["embeddings_initialized"] is True
+    assert isinstance(result.get("embeddings_initialized"), bool)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/test_knowledge_config_ui_design.py` around lines 458 - 462, The
test uses a tautological assertion for result["embeddings_initialized"] which
always passes; replace it with a concrete contract check by asserting the field
is a boolean (e.g., use isinstance(result["embeddings_initialized"], bool)) or
assert a specific expected value depending on the warmup contract from
eng.warmup(); update the assertion referring to result["embeddings_initialized"]
(produced by eng.warmup()) to validate type or expected value rather than "is
False or is True".

tests/unit/test_knowledge_preflight_soft_fail.py (1)

60-156: ⚡ Quick win

Use fixture-managed engine teardown to prevent resource leakage across tests.

Each test builds a KnowledgeEngine but never closes it. A small fixture that yields the engine and calls engine.shutdown() in teardown will make this suite more stable under parallel/long runs.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/test_knowledge_preflight_soft_fail.py` around lines 60 - 156, The
tests instantiate KnowledgeEngine directly and never call its shutdown, causing
leaked resources; create a pytest fixture that yields a KnowledgeEngine
(constructed via _cfg()) and calls engine.shutdown() in the teardown, then
update tests to use that fixture instead of directly calling eng =
KnowledgeEngine(_cfg()) (e.g., accept an engine fixture and pass it to
_build_app(engine)); ensure all tests that create KnowledgeEngine are switched
to use the fixture so shutdown() runs after each test.

tests/unit/test_knowledge_pgvector_metadata.py (1)

74-78: Guard fake SQL UPDATE parsing from silent truncation

zip(cols, params[:-1]) would silently drop assignments if the fake interpreter parses a different number of SET columns than the parameters provide. PostgresKnowledgeMetadata.update_task currently derives set_clause and values from the same cols list, so this mismatch shouldn’t occur today—but the assertion makes future SQL/parse changes fail loudly.

Suggested fix

             cols = [seg.split("=")[0].strip() for seg in set_section.strip().rstrip(",").split(",")]
             task_id = params[-1]
             if task_id in self._rows_by_task_id:
+                assert len(cols) == len(params[:-1]), (
+                    f"UPDATE shape mismatch: cols={len(cols)} params={len(params[:-1])}"
+                )
                 for col, val in zip(cols, params[:-1]):
                     self._rows_by_task_id[task_id][col] = val

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/test_knowledge_pgvector_metadata.py` around lines 74 - 78, Add a
guard to ensure the number of parsed SET columns matches the number of provided
parameter values before zipping to avoid silent truncation: in the code path
where cols and params are derived (used by PostgresKnowledgeMetadata.update_task
and referencing cols, params, task_id, and self._rows_by_task_id), assert or
raise if len(cols) != len(params) - 1 (since the last param is task_id) so
mismatched parses fail loudly rather than dropping assignments; perform this
check immediately before the for col, val in zip(cols, params[:-1]) loop.

Source: Linters/SAST tools

coderabbitai · 2026-06-09T13:42:24Z

+RUN uv sync --extra gpu && \
+    uv pip uninstall onnxruntime 2>/dev/null || true && \
+    uv pip install --reinstall torch torchvision \
+        --index-url https://download.pytorch.org/whl/cu121


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Show Dockerfile.gpu around the referenced lines if [ ! -f Dockerfile.gpu ]; then echo "ERROR: Dockerfile.gpu not found at repo root" exit 1 fi echo "=== Dockerfile.gpu: lines 1-120 ===" nl -ba Dockerfile.gpu | sed -n '1,120p' # Also search for any SHELL directive that could change command parsing echo "=== SHELL directives (if any) ===" rg -n '^\s*SHELL\b' Dockerfile.gpu || true # Show the exact RUN line(s) containing 'uv sync' and 'onnxruntime' echo "=== Matching RUN lines ===" rg -n 'uv sync|onnxruntime|uv pip uninstall|uv pip install --reinstall torch' Dockerfile.gpu

Repository: cuga-project/cuga-agent

Length of output: 165

🏁 Script executed:

#!/bin/bash set -euo pipefail if [ ! -f Dockerfile.gpu ]; then echo "ERROR: Dockerfile.gpu not found at repo root" exit 1 fi echo "=== Dockerfile.gpu: lines 50-90 (context around referenced area) ===" awk 'NR>=50 && NR<=90 {printf "%d:%s\n", NR, $0}' Dockerfile.gpu echo "=== SHELL directives (if any) ===" rg -n '^\s*SHELL\b' Dockerfile.gpu || true echo "=== Matching RUN lines ===" rg -n 'uv sync|onnxruntime|uv pip uninstall|uv pip install --reinstall torch' Dockerfile.gpu || true echo "=== Show full RUN instruction(s) that contain uv sync (handles line continuations) ===" # Print from first match through the line where the RUN instruction ends (heuristic: until a line not ending with '\') awk ' $0 ~ /uv sync/ { inrun=1 } { if (inrun) { print; } if (inrun && $0 ~ /\\$/) next if (inrun && $0 !~ /\\$/) { inrun=0 } } ' Dockerfile.gpu | sed -n '1,200p'

Repository: cuga-project/cuga-agent

Length of output: 2980

Fix RUN chaining so uv sync failures don’t get masked (torch install still runs today).
In Dockerfile.gpu lines 64-67, the expression uv sync ... && ... || true && ... short-circuits via || true, causing uv pip install --reinstall torch torchvision to run even when uv sync fails.

Suggested fix

-RUN uv sync --extra gpu && \ - uv pip uninstall onnxruntime 2>/dev/null || true && \ - uv pip install --reinstall torch torchvision \ +RUN uv sync --extra gpu && \ + (uv pip uninstall onnxruntime 2>/dev/null || true) && \ + uv pip install --reinstall torch torchvision \ --index-url https://download.pytorch.org/whl/cu121

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

RUN uv sync --extra gpu && \

uv pip uninstall onnxruntime 2>/dev/null || true && \

uv pip install --reinstall torch torchvision \

--index-url https://download.pytorch.org/whl/cu121

RUN uv sync --extra gpu && \

(uv pip uninstall onnxruntime 2>/dev/null || true) && \

uv pip install --reinstall torch torchvision \

--index-url https://download.pytorch.org/whl/cu121

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@Dockerfile.gpu` around lines 64 - 67, The RUN line masks failures from uv sync because the current chaining uses "|| true" which makes the whole chain succeed and still runs the torch install; change the chaining so that only the uninstall is tolerated while uv sync must succeed: run uv sync --extra gpu and require its success, then run uv pip uninstall onnxruntime with a tolerated failure (e.g., group the uninstall with || true or run it in a subshell/brace so its failure doesn't affect uv sync), and finally run uv pip install --reinstall torch torchvision only if the prior required commands succeeded; reference the commands uv sync --extra gpu, uv pip uninstall onnxruntime, and uv pip install --reinstall torch torchvision when making this change.

The CI test-allowlist glob added in 734c56c picked up ``test_knowledge_layout_engine.py``, which has 5 tests that mock ``_detect_accelerator`` to return ``("mps", ["CoreMLExecutionProvider", "CPUExecutionProvider"])``. The python-side branch is fooled, but the downstream fastembed session validates the providers against the actual ``onnxruntime`` install — on Linux CI the call fails with ``ValueError: Provider CoreMLExecutionProvider is not available``. Added a module-level ``_REQUIRES_MAC = pytest.mark.skipif(sys.platform != "darwin")`` and applied it to the 5 affected tests: - test_layout_engine_transformers_disables_torch_compile_on_mps - test_layout_engine_auto_resolves_to_transformers_on_mps - test_timing_log_reports_effective_layout_not_user_choice - test_use_gpu_toggle_invalidates_auto_resolution - test_pytorch_embeddings_device_routes_to_mps_when_detected The Mac-only branches still get coverage on dev machines; CI on Linux no longer fails on the inherent platform mismatch.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cuga/backend/knowledge/storage/adapter.py`:
- Around line 632-635: The rollback code silently swallows errors from
_fts_delete_by_source; change the bare except to catch Exception as e and emit a
log entry with the exception details (e.g., using self.logger.error or the
module logger) including context like the source_for_rollback and that this
occurred during FTS rollback, rather than simply passing; keep the rollback
suppression behavior (do not re-raise) but ensure the exception and context are
logged for debugging.
- Around line 595-617: The rollback currently uses source_for_rollback from
items[0] inside _add_all and so only deletes one source on partial failure;
update _add_all to either enforce a single-source precondition by validating
that every item in items has the same "source" (raise early if not) or build the
set of all distinct sources touched (e.g., collect sources from each item's
metadata before the insert loop) and on partial failure iterate and delete by
each source (use the same _scope/_filt pattern) instead of only
source_for_rollback; ensure the chosen approach references source_for_rollback,
_add_all, self._store.add_many and mirrors delete_by_source behavior for all
affected sources.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 57186bb0-fd5b-488a-8680-5d42ad77ca35

📥 Commits

Reviewing files that changed from the base of the PR and between 99255de and 04308c6.

📒 Files selected for processing (7)

.gitignore
README.md
src/cuga/backend/knowledge/engine.py
src/cuga/backend/knowledge/storage/adapter.py
src/cuga/backend/server/manage_routes.py
src/scripts/run_tests.sh
tests/unit/test_knowledge_layout_engine.py

💤 Files with no reviewable changes (1)

.gitignore

🚧 Files skipped from review as they are similar to previous changes (1)

src/cuga/backend/server/manage_routes.py

coderabbitai · 2026-06-09T13:59:14Z

+        source_for_rollback = items[0][2].get("source") if items else None
+
        async def _add_all() -> int:
            n = 0
-            for doc, emb in zip(documents, vectors, strict=True):
-                chunk_id = str(uuid.uuid4())
-                page_val: int
-                page_raw = doc.metadata.get("page")
-                if page_raw is None:
-                    page_val = -1
-                else:
-                    try:
-                        page_val = int(page_raw)
-                    except (TypeError, ValueError):
-                        page_val = -1
-                source = str(doc.metadata.get("source", "") or "")
-                filename = str(doc.metadata.get("filename", "") or "")
-                meta = {
-                    "id": chunk_id,
-                    "tenant_id": scope["tenant_id"],
-                    "instance_id": scope["instance_id"],
-                    "source": source,
-                    "filename": filename,
-                    "page": page_val,
-                    "chunk_text": doc.page_content,
-                    "meta_json": json.dumps(
-                        {"source": source, "filename": filename, "page": page_val},
-                        ensure_ascii=False,
-                    ),
-                }
-                await self._store.add(chunk_id, emb, meta)
-                n += 1
+            partial = False
+            for i in range(0, total_items, insert_bs):
+                chunk = items[i : i + insert_bs]
+                try:
+                    await self._store.add_many(chunk)
+                except Exception:
+                    # If we'd already committed earlier batches, clean them
+                    # up before re-raising so the next ingest attempt
+                    # doesn't encounter duplicate vectors on the same
+                    # source_id. Inline the delete loop here (mirror of
+                    # ``delete_by_source``) so we stay on the current event
+                    # loop — the adapter-level helper would call
+                    # ``_run_embedding_coro`` which would deadlock against
+                    # the loop we're already inside.
+                    if partial and source_for_rollback:
+                        try:
+                            _scope = self._scope()
+                            _filt = {**_scope, "source": source_for_rollback}
+                            while True:


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Rollback cleanup currently assumes a single source across the whole batch.

source_for_rollback is taken from only the first item. On partial failure, cleanup deletes one source and can leave committed rows for other sources in the same call. Please enforce a single-source precondition (or roll back all touched sources).

Suggested fix

- source_for_rollback = items[0][2].get("source") if items else None + rollback_sources = {str(meta.get("source", "")) for _, _, meta in items} + if len(rollback_sources) != 1: + raise ValueError( + "add_documents expects all chunks in one call to share the same source" + ) + source_for_rollback = next(iter(rollback_sources))

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cuga/backend/knowledge/storage/adapter.py` around lines 595 - 617, The rollback currently uses source_for_rollback from items[0] inside _add_all and so only deletes one source on partial failure; update _add_all to either enforce a single-source precondition by validating that every item in items has the same "source" (raise early if not) or build the set of all distinct sources touched (e.g., collect sources from each item's metadata before the insert loop) and on partial failure iterate and delete by each source (use the same _scope/_filt pattern) instead of only source_for_rollback; ensure the chosen approach references source_for_rollback, _add_all, self._store.add_many and mirrors delete_by_source behavior for all affected sources.

coderabbitai · 2026-06-09T13:59:14Z

+                            try:
+                                self._fts_delete_by_source(source_for_rollback)
+                            except Exception:
+                                pass


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Don’t silently swallow FTS rollback errors in the failure path.

The bare except ...: pass hides rollback divergence exactly when debugging partial-insert incidents. Log it at least once.

Suggested fix

- try: - self._fts_delete_by_source(source_for_rollback) - except Exception: - pass + try: + self._fts_delete_by_source(source_for_rollback) + except Exception as fts_err: # noqa: BLE001 + logger.warning( + "FTS rollback failed for source=%r after partial insert: %s", + source_for_rollback, + fts_err, + )

🧰 Tools

🪛 Ruff (0.15.15)

[error] 634-635: try-except-pass detected, consider logging the exception

(S110)

[warning] 634-634: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cuga/backend/knowledge/storage/adapter.py` around lines 632 - 635, The rollback code silently swallows errors from _fts_delete_by_source; change the bare except to catch Exception as e and emit a log entry with the exception details (e.g., using self.logger.error or the module logger) including context like the source_for_rollback and that this occurred during FTS rollback, rather than simply passing; keep the rollback suppression behavior (do not re-raise) but ensure the exception and context are logged for debugging.

Source: Linters/SAST tools

sami-marreed

Request: split this PR before it can be reviewed

This PR is titled "perf + OCR + session isolation + live progress UX" but it currently carries at least four additional features. It's ~12.3k lines of test alone, across 37 changed source files including six 1k–4k-line files. Nobody can responsibly review this as one unit — and the bugs below prove it, because each one slipped through at a seam between features that were never reviewed together. Please break it up.

Split out cleanly (these lift out with little entanglement — please do these):

Client adaptation → own PR. ClientAdaptationPanel.tsx, query_expansion.py (glossary-driven), the adaptation prompt-framing in chat_agent.py / pmt*.jinja2 / awareness.py, plus test_knowledge_client_adaptation.py (1048), test_client_adaptation_e2e.py (282), and the client_adaptation_example.md doc. ~2.3k LOC, fully self-contained feature.
Reranker + GPU image → own PR. reranker.py (pulls in sentence-transformers), Dockerfile.gpu, test_knowledge_use_gpu_surfaces.py, test_knowledge_accelerator.py. New infra + a quality lever — deserves a focused review.
Benchmark tooling → remove from this PR entirely. scripts/bench_knowledge_ingest.py, bench_knowledge_query_compare.py, bench_knowledge_report.py, the knowledge/bench.py module, and test_knowledge_bench.py. (This reverses my earlier "keep ingest+report" note — on reflection benchmark tooling belongs in a separate PR, not the runtime tree.)

Consider splitting — or at minimum retitle:

Multi-scope / Option-B search. The 3630-line test_knowledge_multi_scope_search.py + test_knowledge_rag_scope_failure_fix.py (823) plus search_multi in client.py and the scope routing in prepare_node.py / helpers/knowledge.py. I know this shares core files and that "Option B" was in the original WIP title, so I won't insist on the split — but it's invisible to anyone reading the PR description. Either pull it out, or update the title/body to name it.

After the clean splits come out, what remains — perf core, OCR, session isolation, live progress UX, RAG-profile config — actually matches the title and is reviewable.

Blocking bugs, routed to where they belong (these are why the split matters):

Empty-thread_id regression — prepare_node.py:438. The origin/main merge dropped the _is_empty() guard the branch previously had; ca153f3c restored session as the default scope but not the guard, so an LLM-emitted thread_id="" now hits client.py:150 raise ValueError("thread_id required for session scope"). Belongs to the session/scope PR — it got lost because scope logic was buried inside a perf PR.
Dockerfile.gpu:64-66 — uv sync --extra gpu && uv pip uninstall onnxruntime 2>/dev/null || true && uv pip install ... short-circuits: || true makes the chain "succeed" so the torch reinstall runs even when uv sync fails, and the RUN exits 0 — a broken GPU image that builds green. CodeRabbit flagged this too. Belongs to the GPU PR.
Silent enforce→dry_run flip — config.py:1032-1035. from_settings reads the profile first, and the default standard profile sets drop_page_chrome/junk_filter = "dry_run", shadowing the shipped knowledge_settings.toml that still says = "enforce". The config file now lies about its effective value. Stays in the core/profile PR.
pgvector QA gate still open (my original blocker on CI/test coverage) — the new orphan-vector rollback in adapter.py is only tested against mocked/sqlite-vec; no Docker smoke test against real pgvector. Stays in the core PR.

Once split, I can give each PR the review it actually needs. Happy to re-review in whatever order you stack them.

dolev31 added 17 commits May 20, 2026 17:10

chore(gitignore): exclude local perf-review slide deliverables

f854ad1

knowledge_ingest_perf.pptx and KNOWLEDGE_INGEST_PERF_SLIDES.md are personal review materials, not codebase content — kept local only. Also adds Office lock file patterns (~$*.pptx etc.).

dolev31 self-assigned this Jun 3, 2026

sami-marreed self-requested a review June 3, 2026 08:02

dolev31 mentioned this pull request Jun 3, 2026

[Epic] Knowledge integration: document lifecycle, retrieval, and scoped answers #128

Open

sami-marreed requested changes Jun 8, 2026

View reviewed changes

dolev31 added 5 commits June 8, 2026 17:52

github-advanced-security AI found potential problems Jun 9, 2026

View reviewed changes

Comment thread src/frontend_workspaces/agentic_chat/src/ClientAdaptationPanel.tsx Fixed

github-advanced-security AI found potential problems Jun 9, 2026

View reviewed changes

Comment thread src/cuga/backend/server/manage_routes.py Fixed

Revert "docs(knowledge): bundle first-time demo content for the knowl…

9cd45d9

…edge engine" This reverts commit e1438f2.

dolev31 added 2 commits June 9, 2026 11:36

This was referenced Jun 9, 2026

knowledge: Pareto-locked RAG profiles #316

Open

knowledge: GPU image (cuga:gpu) + transparent CUDA runtime #317

Open

dolev31 added 3 commits June 9, 2026 11:48

This was referenced Jun 9, 2026

knowledge: content-hash cache for re-ingest (skip embed on unchanged content) #319

Open

knowledge: streaming ingest pipeline via Docling page_chunk_size #320

Open

knowledge: CUDA Graphs + warm worker for query-time embed (avoid cold-start cliff) #321

Open

dolev31 added 2 commits June 9, 2026 14:25

github-advanced-security AI found potential problems Jun 9, 2026

View reviewed changes

Comment thread src/cuga/backend/server/manage_routes.py Fixed

dolev31 marked this pull request as ready for review June 9, 2026 13:21

github-advanced-security AI found potential problems Jun 9, 2026

View reviewed changes

Comment thread src/cuga/backend/server/manage_routes.py

coderabbitai Bot reviewed Jun 9, 2026

View reviewed changes

sami-marreed requested changes Jun 9, 2026

View reviewed changes

Conversation

dolev31 commented Jun 3, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Where the prompt-engineering work lives in the diff

Test plan

Follow-up: Pareto-locked profiles + GPU image (added 2026-06)

Summary by CodeRabbit

New Features

Bug Fixes

Documentation

Updated 2026-06: response to Sami's review

Uh oh!

coderabbitai Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Uh oh!

sami-marreed left a comment

Choose a reason for hiding this comment

Review summary

Blockers

Also noted (file unchanged in PR — no inline possible)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

sami-marreed left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dolev31 commented Jun 3, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 3, 2026 •

edited

Loading

sami-marreed left a comment •

edited

Loading