knowledge: perf + OCR + session isolation + live progress UX#297
knowledge: perf + OCR + session isolation + live progress UX#297dolev31 wants to merge 32 commits into
Conversation
…step 0) This is the no-op baseline for the issue #183 perf work. It does not change any ingest behavior; it only instruments the pipeline so the later commits (C1-C6) can be measured against the same ruler and the PR can carry a real attribution table. What this lands: - src/cuga/backend/knowledge/bench.py: tiny record_ingest_run(log_path, payload) helper that appends one JSON line per ingest, with auto-fields ts and commit_sha. Best-effort: any failure (unwritable path, non-serialisable payload) is swallowed so instrumentation never fails an ingest. - KnowledgeConfig.bench_log_path (default ""): when set, the engine appends per-ingest records to that file. Loaded from settings.toml as [knowledge.bench] log_path. - engine._ingest_inner now collects a stage_timings dict (parse_s, insert_total_s, finalize_s, total_s, chunk_count), includes it in the completion update_task call, and forwards it to record_ingest_run. - adapter.add_documents accepts an optional stage_timings dict and populates embed_s + insert_s. Signature added to VectorStoreAdapter base class with a backwards-compatible None default. - scripts/bench_knowledge_ingest.py: harness that runs N ingests with the bench log on, prints a per-run markdown row + a median-aggregated row, and appends the summary to ~/.cuga/bench/runs.json. - scripts/bench_knowledge_report.py: joins the runs.json entries into the attribution table the PR description will paste, with a delta-vs-baseline column. - tests/unit/test_knowledge_bench.py (6 cases) covering append, parent-dir creation, no-op on empty path, default=str fallback, non-mapping payload, unwritable path. - tests/unit/test_knowledge_timings.py (3 cases) verifying adapter populates embed_s + insert_s when stage_timings is passed, backwards-compat when it is not, and that empty doc list does not touch the dict. All 73 existing + new tests pass. ruff check / format clean.
… steps 1-3) Adds add_many to EmbeddingStoreBackend Protocol plus implementations on LocalEmbeddingStore (sqlite-vec) and ProdEmbeddingStore (pgvector). The adapter still uses per-row add() in this commit; the wiring lands in C2. This is intentionally a no-perf-change commit. The point is to land the two backend implementations and their tests on their own, so that any perf delta measured at C2 can be cleanly attributed to the adapter refactor, not to a backend-method change. C1 sanity bench (3 runs, 511-chunk PDF, local stack): median: parse=57.1s embed=37.7s insert=0.39s total=94.9s vs C0: parse=56.8s embed=37.3s insert=0.37s total=94.4s delta: +0.5s (within noise) What this lands: - EmbeddingStoreBackend.add_many: new Protocol method with the same (id, embedding, metadata) tuple shape as add(). - LocalEmbeddingStore.add_many: single transaction containing one executemany. Inner try/except sqlite3.OperationalError falls back to a per-row loop *inside the same transaction*, so even on sqlite-vec builds that reject executemany against vec0 we still drop the per-row commit fsync cost from N to 1. - ProdEmbeddingStore.add_many: one pool.acquire and one conn.transaction per batch with conn.executemany. Preserves the existing ON CONFLICT DO UPDATE upsert behaviour so replace_duplicates callers remain correct. - tests/unit/test_knowledge_local_add_many.py (4 cases): happy path, metadata round-trip via get(), empty no-op, and the executemany -> per-row fallback path. The fallback test wraps the live connection in a small proxy class because sqlite3.Connection is a C type whose methods cannot be monkey-patched directly. - scripts/bench_knowledge_ingest.py: each bench run now copies the source file under a uuid-suffixed name. Without this every run after the first hit the replace_duplicates branch in the engine, where the per-batch loop intentionally skips stage_timings, and we lost the embed_s / insert_s split in the bench output. 77 unit tests pass. ruff check / format clean.
…50 (#183 steps 4-5) First commit with a measurable perf delta. The adapter now builds the (id, embedding, metadata) tuple list once and calls store.add_many in a single round-trip; the engine no longer wraps that call in a 50-doc outer batch loop, because the adapter's bulk path is now correct and cheaper than the previous per-row loop ever was. C2 bench delta on local (3 runs, 511-chunk PDF, fastembed + sqlite-vec): insert_s: 0.374s -> 0.040s -89% (9x faster, the stage we touched) insert_total_s: 37.63s -> 37.27s -1% (embed dominates the wrapper) total_s: 94.44s -> 93.63s -1% (insert is a small share on local) The headline insert_s win is small in absolute terms on sqlite-vec because per-row commit was already on a local fsync. On prod (pgvector) each row in the old loop was a separate pool.acquire + INSERT round-trip; the same architectural change should shave a much larger chunk off insert_s there. We cannot measure that locally without a pgvector test env, so the prod-side number lands when a reviewer with that environment runs the bench script against this commit. What this lands: - adapter.add_documents: replaces the ``async def _add_all: for ... await self._store.add(...)`` loop with a list-comprehension over the documents and a single ``await self._store.add_many(items)`` call. Same metadata shape, same uuid id strategy, same return contract. stage_timings continues to capture embed_s and insert_s separately. - engine._insert_documents_async: drops the outer ``_BATCH = 50`` loop on the replace_duplicates branch; both branches now make a single adapter.add_documents call. The original loop existed because per-row insert times scaled linearly with batch size on the old path; with bulk insert it is strictly slower (one extra pool create/teardown per outer batch on prod, and zero per-row commit fsyncs to amortise on local). The branch lives on only because we still need delete_by_source for replace semantics. 77 unit tests pass. ruff check / format clean.
… step 4) Adapter now splits the chunk list into ``embedding_batch_size``-sized sub-batches and embeds each one sequentially (concurrency=1 for now; C4 turns concurrency on for network providers). New config field plus its plumbing through the vector-store factories so users can tune it without touching adapter internals. C3 bench delta on local (3 runs, 511-chunk PDF, fastembed + sqlite-vec): embed_s: 37.20s -> 29.17s -22% (sub-batching helps even local) total_s: 93.63s -> 85.78s -8% The win on local is bigger than the plan predicted. Likely cause: fastembed's internal embed batching prefers inputs smaller than the full 511-chunk single call we used to make. Cache locality and less peak-memory pressure pay off even before any concurrency lands. What this lands: - KnowledgeConfig: new ``embedding_batch_size`` (default 64) and a forward-declared ``vector_insert_batch_size`` (default 200) that will drive future bulk insert sub-batching but is not used yet. - StorageBackedKnowledgeVectorStore: takes ``embedding_batch_size`` via the constructor; ``add_documents`` walks the chunk list in steps of that size, asserts the embedder returned the expected count per call (RuntimeError otherwise so the insert never touches partial state), and concatenates the resulting vectors in order. Still one ``add_many`` bulk insert from C2. - create_storage_local_knowledge_store / create_storage_prod_knowledge_store / create_vector_store: all forward the new ``embedding_batch_size`` kwarg. - KnowledgeEngine._create_vector_adapter: passes ``self._config.embedding_batch_size`` through. - tests/unit/test_knowledge_timings.py: three new cases — embed sub-batch sizes match config; default of 64 is honoured; embedder returning a short list raises ``RuntimeError`` before any insert. 80 unit tests pass. ruff check / format clean.
Adapter now dispatches sub-batched embed calls through asyncio.gather
when the provider is network-bound (openai / ollama), bounded by a
semaphore of ``embedding_concurrency``. Local providers (fastembed /
huggingface) stay strictly sequential because their ONNX/torch
runtimes already exploit internal threads and adding gather on top
just adds asyncio overhead.
C4 bench on local (3 runs, 511-chunk PDF, fastembed + sqlite-vec):
embed_s: 29.17s -> 29.23s no-op on local (as designed)
total_s: 85.78s -> 85.86s within noise
This is the expected sign for local. The concurrency win materialises
on prod with OpenAI / Ollama where each embed call is one HTTP round
trip; we cannot reproduce that here. A reviewer with that stack should
re-run scripts/bench_knowledge_ingest.py --label C4-... against this
commit to fill in the prod row of the attribution table.
What this lands:
- KnowledgeConfig.embedding_concurrency (default 4).
- Adapter constructor gains ``embedding_concurrency`` and
``embedder_is_network``. add_documents now builds a ``starts`` list of
sub-batch offsets, populates a fixed-size ``vectors`` array at
absolute indices, and runs sub-batches concurrently via
asyncio.gather + Semaphore when the embedder is network-bound. Order
is preserved because indices are absolute, never appended.
- Defensive RuntimeError if any chunk has no embedding after the loop
(catches a logic bug in the dispatch before any insert touches the
store).
- create_storage_(local|prod)_knowledge_store and create_vector_store:
forward the two new kwargs.
- KnowledgeEngine._create_vector_adapter: sets
``embedder_is_network`` from
``self._config.embedding_provider in ("openai", "ollama")`` and
passes ``self._config.embedding_concurrency`` through.
- tests/unit/test_knowledge_timings.py: two new cases —
``test_adapter_concurrent_embed_preserves_order`` runs a sleepy
randomised embedder under network-mode and asserts vectors come back
in input order via retrieval; ``test_adapter_sequential_when_local_provider``
uses a thread-safe counter to assert max-concurrent embed calls is 1
when embedder_is_network=False.
82 unit tests pass. ruff check / format clean.
…step 6) The half of issue #183 not about throughput: ingest now emits per-stage progress events that callers polling get_ingestion_status see in near-real-time, instead of staring at a single processing -> completed transition that takes minutes. The race trap is real: stage_timings emits run on a worker thread (via asyncio.to_thread for the vector mutation) while the completion update runs on the outer event loop. A naive fire-and-forget would let a late progress write overwrite status="completed" back to "processing". The fix is two-part: (a) progress payloads never include the status field, so even a late landing cannot un-flip it; (b) the engine tracks every run_coroutine_threadsafe future and drains them with a bounded timeout before writing the terminal status. C5 bench on local (3 runs, 511-chunk PDF, fastembed + sqlite-vec): total_s: 85.86s -> 87.25s +1.4s (+1.6%) for ~10 progress emits The overhead is the cost of routing ~10 metadata update_task calls through the worker -> outer loop bridge. Acceptable for the UX win; the user no longer stares at a frozen UI for two minutes. What this lands: - adapter.add_documents: new ``progress_cb`` kwarg, default None. Adapter calls _safe_progress("embed", done, total) after each embed sub-batch, _safe_progress("insert_start", 0, total) before the bulk add_many, and _safe_progress("insert", count, total) after. _safe_progress swallows callback exceptions because progress is best-effort and must never fail an ingest. - engine._insert_documents_async: accepts progress_cb, forwards to adapter on both the replace_duplicates and not-exists branches as well as the schema-mismatch retry path. - engine._ingest_inner: emits a "parsed" stage event between Docling finishing and the insert starting; builds progress_cb that submits _emit_progress via asyncio.run_coroutine_threadsafe(outer_loop); tracks each future in progress_futures; drains them with per-future timeout=2.0s before the final status="completed" update_task call. - Progress emits deliberately omit the ``status`` field so a stray late write cannot overwrite the terminal "completed" / "failed" status (defence in depth on top of the drain). - tests/unit/test_knowledge_progress.py: four adapter-level cases — one emit per embed sub-batch with monotonic done count, both insert_start and final insert events fire, a throwing callback does not break ingest, and progress_cb=None is a no-op. 86 unit tests pass. ruff check / format clean.
step 7) Surfaces the three KnowledgeConfig fields added by C3-C5 in knowledge_settings.toml so users can tune them without touching code, and loads them via KnowledgeConfig.from_settings. C6 sanity bench on local (3 runs, 511-chunk PDF): total_s: 87.25s -> 89.04s within noise; no code path changed. What this lands: - src/cuga/configurations/knowledge/knowledge_settings.toml: [knowledge.embeddings] batch_size = 64 # embedding_batch_size concurrency = 4 # embedding_concurrency (network only) [knowledge.engine] vector_insert_batch_size = 200 # reserved; not yet used - KnowledgeConfig.from_settings: reads embeddings.batch_size, embeddings.concurrency, engine.vector_insert_batch_size with the same defaults as the dataclass. - tests/unit/test_knowledge_config_perf_keys.py: confirms TOML loads override defaults; missing keys fall back to defaults; dataclass defaults match settings.toml defaults so the docs and the code agree. 89 unit tests pass.
After landing C0-C6 I went through the new code with the express goal of
finding things I had reimplemented when an existing package or a shared
helper would do. Two real duplications, both fixed here without changing
behaviour or perf (3-run bench: 87.6s vs C5 87.25s, within variance).
1. ``LocalEmbeddingStore.add`` and ``ProdEmbeddingStore.add`` each had
their own copy of the schema/SQL/row-values construction next to the
matching code in ``add_many``. Now:
- local: ``add`` calls ``await self.add_many([(id, embedding, metadata)])``.
The schema-aware INSERT building moves into a single helper
``_insert_sql_and_row`` used by both code paths. Semantics
unchanged: one row in one transaction = one commit, same as before.
- prod: ``add`` delegates to ``add_many`` for the same reason. Two
small helpers (``_upsert_sql``, ``_row_values``) cover the schema +
UPSERT layout once instead of twice. Pool acquire / transaction
semantics for the single-row case are identical because executemany
with a length-1 row list is equivalent to a single execute under
``conn.transaction``.
2. Adapter ``add_documents`` gather path used
``asyncio.to_thread(self._embeddings.embed_documents, ...)`` to make
the sync embedder concurrent under ``asyncio.gather``. LangChain's
``Embeddings.aembed_documents`` already does exactly this in its
default impl (``run_in_executor(None, embed_documents, ...)``) AND
providers that have a real async implementation override it
(``langchain-openai`` uses httpx natively). Switching to
``await self._embeddings.aembed_documents(...)`` gets us the same
behaviour for sync-only providers and a strictly better path for
ones with native async, with less code.
The sequential (local-provider) path keeps the direct
``embed_documents`` call — there is no benefit to going through
asyncio there and the runtime already exploits threads internally.
89 unit tests pass. ruff check / format clean.
What I audited and decided NOT to change (per the user's redundancy
ask):
- bench.py JSONL writer: a loguru sink would couple it to the logging
system whose failure modes should not affect benchmarking. The
current standalone writer is the right isolation.
- stage_timings dict: OpenLit (already in the repo) instruments
high-level LLM/LangGraph calls, not sub-second embed/insert splits.
The two are complementary.
- progress_cb (sync callback through the stack): LangChain's
AsyncCallbackHandler is chain-level; sub-batch progress is not a
natural fit. Adding a handler class would be more boilerplate than
the current pattern.
- asyncpg copy_records_to_table: faster than executemany at >1000
rows BUT does not support ON CONFLICT, which the replace_duplicates
path needs. Keep executemany.
- pytest-benchmark: designed for microbenchmarks in pytest; our
harness spins up CugaAgent end-to-end and parses JSONL across runs.
Different tool for a different problem.
…dering Wires the issue #183 perf controls and per-stage ingest progress through every CUGA surface so the new tunables can be set without editing files and so the UI shows what the engine is actually doing. Surfaces touched: CLI (src/cuga/cli/main.py) - `cuga start ... --embeddings-provider` (fastembed | huggingface | openai | ollama; OpenAI-compatible endpoints — Groq / Together / Fireworks / OpenRouter — use `openai` + `--embeddings-base-url`). - `--embeddings-model`, `--embeddings-base-url`, `--embeddings-api-key`. - `--embeddings-batch-size`, `--embeddings-concurrency` (issue #183 tunables). - All flags map to DYNACONF_KNOWLEDGE__EMBEDDINGS__* env vars set before the cuga modules load settings, so any service (demo / demo_knowledge / manager / etc.) picks them up. UI (src/frontend_workspaces/agentic_chat/src/KnowledgeConfig.tsx) - KnowledgeConfigValues extended with embedding_api_key, embedding_base_url, embedding_batch_size, embedding_concurrency, vector_insert_batch_size. - ReindexTask.file_tasks gains optional stage + progress fields. `getReindexTaskProgressLabel` renders "<friendly stage> (done/total)" — e.g. "Embedding (192/511)" — under each running file in the reindex progress tile. - Embeddings accordion: Base URL + API Key inputs appear when the provider is openai or ollama, with a hint pointing OpenAI-compatible users (Groq / Together / Fireworks / OpenRouter) at the same path. Batch size and concurrency surfaced as NumberInputs with helper text explaining the local-vs-network semantics. Snapshot/publish (already free, verified) - The three new fields are dataclass attributes, so KnowledgeConfig.to_dict already serializes them and coerce_and_validate restores them. Two new unit tests (test_perf_keys_round_trip_via_to_dict_and_coerce and test_perf_keys_coerce_strings_to_int) lock this contract so a future field that bypasses the dataclass round-trip would fail loudly. Docs (README.md "Knowledge Base" section) - New "Embedding Providers" subsection: table of built-in providers + concrete settings.toml snippets for Groq and Together AI via the OpenAI-compatible path; CLI override example. - "Ingestion Performance Tuning" subsection: explains batch_size, concurrency, vector_insert_batch_size and that they snapshot with published configs. - "Per-stage Ingest Progress" subsection points users at the JSONL bench log for performance attribution. Frontend rebuilt; new bundle hashes: main.813ae3a97097fb29e526.js vendors.88ac6d4d0b8ba3e82779.js (Bundles shipped under src/cuga/frontend/dist/ via build.sh which copies the workspaces dist into the cuga Python package for the wheel.) 91 unit tests pass (+2 new snapshot round-trip tests). ruff clean.
… smoke Closes the gaps surfaced by the prod-side rigorous review and ships P1 (Docling level config) from the original plan. Two end-to-end tests land alongside: pgvector + fastembed (Docker) and Groq connectivity. 1. vector_insert_batch_size now actually enforced The plan declared this knob in C6 but the adapter never read it. With the full chunk list going to a single add_many on prod (pgvector), a large document risked: command_timeout=60 firing, an HNSW write lock held for many seconds, and asyncpg's binary-protocol buffer ballooning. Adapter now splits items by vector_insert_batch_size and calls add_many per sub-batch. Plumbed through create_storage_*_knowledge_store and create_vector_store. Stress-test bench with batch=10 (41 sub-batches for 410 chunks) added only +0.12s on the prod stack — knob works, cost is tiny. 2. Docling pdf_mode: fast | balanced | accurate The dominant cost on the 7-8 min ingest in #183 is Docling parse. New KnowledgeConfig.docling_pdf_mode (default "accurate" so nothing regresses) maps to PdfPipelineOptions: fast turns off OCR + table structure; balanced turns off OCR but keeps tables; accurate stays on Docling defaults. Converters are cached per-mode so toggling between ingests doesn't reload weights. Wired through: - settings.toml: [knowledge.docling] pdf_mode - CLI: cuga start --docling-pdf-mode - UI: new "PDF Parsing (Docling)" accordion with a select + warning about fast mode + scanned PDFs - Snapshot: dataclass field, so to_dict / coerce_and_validate round-trip preserves it. Test added. - Validation: rejects unknown modes, integer batch knobs must be > 0. 3. End-to-end pgvector + fastembed bench (Docker) Spun up pgvector/pgvector:pg16 locally and ran the bench against the 1.5MB RAG PDF, 511 chunks. Measured: accurate: 91.8s total (parse 58.4 / embed 33.05 / insert 0.24) fast: 39.2s total (parse 20.97 / embed 17.93 / insert 0.21) -57% wall time, -64% parse, -46% embed (410 chunks vs 511) Search round-trip verified end-to-end: top hit score 0.881 on 'retrieval augmented generation' against the just-ingested doc. 4. Groq connectivity smoke test scripts/test_groq_connectivity.py builds a KnowledgeConfig with provider=openai + base_url=https://api.groq.com/openai/v1 + api_key=$GROQ_API_KEY, builds the embedder via create_embeddings, and calls embed_query. Confirms the OpenAI-compatible config path plumbs end-to-end. Groq's catalog has zero embeddings models so the call returns a structured 4xx — the test accepts any Groq 4xx as "config path verified". The honest finding (Groq != embeddings) stays in the README. 98 unit tests pass. ruff clean. New frontend bundle: main.2bebaba0f19f267fbd65.js vendors.88ac6d4d0b8ba3e82779.js
…dd knowledge_config kwarg
Two real gaps surfaced when the user asked "is everything verifiably
configurable through UI and SDK". Both fixed.
UI gap: vector_insert_batch_size
The type interface declared it but no NumberInput was rendered. Added
one under the Embeddings accordion with min=1 max=5000 step=50, helper
text explaining the pgvector command_timeout / HNSW lock motivation,
and a default of 200 matching KnowledgeConfig + settings.toml.
SDK gap: CugaAgent only had enable_knowledge: bool
SDK users had no way to set docling_pdf_mode, embedding batch knobs,
provider, base_url, or api_key from Python — they had to set env vars
before importing cuga or edit settings.toml. New ``knowledge_config``
kwarg accepts:
- None (default): falls back to KnowledgeConfig.from_settings(settings)
- KnowledgeConfig: validated then used directly
- dict: passed through coerce_and_validate (with type coercion,
so e.g. embedding_batch_size="32" becomes 32)
Validation happens when ``agent.knowledge`` is first accessed (lazy,
matching the rest of the engine init pattern). Bad config raises a
clear ValueError before any ingest can land.
The knowledge property:
@Property
def knowledge(self):
...
override = self._knowledge_config_override
if override is None:
config = KnowledgeConfig.from_settings(settings)
elif isinstance(override, KnowledgeConfig):
override.validate()
config = override
else:
config = KnowledgeConfig.coerce_and_validate(dict(override))
engine = KnowledgeEngine(config)
...
Tests (tests/unit/test_sdk_knowledge_config.py, 4 cases):
- kwarg with KnowledgeConfig lands on engine._config for all 7 fields
- kwarg with dict goes through coerce_and_validate (str->int coercion)
- bad config raises ValueError lazily at first ``.knowledge`` access
- no kwarg falls back to settings.toml (env var still works)
102 unit tests pass (4 new). ruff clean. Frontend rebuilt.
knowledge_ingest_perf.pptx and KNOWLEDGE_INGEST_PERF_SLIDES.md are personal review materials, not codebase content — kept local only. Also adds Office lock file patterns (~$*.pptx etc.).
… engine Two related changes shipped together. The bug fix gates the OpenRouter work: without it, a user who picks OpenRouter in the UI (or changes any other knowledge setting) sees an HTTP 200 PATCH but no behavior change until they explicitly publish, which is hostile UX in demo / single-agent mode. 1. BUG FIX — PATCH /api/manage/config/draft/knowledge now applies to the live engine, not only the draft store Symptom (user-reported): selected Fast docling_pdf_mode in the UI, the request returned 200, but the next upload's log line still showed "mode='accurate', do_ocr=True, do_table_structure=True". Root cause: the handler at manage_routes.py:931 saved the merged config to draft + rebuilt the draft agent, but never called engine.apply_knowledge_config on app.state.app_state.knowledge_engine (the live engine that uploads route through). Comment line at the top of the function literally said "No engine mutation (draft isolation)". Fix: after validation, call live_engine.apply_knowledge_config(filtered). prepare_knowledge_update skips the expensive embed preflight unless provider/model actually changed, so docling_pdf_mode / batch sizes stay cheap. On preflight failure (bad key, unknown model) we return 400 with the engine error message and don't save the draft — no more "saved but not applied" zombie state. Response now includes live_applied + live_changes so the UI can show "reindex recommended" without an extra round-trip. 2. OpenRouter as a first-class embeddings provider User asks "what embeddings provider should I use" → today the answer is "provider=openai + base_url=https://openrouter.ai/api/v1 + the right key", a 3-field setup with a magic URL. New flow: provider="openrouter" model="openai/text-embedding-3-small" # REQUIRED, no default api_key=$OPENROUTER_API_KEY # REQUIRED Base URL auto-set to https://openrouter.ai/api/v1; embedding_base_url override still honored as an escape hatch. Both model and api_key are required by design — OpenRouter has no single "default" embeddings model and we'd rather fail loudly than pick one for the user. Surfaces wired: • engine.create_embeddings: new branch + new OPENROUTER_BASE_URL constant • KnowledgeConfig.validate: "openrouter" accepted • _create_vector_adapter: embedder_is_network includes openrouter (concurrency win applies) • settings.toml comments updated to call out openrouter rules • CLI: --embeddings-provider help text mentions openrouter explicitly • UI: new SelectItem + conditional block showing only Model + API Key fields, plus an inline link to openrouter.ai/models?output_modalities=embeddings • SDK: CugaAgent(knowledge_config=KnowledgeConfig(provider="openrouter",...)) • Publish/Import: dataclass field, free round-trip via to_dict / coerce_and_validate (test-locked) • Docs: README "Knowledge Base" → "Embedding Providers" gains an OpenRouter quick-start with TOML, CLI, and SDK snippets • Smoke script: scripts/test_openrouter_connectivity.py exercises the same code path the engine uses; run when OPENROUTER_API_KEY is set and you want to verify auth/URL/body shape 3. UI touch-up: renamed "PDF Parsing (Docling)" accordion to "Document Parsing (Docling)" — Docling handles DOCX/PPTX/HTML/images/etc. too, and the user pointed out the PDF-only naming was misleading. Tests (115 unit tests pass, up from 102): • tests/unit/test_knowledge_openrouter.py (10 cases): factory auto-URL, explicit override, missing key/model errors, env var fallback, validation accept/reject, snapshot round-trip, network-flag, SDK kwarg, TOML load. • tests/unit/test_knowledge_patch_live_apply.py (2 cases): PATCH applies to engine; invalid config returns 400 without partial apply. Frontend rebuilt: main.977339c561b271967583.js
After the OpenRouter + live-apply commit, walked the whole UI→PATCH→engine→
ingest pipeline once more with adversarial inputs. Found four real gaps;
all fixed and locked with tests.
1. validate() didn't enforce OpenRouter has a model
Before: provider=openrouter + empty model passed config validation but
blew up later in create_embeddings, surfacing a generic engine error
that didn't name the field. Now KnowledgeConfig.validate() rejects it
at coerce_and_validate time with a message pointing at the right
knowledge.embeddings.model field + a link to OpenRouter's catalog.
Test: test_validate_rejects_openrouter_without_model (+ whitespace
variant + the env-var-fallback case for api_key, which must NOT be
blocked at validation time).
2. No whitespace stripping on api_key / model / base_url
Classic copy-paste footgun: a key pasted from a docs page or terminal
often carries a trailing newline / space. langchain-openai's auth
then silently fails with a confusing 401. Now stripped at the
``create_embeddings`` boundary for both openai AND openrouter (and
base_url too). Whitespace-only values are rejected as if empty.
Tests: strips_whitespace, whitespace_only_model_raises,
whitespace_only_key_raises.
3. UI silently swallowed PATCH errors
ManagePage.tsx had a debounced PATCH that did `if (res.ok)` and a
bare `catch {}` with a comment "// silent". With the new live-apply
behavior, a 400 from the engine (bad key, unknown model) was
completely invisible to the user — they saw their UI value but the
engine still ran the old config. That was exactly the confusion the
bug fix was supposed to eliminate.
Now: on non-ok response we parse the JSON body's `detail` field and
show an error toast. On network error (the `catch`), we show a
second-tier toast. The user always sees something.
4. .env / .env.example docs for OpenRouter
Added a commented-out OPENROUTER_API_KEY block to both files
explaining how to opt-in. .env.example uses a placeholder
(sk-or-...); .env carries the existing dev key behind a comment so
re-enabling for testing is one line.
Tests (121 pass, +6 new): all four fixes pinned individually.
Frontend rebuilt: main.f70669d8954206299214.js.
…er matrix in README
Two changes that surfaced when actually exercising OpenRouter with the
user's key:
1. tiktoken-off when targeting non-OpenAI routes
The first bench attempt against ``baai/bge-base-en-v1.5`` via
OpenRouter failed with "No embedding data received". Root cause:
langchain-openai's ``OpenAIEmbeddings`` defaults to ``tiktoken_enabled
=True``, which sends integer arrays for OpenAI-named models. Non-OpenAI
models on OpenRouter (BAAI, Mistral, Qwen, Nvidia, etc.) reject those
with HTTP 422 "Input should be a valid string". Curl with raw string
input proved the underlying API works.
Fix: in ``create_embeddings``, when ``provider == "openrouter"`` OR
when ``provider == "openai"`` AND ``embedding_base_url`` is set,
construct the client with ``tiktoken_enabled=False`` +
``check_embedding_ctx_length=False``. Forces raw-text mode for the
wire request. Verified end-to-end:
openai/text-embedding-3-small → 1536-dim ✓
baai/bge-base-en-v1.5 → 768-dim ✓
sentence-transformers/all-mpnet-base-v2 → 768-dim ✓
(nvidia/llama-nemotron-embed-vl-1b-v2:free still trips the OpenAI
SDK's "No embedding data received" — likely a response-shape
incompatibility we can't fix without forking. Documented as a
provider-side quirk.)
2. Measured embedding-provider performance matrix in README
Ran the bench harness with 3-run median per cell over the same 1.5 MB
PDF, local sqlite-vec, fastembed-vs-OpenRouter × {balanced, fast}.
Headline numbers added to README's Knowledge Base section:
fastembed × accurate (post-P0): 85.8 s
fastembed × balanced: 89.9 s
fastembed × fast: 39.9 s
OpenRouter openai/text-embedding-3-small × balanced: 62.8 s
OpenRouter baai/bge-base-en-v1.5 × balanced: 64.5 s
OpenRouter openai/text-embedding-3-small × fast: 23.8 s ← best
embed_s is the discriminator: fastembed is CPU-bound at ~30 s for
~500 chunks; cloud routes drop it to 2-4 s. Combined with Docling
fast mode (parse -60%), best stack runs 3.96× faster than the
pre-P0 baseline.
README now also includes a "Choosing a provider" decision table.
KNOWLEDGE_PIPELINE.md (gitignored internal reference) was updated
locally with the same matrix + the validation/auto-flags story for
future maintainers.
Previous commit only had 6 of the 9 cells — missing OpenRouter × accurate
for both models and OpenRouter baai × fast. User correctly called out the
gap; ran the missing benches and updated the README.
Full matrix (1.5 MB PDF, 3-run median, local sqlite-vec, total_s):
Provider \ Docling fast balanced accurate
fastembed 39.9 89.9 85.8
openrouter openai-3s 23.8 62.8 63.2
openrouter baai/bge 26.4 64.5 63.4
Per-stage data also expanded in the README so reviewers can see exactly
where each saving comes from (parse_s vs embed_s vs insert_s).
Three observations the matrix made explicit:
1. OpenRouter shaves embed_s by 7-14× independent of Docling mode.
2. balanced ≈ accurate on digital PDFs (OCR finds nothing useful).
3. Cloud has variance — one openrouter × accurate run spiked to ~960s
embed_s on a rate-limit / network blip. The 3-run median is the
honest comparison; SLA-sensitive ingest should factor retries.
Local artifacts also updated (gitignored): KNOWLEDGE_PIPELINE.md with the
full table + caveat, knowledge_ingest_perf.pptx slide 6 now shows the
3×3 plus the per-stage breakdown for the headline rows.
…ty / robustness
Adds a client-shareable benchmarking pipeline that measures the four
dimensions a client weighs when picking a CUGA knowledge configuration
for their workload. Methodology, harness, queries, and renderer all
ship together so a client can run it on their corpus and reproduce the
numbers.
What ships
scripts/bench_knowledge_eval.py Run a (provider, model, docling_mode)
matrix on a single PDF + labeled
query set. Outputs JSON.
scripts/bench_knowledge_eval_report.py Render the JSON as 3 markdown
tables (Latency × Cost, Quality,
Decision matrix).
scripts/eval_queries.json 10 labeled queries — 3 title/
author + 4 concept + 3 detail —
tagged by ``kind`` so a client can
spot configurations that ace one
kind but bomb another.
scripts/KNOWLEDGE_EVAL_METHODOLOGY.md The reasoning behind every metric,
test-design tradeoffs, how to
adapt for a client engagement, and
the gaps v1 does not yet cover
(no LLM grader, no multi-corpus
rollup, no prod-pgvector flag).
Four metrics per cell, by design
Latency : parse_s / embed_s / insert_s / total_s + min/max spread
(3-run median by default; cloud variance is real — we
saw a 960 s outlier on openrouter × accurate).
Cost : tokens estimated from chars × per-model OpenRouter
pricing (live lookup, with hard-coded fallback).
$0 for local providers.
Quality : avg top-1 score (informational; provider-specific scale)
+ top-5 Jaccard vs a reference cell. Jaccard answers
the client's real question — "do I get the same chunks
if I switch from the trusted/expensive config to this
cheaper one?" — without needing hand-labeled ground
truth.
Robustness : ingest_status + search_status + first 3 error strings.
A cell that times out or errors mid-run is still
reported as failed, with diagnostic text.
Findings from the default 9-cell matrix on the RAG paper (2-run median,
10 queries, reference = openrouter/openai-3-small/balanced):
Fastest openrouter/openai-3-small/fast 26.8 s $0.0083
Cheapest cloud openrouter/baai-bge-base/fast 29.4 s $0.0021
Free / air-gapped fastembed/fast 36.6 s $0
Quality-preserving AND cheap openrouter/openai-3-small/fast 26.8 s $0.0083 Jaccard 0.82
Top-5 Jaccard vs reference, by cell:
openrouter / openai-3-small / balanced 1.00 (reference)
openrouter / openai-3-small / accurate 1.00 (same model, different parse)
openrouter / openai-3-small / fast 0.82 (fast skips OCR/tables; minor chunk diff)
openrouter / baai-bge-base / * 0.20-0.30
fastembed / * 0.13
This is the kind of result a client cannot get from the latency
benchmarks alone: switching providers materially changes which chunks
come back for a query. Whether that change is "worse" or "different but
equivalent" is a corpus-specific judgment the client must make. The
harness gives them the queries to spot-check.
Two bugs surfaced while building the harness
1. Dynaconf env-var caching: the first attempt set
DYNACONF_KNOWLEDGE__... env vars per cell but cuga.config was
module-cached after cell 1. Refactored to use the SDK's
``knowledge_config=`` kwarg (built for exactly this case in commit
6e106f0). The harness is now the canonical user of that kwarg.
2. Filename-based chunk identity: first eval gave Jaccard = 0 for
every non-reference cell. Root cause was the per-run uuid-tagged
copy of the source PDF (added earlier to keep each ingest on the
new-doc branch); every cell saw a different filename, so the chunk
id was unique per cell. Fixed by using content-based identity:
``p{page}:{first 120 chars of chunk text}``.
Robustness note for clients
One of the OpenRouter accurate-mode cells gave a transient "No
embedding data received" error on the first bench attempt; the rerun
succeeded. Cloud providers have variance, including occasional outlier
runs taking 10x+ the median. The harness reports min/max alongside
median so a client can see the spread and budget retries.
README now points at the harness from the Knowledge Base section with a
quick-start + summary of the default-matrix findings.
📝 WalkthroughWalkthroughThis PR adds knowledge configuration, retrieval, storage, routing, and UI changes for embedding providers, GPU support, progress reporting, client adaptation, and scoped session behavior. ChangesKnowledge platform rollout
Sequence Diagram(s)✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
|
sami-marreed
left a comment
There was a problem hiding this comment.
Review summary
Solid perf work (bulk insert, concurrent embed, docling modes, OpenRouter, live PATCH) with good unit test coverage when run manually — but several blockers before merge.
Blockers
- Merge conflicts with
main— GitHub showsmergeable: CONFLICTING(see inline onindex.html) - PR description overclaims — OCR, session isolation, live upload UX, and "387 tests" are not in the 41-file diff
- New tests not in CI — 8 new test files are not wired into
src/scripts/run_tests.sh(lines 82–89), which is what.github/workflows/tests.ymlruns - Event-loop blocking —
fut.result()in async ingest; sync embed preflight on live PATCH - Data integrity — partial insert failure can leave orphan vectors on pgvector
- Out-of-scope files — eval harness, methodology doc, personal
.gitignoreentries
Also noted (file unchanged in PR — no inline possible)
src/cuga/backend/knowledge/routes.py— upload stillawait engine._run_ingest(...)synchronously; PR body claimswait=false/ 202 / reload resume butroutes.pyhas zero diffsrc/cuga/backend/knowledge/engine.pycommit_knowledge_update—metric_changedis computed inprepare_knowledge_updatebut omitted from the return dict;manage_routes.pyexposes it inlive_changesanyway → alwaysnull. Add"metric_changed": prepared.metric_changedto the return.get_settings()— omits new fields (docling_pdf_mode, batch knobs,embedding_api_key) after live PATCH- PR still Draft — please mark ready after fixes
Happy to follow up with more comments as review continues.
Snapshot of in-progress work being merged with feat/knowledge-client-adaptation: - search_multi Option B per-source cap (limit × n_scopes, capped 100); fallback keeps Option F - BM25/SQLite FTS5 hybrid retrieval with cross-leg RRF - CID-glyph detector for unOCR'd PDF text layers - Knowledge snapshot import/export via CLI + manage routes - Knowledge UI: in-flight upload persistence, sticky non-zero badge hook - Per-stage progress timers + JSONL bench harness - Many test additions across the knowledge surface
Cherry-picked c9459a8 ("WIP: client-adaptation panel + knowledge
harness") onto a5a7d5a. Conflicts resolved with default-keep-perf,
manually layered harness features on top:
Frontend:
- New ClientAdaptationPanel.tsx (Behavior tab — rules editor + glossary)
- KnowledgeConfig.tsx: glossary entry type + adaptation fields on the
interface; new client_adaptation/glossary surfaces in the modal
- ManagePage.tsx: harness invocation surfaces
Backend:
- engine.py: layered query expansion (glossary aliases) ahead of dense
+ lexical fan-out; empty-guard retry with unexpanded query when
expansion drives dense embedding away from useful regions; cross-
encoder reranker post-fusion; preserved perf branch BM25 + drain
loop + Option B per-source cap
- awareness.py: client_adaptation block rendered top-of-summary; hash
log for observability; new fields plumbed via assemble_system_prompt_section
- config.py: client_adaptation_text + glossary + rerank_* fields;
validation (denylist, bidi-override codepoints, glossary limits);
TOML round-trip via [knowledge.client_adaptation]; secret-redaction
preserved
- cuga_lite_graph.py: backported the multi-tenant draft_knowledge_configs
dict lookup so the draft-overlay path is shared-deployment safe
- cli/main.py: kept perf's config-*/snapshot-* commands AND added
adapt's adaptation-*/glossary-*/doctor (disjoint command names —
both helper functions retained for env-var back-compat)
- knowledge_instructions.md: kept perf's narrow Scope rules; layered
"Citing Sources" section from adapt above Anti-patterns
New files brought over:
- query_expansion.py, reranker.py
- tests/integration/test_client_adaptation_e2e.py
- tests/unit/test_knowledge_client_adaptation.py
- tests/unit/test_knowledge_lever_stack.py
- docs/examples/knowledge_demo/client_adaptation_example.md
- knowledge_testing/.gitignore
Files NOT brought over:
- prepare_node.py — perf branch is pre-agent-core-refactor and never
had this layout; equivalent logic backported into cuga_lite_graph.py
(multi-tenant dict lookup, with client_adaptation_text/glossary
picked up automatically inside assemble_system_prompt_section)
Tests: 667/667 knowledge unit tests pass.
Dist bundles will be rebuilt fresh in a follow-up step.
…ld artefacts Two paths that were appearing in ``git status`` and don't belong in the tracked tree: - ``.claude/scheduled_tasks.lock`` + ``.claude/projects/`` — Claude Code's internal session state. Different per developer; never useful upstream. - ``docs/examples/**/build/`` — Python build artefacts created when running examples in isolation (``uv build`` from a sub-example dir). Pure housekeeping; no behavioural change.
…nt CUDA runtime
Closes the knowledge-engine production-readiness gap with three closely-coupled
changes audited by a 3-expert review (image / per-stage perf / deployment UX)
plus a Docling-specific deep dive.
## 1. Pareto-locked RAG profiles
Profile TOMLs now own their full effective config rather than just chunking +
prompt addendum. Each profile pins:
- embedding_model + batch_size + concurrency
- chunk_size + overlap
- docling.pdf_mode + layout_engine + drop_page_chrome
- rerank.enabled + top_k_in + model
- search.default_limit + score_threshold + max_search_attempts +
hybrid_mode + junk_filter
- engine.max_ingest_workers + vector_insert_batch_size
The matrix (speed / standard / balanced / max_quality) is locked from a 3-expert
debate on a 556-page PDF corpus; values defended in each TOML's header comment.
Wiring:
- ``from_settings`` reads the named profile's full surface; settings.toml
keys act as fallback only.
- ``coerce_and_validate`` applies the profile when ``rag_profile`` is
named in the incoming PATCH (a bare ``{"rag_profile": "X"}`` produces
a fully X-shaped config, not just a relabel).
- Code-enforced contract: ``rerank_top_k_in >= 3 * default_limit`` —
Expert B's recall-ceiling finding.
- New ``would_invalidate_vectors(other)`` helper that the PATCH-config
UX layer can use to gate a confirmation dialog.
Default rag_profile remains ``standard`` (rerank off; bge-small-en;
balanced parse) so the shipped default fits ~4GB containers.
## 2. GPU image variant (Dockerfile.gpu) + transparent CUDA
New ``Dockerfile.gpu`` on ``nvidia/cuda:12.4.1-cudnn-runtime-ubuntu24.04``
base — slim Debian lacks ``libcudnn.so.9`` / ``libgomp.so.1`` / ``libcudart.so.12``
and causes silent ORT fallback to CPU.
New ``[project.optional-dependencies] gpu = ["onnxruntime-gpu>=1.20",
"fastembed-gpu>=0.8.0"]`` (lockfile-managed). ``fastembed-gpu`` avoids the
qdrant/fastembed#608 dual-install collision where a transitive ``onnxruntime``
silently strips CUDAExecutionProvider.
Main ``Dockerfile`` reverts a broken ``--build-arg GPU=1`` block: it was
lockfile-unaware (``uv pip install --reinstall``) and any later ``uv sync``
would silently revert to CPU.
Runtime:
- ``_FastEmbedEmbeddings`` detects ``CUDAExecutionProvider`` in active
providers and forces ``batch_size=256, parallel=1`` for the embed call.
Without this, bge-small at fastembed's default ``parallel=None``
spawns one worker per CPU core, all sharing one GPU, and CUDA-context
thrash caps the speedup at 2–3× CPU.
- 3-mode startup warning distinguishes (a) GPU image with no device
visible (forgot ``--gpus all``), (b) CPU image with GPU requested
(wrong tag), (c) partial GPU (torch sees CUDA but ORT doesn't).
- ``gpu_required: bool = False`` config field + ``CUGA_GPU_REQUIRED=1``
env var: when set, the engine raises at startup instead of warning,
matching vLLM / TGI fail-fast posture.
- New ``cuga doctor`` CLI: prints nvidia-smi, ORT providers, torch
CUDA, fastembed live session providers, /dev/shm size. Copy-pasteable
for support tickets.
README: full GPU deployment section with k8s manifest example
(``nvidia.com/gpu: 1`` + tmpfs /dev/shm + ``CUGA_GPU_REQUIRED=1``),
per-stage speedup table from the audit, HF Spaces + AWS ECS notes.
## 3. Docling parse polish (per Expert D's deep dive)
- ``generate_page_images=False`` only when ``pdf_mode=fast``. Saves
~30 ms/page in PNG encoding when the user has explicitly chosen the
fast profile. Balanced + accurate keep Docling's default so downstream
UI/extension consumers that may want the rendered PNGs aren't broken.
- Warning when CUDA is engaged but ``max_ingest_workers <= 2``: per
Expert D, a single H100 fits ~10–14 Docling workers (4–6 GB VRAM each);
cuga's laptop-tuned default leaves the GPU ~80% idle.
## Tests
- 709 knowledge unit tests pass.
- 8 existing tests updated for the new profile-wins semantics: they now
point ``rag_profile`` at ``__no_such_profile__`` to verify the
settings.toml fallback path independently of profile expansion.
- ``test_publish_syncs_draft_*`` + ``test_cuga_lite_knowledge_scopes``
pre-existing fragilities resolved (apply_knowledge_config stub on the
fake engine; assertion shape updated to match the live setdefault →
get/assign refactor on thread_id injection).
## Defended in 3 expert reports
- Expert A (image / packaging): nvidia/cuda base + ``[gpu]`` extra +
fastembed-gpu eliminate 3 silent-CPU-fallback footguns found by audit.
- Expert B (per-stage GPU perf): batch_size=256 / parallel=1 on CUDA;
code-enforced ``rerank_top_k_in >= 3 * default_limit``.
- Expert C (deployment UX): refined warning + ``gpu_required`` +
``cuga doctor`` + README with platform matrix.
- Expert D (Docling parse): ``generate_page_images=False`` in fast mode;
``max_ingest_workers`` warning.
GPU image size: ~4.8 GB (CPU image stays ~1.5 GB).
End-to-end speedup on H100 vs CPU (measured baseline 6.1 min, 556-page PDF):
~10× with multilingual-e5-large; ~3–6× with bge-small-en (launch-overhead
bound at small models). Tesseract OCR + sqlite-vec/pgvector + BM25 remain
CPU floors regardless.
Ships a small, opinionated corpus that operators can ingest immediately
to evaluate the knowledge engine end-to-end without sourcing their own
documents:
- ``agent_level/`` — 4 evergreen HR-style policies (employee handbook,
health insurance plan, PTO policy, 401k retirement plan). These map to
cuga's ``agent_level`` scope (permanent, shared across conversations).
- ``session_level/`` — 3 user-scoped documents for a fictional persona
(Sarah Chen): benefits enrollment, paystub, PTO balance. These map to
cuga's ``session_level`` scope (transient, per-conversation) and
demonstrate the per-thread scope-isolation contract.
- ``build_deck.py`` + ``cuga_knowledge_demo.pptx`` — small Python
pptx-build script and the rendered demo deck (~80KB) for screenshare
walkthroughs.
Total footprint ~100KB. All content is fictitious; useful for evaluation,
QA, and onboarding new cuga users to the knowledge feature.
…edge engine" This reverts commit e1438f2.
…-bottlenecks # Conflicts: # src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py # src/cuga/backend/knowledge/client.py # src/cuga/frontend/dist/index.html # src/frontend_workspaces/agentic_chat/src/App.tsx
…merge The merge with origin/main pulled in 27bd35e (extract shared agent logic into cuga_agent_core) which moved ~2300 lines out of cuga_lite_graph.py. Three of our branch's semantic improvements would have been lost without this fix: 1. ``helpers/knowledge.py::_get_knowledge_tool_scope_context`` — restore the synthetic ``"all"`` scope and ``"session"`` default. Main's refactored helper returned only ``("agent", "session")`` with default ``"agent"``; our branch had evolved to ``("all", "agent", "session")`` with default ``"session"`` so the LLM defaults to the narrowest plausible query and the engine auto-falls back to ``"all"`` if session is empty (the wiring lives in ``KnowledgeClient.search`` / ``_run_multi``). Without this restore, multi-scope deployments lose the precision gain that the perf branch shipped. 2. ``helpers/knowledge.py::_knowledge_scope_instruction`` — update the both-scopes-wired branch to describe the ``"all"`` scope explicitly so the LLM tool prompt matches the actual scope set returned in (1). 3. ``adapter/prepare_node.py`` — when assembling the draft-overlay knowledge config for ``--draft`` agents, prefer the per-agent ``draft_knowledge_configs`` dict before falling back to the legacy singular ``draft_knowledge_config`` attribute. Main's refactored prepare_node read only the singular attribute, which is multi-tenant unsafe (two draft agents on the same host would clobber each other). The dict lookup mirrors what cuga_lite_graph.py had pre-refactor. 4. ``test_knowledge_multi_scope_search.py`` — update the import path for ``_get_knowledge_tool_scope_context`` from ``cuga_lite_graph`` to ``cuga_lite.helpers.knowledge`` (the new home post-refactor). All 709 knowledge unit tests pass.
After the merge with origin/main pulled in the cuga_agent_core refactor,
CI's ``ruff check`` flagged 40 lint errors across pre-existing files
(31 auto-fixable, 9 manual). All are unused-imports / unused-locals /
semicolon-separated statements — no behavioural change.
- tests/unit/test_knowledge_use_gpu_surfaces.py + a dozen other test
files: ``os``, ``patch``, etc. imported but unused.
- tests/unit/test_knowledge_layout_engine.py:429-430: split two
semicolon-joined statements onto their own lines.
- src/cuga/backend/knowledge/engine.py:2473: drop unused
``rerank_in`` assignment.
- scripts/bench_knowledge_eval.py:242,248: drop unused
``last_ingest_filename`` + ``docs``.
ruff check now passes cleanly. 709 knowledge unit tests still pass.
… failure CI's ``ruff format --check`` flagged 38 files needing format normalization (extra blank lines, trailing-comma fixups, multi-line call formatting). Applied ``uv run ruff format .`` to bring the tree in line with the upstream code style. No behavioural change; 709 knowledge unit tests still pass.
…sion Background: CI's ``uv sync --all-extras --dev`` installs every extra side-by-side. With ``fastembed`` (base) AND ``fastembed-gpu`` (in the previous ``[gpu]`` extra) both installed, they clobber each other on disk — the same qdrant/fastembed#608 issue we were trying to avoid. Net effect under CI: ``ModuleNotFoundError: No module named 'fastembed'`` on every integration test that imports the engine. Fix: - Remove ``fastembed-gpu`` from ``[project.optional-dependencies] gpu``. Only ``onnxruntime-gpu`` stays. ``fastembed-gpu`` was a nicety; the Dockerfile.gpu can swap the underlying ORT directly. - ``Dockerfile.gpu`` now does ``uv pip uninstall onnxruntime`` between ``uv sync --extra gpu`` (which adds ``onnxruntime-gpu``) and the cu121 torch reinstall. Two ORT packages on disk silently strips CUDAExecutionProvider — the uninstall keeps it clean. - ``uv.lock`` re-locked to remove ``fastembed-gpu``. CI's ``uv sync --all-extras`` now resolves cleanly. 709 knowledge unit tests still pass.
… collision ``uv sync --all-extras --dev`` installs every optional dep, including the new ``[gpu]`` extra. ``onnxruntime-gpu`` (in [gpu]) and ``onnxruntime`` (transitive of base fastembed) then coexist on disk and clobber each other — ``import onnxruntime`` raises ``AttributeError: module 'onnxruntime' has no attribute 'SessionOptions'``. The qdrant/fastembed #608 and microsoft/onnxruntime #21825 issues both document the collision. Fix: ``--no-extra gpu`` in the workflow. CI never needs the GPU runtime — the GPU path is exercised in Dockerfile.gpu only.
CodeQL flagged two new alerts after the merge with origin/main: 1. HIGH ``js/polynomial-redos`` — ``src/frontend_workspaces/agentic_chat/src/ClientAdaptationPanel.tsx:309`` ``value.replace(/\s+$/, "")`` is the classic polynomial-backtracking pattern; on adversarial inputs the regex engine spends quadratic time trying to match. Replaced with ``value.trimEnd()`` — O(n), same semantics, intent-revealing. 2. MEDIUM ``py/stack-trace-exposure`` — ``src/cuga/backend/server/manage_routes.py:979`` The embedding-test endpoint's validation error path returned ``str(e)`` unscrubbed. ``KnowledgeConfig.validate()`` messages are operator-authored strings, but if a user accidentally pastes their API key into the model field the error could echo it back. Now uses the same ``_scrub_secret(str(e), api_key)`` the embed-test path two blocks below already uses — consistent secret-scrub posture across the endpoint. ruff clean, 709 knowledge unit tests pass.
…ors, CI tests, scope cleanup) Aggregate response to PR #297 review comments (Sami, Dec 2026). ## Blockers fixed 1. **Event-loop blocking in async ingest** — engine.py:2082 ``fut.result(timeout=2.0)`` was a blocking call inside the asyncio thread; with ~10 progress emits the drain stalled other requests for up to ~20s. Replaced with ``await asyncio.wait_for(asyncio.wrap_future(fut), timeout=2.0)``. 2. **Event-loop blocking in PATCH /knowledge/config** — manage_routes.py ``live_engine.apply_knowledge_config(filtered)`` was called sync in an async route. On embedding-provider/model change it runs ``embed_query("test")`` (network round-trip) on the event loop. Wrapped in ``await asyncio.to_thread(...)``. 3. **Partial insert can orphan vectors on pgvector** — storage/adapter.py Each ``add_many`` sub-batch is its own pgvector transaction; if batch N+1 fails, batches 1..N stay committed and the next retry appends duplicates (because ``delete_by_source`` only runs when metadata exists, and first-time failures never write metadata). Added compensating delete loop inside ``_add_all``: tracks ``source_for_rollback`` on first chunk and, on any subsequent batch failure, runs the same ``list`` / ``delete`` / ``_fts_delete_by_source`` loop as the adapter's public ``delete_by_source`` (inlined to stay on the current event loop — calling the public helper would ``_run_embedding_coro`` a nested loop and deadlock). 4. **New tests not in CI** — src/scripts/run_tests.sh The explicit allowlist of ~8 knowledge tests at lines 82-89 silently excluded ~30+ test modules added since. Switched to ``tests/unit/test_knowledge_*.py`` glob + named the few non-glob companions (session/chat/manage_publish/sdk_knowledge_config). All knowledge tests now run under ``./src/scripts/run_tests.sh unit_tests``. ## Also-noted bugs 5. **commit_knowledge_update omitted metric_changed from return** — engine.py:3197. ``manage_routes.py`` already exposes it in ``live_changes`` so the field was always null. Added ``"metric_changed": prepared.metric_changed`` to the return. 6. **get_settings() omitted new fields** — engine.py:3224. After live PATCH the SDK / UI / CLI couldn't read back ``docling_pdf_mode``, ``embedding_batch_size``, ``embedding_concurrency``, ``vector_insert_batch_size``, ``embedding_api_key`` (presence-only), ``rerank_*``, ``default_limit``, ``max_search_attempts``, ``max_ingest_workers``, ``docling_layout_engine``, ``gpu_required``. All now surfaced. Secret field exposed as ``embedding_api_key_set: bool`` only — the actual key never leaves the engine. All new fields read via ``getattr(self._config, name, default)`` so existing SimpleNamespace mocks in tests don't break. ## CodeQL hardening 7. **manage_routes.py:978-1009** — strengthened the py/stack-trace-exposure response: ``logger.exception`` logs the full server-side traceback; client receives only validator-authored messages on a known prefix whitelist (operators NEED these to fix their config), and falls back to a generic "Invalid knowledge embedding configuration." for any other exception shape so internals never leak. ## Scope cleanup (Sami: out-of-scope for this PR) 8. Removed eval harness scripts: - scripts/KNOWLEDGE_EVAL_METHODOLOGY.md - scripts/bench_knowledge_eval.py - scripts/bench_knowledge_eval_report.py - scripts/eval_queries.json - scripts/test_groq_connectivity.py - scripts/test_openrouter_connectivity.py 9. Removed personal gitignore entries for ``knowledge_ingest_perf.pptx`` + ``KNOWLEDGE_INGEST_PERF_SLIDES.md``. Generic ``~$*.pptx`` / ``~$*.docx`` / ``~$*.xlsx`` lock-file patterns kept since those are repo-wide useful. 10. README — trimmed the 3×3 provider×Docling latency matrix + Top-5 Jaccard quality table + eval-harness section (~80 lines). Provider-choice table + reindex-on-dim-change warning kept (real user docs). Tests: 709 knowledge unit tests pass. ruff check + ruff format clean.
There was a problem hiding this comment.
Actionable comments posted: 1
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (8)
Dockerfile.gpu (1)
27-85:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRun the container as a non-root user.
The image never drops privileges (
USERis missing), so runtime executes as root. Please add a dedicated unprivileged user and switch beforeCMD.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@Dockerfile.gpu` around lines 27 - 85, Create and switch to a non-root user in the Dockerfile: add a dedicated unprivileged user (e.g., "cuga") and group, set a HOME, chown the app workspace and any files copied into /app/cuga_workspace (references: the RUN mkdir -p /app/cuga_workspace and multiple COPY lines), and add a USER cuga (and optionally WORKDIR/HOME) before the final CMD so the container runs unprivileged rather than as root. Ensure the new user has permission to read/write needed files and the PATH (ENV PATH) remains usable for that user.Source: Linters/SAST tools
src/cuga/backend/knowledge/auth.py (1)
164-191:⚠️ Potential issue | 🟠 Major | ⚡ Quick winNormalize
thread_idonce and reuse it for ownership + collection resolution.Line 170 validates stripped content, but Lines 182, 187, and 191 still use the raw value.
"abc"and" abc "currently map to different session/collection identities.Suggested fix
if scope == "session": + normalized_thread_id = (identity.thread_id or "").strip() # Whitespace-safe check. ``" "`` is truthy in Python but it # sanitizes to ``"___"`` and would create a phantom collection # ``kb_sess____`` shared across every misconfigured caller. # Strip before the truthiness test so any "no real thread id" # input is rejected uniformly. - if not (identity.thread_id or "").strip(): + if not normalized_thread_id: raise HTTPException(status_code=400, detail="X-Thread-ID required") @@ provider.get_or_create_session( - identity.thread_id, + normalized_thread_id, user_id=identity.user_id, tenant_id=identity.tenant_id, ) if not provider.check_session_access( - identity.thread_id, identity.user_id, identity.tenant_id + normalized_thread_id, identity.user_id, identity.tenant_id ): raise HTTPException(status_code=403, detail="access denied to session") - return f"kb_sess_{_sanitize(identity.thread_id)}" + return f"kb_sess_{_sanitize(normalized_thread_id)}"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cuga/backend/knowledge/auth.py` around lines 164 - 191, Normalize and reuse a stripped thread_id instead of calling identity.thread_id repeatedly: create a local variable like stripped_thread_id = (identity.thread_id or "").strip(), validate it and raise HTTPException if empty, then pass stripped_thread_id to provider.get_or_create_session and provider.check_session_access and call _sanitize(stripped_thread_id) for the returned collection name; ensure all uses in this scope (the block handling scope == "session" in auth.py) reference that single normalized variable.src/cuga/backend/knowledge/config.py (1)
633-649:⚠️ Potential issue | 🟠 Major | ⚡ Quick winInclude Docling ingest knobs in
vector_config_hash().
would_invalidate_vectors()currently missesdocling_pdf_mode,docling_layout_engine, anddocling_drop_page_chrome, even though those settings change the extracted text/chunks that get embedded and stored. A profile switch likestandard(balanced+dry_run) to a stricter parsing profile can therefore keep the same hash and skip the re-ingest prompt while serving stale vectors.Suggested fix
key = ( f"{sm}|{self.embedding_provider}|{self.embedding_model}|" - f"{self.chunk_size}|{self.chunk_overlap}|{self.metric_type}" + f"{self.chunk_size}|{self.chunk_overlap}|{self.metric_type}|" + f"{self.docling_pdf_mode}|{self.docling_layout_engine}|" + f"{self.docling_drop_page_chrome}" )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cuga/backend/knowledge/config.py` around lines 633 - 649, vector_config_hash() currently omits docling ingest knobs so profile changes (e.g., docling_pdf_mode, docling_layout_engine, docling_drop_page_chrome) won't flip the vector hash; update the function (vector_config_hash) to include these three settings from cuga.config.settings (accessed the same way as storage.mode) in the key composition so the hash reflects any change to docling_pdf_mode, docling_layout_engine, and docling_drop_page_chrome and triggers re-ingest detection consistent with would_invalidate_vectors().src/cuga/backend/knowledge/routes.py (1)
534-695:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
wait=falsepath does not actually return 202.The handler documents async semantics for
wait=false, but it never sets the HTTP status to 202, so clients still receive 200 and can mis-handle polling/state transitions.Suggested fix
-from fastapi import APIRouter, Depends, File, Form, HTTPException, Query, UploadFile, Request +from fastapi import APIRouter, Depends, File, Form, HTTPException, Query, UploadFile, Request, Response @@ async def upload_documents( request: Request, + response: Response, files: list[UploadFile] = File(...), @@ - if len(results) == 1: + if not wait: + response.status_code = 202 + if len(results) == 1: return results[0] return {"tasks": results}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cuga/backend/knowledge/routes.py` around lines 534 - 695, The async branch for wait=false currently builds result payloads but still returns normal 200; update upload_documents to return a 202 status for the fire-and-forget path by returning a Response/JSONResponse with status_code=202 (preserve the same body built from results/_enrich_task). Specifically, after assembling results in the else branch (symbols: wait, results, _enrich_task) and at the final return site, return JSONResponse(status_code=202, content=results[0] if len(results)==1 else {"tasks": results}) or equivalent so single-file and multi-file flows both send HTTP 202 for wait=false. Ensure you import JSONResponse/Response from fastapi.src/frontend_workspaces/agentic_chat/src/KnowledgeSidePanel.tsx (1)
175-180:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winDon’t render the empty state until agent docs are authoritative.
agentDocs.length === 0now means both “no docs” and “still resolving”. Right after first load or publish, this will briefly show “No agent documents” even though the badge is still loading. Pass throughloadedOnce/isResolvingfromuseAgentKnowledgeDocsand render a loading state until the hook settles.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/frontend_workspaces/agentic_chat/src/KnowledgeSidePanel.tsx` around lines 175 - 180, The component KnowledgeSidePanel currently shows "No agent documents" whenever agentDocs.length === 0, which conflates "no docs" with "still resolving"; update the rendering logic to use the hook's settled state (pass through and read loadedOnce and/or isResolving from useAgentKnowledgeDocs) and show a loading placeholder while isResolving or loadedOnce is false, only rendering the empty state when loadedOnce is true and agentDocs.length === 0; adjust the conditional around agentDocs/agentScopeEnabled accordingly so the UI waits for the hook to settle before showing "No agent documents".src/frontend_workspaces/frontend/src/ManagePage.tsx (1)
1028-1105:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAbort or sequence knowledge autosave requests.
This debounce only cancels the timer, not the previous PATCH once it has started. Since this endpoint updates the draft and the live engine, an older request can finish after a newer one and roll the server back to stale
knowledgeConfigwhile the form still shows the latest edits. Guard this with anAbortControlleror a monotonic request id, and ignore/toast only the latest response.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/frontend_workspaces/frontend/src/ManagePage.tsx` around lines 1028 - 1105, The autosave useEffect (watching knowledgeConfig and effectiveAgentId) can let overlapping PATCH requests finish out-of-order; modify it so each scheduled save uses an AbortController or an incrementing sequence id tied to the effect: create a controller/seq stored on skipDraftSaveRef (or a new ref) and pass controller.signal to api.patchManageConfigDraftKnowledge (or include the seq in the request/context), aborting the previous controller before starting a new timeout and ignoring responses whose seq doesn't match the latest ref (or treating aborted requests as no-ops); update error/toast handling to skip toasts for aborted/stale responses while preserving current handling for real network/errors and server 4xx responses.tests/integration/test_client_adaptation_e2e.py (1)
255-283:⚠️ Potential issue | 🟠 Major | ⚡ Quick winTool-path test does not execute any tool call.
The test description says it validates
KnowledgeClient.get_langchain_tools()end-to-end, but it never invokes a returned tool—onlyadapted_engine.search()andlen(tools). This leaves wrapper-call regressions undetected.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/test_client_adaptation_e2e.py` around lines 255 - 283, The test never invokes the langchain tool produced by KnowledgeClient.get_langchain_tools(), so add an execution step in test_client_consumes_search_via_langchain_tool that calls one of the returned tools (from tools, e.g., tools[0]) with the same or similar query used earlier ("severance code 3 process") and awaits its result (use the tool's async run/call API as appropriate), then assert the tool returned non-empty search results; reference the test function name test_client_consumes_search_via_langchain_tool and the method KnowledgeClient.get_langchain_tools so you update the existing tools variable rather than creating a new one.tests/unit/test_knowledge_rag_scope_failure_fix.py (1)
784-824:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winClose
KnowledgeEngineintest_search_multi_failing_case_session_not_drowned.This test creates a real engine and exits without shutdown, unlike nearby tests. Add
try/finallywitheng.shutdown()to avoid leaking resources into subsequent tests.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/test_knowledge_rag_scope_failure_fix.py` around lines 784 - 824, The test test_search_multi_failing_case_session_not_drowned creates a real KnowledgeEngine instance (eng = KnowledgeEngine(cfg)) but never shuts it down; wrap the test body that uses eng (including the monkeypatch and asyncio.run call) in a try/finally and call await eng.shutdown() or eng.shutdown() in the finally block (matching other tests' shutdown pattern) so the KnowledgeEngine is closed; reference the KnowledgeEngine constructor (KnowledgeEngine(cfg)), the eng variable, and the shutdown method (eng.shutdown()) when adding the try/finally.
🟠 Major comments (23)
scripts/bench_knowledge_eval.py-300-325 (1)
300-325:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPreserve query alignment before Jaccard overlap computation.
No-hit and exception paths skip appending to
top_chunk_ids_per_query, sozip(ref_chunks, cell_chunks)can compare different queries and produce incorrect overlap scores.Proposed fix
for q in queries: try: hits = await agent.knowledge.search(q["query"], scope="agent", limit=5) if not hits: out.queries_failed += 1 + top_chunk_ids_per_query.append([]) continue out.queries_answered += 1 @@ top_chunk_ids_per_query.append(ids) except Exception as q_err: out.errors.append(f"search '{q['query'][:40]}…' failed: {type(q_err).__name__}: {q_err}") out.queries_failed += 1 + top_chunk_ids_per_query.append([])Also applies to: 369-375
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/bench_knowledge_eval.py` around lines 300 - 325, The loop building top_chunk_ids_per_query sometimes skips appending when there are no hits or an exception, breaking alignment with queries and causing incorrect Jaccard comparisons; ensure you always append a placeholder (e.g., empty list or sentinel) to top_chunk_ids_per_query for every query even when hits is empty or in the except block, while still incrementing out.queries_failed and recording out.errors as before; make the same change in the similar block around the other occurrence (the block referenced at lines 369-375) so ref_chunks and cell_chunks remain index-aligned for zip-based overlap computation.scripts/bench_knowledge_query_compare.py-61-77 (1)
61-77:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPer-cell temp directories are never cleaned up.
Both ingest input temp dirs and per-cell persist dirs remain on disk after completion.
Proposed fix
-async def _ingest_once(cell, pdf): +async def _ingest_once(cell, pdf): @@ - return agent, cfg + return agent, cfg, persist, tmpdir @@ - agent, _ = await _ingest_once(cell, pdf) + agent, _, persist_dir, input_tmpdir = await _ingest_once(cell, pdf) try: results = await _run_queries(agent, queries) results_per_cell.append(results) finally: try: await agent.aclose() except Exception: pass + shutil.rmtree(persist_dir, ignore_errors=True) + shutil.rmtree(input_tmpdir, ignore_errors=True)Also applies to: 153-161
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/bench_knowledge_query_compare.py` around lines 61 - 77, The per-cell temporary directories created as persist = Path(tempfile.mkdtemp(...)) and tmpdir = Path(tempfile.mkdtemp(...)) are never removed; wrap the per-cell processing that uses KnowledgeConfig, dest, and the copied PDF in a try/finally (or context manager) and call shutil.rmtree(persist, ignore_errors=True) and shutil.rmtree(tmpdir, ignore_errors=True) in the finally block to ensure both the persist_dir and input temp directory are deleted after each cell's work completes (also apply same cleanup for the other similar block around the 153-161 code using the same symbols).scripts/bench_knowledge_eval.py-185-194 (1)
185-194:⚠️ Potential issue | 🟠 Major | ⚡ Quick winTemporary directories are leaked per run/cell.
_make_unique_pdf()and per-cellpersist_dirallocatemkdtemp()directories but never remove them. Repeated benchmarks will steadily consume disk in/tmp.Proposed fix
async def _run_cell(...): @@ - persist_dir = Path(tempfile.mkdtemp(prefix=f"cuga-eval-{provider.replace('/', '-')}-")) + persist_dir = Path(tempfile.mkdtemp(prefix=f"cuga-eval-{provider.replace('/', '-')}-")) @@ - for i in range(runs): - pdf_unique = _make_unique_pdf(pdf) + for i in range(runs): + pdf_unique = _make_unique_pdf(pdf) cell_cfg = _build_cell_config(provider, model, docling_mode, persist_dir) agent = CugaAgent(enable_knowledge=True, knowledge_config=cell_cfg) try: @@ finally: try: await agent.aclose() except Exception: pass + import shutil + shutil.rmtree(pdf_unique.parent, ignore_errors=True) @@ - return out + import shutil + shutil.rmtree(persist_dir, ignore_errors=True) + return outAlso applies to: 209-210
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/bench_knowledge_eval.py` around lines 185 - 194, _temp directories created by mkdtemp in _make_unique_pdf and where persist_dir is set are never cleaned up, causing leaked /tmp files across runs; modify _make_unique_pdf and the code that creates per-cell persist_dir to either use tempfile.TemporaryDirectory() (or register cleanup via context managers) or explicitly remove the created directories when they are no longer needed (e.g., call shutil.rmtree on the tmp path after ingestion/processing completes or on process exit). Locate the _make_unique_pdf function and the places assigning persist_dir, replace mkdtemp usage with TemporaryDirectory or add deterministic cleanup logic (ensure you reference _make_unique_pdf, persist_dir, tempfile.mkdtemp, and shutil.rmtree in your changes).scripts/bench_knowledge_ingest.py-123-137 (1)
123-137:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
chunk_count_medianis read from the wrong location.
chunk_countis consumed from the record root elsewhere, but summary reads it fromtimings, so the median can be wrong/empty.Proposed fix
summary = { "label": label, "runs": len(records), "stack": records[-1].get("stack"), "embedding_provider": records[-1].get("embedding_provider"), - "chunk_count_median": _med("chunk_count"), + "chunk_count_median": ( + round( + statistics.median( + float(r["chunk_count"]) + for r in records + if isinstance(r.get("chunk_count"), (int, float)) + ), + 3, + ) + if any(isinstance(r.get("chunk_count"), (int, float)) for r in records) + else None + ), "parse_s_median": _med("parse_s"),🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/bench_knowledge_ingest.py` around lines 123 - 137, summary currently computes "chunk_count_median" by calling _med("chunk_count"), which looks up chunk_count inside each record's "timings" dict, but chunk_count is stored at the record root elsewhere; update the summary creation to compute chunk_count_median from the root values instead of _med("chunk_count"). Iterate records, extract each record.get("chunk_count") (or skip non-numeric), collect floats, and compute median (rounded to 3 decimals) similar to _med, or add a small helper _med_root(name) that reads record.get(name) and use that for "chunk_count_median" while keeping parse_s_median and others that are actually in timings unchanged.scripts/bench_knowledge_eval_report.py-102-111 (1)
102-111:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDecision matrix can recommend failed cells.
Selection currently runs across all cells, including failed ingest/search results. This can output unusable recommendations.
Proposed fix
- fastest = min(cells, key=lambda c: c.get("median_total_s") or 1e9, default=None) + successful = [ + c for c in cells + if c.get("ingest_status") == "success" and c.get("search_status") != "failed" + ] + fastest = min(successful, key=lambda c: c.get("median_total_s") or 1e9, default=None) @@ - (c for c in cells if (c.get("cost_usd_estimated") or 0) > 0), + (c for c in successful if (c.get("cost_usd_estimated") or 0) > 0), @@ - free_options = [c for c in cells if (c.get("cost_usd_estimated") or 0) == 0] + free_options = [c for c in successful if (c.get("cost_usd_estimated") or 0) == 0] @@ - high_quality = [c for c in cells if (c.get("topk_overlap_vs_reference") or 0) >= 0.8] + high_quality = [c for c in successful if (c.get("topk_overlap_vs_reference") or 0) >= 0.8]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/bench_knowledge_eval_report.py` around lines 102 - 111, The selection logic currently computes fastest/cheapest/free/high_quality across all entries in cells, which can include failed ingest/search results; before the existing calculations (variables fastest, cheapest_nonzero, free_options, free_fastest, high_quality, high_quality_fastest) filter out failed rows by creating a sanitized list, e.g. successful_cells = [c for c in cells if not (c.get("failed") or c.get("error")) and (c.get("median_total_s") is not None or c.get("cost_usd_estimated") is not None or c.get("topk_overlap_vs_reference") is not None)]; then replace uses of cells with successful_cells so the min() and list comprehensions only consider valid results (ensure all references to cells in that block are updated to successful_cells).src/cuga/backend/cuga_graph/nodes/cuga_lite/adapter/prepare_node.py-538-543 (1)
538-543:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAvoid cross-agent fallback when per-agent draft config map exists.
On Line 541, the fallback to
draft_knowledge_configstill runs even whendraft_knowledge_configsexists but has no entry for this agent. That can apply another agent’s draft search config in multi-tenant draft runs.Suggested fix
- _cfgs = getattr(_das, "draft_knowledge_configs", None) - if isinstance(_cfgs, dict): - _draft_kc = _cfgs.get(_base_aid) - if _draft_kc is None: - # Legacy singular attribute fallback. - _draft_kc = getattr(_das, "draft_knowledge_config", None) + _cfgs = getattr(_das, "draft_knowledge_configs", None) + if isinstance(_cfgs, dict): + _draft_kc = _cfgs.get(_base_aid) + else: + # Legacy singular attribute fallback only when per-agent map is unavailable. + _draft_kc = getattr(_das, "draft_knowledge_config", None)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cuga/backend/cuga_graph/nodes/cuga_lite/adapter/prepare_node.py` around lines 538 - 543, Current code falls back to draft_knowledge_config even when draft_knowledge_configs exists but lacks an entry for this agent, risking cross-agent config leakage; change the logic around draft_knowledge_configs/draft_knowledge_config so the singular fallback runs only when draft_knowledge_configs is absent or not a dict. Specifically, in the block referencing _das, draft_knowledge_configs, _draft_kc and _base_aid, first check if hasattr(_das, "draft_knowledge_configs") and isinstance(_cfgs, dict) and use only _cfgs.get(_base_aid); do NOT call getattr(_das, "draft_knowledge_config", None) when draft_knowledge_configs is a dict (even if .get returns None); only use the legacy singular draft_knowledge_config when draft_knowledge_configs is missing or not a dict.src/cuga/backend/knowledge/client.py-272-278 (1)
272-278:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon't log raw user queries at INFO.
This fallback path logs the first 60 characters of every search query. Knowledge searches are likely to contain user data, document excerpts, or secrets, and INFO logs are usually retained centrally. Prefer scope/thread metadata only, or log a short hash if you need correlation.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cuga/backend/knowledge/client.py` around lines 272 - 278, The current info log call using _stdlib_logging.getLogger("cuga.knowledge").info includes the raw query (query[:60]); remove raw user content and instead log only non-sensitive metadata and a short, stable correlation token: compute a short hash (e.g., SHA256 or SHA1 and take first 8-12 hex chars) of the full query and log that token along with thread_id and other metadata, or omit the query entirely; update the call that references (thread_id or "") and (query or "") to use the hash variable (or drop the query field) so no plaintext query data is written to INFO logs. Ensure hashing is performed deterministically before the log statement and reference the existing logger call _stdlib_logging.getLogger("cuga.knowledge").info to locate and replace the code.src/cuga/backend/knowledge/client.py-133-152 (1)
133-152:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftUse a collision-free collection key, not lossy
_sanitize().
_sanitize()maps every non-[A-Za-z0-9_]character to_, so distinct ids likethread-1,thread 1, andthread_1all resolve to the same collection. That can merge documents across sessions or agents and defeats the isolation guarantees this PR is adding.A reversible encoding or a stable hash-derived key is safer here than character replacement.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cuga/backend/knowledge/client.py` around lines 133 - 152, The current _sanitize(...) is lossy and causes collisions (e.g., "thread-1" == "thread_1"), so replace the lossy behavior with a stable, collision-resistant encoding when building collection names: implement a new helper (or change _sanitize) to return a short stable hash (e.g., sha256 hex or urlsafe base64 of the original id) and use that in _agent_collection() (for self._default_agent_id) and in _resolve_collection() (for session thread_id) instead of character-replacement; ensure the returned token contains only safe filename/collection characters and preserve the existing _agent_collection_hash behavior so final names remain deterministic and unique (reference: _sanitize, _agent_collection, _resolve_collection, _agent_collection_hash, _default_agent_id).src/cuga/sdk.py-1670-1684 (1)
1670-1684:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
knowledge_configoverride is only partially applied (injection gate still ignores it).The new override is resolved for engine creation, but auto-injection enablement is still driven by settings (when
enable_knowledgeisNone). This can inject tools when override says disabled—or skip them when override says enabled.Suggested fix
+ def _resolve_knowledge_config(self): + from cuga.backend.knowledge.config import KnowledgeConfig + from cuga.config import settings + + override = self._knowledge_config_override + if override is None: + return KnowledgeConfig.from_settings(settings) + if isinstance(override, KnowledgeConfig): + override.validate() + return override + return KnowledgeConfig.coerce_and_validate(dict(override)) @@ - else: - kb_config = KnowledgeConfig.from_settings(settings) - kb_enabled = kb_config.enabled + else: + kb_enabled = self._resolve_knowledge_config().enabled @@ - # Resolve config: explicit override > settings.toml fallback. - override = self._knowledge_config_override - if override is None: - config = KnowledgeConfig.from_settings(settings) - elif isinstance(override, KnowledgeConfig): - ... - else: - config = KnowledgeConfig.coerce_and_validate(dict(override)) + config = self._resolve_knowledge_config()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cuga/sdk.py` around lines 1670 - 1684, The override in _knowledge_config_override is used to build the KnowledgeEngine but the auto-injection decision still reads from settings, causing mismatch; after you resolve "config" (via KnowledgeConfig.from_settings, validate, or coerce_and_validate) use config.enable_knowledge (or the equivalent property on KnowledgeConfig) as the source for the injection gate instead of reading enable_knowledge from settings when an override was provided, and update the injection-enablement check (where the injection gate is evaluated) to prefer config.enable_knowledge so the override consistently controls whether auto-injection runs.src/cuga/backend/server/manage_routes.py-1313-1335 (1)
1313-1335:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftDon't block the PATCH request on full auto-reindex work.
When
dim_changedis true, this route now awaitslive_engine.reindex()for every collection before returning. On the large corpora this PR targets, a settings save can now scale with total corpus size and easily hit proxy/request timeouts after the draft has already been persisted.Queue the reindex as background work and return a pending status/job handle instead of doing corpus-wide work inline.
src/cuga/backend/server/manage_routes.py-1167-1175 (1)
1167-1175:⚠️ Potential issue | 🟠 Major | ⚡ Quick winClassify auth failures from the merged config, not the PATCH delta.
This branch inspects
knowledge.get(...), which only contains fields sent in the current PATCH. If the client changes justembedding_model,filteredcan still contain an existingembedding_provider/embedding_api_key, but_providerand_user_supplied_keyare computed as empty here. That makes the soft-fail vs hard-fail path take the wrong branch for partial updates.Suggested fix
- _user_supplied_key = bool((knowledge.get("embedding_api_key") or "").strip()) - _provider = (knowledge.get("embedding_provider") or "").lower() + _user_supplied_key = bool((filtered.get("embedding_api_key") or "").strip()) + _provider = str(filtered.get("embedding_provider") or "").lower()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cuga/backend/server/manage_routes.py` around lines 1167 - 1175, The auth-error classification is reading provider/key from the PATCH delta `knowledge` (which may be partial) instead of the merged configuration; update the logic so `_provider`, `_user_supplied_key`, and `_is_credentialed` are computed from the merged config (e.g., use the existing merged/filtered dict already used elsewhere rather than `knowledge`), keeping the same stripping/lowering and `_looks_like_auth` checks, so the soft-fail vs hard-fail branch is based on the full current settings.src/cuga/backend/server/main.py-718-724 (1)
718-724:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPublished knowledge config is still skipped after restart when settings boot with knowledge disabled.
This block only restores
_saved_knowledgewhenapp_state.knowledge_enginealready exists. Ifsettings.tomlstarts withknowledge.enabled=falsebut the published config had knowledge enabled, startup leaves_live_engineasNoneand silently skips the restore path, so the published provider/model/layout never comes back after a restart.Suggested fix
- _saved_knowledge = (_startup_config or {}).get("knowledge") or {} - _live_engine = getattr(app_state, "knowledge_engine", None) - if _saved_knowledge and _live_engine is not None: + _saved_knowledge = (_startup_config or {}).get("knowledge") or {} + _live_engine = getattr(app_state, "knowledge_engine", None) + if _saved_knowledge and _live_engine is None: + from cuga.backend.knowledge.config import KnowledgeConfig as _KC + + _saved_cfg = _KC.coerce_and_validate( + {k: v for k, v in _saved_knowledge.items() if not k.startswith("_")} + ) + if _saved_cfg.enabled: + await initialize_knowledge_engine(app_state, _saved_cfg) + _live_engine = getattr(app_state, "knowledge_engine", None) + if _saved_knowledge and _live_engine is not None: # Strip noise that's not part of the public config shape. _saved_kb = {k: v for k, v in _saved_knowledge.items() if not k.startswith("_")}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cuga/backend/server/main.py` around lines 718 - 724, The restore logic only applies saved knowledge when app_state.knowledge_engine already exists, so if settings boot with knowledge disabled the published config is never restored; change the block handling _saved_knowledge so that if _saved_kb exists and _live_engine is None you initialize or enable the knowledge engine using the same startup/factory path used elsewhere (so app_state.knowledge_engine becomes a real engine) and then call _live_engine.apply_knowledge_config(_saved_kb); keep the existing stripping of keys (the _saved_kb creation) and reuse apply_knowledge_config to apply the restored config.src/cuga/cli/main.py-201-203 (1)
201-203:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUnify knowledge CLI base URL resolution
config-*/snapshot-*useCUGA_API_BASE(default:8005), butadaptation-*/glossary-*/doctoruseCUGA_HOST+CUGA_PORT(default:8000) and ignoreCUGA_API_BASE. This split makes commands in the samecuga knowledgegroup target different backends by default.Suggested fix
def _cuga_server_base_url() -> str: - port = os.environ.get("CUGA_PORT") or "8000" - host = os.environ.get("CUGA_HOST") or "127.0.0.1" - return f"http://{host}:{port}" + api_base = os.environ.get("CUGA_API_BASE") + if api_base: + return api_base.rstrip("/") + port = os.environ.get("CUGA_PORT") or "8000" + host = os.environ.get("CUGA_HOST") or "127.0.0.1" + return f"http://{host}:{port}"Also applies to: 441-455
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cuga/cli/main.py` around lines 201 - 203, Unify CLI backend resolution by making the existing helper _knowledge_api_base() the single source of truth: change it to first read CUGA_API_BASE (default "http://127.0.0.1:8005"), and if unset fall back to composing from CUGA_HOST and CUGA_PORT (default "http://127.0.0.1" and "8000") and always rstrip("/"); then update all places that currently build the host/port separately (the adaptation-*, glossary-*, doctor command implementations) to call _knowledge_api_base() instead of using CUGA_HOST/CUGA_PORT so every cuga knowledge command targets the same backend by default. Ensure callers use the returned URL directly (no further port/host assembly).src/cuga/configurations/knowledge/knowledge_profiles/speed.toml-63-66 (1)
63-66:⚠️ Potential issue | 🟠 Major | ⚡ Quick winSpeed addendum conflicts with the mandatory scope-safety contract
Line 65 directs “most-likely scope first; fallback if empty,” which can miss relevant docs when both scope candidates exist. That conflicts with the contract’s anti-silent-miss rule in this PR.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cuga/configurations/knowledge/knowledge_profiles/speed.toml` around lines 63 - 66, The "Speed addendum" text in speed.toml currently instructs "most-likely scope first; only fall back to the other if the first comes up empty", which violates the scope-safety contract by allowing silent misses when both scopes contain relevant docs; update the guidance so the system either (a) always query both scopes and merge/compare results, or (b) query the most-likely scope first but still query the other scope in parallel or as a verification step and include any higher-relevance hits, and explicitly state that fallback must not be the only source when both scopes exist; change the phrasing in speed.toml accordingly to mandate non-silent-miss behavior and reference the scope-safety contract.src/cuga/configurations/knowledge/knowledge_instructions.md-26-26 (1)
26-26:⚠️ Potential issue | 🟠 Major | ⚡ Quick winResolve contradictory scope-miss rules
Line 26 says session misses auto-fallback to
scope="all"and explicitly says not to re-issue with a different scope, but Line 41 says session miss →scope="agent". These rules conflict and will produce inconsistent tool behavior.Also applies to: 41-41
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cuga/configurations/knowledge/knowledge_instructions.md` at line 26, The document has contradictory fallback rules for session misses: one line says session auto-falls back to scope="all" and forbids re-issuing, while another says session miss → scope="agent"; make them consistent by choosing the intended behavior (prefer keeping scope="all" as the canonical automatic fallback) and update the rule that mentions scope="agent" to instead state the same behavior as the scope="session" rule (i.e., engine auto-retries as scope="all", sets retrieval.fallback_from="session", and callers must treat fallback results as authoritative — do NOT re-issue with a different scope); ensure both occurrences reference scope="session" → fallback to scope="all" and remove or reword any conflicting wording about scope="agent" so the file no longer contains contradictory rules.src/cuga/cli/main.py-1660-1740 (1)
1660-1740:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDo not log raw embedding API keys
Line 1739 logs
_valuefor every override, includingDYNACONF_KNOWLEDGE__EMBEDDINGS__API_KEY(Line 1664). That leaks credentials to logs/console history.Suggested fix
for _env_name, _value in _embedding_flag_env: if _value is not None: os.environ[_env_name] = _value - logger.info("Embedding override: %s=%s", _env_name.split("__", 1)[-1], _value) + _masked = _value + if _env_name.endswith("__API_KEY") or "SECRET" in _env_name: + _masked = "***REDACTED***" + logger.info("Embedding override: %s=%s", _env_name.split("__", 1)[-1], _masked)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cuga/cli/main.py` around lines 1660 - 1740, The loop that sets os.environ for entries in _embedding_flag_env currently logs the raw _value via logger.info, which can leak secrets (notably DYNACONF_KNOWLEDGE__EMBEDDINGS__API_KEY). Update the loop that iterates over (_env_name, _value) (and still sets os.environ[_env_name] = _value) to redact sensitive values before logging: detect API keys by matching substrings like "API_KEY" or the exact "DYNACONF_KNOWLEDGE__EMBEDDINGS__API_KEY" and replace the value with a redacted placeholder (e.g., "<redacted>" or a masked version showing only last 4 chars); continue to log non-sensitive entries unchanged via logger.info("Embedding override: %s=%s", _env_name.split("__", 1)[-1], logged_value).src/cuga/backend/storage/embedding/local.py-108-117 (1)
108-117:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRollback before fallback replay to avoid duplicate writes
Line 111 catches
OperationalErrorand immediately replays all rows in Line 115-116. Ifexecutemany()applied a prefix before failing, replaying without resetting transaction state can duplicate already-written rows.Suggested fix
try: try: # Fast path: single C-loop multi-row insert. conn.executemany(sql, rows) except sqlite3.OperationalError: + # Reset any partial work from executemany before replaying. + try: + conn.rollback() + except sqlite3.Error: + pass # sqlite-vec's vec0 virtual table is suspect for executemany on # some builds; fall back to a loop inside the same transaction # so we still drop the per-row commit fsync cost from N to 1. for row in rows: conn.execute(sql, row) conn.commit()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cuga/backend/storage/embedding/local.py` around lines 108 - 117, The except block that catches sqlite3.OperationalError should rollback the failed transaction before replaying the individual inserts to avoid duplicating any prefix that may have been written by conn.executemany; update the handler around conn.executemany to call conn.rollback() first, then replay rows with conn.execute in the loop, and finally call conn.commit() (keeping the existing conn.commit() after the try/except). Refer to conn.executemany, conn.rollback, conn.execute, and conn.commit to find and modify the logic.src/frontend_workspaces/frontend/src/knowledge/useAgentKnowledgeDocs.ts-209-228 (1)
209-228:⚠️ Potential issue | 🟠 Major | ⚡ Quick winMake the sticky retry cancellable/stale-aware.
After the timer fires, the retry fetch is no longer tied to the current refresh cycle. If a newer
refresh()completes first, this older retry can still hitacceptResult()and overwrite the newer documents/cache with stale data. Reuse anAbortControlleror attach a request generation check before applying retry results.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/frontend_workspaces/frontend/src/knowledge/useAgentKnowledgeDocs.ts` around lines 209 - 228, The sticky retry currently schedules a fetch that can apply stale results via acceptResult; make it cancellable/stale-aware by tying the retry to the same cancellation/generation mechanism used by refresh(): when scheduling the retry (where retryAttemptedRef and stickyRetryTimerRef are used) create or reuse an AbortController (or capture the current refresh generation id) and pass it into listKnowledgeDocuments, then before calling acceptResult(retryData.documents ?? []) verify the controller is not aborted (or the generation id still matches the latest) and only then apply results; also ensure clearing/aborting the pending controller/timer when a new refresh() starts so the old retry cannot overwrite newer data and setIsResolving is handled consistently.tests/unit/test_knowledge_cli_sdk_parity.py-256-271 (1)
256-271:⚠️ Potential issue | 🟠 Major | ⚡ Quick win“CLI flag → env var mapping” test does not validate CLI mapping.
This test never exercises CLI/env wiring; it only rebuilds a settings dict and calls
KnowledgeConfig.from_settings, which is already covered above. A CLI mapping typo could still pass here.tests/unit/test_knowledge_routes.py-72-79 (1)
72-79:⚠️ Potential issue | 🟠 Major | ⚡ Quick winQueued upload contract should assert
202, not200.For
wait=false, this path is documented as an accepted/queued response. Locking200here risks preserving the wrong API contract and diverging from the intended async semantics.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/test_knowledge_routes.py` around lines 72 - 79, The test is asserting the wrong HTTP status for a queued upload: update the assertion in tests/unit/test_knowledge_routes.py (the POST to "/api/knowledge/documents" with data {"scope":"agent","replace_duplicates":"true","wait":"false"}) to expect 202 instead of 200; locate the assertion on response.status_code in this test and change it to assert response.status_code == 202 so the test matches the API's async/queued contract.tests/unit/test_knowledge_publish_import_pipeline.py-131-144 (1)
131-144:⚠️ Potential issue | 🟠 Major | ⚡ Quick winThis secret-leak regression test can pass without exercising
save_config.Because assertions are conditional on
if "saved" in captured, the test succeeds even when the route never reaches disk-save logic. That creates a false green on a security-critical contract.Suggested fix
client.post("/api/manage/config", params={"agent_id": "pub-strip-test"}, json=payload) - # Don't assert on status; the publish flow has many side-effects we mock - # out. Just verify what hit the disk save call. - if "saved" in captured: - saved_kb = captured["saved"].get("knowledge", {}) - # KEY must be stripped on disk - assert saved_kb.get("embedding_api_key", "") == "", ( - f"API key leaked to published config-store: {saved_kb.get('embedding_api_key')!r}" - ) - # Non-secret fields must survive - assert saved_kb.get("embedding_provider") == "openai" - assert saved_kb.get("embedding_model") == "Azure/text-embedding-3-small-1" - assert saved_kb.get("embedding_base_url") == "https://ete-litellm.bx.cloud9.ibm.com/v1" + assert "saved" in captured, "save_config was not invoked" + saved_kb = captured["saved"].get("knowledge", {}) + assert saved_kb.get("embedding_api_key", "") == "", ( + f"API key leaked to published config-store: {saved_kb.get('embedding_api_key')!r}" + ) + assert saved_kb.get("embedding_provider") == "openai" + assert saved_kb.get("embedding_model") == "Azure/text-embedding-3-small-1" + assert saved_kb.get("embedding_base_url") == "https://ete-litellm.bx.cloud9.ibm.com/v1"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/test_knowledge_publish_import_pipeline.py` around lines 131 - 144, The test currently guards assertions with a conditional "if 'saved' in captured" which lets the test pass when save_config was never invoked; change this to an explicit assertion that ensures the save happened before checking contents—for example assert "saved" in captured (or assert "saved" in captured, f"expected save_config to be called after client.post('/api/manage/config')") and then proceed to set saved_kb = captured["saved"].get("knowledge", {}) and assert on embedding_api_key, embedding_provider, embedding_model, and embedding_base_url to enforce the secret-strip contract; update any variable references (captured, saved_kb) accordingly so the test fails when disk save isn't exercised.tests/unit/test_knowledge_use_gpu_surfaces.py-194-197 (1)
194-197:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAvoid eager embedder initialization in this unit test.
eng._ensure_embeddings()can pull in provider/model initialization and make this test flaky/non-hermetic. This test only needs to validateuse_gpuconfig propagation before and after PATCH.Suggested change
def test_patch_draft_knowledge_use_gpu_false_applies_to_live_engine(): """User toggles use_gpu off in the UI → live engine rebinds inference backend.""" eng = KnowledgeEngine(_cfg(use_gpu=True)) - eng._ensure_embeddings() assert eng._config.use_gpu is True🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/test_knowledge_use_gpu_surfaces.py` around lines 194 - 197, The test is eagerly initializing embedders by calling eng._ensure_embeddings(), making it flaky; remove that call and only assert config propagation on the KnowledgeEngine instance: create eng = KnowledgeEngine(_cfg(use_gpu=True)), assert eng._config.use_gpu is True, then apply the PATCH operation the test intends and assert the config value after PATCH. Update references to _ensure_embeddings() to avoid triggering provider/model initialization and keep assertions limited to eng._config.use_gpu before and after the PATCH.tests/unit/test_knowledge_timings.py-267-273 (1)
267-273:⚠️ Potential issue | 🟠 Major | ⚡ Quick winOrder-preservation test does not assert the document identity strongly enough.
This loop claims to verify order, but it only checks
filename == "x.pdf". If embeddings are reordered, the test can still pass because all docs share that filename. Assertpage(and/or exact text) for the top hit.Suggested change
for i in range(0, 50, 7): hits = store.search(f"text {i}", k=3) assert hits, f"no hits for 'text {i}'" # Top hit's filename + page must be the original doc i+1. top = hits[0][0] assert top.metadata.get("filename") == "x.pdf" + assert top.metadata.get("page") == i + 1, ( + f"expected top hit page {i + 1}, got {top.metadata.get('page')!r}" + )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/test_knowledge_timings.py` around lines 267 - 273, The test's loop only asserts filename == "x.pdf" which is insufficient; update the assertion to also check the document identity by asserting the page (and/or exact text) of the top hit matches the expected document for each i. Concretely, after obtaining top = hits[0][0], add an assertion like top.metadata.get("page") == i+1 (or assert the hit text equals the expected "text {i}") so the test verifies order-preservation, using the existing variables hits and top.
🧹 Nitpick comments (3)
tests/unit/test_knowledge_config_ui_design.py (1)
458-462: ⚡ Quick winReplace the tautological warmup assertion with a real contract check.
Line 461 always passes and won’t catch regressions. Assert a concrete condition (e.g., type/expected value) instead.
Suggested change
- assert result["embeddings_initialized"] is False or result["embeddings_initialized"] is True + assert isinstance(result.get("embeddings_initialized"), bool)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/test_knowledge_config_ui_design.py` around lines 458 - 462, The test uses a tautological assertion for result["embeddings_initialized"] which always passes; replace it with a concrete contract check by asserting the field is a boolean (e.g., use isinstance(result["embeddings_initialized"], bool)) or assert a specific expected value depending on the warmup contract from eng.warmup(); update the assertion referring to result["embeddings_initialized"] (produced by eng.warmup()) to validate type or expected value rather than "is False or is True".tests/unit/test_knowledge_preflight_soft_fail.py (1)
60-156: ⚡ Quick winUse fixture-managed engine teardown to prevent resource leakage across tests.
Each test builds a
KnowledgeEnginebut never closes it. A small fixture thatyields the engine and callsengine.shutdown()in teardown will make this suite more stable under parallel/long runs.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/test_knowledge_preflight_soft_fail.py` around lines 60 - 156, The tests instantiate KnowledgeEngine directly and never call its shutdown, causing leaked resources; create a pytest fixture that yields a KnowledgeEngine (constructed via _cfg()) and calls engine.shutdown() in the teardown, then update tests to use that fixture instead of directly calling eng = KnowledgeEngine(_cfg()) (e.g., accept an engine fixture and pass it to _build_app(engine)); ensure all tests that create KnowledgeEngine are switched to use the fixture so shutdown() runs after each test.tests/unit/test_knowledge_pgvector_metadata.py (1)
74-78: Guard fake SQL UPDATE parsing from silent truncation
zip(cols, params[:-1])would silently drop assignments if the fake interpreter parses a different number ofSETcolumns than the parameters provide.PostgresKnowledgeMetadata.update_taskcurrently derivesset_clauseandvaluesfrom the samecolslist, so this mismatch shouldn’t occur today—but the assertion makes future SQL/parse changes fail loudly.Suggested fix
cols = [seg.split("=")[0].strip() for seg in set_section.strip().rstrip(",").split(",")] task_id = params[-1] if task_id in self._rows_by_task_id: + assert len(cols) == len(params[:-1]), ( + f"UPDATE shape mismatch: cols={len(cols)} params={len(params[:-1])}" + ) for col, val in zip(cols, params[:-1]): self._rows_by_task_id[task_id][col] = val🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/test_knowledge_pgvector_metadata.py` around lines 74 - 78, Add a guard to ensure the number of parsed SET columns matches the number of provided parameter values before zipping to avoid silent truncation: in the code path where cols and params are derived (used by PostgresKnowledgeMetadata.update_task and referencing cols, params, task_id, and self._rows_by_task_id), assert or raise if len(cols) != len(params) - 1 (since the last param is task_id) so mismatched parses fail loudly rather than dropping assignments; perform this check immediately before the for col, val in zip(cols, params[:-1]) loop.Source: Linters/SAST tools
| RUN uv sync --extra gpu && \ | ||
| uv pip uninstall onnxruntime 2>/dev/null || true && \ | ||
| uv pip install --reinstall torch torchvision \ | ||
| --index-url https://download.pytorch.org/whl/cu121 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Show Dockerfile.gpu around the referenced lines
if [ ! -f Dockerfile.gpu ]; then
echo "ERROR: Dockerfile.gpu not found at repo root"
exit 1
fi
echo "=== Dockerfile.gpu: lines 1-120 ==="
nl -ba Dockerfile.gpu | sed -n '1,120p'
# Also search for any SHELL directive that could change command parsing
echo "=== SHELL directives (if any) ==="
rg -n '^\s*SHELL\b' Dockerfile.gpu || true
# Show the exact RUN line(s) containing 'uv sync' and 'onnxruntime'
echo "=== Matching RUN lines ==="
rg -n 'uv sync|onnxruntime|uv pip uninstall|uv pip install --reinstall torch' Dockerfile.gpuRepository: cuga-project/cuga-agent
Length of output: 165
🏁 Script executed:
#!/bin/bash
set -euo pipefail
if [ ! -f Dockerfile.gpu ]; then
echo "ERROR: Dockerfile.gpu not found at repo root"
exit 1
fi
echo "=== Dockerfile.gpu: lines 50-90 (context around referenced area) ==="
awk 'NR>=50 && NR<=90 {printf "%d:%s\n", NR, $0}' Dockerfile.gpu
echo "=== SHELL directives (if any) ==="
rg -n '^\s*SHELL\b' Dockerfile.gpu || true
echo "=== Matching RUN lines ==="
rg -n 'uv sync|onnxruntime|uv pip uninstall|uv pip install --reinstall torch' Dockerfile.gpu || true
echo "=== Show full RUN instruction(s) that contain uv sync (handles line continuations) ==="
# Print from first match through the line where the RUN instruction ends (heuristic: until a line not ending with '\')
awk '
$0 ~ /uv sync/ { inrun=1 }
{
if (inrun) { print; }
if (inrun && $0 ~ /\\$/) next
if (inrun && $0 !~ /\\$/) { inrun=0 }
}
' Dockerfile.gpu | sed -n '1,200p'Repository: cuga-project/cuga-agent
Length of output: 2980
Fix RUN chaining so uv sync failures don’t get masked (torch install still runs today).
In Dockerfile.gpu lines 64-67, the expression uv sync ... && ... || true && ... short-circuits via || true, causing uv pip install --reinstall torch torchvision to run even when uv sync fails.
Suggested fix
-RUN uv sync --extra gpu && \
- uv pip uninstall onnxruntime 2>/dev/null || true && \
- uv pip install --reinstall torch torchvision \
+RUN uv sync --extra gpu && \
+ (uv pip uninstall onnxruntime 2>/dev/null || true) && \
+ uv pip install --reinstall torch torchvision \
--index-url https://download.pytorch.org/whl/cu121📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| RUN uv sync --extra gpu && \ | |
| uv pip uninstall onnxruntime 2>/dev/null || true && \ | |
| uv pip install --reinstall torch torchvision \ | |
| --index-url https://download.pytorch.org/whl/cu121 | |
| RUN uv sync --extra gpu && \ | |
| (uv pip uninstall onnxruntime 2>/dev/null || true) && \ | |
| uv pip install --reinstall torch torchvision \ | |
| --index-url https://download.pytorch.org/whl/cu121 |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@Dockerfile.gpu` around lines 64 - 67, The RUN line masks failures from uv
sync because the current chaining uses "|| true" which makes the whole chain
succeed and still runs the torch install; change the chaining so that only the
uninstall is tolerated while uv sync must succeed: run uv sync --extra gpu and
require its success, then run uv pip uninstall onnxruntime with a tolerated
failure (e.g., group the uninstall with || true or run it in a subshell/brace so
its failure doesn't affect uv sync), and finally run uv pip install --reinstall
torch torchvision only if the prior required commands succeeded; reference the
commands uv sync --extra gpu, uv pip uninstall onnxruntime, and uv pip install
--reinstall torch torchvision when making this change.
The CI test-allowlist glob added in 734c56c picked up ``test_knowledge_layout_engine.py``, which has 5 tests that mock ``_detect_accelerator`` to return ``("mps", ["CoreMLExecutionProvider", "CPUExecutionProvider"])``. The python-side branch is fooled, but the downstream fastembed session validates the providers against the actual ``onnxruntime`` install — on Linux CI the call fails with ``ValueError: Provider CoreMLExecutionProvider is not available``. Added a module-level ``_REQUIRES_MAC = pytest.mark.skipif(sys.platform != "darwin")`` and applied it to the 5 affected tests: - test_layout_engine_transformers_disables_torch_compile_on_mps - test_layout_engine_auto_resolves_to_transformers_on_mps - test_timing_log_reports_effective_layout_not_user_choice - test_use_gpu_toggle_invalidates_auto_resolution - test_pytorch_embeddings_device_routes_to_mps_when_detected The Mac-only branches still get coverage on dev machines; CI on Linux no longer fails on the inherent platform mismatch.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/cuga/backend/knowledge/storage/adapter.py`:
- Around line 632-635: The rollback code silently swallows errors from
_fts_delete_by_source; change the bare except to catch Exception as e and emit a
log entry with the exception details (e.g., using self.logger.error or the
module logger) including context like the source_for_rollback and that this
occurred during FTS rollback, rather than simply passing; keep the rollback
suppression behavior (do not re-raise) but ensure the exception and context are
logged for debugging.
- Around line 595-617: The rollback currently uses source_for_rollback from
items[0] inside _add_all and so only deletes one source on partial failure;
update _add_all to either enforce a single-source precondition by validating
that every item in items has the same "source" (raise early if not) or build the
set of all distinct sources touched (e.g., collect sources from each item's
metadata before the insert loop) and on partial failure iterate and delete by
each source (use the same _scope/_filt pattern) instead of only
source_for_rollback; ensure the chosen approach references source_for_rollback,
_add_all, self._store.add_many and mirrors delete_by_source behavior for all
affected sources.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 57186bb0-fd5b-488a-8680-5d42ad77ca35
📒 Files selected for processing (7)
.gitignoreREADME.mdsrc/cuga/backend/knowledge/engine.pysrc/cuga/backend/knowledge/storage/adapter.pysrc/cuga/backend/server/manage_routes.pysrc/scripts/run_tests.shtests/unit/test_knowledge_layout_engine.py
💤 Files with no reviewable changes (1)
- .gitignore
🚧 Files skipped from review as they are similar to previous changes (1)
- src/cuga/backend/server/manage_routes.py
| source_for_rollback = items[0][2].get("source") if items else None | ||
|
|
||
| async def _add_all() -> int: | ||
| n = 0 | ||
| for doc, emb in zip(documents, vectors, strict=True): | ||
| chunk_id = str(uuid.uuid4()) | ||
| page_val: int | ||
| page_raw = doc.metadata.get("page") | ||
| if page_raw is None: | ||
| page_val = -1 | ||
| else: | ||
| try: | ||
| page_val = int(page_raw) | ||
| except (TypeError, ValueError): | ||
| page_val = -1 | ||
| source = str(doc.metadata.get("source", "") or "") | ||
| filename = str(doc.metadata.get("filename", "") or "") | ||
| meta = { | ||
| "id": chunk_id, | ||
| "tenant_id": scope["tenant_id"], | ||
| "instance_id": scope["instance_id"], | ||
| "source": source, | ||
| "filename": filename, | ||
| "page": page_val, | ||
| "chunk_text": doc.page_content, | ||
| "meta_json": json.dumps( | ||
| {"source": source, "filename": filename, "page": page_val}, | ||
| ensure_ascii=False, | ||
| ), | ||
| } | ||
| await self._store.add(chunk_id, emb, meta) | ||
| n += 1 | ||
| partial = False | ||
| for i in range(0, total_items, insert_bs): | ||
| chunk = items[i : i + insert_bs] | ||
| try: | ||
| await self._store.add_many(chunk) | ||
| except Exception: | ||
| # If we'd already committed earlier batches, clean them | ||
| # up before re-raising so the next ingest attempt | ||
| # doesn't encounter duplicate vectors on the same | ||
| # source_id. Inline the delete loop here (mirror of | ||
| # ``delete_by_source``) so we stay on the current event | ||
| # loop — the adapter-level helper would call | ||
| # ``_run_embedding_coro`` which would deadlock against | ||
| # the loop we're already inside. | ||
| if partial and source_for_rollback: | ||
| try: | ||
| _scope = self._scope() | ||
| _filt = {**_scope, "source": source_for_rollback} | ||
| while True: |
There was a problem hiding this comment.
Rollback cleanup currently assumes a single source across the whole batch.
source_for_rollback is taken from only the first item. On partial failure, cleanup deletes one source and can leave committed rows for other sources in the same call. Please enforce a single-source precondition (or roll back all touched sources).
Suggested fix
- source_for_rollback = items[0][2].get("source") if items else None
+ rollback_sources = {str(meta.get("source", "")) for _, _, meta in items}
+ if len(rollback_sources) != 1:
+ raise ValueError(
+ "add_documents expects all chunks in one call to share the same source"
+ )
+ source_for_rollback = next(iter(rollback_sources))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/cuga/backend/knowledge/storage/adapter.py` around lines 595 - 617, The
rollback currently uses source_for_rollback from items[0] inside _add_all and so
only deletes one source on partial failure; update _add_all to either enforce a
single-source precondition by validating that every item in items has the same
"source" (raise early if not) or build the set of all distinct sources touched
(e.g., collect sources from each item's metadata before the insert loop) and on
partial failure iterate and delete by each source (use the same _scope/_filt
pattern) instead of only source_for_rollback; ensure the chosen approach
references source_for_rollback, _add_all, self._store.add_many and mirrors
delete_by_source behavior for all affected sources.
| try: | ||
| self._fts_delete_by_source(source_for_rollback) | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Don’t silently swallow FTS rollback errors in the failure path.
The bare except ...: pass hides rollback divergence exactly when debugging partial-insert incidents. Log it at least once.
Suggested fix
- try:
- self._fts_delete_by_source(source_for_rollback)
- except Exception:
- pass
+ try:
+ self._fts_delete_by_source(source_for_rollback)
+ except Exception as fts_err: # noqa: BLE001
+ logger.warning(
+ "FTS rollback failed for source=%r after partial insert: %s",
+ source_for_rollback,
+ fts_err,
+ )🧰 Tools
🪛 Ruff (0.15.15)
[error] 634-635: try-except-pass detected, consider logging the exception
(S110)
[warning] 634-634: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/cuga/backend/knowledge/storage/adapter.py` around lines 632 - 635, The
rollback code silently swallows errors from _fts_delete_by_source; change the
bare except to catch Exception as e and emit a log entry with the exception
details (e.g., using self.logger.error or the module logger) including context
like the source_for_rollback and that this occurred during FTS rollback, rather
than simply passing; keep the rollback suppression behavior (do not re-raise)
but ensure the exception and context are logged for debugging.
Source: Linters/SAST tools
There was a problem hiding this comment.
Request: split this PR before it can be reviewed
This PR is titled "perf + OCR + session isolation + live progress UX" but it currently carries at least four additional features. It's ~12.3k lines of test alone, across 37 changed source files including six 1k–4k-line files. Nobody can responsibly review this as one unit — and the bugs below prove it, because each one slipped through at a seam between features that were never reviewed together. Please break it up.
Split out cleanly (these lift out with little entanglement — please do these):
- Client adaptation → own PR.
ClientAdaptationPanel.tsx,query_expansion.py(glossary-driven), the adaptation prompt-framing inchat_agent.py/pmt*.jinja2/awareness.py, plustest_knowledge_client_adaptation.py(1048),test_client_adaptation_e2e.py(282), and theclient_adaptation_example.mddoc. ~2.3k LOC, fully self-contained feature. - Reranker + GPU image → own PR.
reranker.py(pulls insentence-transformers),Dockerfile.gpu,test_knowledge_use_gpu_surfaces.py,test_knowledge_accelerator.py. New infra + a quality lever — deserves a focused review. - Benchmark tooling → remove from this PR entirely.
scripts/bench_knowledge_ingest.py,bench_knowledge_query_compare.py,bench_knowledge_report.py, theknowledge/bench.pymodule, andtest_knowledge_bench.py. (This reverses my earlier "keep ingest+report" note — on reflection benchmark tooling belongs in a separate PR, not the runtime tree.)
Consider splitting — or at minimum retitle:
- Multi-scope / Option-B search. The 3630-line
test_knowledge_multi_scope_search.py+test_knowledge_rag_scope_failure_fix.py(823) plussearch_multiinclient.pyand the scope routing inprepare_node.py/helpers/knowledge.py. I know this shares core files and that "Option B" was in the original WIP title, so I won't insist on the split — but it's invisible to anyone reading the PR description. Either pull it out, or update the title/body to name it.
After the clean splits come out, what remains — perf core, OCR, session isolation, live progress UX, RAG-profile config — actually matches the title and is reviewable.
Blocking bugs, routed to where they belong (these are why the split matters):
- Empty-
thread_idregression —prepare_node.py:438. Theorigin/mainmerge dropped the_is_empty()guard the branch previously had;ca153f3crestored session as the default scope but not the guard, so an LLM-emittedthread_id=""now hitsclient.py:150 raise ValueError("thread_id required for session scope"). Belongs to the session/scope PR — it got lost because scope logic was buried inside a perf PR. Dockerfile.gpu:64-66—uv sync --extra gpu && uv pip uninstall onnxruntime 2>/dev/null || true && uv pip install ...short-circuits:|| truemakes the chain "succeed" so the torch reinstall runs even whenuv syncfails, and theRUNexits 0 — a broken GPU image that builds green. CodeRabbit flagged this too. Belongs to the GPU PR.- Silent
enforce→dry_runflip —config.py:1032-1035.from_settingsreads the profile first, and the defaultstandardprofile setsdrop_page_chrome/junk_filter = "dry_run", shadowing the shippedknowledge_settings.tomlthat still says= "enforce". The config file now lies about its effective value. Stays in the core/profile PR. - pgvector QA gate still open (my original blocker on CI/test coverage) — the new orphan-vector rollback in
adapter.pyis only tested against mocked/sqlite-vec; no Docker smoke test against real pgvector. Stays in the core PR.
Once split, I can give each PR the review it actually needs. Happy to re-review in whatever order you stack them.
Summary
End-to-end overhaul of the knowledge subsystem on a single branch (20 commits, 75+ files, ~+5500/-200):
add_manyon store backends, sub-batched + concurrent embed, race-free progress callbacks, Doclingpdf_mode = accurate | balanced | fast. Large-PDF ingest now bounded by embedding-provider throughput, not parse.accuratemode.resolve_collection, MCP wrapper static whitelist +_is_empty()injection, E2B stub serializer parity, observability logs on every session op.task_id+weighted_pct+ui_phaseviaPOST /api/knowledge/documents?wait=false(202), per-upload AbortController + resume-across-reload, app-leveluseAgentKnowledgeDocshook with sticky non-zero + localStorage cache.PATCH /api/knowledge/confignow reconfigures the live engine, README populated with measured 3×3 provider ×pdf_modematrix.Dockerfile.gpuonnvidia/cuda:12.4.1-cudnn,[gpu]extra withonnxruntime-gpu+fastembed-gpu,_FastEmbedEmbeddingsCUDA detection withbatch_size=256/parallel=1, three-mode startup warning +gpu_required+cuga doctorCLI, Docling parse polish (generate_page_images=Falsein fast mode).<client_adaptation>panel with templates / glossary / reset; ingested into prompt via_render_client_adaptation_block; round-trippable through publish/import; secret-stripped on export.knowledge_instructions.md; 1,947 → 1,357 tokens (-30%); generalized cross-scope rule (predicate + forced verbalization, no enumeration overfitting); refine-on-miss playbook with explicit miss-criteria; harness obedience strengthened (MANDATORY framing + BEFORE-you-respond pre-response check); dynamic anchors (session-LIVE + cross-scope commit) emitted just-in-time only when their trigger fires.Closes #183, #247, #248, #289, #290, #291, #292, #293, #294, #295, #296
Where the prompt-engineering work lives in the diff
The Knowledge Tool Contract + harness work is bundled inside commit
7c2c642(alongside the RAG profiles + GPU image) since the audits happened in the same session. Files to scan for review:src/cuga/configurations/knowledge/knowledge_instructions.md— the contract itself (the −60% / now −30% file; iterations 1-3b documented inline)src/cuga/backend/knowledge/awareness.py—_render_client_adaptation_block(MANDATORY framing) +get_knowledge_summary(session-LIVE anchor + cross-scope commit anchor + BEFORE-you-respond tail)tests/unit/test_knowledge_rag_scope_failure_fix.py— contract structural guards (NARROWEST scope, Session+All+Anti-example examples,fallback_from,do NOT re-issue)tests/unit/test_knowledge_multi_scope_search.py—test_canonical_contract_*guards (by_source,recommendation, query rewriting on miss, no staleagent firstguidance)tests/unit/test_knowledge_client_adaptation.py—test_precedence_framing_present(MANDATORY / EVERY response / CRITICAL) +test_recency_tail_pointer_present_when_set(BEFORE you respond)tests/unit/test_chat_agent_knowledge_toggle.py— engine-off ⇒ no knowledge block injectedTest plan
pytest tests/unit/test_knowledge_*— expect 700+ green (160 contract+adaptation+prompt-assembly tests pass on this branch)pdf_mode=balancedon a network embedding provider — confirm completion under 1 minutecuga knowledge snapshot-export/-importround-tripscope="all"(the predicate fires) and verbalizes the candidate doc on each sideFollow-up: Pareto-locked profiles + GPU image (added 2026-06)
3 commits pushed on top of the original scope (
c476a5e,7c2c642,ca153f3after revert + merge withorigin/main):chunk_size+ prompt addendum. Code-enforced contract:rerank_top_k_in >= 3 * default_limit. Designed via a 3-expert RAG debate.Dockerfile.gpuonnvidia/cuda:12.4.1-cudnn-runtimebase,[project.optional-dependencies] gpuextra withonnxruntime-gpu+fastembed-gpu, 3-mode startup warning,gpu_requiredhard-fail mode,cuga doctorhealth-check CLI, full README deployment section.generate_page_images=Falseinpdf_mode=fastonly; warning when CUDA engaged butmax_ingest_workers <= 2.Merge with
origin/main(commit1def217) handled 4 conflicts:cuga_lite_graph.py— took main wholesale (thecuga_agent_corerefactor at 27bd35e extracted ~2300 lines, including knowledge helpers intohelpers/knowledge.py).client.py— combined HEAD'ssearch_default_scope/single_default_scopewith main's**_: Anykwargs absorption.App.tsx— kept HEAD'sagentKnowledgehook source, dropped 3 props the refactoredConfigHeaderno longer accepts.frontend/dist/index.html— took HEAD (rebuild artifact).Post-merge fixes (
ca153f3) restored 3 branch semantics that main's refactor regressed:helpers/knowledge.py::_get_knowledge_tool_scope_contextreturns("all", "agent", "session")with default"session"(main's version was("agent", "session")with default"agent")._knowledge_scope_instructiondescribes the"all"scope in the both-scopes-wired branch.adapter/prepare_node.pyprefers the per-agentdraft_knowledge_configsdict before the legacy singular attribute (multi-tenant safety for shared deployments).Test sweep: 709 knowledge unit tests pass.
Summary by CodeRabbit
New Features
Dockerfile.gpuwith NVIDIA CUDA runtime and optimized knowledge pipelineBug Fixes
Documentation
Updated 2026-06: response to Sami's review
Pushed commit
734c56caddressing every blocker and most also-noted items.Full per-item explanation is in that commit's message body. Headline:
Blockers fixed:
engine.pyfut.result→await asyncio.wait_for(asyncio.wrap_future(...)))/knowledge/config(apply_knowledge_confignowawait asyncio.to_thread(...))_add_all)run_tests.shglob)Also-noted bugs fixed:
commit_knowledge_updatenow returnsmetric_changedget_settings()now exposesdocling_pdf_mode,embedding_batch_size/concurrency,vector_insert_batch_size,rerank_*,embedding_api_key_set(presence-only),default_limit,max_search_attempts,gpu_required, etc.CodeQL hardening:
manage_routes.py:978-1009—logger.exceptionserver-side + validator-message whitelist client-side (generic fallback for any unrecognized shape).Scope cleanup: removed
scripts/KNOWLEDGE_EVAL_METHODOLOGY.md,scripts/bench_knowledge_eval{,_report}.py,scripts/eval_queries.json,scripts/test_{groq,openrouter}_connectivity.py. Personalknowledge_ingest_perf.pptx+KNOWLEDGE_INGEST_PERF_SLIDES.mdgitignore entries dropped. README 3×3 latency matrix + Jaccard table trimmed (~80 lines).Note on the "routes.py zero-diff" comment —
routes.pyactually has ~588 lines of diff vs main; thewait=false/ 202 / live-progress path is implemented at lines 534–696 (upload_documentshandler). I suspect Sami meant the DEFAULTwait=truesynchronous path which is preserved for MCP/ingest_knowledge+ session-attachment back-compat. The async path is opt-in by the UI.Test count clarification: the original summary said "387 tests / 27 suites" — accurate count on this branch right now is 709 knowledge unit tests passing across 30+ test files. CI tests.yml runs them all via
tests/unit/test_knowledge_*.pyglob (was missing before — see blocker 4).