Skip to content

knowledge: perf + OCR + session isolation + live progress UX#297

Open
dolev31 wants to merge 32 commits into
mainfrom
perf/knowledge-ingest-bottlenecks
Open

knowledge: perf + OCR + session isolation + live progress UX#297
dolev31 wants to merge 32 commits into
mainfrom
perf/knowledge-ingest-bottlenecks

Conversation

@dolev31

@dolev31 dolev31 commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

Summary

End-to-end overhaul of the knowledge subsystem on a single branch (20 commits, 75+ files, ~+5500/-200):

  • Perf — bulk add_many on store backends, sub-batched + concurrent embed, race-free progress callbacks, Docling pdf_mode = accurate | balanced | fast. Large-PDF ingest now bounded by embedding-provider throughput, not parse.
  • OCR — zero-config auto-detection (filename + pypdfium2 text-layer scan, run-length ≥3 false-positive guard), per-platform engine selection (OcrMac on macOS / Tesseract on Linux / EasyOCR fallback), supported-language gating against Apple Vision's actual list, code + formula enrichments enabled in accurate mode.
  • Session isolation — whitespace-safe resolve_collection, MCP wrapper static whitelist + _is_empty() injection, E2B stub serializer parity, observability logs on every session op.
  • Live UXtask_id + weighted_pct + ui_phase via POST /api/knowledge/documents?wait=false (202), per-upload AbortController + resume-across-reload, app-level useAgentKnowledgeDocs hook with sticky non-zero + localStorage cache.
  • Providers — OpenRouter embedding provider, PATCH /api/knowledge/config now reconfigures the live engine, README populated with measured 3×3 provider × pdf_mode matrix.
  • RAG profiles + GPU — Pareto-locked profile TOMLs (each pins embedding/chunking/docling/rerank/search/engine in one place), Dockerfile.gpu on nvidia/cuda:12.4.1-cudnn, [gpu] extra with onnxruntime-gpu + fastembed-gpu, _FastEmbedEmbeddings CUDA detection with batch_size=256/parallel=1, three-mode startup warning + gpu_required + cuga doctor CLI, Docling parse polish (generate_page_images=False in fast mode).
  • Knowledge harness UI — operator-tunable <client_adaptation> panel with templates / glossary / reset; ingested into prompt via _render_client_adaptation_block; round-trippable through publish/import; secret-stripped on export.
  • Knowledge Tool Contract — three-round 3-expert audit of knowledge_instructions.md; 1,947 → 1,357 tokens (-30%); generalized cross-scope rule (predicate + forced verbalization, no enumeration overfitting); refine-on-miss playbook with explicit miss-criteria; harness obedience strengthened (MANDATORY framing + BEFORE-you-respond pre-response check); dynamic anchors (session-LIVE + cross-scope commit) emitted just-in-time only when their trigger fires.
  • Tests — 700+ passing across 30+ knowledge-specific suites (perf, OCR, session isolation, MCP stub, pgvector metadata, publish/import, progress, live PATCH, contract guards, harness obedience).

Closes #183, #247, #248, #289, #290, #291, #292, #293, #294, #295, #296

Where the prompt-engineering work lives in the diff

The Knowledge Tool Contract + harness work is bundled inside commit 7c2c642 (alongside the RAG profiles + GPU image) since the audits happened in the same session. Files to scan for review:

  • src/cuga/configurations/knowledge/knowledge_instructions.md — the contract itself (the −60% / now −30% file; iterations 1-3b documented inline)
  • src/cuga/backend/knowledge/awareness.py_render_client_adaptation_block (MANDATORY framing) + get_knowledge_summary (session-LIVE anchor + cross-scope commit anchor + BEFORE-you-respond tail)
  • tests/unit/test_knowledge_rag_scope_failure_fix.py — contract structural guards (NARROWEST scope, Session+All+Anti-example examples, fallback_from, do NOT re-issue)
  • tests/unit/test_knowledge_multi_scope_search.pytest_canonical_contract_* guards (by_source, recommendation, query rewriting on miss, no stale agent first guidance)
  • tests/unit/test_knowledge_client_adaptation.pytest_precedence_framing_present (MANDATORY / EVERY response / CRITICAL) + test_recency_tail_pointer_present_when_set (BEFORE you respond)
  • tests/unit/test_chat_agent_knowledge_toggle.py — engine-off ⇒ no knowledge block injected

Test plan

  • Run pytest tests/unit/test_knowledge_* — expect 700+ green (160 contract+adaptation+prompt-assembly tests pass on this branch)
  • Ingest a 500-page text-native PDF with pdf_mode=balanced on a network embedding provider — confirm completion under 1 minute
  • Ingest a non-English PDF on macOS — confirm OcrMac is selected and works; ingest same doc on Linux — confirm Tesseract path
  • Open two sessions in the UI, upload doc A in session 1, doc B in session 2 — confirm search in each session returns only that session's doc
  • Mid-upload page reload — confirm progress rows rehydrate and continue polling
  • Edit embedding provider in Knowledge UI → PATCH → re-ingest a new doc immediately — confirm new provider is used without restart
  • Verify pgvector parity: same matrix of ingest + search smoke tests against Postgres + pgvector backend
  • Knowledge harness UI — save a rule + glossary entry, reload page, confirm both persist and survive cuga knowledge snapshot-export/-import round-trip
  • Cross-scope decision — agent KB has a paper, session has a paper, ask "what does the paper say about X" — confirm LLM picks scope="all" (the predicate fires) and verbalizes the candidate doc on each side

Follow-up: Pareto-locked profiles + GPU image (added 2026-06)

3 commits pushed on top of the original scope (c476a5e, 7c2c642, ca153f3 after revert + merge with origin/main):

  • Pareto-locked RAG profiles (Closes knowledge: Pareto-locked RAG profiles #316) — 4 profile TOMLs now own their full effective config (embedding_model, chunking, docling.pdf_mode, rerank, hybrid_mode, junk_filter, default_limit, max_ingest_workers) instead of just chunk_size + prompt addendum. Code-enforced contract: rerank_top_k_in >= 3 * default_limit. Designed via a 3-expert RAG debate.
  • GPU image + transparent CUDA runtime (Closes knowledge: GPU image (cuga:gpu) + transparent CUDA runtime #317) — new Dockerfile.gpu on nvidia/cuda:12.4.1-cudnn-runtime base, [project.optional-dependencies] gpu extra with onnxruntime-gpu + fastembed-gpu, 3-mode startup warning, gpu_required hard-fail mode, cuga doctor health-check CLI, full README deployment section.
  • Docling parse polishgenerate_page_images=False in pdf_mode=fast only; warning when CUDA engaged but max_ingest_workers <= 2.

Merge with origin/main (commit 1def217) handled 4 conflicts:

  • cuga_lite_graph.py — took main wholesale (the cuga_agent_core refactor at 27bd35e extracted ~2300 lines, including knowledge helpers into helpers/knowledge.py).
  • client.py — combined HEAD's search_default_scope / single_default_scope with main's **_: Any kwargs absorption.
  • App.tsx — kept HEAD's agentKnowledge hook source, dropped 3 props the refactored ConfigHeader no longer accepts.
  • frontend/dist/index.html — took HEAD (rebuild artifact).

Post-merge fixes (ca153f3) restored 3 branch semantics that main's refactor regressed:

  1. helpers/knowledge.py::_get_knowledge_tool_scope_context returns ("all", "agent", "session") with default "session" (main's version was ("agent", "session") with default "agent").
  2. _knowledge_scope_instruction describes the "all" scope in the both-scopes-wired branch.
  3. adapter/prepare_node.py prefers the per-agent draft_knowledge_configs dict before the legacy singular attribute (multi-tenant safety for shared deployments).

Test sweep: 709 knowledge unit tests pass.

Summary by CodeRabbit

New Features

  • Added GPU deployment support via Dockerfile.gpu with NVIDIA CUDA runtime and optimized knowledge pipeline
  • Introduced multiple embedding providers: OpenRouter, OpenAI-compatible, LiteLLM, and local fastembed
  • Added knowledge scoping: agent-level and session-level document management with fallback search behavior
  • Implemented client adaptation (knowledge harness) and glossary-driven query expansion
  • Added optional cross-encoder reranking for search results
  • New knowledge CLI subcommand group for config/snapshot management

Bug Fixes

  • Fixed knowledge session isolation by thread ID validation
  • Improved progress tracking and weighted percentages for ingestion tasks
  • Enhanced error handling for embedding connectivity and configuration

Documentation

  • GPU deployment guide with Kubernetes examples and performance tuning
  • Embedding providers configuration with multiple provider examples
  • Ingest tuning parameters and per-stage progress documentation

Updated 2026-06: response to Sami's review

Pushed commit 734c56c addressing every blocker and most also-noted items.
Full per-item explanation is in that commit's message body. Headline:

Blockers fixed:

  • Event-loop block in async ingest (engine.py fut.resultawait asyncio.wait_for(asyncio.wrap_future(...)))
  • Event-loop block in PATCH /knowledge/config (apply_knowledge_config now await asyncio.to_thread(...))
  • Orphan vectors on pgvector partial-insert (compensating rollback loop in _add_all)
  • 30+ knowledge tests now in CI (run_tests.sh glob)

Also-noted bugs fixed:

  • commit_knowledge_update now returns metric_changed
  • get_settings() now exposes docling_pdf_mode, embedding_batch_size/concurrency, vector_insert_batch_size, rerank_*, embedding_api_key_set (presence-only), default_limit, max_search_attempts, gpu_required, etc.

CodeQL hardening: manage_routes.py:978-1009logger.exception server-side + validator-message whitelist client-side (generic fallback for any unrecognized shape).

Scope cleanup: removed scripts/KNOWLEDGE_EVAL_METHODOLOGY.md, scripts/bench_knowledge_eval{,_report}.py, scripts/eval_queries.json, scripts/test_{groq,openrouter}_connectivity.py. Personal knowledge_ingest_perf.pptx + KNOWLEDGE_INGEST_PERF_SLIDES.md gitignore entries dropped. README 3×3 latency matrix + Jaccard table trimmed (~80 lines).

Note on the "routes.py zero-diff" commentroutes.py actually has ~588 lines of diff vs main; the wait=false / 202 / live-progress path is implemented at lines 534–696 (upload_documents handler). I suspect Sami meant the DEFAULT wait=true synchronous path which is preserved for MCP/ingest_knowledge + session-attachment back-compat. The async path is opt-in by the UI.

Test count clarification: the original summary said "387 tests / 27 suites" — accurate count on this branch right now is 709 knowledge unit tests passing across 30+ test files. CI tests.yml runs them all via tests/unit/test_knowledge_*.py glob (was missing before — see blocker 4).

dolev31 added 17 commits May 20, 2026 17:10
…step 0)

This is the no-op baseline for the issue #183 perf work. It does not change
any ingest behavior; it only instruments the pipeline so the later commits
(C1-C6) can be measured against the same ruler and the PR can carry a real
attribution table.

What this lands:

- src/cuga/backend/knowledge/bench.py: tiny record_ingest_run(log_path, payload)
  helper that appends one JSON line per ingest, with auto-fields ts and
  commit_sha. Best-effort: any failure (unwritable path, non-serialisable
  payload) is swallowed so instrumentation never fails an ingest.

- KnowledgeConfig.bench_log_path (default ""): when set, the engine appends
  per-ingest records to that file. Loaded from settings.toml as
  [knowledge.bench] log_path.

- engine._ingest_inner now collects a stage_timings dict (parse_s,
  insert_total_s, finalize_s, total_s, chunk_count), includes it in the
  completion update_task call, and forwards it to record_ingest_run.

- adapter.add_documents accepts an optional stage_timings dict and populates
  embed_s + insert_s. Signature added to VectorStoreAdapter base class with
  a backwards-compatible None default.

- scripts/bench_knowledge_ingest.py: harness that runs N ingests with the
  bench log on, prints a per-run markdown row + a median-aggregated row, and
  appends the summary to ~/.cuga/bench/runs.json.

- scripts/bench_knowledge_report.py: joins the runs.json entries into the
  attribution table the PR description will paste, with a delta-vs-baseline
  column.

- tests/unit/test_knowledge_bench.py (6 cases) covering append, parent-dir
  creation, no-op on empty path, default=str fallback, non-mapping payload,
  unwritable path.

- tests/unit/test_knowledge_timings.py (3 cases) verifying adapter populates
  embed_s + insert_s when stage_timings is passed, backwards-compat when it
  is not, and that empty doc list does not touch the dict.

All 73 existing + new tests pass. ruff check / format clean.
… steps 1-3)

Adds add_many to EmbeddingStoreBackend Protocol plus implementations on
LocalEmbeddingStore (sqlite-vec) and ProdEmbeddingStore (pgvector). The
adapter still uses per-row add() in this commit; the wiring lands in C2.

This is intentionally a no-perf-change commit. The point is to land the
two backend implementations and their tests on their own, so that any
perf delta measured at C2 can be cleanly attributed to the adapter
refactor, not to a backend-method change.

C1 sanity bench (3 runs, 511-chunk PDF, local stack):

  median: parse=57.1s embed=37.7s insert=0.39s total=94.9s
  vs C0:  parse=56.8s embed=37.3s insert=0.37s total=94.4s
  delta:                                              +0.5s (within noise)

What this lands:

- EmbeddingStoreBackend.add_many: new Protocol method with the same
  (id, embedding, metadata) tuple shape as add().

- LocalEmbeddingStore.add_many: single transaction containing one
  executemany. Inner try/except sqlite3.OperationalError falls back to a
  per-row loop *inside the same transaction*, so even on sqlite-vec
  builds that reject executemany against vec0 we still drop the per-row
  commit fsync cost from N to 1.

- ProdEmbeddingStore.add_many: one pool.acquire and one
  conn.transaction per batch with conn.executemany. Preserves the
  existing ON CONFLICT DO UPDATE upsert behaviour so replace_duplicates
  callers remain correct.

- tests/unit/test_knowledge_local_add_many.py (4 cases): happy path,
  metadata round-trip via get(), empty no-op, and the executemany ->
  per-row fallback path. The fallback test wraps the live connection in
  a small proxy class because sqlite3.Connection is a C type whose
  methods cannot be monkey-patched directly.

- scripts/bench_knowledge_ingest.py: each bench run now copies the
  source file under a uuid-suffixed name. Without this every run after
  the first hit the replace_duplicates branch in the engine, where the
  per-batch loop intentionally skips stage_timings, and we lost the
  embed_s / insert_s split in the bench output.

77 unit tests pass. ruff check / format clean.
…50 (#183 steps 4-5)

First commit with a measurable perf delta. The adapter now builds the
(id, embedding, metadata) tuple list once and calls store.add_many in a
single round-trip; the engine no longer wraps that call in a 50-doc
outer batch loop, because the adapter's bulk path is now correct and
cheaper than the previous per-row loop ever was.

C2 bench delta on local (3 runs, 511-chunk PDF, fastembed + sqlite-vec):

  insert_s:        0.374s -> 0.040s   -89% (9x faster, the stage we touched)
  insert_total_s:  37.63s -> 37.27s   -1% (embed dominates the wrapper)
  total_s:         94.44s -> 93.63s   -1% (insert is a small share on local)

The headline insert_s win is small in absolute terms on sqlite-vec
because per-row commit was already on a local fsync. On prod (pgvector)
each row in the old loop was a separate pool.acquire + INSERT
round-trip; the same architectural change should shave a much larger
chunk off insert_s there. We cannot measure that locally without a
pgvector test env, so the prod-side number lands when a reviewer with
that environment runs the bench script against this commit.

What this lands:

- adapter.add_documents: replaces the ``async def _add_all: for ... await
  self._store.add(...)`` loop with a list-comprehension over the
  documents and a single ``await self._store.add_many(items)`` call.
  Same metadata shape, same uuid id strategy, same return contract.
  stage_timings continues to capture embed_s and insert_s separately.

- engine._insert_documents_async: drops the outer ``_BATCH = 50`` loop
  on the replace_duplicates branch; both branches now make a single
  adapter.add_documents call. The original loop existed because per-row
  insert times scaled linearly with batch size on the old path; with
  bulk insert it is strictly slower (one extra pool create/teardown per
  outer batch on prod, and zero per-row commit fsyncs to amortise on
  local). The branch lives on only because we still need delete_by_source
  for replace semantics.

77 unit tests pass. ruff check / format clean.
… step 4)

Adapter now splits the chunk list into ``embedding_batch_size``-sized
sub-batches and embeds each one sequentially (concurrency=1 for now;
C4 turns concurrency on for network providers). New config field plus
its plumbing through the vector-store factories so users can tune it
without touching adapter internals.

C3 bench delta on local (3 runs, 511-chunk PDF, fastembed + sqlite-vec):

  embed_s:   37.20s -> 29.17s   -22% (sub-batching helps even local)
  total_s:   93.63s -> 85.78s   -8%

The win on local is bigger than the plan predicted. Likely cause:
fastembed's internal embed batching prefers inputs smaller than the
full 511-chunk single call we used to make. Cache locality and less
peak-memory pressure pay off even before any concurrency lands.

What this lands:

- KnowledgeConfig: new ``embedding_batch_size`` (default 64) and a
  forward-declared ``vector_insert_batch_size`` (default 200) that will
  drive future bulk insert sub-batching but is not used yet.

- StorageBackedKnowledgeVectorStore: takes ``embedding_batch_size`` via
  the constructor; ``add_documents`` walks the chunk list in steps of
  that size, asserts the embedder returned the expected count per call
  (RuntimeError otherwise so the insert never touches partial state),
  and concatenates the resulting vectors in order. Still one
  ``add_many`` bulk insert from C2.

- create_storage_local_knowledge_store /
  create_storage_prod_knowledge_store / create_vector_store: all forward
  the new ``embedding_batch_size`` kwarg.

- KnowledgeEngine._create_vector_adapter: passes
  ``self._config.embedding_batch_size`` through.

- tests/unit/test_knowledge_timings.py: three new cases — embed sub-batch
  sizes match config; default of 64 is honoured; embedder returning a
  short list raises ``RuntimeError`` before any insert.

80 unit tests pass. ruff check / format clean.
Adapter now dispatches sub-batched embed calls through asyncio.gather
when the provider is network-bound (openai / ollama), bounded by a
semaphore of ``embedding_concurrency``. Local providers (fastembed /
huggingface) stay strictly sequential because their ONNX/torch
runtimes already exploit internal threads and adding gather on top
just adds asyncio overhead.

C4 bench on local (3 runs, 511-chunk PDF, fastembed + sqlite-vec):

  embed_s:  29.17s -> 29.23s   no-op on local (as designed)
  total_s:  85.78s -> 85.86s   within noise

This is the expected sign for local. The concurrency win materialises
on prod with OpenAI / Ollama where each embed call is one HTTP round
trip; we cannot reproduce that here. A reviewer with that stack should
re-run scripts/bench_knowledge_ingest.py --label C4-... against this
commit to fill in the prod row of the attribution table.

What this lands:

- KnowledgeConfig.embedding_concurrency (default 4).

- Adapter constructor gains ``embedding_concurrency`` and
  ``embedder_is_network``. add_documents now builds a ``starts`` list of
  sub-batch offsets, populates a fixed-size ``vectors`` array at
  absolute indices, and runs sub-batches concurrently via
  asyncio.gather + Semaphore when the embedder is network-bound. Order
  is preserved because indices are absolute, never appended.

- Defensive RuntimeError if any chunk has no embedding after the loop
  (catches a logic bug in the dispatch before any insert touches the
  store).

- create_storage_(local|prod)_knowledge_store and create_vector_store:
  forward the two new kwargs.

- KnowledgeEngine._create_vector_adapter: sets
  ``embedder_is_network`` from
  ``self._config.embedding_provider in ("openai", "ollama")`` and
  passes ``self._config.embedding_concurrency`` through.

- tests/unit/test_knowledge_timings.py: two new cases —
  ``test_adapter_concurrent_embed_preserves_order`` runs a sleepy
  randomised embedder under network-mode and asserts vectors come back
  in input order via retrieval; ``test_adapter_sequential_when_local_provider``
  uses a thread-safe counter to assert max-concurrent embed calls is 1
  when embedder_is_network=False.

82 unit tests pass. ruff check / format clean.
…step 6)

The half of issue #183 not about throughput: ingest now emits per-stage
progress events that callers polling get_ingestion_status see in
near-real-time, instead of staring at a single processing -> completed
transition that takes minutes.

The race trap is real: stage_timings emits run on a worker thread (via
asyncio.to_thread for the vector mutation) while the completion update
runs on the outer event loop. A naive fire-and-forget would let a late
progress write overwrite status="completed" back to "processing". The
fix is two-part: (a) progress payloads never include the status field,
so even a late landing cannot un-flip it; (b) the engine tracks every
run_coroutine_threadsafe future and drains them with a bounded timeout
before writing the terminal status.

C5 bench on local (3 runs, 511-chunk PDF, fastembed + sqlite-vec):

  total_s:  85.86s -> 87.25s   +1.4s (+1.6%) for ~10 progress emits

The overhead is the cost of routing ~10 metadata update_task calls
through the worker -> outer loop bridge. Acceptable for the UX win;
the user no longer stares at a frozen UI for two minutes.

What this lands:

- adapter.add_documents: new ``progress_cb`` kwarg, default None.
  Adapter calls _safe_progress("embed", done, total) after each
  embed sub-batch, _safe_progress("insert_start", 0, total) before
  the bulk add_many, and _safe_progress("insert", count, total)
  after. _safe_progress swallows callback exceptions because progress
  is best-effort and must never fail an ingest.

- engine._insert_documents_async: accepts progress_cb, forwards to
  adapter on both the replace_duplicates and not-exists branches as
  well as the schema-mismatch retry path.

- engine._ingest_inner: emits a "parsed" stage event between Docling
  finishing and the insert starting; builds progress_cb that submits
  _emit_progress via asyncio.run_coroutine_threadsafe(outer_loop);
  tracks each future in progress_futures; drains them with per-future
  timeout=2.0s before the final status="completed" update_task call.

- Progress emits deliberately omit the ``status`` field so a stray
  late write cannot overwrite the terminal "completed" / "failed"
  status (defence in depth on top of the drain).

- tests/unit/test_knowledge_progress.py: four adapter-level cases —
  one emit per embed sub-batch with monotonic done count, both
  insert_start and final insert events fire, a throwing callback does
  not break ingest, and progress_cb=None is a no-op.

86 unit tests pass. ruff check / format clean.
 step 7)

Surfaces the three KnowledgeConfig fields added by C3-C5 in
knowledge_settings.toml so users can tune them without touching code,
and loads them via KnowledgeConfig.from_settings.

C6 sanity bench on local (3 runs, 511-chunk PDF):

  total_s:  87.25s -> 89.04s   within noise; no code path changed.

What this lands:

- src/cuga/configurations/knowledge/knowledge_settings.toml:
    [knowledge.embeddings]
      batch_size  = 64   # embedding_batch_size
      concurrency = 4    # embedding_concurrency (network only)
    [knowledge.engine]
      vector_insert_batch_size = 200  # reserved; not yet used

- KnowledgeConfig.from_settings: reads embeddings.batch_size,
  embeddings.concurrency, engine.vector_insert_batch_size with the same
  defaults as the dataclass.

- tests/unit/test_knowledge_config_perf_keys.py: confirms TOML loads
  override defaults; missing keys fall back to defaults; dataclass
  defaults match settings.toml defaults so the docs and the code agree.

89 unit tests pass.
After landing C0-C6 I went through the new code with the express goal of
finding things I had reimplemented when an existing package or a shared
helper would do. Two real duplications, both fixed here without changing
behaviour or perf (3-run bench: 87.6s vs C5 87.25s, within variance).

1. ``LocalEmbeddingStore.add`` and ``ProdEmbeddingStore.add`` each had
   their own copy of the schema/SQL/row-values construction next to the
   matching code in ``add_many``. Now:

   - local: ``add`` calls ``await self.add_many([(id, embedding, metadata)])``.
     The schema-aware INSERT building moves into a single helper
     ``_insert_sql_and_row`` used by both code paths. Semantics
     unchanged: one row in one transaction = one commit, same as before.

   - prod: ``add`` delegates to ``add_many`` for the same reason. Two
     small helpers (``_upsert_sql``, ``_row_values``) cover the schema +
     UPSERT layout once instead of twice. Pool acquire / transaction
     semantics for the single-row case are identical because executemany
     with a length-1 row list is equivalent to a single execute under
     ``conn.transaction``.

2. Adapter ``add_documents`` gather path used
   ``asyncio.to_thread(self._embeddings.embed_documents, ...)`` to make
   the sync embedder concurrent under ``asyncio.gather``. LangChain's
   ``Embeddings.aembed_documents`` already does exactly this in its
   default impl (``run_in_executor(None, embed_documents, ...)``) AND
   providers that have a real async implementation override it
   (``langchain-openai`` uses httpx natively). Switching to
   ``await self._embeddings.aembed_documents(...)`` gets us the same
   behaviour for sync-only providers and a strictly better path for
   ones with native async, with less code.

   The sequential (local-provider) path keeps the direct
   ``embed_documents`` call — there is no benefit to going through
   asyncio there and the runtime already exploits threads internally.

89 unit tests pass. ruff check / format clean.

What I audited and decided NOT to change (per the user's redundancy
ask):

- bench.py JSONL writer: a loguru sink would couple it to the logging
  system whose failure modes should not affect benchmarking. The
  current standalone writer is the right isolation.

- stage_timings dict: OpenLit (already in the repo) instruments
  high-level LLM/LangGraph calls, not sub-second embed/insert splits.
  The two are complementary.

- progress_cb (sync callback through the stack): LangChain's
  AsyncCallbackHandler is chain-level; sub-batch progress is not a
  natural fit. Adding a handler class would be more boilerplate than
  the current pattern.

- asyncpg copy_records_to_table: faster than executemany at >1000
  rows BUT does not support ON CONFLICT, which the replace_duplicates
  path needs. Keep executemany.

- pytest-benchmark: designed for microbenchmarks in pytest; our
  harness spins up CugaAgent end-to-end and parses JSONL across runs.
  Different tool for a different problem.
…dering

Wires the issue #183 perf controls and per-stage ingest progress through
every CUGA surface so the new tunables can be set without editing files
and so the UI shows what the engine is actually doing.

Surfaces touched:

CLI (src/cuga/cli/main.py)
  - `cuga start ... --embeddings-provider` (fastembed | huggingface | openai | ollama;
    OpenAI-compatible endpoints — Groq / Together / Fireworks / OpenRouter —
    use `openai` + `--embeddings-base-url`).
  - `--embeddings-model`, `--embeddings-base-url`, `--embeddings-api-key`.
  - `--embeddings-batch-size`, `--embeddings-concurrency` (issue #183 tunables).
  - All flags map to DYNACONF_KNOWLEDGE__EMBEDDINGS__* env vars set before
    the cuga modules load settings, so any service (demo / demo_knowledge /
    manager / etc.) picks them up.

UI (src/frontend_workspaces/agentic_chat/src/KnowledgeConfig.tsx)
  - KnowledgeConfigValues extended with embedding_api_key,
    embedding_base_url, embedding_batch_size, embedding_concurrency,
    vector_insert_batch_size.
  - ReindexTask.file_tasks gains optional stage + progress fields.
    `getReindexTaskProgressLabel` renders "<friendly stage> (done/total)" —
    e.g. "Embedding (192/511)" — under each running file in the reindex
    progress tile.
  - Embeddings accordion: Base URL + API Key inputs appear when the
    provider is openai or ollama, with a hint pointing OpenAI-compatible
    users (Groq / Together / Fireworks / OpenRouter) at the same path.
    Batch size and concurrency surfaced as NumberInputs with helper text
    explaining the local-vs-network semantics.

Snapshot/publish (already free, verified)
  - The three new fields are dataclass attributes, so KnowledgeConfig.to_dict
    already serializes them and coerce_and_validate restores them. Two
    new unit tests (test_perf_keys_round_trip_via_to_dict_and_coerce and
    test_perf_keys_coerce_strings_to_int) lock this contract so a future
    field that bypasses the dataclass round-trip would fail loudly.

Docs (README.md "Knowledge Base" section)
  - New "Embedding Providers" subsection: table of built-in providers +
    concrete settings.toml snippets for Groq and Together AI via the
    OpenAI-compatible path; CLI override example.
  - "Ingestion Performance Tuning" subsection: explains batch_size,
    concurrency, vector_insert_batch_size and that they snapshot with
    published configs.
  - "Per-stage Ingest Progress" subsection points users at the JSONL
    bench log for performance attribution.

Frontend rebuilt; new bundle hashes:
  main.813ae3a97097fb29e526.js
  vendors.88ac6d4d0b8ba3e82779.js
(Bundles shipped under src/cuga/frontend/dist/ via build.sh which copies
the workspaces dist into the cuga Python package for the wheel.)

91 unit tests pass (+2 new snapshot round-trip tests). ruff clean.
… smoke

Closes the gaps surfaced by the prod-side rigorous review and ships P1
(Docling level config) from the original plan. Two end-to-end tests
land alongside: pgvector + fastembed (Docker) and Groq connectivity.

1. vector_insert_batch_size now actually enforced
   The plan declared this knob in C6 but the adapter never read it. With
   the full chunk list going to a single add_many on prod (pgvector), a
   large document risked: command_timeout=60 firing, an HNSW write lock
   held for many seconds, and asyncpg's binary-protocol buffer ballooning.
   Adapter now splits items by vector_insert_batch_size and calls add_many
   per sub-batch. Plumbed through create_storage_*_knowledge_store and
   create_vector_store. Stress-test bench with batch=10 (41 sub-batches
   for 410 chunks) added only +0.12s on the prod stack — knob works,
   cost is tiny.

2. Docling pdf_mode: fast | balanced | accurate
   The dominant cost on the 7-8 min ingest in #183 is Docling parse.
   New KnowledgeConfig.docling_pdf_mode (default "accurate" so nothing
   regresses) maps to PdfPipelineOptions: fast turns off OCR + table
   structure; balanced turns off OCR but keeps tables; accurate stays
   on Docling defaults. Converters are cached per-mode so toggling
   between ingests doesn't reload weights. Wired through:
     - settings.toml: [knowledge.docling] pdf_mode
     - CLI: cuga start --docling-pdf-mode
     - UI: new "PDF Parsing (Docling)" accordion with a select + warning
       about fast mode + scanned PDFs
     - Snapshot: dataclass field, so to_dict / coerce_and_validate
       round-trip preserves it. Test added.
     - Validation: rejects unknown modes, integer batch knobs must be > 0.

3. End-to-end pgvector + fastembed bench (Docker)
   Spun up pgvector/pgvector:pg16 locally and ran the bench against the
   1.5MB RAG PDF, 511 chunks. Measured:
     accurate: 91.8s total (parse 58.4 / embed 33.05 / insert 0.24)
     fast:     39.2s total (parse 20.97 / embed 17.93 / insert 0.21)
              -57% wall time, -64% parse, -46% embed (410 chunks vs 511)
   Search round-trip verified end-to-end: top hit score 0.881 on
   'retrieval augmented generation' against the just-ingested doc.

4. Groq connectivity smoke test
   scripts/test_groq_connectivity.py builds a KnowledgeConfig with
   provider=openai + base_url=https://api.groq.com/openai/v1 +
   api_key=$GROQ_API_KEY, builds the embedder via create_embeddings,
   and calls embed_query. Confirms the OpenAI-compatible config path
   plumbs end-to-end. Groq's catalog has zero embeddings models so the
   call returns a structured 4xx — the test accepts any Groq 4xx as
   "config path verified". The honest finding (Groq != embeddings)
   stays in the README.

98 unit tests pass. ruff clean. New frontend bundle:
  main.2bebaba0f19f267fbd65.js
  vendors.88ac6d4d0b8ba3e82779.js
…dd knowledge_config kwarg

Two real gaps surfaced when the user asked "is everything verifiably
configurable through UI and SDK". Both fixed.

UI gap: vector_insert_batch_size
  The type interface declared it but no NumberInput was rendered. Added
  one under the Embeddings accordion with min=1 max=5000 step=50, helper
  text explaining the pgvector command_timeout / HNSW lock motivation,
  and a default of 200 matching KnowledgeConfig + settings.toml.

SDK gap: CugaAgent only had enable_knowledge: bool
  SDK users had no way to set docling_pdf_mode, embedding batch knobs,
  provider, base_url, or api_key from Python — they had to set env vars
  before importing cuga or edit settings.toml. New ``knowledge_config``
  kwarg accepts:
    - None (default): falls back to KnowledgeConfig.from_settings(settings)
    - KnowledgeConfig: validated then used directly
    - dict: passed through coerce_and_validate (with type coercion,
      so e.g. embedding_batch_size="32" becomes 32)

  Validation happens when ``agent.knowledge`` is first accessed (lazy,
  matching the rest of the engine init pattern). Bad config raises a
  clear ValueError before any ingest can land.

The knowledge property:

    @Property
    def knowledge(self):
        ...
        override = self._knowledge_config_override
        if override is None:
            config = KnowledgeConfig.from_settings(settings)
        elif isinstance(override, KnowledgeConfig):
            override.validate()
            config = override
        else:
            config = KnowledgeConfig.coerce_and_validate(dict(override))
        engine = KnowledgeEngine(config)
        ...

Tests (tests/unit/test_sdk_knowledge_config.py, 4 cases):
  - kwarg with KnowledgeConfig lands on engine._config for all 7 fields
  - kwarg with dict goes through coerce_and_validate (str->int coercion)
  - bad config raises ValueError lazily at first ``.knowledge`` access
  - no kwarg falls back to settings.toml (env var still works)

102 unit tests pass (4 new). ruff clean. Frontend rebuilt.
knowledge_ingest_perf.pptx and KNOWLEDGE_INGEST_PERF_SLIDES.md are personal
review materials, not codebase content — kept local only.
Also adds Office lock file patterns (~$*.pptx etc.).
… engine

Two related changes shipped together. The bug fix gates the OpenRouter
work: without it, a user who picks OpenRouter in the UI (or changes any
other knowledge setting) sees an HTTP 200 PATCH but no behavior change
until they explicitly publish, which is hostile UX in demo / single-agent
mode.

1. BUG FIX — PATCH /api/manage/config/draft/knowledge now applies to the
   live engine, not only the draft store
   Symptom (user-reported): selected Fast docling_pdf_mode in the UI, the
   request returned 200, but the next upload's log line still showed
   "mode='accurate', do_ocr=True, do_table_structure=True".
   Root cause: the handler at manage_routes.py:931 saved the merged
   config to draft + rebuilt the draft agent, but never called
   engine.apply_knowledge_config on app.state.app_state.knowledge_engine
   (the live engine that uploads route through). Comment line at the top
   of the function literally said "No engine mutation (draft isolation)".
   Fix: after validation, call live_engine.apply_knowledge_config(filtered).
   prepare_knowledge_update skips the expensive embed preflight unless
   provider/model actually changed, so docling_pdf_mode / batch sizes
   stay cheap. On preflight failure (bad key, unknown model) we return
   400 with the engine error message and don't save the draft — no more
   "saved but not applied" zombie state.
   Response now includes live_applied + live_changes so the UI can show
   "reindex recommended" without an extra round-trip.

2. OpenRouter as a first-class embeddings provider
   User asks "what embeddings provider should I use" → today the answer
   is "provider=openai + base_url=https://openrouter.ai/api/v1 + the right
   key", a 3-field setup with a magic URL. New flow:
     provider="openrouter"
     model="openai/text-embedding-3-small"   # REQUIRED, no default
     api_key=$OPENROUTER_API_KEY              # REQUIRED
   Base URL auto-set to https://openrouter.ai/api/v1; embedding_base_url
   override still honored as an escape hatch. Both model and api_key
   are required by design — OpenRouter has no single "default"
   embeddings model and we'd rather fail loudly than pick one for the
   user.

   Surfaces wired:
     • engine.create_embeddings: new branch + new OPENROUTER_BASE_URL constant
     • KnowledgeConfig.validate: "openrouter" accepted
     • _create_vector_adapter: embedder_is_network includes openrouter
       (concurrency win applies)
     • settings.toml comments updated to call out openrouter rules
     • CLI: --embeddings-provider help text mentions openrouter explicitly
     • UI: new SelectItem + conditional block showing only Model + API Key
       fields, plus an inline link to
       openrouter.ai/models?output_modalities=embeddings
     • SDK: CugaAgent(knowledge_config=KnowledgeConfig(provider="openrouter",...))
     • Publish/Import: dataclass field, free round-trip via to_dict /
       coerce_and_validate (test-locked)
     • Docs: README "Knowledge Base" → "Embedding Providers" gains an
       OpenRouter quick-start with TOML, CLI, and SDK snippets
     • Smoke script: scripts/test_openrouter_connectivity.py exercises
       the same code path the engine uses; run when OPENROUTER_API_KEY
       is set and you want to verify auth/URL/body shape

3. UI touch-up: renamed "PDF Parsing (Docling)" accordion to "Document
   Parsing (Docling)" — Docling handles DOCX/PPTX/HTML/images/etc. too,
   and the user pointed out the PDF-only naming was misleading.

Tests (115 unit tests pass, up from 102):
  • tests/unit/test_knowledge_openrouter.py (10 cases): factory auto-URL,
    explicit override, missing key/model errors, env var fallback,
    validation accept/reject, snapshot round-trip, network-flag, SDK
    kwarg, TOML load.
  • tests/unit/test_knowledge_patch_live_apply.py (2 cases): PATCH
    applies to engine; invalid config returns 400 without partial apply.

Frontend rebuilt: main.977339c561b271967583.js
After the OpenRouter + live-apply commit, walked the whole UI→PATCH→engine→
ingest pipeline once more with adversarial inputs. Found four real gaps;
all fixed and locked with tests.

1. validate() didn't enforce OpenRouter has a model
   Before: provider=openrouter + empty model passed config validation but
   blew up later in create_embeddings, surfacing a generic engine error
   that didn't name the field. Now KnowledgeConfig.validate() rejects it
   at coerce_and_validate time with a message pointing at the right
   knowledge.embeddings.model field + a link to OpenRouter's catalog.
   Test: test_validate_rejects_openrouter_without_model (+ whitespace
   variant + the env-var-fallback case for api_key, which must NOT be
   blocked at validation time).

2. No whitespace stripping on api_key / model / base_url
   Classic copy-paste footgun: a key pasted from a docs page or terminal
   often carries a trailing newline / space. langchain-openai's auth
   then silently fails with a confusing 401. Now stripped at the
   ``create_embeddings`` boundary for both openai AND openrouter (and
   base_url too). Whitespace-only values are rejected as if empty.
   Tests: strips_whitespace, whitespace_only_model_raises,
   whitespace_only_key_raises.

3. UI silently swallowed PATCH errors
   ManagePage.tsx had a debounced PATCH that did `if (res.ok)` and a
   bare `catch {}` with a comment "// silent". With the new live-apply
   behavior, a 400 from the engine (bad key, unknown model) was
   completely invisible to the user — they saw their UI value but the
   engine still ran the old config. That was exactly the confusion the
   bug fix was supposed to eliminate.
   Now: on non-ok response we parse the JSON body's `detail` field and
   show an error toast. On network error (the `catch`), we show a
   second-tier toast. The user always sees something.

4. .env / .env.example docs for OpenRouter
   Added a commented-out OPENROUTER_API_KEY block to both files
   explaining how to opt-in. .env.example uses a placeholder
   (sk-or-...); .env carries the existing dev key behind a comment so
   re-enabling for testing is one line.

Tests (121 pass, +6 new): all four fixes pinned individually.
Frontend rebuilt: main.f70669d8954206299214.js.
…er matrix in README

Two changes that surfaced when actually exercising OpenRouter with the
user's key:

1. tiktoken-off when targeting non-OpenAI routes
   The first bench attempt against ``baai/bge-base-en-v1.5`` via
   OpenRouter failed with "No embedding data received". Root cause:
   langchain-openai's ``OpenAIEmbeddings`` defaults to ``tiktoken_enabled
   =True``, which sends integer arrays for OpenAI-named models. Non-OpenAI
   models on OpenRouter (BAAI, Mistral, Qwen, Nvidia, etc.) reject those
   with HTTP 422 "Input should be a valid string". Curl with raw string
   input proved the underlying API works.

   Fix: in ``create_embeddings``, when ``provider == "openrouter"`` OR
   when ``provider == "openai"`` AND ``embedding_base_url`` is set,
   construct the client with ``tiktoken_enabled=False`` +
   ``check_embedding_ctx_length=False``. Forces raw-text mode for the
   wire request. Verified end-to-end:
     openai/text-embedding-3-small   → 1536-dim ✓
     baai/bge-base-en-v1.5            → 768-dim ✓
     sentence-transformers/all-mpnet-base-v2 → 768-dim ✓

   (nvidia/llama-nemotron-embed-vl-1b-v2:free still trips the OpenAI
   SDK's "No embedding data received" — likely a response-shape
   incompatibility we can't fix without forking. Documented as a
   provider-side quirk.)

2. Measured embedding-provider performance matrix in README
   Ran the bench harness with 3-run median per cell over the same 1.5 MB
   PDF, local sqlite-vec, fastembed-vs-OpenRouter × {balanced, fast}.
   Headline numbers added to README's Knowledge Base section:
     fastembed × accurate (post-P0):              85.8 s
     fastembed × balanced:                        89.9 s
     fastembed × fast:                            39.9 s
     OpenRouter openai/text-embedding-3-small × balanced: 62.8 s
     OpenRouter baai/bge-base-en-v1.5 × balanced:         64.5 s
     OpenRouter openai/text-embedding-3-small × fast:     23.8 s  ← best
   embed_s is the discriminator: fastembed is CPU-bound at ~30 s for
   ~500 chunks; cloud routes drop it to 2-4 s. Combined with Docling
   fast mode (parse -60%), best stack runs 3.96× faster than the
   pre-P0 baseline.

README now also includes a "Choosing a provider" decision table.

KNOWLEDGE_PIPELINE.md (gitignored internal reference) was updated
locally with the same matrix + the validation/auto-flags story for
future maintainers.
Previous commit only had 6 of the 9 cells — missing OpenRouter × accurate
for both models and OpenRouter baai × fast. User correctly called out the
gap; ran the missing benches and updated the README.

Full matrix (1.5 MB PDF, 3-run median, local sqlite-vec, total_s):

  Provider \ Docling     fast    balanced  accurate
  fastembed              39.9    89.9      85.8
  openrouter openai-3s   23.8    62.8      63.2
  openrouter baai/bge    26.4    64.5      63.4

Per-stage data also expanded in the README so reviewers can see exactly
where each saving comes from (parse_s vs embed_s vs insert_s).

Three observations the matrix made explicit:
  1. OpenRouter shaves embed_s by 7-14× independent of Docling mode.
  2. balanced ≈ accurate on digital PDFs (OCR finds nothing useful).
  3. Cloud has variance — one openrouter × accurate run spiked to ~960s
     embed_s on a rate-limit / network blip. The 3-run median is the
     honest comparison; SLA-sensitive ingest should factor retries.

Local artifacts also updated (gitignored): KNOWLEDGE_PIPELINE.md with the
full table + caveat, knowledge_ingest_perf.pptx slide 6 now shows the
3×3 plus the per-stage breakdown for the headline rows.
…ty / robustness

Adds a client-shareable benchmarking pipeline that measures the four
dimensions a client weighs when picking a CUGA knowledge configuration
for their workload. Methodology, harness, queries, and renderer all
ship together so a client can run it on their corpus and reproduce the
numbers.

What ships

  scripts/bench_knowledge_eval.py           Run a (provider, model, docling_mode)
                                            matrix on a single PDF + labeled
                                            query set. Outputs JSON.
  scripts/bench_knowledge_eval_report.py    Render the JSON as 3 markdown
                                            tables (Latency × Cost, Quality,
                                            Decision matrix).
  scripts/eval_queries.json                 10 labeled queries — 3 title/
                                            author + 4 concept + 3 detail —
                                            tagged by ``kind`` so a client can
                                            spot configurations that ace one
                                            kind but bomb another.
  scripts/KNOWLEDGE_EVAL_METHODOLOGY.md     The reasoning behind every metric,
                                            test-design tradeoffs, how to
                                            adapt for a client engagement, and
                                            the gaps v1 does not yet cover
                                            (no LLM grader, no multi-corpus
                                            rollup, no prod-pgvector flag).

Four metrics per cell, by design

  Latency        : parse_s / embed_s / insert_s / total_s + min/max spread
                   (3-run median by default; cloud variance is real — we
                   saw a 960 s outlier on openrouter × accurate).
  Cost           : tokens estimated from chars × per-model OpenRouter
                   pricing (live lookup, with hard-coded fallback).
                   $0 for local providers.
  Quality        : avg top-1 score (informational; provider-specific scale)
                   + top-5 Jaccard vs a reference cell. Jaccard answers
                   the client's real question — "do I get the same chunks
                   if I switch from the trusted/expensive config to this
                   cheaper one?" — without needing hand-labeled ground
                   truth.
  Robustness     : ingest_status + search_status + first 3 error strings.
                   A cell that times out or errors mid-run is still
                   reported as failed, with diagnostic text.

Findings from the default 9-cell matrix on the RAG paper (2-run median,
10 queries, reference = openrouter/openai-3-small/balanced):

  Fastest                        openrouter/openai-3-small/fast        26.8 s   $0.0083
  Cheapest cloud                 openrouter/baai-bge-base/fast         29.4 s   $0.0021
  Free / air-gapped              fastembed/fast                        36.6 s   $0
  Quality-preserving AND cheap   openrouter/openai-3-small/fast        26.8 s   $0.0083   Jaccard 0.82

Top-5 Jaccard vs reference, by cell:

  openrouter / openai-3-small / balanced    1.00   (reference)
  openrouter / openai-3-small / accurate    1.00   (same model, different parse)
  openrouter / openai-3-small / fast        0.82   (fast skips OCR/tables; minor chunk diff)
  openrouter / baai-bge-base / *            0.20-0.30
  fastembed / *                             0.13

This is the kind of result a client cannot get from the latency
benchmarks alone: switching providers materially changes which chunks
come back for a query. Whether that change is "worse" or "different but
equivalent" is a corpus-specific judgment the client must make. The
harness gives them the queries to spot-check.

Two bugs surfaced while building the harness

  1. Dynaconf env-var caching: the first attempt set
     DYNACONF_KNOWLEDGE__... env vars per cell but cuga.config was
     module-cached after cell 1. Refactored to use the SDK's
     ``knowledge_config=`` kwarg (built for exactly this case in commit
     6e106f0). The harness is now the canonical user of that kwarg.

  2. Filename-based chunk identity: first eval gave Jaccard = 0 for
     every non-reference cell. Root cause was the per-run uuid-tagged
     copy of the source PDF (added earlier to keep each ingest on the
     new-doc branch); every cell saw a different filename, so the chunk
     id was unique per cell. Fixed by using content-based identity:
     ``p{page}:{first 120 chars of chunk text}``.

Robustness note for clients

  One of the OpenRouter accurate-mode cells gave a transient "No
  embedding data received" error on the first bench attempt; the rerun
  succeeded. Cloud providers have variance, including occasional outlier
  runs taking 10x+ the median. The harness reports min/max alongside
  median so a client can see the spread and budget retries.

README now points at the harness from the Knowledge Base section with a
quick-start + summary of the default-matrix findings.
@coderabbitai

coderabbitai Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds knowledge configuration, retrieval, storage, routing, and UI changes for embedding providers, GPU support, progress reporting, client adaptation, and scoped session behavior.

Changes

Knowledge platform rollout

Layer / File(s) Summary
Deployment docs and benchmark tooling
README.md, Dockerfile*, .env.example, .gitignore, pyproject.toml, docs/examples/*, scripts/bench_*.py, src/scripts/run_tests.sh
Deployment notes, container images, dependency metadata, ignore rules, docs examples, benchmark CLIs, and test runner coverage were updated for OCR, GPU, and knowledge benchmarking.
Knowledge contracts and runtime backend
src/cuga/backend/knowledge/*, src/cuga/backend/cuga_graph/..., src/cuga/configurations/knowledge/*, src/cuga/__init__.py
Knowledge config/schema, prompt assembly, search envelopes, reranking, query expansion, scope handling, storage/retrieval plumbing, and knowledge contract/profile settings were expanded.
SDK and frontend integration flow
src/cuga/sdk.py, src/frontend_workspaces/**, src/cuga/cli/main.py, src/cuga/backend/server/*, src/cuga/backend/knowledge/routes.py, src/cuga/backend/knowledge/mcp_server.py
SDK overrides, manage routes, startup apply behavior, CLI commands, frontend knowledge docs/badges, and upload/search/task APIs were wired through the new knowledge surfaces.
Validation and regression coverage
tests/**
New and updated tests cover adaptation, GPU/use_gpu, provider selection, layout/OCR, reranking, session isolation, upload progress, publish/import, limits, and route/UI parity.

Sequence Diagram(s)

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/knowledge-ingest-bottlenecks

@sami-marreed sami-marreed left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review summary

Solid perf work (bulk insert, concurrent embed, docling modes, OpenRouter, live PATCH) with good unit test coverage when run manually — but several blockers before merge.

Blockers

  1. Merge conflicts with main — GitHub shows mergeable: CONFLICTING (see inline on index.html)
  2. PR description overclaims — OCR, session isolation, live upload UX, and "387 tests" are not in the 41-file diff
  3. New tests not in CI — 8 new test files are not wired into src/scripts/run_tests.sh (lines 82–89), which is what .github/workflows/tests.yml runs
  4. Event-loop blockingfut.result() in async ingest; sync embed preflight on live PATCH
  5. Data integrity — partial insert failure can leave orphan vectors on pgvector
  6. Out-of-scope files — eval harness, methodology doc, personal .gitignore entries

Also noted (file unchanged in PR — no inline possible)

  • src/cuga/backend/knowledge/routes.py — upload still await engine._run_ingest(...) synchronously; PR body claims wait=false / 202 / reload resume but routes.py has zero diff
  • src/cuga/backend/knowledge/engine.py commit_knowledge_updatemetric_changed is computed in prepare_knowledge_update but omitted from the return dict; manage_routes.py exposes it in live_changes anyway → always null. Add "metric_changed": prepared.metric_changed to the return.
  • get_settings() — omits new fields (docling_pdf_mode, batch knobs, embedding_api_key) after live PATCH
  • PR still Draft — please mark ready after fixes

Happy to follow up with more comments as review continues.

Comment thread src/cuga/frontend/dist/index.html Outdated
Comment thread src/cuga/backend/knowledge/engine.py Outdated
Comment thread src/cuga/backend/server/manage_routes.py Outdated
Comment thread src/cuga/backend/knowledge/storage/adapter.py
Comment thread .gitignore Outdated
Comment thread scripts/KNOWLEDGE_EVAL_METHODOLOGY.md Outdated
Comment thread scripts/bench_knowledge_eval.py Outdated
Comment thread README.md
Comment thread tests/unit/test_knowledge_timings.py
Comment thread src/cuga/backend/knowledge/engine.py
dolev31 added 5 commits June 8, 2026 17:52
Snapshot of in-progress work being merged with feat/knowledge-client-adaptation:
- search_multi Option B per-source cap (limit × n_scopes, capped 100); fallback keeps Option F
- BM25/SQLite FTS5 hybrid retrieval with cross-leg RRF
- CID-glyph detector for unOCR'd PDF text layers
- Knowledge snapshot import/export via CLI + manage routes
- Knowledge UI: in-flight upload persistence, sticky non-zero badge hook
- Per-stage progress timers + JSONL bench harness
- Many test additions across the knowledge surface
Cherry-picked c9459a8 ("WIP: client-adaptation panel + knowledge
harness") onto a5a7d5a. Conflicts resolved with default-keep-perf,
manually layered harness features on top:

Frontend:
- New ClientAdaptationPanel.tsx (Behavior tab — rules editor + glossary)
- KnowledgeConfig.tsx: glossary entry type + adaptation fields on the
  interface; new client_adaptation/glossary surfaces in the modal
- ManagePage.tsx: harness invocation surfaces

Backend:
- engine.py: layered query expansion (glossary aliases) ahead of dense
  + lexical fan-out; empty-guard retry with unexpanded query when
  expansion drives dense embedding away from useful regions; cross-
  encoder reranker post-fusion; preserved perf branch BM25 + drain
  loop + Option B per-source cap
- awareness.py: client_adaptation block rendered top-of-summary; hash
  log for observability; new fields plumbed via assemble_system_prompt_section
- config.py: client_adaptation_text + glossary + rerank_* fields;
  validation (denylist, bidi-override codepoints, glossary limits);
  TOML round-trip via [knowledge.client_adaptation]; secret-redaction
  preserved
- cuga_lite_graph.py: backported the multi-tenant draft_knowledge_configs
  dict lookup so the draft-overlay path is shared-deployment safe
- cli/main.py: kept perf's config-*/snapshot-* commands AND added
  adapt's adaptation-*/glossary-*/doctor (disjoint command names —
  both helper functions retained for env-var back-compat)
- knowledge_instructions.md: kept perf's narrow Scope rules; layered
  "Citing Sources" section from adapt above Anti-patterns

New files brought over:
- query_expansion.py, reranker.py
- tests/integration/test_client_adaptation_e2e.py
- tests/unit/test_knowledge_client_adaptation.py
- tests/unit/test_knowledge_lever_stack.py
- docs/examples/knowledge_demo/client_adaptation_example.md
- knowledge_testing/.gitignore

Files NOT brought over:
- prepare_node.py — perf branch is pre-agent-core-refactor and never
  had this layout; equivalent logic backported into cuga_lite_graph.py
  (multi-tenant dict lookup, with client_adaptation_text/glossary
  picked up automatically inside assemble_system_prompt_section)

Tests: 667/667 knowledge unit tests pass.

Dist bundles will be rebuilt fresh in a follow-up step.
…ld artefacts

Two paths that were appearing in ``git status`` and don't belong in the
tracked tree:

- ``.claude/scheduled_tasks.lock`` + ``.claude/projects/`` — Claude Code's
  internal session state. Different per developer; never useful upstream.
- ``docs/examples/**/build/`` — Python build artefacts created when
  running examples in isolation (``uv build`` from a sub-example dir).

Pure housekeeping; no behavioural change.
…nt CUDA runtime

Closes the knowledge-engine production-readiness gap with three closely-coupled
changes audited by a 3-expert review (image / per-stage perf / deployment UX)
plus a Docling-specific deep dive.

## 1. Pareto-locked RAG profiles

Profile TOMLs now own their full effective config rather than just chunking +
prompt addendum. Each profile pins:

  - embedding_model + batch_size + concurrency
  - chunk_size + overlap
  - docling.pdf_mode + layout_engine + drop_page_chrome
  - rerank.enabled + top_k_in + model
  - search.default_limit + score_threshold + max_search_attempts +
    hybrid_mode + junk_filter
  - engine.max_ingest_workers + vector_insert_batch_size

The matrix (speed / standard / balanced / max_quality) is locked from a 3-expert
debate on a 556-page PDF corpus; values defended in each TOML's header comment.

Wiring:
  - ``from_settings`` reads the named profile's full surface; settings.toml
    keys act as fallback only.
  - ``coerce_and_validate`` applies the profile when ``rag_profile`` is
    named in the incoming PATCH (a bare ``{"rag_profile": "X"}`` produces
    a fully X-shaped config, not just a relabel).
  - Code-enforced contract: ``rerank_top_k_in >= 3 * default_limit`` —
    Expert B's recall-ceiling finding.
  - New ``would_invalidate_vectors(other)`` helper that the PATCH-config
    UX layer can use to gate a confirmation dialog.

Default rag_profile remains ``standard`` (rerank off; bge-small-en;
balanced parse) so the shipped default fits ~4GB containers.

## 2. GPU image variant (Dockerfile.gpu) + transparent CUDA

New ``Dockerfile.gpu`` on ``nvidia/cuda:12.4.1-cudnn-runtime-ubuntu24.04``
base — slim Debian lacks ``libcudnn.so.9`` / ``libgomp.so.1`` / ``libcudart.so.12``
and causes silent ORT fallback to CPU.

New ``[project.optional-dependencies] gpu = ["onnxruntime-gpu>=1.20",
"fastembed-gpu>=0.8.0"]`` (lockfile-managed). ``fastembed-gpu`` avoids the
qdrant/fastembed#608 dual-install collision where a transitive ``onnxruntime``
silently strips CUDAExecutionProvider.

Main ``Dockerfile`` reverts a broken ``--build-arg GPU=1`` block: it was
lockfile-unaware (``uv pip install --reinstall``) and any later ``uv sync``
would silently revert to CPU.

Runtime:
  - ``_FastEmbedEmbeddings`` detects ``CUDAExecutionProvider`` in active
    providers and forces ``batch_size=256, parallel=1`` for the embed call.
    Without this, bge-small at fastembed's default ``parallel=None``
    spawns one worker per CPU core, all sharing one GPU, and CUDA-context
    thrash caps the speedup at 2–3× CPU.
  - 3-mode startup warning distinguishes (a) GPU image with no device
    visible (forgot ``--gpus all``), (b) CPU image with GPU requested
    (wrong tag), (c) partial GPU (torch sees CUDA but ORT doesn't).
  - ``gpu_required: bool = False`` config field + ``CUGA_GPU_REQUIRED=1``
    env var: when set, the engine raises at startup instead of warning,
    matching vLLM / TGI fail-fast posture.
  - New ``cuga doctor`` CLI: prints nvidia-smi, ORT providers, torch
    CUDA, fastembed live session providers, /dev/shm size. Copy-pasteable
    for support tickets.

README: full GPU deployment section with k8s manifest example
(``nvidia.com/gpu: 1`` + tmpfs /dev/shm + ``CUGA_GPU_REQUIRED=1``),
per-stage speedup table from the audit, HF Spaces + AWS ECS notes.

## 3. Docling parse polish (per Expert D's deep dive)

  - ``generate_page_images=False`` only when ``pdf_mode=fast``. Saves
    ~30 ms/page in PNG encoding when the user has explicitly chosen the
    fast profile. Balanced + accurate keep Docling's default so downstream
    UI/extension consumers that may want the rendered PNGs aren't broken.
  - Warning when CUDA is engaged but ``max_ingest_workers <= 2``: per
    Expert D, a single H100 fits ~10–14 Docling workers (4–6 GB VRAM each);
    cuga's laptop-tuned default leaves the GPU ~80% idle.

## Tests

  - 709 knowledge unit tests pass.
  - 8 existing tests updated for the new profile-wins semantics: they now
    point ``rag_profile`` at ``__no_such_profile__`` to verify the
    settings.toml fallback path independently of profile expansion.
  - ``test_publish_syncs_draft_*`` + ``test_cuga_lite_knowledge_scopes``
    pre-existing fragilities resolved (apply_knowledge_config stub on the
    fake engine; assertion shape updated to match the live setdefault →
    get/assign refactor on thread_id injection).

## Defended in 3 expert reports

  - Expert A (image / packaging): nvidia/cuda base + ``[gpu]`` extra +
    fastembed-gpu eliminate 3 silent-CPU-fallback footguns found by audit.
  - Expert B (per-stage GPU perf): batch_size=256 / parallel=1 on CUDA;
    code-enforced ``rerank_top_k_in >= 3 * default_limit``.
  - Expert C (deployment UX): refined warning + ``gpu_required`` +
    ``cuga doctor`` + README with platform matrix.
  - Expert D (Docling parse): ``generate_page_images=False`` in fast mode;
    ``max_ingest_workers`` warning.

GPU image size: ~4.8 GB (CPU image stays ~1.5 GB).
End-to-end speedup on H100 vs CPU (measured baseline 6.1 min, 556-page PDF):
~10× with multilingual-e5-large; ~3–6× with bge-small-en (launch-overhead
bound at small models). Tesseract OCR + sqlite-vec/pgvector + BM25 remain
CPU floors regardless.
Ships a small, opinionated corpus that operators can ingest immediately
to evaluate the knowledge engine end-to-end without sourcing their own
documents:

  - ``agent_level/`` — 4 evergreen HR-style policies (employee handbook,
    health insurance plan, PTO policy, 401k retirement plan). These map to
    cuga's ``agent_level`` scope (permanent, shared across conversations).

  - ``session_level/`` — 3 user-scoped documents for a fictional persona
    (Sarah Chen): benefits enrollment, paystub, PTO balance. These map to
    cuga's ``session_level`` scope (transient, per-conversation) and
    demonstrate the per-thread scope-isolation contract.

  - ``build_deck.py`` + ``cuga_knowledge_demo.pptx`` — small Python
    pptx-build script and the rendered demo deck (~80KB) for screenshare
    walkthroughs.

Total footprint ~100KB. All content is fictitious; useful for evaluation,
QA, and onboarding new cuga users to the knowledge feature.
Comment thread src/frontend_workspaces/agentic_chat/src/ClientAdaptationPanel.tsx Fixed
Comment thread src/cuga/backend/server/manage_routes.py Fixed
dolev31 added 2 commits June 9, 2026 11:36
…-bottlenecks

# Conflicts:
#	src/cuga/backend/cuga_graph/nodes/cuga_lite/cuga_lite_graph.py
#	src/cuga/backend/knowledge/client.py
#	src/cuga/frontend/dist/index.html
#	src/frontend_workspaces/agentic_chat/src/App.tsx
…merge

The merge with origin/main pulled in 27bd35e (extract shared agent logic
into cuga_agent_core) which moved ~2300 lines out of cuga_lite_graph.py.
Three of our branch's semantic improvements would have been lost without
this fix:

1. ``helpers/knowledge.py::_get_knowledge_tool_scope_context`` — restore
   the synthetic ``"all"`` scope and ``"session"`` default. Main's
   refactored helper returned only ``("agent", "session")`` with default
   ``"agent"``; our branch had evolved to ``("all", "agent", "session")``
   with default ``"session"`` so the LLM defaults to the narrowest plausible
   query and the engine auto-falls back to ``"all"`` if session is empty
   (the wiring lives in ``KnowledgeClient.search`` / ``_run_multi``).
   Without this restore, multi-scope deployments lose the precision gain
   that the perf branch shipped.

2. ``helpers/knowledge.py::_knowledge_scope_instruction`` — update the
   both-scopes-wired branch to describe the ``"all"`` scope explicitly so
   the LLM tool prompt matches the actual scope set returned in (1).

3. ``adapter/prepare_node.py`` — when assembling the draft-overlay knowledge
   config for ``--draft`` agents, prefer the per-agent
   ``draft_knowledge_configs`` dict before falling back to the legacy
   singular ``draft_knowledge_config`` attribute. Main's refactored
   prepare_node read only the singular attribute, which is multi-tenant
   unsafe (two draft agents on the same host would clobber each other).
   The dict lookup mirrors what cuga_lite_graph.py had pre-refactor.

4. ``test_knowledge_multi_scope_search.py`` — update the import path
   for ``_get_knowledge_tool_scope_context`` from ``cuga_lite_graph`` to
   ``cuga_lite.helpers.knowledge`` (the new home post-refactor).

All 709 knowledge unit tests pass.
dolev31 added 3 commits June 9, 2026 11:48
After the merge with origin/main pulled in the cuga_agent_core refactor,
CI's ``ruff check`` flagged 40 lint errors across pre-existing files
(31 auto-fixable, 9 manual). All are unused-imports / unused-locals /
semicolon-separated statements — no behavioural change.

  - tests/unit/test_knowledge_use_gpu_surfaces.py + a dozen other test
    files: ``os``, ``patch``, etc. imported but unused.
  - tests/unit/test_knowledge_layout_engine.py:429-430: split two
    semicolon-joined statements onto their own lines.
  - src/cuga/backend/knowledge/engine.py:2473: drop unused
    ``rerank_in`` assignment.
  - scripts/bench_knowledge_eval.py:242,248: drop unused
    ``last_ingest_filename`` + ``docs``.

ruff check now passes cleanly. 709 knowledge unit tests still pass.
… failure

CI's ``ruff format --check`` flagged 38 files needing format normalization
(extra blank lines, trailing-comma fixups, multi-line call formatting).
Applied ``uv run ruff format .`` to bring the tree in line with the
upstream code style. No behavioural change; 709 knowledge unit tests
still pass.
…sion

Background: CI's ``uv sync --all-extras --dev`` installs every extra
side-by-side. With ``fastembed`` (base) AND ``fastembed-gpu`` (in the
previous ``[gpu]`` extra) both installed, they clobber each other on
disk — the same qdrant/fastembed#608 issue we were trying to avoid.
Net effect under CI: ``ModuleNotFoundError: No module named 'fastembed'``
on every integration test that imports the engine.

Fix:
  - Remove ``fastembed-gpu`` from ``[project.optional-dependencies] gpu``.
    Only ``onnxruntime-gpu`` stays. ``fastembed-gpu`` was a nicety; the
    Dockerfile.gpu can swap the underlying ORT directly.
  - ``Dockerfile.gpu`` now does ``uv pip uninstall onnxruntime`` between
    ``uv sync --extra gpu`` (which adds ``onnxruntime-gpu``) and the
    cu121 torch reinstall. Two ORT packages on disk silently strips
    CUDAExecutionProvider — the uninstall keeps it clean.
  - ``uv.lock`` re-locked to remove ``fastembed-gpu``.

CI's ``uv sync --all-extras`` now resolves cleanly. 709 knowledge unit
tests still pass.
dolev31 added 2 commits June 9, 2026 14:25
… collision

``uv sync --all-extras --dev`` installs every optional dep, including the
new ``[gpu]`` extra. ``onnxruntime-gpu`` (in [gpu]) and ``onnxruntime``
(transitive of base fastembed) then coexist on disk and clobber each
other — ``import onnxruntime`` raises ``AttributeError: module
'onnxruntime' has no attribute 'SessionOptions'``. The qdrant/fastembed
#608 and microsoft/onnxruntime #21825 issues both document the collision.

Fix: ``--no-extra gpu`` in the workflow. CI never needs the GPU
runtime — the GPU path is exercised in Dockerfile.gpu only.
CodeQL flagged two new alerts after the merge with origin/main:

1. HIGH ``js/polynomial-redos`` —
   ``src/frontend_workspaces/agentic_chat/src/ClientAdaptationPanel.tsx:309``
   ``value.replace(/\s+$/, "")`` is the classic polynomial-backtracking
   pattern; on adversarial inputs the regex engine spends quadratic time
   trying to match. Replaced with ``value.trimEnd()`` — O(n), same
   semantics, intent-revealing.

2. MEDIUM ``py/stack-trace-exposure`` —
   ``src/cuga/backend/server/manage_routes.py:979``
   The embedding-test endpoint's validation error path returned
   ``str(e)`` unscrubbed. ``KnowledgeConfig.validate()`` messages are
   operator-authored strings, but if a user accidentally pastes their
   API key into the model field the error could echo it back. Now uses
   the same ``_scrub_secret(str(e), api_key)`` the embed-test path two
   blocks below already uses — consistent secret-scrub posture across
   the endpoint.

ruff clean, 709 knowledge unit tests pass.
Comment thread src/cuga/backend/server/manage_routes.py Fixed
@dolev31 dolev31 marked this pull request as ready for review June 9, 2026 13:21
…ors, CI tests, scope cleanup)

Aggregate response to PR #297 review comments (Sami, Dec 2026).

## Blockers fixed

1. **Event-loop blocking in async ingest** — engine.py:2082
   ``fut.result(timeout=2.0)`` was a blocking call inside the asyncio
   thread; with ~10 progress emits the drain stalled other requests for
   up to ~20s. Replaced with
   ``await asyncio.wait_for(asyncio.wrap_future(fut), timeout=2.0)``.

2. **Event-loop blocking in PATCH /knowledge/config** — manage_routes.py
   ``live_engine.apply_knowledge_config(filtered)`` was called sync in
   an async route. On embedding-provider/model change it runs
   ``embed_query("test")`` (network round-trip) on the event loop.
   Wrapped in ``await asyncio.to_thread(...)``.

3. **Partial insert can orphan vectors on pgvector** — storage/adapter.py
   Each ``add_many`` sub-batch is its own pgvector transaction; if
   batch N+1 fails, batches 1..N stay committed and the next retry
   appends duplicates (because ``delete_by_source`` only runs when
   metadata exists, and first-time failures never write metadata).
   Added compensating delete loop inside ``_add_all``: tracks
   ``source_for_rollback`` on first chunk and, on any subsequent
   batch failure, runs the same ``list`` / ``delete`` / ``_fts_delete_by_source``
   loop as the adapter's public ``delete_by_source`` (inlined to stay
   on the current event loop — calling the public helper would
   ``_run_embedding_coro`` a nested loop and deadlock).

4. **New tests not in CI** — src/scripts/run_tests.sh
   The explicit allowlist of ~8 knowledge tests at lines 82-89 silently
   excluded ~30+ test modules added since. Switched to
   ``tests/unit/test_knowledge_*.py`` glob + named the few non-glob
   companions (session/chat/manage_publish/sdk_knowledge_config). All
   knowledge tests now run under ``./src/scripts/run_tests.sh
   unit_tests``.

## Also-noted bugs

5. **commit_knowledge_update omitted metric_changed from return** —
   engine.py:3197. ``manage_routes.py`` already exposes it in
   ``live_changes`` so the field was always null. Added
   ``"metric_changed": prepared.metric_changed`` to the return.

6. **get_settings() omitted new fields** — engine.py:3224. After live
   PATCH the SDK / UI / CLI couldn't read back ``docling_pdf_mode``,
   ``embedding_batch_size``, ``embedding_concurrency``,
   ``vector_insert_batch_size``, ``embedding_api_key`` (presence-only),
   ``rerank_*``, ``default_limit``, ``max_search_attempts``,
   ``max_ingest_workers``, ``docling_layout_engine``, ``gpu_required``.
   All now surfaced. Secret field exposed as ``embedding_api_key_set:
   bool`` only — the actual key never leaves the engine. All new
   fields read via ``getattr(self._config, name, default)`` so existing
   SimpleNamespace mocks in tests don't break.

## CodeQL hardening

7. **manage_routes.py:978-1009** — strengthened the
   py/stack-trace-exposure response: ``logger.exception`` logs the
   full server-side traceback; client receives only validator-authored
   messages on a known prefix whitelist (operators NEED these to fix
   their config), and falls back to a generic
   "Invalid knowledge embedding configuration." for any other
   exception shape so internals never leak.

## Scope cleanup (Sami: out-of-scope for this PR)

8. Removed eval harness scripts:
   - scripts/KNOWLEDGE_EVAL_METHODOLOGY.md
   - scripts/bench_knowledge_eval.py
   - scripts/bench_knowledge_eval_report.py
   - scripts/eval_queries.json
   - scripts/test_groq_connectivity.py
   - scripts/test_openrouter_connectivity.py
9. Removed personal gitignore entries for ``knowledge_ingest_perf.pptx``
   + ``KNOWLEDGE_INGEST_PERF_SLIDES.md``. Generic ``~$*.pptx`` /
   ``~$*.docx`` / ``~$*.xlsx`` lock-file patterns kept since those
   are repo-wide useful.
10. README — trimmed the 3×3 provider×Docling latency matrix +
    Top-5 Jaccard quality table + eval-harness section (~80 lines).
    Provider-choice table + reindex-on-dim-change warning kept (real
    user docs).

Tests: 709 knowledge unit tests pass. ruff check + ruff format clean.
Comment thread src/cuga/backend/server/manage_routes.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (8)
Dockerfile.gpu (1)

27-85: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Run the container as a non-root user.

The image never drops privileges (USER is missing), so runtime executes as root. Please add a dedicated unprivileged user and switch before CMD.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Dockerfile.gpu` around lines 27 - 85, Create and switch to a non-root user in
the Dockerfile: add a dedicated unprivileged user (e.g., "cuga") and group, set
a HOME, chown the app workspace and any files copied into /app/cuga_workspace
(references: the RUN mkdir -p /app/cuga_workspace and multiple COPY lines), and
add a USER cuga (and optionally WORKDIR/HOME) before the final CMD so the
container runs unprivileged rather than as root. Ensure the new user has
permission to read/write needed files and the PATH (ENV PATH) remains usable for
that user.

Source: Linters/SAST tools

src/cuga/backend/knowledge/auth.py (1)

164-191: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Normalize thread_id once and reuse it for ownership + collection resolution.

Line 170 validates stripped content, but Lines 182, 187, and 191 still use the raw value. "abc" and " abc " currently map to different session/collection identities.

Suggested fix
     if scope == "session":
+        normalized_thread_id = (identity.thread_id or "").strip()
         # Whitespace-safe check. ``"   "`` is truthy in Python but it
         # sanitizes to ``"___"`` and would create a phantom collection
         # ``kb_sess____`` shared across every misconfigured caller.
         # Strip before the truthiness test so any "no real thread id"
         # input is rejected uniformly.
-        if not (identity.thread_id or "").strip():
+        if not normalized_thread_id:
             raise HTTPException(status_code=400, detail="X-Thread-ID required")
@@
                 provider.get_or_create_session(
-                    identity.thread_id,
+                    normalized_thread_id,
                     user_id=identity.user_id,
                     tenant_id=identity.tenant_id,
                 )
                 if not provider.check_session_access(
-                    identity.thread_id, identity.user_id, identity.tenant_id
+                    normalized_thread_id, identity.user_id, identity.tenant_id
                 ):
                     raise HTTPException(status_code=403, detail="access denied to session")
 
-        return f"kb_sess_{_sanitize(identity.thread_id)}"
+        return f"kb_sess_{_sanitize(normalized_thread_id)}"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cuga/backend/knowledge/auth.py` around lines 164 - 191, Normalize and
reuse a stripped thread_id instead of calling identity.thread_id repeatedly:
create a local variable like stripped_thread_id = (identity.thread_id or
"").strip(), validate it and raise HTTPException if empty, then pass
stripped_thread_id to provider.get_or_create_session and
provider.check_session_access and call _sanitize(stripped_thread_id) for the
returned collection name; ensure all uses in this scope (the block handling
scope == "session" in auth.py) reference that single normalized variable.
src/cuga/backend/knowledge/config.py (1)

633-649: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Include Docling ingest knobs in vector_config_hash().

would_invalidate_vectors() currently misses docling_pdf_mode, docling_layout_engine, and docling_drop_page_chrome, even though those settings change the extracted text/chunks that get embedded and stored. A profile switch like standard (balanced + dry_run) to a stricter parsing profile can therefore keep the same hash and skip the re-ingest prompt while serving stale vectors.

Suggested fix
         key = (
             f"{sm}|{self.embedding_provider}|{self.embedding_model}|"
-            f"{self.chunk_size}|{self.chunk_overlap}|{self.metric_type}"
+            f"{self.chunk_size}|{self.chunk_overlap}|{self.metric_type}|"
+            f"{self.docling_pdf_mode}|{self.docling_layout_engine}|"
+            f"{self.docling_drop_page_chrome}"
         )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cuga/backend/knowledge/config.py` around lines 633 - 649,
vector_config_hash() currently omits docling ingest knobs so profile changes
(e.g., docling_pdf_mode, docling_layout_engine, docling_drop_page_chrome) won't
flip the vector hash; update the function (vector_config_hash) to include these
three settings from cuga.config.settings (accessed the same way as storage.mode)
in the key composition so the hash reflects any change to docling_pdf_mode,
docling_layout_engine, and docling_drop_page_chrome and triggers re-ingest
detection consistent with would_invalidate_vectors().
src/cuga/backend/knowledge/routes.py (1)

534-695: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

wait=false path does not actually return 202.

The handler documents async semantics for wait=false, but it never sets the HTTP status to 202, so clients still receive 200 and can mis-handle polling/state transitions.

Suggested fix
-from fastapi import APIRouter, Depends, File, Form, HTTPException, Query, UploadFile, Request
+from fastapi import APIRouter, Depends, File, Form, HTTPException, Query, UploadFile, Request, Response
@@
 async def upload_documents(
     request: Request,
+    response: Response,
     files: list[UploadFile] = File(...),
@@
-    if len(results) == 1:
+    if not wait:
+        response.status_code = 202
+    if len(results) == 1:
         return results[0]
     return {"tasks": results}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cuga/backend/knowledge/routes.py` around lines 534 - 695, The async
branch for wait=false currently builds result payloads but still returns normal
200; update upload_documents to return a 202 status for the fire-and-forget path
by returning a Response/JSONResponse with status_code=202 (preserve the same
body built from results/_enrich_task). Specifically, after assembling results in
the else branch (symbols: wait, results, _enrich_task) and at the final return
site, return JSONResponse(status_code=202, content=results[0] if len(results)==1
else {"tasks": results}) or equivalent so single-file and multi-file flows both
send HTTP 202 for wait=false. Ensure you import JSONResponse/Response from
fastapi.
src/frontend_workspaces/agentic_chat/src/KnowledgeSidePanel.tsx (1)

175-180: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Don’t render the empty state until agent docs are authoritative.

agentDocs.length === 0 now means both “no docs” and “still resolving”. Right after first load or publish, this will briefly show “No agent documents” even though the badge is still loading. Pass through loadedOnce/isResolving from useAgentKnowledgeDocs and render a loading state until the hook settles.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/frontend_workspaces/agentic_chat/src/KnowledgeSidePanel.tsx` around lines
175 - 180, The component KnowledgeSidePanel currently shows "No agent documents"
whenever agentDocs.length === 0, which conflates "no docs" with "still
resolving"; update the rendering logic to use the hook's settled state (pass
through and read loadedOnce and/or isResolving from useAgentKnowledgeDocs) and
show a loading placeholder while isResolving or loadedOnce is false, only
rendering the empty state when loadedOnce is true and agentDocs.length === 0;
adjust the conditional around agentDocs/agentScopeEnabled accordingly so the UI
waits for the hook to settle before showing "No agent documents".
src/frontend_workspaces/frontend/src/ManagePage.tsx (1)

1028-1105: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Abort or sequence knowledge autosave requests.

This debounce only cancels the timer, not the previous PATCH once it has started. Since this endpoint updates the draft and the live engine, an older request can finish after a newer one and roll the server back to stale knowledgeConfig while the form still shows the latest edits. Guard this with an AbortController or a monotonic request id, and ignore/toast only the latest response.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/frontend_workspaces/frontend/src/ManagePage.tsx` around lines 1028 -
1105, The autosave useEffect (watching knowledgeConfig and effectiveAgentId) can
let overlapping PATCH requests finish out-of-order; modify it so each scheduled
save uses an AbortController or an incrementing sequence id tied to the effect:
create a controller/seq stored on skipDraftSaveRef (or a new ref) and pass
controller.signal to api.patchManageConfigDraftKnowledge (or include the seq in
the request/context), aborting the previous controller before starting a new
timeout and ignoring responses whose seq doesn't match the latest ref (or
treating aborted requests as no-ops); update error/toast handling to skip toasts
for aborted/stale responses while preserving current handling for real
network/errors and server 4xx responses.
tests/integration/test_client_adaptation_e2e.py (1)

255-283: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Tool-path test does not execute any tool call.

The test description says it validates KnowledgeClient.get_langchain_tools() end-to-end, but it never invokes a returned tool—only adapted_engine.search() and len(tools). This leaves wrapper-call regressions undetected.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/test_client_adaptation_e2e.py` around lines 255 - 283, The
test never invokes the langchain tool produced by
KnowledgeClient.get_langchain_tools(), so add an execution step in
test_client_consumes_search_via_langchain_tool that calls one of the returned
tools (from tools, e.g., tools[0]) with the same or similar query used earlier
("severance code 3 process") and awaits its result (use the tool's async
run/call API as appropriate), then assert the tool returned non-empty search
results; reference the test function name
test_client_consumes_search_via_langchain_tool and the method
KnowledgeClient.get_langchain_tools so you update the existing tools variable
rather than creating a new one.
tests/unit/test_knowledge_rag_scope_failure_fix.py (1)

784-824: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Close KnowledgeEngine in test_search_multi_failing_case_session_not_drowned.

This test creates a real engine and exits without shutdown, unlike nearby tests. Add try/finally with eng.shutdown() to avoid leaking resources into subsequent tests.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/test_knowledge_rag_scope_failure_fix.py` around lines 784 - 824,
The test test_search_multi_failing_case_session_not_drowned creates a real
KnowledgeEngine instance (eng = KnowledgeEngine(cfg)) but never shuts it down;
wrap the test body that uses eng (including the monkeypatch and asyncio.run
call) in a try/finally and call await eng.shutdown() or eng.shutdown() in the
finally block (matching other tests' shutdown pattern) so the KnowledgeEngine is
closed; reference the KnowledgeEngine constructor (KnowledgeEngine(cfg)), the
eng variable, and the shutdown method (eng.shutdown()) when adding the
try/finally.
🟠 Major comments (23)
scripts/bench_knowledge_eval.py-300-325 (1)

300-325: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve query alignment before Jaccard overlap computation.

No-hit and exception paths skip appending to top_chunk_ids_per_query, so zip(ref_chunks, cell_chunks) can compare different queries and produce incorrect overlap scores.

Proposed fix
         for q in queries:
             try:
                 hits = await agent.knowledge.search(q["query"], scope="agent", limit=5)
                 if not hits:
                     out.queries_failed += 1
+                    top_chunk_ids_per_query.append([])
                     continue
                 out.queries_answered += 1
@@
                 top_chunk_ids_per_query.append(ids)
             except Exception as q_err:
                 out.errors.append(f"search '{q['query'][:40]}…' failed: {type(q_err).__name__}: {q_err}")
                 out.queries_failed += 1
+                top_chunk_ids_per_query.append([])

Also applies to: 369-375

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/bench_knowledge_eval.py` around lines 300 - 325, The loop building
top_chunk_ids_per_query sometimes skips appending when there are no hits or an
exception, breaking alignment with queries and causing incorrect Jaccard
comparisons; ensure you always append a placeholder (e.g., empty list or
sentinel) to top_chunk_ids_per_query for every query even when hits is empty or
in the except block, while still incrementing out.queries_failed and recording
out.errors as before; make the same change in the similar block around the other
occurrence (the block referenced at lines 369-375) so ref_chunks and cell_chunks
remain index-aligned for zip-based overlap computation.
scripts/bench_knowledge_query_compare.py-61-77 (1)

61-77: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Per-cell temp directories are never cleaned up.

Both ingest input temp dirs and per-cell persist dirs remain on disk after completion.

Proposed fix
-async def _ingest_once(cell, pdf):
+async def _ingest_once(cell, pdf):
@@
-    return agent, cfg
+    return agent, cfg, persist, tmpdir
@@
-        agent, _ = await _ingest_once(cell, pdf)
+        agent, _, persist_dir, input_tmpdir = await _ingest_once(cell, pdf)
         try:
             results = await _run_queries(agent, queries)
             results_per_cell.append(results)
         finally:
             try:
                 await agent.aclose()
             except Exception:
                 pass
+            shutil.rmtree(persist_dir, ignore_errors=True)
+            shutil.rmtree(input_tmpdir, ignore_errors=True)

Also applies to: 153-161

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/bench_knowledge_query_compare.py` around lines 61 - 77, The per-cell
temporary directories created as persist = Path(tempfile.mkdtemp(...)) and
tmpdir = Path(tempfile.mkdtemp(...)) are never removed; wrap the per-cell
processing that uses KnowledgeConfig, dest, and the copied PDF in a try/finally
(or context manager) and call shutil.rmtree(persist, ignore_errors=True) and
shutil.rmtree(tmpdir, ignore_errors=True) in the finally block to ensure both
the persist_dir and input temp directory are deleted after each cell's work
completes (also apply same cleanup for the other similar block around the
153-161 code using the same symbols).
scripts/bench_knowledge_eval.py-185-194 (1)

185-194: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Temporary directories are leaked per run/cell.

_make_unique_pdf() and per-cell persist_dir allocate mkdtemp() directories but never remove them. Repeated benchmarks will steadily consume disk in /tmp.

Proposed fix
 async def _run_cell(...):
@@
-    persist_dir = Path(tempfile.mkdtemp(prefix=f"cuga-eval-{provider.replace('/', '-')}-"))
+    persist_dir = Path(tempfile.mkdtemp(prefix=f"cuga-eval-{provider.replace('/', '-')}-"))
@@
-    for i in range(runs):
-        pdf_unique = _make_unique_pdf(pdf)
+    for i in range(runs):
+        pdf_unique = _make_unique_pdf(pdf)
         cell_cfg = _build_cell_config(provider, model, docling_mode, persist_dir)
         agent = CugaAgent(enable_knowledge=True, knowledge_config=cell_cfg)
         try:
@@
         finally:
             try:
                 await agent.aclose()
             except Exception:
                 pass
+            import shutil
+            shutil.rmtree(pdf_unique.parent, ignore_errors=True)
@@
-    return out
+    import shutil
+    shutil.rmtree(persist_dir, ignore_errors=True)
+    return out

Also applies to: 209-210

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/bench_knowledge_eval.py` around lines 185 - 194, _temp directories
created by mkdtemp in _make_unique_pdf and where persist_dir is set are never
cleaned up, causing leaked /tmp files across runs; modify _make_unique_pdf and
the code that creates per-cell persist_dir to either use
tempfile.TemporaryDirectory() (or register cleanup via context managers) or
explicitly remove the created directories when they are no longer needed (e.g.,
call shutil.rmtree on the tmp path after ingestion/processing completes or on
process exit). Locate the _make_unique_pdf function and the places assigning
persist_dir, replace mkdtemp usage with TemporaryDirectory or add deterministic
cleanup logic (ensure you reference _make_unique_pdf, persist_dir,
tempfile.mkdtemp, and shutil.rmtree in your changes).
scripts/bench_knowledge_ingest.py-123-137 (1)

123-137: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

chunk_count_median is read from the wrong location.

chunk_count is consumed from the record root elsewhere, but summary reads it from timings, so the median can be wrong/empty.

Proposed fix
     summary = {
         "label": label,
         "runs": len(records),
         "stack": records[-1].get("stack"),
         "embedding_provider": records[-1].get("embedding_provider"),
-        "chunk_count_median": _med("chunk_count"),
+        "chunk_count_median": (
+            round(
+                statistics.median(
+                    float(r["chunk_count"])
+                    for r in records
+                    if isinstance(r.get("chunk_count"), (int, float))
+                ),
+                3,
+            )
+            if any(isinstance(r.get("chunk_count"), (int, float)) for r in records)
+            else None
+        ),
         "parse_s_median": _med("parse_s"),
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/bench_knowledge_ingest.py` around lines 123 - 137, summary currently
computes "chunk_count_median" by calling _med("chunk_count"), which looks up
chunk_count inside each record's "timings" dict, but chunk_count is stored at
the record root elsewhere; update the summary creation to compute
chunk_count_median from the root values instead of _med("chunk_count"). Iterate
records, extract each record.get("chunk_count") (or skip non-numeric), collect
floats, and compute median (rounded to 3 decimals) similar to _med, or add a
small helper _med_root(name) that reads record.get(name) and use that for
"chunk_count_median" while keeping parse_s_median and others that are actually
in timings unchanged.
scripts/bench_knowledge_eval_report.py-102-111 (1)

102-111: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Decision matrix can recommend failed cells.

Selection currently runs across all cells, including failed ingest/search results. This can output unusable recommendations.

Proposed fix
-    fastest = min(cells, key=lambda c: c.get("median_total_s") or 1e9, default=None)
+    successful = [
+        c for c in cells
+        if c.get("ingest_status") == "success" and c.get("search_status") != "failed"
+    ]
+    fastest = min(successful, key=lambda c: c.get("median_total_s") or 1e9, default=None)
@@
-        (c for c in cells if (c.get("cost_usd_estimated") or 0) > 0),
+        (c for c in successful if (c.get("cost_usd_estimated") or 0) > 0),
@@
-    free_options = [c for c in cells if (c.get("cost_usd_estimated") or 0) == 0]
+    free_options = [c for c in successful if (c.get("cost_usd_estimated") or 0) == 0]
@@
-    high_quality = [c for c in cells if (c.get("topk_overlap_vs_reference") or 0) >= 0.8]
+    high_quality = [c for c in successful if (c.get("topk_overlap_vs_reference") or 0) >= 0.8]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/bench_knowledge_eval_report.py` around lines 102 - 111, The selection
logic currently computes fastest/cheapest/free/high_quality across all entries
in cells, which can include failed ingest/search results; before the existing
calculations (variables fastest, cheapest_nonzero, free_options, free_fastest,
high_quality, high_quality_fastest) filter out failed rows by creating a
sanitized list, e.g. successful_cells = [c for c in cells if not
(c.get("failed") or c.get("error")) and (c.get("median_total_s") is not None or
c.get("cost_usd_estimated") is not None or c.get("topk_overlap_vs_reference") is
not None)]; then replace uses of cells with successful_cells so the min() and
list comprehensions only consider valid results (ensure all references to cells
in that block are updated to successful_cells).
src/cuga/backend/cuga_graph/nodes/cuga_lite/adapter/prepare_node.py-538-543 (1)

538-543: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid cross-agent fallback when per-agent draft config map exists.

On Line 541, the fallback to draft_knowledge_config still runs even when draft_knowledge_configs exists but has no entry for this agent. That can apply another agent’s draft search config in multi-tenant draft runs.

Suggested fix
-                                _cfgs = getattr(_das, "draft_knowledge_configs", None)
-                                if isinstance(_cfgs, dict):
-                                    _draft_kc = _cfgs.get(_base_aid)
-                                if _draft_kc is None:
-                                    # Legacy singular attribute fallback.
-                                    _draft_kc = getattr(_das, "draft_knowledge_config", None)
+                                _cfgs = getattr(_das, "draft_knowledge_configs", None)
+                                if isinstance(_cfgs, dict):
+                                    _draft_kc = _cfgs.get(_base_aid)
+                                else:
+                                    # Legacy singular attribute fallback only when per-agent map is unavailable.
+                                    _draft_kc = getattr(_das, "draft_knowledge_config", None)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cuga/backend/cuga_graph/nodes/cuga_lite/adapter/prepare_node.py` around
lines 538 - 543, Current code falls back to draft_knowledge_config even when
draft_knowledge_configs exists but lacks an entry for this agent, risking
cross-agent config leakage; change the logic around
draft_knowledge_configs/draft_knowledge_config so the singular fallback runs
only when draft_knowledge_configs is absent or not a dict. Specifically, in the
block referencing _das, draft_knowledge_configs, _draft_kc and _base_aid, first
check if hasattr(_das, "draft_knowledge_configs") and isinstance(_cfgs, dict)
and use only _cfgs.get(_base_aid); do NOT call getattr(_das,
"draft_knowledge_config", None) when draft_knowledge_configs is a dict (even if
.get returns None); only use the legacy singular draft_knowledge_config when
draft_knowledge_configs is missing or not a dict.
src/cuga/backend/knowledge/client.py-272-278 (1)

272-278: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't log raw user queries at INFO.

This fallback path logs the first 60 characters of every search query. Knowledge searches are likely to contain user data, document excerpts, or secrets, and INFO logs are usually retained centrally. Prefer scope/thread metadata only, or log a short hash if you need correlation.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cuga/backend/knowledge/client.py` around lines 272 - 278, The current
info log call using _stdlib_logging.getLogger("cuga.knowledge").info includes
the raw query (query[:60]); remove raw user content and instead log only
non-sensitive metadata and a short, stable correlation token: compute a short
hash (e.g., SHA256 or SHA1 and take first 8-12 hex chars) of the full query and
log that token along with thread_id and other metadata, or omit the query
entirely; update the call that references (thread_id or "") and (query or "") to
use the hash variable (or drop the query field) so no plaintext query data is
written to INFO logs. Ensure hashing is performed deterministically before the
log statement and reference the existing logger call
_stdlib_logging.getLogger("cuga.knowledge").info to locate and replace the code.
src/cuga/backend/knowledge/client.py-133-152 (1)

133-152: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Use a collision-free collection key, not lossy _sanitize().

_sanitize() maps every non-[A-Za-z0-9_] character to _, so distinct ids like thread-1, thread 1, and thread_1 all resolve to the same collection. That can merge documents across sessions or agents and defeats the isolation guarantees this PR is adding.

A reversible encoding or a stable hash-derived key is safer here than character replacement.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cuga/backend/knowledge/client.py` around lines 133 - 152, The current
_sanitize(...) is lossy and causes collisions (e.g., "thread-1" == "thread_1"),
so replace the lossy behavior with a stable, collision-resistant encoding when
building collection names: implement a new helper (or change _sanitize) to
return a short stable hash (e.g., sha256 hex or urlsafe base64 of the original
id) and use that in _agent_collection() (for self._default_agent_id) and in
_resolve_collection() (for session thread_id) instead of character-replacement;
ensure the returned token contains only safe filename/collection characters and
preserve the existing _agent_collection_hash behavior so final names remain
deterministic and unique (reference: _sanitize, _agent_collection,
_resolve_collection, _agent_collection_hash, _default_agent_id).
src/cuga/sdk.py-1670-1684 (1)

1670-1684: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

knowledge_config override is only partially applied (injection gate still ignores it).

The new override is resolved for engine creation, but auto-injection enablement is still driven by settings (when enable_knowledge is None). This can inject tools when override says disabled—or skip them when override says enabled.

Suggested fix
+    def _resolve_knowledge_config(self):
+        from cuga.backend.knowledge.config import KnowledgeConfig
+        from cuga.config import settings
+
+        override = self._knowledge_config_override
+        if override is None:
+            return KnowledgeConfig.from_settings(settings)
+        if isinstance(override, KnowledgeConfig):
+            override.validate()
+            return override
+        return KnowledgeConfig.coerce_and_validate(dict(override))
@@
-                else:
-                    kb_config = KnowledgeConfig.from_settings(settings)
-                    kb_enabled = kb_config.enabled
+                else:
+                    kb_enabled = self._resolve_knowledge_config().enabled
@@
-            # Resolve config: explicit override > settings.toml fallback.
-            override = self._knowledge_config_override
-            if override is None:
-                config = KnowledgeConfig.from_settings(settings)
-            elif isinstance(override, KnowledgeConfig):
-                ...
-            else:
-                config = KnowledgeConfig.coerce_and_validate(dict(override))
+            config = self._resolve_knowledge_config()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cuga/sdk.py` around lines 1670 - 1684, The override in
_knowledge_config_override is used to build the KnowledgeEngine but the
auto-injection decision still reads from settings, causing mismatch; after you
resolve "config" (via KnowledgeConfig.from_settings, validate, or
coerce_and_validate) use config.enable_knowledge (or the equivalent property on
KnowledgeConfig) as the source for the injection gate instead of reading
enable_knowledge from settings when an override was provided, and update the
injection-enablement check (where the injection gate is evaluated) to prefer
config.enable_knowledge so the override consistently controls whether
auto-injection runs.
src/cuga/backend/server/manage_routes.py-1313-1335 (1)

1313-1335: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Don't block the PATCH request on full auto-reindex work.

When dim_changed is true, this route now awaits live_engine.reindex() for every collection before returning. On the large corpora this PR targets, a settings save can now scale with total corpus size and easily hit proxy/request timeouts after the draft has already been persisted.

Queue the reindex as background work and return a pending status/job handle instead of doing corpus-wide work inline.

src/cuga/backend/server/manage_routes.py-1167-1175 (1)

1167-1175: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Classify auth failures from the merged config, not the PATCH delta.

This branch inspects knowledge.get(...), which only contains fields sent in the current PATCH. If the client changes just embedding_model, filtered can still contain an existing embedding_provider/embedding_api_key, but _provider and _user_supplied_key are computed as empty here. That makes the soft-fail vs hard-fail path take the wrong branch for partial updates.

Suggested fix
-                _user_supplied_key = bool((knowledge.get("embedding_api_key") or "").strip())
-                _provider = (knowledge.get("embedding_provider") or "").lower()
+                _user_supplied_key = bool((filtered.get("embedding_api_key") or "").strip())
+                _provider = str(filtered.get("embedding_provider") or "").lower()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cuga/backend/server/manage_routes.py` around lines 1167 - 1175, The
auth-error classification is reading provider/key from the PATCH delta
`knowledge` (which may be partial) instead of the merged configuration; update
the logic so `_provider`, `_user_supplied_key`, and `_is_credentialed` are
computed from the merged config (e.g., use the existing merged/filtered dict
already used elsewhere rather than `knowledge`), keeping the same
stripping/lowering and `_looks_like_auth` checks, so the soft-fail vs hard-fail
branch is based on the full current settings.
src/cuga/backend/server/main.py-718-724 (1)

718-724: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Published knowledge config is still skipped after restart when settings boot with knowledge disabled.

This block only restores _saved_knowledge when app_state.knowledge_engine already exists. If settings.toml starts with knowledge.enabled=false but the published config had knowledge enabled, startup leaves _live_engine as None and silently skips the restore path, so the published provider/model/layout never comes back after a restart.

Suggested fix
-            _saved_knowledge = (_startup_config or {}).get("knowledge") or {}
-            _live_engine = getattr(app_state, "knowledge_engine", None)
-            if _saved_knowledge and _live_engine is not None:
+            _saved_knowledge = (_startup_config or {}).get("knowledge") or {}
+            _live_engine = getattr(app_state, "knowledge_engine", None)
+            if _saved_knowledge and _live_engine is None:
+                from cuga.backend.knowledge.config import KnowledgeConfig as _KC
+
+                _saved_cfg = _KC.coerce_and_validate(
+                    {k: v for k, v in _saved_knowledge.items() if not k.startswith("_")}
+                )
+                if _saved_cfg.enabled:
+                    await initialize_knowledge_engine(app_state, _saved_cfg)
+                    _live_engine = getattr(app_state, "knowledge_engine", None)
+            if _saved_knowledge and _live_engine is not None:
                 # Strip noise that's not part of the public config shape.
                 _saved_kb = {k: v for k, v in _saved_knowledge.items() if not k.startswith("_")}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cuga/backend/server/main.py` around lines 718 - 724, The restore logic
only applies saved knowledge when app_state.knowledge_engine already exists, so
if settings boot with knowledge disabled the published config is never restored;
change the block handling _saved_knowledge so that if _saved_kb exists and
_live_engine is None you initialize or enable the knowledge engine using the
same startup/factory path used elsewhere (so app_state.knowledge_engine becomes
a real engine) and then call _live_engine.apply_knowledge_config(_saved_kb);
keep the existing stripping of keys (the _saved_kb creation) and reuse
apply_knowledge_config to apply the restored config.
src/cuga/cli/main.py-201-203 (1)

201-203: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Unify knowledge CLI base URL resolution

config-*/snapshot-* use CUGA_API_BASE (default :8005), but adaptation-*/glossary-*/doctor use CUGA_HOST+CUGA_PORT (default :8000) and ignore CUGA_API_BASE. This split makes commands in the same cuga knowledge group target different backends by default.

Suggested fix
 def _cuga_server_base_url() -> str:
-    port = os.environ.get("CUGA_PORT") or "8000"
-    host = os.environ.get("CUGA_HOST") or "127.0.0.1"
-    return f"http://{host}:{port}"
+    api_base = os.environ.get("CUGA_API_BASE")
+    if api_base:
+        return api_base.rstrip("/")
+    port = os.environ.get("CUGA_PORT") or "8000"
+    host = os.environ.get("CUGA_HOST") or "127.0.0.1"
+    return f"http://{host}:{port}"

Also applies to: 441-455

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cuga/cli/main.py` around lines 201 - 203, Unify CLI backend resolution by
making the existing helper _knowledge_api_base() the single source of truth:
change it to first read CUGA_API_BASE (default "http://127.0.0.1:8005"), and if
unset fall back to composing from CUGA_HOST and CUGA_PORT (default
"http://127.0.0.1" and "8000") and always rstrip("/"); then update all places
that currently build the host/port separately (the adaptation-*, glossary-*,
doctor command implementations) to call _knowledge_api_base() instead of using
CUGA_HOST/CUGA_PORT so every cuga knowledge command targets the same backend by
default. Ensure callers use the returned URL directly (no further port/host
assembly).
src/cuga/configurations/knowledge/knowledge_profiles/speed.toml-63-66 (1)

63-66: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Speed addendum conflicts with the mandatory scope-safety contract

Line 65 directs “most-likely scope first; fallback if empty,” which can miss relevant docs when both scope candidates exist. That conflicts with the contract’s anti-silent-miss rule in this PR.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cuga/configurations/knowledge/knowledge_profiles/speed.toml` around lines
63 - 66, The "Speed addendum" text in speed.toml currently instructs
"most-likely scope first; only fall back to the other if the first comes up
empty", which violates the scope-safety contract by allowing silent misses when
both scopes contain relevant docs; update the guidance so the system either (a)
always query both scopes and merge/compare results, or (b) query the most-likely
scope first but still query the other scope in parallel or as a verification
step and include any higher-relevance hits, and explicitly state that fallback
must not be the only source when both scopes exist; change the phrasing in
speed.toml accordingly to mandate non-silent-miss behavior and reference the
scope-safety contract.
src/cuga/configurations/knowledge/knowledge_instructions.md-26-26 (1)

26-26: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Resolve contradictory scope-miss rules

Line 26 says session misses auto-fallback to scope="all" and explicitly says not to re-issue with a different scope, but Line 41 says session miss → scope="agent". These rules conflict and will produce inconsistent tool behavior.

Also applies to: 41-41

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cuga/configurations/knowledge/knowledge_instructions.md` at line 26, The
document has contradictory fallback rules for session misses: one line says
session auto-falls back to scope="all" and forbids re-issuing, while another
says session miss → scope="agent"; make them consistent by choosing the intended
behavior (prefer keeping scope="all" as the canonical automatic fallback) and
update the rule that mentions scope="agent" to instead state the same behavior
as the scope="session" rule (i.e., engine auto-retries as scope="all", sets
retrieval.fallback_from="session", and callers must treat fallback results as
authoritative — do NOT re-issue with a different scope); ensure both occurrences
reference scope="session" → fallback to scope="all" and remove or reword any
conflicting wording about scope="agent" so the file no longer contains
contradictory rules.
src/cuga/cli/main.py-1660-1740 (1)

1660-1740: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not log raw embedding API keys

Line 1739 logs _value for every override, including DYNACONF_KNOWLEDGE__EMBEDDINGS__API_KEY (Line 1664). That leaks credentials to logs/console history.

Suggested fix
     for _env_name, _value in _embedding_flag_env:
         if _value is not None:
             os.environ[_env_name] = _value
-            logger.info("Embedding override: %s=%s", _env_name.split("__", 1)[-1], _value)
+            _masked = _value
+            if _env_name.endswith("__API_KEY") or "SECRET" in _env_name:
+                _masked = "***REDACTED***"
+            logger.info("Embedding override: %s=%s", _env_name.split("__", 1)[-1], _masked)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cuga/cli/main.py` around lines 1660 - 1740, The loop that sets os.environ
for entries in _embedding_flag_env currently logs the raw _value via
logger.info, which can leak secrets (notably
DYNACONF_KNOWLEDGE__EMBEDDINGS__API_KEY). Update the loop that iterates over
(_env_name, _value) (and still sets os.environ[_env_name] = _value) to redact
sensitive values before logging: detect API keys by matching substrings like
"API_KEY" or the exact "DYNACONF_KNOWLEDGE__EMBEDDINGS__API_KEY" and replace the
value with a redacted placeholder (e.g., "<redacted>" or a masked version
showing only last 4 chars); continue to log non-sensitive entries unchanged via
logger.info("Embedding override: %s=%s", _env_name.split("__", 1)[-1],
logged_value).
src/cuga/backend/storage/embedding/local.py-108-117 (1)

108-117: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Rollback before fallback replay to avoid duplicate writes

Line 111 catches OperationalError and immediately replays all rows in Line 115-116. If executemany() applied a prefix before failing, replaying without resetting transaction state can duplicate already-written rows.

Suggested fix
         try:
             try:
                 # Fast path: single C-loop multi-row insert.
                 conn.executemany(sql, rows)
             except sqlite3.OperationalError:
+                # Reset any partial work from executemany before replaying.
+                try:
+                    conn.rollback()
+                except sqlite3.Error:
+                    pass
                 # sqlite-vec's vec0 virtual table is suspect for executemany on
                 # some builds; fall back to a loop inside the same transaction
                 # so we still drop the per-row commit fsync cost from N to 1.
                 for row in rows:
                     conn.execute(sql, row)
             conn.commit()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cuga/backend/storage/embedding/local.py` around lines 108 - 117, The
except block that catches sqlite3.OperationalError should rollback the failed
transaction before replaying the individual inserts to avoid duplicating any
prefix that may have been written by conn.executemany; update the handler around
conn.executemany to call conn.rollback() first, then replay rows with
conn.execute in the loop, and finally call conn.commit() (keeping the existing
conn.commit() after the try/except). Refer to conn.executemany, conn.rollback,
conn.execute, and conn.commit to find and modify the logic.
src/frontend_workspaces/frontend/src/knowledge/useAgentKnowledgeDocs.ts-209-228 (1)

209-228: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Make the sticky retry cancellable/stale-aware.

After the timer fires, the retry fetch is no longer tied to the current refresh cycle. If a newer refresh() completes first, this older retry can still hit acceptResult() and overwrite the newer documents/cache with stale data. Reuse an AbortController or attach a request generation check before applying retry results.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/frontend_workspaces/frontend/src/knowledge/useAgentKnowledgeDocs.ts`
around lines 209 - 228, The sticky retry currently schedules a fetch that can
apply stale results via acceptResult; make it cancellable/stale-aware by tying
the retry to the same cancellation/generation mechanism used by refresh(): when
scheduling the retry (where retryAttemptedRef and stickyRetryTimerRef are used)
create or reuse an AbortController (or capture the current refresh generation
id) and pass it into listKnowledgeDocuments, then before calling
acceptResult(retryData.documents ?? []) verify the controller is not aborted (or
the generation id still matches the latest) and only then apply results; also
ensure clearing/aborting the pending controller/timer when a new refresh()
starts so the old retry cannot overwrite newer data and setIsResolving is
handled consistently.
tests/unit/test_knowledge_cli_sdk_parity.py-256-271 (1)

256-271: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

“CLI flag → env var mapping” test does not validate CLI mapping.

This test never exercises CLI/env wiring; it only rebuilds a settings dict and calls KnowledgeConfig.from_settings, which is already covered above. A CLI mapping typo could still pass here.

tests/unit/test_knowledge_routes.py-72-79 (1)

72-79: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Queued upload contract should assert 202, not 200.

For wait=false, this path is documented as an accepted/queued response. Locking 200 here risks preserving the wrong API contract and diverging from the intended async semantics.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/test_knowledge_routes.py` around lines 72 - 79, The test is
asserting the wrong HTTP status for a queued upload: update the assertion in
tests/unit/test_knowledge_routes.py (the POST to "/api/knowledge/documents" with
data {"scope":"agent","replace_duplicates":"true","wait":"false"}) to expect 202
instead of 200; locate the assertion on response.status_code in this test and
change it to assert response.status_code == 202 so the test matches the API's
async/queued contract.
tests/unit/test_knowledge_publish_import_pipeline.py-131-144 (1)

131-144: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

This secret-leak regression test can pass without exercising save_config.

Because assertions are conditional on if "saved" in captured, the test succeeds even when the route never reaches disk-save logic. That creates a false green on a security-critical contract.

Suggested fix
     client.post("/api/manage/config", params={"agent_id": "pub-strip-test"}, json=payload)
-    # Don't assert on status; the publish flow has many side-effects we mock
-    # out. Just verify what hit the disk save call.
-    if "saved" in captured:
-        saved_kb = captured["saved"].get("knowledge", {})
-        # KEY must be stripped on disk
-        assert saved_kb.get("embedding_api_key", "") == "", (
-            f"API key leaked to published config-store: {saved_kb.get('embedding_api_key')!r}"
-        )
-        # Non-secret fields must survive
-        assert saved_kb.get("embedding_provider") == "openai"
-        assert saved_kb.get("embedding_model") == "Azure/text-embedding-3-small-1"
-        assert saved_kb.get("embedding_base_url") == "https://ete-litellm.bx.cloud9.ibm.com/v1"
+    assert "saved" in captured, "save_config was not invoked"
+    saved_kb = captured["saved"].get("knowledge", {})
+    assert saved_kb.get("embedding_api_key", "") == "", (
+        f"API key leaked to published config-store: {saved_kb.get('embedding_api_key')!r}"
+    )
+    assert saved_kb.get("embedding_provider") == "openai"
+    assert saved_kb.get("embedding_model") == "Azure/text-embedding-3-small-1"
+    assert saved_kb.get("embedding_base_url") == "https://ete-litellm.bx.cloud9.ibm.com/v1"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/test_knowledge_publish_import_pipeline.py` around lines 131 - 144,
The test currently guards assertions with a conditional "if 'saved' in captured"
which lets the test pass when save_config was never invoked; change this to an
explicit assertion that ensures the save happened before checking contents—for
example assert "saved" in captured (or assert "saved" in captured, f"expected
save_config to be called after client.post('/api/manage/config')") and then
proceed to set saved_kb = captured["saved"].get("knowledge", {}) and assert on
embedding_api_key, embedding_provider, embedding_model, and embedding_base_url
to enforce the secret-strip contract; update any variable references (captured,
saved_kb) accordingly so the test fails when disk save isn't exercised.
tests/unit/test_knowledge_use_gpu_surfaces.py-194-197 (1)

194-197: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid eager embedder initialization in this unit test.

eng._ensure_embeddings() can pull in provider/model initialization and make this test flaky/non-hermetic. This test only needs to validate use_gpu config propagation before and after PATCH.

Suggested change
 def test_patch_draft_knowledge_use_gpu_false_applies_to_live_engine():
     """User toggles use_gpu off in the UI → live engine rebinds inference backend."""
     eng = KnowledgeEngine(_cfg(use_gpu=True))
-    eng._ensure_embeddings()
     assert eng._config.use_gpu is True
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/test_knowledge_use_gpu_surfaces.py` around lines 194 - 197, The
test is eagerly initializing embedders by calling eng._ensure_embeddings(),
making it flaky; remove that call and only assert config propagation on the
KnowledgeEngine instance: create eng = KnowledgeEngine(_cfg(use_gpu=True)),
assert eng._config.use_gpu is True, then apply the PATCH operation the test
intends and assert the config value after PATCH. Update references to
_ensure_embeddings() to avoid triggering provider/model initialization and keep
assertions limited to eng._config.use_gpu before and after the PATCH.
tests/unit/test_knowledge_timings.py-267-273 (1)

267-273: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Order-preservation test does not assert the document identity strongly enough.

This loop claims to verify order, but it only checks filename == "x.pdf". If embeddings are reordered, the test can still pass because all docs share that filename. Assert page (and/or exact text) for the top hit.

Suggested change
     for i in range(0, 50, 7):
         hits = store.search(f"text {i}", k=3)
         assert hits, f"no hits for 'text {i}'"
         # Top hit's filename + page must be the original doc i+1.
         top = hits[0][0]
         assert top.metadata.get("filename") == "x.pdf"
+        assert top.metadata.get("page") == i + 1, (
+            f"expected top hit page {i + 1}, got {top.metadata.get('page')!r}"
+        )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/test_knowledge_timings.py` around lines 267 - 273, The test's loop
only asserts filename == "x.pdf" which is insufficient; update the assertion to
also check the document identity by asserting the page (and/or exact text) of
the top hit matches the expected document for each i. Concretely, after
obtaining top = hits[0][0], add an assertion like top.metadata.get("page") ==
i+1 (or assert the hit text equals the expected "text {i}") so the test verifies
order-preservation, using the existing variables hits and top.
🧹 Nitpick comments (3)
tests/unit/test_knowledge_config_ui_design.py (1)

458-462: ⚡ Quick win

Replace the tautological warmup assertion with a real contract check.

Line 461 always passes and won’t catch regressions. Assert a concrete condition (e.g., type/expected value) instead.

Suggested change
-    assert result["embeddings_initialized"] is False or result["embeddings_initialized"] is True
+    assert isinstance(result.get("embeddings_initialized"), bool)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/test_knowledge_config_ui_design.py` around lines 458 - 462, The
test uses a tautological assertion for result["embeddings_initialized"] which
always passes; replace it with a concrete contract check by asserting the field
is a boolean (e.g., use isinstance(result["embeddings_initialized"], bool)) or
assert a specific expected value depending on the warmup contract from
eng.warmup(); update the assertion referring to result["embeddings_initialized"]
(produced by eng.warmup()) to validate type or expected value rather than "is
False or is True".
tests/unit/test_knowledge_preflight_soft_fail.py (1)

60-156: ⚡ Quick win

Use fixture-managed engine teardown to prevent resource leakage across tests.

Each test builds a KnowledgeEngine but never closes it. A small fixture that yields the engine and calls engine.shutdown() in teardown will make this suite more stable under parallel/long runs.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/test_knowledge_preflight_soft_fail.py` around lines 60 - 156, The
tests instantiate KnowledgeEngine directly and never call its shutdown, causing
leaked resources; create a pytest fixture that yields a KnowledgeEngine
(constructed via _cfg()) and calls engine.shutdown() in the teardown, then
update tests to use that fixture instead of directly calling eng =
KnowledgeEngine(_cfg()) (e.g., accept an engine fixture and pass it to
_build_app(engine)); ensure all tests that create KnowledgeEngine are switched
to use the fixture so shutdown() runs after each test.
tests/unit/test_knowledge_pgvector_metadata.py (1)

74-78: Guard fake SQL UPDATE parsing from silent truncation

zip(cols, params[:-1]) would silently drop assignments if the fake interpreter parses a different number of SET columns than the parameters provide. PostgresKnowledgeMetadata.update_task currently derives set_clause and values from the same cols list, so this mismatch shouldn’t occur today—but the assertion makes future SQL/parse changes fail loudly.

Suggested fix
             cols = [seg.split("=")[0].strip() for seg in set_section.strip().rstrip(",").split(",")]
             task_id = params[-1]
             if task_id in self._rows_by_task_id:
+                assert len(cols) == len(params[:-1]), (
+                    f"UPDATE shape mismatch: cols={len(cols)} params={len(params[:-1])}"
+                )
                 for col, val in zip(cols, params[:-1]):
                     self._rows_by_task_id[task_id][col] = val
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/test_knowledge_pgvector_metadata.py` around lines 74 - 78, Add a
guard to ensure the number of parsed SET columns matches the number of provided
parameter values before zipping to avoid silent truncation: in the code path
where cols and params are derived (used by PostgresKnowledgeMetadata.update_task
and referencing cols, params, task_id, and self._rows_by_task_id), assert or
raise if len(cols) != len(params) - 1 (since the last param is task_id) so
mismatched parses fail loudly rather than dropping assignments; perform this
check immediately before the for col, val in zip(cols, params[:-1]) loop.

Source: Linters/SAST tools

Comment thread Dockerfile.gpu
Comment on lines +64 to +67
RUN uv sync --extra gpu && \
uv pip uninstall onnxruntime 2>/dev/null || true && \
uv pip install --reinstall torch torchvision \
--index-url https://download.pytorch.org/whl/cu121

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Show Dockerfile.gpu around the referenced lines
if [ ! -f Dockerfile.gpu ]; then
  echo "ERROR: Dockerfile.gpu not found at repo root"
  exit 1
fi

echo "=== Dockerfile.gpu: lines 1-120 ==="
nl -ba Dockerfile.gpu | sed -n '1,120p'

# Also search for any SHELL directive that could change command parsing
echo "=== SHELL directives (if any) ==="
rg -n '^\s*SHELL\b' Dockerfile.gpu || true

# Show the exact RUN line(s) containing 'uv sync' and 'onnxruntime'
echo "=== Matching RUN lines ==="
rg -n 'uv sync|onnxruntime|uv pip uninstall|uv pip install --reinstall torch' Dockerfile.gpu

Repository: cuga-project/cuga-agent

Length of output: 165


🏁 Script executed:

#!/bin/bash
set -euo pipefail

if [ ! -f Dockerfile.gpu ]; then
  echo "ERROR: Dockerfile.gpu not found at repo root"
  exit 1
fi

echo "=== Dockerfile.gpu: lines 50-90 (context around referenced area) ==="
awk 'NR>=50 && NR<=90 {printf "%d:%s\n", NR, $0}' Dockerfile.gpu

echo "=== SHELL directives (if any) ==="
rg -n '^\s*SHELL\b' Dockerfile.gpu || true

echo "=== Matching RUN lines ==="
rg -n 'uv sync|onnxruntime|uv pip uninstall|uv pip install --reinstall torch' Dockerfile.gpu || true

echo "=== Show full RUN instruction(s) that contain uv sync (handles line continuations) ==="
# Print from first match through the line where the RUN instruction ends (heuristic: until a line not ending with '\')
awk '
  $0 ~ /uv sync/ { inrun=1 }
  {
    if (inrun) { print; }
    if (inrun && $0 ~ /\\$/) next
    if (inrun && $0 !~ /\\$/) { inrun=0 }
  }
' Dockerfile.gpu | sed -n '1,200p'

Repository: cuga-project/cuga-agent

Length of output: 2980


Fix RUN chaining so uv sync failures don’t get masked (torch install still runs today).
In Dockerfile.gpu lines 64-67, the expression uv sync ... && ... || true && ... short-circuits via || true, causing uv pip install --reinstall torch torchvision to run even when uv sync fails.

Suggested fix
-RUN uv sync --extra gpu && \
-    uv pip uninstall onnxruntime 2>/dev/null || true && \
-    uv pip install --reinstall torch torchvision \
+RUN uv sync --extra gpu && \
+    (uv pip uninstall onnxruntime 2>/dev/null || true) && \
+    uv pip install --reinstall torch torchvision \
         --index-url https://download.pytorch.org/whl/cu121
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
RUN uv sync --extra gpu && \
uv pip uninstall onnxruntime 2>/dev/null || true && \
uv pip install --reinstall torch torchvision \
--index-url https://download.pytorch.org/whl/cu121
RUN uv sync --extra gpu && \
(uv pip uninstall onnxruntime 2>/dev/null || true) && \
uv pip install --reinstall torch torchvision \
--index-url https://download.pytorch.org/whl/cu121
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Dockerfile.gpu` around lines 64 - 67, The RUN line masks failures from uv
sync because the current chaining uses "|| true" which makes the whole chain
succeed and still runs the torch install; change the chaining so that only the
uninstall is tolerated while uv sync must succeed: run uv sync --extra gpu and
require its success, then run uv pip uninstall onnxruntime with a tolerated
failure (e.g., group the uninstall with || true or run it in a subshell/brace so
its failure doesn't affect uv sync), and finally run uv pip install --reinstall
torch torchvision only if the prior required commands succeeded; reference the
commands uv sync --extra gpu, uv pip uninstall onnxruntime, and uv pip install
--reinstall torch torchvision when making this change.

The CI test-allowlist glob added in 734c56c picked up
``test_knowledge_layout_engine.py``, which has 5 tests that mock
``_detect_accelerator`` to return ``("mps", ["CoreMLExecutionProvider",
"CPUExecutionProvider"])``. The python-side branch is fooled, but the
downstream fastembed session validates the providers against the actual
``onnxruntime`` install — on Linux CI the call fails with
``ValueError: Provider CoreMLExecutionProvider is not available``.

Added a module-level ``_REQUIRES_MAC = pytest.mark.skipif(sys.platform
!= "darwin")`` and applied it to the 5 affected tests:

  - test_layout_engine_transformers_disables_torch_compile_on_mps
  - test_layout_engine_auto_resolves_to_transformers_on_mps
  - test_timing_log_reports_effective_layout_not_user_choice
  - test_use_gpu_toggle_invalidates_auto_resolution
  - test_pytorch_embeddings_device_routes_to_mps_when_detected

The Mac-only branches still get coverage on dev machines; CI on Linux
no longer fails on the inherent platform mismatch.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cuga/backend/knowledge/storage/adapter.py`:
- Around line 632-635: The rollback code silently swallows errors from
_fts_delete_by_source; change the bare except to catch Exception as e and emit a
log entry with the exception details (e.g., using self.logger.error or the
module logger) including context like the source_for_rollback and that this
occurred during FTS rollback, rather than simply passing; keep the rollback
suppression behavior (do not re-raise) but ensure the exception and context are
logged for debugging.
- Around line 595-617: The rollback currently uses source_for_rollback from
items[0] inside _add_all and so only deletes one source on partial failure;
update _add_all to either enforce a single-source precondition by validating
that every item in items has the same "source" (raise early if not) or build the
set of all distinct sources touched (e.g., collect sources from each item's
metadata before the insert loop) and on partial failure iterate and delete by
each source (use the same _scope/_filt pattern) instead of only
source_for_rollback; ensure the chosen approach references source_for_rollback,
_add_all, self._store.add_many and mirrors delete_by_source behavior for all
affected sources.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 57186bb0-fd5b-488a-8680-5d42ad77ca35

📥 Commits

Reviewing files that changed from the base of the PR and between 99255de and 04308c6.

📒 Files selected for processing (7)
  • .gitignore
  • README.md
  • src/cuga/backend/knowledge/engine.py
  • src/cuga/backend/knowledge/storage/adapter.py
  • src/cuga/backend/server/manage_routes.py
  • src/scripts/run_tests.sh
  • tests/unit/test_knowledge_layout_engine.py
💤 Files with no reviewable changes (1)
  • .gitignore
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/cuga/backend/server/manage_routes.py

Comment on lines +595 to +617
source_for_rollback = items[0][2].get("source") if items else None

async def _add_all() -> int:
n = 0
for doc, emb in zip(documents, vectors, strict=True):
chunk_id = str(uuid.uuid4())
page_val: int
page_raw = doc.metadata.get("page")
if page_raw is None:
page_val = -1
else:
try:
page_val = int(page_raw)
except (TypeError, ValueError):
page_val = -1
source = str(doc.metadata.get("source", "") or "")
filename = str(doc.metadata.get("filename", "") or "")
meta = {
"id": chunk_id,
"tenant_id": scope["tenant_id"],
"instance_id": scope["instance_id"],
"source": source,
"filename": filename,
"page": page_val,
"chunk_text": doc.page_content,
"meta_json": json.dumps(
{"source": source, "filename": filename, "page": page_val},
ensure_ascii=False,
),
}
await self._store.add(chunk_id, emb, meta)
n += 1
partial = False
for i in range(0, total_items, insert_bs):
chunk = items[i : i + insert_bs]
try:
await self._store.add_many(chunk)
except Exception:
# If we'd already committed earlier batches, clean them
# up before re-raising so the next ingest attempt
# doesn't encounter duplicate vectors on the same
# source_id. Inline the delete loop here (mirror of
# ``delete_by_source``) so we stay on the current event
# loop — the adapter-level helper would call
# ``_run_embedding_coro`` which would deadlock against
# the loop we're already inside.
if partial and source_for_rollback:
try:
_scope = self._scope()
_filt = {**_scope, "source": source_for_rollback}
while True:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Rollback cleanup currently assumes a single source across the whole batch.

source_for_rollback is taken from only the first item. On partial failure, cleanup deletes one source and can leave committed rows for other sources in the same call. Please enforce a single-source precondition (or roll back all touched sources).

Suggested fix
-        source_for_rollback = items[0][2].get("source") if items else None
+        rollback_sources = {str(meta.get("source", "")) for _, _, meta in items}
+        if len(rollback_sources) != 1:
+            raise ValueError(
+                "add_documents expects all chunks in one call to share the same source"
+            )
+        source_for_rollback = next(iter(rollback_sources))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cuga/backend/knowledge/storage/adapter.py` around lines 595 - 617, The
rollback currently uses source_for_rollback from items[0] inside _add_all and so
only deletes one source on partial failure; update _add_all to either enforce a
single-source precondition by validating that every item in items has the same
"source" (raise early if not) or build the set of all distinct sources touched
(e.g., collect sources from each item's metadata before the insert loop) and on
partial failure iterate and delete by each source (use the same _scope/_filt
pattern) instead of only source_for_rollback; ensure the chosen approach
references source_for_rollback, _add_all, self._store.add_many and mirrors
delete_by_source behavior for all affected sources.

Comment on lines +632 to +635
try:
self._fts_delete_by_source(source_for_rollback)
except Exception:
pass

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Don’t silently swallow FTS rollback errors in the failure path.

The bare except ...: pass hides rollback divergence exactly when debugging partial-insert incidents. Log it at least once.

Suggested fix
-                            try:
-                                self._fts_delete_by_source(source_for_rollback)
-                            except Exception:
-                                pass
+                            try:
+                                self._fts_delete_by_source(source_for_rollback)
+                            except Exception as fts_err:  # noqa: BLE001
+                                logger.warning(
+                                    "FTS rollback failed for source=%r after partial insert: %s",
+                                    source_for_rollback,
+                                    fts_err,
+                                )
🧰 Tools
🪛 Ruff (0.15.15)

[error] 634-635: try-except-pass detected, consider logging the exception

(S110)


[warning] 634-634: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cuga/backend/knowledge/storage/adapter.py` around lines 632 - 635, The
rollback code silently swallows errors from _fts_delete_by_source; change the
bare except to catch Exception as e and emit a log entry with the exception
details (e.g., using self.logger.error or the module logger) including context
like the source_for_rollback and that this occurred during FTS rollback, rather
than simply passing; keep the rollback suppression behavior (do not re-raise)
but ensure the exception and context are logged for debugging.

Source: Linters/SAST tools

@sami-marreed sami-marreed left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Request: split this PR before it can be reviewed

This PR is titled "perf + OCR + session isolation + live progress UX" but it currently carries at least four additional features. It's ~12.3k lines of test alone, across 37 changed source files including six 1k–4k-line files. Nobody can responsibly review this as one unit — and the bugs below prove it, because each one slipped through at a seam between features that were never reviewed together. Please break it up.

Split out cleanly (these lift out with little entanglement — please do these):

  • Client adaptation → own PR. ClientAdaptationPanel.tsx, query_expansion.py (glossary-driven), the adaptation prompt-framing in chat_agent.py / pmt*.jinja2 / awareness.py, plus test_knowledge_client_adaptation.py (1048), test_client_adaptation_e2e.py (282), and the client_adaptation_example.md doc. ~2.3k LOC, fully self-contained feature.
  • Reranker + GPU image → own PR. reranker.py (pulls in sentence-transformers), Dockerfile.gpu, test_knowledge_use_gpu_surfaces.py, test_knowledge_accelerator.py. New infra + a quality lever — deserves a focused review.
  • Benchmark tooling → remove from this PR entirely. scripts/bench_knowledge_ingest.py, bench_knowledge_query_compare.py, bench_knowledge_report.py, the knowledge/bench.py module, and test_knowledge_bench.py. (This reverses my earlier "keep ingest+report" note — on reflection benchmark tooling belongs in a separate PR, not the runtime tree.)

Consider splitting — or at minimum retitle:

  • Multi-scope / Option-B search. The 3630-line test_knowledge_multi_scope_search.py + test_knowledge_rag_scope_failure_fix.py (823) plus search_multi in client.py and the scope routing in prepare_node.py / helpers/knowledge.py. I know this shares core files and that "Option B" was in the original WIP title, so I won't insist on the split — but it's invisible to anyone reading the PR description. Either pull it out, or update the title/body to name it.

After the clean splits come out, what remains — perf core, OCR, session isolation, live progress UX, RAG-profile config — actually matches the title and is reviewable.

Blocking bugs, routed to where they belong (these are why the split matters):

  1. Empty-thread_id regressionprepare_node.py:438. The origin/main merge dropped the _is_empty() guard the branch previously had; ca153f3c restored session as the default scope but not the guard, so an LLM-emitted thread_id="" now hits client.py:150 raise ValueError("thread_id required for session scope"). Belongs to the session/scope PR — it got lost because scope logic was buried inside a perf PR.
  2. Dockerfile.gpu:64-66uv sync --extra gpu && uv pip uninstall onnxruntime 2>/dev/null || true && uv pip install ... short-circuits: || true makes the chain "succeed" so the torch reinstall runs even when uv sync fails, and the RUN exits 0 — a broken GPU image that builds green. CodeRabbit flagged this too. Belongs to the GPU PR.
  3. Silent enforcedry_run flipconfig.py:1032-1035. from_settings reads the profile first, and the default standard profile sets drop_page_chrome/junk_filter = "dry_run", shadowing the shipped knowledge_settings.toml that still says = "enforce". The config file now lies about its effective value. Stays in the core/profile PR.
  4. pgvector QA gate still open (my original blocker on CI/test coverage) — the new orphan-vector rollback in adapter.py is only tested against mocked/sqlite-vec; no Docker smoke test against real pgvector. Stays in the core PR.

Once split, I can give each PR the review it actually needs. Happy to re-review in whatever order you stack them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants