Skip to content

Coding harness: let the assistant reason about your code#20

Merged
KerseyFabrications merged 17 commits into
mainfrom
coding-harness-test
Jun 15, 2026
Merged

Coding harness: let the assistant reason about your code#20
KerseyFabrications merged 17 commits into
mainfrom
coding-harness-test

Conversation

@KerseyFabrications

@KerseyFabrications KerseyFabrications commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Adds an opt-in coding harness — DAWN can index Git repositories into a code graph and answer questions about them ("what calls foo?", "trace from main", "what's on this branch?"), via the external cbm (codebase-memory-mcp) code-graph server. Off by default; zero impact on existing builds.

What's included

  • MCP bridge (src/tools/mcp_*): HTTP+SSE client that connects to operator-launched MCP servers, translates their tool schemas into native DAWN tools, with per-user access gating and a dangerous-tool denylist. DAWN never launches the server (no-subprocess invariant — CI-enforced).
  • Code projects subsystem (src/tools/code_project_*): import a GitHub URL (in-process libgit2 clone) or link an existing local checkout; per-project status, branch tracking, and a name-translation boundary that keeps the on-disk slug/paths out of the LLM's view.
  • Branch + reindex ops: import/track/switch a branch; refresh (fetch + incremental) vs rebuild (drop graph + full reindex); startup reconciliation for jobs interrupted mid-index.
  • Surfaces: WebUI "Coding" popover (import/link tabs, branch field, per-row refresh/rebuild/set-branch/delete); dawn-admin code-project …; admin-socket opcodes.
  • Docs: docs/CODING_PROJECTS.md (user/operator guide), referenced from README + GETTING_STARTED.

How it works

A voice/text query routes through the LLM, which calls the bridged cbm query tools (search_code, trace_path, get_architecture, …). DAWN auto-fills the active project, translates clean identifiers ↔ cbm's path-derived graph slug, and scrubs slugs/paths back out of results. Project management (import/link/branch/rebuild) is operator-facing (WebUI + dawn-admin); the LLM only queries.

Merge safety

  • Enabled by no preset — requires explicit -DDAWN_ENABLE_CODE_PROJECTS=ON (and -DDAWN_ENABLE_MCP_BRIDGE_TOOL=ON); the option defaults OFF and hard-errors without the MCP bridge. Existing/preset builds (full, local, server, CI) compile it out entirely — no new behavior, code, or runtime cost.
  • Schema migrations are additive (code_projects table + branch/kind/graph_name columns, through v66); idempotent, gated, with a re-run test.
  • New dependency libgit2 ≥ 1.6 is required only when the feature is enabled (build-from-source via INSTALL_LIBGIT2=true ./scripts/install.sh; see DEPENDENCIES.md).
  • Operator must run the cbm-mcp service for the feature to do anything (services/cbm-mcp/).

Security

  • Link-local is admin-only, gated to [code_projects] allowed_local_roots, and exposes indexed file contents to the LLM — so it's never shared globally and the cbm service is sandboxed (ProtectHome=tmpfs + BindReadOnlyPaths). cbm's discovery skips symlinks, so a symlink can't leak out-of-root content.
  • Clone path keeps the existing SSRF guard (host allowlist, redirects disabled, size/file/depth caps, symlink stripping); branch names are validated against libgit2 ref rules.
  • Reviewed by a five-agent pass (architecture / efficiency / security / UI / standards) — 0 critical; findings fixed.

Testing

  • Unit: test_code_project_db (8/8 — CRUD, v66 migration idempotency, schema-index stability), test_code_project_git (5/5 — clone, open_validate NO_SEARCH, fetch+branch-switch), test_mcp_bridge (6/6 — registration/dispatch/auth/denylist + namemap).
  • Full suite green (87/87 runnable); debug build clean (0 warnings); format_code.sh --check clean.
  • Live-validated end-to-end (import/branch-switch/refresh/rebuild/link-local/delete) against real repos.

Follow-ups (not in this PR)

  • Phase 9: surface allowed_local_roots in the WebUI settings panel, automate the cbm-mcp sandbox grant in the installer, and wire the feature into the full preset / scripts/install.sh (so it's buildable without manual -D flags) once that setup is automated. Today it's explicit-opt-in.

… can't kill the daemon

A peer-closed socket during the Telegram long-poll (curl on the listener thread)
raised SIGPIPE inside OpenSSL's write(), and the default disposition terminated
the daemon. Ignore SIGPIPE process-wide at startup (covers curl on every thread,
libwebsockets, mosquitto) and add CURLOPT_NOSIGNAL to the shared
curl_apply_dawn_defaults() preamble (required for multi-threaded libcurl; covers
all messaging drivers + web/oauth/image curl users).
…ve the WS thread

A long synchronous local-ONNX embed burst (hundreds of chunks, ~90s) starved the
single WebSocket service thread enough to lapse a connected satellite's app-level
keepalive, bouncing it mid-index. usleep briefly every 8 chunks (no lock held) to
give latency-sensitive threads a scheduling window — ~200ms total across a
multi-second index.
…djacent context

Semantic+BM25 search fuzzes exact strings (IDs, field values like "birthday:
'1965") and returns isolated mid-record fragments. Add a deterministic literal-
grep tool: every match leads with its matched LINE (always shown, never budget-
truncated) plus optional surrounding chunks.

New gap-safe DB primitives: chunk_read_range (windows by chunk_index value, not
OFFSET) and chunk_grep (permission-scoped JOIN; LIKE/instr; paginated). Tool:
query + context(0-2) + case_sensitive + offset; range-union dedup, token budget,
pagination footer (page size 50). is_available gated on the DB, not the embedder.

Tests cover gap-safe range, cross-user scoping, case, wildcard-as-literal, and
pagination. Five-agent reviewed (params .field_name, snprintf-overflow guard,
clamps). Live-verified on a 536-record YAML; 79/79 CI.
…r chunk)

Size-based chunking split structured docs mid-record, so a retrieval/grep hit
landed on a fragment with surrounding fields in other chunks. Content-sniff (not
filetype — a .yaml URL can be stored as .txt) for top-level YAML sequences and
CSV tables and split per record (CSV chunks carry the header so each is self-
describing). Conservative thresholds; prose falls through unchanged. Oversized
records split at line boundaries.

Tests: YAML one-record-per-chunk, CSV header+row, prose-not-mis-split guard.
Live-verified: re-indexed legislators YAML chunks at record boundaries (536
chunks / ~538 records); grep returns clean per-person records. 79/79 CI.
…unless dangerous)

Add optional [llm.tools] local_disabled/remote_disabled blocklists alongside the
legacy enable whitelists. Per surface: disable-list set → default-on except listed;
enable-list only → whitelist (legacy, unchanged); neither → all-on. TOOL_CAP_DANGEROUS
tools always require explicit enable-list opt-in. Lets a newly-added tool work
without editing every deployment's allowlist, while existing whitelist configs are
byte-identical after upgrade (back-compat by key, not rename).

apply_config takes the config struct (caller -18 lines); WebUI tool-toggle
persistence is blocklist-aware (writes *_disabled, keeps dangerous opt-ins in
*_enabled) so toggles don't silently revert. Security-reviewed: dangerous-tool
auto-enable invariant airtight under all combinations. 79/79 CI.
…nfig

The file-local static shared the name of the global dawn_config_t g_config
from dawn_config.h. Harmless only because this TU didn't include that header;
the tool-blocklist change surfaced it when a new llm_tools.h include pulled
dawn_config.h in transitively (conflicting types for g_config). Rename to the
s_ static convention so a future header reorg can't reintroduce the collision.
Pure rename, no behavior change.
CI no-process-mgmt grep, HTTP+SSE transport, JSON-RPC client+FSM,
hardened JSON-Schema->treg_param translator, schema v55 mcp_user_access
+ auth_db_mcp allowlist, and bridge registration/dispatch (trampolines,
per-call auth, dangerous-tool admin denylist). 4 new test binaries; full
CI suite 69/69. Bridge tool/options default OFF; not yet wired to config.
…rdening

Builds on 3905c44 (MCP bridge foundation); completes the daemon/CLI side of
Phase 1. WebUI (Steps 17-19) still pending.

- config: [mcp]/[[mcp.server]] + [code_projects] parse/validate/defaults;
  executor raw-args hook so typed JSON survives (action,value) dispatch
- bridge: config-driven mcp_bridge_init, per-call fail-closed auth, cbm
  project auto-fill; admin 0xB0-0xB8 + dawn-admin mcp/code-project CLI
- code projects: schema v56 table, code_project_db CRUD/visibility, libgit2
  in-process clone (size/file/depth caps, symlink sweep, SSRF+allowlist,
  redirect-off), nice-10 orchestrator worker, cbm code_graph provider,
  native code_project tool, per-session active project
- libgit2 1.8.1 build in scripts/lib/libs.sh (opt-in); CMake options
  DAWN_ENABLE_MCP_BRIDGE_TOOL / DAWN_ENABLE_CODE_PROJECTS (OFF default)
- check_no_process_mgmt.sh CI grep enforces the no-subprocess invariant
- review hardening (9-agent pass): fail-closed dispatch, libgit2 redirect/
  userinfo/allowlist/redact, server-table locking, post-clone size tally,
  libgit2 init lifecycle, shutdown wiring, standards/Doxygen/null-checks

Verified: dawn + dawn-admin + tests-ci build 0 warnings; 71/71 CI; format +
no-process-mgmt grep clean. Migrations v55/v56 unconditional (main still v54).
Settings panel + a Coding header popover for the code-projects subsystem,
on top of the daemon/CLI side already on this branch. Frontend not yet
browser-tested; backend is compile-verified (86/86 CI, 0 warnings).

Settings:
- config_to_json (config_env.c) serializes [mcp] (+ servers[] read-only)
  and [code_projects]; webui_config.c apply-parses the editable scalars
  (servers stay TOML-managed); schema.js adds MCP Bridge + Code Projects
  sections under a new Coding category.

WS backend (the WebUI uses WS dispatch, not REST):
- webui_code_projects.{c,h}: list/import/refresh/delete handlers, scoped to
  conn->auth_user_id; import honors import_user_required=admin and only
  admins set global or act on others' projects. Dispatched in
  webui_message_dispatch.c (#ifdef DAWN_ENABLE_CODE_PROJECTS); built via
  DawnTools.cmake under code-projects + ENABLE_WEBUI.
- Strong override of code_project_broadcast_status_changed in
  webui_broadcasts.c pushes code_project_status_changed for live re-fetch.

Frontend:
- code-projects.js (DawnCodeProjects), code-projects.css (@imported in
  main.css), #coding-btn + #code-projects-popover in index.html, dawn.js
  message cases + init + auth-gated reveal. XSS-escaped rendering.
- ui-design-architect review applied: real showConfirmModal API + focus
  trap/return, four-way popover mutual-close, aria-labels + focus rings,
  themed badge tokens, elevation/backdrop, narrow-viewport bottom sheet.

Verified: dawn + dawn-admin + tests-ci build 0 warnings; 86/86 CI; C
format clean; all JS passes node --check. Browser test pending.
Stabilization fixes from live WebUI testing of the code-projects panel:

- Import validates the repo exists before creating a DB row. The remote
  probe (in-process libgit2 ref negotiation, redirects off, bounded by
  server timeouts) runs on the worker thread, not the audio-carrying lws
  service thread. Nonexistent/unreachable URL -> no row + failure toast;
  repo exists but clone later fails (size/depth caps) -> error row kept
  (refreshable settings issue). cp_job_t now carries import params.
- valid_name: allow uppercase (isalnum) so e.g. Hello-World imports.
- Coding popover button hidden unless [code_projects].enabled (JS gate via
  get_config_response + #coding-btn.hidden CSS specificity override).
- [mcp]/[code_projects] now written by config_write_toml, not just
  config_to_json, so the WebUI "Enable Code Projects" toggle persists
  across restart (also fixes latent wipe of a hand-added [mcp]).

Build clean, format clean, JS syntax checked. Live-verified: bad URL
toasts failure with no phantom row/clone dir; valid URL imports.
A clone with cbm-mcp absent reported a bare "indexing failed" that reads
like a bug rather than a missing backend. worker_do_index now pre-checks
backend availability and reports "clone ready, but no code server
connected — start cbm-mcp, then re-index".

Added via the provider abstraction (no layering break): new optional
is_available() on the code_graph_provider vtable, backed by a new
mcp_bridge_server_connected() that reports MCP_STATE_CONNECTED without
triggering a call or reconnect.

Build clean, format clean.
Operator service that runs codebase-memory-mcp (stdio) behind mcp-proxy,
re-exposing it over SSE for DAWN's MCP bridge. Unit + EnvironmentFile +
logrotate + install.sh + README, modeled on services/llama-server. Runs as
the dawn user (reads /var/lib/dawn/source, writes its graph cache). cbm
built with libgit2 disabled (fallback to git log).
Two cbm-bridge fixes (entangled in mcp_bridge_tool.c, so committed together):

Admin-grant bootstrap: auth_db_mcp_grant_all_admins ran inside
mcp_bridge_init (during tools_register_all), ~460 lines before
auth_db_init — it hit a closed DB and no-op'd, leaving mcp_user_access
empty so even admin was denied every cbm_* tool. Moved the grant to dawn.c
after auth_db_init, keyed on every configured [[mcp.server]] alias.

Name-translation boundary: cbm names projects by slugifying the absolute
repo path and prefixes qualified_name/file_path with it, leaking the
filesystem layout to the LLM and baking the path into conversation history.
New code_project_namemap translates the LLM's clean identifiers to cbm's
namespace outbound and strips the slug + source_root paths from results
inbound; the prefix is captured from cbm's own list_projects (no schema
change). LLM now sees only clean names + project-relative paths; stored
conversations survive a directory move + reindex.

Also: link code_project_namemap.c into test_mcp_bridge. Build clean,
86/86 CI tests pass.
cbm's list_projects duplicated the native code_project list tool (which has
clean names + per-user visibility), and the two returned different formats —
a DX wart Friday flagged during testing. register_server_tools now skips
registering cbm/list_projects as an LLM-facing tool (gated on
DAWN_ENABLE_CODE_PROJECTS so it only hides when code_project exists). The
namemap capture is unaffected: it calls list_projects via
mcp_bridge_call_tool, which bypasses the registry.

Build clean, 86/86 CI tests pass.
Extends the code-projects harness across schema/git/namemap/service/surfaces:

- branch: import/track/switch a branch (libgit2 fetch+checkout); schema v66 adds
  branch/kind/graph_name to code_projects via an idempotent (PRAGMA-probed) ALTER.
- link-local: register an existing local checkout (kind=local, admin-only, gated
  by [code_projects] allowed_local_roots) — never cloned or removed. cbm-mcp.service
  gains an opt-in ProtectHome=tmpfs + BindReadOnlyPaths sandbox block.
- refresh (fetch + incremental) vs rebuild (delete graph + reindex); startup
  reconciliation heals rows interrupted mid-index.
- fix: cbm delete sent the wrong arg key ("project_name") and the clean name, so
  the on-disk graph was never removed; now resolves the persisted path-derived
  slug. namemap reworked from a single source_root prefix to a per-project map
  (multi-repo + shared-cbm safe).
- surfaces: WebUI (Import/Link tabs, branch field, rebuild/set-branch, full-width
  URLs, conversation-style hover actions); admin opcodes 0xD0-0xD6; dawn-admin
  link/rebuild/set-branch + import --branch; admins now see all projects.
- docs: CODING_PROJECTS.md (user/operator guide)

Reviewed by 5 agents (0 critical); findings fixed. Debug build clean (0 warnings);
test_code_project_db 8/8, test_code_project_git 5/5; format clean.
- README: Code Projects rows in the Optional Features + Documentation tables.
- GETTING_STARTED: Code Projects pointer under Optional Components.
- CODING_PROJECTS.md §5: point the cbm-sharing deep-dive at the atlas archive;
  CODING_HARNESS_CBM_SHARING.md stays untracked (moves to atlas after the PR).
@qodo-code-review

qodo-code-review Bot commented Jun 15, 2026

Copy link
Copy Markdown

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (1)

Context used
✅ Compliance rules (platform): 27 rules

Grey Divider


Action required

1. transfer_progress_cb returns literal 0 ✓ Resolved 📘 Rule violation ≡ Correctness
Description
New code returns raw literal 0 from functions used as status returns, instead of returning
SUCCESS. This violates the requirement to use standardized SUCCESS/FAILURE constants for
status returns.
Code

src/tools/code_project_git.c[R65-86]

+static int transfer_progress_cb(const git_indexer_progress *stats, void *payload) {
+   git_cb_ctx_t *ctx = (git_cb_ctx_t *)payload;
+   const code_git_clone_opts_t *o = ctx->opts;
+   /* Best-effort in-flight caps. total_objects is server-advertised and can be
+    * understated; the authoritative on-disk tally runs in sweep_cb (sec-S5). */
+   if (o->max_file_count > 0 && stats->total_objects > o->max_file_count) {
+      OLOG_ERROR("code_git: object count %u exceeds cap %u", stats->total_objects,
+                 o->max_file_count);
+      return FAILURE; /* any nonzero return aborts the fetch (libgit2 contract) */
+   }
+   if (o->max_size_bytes > 0 && stats->received_bytes > o->max_size_bytes) {
+      OLOG_ERROR("code_git: received bytes exceed cap %zu", o->max_size_bytes);
+      return FAILURE;
+   }
+   if (o->progress_cb != NULL && stats->total_objects > 0) {
+      int pct = (int)(((uint64_t)stats->received_objects * 100) / stats->total_objects);
+      if (pct != ctx->last_transfer_pct) {
+         ctx->last_transfer_pct = pct;
+         o->progress_cb(o->progress_user, pct, "cloning");
+      }
+   }
+   return 0;
Evidence
PR Compliance ID 278936 forbids returning raw 0/1 literals for status returns and requires using
SUCCESS/FAILURE. The new callbacks return 0 directly (e.g., transfer_progress_cb,
sweep_cb, remove_cb) even though this file includes dawn_error.h and already uses
SUCCESS/FAILURE elsewhere.

Rule 278936: Use standardized SUCCESS/FAILURE constants for function status returns
src/tools/code_project_git.c[65-86]
src/tools/code_project_git.c[110-138]
src/tools/code_project_git.c[200-205]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Project standards require returning `SUCCESS`/`FAILURE` constants instead of raw `0`/`1` literals when using integer status returns.

## Issue Context
These callbacks currently return `0` to continue per external library contracts; `SUCCESS` is equivalent but keeps the code compliant and consistent.

## Fix Focus Areas
- src/tools/code_project_git.c[65-86]
- src/tools/code_project_git.c[110-138]
- src/tools/code_project_git.c[200-205]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. mcp_client_create Doxygen incomplete ✓ Resolved 📘 Rule violation ✧ Quality
Description
New public API declarations in include/tools/mcp_client.h have Doxygen comments that lack required
@param entries for parameters. This breaks the compliance requirement for complete Doxygen
documentation on public APIs.
Code

include/tools/mcp_client.h[R83-90]

+/**
+ * @brief Create a client and its transport (idle; call mcp_client_connect()).
+ * @return Heap handle, or NULL on error. Free with mcp_client_destroy().
+ */
+mcp_client_t *mcp_client_create(const mcp_client_opts_t *opts);
+
+/** @brief Shut down (if needed), destroy the transport, and free the client. */
+void mcp_client_destroy(mcp_client_t *c);
Evidence
PR Compliance ID 278940 requires Doxygen blocks for public API declarations to include @param for
each parameter. The added mcp_client_create(const mcp_client_opts_t *opts) block has
@brief/@return but no @param opts, and nearby public APIs follow the same incomplete pattern.

Rule 278940: Require Doxygen-style comments for all public API functions
include/tools/mcp_client.h[83-110]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Public API function declarations must have Doxygen-style comments with `@param` entries for every parameter (and `@return` for non-void). Several new declarations have incomplete Doxygen blocks (missing one or more `@param`).

## Issue Context
These functions are declared in a public header and are part of the exported MCP client API.

## Fix Focus Areas
- include/tools/mcp_client.h[83-110]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. mcp_progress_fn missing _t 📘 Rule violation ✧ Quality
Description
New typedef aliases like mcp_progress_fn and mcp_transport_on_message_fn do not end with the
required _t suffix. This violates the project typedef naming convention and can cause inconsistent
type naming across the codebase.
Code

include/tools/mcp_client.h[R55-59]

+/** Per-request progress callback (driven by `notifications/progress`). */
+typedef void (*mcp_progress_fn)(void *user, int percent, const char *message);
+
+/** Factory for the underlying transport (keeps the client transport-agnostic). */
+typedef mcp_transport_t *(*mcp_transport_factory_fn)(const mcp_transport_opts_t *opts);
Evidence
PR Compliance ID 278923 requires all new/modified typedef alias names to end in _t. The diff
introduces new typedef aliases mcp_progress_fn, mcp_transport_factory_fn,
mcp_transport_on_message_fn, and mcp_transport_on_state_fn which do not end in _t.

Rule 278923: C/C++ typedef names must use _t suffix
include/tools/mcp_client.h[55-59]
include/tools/mcp_transport.h[59-66]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
New C typedef aliases must end with the literal suffix `_t`, but several new function-pointer typedefs end with `_fn`.

## Issue Context
This repository’s coding standards require `_t` for all typedef aliases for consistency.

## Fix Focus Areas
- include/tools/mcp_client.h[55-59]
- include/tools/mcp_transport.h[59-66]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


View more (2)
4. code_project_db_* Doxygen incomplete ✓ Resolved 📘 Rule violation ✧ Quality
Description
New public API declarations in include/tools/code_project_db.h use brief Doxygen comments but omit
required @param entries (and/or @return) for the declared functions. This violates the rule
requiring complete Doxygen documentation for public APIs.
Code

include/tools/code_project_db.h[R67-87]

+/** @brief Insert a new project row. @param id_out If non-NULL, set to the new row id. */
+int code_project_db_create(const code_project_t *p, int64_t *id_out);
+
+/** @brief Update a project's status string and optional status message. */
+int code_project_db_update_status(int64_t id, const char *status, const char *msg);
+
+/** @brief Stamp the last-indexed time on a project. */
+int code_project_db_set_indexed_at(int64_t id, time_t when);
+
+/** @brief Set a project's tracked branch (clone kind). Empty/NULL clears it. */
+int code_project_db_set_branch(int64_t id, const char *branch);
+
+/** @brief Set a project's persisted cbm graph slug. Empty/NULL clears it. */
+int code_project_db_set_graph_name(int64_t id, const char *graph_name);
+
+/** @brief Fetch a project by id into @p out. */
+int code_project_db_get(int64_t id, code_project_t *out);
+
+/** @brief Fetch a project by unique name into @p out. */
+int code_project_db_get_by_name(const char *name, code_project_t *out);
+
Evidence
PR Compliance ID 278940 requires complete Doxygen comment blocks for public API declarations. The
new code_project_db_* functions are public header declarations, but their comments are brief and
do not provide @param entries for all parameters and/or do not provide @return tags for non-void
return types.

Rule 278940: Require Doxygen-style comments for all public API functions
include/tools/code_project_db.h[67-87]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Public API function declarations in headers must have complete Doxygen comments, including `@param` tags for every parameter and `@return` for non-void returns.

## Issue Context
`include/tools/code_project_db.h` introduces a new public API surface for code-project CRUD operations.

## Fix Focus Areas
- include/tools/code_project_db.h[67-116]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


5. SSE endpoint SSRF risk ✓ Resolved 🐞 Bug ⛨ Security
Description
resolve_endpoint() accepts a server-provided endpoint value that can be an absolute URL
(changing host/scheme), and the transport then POSTs JSON-RPC (including the Authorization bearer
header) to that unvalidated URL. This enables token exfiltration and SSRF-style outbound requests if
the MCP server is malicious/compromised (or if TLS verification is disabled and an active network
attacker injects the SSE event).
Code

src/tools/mcp_transport_http_sse.c[R99-115]

+static char *resolve_endpoint(const char *base_url, const char *endpoint_field) {
+   CURLU *h = curl_url();
+   if (h == NULL) {
+      return NULL;
+   }
+
+   char *result = NULL;
+   if (curl_url_set(h, CURLUPART_URL, base_url, 0) == CURLUE_OK &&
+       curl_url_set(h, CURLUPART_URL, endpoint_field, 0) == CURLUE_OK) {
+      char *full = NULL;
+      if (curl_url_get(h, CURLUPART_URL, &full, 0) == CURLUE_OK) {
+         result = strdup(full);
+         curl_free(full);
+      }
+   }
+   curl_url_cleanup(h);
+   return result;
Evidence
The transport resolves the server-provided endpoint by setting CURLUPART_URL twice; the second set
can fully replace the URL, not just apply a relative path. The resolved endpoint is then used as
CURLOPT_URL for POST requests, and the transport attaches the stored Authorization header to that
POST.

src/tools/mcp_transport_http_sse.c[94-115]
src/tools/mcp_transport_http_sse.c[129-164]
src/tools/mcp_transport_http_sse.c[264-304]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The HTTP+SSE transport trusts the SSE `endpoint` event and resolves it using `curl_url_set(..., CURLUPART_URL, endpoint_field, ...)`, which can accept absolute URLs. The resolved endpoint is later used as the POST destination while still attaching the bearer Authorization header.

## Issue Context
This transport is used by the MCP bridge to send JSON-RPC requests. The endpoint should be constrained to the operator-configured base URL (same origin) to prevent a server-controlled event from redirecting authenticated POST traffic elsewhere.

## Fix Focus Areas
- Ensure endpoint is same-origin (scheme/host/port) as the configured base URL, or restrict the `endpoint` event to relative-path-only values (e.g., must start with `/` and must not contain a scheme/host).
- Reject endpoints containing userinfo/credentials.
- On rejection, transition to ERROR and avoid setting `connected`.

### Code pointers
- src/tools/mcp_transport_http_sse.c[99-115]
- src/tools/mcp_transport_http_sse.c[129-164]
- src/tools/mcp_transport_http_sse.c[264-304]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

6. Git sweep ignores walk errors 🐞 Bug ☼ Reliability
Description
The post-checkout nftw() sweep callback ignores typeflag and always reads sb->st_mode, and the
sweep runner ignores the nftw() return value, so traversal/stat errors can silently skip symlink
stripping and size/depth/file-count enforcement. This can leave a clone in a state that violates the
intended hardening invariants without being detected.
Code

src/tools/code_project_git.c[R110-157]

+static int sweep_cb(const char *path, const struct stat *sb, int typeflag, struct FTW *ftw) {
+   (void)typeflag;
+   if (S_ISLNK(sb->st_mode)) {
+      OLOG_WARNING("code_git: removing symlink from clone: %s", path);
+      if (unlink(path) != 0) {
+         s_sweep_failed = 1;
+      }
+      return 0;
+   }
+   if (s_sweep_max_depth > 0 && ftw->level > (int)s_sweep_max_depth) {
+      OLOG_ERROR("code_git: path depth %d exceeds cap %u: %s", ftw->level, s_sweep_max_depth, path);
+      s_sweep_failed = 1;
+   }
+   /* Authoritative on-disk size/count tally (sec-S5): transfer_progress_cb's
+    * caps trust the server-advertised object count, which a hostile server can
+    * understate. Enforce the real working-tree totals here. */
+   if (S_ISREG(sb->st_mode)) {
+      s_sweep_files++;
+      s_sweep_bytes += (uint64_t)sb->st_size;
+      if (s_sweep_max_files > 0 && s_sweep_files > s_sweep_max_files) {
+         OLOG_ERROR("code_git: file count %u exceeds cap %u", s_sweep_files, s_sweep_max_files);
+         s_sweep_failed = 1;
+      }
+      if (s_sweep_max_bytes > 0 && s_sweep_bytes > s_sweep_max_bytes) {
+         OLOG_ERROR("code_git: working-tree size exceeds cap %zu", s_sweep_max_bytes);
+         s_sweep_failed = 1;
+      }
+   }
+   return 0;
+}
+
+/* Run the post-checkout sweep over @p path: strip symlinks (containment) and
+ * tally the authoritative on-disk size / file-count / depth against the caps
+ * (sec-S5). Shared by clone and fetch (fetch adds files too). Not reentrant —
+ * uses the file-scope sweep state; callers serialize on the one worker thread.
+ * @return SUCCESS or FAILURE. */
+static int run_post_checkout_sweep(const char *path,
+                                   uint8_t max_depth,
+                                   size_t max_bytes,
+                                   uint32_t max_files) {
+   s_sweep_max_depth = max_depth;
+   s_sweep_max_bytes = max_bytes;
+   s_sweep_max_files = max_files;
+   s_sweep_bytes = 0;
+   s_sweep_files = 0;
+   s_sweep_failed = 0;
+   nftw(path, sweep_cb, GIT_SWEEP_MAX_FDS, FTW_PHYS);
+   return s_sweep_failed ? FAILURE : SUCCESS;
Evidence
The sweep callback explicitly discards typeflag and immediately dereferences sb->st_mode, and
the sweep runner does not check whether nftw() itself failed—only whether s_sweep_failed was set
by the callback—so traversal/stat errors can bypass the intended enforcement path.

src/tools/code_project_git.c[110-158]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The `nftw()`-based post-checkout sweep is intended to enforce caps and strip symlinks, but it does not explicitly handle traversal/stat failures and does not propagate `nftw()` failing.

## Issue Context
`run_post_checkout_sweep()` is used after clone and after fetch+checkout; it is part of the security/robustness envelope for imported repositories.

## Fix Focus Areas
- In `sweep_cb`, handle error/special cases via `typeflag` (e.g., when stat failed) before reading fields from `*sb`; mark the sweep failed and optionally log.
- In `run_post_checkout_sweep`, check the return value of `nftw()` and treat nonzero/-1 as FAILURE (also consider capturing `errno` for logs).

### Code pointers
- src/tools/code_project_git.c[110-158]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an opt-in “coding harness” that lets DAWN index Git repositories into a code graph via an operator-run cbm MCP server, and exposes UI/admin/LLM surfaces to manage/query those projects (feature-gated and off by default).

Changes:

  • Introduces MCP bridge client (HTTP+SSE) with per-user access gating and admin tooling.
  • Adds code-projects subsystem (DB schema + libgit2 clone/link + indexing orchestration) plus WebUI/admin-socket/dawn-admin surfaces.
  • Improves document tooling (structured chunking + literal grep + indexing yield) and extends tool enable/disable config (blocklist + whitelist models).

Reviewed changes

Copilot reviewed 96 out of 96 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
www/js/ui/settings/schema.js Adds Settings panels for MCP Bridge and Code Projects + “Coding” category.
www/js/ui/scheduler-queue.js Ensures header popovers close Code Projects when opening Scheduler Queue.
www/js/ui/memory.js Ensures Memory popover closes Code Projects if open.
www/js/ui/doc-library.js Ensures Doc Library popover closes Code Projects if open.
www/js/dawn.js Adds WebSocket dispatch + visibility gating for Code Projects UI.
www/index.html Adds Coding header button + Code Projects popover markup and script include.
www/css/main.css Adds Code Projects CSS component import.
tests/test_document_chunker.c Adds unit tests for structure-aware YAML/CSV chunking.
tests/test_code_project_git.c Adds libgit2-based clone/fetch/checkout/link validation tests.
tests/test_auth_db_mcp.c Adds tests for per-user MCP access (grant/revoke/check/admin).
tests/smoke_test_harness.sh Adds optional end-to-end smoke test harness for import/index flow.
tests/CMakeLists.txt Wires new unit tests; gates libgit2 test behind DAWN_ENABLE_CODE_PROJECTS.
src/webui/webui_tools.c Fixes persisted tool config round-tripping for blocklist vs whitelist setups.
src/webui/webui_message_dispatch.c Adds WebSocket handlers for Code Projects messages (feature-gated).
src/webui/webui_config.c Applies [mcp] and [code_projects] settings from WebUI JSON payloads.
src/webui/webui_broadcasts.c Adds WebUI broadcasts for code-project status/import failures.
src/tools/tools_init.c Registers MCP bridge + code-project service/tool; adds document_grep tool registration.
src/tools/search_summarizer.c Renames module-static config variable for clarity.
src/tools/document_index_pipeline.c Adds periodic CPU yield during embed loop to reduce thread starvation.
src/tools/document_db.c Adds chunk-range read + literal grep query helpers.
src/tools/document_chunker.c Adds conservative structure-aware YAML/CSV record chunking.
src/tools/code_project_tool.c Adds native code_project LLM tool (list/set_active/status).
src/tools/code_graph_provider_cbm.c Adds cbm-backed code-graph provider using MCP bridge tools.
src/llm/llm_tools.c Adds blocklist/whitelist resolution; exposes thread-local raw tool-call JSON args.
src/llm/llm_interface.c Applies new llm.tools config model (enable/disable lists).
src/dawn.c Ignores SIGPIPE; adds MCP admin bootstrap grants; adds orderly shutdown of new subsystems.
src/config/config_validate.c Validates MCP server config and code-project limits/regex.
src/config/config_parser.c Parses MCP and code-projects config sections + new tool disable lists.
src/config/config_env.c Serializes MCP/code-projects settings to JSON; round-trips MCP/code-projects + new disable lists to TOML.
src/config/config_defaults.c Adds secure-by-default code-projects defaults.
src/auth/auth_db_statements.c Adds prepared statements for chunk-range reads and literal grep.
src/auth/auth_db_migrations.c Adds v64–v66 migrations to global schema ladder.
src/auth/auth_db_migrations_v66.c Adds idempotent ALTERs for code_projects branch/kind/graph_name columns.
src/auth/auth_db_migrations_v65.c Adds idempotent code_projects table creation.
src/auth/auth_db_migrations_v64.c Adds idempotent mcp_user_access table creation.
src/auth/auth_db_mcp.c Implements per-user MCP access allowlist CRUD.
src/auth/admin_socket.c Dispatches new ADMIN_MSG_MCP_* and ADMIN_MSG_CODE_PROJ_* opcodes (feature-gated).
src/auth/admin_socket_mcp.c Implements admin-socket MCP list/status/grant/revoke/reset handlers.
services/cbm-mcp/README.md Documents running cbm behind mcp-proxy as a systemd service.
services/cbm-mcp/install.sh Adds installer for cbm-mcp systemd service + dependencies.
services/cbm-mcp/cbm-mcp.service Adds hardened systemd unit for mcp-proxy + cbm.
services/cbm-mcp/cbm-mcp.conf Adds EnvironmentFile for cbm-mcp deployment.
services/cbm-mcp/cbm-mcp-logrotate Adds logrotate config for cbm-mcp logs.
scripts/lib/libs.sh Adds opt-in libgit2 build-from-source helper.
scripts/check_no_process_mgmt.sh Adds CI invariant to forbid process-management calls in harness code.
README.md Links new Code Projects documentation.
include/webui/webui_code_projects.h Declares WebUI Code Projects WebSocket handlers.
include/tools/mcp_transport.h Adds MCP transport abstraction interface.
include/tools/mcp_transport_http_sse.h Declares HTTP+SSE transport factory.
include/tools/mcp_client.h Declares MCP JSON-RPC client/FSM and call API.
include/tools/mcp_bridge.h Declares MCP bridge API for tool registration/dispatch/status.
include/tools/mcp_bridge_schema.h Declares MCP JSON Schema translation and description hardening helpers.
include/tools/document_grep.h Declares document_grep tool registration entrypoint.
include/tools/document_db.h Adds types/APIs for grep hits and chunk-range reads.
include/tools/code_project_tool.h Declares native code_project tool registration.
include/tools/code_project_service.h Declares code-projects orchestrator API and broadcast hooks.
include/tools/code_project_namemap.h Declares cbm slug/name translation boundary API.
include/tools/code_project_git.h Declares libgit2 clone/fetch/validate wrapper API.
include/tools/code_project_db.h Declares code_projects DB CRUD/visibility API.
include/tools/code_graph_provider.h Declares code-graph provider vtable (cbm-backed in Phase 1).
include/llm/llm_tools.h Updates llm.tools config API and adds raw-args accessor.
include/core/session_manager.h Adds per-session active code-project fields.
include/core/curl_buffer.h Sets CURLOPT_NOSIGNAL in shared curl defaults for thread-safety.
include/config/dawn_config.h Adds MCP/code-projects config structs and tool disable lists.
include/auth/auth_db_mcp.h Declares MCP allowlist DB APIs.
include/auth/auth_db_internal.h Bumps schema version to 66 and adds new stmt/migration declarations.
include/auth/admin_socket.h Adds admin socket opcodes for MCP and code projects.
include/auth/admin_socket_internal.h Declares MCP/code-project admin handler prototypes.
GETTING_STARTED.md Adds “Code Projects (Coding Harness)” section linking to docs.
docs/CODING_PROJECTS.md Adds user/operator guide for code projects feature.
DEPENDENCIES.md Documents libgit2 dependency gating/installation for code projects.
dawn.toml.example Documents MCP bridge and code-projects TOML configuration + new tool blocklist model.
dawn-admin/socket_client.h Declares dawn-admin client helpers for MCP and code-project opcodes.
dawn-admin/main.c Adds dawn-admin mcp … and dawn-admin code-project … commands.
CMakeLists.txt Adds libgit2 detection/version gate; adds global schema helper sources; adds no-process-mgmt check target.
cmake/DawnTools.cmake Adds DAWN_ENABLE_MCP_BRIDGE_TOOL and DAWN_ENABLE_CODE_PROJECTS build options and sources.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/dawn.c
Comment thread scripts/check_no_process_mgmt.sh
Comment thread include/tools/code_project_db.h Outdated
Comment thread docs/CODING_PROJECTS.md
Comment thread include/tools/mcp_client.h
Comment thread include/tools/mcp_client.h
Comment thread include/tools/code_project_db.h Outdated
Comment thread src/tools/code_project_git.c Outdated
Comment thread src/tools/mcp_transport_http_sse.c
KerseyFabrications added a commit that referenced this pull request Jun 15, 2026
- mcp transport (SSRF): reject a cross-origin or credentialed SSE `endpoint`
  event — it could redirect authenticated POSTs (bearer header) to another host.
  resolve_endpoint now requires the resolved URL to stay same-origin as the
  configured base URL.
- dawn.c: skip disabled / empty-alias servers when bootstrapping admin MCP
  access (was inserting empty-alias grant rows).
- check_no_process_mgmt.sh: fix the no-subprocess CI invariant's harness globs —
  webui_projects.* typo → webui_code_projects.*; drop dead v55/v56 + dawn_admin_*
  globs; add v64-v66 migrations and the dawn-admin client (main.*/socket_client.*);
  harden comment-stripping to skip multi-line block comments (so doc-comment prose
  like "daemon (...)" can't false-positive). Now scans 33 files — was missing the
  WebUI and dawn-admin harness handlers.
- code_project_git.c: return SUCCESS instead of literal 0 from the libgit2/nftw
  callbacks (named-constant convention; FAILURE already used alongside).
- docs/code_project_db.h: schema comment v56 → v65/v66; add @param Doxygen to the
  code_project_db_* and mcp_client_* public APIs; CODING_PROJECTS.md: cbm link is
  HTTP+SSE on localhost, not "a local socket".

Skipped (false positive): `_fn` typedef-suffix flag — `_fn` is the codebase
convention for function-pointer typedefs (19 existing uses).

Build clean (0 warnings); test_code_project_db 8/8, test_code_project_git 5/5,
test_mcp_bridge 6/6; check_no_process_mgmt passes; format clean.
@KerseyFabrications

Copy link
Copy Markdown
Contributor Author

@qodo-code-review

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 96 out of 96 changed files in this pull request and generated 5 comments.

Comment on lines +389 to +392
/* ---- YAML sequence: many top-level "- " records, mostly structured lines ---- */
if (top_dash >= STRUCT_YAML_MIN_ITEMS &&
(double)(top_dash + indented) >= (double)nonblank * STRUCT_YAML_INDENT_FRAC) {
if (result_init(out) != SUCCESS)
Comment on lines +429 to +446
p = (hdr_end < end) ? hdr_end + 1 : end;
char buf[8192];
while (p < end) {
const char *le = line_end(p, end);
if (!line_blank(p, le)) {
int row_len = (int)(le - p);
/* chunk = "header\nrow" so each record is self-describing. */
int n = snprintf(buf, sizeof(buf), "%.*s\n%.*s", hdr_len, hdr, row_len, p);
if (n > 0) {
if (result_add(out, buf, n < (int)sizeof(buf) ? n : (int)sizeof(buf) - 1) !=
SUCCESS) {
chunk_result_free(out);
return false;
}
}
}
p = (le < end) ? le + 1 : end;
}
Comment thread src/tools/document_db.c
Comment on lines +463 to +467
for (size_t i = 0; i < nlen && esc_len + 2 < sizeof(escaped); i++) {
if (needle[i] == '%' || needle[i] == '_' || needle[i] == '\\')
escaped[esc_len++] = '\\';
escaped[esc_len++] = needle[i];
}
Comment on lines +351 to +352
if (piece_end == p)
piece_end = (p + max_chars < stop) ? p + max_chars : stop; /* one giant line */
Comment thread src/tools/code_project_service.c Outdated
Comment on lines +215 to +228
char anchored[256];
snprintf(anchored, sizeof(anchored), "^(%s)$", allowed_host_pattern);
regex_t re;
int crc = regcomp(&re, anchored, REG_EXTENDED | REG_NOSUB);
if (crc == 0) {
ok = (regexec(&re, host, 0, NULL, 0) == 0);
regfree(&re);
} else {
/* Fail closed (ok stays false), but make the misconfiguration
* diagnosable — otherwise every import silently "fails the allowlist". */
OLOG_ERROR("code_project: allowed_host_pattern failed to compile (rc=%d) — "
"rejecting all imports until fixed",
crc);
}
- mcp transport (SSRF): reject a cross-origin or credentialed SSE `endpoint`
  event — it could redirect authenticated POSTs (bearer header) to another host.
  resolve_endpoint now requires the resolved URL to stay same-origin as the
  configured base URL.
- dawn.c: skip disabled / empty-alias servers when bootstrapping admin MCP
  access (was inserting empty-alias grant rows).
- check_no_process_mgmt.sh: fix the no-subprocess CI invariant's harness globs —
  webui_projects.* typo → webui_code_projects.*; drop dead v55/v56 + dawn_admin_*
  globs; add v64-v66 migrations and the dawn-admin client (main.*/socket_client.*);
  harden comment-stripping to skip multi-line block comments (so doc-comment prose
  like "daemon (...)" can't false-positive). Now scans 33 files — was missing the
  WebUI and dawn-admin harness handlers.
- code_project_git.c: return SUCCESS instead of literal 0 from the libgit2/nftw
  callbacks (named-constant convention; FAILURE already used alongside).
- docs/code_project_db.h: schema comment v56 → v65/v66; add @param Doxygen to the
  code_project_db_* and mcp_client_* public APIs; CODING_PROJECTS.md: cbm link is
  HTTP+SSE on localhost, not "a local socket".

Skipped (false positive): `_fn` typedef-suffix flag — `_fn` is the codebase
convention for function-pointer typedefs (19 existing uses).

Build clean (0 warnings); test_code_project_db 8/8, test_code_project_git 5/5,
test_mcp_bridge 6/6; check_no_process_mgmt passes; format clean.
@KerseyFabrications KerseyFabrications merged commit f9a2da6 into main Jun 15, 2026
5 checks passed
@KerseyFabrications KerseyFabrications deleted the coding-harness-test branch June 15, 2026 06:16
KerseyFabrications added a commit that referenced this pull request Jun 15, 2026
Copilot flagged four edge cases on PR #20; an architecture + efficiency
review then caught that two first-pass fixes targeted the wrong layer:

- CSV path ignored max_chars and emitted records that read-back truncates
  (char text[DOC_CHUNK_TEXT_MAX]); now respects max_chars and falls through
  to prose for an oversized atomic row (parity with the YAML path).
- struct_emit's "giant line" branch was unreachable, so a single line over
  max_chars was emitted whole and byte-truncated mid-UTF-8 on read-back;
  reworked to hard-split at max_chars with UTF-8 lead-byte back-off.
- chunk_structured counted any indented line as a YAML signal; now requires
  a ':' mapping key or a nested '- ' item so indented prose can't trip it.
- the over-long-needle truncation was at the grep tool layer (silent prefix
  match); now fails closed there, with the DB-layer guard as a backstop.

New fixtures for the CSV-fallback and UTF-8 oversized-line paths.
test_document_chunker 16/16, test_document_db 16/16, 0 warnings.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants