Skip to content

feat(extraction): add Clojure/ClojureScript language support#701

Open
stigi wants to merge 3 commits into
colbymchenry:mainfrom
stigi:add-clojure-support
Open

feat(extraction): add Clojure/ClojureScript language support#701
stigi wants to merge 3 commits into
colbymchenry:mainfrom
stigi:add-clojure-support

Conversation

@stigi
Copy link
Copy Markdown

@stigi stigi commented Jun 5, 2026

Adds Clojure / ClojureScript as an indexed language: .clj, .cljs, .cljc, Babashka .bb, and .edn — including statically-connected re-frame flows and UIx/helix components.

What

  • Core language — namespaces become module nodes scoping their defs (my.app.core::process-user); defn (multi-arity, docstrings, ^:private), def/defonce, defprotocol, defrecord/deftype (fields, methods, implements, implicit ->Foo ctors), defmulti/defmethod (overloads); :require/:import with :as/:refer/prefix lists/string requires/reader conditionals. Aliased calls emit full.ns::name and resolve through the existing matchers — zero resolver changes. Local-scope tracking (let/fn/letfn/destructuring) prevents idiomatic shadowing from fabricating call edges; interop maps to calls/references/instantiates; quoted/discarded/(comment) forms are skipped.
  • re-frame — every reg-* registration with a literal keyword becomes a function node named by its alias-expanded keyword (::subs/items:my.app.subs/items); dispatch/subscribe/sub sites with literal event vectors emit same-named refs the exact-name matcher links. Detection is shape-based because real apps wrap re-frame in project facades (status-mobile's fronts 512 files); precision is structural — an edge needs both ends to carry the same keyword. The registrar call itself keeps its ordinary ref (callers/impact on facades).
  • UIx / helixdefui/defnccomponent nodes; the $ element macro → calls edges to the composed component (gated on $ resolving to uix/helix core in the require table).
  • EDN data mode — top-level config keys become property nodes; qualified symbols in values (shadow-cljs :init-fn, clj-kondo :hooks) emit references to the code they name; never any call edges; >64-key maps are datasets and contribute no nodes.

How

The maintained grammar (sogaiu/tree-sitter-clojure, vendored ABI-14 wasm — build recipe in src/extraction/wasm/README.md) is purely lexical, so this is the first extractor running entirely through the visitNode full-takeover hook, interpreting list heads. A generic def-macro heuristic (def* + symbol first arg, plain or qualified) catches library definers (defroutes, deftest, rum/defc, hsx/defc) without knowing each macro library; a ~250-entry clojure.core blocklist suppresses call refs whose target is never in the graph.

fix(explore) — real-world testing on a 2k-file cljs monorepo exposed four retrieval gaps in codegraph_explore's query handling, fixed here because without them the new graph is unreachable by agents: (1) the symbol-token filter rejected every Clojure name (kebab-case/?/!/alias/name/:keyword — the flow builder silently never ran); (2) bare tokens now also resolve namespaces by last segment (the Clojure way to reference a subsystem); (3) ambiguous tokens prefer candidates co-located with the other tokens' locations instead of the biggest same-named def anywhere in the monorepo; (4) colon-less keywords (app/set-page-state) resolve to their registration node. Unit-tested on a fixture reproducing the failing session, including no-hijack guards; Alamofire/metabase control probes unchanged.

fix(tests) — vitest workers now run with --liftoff-only like every production launch path (#293/#298); without it the turboshaft Zone OOM kills a worker mid-file on arm64/Node 24 and ~90 tests silently never run (reproducible on main).

Validation

Repo tier files nodes edges notes
ring S 84 1,144 2,468
re-frame (lib+todomvc) S 70 1,699 269 kw regs / 194 kw edges
athens M 189 3,991 292 / 552
logseq M 1,312 29,560 90,842 hsx components; 99 .edn
status-mobile L 2,050 28,206 1,635 / 2,323 via facade
metabase L 15,374 195,979 623,318
uix S 131 components

Node counts stable on re-index; grammar health-checked 20/20 in a multi-grammar runtime. codegraph_node :profile/logout (status-mobile) returns the handler + all 13 dispatch sites in one call.

Agent A/B (canonical flow question, headless opus): logseq 4 calls / 0 Read / 0 Grep / 69s / $0.48 vs 155 calls / ~40 Read / 248s / $1.23 · metabase 14 calls / 6 Read / 113s vs 24 / 16 / 197s · status-mobile (n=2) 12–14 / 3–6 Read / 72–88s vs 25–33 / 10–15 / ~122s · ring (S, n=2) parity — the known small-repo pattern. Both arms always answered correctly.

52 new tests; full suite 60/60. Coverage-matrix entry in docs/design/dynamic-dispatch-coverage-playbook.md.

Known gaps (deliberate v1)

  • extend-protocol/extend-type: bodies walked, no implements/method nodes.
  • re-frame: variable event vectors, rf/defn {:events [...]} macros, fx-map keys, handler→sub app-db data-flow.
  • integrant ig/init-key ↔ system.edn keys, multimethod dispatch-value edges — natural follow-ups on the keyword mechanism.
  • Bare-symbol HOF args link only when defined earlier in the same file; nested EDN keys aren't nodes.

🤖 Generated with Claude Code

@stigi stigi marked this pull request as draft June 5, 2026 11:50
stigi added a commit to stigi/codegraph that referenced this pull request Jun 5, 2026
…n & provenance

Review findings on colbymchenry#701, all addressed:

- Local-scope tracking (the Medium): binding names from let/loop/for/
  doseq/binding vecs, fn/defn/defmethod/method-impl params, letfn names,
  and as->/catch bindings now join a locals frame stack; bare symbols
  matching a frame emit nothing, and a locally-shadowed HEAD call
  ((let [helper (mk)] (helper 1))) emits nothing either — shadowing a
  same-file fn name is idiomatic Clojure and produced false calls edges
  (re-indexing ring removed 207 of them, ~8% of its call edges; node
  counts unchanged).
- (.-value el) ClojureScript property reads now emit references, not
  calls; (.method obj) stays calls.
- definline removed from the core-forms blocklist — the def-macro
  heuristic now extracts it as a function.
- Same-file HOF lookup is a lazily-rebuilt name Set instead of a linear
  ctx.nodes scan per bare symbol (removes the O(symbols × nodes)
  worst case on god-files).
- Vendored-wasm provenance: src/extraction/wasm/README.md records the
  exact grammar commit (sogaiu/tree-sitter-clojure e43eff8), tree-sitter
  CLI version (0.26.9), and build command; grammars.ts points to it.
- Comments documenting accepted trade-offs: .bb BitBake collision,
  foreign-multimethod defmethod naming, second-ns-form nesting.
- 11 new regression tests (shadowing incl. head position and scope-end,
  for/doseq modifiers, as->/catch bindings, .-property kind, definline,
  ^:private metadata, string requires, prefix lists, letfn, .bb content).
- Playbook coverage matrix gains the Clojure row with the S/M/L bench
  numbers, incl. a second ring A/B run (n=2) per the validation
  methodology.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@stigi stigi force-pushed the add-clojure-support branch 9 times, most recently from 858ff96 to a019b3b Compare June 5, 2026 14:53
Index .clj/.cljs/.cljc/.bb and .edn files: namespaces, defn/def
(multi-arity, function-valued, private), defprotocol, defrecord/deftype
(fields, methods, implements, implicit ->Ctor fns), defmulti/defmethod,
and :require/:import clauses.

The maintained grammar (sogaiu/tree-sitter-clojure, vendored ABI-14
wasm — build recipe in src/extraction/wasm/README.md) is purely
lexical: `(defn foo [x] ...)` is just a list_lit. So extraction runs
entirely through the visitNode full-takeover hook (the Pascal
precedent), interpreting list heads. A generic def-macro heuristic also
catches library definers (defroutes, deftest, definline, rum/defc,
hsx/defc) so e.g. logseq's UI components are indexed.

Calls through :as aliases and :refer'd symbols emit `full.ns::name`
references that resolve via the existing qualified-name matcher, and
requires name-match the target namespace's module node — zero resolver
changes. Local-scope tracking (let/loop/for bindings, fn params, letfn,
as->/catch) suppresses shadowed names so idiomatic shadowing of a
same-file fn never fabricates call edges; quoted/discarded/comment
forms are skipped; reader conditionals descend both branches;
(.-prop x) reads emit references, (.method x) calls.

UIx and helix React components are first-class: defui/defnc produce
`component` nodes and the `$` element macro emits calls edges to the
composed component (`($ ui/button ...)` → button), gated on `$`
resolving to uix.core/helix.core in the require table. pitch-io/uix:
131 components with composition edges.

UIx and helix React components are first-class: defui/defnc produce
`component` nodes and the `$` element macro emits calls edges to the
composed component (`($ ui/button ...)` -> button), gated on `$`
resolving to uix.core/helix.core in the require table. pitch-io/uix:
131 components with composition edges.

re-frame's keyword-keyed dispatch connects statically: every reg-*
registration with a literal keyword becomes a function node NAMED by
its alias-expanded keyword (::subs/items → :my.app.subs/items), and
dispatch/dispatch-sync/subscribe/sub sites with a literal event vector
emit same-named calls refs the exact-name matcher links. Detection is
shape-based because real apps front re-frame with project facades
(status-mobile's utils.re-frame covers 512 files with custom
registrars); precision is structural — an edge needs both ends to
carry the same keyword. The registrar call itself keeps its ordinary
call ref, so callers/impact on a facade still see every registration
site (status-mobile: 3,654 calls into utils.re-frame's defs).
status-mobile: 1,635 registrations, 2,323 keyword edges; codegraph_node :profile/logout returns the handler plus
all 13 dispatch sites in one call.

.edn files extract in data mode: top-level map keys become property
nodes, qualified symbols in values (shadow-cljs :init-fn, integrant
handlers, clj-kondo :hooks) emit references to the code they name, and
no call edges are ever emitted. Maps with >64 top-level keys are
datasets (locale dicts), not config — no nodes (refs still scanned).

Validated on ring (84 files / 2.5k edges), logseq (1,312 / 91k incl.
99 .edn), metabase (15,374 / 623k), re-frame todomvc, athens, and
status-mobile (2,050 / 1,635 keyword registrations): extraction PASS,
node counts stable on re-sync. Agent A/B: logseq's canonical flow
155→4 tool calls, 0 Read/0 Grep, 3.6× faster, 2.5× cheaper;
status-mobile's logout flow (n=2) half the calls and Reads at ~1.7×
speed. 47 extraction tests; coverage matrix entry in
docs/design/dynamic-dispatch-coverage-playbook.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@stigi stigi force-pushed the add-clojure-support branch from a019b3b to 0a0005b Compare June 5, 2026 15:04
stigi and others added 2 commits June 5, 2026 20:20
…tokens

Three layers, all surfaced by one real-world session on a 2k-file cljs
monorepo (the agent's two explores returned wrong-subsystem noise and
it fell back to 6 Reads):

1. Token charset: the symbol-token filter only accepted \w identifier
   chars, so NO Clojure symbol survived — not kebab-case
   (on-route-change+), not predicates (valid?), not alias-qualified
   set-state/dashboard or keyword :profile/logout forms. The flow
   builder and named-seed injection silently never ran on Clojure
   repos. Widened to the Lisp symbol alphabet with / qualifiers and
   optional leading colon; clj extensions added to the
   strip-file-extension list.

2. Module tokens: a bare token can name a NAMESPACE by its last
   segment — the Clojure norm ("the deactivate stage" = ns
   app.page.lifecycle.deactivate, whose fns are named per page type).
   Callable-only resolution made those tokens contribute nothing, or
   latch onto an unrelated same-named fn in another subsystem (the
   only function literally named `deactivate` was the SCIM backend's).
   Exact last-segment module matches now inject as file pointers and
   serve as location anchors.

3. Co-location: an ambiguous bare token used to resolve independently
   to its most-substantive def anywhere in the monorepo. The agent's
   bag of names describes ONE flow, so tokens are spatially coherent:
   ambiguous candidates now prefer max path-proximity to the anchor
   dirs (specific tokens' + module matches' locations), taking all
   ties up to 3 — when the per-stage overloads of one name all live
   beside the anchors (lifecycle activate/set-state/deactivate
   `dashboard`), they are all the answer.

4. Colon-less keywords: agents write the re-frame event
   `:app/set-page-state` as `app/set-page-state` (and then grep for it
   when the lookup misses). Namespaced slash tokens without a leading
   colon now resolve to the colon-prefixed registration node, in both
   findAllSymbols and exact matching — gated on the `/` so plain names
   can't be hijacked by same-named unqualified keywords.

All four layers are covered by unit tests on a miniature monorepo
fixture reproducing the failing session's shapes
(__tests__/explore-clojure-tokens.test.ts), including the negative
guards: a bare name is never hijacked by a same-named unqualified
keyword (gated on the `/` in BOTH findAllSymbols and matchesSymbol),
and the co-location pick is exercised on a genuinely ambiguous name
with a bigger-bodied wrong-subsystem decoy.

The failing session's exact query now renders all three lifecycle
stage files with their call inventories in one explore payload instead
of SCIM + playground noise. Controls intact: Alamofire's god-file
multi-phase invariant (Request spine + Validation.validate + Session
in one ~12K payload) and metabase's TS-side dashboard probe unchanged;
full suite green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Every production launch path already pins V8's WASM compilation to the
Liftoff baseline tier (src/extraction/wasm-runtime-flags.ts, colbymchenry#293/colbymchenry#298)
— but vitest workers didn't, and suites that load every grammar in
beforeAll (extraction.test.ts) could hit the turboshaft Zone OOM during
background grammar compilation. Reproduced reliably on main on an arm64
Mac with Node 24.16: the tinypool worker dies mid-file ("Worker exited
unexpectedly", 1 unhandled error) and ~90 tests at the end of the file
silently never run.

With execArgv: ['--liftoff-only'] on both pools the full suite runs
clean: 59/59 files, 0 unhandled errors, and the previously-vanishing
tests execute.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@stigi stigi force-pushed the add-clojure-support branch from 0a0005b to 3b52163 Compare June 5, 2026 18:20
@stigi stigi marked this pull request as ready for review June 5, 2026 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant