feat(extraction): add Clojure/ClojureScript language support#701
Open
stigi wants to merge 3 commits into
Open
Conversation
stigi
added a commit
to stigi/codegraph
that referenced
this pull request
Jun 5, 2026
…n & provenance Review findings on colbymchenry#701, all addressed: - Local-scope tracking (the Medium): binding names from let/loop/for/ doseq/binding vecs, fn/defn/defmethod/method-impl params, letfn names, and as->/catch bindings now join a locals frame stack; bare symbols matching a frame emit nothing, and a locally-shadowed HEAD call ((let [helper (mk)] (helper 1))) emits nothing either — shadowing a same-file fn name is idiomatic Clojure and produced false calls edges (re-indexing ring removed 207 of them, ~8% of its call edges; node counts unchanged). - (.-value el) ClojureScript property reads now emit references, not calls; (.method obj) stays calls. - definline removed from the core-forms blocklist — the def-macro heuristic now extracts it as a function. - Same-file HOF lookup is a lazily-rebuilt name Set instead of a linear ctx.nodes scan per bare symbol (removes the O(symbols × nodes) worst case on god-files). - Vendored-wasm provenance: src/extraction/wasm/README.md records the exact grammar commit (sogaiu/tree-sitter-clojure e43eff8), tree-sitter CLI version (0.26.9), and build command; grammars.ts points to it. - Comments documenting accepted trade-offs: .bb BitBake collision, foreign-multimethod defmethod naming, second-ns-form nesting. - 11 new regression tests (shadowing incl. head position and scope-end, for/doseq modifiers, as->/catch bindings, .-property kind, definline, ^:private metadata, string requires, prefix lists, letfn, .bb content). - Playbook coverage matrix gains the Clojure row with the S/M/L bench numbers, incl. a second ring A/B run (n=2) per the validation methodology. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
858ff96 to
a019b3b
Compare
Index .clj/.cljs/.cljc/.bb and .edn files: namespaces, defn/def (multi-arity, function-valued, private), defprotocol, defrecord/deftype (fields, methods, implements, implicit ->Ctor fns), defmulti/defmethod, and :require/:import clauses. The maintained grammar (sogaiu/tree-sitter-clojure, vendored ABI-14 wasm — build recipe in src/extraction/wasm/README.md) is purely lexical: `(defn foo [x] ...)` is just a list_lit. So extraction runs entirely through the visitNode full-takeover hook (the Pascal precedent), interpreting list heads. A generic def-macro heuristic also catches library definers (defroutes, deftest, definline, rum/defc, hsx/defc) so e.g. logseq's UI components are indexed. Calls through :as aliases and :refer'd symbols emit `full.ns::name` references that resolve via the existing qualified-name matcher, and requires name-match the target namespace's module node — zero resolver changes. Local-scope tracking (let/loop/for bindings, fn params, letfn, as->/catch) suppresses shadowed names so idiomatic shadowing of a same-file fn never fabricates call edges; quoted/discarded/comment forms are skipped; reader conditionals descend both branches; (.-prop x) reads emit references, (.method x) calls. UIx and helix React components are first-class: defui/defnc produce `component` nodes and the `$` element macro emits calls edges to the composed component (`($ ui/button ...)` → button), gated on `$` resolving to uix.core/helix.core in the require table. pitch-io/uix: 131 components with composition edges. UIx and helix React components are first-class: defui/defnc produce `component` nodes and the `$` element macro emits calls edges to the composed component (`($ ui/button ...)` -> button), gated on `$` resolving to uix.core/helix.core in the require table. pitch-io/uix: 131 components with composition edges. re-frame's keyword-keyed dispatch connects statically: every reg-* registration with a literal keyword becomes a function node NAMED by its alias-expanded keyword (::subs/items → :my.app.subs/items), and dispatch/dispatch-sync/subscribe/sub sites with a literal event vector emit same-named calls refs the exact-name matcher links. Detection is shape-based because real apps front re-frame with project facades (status-mobile's utils.re-frame covers 512 files with custom registrars); precision is structural — an edge needs both ends to carry the same keyword. The registrar call itself keeps its ordinary call ref, so callers/impact on a facade still see every registration site (status-mobile: 3,654 calls into utils.re-frame's defs). status-mobile: 1,635 registrations, 2,323 keyword edges; codegraph_node :profile/logout returns the handler plus all 13 dispatch sites in one call. .edn files extract in data mode: top-level map keys become property nodes, qualified symbols in values (shadow-cljs :init-fn, integrant handlers, clj-kondo :hooks) emit references to the code they name, and no call edges are ever emitted. Maps with >64 top-level keys are datasets (locale dicts), not config — no nodes (refs still scanned). Validated on ring (84 files / 2.5k edges), logseq (1,312 / 91k incl. 99 .edn), metabase (15,374 / 623k), re-frame todomvc, athens, and status-mobile (2,050 / 1,635 keyword registrations): extraction PASS, node counts stable on re-sync. Agent A/B: logseq's canonical flow 155→4 tool calls, 0 Read/0 Grep, 3.6× faster, 2.5× cheaper; status-mobile's logout flow (n=2) half the calls and Reads at ~1.7× speed. 47 extraction tests; coverage matrix entry in docs/design/dynamic-dispatch-coverage-playbook.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
a019b3b to
0a0005b
Compare
…tokens
Three layers, all surfaced by one real-world session on a 2k-file cljs
monorepo (the agent's two explores returned wrong-subsystem noise and
it fell back to 6 Reads):
1. Token charset: the symbol-token filter only accepted \w identifier
chars, so NO Clojure symbol survived — not kebab-case
(on-route-change+), not predicates (valid?), not alias-qualified
set-state/dashboard or keyword :profile/logout forms. The flow
builder and named-seed injection silently never ran on Clojure
repos. Widened to the Lisp symbol alphabet with / qualifiers and
optional leading colon; clj extensions added to the
strip-file-extension list.
2. Module tokens: a bare token can name a NAMESPACE by its last
segment — the Clojure norm ("the deactivate stage" = ns
app.page.lifecycle.deactivate, whose fns are named per page type).
Callable-only resolution made those tokens contribute nothing, or
latch onto an unrelated same-named fn in another subsystem (the
only function literally named `deactivate` was the SCIM backend's).
Exact last-segment module matches now inject as file pointers and
serve as location anchors.
3. Co-location: an ambiguous bare token used to resolve independently
to its most-substantive def anywhere in the monorepo. The agent's
bag of names describes ONE flow, so tokens are spatially coherent:
ambiguous candidates now prefer max path-proximity to the anchor
dirs (specific tokens' + module matches' locations), taking all
ties up to 3 — when the per-stage overloads of one name all live
beside the anchors (lifecycle activate/set-state/deactivate
`dashboard`), they are all the answer.
4. Colon-less keywords: agents write the re-frame event
`:app/set-page-state` as `app/set-page-state` (and then grep for it
when the lookup misses). Namespaced slash tokens without a leading
colon now resolve to the colon-prefixed registration node, in both
findAllSymbols and exact matching — gated on the `/` so plain names
can't be hijacked by same-named unqualified keywords.
All four layers are covered by unit tests on a miniature monorepo
fixture reproducing the failing session's shapes
(__tests__/explore-clojure-tokens.test.ts), including the negative
guards: a bare name is never hijacked by a same-named unqualified
keyword (gated on the `/` in BOTH findAllSymbols and matchesSymbol),
and the co-location pick is exercised on a genuinely ambiguous name
with a bigger-bodied wrong-subsystem decoy.
The failing session's exact query now renders all three lifecycle
stage files with their call inventories in one explore payload instead
of SCIM + playground noise. Controls intact: Alamofire's god-file
multi-phase invariant (Request spine + Validation.validate + Session
in one ~12K payload) and metabase's TS-side dashboard probe unchanged;
full suite green.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Every production launch path already pins V8's WASM compilation to the Liftoff baseline tier (src/extraction/wasm-runtime-flags.ts, colbymchenry#293/colbymchenry#298) — but vitest workers didn't, and suites that load every grammar in beforeAll (extraction.test.ts) could hit the turboshaft Zone OOM during background grammar compilation. Reproduced reliably on main on an arm64 Mac with Node 24.16: the tinypool worker dies mid-file ("Worker exited unexpectedly", 1 unhandled error) and ~90 tests at the end of the file silently never run. With execArgv: ['--liftoff-only'] on both pools the full suite runs clean: 59/59 files, 0 unhandled errors, and the previously-vanishing tests execute. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
0a0005b to
3b52163
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds Clojure / ClojureScript as an indexed language:
.clj,.cljs,.cljc, Babashka.bb, and.edn— including statically-connected re-frame flows and UIx/helix components.What
modulenodes scoping their defs (my.app.core::process-user);defn(multi-arity, docstrings,^:private),def/defonce,defprotocol,defrecord/deftype(fields, methods,implements, implicit->Fooctors),defmulti/defmethod(overloads);:require/:importwith:as/:refer/prefix lists/string requires/reader conditionals. Aliased calls emitfull.ns::nameand resolve through the existing matchers — zero resolver changes. Local-scope tracking (let/fn/letfn/destructuring) prevents idiomatic shadowing from fabricating call edges; interop maps tocalls/references/instantiates; quoted/discarded/(comment)forms are skipped.reg-*registration with a literal keyword becomes a function node named by its alias-expanded keyword (::subs/items→:my.app.subs/items);dispatch/subscribe/subsites with literal event vectors emit same-named refs the exact-name matcher links. Detection is shape-based because real apps wrap re-frame in project facades (status-mobile's fronts 512 files); precision is structural — an edge needs both ends to carry the same keyword. The registrar call itself keeps its ordinary ref (callers/impact on facades).defui/defnc→componentnodes; the$element macro →callsedges to the composed component (gated on$resolving to uix/helix core in the require table).propertynodes; qualified symbols in values (shadow-cljs:init-fn, clj-kondo:hooks) emit references to the code they name; never any call edges; >64-key maps are datasets and contribute no nodes.How
The maintained grammar (sogaiu/tree-sitter-clojure, vendored ABI-14 wasm — build recipe in
src/extraction/wasm/README.md) is purely lexical, so this is the first extractor running entirely through thevisitNodefull-takeover hook, interpreting list heads. A generic def-macro heuristic (def*+ symbol first arg, plain or qualified) catches library definers (defroutes,deftest,rum/defc,hsx/defc) without knowing each macro library; a ~250-entry clojure.core blocklist suppresses call refs whose target is never in the graph.fix(explore)— real-world testing on a 2k-file cljs monorepo exposed four retrieval gaps incodegraph_explore's query handling, fixed here because without them the new graph is unreachable by agents: (1) the symbol-token filter rejected every Clojure name (kebab-case/?/!/alias/name/:keyword— the flow builder silently never ran); (2) bare tokens now also resolve namespaces by last segment (the Clojure way to reference a subsystem); (3) ambiguous tokens prefer candidates co-located with the other tokens' locations instead of the biggest same-named def anywhere in the monorepo; (4) colon-less keywords (app/set-page-state) resolve to their registration node. Unit-tested on a fixture reproducing the failing session, including no-hijack guards; Alamofire/metabase control probes unchanged.fix(tests)— vitest workers now run with--liftoff-onlylike every production launch path (#293/#298); without it the turboshaft Zone OOM kills a worker mid-file on arm64/Node 24 and ~90 tests silently never run (reproducible on main).Validation
.ednNode counts stable on re-index; grammar health-checked 20/20 in a multi-grammar runtime.
codegraph_node :profile/logout(status-mobile) returns the handler + all 13 dispatch sites in one call.Agent A/B (canonical flow question, headless opus): logseq 4 calls / 0 Read / 0 Grep / 69s / $0.48 vs 155 calls / ~40 Read / 248s / $1.23 · metabase 14 calls / 6 Read / 113s vs 24 / 16 / 197s · status-mobile (n=2) 12–14 / 3–6 Read / 72–88s vs 25–33 / 10–15 / ~122s · ring (S, n=2) parity — the known small-repo pattern. Both arms always answered correctly.
52 new tests; full suite 60/60. Coverage-matrix entry in
docs/design/dynamic-dispatch-coverage-playbook.md.Known gaps (deliberate v1)
extend-protocol/extend-type: bodies walked, noimplements/method nodes.rf/defn {:events [...]}macros, fx-map keys, handler→sub app-db data-flow.ig/init-key↔ system.edn keys, multimethod dispatch-value edges — natural follow-ups on the keyword mechanism.🤖 Generated with Claude Code