Summary
On real-world Perl codebases, the call graph is dominated by false-positive CALLS edges. Perl files are extracted (call sites emitted), and any call the textual resolver can't place falls back to a generic short-name matcher that has no language or call-kind awareness. It happily wires Perl builtins, framework method calls, and even mis-parsed config strings to unrelated project subs that merely share a name.
Evidence
Measured on a large real Perl monorepo (~1,200 .pm/.pl + 352 .cgi endpoint files). For .cgi callers, the generic resolver produced 4,940 suffix_match edges, of which the overwhelming majority are noise:
| Noise class |
.cgi edges |
Example callees |
CPAN/framework method calls ($obj->m()) matched to unrelated project subs |
~3,064 |
log (Log::Log4perl), param/header (CGI.pm), connect/commit/rollback/execute (DBI), encode (JSON) |
| Perl builtins matched to project subs |
~645 |
shift, push, keys, sprintf |
| Config strings mis-extracted as call targets |
~305 |
log4perl.appender.File.utf8 |
These targets are not project calls at all — the classes belong to CPAN/framework modules not in the graph, the builtins are language primitives, and the config string is a literal. The generic resolver fabricates edges purely on short-name collision.
Root causes
- Callee over-extraction (Perl):
extract_scripting_callee (internal/cbm/extract_calls.c, CBM_LANG_PERL branch) returns the raw first-child text of a call node, so non-identifier tokens (dotted config strings, literals) become callees.
- No builtin guard: the generic resolver (
src/pipeline/registry.c) matches builtin-named calls (shift/push/…) to project subs of the same name.
- No function-vs-method distinction:
CBMCall carries no call-kind, so a method call $obj->commit() with an unknown receiver is treated as a bare function and short-name-matched to a project commit sub.
Proposed fix (language-gated; non-Perl behavior unchanged)
- Extraction hygiene: in the Perl callee path, extract the real method/function name token and reject non-identifier callees (containing
., quotes, whitespace, etc.) so config strings/literals never become call targets.
- Builtin guard: add a Perl builtin set; when an unresolved Perl call's name is a builtin, suppress the generic edge (real same-file subs are already resolved by earlier stages before the generic fallback).
- Method-vs-function: add a
is_method flag to CBMCall, set it during Perl extraction for arrow/method calls, thread it into the resolver, and suppress generic short-name matching for Perl method calls with an unknown receiver (precise method resolution is the LSP's job; bare short-name matching is almost always wrong).
Every change is gated on CBM_LANG_PERL — the generic resolver and CBMCall are shared by all languages, so the other nine families remain byte-identical.
Acceptance criteria
Validation (on the proposed branch)
Re-indexing the same repo with the fix: .cgi suffix_match drops 4,940 → 655 (−87%), builtin/CPAN-method/config-string noise on .cgi goes to zero, project-wide CALLS edges drop by ~13,400 (noise removal — fewer, more-correct edges), scripts/build.sh/scripts/test.sh stay green, and the cross-language breadth check confirms all other languages still resolve.
Summary
On real-world Perl codebases, the call graph is dominated by false-positive
CALLSedges. Perl files are extracted (call sites emitted), and any call the textual resolver can't place falls back to a generic short-name matcher that has no language or call-kind awareness. It happily wires Perl builtins, framework method calls, and even mis-parsed config strings to unrelated project subs that merely share a name.Evidence
Measured on a large real Perl monorepo (~1,200
.pm/.pl+ 352.cgiendpoint files). For.cgicallers, the generic resolver produced 4,940suffix_matchedges, of which the overwhelming majority are noise:.cgiedges$obj->m()) matched to unrelated project subslog(Log::Log4perl),param/header(CGI.pm),connect/commit/rollback/execute(DBI),encode(JSON)shift,push,keys,sprintflog4perl.appender.File.utf8These targets are not project calls at all — the classes belong to CPAN/framework modules not in the graph, the builtins are language primitives, and the config string is a literal. The generic resolver fabricates edges purely on short-name collision.
Root causes
extract_scripting_callee(internal/cbm/extract_calls.c,CBM_LANG_PERLbranch) returns the raw first-child text of a call node, so non-identifier tokens (dotted config strings, literals) become callees.src/pipeline/registry.c) matches builtin-named calls (shift/push/…) to project subs of the same name.CBMCallcarries no call-kind, so a method call$obj->commit()with an unknown receiver is treated as a bare function and short-name-matched to a projectcommitsub.Proposed fix (language-gated; non-Perl behavior unchanged)
., quotes, whitespace, etc.) so config strings/literals never become call targets.is_methodflag toCBMCall, set it during Perl extraction for arrow/method calls, thread it into the resolver, and suppress generic short-name matching for Perl method calls with an unknown receiver (precise method resolution is the LSP's job; bare short-name matching is almost always wrong).Every change is gated on
CBM_LANG_PERL— the generic resolver andCBMCallare shared by all languages, so the other nine families remain byte-identical.Acceptance criteria
push @x, 1) does not produce aCALLSedge to a project sub namedpush.$dbh->commit()) does not generic-match to a project sub namedcommit.log4perl.appender.File.utf8) is not extracted as a call target.Validation (on the proposed branch)
Re-indexing the same repo with the fix:
.cgisuffix_matchdrops 4,940 → 655 (−87%), builtin/CPAN-method/config-string noise on.cgigoes to zero, project-wideCALLSedges drop by ~13,400 (noise removal — fewer, more-correct edges),scripts/build.sh/scripts/test.shstay green, and the cross-language breadth check confirms all other languages still resolve.