feat(aegis-core): Layer 1 per-file fact cache + SEC010 multi-language dispatch by wei9072 · Pull Request #10 · wei9072/aegis

wei9072 · 2026-05-06T03:10:08Z

Summary

Refactors per-file fact extraction (currently: just imports) into
Layer 1 — ParsedFile now owns a lazy OnceCell<Vec<Import>> cache
populated by a single shared extract_imports() call. Then SEC010
gets rebuilt on top of this so per-language matchers can resolve
the receiver's import to disambiguate things like Go's
math/rand vs crypto/rand ambiguity (same rand.Read literal,
different module).

Why

Two threads of feedback drove this:

PR fix(security): SEC010 FP on secrets.choice / os.urandom #9's allowlist was reactive — only patched the secrets.choice
FP we tripped over. The architectural problem was the SEC010
last == "choice" matcher being Python-centric and not knowing
what module the receiver came from.
Layer 1 is the right home for per-file derivations
(imports / public symbols / etc.). Currently every consumer
re-runs Query::new(lang, adapter.import_query()) + qc.matches(...)
independently — signals/imports_local.rs and
workspace.rs::summarize_file both did the work, and security
would have been the third copy.

Commits

feat(aegis-core): ast::imports — per-file Import struct + extractor —
new module wraps the common extraction. Pure function; caching
added in next commit.
feat(aegis-core): ParsedFile.imports() lazy cache + resolve_receiver() —
OnceCell<Vec<Import>> on ParsedFile, plus a best-effort
resolve_receiver(name) lookup (last-segment match for paths
like math/rand → rand).
refactor(aegis-core): migrate consumers to ParsedFile.imports() cache —
imports_local.rs and workspace.rs's summarize_file both shrink
to a single iterator over parsed.imports(). Behaviour
preserved; HashSet/Vec semantics handled at consumer level.
feat(security): SEC010 language-aware dispatch over Layer 1 imports —
per-language matchers (Python/JS/Go/Java/C#/PHP/Rust) replace
the polyglot needle list. Drops PR fix(security): SEC010 FP on secrets.choice / os.urandom #9's safe_module_prefixes
allowlist hack — now redundant. Threads &ParsedFile through
walk(). Existing 142 tests pass without modification.
test(security): SEC010 per-language regression coverage —
10 new positive/negative tests across Python / JS / Go / Java /
C# / PHP. Also fixes Java call_name extraction (composes
object.name for method_invocation) and extends
enclosing_token_context to recognize Go / Java / Rust
assignment node kinds.

Coverage matrix

Language	Was caught	Now caught	Notes
Python `random.X`	✅	✅	Plus `random.SystemRandom().X` excluded
Python `secrets.X` (FP risk)	❌ FP	✅ silent	Per-language matcher requires `first == "random"`
Python `np.random.choice` (FP risk)	❌ FP	✅ silent	Same
JS `Math.random`	✅	✅
Go `math/rand.Intn`	❌ silent	✅ fires	Layer 1 import resolution
Go `crypto/rand.Read` (FP risk)	❌ FP-prone	✅ silent	Same literal text, resolved via imports
Java `new Random().nextInt`	❌ silent	✅ fires
Java `new SecureRandom().nextInt` (FP)	❌ FP-prone	✅ silent	Receiver-path filter
Java `Random r; r.nextInt`	❌ silent	❌ FN (accepted)	Needs dataflow — out of SEC layer scope
C# `new Random().Next`	❌ silent	✅ fires
PHP `rand()`	❌ silent	✅ fires
PHP `random_int()` (FP risk)	❌ FP-prone	✅ silent	Not in matcher list
Rust `rand::thread_rng().gen()`	❌ silent	✅ fires (when in token context)

Test results

cargo test --workspace — 153 / 153 pass (was 132; +21 covering
new dispatch + new Layer 1 cache tests + import-extraction unit
tests).

Live MCP sanity-checked on JS / Go / Java samples; per-language
matchers fire / stay silent as designed.

Test plan

cargo test --workspace — 153 / 153 pass
cargo install --path crates/aegis-mcp --force succeeds
Live MCP fires on Go import "math/rand"; rand.Intn test
Live MCP silent on Go import "crypto/rand"; rand.Read test
PR fix(security): SEC010 FP on secrets.choice / os.urandom #9 regression tests still pass (secrets.choice /
os.urandom stay silent)
All prior SEC010 tests still pass (Python URL shortener,
random.choice for token, etc.)
CI green on push

🤖 Generated with Claude Code

Introduces the first per-file Layer 1 fact-extraction module. Every language adapter declares an `import_query` capturing module names as `@import`; until now signals/imports_local.rs and workspace.rs each ran that query independently. This pulls the common pattern into one place so future per-file fact derivation (security receiver resolution, public symbols, etc.) has the same home. `extract_imports(parsed) -> Vec<Import>` runs the adapter query once and returns normalized module strings with line numbers. Pure function; caching is the next commit's job (`ParsedFile.imports()` lazy cache). 5 tests cover Python plain/from-imports, Go quoted module paths, JS/TS quote stripping, line numbering (1-indexed), and the contract that callers must hold a ParsedFile (unparseable extensions can't reach this code path). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Adds an `OnceCell<Vec<Import>>` to ParsedFile so the import query runs at most once per file. Single-threaded by design — ParsedFile is owned by gather_findings and not shared across threads, so std::cell::OnceCell suffices. New API: - `ParsedFile::imports() -> &[Import]` — lazy cache populated on first call. - `ParsedFile::resolve_receiver(name) -> Option<&Import>` — best- effort receiver-to-import lookup. Today: last-segment of module path or full module name. Alias-aware resolution (Go `import myrand "math/rand"`) is intentionally deferred — needs language-specific AST shapes. 5 new tests cover cache population, cache stability across calls (identical slice pointer), Python module-name resolution, Go last-segment path resolution (`math/rand` → `rand`), and the empty-string boundary. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Both signals/imports_local.rs and workspace.rs::summarize_file used to run their own copy of: let query = Query::new(lang, adapter.import_query()).ok()?; let mut qc = QueryCursor::new(); for m in qc.matches(&query, parsed.root_node(), src) { for cap in m.captures { ... adapter.normalize_import(text) ... } } That work now lives in ParsedFile.imports() and runs at most once per file. Both call sites collapse to a single iterator over `parsed.imports()`. Behavioural delta: none. Both consumers were already de-duplicating on the module string (HashSet in workspace.rs, line-counting in imports_local.rs); the cache returns a `Vec<Import>` with each distinct (module, line) pair, so consumer behaviour is preserved. Drops three unused tree-sitter type imports from workspace.rs (Parser, Query, QueryCursor) — they were only used by the now-gone inline import-extraction. The pre-existing unused `ParsedFile` import in findings.rs is separate scope; left alone. 142 / 142 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Replaces the polyglot needle list (`choice` | `randint` | ... | `Math.random`) with per-language matchers. Each language's matcher hard-codes the receiver path, so safe APIs that share function names (`secrets.choice` shares `choice`, `SecureRandom.nextInt` shares `nextInt`, `crypto/rand.Read` shares `Read`) never match. Per-language coverage: - **Python**: `random.X` only (`X` ∈ choice/choices/randint/sample/ uniform/randrange/shuffle/random). Excludes `secrets.X`, `os.urandom`, `np.random.X`, `random.SystemRandom().X`. - **JS / TS**: `Math.random` only. - **Go**: `math/rand` method set, with import-resolution via `parsed.resolve_receiver()` to disambiguate from `crypto/rand`. `import "math/rand"` → `rand.Intn` flags; `import "crypto/rand"` → `rand.Read` doesn't (and `Intn`/`Int31`/etc. don't even exist there). Falls back to method-name conservative default when no matching import is in the file. - **Java / Kotlin**: `nextInt`/`nextLong`/etc. on `Random` (or unqualified). `SecureRandom`-prefixed receivers excluded by receiver-path filter. - **C#**: `Next`/`NextDouble`/`NextBytes`. Excludes `RandomNumberGenerator` and any `Cryptography`-namespace receiver. - **PHP**: global `rand`/`mt_rand`/`array_rand`. `random_int` and `random_bytes` are CSPRNG and intentionally absent. - **Rust**: `gen`/`gen_range`/`gen_bool` from the rand crate. `OsRng` excluded by receiver-path filter. Drops PR #9's `safe_module_prefixes` whitelist hack — now redundant. Each language's matcher hard-codes the receiver path, so `secrets.choice` never reaches the SEC010 path in the first place because `is_python_weak_rng` requires `first == "random"`. Threads `&ParsedFile` down through `walk()` so language identity and import facts are available at the SEC010 call site. Other SEC checks unchanged. Existing 142 tests pass without modification — the dispatch preserves semantics for everything that was already flagging. Per-language regression tests added in the next commit. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

10 new tests covering the dispatch matrix from PR #10 step 4: Python: - random.SystemRandom() CSPRNG-backed → does NOT fire - numpy np.random.choice (statistical sampling) → does NOT fire JS / TS: - Math.random for token (chained .toString[2] form) → fires - window.crypto.getRandomValues → does NOT fire Go (with Layer 1 import resolution): - import "math/rand" + rand.Intn → fires - import "crypto/rand" + rand.Read → does NOT fire (same literal `rand.Read` text, different module per import resolution) Java: - new Random().nextInt(...) literal → fires - new SecureRandom().nextInt(...) literal → does NOT fire Note: variable-stored (`Random r = new Random(); r.nextInt`) is a known FN — without dataflow we can't recover the class identity from the receiver `r`. Worth less than the FP risk on `SecureRandom r; r.nextInt`. C# / PHP: - new Random().Next(...) → fires - PHP rand($min, $max) → fires - PHP random_int($min, $max) (CSPRNG) → does NOT fire Also fixes call-name extraction for Java's `method_invocation`, which lacks a `function` field — composes `object.name` from the two child fields so the receiver path is visible to the matcher. And extends `enclosing_token_context` to recognize Go (var_spec / short_var_declaration), Java (local_variable_declaration), and JS augmented assignment kinds, so the token-context detection works across the languages now in scope. 153 / 153 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Round 9 validation surfaced that SEC009's `last == "md5" | "sha1"` matcher only fired on Python `hashlib.md5(...)` and never on Go, Java, C#, PHP, or Node — exactly the same multi-language coverage gap PR #10 fixed for SEC010. Per-language matchers, mirroring PR #10's architecture: - **Python**: `hashlib.md5` / `hashlib.sha1`. Receiver-anchored, excludes `Crypto.Hash.MD5.new` from PyCryptodome (out of scope for now). - **Node / JS**: `crypto.createHash('md5'|'sha1'|'sha-1')`. The algorithm lives in the first string arg; new `first_arg_is_weak_alg_string()` helper inspects the literal. - **Go**: `md5.Sum` / `md5.New` / `sha1.Sum` / `sha1.New` with Layer 1 import resolution against `crypto/md5` and `crypto/sha1`. Round 9's `h := md5.Sum([]byte(password))` case. - **Java / Kotlin**: `MessageDigest.getInstance("MD5")` (string arg), Apache Commons `DigestUtils.md5Hex` / `sha1Hex`. Round 9's `MessageDigest.getInstance("MD5")` case. - **C#**: `MD5.Create()` / `SHA1.Create()` / `MD5CryptoServiceProvider` / `MD5Managed` and SHA1 equivalents. Receiver-anchored to avoid matching `SHA1024` / `SomeMD5Field`. - **PHP**: global `md5(...)` / `sha1(...)`, plus `hash('md5', ...)` / `hash('sha1', ...)` (string arg). `hash('sha256', ...)` stays silent. Also fixes `enclosing_security_context` with the same improvements PR #11 applied to `enclosing_token_context`: - Multi-language assignment node kinds (Go `short_var_declaration`, Java `local_variable_declaration`, etc.). - Function-name needle check at function-shape level. Round 9 case: `func HashPassword(password string)` calls `md5.Sum` via local `h :=`; the function name carries the `password` needle even though the local assignment doesn't. - Walks past inner blocks (don't break at for / if body, only at function shape). Tests: 8 new — Node createHash md5 (positive) / sha256 (negative), Go md5.Sum + crypto/md5 import, Java MessageDigest.getInstance MD5 (positive) / SHA-256 (negative), PHP md5 (positive) / hash sha256 (negative), C# MD5.Create. 164 / 164 tests pass. Live MCP confirmation: Round 9's starting-go/auth.go and starting-java/Auth.java now both fire all three planted findings (SEC009, SEC010, SEC012) instead of 2 / 1 respectively. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…he (#16) Continues PR #10's pattern: any per-file fact whose extraction was duplicated across consumers belongs in `ast::*` with a `OnceCell` cache on `ParsedFile`. Both extractors used to live as private helpers inside `workspace.rs::summarize_file`. They moved to a new `ast::symbols` module: - `extract_public_symbols(parsed) -> HashSet<String>` — top-level function / class / type / trait / variable names this file exposes. Skips function bodies (nested local helpers are not the file's public API). Rust requires `pub` modifier. - `extract_imported_symbols(parsed) -> HashSet<String>` — names pulled in via `from X import Y` (Python) or `import { a, b } from 'x'` (TS / JS). - `walk_export()` — preserved verbatim for `export default` / `export { a, b as c }` shapes. - `is_public_name()` / `is_likely_public()` — moved alongside. `ParsedFile` grows two more lazy caches: - `public_symbols() -> &HashSet<String>` - `imported_symbols() -> &HashSet<String>` `workspace.rs::summarize_file` now reads both caches instead of walking the tree itself. Net: -158 / +257 (the +257 includes 100 lines of moved code + 50 lines of new tests inside ast::symbols + 21 lines of new ParsedFile API). 3 new unit tests in `ast::symbols`: Python public-fn / Python private-skip, Rust pub-only filter, Python `from … import …` including `aliased_import`. `workspace.rs` unchanged behaviour-wise; same 11 existing workspace tests still pass. Total: 164 → 167 tests pass. Architectural rationale: this is the third per-file fact that moved into Layer 1 (after `Import` in PR #10 and Layer 1 was already provider for `ParsedFile`). The pattern is now stable enough that any future "per-file derived fact" (call_sites, function definitions, etc.) just adds another `OnceCell` field + `extract_X` function. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

Closes the last gap from PR #10's Go SEC010 dispatch. Previously `import myrand "math/rand"` parsed as a single import with module "math/rand" and no alias; resolve_receiver("myrand") missed because last-segment of "math/rand" is "rand", not "myrand". SEC010 then silently passed `myrand.Intn(...)` even though it's the same weak RNG. Two changes: 1. **`Import.alias: Option<String>`** — new field on the Import struct in `ast::imports`. Populated when the captured path's parent `import_spec` node has a `name` field (Go's syntax for import renaming). Other languages return None for now. Filters out Go's `_` (blank import for side-effects) and `.` (dot import) so resolve_receiver doesn't try to match those. 2. **`ParsedFile::resolve_receiver` two-pass lookup**: - Pass 1: explicit alias match. Aliased imports always win (`myrand` resolves to `math/rand`). - Pass 2: last-segment / module-name match, but **skips aliased imports** so `rand` no longer falsely resolves to a `math/rand` that was imported as `myrand`. Tests: 5 new — Go aliased import captures alias, Go unaliased returns None, blank/dot imports skipped, alias preferred over last-segment, mixed aliased+unaliased in same file. Plus 2 SEC010 end-to-end tests: - `myrand.Intn` (aliased math/rand) → fires - `crand.Read` (aliased crypto/rand) → does NOT fire Total: 167 → 174 tests pass. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

Closes the audit gaps identified after PRs #10 / #12. The audit across SEC003-008: | Rule | Coverage before | Action this PR | |---|---|---| | SEC003 TLS off | text-level Python/Node/Go/.NET | unchanged (decent already) | | SEC004 shell | Python `shell=True`+interp only | **language-aware dispatch** | | SEC005 SQL concat | text+string-literal Python/Java | unchanged (decent) | | SEC006 CORS | text-level cross-language | unchanged | | SEC007 JWT | `name.contains("jwt")` Python only | **language-aware dispatch** | | SEC008 deser | Python/Node/Java idioms | unchanged (decent) | ## SEC004 expansion Per-language shell-running idioms; requires interpolation in arg. - **Python**: subprocess.run/Popen with `shell=True` + interp - **Node.js**: `child_process.exec` / `execSync` with interp (always shells out, no `shell:true` gate; `execFile` is the safe one) - **PHP**: global `shell_exec` / `exec` / `passthru` / `system` / `proc_open` with interp - **Java**: `Runtime.getRuntime().exec(String)` overload with concat — String[] overload safe and excluded - **Go**: `exec.Command("sh"|"bash"|"/bin/sh"|"/bin/bash", "-c", ...)` with interp. Bare `exec.Command("ls", arg)` (argv-style) excluded — no shell metachar interpretation `text_has_interp` extended with PHP `.` concat (gated on `$` to avoid floating-point literals). ## SEC007 expansion Per-language JWT decode without verification: - **Python**: `jwt.decode(...)` without algorithms/key/verify kwarg (existing behaviour) - **Node.js**: `jsonwebtoken.decode()` always returns unverified claims — flag unconditionally; `verify(token, secret, opts)` is the safe API. `verify()` with `verify: false` opt also flagged. - **Java / Kotlin**: Auth0 lib's `JWT.decode(token)` returns unverified DecodedJWT; safe path is `JWT.require(...).build().verify(token)`. - **PHP**: firebase/php-jwt's `JWT::decode($token, $key)` requires explicit algorithm list. Flagged unless one of `'HS256'`/`'RS256'`/ `'ES256'`/`'EdDSA'` appears in call text. Algorithm-`none` detection extended with JWT-spec literal `"alg": "none"` shape. `check_jwt_unsafe` now takes `&ParsedFile` so language identity is available — prevents PHP `JWT::decode` from being misclassified as Java (the old `name.contains("JWT")` check was language-blind). ## Infrastructure changes 1. **`call_name` extended for PHP scoped/member calls.** Previously only handled Java's `method_invocation`; now also composes `Class::method` from `scoped_call_expression` and `$obj->method` from `member_call_expression`. 2. **`leaf_method_name(name)` helper** — splits on `.` / `::` / `->` so `JWT::decode`'s leaf is `decode`, not the whole string. 3. **walk dispatch** extended with `scoped_call_expression` and `member_call_expression` node kinds. ## Tests 10 new (5 SEC004 multi-lang + 5 SEC007 multi-lang). 174 → **177** total tests passing. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

wei9072 and others added 5 commits May 6, 2026 02:57

wei9072 merged commit a9ccf3f into main May 6, 2026
1 check passed

wei9072 deleted the feat/layer1-fact-cache branch May 6, 2026 03:11

This was referenced May 6, 2026

fix(security): SEC010 reads function name + walks past inner blocks #11

Merged

feat(security): SEC009 language-aware dispatch (PR #12) #12

Merged

wei9072 mentioned this pull request May 7, 2026

refactor(aegis-core): public_symbols → Layer 1 cache (PR #16) #16

Merged

4 tasks

wei9072 mentioned this pull request May 7, 2026

feat(aegis-core): Layer 2 Go import alias resolution (PR #17) #17

Merged

5 tasks

wei9072 mentioned this pull request May 7, 2026

feat(security): SEC004 + SEC007 multi-language dispatch (PR #18) #18

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(aegis-core): Layer 1 per-file fact cache + SEC010 multi-language dispatch#10

feat(aegis-core): Layer 1 per-file fact cache + SEC010 multi-language dispatch#10
wei9072 merged 5 commits into
mainfrom
feat/layer1-fact-cache

wei9072 commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wei9072 commented May 6, 2026

Summary

Why

Commits

Coverage matrix

Test results

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant