feat(aegis-core): Layer 1 per-file fact cache + SEC010 multi-language dispatch#10
Merged
Conversation
Introduces the first per-file Layer 1 fact-extraction module. Every language adapter declares an `import_query` capturing module names as `@import`; until now signals/imports_local.rs and workspace.rs each ran that query independently. This pulls the common pattern into one place so future per-file fact derivation (security receiver resolution, public symbols, etc.) has the same home. `extract_imports(parsed) -> Vec<Import>` runs the adapter query once and returns normalized module strings with line numbers. Pure function; caching is the next commit's job (`ParsedFile.imports()` lazy cache). 5 tests cover Python plain/from-imports, Go quoted module paths, JS/TS quote stripping, line numbering (1-indexed), and the contract that callers must hold a ParsedFile (unparseable extensions can't reach this code path). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds an `OnceCell<Vec<Import>>` to ParsedFile so the import query runs at most once per file. Single-threaded by design — ParsedFile is owned by gather_findings and not shared across threads, so std::cell::OnceCell suffices. New API: - `ParsedFile::imports() -> &[Import]` — lazy cache populated on first call. - `ParsedFile::resolve_receiver(name) -> Option<&Import>` — best- effort receiver-to-import lookup. Today: last-segment of module path or full module name. Alias-aware resolution (Go `import myrand "math/rand"`) is intentionally deferred — needs language-specific AST shapes. 5 new tests cover cache population, cache stability across calls (identical slice pointer), Python module-name resolution, Go last-segment path resolution (`math/rand` → `rand`), and the empty-string boundary. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Both signals/imports_local.rs and workspace.rs::summarize_file used
to run their own copy of:
let query = Query::new(lang, adapter.import_query()).ok()?;
let mut qc = QueryCursor::new();
for m in qc.matches(&query, parsed.root_node(), src) {
for cap in m.captures { ... adapter.normalize_import(text) ... }
}
That work now lives in ParsedFile.imports() and runs at most once
per file. Both call sites collapse to a single iterator over
`parsed.imports()`.
Behavioural delta: none. Both consumers were already de-duplicating
on the module string (HashSet in workspace.rs, line-counting in
imports_local.rs); the cache returns a `Vec<Import>` with each
distinct (module, line) pair, so consumer behaviour is preserved.
Drops three unused tree-sitter type imports from workspace.rs
(Parser, Query, QueryCursor) — they were only used by the now-gone
inline import-extraction. The pre-existing unused `ParsedFile`
import in findings.rs is separate scope; left alone.
142 / 142 tests pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replaces the polyglot needle list (`choice` | `randint` | ... | `Math.random`) with per-language matchers. Each language's matcher hard-codes the receiver path, so safe APIs that share function names (`secrets.choice` shares `choice`, `SecureRandom.nextInt` shares `nextInt`, `crypto/rand.Read` shares `Read`) never match. Per-language coverage: - **Python**: `random.X` only (`X` ∈ choice/choices/randint/sample/ uniform/randrange/shuffle/random). Excludes `secrets.X`, `os.urandom`, `np.random.X`, `random.SystemRandom().X`. - **JS / TS**: `Math.random` only. - **Go**: `math/rand` method set, with import-resolution via `parsed.resolve_receiver()` to disambiguate from `crypto/rand`. `import "math/rand"` → `rand.Intn` flags; `import "crypto/rand"` → `rand.Read` doesn't (and `Intn`/`Int31`/etc. don't even exist there). Falls back to method-name conservative default when no matching import is in the file. - **Java / Kotlin**: `nextInt`/`nextLong`/etc. on `Random` (or unqualified). `SecureRandom`-prefixed receivers excluded by receiver-path filter. - **C#**: `Next`/`NextDouble`/`NextBytes`. Excludes `RandomNumberGenerator` and any `Cryptography`-namespace receiver. - **PHP**: global `rand`/`mt_rand`/`array_rand`. `random_int` and `random_bytes` are CSPRNG and intentionally absent. - **Rust**: `gen`/`gen_range`/`gen_bool` from the rand crate. `OsRng` excluded by receiver-path filter. Drops PR #9's `safe_module_prefixes` whitelist hack — now redundant. Each language's matcher hard-codes the receiver path, so `secrets.choice` never reaches the SEC010 path in the first place because `is_python_weak_rng` requires `first == "random"`. Threads `&ParsedFile` down through `walk()` so language identity and import facts are available at the SEC010 call site. Other SEC checks unchanged. Existing 142 tests pass without modification — the dispatch preserves semantics for everything that was already flagging. Per-language regression tests added in the next commit. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
10 new tests covering the dispatch matrix from PR #10 step 4: Python: - random.SystemRandom() CSPRNG-backed → does NOT fire - numpy np.random.choice (statistical sampling) → does NOT fire JS / TS: - Math.random for token (chained .toString[2] form) → fires - window.crypto.getRandomValues → does NOT fire Go (with Layer 1 import resolution): - import "math/rand" + rand.Intn → fires - import "crypto/rand" + rand.Read → does NOT fire (same literal `rand.Read` text, different module per import resolution) Java: - new Random().nextInt(...) literal → fires - new SecureRandom().nextInt(...) literal → does NOT fire Note: variable-stored (`Random r = new Random(); r.nextInt`) is a known FN — without dataflow we can't recover the class identity from the receiver `r`. Worth less than the FP risk on `SecureRandom r; r.nextInt`. C# / PHP: - new Random().Next(...) → fires - PHP rand($min, $max) → fires - PHP random_int($min, $max) (CSPRNG) → does NOT fire Also fixes call-name extraction for Java's `method_invocation`, which lacks a `function` field — composes `object.name` from the two child fields so the receiver path is visible to the matcher. And extends `enclosing_token_context` to recognize Go (var_spec / short_var_declaration), Java (local_variable_declaration), and JS augmented assignment kinds, so the token-context detection works across the languages now in scope. 153 / 153 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced May 6, 2026
wei9072
added a commit
that referenced
this pull request
May 6, 2026
Round 9 validation surfaced that SEC009's `last == "md5" | "sha1"` matcher only fired on Python `hashlib.md5(...)` and never on Go, Java, C#, PHP, or Node — exactly the same multi-language coverage gap PR #10 fixed for SEC010. Per-language matchers, mirroring PR #10's architecture: - **Python**: `hashlib.md5` / `hashlib.sha1`. Receiver-anchored, excludes `Crypto.Hash.MD5.new` from PyCryptodome (out of scope for now). - **Node / JS**: `crypto.createHash('md5'|'sha1'|'sha-1')`. The algorithm lives in the first string arg; new `first_arg_is_weak_alg_string()` helper inspects the literal. - **Go**: `md5.Sum` / `md5.New` / `sha1.Sum` / `sha1.New` with Layer 1 import resolution against `crypto/md5` and `crypto/sha1`. Round 9's `h := md5.Sum([]byte(password))` case. - **Java / Kotlin**: `MessageDigest.getInstance("MD5")` (string arg), Apache Commons `DigestUtils.md5Hex` / `sha1Hex`. Round 9's `MessageDigest.getInstance("MD5")` case. - **C#**: `MD5.Create()` / `SHA1.Create()` / `MD5CryptoServiceProvider` / `MD5Managed` and SHA1 equivalents. Receiver-anchored to avoid matching `SHA1024` / `SomeMD5Field`. - **PHP**: global `md5(...)` / `sha1(...)`, plus `hash('md5', ...)` / `hash('sha1', ...)` (string arg). `hash('sha256', ...)` stays silent. Also fixes `enclosing_security_context` with the same improvements PR #11 applied to `enclosing_token_context`: - Multi-language assignment node kinds (Go `short_var_declaration`, Java `local_variable_declaration`, etc.). - Function-name needle check at function-shape level. Round 9 case: `func HashPassword(password string)` calls `md5.Sum` via local `h :=`; the function name carries the `password` needle even though the local assignment doesn't. - Walks past inner blocks (don't break at for / if body, only at function shape). Tests: 8 new — Node createHash md5 (positive) / sha256 (negative), Go md5.Sum + crypto/md5 import, Java MessageDigest.getInstance MD5 (positive) / SHA-256 (negative), PHP md5 (positive) / hash sha256 (negative), C# MD5.Create. 164 / 164 tests pass. Live MCP confirmation: Round 9's starting-go/auth.go and starting-java/Auth.java now both fire all three planted findings (SEC009, SEC010, SEC012) instead of 2 / 1 respectively. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
4 tasks
wei9072
added a commit
that referenced
this pull request
May 7, 2026
…he (#16) Continues PR #10's pattern: any per-file fact whose extraction was duplicated across consumers belongs in `ast::*` with a `OnceCell` cache on `ParsedFile`. Both extractors used to live as private helpers inside `workspace.rs::summarize_file`. They moved to a new `ast::symbols` module: - `extract_public_symbols(parsed) -> HashSet<String>` — top-level function / class / type / trait / variable names this file exposes. Skips function bodies (nested local helpers are not the file's public API). Rust requires `pub` modifier. - `extract_imported_symbols(parsed) -> HashSet<String>` — names pulled in via `from X import Y` (Python) or `import { a, b } from 'x'` (TS / JS). - `walk_export()` — preserved verbatim for `export default` / `export { a, b as c }` shapes. - `is_public_name()` / `is_likely_public()` — moved alongside. `ParsedFile` grows two more lazy caches: - `public_symbols() -> &HashSet<String>` - `imported_symbols() -> &HashSet<String>` `workspace.rs::summarize_file` now reads both caches instead of walking the tree itself. Net: -158 / +257 (the +257 includes 100 lines of moved code + 50 lines of new tests inside ast::symbols + 21 lines of new ParsedFile API). 3 new unit tests in `ast::symbols`: Python public-fn / Python private-skip, Rust pub-only filter, Python `from … import …` including `aliased_import`. `workspace.rs` unchanged behaviour-wise; same 11 existing workspace tests still pass. Total: 164 → 167 tests pass. Architectural rationale: this is the third per-file fact that moved into Layer 1 (after `Import` in PR #10 and Layer 1 was already provider for `ParsedFile`). The pattern is now stable enough that any future "per-file derived fact" (call_sites, function definitions, etc.) just adds another `OnceCell` field + `extract_X` function. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
5 tasks
wei9072
added a commit
that referenced
this pull request
May 7, 2026
Closes the last gap from PR #10's Go SEC010 dispatch. Previously `import myrand "math/rand"` parsed as a single import with module "math/rand" and no alias; resolve_receiver("myrand") missed because last-segment of "math/rand" is "rand", not "myrand". SEC010 then silently passed `myrand.Intn(...)` even though it's the same weak RNG. Two changes: 1. **`Import.alias: Option<String>`** — new field on the Import struct in `ast::imports`. Populated when the captured path's parent `import_spec` node has a `name` field (Go's syntax for import renaming). Other languages return None for now. Filters out Go's `_` (blank import for side-effects) and `.` (dot import) so resolve_receiver doesn't try to match those. 2. **`ParsedFile::resolve_receiver` two-pass lookup**: - Pass 1: explicit alias match. Aliased imports always win (`myrand` resolves to `math/rand`). - Pass 2: last-segment / module-name match, but **skips aliased imports** so `rand` no longer falsely resolves to a `math/rand` that was imported as `myrand`. Tests: 5 new — Go aliased import captures alias, Go unaliased returns None, blank/dot imports skipped, alias preferred over last-segment, mixed aliased+unaliased in same file. Plus 2 SEC010 end-to-end tests: - `myrand.Intn` (aliased math/rand) → fires - `crand.Read` (aliased crypto/rand) → does NOT fire Total: 167 → 174 tests pass. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
6 tasks
wei9072
added a commit
that referenced
this pull request
May 7, 2026
Closes the audit gaps identified after PRs #10 / #12. The audit across SEC003-008: | Rule | Coverage before | Action this PR | |---|---|---| | SEC003 TLS off | text-level Python/Node/Go/.NET | unchanged (decent already) | | SEC004 shell | Python `shell=True`+interp only | **language-aware dispatch** | | SEC005 SQL concat | text+string-literal Python/Java | unchanged (decent) | | SEC006 CORS | text-level cross-language | unchanged | | SEC007 JWT | `name.contains("jwt")` Python only | **language-aware dispatch** | | SEC008 deser | Python/Node/Java idioms | unchanged (decent) | ## SEC004 expansion Per-language shell-running idioms; requires interpolation in arg. - **Python**: subprocess.run/Popen with `shell=True` + interp - **Node.js**: `child_process.exec` / `execSync` with interp (always shells out, no `shell:true` gate; `execFile` is the safe one) - **PHP**: global `shell_exec` / `exec` / `passthru` / `system` / `proc_open` with interp - **Java**: `Runtime.getRuntime().exec(String)` overload with concat — String[] overload safe and excluded - **Go**: `exec.Command("sh"|"bash"|"/bin/sh"|"/bin/bash", "-c", ...)` with interp. Bare `exec.Command("ls", arg)` (argv-style) excluded — no shell metachar interpretation `text_has_interp` extended with PHP `.` concat (gated on `$` to avoid floating-point literals). ## SEC007 expansion Per-language JWT decode without verification: - **Python**: `jwt.decode(...)` without algorithms/key/verify kwarg (existing behaviour) - **Node.js**: `jsonwebtoken.decode()` always returns unverified claims — flag unconditionally; `verify(token, secret, opts)` is the safe API. `verify()` with `verify: false` opt also flagged. - **Java / Kotlin**: Auth0 lib's `JWT.decode(token)` returns unverified DecodedJWT; safe path is `JWT.require(...).build().verify(token)`. - **PHP**: firebase/php-jwt's `JWT::decode($token, $key)` requires explicit algorithm list. Flagged unless one of `'HS256'`/`'RS256'`/ `'ES256'`/`'EdDSA'` appears in call text. Algorithm-`none` detection extended with JWT-spec literal `"alg": "none"` shape. `check_jwt_unsafe` now takes `&ParsedFile` so language identity is available — prevents PHP `JWT::decode` from being misclassified as Java (the old `name.contains("JWT")` check was language-blind). ## Infrastructure changes 1. **`call_name` extended for PHP scoped/member calls.** Previously only handled Java's `method_invocation`; now also composes `Class::method` from `scoped_call_expression` and `$obj->method` from `member_call_expression`. 2. **`leaf_method_name(name)` helper** — splits on `.` / `::` / `->` so `JWT::decode`'s leaf is `decode`, not the whole string. 3. **walk dispatch** extended with `scoped_call_expression` and `member_call_expression` node kinds. ## Tests 10 new (5 SEC004 multi-lang + 5 SEC007 multi-lang). 174 → **177** total tests passing. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Refactors per-file fact extraction (currently: just imports) into
Layer 1 —
ParsedFilenow owns a lazyOnceCell<Vec<Import>>cachepopulated by a single shared
extract_imports()call. Then SEC010gets rebuilt on top of this so per-language matchers can resolve
the receiver's import to disambiguate things like Go's
math/randvscrypto/randambiguity (samerand.Readliteral,different module).
Why
Two threads of feedback drove this:
secrets.choiceFP we tripped over. The architectural problem was the SEC010
last == "choice"matcher being Python-centric and not knowingwhat module the receiver came from.
(imports / public symbols / etc.). Currently every consumer
re-runs
Query::new(lang, adapter.import_query()) + qc.matches(...)independently —
signals/imports_local.rsandworkspace.rs::summarize_fileboth did the work, and securitywould have been the third copy.
Commits
feat(aegis-core): ast::imports — per-file Import struct + extractor—new module wraps the common extraction. Pure function; caching
added in next commit.
feat(aegis-core): ParsedFile.imports() lazy cache + resolve_receiver()—OnceCell<Vec<Import>>on ParsedFile, plus a best-effortresolve_receiver(name)lookup (last-segment match for pathslike
math/rand→rand).refactor(aegis-core): migrate consumers to ParsedFile.imports() cache—imports_local.rs and workspace.rs's summarize_file both shrink
to a single iterator over
parsed.imports(). Behaviourpreserved; HashSet/Vec semantics handled at consumer level.
feat(security): SEC010 language-aware dispatch over Layer 1 imports—per-language matchers (Python/JS/Go/Java/C#/PHP/Rust) replace
the polyglot needle list. Drops PR fix(security): SEC010 FP on secrets.choice / os.urandom #9's
safe_module_prefixesallowlist hack — now redundant. Threads
&ParsedFilethroughwalk(). Existing 142 tests pass without modification.test(security): SEC010 per-language regression coverage—10 new positive/negative tests across Python / JS / Go / Java /
C# / PHP. Also fixes Java
call_nameextraction (composesobject.nameformethod_invocation) and extendsenclosing_token_contextto recognize Go / Java / Rustassignment node kinds.
Coverage matrix
random.Xrandom.SystemRandom().Xexcludedsecrets.X(FP risk)first == "random"np.random.choice(FP risk)Math.randommath/rand.Intncrypto/rand.Read(FP risk)new Random().nextIntnew SecureRandom().nextInt(FP)Random r; r.nextIntnew Random().Nextrand()random_int()(FP risk)rand::thread_rng().gen()Test results
cargo test --workspace— 153 / 153 pass (was 132; +21 coveringnew dispatch + new Layer 1 cache tests + import-extraction unit
tests).
Live MCP sanity-checked on JS / Go / Java samples; per-language
matchers fire / stay silent as designed.
Test plan
cargo test --workspace— 153 / 153 passcargo install --path crates/aegis-mcp --forcesucceedsimport "math/rand"; rand.Intntestimport "crypto/rand"; rand.Readtestsecrets.choice/os.urandomstay silent)random.choice for token, etc.)
🤖 Generated with Claude Code