feat: implement Phase 2-4 - AST symbols, project metadata, doc-code sync by avifenesh · Pull Request #9 · agent-sh/agent-analyzer

avifenesh · 2026-03-22T12:49:15Z

Summary

Phase 2 (analyzer-repo-map): tree-sitter AST extraction for Rust, TypeScript, JavaScript, Python, Go, Java. Extracts exports, imports, definitions, struct/enum fields, interface properties, class methods. Computes cyclomatic complexity. Detects naming conventions and test patterns.
Phase 3 (analyzer-collectors): Project metadata - README sections, CI detection, license detection, language distribution, package manager.
Phase 4 (analyzer-sync-check): Doc-code cross-referencing with pulldown-cmark. Matches inline code refs against symbol table with camelCase-to-snake_case. Detects stale refs and hotspot references.

All integrated into repo-intel init/update. 142 tests (was 77), clippy clean.

Validation results (16 repos)

Repo	Lang	Files	Defs	Match%	Hotspot
agent-analyzer	Rust	34	796	86%	63
glide-mq	TS	176	1072	73%	565
glidemq-hono	TS	20	103	60%	5
glidemq-fastify	TS	20	117	57%	4
next-task	JS	102	647	27%	0
agentsys	JS	226	911	16%	43

JS plugin repos score lower because their source is shared agent-core library while docs reference prompt/skill concepts.

Test plan

cargo test --workspace - 142 tests pass
cargo clippy --workspace -- -D warnings - clean
Validated on 16 repos across 4 languages
Backward compatible - new fields are Option with serde(default)
CI passes (cargo test + clippy + fmt)

Phase 2 (analyzer-repo-map): tree-sitter AST extraction for 6 languages (Rust, TypeScript, JavaScript, Python, Go, Java). Extracts exports, imports, definitions, struct/enum fields, interface properties, class methods, and enum variants. Computes cyclomatic complexity. Detects naming conventions and test patterns. Builds import graph. Phase 3 (analyzer-collectors): project metadata gathering - README sections, CI provider detection, license detection (SPDX + file pattern), language distribution, package manager detection. Phase 4 (analyzer-sync-check): doc-code cross-referencing using pulldown-cmark. Matches inline code refs against AST symbol table with camelCase-to-snake_case resolution. Detects stale references (deleted, renamed symbols) and hotspot references. Comprehensive noise filtering for URLs, file paths, CLI flags, env vars, stdlib globals. All phases integrated into repo-intel init/update pipeline. 142 tests, clippy clean. Validated on 16 repos across Rust (86% match), TypeScript (57-73% match), Go, and mixed-language projects.

chatgpt-codex-connector · 2026-03-22T12:49:20Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, add credits to your account and enable them for code reviews in your settings.

gemini-code-assist · 2026-03-22T12:50:09Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request delivers a major expansion of the repo-intel analyzer by implementing Phases 2, 3, and 4. These phases introduce deep code understanding through AST analysis, comprehensive project metadata collection, and intelligent documentation-code synchronization checks. The enhancements provide a more holistic and actionable view of repository health and structure, moving beyond basic Git intelligence to offer detailed insights into code quality, project setup, and documentation accuracy.

Highlights

Phase 2: AST Symbol Extraction: Implemented AST (Abstract Syntax Tree) symbol extraction for Rust, TypeScript, JavaScript, Python, Go, and Java using tree-sitter. This includes extracting exports, imports, definitions, struct/enum fields, interface properties, and class methods. It also computes cyclomatic complexity for functions and detects naming conventions and test patterns.
Phase 3: Project Metadata Collection: Added functionality to collect comprehensive project metadata, such as README sections, CI provider detection (GitHub Actions, GitLab CI, etc.), license detection (SPDX from manifests and file pattern matching), language distribution by file extension, and package manager identification.
Phase 4: Doc-Code Synchronization Check: Introduced doc-code cross-referencing capabilities. This involves parsing Markdown files to extract inline code references, matching these references against the generated symbol table (including camelCase-to-snake_case conversion), and detecting stale references (deleted, renamed, or referencing 'hotspot' files).
CLI Integration and Test Expansion: All new phases are integrated into the existing repo-intel init and update commands. The total number of tests has significantly increased from 77 to 142, ensuring robust coverage for the new features.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This is a massive and impressive feature implementation, adding phases 2-4 for AST analysis, metadata collection, and doc-code sync. The changes are extensive and well-structured across new and existing crates. The implementation of the new features looks solid, with good use of tree-sitter for AST parsing and pulldown-cmark for documentation analysis. My review focuses on improving robustness, error handling, and maintainability in a few key areas. I've identified a fragile TOML parsing implementation that should be replaced, some inconsistencies in error handling within the CLI commands, and a few places where errors are silently ignored, which could be improved with warnings. Overall, this is a significant step forward for the project.

gemini-code-assist · 2026-03-22T12:53:58Z

crates/analyzer-collectors/src/license.rs

+    if let Ok(content) = std::fs::read_to_string(repo_path.join("Cargo.toml")) {
+        let mut in_workspace_package = false;
+        for line in content.lines() {
+            let trimmed = line.trim();
+
+            // Track [workspace.package] section for workspace-level license
+            if trimmed.starts_with('[') {
+                in_workspace_package = trimmed == "[workspace.package]";
+            }
+
+            // Match `license = "MIT"` but not `license.workspace = true`
+            if trimmed.starts_with("license") && trimmed.contains('=') {
+                // Skip workspace delegation (license.workspace = true)
+                if trimmed.starts_with("license.workspace") || trimmed.starts_with("license.path") {
+                    continue;
+                }
+                let value = trimmed
+                    .split('=')
+                    .nth(1)?
+                    .trim()
+                    .trim_matches('"')
+                    .trim_matches('\'');
+                if !value.is_empty() && !value.contains('{') && value != "true" && value != "false" {
+                    return Some(value.to_string());
+                }
+            }
+        }
+
+        // If this is a workspace member with `license.workspace = true`,
+        // check if we found it in [workspace.package] section above
+        if !in_workspace_package {
+            // Re-scan for [workspace.package] license
+            let mut in_ws = false;
+            for line in content.lines() {
+                let trimmed = line.trim();
+                if trimmed == "[workspace.package]" {
+                    in_ws = true;
+                    continue;
+                }
+                if in_ws && trimmed.starts_with('[') {
+                    break;
+                }
+                if in_ws && trimmed.starts_with("license") && trimmed.contains('=') {
+                    if trimmed.starts_with("license.") {
+                        continue;
+                    }
+                    let value = trimmed
+                        .split('=')
+                        .nth(1)?
+                        .trim()
+                        .trim_matches('"')
+                        .trim_matches('\'');
+                    if !value.is_empty() && value != "true" && value != "false" {
+                        return Some(value.to_string());
+                    }
+                }
+            }
+        }
+    }


The current implementation for parsing Cargo.toml uses line-by-line string manipulation. This is fragile and can fail with valid TOML syntax like comments, different string quoting, or key ordering. For example, license = "MIT" # My license would not be parsed correctly.

To make this more robust and maintainable, I recommend using a dedicated TOML parser like the toml crate. This would be similar to how serde_json is used for package.json in the same function.

gemini-code-assist · 2026-03-22T12:53:58Z

crates/analyzer-cli/src/commands/repo_map.rs

+                    eprintln!(
+                        "[ERROR] No symbol data in map file. Run `repo-intel init` first."
+                    );
+                    std::process::exit(1);
+                }


Using std::process::exit(1) here (and in the Dependents command and sync_check.rs) bypasses the Result-based error handling of the run function. It's better to return an error and let the main function handle printing it and exiting. This improves consistency and makes the function more reusable. You can use anyhow::bail! for this.

Suggested change

eprintln!(

"[ERROR] No symbol data in map file. Run `repo-intel init` first."

);

std::process::exit(1);

}

anyhow::bail!("[ERROR] No symbol data in map file. Run `repo-intel init` first.");

gemini-code-assist · 2026-03-22T12:53:58Z

crates/analyzer-cli/src/commands/repo_map.rs

+fn load_map(path: &PathBuf) -> Result<RepoIntelData> {
+    let json = std::fs::read_to_string(path)?;
+    Ok(serde_json::from_str(&json)?)
+}


This load_map function is duplicated in crates/analyzer-cli/src/commands/sync_check.rs. The other version provides more helpful error context using anyhow::with_context. To improve error messages and reduce code duplication, you should use the more descriptive version and centralize it. Consider moving this helper function to a shared module like crates/analyzer-cli/src/commands/utils.rs.

Suggested change

fn load_map(path: &PathBuf) -> Result<RepoIntelData> {

let json = std::fs::read_to_string(path)?;

Ok(serde_json::from_str(&json)?)

}

fn load_map(path: &PathBuf) -> Result<RepoIntelData> {

let json = std::fs::read_to_string(path)

.with_context(|| format!("failed to read map file: {}", path.display()))?;

serde_json::from_str(&json).context("failed to parse map JSON")

}

gemini-code-assist · 2026-03-22T12:53:58Z

crates/analyzer-collectors/src/languages.rs

+    })
+    .ok();


The .ok() call here silently swallows any I/O errors that might occur during walk::walk_files. This could lead to incomplete language statistics without any indication that something went wrong. It would be more robust to propagate the error.

I recommend changing detect_languages to return a Result<Vec<LanguageInfo>> and using ? to propagate the error from walk_files. The caller, collect_metadata, already returns a Result, so it can handle the propagated error.

.ok();

gemini-code-assist · 2026-03-22T12:53:58Z

crates/analyzer-repo-map/src/extractor.rs

+            match extract_file_symbols(&source, *lang) {
+                Ok(syms) => Some((rel.clone(), syms)),
+                Err(_) => None, // skip files that fail to parse
+            }


Files that fail to parse are silently skipped. This can hide issues with the parsers or unexpected file contents from the user. It would be better to log a warning to stderr when a file cannot be parsed, so the user is aware of which files were skipped and why.

match extract_file_symbols(&source, *lang) { Ok(syms) => Some((rel.clone(), syms)), Err(e) => { eprintln!("[WARN] Failed to parse {}: {}", rel, e); None } }

gemini-code-assist · 2026-03-22T12:53:59Z

crates/analyzer-repo-map/src/extractor.rs

+fn parse_rust_use(node: &tree_sitter::Node, source: &[u8]) -> Option<ImportEntry> {
+    let text = node.utf8_text(source).ok()?;
+    // Simple parse: "use crate::foo::Bar;" or "use std::collections::HashMap;"
+    let text = text.trim().trim_start_matches("use ").trim_end_matches(';');
+    let parts: Vec<&str> = text.split("::").collect();
+    if parts.len() >= 2 {
+        let module = parts[..parts.len() - 1].join("::");
+        let name = parts.last().unwrap_or(&"").to_string();
+        Some(ImportEntry {
+            from: module,
+            names: vec![name],
+        })
+    } else {
+        None
+    }
+}


The parse_rust_use function uses a simple string split to parse use declarations. This is fragile and will not correctly handle more complex but valid use statements, such as grouped imports (use std::collections::{HashMap, HashSet};), aliasing (use std::fmt::Result as FmtResult;), or wildcard imports (use std::io::*;). This could lead to an incomplete or incorrect import graph.

A more robust approach would be to traverse the use_declaration node's syntax tree to extract the path components and imported items accurately. Given the scope of this PR, a full rewrite might not be necessary, but it's a significant limitation to be aware of for future improvements.

…ands New repo-intel query types for Phase 2-4 data: - symbols <file>: AST symbols from the symbol table - dependents <symbol>: reverse lookup of symbol importers - stale-docs: symbol-level doc staleness from sync-check - project-info: project metadata (languages, CI, license) Consistent query pattern for all consumers - plugins call repo-intel query <type> instead of separate commands.

gemini-code-assist bot reviewed Mar 22, 2026

View reviewed changes

avifenesh added 6 commits March 22, 2026 13:17

style: fix rustfmt formatting in queries.rs

8435c00

style: fix rustfmt formatting in queries.rs

0b4d00c

style: apply rustfmt across all crates

eb7da64

[cairn] fix rustfmt in aggregator.rs and extractor.rs

5f45718

style: fmt new query subcommands

e8022ed

avifenesh merged commit e125281 into main Mar 22, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement Phase 2-4 - AST symbols, project metadata, doc-code sync#9

feat: implement Phase 2-4 - AST symbols, project metadata, doc-code sync#9
avifenesh merged 7 commits intomainfrom
feat/phase-2-4-implementation

avifenesh commented Mar 22, 2026

Uh oh!

chatgpt-codex-connector bot commented Mar 22, 2026

Uh oh!

gemini-code-assist bot commented Mar 22, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 22, 2026

Uh oh!

gemini-code-assist bot Mar 22, 2026

Uh oh!

gemini-code-assist bot Mar 22, 2026

Uh oh!

gemini-code-assist bot Mar 22, 2026

Uh oh!

gemini-code-assist bot Mar 22, 2026

Uh oh!

gemini-code-assist bot Mar 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

avifenesh commented Mar 22, 2026

Summary

Validation results (16 repos)

Test plan

Uh oh!

chatgpt-codex-connector bot commented Mar 22, 2026

Uh oh!

gemini-code-assist bot commented Mar 22, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant