Skip to content

feat: implement Phase 2-4 - AST symbols, project metadata, doc-code sync#9

Merged
avifenesh merged 7 commits intomainfrom
feat/phase-2-4-implementation
Mar 22, 2026
Merged

feat: implement Phase 2-4 - AST symbols, project metadata, doc-code sync#9
avifenesh merged 7 commits intomainfrom
feat/phase-2-4-implementation

Conversation

@avifenesh
Copy link
Copy Markdown
Contributor

Summary

  • Phase 2 (analyzer-repo-map): tree-sitter AST extraction for Rust, TypeScript, JavaScript, Python, Go, Java. Extracts exports, imports, definitions, struct/enum fields, interface properties, class methods. Computes cyclomatic complexity. Detects naming conventions and test patterns.
  • Phase 3 (analyzer-collectors): Project metadata - README sections, CI detection, license detection, language distribution, package manager.
  • Phase 4 (analyzer-sync-check): Doc-code cross-referencing with pulldown-cmark. Matches inline code refs against symbol table with camelCase-to-snake_case. Detects stale refs and hotspot references.

All integrated into repo-intel init/update. 142 tests (was 77), clippy clean.

Validation results (16 repos)

Repo Lang Files Defs Match% Hotspot
agent-analyzer Rust 34 796 86% 63
glide-mq TS 176 1072 73% 565
glidemq-hono TS 20 103 60% 5
glidemq-fastify TS 20 117 57% 4
next-task JS 102 647 27% 0
agentsys JS 226 911 16% 43

JS plugin repos score lower because their source is shared agent-core library while docs reference prompt/skill concepts.

Test plan

  • cargo test --workspace - 142 tests pass
  • cargo clippy --workspace -- -D warnings - clean
  • Validated on 16 repos across 4 languages
  • Backward compatible - new fields are Option with serde(default)
  • CI passes (cargo test + clippy + fmt)

Phase 2 (analyzer-repo-map): tree-sitter AST extraction for 6 languages
(Rust, TypeScript, JavaScript, Python, Go, Java). Extracts exports,
imports, definitions, struct/enum fields, interface properties, class
methods, and enum variants. Computes cyclomatic complexity. Detects
naming conventions and test patterns. Builds import graph.

Phase 3 (analyzer-collectors): project metadata gathering - README
sections, CI provider detection, license detection (SPDX + file
pattern), language distribution, package manager detection.

Phase 4 (analyzer-sync-check): doc-code cross-referencing using
pulldown-cmark. Matches inline code refs against AST symbol table with
camelCase-to-snake_case resolution. Detects stale references (deleted,
renamed symbols) and hotspot references. Comprehensive noise filtering
for URLs, file paths, CLI flags, env vars, stdlib globals.

All phases integrated into repo-intel init/update pipeline. 142 tests,
clippy clean. Validated on 16 repos across Rust (86% match), TypeScript
(57-73% match), Go, and mixed-language projects.
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, add credits to your account and enable them for code reviews in your settings.

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request delivers a major expansion of the repo-intel analyzer by implementing Phases 2, 3, and 4. These phases introduce deep code understanding through AST analysis, comprehensive project metadata collection, and intelligent documentation-code synchronization checks. The enhancements provide a more holistic and actionable view of repository health and structure, moving beyond basic Git intelligence to offer detailed insights into code quality, project setup, and documentation accuracy.

Highlights

  • Phase 2: AST Symbol Extraction: Implemented AST (Abstract Syntax Tree) symbol extraction for Rust, TypeScript, JavaScript, Python, Go, and Java using tree-sitter. This includes extracting exports, imports, definitions, struct/enum fields, interface properties, and class methods. It also computes cyclomatic complexity for functions and detects naming conventions and test patterns.
  • Phase 3: Project Metadata Collection: Added functionality to collect comprehensive project metadata, such as README sections, CI provider detection (GitHub Actions, GitLab CI, etc.), license detection (SPDX from manifests and file pattern matching), language distribution by file extension, and package manager identification.
  • Phase 4: Doc-Code Synchronization Check: Introduced doc-code cross-referencing capabilities. This involves parsing Markdown files to extract inline code references, matching these references against the generated symbol table (including camelCase-to-snake_case conversion), and detecting stale references (deleted, renamed, or referencing 'hotspot' files).
  • CLI Integration and Test Expansion: All new phases are integrated into the existing repo-intel init and update commands. The total number of tests has significantly increased from 77 to 142, ensuring robust coverage for the new features.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is a massive and impressive feature implementation, adding phases 2-4 for AST analysis, metadata collection, and doc-code sync. The changes are extensive and well-structured across new and existing crates. The implementation of the new features looks solid, with good use of tree-sitter for AST parsing and pulldown-cmark for documentation analysis. My review focuses on improving robustness, error handling, and maintainability in a few key areas. I've identified a fragile TOML parsing implementation that should be replaced, some inconsistencies in error handling within the CLI commands, and a few places where errors are silently ignored, which could be improved with warnings. Overall, this is a significant step forward for the project.

Comment on lines +87 to +145
if let Ok(content) = std::fs::read_to_string(repo_path.join("Cargo.toml")) {
let mut in_workspace_package = false;
for line in content.lines() {
let trimmed = line.trim();

// Track [workspace.package] section for workspace-level license
if trimmed.starts_with('[') {
in_workspace_package = trimmed == "[workspace.package]";
}

// Match `license = "MIT"` but not `license.workspace = true`
if trimmed.starts_with("license") && trimmed.contains('=') {
// Skip workspace delegation (license.workspace = true)
if trimmed.starts_with("license.workspace") || trimmed.starts_with("license.path") {
continue;
}
let value = trimmed
.split('=')
.nth(1)?
.trim()
.trim_matches('"')
.trim_matches('\'');
if !value.is_empty() && !value.contains('{') && value != "true" && value != "false" {
return Some(value.to_string());
}
}
}

// If this is a workspace member with `license.workspace = true`,
// check if we found it in [workspace.package] section above
if !in_workspace_package {
// Re-scan for [workspace.package] license
let mut in_ws = false;
for line in content.lines() {
let trimmed = line.trim();
if trimmed == "[workspace.package]" {
in_ws = true;
continue;
}
if in_ws && trimmed.starts_with('[') {
break;
}
if in_ws && trimmed.starts_with("license") && trimmed.contains('=') {
if trimmed.starts_with("license.") {
continue;
}
let value = trimmed
.split('=')
.nth(1)?
.trim()
.trim_matches('"')
.trim_matches('\'');
if !value.is_empty() && value != "true" && value != "false" {
return Some(value.to_string());
}
}
}
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation for parsing Cargo.toml uses line-by-line string manipulation. This is fragile and can fail with valid TOML syntax like comments, different string quoting, or key ordering. For example, license = "MIT" # My license would not be parsed correctly.

To make this more robust and maintainable, I recommend using a dedicated TOML parser like the toml crate. This would be similar to how serde_json is used for package.json in the same function.

Comment on lines +85 to +89
eprintln!(
"[ERROR] No symbol data in map file. Run `repo-intel init` first."
);
std::process::exit(1);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using std::process::exit(1) here (and in the Dependents command and sync_check.rs) bypasses the Result-based error handling of the run function. It's better to return an error and let the main function handle printing it and exiting. This improves consistency and makes the function more reusable. You can use anyhow::bail! for this.

Suggested change
eprintln!(
"[ERROR] No symbol data in map file. Run `repo-intel init` first."
);
std::process::exit(1);
}
anyhow::bail!("[ERROR] No symbol data in map file. Run `repo-intel init` first.");

Comment on lines +120 to +123
fn load_map(path: &PathBuf) -> Result<RepoIntelData> {
let json = std::fs::read_to_string(path)?;
Ok(serde_json::from_str(&json)?)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This load_map function is duplicated in crates/analyzer-cli/src/commands/sync_check.rs. The other version provides more helpful error context using anyhow::with_context. To improve error messages and reduce code duplication, you should use the more descriptive version and centralize it. Consider moving this helper function to a shared module like crates/analyzer-cli/src/commands/utils.rs.

Suggested change
fn load_map(path: &PathBuf) -> Result<RepoIntelData> {
let json = std::fs::read_to_string(path)?;
Ok(serde_json::from_str(&json)?)
}
fn load_map(path: &PathBuf) -> Result<RepoIntelData> {
let json = std::fs::read_to_string(path)
.with_context(|| format!("failed to read map file: {}", path.display()))?;
serde_json::from_str(&json).context("failed to parse map JSON")
}

Comment on lines +80 to +81
})
.ok();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The .ok() call here silently swallows any I/O errors that might occur during walk::walk_files. This could lead to incomplete language statistics without any indication that something went wrong. It would be more robust to propagate the error.

I recommend changing detect_languages to return a Result<Vec<LanguageInfo>> and using ? to propagate the error from walk_files. The caller, collect_metadata, already returns a Result, so it can handle the propagated error.

    .ok();

Comment on lines +53 to +56
match extract_file_symbols(&source, *lang) {
Ok(syms) => Some((rel.clone(), syms)),
Err(_) => None, // skip files that fail to parse
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Files that fail to parse are silently skipped. This can hide issues with the parsers or unexpected file contents from the user. It would be better to log a warning to stderr when a file cannot be parsed, so the user is aware of which files were skipped and why.

            match extract_file_symbols(&source, *lang) {
                Ok(syms) => Some((rel.clone(), syms)),
                Err(e) => {
                    eprintln!("[WARN] Failed to parse {}: {}", rel, e);
                    None
                }
            }

Comment on lines +357 to +372
fn parse_rust_use(node: &tree_sitter::Node, source: &[u8]) -> Option<ImportEntry> {
let text = node.utf8_text(source).ok()?;
// Simple parse: "use crate::foo::Bar;" or "use std::collections::HashMap;"
let text = text.trim().trim_start_matches("use ").trim_end_matches(';');
let parts: Vec<&str> = text.split("::").collect();
if parts.len() >= 2 {
let module = parts[..parts.len() - 1].join("::");
let name = parts.last().unwrap_or(&"").to_string();
Some(ImportEntry {
from: module,
names: vec![name],
})
} else {
None
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The parse_rust_use function uses a simple string split to parse use declarations. This is fragile and will not correctly handle more complex but valid use statements, such as grouped imports (use std::collections::{HashMap, HashSet};), aliasing (use std::fmt::Result as FmtResult;), or wildcard imports (use std::io::*;). This could lead to an incomplete or incorrect import graph.

A more robust approach would be to traverse the use_declaration node's syntax tree to extract the path components and imported items accurately. Given the scope of this PR, a full rewrite might not be necessary, but it's a significant limitation to be aware of for future improvements.

…ands

New repo-intel query types for Phase 2-4 data:
- symbols <file>: AST symbols from the symbol table
- dependents <symbol>: reverse lookup of symbol importers
- stale-docs: symbol-level doc staleness from sync-check
- project-info: project metadata (languages, CI, license)

Consistent query pattern for all consumers - plugins call
repo-intel query <type> instead of separate commands.
@avifenesh avifenesh merged commit e125281 into main Mar 22, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant